1738 Commits

Author SHA1 Message Date
Simon Mo
77072e93b3
[docs] governance documents (#24801)
Signed-off-by: simon-mo <simon.mo@hey.com>
Signed-off-by: Michael Goin <mgoin64@gmail.com>
Signed-off-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: Mark McLoughlin <markmc@redhat.com>
Co-authored-by: Russell Bryant <rbryant@redhat.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2025-12-08 12:06:20 +00:00
wang.yuqi
9e77ffca3f
[Model][7/N] Improve all pooling task | Deprecation as_reward_model. Extract hidden states prefer using new multi-vector retrieval API (#26686)
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
2025-12-08 08:10:09 +00:00
Zhiyu
cd00c443d2
[Misc] Rename TensorRT Model Optimizer to Model Optimizer (#30091)
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
2025-12-08 07:05:27 +00:00
Isotr0py
b952f4d3c3
[v1] Add PrefixLM support to FlexAttention backend (#27938)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-12-07 15:51:36 +00:00
Cyrus Leung
e83b7e379c
Revert "[Renderer] Separate out RendererConfig from ModelConfig (#30145)" (#30199) 2025-12-07 00:00:22 -08:00
Cyrus Leung
27f4c2fd46
[Renderer] Separate out RendererConfig from ModelConfig (#30145)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-12-06 23:15:42 -08:00
jeremyteboul
dce6d229f7
Support multiple image/audio embeddings per requests (#29988)
Signed-off-by: Jeremy Teboul <jeremyteboul@fb.com>
Co-authored-by: Jeremy Teboul <jeremyteboul@fb.com>
2025-12-07 04:34:24 +00:00
Viacheslav
21bb323542
Gigachat 3 tool parser and tests (#29905)
Signed-off-by: Viacheslav Barinov <viacheslav.teh@gmail.com>
2025-12-06 12:04:14 +00:00
redwrasse
6476382384
prefix caching design doc sha256 now default (#29261)
Signed-off-by: redwrasse <mail@redwrasse.io>
2025-12-06 07:39:56 +00:00
Russell Bryant
3633035a3f
[Misc] Rename CohereForAI references to CohereLabs (#30147)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-12-05 19:41:40 +00:00
Yanan Cao
62b3333448
[Frontend] Remove deprecated -O.xx flag (#29991)
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com>
2025-12-05 00:47:22 -08:00
Tiger Xu / Zhonghu Xu
60a66ea2dc
[DOC]: Add kthena to integrations (#29931)
Signed-off-by: Zhonghu Xu <xuzhonghu@huawei.com>
2025-12-05 08:11:03 +00:00
Hubert de La Jonquiere
befb59e5b1
[Model] Add Holo2 reasoning parser (#30048)
Signed-off-by: hdlj-h <hubert@hcompany.ai>
2025-12-05 10:38:45 +08:00
TimWang
690cc3ef20
docs: update metrics design doc to use new vllm:kv_cache_usage_perc (#30041)
Signed-off-by: Tim <tim.wang03@sap.com>
2025-12-04 23:37:14 +00:00
Tao Yun
6dcb07f676
support qwen3-vl handle requests with embeddings (#30037)
Signed-off-by: taoyun <1069423820@qq.com>
Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2025-12-04 17:34:06 +00:00
Shengqi Chen
990f806473
[Doc] clarify nightly builds in developer docs (#30019)
Signed-off-by: Shengqi Chen <harry-chen@outlook.com>
2025-12-05 00:28:37 +08:00
Harry Mellor
9998ea5b57
Delete HF version of Phi 4 MM (#30049)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-12-04 13:44:50 +00:00
wang.yuqi
74c4d80c6c
[Model][6/N] Improve all pooling task | Support chunked prefill with ALL pooling (#27145)
Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-12-04 13:44:15 +00:00
dtc
842aba501d
[P/D] Introduce Mooncake Transfer Engine as kv_connector (#24718)
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Signed-off-by: dtc <dtcccc@linux.alibaba.com>
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com>
2025-12-04 09:51:36 +00:00
CYJiang
fd68e909db
[docs] Remove _total from counter metrics names (#30028)
In Prometheus Counters always expose their actual numeric value with a metric name that ends in _total. We should document the base name, as this what appears in the get_metrics() API.

Signed-off-by: CYJiang <86391540+googs1025@users.noreply.github.com>
2025-12-04 07:46:15 +00:00
Cyrus Leung
9ae2f60374
[Misc] Various cleanups for MM input processing (#29970)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-12-04 06:22:20 +00:00
bnellnm
2902c34826
[Kernels] Remove BatchedTritonOrDeepGemmExperts and default fallback to Triton (#29929)
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: bnellnm <49004751+bnellnm@users.noreply.github.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
2025-12-03 20:49:00 +00:00
Lumis Chen
9bcf92295a
[Core] Add xxHash as a high-performance hash option for accelerating prefix caching (#29163)
Signed-off-by: LuminolT <lumischen01@gmail.com>
Signed-off-by: Lumis Chen <lumischen01@gmail.com>
Co-authored-by: Russell Bryant <rbryant@redhat.com>
2025-12-03 16:06:57 +00:00
ioana ghiban
1bb17ecb39
[CPU Backend] [Doc]: Update Installation Docs for CPUs (#29868)
Signed-off-by: Ioana Ghiban <ioana.ghiban@arm.com>
2025-12-03 13:33:50 +00:00
ioana ghiban
15b1511a15
[GPU Backend] [Doc]: Remove duplicate statements on missing GPU wheels. (#29962)
Signed-off-by: Ioana Ghiban <ioana.ghiban@arm.com>
2025-12-03 12:56:47 +00:00
Amr Mahdi
f5d3d93c40
[docker] Build CUDA kernels in separate Docker stage for faster rebuilds (#29452)
Signed-off-by: Amr Mahdi <amrmahdi@meta.com>
2025-12-03 11:41:53 +00:00
Fadi Arafeh
78f4bb0ba8
[DOC] Add Arm to list of compute resouces providers (#29894)
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>
2025-12-03 11:36:58 +00:00
Russell Bryant
b08025a83b
[Docs] Discuss api key limitations in security guide (#29922)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-12-02 20:57:28 -08:00
wang.yuqi
2eb4fe9129
[examples] Resettle pooling examples. (#29365)
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-12-02 15:54:28 +00:00
Julien Denize
d8c6210eea
Add Mistral Large 3 and Ministral 3 (#29757)
Signed-off-by: Julien Denize <julien.denize@mistral.ai>
Signed-off-by: Julien Denize <40604584+juliendenize@users.noreply.github.com>
Signed-off-by: Mickael Seznec <mickael@mistral.ai>
Signed-off-by: Roger Wang <hey@rogerw.io>
Co-authored-by: Roger Wang <hey@rogerw.io>
Co-authored-by: Mickael Seznec <mickael@mistral.ai>
2025-12-02 10:29:00 +00:00
Louie Tsai
8bbcf8b6e7
[vLLM Benchmark Suite] Add default parameters section and update CPU benchmark cases (#29381)
Signed-off-by: Tsai, Louie <louie.tsai@intel.com>
Signed-off-by: Louie Tsai <louie.tsai@intel.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Li, Jiang <bigpyj64@gmail.com>
2025-12-02 09:00:23 +00:00
Shengqi Chen
4b612664fd
[CI] Renovation of nightly wheel build & generation (take 2) (#29838)
Signed-off-by: Shengqi Chen <harry-chen@outlook.com>
2025-12-01 22:17:10 -08:00
Kevin H. Luu
1336a1ea24
Revert #29787 and #29690 (#29815) 2025-12-01 13:42:03 -08:00
Finbarr Timbers
38caf7fa1a
Update FAQ on interleaving sliding windows support (#29796)
Signed-off-by: Finbarr Timbers <finbarrtimbers@gmail.com>
2025-12-01 19:15:19 +00:00
shivampr
cabc77cc86
[Core][Observability] Add KV cache residency metrics (#27793)
Introduces three new Prometheus histograms for fine-grained observability of KV cache residency behavior:

vllm:kv_block_lifetime_seconds — total lifetime from allocation to free
vllm:kv_block_idle_before_evict_seconds — idle duration before eviction
vllm:kv_block_reuse_gap_seconds — time between consecutive reuses of the same block

These metrics help operators analyze KV cache efficiency, reuse patterns, and eviction timing beyond simple utilization rates.

Implementation uses monotonic timestamps for accuracy, 1% sampling for minimal overhead (~48 bytes/block), and is fully thread-safe with zero runtime cost when disabled.

Two new runtime flags are introduced:

--kv-cache-metrics – enable KV cache residency metrics
--kv-cache-metrics-sample – control sampling ratio (default: 0.01)

Signed-off-by: Shivam <shivamprasad91@gmail.com>
2025-12-01 18:27:53 +00:00
sangbumlikeagod
092bb73b8a
[Frontend] add 'verbose_json' and 'timestamp' feature on Whisper Transcription/Translation (#24209)
Signed-off-by: sangbumlikeagod <oironese@naver.com>
Signed-off-by: sangbumlikeagod <98077576+sangbumlikeagod@users.noreply.github.com>
2025-12-01 18:19:17 +01:00
Shengqi Chen
36db0a35e4
[CI] Renovation of nightly wheel build & generation (#29690)
Signed-off-by: Shengqi Chen <harry-chen@outlook.com>
2025-12-01 21:25:39 +08:00
wang.yuqi
62de4f4257
[Frontend] Resettle pooling entrypoints (#29634)
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
2025-12-01 15:30:43 +08:00
Yifei Zhang
1ab8fc8197
Make PyTorch profiler gzip and CUDA time dump configurable (#29568)
Signed-off-by: Yifei Zhang <yifei.zhang1992@outlook.com>
2025-12-01 04:30:46 +00:00
Cyrus Leung
2afcec4dec
[Misc] Update TokenizerLike interface and move get_cached_tokenizer (#29730)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-11-30 14:59:47 +08:00
Jinzhen Lin
1656ad3704
[Kernel][Quantization] add w4a8 support for marlin kernel (#24722)
Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com>
Signed-off-by: Michael Goin <mgoin64@gmail.com>
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Michael Goin <mgoin@redhat.com>
2025-11-29 07:19:33 -08:00
dublc
f4341f45d3
[Doc]: fix code block rendering (#29728)
Signed-off-by: dublc <jdublc0x@gmail.com>
2025-11-29 13:46:48 +00:00
Cyrus Leung
34a984274e
[Misc] Refactor tokenizer interface (#29693)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-11-29 04:02:21 -08:00
Yanan Cao
3461e7efd8
[Frontend] Remap -O to -cc commandline flag (#29557)
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
2025-11-28 21:51:12 +00:00
Harry Mellor
4332955602
[Docs] Add CLI reference doc for vllm bench sweep plot_pareto (#29689)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-11-28 08:10:08 -09:00
Wilson Wu
5c2b5cb422
[Docs] Add SPLADE and Ultravox models to supported models documentation (#29659)
Signed-off-by: Wilson Wu <iwilsonwu@gmail.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2025-11-28 01:29:28 -09:00
Cyrus Leung
ccbdf51bd5
[Doc] Reorganize benchmark docs (#29658)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-11-28 17:19:25 +08:00
rongfu.leng
480598958e
[Feature][Bench] Add pareto visualization (#29477)
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io>
2025-11-27 23:53:20 -08:00
Wilson Wu
18523b87f6
[Docs] Update supported models for Olmo 3 in tool calling documentation (#29411)
Signed-off-by: Wilson Wu <iwilsonwu@gmail.com>
2025-11-28 02:53:55 +00:00
Morrison Turnansky
0838b52e2e
[Frontend][torch.compile] CompilationConfig Overhaul (#20283): Set up -O infrastructure (#26847)
Signed-off-by: morrison-turnansky <mturnans@redhat.com>
Signed-off-by: adabeyta <aabeyta@redhat.com>
Signed-off-by: Morrison Turnansky <mturnans@redhat.com>
Co-authored-by: adabeyta <aabeyta@redhat.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-11-27 01:55:58 -08:00