1702 Commits

Author SHA1 Message Date
Woosuk Kwon
3e1ad40655
[Model Runner V2] Add apply_temperature option to gumbel_sample (#29276)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-11-23 14:13:00 -08:00
Woosuk Kwon
62d54ba46d
[Model Runner V2] Optimize CUDA graph capture time (#29275)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-11-23 11:15:32 -08:00
Woosuk Kwon
b004c00418
[Model Runner V2] Support spec decoding [1/N] (#29274)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-11-23 10:09:06 -08:00
Woosuk Kwon
7f12c82fa6
[Model Runner V2] Change bookkeeping logic in preparation for spec decoding (#29194)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-11-23 09:42:52 -08:00
Woosuk Kwon
20ee418adc
[Model Runner V2] Minor fix for cudagraph_utils (#29256) 2025-11-22 20:12:50 -08:00
Yizhou
df78aeef08
Refactor: Move CUDA graph dispatch logic earlier (#27382)
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-11-22 16:10:31 -05:00
Nick Hill
7df331c66b
[BugFix] Fix chunked prompt logprobs + preemption (#29071) 2025-11-22 16:07:18 -05:00
Fadi Arafeh
730bd35378
[perf][cpu] Accelerate paged attention GEMMs (QK, PV) on Arm CPUs with NEON (#29193)
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>
2025-11-22 09:04:36 -08:00
Nick Hill
d44a63c6d6
[BugFix] Fix returned logprobs with spec decode + prefill chunking (#29216)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-11-22 22:41:25 +08:00
Nicolò Lucchesi
066209a045
[Attention] Refactor FA block_size limitations to hybrid models only (#29084)
Signed-off-by: NickLucche <nlucches@redhat.com>
2025-11-22 06:38:44 -08:00
Cyrus Leung
5a4802588e
[Misc] Further clean up chunked prefill and prefix caching init (#29186)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-11-22 19:34:15 +08:00
Woosuk Kwon
e9056056fb
[Model Runner V2] Limit cudagraph size to max decode batch size (#29221)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-11-21 20:21:35 -08:00
Jie Luo
5c8f2adf50
[Bugfix] Fix block size in block_table with PCP (#29094)
Signed-off-by: Livinfly <luojie3m@gmail.com>
2025-11-22 01:34:28 +00:00
Wentao Ye
1d34eb11e0
[CI] Bug: Fix triton import issue (#29202)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-11-21 17:14:49 -08:00
Lucas Wilkinson
30d6466238
[BugFix] Fix Eagle IndexError: list index out of range for even num_speculative_tokens (#29102)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-11-22 00:47:05 +00:00
Woosuk Kwon
e9af6ba62a
[Model Runner V2] Optimize Gumbel Sampling Kernel (#29210)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-11-21 15:52:28 -08:00
Mark McLoughlin
c6fa3895e9
[KV Connector] Fix async connector prefix cache metrics (#28585)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
2025-11-21 17:45:00 -05:00
Julien Denize
57430fc95c
Default model load/config/tokenizer to mistral format if relevant files exist (#28659)
Signed-off-by: Julien Denize <julien.denize@mistral.ai>
Signed-off-by: Julien Denize <40604584+juliendenize@users.noreply.github.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
2025-11-21 13:58:59 -08:00
Woosuk Kwon
1bed891f72
[Chore] Fix pre-commit error after #25266 (#29190) 2025-11-21 10:21:40 -08:00
Wentao Ye
a42ab317ac
[Log] Optimize startup log (#28948)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
2025-11-21 08:46:20 -08:00
Woosuk Kwon
30b44a1598
GPU Model Runner V2 (#25266)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-11-21 08:20:55 -08:00
Cyrus Leung
d7219bcda3
[Misc] Move dynamic seed initialization to EngineArgs (#29165)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-11-21 15:27:44 +00:00
wangxiyuan
4050bae417
[Doc] Update plugin doc (#28532)
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-11-21 14:57:26 +00:00
who who who
fc9f821d20
fix cross attention (#28346)
Signed-off-by: fsx950223 <fsx950223@outlook.com>
2025-11-21 04:55:43 -08:00
Russell Bryant
cca2d2cdbe
[Core] Align whisper closer to other multimodal models (#27292)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-11-21 12:01:54 +00:00
Jialin Ouyang
30b9c67743
Revert "[Redo] #26368 (#28771)" (#29121)
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>
2025-11-20 21:27:45 -08:00
Cyrus Leung
56e96b37e4
[V0 Deprecation] Remove best_of (#29090)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-11-21 11:40:40 +08:00
Wentao Ye
56669c1f29
[CI] Fix mypy for vllm/v1/worker (#29037)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-11-21 11:36:07 +08:00
Xiao Li
ed6ae1e36a
[AITER] [ROCm] Fix crash when loading llama4 model with old aiter version installed, fallback to forward_native implementation (#29124)
Signed-off-by: Xiao Li <ilx@meta.com>
2025-11-20 17:54:35 -08:00
Jee Jee Li
9875be6431
[LoRA][2/2]Remove LoRA extra vocab (#28545)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-11-21 09:46:43 +08:00
Or Ozeri
647464719b
[KVConnector][Core] Support cross-layer KV blocks (#27743)
Signed-off-by: Or Ozeri <oro@il.ibm.com>
2025-11-20 19:09:59 +01:00
Zhewen Li
93c8672ceb
[Bugfix] Fix spec decode memory regression after #28549 (#28819)
Signed-off-by: zhewenli <zhewenli@meta.com>
2025-11-20 19:05:50 +08:00
Samit
371b1d4c61
[RL] Add Pause and Resume Generation for Asynchronous RL Training (#28037)
Signed-off-by: SamitHuang <285365963@qq.com>
Signed-off-by: Samit <285365963@qq.com>
Signed-off-by: samithuang <285365963@qq.com>
Co-authored-by: 22quinn <33176974+22quinn@users.noreply.github.com>
2025-11-20 03:01:03 -08:00
Or Ozeri
c0c2dd1e0b
[BugFix] kv_offloading: Fix bug in loading of partial cpu blocks (#28951)
Signed-off-by: Or Ozeri <oro@il.ibm.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2025-11-20 18:55:10 +08:00
Pleaplusone
06c20c9904
[ROCm] Add AMD GPU support on Deepseek v3.2 and SparseMLA (#26670)
Signed-off-by: ganyi <ygan@amd.com>
2025-11-20 02:54:01 -08:00
Isotr0py
64192d5624
[Bugfix] Revert custom attention mask for gemma3-mm (#28995)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2025-11-20 13:23:22 +08:00
Benjamin Chislett
fcbcba6c70
[Feat] Iteration-level profiling for Torch and CUDA profiler (#28987)
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Benjamin Chislett <chislett.ben@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-11-19 19:17:48 -08:00
Qiang Zhang
3fb0d90999
[AMD] Use Decoupled Kernel Block Size to Support AITER MLA block_size=1 (#27715)
Signed-off-by: chiangzhang <chiangzhang@tencent.com>
2025-11-20 02:11:52 +00:00
Jialin Ouyang
537cc635c7
[GC Debugger] Simply and improve GC Debugger Utils (#29029)
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>
2025-11-20 00:10:22 +00:00
Julien Denize
cdeec2e606
[BugFix] Ray with multiple nodes (#28873)
Signed-off-by: Julien Denize <julien.denize@mistral.ai>
2025-11-19 21:20:58 +00:00
Qiu
2fd893b4ce
[Feature] Prefill Context Parallel (PCP) basic support (#28718)
Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>
Signed-off-by: FENP <yuanyongjie.yyj@antgroup.com>
Signed-off-by: LookAround <lixushi@huawei.com>
Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>
Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
Co-authored-by: FENP <yuanyongjie.yyj@antgroup.com>
Co-authored-by: LookAround <lixushi@huawei.com>
Co-authored-by: Jingchun Gao <gaojingchun1@huawei.com>
Co-authored-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
Co-authored-by: Jingchun Gao <63247409+gjc0824@users.noreply.github.com>
2025-11-19 15:52:44 -05:00
Izzy Putterman
02f5903b84
Eagle: MM Cuda Graphs with MRope (#28896)
Signed-off-by: Izzy Putterman <iputterman@nvidia.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2025-11-19 15:01:05 -05:00
Aleksandr Malyshev
ac10fd3c69
Upstreaming aiter triton attention backend as a new backend (#28701)
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com>
2025-11-19 19:59:30 +00:00
Jialin Ouyang
3319a493fc
[Core] Reuse created spec tokens lists to mitigate GC cost (#28917)
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>
2025-11-19 19:20:22 +00:00
Lucas Wilkinson
48fc8b1e59
[BugFix] Fix async-scheduling + FlashAttn MLA (#28990)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-11-19 10:04:07 -05:00
Matthew Bonanni
4c23690f43
[Attention] FlashAttention ViT support, make default backend (#28763)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
2025-11-18 20:06:21 -08:00
Jialin Ouyang
40b6b38f2c
[Core] Switch Flat logprob control from environment variable to SamplingParams (#28914)
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>
Co-authored-by: 22quinn <33176974+22quinn@users.noreply.github.com>
2025-11-19 02:10:02 +00:00
Kunshang Ji
2a2d5d2780
Replace torch.cuda.Event with torch.Event for better hardware compatibility (#26985)
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
2025-11-18 11:34:36 -08:00
vllmellm
0af3d4f0df
[FEAT] [AITER] [ROCm] integrate aiter sampling ops (#26084)
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
2025-11-18 17:28:34 +00:00
Nick Hill
da8dadf68b
[Minor] Rename ec_producer field to is_ec_producer (#28884)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-11-18 17:26:07 +00:00