12170 Commits

Author SHA1 Message Date
Nicolò Lucchesi
066209a045
[Attention] Refactor FA block_size limitations to hybrid models only (#29084)
Signed-off-by: NickLucche <nlucches@redhat.com>
2025-11-22 06:38:44 -08:00
Bram Wasti
5f7209a793
[tiny] Remove unsupported TRITON_MLA backend from batch invariance (#28832)
Signed-off-by: Bram Wasti <bwasti@meta.com>
Signed-off-by: Bram Wasti <bwasti@fb.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
2025-11-22 21:00:50 +08:00
yihong
2d4978a57e
fix: clean up function never use in setup.py (#29061)
Signed-off-by: yihong0618 <zouzou0208@gmail.com>
2025-11-22 05:00:04 -08:00
Nandan Vallamdasu
6965a392a4
Fix: Resolve circular import in model_loader/utils.py (#29189)
Signed-off-by: nandan2003 <nandan.vallamdasu@outlook.com>
Signed-off-by: Nandan Vallamdasu  <nandan.vallamdasu@outlook.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2025-11-22 04:58:22 -08:00
Cyrus Leung
5a4802588e
[Misc] Further clean up chunked prefill and prefix caching init (#29186)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-11-22 19:34:15 +08:00
rasmith
8e22da1d7f
[CI/Build Don't add FLASHINFER backend in test_cpu_offloading.py (#29229)
Signed-off-by: Randall Smith <ransmith@amd.com>
Co-authored-by: Randall Smith <ransmith@amd.com>
2025-11-22 11:00:54 +00:00
rasmith
a4fdf2405c
[CI/Build] Skip tests that require libcudart in test_lmcache_integration.py (#29228)
Signed-off-by: Randall Smith <ransmith@amd.com>
Co-authored-by: Randall Smith <ransmith@amd.com>
2025-11-22 10:59:39 +00:00
Jane (Yuan) Xu
e6309acdba
Simplify from_blob usage in get_cuda_view_from_cpu_tensor (#29027)
Signed-off-by: Jane Xu <janeyx@meta.com>
2025-11-22 10:35:32 +00:00
jinghanhu
988ee66b0d
Handle triton kernel import exception (#29062) 2025-11-22 10:07:50 +00:00
Mads Kildegård
ea38474ac5
[Frontend][Responses API] Multi-turn (with type: "output_text") support for non-harmony requests (#29175)
Signed-off-by: Mads Kildegård <mkildegaard99@gmail.com>
2025-11-22 09:58:22 +00:00
Andrew Xia
742e9ff6b3
[responsesAPI] parse reasoning item input (#28248)
Signed-off-by: Andrew Xia <axia@fb.com>
Co-authored-by: Andrew Xia <axia@fb.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2025-11-22 15:42:11 +08:00
Woosuk Kwon
e9056056fb
[Model Runner V2] Limit cudagraph size to max decode batch size (#29221)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-11-21 20:21:35 -08:00
Jee Jee Li
1489902b53
[LoRA] Cleanup FusedMoEWithLoRA (#29187)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-11-22 04:01:30 +00:00
Yanan Cao
933f67ecd8
[Bugfix]Fix a conditional to not check zero value (#28754)
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com>
2025-11-21 19:59:07 -08:00
rasmith
fd65015a14
[CI/Build] Only use supported types and features on ROCm in MoE kernel tests (#29149)
Signed-off-by: Randall Smith <ransmith@amd.com>
Co-authored-by: Randall Smith <ransmith@amd.com>
2025-11-21 20:34:33 -07:00
Yihua Cheng
77e1c035d0
[chore][LMCache connector] Remove useless logs from lmcache connector (#29069)
Signed-off-by: ApostaC <yihua98@uchicago.edu>
2025-11-22 03:18:00 +00:00
rasmith
6f403501a0
[CI/Build][AMD] Enable Entrypoints Integration Test (Pooling) to run without error on ROCm (#29212)
Signed-off-by: Randall Smith <ransmith@amd.com>
Co-authored-by: Randall Smith <ransmith@amd.com>
2025-11-22 02:13:18 +00:00
FlintyLemming
052950e5b3
Add fused MoE config for H200 E160 N192 fp8 (#29182)
Signed-off-by: FlintyLemming <admin@flinty.moe>
2025-11-21 17:37:51 -08:00
qli88
1ef9c9e294
[CI/Build] Disable test_gptoss_tp.py in 'LoRA TP Test' group for ROCm platform (#29204)
Signed-off-by: qli88 <qiang.li2@amd.com>
2025-11-21 17:36:19 -08:00
Jie Luo
5c8f2adf50
[Bugfix] Fix block size in block_table with PCP (#29094)
Signed-off-by: Livinfly <luojie3m@gmail.com>
2025-11-22 01:34:28 +00:00
Ryan Rock
ed8e6843cc
[CI/Build] Add terratorch for AMD (#29205)
Signed-off-by: Ryan Rock <ryan.rock@amd.com>
2025-11-21 17:31:22 -08:00
Lukas Geiger
d045e22dfe
[Model][Qwen3VL] Tune Triton w8a8 block fp8 kernel for L40s (#29217)
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
2025-11-21 17:30:55 -08:00
Wentao Ye
1d34eb11e0
[CI] Bug: Fix triton import issue (#29202)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-11-21 17:14:49 -08:00
Charlie Fu
9a3101b2ba
[Rocm][CI] Fix DeekSeek V2-Lite Accuracy CI (#29135)
Signed-off-by: charlifu <charlifu@amd.com>
2025-11-21 17:11:02 -08:00
Angela Yi
d5dbdbfcb2
[docs] Fix cudagraph mode config (#29170)
Signed-off-by: angelayi <yiangela7@gmail.com>
2025-11-21 17:10:27 -08:00
Lucas Wilkinson
30d6466238
[BugFix] Fix Eagle IndexError: list index out of range for even num_speculative_tokens (#29102)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-11-22 00:47:05 +00:00
Woosuk Kwon
e9af6ba62a
[Model Runner V2] Optimize Gumbel Sampling Kernel (#29210)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-11-21 15:52:28 -08:00
Mark McLoughlin
c6fa3895e9
[KV Connector] Fix async connector prefix cache metrics (#28585)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
2025-11-21 17:45:00 -05:00
Varun Sundar Rabindranath
3137991f55
[BugFix] EPLB + B200 + DeepGEMM : Handle column-major scales tensor (#29162)
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
2025-11-21 14:28:17 -08:00
Julien Denize
57430fc95c
Default model load/config/tokenizer to mistral format if relevant files exist (#28659)
Signed-off-by: Julien Denize <julien.denize@mistral.ai>
Signed-off-by: Julien Denize <40604584+juliendenize@users.noreply.github.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
2025-11-21 13:58:59 -08:00
Lucas Wilkinson
c68c7b403d
[BugFix] Fix missing symbol triggering FA2 fallback on Hopper (#29107)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-11-21 13:58:32 -08:00
Ning Xie
53a1ba6ec5
[log] add weights loading time log to sharded_state loader (#28628)
Signed-off-by: Andy Xie <andy.xning@gmail.com>
2025-11-21 21:06:09 +00:00
Lucas Wilkinson
1840c5cb18
[BugFix] Make sure to allocate worst case MoE workspace during profile run in the DP + EP case (#27426)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-11-21 11:41:52 -08:00
Woosuk Kwon
1bed891f72
[Chore] Fix pre-commit error after #25266 (#29190) 2025-11-21 10:21:40 -08:00
Cyrus Leung
ceca060501
[Deprecation] Deprecate seed=None (#29185)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-11-21 18:19:25 +00:00
Charlie Fu
75648b16dd
[ROCm][CI] Fix config/test_config_generation.py (#29142)
Signed-off-by: charlifu <charlifu@amd.com>
2025-11-21 17:12:16 +00:00
Chendi.Xue
460d02a417
[NIXL] Fix after virtual block_size for host_buffer with heter kv_layout (#29122)
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
2025-11-21 08:55:27 -08:00
Mingyuan Ma
b4c8fbaae2
Add TRTLLM MoE NVFP4 kernel to CompressedTensorsW4A4MoeMethod (#28892)
Signed-off-by: mingyuanm <mingyuanm@nvidia.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
2025-11-21 09:54:11 -07:00
rasmith
e99e467384
[CI/Build][Kernel][AMD] Move extra dim to after load in _fwd_kv_parallel in lighting_attn.py (#29132)
Signed-off-by: Randall Smith <ransmith@amd.com>
Co-authored-by: Randall Smith <ransmith@amd.com>
2025-11-21 11:53:09 -05:00
Wentao Ye
a42ab317ac
[Log] Optimize startup log (#28948)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
2025-11-21 08:46:20 -08:00
Aleksandr Malyshev
b7f1f490a6
Upstream triton fp4 weight preshuffle (#28888)
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com>
2025-11-21 11:34:46 -05:00
Woosuk Kwon
30b44a1598
GPU Model Runner V2 (#25266)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-11-21 08:20:55 -08:00
Wentao Ye
1f400c58b8
[CI] Add batch invariant test to ci (#27842)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-11-21 09:20:33 -07:00
rasmith
711241c13c
[CI/Build] Fix illegal memory access and unsupported test in kernels/attention/test_cache.py (#29118)
Signed-off-by: Randall Smith <ransmith@amd.com>
Co-authored-by: Randall Smith <ransmith@amd.com>
2025-11-21 10:58:38 -05:00
Cyrus Leung
d7219bcda3
[Misc] Move dynamic seed initialization to EngineArgs (#29165)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-11-21 15:27:44 +00:00
wangxiyuan
4050bae417
[Doc] Update plugin doc (#28532)
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-11-21 14:57:26 +00:00
skaraban3807
f1805db1a6
[Perf] These changes enhance the NUMA functionality of vllm for systems with more than one NUMA nodes per socket (#25559)
Signed-off-by: Siddappa Karabannavar <siddappa.karabannavar@amd.com>
2025-11-21 14:13:52 +00:00
Julien Denize
434f3d3eb8
Fix mistral config (#29172)
Signed-off-by: Julien Denize <julien.denize@mistral.ai>
Signed-off-by: Julien Denize <40604584+juliendenize@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2025-11-21 14:01:20 +00:00
sfbemerk
2092ce8c39
Tool Call Parser logs should not contain user input / model output except on DEBUG (#29160)
Signed-off-by: Benjamin Merkel <benjamin.merkel@tngtech.com>
Co-authored-by: Benjamin Merkel <benjamin.merkel@tngtech.com>
Co-authored-by: Chauncey <chaunceyjiang@gmail.com>
2025-11-21 20:57:19 +08:00
who who who
fc9f821d20
fix cross attention (#28346)
Signed-off-by: fsx950223 <fsx950223@outlook.com>
2025-11-21 04:55:43 -08:00