Varun Sundar Rabindranath
74a9a9faad
[Performance][B200] Fix deepgemm prologue ( #27897 )
...
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
2025-11-12 13:13:03 -08:00
PerryZhang01
a1e7fa362a
[EPLB][ROCm]: support EPBL for ROCm backend ( #27731 )
...
Signed-off-by: Perry Zhang <perzhang@amd.com>
Co-authored-by: Perry Zhang <perzhang@amd.com>
2025-11-12 18:16:35 +00:00
Canlin Guo
bc5bd45c7d
[Refactor] Remove redundant TP gather/split in split_qkv in QwenVL ( #28271 )
...
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
2025-11-12 15:56:47 +00:00
Alexander Matveev
f76e85c299
[Performance][Hopper] Avoid M dim padding to 4x for most cases (due to cuda graphs paddings) ( #28492 )
...
Signed-off-by: Alexander Matveev <amatveev@redhat.com>
2025-11-12 10:51:43 -05:00
Harry Mellor
54aecd9ed5
Fix pre-commit (and XPU) on main ( #28556 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-11-12 06:13:41 -08:00
Jee Jee Li
a9d18b5107
[Bugfix] Fix gpt_oss packed_modules_mapping ( #28536 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-11-12 21:02:06 +08:00
wuyaoxuehun
d3ade61e42
[Model] fix glm4_moe_mtp load weights with GLM-4.6 checkpoint. ( #27597 )
...
Signed-off-by: wuao.scotty <wuao.scotty@bytedance.com>
Co-authored-by: wuao.scotty <wuao.scotty@bytedance.com>
2025-11-12 10:14:00 +00:00
yyzxw
1761dea1a8
[BugFix]: --enable-lora with model granite-4.0-micro crash ( #27733 )
...
Signed-off-by: zxw <1020938856@qq.com>
2025-11-12 09:03:56 +00:00
Lukas Geiger
ac0bb2c307
[Core] Cache vllm_is_batch_invariant ( #28304 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
2025-11-12 05:03:01 +00:00
Fanli Lin
b9ce9a3013
[BugFix] Add fallback path in apply_rotary_pos_emb_flashattn for non-cuda platforms ( #28447 )
...
Signed-off-by: Lin, Fanli <fanli.lin@intel.com>
2025-11-12 03:13:21 +00:00
Chenguang Zheng
4ccffe561f
[Core] Encoder separation for Encode-Prefill-Decode Disaggregation ( #25233 )
...
Signed-off-by: n00909098 <nguyen.kha.long@huawei.com>
Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>
Signed-off-by: herotai214 <herotai214@gmail.com>
Signed-off-by: Khuong Le <khuong.le.manh@huawei.com>
Signed-off-by: Khuong Le <lemanhkhuong2611@gmail.com>
Co-authored-by: n00909098 <nguyen.kha.long@huawei.com>
Co-authored-by: knlnguyen1802 <knlnguyen1802@gmail.com>
Co-authored-by: herotai214 <herotai214@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Khuong Le <khuong.le.manh@huawei.com>
Co-authored-by: Khuong Le <lemanhkhuong2611@gmail.com>
2025-11-11 18:58:33 -08:00
Lukas Geiger
cbb799e314
[Model][Qwen3VL] Simplify get_mrope_input_positions using numpy ( #28302 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
2025-11-12 02:55:10 +00:00
Michael Goin
e5f599d4d1
[Bugfix] Disable shared expert overlap if Marlin MoE is used ( #28410 )
...
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-11-11 23:16:12 +00:00
Jee Jee Li
9d1c474704
[LoRA][1/N]Remove LoRA extra vocab ( #28382 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-11-11 11:06:21 -08:00
Lukas Geiger
76e4dcf225
[Misc] Remove unused attention prefix prefill ops functions ( #26971 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
2025-11-11 18:26:04 +00:00
Fanli Lin
d5edcb8678
[BugFix] Fix Siglip2Attention on XPU ( #28448 )
...
Signed-off-by: Lin, Fanli <fanli.lin@intel.com>
2025-11-11 18:18:02 +00:00
xuebwang-amd
5a1271d83a
[Quantization] fix attention quantization of gpt_oss model ( #27334 )
...
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
2025-11-11 12:06:00 -05:00
xuebwang-amd
05576df85c
[ROCm][Quantization] extend AMD Quark to support mixed-precision quantized model ( #24239 )
...
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Co-authored-by: fxmarty-amd <felmarty@amd.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2025-11-11 12:05:22 -05:00
zhrrr
68c09efc37
[Kernel][Perf] fuse QK Norm and RoPE into one cuda kernel for Qwen Model ( #27165 )
...
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
2025-11-11 12:00:31 -05:00
Michael Goin
f9a4087182
Remove weight_scale.T special case for SM90 Block FP8 CUTLASS kernel ( #28431 )
...
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-11-11 11:46:04 -05:00
Fanli Lin
b886068056
[BugFix] Fix RuntimeError in PixtralHFAttention on CPU/XPU ( #28444 )
...
Signed-off-by: Lin, Fanli <fanli.lin@intel.com>
2025-11-11 15:29:33 +00:00
bnellnm
a1448b4b69
[Kernels] Split up fused_moe/layer.py, isolate more modular kernel code ( #28064 )
2025-11-11 07:29:02 -07:00
Cyrus Leung
afffd3cc8a
[Model] Pass mm_features directly into get_mrope_input_positions ( #28399 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-11-11 21:14:48 +08:00
Matthew Bonanni
b30dfa03c5
[Attention] Refactor CUDA attention backend selection logic ( #24794 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
2025-11-11 07:40:44 -05:00
Lukas Geiger
9973e6e04a
[Model][Qwen3VL] Slighly speedup fast_pos_embed_interpolate ( #28434 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
2025-11-11 10:35:10 +00:00
Fanli Lin
c7991269dd
[BugFix] 'DeepseekV2Config' object has no attribute 'use_mla'` ( #28387 )
...
Signed-off-by: Lin, Fanli <fanli.lin@intel.com>
2025-11-11 08:45:38 +00:00
Jiangyun Zhu
f0359fffa4
[Bugfix] fix qwen3-next crash ( #28202 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
2025-11-11 08:24:28 +00:00
Roger Wang
4fd4b743a2
[Bugfix] Fix max image size for PaddleOCR-VL ( #28442 )
...
Signed-off-by: Roger Wang <hey@rogerw.io>
2025-11-11 08:07:24 +00:00
Robert Shaw
e605e8e323
[Bugfix] Fix Stream Sync for Shared Expert Overlap ( #28430 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
Signed-off-by: Robert Shaw <robertgshaw2@gmail.com>
Co-authored-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
2025-11-11 05:59:08 +00:00
Wentao Ye
de540c0354
[Feature] Add env var VLLM_MOE_USE_DEEP_GEMM ( #28422 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-11-11 02:29:48 +00:00
Wentao Ye
35d801f13f
[Feature] Refactor batch invariant fp8 DeepGEMM ( #27606 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-11-11 00:08:40 +00:00
Ilya Markov
d17ecc6b19
[PERF] Allreduce fusion. Support torch native matching. Tuning of the thresholds ( #24248 )
...
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Signed-off-by: ilmarkov <markovilya197@gmail.com>
Co-authored-by: Luka Govedič <lgovedic@redhat.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
2025-11-10 18:33:11 -05:00
Yong Hoon Shin
021143561f
[ROCm] Add missing gemm_a8w8_blockscale import ( #28378 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com>
2025-11-10 23:13:36 +00:00
Lucas Wilkinson
6dec9f6109
[BugFix] Fix DeepGEMM over-allocating workspace ( #28254 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-11-10 17:01:17 -05:00
Sage Moore
40d33264c6
[Bugfix][EPLB] Disabled shared expert overlap when EPLB is enabled ( #28377 )
...
Signed-off-by: Sage Moore <sage@neuralmagic.com>
Signed-off-by: Sage Moore <sagemoore@utexas.edu>
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
2025-11-10 20:39:19 +00:00
jiahanc
34553b9d27
[Performance] Support FP8 flashinfer TRTLLM MOE on Qwen3 and Qwen-3next ( #27492 )
...
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
2025-11-10 12:34:57 -05:00
Varun Sundar Rabindranath
b039bfda8f
[Bugfix] Fix persistent_masked_m_silu_mul_quant tests ( #28366 )
...
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
2025-11-10 09:21:52 -08:00
Cyrus Leung
d0e186c16f
[V0 Deprecation] Remove unused context_len and seq_len from M-RoPE ( #28395 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-11-11 00:30:06 +08:00
vllmellm
f080a83511
[RFC][ROCm][AITER] Keep all AITER kernels in _aiter_ops class like _custom_ops and _ipex_ops ( #24490 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
2025-11-10 08:20:53 -08:00
zejunchen-zejun
b06b9470ca
[Rocm][fused_moe][fp4] view weight to torch.float4_e2m1fn_x2 when running aiter fused moe for fp4 model ( #27474 )
...
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
2025-11-10 10:38:56 -05:00
Ferrebo
912744d066
[Fix] optimize visual token mask with caching and multi-token support ( #28374 )
...
Signed-off-by: Ferrebo <itachi971009@gmail.com>
Signed-off-by: kebo01 <kebo01@baidu.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-11-10 13:23:49 +00:00
Yu Jiaqi
15be507c86
[bugfix] fix siglip batch text output error ( #28365 )
...
Signed-off-by: piood <2477084691@qq.com>
2025-11-10 21:21:15 +08:00
Xiake Sun
03fa4d3fb3
[Hardware][AMD][Model] Add Triton MoE tuning support and optimized configs for Qwen3 omni for MI308X ( #28373 )
...
Signed-off-by: Xiake Sun <xiake.sun@amd.com>
Signed-off-by: Xiake Sun <xisun@amd.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-11-10 04:53:40 +00:00
Jiangyun Zhu
c4768dcf47
[Kernel] Fix fused_gdn_gating ( #28343 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
2025-11-09 14:26:35 -07:00
Jiangyun Zhu
7ae5a5fb11
[Misc] Add some comments in qwen3-next ( #28267 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
2025-11-08 23:59:24 -08:00
Yong Hoon Shin
de2b78305f
[ROCm] Add env to enable/disable aiter triton gemm ( #28321 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com>
2025-11-08 22:27:00 -08:00
Mohammad Miadh Angkad
404d7a9d14
[Performance][gpt-oss] Revert gpt-oss max cudagraph size to 1024 ( #28345 )
...
Signed-off-by: Mohammad Miadh Angkad <MAngkad.BSDSBA2027@aim.edu>
2025-11-08 15:50:10 -07:00
Robert Shaw
26990d25dc
[Bugfix] Update device name for H200 detection ( #28349 )
...
Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
2025-11-08 19:01:11 +00:00
Isotr0py
934a9c3b79
[Model] Consolidate Deepseek-MoE implementation with DeepSeek-v2 ( #28101 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
2025-11-08 05:01:27 +00:00
Michael Goin
0852527647
[Perf][DeepSeek] Add sigmoid+bias fusion to fused_grouped_topk from TRTLLM ( #28124 )
...
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
2025-11-07 18:20:55 -08:00