Andy Chen
9b94d6ec8f
Enable 4bit bnb prequant MOE ( #21548 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
2025-08-11 19:02:14 -07:00
Harry Mellor
458e74eb90
Support more parallel styles in Transformers backend TP ( #22651 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-08-11 10:42:48 -07:00
22quinn
807d21b80d
[BugFix] [Spec Decode] Remove LlamaForCausalLMEagle3 to fix CI ( #22611 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com>
2025-08-11 10:31:36 -07:00
wang.yuqi
84cf78acee
[Model] Pooling models default to using chunked prefill & prefix caching if supported. ( #20930 )
...
Signed-off-by: wang.yuqi <noooop@126.com>
2025-08-11 09:41:37 -07:00
Wentao Ye
f7dcce7a4a
[Feature] Add VLLM_USE_DEEP_GEMM_E8M0 Env to Control E8M0 Scale ( #21968 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-08-11 09:39:08 -07:00
danielafrimi
14a5d903ab
[Model] NemotronH Support ( #22349 )
...
Signed-off-by: Daniel Afrimi <danielafrimi8@gmail.com>
2025-08-11 04:09:24 -07:00
Cyrus Leung
951b038298
[Misc] Move jsontree to utils ( #22622 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-08-11 03:49:32 -07:00
JartX
1e55dfa7e5
[BUGFIX] KeyError 'layers.14.mlp.gate.g_idx' for Qwen3-MoE with GPTQ on ROCm ( #22017 )
2025-08-11 00:13:30 -07:00
Maximilien de Bayser
39052dbca8
Support token_type_ids in V1 with less code changes ( #21985 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
2025-08-10 22:54:59 -07:00
vllmellm
9c97a1c349
[ROCm][AITER] Support AITER Rope ops in RotaryEmbedding Module. ( #22521 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
2025-08-10 22:52:34 -07:00
Zhewen Li
afa5b7ca0b
[Misc][gpt-oss] guard import when triton kernel when not up to date ( #22584 )
...
Signed-off-by: zhewenli <zhewenli@meta.com>
2025-08-10 21:29:35 -07:00
Benji Beck
06da44f0cb
Migrate LlavaImageInputs to TensorSchema ( #21770 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com>
2025-08-10 19:29:19 -07:00
Benji Beck
a554991748
Migrate LlavaNextVideoPixelInputs to TensorSchema ( #21843 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com>
2025-08-10 19:29:16 -07:00
ZiTian Zhao
8c50d62f5a
Remove redundant row_indices unsqueeze operation in MiniCPMO ( #22528 )
...
Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com>
2025-08-10 09:20:00 -07:00
Benji Beck
b4e2916721
Migrate LlavaNextImageInputs to TensorSchema ( #21774 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-08-10 09:05:21 -07:00
Isotr0py
b76753f0b5
[Bugfix][Kernel] Support partial rotary embedding for MRoPE triton kernel ( #22593 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-08-10 09:00:36 -07:00
Harry Mellor
00976db0c3
[Docs] Fix warnings in docs build ( #22588 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-08-10 05:49:51 -07:00
Isotr0py
7e8d685775
[Minor] Fix pre-commit error on main ( #22579 )
...
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-08-10 00:08:23 -07:00
Harry Mellor
c49848396d
Refactor sliding window configuration to Transformers best practice ( #21927 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-08-09 20:50:48 -07:00
ZiTian Zhao
534c45b962
Improve fast_topk function with type hints and documentation ( #22530 )
...
Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com>
2025-08-09 20:25:42 -07:00
Le Chen
3d7363e61c
[Config] add "qwen" as a native eagle3 target supported model ( #22333 )
...
Signed-off-by: lechen <lecself@163.com>
Signed-off-by: LeChen <lecself@163.com>
2025-08-09 20:21:05 -07:00
Jee Jee Li
0c5254b82a
[oss] Init gpt-oss bf16 support ( #22508 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-08-09 20:19:13 -07:00
TJian
42172ad18f
[FEAT] [Performance] Add triton mrope to replace the torch code path ( #22375 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
2025-08-09 11:50:03 -07:00
Nicolò Lucchesi
5a16fa614c
[Model] Gemma3n MM ( #20495 )
...
Signed-off-by: ShriKode <shrikode@gmail.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: Roger Wang <hey@rogerw.me>
Co-authored-by: ShriKode <shrikode@gmail.com>
Co-authored-by: Roger Wang <hey@rogerw.me>
2025-08-09 09:56:25 -07:00
Yuxuan Zhang
a6022e6fbc
GLM-4.5V with new class name at transformers ( #22520 )
...
Signed-off-by: zRzRzRzRzRzRzR <2448370773@qq.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-08-09 00:50:21 -07:00
Kyuyeun Kim
9a0c5ded5a
[TPU] Add support for online w8a8 quantization ( #22425 )
...
Signed-off-by: Kyuyeun Kim <kyuyeunk@google.com>
2025-08-08 23:12:54 -07:00
Eldar Kurtić
10a02535d4
Fix loading of quantized BigCode models ( #22463 )
...
Signed-off-by: Eldar Kurtic <eldar@neuralmagic.com>
2025-08-08 23:12:12 -07:00
Cyrus Leung
65552b476b
[Misc] Use config definitions from Transformers library ( #21913 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-08-08 23:10:51 -07:00
Thomas Parnell
6ade99eafa
[V1] [Hybrid] Support Minimax-Text-01 in V1 ( #22151 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
2025-08-08 23:08:48 -07:00
Roger Wang
08b751ba74
Implicit language-model-only mode via limit-mm-per-prompt ( #22299 )
...
Signed-off-by: Roger Wang <hey@rogerw.me>
Signed-off-by: Andy Xie <andy.xning@gmail.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: Andrew Sansom <andrew@protopia.ai>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Signed-off-by: Shu Wang <shuw@nvidia.com>
Signed-off-by: Po-Han Huang <pohanh@nvidia.com>
Signed-off-by: Shu Wang. <shuw@nvidia.com>
Signed-off-by: XIn Li <xinli@nvidia.com>
Signed-off-by: Junhao Li <junhao@ubicloud.com>
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
Signed-off-by: zRzRzRzRzRzRzR <2448370773@qq.com>
Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com>
Signed-off-by: zitian zhao <zitian.zhao@tencentmusic.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: iAmir97 <Amir.balwel@embeddedllm.com>
Signed-off-by: iAmir97 <71513472+iAmir97@users.noreply.github.com>
Signed-off-by: Linkun <github@lkchen.net>
Co-authored-by: Ning Xie <andy.xning@gmail.com>
Co-authored-by: TJian <tunjian.tan@embeddedllm.com>
Co-authored-by: Andrew Sansom <andrew@protopia.ai>
Co-authored-by: Zhiyu <zhiyuc@nvidia.com>
Co-authored-by: Shu Wang <shuw@nvidia.com>
Co-authored-by: XIn Li <xinli@nvidia.com>
Co-authored-by: Junhao Li <streaver91@gmail.com>
Co-authored-by: Chauncey <chaunceyjiang@gmail.com>
Co-authored-by: Yuxuan Zhang <2448370773@qq.com>
Co-authored-by: ZiTian Zhao <zitian.zhao@tencentmusic.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Co-authored-by: Po-Han Huang (NVIDIA) <53919306+nvpohanh@users.noreply.github.com>
Co-authored-by: iAmir97 <71513472+iAmir97@users.noreply.github.com>
Co-authored-by: iAmir97 <Amir.balwel@embeddedllm.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Hong Hanh <hanh.usth@gmail.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: lkchen <github@lkchen.net>
2025-08-08 22:21:40 -07:00
Isotr0py
429e4e2d42
[Bugfix] Fix ModernBert cuda graph capturing in v1 ( #21901 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-08-08 22:17:22 -07:00
Varun Sundar Rabindranath
f703b923f3
[Misc] DeepGEMM : Avoid JIT generation in the hot-path ( #22215 )
...
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
2025-08-08 16:09:59 -07:00
Yongye Zhu
f756a682d9
[gpt-oss] guard import when triton kernel is not installed ( #22529 )
...
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-08-08 11:18:33 -07:00
Yongye Zhu
e789cad6b8
[gpt-oss] triton kernel mxfp4 ( #22421 )
...
Signed-off-by: <zyy1102000@gmail.com>
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
2025-08-08 08:24:07 -07:00
Po-Han Huang (NVIDIA)
af473f0a85
[bugfix] Fix Llama3/4 issues caused by FlashInfer 0.2.10 ( #22426 )
...
Signed-off-by: Po-Han Huang <pohanh@nvidia.com>
2025-08-07 20:25:01 -07:00
Cyrus Leung
157f9c1368
Fix pre-commit ( #22487 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-08-07 20:21:54 -07:00
ZiTian Zhao
6f287915d8
Optimize MiniCPMO mask creation with vectorized implementation ( #22464 )
...
Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com>
Signed-off-by: zitian zhao <zitian.zhao@tencentmusic.com>
2025-08-07 20:18:50 -07:00
Yuxuan Zhang
c152e2a8a0
not tie_word_embeddings for glm-4.5 and glm-4.5v ( #22460 )
...
Signed-off-by: zRzRzRzRzRzRzR <2448370773@qq.com>
2025-08-07 19:37:23 -07:00
Chauncey
17eaaef595
[Bugfix] Fix RuntimeError: Index put requires the source and destination dtypes match ( #22065 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
2025-08-07 19:20:21 -07:00
Shu Wang
b2c8ce57c6
Fix Flashinfer CUTLASS MOE Allgather ( #21963 )
...
Signed-off-by: Shu Wang <shuw@nvidia.com>
2025-08-07 19:18:25 -07:00
Shu Wang
a3b9c17b56
Support Tensorrt-LLM MoE fp4 for low-latency ( #21331 )
...
Signed-off-by: Shu Wang <shuw@nvidia.com>
Signed-off-by: Po-Han Huang <pohanh@nvidia.com>
Signed-off-by: Shu Wang. <shuw@nvidia.com>
Signed-off-by: XIn Li <xinli@nvidia.com>
Co-authored-by: XIn Li <xinli@nvidia.com>
2025-08-07 19:18:22 -07:00
Zhiyu
d57dc2364e
Add ModelOpt Qwen3 nvfp4 support ( #20101 )
...
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
2025-08-07 19:18:19 -07:00
Cyrus Leung
8c9da6be22
[Core] Simplify mm processing cache ( #22457 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-08-07 09:47:07 -07:00
fxmarty-amd
7e0b121812
[Bugfix] Add missing packed_modules_mapping to DeepseekV2ForCausalLM ( #22352 )
...
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
2025-08-07 06:30:48 -07:00
Woosuk Kwon
136825de75
[Misc] Enhance code formatting in mxfp4.py ( #22423 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-08-07 00:26:24 -07:00
JaceyShao
c2dba2dba8
Add H20-3e fused MoE kernel tuning configs for GLM-4.5 ( #22433 )
...
Signed-off-by: shaojunqi <shaojunqi.sjq@alibaba-inc.com>
Co-authored-by: shaojunqi <shaojunqi.sjq@alibaba-inc.com>
2025-08-07 00:24:47 -07:00
Ming Yang
82216dc21f
[Misc] Support routing logic simulation ( #21990 )
...
Signed-off-by: Ming Yang <minos.future@gmail.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-08-06 23:06:20 -07:00
vllmellm
cbc8457b26
[Model] Switch to Fused RMS norm in Qwen2.5_VL model. ( #22184 )
...
Signed-off-by: kf <kuanfu.liu@embeddedllm.com>
Signed-off-by: tjtanaavllm <tunjian.tan@amd.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Co-authored-by: kf <kuanfu.liu@embeddedllm.com>
2025-08-06 23:05:24 -07:00
WeiQing Chen
4be02a3776
[Bugfix] EPLB load statistics problem ( #22167 )
...
Signed-off-by: ycyaw66 <497410282@qq.com>
Signed-off-by: David Chen <530634352@qq.com>
Co-authored-by: ycyaw66 <497410282@qq.com>
2025-08-07 04:07:54 +00:00
Syed Muhammad Bin Asif
609b533cb6
[Bugfix] Add proper comparison for package versions ( #22314 )
...
Signed-off-by: Syed Muhammad Bin Asif <syedmba7@connect.hku.hk>
2025-08-06 20:31:03 -07:00