Fanli Lin
f37e8938d2
[XPU] Fix AWQ skipped layer detection in IPEX quantization ( #29774 )
...
Signed-off-by: Fanli Lin <fanli.lin@intel.com>
2025-12-01 12:00:52 +00:00
Shu Wang
f72a817bdf
[MoE] CuteDSL MoE with Nvfp4 DeepEP dispatch ( #27141 )
...
Signed-off-by: Shu Wang <shuw@nvidia.com>
Signed-off-by: Shu Wang. <shuw@nvidia.com>
Signed-off-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: root <root@umbriel-b200-017.ipp4a1.colossus.nvidia.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
2025-11-30 16:05:32 -08:00
Omer Ullman Argov
39d28108f4
[Feat] Support non-gated activations in NVFP4 modelopt path ( #29004 )
2025-11-30 11:02:40 -05:00
Isotr0py
47539cfd3e
[Bugfix] Fix mismatched nvfp4 gemm output shape ( #29742 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-11-30 09:15:01 +00:00
朝
9381b5cde0
[Doc]: Fix typo in fused_moe layer ( #29731 )
...
Signed-off-by: BowTen <bowten@qq.com>
2025-11-29 22:29:13 -08:00
Isotr0py
e1464c3a08
[Quantization] Enable compressed-tensors AWQ for Turing GPU ( #29732 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-11-30 06:04:28 +00:00
Xin Yang
a491b0911b
[LoRA] Support FusedMoE LoRA Triton kernel for mxfp4 ( #29708 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com>
Signed-off-by: Xin Yang <105740670+xyang16@users.noreply.github.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
2025-11-30 10:37:25 +08:00
Jinzhen Lin
1656ad3704
[Kernel][Quantization] add w4a8 support for marlin kernel ( #24722 )
...
Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com>
Signed-off-by: Michael Goin <mgoin64@gmail.com>
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Michael Goin <mgoin@redhat.com>
2025-11-29 07:19:33 -08:00
Didier Durand
04a797cd0e
[Doc]: fixing typos in various files. ( #29717 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com>
2025-11-29 01:15:39 -08:00
Huamin Li
3fd1fb0b60
Revert "[LoRA] Support FusedMoE LoRA Triton kernel for mxfp4 ( #28971 )" ( #29697 )
...
Signed-off-by: Huamin Li <3ericli@gmail.com>
2025-11-28 15:26:52 -08:00
Xin Yang
745a3bae1a
[LoRA] Support FusedMoE LoRA Triton kernel for mxfp4 ( #28971 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
2025-11-28 10:48:28 +08:00
Matthew Bonanni
fc1d8be3dc
[Attention] Update attention imports ( #29540 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
2025-11-27 11:19:09 -05:00
Jinzhen Lin
a67dec7cba
[Bugfix] fix IMA issue in certain cases of the moe marlin kernel ( #28619 )
...
Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
2025-11-26 19:02:21 -08:00
HDCharles
df01eda4dc
[Bugfix] Make compressed-tensors MoEs respect ignored layers ( #28878 )
...
Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
2025-11-26 21:35:13 -05:00
Matthew Bonanni
430dd4d9eb
[Attention] Remove imports from vllm/attention/__init__.py ( #29342 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
2025-11-26 10:53:15 -07:00
HDCharles
e603129505
[refactor] CTConfig methods to static/class methods ( #28870 )
...
Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-11-26 17:21:58 +00:00
Wentao Ye
0b0aa874e8
[Perf] Optimize batch invariant BMM, 18.1% Throughput improvement, 10.7% TTFT improvement ( #29345 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-11-26 09:38:52 -07:00
Huamin Li
70d5953f82
Revert "[Bugfix] Fix GPT-OSS AR+NORM fusion ( #28841 )" ( #29483 )
...
Signed-off-by: Huamin Li <3ericli@gmail.com>
2025-11-26 22:27:26 +08:00
Xin Yang
53d7f1f601
[Kernel] Use pre-allocated output buffer for triton kernel fused_experts ( #29219 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com>
2025-11-26 10:21:00 +08:00
Michael Goin
7df0289782
Change warning logs to debug for unimplemented MXFP4 Linear/Attention ( #29441 )
...
Signed-off-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
2025-11-25 22:52:31 +00:00
Michael Goin
e502098643
[Kernel] Add NVFP4 MoE CUTLASS support for SM120 ( #29242 )
...
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: Michael Goin <mgoin64@gmail.com>
2025-11-25 06:59:07 -08:00
elvischenv
6330f9477d
[Bugfix] Fix GPT-OSS AR+NORM fusion ( #28841 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
2025-11-25 07:59:40 +00:00
Fadi Arafeh
98caeadd54
[fix][cpu] Use a SwigluOAI impl which supports interleaved gate-up wei ( #29273 )
...
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>
2025-11-25 15:11:11 +08:00
Isotr0py
92effb07a4
[Model] Add HunyuanOCR support ( #29327 )
...
Signed-off-by: manayang <jackmanayang@gmail.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Roger Wang <hey@rogerw.io>
Co-authored-by: sergeywang <sergeywang@tencent.com>
Co-authored-by: manayang <jackmanayang@gmail.com>
Co-authored-by: manayang <manayang@tencent.com>
Co-authored-by: Roger Wang <hey@rogerw.io>
2025-11-25 03:28:51 +00:00
Michael Goin
6f1355a1b7
[Perf] Disable DeepGEMM MoE by default when TP=8 is used ( #29346 )
...
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-11-24 19:01:40 -07:00
Wentao Ye
699bca76c0
[UX] Raise error for attn backend of batch invariant ( #29348 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-11-24 17:49:01 -07:00
Michael Goin
c17610e2ba
[Bugfix] Only use triton_kernels for MXFP4 on SM90 and SM100 ( #29339 )
...
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-11-24 18:22:46 -05:00
bnellnm
8f066146c3
[MoE][Refactor] Make select_experts a non-static method ( #29067 )
...
Signed-off-by: Bill Nell <bnell@redhat.com>
2025-11-24 13:38:04 -05:00
jiahanc
5f96c00c55
[Fix] Add SM check to flashinfer MOE backend ( #29144 )
...
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
2025-11-23 00:39:30 +00:00
Federico
f55c76c2b3
chore: add RTX_PRO_6000 GLM4.6-FP8 kernel tuning ( #29240 )
2025-11-22 08:42:48 -08:00
Bram Wasti
5f7209a793
[tiny] Remove unsupported TRITON_MLA backend from batch invariance ( #28832 )
...
Signed-off-by: Bram Wasti <bwasti@meta.com>
Signed-off-by: Bram Wasti <bwasti@fb.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
2025-11-22 21:00:50 +08:00
jinghanhu
988ee66b0d
Handle triton kernel import exception ( #29062 )
2025-11-22 10:07:50 +00:00
FlintyLemming
052950e5b3
Add fused MoE config for H200 E160 N192 fp8 ( #29182 )
...
Signed-off-by: FlintyLemming <admin@flinty.moe>
2025-11-21 17:37:51 -08:00
Lukas Geiger
d045e22dfe
[Model][Qwen3VL] Tune Triton w8a8 block fp8 kernel for L40s ( #29217 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
2025-11-21 17:30:55 -08:00
Varun Sundar Rabindranath
3137991f55
[BugFix] EPLB + B200 + DeepGEMM : Handle column-major scales tensor ( #29162 )
...
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
2025-11-21 14:28:17 -08:00
Lucas Wilkinson
1840c5cb18
[BugFix] Make sure to allocate worst case MoE workspace during profile run in the DP + EP case ( #27426 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-11-21 11:41:52 -08:00
Mingyuan Ma
b4c8fbaae2
Add TRTLLM MoE NVFP4 kernel to CompressedTensorsW4A4MoeMethod ( #28892 )
...
Signed-off-by: mingyuanm <mingyuanm@nvidia.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
2025-11-21 09:54:11 -07:00
rasmith
e99e467384
[CI/Build][Kernel][AMD] Move extra dim to after load in _fwd_kv_parallel in lighting_attn.py ( #29132 )
...
Signed-off-by: Randall Smith <ransmith@amd.com>
Co-authored-by: Randall Smith <ransmith@amd.com>
2025-11-21 11:53:09 -05:00
Wentao Ye
a42ab317ac
[Log] Optimize startup log ( #28948 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
2025-11-21 08:46:20 -08:00
Aleksandr Malyshev
b7f1f490a6
Upstream triton fp4 weight preshuffle ( #28888 )
...
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com>
2025-11-21 11:34:46 -05:00
Cyrus Leung
aab0102a26
[V0 deprecation] Remove more V0 references ( #29088 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-11-21 11:56:59 +00:00
Hongxia Yang
3f5f36da3f
[ROCm] Fix for import when building with upstream triton for gfx1100 for gpt-oss serving ( #29127 )
...
Signed-off-by: Hongxia Yang <hongxia.yang@amd.com>
2025-11-21 03:30:07 +00:00
Wentao Ye
e1eefa4c40
[Bug] Fix torch warning of tf32 usage ( #29112 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-11-21 01:54:59 +00:00
Wentao Ye
df44df0143
[Feature] Shared Experts Overlap with FI deepgemm swap kernel, 2.2% throughput improvement and 3.6% TTFT improvement ( #28879 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-11-20 18:41:49 -07:00
Anna Shors
6eb745d9bd
Add truncate arg to yarn to match openai implementation of gpt-oss ( #28244 )
...
Signed-off-by: ashors1 <ashors@nvidia.com>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
2025-11-20 18:53:50 +08:00
Wentao Ye
2c52c7fd9a
[Bug] Fix torch dynamo warning Dynamo detected a call to a functools.lru_cache ( #29038 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-11-20 16:52:23 +08:00
Shengliang Xu
a8c536829c
Consolidate Nvidia ModelOpt quant config handling for all quantization methods ( #28076 )
...
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
2025-11-19 22:39:36 -05:00
Wentao Ye
5031cd5d55
[Refactor] Optimize select_experts ( #28069 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-11-19 18:53:15 -05:00
JartX
8e38e99829
[Feature] EPLB on Qwen3VLMoe and CompressedTensorsWNA16MoEMethod ( #28849 )
2025-11-19 18:30:08 -05:00
Max Hu
cb0a7b4bea
[Bugfix] Move flashinfer kernel check into ``__init__` function of `FusedMoE`` ( #29018 )
...
Signed-off-by: Max Hu <hyoung2991@gmail.com>
2025-11-19 21:54:15 +00:00