1022 Commits

Author SHA1 Message Date
fxmarty-amd
332d4cb17b
[Feature][Quantization] MXFP4 support for MOE models (#17888)
Signed-off-by: Felix Marty <felmarty@amd.com>
Signed-off-by: Bowen Bao <bowenbao@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Co-authored-by: Bowen Bao <bowenbao@amd.com>
2025-07-09 13:19:02 -07:00
Jacob Manning
bf03ff3575
[Kernel] Add Conch backend for mixed-precision linear layer (#19818)
Signed-off-by: Jacob Manning <jmanning+oss@stackav.com>
2025-07-09 13:17:55 -07:00
Tuan, Hoang-Trong
47043eb678
[Kernel] Triton implementation of causal-conv1d for Mamba-based models (#18218)
Signed-off-by: Tuan M. Hoang-Trong <tmhoangt@us.ibm.com>
Co-authored-by: Tuan M. Hoang-Trong <tmhoangt@us.ibm.com>
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
2025-07-09 12:53:55 -07:00
Michael Goin
31b96d1c64
Support Llama 4 for cutlass_moe_fp4 (#20453)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-07-09 15:53:38 -04:00
Duncan Moss
97abeb1daa
[feat] enable SM100 CUTLASS block scaled group gemm for smaller batch sizes (#20640)
Signed-off-by: Duncan Moss <djm.moss@gmail.com>
2025-07-09 11:03:35 +08:00
Ming Yang
c438183e99
[Bugfix] Fix topk_ids indices_type for CUTLASS w8a8 FP8 MoE (#20166)
Signed-off-by: Ming Yang <yming@meta.com>
2025-07-08 23:10:57 +00:00
Ming Yang
afb7cff1b9
[Bugfix] Fix Maverick correctness by filling zero to cache space in cutlass_moe (#20167)
Signed-off-by: Ming Yang <yming@meta.com>
2025-07-08 01:07:22 +00:00
Cyrus Leung
c18b3b8e8b
[Bugfix] Add use_cross_encoder flag to use correct activation in ClassifierPooler (#20527)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-07-06 14:01:48 -07:00
Brayden Zhong
cede942b87
[Benchmark] Add support for multiple batch size benchmark through CLI in benchmark_moe.py (#20516)
Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca>
2025-07-06 09:20:11 +00:00
Chengji Yao
4548c03c50
[TPU][Bugfix] fix the MoE OOM issue (#20339)
Signed-off-by: Chengji Yao <chengjiyao@google.com>
2025-07-05 21:19:09 -07:00
Lucia Fang
432870829d
[Bugfix] Fix missing per_act_token parameter in compressed_tensors_moe (#20509)
Signed-off-by: Lu Fang <fanglu@fb.com>
2025-07-06 12:08:30 +08:00
Lucia Fang
8aeaa910a2
Fix unknown attribute of topk_indices_dtype in CompressedTensorsW8A8Fp8MoECutlassMethod (#20507)
Co-authored-by: Lucia (Lu) Fang <fanglu@meta.com>
2025-07-05 14:03:20 +08:00
Duncan Moss
3d184b95b8
[feat]: CUTLASS block scaled group gemm for SM100 (#19757)
Signed-off-by: Duncan Moss <djm.moss@gmail.com>
Co-authored-by: Duncan Moss <dmoss@nvidia.com>
2025-07-04 12:58:04 -06:00
Thomas Parnell
2f35a022e6
Enable V1 for Hybrid SSM/Attention Models (#20016)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Co-authored-by: Stanislaw Wozniak <stw@zurich.ibm.com>
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
2025-07-04 17:46:53 +00:00
Michael Goin
0e3fe896e2
Support Llama 4 for fused_marlin_moe (#20457)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-07-04 07:55:10 +00:00
Jee Jee Li
1caca5a589
[Misc] Add SPDX-FileCopyrightText (#20428)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-07-04 07:40:42 +00:00
bnellnm
78fe77534b
[Kernel] Enable fp8 support for pplx and BatchedTritonExperts. (#18864)
Signed-off-by: Bill Nell <bnell@redhat.com>
2025-07-03 14:55:40 -07:00
wang.yuqi
6f1229f91d
[Model][2/N] Automatic conversion of CrossEncoding model (#19978)
Signed-off-by: wang.yuqi <noooop@126.com>
2025-07-03 13:59:23 +00:00
Jee Jee Li
1819fbda63
[Quantization] Bump to use latest bitsandbytes (#20424)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-07-03 21:58:46 +08:00
Li, Jiang
0ec3779df7
[Bugfix][CI/CD][CPU] Fix CPU CI tests (#20383)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
2025-07-02 20:11:36 -07:00
bnellnm
2e25bb12a8
[Bugfix] Fix import of CutlassExpertsFp8 in compressed_tensors_moe.py (#20381)
Signed-off-by: Bill Nell <bnell@redhat.com>
2025-07-03 02:07:43 +00:00
bnellnm
c1909e7e8c
[Kernels] MoE refactor (#19636)
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Co-authored-by: ElizaWszola <ewszola@redhat.com>
2025-07-02 06:08:27 -07:00
Cyrus Leung
1a03dd496b
[Bugfix] Fix dynamic rotary embedding (#20343)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-07-02 06:31:26 +00:00
Wentao Ye
9dae7d46bf
[Refactor] Remove Unused Env VLLM_ENABLE_MOE_ALIGN_BLOCK_SIZE_TRITON (#20334)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-07-01 19:03:43 -07:00
Wentao Ye
7058d7dd5d
[Refactor] Remove duplicate find_free_port (#20333)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-07-01 19:03:07 -07:00
czhu-cohere
3abfe22154
Enable group size 64 for Machete (#20290)
Signed-off-by: czhu-cohere <conway.zhu@cohere.com>
2025-07-01 18:05:44 -07:00
TJian
02cabff207
[V1] [ROCm] Enable EP with AITER Fused MoE (#20270)
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
2025-07-01 16:48:30 +00:00
aiyiwang2025
ecad851cbd
[Model]Add Tencent HunYuanMoEV1 Model Support (#20114)
Signed-off-by: aiyiwang <aiyiwang@tencent.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: quinnrong <quinnrong@tencent.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
2025-07-01 07:28:13 -07:00
Yuxuan Zhang
ed70f3c64f
Add GLM4.1V model (Draft) (#19331)
Signed-off-by: zRzRzRzRzRzRzR <2448370773@qq.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-07-01 12:48:26 +00:00
Kyle Sayers
9025a9a705
[Quant] [Bugfix] Fix quantization config matching with hf_to_vllm_mapper (#20046) 2025-07-01 19:20:34 +09:00
Li, Jiang
6cc1e7d96d
[CPU] Update custom ops for the CPU backend (#20255)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
2025-07-01 07:25:03 +00:00
czhu-cohere
9909726d2a
Enable ZP Support for Machete (#20268)
Signed-off-by: czhu-cohere <conway.zhu@cohere.com>
2025-07-01 07:12:20 +00:00
Alex Kogan
27949354fa
[Feature] A calibration-free RTN-based quantization for accurate and accelerated INT4/INT8 inference (#18768)
Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
2025-07-01 05:44:38 +00:00
Wentao Ye
97d9524fe9
[Refactor] Remove useless pdb comment (#20266)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-06-30 18:15:24 +00:00
li haoyang
1c50e100a9
[Bugfix] fix quark ptpc (#20251)
Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>
Co-authored-by: Haoyang Li <307790822@qq.com>
2025-06-30 22:24:50 +09:00
Dipika Sikka
6f2f53a82d
[Quantization] Add compressed-tensors NVFP4 MoE Support (#19990)
Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com>
Signed-off-by: Dipika <dipikasikka1@gmail.com>
2025-06-29 22:05:40 +00:00
Wentao Ye
4d36693687
[Refactor] Create a function util and cache the results for has_deepgemm, has_deepep, has_pplx (#20187)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-06-28 22:06:38 +00:00
bnellnm
c6c983053d
[Bugfix] Mark 'hidden_states' as mutable in moe_forward registration. (#20152)
Signed-off-by: Bill Nell <bnell@redhat.com>
2025-06-27 09:42:22 -06:00
Robert Shaw
d1c956dc0f
Gemma3n (Text-only) (#20134)
Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com>
Signed-off-by: Roger Wang <hey@rogerw.me>
Co-authored-by: Roger Wang <hey@rogerw.me>
2025-06-27 07:16:26 +00:00
Bowen Wang
e9fd658a73
[Feature] Expert Parallelism Load Balancer (EPLB) (#18343)
Signed-off-by: Bowen Wang <abmfy@icloud.com>
2025-06-26 15:30:21 -07:00
Wentao Ye
c894c5dc1f
[Bug Fix] Fix address/port already in use error for deep_ep test (#20094)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-06-26 22:33:13 +08:00
Li, Jiang
0567c8249f
[CPU] Fix torch version in x86 CPU backend (#19258)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
2025-06-26 03:34:47 -07:00
Dipika Sikka
02c97d9a92
[Quantization] Add compressed-tensors emulations support for NVFP4 (#19879)
Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com>
Signed-off-by: Dipika <dipikasikka1@gmail.com>
2025-06-25 14:28:19 -04:00
bnellnm
015fab8c2f
[Kernels][Bugfix] Use torch op for all kernels in FusedMoE forward. Add additional testing for cudagraphs. (#19717)
Signed-off-by: Bill Nell <bnell@redhat.com>
2025-06-24 23:22:58 -07:00
Wentao Ye
879f69bed3
[Refactor] Remove duplicate ceil_div (#20023)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-06-25 05:19:09 +00:00
Wentao Ye
a6c4b87fbc
Revert "[Feature] Integrate new deepgemm (#19820)" (#20049)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-06-24 19:45:22 -07:00
Boyuan Feng
c01d1c5aba
use .dev for version comparison with pytorch nightly release (#20031)
Signed-off-by: Boyuan Feng <boyuan@meta.com>
2025-06-24 21:52:16 +00:00
Wentao Ye
c6e3bba8e6
[Feature] Integrate new deepgemm (#19820)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-06-24 12:51:56 -07:00
Vadim Gimpelson
9a3b88328f
[PERF] Speedup of MRoPE prepare inputs (#19939)
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@centml.ai>
2025-06-23 23:01:26 -07:00
Jun-Howie
dd2ccf8dde
Feat Dynamic Quantization for MoE Layers in GPTQ Marlin Backend (#19395) 2025-06-24 07:23:28 +09:00