fxmarty-amd
332d4cb17b
[Feature][Quantization] MXFP4 support for MOE models ( #17888 )
...
Signed-off-by: Felix Marty <felmarty@amd.com>
Signed-off-by: Bowen Bao <bowenbao@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Co-authored-by: Bowen Bao <bowenbao@amd.com>
2025-07-09 13:19:02 -07:00
Jacob Manning
bf03ff3575
[Kernel] Add Conch backend for mixed-precision linear layer ( #19818 )
...
Signed-off-by: Jacob Manning <jmanning+oss@stackav.com>
2025-07-09 13:17:55 -07:00
Tuan, Hoang-Trong
47043eb678
[Kernel] Triton implementation of causal-conv1d for Mamba-based models ( #18218 )
...
Signed-off-by: Tuan M. Hoang-Trong <tmhoangt@us.ibm.com>
Co-authored-by: Tuan M. Hoang-Trong <tmhoangt@us.ibm.com>
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
2025-07-09 12:53:55 -07:00
Michael Goin
31b96d1c64
Support Llama 4 for cutlass_moe_fp4 ( #20453 )
...
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-07-09 15:53:38 -04:00
Duncan Moss
97abeb1daa
[feat] enable SM100 CUTLASS block scaled group gemm for smaller batch sizes ( #20640 )
...
Signed-off-by: Duncan Moss <djm.moss@gmail.com>
2025-07-09 11:03:35 +08:00
Ming Yang
c438183e99
[Bugfix] Fix topk_ids indices_type for CUTLASS w8a8 FP8 MoE ( #20166 )
...
Signed-off-by: Ming Yang <yming@meta.com>
2025-07-08 23:10:57 +00:00
Ming Yang
afb7cff1b9
[Bugfix] Fix Maverick correctness by filling zero to cache space in cutlass_moe ( #20167 )
...
Signed-off-by: Ming Yang <yming@meta.com>
2025-07-08 01:07:22 +00:00
Cyrus Leung
c18b3b8e8b
[Bugfix] Add use_cross_encoder flag to use correct activation in ClassifierPooler ( #20527 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-07-06 14:01:48 -07:00
Brayden Zhong
cede942b87
[Benchmark] Add support for multiple batch size benchmark through CLI in benchmark_moe.py ( #20516 )
...
Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca>
2025-07-06 09:20:11 +00:00
Chengji Yao
4548c03c50
[TPU][Bugfix] fix the MoE OOM issue ( #20339 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com>
2025-07-05 21:19:09 -07:00
Lucia Fang
432870829d
[Bugfix] Fix missing per_act_token parameter in compressed_tensors_moe ( #20509 )
...
Signed-off-by: Lu Fang <fanglu@fb.com>
2025-07-06 12:08:30 +08:00
Lucia Fang
8aeaa910a2
Fix unknown attribute of topk_indices_dtype in CompressedTensorsW8A8Fp8MoECutlassMethod ( #20507 )
...
Co-authored-by: Lucia (Lu) Fang <fanglu@meta.com>
2025-07-05 14:03:20 +08:00
Duncan Moss
3d184b95b8
[feat]: CUTLASS block scaled group gemm for SM100 ( #19757 )
...
Signed-off-by: Duncan Moss <djm.moss@gmail.com>
Co-authored-by: Duncan Moss <dmoss@nvidia.com>
2025-07-04 12:58:04 -06:00
Thomas Parnell
2f35a022e6
Enable V1 for Hybrid SSM/Attention Models ( #20016 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Co-authored-by: Stanislaw Wozniak <stw@zurich.ibm.com>
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
2025-07-04 17:46:53 +00:00
Michael Goin
0e3fe896e2
Support Llama 4 for fused_marlin_moe ( #20457 )
...
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-07-04 07:55:10 +00:00
Jee Jee Li
1caca5a589
[Misc] Add SPDX-FileCopyrightText ( #20428 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-07-04 07:40:42 +00:00
bnellnm
78fe77534b
[Kernel] Enable fp8 support for pplx and BatchedTritonExperts. ( #18864 )
...
Signed-off-by: Bill Nell <bnell@redhat.com>
2025-07-03 14:55:40 -07:00
wang.yuqi
6f1229f91d
[Model][2/N] Automatic conversion of CrossEncoding model ( #19978 )
...
Signed-off-by: wang.yuqi <noooop@126.com>
2025-07-03 13:59:23 +00:00
Jee Jee Li
1819fbda63
[Quantization] Bump to use latest bitsandbytes ( #20424 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-07-03 21:58:46 +08:00
Li, Jiang
0ec3779df7
[Bugfix][CI/CD][CPU] Fix CPU CI tests ( #20383 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com>
2025-07-02 20:11:36 -07:00
bnellnm
2e25bb12a8
[Bugfix] Fix import of CutlassExpertsFp8 in compressed_tensors_moe.py ( #20381 )
...
Signed-off-by: Bill Nell <bnell@redhat.com>
2025-07-03 02:07:43 +00:00
bnellnm
c1909e7e8c
[Kernels] MoE refactor ( #19636 )
...
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Co-authored-by: ElizaWszola <ewszola@redhat.com>
2025-07-02 06:08:27 -07:00
Cyrus Leung
1a03dd496b
[Bugfix] Fix dynamic rotary embedding ( #20343 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-07-02 06:31:26 +00:00
Wentao Ye
9dae7d46bf
[Refactor] Remove Unused Env VLLM_ENABLE_MOE_ALIGN_BLOCK_SIZE_TRITON ( #20334 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-07-01 19:03:43 -07:00
Wentao Ye
7058d7dd5d
[Refactor] Remove duplicate find_free_port ( #20333 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-07-01 19:03:07 -07:00
czhu-cohere
3abfe22154
Enable group size 64 for Machete ( #20290 )
...
Signed-off-by: czhu-cohere <conway.zhu@cohere.com>
2025-07-01 18:05:44 -07:00
TJian
02cabff207
[V1] [ROCm] Enable EP with AITER Fused MoE ( #20270 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
2025-07-01 16:48:30 +00:00
aiyiwang2025
ecad851cbd
[Model]Add Tencent HunYuanMoEV1 Model Support ( #20114 )
...
Signed-off-by: aiyiwang <aiyiwang@tencent.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: quinnrong <quinnrong@tencent.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
2025-07-01 07:28:13 -07:00
Yuxuan Zhang
ed70f3c64f
Add GLM4.1V model (Draft) ( #19331 )
...
Signed-off-by: zRzRzRzRzRzRzR <2448370773@qq.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-07-01 12:48:26 +00:00
Kyle Sayers
9025a9a705
[Quant] [Bugfix] Fix quantization config matching with hf_to_vllm_mapper ( #20046 )
2025-07-01 19:20:34 +09:00
Li, Jiang
6cc1e7d96d
[CPU] Update custom ops for the CPU backend ( #20255 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com>
2025-07-01 07:25:03 +00:00
czhu-cohere
9909726d2a
Enable ZP Support for Machete ( #20268 )
...
Signed-off-by: czhu-cohere <conway.zhu@cohere.com>
2025-07-01 07:12:20 +00:00
Alex Kogan
27949354fa
[Feature] A calibration-free RTN-based quantization for accurate and accelerated INT4/INT8 inference ( #18768 )
...
Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
2025-07-01 05:44:38 +00:00
Wentao Ye
97d9524fe9
[Refactor] Remove useless pdb comment ( #20266 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-06-30 18:15:24 +00:00
li haoyang
1c50e100a9
[Bugfix] fix quark ptpc ( #20251 )
...
Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>
Co-authored-by: Haoyang Li <307790822@qq.com>
2025-06-30 22:24:50 +09:00
Dipika Sikka
6f2f53a82d
[Quantization] Add compressed-tensors NVFP4 MoE Support ( #19990 )
...
Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com>
Signed-off-by: Dipika <dipikasikka1@gmail.com>
2025-06-29 22:05:40 +00:00
Wentao Ye
4d36693687
[Refactor] Create a function util and cache the results for has_deepgemm, has_deepep, has_pplx ( #20187 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-06-28 22:06:38 +00:00
bnellnm
c6c983053d
[Bugfix] Mark 'hidden_states' as mutable in moe_forward registration. ( #20152 )
...
Signed-off-by: Bill Nell <bnell@redhat.com>
2025-06-27 09:42:22 -06:00
Robert Shaw
d1c956dc0f
Gemma3n (Text-only) ( #20134 )
...
Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com>
Signed-off-by: Roger Wang <hey@rogerw.me>
Co-authored-by: Roger Wang <hey@rogerw.me>
2025-06-27 07:16:26 +00:00
Bowen Wang
e9fd658a73
[Feature] Expert Parallelism Load Balancer (EPLB) ( #18343 )
...
Signed-off-by: Bowen Wang <abmfy@icloud.com>
2025-06-26 15:30:21 -07:00
Wentao Ye
c894c5dc1f
[Bug Fix] Fix address/port already in use error for deep_ep test ( #20094 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-06-26 22:33:13 +08:00
Li, Jiang
0567c8249f
[CPU] Fix torch version in x86 CPU backend ( #19258 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com>
2025-06-26 03:34:47 -07:00
Dipika Sikka
02c97d9a92
[Quantization] Add compressed-tensors emulations support for NVFP4 ( #19879 )
...
Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com>
Signed-off-by: Dipika <dipikasikka1@gmail.com>
2025-06-25 14:28:19 -04:00
bnellnm
015fab8c2f
[Kernels][Bugfix] Use torch op for all kernels in FusedMoE forward. Add additional testing for cudagraphs. ( #19717 )
...
Signed-off-by: Bill Nell <bnell@redhat.com>
2025-06-24 23:22:58 -07:00
Wentao Ye
879f69bed3
[Refactor] Remove duplicate ceil_div ( #20023 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-06-25 05:19:09 +00:00
Wentao Ye
a6c4b87fbc
Revert "[Feature] Integrate new deepgemm ( #19820 )" ( #20049 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-06-24 19:45:22 -07:00
Boyuan Feng
c01d1c5aba
use .dev for version comparison with pytorch nightly release ( #20031 )
...
Signed-off-by: Boyuan Feng <boyuan@meta.com>
2025-06-24 21:52:16 +00:00
Wentao Ye
c6e3bba8e6
[Feature] Integrate new deepgemm ( #19820 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-06-24 12:51:56 -07:00
Vadim Gimpelson
9a3b88328f
[PERF] Speedup of MRoPE prepare inputs ( #19939 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@centml.ai>
2025-06-23 23:01:26 -07:00
Jun-Howie
dd2ccf8dde
Feat Dynamic Quantization for MoE Layers in GPTQ Marlin Backend ( #19395 )
2025-06-24 07:23:28 +09:00