Jee Jee Li
|
04ff4be310
|
[Misc] Add fused_moe configs for Qwen3-Coder-480B-A35B-Instruct-FP8 (#21700)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
|
2025-07-27 20:12:18 -07:00 |
|
Caleb_Du
|
57c22e57f9
|
Fix CUDA permute/unpermute for use with DeepGemm Moe (#17934)
Signed-off-by: Caleb_Du <Caleb_Du@zju.edu.cn>
|
2025-07-27 07:08:00 -07:00 |
|
Wentao Ye
|
bda9d0535f
|
[Refactor] Refactor MOE NVFP4 Code Base: ModelOpt + Compressed Tensor (#21631)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
2025-07-27 05:25:21 -07:00 |
|
Kaixi Hou
|
de509ae8eb
|
[NVIDIA] Explicitly disable shuffled weights for flashinfer blockscale moe fp8 kernels (#21411)
Signed-off-by: kaixih <kaixih@nvidia.com>
|
2025-07-26 07:10:36 -07:00 |
|
Wentao Ye
|
56e544f24b
|
[Refactor] Remove moe_align_block_size_triton (#21335)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
2025-07-26 07:08:29 -07:00 |
|
Alex Kogan
|
7ae75fa6d0
|
[Feature] Add support for MoE models in the calibration-free RTN-based quantization (#20766)
Signed-off-by: Alex Kogan <alex.kogan@oracle.com>
|
2025-07-25 18:09:34 -07:00 |
|
Wentao Ye
|
75d29cf4e1
|
[Perf] Cuda Kernel for Int8 Per Token Group Quant (#21476)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
2025-07-25 17:07:07 -07:00 |
|
Chih-Chieh Yang
|
eab2f3980c
|
[Model] Replace Mamba2 RMSNorm Gated with Fused Triton Kernel (#20839)
Signed-off-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com>
Signed-off-by: Yu Chin Fabian Lim <fabian.lim@gmail.com>
Signed-off-by: Chih-Chieh Yang <7364402+cyang49@users.noreply.github.com>
Co-authored-by: Yu Chin Fabian Lim <fabian.lim@gmail.com>
|
2025-07-25 06:49:36 -07:00 |
|
Cyrus Leung
|
46d81d6951
|
[V1] Get supported tasks from model runner instead of model config (#21585)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2025-07-25 05:36:45 -07:00 |
|
Xu Wenqing
|
8ed01e32f7
|
Add H20-3e fused MoE kernel tuning configs for Qwen3-Coder-480B-A35B-Instruct (#21598)
Signed-off-by: 许文卿 <xwq391974@alibaba-inc.com>
|
2025-07-25 02:36:55 -07:00 |
|
Varun Sundar Rabindranath
|
2212cd6cfb
|
[Bugfix] DeepGemm utils : Fix hardcoded type-cast (#21517)
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
|
2025-07-24 20:17:29 -07:00 |
|
Burkhard Ringlein
|
ce3a9b1378
|
[Kernel] adding fused_moe configs for upcoming granite4 (#21332)
Signed-off-by: Burkhard Ringlein <ngl@zurich.ibm.com>
Co-authored-by: Thomas Parnell <tpa@zurich.ibm.com>
|
2025-07-24 20:16:59 -07:00 |
|
Wentao Ye
|
633f6e804b
|
[Bug] Fix DeepGemm Init Error (#21554)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
2025-07-24 20:07:22 -07:00 |
|
Woosuk Kwon
|
fe56180c7f
|
[MoE] More balanced expert sharding (#21497)
Signed-off-by: Woosuk Kwon <woosuk@thinkingmachines.ai>
|
2025-07-24 15:56:08 -07:00 |
|
Shu Wang
|
1b25f1fe75
|
Update flashinfer CUTLASS MoE Kernel (#21408)
Signed-off-by: Shu Wang. <shuw@nvidia.com>
|
2025-07-24 08:13:31 -07:00 |
|
Nick Hill
|
f0f4de8f26
|
[Misc] Fix duplicate FusedMoEConfig debug messages (#21455)
Signed-off-by: Nick Hill <nhill@redhat.com>
|
2025-07-24 01:27:30 -07:00 |
|
Chengji Yao
|
e74bfc70e4
|
[TPU][Bugfix] fix moe layer (#21340)
Signed-off-by: Chengji Yao <chengjiyao@google.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
|
2025-07-24 00:38:39 -07:00 |
|
Michael Goin
|
f002e9a870
|
[Cleanup] Only log MoE DP setup warning if DP is enabled (#21315)
Signed-off-by: mgoin <mgoin64@gmail.com>
|
2025-07-23 00:02:48 -07:00 |
|
Wentao Ye
|
774d0c014b
|
[Perf] Cuda Kernel for Per Token Group Quant (#21083)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
2025-07-22 07:27:15 -07:00 |
|
Duncan Moss
|
2c8db17cfd
|
[feat]: add SM100 support for cutlass FP8 groupGEMM (#20447)
Signed-off-by: Duncan Moss <djm.moss@gmail.com>
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Co-authored-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
|
2025-07-22 07:27:12 -07:00 |
|
Mickaël Seznec
|
4fb56914c5
|
[perf] Add fused MLA QKV + strided layernorm (#21116)
Signed-off-by: Mickael Seznec <mickael@mistral.ai>
Co-authored-by: mgoin <mgoin64@gmail.com>
|
2025-07-22 07:07:44 -07:00 |
|
Shu Wang
|
9e23ad9655
|
Update fp4 quantize API (#21327)
Signed-off-by: Shu Wang <shuw@nvidia.com>
|
2025-07-21 23:40:21 -07:00 |
|
Ming Yang
|
e7b2042681
|
Revert "[Performance] Performance improvements in non-blockwise fp8 CUTLASS MoE (#20762) (#21334)
Signed-off-by: Ming Yang <minos.future@gmail.com>
|
2025-07-21 21:49:01 -07:00 |
|
Himanshu Jaju
|
0ec82edda5
|
[perf] Speed up align sum kernels (#21079)
Signed-off-by: Himanshu Jaju <hj@mistral.ai>
|
2025-07-21 11:19:23 -07:00 |
|
Zhiyu
|
6b46c4b653
|
Add Nvidia ModelOpt config adaptation (#19815)
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
|
2025-07-21 10:02:58 -04:00 |
|
Cyrus Leung
|
042af0c8d3
|
[Model][1/N] Support multiple poolers at model level (#21227)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2025-07-21 02:22:21 -07:00 |
|
Thomas Parnell
|
881e3cbe3b
|
[V1] [Hybrid] Enable piecewise CUDA Graph for mamba layers (#21194)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
|
2025-07-19 19:27:21 +00:00 |
|
Kaixi Hou
|
6d0734c562
|
[NVIDIA] Add SM100 Flashinfer MoE blockscale fp8 backend for low latency (#20645)
Signed-off-by: kaixih <kaixih@nvidia.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
|
2025-07-19 02:33:01 -07:00 |
|
Varun Sundar Rabindranath
|
dcc6cfb991
|
[Kernel][Performance] Tweak MoE Batched silu_mul_fp8_quant_deep_gemm kernel (#21193)
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
|
2025-07-18 23:09:51 -07:00 |
|
Woosuk Kwon
|
dd572c0ab3
|
[V0 Deprecation] Remove V0 Spec Decode workers (#21152)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-07-18 21:47:50 -07:00 |
|
Rui Qiao
|
217937221b
|
Elastic Expert Parallel Initial Support (#20775)
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
|
2025-07-18 17:46:09 -07:00 |
|
Richard Zou
|
b2eb2b5ad7
|
[Kernel] Apply torch.Tag.needs_fixed_stride_order only for torch==2.6.0 (#19346)
Signed-off-by: rzou <zou3519@gmail.com>
|
2025-07-18 14:10:21 -04:00 |
|
Cyrus Leung
|
45badd05d0
|
[Core] Set pooling params based on task and model (#21128)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2025-07-18 05:41:17 -07:00 |
|
ElizaWszola
|
4adc66f64d
|
[Bugfix] Allocate less memory in non-batched CUTLASS MoE (#21121)
Signed-off-by: ElizaWszola <ewszola@redhat.com>
|
2025-07-18 18:55:52 +08:00 |
|
Shu Wang
|
c7d8724e78
|
[Core] FlashInfer CUTLASS fused MoE backend (NVFP4) (#20037)
Signed-off-by: shuw <shuw@nvidia.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
|
2025-07-17 21:32:45 -07:00 |
|
Wentao Ye
|
8a8fc94639
|
[Log] Debugging Log with more Information (#20770)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
2025-07-18 00:19:46 +00:00 |
|
Woosuk Kwon
|
4de7146351
|
[V0 deprecation] Remove V0 HPU backend (#21131)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-07-17 16:37:36 -07:00 |
|
Cyrus Leung
|
90bd2ab6e3
|
[Model] Update pooling model interface (#21058)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2025-07-17 16:05:40 +00:00 |
|
ElizaWszola
|
9fb2d22032
|
[Performance] Performance improvements in non-blockwise fp8 CUTLASS MoE (#20762)
Signed-off-by: ElizaWszola <ewszola@redhat.com>
|
2025-07-17 09:56:44 -04:00 |
|
Harry Mellor
|
fe8a2c544a
|
[Docs] Improve docstring formatting for FusedMoEParallelConfig.make (#21117)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
2025-07-17 04:13:00 -07:00 |
|
Asher
|
5a7fb3ab9e
|
[Model] Add ToolParser and MoE Config for Hunyuan A13B (#20820)
Signed-off-by: Asher Zhang <asherszhang@tencent.com>
|
2025-07-17 09:10:09 +00:00 |
|
Varun Sundar Rabindranath
|
11dfdf21bf
|
[Kernel] DeepGemm MoE : Integrate triton permute / unpermute kernels (#20903)
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
|
2025-07-17 08:10:37 +00:00 |
|
Michael Goin
|
28a6d5423d
|
[Bugfix] Fix Machete zero point issue for GPTQ models on SM90 (#21066)
Signed-off-by: mgoin <mgoin64@gmail.com>
|
2025-07-16 19:54:45 -07:00 |
|
Kevin_Xiong
|
c9ba8104ed
|
[Bugfix] weight loading use correct tp_group with patch_tensor_parallel_group (#21024)
Signed-off-by: KevinXiong-C <kevin_xiong1997@outlook.com>
|
2025-07-16 19:36:36 -07:00 |
|
Nir David
|
01513a334a
|
Support FP8 Quantization and Inference Run on Intel Gaudi (HPU) using INC (Intel Neural Compressor) (#12010)
Signed-off-by: Nir David <ndavid@habana.ai>
Signed-off-by: Uri Livne <ulivne@habana.ai>
Co-authored-by: Uri Livne <ulivne@habana.ai>
|
2025-07-16 15:33:41 -04:00 |
|
Cyrus Leung
|
1c3198b6c4
|
[Model] Consolidate pooler implementations (#20927)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2025-07-16 13:39:13 +00:00 |
|
Peter Pan
|
1eb2b9c102
|
[CI] update typos config for CI pre-commit and fix some spells (#20919)
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>
|
2025-07-15 21:12:40 -07:00 |
|
Wentao Ye
|
76ddeff293
|
[Doc] Remove duplicate docstring (#21012)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
2025-07-15 20:09:13 -07:00 |
|
Ming Yang
|
fcb9f879c1
|
[Bugfix] Correct per_act_token in CompressedTensorsW8A8Fp8MoECutlassM… (#20937)
Signed-off-by: Ming Yang <minos.future@gmail.com>
|
2025-07-15 19:53:42 -07:00 |
|
Tuan, Hoang-Trong
|
f29fd8a7f8
|
[BugFix] fix 3 issues: (1) using metadata for causal-conv1d, (2) indexing overflow in v1 vLLM, and (3) init_states in v0 (#20838)
Signed-off-by: Tuan M. Hoang-Trong <tmhoangt@us.ibm.com>
Co-authored-by: Tuan M. Hoang-Trong <tmhoangt@us.ibm.com>
|
2025-07-15 16:08:26 -04:00 |
|