585 Commits

Author SHA1 Message Date
who who who
b3caeb82e7
[ROCm][AITER] Enable fp8 kv cache on rocm aiter backend. (#20295)
Signed-off-by: fsx950223 <fsx950223@outlook.com>
Signed-off-by: amd-ruitang3 <Rui.Tang2@amd.com>
Co-authored-by: amd-ruitang3 <Rui.Tang2@amd.com>
2025-07-25 06:50:21 -07:00
Ming Yang
2ded067fd2
[Bugfix] Fix CUDA arch flags for MoE permute (#21426)
Signed-off-by: Ming Yang <minos.future@gmail.com>
2025-07-24 03:23:59 -07:00
Yang Chen
6929f8b437
[Misc] fixed nvfp4_moe test failures due to invalid kwargs (#21246)
Signed-off-by: Yang Chen <yangche@fb.com>
2025-07-23 01:41:43 -07:00
Yu Chin Fabian Lim
32ec9e2f2a
Mamba V2 Test not Asserting Failures. (#21379)
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
2025-07-23 01:40:27 -07:00
Wentao Ye
774d0c014b
[Perf] Cuda Kernel for Per Token Group Quant (#21083)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-07-22 07:27:15 -07:00
Mickaël Seznec
4fb56914c5
[perf] Add fused MLA QKV + strided layernorm (#21116)
Signed-off-by: Mickael Seznec <mickael@mistral.ai>
Co-authored-by: mgoin <mgoin64@gmail.com>
2025-07-22 07:07:44 -07:00
Ming Yang
e7b2042681
Revert "[Performance] Performance improvements in non-blockwise fp8 CUTLASS MoE (#20762) (#21334)
Signed-off-by: Ming Yang <minos.future@gmail.com>
2025-07-21 21:49:01 -07:00
Woosuk Kwon
6dda13c86b
[Misc] Add sliding window to flashinfer test (#21282)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-07-21 08:37:49 -07:00
Woosuk Kwon
752c6ade2e
[V0 Deprecation] Deprecate BlockSparse Attention & Phi3-Small (#21217)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-07-19 13:53:17 -07:00
shixianc
7d94577138
Add torch golden impl for moe_align_block_size kernel test (#20653)
Signed-off-by: Shixian Cui <shixian@amazon.com>
Co-authored-by: Shixian Cui <shixian@amazon.com>
2025-07-19 02:32:36 -07:00
shixianc
5780121c95
[Perf] Add swap_ab to SM90 FP8 non-block CUTLASS moe grouped gemm (#20911)
Signed-off-by: Shixian Cui <shixian@amazon.com>
Co-authored-by: Shixian Cui <shixian@amazon.com>
2025-07-18 04:34:43 +00:00
ElizaWszola
9fb2d22032
[Performance] Performance improvements in non-blockwise fp8 CUTLASS MoE (#20762)
Signed-off-by: ElizaWszola <ewszola@redhat.com>
2025-07-17 09:56:44 -04:00
Varun Sundar Rabindranath
11dfdf21bf
[Kernel] DeepGemm MoE : Integrate triton permute / unpermute kernels (#20903)
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
2025-07-17 08:10:37 +00:00
Peter Pan
1eb2b9c102
[CI] update typos config for CI pre-commit and fix some spells (#20919)
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>
2025-07-15 21:12:40 -07:00
Wentao Ye
c1acd6d7d4
[Refactor] Change the way of import triton (#20774)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-07-12 19:39:55 -07:00
Wentao Ye
42d440c22b
[Perf] Use Triton instead of Torch for DeepGEMM Per Token Group Quant (#20841)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-07-12 19:38:45 -07:00
Varun Sundar Rabindranath
53fa457391
[Misc] Add unit tests for MoE ModularKernel combinations + Profiling utility (#20449)
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
2025-07-11 07:51:46 -07:00
Pavani Majety
7bd4c37ae7
[Core] Add Flashinfer TRTLLM Backend for Flashinfer decode path (SM100). (#19825)
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: shuw <shuw@nvidia.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
2025-07-11 09:23:23 +00:00
Wentao Ye
e2de455c34
[Feature] Integrate SM100 DeepGEMM support (#20087) 2025-07-10 20:18:05 -07:00
Varun Sundar Rabindranath
f0c98cae27
[Misc] MoE ModularKernel : Introduce TopKWeightAndReduce (#20648)
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
2025-07-10 14:40:38 -07:00
Varun Sundar Rabindranath
fdadb6f43a
[Bugfix] Fused MoE Modular Kernel chunking loop (#20392)
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
2025-07-10 20:31:10 +00:00
fxmarty-amd
332d4cb17b
[Feature][Quantization] MXFP4 support for MOE models (#17888)
Signed-off-by: Felix Marty <felmarty@amd.com>
Signed-off-by: Bowen Bao <bowenbao@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Co-authored-by: Bowen Bao <bowenbao@amd.com>
2025-07-09 13:19:02 -07:00
Tuan, Hoang-Trong
47043eb678
[Kernel] Triton implementation of causal-conv1d for Mamba-based models (#18218)
Signed-off-by: Tuan M. Hoang-Trong <tmhoangt@us.ibm.com>
Co-authored-by: Tuan M. Hoang-Trong <tmhoangt@us.ibm.com>
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
2025-07-09 12:53:55 -07:00
Ming Yang
afb7cff1b9
[Bugfix] Fix Maverick correctness by filling zero to cache space in cutlass_moe (#20167)
Signed-off-by: Ming Yang <yming@meta.com>
2025-07-08 01:07:22 +00:00
Cyrus Leung
9fb52e523a
[V1] Support any head size for FlexAttention backend (#20467)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-07-06 09:54:36 -07:00
Woosuk Kwon
e202dd2736
[V0 deprecation] Remove V0 CPU/XPU/TPU backends (#20412)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: jiang1.li <jiang1.li@intel.com>
Co-authored-by: Li, Jiang <jiang1.li@intel.com>
2025-07-06 08:48:13 -07:00
Vadim Gimpelson
f73d02aadc
[BUG] Fix #20484. Support empty sequence in cuda penalty kernel (#20491)
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@centml.ai>
2025-07-05 19:38:02 -07:00
Jeremy Reizenstein
c5ebe040ac
test_attention compat with coming xformers change (#20487)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-07-05 19:37:59 -07:00
Isotr0py
32c9be2200
[v1] Re-add fp32 support to v1 engine through FlexAttention (#19754)
Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-07-05 09:41:10 +00:00
Michael Goin
c108781c85
[CI Bugfix] Fix pre-commit failures on main (#20502) 2025-07-04 14:17:30 -07:00
Duncan Moss
3d184b95b8
[feat]: CUTLASS block scaled group gemm for SM100 (#19757)
Signed-off-by: Duncan Moss <djm.moss@gmail.com>
Co-authored-by: Duncan Moss <dmoss@nvidia.com>
2025-07-04 12:58:04 -06:00
Jee Jee Li
1caca5a589
[Misc] Add SPDX-FileCopyrightText (#20428)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-07-04 07:40:42 +00:00
bnellnm
78fe77534b
[Kernel] Enable fp8 support for pplx and BatchedTritonExperts. (#18864)
Signed-off-by: Bill Nell <bnell@redhat.com>
2025-07-03 14:55:40 -07:00
bnellnm
c1909e7e8c
[Kernels] MoE refactor (#19636)
Signed-off-by: Bill Nell <bnell@redhat.com>
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Co-authored-by: ElizaWszola <ewszola@redhat.com>
2025-07-02 06:08:27 -07:00
Wentao Ye
7058d7dd5d
[Refactor] Remove duplicate find_free_port (#20333)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-07-01 19:03:07 -07:00
czhu-cohere
3abfe22154
Enable group size 64 for Machete (#20290)
Signed-off-by: czhu-cohere <conway.zhu@cohere.com>
2025-07-01 18:05:44 -07:00
TY-AMD
96453cfa83
[BugFix][V1][ROCm] Triton MLA uses V0 backend on V1 engine (#19067)
Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com>
2025-07-01 16:12:19 +08:00
Varun Sundar Rabindranath
08d81f1014
[Bugfix] Fix deepep tests (#20288)
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
2025-07-01 15:29:08 +08:00
czhu-cohere
9909726d2a
Enable ZP Support for Machete (#20268)
Signed-off-by: czhu-cohere <conway.zhu@cohere.com>
2025-07-01 07:12:20 +00:00
Wentao Ye
551ef1631a
[Unit Test] Add unit test for deep gemm (#20090)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-06-30 10:26:42 -06:00
Wentao Ye
4d36693687
[Refactor] Create a function util and cache the results for has_deepgemm, has_deepep, has_pplx (#20187)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-06-28 22:06:38 +00:00
Wentao Ye
562308816c
[Refactor] Rename commnication utils (#20091)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-06-26 22:19:32 +00:00
Wentao Ye
c894c5dc1f
[Bug Fix] Fix address/port already in use error for deep_ep test (#20094)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-06-26 22:33:13 +08:00
bnellnm
015fab8c2f
[Kernels][Bugfix] Use torch op for all kernels in FusedMoE forward. Add additional testing for cudagraphs. (#19717)
Signed-off-by: Bill Nell <bnell@redhat.com>
2025-06-24 23:22:58 -07:00
Wentao Ye
879f69bed3
[Refactor] Remove duplicate ceil_div (#20023)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-06-25 05:19:09 +00:00
Wentao Ye
a6c4b87fbc
Revert "[Feature] Integrate new deepgemm (#19820)" (#20049)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-06-24 19:45:22 -07:00
Wentao Ye
c6e3bba8e6
[Feature] Integrate new deepgemm (#19820)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-06-24 12:51:56 -07:00
Tyler Michael Smith
68aaeb3749
[EP+DP] Optimize the little operations in the DeepGEMM + DeepEP low latency case (#19885)
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Signed-off-by: Tyler Michael Smith <tysmith@redhat.com>
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
2025-06-23 11:07:47 -07:00
Wentao Ye
ffb2cd6b54
[Perf] Optimize moe_align_block_size CUDA kernel (#19572)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
2025-06-17 11:49:26 -07:00
Driss Guessous
ddfed314f9
Fixes IMA for TP w/ flex-attention (#19712)
Signed-off-by: drisspg <drisspguessous@gmail.com>
2025-06-17 04:01:50 +00:00