403 Commits

Author SHA1 Message Date
Wentao Ye
f7dcce7a4a
[Feature] Add VLLM_USE_DEEP_GEMM_E8M0 Env to Control E8M0 Scale (#21968)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-08-11 09:39:08 -07:00
Isotr0py
b76753f0b5
[Bugfix][Kernel] Support partial rotary embedding for MRoPE triton kernel (#22593)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-08-10 09:00:36 -07:00
Ning Xie
326976291b
[Misc] code clean duplicate set_current_vllm_config in _set_vllm_config (#22566)
Signed-off-by: Andy Xie <andy.xning@gmail.com>
2025-08-10 00:08:48 -07:00
TJian
42172ad18f
[FEAT] [Performance] Add triton mrope to replace the torch code path (#22375)
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
2025-08-09 11:50:03 -07:00
Jee Jee Li
0edc0cd52b
[Bugfix] Fix CI moe kernel failure (#22556)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-08-09 00:03:29 -07:00
Yongye Zhu
e789cad6b8
[gpt-oss] triton kernel mxfp4 (#22421)
Signed-off-by: <zyy1102000@gmail.com>
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
2025-08-08 08:24:07 -07:00
Maximilien de Bayser
f825c6bd22
Support encoder_only attention for FlexAttention (#22273)
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
2025-08-06 18:37:14 -07:00
elvischenv
83156c7b89
[NVIDIA] Support Flashinfer TRT-LLM Prefill Attention Kernel (#22095)
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
2025-08-05 02:45:34 -07:00
Michael Goin
e79a12fc3a
[UX] Fail if an invalid attention backend is specified (#22217)
Signed-off-by: mgoin <michael@neuralmagic.com>
2025-08-04 23:54:52 -07:00
Chih-Chieh Yang
b690e34824
[Model] Mamba2 preallocate SSM output tensor to avoid d2d copy overhead (#21075)
Signed-off-by: Chih-Chieh Yang <7364402+cyang49@users.noreply.github.com>
Signed-off-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com>
2025-08-02 01:59:34 -07:00
Wentao Ye
6e8d8c4afb
[Test] Add Unit Test for Batched DeepGEMM (#21559)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-08-02 10:45:46 +08:00
Michael Goin
88faa466d7
[CI] Initial tests for SM100 Blackwell runner (#21877)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-08-01 16:18:38 -07:00
Wentao Ye
3700642013
[Refactor] Remove Duplicate per_block_cast_to_fp8, Remove Dependencies of DeepGEMM (#21787)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-08-01 01:13:27 +00:00
Matthew Bonanni
e360316ab9
Add DeepGEMM to Dockerfile in vllm-base image (#21533)
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
2025-07-31 18:01:55 -07:00
Wentao Ye
0271c2ff2f
[Test] Add Benchmark and Unit Test for per_token_group_quant (#21860)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-07-30 07:15:02 -07:00
elvischenv
58b11b24a6
[Bugfix] Fix workspace buffer None issue for Flashinfer TRTLLM Backend (#21525)
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
2025-07-29 10:34:00 -04:00
lyrisz
c6c9122d50
[Kernel] SM90 CUTLASS FP8 GEMM: add support for swap AB + kernel tuning (#20396)
Signed-off-by: Faqin Zhong <faqin.zhong@gmail.com>
Co-authored-by: Duncan Moss <djm.moss@gmail.com>
2025-07-28 23:13:58 +00:00
Caleb_Du
57c22e57f9
Fix CUDA permute/unpermute for use with DeepGemm Moe (#17934)
Signed-off-by: Caleb_Du <Caleb_Du@zju.edu.cn>
2025-07-27 07:08:00 -07:00
who who who
b3caeb82e7
[ROCm][AITER] Enable fp8 kv cache on rocm aiter backend. (#20295)
Signed-off-by: fsx950223 <fsx950223@outlook.com>
Signed-off-by: amd-ruitang3 <Rui.Tang2@amd.com>
Co-authored-by: amd-ruitang3 <Rui.Tang2@amd.com>
2025-07-25 06:50:21 -07:00
Ming Yang
2ded067fd2
[Bugfix] Fix CUDA arch flags for MoE permute (#21426)
Signed-off-by: Ming Yang <minos.future@gmail.com>
2025-07-24 03:23:59 -07:00
Yang Chen
6929f8b437
[Misc] fixed nvfp4_moe test failures due to invalid kwargs (#21246)
Signed-off-by: Yang Chen <yangche@fb.com>
2025-07-23 01:41:43 -07:00
Yu Chin Fabian Lim
32ec9e2f2a
Mamba V2 Test not Asserting Failures. (#21379)
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
2025-07-23 01:40:27 -07:00
Wentao Ye
774d0c014b
[Perf] Cuda Kernel for Per Token Group Quant (#21083)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-07-22 07:27:15 -07:00
Mickaël Seznec
4fb56914c5
[perf] Add fused MLA QKV + strided layernorm (#21116)
Signed-off-by: Mickael Seznec <mickael@mistral.ai>
Co-authored-by: mgoin <mgoin64@gmail.com>
2025-07-22 07:07:44 -07:00
Ming Yang
e7b2042681
Revert "[Performance] Performance improvements in non-blockwise fp8 CUTLASS MoE (#20762) (#21334)
Signed-off-by: Ming Yang <minos.future@gmail.com>
2025-07-21 21:49:01 -07:00
Woosuk Kwon
6dda13c86b
[Misc] Add sliding window to flashinfer test (#21282)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-07-21 08:37:49 -07:00
Woosuk Kwon
752c6ade2e
[V0 Deprecation] Deprecate BlockSparse Attention & Phi3-Small (#21217)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-07-19 13:53:17 -07:00
shixianc
7d94577138
Add torch golden impl for moe_align_block_size kernel test (#20653)
Signed-off-by: Shixian Cui <shixian@amazon.com>
Co-authored-by: Shixian Cui <shixian@amazon.com>
2025-07-19 02:32:36 -07:00
shixianc
5780121c95
[Perf] Add swap_ab to SM90 FP8 non-block CUTLASS moe grouped gemm (#20911)
Signed-off-by: Shixian Cui <shixian@amazon.com>
Co-authored-by: Shixian Cui <shixian@amazon.com>
2025-07-18 04:34:43 +00:00
ElizaWszola
9fb2d22032
[Performance] Performance improvements in non-blockwise fp8 CUTLASS MoE (#20762)
Signed-off-by: ElizaWszola <ewszola@redhat.com>
2025-07-17 09:56:44 -04:00
Varun Sundar Rabindranath
11dfdf21bf
[Kernel] DeepGemm MoE : Integrate triton permute / unpermute kernels (#20903)
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
2025-07-17 08:10:37 +00:00
Peter Pan
1eb2b9c102
[CI] update typos config for CI pre-commit and fix some spells (#20919)
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>
2025-07-15 21:12:40 -07:00
Wentao Ye
c1acd6d7d4
[Refactor] Change the way of import triton (#20774)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-07-12 19:39:55 -07:00
Wentao Ye
42d440c22b
[Perf] Use Triton instead of Torch for DeepGEMM Per Token Group Quant (#20841)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-07-12 19:38:45 -07:00
Varun Sundar Rabindranath
53fa457391
[Misc] Add unit tests for MoE ModularKernel combinations + Profiling utility (#20449)
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
2025-07-11 07:51:46 -07:00
Pavani Majety
7bd4c37ae7
[Core] Add Flashinfer TRTLLM Backend for Flashinfer decode path (SM100). (#19825)
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: shuw <shuw@nvidia.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
2025-07-11 09:23:23 +00:00
Wentao Ye
e2de455c34
[Feature] Integrate SM100 DeepGEMM support (#20087) 2025-07-10 20:18:05 -07:00
Varun Sundar Rabindranath
f0c98cae27
[Misc] MoE ModularKernel : Introduce TopKWeightAndReduce (#20648)
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
2025-07-10 14:40:38 -07:00
Varun Sundar Rabindranath
fdadb6f43a
[Bugfix] Fused MoE Modular Kernel chunking loop (#20392)
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
2025-07-10 20:31:10 +00:00
fxmarty-amd
332d4cb17b
[Feature][Quantization] MXFP4 support for MOE models (#17888)
Signed-off-by: Felix Marty <felmarty@amd.com>
Signed-off-by: Bowen Bao <bowenbao@amd.com>
Signed-off-by: Felix Marty <Felix.Marty@amd.com>
Co-authored-by: Bowen Bao <bowenbao@amd.com>
2025-07-09 13:19:02 -07:00
Tuan, Hoang-Trong
47043eb678
[Kernel] Triton implementation of causal-conv1d for Mamba-based models (#18218)
Signed-off-by: Tuan M. Hoang-Trong <tmhoangt@us.ibm.com>
Co-authored-by: Tuan M. Hoang-Trong <tmhoangt@us.ibm.com>
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
2025-07-09 12:53:55 -07:00
Ming Yang
afb7cff1b9
[Bugfix] Fix Maverick correctness by filling zero to cache space in cutlass_moe (#20167)
Signed-off-by: Ming Yang <yming@meta.com>
2025-07-08 01:07:22 +00:00
Cyrus Leung
9fb52e523a
[V1] Support any head size for FlexAttention backend (#20467)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-07-06 09:54:36 -07:00
Woosuk Kwon
e202dd2736
[V0 deprecation] Remove V0 CPU/XPU/TPU backends (#20412)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: jiang1.li <jiang1.li@intel.com>
Co-authored-by: Li, Jiang <jiang1.li@intel.com>
2025-07-06 08:48:13 -07:00
Vadim Gimpelson
f73d02aadc
[BUG] Fix #20484. Support empty sequence in cuda penalty kernel (#20491)
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@centml.ai>
2025-07-05 19:38:02 -07:00
Jeremy Reizenstein
c5ebe040ac
test_attention compat with coming xformers change (#20487)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-07-05 19:37:59 -07:00
Isotr0py
32c9be2200
[v1] Re-add fp32 support to v1 engine through FlexAttention (#19754)
Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-07-05 09:41:10 +00:00
Michael Goin
c108781c85
[CI Bugfix] Fix pre-commit failures on main (#20502) 2025-07-04 14:17:30 -07:00
Duncan Moss
3d184b95b8
[feat]: CUTLASS block scaled group gemm for SM100 (#19757)
Signed-off-by: Duncan Moss <djm.moss@gmail.com>
Co-authored-by: Duncan Moss <dmoss@nvidia.com>
2025-07-04 12:58:04 -06:00
Jee Jee Li
1caca5a589
[Misc] Add SPDX-FileCopyrightText (#20428)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-07-04 07:40:42 +00:00