180 Commits

Author SHA1 Message Date
Pavani Majety
0c0fdae84f
[Hardware/NVIDIA/Kernel] Enable nvidia/DeepSeek-R1-FP4 Model (#16362) 2025-05-09 16:24:41 -07:00
Gregory Shtrasberg
32aa74c09c
[ROCm][FP8][Kernel] FP8 quantization fused into Custom Paged Attention (#17139)
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
2025-05-07 07:12:35 -07:00
Yong Hoon Shin
98c89e16ff
Make key optional for rotary embedding (#17566)
Signed-off-by: Yong Hoon Shin <yhshin@meta.com>
2025-05-07 00:11:46 -07:00
Wanrui Dai
f80ae5bdcf
[Kernel] Use fused rmsnorm for some models like qwen3 series (#17735)
Signed-off-by: evian <eviantai@u.nus.edu>
Co-authored-by: evian <eviantai@u.nus.edu>
2025-05-06 23:10:02 -07:00
Szymon Ożóg
1a45a61387
[Kernel] GGUF MoeVec kernel (#16780)
Signed-off-by: SzymonOzog <szymon.ozog@aleph-alpha.com>
Signed-off-by: SzymonOzog <szymon.ozog@gmail.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
2025-05-06 23:07:23 -07:00
Jinzhen Lin
1d0c9d6b2d
[Kernel] some optimizations for dense marlin and moe marlin (#16850)
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
2025-05-05 09:39:30 -07:00
rasmith
e3d0a1d190
[Quantizaton] [AMD] Add support for running DeepSeek int8 w8a8 MoE on ROCm (#17558)
Signed-off-by: Randall Smith <Randall.Smith@amd.com>
2025-05-02 21:41:10 -07:00
Zhengyuan Su (苏政渊)
17eb306fcc
[Bugfix] Add contiguous call inside rope kernel wrapper (#17091)
Signed-off-by: 苏政渊 <suzhengyuan@moonshot.cn>
Co-authored-by: 苏政渊 <suzhengyuan@moonshot.cn>
2025-04-28 19:24:07 -07:00
Kaixi Hou
ed7a29d9f8
[NVIDIA] Support Cutlass MLA for Blackwell GPUs (#16032)
Signed-off-by: kaixih <kaixih@nvidia.com>
2025-04-27 06:29:21 -07:00
Charlie Fu
188b7f9b8c
[Performance][ROCm] Add skinny gemms for unquantized linear on ROCm (#15830)
Signed-off-by: charlifu <charlifu@amd.com>
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com>
2025-04-21 20:46:22 -07:00
Jinzhen Lin
d06ba4ed3f
[Kernel] moe wna16 marlin kernel (#14447)
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
2025-04-14 20:05:22 -07:00
DefTruth
e9528f6dc6
[Kernel] support merge_attn_states CUDA kernel, 3x speedup (#16173)
Signed-off-by: DefTruth <qiustudent_r@163.com>
2025-04-11 06:50:50 -06:00
Aleksandr Malyshev
e73ff24e31
[ROCM][KERNEL] Paged attention for V1 (#15720)
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
Signed-off-by: root <root@banff-cyxtera-s65-4.amd.com>
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com>
Co-authored-by: root <root@banff-cyxtera-s65-4.amd.com>
2025-04-02 19:48:00 -07:00
LukasBluebaum
90969fb39a
[Kernel] Add more dtype support for GGUF dequantization (#15879)
Signed-off-by: lukas.bluebaum <lukas.bluebaum@aleph-alpha.com>
2025-04-02 01:58:48 -07:00
bnellnm
e59ca942f5
Add option to use DeepGemm contiguous grouped gemm kernel for fused MoE operations. (#13932)
Signed-off-by: Bill Nell <bnell@redhat.com>
2025-04-01 12:07:43 -04:00
Ilya Markov
b7b7676d67
[Distributed] Add custom allreduce support for ROCM (#14125)
Signed-off-by: ilmarkov <imarkov@redhat.com>
Co-authored-by: ilmarkov <imarkov@redhat.com>
2025-03-31 22:49:12 -07:00
ElizaWszola
9239bf718e
[Kernel] CUTLASS grouped gemm fp8 MoE kernel (#13972)
Signed-off-by: ElizaWszola <eliza@neuralmagic.com>
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Co-authored-by: Lucas Wilkinson <wilkinson.lucas@gmail.com>
2025-03-27 00:54:44 +00:00
Thien Tran
4f044b1d67
[Kernel][CPU] CPU MLA (#14744)
Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg>
2025-03-25 09:34:59 +00:00
Sage Moore
45f3f3f59e
[ROCm][Bugfix] Ensure that the moe_wna16_gemm kernel is not built on ROCm platforms. (#14629)
Signed-off-by: Sage Moore <sage@neuralmagic.com>
2025-03-12 08:00:28 -04:00
Pavani Majety
debd6bbf09
[Kernel] Add ModelOpt FP4 Checkpoint Support (#12520)
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
2025-03-12 05:13:11 +00:00
Szymon Ożóg
e22ee1e7a2
[Kernel] GGUF MoE kernel (#14613)
Signed-off-by: SzymonOzog <szymon.ozog@aleph-alpha.com>
2025-03-12 03:33:27 +00:00
Jeff Daily
a1c8f3796c
dynamic distpatch of fp8 kernels (#14245)
Signed-off-by: Jeff Daily <jeff.daily@amd.com>
2025-03-11 10:54:56 -04:00
Jinzhen Lin
90e88ab756
[Kernel] moe wna16 cuda kernel (#13321)
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
2025-03-10 20:12:40 -04:00
Szymon Ożóg
89cdaa83e7
[Kernel] Add more dtype support for GGUF kernels (#14043)
Signed-off-by: SzymonOzog <szymon.ozog@aleph-alpha.com>
Signed-off-by: SzymonOzog <szymon.ozog@gmail.com>
2025-03-10 07:30:04 -07:00
Jinzhen Lin
d0feea31c7
[Kernel] optimize performance of gptq marlin kernel when n is small (#14138)
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
2025-03-07 11:53:38 -05:00
Harry Mellor
cf069aa8aa
Update deprecated Python 3.8 typing (#13971) 2025-03-02 17:34:51 -08:00
YajieWang
6a92ff93e1
[Misc][Kernel]: Add GPTQAllSpark Quantization (#12931) 2025-02-28 22:30:59 -08:00
Lucas Wilkinson
f95903909f
[Kernel] FlashMLA integration (#13747)
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-02-27 10:35:08 +08:00
Kaixi Hou
e109e598c7
[NVIDIA] Support nvfp4 cutlass gemm (#13571) 2025-02-22 05:24:05 -08:00
Lucas Wilkinson
288cc6c234
[Attention] MLA with chunked prefill (#12639)
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: Patrick Horn <patrick.horn@gmail.com>
Co-authored-by: simon-mo <xmo@berkeley.edu>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
2025-02-21 15:30:12 -08:00
Sage Moore
c9f9d5b397
[Bugfix][AMD] Update torch_bindings so that scaled_fp4_quant isn't build on ROCm (#13235) 2025-02-14 20:30:42 -08:00
Tyler Michael Smith
c1e37bf71b
[Kernel][Bugfix] Refactor and Fix CUTLASS 2:4 Sparse Kernels (#13198)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2025-02-14 00:01:14 +00:00
Kaixi Hou
4fc5c23bb6
[NVIDIA] Support nvfp4 quantization (#12784) 2025-02-12 19:51:51 -08:00
Lucas Wilkinson
75e94309e8
[Perf] Mem align KV caches for CUDA devices (MLA perf improvement) (#12676)
Signed-off-by: simon-mo <xmo@berkeley.edu>
Signed-off-by: Lucas Wilkinson <lcwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Co-authored-by: simon-mo <xmo@berkeley.edu>
2025-02-04 18:22:24 -08:00
Yang Chen
95460fc513
[Kernel] port sgl moe_align_block_size kernels (#12574)
sgl_moe_align_block_size is based on:


ded9fcd09a

moe_align_block_size is based on:


ba5112ff69

Signed-off-by: Yang Chen <yangche@fb.com>
2025-02-03 13:09:50 +08:00
Russell Bryant
e489ad7a21
[Misc] Add SPDX-License-Identifier headers to python source files (#12628)
- **Add SPDX license headers to python source files**
- **Check for SPDX headers using pre-commit**

commit 9d7ef44c3cfb72ca4c32e1c677d99259d10d4745
Author: Russell Bryant <rbryant@redhat.com>
Date:   Fri Jan 31 14:18:24 2025 -0500

    Add SPDX license headers to python source files
    
This commit adds SPDX license headers to python source files as
recommended to
the project by the Linux Foundation. These headers provide a concise way
that is
both human and machine readable for communicating license information
for each
source file. It helps avoid any ambiguity about the license of the code
and can
    also be easily used by tools to help manage license compliance.
    
The Linux Foundation runs license scans against the codebase to help
ensure
    we are in compliance with the licenses of the code we use, including
dependencies. Having these headers in place helps that tool do its job.
    
    More information can be found on the SPDX site:
    
    - https://spdx.dev/learn/handling-license-info/
    
    Signed-off-by: Russell Bryant <rbryant@redhat.com>

commit 5a1cf1cb3b80759131c73f6a9dddebccac039dea
Author: Russell Bryant <rbryant@redhat.com>
Date:   Fri Jan 31 14:36:32 2025 -0500

    Check for SPDX headers using pre-commit
    
    Signed-off-by: Russell Bryant <rbryant@redhat.com>

---------

Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-02-02 11:58:18 -08:00
Tyler Michael Smith
eb5741ad42
[Kernel][Quantization] Integrate block-quantized CUTLASS kernels for DeepSeekV3 (#12587)
Integrates the block-quantized kernels introduced in
https://github.com/vllm-project/vllm/pull/11868 for use in linear
layers.

Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2025-01-31 15:29:11 -08:00
Lucas Wilkinson
cabaf4eff3
[Attention] MLA decode optimizations (#12528)
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: simon-mo <xmo@berkeley.edu>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: simon-mo <simon.mo@hey.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com>
Co-authored-by: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com>
Co-authored-by: simon-mo <xmo@berkeley.edu>
2025-01-30 23:49:37 -08:00
Lucas Wilkinson
9798b2fb00
[Kernel] Update cutlass_scaled_mm to support 2d group (blockwise) scaling (#11868) 2025-01-30 18:33:00 -08:00
Harry Mellor
823ab79633
Update pre-commit hooks (#12475)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-01-27 17:23:08 -07:00
Gregory Shtrasberg
e97f802b2d
[FP8][Kernel] Dynamic kv cache scaling factors computation (#11906)
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: Micah Williamson <micah.williamson@amd.com>
2025-01-23 18:04:03 +00:00
cennn
d907be7dc7
[misc] remove python function call for custom activation op (#11885)
Co-authored-by: youkaichao <youkaichao@gmail.com>
2025-01-10 17:18:25 +08:00
youkaichao
869579a702
[optimization] remove python function call for custom op (#11750)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-07 17:04:28 +00:00
Liangfu Chen
628ec6c17b
[Docker] bump up neuron sdk v2.21 (#11593)
Signed-off-by: Liangfu Chen <liangfc@amazon.com>
2024-12-30 13:46:14 +08:00
youkaichao
f30581c518
[misc][perf] remove old code (#11425)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-23 08:01:08 +00:00
Tyler Michael Smith
5a9da2e6e9
[Bugfix][Build/CI] Fix sparse CUTLASS compilation on CUDA [12.0, 12.2) (#11311)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2024-12-19 02:43:30 +00:00
Dipika Sikka
60508ffda9
[Kernel]: Cutlass 2:4 Sparsity + FP8/Int8 Quant Support (#10995)
Co-authored-by: Faraz Shahsavan <faraz.shahsavan@gmail.com>
Co-authored-by: ilmarkov <markovilya197@gmail.com>
Co-authored-by: Rahul Tuli <rahul@neuralmagic.com>
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
2024-12-18 09:57:16 -05:00
Luka Govedič
30870b4f66
[torch.compile] Dynamic fp8 + rms_norm fusion (#10906)
Signed-off-by: luka <luka@neuralmagic.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-12-13 03:19:23 +00:00
kliuae
7c25fe45a6
[AMD] Add support for GGUF quantization on ROCm (#10254) 2024-11-22 21:14:49 -08:00
youkaichao
388ee3de66
[torch.compile] limit inductor threads and lazy import quant (#10482)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-20 18:36:33 -08:00