262 Commits

Author SHA1 Message Date
Dipika Sikka
60508ffda9
[Kernel]: Cutlass 2:4 Sparsity + FP8/Int8 Quant Support (#10995)
Co-authored-by: Faraz Shahsavan <faraz.shahsavan@gmail.com>
Co-authored-by: ilmarkov <markovilya197@gmail.com>
Co-authored-by: Rahul Tuli <rahul@neuralmagic.com>
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
2024-12-18 09:57:16 -05:00
Luka Govedič
30870b4f66
[torch.compile] Dynamic fp8 + rms_norm fusion (#10906)
Signed-off-by: luka <luka@neuralmagic.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-12-13 03:19:23 +00:00
Woosuk Kwon
3b61cb450d
[V1] Further reduce CPU overheads in flash-attn (#10989)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-12-09 12:38:46 -08:00
zhou fan
78029b34ed
[BugFix][Kernel]: fix illegal memory access in causal_conv1d when conv_states is None (#10928)
Signed-off-by: xffxff <1247714429@qq.com>
2024-12-08 01:21:18 +08:00
Gregory Shtrasberg
f13cf9ad50
[Build] Fix for the Wswitch-bool clang warning (#10060)
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
2024-12-07 09:03:44 +00:00
Tyler Michael Smith
e2251109c7
[Kernel] Remove if-else with identical branches in marlin 2:4 (#10687)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2024-11-26 22:55:32 -08:00
Sanket Kale
a6760f6456
[Feature] vLLM ARM Enablement for AARCH64 CPUs (#9228)
Signed-off-by: Sanket Kale <sanketk.kale@fujitsu.com>
Co-authored-by: Sanket Kale <sanketk.kale@fujitsu.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
2024-11-25 18:32:39 -08:00
kliuae
7c25fe45a6
[AMD] Add support for GGUF quantization on ROCm (#10254) 2024-11-22 21:14:49 -08:00
Lucas Wilkinson
d200972e7f
[Bugfix] Marlin 2:4 temp fix for large M dim (>256) (#10464)
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
2024-11-19 19:40:33 -08:00
ElizaWszola
b00b33d77e
[Model][Quantization] HQQ support through Marlin kernel expansion (#9766)
Signed-off-by: ElizaWszola <eliza@neuralmagic.com>
2024-11-19 13:31:12 -08:00
Manjul Mohan
1ea291a417
Fix: Build error seen on Power Architecture (#10421)
Signed-off-by: Manjul Mohan <manjul.mohan@ibm.com>
Signed-off-by: B-201 <Joy25810@foxmail.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: ismael-dm <ismaeldm99@gmail.com>
Signed-off-by: Andrew Nesbitt <andrewnez@gmail.com>
Signed-off-by: mgoin <michael@neuralmagic.com>
Signed-off-by: yan ma <yan.ma@intel.com>
Signed-off-by: Angus Wang <wangjadehao@gmail.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: rickyx <rickyx@anyscale.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Co-authored-by: Manjul Mohan manjul.mohan@ibm.com <manjulmohan@ltcd97-lp2.aus.stglabs.ibm.com>
Co-authored-by: B-201 <Joy25810@foxmail.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: ismael-dm <ismaeldm99@gmail.com>
Co-authored-by: Andrew Nesbitt <andrewnez@gmail.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
Co-authored-by: Yan Ma <yan.ma@intel.com>
Co-authored-by: Angus Wang <wangjadehao@gmail.com>
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Co-authored-by: Ricky Xu <rickyx@anyscale.com>
Co-authored-by: Kevin H. Luu <kevin@anyscale.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: Travis Johnson <tsjohnso@us.ibm.com>
Co-authored-by: Russell Bryant <rbryant@redhat.com>
2024-11-19 09:34:57 -08:00
Lucas Wilkinson
96d999fbe8
[Kernel] Initial Machete W4A8 support + Refactors (#9855)
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
2024-11-18 12:59:29 -07:00
Maximilien de Bayser
4a18fd14ba
Support Roberta embedding models (#9387)
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Co-authored-by: Flavia Beo <flavia.beo@ibm.com>
2024-11-14 21:23:29 +00:00
Pavani Majety
b6dde33019
[Core] Flashinfer - Remove advance step size restriction (#10282) 2024-11-13 16:29:32 +08:00
Aleksandr Malyshev
812c981fa0
Splitting attention kernel file (#10091)
Signed-off-by: maleksan85 <maleksan@amd.com>
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com>
2024-11-11 22:55:07 -08:00
Luka Govedič
4f93dfe952
[torch.compile] Fuse RMSNorm with quant (#9138)
Signed-off-by: luka <luka@neuralmagic.com>
Co-authored-by: youkaichao <youkaichao@126.com>
2024-11-08 21:20:08 +00:00
Li, Jiang
a6f332d0d9
[Hardware][CPU][bugfix] Fix half dtype support on AVX2-only target (#10108)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
2024-11-07 18:42:50 +08:00
Hanzhi Zhou
6192e9b8fe
[Core][Distributed] Refactor ipc buffer init in CustomAllreduce (#10030)
Signed-off-by: Hanzhi Zhou <hanzhi713@gmail.com>
2024-11-06 23:50:47 -08:00
Li, Jiang
a4b3e0c1e9
[Hardware][CPU] Update torch 2.5 (#9911)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
2024-11-07 04:43:08 +00:00
Aaron Pham
21063c11c7
[CI/Build] drop support for Python 3.8 EOL (#8464)
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
2024-11-06 07:11:55 +00:00
Mor Zusman
9fb12f7848
[BugFix][Kernel] Fix Illegal memory access in causal_conv1d in H100 (#9838)
Signed-off-by: mzusman <mor.zusmann@gmail.com>
2024-10-31 20:06:25 +00:00
youkaichao
8549c82660
[core] cudagraph output with tensor weak reference (#9724)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-10-27 00:19:28 -07:00
Charlie Fu
59449095ab
[Performance][Kernel] Fused_moe Performance Improvement (#9384)
Signed-off-by: charlifu <charlifu@amd.com>
2024-10-24 15:37:52 -07:00
Jee Jee Li
295a061fb3
[Kernel] add kernel for FATReLU (#9610)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2024-10-24 16:18:27 +08:00
Lucas Wilkinson
d1e8240875
[Bugfix] Fix spurious "No compiled cutlass_scaled_mm ..." for W8A8 on Turing (#9487) 2024-10-22 15:41:13 -07:00
bnellnm
eca2c5f7c0
[Bugfix] Fix support for dimension like integers and ScalarType (#9299) 2024-10-17 19:08:34 +00:00
Li, Jiang
5eda21e773
[Hardware][CPU] compressed-tensor INT8 W8A8 AZP support (#9344) 2024-10-17 12:21:04 -04:00
rasmith
92d86da217
[BugFix] [Kernel] Fix GPU SEGV occurring in int8 kernels (#9391) 2024-10-17 01:34:06 +00:00
Tyler Michael Smith
c3fab5f769
[Bugfix][Kernel] Prevent integer overflow in fp8 dynamic per-token quantize kernel (#9425) 2024-10-16 23:46:06 +00:00
Mor Zusman
fb60ae9b91
[Kernel][Model] Improve continuous batching for Jamba and Mamba (#9189) 2024-10-16 12:12:43 -04:00
Lucas Wilkinson
18511aeda6
[Bugfix] Fix Machete unittests failing with NotImplementedError (#9218) 2024-10-10 17:39:56 +00:00
Lucas Wilkinson
a64e7b9407
[Bugfix] Machete garbage results for some models (large K dim) (#9212) 2024-10-10 14:16:17 +08:00
ElizaWszola
05d686432f
[Kernel] Zero point support in fused MarlinMoE kernel + AWQ Fused MoE (#8973)
Co-authored-by: Dipika <dipikasikka1@gmail.com>
Co-authored-by: Dipika Sikka <ds3822@columbia.edu>
2024-10-04 12:34:44 -06:00
Lucas Wilkinson
aeb37c2a72
[CI/Build] Per file CUDA Archs (improve wheel size and dev build times) (#8845) 2024-10-03 22:55:25 -04:00
Varun Sundar Rabindranath
afb050b29d
[Core] CUDA Graphs for Multi-Step + Chunked-Prefill (#8645)
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-10-02 19:44:39 +00:00
Kevin H. Luu
aaccca2b4d
[CI/Build] Fix machete generated kernel files ordering (#8976)
Signed-off-by: kevin <kevin@anyscale.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2024-10-01 03:33:12 +00:00
Mor Zusman
f13a07b1f8
[Kernel][Model] Varlen prefill + Prefill chunking support for mamba kernels and Jamba model (#8533) 2024-09-29 17:35:58 -04:00
ElizaWszola
d081da0064
[Bugfix] Fix Marlin MoE act order when is_k_full == False (#8741)
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
2024-09-28 18:19:40 -07:00
Varun Sundar Rabindranath
c2ec430ab5
[Core] Multi-Step + Single Step Prefills via Chunked Prefill code path (#8378)
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-09-27 13:32:07 -07:00
Tyler Michael Smith
71d21c73ab
[Bugfix] Fixup advance_step.cu warning (#8815) 2024-09-26 16:23:45 -07:00
bnellnm
300da09177
[Kernel] Fullgraph and opcheck tests (#8479) 2024-09-25 08:35:52 -06:00
sasha0552
b4522474a3
[Bugfix][Kernel] Implement acquire/release polyfill for Pascal (#8776) 2024-09-24 21:26:33 -07:00
ElizaWszola
a928ded995
[Kernel] Split Marlin MoE kernels into multiple files (#8661)
Co-authored-by: mgoin <michael@neuralmagic.com>
2024-09-24 09:31:42 -07:00
Hanzhi Zhou
cc4325b66a
[Bugfix] Fix potentially unsafe custom allreduce synchronization (#8558) 2024-09-24 01:08:14 -07:00
Lucas Wilkinson
86e9c8df29
[Kernel] (2/N) Machete - Integrate into CompressedTensorsWNA16 and GPTQMarlin (#7701)
Co-authored-by: mgoin <michael@neuralmagic.com>
Co-authored-by: Divakar Verma <137818590+divakar-amd@users.noreply.github.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
2024-09-23 13:46:26 -04:00
Tyler Michael Smith
d66ac62854
[Kernel][Bugfix] Delete some more useless code in marlin_moe_ops.cu (#8643) 2024-09-21 23:45:02 +00:00
Charlie Fu
9cc373f390
[Kernel][Amd] Add fp8 kv cache support for rocm custom paged attention (#8577) 2024-09-19 17:37:57 +00:00
Tyler Michael Smith
4c34ce8916
[Kernel] Remove marlin moe templating on thread_m_blocks (#8573)
Co-authored-by: lwilkinson@neuralmagic.com
2024-09-19 01:42:49 +00:00
Tyler Michael Smith
8110e44529
[Kernel] Change interface to Mamba causal_conv1d_update for continuous batching (#8012) 2024-09-17 23:44:27 +00:00
youkaichao
99aa4eddaf
[torch.compile] register allreduce operations as custom ops (#8526) 2024-09-16 22:57:57 -07:00