Pavani Majety
|
b6dde33019
|
[Core] Flashinfer - Remove advance step size restriction (#10282)
|
2024-11-13 16:29:32 +08:00 |
|
Aleksandr Malyshev
|
812c981fa0
|
Splitting attention kernel file (#10091)
Signed-off-by: maleksan85 <maleksan@amd.com>
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com>
|
2024-11-11 22:55:07 -08:00 |
|
Luka Govedič
|
4f93dfe952
|
[torch.compile] Fuse RMSNorm with quant (#9138)
Signed-off-by: luka <luka@neuralmagic.com>
Co-authored-by: youkaichao <youkaichao@126.com>
|
2024-11-08 21:20:08 +00:00 |
|
Li, Jiang
|
a6f332d0d9
|
[Hardware][CPU][bugfix] Fix half dtype support on AVX2-only target (#10108)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
|
2024-11-07 18:42:50 +08:00 |
|
Hanzhi Zhou
|
6192e9b8fe
|
[Core][Distributed] Refactor ipc buffer init in CustomAllreduce (#10030)
Signed-off-by: Hanzhi Zhou <hanzhi713@gmail.com>
|
2024-11-06 23:50:47 -08:00 |
|
Li, Jiang
|
a4b3e0c1e9
|
[Hardware][CPU] Update torch 2.5 (#9911)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
|
2024-11-07 04:43:08 +00:00 |
|
Aaron Pham
|
21063c11c7
|
[CI/Build] drop support for Python 3.8 EOL (#8464)
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
|
2024-11-06 07:11:55 +00:00 |
|
Mor Zusman
|
9fb12f7848
|
[BugFix][Kernel] Fix Illegal memory access in causal_conv1d in H100 (#9838)
Signed-off-by: mzusman <mor.zusmann@gmail.com>
|
2024-10-31 20:06:25 +00:00 |
|
youkaichao
|
8549c82660
|
[core] cudagraph output with tensor weak reference (#9724)
Signed-off-by: youkaichao <youkaichao@gmail.com>
|
2024-10-27 00:19:28 -07:00 |
|
Charlie Fu
|
59449095ab
|
[Performance][Kernel] Fused_moe Performance Improvement (#9384)
Signed-off-by: charlifu <charlifu@amd.com>
|
2024-10-24 15:37:52 -07:00 |
|
Jee Jee Li
|
295a061fb3
|
[Kernel] add kernel for FATReLU (#9610)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
|
2024-10-24 16:18:27 +08:00 |
|
Lucas Wilkinson
|
d1e8240875
|
[Bugfix] Fix spurious "No compiled cutlass_scaled_mm ..." for W8A8 on Turing (#9487)
|
2024-10-22 15:41:13 -07:00 |
|
bnellnm
|
eca2c5f7c0
|
[Bugfix] Fix support for dimension like integers and ScalarType (#9299)
|
2024-10-17 19:08:34 +00:00 |
|
Li, Jiang
|
5eda21e773
|
[Hardware][CPU] compressed-tensor INT8 W8A8 AZP support (#9344)
|
2024-10-17 12:21:04 -04:00 |
|
rasmith
|
92d86da217
|
[BugFix] [Kernel] Fix GPU SEGV occurring in int8 kernels (#9391)
|
2024-10-17 01:34:06 +00:00 |
|
Tyler Michael Smith
|
c3fab5f769
|
[Bugfix][Kernel] Prevent integer overflow in fp8 dynamic per-token quantize kernel (#9425)
|
2024-10-16 23:46:06 +00:00 |
|
Mor Zusman
|
fb60ae9b91
|
[Kernel][Model] Improve continuous batching for Jamba and Mamba (#9189)
|
2024-10-16 12:12:43 -04:00 |
|
Lucas Wilkinson
|
18511aeda6
|
[Bugfix] Fix Machete unittests failing with NotImplementedError (#9218)
|
2024-10-10 17:39:56 +00:00 |
|
Lucas Wilkinson
|
a64e7b9407
|
[Bugfix] Machete garbage results for some models (large K dim) (#9212)
|
2024-10-10 14:16:17 +08:00 |
|
ElizaWszola
|
05d686432f
|
[Kernel] Zero point support in fused MarlinMoE kernel + AWQ Fused MoE (#8973)
Co-authored-by: Dipika <dipikasikka1@gmail.com>
Co-authored-by: Dipika Sikka <ds3822@columbia.edu>
|
2024-10-04 12:34:44 -06:00 |
|
Lucas Wilkinson
|
aeb37c2a72
|
[CI/Build] Per file CUDA Archs (improve wheel size and dev build times) (#8845)
|
2024-10-03 22:55:25 -04:00 |
|
Varun Sundar Rabindranath
|
afb050b29d
|
[Core] CUDA Graphs for Multi-Step + Chunked-Prefill (#8645)
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
|
2024-10-02 19:44:39 +00:00 |
|
Kevin H. Luu
|
aaccca2b4d
|
[CI/Build] Fix machete generated kernel files ordering (#8976)
Signed-off-by: kevin <kevin@anyscale.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
|
2024-10-01 03:33:12 +00:00 |
|
Mor Zusman
|
f13a07b1f8
|
[Kernel][Model] Varlen prefill + Prefill chunking support for mamba kernels and Jamba model (#8533)
|
2024-09-29 17:35:58 -04:00 |
|
ElizaWszola
|
d081da0064
|
[Bugfix] Fix Marlin MoE act order when is_k_full == False (#8741)
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
|
2024-09-28 18:19:40 -07:00 |
|
Varun Sundar Rabindranath
|
c2ec430ab5
|
[Core] Multi-Step + Single Step Prefills via Chunked Prefill code path (#8378)
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
|
2024-09-27 13:32:07 -07:00 |
|
Tyler Michael Smith
|
71d21c73ab
|
[Bugfix] Fixup advance_step.cu warning (#8815)
|
2024-09-26 16:23:45 -07:00 |
|
bnellnm
|
300da09177
|
[Kernel] Fullgraph and opcheck tests (#8479)
|
2024-09-25 08:35:52 -06:00 |
|
sasha0552
|
b4522474a3
|
[Bugfix][Kernel] Implement acquire/release polyfill for Pascal (#8776)
|
2024-09-24 21:26:33 -07:00 |
|
ElizaWszola
|
a928ded995
|
[Kernel] Split Marlin MoE kernels into multiple files (#8661)
Co-authored-by: mgoin <michael@neuralmagic.com>
|
2024-09-24 09:31:42 -07:00 |
|
Hanzhi Zhou
|
cc4325b66a
|
[Bugfix] Fix potentially unsafe custom allreduce synchronization (#8558)
|
2024-09-24 01:08:14 -07:00 |
|
Lucas Wilkinson
|
86e9c8df29
|
[Kernel] (2/N) Machete - Integrate into CompressedTensorsWNA16 and GPTQMarlin (#7701)
Co-authored-by: mgoin <michael@neuralmagic.com>
Co-authored-by: Divakar Verma <137818590+divakar-amd@users.noreply.github.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
|
2024-09-23 13:46:26 -04:00 |
|
Tyler Michael Smith
|
d66ac62854
|
[Kernel][Bugfix] Delete some more useless code in marlin_moe_ops.cu (#8643)
|
2024-09-21 23:45:02 +00:00 |
|
Charlie Fu
|
9cc373f390
|
[Kernel][Amd] Add fp8 kv cache support for rocm custom paged attention (#8577)
|
2024-09-19 17:37:57 +00:00 |
|
Tyler Michael Smith
|
4c34ce8916
|
[Kernel] Remove marlin moe templating on thread_m_blocks (#8573)
Co-authored-by: lwilkinson@neuralmagic.com
|
2024-09-19 01:42:49 +00:00 |
|
Tyler Michael Smith
|
8110e44529
|
[Kernel] Change interface to Mamba causal_conv1d_update for continuous batching (#8012)
|
2024-09-17 23:44:27 +00:00 |
|
youkaichao
|
99aa4eddaf
|
[torch.compile] register allreduce operations as custom ops (#8526)
|
2024-09-16 22:57:57 -07:00 |
|
Luka Govedič
|
5d73ae49d6
|
[Kernel] AQ AZP 3/4: Asymmetric quantization kernels (#7270)
|
2024-09-16 11:52:40 -07:00 |
|
sasha0552
|
781e3b9a42
|
[Bugfix][Kernel] Fix build for sm_60 in GGUF kernel (#8506)
|
2024-09-16 12:15:57 -06:00 |
|
ElizaWszola
|
a091e2da3e
|
[Kernel] Enable 8-bit weights in Fused Marlin MoE (#8032)
Co-authored-by: Dipika <dipikasikka1@gmail.com>
|
2024-09-16 09:47:19 -06:00 |
|
Isotr0py
|
fc990f9795
|
[Bugfix][Kernel] Add IQ1_M quantization implementation to GGUF kernel (#8357)
|
2024-09-15 16:51:44 -06:00 |
|
Charlie Fu
|
1ef0d2efd0
|
[Kernel][Hardware][Amd]Custom paged attention kernel for rocm (#8310)
|
2024-09-13 17:01:11 -07:00 |
|
William Lin
|
a6c0f3658d
|
[multi-step] add flashinfer backend (#7928)
|
2024-09-12 11:16:22 -07:00 |
|
bnellnm
|
73202dbe77
|
[Kernel][Misc] register ops to prevent graph breaks (#6917)
Co-authored-by: Sage Moore <sage@neuralmagic.com>
|
2024-09-11 12:52:19 -07:00 |
|
Li, Jiang
|
0b952af458
|
[Hardware][Intel] Support compressed-tensor W8A8 for CPU backend (#7257)
|
2024-09-11 09:46:46 -07:00 |
|
Dipika Sikka
|
6cd5e5b07e
|
[Misc] Fused MoE Marlin support for GPTQ (#8217)
|
2024-09-09 23:02:52 -04:00 |
|
Dipika Sikka
|
23f322297f
|
[Misc] Remove SqueezeLLM (#8220)
|
2024-09-06 16:29:03 -06:00 |
|
Mor Zusman
|
fdd9daafa3
|
[Kernel/Model] Migrate mamba_ssm and causal_conv1d kernels to vLLM (#7651)
|
2024-08-28 15:06:52 -07:00 |
|
bnellnm
|
c166e7e43e
|
[Bugfix] Allow ScalarType to be compiled with pytorch 2.3 and add checks for registering FakeScalarType and dynamo support. (#7886)
|
2024-08-27 23:13:45 -04:00 |
|
Dipika Sikka
|
fc911880cc
|
[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7766)
Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
|
2024-08-27 15:07:09 -07:00 |
|