Alexander Matveev
75acdaa4b6
[Kernel] Increase precision of GPTQ/AWQ Marlin kernel ( #6795 )
2024-07-27 17:52:33 -04:00
Joe
14dbd5a767
[Model] H2O Danube3-4b ( #6451 )
2024-07-26 20:47:50 -07:00
Lucas Wilkinson
55712941e5
[Bug Fix] Illegal memory access, FP8 Llama 3.1 405b ( #6852 )
2024-07-27 02:27:44 +00:00
Li, Jiang
3bbb4936dc
[Hardware] [Intel] Enable Multiprocessing and tensor parallel in CPU backend and update documentation ( #6125 )
2024-07-26 13:50:10 -07:00
Tyler Michael Smith
50704f52c4
[Bugfix][Kernel] Promote another index to int64_t ( #6838 )
2024-07-26 18:41:04 +00:00
Antoni Baum
0e63494cf3
Add fp8 support to reshape_and_cache_flash ( #6667 )
2024-07-24 18:36:52 +00:00
Tyler Michael Smith
fea59c7712
[Bugfix][Kernel] Use int64_t for indices in fp8 quant kernels ( #6649 )
2024-07-22 14:08:30 -06:00
Alexander Matveev
396d92d5e0
[Kernel][Core] Add AWQ support to the Marlin kernel ( #6612 )
2024-07-21 19:41:42 -04:00
Varun Sundar Rabindranath
2e26564259
[ Kernel ] FP8 Dynamic Per Token Quant - Add scale_ub ( #6593 )
...
Co-authored-by: Varun Sundar Rabindranth <varun@neuralmagic.com>
2024-07-19 18:15:26 -07:00
Varun Sundar Rabindranath
b5241e41d9
[ Kernel ] FP8 Dynamic-Per-Token Quant Kernel ( #6511 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-07-18 01:38:35 +00:00
Alexander Matveev
e76466dde2
[Core] draft_model_runner: Implement prepare_inputs on GPU for advance_step ( #6338 )
2024-07-17 14:30:28 -07:00
Michael Goin
978aed5300
[Kernel][Attention] Separate Attention.kv_scale into k_scale and v_scale ( #6081 )
2024-07-16 15:31:32 -07:00
Tyler Michael Smith
9dad5cc859
[Kernel] Turn off CUTLASS scaled_mm for Ada Lovelace ( #6384 )
2024-07-14 13:37:19 +00:00
Michael Goin
47f0954af0
[Kernel] Expand FP8 support to Ampere GPUs using FP8 Marlin ( #5975 )
2024-07-03 17:38:00 +00:00
Joe Runde
ba4994443a
[Kernel] Add punica dimensions for Granite 3b and 8b ( #5930 )
...
Signed-off-by: Joe Runde <joe@joerun.de>
2024-06-29 10:48:25 +08:00
Tyler Michael Smith
5d2a1a9cf0
Unmark more files as executable ( #5962 )
2024-06-28 17:34:56 -04:00
Tyler Michael Smith
6a2d659d28
[Bugfix] Fix compute datatype for cutlass 3.x epilogues ( #5931 )
2024-06-28 17:10:34 +00:00
Chip Kerchner
38a1674abb
Support CPU inference with VSX PowerPC ISA ( #5652 )
2024-06-26 21:53:04 +00:00
Luka Govedič
5bfd1bbc98
[Kernel] Adding bias epilogue support for cutlass_scaled_mm ( #5560 )
...
Co-authored-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com>
Co-authored-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
2024-06-26 15:16:00 +00:00
Varun Sundar Rabindranath
6c916ac8a8
[BugFix] [Kernel] Add Cutlass2x fallback kernels ( #5744 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-06-23 21:07:11 +00:00
Roger Wang
bd620b01fb
[Kernel][CPU] Add Quick gelu to CPU ( #5717 )
2024-06-21 06:39:40 +00:00
Jinzhen Lin
1f5674218f
[Kernel] Add punica dimension for Qwen2 LoRA ( #5441 )
2024-06-20 17:55:41 -07:00
Tyler Michael Smith
3f3b6b2150
[Bugfix] Fix the CUDA version check for FP8 support in the CUTLASS kernels ( #5715 )
2024-06-20 18:36:10 +00:00
Varun Sundar Rabindranath
a7dcc62086
[Kernel] Update Cutlass int8 kernel configs for SM80 ( #5275 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-06-20 13:33:21 +00:00
Roger Wang
ad137cd111
[Model] Port over CLIPVisionModel for VLMs ( #5591 )
2024-06-20 11:52:09 +00:00
Varun Sundar Rabindranath
111af1fa2c
[Kernel] Update Cutlass int8 kernel configs for SM90 ( #5514 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-06-20 06:37:08 +00:00
Tyler Michael Smith
b23ce92032
[Bugfix] Fix CUDA version check for mma warning suppression ( #5642 )
2024-06-18 23:48:49 +00:00
sergey-tinkoff
07feecde1a
[Model] LoRA support added for command-r ( #5178 )
2024-06-18 11:01:21 -07:00
Joe Runde
5002175e80
[Kernel] Add punica dimensions for Granite 13b ( #5559 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-06-18 03:54:11 +00:00
Tyler Michael Smith
348616ac4b
[Kernel] Suppress mma.sp warning on CUDA 12.5 and later ( #5401 )
2024-06-14 10:02:00 -07:00
Tyler Michael Smith
703475f6c2
[Kernel] Fix CUTLASS 3.x custom broadcast load epilogue ( #5516 )
2024-06-14 09:30:15 -07:00
Jie Fu (傅杰)
cd9c0d65d9
[Hardware][Intel] Support CPU inference with AVX2 ISA ( #5452 )
2024-06-13 17:22:24 -06:00
Tyler Michael Smith
85657b5607
[Kernel] Factor out epilogues from cutlass kernels ( #5391 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: zifeitong <zifei.tong@parasail.io>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
2024-06-13 11:22:19 -07:00
Cody Yu
5985e3427d
[Kernel] Vectorized FP8 quantize kernel ( #5396 )
...
Inspired by #5146 , this PR improves FP8 quantize kernel by vectorizing data transfer to better utilize memory bandwidth. Microbenchmark shows that this improved kernel can achieve 1.0x-1.5x speedup (especially when hidden size is large).
In details, we applied 3 optimizations:
- Use inverted scale so that most divisions are changed to multiplications.
- Unroll the loop by 4 times to improve ILP.
- Use vectorized 4 to transfer data between HBM and SRAM.
2024-06-12 14:07:26 -07:00
bnellnm
5467ac3196
[Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops ( #5047 )
2024-06-09 16:23:30 -04:00
Jie Fu (傅杰)
6840a71610
[Misc] Remove unused cuda_utils.h in CPU backend ( #5345 )
2024-06-07 14:09:13 -07:00
Dipika Sikka
ca3ea51bde
[Kernel] Dynamic Per-Token Activation Quantization ( #5037 )
...
Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-06-07 09:36:26 -07:00
Tyler Michael Smith
ccd4f129e8
[Kernel] Add GPU architecture guards to the CUTLASS w8a8 kernels to reduce binary size ( #5157 )
...
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2024-06-05 10:44:15 -07:00
Yuan
cafb8e06c5
[CI/BUILD] enable intel queue for longer CPU tests ( #4113 )
2024-06-03 10:39:50 -07:00
Tyler Michael Smith
cbb2f59cc8
[Kernel] Pass a device pointer into the quantize kernel for the scales ( #5159 )
2024-06-03 09:52:30 -07:00
Divakar Verma
a66cf40b20
[Kernel][ROCm][AMD] enable fused topk_softmax kernel for moe layer ( #4927 )
...
This PR enables the fused topk_softmax kernel used in moe layer for HIP
2024-06-02 14:13:26 -07:00
Varun Sundar Rabindranath
f081c3ce4b
[Kernel] Update Cutlass fp8 configs ( #5144 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
2024-06-01 08:46:07 +00:00
Tyler Michael Smith
260d119e86
[Kernel] Refactor CUTLASS kernels to always take scales that reside on the GPU ( #5137 )
2024-06-01 06:45:32 +00:00
Tyler Michael Smith
1197e02141
[Build] Guard against older CUDA versions when building CUTLASS 3.x kernels ( #5168 )
2024-05-31 17:21:38 -07:00
Simon Mo
e9d3aa04f6
Revert "[Kernel] Marlin_24: Ensure the mma.sp instruction is using the ::ordered_metadata modifier (introduced with PTX 8.5)" ( #5149 )
2024-05-30 22:00:26 -07:00
SnowDist
a22dea54d3
[Model] Support MAP-NEO model ( #5081 )
...
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2024-05-30 19:24:41 -07:00
Alexander Matveev
6d21fa1cad
[Kernel] Marlin_24: Ensure the mma.sp instruction is using the ::ordered_metadata modifier (introduced with PTX 8.5) ( #5136 )
2024-05-30 21:02:11 -05:00
Eric Xihui Lin
8e192ff967
[Kernel][Backend][Model] Blocksparse flash attention kernel and Phi-3-Small model ( #4799 )
...
Co-authored-by: beagleski <yunanzhang@microsoft.com>
Co-authored-by: bapatra <bapatra@microsoft.com>
Co-authored-by: Barun Patra <codedecde@users.noreply.github.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-05-24 22:00:52 -07:00
Dipika Sikka
a1242324c9
[Kernel] Initial Activation Quantization Support ( #4525 )
...
Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-05-23 21:29:18 +00:00
Alexander Matveev
6066253296
Marlin 24 prefill performance improvement (about 25% better on average) ( #4983 )
2024-05-23 02:39:27 -04:00