bnellnm
|
37fd47e780
|
[Kernel] fix types used in aqlm and ggml kernels to support dynamo (#7596)
|
2024-08-16 14:00:11 -07:00 |
|
bnellnm
|
7759ae958f
|
[Kernel][Misc] dynamo support for ScalarType (#7594)
|
2024-08-16 13:59:49 -07:00 |
|
Charlie Fu
|
e837b624f2
|
[Feature][Hardware][Amd] Add fp8 Linear Layer for Rocm (#7210)
|
2024-08-16 10:06:30 -07:00 |
|
Lucas Wilkinson
|
6aa33cb2dd
|
[Misc] Use scalar type to dispatch to different gptq_marlin kernels (#7323)
|
2024-08-12 14:40:13 -04:00 |
|
Luka Govedič
|
8d59dbb000
|
[Kernel] Add per-tensor and per-token AZP epilogues (#5941)
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
|
2024-08-06 18:17:08 +00:00 |
|
Isotr0py
|
360bd67cf0
|
[Core] Support loading GGUF model (#5191)
Co-authored-by: Michael Goin <michael@neuralmagic.com>
|
2024-08-05 17:54:23 -06:00 |
|
Tyler Michael Smith
|
6e4852ce28
|
[CI/Build] Suppress divide-by-zero and missing return statement warnings (#7001)
|
2024-08-05 16:00:01 -04:00 |
|
Tyler Michael Smith
|
8571ac4672
|
[Kernel] Update CUTLASS to 3.5.1 (#7085)
|
2024-08-05 15:13:43 -04:00 |
|
Lucas Wilkinson
|
a8d604ca2a
|
[Misc] Disambiguate quantized types via a new ScalarType (#6396)
|
2024-08-02 13:51:58 -07:00 |
|
Jee Jee Li
|
7ecee34321
|
[Kernel][RFC] Refactor the punica kernel based on Triton (#5036)
|
2024-07-31 17:12:24 -07:00 |
|
Varun Sundar Rabindranath
|
35e9c12bfa
|
[Kernel] Tuned int8 Cutlass Kernels for SM75 (T4) (#6996)
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
|
2024-07-31 14:40:32 -07:00 |
|
Varun Sundar Rabindranath
|
93548eb37e
|
[Kernel] Enable FP8 Cutlass for Ada Lovelace (#6950)
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
|
2024-07-31 14:40:22 -07:00 |
|
HandH1998
|
6512937de1
|
Support W4A8 quantization for vllm (#5218)
|
2024-07-31 07:55:21 -06:00 |
|
Tyler Michael Smith
|
cbbc904470
|
[Kernel] Squash a few more warnings (#6914)
|
2024-07-30 13:50:42 -04:00 |
|
Varun Sundar Rabindranath
|
af647fb8b3
|
[Kernel] Tuned int8 kernels for Ada Lovelace (#6848)
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
|
2024-07-29 20:24:58 -06:00 |
|
Tyler Michael Smith
|
61a97c32f6
|
[Kernel] Fix marlin divide-by-zero warnings (#6904)
|
2024-07-30 01:26:07 +00:00 |
|
Tyler Michael Smith
|
aae6d36f7e
|
[Kernel] Remove unused variables in awq/gemm_kernels.cu (#6908)
|
2024-07-29 18:01:17 -06:00 |
|
Tyler Michael Smith
|
60d1c6e584
|
[Kernel] Fix deprecation function warnings squeezellm quant_cuda_kernel (#6901)
|
2024-07-29 09:59:02 -07:00 |
|
Varun Sundar Rabindranath
|
766435e660
|
[Kernel] Tuned FP8 Kernels for Ada Lovelace (#6677)
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
|
2024-07-29 09:42:35 -06:00 |
|
Alexander Matveev
|
75acdaa4b6
|
[Kernel] Increase precision of GPTQ/AWQ Marlin kernel (#6795)
|
2024-07-27 17:52:33 -04:00 |
|
Joe
|
14dbd5a767
|
[Model] H2O Danube3-4b (#6451)
|
2024-07-26 20:47:50 -07:00 |
|
Lucas Wilkinson
|
55712941e5
|
[Bug Fix] Illegal memory access, FP8 Llama 3.1 405b (#6852)
|
2024-07-27 02:27:44 +00:00 |
|
Li, Jiang
|
3bbb4936dc
|
[Hardware] [Intel] Enable Multiprocessing and tensor parallel in CPU backend and update documentation (#6125)
|
2024-07-26 13:50:10 -07:00 |
|
Tyler Michael Smith
|
50704f52c4
|
[Bugfix][Kernel] Promote another index to int64_t (#6838)
|
2024-07-26 18:41:04 +00:00 |
|
Antoni Baum
|
0e63494cf3
|
Add fp8 support to reshape_and_cache_flash (#6667)
|
2024-07-24 18:36:52 +00:00 |
|
Tyler Michael Smith
|
fea59c7712
|
[Bugfix][Kernel] Use int64_t for indices in fp8 quant kernels (#6649)
|
2024-07-22 14:08:30 -06:00 |
|
Alexander Matveev
|
396d92d5e0
|
[Kernel][Core] Add AWQ support to the Marlin kernel (#6612)
|
2024-07-21 19:41:42 -04:00 |
|
Varun Sundar Rabindranath
|
2e26564259
|
[ Kernel ] FP8 Dynamic Per Token Quant - Add scale_ub (#6593)
Co-authored-by: Varun Sundar Rabindranth <varun@neuralmagic.com>
|
2024-07-19 18:15:26 -07:00 |
|
Varun Sundar Rabindranath
|
b5241e41d9
|
[ Kernel ] FP8 Dynamic-Per-Token Quant Kernel (#6511)
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
|
2024-07-18 01:38:35 +00:00 |
|
Alexander Matveev
|
e76466dde2
|
[Core] draft_model_runner: Implement prepare_inputs on GPU for advance_step (#6338)
|
2024-07-17 14:30:28 -07:00 |
|
Michael Goin
|
978aed5300
|
[Kernel][Attention] Separate Attention.kv_scale into k_scale and v_scale (#6081)
|
2024-07-16 15:31:32 -07:00 |
|
Tyler Michael Smith
|
9dad5cc859
|
[Kernel] Turn off CUTLASS scaled_mm for Ada Lovelace (#6384)
|
2024-07-14 13:37:19 +00:00 |
|
Michael Goin
|
47f0954af0
|
[Kernel] Expand FP8 support to Ampere GPUs using FP8 Marlin (#5975)
|
2024-07-03 17:38:00 +00:00 |
|
Joe Runde
|
ba4994443a
|
[Kernel] Add punica dimensions for Granite 3b and 8b (#5930)
Signed-off-by: Joe Runde <joe@joerun.de>
|
2024-06-29 10:48:25 +08:00 |
|
Tyler Michael Smith
|
5d2a1a9cf0
|
Unmark more files as executable (#5962)
|
2024-06-28 17:34:56 -04:00 |
|
Tyler Michael Smith
|
6a2d659d28
|
[Bugfix] Fix compute datatype for cutlass 3.x epilogues (#5931)
|
2024-06-28 17:10:34 +00:00 |
|
Chip Kerchner
|
38a1674abb
|
Support CPU inference with VSX PowerPC ISA (#5652)
|
2024-06-26 21:53:04 +00:00 |
|
Luka Govedič
|
5bfd1bbc98
|
[Kernel] Adding bias epilogue support for cutlass_scaled_mm (#5560)
Co-authored-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com>
Co-authored-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
|
2024-06-26 15:16:00 +00:00 |
|
Varun Sundar Rabindranath
|
6c916ac8a8
|
[BugFix] [Kernel] Add Cutlass2x fallback kernels (#5744)
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
|
2024-06-23 21:07:11 +00:00 |
|
Roger Wang
|
bd620b01fb
|
[Kernel][CPU] Add Quick gelu to CPU (#5717)
|
2024-06-21 06:39:40 +00:00 |
|
Jinzhen Lin
|
1f5674218f
|
[Kernel] Add punica dimension for Qwen2 LoRA (#5441)
|
2024-06-20 17:55:41 -07:00 |
|
Tyler Michael Smith
|
3f3b6b2150
|
[Bugfix] Fix the CUDA version check for FP8 support in the CUTLASS kernels (#5715)
|
2024-06-20 18:36:10 +00:00 |
|
Varun Sundar Rabindranath
|
a7dcc62086
|
[Kernel] Update Cutlass int8 kernel configs for SM80 (#5275)
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
|
2024-06-20 13:33:21 +00:00 |
|
Roger Wang
|
ad137cd111
|
[Model] Port over CLIPVisionModel for VLMs (#5591)
|
2024-06-20 11:52:09 +00:00 |
|
Varun Sundar Rabindranath
|
111af1fa2c
|
[Kernel] Update Cutlass int8 kernel configs for SM90 (#5514)
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
|
2024-06-20 06:37:08 +00:00 |
|
Tyler Michael Smith
|
b23ce92032
|
[Bugfix] Fix CUDA version check for mma warning suppression (#5642)
|
2024-06-18 23:48:49 +00:00 |
|
sergey-tinkoff
|
07feecde1a
|
[Model] LoRA support added for command-r (#5178)
|
2024-06-18 11:01:21 -07:00 |
|
Joe Runde
|
5002175e80
|
[Kernel] Add punica dimensions for Granite 13b (#5559)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
|
2024-06-18 03:54:11 +00:00 |
|
Tyler Michael Smith
|
348616ac4b
|
[Kernel] Suppress mma.sp warning on CUDA 12.5 and later (#5401)
|
2024-06-14 10:02:00 -07:00 |
|
Tyler Michael Smith
|
703475f6c2
|
[Kernel] Fix CUTLASS 3.x custom broadcast load epilogue (#5516)
|
2024-06-14 09:30:15 -07:00 |
|