Cyrus Leung
|
a84e598e21
|
[CI/Build] Reorganize models tests (#7820)
|
2024-09-13 10:20:06 -07:00 |
|
bnellnm
|
73202dbe77
|
[Kernel][Misc] register ops to prevent graph breaks (#6917)
Co-authored-by: Sage Moore <sage@neuralmagic.com>
|
2024-09-11 12:52:19 -07:00 |
|
Dipika Sikka
|
6cd5e5b07e
|
[Misc] Fused MoE Marlin support for GPTQ (#8217)
|
2024-09-09 23:02:52 -04:00 |
|
Elfie Guo
|
e39ebf5cf5
|
[Core/Bugfix] Add query dtype as per FlashInfer API requirements. (#8173)
|
2024-09-05 05:12:26 +00:00 |
|
Pavani Majety
|
6b3421567d
|
[Core][Kernels] Enable FP8 KV Cache with Flashinfer backend. + BugFix for kv_cache_dtype=auto (#7985)
Co-authored-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
|
2024-08-29 14:53:11 -04:00 |
|
youkaichao
|
ef99a78760
|
Revert "[Core][Kernels] Use FlashInfer backend for FP8 KV Cache when available." (#7982)
|
2024-08-28 21:27:06 -07:00 |
|
Mor Zusman
|
fdd9daafa3
|
[Kernel/Model] Migrate mamba_ssm and causal_conv1d kernels to vLLM (#7651)
|
2024-08-28 15:06:52 -07:00 |
|
rasmith
|
e5697d161c
|
[Kernel] [Triton] [AMD] Adding Triton implementations awq_dequantize and awq_gemm to support AWQ (#7386)
|
2024-08-28 15:37:47 -04:00 |
|
Pavani Majety
|
b98cc28f91
|
[Core][Kernels] Use FlashInfer backend for FP8 KV Cache when available. (#7798)
Co-authored-by: Simon Mo <simon.mo@hey.com>
|
2024-08-28 10:01:22 -07:00 |
|
LI MOU
|
53328d7536
|
[BUG] fix crash on flashinfer backend with cudagraph disabled, when attention group_size not in [1,2,4,8] (#7509)
|
2024-08-21 08:54:31 -07:00 |
|
Lucas Wilkinson
|
5288c06aa0
|
[Kernel] (1/N) Machete - Hopper Optimized Mixed Precision Linear Kernel (#7174)
|
2024-08-20 07:09:33 -06:00 |
|
Charlie Fu
|
e837b624f2
|
[Feature][Hardware][Amd] Add fp8 Linear Layer for Rocm (#7210)
|
2024-08-16 10:06:30 -07:00 |
|
youkaichao
|
54bd9a03c4
|
register custom op for flash attn and use from torch.ops (#7536)
|
2024-08-15 22:38:56 -07:00 |
|
jon-chuang
|
50b8d08dbd
|
[Misc/Testing] Use torch.testing.assert_close (#7324)
|
2024-08-16 04:24:04 +00:00 |
|
jon-chuang
|
a046f86397
|
[Core/Bugfix] Add FP8 K/V Scale and dtype conversion for prefix/prefill Triton Kernel (#7208)
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
|
2024-08-12 22:47:41 +00:00 |
|
Luka Govedič
|
5fb4a3f678
|
[Bugfix][Kernel] Increased atol to fix failing tests (#7305)
|
2024-08-08 12:16:13 -04:00 |
|
afeldman-nm
|
fd95e026e0
|
[Core] Subclass ModelRunner to support cross-attention & encoder sequences (towards eventual encoder/decoder model support) (#4942)
Co-authored-by: Andrew Feldman <afeld2012@gmail.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
|
2024-08-06 16:51:47 -04:00 |
|
Luka Govedič
|
8d59dbb000
|
[Kernel] Add per-tensor and per-token AZP epilogues (#5941)
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
|
2024-08-06 18:17:08 +00:00 |
|
Lucas Wilkinson
|
a8d604ca2a
|
[Misc] Disambiguate quantized types via a new ScalarType (#6396)
|
2024-08-02 13:51:58 -07:00 |
|
Woosuk Kwon
|
805a8a75f2
|
[Misc] Support attention logits soft-capping with flash-attn (#7022)
|
2024-08-01 13:14:37 -07:00 |
|
Jee Jee Li
|
7ecee34321
|
[Kernel][RFC] Refactor the punica kernel based on Triton (#5036)
|
2024-07-31 17:12:24 -07:00 |
|
HandH1998
|
6512937de1
|
Support W4A8 quantization for vllm (#5218)
|
2024-07-31 07:55:21 -06:00 |
|
Varun Sundar Rabindranath
|
af647fb8b3
|
[Kernel] Tuned int8 kernels for Ada Lovelace (#6848)
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
|
2024-07-29 20:24:58 -06:00 |
|
Thomas Parnell
|
9a7e2d0534
|
[Bugfix] Allow vllm to still work if triton is not installed. (#6786)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
|
2024-07-29 14:51:27 -07:00 |
|
Varun Sundar Rabindranath
|
766435e660
|
[Kernel] Tuned FP8 Kernels for Ada Lovelace (#6677)
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
|
2024-07-29 09:42:35 -06:00 |
|
Alexander Matveev
|
75acdaa4b6
|
[Kernel] Increase precision of GPTQ/AWQ Marlin kernel (#6795)
|
2024-07-27 17:52:33 -04:00 |
|
Joe
|
14dbd5a767
|
[Model] H2O Danube3-4b (#6451)
|
2024-07-26 20:47:50 -07:00 |
|
Antoni Baum
|
0e63494cf3
|
Add fp8 support to reshape_and_cache_flash (#6667)
|
2024-07-24 18:36:52 +00:00 |
|
Tyler Michael Smith
|
fea59c7712
|
[Bugfix][Kernel] Use int64_t for indices in fp8 quant kernels (#6649)
|
2024-07-22 14:08:30 -06:00 |
|
Alexander Matveev
|
396d92d5e0
|
[Kernel][Core] Add AWQ support to the Marlin kernel (#6612)
|
2024-07-21 19:41:42 -04:00 |
|
Varun Sundar Rabindranath
|
2e26564259
|
[ Kernel ] FP8 Dynamic Per Token Quant - Add scale_ub (#6593)
Co-authored-by: Varun Sundar Rabindranth <varun@neuralmagic.com>
|
2024-07-19 18:15:26 -07:00 |
|
Robert Shaw
|
4cc24f01b1
|
[ Kernel ] Enable Dynamic Per Token fp8 (#6547)
|
2024-07-19 23:08:15 +00:00 |
|
Varun Sundar Rabindranath
|
b5241e41d9
|
[ Kernel ] FP8 Dynamic-Per-Token Quant Kernel (#6511)
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
|
2024-07-18 01:38:35 +00:00 |
|
Michael Goin
|
978aed5300
|
[Kernel][Attention] Separate Attention.kv_scale into k_scale and v_scale (#6081)
|
2024-07-16 15:31:32 -07:00 |
|
Robert Shaw
|
b675069d74
|
[ Misc ] Refactor Marlin Python Utilities (#6082)
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
|
2024-07-11 15:40:11 +00:00 |
|
afeldman-nm
|
543aa48573
|
[Kernel] Correctly invoke prefill & decode kernels for cross-attention (towards eventual encoder/decoder model support) (#4888)
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2024-07-08 17:12:15 +00:00 |
|
Lily Liu
|
69ec3ca14c
|
[Kernel][Model] logits_soft_cap for Gemma2 with flashinfer (#6051)
Co-authored-by: Simon Mo <simon.mo@hey.com>
|
2024-07-04 16:35:51 -07:00 |
|
Michael Goin
|
47f0954af0
|
[Kernel] Expand FP8 support to Ampere GPUs using FP8 Marlin (#5975)
|
2024-07-03 17:38:00 +00:00 |
|
youkaichao
|
482045ee77
|
[hardware][misc] introduce platform abstraction (#6080)
|
2024-07-02 20:12:22 -07:00 |
|
Robert Shaw
|
7c008c51a9
|
[ Misc ] Refactor MoE to isolate Fp8 From Mixtral (#5970)
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
|
2024-07-02 21:54:35 +00:00 |
|
Avshalom Manevich
|
12a59959ed
|
[Bugfix] adding chunking mechanism to fused_moe to handle large inputs (#6029)
|
2024-07-01 21:08:29 +00:00 |
|
youkaichao
|
614aa51203
|
[misc][cuda] use nvml to avoid accidentally cuda initialization (#6007)
|
2024-06-30 20:07:34 -07:00 |
|
Tyler Michael Smith
|
6a2d659d28
|
[Bugfix] Fix compute datatype for cutlass 3.x epilogues (#5931)
|
2024-06-28 17:10:34 +00:00 |
|
Ilya Lavrenov
|
57f09a419c
|
[Hardware][Intel] OpenVINO vLLM backend (#5379)
|
2024-06-28 13:50:16 +00:00 |
|
Luka Govedič
|
5bfd1bbc98
|
[Kernel] Adding bias epilogue support for cutlass_scaled_mm (#5560)
Co-authored-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com>
Co-authored-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
|
2024-06-26 15:16:00 +00:00 |
|
Cyrus Leung
|
0e9164b40a
|
[mypy] Enable type checking for test directory (#5017)
|
2024-06-15 04:45:31 +00:00 |
|
Tyler Michael Smith
|
85657b5607
|
[Kernel] Factor out epilogues from cutlass kernels (#5391)
Co-authored-by: Michael Goin <michael@neuralmagic.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: zifeitong <zifei.tong@parasail.io>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
|
2024-06-13 11:22:19 -07:00 |
|
bnellnm
|
5467ac3196
|
[Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops (#5047)
|
2024-06-09 16:23:30 -04:00 |
|
Dipika Sikka
|
ca3ea51bde
|
[Kernel] Dynamic Per-Token Activation Quantization (#5037)
Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
|
2024-06-07 09:36:26 -07:00 |
|
Woosuk Kwon
|
41ca62cf03
|
[Misc] Add CustomOp interface for device portability (#5255)
|
2024-06-05 09:18:19 -07:00 |
|