Varun Sundar Rabindranath
|
afb050b29d
|
[Core] CUDA Graphs for Multi-Step + Chunked-Prefill (#8645)
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
|
2024-10-02 19:44:39 +00:00 |
|
Lily Liu
|
1570203864
|
[Spec Decode] (1/2) Remove batch expansion (#8839)
|
2024-10-01 16:04:42 -07:00 |
|
Varun Sundar Rabindranath
|
c2ec430ab5
|
[Core] Multi-Step + Single Step Prefills via Chunked Prefill code path (#8378)
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
|
2024-09-27 13:32:07 -07:00 |
|
youkaichao
|
a9b15c606f
|
[torch.compile] use empty tensor instead of None for profiling (#8875)
|
2024-09-27 08:11:32 -07:00 |
|
Brittany
|
8df2dc3c88
|
[TPU] Update pallas.py to support trillium (#8871)
|
2024-09-27 01:16:55 -07:00 |
|
Luka Govedič
|
71c60491f2
|
[Kernel] Build flash-attn from source (#8245)
|
2024-09-20 23:27:10 -07:00 |
|
William Lin
|
9e5ec35b1f
|
[bugfix] [AMD] add multi-step advance_step to ROCmFlashAttentionMetadata (#8474)
|
2024-09-19 20:49:54 -07:00 |
|
Charlie Fu
|
9cc373f390
|
[Kernel][Amd] Add fp8 kv cache support for rocm custom paged attention (#8577)
|
2024-09-19 17:37:57 +00:00 |
|
Aaron Pham
|
9d104b5beb
|
[CI/Build] Update Ruff version (#8469)
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
|
2024-09-18 11:00:56 +00:00 |
|
Cyrus Leung
|
6ffa3f314c
|
[CI/Build] Avoid CUDA initialization (#8534)
|
2024-09-18 10:38:11 +00:00 |
|
sroy745
|
1009e93c5d
|
[Encoder decoder] Add cuda graph support during decoding for encoder-decoder models (#7631)
|
2024-09-17 07:35:01 -07:00 |
|
Charlie Fu
|
1ef0d2efd0
|
[Kernel][Hardware][Amd]Custom paged attention kernel for rocm (#8310)
|
2024-09-13 17:01:11 -07:00 |
|
Kunshang Ji
|
851725202a
|
[Hardware][intel GPU] bump up ipex version to 2.3 (#8365)
Co-authored-by: Yan Ma <yan.ma@intel.com>
|
2024-09-13 16:54:34 -07:00 |
|
Alexander Matveev
|
019877253b
|
[Bugfix] multi-step + flashinfer: ensure cuda graph compatible (#8427)
|
2024-09-12 21:01:50 +00:00 |
|
William Lin
|
a6c0f3658d
|
[multi-step] add flashinfer backend (#7928)
|
2024-09-12 11:16:22 -07:00 |
|
youkaichao
|
7de49aa86c
|
[torch.compile] hide slicing under custom op for inductor (#8384)
|
2024-09-12 00:11:55 -07:00 |
|
Alexander Matveev
|
22f3a4bc6c
|
[Bugfix] lookahead block table with cuda graph max capture (#8340)
[Bugfix] Ensure multistep lookahead allocation is compatible with cuda graph max capture (#8340)
|
2024-09-10 16:00:35 -07:00 |
|
Kevin Lin
|
5faedf1b62
|
[Spec Decode] Move ops.advance_step to flash attn advance_step (#8224)
|
2024-09-10 13:18:14 -07:00 |
|
Elfie Guo
|
e39ebf5cf5
|
[Core/Bugfix] Add query dtype as per FlashInfer API requirements. (#8173)
|
2024-09-05 05:12:26 +00:00 |
|
Pavani Majety
|
622f8abff8
|
[Bugfix] bugfix and add model test for flashinfer fp8 kv cache. (#8013)
|
2024-08-30 22:18:50 -07:00 |
|
Woosuk Kwon
|
2684efc467
|
[TPU][Bugfix] Fix tpu type api (#8035)
|
2024-08-30 09:01:26 -07:00 |
|
Richard Liu
|
2148441fd3
|
[TPU] Support single and multi-host TPUs on GKE (#7613)
|
2024-08-30 00:27:40 -07:00 |
|
Pavani Majety
|
6b3421567d
|
[Core][Kernels] Enable FP8 KV Cache with Flashinfer backend. + BugFix for kv_cache_dtype=auto (#7985)
Co-authored-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
|
2024-08-29 14:53:11 -04:00 |
|
youkaichao
|
ef99a78760
|
Revert "[Core][Kernels] Use FlashInfer backend for FP8 KV Cache when available." (#7982)
|
2024-08-28 21:27:06 -07:00 |
|
Pavani Majety
|
b98cc28f91
|
[Core][Kernels] Use FlashInfer backend for FP8 KV Cache when available. (#7798)
Co-authored-by: Simon Mo <simon.mo@hey.com>
|
2024-08-28 10:01:22 -07:00 |
|
Cody Yu
|
9606c7197d
|
Revert #7509 (#7887)
|
2024-08-27 00:16:31 -07:00 |
|
LI MOU
|
53328d7536
|
[BUG] fix crash on flashinfer backend with cudagraph disabled, when attention group_size not in [1,2,4,8] (#7509)
|
2024-08-21 08:54:31 -07:00 |
|
Antoni Baum
|
3b682179dd
|
[Core] Add AttentionState abstraction (#7663)
|
2024-08-20 18:50:45 +00:00 |
|
William Lin
|
f366f6339b
|
[spec decode] [4/N] Move update_flash_attn_metadata to attn backend (#7571)
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
|
2024-08-16 11:41:56 -07:00 |
|
youkaichao
|
54bd9a03c4
|
register custom op for flash attn and use from torch.ops (#7536)
|
2024-08-15 22:38:56 -07:00 |
|
youkaichao
|
4d2dc5072b
|
[hardware] unify usage of is_tpu to current_platform.is_tpu() (#7102)
|
2024-08-13 00:16:42 -07:00 |
|
jon-chuang
|
a046f86397
|
[Core/Bugfix] Add FP8 K/V Scale and dtype conversion for prefix/prefill Triton Kernel (#7208)
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
|
2024-08-12 22:47:41 +00:00 |
|
Woosuk Kwon
|
cfba4def5d
|
[Bugfix] Fix logit soft cap in flash-attn backend (#7425)
|
2024-08-12 09:58:28 -07:00 |
|
Lily Liu
|
ec2affa8ae
|
[Kernel] Flashinfer correctness fix for v0.1.3 (#7319)
|
2024-08-12 07:59:17 +00:00 |
|
Antoni Baum
|
999ef0b917
|
[Misc] Add numpy implementation of compute_slot_mapping (#7377)
|
2024-08-09 22:52:29 +00:00 |
|
Alexander Matveev
|
e02ac55617
|
[Performance] Optimize e2e overheads: Reduce python allocations (#7162)
|
2024-08-08 21:34:28 -07:00 |
|
Lily Liu
|
e53dfd3eaf
|
[Kernel] Fix Flashinfer Correctness (#7284)
|
2024-08-07 16:26:52 -07:00 |
|
afeldman-nm
|
fd95e026e0
|
[Core] Subclass ModelRunner to support cross-attention & encoder sequences (towards eventual encoder/decoder model support) (#4942)
Co-authored-by: Andrew Feldman <afeld2012@gmail.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
|
2024-08-06 16:51:47 -04:00 |
|
Cody Yu
|
ef527be06c
|
[MISC] Use non-blocking transfer in prepare_input (#7172)
|
2024-08-05 23:41:27 +00:00 |
|
Zach Zheng
|
fb2c1c86c1
|
[Bugfix] Fix block table for seqs that have prefix cache hits (#7018)
|
2024-08-02 22:38:15 -07:00 |
|
Lily Liu
|
954f7305a1
|
[Kernel] Fix input for flashinfer prefill wrapper. (#7008)
|
2024-08-01 18:44:16 -07:00 |
|
Woosuk Kwon
|
805a8a75f2
|
[Misc] Support attention logits soft-capping with flash-attn (#7022)
|
2024-08-01 13:14:37 -07:00 |
|
Thomas Parnell
|
9a7e2d0534
|
[Bugfix] Allow vllm to still work if triton is not installed. (#6786)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
|
2024-07-29 14:51:27 -07:00 |
|
Woosuk Kwon
|
fad5576c58
|
[TPU] Reduce compilation time & Upgrade PyTorch XLA version (#6856)
|
2024-07-27 10:28:33 -07:00 |
|
Woosuk Kwon
|
52f07e3dec
|
[Hardware][TPU] Implement tensor parallelism with Ray (#5871)
|
2024-07-26 20:54:27 -07:00 |
|
Joe
|
14dbd5a767
|
[Model] H2O Danube3-4b (#6451)
|
2024-07-26 20:47:50 -07:00 |
|
Cody Yu
|
309aaef825
|
[Bugfix] Fix decode tokens w. CUDA graph (#6757)
|
2024-07-24 22:33:56 -07:00 |
|
Antoni Baum
|
5448f67635
|
[Core] Tweaks to model runner/input builder developer APIs (#6712)
|
2024-07-24 12:17:12 -07:00 |
|
Antoni Baum
|
0e63494cf3
|
Add fp8 support to reshape_and_cache_flash (#6667)
|
2024-07-24 18:36:52 +00:00 |
|
Michael Goin
|
9e0b558a09
|
[Misc] Support FP8 kv cache scales from compressed-tensors (#6528)
|
2024-07-23 04:11:50 +00:00 |
|