110 Commits

Author SHA1 Message Date
Cody Yu
9606c7197d
Revert #7509 (#7887) 2024-08-27 00:16:31 -07:00
LI MOU
53328d7536
[BUG] fix crash on flashinfer backend with cudagraph disabled, when attention group_size not in [1,2,4,8] (#7509) 2024-08-21 08:54:31 -07:00
Antoni Baum
3b682179dd
[Core] Add AttentionState abstraction (#7663) 2024-08-20 18:50:45 +00:00
William Lin
f366f6339b
[spec decode] [4/N] Move update_flash_attn_metadata to attn backend (#7571)
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2024-08-16 11:41:56 -07:00
youkaichao
54bd9a03c4
register custom op for flash attn and use from torch.ops (#7536) 2024-08-15 22:38:56 -07:00
youkaichao
4d2dc5072b
[hardware] unify usage of is_tpu to current_platform.is_tpu() (#7102) 2024-08-13 00:16:42 -07:00
jon-chuang
a046f86397
[Core/Bugfix] Add FP8 K/V Scale and dtype conversion for prefix/prefill Triton Kernel (#7208)
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2024-08-12 22:47:41 +00:00
Woosuk Kwon
cfba4def5d
[Bugfix] Fix logit soft cap in flash-attn backend (#7425) 2024-08-12 09:58:28 -07:00
Lily Liu
ec2affa8ae
[Kernel] Flashinfer correctness fix for v0.1.3 (#7319) 2024-08-12 07:59:17 +00:00
Antoni Baum
999ef0b917
[Misc] Add numpy implementation of compute_slot_mapping (#7377) 2024-08-09 22:52:29 +00:00
Alexander Matveev
e02ac55617
[Performance] Optimize e2e overheads: Reduce python allocations (#7162) 2024-08-08 21:34:28 -07:00
Lily Liu
e53dfd3eaf
[Kernel] Fix Flashinfer Correctness (#7284) 2024-08-07 16:26:52 -07:00
afeldman-nm
fd95e026e0
[Core] Subclass ModelRunner to support cross-attention & encoder sequences (towards eventual encoder/decoder model support) (#4942)
Co-authored-by: Andrew Feldman <afeld2012@gmail.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
2024-08-06 16:51:47 -04:00
Cody Yu
ef527be06c
[MISC] Use non-blocking transfer in prepare_input (#7172) 2024-08-05 23:41:27 +00:00
Zach Zheng
fb2c1c86c1
[Bugfix] Fix block table for seqs that have prefix cache hits (#7018) 2024-08-02 22:38:15 -07:00
Lily Liu
954f7305a1
[Kernel] Fix input for flashinfer prefill wrapper. (#7008) 2024-08-01 18:44:16 -07:00
Woosuk Kwon
805a8a75f2
[Misc] Support attention logits soft-capping with flash-attn (#7022) 2024-08-01 13:14:37 -07:00
Thomas Parnell
9a7e2d0534
[Bugfix] Allow vllm to still work if triton is not installed. (#6786)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
2024-07-29 14:51:27 -07:00
Woosuk Kwon
fad5576c58
[TPU] Reduce compilation time & Upgrade PyTorch XLA version (#6856) 2024-07-27 10:28:33 -07:00
Woosuk Kwon
52f07e3dec
[Hardware][TPU] Implement tensor parallelism with Ray (#5871) 2024-07-26 20:54:27 -07:00
Joe
14dbd5a767
[Model] H2O Danube3-4b (#6451) 2024-07-26 20:47:50 -07:00
Cody Yu
309aaef825
[Bugfix] Fix decode tokens w. CUDA graph (#6757) 2024-07-24 22:33:56 -07:00
Antoni Baum
5448f67635
[Core] Tweaks to model runner/input builder developer APIs (#6712) 2024-07-24 12:17:12 -07:00
Antoni Baum
0e63494cf3
Add fp8 support to reshape_and_cache_flash (#6667) 2024-07-24 18:36:52 +00:00
Michael Goin
9e0b558a09
[Misc] Support FP8 kv cache scales from compressed-tensors (#6528) 2024-07-23 04:11:50 +00:00
Cody Yu
e0c15758b8
[Core] Modulize prepare input and attention metadata builder (#6596) 2024-07-23 00:45:24 +00:00
Matt Wong
06d6c5fe9f
[Bugfix][CI/Build][Hardware][AMD] Fix AMD tests, add HF cache, update CK FA, add partially supported model notes (#6543) 2024-07-20 09:39:07 -07:00
Robert Shaw
683e3cb9c4
[ Misc ] fbgemm checkpoints (#6559) 2024-07-20 09:36:57 -07:00
Cyrus Leung
9042d68362
[Misc] Consolidate and optimize logic for building padded tensors (#6541) 2024-07-20 04:17:24 +00:00
Noam Gat
c8a7d51c49
[Bugfix] Update flashinfer.py with PagedAttention forwards - Fixes Gemma2 OpenAI Server Crash (#6501) 2024-07-18 07:47:13 +00:00
Cody Yu
2fa4623d9e
[Core] Refactor _prepare_model_input_tensors - take 2 (#6164) 2024-07-17 09:37:16 -07:00
Michael Goin
978aed5300
[Kernel][Attention] Separate Attention.kv_scale into k_scale and v_scale (#6081) 2024-07-16 15:31:32 -07:00
Thomas Parnell
4ef95b0f06
[Bugfix] use float32 precision in samplers/test_logprobs.py for comparing with HF (#6409)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
2024-07-15 13:14:49 -04:00
Thomas Parnell
e1684a766a
[Bugfix] Fix hard-coded value of x in context_attention_fwd (#6373)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
2024-07-12 18:30:54 -07:00
Woosuk Kwon
f8f9ff57ee
[Bugfix][TPU] Fix megacore setting for v5e-litepod (#6397) 2024-07-12 15:59:47 -07:00
Michael Goin
d59eb98489
[Model][Phi3-Small] Remove scipy from blocksparse_attention (#6343) 2024-07-12 10:47:17 +08:00
Lily Liu
d6ab528997
[Misc] Remove flashinfer warning, add flashinfer tests to CI (#6351) 2024-07-12 01:32:06 +00:00
afeldman-nm
543aa48573
[Kernel] Correctly invoke prefill & decode kernels for cross-attention (towards eventual encoder/decoder model support) (#4888)
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-07-08 17:12:15 +00:00
JGSweets
e58294ddf2
[Bugfix] Add verbose error if scipy is missing for blocksparse attention (#5695) 2024-07-05 10:41:01 -07:00
Lily Liu
69ec3ca14c
[Kernel][Model] logits_soft_cap for Gemma2 with flashinfer (#6051)
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-07-04 16:35:51 -07:00
Gregory Shtrasberg
56b325e977
[ROCm][AMD][Model]Adding alibi slopes support in ROCm triton flash attention and naive flash attention (#6043)
Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com>
2024-07-03 22:19:38 -07:00
youkaichao
482045ee77
[hardware][misc] introduce platform abstraction (#6080) 2024-07-02 20:12:22 -07:00
Antoni Baum
c4059ea54f
[Bugfix] Add explicit end_forward calls to flashinfer (#6044) 2024-07-01 23:08:58 +00:00
youkaichao
614aa51203
[misc][cuda] use nvml to avoid accidentally cuda initialization (#6007) 2024-06-30 20:07:34 -07:00
Lily Liu
7041de4384
[Kernel] Flashinfer for prefill & decode, with Cudagraph support for decode (#4628)
Co-authored-by: LiuXiaoxuanPKU <llilyliupku@gmail.com>, bong-furiosa <bongwon.jang@furiosa.ai>
2024-06-28 15:28:49 -07:00
Michael Goin
4bf35ed9ae
[Bugfix] Only add Attention.kv_scale if kv cache quantization is enabled (#5936) 2024-06-28 21:12:40 +00:00
Ilya Lavrenov
57f09a419c
[Hardware][Intel] OpenVINO vLLM backend (#5379) 2024-06-28 13:50:16 +00:00
Woosuk Kwon
f136da15e1
[Hardware][TPU] Optimize KV cache swapping (#5878) 2024-06-27 21:12:13 -07:00
Woosuk Kwon
f5c8628fdc
[Bugfix][TPU] Fix CPU cache allocation (#5869) 2024-06-26 13:42:40 -07:00
Woosuk Kwon
cbc53b6b8d
[Hardware][TPU] Support parallel sampling & Swapping (#5855) 2024-06-26 11:07:49 -07:00