Thomas Parnell
496e991da8
[Doc] Consistent naming of attention backends ( #9498 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
2024-10-21 22:29:57 +08:00
Chen Zhang
4fa3e33349
[Kernel] Support sliding window in flash attention backend ( #9403 )
2024-10-20 10:57:52 -07:00
Thomas Parnell
0c9a5258f9
[Kernel] Add env variable to force flashinfer backend to enable tensor cores ( #9497 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Co-authored-by: Chih-Chieh Yang <chih.chieh.yang@ibm.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2024-10-18 17:55:48 -07:00
Robert Shaw
343f8e0905
Support BERTModel (first encoder-only embedding model) ( #9056 )
...
Signed-off-by: Max de Bayser <maxdebayser@gmail.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Co-authored-by: Andrew Feldman <afeldman@neuralmagic.com>
Co-authored-by: afeldman-nm <156691304+afeldman-nm@users.noreply.github.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: laishzh <laishengzhang@gmail.com>
Co-authored-by: Max de Bayser <maxdebayser@gmail.com>
Co-authored-by: Max de Bayser <mbayser@br.ibm.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2024-10-17 23:21:01 +00:00
Kuntai Du
81ede99ca4
[Core] Deprecating block manager v1 and make block manager v2 default ( #8704 )
...
Removing the block manager v1. This is the initial piece of prefix-caching-centric design. In order to achieve prefix-caching-centric design, we need to simplify the code path so that we only use v2 block manager (which has much higher performance on prefix caching).
2024-10-17 11:38:15 -05:00
Russell Bryant
776dbd74f1
[CI/Build] mypy: Resolve some errors from checking vllm/engine ( #9267 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2024-10-16 22:55:59 +00:00
Woosuk Kwon
473e7b3606
[TPU] Fix TPU SMEM OOM by Pallas paged attention kernel ( #9350 )
2024-10-14 15:02:06 -07:00
Lily Liu
f519902c52
[CI] Fix merge conflict ( #9317 )
2024-10-13 06:41:23 +00:00
Xiang Xu
00298e092c
[Bugfix] Fix bug of xformer prefill for encoder-decoder ( #9026 )
2024-10-12 15:00:43 +08:00
Lily Liu
89feb4c84d
[SpecDec] Remove Batch Expansion (2/3) ( #9298 )
2024-10-12 05:13:37 +00:00
Wallas Henrique
8baf85e4e9
[Doc] Compatibility matrix for mutual exclusive features ( #8512 )
...
Signed-off-by: Wallas Santos <wallashss@ibm.com>
2024-10-11 11:18:50 -07:00
Tyler Michael Smith
7342a7d7f8
[Model] Support Mamba ( #6484 )
2024-10-11 15:40:06 +00:00
Isotr0py
4f95ffee6f
[Hardware][CPU] Cross-attention and Encoder-Decoder models support on CPU backend ( #9089 )
2024-10-07 06:50:35 +00:00
Varun Sundar Rabindranath
cb3b2b9ba4
[Bugfix] Fix incorrect updates to num_computed_tokens in multi-step scheduling ( #9038 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-10-06 12:48:11 -07:00
youkaichao
f4dd830e09
[core] use forward context for flash infer ( #9097 )
2024-10-05 19:37:31 -07:00
youkaichao
9aaf14c62e
[misc] add forward context for attention ( #9029 )
2024-10-03 12:09:42 -07:00
Sergey Shlyapnikov
f58d4fccc9
[OpenVINO] Enable GPU support for OpenVINO vLLM backend ( #8192 )
2024-10-02 17:50:01 -04:00
Varun Sundar Rabindranath
afb050b29d
[Core] CUDA Graphs for Multi-Step + Chunked-Prefill ( #8645 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-10-02 19:44:39 +00:00
Lily Liu
1570203864
[Spec Decode] (1/2) Remove batch expansion ( #8839 )
2024-10-01 16:04:42 -07:00
Varun Sundar Rabindranath
c2ec430ab5
[Core] Multi-Step + Single Step Prefills via Chunked Prefill code path ( #8378 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-09-27 13:32:07 -07:00
youkaichao
a9b15c606f
[torch.compile] use empty tensor instead of None for profiling ( #8875 )
2024-09-27 08:11:32 -07:00
Brittany
8df2dc3c88
[TPU] Update pallas.py to support trillium ( #8871 )
2024-09-27 01:16:55 -07:00
Luka Govedič
71c60491f2
[Kernel] Build flash-attn from source ( #8245 )
2024-09-20 23:27:10 -07:00
William Lin
9e5ec35b1f
[bugfix] [AMD] add multi-step advance_step to ROCmFlashAttentionMetadata ( #8474 )
2024-09-19 20:49:54 -07:00
Charlie Fu
9cc373f390
[Kernel][Amd] Add fp8 kv cache support for rocm custom paged attention ( #8577 )
2024-09-19 17:37:57 +00:00
Aaron Pham
9d104b5beb
[CI/Build] Update Ruff version ( #8469 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-09-18 11:00:56 +00:00
Cyrus Leung
6ffa3f314c
[CI/Build] Avoid CUDA initialization ( #8534 )
2024-09-18 10:38:11 +00:00
sroy745
1009e93c5d
[Encoder decoder] Add cuda graph support during decoding for encoder-decoder models ( #7631 )
2024-09-17 07:35:01 -07:00
Charlie Fu
1ef0d2efd0
[Kernel][Hardware][Amd]Custom paged attention kernel for rocm ( #8310 )
2024-09-13 17:01:11 -07:00
Kunshang Ji
851725202a
[Hardware][intel GPU] bump up ipex version to 2.3 ( #8365 )
...
Co-authored-by: Yan Ma <yan.ma@intel.com>
2024-09-13 16:54:34 -07:00
Alexander Matveev
019877253b
[Bugfix] multi-step + flashinfer: ensure cuda graph compatible ( #8427 )
2024-09-12 21:01:50 +00:00
William Lin
a6c0f3658d
[multi-step] add flashinfer backend ( #7928 )
2024-09-12 11:16:22 -07:00
youkaichao
7de49aa86c
[torch.compile] hide slicing under custom op for inductor ( #8384 )
2024-09-12 00:11:55 -07:00
Alexander Matveev
22f3a4bc6c
[Bugfix] lookahead block table with cuda graph max capture ( #8340 )
...
[Bugfix] Ensure multistep lookahead allocation is compatible with cuda graph max capture (#8340 )
2024-09-10 16:00:35 -07:00
Kevin Lin
5faedf1b62
[Spec Decode] Move ops.advance_step to flash attn advance_step ( #8224 )
2024-09-10 13:18:14 -07:00
Elfie Guo
e39ebf5cf5
[Core/Bugfix] Add query dtype as per FlashInfer API requirements. ( #8173 )
2024-09-05 05:12:26 +00:00
Pavani Majety
622f8abff8
[Bugfix] bugfix and add model test for flashinfer fp8 kv cache. ( #8013 )
2024-08-30 22:18:50 -07:00
Woosuk Kwon
2684efc467
[TPU][Bugfix] Fix tpu type api ( #8035 )
2024-08-30 09:01:26 -07:00
Richard Liu
2148441fd3
[TPU] Support single and multi-host TPUs on GKE ( #7613 )
2024-08-30 00:27:40 -07:00
Pavani Majety
6b3421567d
[Core][Kernels] Enable FP8 KV Cache with Flashinfer backend. + BugFix for kv_cache_dtype=auto ( #7985 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2024-08-29 14:53:11 -04:00
youkaichao
ef99a78760
Revert "[Core][Kernels] Use FlashInfer backend for FP8 KV Cache when available." ( #7982 )
2024-08-28 21:27:06 -07:00
Pavani Majety
b98cc28f91
[Core][Kernels] Use FlashInfer backend for FP8 KV Cache when available. ( #7798 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-08-28 10:01:22 -07:00
Cody Yu
9606c7197d
Revert #7509 ( #7887 )
2024-08-27 00:16:31 -07:00
LI MOU
53328d7536
[BUG] fix crash on flashinfer backend with cudagraph disabled, when attention group_size not in [1,2,4,8] ( #7509 )
2024-08-21 08:54:31 -07:00
Antoni Baum
3b682179dd
[Core] Add AttentionState abstraction ( #7663 )
2024-08-20 18:50:45 +00:00
William Lin
f366f6339b
[spec decode] [4/N] Move update_flash_attn_metadata to attn backend ( #7571 )
...
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2024-08-16 11:41:56 -07:00
youkaichao
54bd9a03c4
register custom op for flash attn and use from torch.ops ( #7536 )
2024-08-15 22:38:56 -07:00
youkaichao
4d2dc5072b
[hardware] unify usage of is_tpu to current_platform.is_tpu() ( #7102 )
2024-08-13 00:16:42 -07:00
jon-chuang
a046f86397
[Core/Bugfix] Add FP8 K/V Scale and dtype conversion for prefix/prefill Triton Kernel ( #7208 )
...
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2024-08-12 22:47:41 +00:00
Woosuk Kwon
cfba4def5d
[Bugfix] Fix logit soft cap in flash-attn backend ( #7425 )
2024-08-12 09:58:28 -07:00