Matthew Bonanni
b30dfa03c5
[Attention] Refactor CUDA attention backend selection logic ( #24794 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
2025-11-11 07:40:44 -05:00
Lucas Wilkinson
e8697faf03
[V0 deprecation] Remove no longer used get_metadata_cls ( #28370 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-11-10 14:32:09 +08:00
Chen Zhang
c765f0b443
[FlashInfer] Avoid FlashInfer block_size 16 + head_size 256 on blackwell ( #27994 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-11-05 09:25:32 -08:00
Yeshwanth N
71b1c8b667
[Chore]:Extract math and argparse utilities to separate modules ( #27188 )
...
Signed-off-by: Yeshwanth Surya <yeshsurya@gmail.com>
Signed-off-by: Yeshwanth N <yeshsurya@gmail.com>
Signed-off-by: yeshsurya <yeshsurya@gmail.com>
2025-10-26 04:03:32 -07:00
fhl2000
284cc92275
[MISC] cudagraph_capture_sizes related improvements ( #26016 )
...
Signed-off-by: fhl <2410591650@qq.com>
Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-10-24 05:11:05 -07:00
Jonathan Chen
ca76486a16
[Chore] Separate out vllm.utils.platform_utils.py ( #27374 )
...
Signed-off-by: Jonathan <chenleejonathan@gmail.com>
2025-10-23 19:08:06 +00:00
Bram Wasti
b2f78cbad4
[small][batch invariance] Rename the env and internal flags to simplify usage ( #26855 )
...
Signed-off-by: Bram Wasti <bwasti@meta.com>
2025-10-16 21:40:25 +00:00
rongfu.leng
5afd3276df
[Feature] Add process_weights_after_loading to AttentionImpl ( #26870 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io>
2025-10-16 08:02:30 -07:00
Adrian Abeyta
0a9ef0cfce
Move query quantization to attention layer for Flashinfer & Triton. ( #26534 )
...
Signed-off-by: adabeyta <aabeyta@redhat.com>
Signed-off-by: Adrian Abeyta <aabeyta@redhat.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
2025-10-15 19:01:38 -04:00
Boyuan Feng
a86b4c58e8
remove attn output view kernel ( #26680 )
...
Signed-off-by: Boyuan Feng <boyuan@meta.com>
Signed-off-by: Boyuan Feng <fby.1994@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-10-14 22:53:10 +00:00
Bram Wasti
3263799056
[unrevert] Add batch invariant kernel override for FlashInfer backend [2/n] ( #26373 )
...
Signed-off-by: Bram Wasti <bwasti@meta.com>
Signed-off-by: Bram Wasti <bwasti@fb.com>
2025-10-13 10:24:53 -04:00
Harry Mellor
8fcaaf6a16
Update Optional[x] -> x | None and Union[x, y] to x | y ( #26633 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-10-12 09:51:31 -07:00
Benjamin Chislett
6e783bc54b
[Bugfix] Fix CUDA graph selection bug in FlashInfer at high concurrency ( #26499 )
...
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
2025-10-09 17:12:34 -04:00
elvischenv
5e49c3e777
Bump Flashinfer to v0.4.0 ( #26326 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
2025-10-08 23:58:44 -07:00
Zhiyuan Li
d24cf322e1
[Hybrid]: Decouple Kernel Block Size from KV Page Size ( #24486 )
...
Signed-off-by: lizhiyuan <uniartisan2017@gmail.com>
Signed-off-by: Zhiyuan Li <uniartisan2017@gmail.com>
2025-10-08 23:43:39 -07:00
elvischenv
b82f4307c9
[Bugfix][Flashinfer] fix VLLM_USE_TRTLLM_ATTENTION issue for models with diff hyperparameters ( #25924 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
2025-10-08 19:54:48 +00:00
Matthew Bonanni
4727a8afa7
[Attention] Remove unused reorder_batch method ( #24463 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
2025-10-06 13:13:39 -04:00
Harry Mellor
1c0c68202c
Fix per file ruff ignores related to typing ( #26254 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-10-05 16:37:55 +00:00
Harry Mellor
4e256cadc2
Remove all references to yapf as it's no longer used ( #26251 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-10-05 09:18:11 -07:00
Harry Mellor
d6953beb91
Convert formatting to use ruff instead of yapf + isort ( #26247 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-10-05 07:06:22 -07:00
Cyrus Leung
1838cd4860
Revert "Add batch invariant kernel override for FlashInfer backend [2/n]" ( #26220 )
2025-10-04 02:45:08 -07:00
Bram Wasti
2f7dbc9b42
Add batch invariant kernel override for FlashInfer backend [2/n] ( #25769 )
...
Signed-off-by: Bram Wasti <bwasti@meta.com>
Signed-off-by: Bram Wasti <bwasti@fb.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
2025-10-03 19:49:30 -07:00
Michael Goin
f1fc2107a3
[Bugfix] Disable cascade attention with FlashInfer ( #26130 )
...
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-10-02 16:30:37 -07:00
Yongye Zhu
fa7e254a7f
[New Model] DeepSeek-V3.2 (Rebased to Main) ( #25896 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
Signed-off-by: Lucia Fang <fanglu@meta.com>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Co-authored-by: yewentao256 <zhyanwentao@126.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
Co-authored-by: Lucia Fang <116399278+luccafong@users.noreply.github.com>
Co-authored-by: Lucia Fang <fanglu@meta.com>
Co-authored-by: NickLucche <nlucches@redhat.com>
Co-authored-by: Siyuan Fu <siyuanf@nvidia.com>
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com>
Co-authored-by: Xiaozhu Meng <mxz297@gmail.com>
Co-authored-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com>
2025-09-30 17:14:41 +08:00
Matthew Bonanni
3468f17ebe
[V0 deprecation] Remove _VLLM_V1 suffixes from attention backend names ( #25489 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>
2025-09-25 17:37:50 +00:00
Benjamin Chislett
c30b405b8f
[Spec Decode] Enable FlashInfer Spec Decoding ( #25196 )
...
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Co-authored-by: lhsjohn <huashuoli@tencent.com>
2025-09-23 22:29:58 -04:00
Benjamin Chislett
1983609239
[Bugfix] Use a separate FlashInfer workspace buffer for trtllm-gen ( #25520 )
2025-09-24 00:19:56 +00:00
nvjullin
b1a63d1b3b
[BugFix] Make FlashInferMetadataBuilder non-blocking ( #25040 )
...
Signed-off-by: Julien Lin <jullin@nvidia.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
2025-09-19 20:36:34 +00:00
elvischenv
e67a79db03
[Bugfix] Refactor Flashinfer TRTLLM attention kernel selection logic ( #24600 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
2025-09-17 15:36:29 -07:00
Xiaozhu Meng
e42af78b18
[flashinfer] [kernel] support for fp8 kv cache for trtllm prefill attention ( #24197 )
...
Signed-off-by: Xiaozhu <mxz297@gmail.com>
2025-09-11 14:20:09 -07:00
Michael Goin
fba7856581
[Perf] Warmup FlashInfer attention during startup ( #23439 )
...
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: Michael Goin <mgoin64@gmail.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Co-authored-by: Luka Govedič <lgovedic@redhat.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Co-authored-by: Matthew Bonanni <mbonanni001@gmail.com>
2025-09-10 15:03:17 -07:00
Chen Zhang
b5e383cd8b
[gpt-oss] raise error for flashinfer backend without trtllm ( #24482 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-09-10 14:33:13 -07:00
Russell Bryant
37e8182bfe
[v1] Add Whisper model support (encoder-decoder) ( #21088 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com>
Co-authored-by: NickLucche <nlucches@redhat.com>
2025-09-10 13:53:35 -07:00
Thien Tran
a0933c3bd6
[Bugfix] Enable FP8 KV cache for FlashInfer and Triton backend on non-sm100 GPUs ( #24577 )
...
Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg>
2025-09-10 12:33:41 -07:00
elvischenv
bba1042c6f
[Flashinfer] Support Flashinfer TRTLLM FP8-qkv BF16/FP16-out Attention Kernel ( #23647 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
2025-09-08 20:53:07 -07:00
Didier Durand
35bf193864
[Doc]: fix typos in Python comments ( #24294 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
2025-09-05 19:41:12 -07:00
Lucas Wilkinson
402759d472
[Attention] FlashAttn MLA ( #14258 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>
Co-authored-by: Matthew Bonanni <mbonanni001@gmail.com>
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com>
2025-09-04 02:47:59 -07:00
co63oc
1bd007f234
fix some typos ( #24071 )
...
Signed-off-by: co63oc <co63oc@users.noreply.github.com>
2025-09-02 20:44:50 -07:00
Woosuk Kwon
7ffbf27239
[BugFix][FlashInfer] Fix potential race condition for paged_kv_indptr_cpu ( #23737 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-08-28 14:22:46 -07:00
Hyogeun Oh (오효근)
4e4d017b6f
[Docs] Fix warnings in mkdocs build (continued) ( #23743 )
...
Signed-off-by: Zerohertz <ohg3417@gmail.com>
Signed-off-by: Hyogeun Oh (오효근) <ohg3417@gmail.com>
2025-08-27 17:17:29 +00:00
Woosuk Kwon
11eddf02f0
[FlashInfer] Cache hyper params in metadata builder ( #23732 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-08-27 03:45:04 -07:00
Woosuk Kwon
6578e87365
Optimize input preparation for FlashInfer [2/N] ( #23174 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-08-27 02:52:45 -07:00
Woosuk Kwon
efc88cf64a
[Misc] Simplify FlashInfer attention metadata ( #23585 )
...
Signed-off-by: Woosuk Kwon <woosuk@thinkingmachines.ai>
2025-08-25 15:42:29 -07:00
elvischenv
24d0c9e6ed
[NVIDIA][torch.compile] Support Flashinfer TRTLLM FP8-q/kv NVFP4-out Attention Kernel ( #22703 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
2025-08-22 22:09:05 +00:00
Pavani Majety
1d353b6352
[Core] Always use tensor cores for Flashinfer Decode Wrapper ( #23214 )
...
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
2025-08-21 16:02:11 -04:00
Woosuk Kwon
d6d13bd49e
[Misc] Add max_seq_len to CommonAttentionMetadata ( #23216 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-08-20 09:05:29 -07:00
Woosuk Kwon
e61bac87ee
[Misc] Minor refactoring for FlashInfer backend ( #23147 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-08-19 13:11:51 -07:00
Woosuk Kwon
5b5f350d67
[Misc] Enable yapf for FlashInfer backend ( #23193 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-08-19 10:33:47 -07:00
elvischenv
03752dba8f
[NVIDIA] Support Flashinfer TRTLLM FP8-q/kv/out Attention Kernel ( #21716 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
2025-08-19 08:22:15 -04:00
Michael Goin
000cceca8c
[Bugfix gpt-oss] Fix float32 convert for flashinfer sink support ( #23016 )
...
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-08-16 11:16:00 -07:00