Chen Zhang
950751a987
[v1] Pass BlockTable and KVCacheSpec to AttentionMetadataBuilders ( #17483 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-05-10 16:12:04 -07:00
Lucas Wilkinson
5e6f939484
[Attention] MLA move rotary embedding to cuda-graph region ( #17668 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
2025-05-09 11:14:42 +08:00
vllmellm
3c9396a64f
[FEAT][ROCm]: Support AITER MLA on V1 Engine ( #17523 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Co-authored-by: qli88 <qiang.li2@amd.com>
Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com>
2025-05-09 10:42:05 +08:00
Chen Zhang
cba31c47c4
[v1] AttentionMetadata for each layer ( #17394 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-05-06 07:58:37 -07:00
Harry Mellor
d6484ef3c3
Add full API docs and improve the UX of navigating them ( #17485 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-05-03 19:42:43 -07:00
Lucas Wilkinson
afcb3f8863
[Attention] MLA move o_proj q_proj into cuda-graph region ( #17484 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
2025-05-02 03:16:26 +00:00
Chen Zhang
24e6ad3f16
[V1] Remove num_input_tokens from attn_metadata ( #17193 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-04-29 09:28:41 -07:00
Zhengyuan Su (苏政渊)
17eb306fcc
[Bugfix] Add contiguous call inside rope kernel wrapper ( #17091 )
...
Signed-off-by: 苏政渊 <suzhengyuan@moonshot.cn>
Co-authored-by: 苏政渊 <suzhengyuan@moonshot.cn>
2025-04-28 19:24:07 -07:00
Lucas Wilkinson
d8bccde686
[BugFix] Fix vllm_flash_attn install issues ( #17267 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: Aaron Pham <contact@aarnphm.xyz>
2025-04-27 17:27:56 -07:00
Michael Goin
986537f1c3
[V1] V1 FlashInfer Attention ( #16684 )
...
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: Aurick Qiao <qiao@aurick.net>
2025-04-22 00:38:41 +00:00
Lucas Wilkinson
183dad7a85
[Attention] Update to lastest FA3 code ( #13111 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
2025-04-17 15:14:07 -07:00
Nick Hill
0377b8310b
[MLA] Simplification to batch P/D reordering ( #16673 )
...
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-04-17 16:12:09 -04:00
DefTruth
e9528f6dc6
[Kernel] support merge_attn_states CUDA kernel, 3x speedup ( #16173 )
...
Signed-off-by: DefTruth <qiustudent_r@163.com>
2025-04-11 06:50:50 -06:00
yihong
04149cce27
[BugFix] fix some typos found by typos. ( #16314 )
...
Signed-off-by: yihong0618 <zouzou0208@gmail.com>
2025-04-09 03:43:59 -07:00
Lucas Wilkinson
dccf535f8e
[V1] Enable V1 Fp8 cache for FA3 in the oracle ( #15191 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-03-23 15:07:04 -07:00
Lehua Ding
91ca929dc7
[V1] Fix wrong import path of get_flash_attn_version ( #15280 )
...
Signed-off-by: Lehua Ding <lehuading@tencent.com>
2025-03-21 03:54:11 -07:00
Woosuk Kwon
0c6f5023c3
[V1] Scheduler Refactoring [1/N] - Add Scheduler Interface ( #15250 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
2025-03-20 17:50:43 -07:00
Lucas Wilkinson
9532c49836
[Attention] MLA get rid of materialization ( #14770 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-03-13 23:39:02 -07:00
Jeff Daily
a1c8f3796c
dynamic distpatch of fp8 kernels ( #14245 )
...
Signed-off-by: Jeff Daily <jeff.daily@amd.com>
2025-03-11 10:54:56 -04:00
Simon Mo
fb0acb6c72
[Perf] Improve MLA on V1 ( #14540 )
...
Signed-off-by: simon-mo <simon.mo@hey.com>
2025-03-10 12:06:58 -07:00
Lucas Wilkinson
db84f5eb3b
[Bugfix] DeepSeek Accuracy ( #14476 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-03-08 16:47:03 +00:00
Tyler Michael Smith
ca7a2d5f28
Revert "[Perf] Reduce MLA CPU overheads in V1 ( #14384 )" ( #14471 )
2025-03-07 22:18:53 -08:00
Luka Govedič
e1744502c2
[FP8] Refactor apply_fp8_linear and apply_fp8_linear_generic into an object ( #14390 )
...
Signed-off-by: luka <luka@neuralmagic.com>
2025-03-07 05:20:16 +00:00
Lucas Wilkinson
dae6896977
[Perf] Reduce MLA CPU overheads in V1 ( #14384 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-03-06 19:59:14 -08:00
Ying Zhong
9f1710f1ac
Fix mla prefill context performance ( #13897 )
...
Signed-off-by: ZhongYingMatrix <zhongyingmatrix@gmail.com>
2025-03-06 09:35:49 -08:00
Lucas Wilkinson
f6bb18fd9a
[BugFix] MLA + V1, illegal memory access and accuracy issues ( #14253 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-03-05 17:10:13 -08:00
Harry Mellor
cf069aa8aa
Update deprecated Python 3.8 typing ( #13971 )
2025-03-02 17:34:51 -08:00
Lucas Wilkinson
2e94b9cfbb
[Attention] Flash MLA for V1 ( #13867 )
...
Signed-off-by: Yang Chen <yangche@fb.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: Yang Chen <yangche@fb.com>
2025-02-27 23:03:41 +00:00
Yang Chen
58d1b2aa77
[Attention] MLA support for V1 ( #13789 )
...
Signed-off-by: Yang Chen <yangche@fb.com>
2025-02-27 13:14:17 -05:00