afeldman-nm
bf7f470b22
[V1] Logits processors extensibility ( #19912 )
...
Signed-off-by: Andrew Feldman <afeldman@redhat.com>
Signed-off-by: Andrew Feldman <afeld2012@gmail.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Nick Hill <nhill@redhat.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
Co-authored-by: Andrew Feldman <afeld2012@gmail.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-08-16 12:59:17 -07:00
Cyrus Leung
4dff91c93d
[Refactor] Allow optional MultiModalKwargsItem in IPC ( #23022 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-08-16 11:30:49 +00:00
Nick Hill
ad0297d113
[Misc] Support passing multiple request ids at once to AsyncLLM.abort() ( #22944 )
...
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-08-15 17:00:36 -07:00
Yong Hoon Shin
3e2f7985a2
Support multiple attention groups for KV sharing ( #22672 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com>
2025-08-15 16:54:10 -07:00
Or Ozeri
c280066f9d
[v1] Move block_hashes from KVCacheManager to Request.block_hashes ( #19728 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com>
2025-08-15 16:52:52 -07:00
Nick Hill
b9dc9d2607
[BugFix] Handle case where async utility call is cancelled ( #22996 )
...
Signed-off-by: Nick Hill <nhill@redhat.com>
Co-authored-by: Yinghai Lu <yinghai@thinkingmachines.ai>
2025-08-15 17:38:42 -06:00
fhl2000
74f441f4b5
[Core] Allow full cudagraph with separate attention routines and orthogonal to compilation, add support for FA2 and FlashInfer ( #20059 )
...
Signed-off-by: fhl <2410591650@qq.com>
Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
2025-08-15 10:01:39 -04:00
Thomas Parnell
75531a6c13
[V1] [Hybrid] Support using float32 for state in Hybrid Models (Mamba2, Mamba1, Minimax) ( #22928 )
...
Signed-off-by: Daniel Afrimi <danielafrimi8@gmail.com>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: Daniel Afrimi <danielafrimi8@gmail.com>
Co-authored-by: Burkhard Ringlein <ngl@zurich.ibm.com>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
2025-08-15 12:57:06 +00:00
Asaf Joseph Gardin
3d232dbd19
[Mamba] - refactor: Renamed mamba_attn to mamba2_attn ( #22818 )
...
Signed-off-by: asafg <asafg@ai21.com>
Co-authored-by: asafg <asafg@ai21.com>
2025-08-15 06:38:05 +00:00
Nick Hill
ae05a6d83d
[BugFix] Fix port lookup in internal DP LB tests ( #22252 )
...
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-08-15 11:17:11 +08:00
Nick Hill
ebcce2cd36
[Core] Return final response for aborted requests from AsyncLLM.generate ( #22283 )
...
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-08-14 14:49:02 -07:00
Lucas Wilkinson
b8ff05361a
[CI] Temporarily disable flaky test ( #22930 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-08-14 19:59:16 +00:00
Jialin Ouyang
31a500c86f
[Core] [N-gram SD Optimization][1/n] Propose tokens with a single KMP ( #22437 )
...
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>
2025-08-13 14:44:06 -07:00
Cyrus Leung
b4b78d6317
[CI/Build] Fix param mismatch in test_eagle_correctness ( #22847 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-08-13 10:55:25 -07:00
Nicolò Lucchesi
12817a8ac7
[CI] Fix tests/v1/e2e/test_kv_sharing_fast_prefill.py import on test ( #22815 )
...
Signed-off-by: NickLucche <nlucches@redhat.com>
2025-08-13 10:35:50 -07:00
Cyrus Leung
19b927e52d
[Core] Use individual MM items in P0/P1 cache and model runner ( #22570 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-08-13 07:18:07 -07:00
Nicolò Lucchesi
6b794c756c
[Nixl][CI] Fix tests ( #22806 )
...
Signed-off-by: NickLucche <nlucches@redhat.com>
2025-08-13 06:03:53 -07:00
Giancarlo Delfin
d94e3026de
[V1] Add tree drafting tests for eagle spec decoding ( #22705 )
...
Signed-off-by: Giancarlo Delfin <gdelfin@meta.com>
2025-08-13 04:11:28 -07:00
Woosuk Kwon
71683ca6f6
[V0 Deprecation] Remove multi-step scheduling ( #22138 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk@thinkingmachines.ai>
2025-08-12 20:18:39 -07:00
Nicolò Lucchesi
422f22e012
[CI][Nixl] Check kv cache layout during handshake ( #22745 )
...
Signed-off-by: NickLucche <nlucches@redhat.com>
2025-08-12 12:53:52 -07:00
Nicolò Lucchesi
3d9d40efde
[Bugfix][CI] Fix test_remote_decode_lifecycle.py::test_short_prompt_lifecycle ( #22727 )
...
Signed-off-by: NickLucche <nlucches@redhat.com>
2025-08-12 07:30:17 -07:00
phantomlei
bc8372efc3
[Bugfix] Fix erroneous randomly generated cases in bad word testing ( #22170 )
...
Signed-off-by: phantomlei <phantomlei3@gmail.com>
2025-08-12 02:03:22 -07:00
Michael Goin
93d0652433
[CI] Increase timeout for test_completion_with_image_embeds ( #22670 )
...
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-08-11 20:31:36 -07:00
TJian
65abe111a3
[CI] Skip Tree Attn Test in test_max_len.py to unblock CI ( #22664 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
2025-08-11 10:36:05 -07:00
22quinn
807d21b80d
[BugFix] [Spec Decode] Remove LlamaForCausalLMEagle3 to fix CI ( #22611 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com>
2025-08-11 10:31:36 -07:00
GuanLuo
16fb668b61
fix: NIXL connector transfers partial block to pass full multi-modal context ( #21074 )
...
Signed-off-by: GuanLuo <gluo@nvidia.com>
2025-08-11 09:40:55 -07:00
Nick Hill
5898b135ab
[BugFix] Fix KVConnectorOutput TPU breakage ( #22598 )
...
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-08-10 19:33:48 -07:00
Chengji Yao
2a84fb422f
[TPU] kv cache update kernel doesn't need to be padded slices to multiple of num_slices_per_block ( #22394 )
...
Signed-off-by: Chengji Yao <chengjiyao@gmail.com>
Co-authored-by: Chengji Yao <chengjiyao@gmail.com>
2025-08-09 20:49:04 -07:00
Le Chen
3d7363e61c
[Config] add "qwen" as a native eagle3 target supported model ( #22333 )
...
Signed-off-by: lechen <lecself@163.com>
Signed-off-by: LeChen <lecself@163.com>
2025-08-09 20:21:05 -07:00
Kyuyeun Kim
9a0c5ded5a
[TPU] Add support for online w8a8 quantization ( #22425 )
...
Signed-off-by: Kyuyeun Kim <kyuyeunk@google.com>
2025-08-08 23:12:54 -07:00
Chauncey
17eaaef595
[Bugfix] Fix RuntimeError: Index put requires the source and destination dtypes match ( #22065 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
2025-08-07 19:20:21 -07:00
TJian
1ee5ead5f8
[ROCm] [V1] [SpecDec] Enable Speculative Decoding on ROCm V1 Engine ( #21496 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
2025-08-07 19:13:17 -07:00
Harry Mellor
7e3a8dc906
Remove from_dict from SpeculativeConfig ( #22451 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-08-07 10:13:04 -07:00
Chen Zhang
4815b00f54
[gpt-oss] Generate ResponseOutputItem from Harmony Message ( #22410 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-08-07 08:33:25 -07:00
Michael Goin
a00d8b236f
Use float32 for test_completion.py ( #22385 )
...
Signed-off-by: Michael Goin <mgoin64@gmail.com>
2025-08-07 11:07:47 +08:00
Lucas Wilkinson
1dc8a70b6d
[Attention] Support multiple attention metadata builders per kv_cache_spec + proper local attention no hybrid kv cache fix ( #21588 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-08-06 18:40:52 -07:00
Asaf Joseph Gardin
46a13949d5
[v1] - Mamba1 Attention Metadata ( #21249 )
...
Signed-off-by: asafg <asafg@ai21.com>
Co-authored-by: asafg <asafg@ai21.com>
2025-08-06 17:03:42 -07:00
Giancarlo Delfin
469b3ffaaa
[V1] port xformers backend to v1 ( #21342 )
...
Signed-off-by: Giancarlo Delfin <gdelfin@meta.com>
2025-08-05 10:04:46 -07:00
Nicolò Lucchesi
0c275ad5ad
[V0 Deprecation][TPU] Remove V1 flag check from tests ( #22248 )
...
Signed-off-by: NickLucche <nlucches@redhat.com>
2025-08-05 06:53:23 -07:00
Giancarlo Delfin
5ea71ff46f
[V1] reduce block size for tree attention correctness test to fix 'ou… ( #22207 )
...
Signed-off-by: Giancarlo Delfin <gdelfin@meta.com>
2025-08-04 19:11:06 -07:00
Woosuk Kwon
7175817637
Revert "[Bugfix] V1 Fix the cursor leakage issue during request scheduling." ( #22223 )
2025-08-04 18:37:06 -07:00
PiteXChen
2dffac464c
[Bugfix] V1 Fix the cursor leakage issue during request scheduling. ( #21173 )
...
Signed-off-by: CLFutureX <775523362@qq.com>
2025-08-04 18:34:10 -07:00
22quinn
54de71d0df
[Sampler] Support returning all logprobs or logits ( #21792 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com>
2025-08-04 03:04:12 -07:00
Tyler Michael Smith
8ecb3e9e93
[CI Bugfix] Fix wNa16 kernel not found for test_shared_storage_connector_hashes ( #22163 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2025-08-03 22:19:04 -07:00
Giancarlo Delfin
aa7012eb6d
Add tree attention backend for v1 (part 1) ( #20401 )
...
Signed-off-by: Giancarlo Delfin <gdelfin@meta.com>
2025-08-03 22:13:26 -07:00
Abirdcfly
0d7db16a92
[PD] add test for chat completions endpoint ( #21925 )
...
Signed-off-by: Abirdcfly <fp544037857@gmail.com>
2025-08-03 19:57:03 -07:00
Woosuk Kwon
6d98843b31
[Responses API] Disable response store by default ( #22137 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-08-03 04:04:21 -07:00
David Ben-David
aefeea0fde
[V1] [P/D] Refactor KV Connector Path ( #21980 )
...
Signed-off-by: David Ben-David <davidb@pliops.com>
Co-authored-by: David Ben-David <davidb@pliops.com>
2025-08-03 04:03:40 -07:00
Roger Wang
067c34a155
docs: remove deprecated disable-log-requests flag ( #22113 )
...
Signed-off-by: Roger Wang <hey@rogerw.me>
2025-08-02 00:19:48 -07:00
Yong Hoon Shin
8564dc9448
Fix test_kv_sharing_fast_prefill flakiness ( #22038 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com>
2025-08-01 23:55:34 -07:00