71 Commits

Author SHA1 Message Date
Mark McLoughlin
bb0a311213
Revert "[v1] Support multiple KV cache groups in GPU model runner (#17945) (#18459)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
2025-05-21 10:25:23 -07:00
Ning Xie
420caf7557
[UT] Add ut for none hash (#17892)
Signed-off-by: Andy Xie <andy.xning@gmail.com>
2025-05-15 13:28:11 +08:00
Chen Zhang
e60f550b38
[v1] Support multiple KV cache groups in GPU model runner (#17945)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-05-14 18:54:54 -07:00
Chen Zhang
f2ae883b67
[v1][KVCacheManager] pass num_new_computed_tokens to kv cache manager (#18001)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-05-13 19:09:39 -07:00
Chen Zhang
f0d610a8ae
[v1][KVCacheManager] Avoid full cache hit by controlling max_length (#17999)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-05-13 06:50:38 +00:00
Robert Shaw
d19110204c
[P/D] NIXL Integration (#17751)
Signed-off-by: ApostaC <yihua98@uchicago.edu>
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com>
Signed-off-by: Robert Shaw <rshaw@neuralmagic.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: Nick Hill <nhill@redhat.com>
Signed-off-by: Brent Salisbury <bsalisbu@redhat.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: ApostaC <yihua98@uchicago.edu>
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com>
Co-authored-by: Brent Salisbury <bsalisbu@redhat.com>
2025-05-12 09:46:16 -07:00
Chen Zhang
ca66a1674c
[v1] Rename specialized_manager.py to single_type_kv_cache_manager.py (#17946)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-05-10 16:14:12 -07:00
Chen Zhang
200da9a517
[v1] Move block management logic from KVCacheManager to SpecializedManager (#17474)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-05-09 15:25:34 +00:00
Ning Xie
d310e6de98
[BUGFIX]: return fast when request requires prompt logprobs (#17251) 2025-05-08 21:25:41 -07:00
Chen Zhang
aabcd2cae3
[v1] Introduce KVCacheBlocks as interface between Scheduler and KVCacheManager (#17479)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-05-06 08:50:34 -07:00
Harry Mellor
d6484ef3c3
Add full API docs and improve the UX of navigating them (#17485)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-05-03 19:42:43 -07:00
Robert Shaw
c777df79f7
[BugFix] Fix Memory Leak (#17567)
Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com>
2025-05-02 01:07:03 -07:00
Chen Zhang
81ecf425f0
[v1][Spec Decode] Make sliding window compatible with eagle prefix caching (#17398)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-04-30 18:25:53 +00:00
Alec
0be6d05b5e
[V1][Metrics] add support for kv event publishing (#16750)
Signed-off-by: alec-flowers <aflowers@nvidia.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Co-authored-by: Mark McLoughlin <markmc@redhat.com>
2025-04-30 07:44:45 -07:00
Marko Rosenmueller
77073c77bc
[Core] Prevent side-channel attacks via cache salting (#17045)
Signed-off-by: Marko Rosenmueller <5467316+dr75@users.noreply.github.com>
2025-04-30 20:27:21 +08:00
Lily Liu
20e489eaa1
[V1][Spec Decode] Make eagle compatible with prefix caching. (#17137)
Signed-off-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>
2025-04-27 09:29:43 -07:00
Ning Xie
fd11a325b8
[MISC] rename interval to max_recent_requests (#14285) 2025-04-26 16:59:18 +00:00
Nick Hill
df6f3ce883
[Core] Remove prompt string from engine core data structures (#17214)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-04-25 23:41:05 -07:00
Mark McLoughlin
340d7b1b21
[V1][Spec Decoding] Add num_drafts and num_accepted_tokens_per_position metrics (#16665)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
2025-04-24 08:57:40 -07:00
Rui Qiao
c0dfd97519
[V1][PP] Optimization: continue scheduling prefill chunks (#17080)
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
2025-04-24 05:27:08 -07:00
Woosuk Kwon
c4ab9f3e71
[V1] Remove pre-allocation for KV cache (#16941)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-04-22 00:52:18 -07:00
Woosuk Kwon
3a0fba5cf4
[V1][Spec Decode] Handle draft tokens beyond max_model_len (#16087)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-04-21 12:38:50 -07:00
vie-serendipity
d9737ca1c6
[V1][Misc] stop update prefix cache stats when logs_stats is disabled (#16460)
Signed-off-by: vie-serendipity <2733147505@qq.com>
2025-04-19 02:25:19 -07:00
Yihua Cheng
3408e47159
[P/D][V1] KV Connector API V1 (#15960)
Signed-off-by: ApostaC <yihua98@uchicago.edu>
Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com>
Signed-off-by: remi <remi@mistral.ai>
Co-authored-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Co-authored-by: Rémi Delacourt <54138269+Flechman@users.noreply.github.com>
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com>
2025-04-17 13:22:40 -07:00
Lily Liu
f49e5aff11
[V1][Spec Decode] KV cache slots for eagle heads (#16370)
Signed-off-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>
2025-04-12 19:42:51 -07:00
Michael Goin
aa3b3d76e0
Enforce valid max_num_batched_tokens when disable_chunked_mm_input=True (#16447)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-04-11 08:09:52 +00:00
rongfu.leng
4716377fbc
[Feature] Estimate max-model-len use available KV cache memory (#16168)
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io>
2025-04-08 19:12:51 -07:00
Michael Goin
8e5314a468
[V1] Add disable_chunked_mm_input arg to disable partial mm input prefill (#15837)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-04-07 23:24:07 -07:00
Roger Wang
f2ebb6f541
[V1] Scatter and gather placeholders in the model runner (#16076)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: Roger Wang <ywang@roblox.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: mgoin <mgoin64@gmail.com>
Co-authored-by: Jennifer Zhao <ai.jenniferzhao@gmail.com>
2025-04-08 10:43:41 +08:00
Roger Wang
af51d80fa1
Revert "[V1] Scatter and gather placeholders in the model runner" (#16075) 2025-04-04 14:50:57 -07:00
Cyrus Leung
f5722a5052
[V1] Scatter and gather placeholders in the model runner (#15712)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: Roger Wang <ywang@roblox.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
2025-04-04 21:26:44 +00:00
Mark McLoughlin
a35a8a8392
[V1][Spec Decode] Avoid logging useless nan metrics (#16023)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
2025-04-04 08:52:41 -07:00
Mark McLoughlin
a79cc68b3a
[V1][Metrics] Initial speculative decoding metrics (#15151)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
2025-04-01 10:45:04 -07:00
Chen Zhang
3a5f0afcd2
[V1] Implement sliding window attention in kv_cache_manager (#14097)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-04-01 00:33:17 -07:00
Mark McLoughlin
f98a4920f9
[V1][Core] Remove unused speculative config from scheduler (#15818)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
2025-03-31 19:15:21 +00:00
Cody Yu
54aa619459
[V1] Refactor num_computed_tokens logic (#15307)
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-03-27 04:54:36 +00:00
marko
27df5199d9
Support SHA256 as hash function in prefix caching (#15297)
Signed-off-by: Marko Rosenmueller <5467316+dr75@users.noreply.github.com>
2025-03-26 11:11:28 -07:00
Lu Fang
082ab86f5f
[V1] Support long_prefill_token_threshold in v1 scheduler (#15419)
Signed-off-by: Lu Fang <lufang@fb.com>
2025-03-25 14:22:26 -07:00
Chen Zhang
93a00d7dde
[v1] Refactor KVCacheConfig (#14079)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-03-21 04:56:27 -07:00
Woosuk Kwon
0c6f5023c3
[V1] Scheduler Refactoring [1/N] - Add Scheduler Interface (#15250)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
2025-03-20 17:50:43 -07:00
afeldman-nm
ef64044079
[V1] Prompt logprobs + APC compatibility; prompt logprobs reqs cannot fill APC (#13949) 2025-03-08 01:48:12 +00:00
Aaron Pham
80e9afb5bc
[V1][Core] Support for Structured Outputs (#12388)
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
Signed-off-by: Russell Bryant <rbryant@redhat.com>
Co-authored-by: Russell Bryant <rbryant@redhat.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
2025-03-07 07:19:11 -08:00
Harry Mellor
cf069aa8aa
Update deprecated Python 3.8 typing (#13971) 2025-03-02 17:34:51 -08:00
Chen Zhang
28943d36ce
[v1] Move block pool operations to a separate class (#13973)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2025-02-28 20:53:31 +00:00
Woosuk Kwon
cd4a72a28d
[V1][Spec decode] Move drafter to model runner (#13363)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-02-17 15:40:12 -08:00
Lily Liu
80f63a3966
[V1][Spec Decode] Ngram Spec Decode (#12193)
Signed-off-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>
2025-02-15 18:05:11 -08:00
Cody Yu
9206b3d7ec
[V1][PP] Run engine busy loop with batch queue (#13064) 2025-02-15 03:59:01 -08:00
Mark McLoughlin
75e6e14516
[V1][Metrics] Add several request timing histograms (#12644)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
2025-02-11 10:14:00 -05:00
Cody Yu
41c5dd45b9
[V1][Metrics] Add GPU prefix cache hit rate % gauge (#12592) 2025-02-11 08:27:25 +00:00
Woosuk Kwon
3243158336
[V1] Move KV block hashes from Request to KVCacheManager (#12922)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-02-07 19:14:10 -08:00