150 Commits

Author SHA1 Message Date
Nick Hill
aea94362c9
[Frontend][V1] Online serving performance improvements (#12287) 2025-01-22 22:22:12 +00:00
Cody Yu
7206ce4ce1
[Core] Support reset_prefix_cache (#12284) 2025-01-22 18:52:27 +00:00
youkaichao
68ad4e3a8d
[Core] Support fully transparent sleep mode (#11743)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-22 14:39:32 +08:00
Kevin H. Luu
64ea24d0b3
[ci/lint] Add back default arg for pre-commit (#12279)
Signed-off-by: kevin <kevin@anyscale.com>
2025-01-22 01:15:27 +00:00
Ricky Xu
132a132100
[v1][stats][1/n] Add RequestStatsUpdate and RequestStats types (#10907)
Signed-off-by: rickyx <rickyx@anyscale.com>
2025-01-21 11:51:13 -08:00
Roger Wang
b197a5ccfd
[V1][Bugfix] Fix data item ordering in mixed-modality inference (#12259)
Signed-off-by: Roger Wang <ywang@roblox.com>
2025-01-21 13:18:43 +00:00
wangxiyuan
86bfb6dba7
[Misc] Pass attention to impl backend (#12218)
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-01-20 23:25:28 +08:00
Chen Zhang
5f0ec3935a
[V1] Remove _get_cache_block_size (#12214)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-01-20 21:54:16 +08:00
Cyrus Leung
59a0192fb9
[Core] Interface for accessing model from VllmRunner (#10353)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-20 15:00:59 +08:00
Roger Wang
81763c58a0
[V1] Add V1 support of Qwen2-VL (#12128)
Signed-off-by: Roger Wang <ywang@roblox.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: imkero <kerorek@outlook.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-19 19:52:13 +08:00
youkaichao
6d0e3d3724
[core] clean up executor class hierarchy between v1 and v0 (#12171)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-18 14:35:15 +08:00
youkaichao
87a0c076af
[core] allow callable in collective_rpc (#12151)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-17 20:47:01 +08:00
Chen Zhang
69d765f5a5
[V1] Move more control of kv cache initialization from model_executor to EngineCore (#11960)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2025-01-17 07:39:35 +00:00
youkaichao
bf53e0c70b
Support torchrun and SPMD-style offline inference (#12071)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-16 19:58:53 +08:00
Roger Wang
70755e819e
[V1][Core] Autotune encoder cache budget (#11895)
Signed-off-by: Roger Wang <ywang@roblox.com>
2025-01-15 11:29:00 -08:00
Keyun Tong
ad388d25a8
Type-fix: make execute_model output type optional (#12020) 2025-01-15 09:44:56 +00:00
Chen Zhang
994fc655b7
[V1][Prefix Cache] Move the logic of num_computed_tokens into KVCacheManager (#12003) 2025-01-15 07:55:30 +00:00
youkaichao
ad34c0df0f
[core] platform agnostic executor via collective_rpc (#11256)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-15 13:45:21 +08:00
Woosuk Kwon
b7ee940a82
[V1][BugFix] Fix edge case in VLM scheduling (#12065)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-01-14 20:21:28 -08:00
wangxiyuan
2e0e017610
[Platform] Add output for Attention Backend (#11981)
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-01-14 13:27:04 +00:00
youkaichao
89ce62a316
[platform] add ray_device_key (#11948)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-13 16:20:52 +08:00
Chenguang Li
c3f05b09a0
[Misc]Minor Changes about Worker (#11555)
Signed-off-by: Chenguang Li <757486878@qq.com>
2025-01-13 15:47:05 +08:00
Robert Shaw
619ae268c3
[V1] [2/n] Logging and Metrics - OutputProcessor Abstraction (#11973)
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
2025-01-13 04:54:10 +00:00
Robert Shaw
9597a095f2
[V1][Core][1/n] Logging and Metrics (#11962)
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
2025-01-12 21:02:02 +00:00
Roger Wang
b25cfab9a0
[V1] Avoid sending text prompt to core engine (#11963)
Signed-off-by: Roger Wang <ywang@roblox.com>
2025-01-12 06:36:38 +00:00
Joe Runde
ac2f3f7fee
[Bugfix] Validate lora adapters to avoid crashing server (#11727)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
2025-01-10 15:56:36 +08:00
Chen Zhang
cf5f000d21
[torch.compile] Hide KV cache behind torch.compile boundary (#11677)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-01-10 13:14:42 +08:00
WangErXiao
47de8821d3
[Misc]add some explanations for BlockHashType (#11847) 2025-01-08 18:21:30 +00:00
WangErXiao
dc71af0a71
Remove the duplicate imports of MultiModalKwargs and PlaceholderRange… (#11824) 2025-01-08 04:09:25 +00:00
Roger Wang
91b361ae89
[V1] Extend beyond image modality and support mixed-modality inference with Llava-OneVision (#11685)
Signed-off-by: Roger Wang <ywang@roblox.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-06 19:58:16 +00:00
Chen Zhang
e20c92bb61
[Kernel] Move attn_type to Attention.__init__() (#11690)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-01-07 00:11:28 +08:00
Rui Qiao
022c5c6944
[V1] Refactor get_executor_cls (#11754) 2025-01-06 07:59:16 +00:00
Woosuk Kwon
06bfb51963
[V1] Add BlockTable class (#11693)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-01-06 14:24:42 +09:00
Yan Burman
300acb8347
[Core][Bugfix] Use correct device to initialize GPU data during CUDA-graph-capture (#11233)
Signed-off-by: Yan Burman <yanburman@users.noreply.github.com>
Signed-off-by: Ido Asraff <idoa@atero.ai>
2025-01-04 14:50:16 +08:00
Kunshang Ji
fbf2564554
[V1] Add RayExecutor support for AsyncLLM (api server) (#11712) 2025-01-04 06:41:31 +00:00
WangErXiao
e5d7ed0c53
[V1] log GPU blocks num for MultiprocExecutor (#11656) 2025-01-04 00:13:12 +00:00
Robert Shaw
ad0d567e1c
[V1] Chore: cruft removal (#11724) 2025-01-03 23:25:02 +00:00
Robert Shaw
1543914c04
[V1] Improve TP>1 Error Handling + Stack Trace (#11721)
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
2025-01-03 21:29:11 +00:00
Robert Shaw
80c751e7f6
[V1] Simplify Shutdown (#11659) 2025-01-03 17:25:38 +00:00
Woosuk Kwon
b55ed6ef8a
[V1][Minor] Optimize token_ids_cpu copy (#11692)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-01-02 12:04:58 -07:00
Cyrus Leung
23c1b10a4c
[VLM][Bugfix] Multi-modal processor compatible with V1 multi-input (#11674)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-02 17:00:00 +08:00
Woosuk Kwon
73001445fb
[V1] Implement Cascade Attention (#11635)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-01-01 21:56:46 +09:00
Kazuhiro Serizawa
6d70198b17
[Doc] Fix typo (#11666)
Signed-off-by: Kazuhiro Serizawa <nserihiro@gmail.com>
2025-01-01 08:10:10 +00:00
Roger Wang
e7c7c5e822
[V1][VLM] V1 support for selected single-image models. (#11632)
Signed-off-by: Roger Wang <ywang@roblox.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: Isotr0py <2037008807@qq.com>
2024-12-31 21:17:22 +00:00
Chen Zhang
8c3230d8c1
[V1] Simpify vision block hash for prefix caching by removing offset from hash (#11646) 2024-12-31 08:56:01 +00:00
sakunkun
2c5718809b
[Bugfix] Move the _touch(computed_blocks) call in the allocate_slots method to after the check for allocating new blocks. (#11565) 2024-12-31 06:29:04 +00:00
Robert Shaw
5886aa496e
[V1] [6/N] API Server: Better Shutdown (#11586) 2024-12-30 15:51:02 +00:00
youkaichao
3682e33f9f
[v1] fix compilation cache (#11598)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-30 04:24:12 +00:00
Robert Shaw
4fb8e329fd
[V1] [5/N] API Server: unify Detokenizer and EngineCore input (#11545)
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
2024-12-28 20:51:57 +00:00
Woosuk Kwon
42bb201fd6
[V1][Minor] Set pin_memory=False for token_ids_cpu tensor (#11581)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-12-28 13:33:12 +00:00