bk-201
c0cc07e7ee
Merge remote-tracking branch 'origin/main' into mlm-full-lora-support
2025-12-03 15:24:12 +00:00
Yong Hoon Shin
69520bc695
Add logging for cudagraph related info ( #29825 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com>
2025-12-03 01:01:48 -08:00
Jee Jee Li
83556e9d85
Address conflict
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-12-03 06:10:36 +00:00
Arpit Khandelwal
d7284a2604
[Core] Rename PassConfig flags as per RFC #27995 ( #29646 )
...
Signed-off-by: arpitkh101 <arpit5khandelwal@gmail.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
2025-12-03 03:38:55 +00:00
Lucas Wilkinson
5cdd664509
[BugFix] Fix assert in build_for_cudagraph_capture ( #29893 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-12-02 16:56:54 -08:00
maang-h
5d91d2b292
[Doc] Add allocate_slots parameter docs ( #29777 )
...
Signed-off-by: maang <maang_h@163.com>
Signed-off-by: maang-h <55082429+maang-h@users.noreply.github.com>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
2025-12-02 23:23:09 +00:00
Chauncey
0a9caca9f5
[Bugfix] fix --scheduling-policy=priority & n>1 crashes engine ( #29764 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
Signed-off-by: Nick Hill <nhill@redhat.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
2025-12-02 22:42:28 +00:00
jthomson04
1528e079e2
[Perf] Avoid pageable HtoD transfer in MinTokensLogitsProcessor ( #29826 )
...
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
2025-12-02 21:25:52 +00:00
Matthew Bonanni
1d93f11675
[Attention][CUDAGraph] Remove CG padding from attention backends ( #29352 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
2025-12-02 13:48:08 -05:00
Julien Denize
d8c6210eea
Add Mistral Large 3 and Ministral 3 ( #29757 )
...
Signed-off-by: Julien Denize <julien.denize@mistral.ai>
Signed-off-by: Julien Denize <40604584+juliendenize@users.noreply.github.com>
Signed-off-by: Mickael Seznec <mickael@mistral.ai>
Signed-off-by: Roger Wang <hey@rogerw.io>
Co-authored-by: Roger Wang <hey@rogerw.io>
Co-authored-by: Mickael Seznec <mickael@mistral.ai>
2025-12-02 10:29:00 +00:00
Wushi Dong
0037b5746a
[Core] Eliminate redundant is_encoder_decoder lookups (20-40us/step) ( #29800 )
...
Signed-off-by: Wushi Dong <dongws@meta.com>
2025-12-02 07:08:07 +00:00
Cyrus Leung
653591d5e7
[Chore] Move tokenizer initialization methods ( #29793 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-12-02 13:33:37 +08:00
usberkeley
81fe3f82af
[BugFix] Fix index error in ngram_proposer ( #29779 )
...
Signed-off-by: Bradley <bradley.b.pitt@gmail.com>
2025-12-02 04:48:11 +00:00
Seiji Eicher
22274b2184
[Misc] Add ReplicaId to Ray metrics ( #24267 )
...
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Co-authored-by: rongfu.leng <1275177125@qq.com>
2025-12-02 03:21:44 +00:00
Zhuohan Li
d0cd728907
[Core] Support reseting all running requests' KV while calling reset_prefix_cache ( #28827 )
...
Signed-off-by: Zhuohan Li <zhuohan123@gmail.com>
Signed-off-by: Nick Hill <nhill@redhat.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
2025-12-02 02:25:05 +00:00
Nick Hill
44822d7ff2
[BugFix] Preserve spec decoding uniform decode when scheduling ( #29759 )
...
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-12-01 17:15:52 -08:00
shivampr
cabc77cc86
[Core][Observability] Add KV cache residency metrics ( #27793 )
...
Introduces three new Prometheus histograms for fine-grained observability of KV cache residency behavior:
vllm:kv_block_lifetime_seconds — total lifetime from allocation to free
vllm:kv_block_idle_before_evict_seconds — idle duration before eviction
vllm:kv_block_reuse_gap_seconds — time between consecutive reuses of the same block
These metrics help operators analyze KV cache efficiency, reuse patterns, and eviction timing beyond simple utilization rates.
Implementation uses monotonic timestamps for accuracy, 1% sampling for minimal overhead (~48 bytes/block), and is fully thread-safe with zero runtime cost when disabled.
Two new runtime flags are introduced:
--kv-cache-metrics – enable KV cache residency metrics
--kv-cache-metrics-sample – control sampling ratio (default: 0.01)
Signed-off-by: Shivam <shivamprasad91@gmail.com>
2025-12-01 18:27:53 +00:00
Isotr0py
b95db244ee
[v1] Add real sliding window calculation to FlexAttention direct BlockMask building ( #26015 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: baonudesifeizhai <baonudesifeizhai@gmail.com>
Co-authored-by: baonudesifeizhai <baonudesifeizhai@gmail.com>
2025-12-01 13:12:51 +00:00
Mickaël Seznec
86e178f7c4
[crashfix] Eagle + multimodal can crash on mm cache miss ( #29750 )
...
Signed-off-by: Mickael Seznec <mickael@mistral.ai>
Co-authored-by: Roger Wang <hey@rogerw.io>
2025-12-01 17:29:33 +08:00
Yifei Zhang
1ab8fc8197
Make PyTorch profiler gzip and CUDA time dump configurable ( #29568 )
...
Signed-off-by: Yifei Zhang <yifei.zhang1992@outlook.com>
2025-12-01 04:30:46 +00:00
Woosuk Kwon
ec38a7368d
[Model Runner V2] Use packed mask for prompt bin counts ( #29756 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-11-30 14:15:42 -08:00
Pleaplusone
8c363ed666
[ROCm][Attention] Sliding window support for AiterFlashAttentionBackend ( #29234 )
...
Signed-off-by: ganyi <ygan@amd.com>
2025-11-30 11:31:50 +00:00
Cyrus Leung
64bc09ba27
[Core] Enable inputs_embeds_size separate from hidden_size ( #29741 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-11-30 17:31:12 +08:00
Cyrus Leung
2afcec4dec
[Misc] Update TokenizerLike interface and move get_cached_tokenizer ( #29730 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-11-30 14:59:47 +08:00
Vensen
66b5840287
[Bugfix][sleepmode][fp8 kv cache]: Fix FP8 KV cache + sleep(level=2) gibberish output ( #28783 )
...
Signed-off-by: vensen <vensenmu@gmail.com>
Co-authored-by: TJian <tunjian.tan@embeddedllm.com>
2025-11-30 14:24:25 +08:00
Huamin Li
82c795d6f2
Fix AttributeError about _use_fi_prefill ( #29734 )
...
Signed-off-by: Huamin Li <3ericli@gmail.com>
2025-11-30 06:04:55 +00:00
Cyrus Leung
fa59fe417f
[Chore] Move detokenizer_utils to vllm/tokenizers ( #29727 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-11-29 06:25:17 -08:00
Cyrus Leung
34a984274e
[Misc] Refactor tokenizer interface ( #29693 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-11-29 04:02:21 -08:00
Woosuk Kwon
f223ed4181
[Model Runner V2] Fuse penalties and temperature into single kernel ( #29720 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-11-29 02:29:16 -08:00
Woosuk Kwon
6afc0ffaf6
[Model Runner V2] Add sample/ directory and reorganize files ( #29719 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-11-29 00:41:01 -08:00
Jee Jee Li
39e63dec7c
[LoRA] Cleanup LoRA unused code ( #29611 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2025-11-28 22:52:58 -08:00
Woosuk Kwon
4a80ad0a25
[Model Runner V2] Don't use UVA buffer for prefill_len ( #29713 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-11-28 20:27:16 -08:00
Lucas Wilkinson
e23f665d83
[BugFix] Fix DBO failing with TypeError: 'NoneType' object is not iterable ( #29698 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-11-28 20:19:01 -08:00
Woosuk Kwon
ca1b1e7296
[Model Runner V2] Refactor prefill token preparation ( #29712 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-11-28 19:49:17 -08:00
Woosuk Kwon
1dcafb3dea
[Model Runner V2] Support penalties using bin counts ( #29703 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-11-28 17:53:17 -08:00
Augusto Yao
9726e64530
bugfix: correct attn output with base 2 or e ( #28840 )
...
Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>
2025-11-29 07:52:12 +08:00
Benjamin Chislett
1986de1375
[Perf] Optimize EAGLE prepare_inputs_padded with triton kernels ( #28597 )
...
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Benjamin Chislett <chislett.ben@gmail.com>
2025-11-28 22:25:05 +00:00
Cyrus Leung
8d9338fae4
[Chore] Rename Processor to InputProcessor ( #29682 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-11-28 09:35:41 -08:00
Didier Durand
fae6943068
[Doc]: fixing typos in multiple files. ( #29685 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com>
2025-11-28 08:41:41 -08:00
Cyrus Leung
9e6bcda3ac
[mypy] Enable type checking for more directories ( #29674 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-11-28 08:39:27 -08:00
Harry Mellor
9eec282cb5
Guard FlashInfer sampler using the same check as FlashInfer attention backend ( #29415 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2025-11-28 08:34:48 -08:00
Nick Hill
8e7a891602
[BugFix] Fix spec decoding max_tokens scheduling perf issue ( #29542 )
...
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-11-28 20:52:23 +08:00
Cyrus Leung
953d9c820b
[mypy] Pass type checking for vllm/utils and vllm/v1/pool ( #29666 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-11-28 20:40:47 +08:00
maang-h
cc0f2a0e19
[Doc] Improve abnormal information string ( #29655 )
...
Signed-off-by: maang <maang_h@163.com>
2025-11-28 00:12:20 -08:00
wang.yuqi
f4b76056ee
Improve enable chunked_prefill & prefix_caching logic. ( #26623 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
Signed-off-by: wang.yuqi <noooop@126.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2025-11-27 22:05:48 -08:00
EanWang211123
37b15e97e8
[Multimodal][Speculative Decoding]Eagle3 mm support, enablement on qwen3vl ( #29594 )
...
Signed-off-by: Tsai, Louie <louie.tsai@intel.com>
Signed-off-by: EanWang211123 <wangyiheng@sangfor.com.cn>
Co-authored-by: Louie Tsai <louie.tsai@intel.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2025-11-27 22:05:45 -08:00
maang-h
c7ba1f6bc7
[BugFix] Fix ValueError in NewRequestData repr methods ( #29392 )
...
Signed-off-by: maang <maang_h@163.com>
2025-11-28 13:42:30 +08:00
Lucas Wilkinson
be493e0b3c
[BugFix] Fix new nightly failures ( #29578 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-11-27 13:45:38 -08:00
Woosuk Kwon
ae0ce1be27
[Model Runner V2][BugFix] Keep reference to GPU tensors in AsyncOutput ( #29623 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-11-27 12:38:53 -08:00
Andrii Skliar
a5345bf49d
[BugFix] Fix plan API Mismatch when using latest FlashInfer ( #29426 )
...
Signed-off-by: Andrii Skliar <askliar@askliar-mlt.client.nvidia.com>
Co-authored-by: Andrii Skliar <askliar@askliar-mlt.client.nvidia.com>
2025-11-27 11:34:59 -08:00