Harry Mellor
951445a52d
Remove default values from InitVars so that they're not stored ( #29859 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-12-02 12:16:37 +00:00
杰兮
48d15a32aa
[CI] Fix Bad_words test for tokenizer encode/decode asymmetry ( #28193 )
...
Signed-off-by: zhyajie <yajizhan@amd.com>
Co-authored-by: zhyajie <yajizhan@amd.com>
2025-12-02 00:02:12 -08:00
Cyrus Leung
653591d5e7
[Chore] Move tokenizer initialization methods ( #29793 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-12-02 13:33:37 +08:00
Divakar Verma
e2fbfc955e
[CI][AMD] spec_decode:eagle skip FLASH_ATTN for deepseek on ROCm ( #29827 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com>
2025-12-02 05:27:46 +00:00
Divakar Verma
a690fb5bd6
[CI][ROCm] Fix test_correctness_sliding_window ( #29243 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2025-12-02 04:53:27 +00:00
usberkeley
81fe3f82af
[BugFix] Fix index error in ngram_proposer ( #29779 )
...
Signed-off-by: Bradley <bradley.b.pitt@gmail.com>
2025-12-02 04:48:11 +00:00
Zhuohan Li
d0cd728907
[Core] Support reseting all running requests' KV while calling reset_prefix_cache ( #28827 )
...
Signed-off-by: Zhuohan Li <zhuohan123@gmail.com>
Signed-off-by: Nick Hill <nhill@redhat.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
2025-12-02 02:25:05 +00:00
Nick Hill
44822d7ff2
[BugFix] Preserve spec decoding uniform decode when scheduling ( #29759 )
...
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-12-01 17:15:52 -08:00
shivampr
cabc77cc86
[Core][Observability] Add KV cache residency metrics ( #27793 )
...
Introduces three new Prometheus histograms for fine-grained observability of KV cache residency behavior:
vllm:kv_block_lifetime_seconds — total lifetime from allocation to free
vllm:kv_block_idle_before_evict_seconds — idle duration before eviction
vllm:kv_block_reuse_gap_seconds — time between consecutive reuses of the same block
These metrics help operators analyze KV cache efficiency, reuse patterns, and eviction timing beyond simple utilization rates.
Implementation uses monotonic timestamps for accuracy, 1% sampling for minimal overhead (~48 bytes/block), and is fully thread-safe with zero runtime cost when disabled.
Two new runtime flags are introduced:
--kv-cache-metrics – enable KV cache residency metrics
--kv-cache-metrics-sample – control sampling ratio (default: 0.01)
Signed-off-by: Shivam <shivamprasad91@gmail.com>
2025-12-01 18:27:53 +00:00
Marcin Ostrowski
5cfa967efa
[Bugfix] TypeError: 'NoneType' object is not callable ( #29414 )
...
Signed-off-by: Marcin Ostrowski <marcinx.ostrowski@intel.com>
2025-12-01 13:16:44 +00:00
Isotr0py
b95db244ee
[v1] Add real sliding window calculation to FlexAttention direct BlockMask building ( #26015 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: baonudesifeizhai <baonudesifeizhai@gmail.com>
Co-authored-by: baonudesifeizhai <baonudesifeizhai@gmail.com>
2025-12-01 13:12:51 +00:00
Cyrus Leung
f0a28bf661
[Misc] Unify tokenizer registration ( #29767 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-12-01 11:34:58 +00:00
Cyrus Leung
34a984274e
[Misc] Refactor tokenizer interface ( #29693 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-11-29 04:02:21 -08:00
Lucas Wilkinson
e23f665d83
[BugFix] Fix DBO failing with TypeError: 'NoneType' object is not iterable ( #29698 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-11-28 20:19:01 -08:00
Benjamin Chislett
1986de1375
[Perf] Optimize EAGLE prepare_inputs_padded with triton kernels ( #28597 )
...
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Benjamin Chislett <chislett.ben@gmail.com>
2025-11-28 22:25:05 +00:00
Cyrus Leung
8d9338fae4
[Chore] Rename Processor to InputProcessor ( #29682 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-11-28 09:35:41 -08:00
Nick Hill
8e7a891602
[BugFix] Fix spec decoding max_tokens scheduling perf issue ( #29542 )
...
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-11-28 20:52:23 +08:00
EanWang211123
37b15e97e8
[Multimodal][Speculative Decoding]Eagle3 mm support, enablement on qwen3vl ( #29594 )
...
Signed-off-by: Tsai, Louie <louie.tsai@intel.com>
Signed-off-by: EanWang211123 <wangyiheng@sangfor.com.cn>
Co-authored-by: Louie Tsai <louie.tsai@intel.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2025-11-27 22:05:45 -08:00
maang-h
c7ba1f6bc7
[BugFix] Fix ValueError in NewRequestData repr methods ( #29392 )
...
Signed-off-by: maang <maang_h@163.com>
2025-11-28 13:42:30 +08:00
Matthew Bonanni
fc1d8be3dc
[Attention] Update attention imports ( #29540 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
2025-11-27 11:19:09 -05:00
Ryan Rock
bab438ff3e
[CI/Build] Skip ray tests on ROCm ( #29556 )
...
Signed-off-by: Ryan Rock <ryan.rock@amd.com>
2025-11-27 07:01:37 -08:00
Cyrus Leung
e6d4f3c254
[Bugfix] Fix pre-commit ( #29601 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-11-27 02:23:06 -08:00
Morrison Turnansky
0838b52e2e
[Frontend][torch.compile] CompilationConfig Overhaul ( #20283 ): Set up -O infrastructure ( #26847 )
...
Signed-off-by: morrison-turnansky <mturnans@redhat.com>
Signed-off-by: adabeyta <aabeyta@redhat.com>
Signed-off-by: Morrison Turnansky <mturnans@redhat.com>
Co-authored-by: adabeyta <aabeyta@redhat.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-11-27 01:55:58 -08:00
Micah Williamson
43c5792592
[ROCm][CI] Fix test_cpu_offloading for ROCm ( #29548 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com>
2025-11-27 07:54:44 +00:00
Lucas Wilkinson
56539cddac
[Core] Refactor padding logic and pad for CUDA graphs before attention metadata building ( #28579 )
2025-11-26 14:07:13 -05:00
Matthew Bonanni
430dd4d9eb
[Attention] Remove imports from vllm/attention/__init__.py ( #29342 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
2025-11-26 10:53:15 -07:00
Wentao Ye
0b0aa874e8
[Perf] Optimize batch invariant BMM, 18.1% Throughput improvement, 10.7% TTFT improvement ( #29345 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-11-26 09:38:52 -07:00
Nick Hill
4e57c6587f
[Core] Support logprobs with spec decode + async scheduling ( #29223 )
...
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-11-25 12:55:24 -08:00
Yifan Qiao
48ddb02b79
[Hybrid Allocator] Support KV cache groups with different block_size ( #29143 )
...
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
2025-11-25 10:30:57 -05:00
wang.yuqi
67fc16cd8c
[Bugfix] If chunked_prefill is disabled, end the scheduling early. ( #28911 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
2025-11-25 16:06:09 +08:00
Micah Williamson
ef1f7030f0
[ROCm][CI] Fix test_cudagraph_mode failure in AMD CI ( #29367 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com>
2025-11-25 07:55:09 +00:00
Rémi Delacourt
12c007e288
EAGLE Support DP>1 ( #26086 )
...
Signed-off-by: Rémi Delacourt <remi@mistral.ai>
Signed-off-by: Rémi Delacourt <54138269+Flechman@users.noreply.github.com>
Signed-off-by: remi <remi@mistral.ai>
2025-11-25 07:32:21 +00:00
vllmellm
64deead719
[Bugfix] [ROCm] [UX]: revert Flex attention backend ( #29371 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
2025-11-25 06:56:06 +00:00
Harry Mellor
316c8492bf
Scheduled removal of guided_* config fields ( #29326 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-11-25 05:24:05 +00:00
Chen Zhang
71df2a57ef
[Hybrid Allocator] Better layer padding strategy for gpt-oss eagle ( #29303 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-11-24 14:28:32 -08:00
vllmellm
e48b2e6848
[Bugfix] [ROCm] [UX] Reorganize ROCm Backend Selection Logic ( #26980 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
2025-11-24 15:24:49 +00:00
rasmith
3999442f1c
[CI/Build][AMD] Add check for flash_att_varlen_func to test_tree_attention.py ( #29252 )
...
Signed-off-by: Randall Smith <ransmith@amd.com>
Co-authored-by: Randall Smith <ransmith@amd.com>
2025-11-23 04:45:08 +00:00
rasmith
71362ffab4
[CI/Build][AMD] Skip test_multi_shared_storage_connector_consistency in test_multi_connector.py due to hipErrorLaunchFailure when calling .cpu() ( #29253 )
...
Signed-off-by: Randall Smith <ransmith@amd.com>
Co-authored-by: Randall Smith <ransmith@amd.com>
2025-11-23 04:42:49 +00:00
Nick Hill
7df331c66b
[BugFix] Fix chunked prompt logprobs + preemption ( #29071 )
2025-11-22 16:07:18 -05:00
Nick Hill
d44a63c6d6
[BugFix] Fix returned logprobs with spec decode + prefill chunking ( #29216 )
...
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-11-22 22:41:25 +08:00
Nicolò Lucchesi
066209a045
[Attention] Refactor FA block_size limitations to hybrid models only ( #29084 )
...
Signed-off-by: NickLucche <nlucches@redhat.com>
2025-11-22 06:38:44 -08:00
Cyrus Leung
5a4802588e
[Misc] Further clean up chunked prefill and prefix caching init ( #29186 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-11-22 19:34:15 +08:00
rasmith
8e22da1d7f
[CI/Build Don't add FLASHINFER backend in test_cpu_offloading.py ( #29229 )
...
Signed-off-by: Randall Smith <ransmith@amd.com>
Co-authored-by: Randall Smith <ransmith@amd.com>
2025-11-22 11:00:54 +00:00
rasmith
a4fdf2405c
[CI/Build] Skip tests that require libcudart in test_lmcache_integration.py ( #29228 )
...
Signed-off-by: Randall Smith <ransmith@amd.com>
Co-authored-by: Randall Smith <ransmith@amd.com>
2025-11-22 10:59:39 +00:00
Mark McLoughlin
c6fa3895e9
[KV Connector] Fix async connector prefix cache metrics ( #28585 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
2025-11-21 17:45:00 -05:00
Julien Denize
57430fc95c
Default model load/config/tokenizer to mistral format if relevant files exist ( #28659 )
...
Signed-off-by: Julien Denize <julien.denize@mistral.ai>
Signed-off-by: Julien Denize <40604584+juliendenize@users.noreply.github.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
2025-11-21 13:58:59 -08:00
Wentao Ye
1f400c58b8
[CI] Add batch invariant test to ci ( #27842 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-11-21 09:20:33 -07:00
WeiQing Chen
b34129bf8e
[Misc] remove useless v1 env ( #29164 )
...
Signed-off-by: David Chen <530634352@qq.com>
2025-11-21 01:41:20 -08:00
Jialin Ouyang
30b9c67743
Revert "[Redo] #26368 ( #28771 )" ( #29121 )
...
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>
2025-11-20 21:27:45 -08:00
Cyrus Leung
56e96b37e4
[V0 Deprecation] Remove best_of ( #29090 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-11-21 11:40:40 +08:00