shivampr
cabc77cc86
[Core][Observability] Add KV cache residency metrics ( #27793 )
...
Introduces three new Prometheus histograms for fine-grained observability of KV cache residency behavior:
vllm:kv_block_lifetime_seconds — total lifetime from allocation to free
vllm:kv_block_idle_before_evict_seconds — idle duration before eviction
vllm:kv_block_reuse_gap_seconds — time between consecutive reuses of the same block
These metrics help operators analyze KV cache efficiency, reuse patterns, and eviction timing beyond simple utilization rates.
Implementation uses monotonic timestamps for accuracy, 1% sampling for minimal overhead (~48 bytes/block), and is fully thread-safe with zero runtime cost when disabled.
Two new runtime flags are introduced:
--kv-cache-metrics – enable KV cache residency metrics
--kv-cache-metrics-sample – control sampling ratio (default: 0.01)
Signed-off-by: Shivam <shivamprasad91@gmail.com>
2025-12-01 18:27:53 +00:00
Kevin H. Luu
ec7035c9d4
[ci] Make distributed 8 gpus test optional ( #29801 )
...
Signed-off-by: Kevin H. Luu <khluu000@gmail.com>
2025-12-01 10:22:05 -08:00
knlnguyen1802
fc6acc88ca
[Bugfix] Missing cached item in the MultiModalReceiverCache ( #28525 )
...
Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>
Co-authored-by: Chenguang Zheng <645327136@qq.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2025-12-01 10:18:07 -08:00
BADAOUI Abdennacer
d0985c5feb
[Hardware][AMD] Remove ROCm skip conditions for transformers backend tests ( #29782 )
...
Signed-off-by: badaoui <abdennacerbadaoui0@gmail.com>
2025-12-02 02:03:13 +08:00
sangbumlikeagod
092bb73b8a
[Frontend] add 'verbose_json' and 'timestamp' feature on Whisper Transcription/Translation ( #24209 )
...
Signed-off-by: sangbumlikeagod <oironese@naver.com>
Signed-off-by: sangbumlikeagod <98077576+sangbumlikeagod@users.noreply.github.com>
2025-12-01 18:19:17 +01:00
FredericOdermatt
5d43f7372e
[Doc] Update description disable_any_whitespace ( #29784 )
...
Signed-off-by: Frederic Odermatt <frederic.odermatt@44ai.ch>
2025-12-01 16:48:33 +00:00
Shengqi Chen
37593deb02
[CI] fix url-encoding behavior in nightly metadata generation ( #29787 )
...
Signed-off-by: Shengqi Chen <harry-chen@outlook.com>
2025-12-01 23:17:20 +08:00
Liu Jinyi
f5516039c5
[Doc] fix heading levels ( #29783 )
...
Signed-off-by: KKKZOZ <kkkzoz@qq.com>
2025-12-01 14:49:22 +00:00
Shengqi Chen
36db0a35e4
[CI] Renovation of nightly wheel build & generation ( #29690 )
...
Signed-off-by: Shengqi Chen <harry-chen@outlook.com>
2025-12-01 21:25:39 +08:00
Marcin Ostrowski
5cfa967efa
[Bugfix] TypeError: 'NoneType' object is not callable ( #29414 )
...
Signed-off-by: Marcin Ostrowski <marcinx.ostrowski@intel.com>
2025-12-01 13:16:44 +00:00
Isotr0py
b95db244ee
[v1] Add real sliding window calculation to FlexAttention direct BlockMask building ( #26015 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: baonudesifeizhai <baonudesifeizhai@gmail.com>
Co-authored-by: baonudesifeizhai <baonudesifeizhai@gmail.com>
2025-12-01 13:12:51 +00:00
Zhengxu Chen
ad9d656bfa
[multimodal][test] Reduce memory utilization for test_siglip to avoid OOM ( #29504 )
...
Signed-off-by: zhxchen17 <zhxchen17@fb.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2025-12-01 20:41:48 +08:00
Fanli Lin
f37e8938d2
[XPU] Fix AWQ skipped layer detection in IPEX quantization ( #29774 )
...
Signed-off-by: Fanli Lin <fanli.lin@intel.com>
2025-12-01 12:00:52 +00:00
Cyrus Leung
f0a28bf661
[Misc] Unify tokenizer registration ( #29767 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-12-01 11:34:58 +00:00
Mickaël Seznec
86e178f7c4
[crashfix] Eagle + multimodal can crash on mm cache miss ( #29750 )
...
Signed-off-by: Mickael Seznec <mickael@mistral.ai>
Co-authored-by: Roger Wang <hey@rogerw.io>
2025-12-01 17:29:33 +08:00
daniel-salib
014ece97c7
[Frontend] Add tool filtering support to ToolServer ( #29224 )
...
Signed-off-by: Daniel Salib <danielsalib@meta.com>
Co-authored-by: Chauncey <chaunceyjiang@gmail.com>
2025-12-01 08:03:57 +00:00
wang.yuqi
62de4f4257
[Frontend] Resettle pooling entrypoints ( #29634 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
2025-12-01 15:30:43 +08:00
Huamin Li
83805a6078
[CI] Skip paddleocr_vl for transformer 4.57.3 ( #29758 )
...
Signed-off-by: Huamin Li <3ericli@gmail.com>
2025-12-01 04:38:06 +00:00
Yifei Zhang
1ab8fc8197
Make PyTorch profiler gzip and CUDA time dump configurable ( #29568 )
...
Signed-off-by: Yifei Zhang <yifei.zhang1992@outlook.com>
2025-12-01 04:30:46 +00:00
Shu Wang
f72a817bdf
[MoE] CuteDSL MoE with Nvfp4 DeepEP dispatch ( #27141 )
...
Signed-off-by: Shu Wang <shuw@nvidia.com>
Signed-off-by: Shu Wang. <shuw@nvidia.com>
Signed-off-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: root <root@umbriel-b200-017.ipp4a1.colossus.nvidia.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
2025-11-30 16:05:32 -08:00
Woosuk Kwon
ec38a7368d
[Model Runner V2] Use packed mask for prompt bin counts ( #29756 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-11-30 14:15:42 -08:00
Xingyu Liu
21c2627934
[Misc]Remove redundant hidden_size property in ModelConfig ( #29749 )
...
Signed-off-by: Xingyu Liu <charlotteliu12x@gmail.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-11-30 17:14:23 +00:00
Omer Ullman Argov
39d28108f4
[Feat] Support non-gated activations in NVFP4 modelopt path ( #29004 )
2025-11-30 11:02:40 -05:00
Harry Mellor
cd719de5cb
Fix RoPE failures in Transformers nightly ( #29700 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-11-30 14:29:32 +00:00
Pleaplusone
8c363ed666
[ROCm][Attention] Sliding window support for AiterFlashAttentionBackend ( #29234 )
...
Signed-off-by: ganyi <ygan@amd.com>
2025-11-30 11:31:50 +00:00
Cyrus Leung
64bc09ba27
[Core] Enable inputs_embeds_size separate from hidden_size ( #29741 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-11-30 17:31:12 +08:00
Isotr0py
47539cfd3e
[Bugfix] Fix mismatched nvfp4 gemm output shape ( #29742 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-11-30 09:15:01 +00:00
Cyrus Leung
2afcec4dec
[Misc] Update TokenizerLike interface and move get_cached_tokenizer ( #29730 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-11-30 14:59:47 +08:00
朝
9381b5cde0
[Doc]: Fix typo in fused_moe layer ( #29731 )
...
Signed-off-by: BowTen <bowten@qq.com>
2025-11-29 22:29:13 -08:00
Vensen
66b5840287
[Bugfix][sleepmode][fp8 kv cache]: Fix FP8 KV cache + sleep(level=2) gibberish output ( #28783 )
...
Signed-off-by: vensen <vensenmu@gmail.com>
Co-authored-by: TJian <tunjian.tan@embeddedllm.com>
2025-11-30 14:24:25 +08:00
Huamin Li
82c795d6f2
Fix AttributeError about _use_fi_prefill ( #29734 )
...
Signed-off-by: Huamin Li <3ericli@gmail.com>
2025-11-30 06:04:55 +00:00
Isotr0py
e1464c3a08
[Quantization] Enable compressed-tensors AWQ for Turing GPU ( #29732 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-11-30 06:04:28 +00:00
Xin Yang
a491b0911b
[LoRA] Support FusedMoE LoRA Triton kernel for mxfp4 ( #29708 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com>
Signed-off-by: Xin Yang <105740670+xyang16@users.noreply.github.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
2025-11-30 10:37:25 +08:00
Jee Jee Li
b9d0504a36
[Bugfix] Revert test_tokenization.py ( #29729 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-11-29 16:35:15 +00:00
Jinzhen Lin
1656ad3704
[Kernel][Quantization] add w4a8 support for marlin kernel ( #24722 )
...
Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com>
Signed-off-by: Michael Goin <mgoin64@gmail.com>
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Michael Goin <mgoin@redhat.com>
2025-11-29 07:19:33 -08:00
Cyrus Leung
fa59fe417f
[Chore] Move detokenizer_utils to vllm/tokenizers ( #29727 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-11-29 06:25:17 -08:00
Cyrus Leung
fe3398fab2
[Chore] Enable passing tokenizer=None into MM processor ( #29724 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-11-29 06:25:10 -08:00
Chukwuma Nwaugha
ad7f714d62
hfrunner.classify should return list[list[float]] not list[str] ( #29671 )
...
Signed-off-by: Chukwuma Nwaugha <nwaughac@gmail.com>
2025-11-29 13:57:00 +00:00
dublc
f4341f45d3
[Doc]: fix code block rendering ( #29728 )
...
Signed-off-by: dublc <jdublc0x@gmail.com>
2025-11-29 13:46:48 +00:00
Cyrus Leung
34a984274e
[Misc] Refactor tokenizer interface ( #29693 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-11-29 04:02:21 -08:00
Woosuk Kwon
f223ed4181
[Model Runner V2] Fuse penalties and temperature into single kernel ( #29720 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-11-29 02:29:16 -08:00
Didier Durand
04a797cd0e
[Doc]: fixing typos in various files. ( #29717 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com>
2025-11-29 01:15:39 -08:00
Woosuk Kwon
6afc0ffaf6
[Model Runner V2] Add sample/ directory and reorganize files ( #29719 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-11-29 00:41:01 -08:00
Jee Jee Li
39e63dec7c
[LoRA] Cleanup LoRA unused code ( #29611 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2025-11-28 22:52:58 -08:00
Woosuk Kwon
4a80ad0a25
[Model Runner V2] Don't use UVA buffer for prefill_len ( #29713 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-11-28 20:27:16 -08:00
Angela Yi
4b17ce6815
Add gpu memory wait before test_async_tp ( #28893 )
...
Signed-off-by: angelayi <yiangela7@gmail.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
2025-11-28 20:19:05 -08:00
Lucas Wilkinson
e23f665d83
[BugFix] Fix DBO failing with TypeError: 'NoneType' object is not iterable ( #29698 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-11-28 20:19:01 -08:00
Woosuk Kwon
ca1b1e7296
[Model Runner V2] Refactor prefill token preparation ( #29712 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-11-28 19:49:17 -08:00
Tsukasa OI
762a4a6ca9
[Frontend] Perform offline path replacement to tokenizer ( #29706 )
...
Signed-off-by: Tsukasa OI <floss_llm@irq.a4lg.com>
2025-11-28 18:32:08 -08:00
Cyrus Leung
b2c50eda50
[Bugfix] Fix wrong mock attribute ( #29704 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-11-29 10:30:41 +08:00