vllm/design at 48d15a32aa567dfc59ede46683b01cc2321579cb - vllm

mirror of https://git.datalinker.icu/vllm-project/vllm.git synced 2026-03-19 21:47:34 +08:00

History

[Core][Observability] Add KV cache residency metrics (#27793 )

Introduces three new Prometheus histograms for fine-grained observability of KV cache residency behavior:

vllm:kv_block_lifetime_seconds — total lifetime from allocation to free
vllm:kv_block_idle_before_evict_seconds — idle duration before eviction
vllm:kv_block_reuse_gap_seconds — time between consecutive reuses of the same block

These metrics help operators analyze KV cache efficiency, reuse patterns, and eviction timing beyond simple utilization rates.

Implementation uses monotonic timestamps for accuracy, 1% sampling for minimal overhead (~48 bytes/block), and is fully thread-safe with zero runtime cost when disabled.

Two new runtime flags are introduced:

--kv-cache-metrics – enable KV cache residency metrics
--kv-cache-metrics-sample – control sampling ratio (default: 0.01)

Signed-off-by: Shivam <shivamprasad91@gmail.com>

2025-12-01 18:27:53 +00:00

arch_overview.md

…

cuda_graphs.md

[Core] Refactor padding logic and pad for CUDA graphs before attention metadata building (#28579 )

2025-11-26 14:07:13 -05:00

dbo.md

…

debug_vllm_compile.md

[Frontend] Remap -O to -cc commandline flag (#29557 )

2025-11-28 21:51:12 +00:00

fused_moe_modular_kernel.md