mirror of https://git.datalinker.icu/vllm-project/vllm.git synced 2026-01-23 20:54:36 +08:00

History

[Core][Observability] Add KV cache residency metrics (#27793 )

Introduces three new Prometheus histograms for fine-grained observability of KV cache residency behavior:

vllm:kv_block_lifetime_seconds — total lifetime from allocation to free
vllm:kv_block_idle_before_evict_seconds — idle duration before eviction
vllm:kv_block_reuse_gap_seconds — time between consecutive reuses of the same block

These metrics help operators analyze KV cache efficiency, reuse patterns, and eviction timing beyond simple utilization rates.

Implementation uses monotonic timestamps for accuracy, 1% sampling for minimal overhead (~48 bytes/block), and is fully thread-safe with zero runtime cost when disabled.

Two new runtime flags are introduced:

--kv-cache-metrics – enable KV cache residency metrics
--kv-cache-metrics-sample – control sampling ratio (default: 0.01)

Signed-off-by: Shivam <shivamprasad91@gmail.com>

2025-12-01 18:27:53 +00:00

api

[Docs] Replace all explicit anchors with real links (#27087 )

2025-10-17 02:22:06 -07:00

assets

[CI/Build] Moves to cuda-base runtime image while retaining minimal JIT dependencies (#29270 )

2025-11-24 11:40:54 -08:00

benchmarking

[Doc] Reorganize benchmark docs (#29658 )

2025-11-28 17:19:25 +08:00

cli

[Docs] Add CLI reference doc for vllm bench sweep plot_pareto (#29689 )

2025-11-28 08:10:08 -09:00

community

[DOC] Add vLLM Bangkok Meetup info (#29561 )

2025-11-27 04:42:50 +00:00

configuration

[Doc] Update more docs with respect to V1 (#29188 )

2025-11-23 10:58:48 +08:00

contributing

Make PyTorch profiler gzip and CUDA time dump configurable (#29568 )

2025-12-01 04:30:46 +00:00

deployment

Update KServe guide link in documentation (#29258 )

2025-11-24 14:40:05 +00:00

design

[Core][Observability] Add KV cache residency metrics (#27793 )

2025-12-01 18:27:53 +00:00

examples

[Doc]: Remove 404 hyperlinks (#24785 )

2025-09-13 00:15:41 -07:00

features

[Misc] Refactor tokenizer interface (#29693 )

2025-11-29 04:02:21 -08:00

getting_started

[CI] Renovation of nightly wheel build & generation (#29690 )

2025-12-01 21:25:39 +08:00

mkdocs

[Feature][Bench] Add pareto visualization (#29477 )

2025-11-27 23:53:20 -08:00

models

[Docs] Add SPLADE and Ultravox models to supported models documentation (#29659 )

2025-11-28 01:29:28 -09:00

serving

[Frontend] add 'verbose_json' and 'timestamp' feature on Whisper Transcription/Translation (#24209 )

2025-12-01 18:19:17 +01:00

training

[Document] Add ms-swift library to rlhf.md (#27469 )

2025-10-24 20:31:50 +00:00

usage

[Doc] Update more docs with respect to V1 (#29188 )

2025-11-23 10:58:48 +08:00

.nav.yml

[Doc] Reorganize benchmark docs (#29658 )

2025-11-28 17:19:25 +08:00

README.md

[Doc] Fix failing doc build (#28772 )

2025-11-15 05:33:27 -08:00

README.md

hide

navigation

toc

Welcome to vLLM

![](./assets/logos/vllm-logo-text-light.png){ align="center" alt="vLLM Light" class="logo-light" width="60%" } ![](./assets/logos/vllm-logo-text-dark.png){ align="center" alt="vLLM Dark" class="logo-dark" width="60%" }

Easy, fast, and cheap LLM serving for everyone

Star Watch Fork

vLLM is a fast and easy-to-use library for LLM inference and serving.

Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Where to get started with vLLM depends on the type of user. If you are looking to:

Run open-source models on vLLM, we recommend starting with the Quickstart Guide
Build applications with vLLM, we recommend starting with the User Guide
Build vLLM, we recommend starting with Developer Guide

For information about the development of vLLM, see:

vLLM is fast with:

State-of-the-art serving throughput
Efficient management of attention key and value memory with PagedAttention
Continuous batching of incoming requests
Fast model execution with CUDA/HIP graph
Quantization: GPTQ, AWQ, INT4, INT8, and FP8
Optimized CUDA kernels, including integration with FlashAttention and FlashInfer.
Speculative decoding
Chunked prefill

vLLM is flexible and easy to use with:

Seamless integration with popular HuggingFace models
High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
Tensor, pipeline, data and expert parallelism support for distributed inference
Streaming outputs
OpenAI-compatible API server
Support for NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs and GPUs, PowerPC CPUs, Arm CPUs, and TPU. Additionally, support for diverse hardware plugins such as Intel Gaudi, IBM Spyre and Huawei Ascend.
Prefix caching support
Multi-LoRA support

For more information, check out the following:

vLLM announcing blog post (intro to PagedAttention)
vLLM paper (SOSP 2023)
How continuous batching enables 23x throughput in LLM inference while reducing p50 latency by Cade Daniel et al.
vLLM Meetups