From 690cc3ef20eec0d080b8e2fce397bf4f981beaf1 Mon Sep 17 00:00:00 2001 From: TimWang <7367474+haitwang-cloud@users.noreply.github.com> Date: Fri, 5 Dec 2025 07:37:14 +0800 Subject: [PATCH] docs: update metrics design doc to use new vllm:kv_cache_usage_perc (#30041) Signed-off-by: Tim --- docs/design/metrics.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/design/metrics.md b/docs/design/metrics.md index 13264f6861b0c..28b5405871ac2 100644 --- a/docs/design/metrics.md +++ b/docs/design/metrics.md @@ -62,7 +62,7 @@ The subset of metrics exposed in the Grafana dashboard gives us an indication of - `vllm:time_per_output_token_seconds` - Inter-token latency (Time Per Output Token, TPOT) in seconds. - `vllm:time_to_first_token_seconds` - Time to First Token (TTFT) latency in seconds. - `vllm:num_requests_running` (also, `_swapped` and `_waiting`) - Number of requests in the RUNNING, WAITING, and SWAPPED states. -- `vllm:gpu_cache_usage_perc` - Percentage of used cache blocks by vLLM. +- `vllm:kv_cache_usage_perc` - Percentage of used cache blocks by vLLM. - `vllm:request_prompt_tokens` - Request prompt length. - `vllm:request_generation_tokens` - Request generation length. - `vllm:request_success` - Number of finished requests by their finish reason: either an EOS token was generated or the max sequence length was reached.