[docs] Update v1 metrics design doc (#27332)

Signed-off-by: Simon Mo <simon.mo@hey.com> Signed-off-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: atalhens <sneh.lata@nutanix.com> Co-authored-by: Simon Mo <simon.mo@hey.com> Co-authored-by: atalhens <sneh.lata@nutanix.com>
2026-07-14 09:47:11 +08:00 · 2025-10-22 14:29:15 +01:00 · 2025-10-22 14:29:15 +01:00 · 141d3b9fc5
commit 141d3b9fc5
parent abf3db40ef
1 changed files with 67 additions and 81 deletions
--- a/docs/design/metrics.md
+++ b/docs/design/metrics.md
@ -1,12 +1,12 @@
 # Metrics

-Ensure the v1 LLM Engine exposes a superset of the metrics available in v0.
+vLLM exposes a rich set of metrics to support observability and capacity planning for the V1 engine.

 ## Objectives

- Achieve parity of metrics between v0 and v1.
- The priority use case is accessing these metrics via Prometheus, as this is what we expect to be used in production environments.
- Logging support (i.e. printing metrics to the info log) is provided for more ad-hoc testing, debugging, development, and exploratory use cases.
+- Provide comprehensive coverage of engine and request level metrics to aid production monitoring.
+- Prioritize Prometheus integrations, as this is what we expect to be used in production environments.
+- Offer logging support (i.e. printing metrics to the info log) for ad-hoc testing, debugging, development, and exploratory use cases.

 ## Background

@ -17,45 +17,36 @@ Metrics in vLLM can be categorized as follows:

 The mental model is that server-level metrics help explain the values of request-level metrics.

-### v0 Metrics
+### Metrics Overview

-In v0, the following metrics are exposed via a Prometheus-compatible `/metrics` endpoint using the `vllm:` prefix:
+### v1 Metrics

- `vllm:num_requests_running` (Gauge)
- `vllm:num_requests_swapped` (Gauge)
- `vllm:num_requests_waiting` (Gauge)
- `vllm:gpu_cache_usage_perc` (Gauge)
- `vllm:cpu_cache_usage_perc` (Gauge)
- `vllm:gpu_prefix_cache_hit_rate` (Gauge)
- `vllm:cpu_prefix_cache_hit_rate` (Gauge)
- `vllm:prompt_tokens_total` (Counter)
- `vllm:generation_tokens_total` (Counter)
- `vllm:request_success_total` (Counter)
- `vllm:request_prompt_tokens` (Histogram)
- `vllm:request_generation_tokens` (Histogram)
- `vllm:time_to_first_token_seconds` (Histogram)
- `vllm:time_per_output_token_seconds` (Histogram)
- `vllm:e2e_request_latency_seconds` (Histogram)
- `vllm:request_queue_time_seconds` (Histogram)
- `vllm:request_inference_time_seconds` (Histogram)
- `vllm:request_prefill_time_seconds` (Histogram)
- `vllm:request_decode_time_seconds` (Histogram)
- `vllm:request_max_num_generation_tokens` (Histogram)
- `vllm:num_preemptions_total` (Counter)
- `vllm:cache_config_info` (Gauge)
- `vllm:lora_requests_info` (Gauge)
- `vllm:tokens_total` (Counter)
- `vllm:iteration_tokens_total` (Histogram)
- `vllm:time_in_queue_requests` (Histogram)
- `vllm:model_forward_time_milliseconds` (Histogram)
- `vllm:model_execute_time_milliseconds` (Histogram)
- `vllm:request_params_n` (Histogram)
- `vllm:request_params_max_tokens` (Histogram)
- `vllm:spec_decode_draft_acceptance_rate` (Gauge)
- `vllm:spec_decode_efficiency` (Gauge)
- `vllm:spec_decode_num_accepted_tokens_total` (Counter)
- `vllm:spec_decode_num_draft_tokens_total` (Counter)
- `vllm:spec_decode_num_emitted_tokens_total` (Counter)
+In v1, the following metrics are exposed via a Prometheus-compatible `/metrics` endpoint using the `vllm:` prefix:
+
+- `vllm:num_requests_running` (Gauge) - Number of requests currently running.
+- `vllm:num_requests_waiting` (Gauge) - Number of requests currently waiting.
+- `vllm:kv_cache_usage_perc` (Gauge) - Fraction of used KV cache blocks (0–1).
+- `vllm:prefix_cache_queries` (Counter) - Number of prefix cache queries.
+- `vllm:prefix_cache_hits` (Counter) - Number of prefix cache hits.
+- `vllm:mm_cache_queries` (Counter) - (For multimodal models) Number of multimodal cache queries.
+- `vllm:mm_cache_hits` (Counter) - (For multimodal models) Number of multimodal cache hits.
+- `vllm:num_preemptions_total` (Counter) - Number of preemptions.
+- `vllm:prompt_tokens_total` (Counter) - Total number of prompt tokens processed.
+- `vllm:generation_tokens_total` (Counter) - Total number of generated tokens.
+- `vllm:iteration_tokens_total` (Histogram) - Histogram of tokens processed in each engine step.
+- `vllm:cache_config_info` (Gauge) - Information about the cache configuration.
+- `vllm:request_success_total` (Counter) - Number of finished requests (by finish reason).
+- `vllm:request_prompt_tokens` (Histogram) - Histogram of input prompt token counts.
+- `vllm:request_generation_tokens` (Histogram) - Histogram of generation token counts.
+- `vllm:request_params_n` (Histogram) - Histogram of request parameter n.
+- `vllm:request_params_max_tokens` - (Histogram) - Histogram of max_tokens parameter in requests.
+- `vllm:time_to_first_token_seconds` (Histogram) - Time to first token (TTFT).
+- `vllm:inter_token_latency_seconds` (Histogram) - Inter-token latency.
+- `vllm:e2e_request_latency_seconds` (Histogram) - End-to-end request latency.
+- `vllm:request_queue_time_seconds` (Histogram) - Time spent in the queue.
+- `vllm:request_inference_time_seconds` (Histogram) - Request inference time.
+- `vllm:request_prefill_time_seconds` (Histogram) - Request prefill time.
+- `vllm:request_decode_time_seconds` (Histogram) - Request decode time.

 These are documented under [Inferencing and Serving -> Production Metrics](../usage/metrics.md).

@ -86,7 +77,7 @@ See [the PR which added this Dashboard](https://github.com/vllm-project/vllm/pul

 Prometheus support was initially added [using the aioprometheus library](https://github.com/vllm-project/vllm/pull/1890), but a switch was made quickly to [prometheus_client](https://github.com/vllm-project/vllm/pull/2730). The rationale is discussed in both linked PRs.

-With the switch to `aioprometheus`, we lost a `MetricsMiddleware` to track HTTP metrics, but this was reinstated [using prometheus_fastapi_instrumentator](https://github.com/vllm-project/vllm/pull/15657):
+During those migrations we briefly lost a `MetricsMiddleware` to track HTTP metrics, but this was reinstated [using prometheus_fastapi_instrumentator](https://github.com/vllm-project/vllm/pull/15657):

 ```bash
 $ curl http://0.0.0.0:8000/metrics 2>/dev/null  | grep -P '^http_(?!.*(_bucket|_created|_sum)).*'
@ -99,7 +90,9 @@ http_request_duration_seconds_count{handler="/v1/completions",method="POST"} 201

 ### Multi-process Mode

-In v0, metrics are collected in the engine core process and we use multiprocess mode to make them available in the API server process. See <https://github.com/vllm-project/vllm/pull/7279>.
+Historically, metrics were collected in the engine core process and multiprocess mode was used to make them available in the API server process. See <https://github.com/vllm-project/vllm/pull/7279>.
+
+More recently, metrics are collected in the API server process and multiprocess mode is only used when `--api-server-count > 1`. See <https://github.com/vllm-project/vllm/pull/17546> and details on [API server scale-out](../serving/data_parallel_deployment.md#internal-load-balancing).

 ### Built in Python/Process Metrics

@ -116,14 +109,15 @@ The following metrics are supported by default by `prometheus_client`, but they
 - `process_open_fds`
 - `process_max_fds`

-This is relevant because if we move away from multiprocess mode in v1,
-we get these back. However, it's questionable how relevant these are
-if they don't aggregate these stats for all processes that make up a
-vLLM instance.
+Therefore, these metrics are unavailable when `--api-server-count > 1`. It's questionable how relevant these are since they do not aggregate these stats for all processes that make up a vLLM instance.

-### v0 PRs and Issues
+## Metrics Design

-For background, these are some of the relevant PRs which added the v0 metrics:
+The ["Even Better Observability"](https://github.com/vllm-project/vllm/issues/3616) feature where was where much of the metrics design was planned. For example, see where [a detailed roadmap was laid out](https://github.com/vllm-project/vllm/issues/3616#issuecomment-2030858781).
+
+### Legacy PRs
+
+To help understand the background to the metrics design, here are some of the relevant PRs which added the original, now legacy, metrics:

 - <https://github.com/vllm-project/vllm/pull/1890>
 - <https://github.com/vllm-project/vllm/pull/2316>
@ -131,14 +125,9 @@ For background, these are some of the relevant PRs which added the v0 metrics:
 - <https://github.com/vllm-project/vllm/pull/4464>
 - <https://github.com/vllm-project/vllm/pull/7279>

-Also note the ["Even Better Observability"](https://github.com/vllm-project/vllm/issues/3616) feature where e.g. [a detailed roadmap was laid out](https://github.com/vllm-project/vllm/issues/3616#issuecomment-2030858781).
+### Metrics Implementation PRs

-## v1 Design
-
-### v1 PRs
-
-For background, here are the relevant v1 PRs relating to the v1
-metrics issue <https://github.com/vllm-project/vllm/issues/10582>:
+For background, here are the relevant PRs relating to the metrics implementation <https://github.com/vllm-project/vllm/issues/10582>:

 - <https://github.com/vllm-project/vllm/pull/11962>
 - <https://github.com/vllm-project/vllm/pull/11973>
@ -369,7 +358,7 @@ vllm:cache_config_info{block_size="16",cache_dtype="auto",calculate_kv_scales="F

 However, `prometheus_client` has
 [never supported Info metrics in multiprocessing mode](https://github.com/prometheus/client_python/pull/300) -
-for [unclear reasons](https://github.com/vllm-project/vllm/pull/7279#discussion_r1710417152). We
+for [unclear reasons](gh-pr:7279#discussion_r1710417152). We
 simply use a `Gauge` metric set to 1 and
 `multiprocess_mode="mostrecent"` instead.

@ -396,9 +385,8 @@ recent metric is used, but only from currently running processes.

 This was added in <https://github.com/vllm-project/vllm/pull/9477> and there is
 [at least one known user](https://github.com/kubernetes-sigs/gateway-api-inference-extension/pull/54).
-If we revisit this design and deprecate the old metric, we should reduce
-the need for a significant deprecation period by making the change in
-v0 also and asking this project to move to the new metric.
+If we revisit this design and deprecate the old metric, we should
+coordinate with downstream users so they can migrate before the removal.

 ### Prefix Cache metrics

@ -478,22 +466,20 @@ us with:

 ```python
 if seq_group.is_finished():
-    if (
-        seq_group.metrics.first_scheduled_time is not None
-        and seq_group.metrics.first_token_time is not None
-    ):
+    if (seq_group.metrics.first_scheduled_time is not None and
+            seq_group.metrics.first_token_time is not None):
        time_queue_requests.append(
            seq_group.metrics.first_scheduled_time -
-            seq_group.metrics.arrival_time
-        )
+            seq_group.metrics.arrival_time)
    ...
    if seq_group.metrics.time_in_queue is not None:
-        time_in_queue_requests.append(seq_group.metrics.time_in_queue)
+        time_in_queue_requests.append(
+            seq_group.metrics.time_in_queue)
 ```

 This seems duplicative, and one of them should be removed. The latter
 is used by the Grafana dashboard, so we should deprecate or remove the
-former from v0.
+former.

 ### Prefix Cache Hit Rate

@ -502,7 +488,7 @@ See above - we now expose 'queries' and 'hits' counters rather than a

 ### KV Cache Offloading

-Two v0 metrics relate to a "swapped" preemption mode that is no
+Two legacy metrics relate to a "swapped" preemption mode that is no
 longer relevant in v1:

 - `vllm:num_requests_swapped`
@ -513,7 +499,7 @@ cache to complete other requests), we swap kv cache blocks out to CPU
 memory. This is also known as "KV cache offloading" and is configured
 with `--swap-space` and `--preemption-mode`.

-In v0, [vLLM has long supported beam search](https://github.com/vllm-project/vllm/issues/6226). The
+Historically, [vLLM has long supported beam search](https://github.com/vllm-project/vllm/issues/6226). The
 SequenceGroup encapsulated the idea of N Sequences which
 all shared the same prompt kv blocks. This enabled KV cache block
 sharing between requests, and copy-on-write to do branching. CPU
@ -526,7 +512,7 @@ and the part of the prompt that was evicted can be recomputed.

 SequenceGroup was removed in V1, although a replacement will be
 required for "parallel sampling" (`n>1`).
-[Beam search was moved out of the core (in V0)](https://github.com/vllm-project/vllm/issues/8306). There was a
+[Beam search was moved out of the core](https://github.com/vllm-project/vllm/issues/8306). There was a
 lot of complex code for a very uncommon feature.

 In V1, with prefix caching being better (zero over head) and therefore
@ -537,7 +523,7 @@ better.

 ### Parallel Sampling

-Some v0 metrics are only relevant in the context of "parallel
+Some legacy metrics are only relevant in the context of "parallel
 sampling". This is where the `n` parameter in a request is used to
 request multiple completions from the same prompt.

@ -556,7 +542,7 @@ also add these metrics.

 ### Speculative Decoding

-Some v0 metrics are specific to "speculative decoding". This is where
+Some legacy metrics are specific to "speculative decoding". This is where
 we generate candidate tokens using a faster, approximate method or
 model and then validate those tokens with the larger model.

@ -568,7 +554,7 @@ model and then validate those tokens with the larger model.

 There is a PR under review (<https://github.com/vllm-project/vllm/pull/12193>) to add "prompt lookup (ngram)"
 speculative decoding to v1. Other techniques will follow. We should
-revisit the v0 metrics in this context.
+revisit these metrics in this context.

 !!! note
    We should probably expose acceptance rate as separate accepted
@ -641,7 +627,7 @@ metrics are often relatively straightforward to add:
   metrics are usually of very limited use unless they can be enabled
   by default and in production.
 3. They have an impact on development and maintenance of the
-   project. Every metric added to v0 has made this v1 effort more
+   project. Every metric added over time has made this effort more
   time-consuming, and perhaps not all metrics justify this ongoing
   investment in their maintenance.

@ -652,24 +638,24 @@ performance and health. Tracing, on the other hand, tracks individual
 requests as they move through different services and components. Both
 fall under the more general heading of "Observability".

-v0 has support for OpenTelemetry tracing:
+vLLM has support for OpenTelemetry tracing:

- Added by <https://github.com/vllm-project/vllm/pull/4687>
+- Added by <https://github.com/vllm-project/vllm/pull/4687> and reinstated by <https://github.com/vllm-project/vllm/pull/20372>
 - Configured with `--oltp-traces-endpoint` and `--collect-detailed-traces`
 - [OpenTelemetry blog post](https://opentelemetry.io/blog/2024/llm-observability/)
 - [User-facing docs](../examples/online_serving/opentelemetry.md)
 - [Blog post](https://medium.com/@ronen.schaffer/follow-the-trail-supercharging-vllm-with-opentelemetry-distributed-tracing-aa655229b46f)
 - [IBM product docs](https://www.ibm.com/docs/en/instana-observability/current?topic=mgaa-monitoring-large-language-models-llms-vllm-public-preview)
-  
+
 OpenTelemetry has a
 [Gen AI Working Group](https://github.com/open-telemetry/community/blob/main/projects/gen-ai.md).

-Since metrics is a big enough topic on its own, we are going to tackle
-the topic of tracing in v1 separately.
+Since metrics is a big enough topic on its own, we consider the topic
+of tracing to be quite separate from metrics.

 ### OpenTelemetry Model Forward vs Execute Time

-In v0, we have the following two metrics:
+The current implementation exposes the following two metrics:

 - `vllm:model_forward_time_milliseconds` (Histogram) - The time spent
  in the model forward pass when this request was in the batch.