mirror of
https://git.datalinker.icu/vllm-project/vllm.git
synced 2025-12-09 12:25:01 +08:00
[docs] Update v1 metrics design doc (#27332)
Signed-off-by: Simon Mo <simon.mo@hey.com> Signed-off-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: atalhens <sneh.lata@nutanix.com> Co-authored-by: Simon Mo <simon.mo@hey.com> Co-authored-by: atalhens <sneh.lata@nutanix.com>
This commit is contained in:
parent
abf3db40ef
commit
141d3b9fc5
@ -1,12 +1,12 @@
|
||||
# Metrics
|
||||
|
||||
Ensure the v1 LLM Engine exposes a superset of the metrics available in v0.
|
||||
vLLM exposes a rich set of metrics to support observability and capacity planning for the V1 engine.
|
||||
|
||||
## Objectives
|
||||
|
||||
- Achieve parity of metrics between v0 and v1.
|
||||
- The priority use case is accessing these metrics via Prometheus, as this is what we expect to be used in production environments.
|
||||
- Logging support (i.e. printing metrics to the info log) is provided for more ad-hoc testing, debugging, development, and exploratory use cases.
|
||||
- Provide comprehensive coverage of engine and request level metrics to aid production monitoring.
|
||||
- Prioritize Prometheus integrations, as this is what we expect to be used in production environments.
|
||||
- Offer logging support (i.e. printing metrics to the info log) for ad-hoc testing, debugging, development, and exploratory use cases.
|
||||
|
||||
## Background
|
||||
|
||||
@ -17,45 +17,36 @@ Metrics in vLLM can be categorized as follows:
|
||||
|
||||
The mental model is that server-level metrics help explain the values of request-level metrics.
|
||||
|
||||
### v0 Metrics
|
||||
### Metrics Overview
|
||||
|
||||
In v0, the following metrics are exposed via a Prometheus-compatible `/metrics` endpoint using the `vllm:` prefix:
|
||||
### v1 Metrics
|
||||
|
||||
- `vllm:num_requests_running` (Gauge)
|
||||
- `vllm:num_requests_swapped` (Gauge)
|
||||
- `vllm:num_requests_waiting` (Gauge)
|
||||
- `vllm:gpu_cache_usage_perc` (Gauge)
|
||||
- `vllm:cpu_cache_usage_perc` (Gauge)
|
||||
- `vllm:gpu_prefix_cache_hit_rate` (Gauge)
|
||||
- `vllm:cpu_prefix_cache_hit_rate` (Gauge)
|
||||
- `vllm:prompt_tokens_total` (Counter)
|
||||
- `vllm:generation_tokens_total` (Counter)
|
||||
- `vllm:request_success_total` (Counter)
|
||||
- `vllm:request_prompt_tokens` (Histogram)
|
||||
- `vllm:request_generation_tokens` (Histogram)
|
||||
- `vllm:time_to_first_token_seconds` (Histogram)
|
||||
- `vllm:time_per_output_token_seconds` (Histogram)
|
||||
- `vllm:e2e_request_latency_seconds` (Histogram)
|
||||
- `vllm:request_queue_time_seconds` (Histogram)
|
||||
- `vllm:request_inference_time_seconds` (Histogram)
|
||||
- `vllm:request_prefill_time_seconds` (Histogram)
|
||||
- `vllm:request_decode_time_seconds` (Histogram)
|
||||
- `vllm:request_max_num_generation_tokens` (Histogram)
|
||||
- `vllm:num_preemptions_total` (Counter)
|
||||
- `vllm:cache_config_info` (Gauge)
|
||||
- `vllm:lora_requests_info` (Gauge)
|
||||
- `vllm:tokens_total` (Counter)
|
||||
- `vllm:iteration_tokens_total` (Histogram)
|
||||
- `vllm:time_in_queue_requests` (Histogram)
|
||||
- `vllm:model_forward_time_milliseconds` (Histogram)
|
||||
- `vllm:model_execute_time_milliseconds` (Histogram)
|
||||
- `vllm:request_params_n` (Histogram)
|
||||
- `vllm:request_params_max_tokens` (Histogram)
|
||||
- `vllm:spec_decode_draft_acceptance_rate` (Gauge)
|
||||
- `vllm:spec_decode_efficiency` (Gauge)
|
||||
- `vllm:spec_decode_num_accepted_tokens_total` (Counter)
|
||||
- `vllm:spec_decode_num_draft_tokens_total` (Counter)
|
||||
- `vllm:spec_decode_num_emitted_tokens_total` (Counter)
|
||||
In v1, the following metrics are exposed via a Prometheus-compatible `/metrics` endpoint using the `vllm:` prefix:
|
||||
|
||||
- `vllm:num_requests_running` (Gauge) - Number of requests currently running.
|
||||
- `vllm:num_requests_waiting` (Gauge) - Number of requests currently waiting.
|
||||
- `vllm:kv_cache_usage_perc` (Gauge) - Fraction of used KV cache blocks (0–1).
|
||||
- `vllm:prefix_cache_queries` (Counter) - Number of prefix cache queries.
|
||||
- `vllm:prefix_cache_hits` (Counter) - Number of prefix cache hits.
|
||||
- `vllm:mm_cache_queries` (Counter) - (For multimodal models) Number of multimodal cache queries.
|
||||
- `vllm:mm_cache_hits` (Counter) - (For multimodal models) Number of multimodal cache hits.
|
||||
- `vllm:num_preemptions_total` (Counter) - Number of preemptions.
|
||||
- `vllm:prompt_tokens_total` (Counter) - Total number of prompt tokens processed.
|
||||
- `vllm:generation_tokens_total` (Counter) - Total number of generated tokens.
|
||||
- `vllm:iteration_tokens_total` (Histogram) - Histogram of tokens processed in each engine step.
|
||||
- `vllm:cache_config_info` (Gauge) - Information about the cache configuration.
|
||||
- `vllm:request_success_total` (Counter) - Number of finished requests (by finish reason).
|
||||
- `vllm:request_prompt_tokens` (Histogram) - Histogram of input prompt token counts.
|
||||
- `vllm:request_generation_tokens` (Histogram) - Histogram of generation token counts.
|
||||
- `vllm:request_params_n` (Histogram) - Histogram of request parameter n.
|
||||
- `vllm:request_params_max_tokens` - (Histogram) - Histogram of max_tokens parameter in requests.
|
||||
- `vllm:time_to_first_token_seconds` (Histogram) - Time to first token (TTFT).
|
||||
- `vllm:inter_token_latency_seconds` (Histogram) - Inter-token latency.
|
||||
- `vllm:e2e_request_latency_seconds` (Histogram) - End-to-end request latency.
|
||||
- `vllm:request_queue_time_seconds` (Histogram) - Time spent in the queue.
|
||||
- `vllm:request_inference_time_seconds` (Histogram) - Request inference time.
|
||||
- `vllm:request_prefill_time_seconds` (Histogram) - Request prefill time.
|
||||
- `vllm:request_decode_time_seconds` (Histogram) - Request decode time.
|
||||
|
||||
These are documented under [Inferencing and Serving -> Production Metrics](../usage/metrics.md).
|
||||
|
||||
@ -86,7 +77,7 @@ See [the PR which added this Dashboard](https://github.com/vllm-project/vllm/pul
|
||||
|
||||
Prometheus support was initially added [using the aioprometheus library](https://github.com/vllm-project/vllm/pull/1890), but a switch was made quickly to [prometheus_client](https://github.com/vllm-project/vllm/pull/2730). The rationale is discussed in both linked PRs.
|
||||
|
||||
With the switch to `aioprometheus`, we lost a `MetricsMiddleware` to track HTTP metrics, but this was reinstated [using prometheus_fastapi_instrumentator](https://github.com/vllm-project/vllm/pull/15657):
|
||||
During those migrations we briefly lost a `MetricsMiddleware` to track HTTP metrics, but this was reinstated [using prometheus_fastapi_instrumentator](https://github.com/vllm-project/vllm/pull/15657):
|
||||
|
||||
```bash
|
||||
$ curl http://0.0.0.0:8000/metrics 2>/dev/null | grep -P '^http_(?!.*(_bucket|_created|_sum)).*'
|
||||
@ -99,7 +90,9 @@ http_request_duration_seconds_count{handler="/v1/completions",method="POST"} 201
|
||||
|
||||
### Multi-process Mode
|
||||
|
||||
In v0, metrics are collected in the engine core process and we use multiprocess mode to make them available in the API server process. See <https://github.com/vllm-project/vllm/pull/7279>.
|
||||
Historically, metrics were collected in the engine core process and multiprocess mode was used to make them available in the API server process. See <https://github.com/vllm-project/vllm/pull/7279>.
|
||||
|
||||
More recently, metrics are collected in the API server process and multiprocess mode is only used when `--api-server-count > 1`. See <https://github.com/vllm-project/vllm/pull/17546> and details on [API server scale-out](../serving/data_parallel_deployment.md#internal-load-balancing).
|
||||
|
||||
### Built in Python/Process Metrics
|
||||
|
||||
@ -116,14 +109,15 @@ The following metrics are supported by default by `prometheus_client`, but they
|
||||
- `process_open_fds`
|
||||
- `process_max_fds`
|
||||
|
||||
This is relevant because if we move away from multiprocess mode in v1,
|
||||
we get these back. However, it's questionable how relevant these are
|
||||
if they don't aggregate these stats for all processes that make up a
|
||||
vLLM instance.
|
||||
Therefore, these metrics are unavailable when `--api-server-count > 1`. It's questionable how relevant these are since they do not aggregate these stats for all processes that make up a vLLM instance.
|
||||
|
||||
### v0 PRs and Issues
|
||||
## Metrics Design
|
||||
|
||||
For background, these are some of the relevant PRs which added the v0 metrics:
|
||||
The ["Even Better Observability"](https://github.com/vllm-project/vllm/issues/3616) feature where was where much of the metrics design was planned. For example, see where [a detailed roadmap was laid out](https://github.com/vllm-project/vllm/issues/3616#issuecomment-2030858781).
|
||||
|
||||
### Legacy PRs
|
||||
|
||||
To help understand the background to the metrics design, here are some of the relevant PRs which added the original, now legacy, metrics:
|
||||
|
||||
- <https://github.com/vllm-project/vllm/pull/1890>
|
||||
- <https://github.com/vllm-project/vllm/pull/2316>
|
||||
@ -131,14 +125,9 @@ For background, these are some of the relevant PRs which added the v0 metrics:
|
||||
- <https://github.com/vllm-project/vllm/pull/4464>
|
||||
- <https://github.com/vllm-project/vllm/pull/7279>
|
||||
|
||||
Also note the ["Even Better Observability"](https://github.com/vllm-project/vllm/issues/3616) feature where e.g. [a detailed roadmap was laid out](https://github.com/vllm-project/vllm/issues/3616#issuecomment-2030858781).
|
||||
### Metrics Implementation PRs
|
||||
|
||||
## v1 Design
|
||||
|
||||
### v1 PRs
|
||||
|
||||
For background, here are the relevant v1 PRs relating to the v1
|
||||
metrics issue <https://github.com/vllm-project/vllm/issues/10582>:
|
||||
For background, here are the relevant PRs relating to the metrics implementation <https://github.com/vllm-project/vllm/issues/10582>:
|
||||
|
||||
- <https://github.com/vllm-project/vllm/pull/11962>
|
||||
- <https://github.com/vllm-project/vllm/pull/11973>
|
||||
@ -369,7 +358,7 @@ vllm:cache_config_info{block_size="16",cache_dtype="auto",calculate_kv_scales="F
|
||||
|
||||
However, `prometheus_client` has
|
||||
[never supported Info metrics in multiprocessing mode](https://github.com/prometheus/client_python/pull/300) -
|
||||
for [unclear reasons](https://github.com/vllm-project/vllm/pull/7279#discussion_r1710417152). We
|
||||
for [unclear reasons](gh-pr:7279#discussion_r1710417152). We
|
||||
simply use a `Gauge` metric set to 1 and
|
||||
`multiprocess_mode="mostrecent"` instead.
|
||||
|
||||
@ -396,9 +385,8 @@ recent metric is used, but only from currently running processes.
|
||||
|
||||
This was added in <https://github.com/vllm-project/vllm/pull/9477> and there is
|
||||
[at least one known user](https://github.com/kubernetes-sigs/gateway-api-inference-extension/pull/54).
|
||||
If we revisit this design and deprecate the old metric, we should reduce
|
||||
the need for a significant deprecation period by making the change in
|
||||
v0 also and asking this project to move to the new metric.
|
||||
If we revisit this design and deprecate the old metric, we should
|
||||
coordinate with downstream users so they can migrate before the removal.
|
||||
|
||||
### Prefix Cache metrics
|
||||
|
||||
@ -478,22 +466,20 @@ us with:
|
||||
|
||||
```python
|
||||
if seq_group.is_finished():
|
||||
if (
|
||||
seq_group.metrics.first_scheduled_time is not None
|
||||
and seq_group.metrics.first_token_time is not None
|
||||
):
|
||||
if (seq_group.metrics.first_scheduled_time is not None and
|
||||
seq_group.metrics.first_token_time is not None):
|
||||
time_queue_requests.append(
|
||||
seq_group.metrics.first_scheduled_time -
|
||||
seq_group.metrics.arrival_time
|
||||
)
|
||||
seq_group.metrics.arrival_time)
|
||||
...
|
||||
if seq_group.metrics.time_in_queue is not None:
|
||||
time_in_queue_requests.append(seq_group.metrics.time_in_queue)
|
||||
time_in_queue_requests.append(
|
||||
seq_group.metrics.time_in_queue)
|
||||
```
|
||||
|
||||
This seems duplicative, and one of them should be removed. The latter
|
||||
is used by the Grafana dashboard, so we should deprecate or remove the
|
||||
former from v0.
|
||||
former.
|
||||
|
||||
### Prefix Cache Hit Rate
|
||||
|
||||
@ -502,7 +488,7 @@ See above - we now expose 'queries' and 'hits' counters rather than a
|
||||
|
||||
### KV Cache Offloading
|
||||
|
||||
Two v0 metrics relate to a "swapped" preemption mode that is no
|
||||
Two legacy metrics relate to a "swapped" preemption mode that is no
|
||||
longer relevant in v1:
|
||||
|
||||
- `vllm:num_requests_swapped`
|
||||
@ -513,7 +499,7 @@ cache to complete other requests), we swap kv cache blocks out to CPU
|
||||
memory. This is also known as "KV cache offloading" and is configured
|
||||
with `--swap-space` and `--preemption-mode`.
|
||||
|
||||
In v0, [vLLM has long supported beam search](https://github.com/vllm-project/vllm/issues/6226). The
|
||||
Historically, [vLLM has long supported beam search](https://github.com/vllm-project/vllm/issues/6226). The
|
||||
SequenceGroup encapsulated the idea of N Sequences which
|
||||
all shared the same prompt kv blocks. This enabled KV cache block
|
||||
sharing between requests, and copy-on-write to do branching. CPU
|
||||
@ -526,7 +512,7 @@ and the part of the prompt that was evicted can be recomputed.
|
||||
|
||||
SequenceGroup was removed in V1, although a replacement will be
|
||||
required for "parallel sampling" (`n>1`).
|
||||
[Beam search was moved out of the core (in V0)](https://github.com/vllm-project/vllm/issues/8306). There was a
|
||||
[Beam search was moved out of the core](https://github.com/vllm-project/vllm/issues/8306). There was a
|
||||
lot of complex code for a very uncommon feature.
|
||||
|
||||
In V1, with prefix caching being better (zero over head) and therefore
|
||||
@ -537,7 +523,7 @@ better.
|
||||
|
||||
### Parallel Sampling
|
||||
|
||||
Some v0 metrics are only relevant in the context of "parallel
|
||||
Some legacy metrics are only relevant in the context of "parallel
|
||||
sampling". This is where the `n` parameter in a request is used to
|
||||
request multiple completions from the same prompt.
|
||||
|
||||
@ -556,7 +542,7 @@ also add these metrics.
|
||||
|
||||
### Speculative Decoding
|
||||
|
||||
Some v0 metrics are specific to "speculative decoding". This is where
|
||||
Some legacy metrics are specific to "speculative decoding". This is where
|
||||
we generate candidate tokens using a faster, approximate method or
|
||||
model and then validate those tokens with the larger model.
|
||||
|
||||
@ -568,7 +554,7 @@ model and then validate those tokens with the larger model.
|
||||
|
||||
There is a PR under review (<https://github.com/vllm-project/vllm/pull/12193>) to add "prompt lookup (ngram)"
|
||||
speculative decoding to v1. Other techniques will follow. We should
|
||||
revisit the v0 metrics in this context.
|
||||
revisit these metrics in this context.
|
||||
|
||||
!!! note
|
||||
We should probably expose acceptance rate as separate accepted
|
||||
@ -641,7 +627,7 @@ metrics are often relatively straightforward to add:
|
||||
metrics are usually of very limited use unless they can be enabled
|
||||
by default and in production.
|
||||
3. They have an impact on development and maintenance of the
|
||||
project. Every metric added to v0 has made this v1 effort more
|
||||
project. Every metric added over time has made this effort more
|
||||
time-consuming, and perhaps not all metrics justify this ongoing
|
||||
investment in their maintenance.
|
||||
|
||||
@ -652,24 +638,24 @@ performance and health. Tracing, on the other hand, tracks individual
|
||||
requests as they move through different services and components. Both
|
||||
fall under the more general heading of "Observability".
|
||||
|
||||
v0 has support for OpenTelemetry tracing:
|
||||
vLLM has support for OpenTelemetry tracing:
|
||||
|
||||
- Added by <https://github.com/vllm-project/vllm/pull/4687>
|
||||
- Added by <https://github.com/vllm-project/vllm/pull/4687> and reinstated by <https://github.com/vllm-project/vllm/pull/20372>
|
||||
- Configured with `--oltp-traces-endpoint` and `--collect-detailed-traces`
|
||||
- [OpenTelemetry blog post](https://opentelemetry.io/blog/2024/llm-observability/)
|
||||
- [User-facing docs](../examples/online_serving/opentelemetry.md)
|
||||
- [Blog post](https://medium.com/@ronen.schaffer/follow-the-trail-supercharging-vllm-with-opentelemetry-distributed-tracing-aa655229b46f)
|
||||
- [IBM product docs](https://www.ibm.com/docs/en/instana-observability/current?topic=mgaa-monitoring-large-language-models-llms-vllm-public-preview)
|
||||
|
||||
|
||||
OpenTelemetry has a
|
||||
[Gen AI Working Group](https://github.com/open-telemetry/community/blob/main/projects/gen-ai.md).
|
||||
|
||||
Since metrics is a big enough topic on its own, we are going to tackle
|
||||
the topic of tracing in v1 separately.
|
||||
Since metrics is a big enough topic on its own, we consider the topic
|
||||
of tracing to be quite separate from metrics.
|
||||
|
||||
### OpenTelemetry Model Forward vs Execute Time
|
||||
|
||||
In v0, we have the following two metrics:
|
||||
The current implementation exposes the following two metrics:
|
||||
|
||||
- `vllm:model_forward_time_milliseconds` (Histogram) - The time spent
|
||||
in the model forward pass when this request was in the batch.
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user