mirror of
https://git.datalinker.icu/vllm-project/vllm.git
synced 2026-01-23 20:14:43 +08:00
[docs] Remove _total from counter metrics names (#30028)
In Prometheus Counters always expose their actual numeric value with a metric name that ends in _total. We should document the base name, as this what appears in the get_metrics() API. Signed-off-by: CYJiang <86391540+googs1025@users.noreply.github.com>
This commit is contained in:
parent
404fc4bfc0
commit
fd68e909db
@ -57,15 +57,15 @@ vLLM also provides [a reference example](../../examples/online_serving/prometheu
|
||||
The subset of metrics exposed in the Grafana dashboard gives us an indication of which metrics are especially important:
|
||||
|
||||
- `vllm:e2e_request_latency_seconds_bucket` - End to end request latency measured in seconds.
|
||||
- `vllm:prompt_tokens_total` - Prompt tokens.
|
||||
- `vllm:generation_tokens_total` - Generation tokens.
|
||||
- `vllm:prompt_tokens` - Prompt tokens.
|
||||
- `vllm:generation_tokens` - Generation tokens.
|
||||
- `vllm:time_per_output_token_seconds` - Inter-token latency (Time Per Output Token, TPOT) in seconds.
|
||||
- `vllm:time_to_first_token_seconds` - Time to First Token (TTFT) latency in seconds.
|
||||
- `vllm:num_requests_running` (also, `_swapped` and `_waiting`) - Number of requests in the RUNNING, WAITING, and SWAPPED states.
|
||||
- `vllm:gpu_cache_usage_perc` - Percentage of used cache blocks by vLLM.
|
||||
- `vllm:request_prompt_tokens` - Request prompt length.
|
||||
- `vllm:request_generation_tokens` - Request generation length.
|
||||
- `vllm:request_success_total` - Number of finished requests by their finish reason: either an EOS token was generated or the max sequence length was reached.
|
||||
- `vllm:request_success` - Number of finished requests by their finish reason: either an EOS token was generated or the max sequence length was reached.
|
||||
- `vllm:request_queue_time_seconds` - Queue time.
|
||||
- `vllm:request_prefill_time_seconds` - Requests prefill time.
|
||||
- `vllm:request_decode_time_seconds` - Requests decode time.
|
||||
@ -571,9 +571,9 @@ model and then validate those tokens with the larger model.
|
||||
|
||||
- `vllm:spec_decode_draft_acceptance_rate` (Gauge)
|
||||
- `vllm:spec_decode_efficiency` (Gauge)
|
||||
- `vllm:spec_decode_num_accepted_tokens_total` (Counter)
|
||||
- `vllm:spec_decode_num_draft_tokens_total` (Counter)
|
||||
- `vllm:spec_decode_num_emitted_tokens_total` (Counter)
|
||||
- `vllm:spec_decode_num_accepted_tokens` (Counter)
|
||||
- `vllm:spec_decode_num_draft_tokens` (Counter)
|
||||
- `vllm:spec_decode_num_emitted_tokens` (Counter)
|
||||
|
||||
There is a PR under review (<https://github.com/vllm-project/vllm/pull/12193>) to add "prompt lookup (ngram)"
|
||||
speculative decoding to v1. Other techniques will follow. We should
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user