[Doc] Reorganize benchmark docs (#29658)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
This commit is contained in:
Cyrus Leung 2025-11-28 17:19:25 +08:00 committed by GitHub
parent 5f5521bd5d
commit ccbdf51bd5
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
5 changed files with 291 additions and 292 deletions

View File

@ -52,6 +52,11 @@ nav:
- Plugins:
- design/*plugin*.md
- design/*
- Benchmarking:
- benchmarking/README.md
- benchmarking/cli.md
- benchmarking/sweeps.md
- benchmarking/dashboard.md
- API Reference:
- api/README.md
- api/vllm

View File

@ -0,0 +1,7 @@
# Benchmark Suites
vLLM provides comprehensive benchmarking tools for performance testing and evaluation:
- **[Benchmark CLI](./cli.md)**: `vllm bench` CLI tools and specialized benchmark scripts for interactive performance testing.
- **[Parameter Sweeps](./sweeps.md)**: Automate `vllm bench` runs for multiple configurations, useful for [optimization and tuning](../configuration/optimization.md).
- **[Performance Dashboard](./dashboard.md)**: Automated CI that publishes benchmarks on each commit.

View File

@ -1,22 +1,10 @@
---
toc_depth: 4
---
# Benchmark CLI
# Benchmark Suites
This section guides you through running benchmark tests with the extensive datasets supported on vLLM.
vLLM provides comprehensive benchmarking tools for performance testing and evaluation:
It's a living document, updated as new features and datasets become available.
- **[Benchmark CLI](#benchmark-cli)**: `vllm bench` CLI tools and specialized benchmark scripts for interactive performance testing
- **[Parameter sweeps](#parameter-sweeps)**: Automate `vllm bench` runs for multiple configurations
- **[Performance benchmarks](#performance-benchmarks)**: Automated CI benchmarks for development
## Benchmark CLI
This section guides you through running benchmark tests with the extensive
datasets supported on vLLM. It's a living document, updated as new features and datasets
become available.
### Dataset Overview
## Dataset Overview
<style>
th {
@ -59,9 +47,9 @@ Legend:
--dataset-path /datasets/VisionArena-Chat/ --hf-name lmarena-ai/VisionArena-Chat
```
### Examples
## Examples
#### 🚀 Online Benchmark
### 🚀 Online Benchmark
<details class="admonition abstract" markdown="1">
<summary>Show more</summary>
@ -112,7 +100,7 @@ P99 ITL (ms): 8.39
==================================================
```
##### Custom Dataset
#### Custom Dataset
If the dataset you want to benchmark is not supported yet in vLLM, even then you can benchmark on it using `CustomDataset`. Your data needs to be in `.jsonl` format and needs to have "prompt" field per entry, e.g., data.jsonl
@ -145,7 +133,7 @@ vllm bench serve --port 9001 --save-result --save-detailed \
You can skip applying chat template if your data already has it by using `--custom-skip-chat-template`.
##### VisionArena Benchmark for Vision Language Models
#### VisionArena Benchmark for Vision Language Models
```bash
# need a model with vision capability here
@ -163,7 +151,7 @@ vllm bench serve \
--num-prompts 1000
```
##### InstructCoder Benchmark with Speculative Decoding
#### InstructCoder Benchmark with Speculative Decoding
``` bash
vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
@ -180,7 +168,7 @@ vllm bench serve \
--num-prompts 2048
```
##### Spec Bench Benchmark with Speculative Decoding
#### Spec Bench Benchmark with Speculative Decoding
``` bash
vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
@ -217,7 +205,7 @@ vllm bench serve \
--spec-bench-category "summarization"
```
##### Other HuggingFaceDataset Examples
#### Other HuggingFaceDataset Examples
```bash
vllm serve Qwen/Qwen2-VL-7B-Instruct
@ -283,7 +271,7 @@ vllm bench serve \
--blazedit-max-distance 0.99
```
##### Running With Sampling Parameters
#### Running With Sampling Parameters
When using OpenAI-compatible backends such as `vllm`, optional sampling
parameters can be specified. Example client command:
@ -301,7 +289,7 @@ vllm bench serve \
--num-prompts 10
```
##### Running With Ramp-Up Request Rate
#### Running With Ramp-Up Request Rate
The benchmark tool also supports ramping up the request rate over the
duration of the benchmark run. This can be useful for stress testing the
@ -318,11 +306,11 @@ The following arguments can be used to control the ramp-up:
- `--ramp-up-start-rps`: The request rate at the beginning of the benchmark.
- `--ramp-up-end-rps`: The request rate at the end of the benchmark.
##### Load Pattern Configuration
#### Load Pattern Configuration
vLLM's benchmark serving script provides sophisticated load pattern simulation capabilities through three key parameters that control request generation and concurrency behavior:
###### Load Pattern Control Parameters
##### Load Pattern Control Parameters
- `--request-rate`: Controls the target request generation rate (requests per second). Set to `inf` for maximum throughput testing or finite values for controlled load simulation.
- `--burstiness`: Controls traffic variability using a Gamma distribution (range: > 0). Lower values create bursty traffic, higher values create uniform traffic.
@ -387,7 +375,7 @@ Using KV cache metrics for load pattern configuration:
</details>
#### 📈 Offline Throughput Benchmark
### 📈 Offline Throughput Benchmark
<details class="admonition abstract" markdown="1">
<summary>Show more</summary>
@ -408,7 +396,7 @@ Total num prompt tokens: 5014
Total num output tokens: 1500
```
##### VisionArena Benchmark for Vision Language Models
#### VisionArena Benchmark for Vision Language Models
```bash
vllm bench throughput \
@ -428,7 +416,7 @@ Total num prompt tokens: 14527
Total num output tokens: 1280
```
##### InstructCoder Benchmark with Speculative Decoding
#### InstructCoder Benchmark with Speculative Decoding
``` bash
VLLM_WORKER_MULTIPROC_METHOD=spawn \
@ -451,7 +439,7 @@ Total num prompt tokens: 261136
Total num output tokens: 204800
```
##### Other HuggingFaceDataset Examples
#### Other HuggingFaceDataset Examples
`lmms-lab/LLaVA-OneVision-Data`:
@ -509,20 +497,20 @@ vllm bench throughput \
</details>
#### 🛠️ Structured Output Benchmark
### 🛠️ Structured Output Benchmark
<details class="admonition abstract" markdown="1">
<summary>Show more</summary>
Benchmark the performance of structured output generation (JSON, grammar, regex).
##### Server Setup
#### Server Setup
```bash
vllm serve NousResearch/Hermes-3-Llama-3.1-8B
```
##### JSON Schema Benchmark
#### JSON Schema Benchmark
```bash
python3 benchmarks/benchmark_serving_structured_output.py \
@ -534,7 +522,7 @@ python3 benchmarks/benchmark_serving_structured_output.py \
--num-prompts 1000
```
##### Grammar-based Generation Benchmark
#### Grammar-based Generation Benchmark
```bash
python3 benchmarks/benchmark_serving_structured_output.py \
@ -546,7 +534,7 @@ python3 benchmarks/benchmark_serving_structured_output.py \
--num-prompts 1000
```
##### Regex-based Generation Benchmark
#### Regex-based Generation Benchmark
```bash
python3 benchmarks/benchmark_serving_structured_output.py \
@ -557,7 +545,7 @@ python3 benchmarks/benchmark_serving_structured_output.py \
--num-prompts 1000
```
##### Choice-based Generation Benchmark
#### Choice-based Generation Benchmark
```bash
python3 benchmarks/benchmark_serving_structured_output.py \
@ -568,7 +556,7 @@ python3 benchmarks/benchmark_serving_structured_output.py \
--num-prompts 1000
```
##### XGrammar Benchmark Dataset
#### XGrammar Benchmark Dataset
```bash
python3 benchmarks/benchmark_serving_structured_output.py \
@ -581,14 +569,14 @@ python3 benchmarks/benchmark_serving_structured_output.py \
</details>
#### 📚 Long Document QA Benchmark
### 📚 Long Document QA Benchmark
<details class="admonition abstract" markdown="1">
<summary>Show more</summary>
Benchmark the performance of long document question-answering with prefix caching.
##### Basic Long Document QA Test
#### Basic Long Document QA Test
```bash
python3 benchmarks/benchmark_long_document_qa_throughput.py \
@ -600,7 +588,7 @@ python3 benchmarks/benchmark_long_document_qa_throughput.py \
--repeat-count 5
```
##### Different Repeat Modes
#### Different Repeat Modes
```bash
# Random mode (default) - shuffle prompts randomly
@ -633,14 +621,14 @@ python3 benchmarks/benchmark_long_document_qa_throughput.py \
</details>
#### 🗂️ Prefix Caching Benchmark
### 🗂️ Prefix Caching Benchmark
<details class="admonition abstract" markdown="1">
<summary>Show more</summary>
Benchmark the efficiency of automatic prefix caching.
##### Fixed Prompt with Prefix Caching
#### Fixed Prompt with Prefix Caching
```bash
python3 benchmarks/benchmark_prefix_caching.py \
@ -651,7 +639,7 @@ python3 benchmarks/benchmark_prefix_caching.py \
--input-length-range 128:256
```
##### ShareGPT Dataset with Prefix Caching
#### ShareGPT Dataset with Prefix Caching
```bash
# download dataset
@ -682,14 +670,14 @@ vllm bench serve \
</details>
#### ⚡ Request Prioritization Benchmark
### ⚡ Request Prioritization Benchmark
<details class="admonition abstract" markdown="1">
<summary>Show more</summary>
Benchmark the performance of request prioritization in vLLM.
##### Basic Prioritization Test
#### Basic Prioritization Test
```bash
python3 benchmarks/benchmark_prioritization.py \
@ -700,7 +688,7 @@ python3 benchmarks/benchmark_prioritization.py \
--scheduling-policy priority
```
##### Multiple Sequences per Prompt
#### Multiple Sequences per Prompt
```bash
python3 benchmarks/benchmark_prioritization.py \
@ -714,14 +702,14 @@ python3 benchmarks/benchmark_prioritization.py \
</details>
#### 👁️ Multi-Modal Benchmark
### 👁️ Multi-Modal Benchmark
<details class="admonition abstract" markdown="1">
<summary>Show more</summary>
Benchmark the performance of multi-modal requests in vLLM.
##### Images (ShareGPT4V)
#### Images (ShareGPT4V)
Start vLLM:
@ -747,7 +735,7 @@ vllm bench serve \
--endpoint /v1/chat/completions
```
##### Videos (ShareGPT4Video)
#### Videos (ShareGPT4Video)
Start vLLM:
@ -773,7 +761,7 @@ vllm bench serve \
--endpoint /v1/chat/completions
```
##### Synthetic Random Images (random-mm)
#### Synthetic Random Images (random-mm)
Generate synthetic image inputs alongside random text prompts to stress-test vision models without external datasets.
@ -846,14 +834,14 @@ This should be seen as an edge case, and if this behavior can be avoided by sett
</details>
#### Embedding Benchmark
### Embedding Benchmark
Benchmark the performance of embedding requests in vLLM.
<details class="admonition abstract" markdown="1">
<summary>Show more</summary>
##### Text Embeddings
#### Text Embeddings
Unlike generative models which use Completions API or Chat Completions API,
you should set `--backend openai-embeddings` and `--endpoint /v1/embeddings` to use the Embeddings API.
@ -879,7 +867,7 @@ vllm bench serve \
--dataset-path <your data path>/ShareGPT_V3_unfiltered_cleaned_split.json
```
##### Multi-modal Embeddings
#### Multi-modal Embeddings
Unlike generative models which use Completions API or Chat Completions API,
you should set `--endpoint /v1/embeddings` to use the Embeddings API. The backend to use depends on the model:
@ -944,7 +932,7 @@ vllm bench serve \
</details>
#### Reranker Benchmark
### Reranker Benchmark
Benchmark the performance of rerank requests in vLLM.
@ -988,240 +976,3 @@ to account for the extra prompt which is the query. The token accounting to repo
throughput numbers correctly is also adjusted.
</details>
## Parameter Sweeps
### Online Benchmark
[`vllm/benchmarks/sweep/serve.py`](../../vllm/benchmarks/sweep/serve.py) automatically starts `vllm serve` and runs `vllm bench serve` to evaluate vLLM over multiple configurations.
Follow these steps to run the script:
1. Construct the base command to `vllm serve`, and pass it to the `--serve-cmd` option.
2. Construct the base command to `vllm bench serve`, and pass it to the `--bench-cmd` option.
3. (Optional) If you would like to vary the settings of `vllm serve`, create a new JSON file and populate it with the parameter combinations you want to test. Pass the file path to `--serve-params`.
- Example: Tuning `--max-num-seqs` and `--max-num-batched-tokens`:
```json
[
{
"max_num_seqs": 32,
"max_num_batched_tokens": 1024
},
{
"max_num_seqs": 64,
"max_num_batched_tokens": 1024
},
{
"max_num_seqs": 64,
"max_num_batched_tokens": 2048
},
{
"max_num_seqs": 128,
"max_num_batched_tokens": 2048
},
{
"max_num_seqs": 128,
"max_num_batched_tokens": 4096
},
{
"max_num_seqs": 256,
"max_num_batched_tokens": 4096
}
]
```
4. (Optional) If you would like to vary the settings of `vllm bench serve`, create a new JSON file and populate it with the parameter combinations you want to test. Pass the file path to `--bench-params`.
- Example: Using different input/output lengths for random dataset:
```json
[
{
"random_input_len": 128,
"random_output_len": 32
},
{
"random_input_len": 256,
"random_output_len": 64
},
{
"random_input_len": 512,
"random_output_len": 128
}
]
```
5. Determine where you want to save the results, and pass that to `--output-dir`.
Example command:
```bash
vllm bench sweep serve \
--serve-cmd 'vllm serve meta-llama/Llama-2-7b-chat-hf' \
--bench-cmd 'vllm bench serve --model meta-llama/Llama-2-7b-chat-hf --backend vllm --endpoint /v1/completions --dataset-name sharegpt --dataset-path benchmarks/ShareGPT_V3_unfiltered_cleaned_split.json' \
--serve-params benchmarks/serve_hparams.json \
--bench-params benchmarks/bench_hparams.json \
-o benchmarks/results
```
!!! important
If both `--serve-params` and `--bench-params` are passed, the script will iterate over the Cartesian product between them.
You can use `--dry-run` to preview the commands to be run.
We only start the server once for each `--serve-params`, and keep it running for multiple `--bench-params`.
Between each benchmark run, we call the `/reset_prefix_cache` and `/reset_mm_cache` endpoints to get a clean slate for the next run.
In case you are using a custom `--serve-cmd`, you can override the commands used for resetting the state by setting `--after-bench-cmd`.
!!! note
By default, each parameter combination is run 3 times to make the results more reliable. You can adjust the number of runs by setting `--num-runs`.
!!! tip
You can use the `--resume` option to continue the parameter sweep if one of the runs failed.
### SLA Auto-Tuner
[`vllm/benchmarks/sweep/serve_sla.py`](../../vllm/benchmarks/sweep/serve_sla.py) is a wrapper over [`vllm/benchmarks/sweep/serve.py`](../../vllm/benchmarks/sweep/serve.py) that tunes either the request rate or concurrency (choose using `--sla-variable`) in order to satisfy the SLA constraints given by `--sla-params`.
For example, to ensure E2E latency within different target values for 99% of requests:
```json
[
{
"p99_e2el_ms": "<=200"
},
{
"p99_e2el_ms": "<=500"
},
{
"p99_e2el_ms": "<=1000"
},
{
"p99_e2el_ms": "<=2000"
}
]
```
Example command:
```bash
vllm bench sweep serve_sla \
--serve-cmd 'vllm serve meta-llama/Llama-2-7b-chat-hf' \
--bench-cmd 'vllm bench serve --model meta-llama/Llama-2-7b-chat-hf --backend vllm --endpoint /v1/completions --dataset-name sharegpt --dataset-path benchmarks/ShareGPT_V3_unfiltered_cleaned_split.json' \
--serve-params benchmarks/serve_hparams.json \
--bench-params benchmarks/bench_hparams.json \
--sla-params benchmarks/sla_hparams.json \
--sla-variable max_concurrency \
-o benchmarks/results
```
The algorithm for adjusting the SLA variable is as follows:
1. Run the benchmark with infinite QPS, and use the corresponding metrics to determine the initial value of the variable.
- For example, the initial request rate is set to the concurrency under infinite QPS.
2. If the SLA is still satisfied, keep doubling the value until the SLA is no longer satisfied. This gives a relatively narrow window that contains the point where the SLA is barely satisfied.
3. Apply binary search over the window to find the maximum value that still satisfies the SLA.
!!! important
SLA tuning is applied over each combination of `--serve-params`, `--bench-params`, and `--sla-params`.
For a given combination of `--serve-params` and `--bench-params`, we share the benchmark results across `--sla-params` to avoid rerunning benchmarks with the same SLA variable value.
### Visualizer
[`vllm/benchmarks/sweep/plot.py`](../../vllm/benchmarks/sweep/plot.py) can be used to plot performance curves from parameter sweep results.
Example command:
```bash
vllm bench sweep plot benchmarks/results/<timestamp> \
--var-x max_concurrency \
--row-by random_input_len \
--col-by random_output_len \
--curve-by api_server_count,max_num_batched_tokens \
--filter-by 'max_concurrency<=1024'
```
!!! tip
You can use `--dry-run` to preview the figures to be plotted.
### Pareto visualization (tokens/s/user vs tokens/s/GPU)
`vllm bench sweep plot_pareto` helps pick configurations that balance per-user and per-GPU throughput.
Higher concurrency or batch size can raise GPU efficiency (per-GPU), but can add per user latency; lower concurrency improves per-user rate but underutilizes GPUs; The Pareto frontier shows the best achievable pairs across your runs.
- x-axis: tokens/s/user = `output_throughput` ÷ concurrency (`--user-count-var`, default `max_concurrency`, fallback `max_concurrent_requests`).
- y-axis: tokens/s/GPU = `output_throughput` ÷ GPU count (`--gpu-count-var` if set; else gpu_count is TP×PP*DP).
- Output: a single figure at `OUTPUT_DIR/pareto/PARETO.png`.
- Show the configuration used in each data point `--label-by` (default: `max_concurrency,gpu_count`).
Example:
```bash
vllm bench sweep plot_pareto benchmarks/results/<timestamp> \
--label-by max_concurrency,tensor_parallel_size,pipeline_parallel_size
```
## Performance Benchmarks
The performance benchmarks are used for development to confirm whether new changes improve performance under various workloads. They are triggered on every commit with both the `perf-benchmarks` and `ready` labels, and when a PR is merged into vLLM.
### Manually Trigger the benchmark
Use [vllm-ci-test-repo images](https://gallery.ecr.aws/q9t5s3a7/vllm-ci-test-repo) with vLLM benchmark suite.
For CPU environment, please use the image with "-cpu" postfix.
Here is an example for docker run command for CPU.
```bash
docker run -it --entrypoint /bin/bash -v /data/huggingface:/root/.cache/huggingface -e HF_TOKEN='' --shm-size=16g --name vllm-cpu-ci public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:1da94e673c257373280026f75ceb4effac80e892-cpu
```
Then, run below command inside the docker instance.
```bash
bash .buildkite/performance-benchmarks/scripts/run-performance-benchmarks.sh
```
When run, benchmark script generates results under **benchmark/results** folder, along with the benchmark_results.md and benchmark_results.json.
#### Runtime environment variables
- `ON_CPU`: set the value to '1' on Intel® Xeon® Processors. Default value is 0.
- `SERVING_JSON`: JSON file to use for the serving tests. Default value is empty string (use default file).
- `LATENCY_JSON`: JSON file to use for the latency tests. Default value is empty string (use default file).
- `THROUGHPUT_JSON`: JSON file to use for the throughout tests. Default value is empty string (use default file).
- `REMOTE_HOST`: IP for the remote vLLM service to benchmark. Default value is empty string.
- `REMOTE_PORT`: Port for the remote vLLM service to benchmark. Default value is empty string.
For more results visualization, check the [visualizing the results](https://github.com/intel-ai-tce/vllm/blob/more_cpu_models/.buildkite/nightly-benchmarks/README.md#visualizing-the-results).
The latest performance results are hosted on the public [vLLM Performance Dashboard](https://hud.pytorch.org/benchmark/llms?repoName=vllm-project%2Fvllm).
More information on the performance benchmarks and their parameters can be found in [Benchmark README](https://github.com/intel-ai-tce/vllm/blob/more_cpu_models/.buildkite/nightly-benchmarks/README.md) and [performance benchmark description](../../.buildkite/performance-benchmarks/performance-benchmarks-descriptions.md).
### Continuous Benchmarking
The continuous benchmarking provides automated performance monitoring for vLLM across different models and GPU devices. This helps track vLLM's performance characteristics over time and identify any performance regressions or improvements.
#### How It Works
The continuous benchmarking is triggered via a [GitHub workflow CI](https://github.com/pytorch/pytorch-integration-testing/actions/workflows/vllm-benchmark.yml) in the PyTorch infrastructure repository, which runs automatically every 4 hours. The workflow executes three types of performance tests:
- **Serving tests**: Measure request handling and API performance
- **Throughput tests**: Evaluate token generation rates
- **Latency tests**: Assess response time characteristics
#### Benchmark Configuration
The benchmarking currently runs on a predefined set of models configured in the [vllm-benchmarks directory](https://github.com/pytorch/pytorch-integration-testing/tree/main/vllm-benchmarks/benchmarks). To add new models for benchmarking:
1. Navigate to the appropriate GPU directory in the benchmarks configuration
2. Add your model specifications to the corresponding configuration files
3. The new models will be included in the next scheduled benchmark run
#### Viewing Results
All continuous benchmarking results are automatically published to the public [vLLM Performance Dashboard](https://hud.pytorch.org/benchmark/llms?repoName=vllm-project%2Fvllm).

View File

@ -0,0 +1,58 @@
# Performance Dashboard
The performance dashboard is used to confirm whether new changes improve/degrade performance under various workloads.
It is updated by triggering benchmark runs on every commit with both the `perf-benchmarks` and `ready` labels, and when a PR is merged into vLLM.
The results are automatically published to the public [vLLM Performance Dashboard](https://hud.pytorch.org/benchmark/llms?repoName=vllm-project%2Fvllm).
## Manually Trigger the benchmark
Use [vllm-ci-test-repo images](https://gallery.ecr.aws/q9t5s3a7/vllm-ci-test-repo) with vLLM benchmark suite.
For CPU environment, please use the image with "-cpu" postfix.
Here is an example for docker run command for CPU.
```bash
docker run -it --entrypoint /bin/bash -v /data/huggingface:/root/.cache/huggingface -e HF_TOKEN='' --shm-size=16g --name vllm-cpu-ci public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:1da94e673c257373280026f75ceb4effac80e892-cpu
```
Then, run below command inside the docker instance.
```bash
bash .buildkite/performance-benchmarks/scripts/run-performance-benchmarks.sh
```
When run, benchmark script generates results under **benchmark/results** folder, along with the benchmark_results.md and benchmark_results.json.
### Runtime environment variables
- `ON_CPU`: set the value to '1' on Intel® Xeon® Processors. Default value is 0.
- `SERVING_JSON`: JSON file to use for the serving tests. Default value is empty string (use default file).
- `LATENCY_JSON`: JSON file to use for the latency tests. Default value is empty string (use default file).
- `THROUGHPUT_JSON`: JSON file to use for the throughout tests. Default value is empty string (use default file).
- `REMOTE_HOST`: IP for the remote vLLM service to benchmark. Default value is empty string.
- `REMOTE_PORT`: Port for the remote vLLM service to benchmark. Default value is empty string.
For more results visualization, check the [visualizing the results](https://github.com/intel-ai-tce/vllm/blob/more_cpu_models/.buildkite/nightly-benchmarks/README.md#visualizing-the-results).
More information on the performance benchmarks and their parameters can be found in [Benchmark README](https://github.com/intel-ai-tce/vllm/blob/more_cpu_models/.buildkite/nightly-benchmarks/README.md) and [performance benchmark description](../../.buildkite/performance-benchmarks/performance-benchmarks-descriptions.md).
## Continuous Benchmarking
The continuous benchmarking provides automated performance monitoring for vLLM across different models and GPU devices. This helps track vLLM's performance characteristics over time and identify any performance regressions or improvements.
### How It Works
The continuous benchmarking is triggered via a [GitHub workflow CI](https://github.com/pytorch/pytorch-integration-testing/actions/workflows/vllm-benchmark.yml) in the PyTorch infrastructure repository, which runs automatically every 4 hours. The workflow executes three types of performance tests:
- **Serving tests**: Measure request handling and API performance
- **Throughput tests**: Evaluate token generation rates
- **Latency tests**: Assess response time characteristics
### Benchmark Configuration
The benchmarking currently runs on a predefined set of models configured in the [vllm-benchmarks directory](https://github.com/pytorch/pytorch-integration-testing/tree/main/vllm-benchmarks/benchmarks). To add new models for benchmarking:
1. Navigate to the appropriate GPU directory in the benchmarks configuration
2. Add your model specifications to the corresponding configuration files
3. The new models will be included in the next scheduled benchmark run

178
docs/benchmarking/sweeps.md Normal file
View File

@ -0,0 +1,178 @@
# Parameter Sweeps
## Online Benchmark
### Basic
`vllm bench sweep serve` automatically starts `vllm serve` and runs `vllm bench serve` to evaluate vLLM over multiple configurations.
Follow these steps to run the script:
1. Construct the base command to `vllm serve`, and pass it to the `--serve-cmd` option.
2. Construct the base command to `vllm bench serve`, and pass it to the `--bench-cmd` option.
3. (Optional) If you would like to vary the settings of `vllm serve`, create a new JSON file and populate it with the parameter combinations you want to test. Pass the file path to `--serve-params`.
- Example: Tuning `--max-num-seqs` and `--max-num-batched-tokens`:
```json
[
{
"max_num_seqs": 32,
"max_num_batched_tokens": 1024
},
{
"max_num_seqs": 64,
"max_num_batched_tokens": 1024
},
{
"max_num_seqs": 64,
"max_num_batched_tokens": 2048
},
{
"max_num_seqs": 128,
"max_num_batched_tokens": 2048
},
{
"max_num_seqs": 128,
"max_num_batched_tokens": 4096
},
{
"max_num_seqs": 256,
"max_num_batched_tokens": 4096
}
]
```
4. (Optional) If you would like to vary the settings of `vllm bench serve`, create a new JSON file and populate it with the parameter combinations you want to test. Pass the file path to `--bench-params`.
- Example: Using different input/output lengths for random dataset:
```json
[
{
"random_input_len": 128,
"random_output_len": 32
},
{
"random_input_len": 256,
"random_output_len": 64
},
{
"random_input_len": 512,
"random_output_len": 128
}
]
```
5. Determine where you want to save the results, and pass that to `--output-dir`.
Example command:
```bash
vllm bench sweep serve \
--serve-cmd 'vllm serve meta-llama/Llama-2-7b-chat-hf' \
--bench-cmd 'vllm bench serve --model meta-llama/Llama-2-7b-chat-hf --backend vllm --endpoint /v1/completions --dataset-name sharegpt --dataset-path benchmarks/ShareGPT_V3_unfiltered_cleaned_split.json' \
--serve-params benchmarks/serve_hparams.json \
--bench-params benchmarks/bench_hparams.json \
-o benchmarks/results
```
!!! important
If both `--serve-params` and `--bench-params` are passed, the script will iterate over the Cartesian product between them.
You can use `--dry-run` to preview the commands to be run.
We only start the server once for each `--serve-params`, and keep it running for multiple `--bench-params`.
Between each benchmark run, we call the `/reset_prefix_cache` and `/reset_mm_cache` endpoints to get a clean slate for the next run.
In case you are using a custom `--serve-cmd`, you can override the commands used for resetting the state by setting `--after-bench-cmd`.
!!! note
By default, each parameter combination is run 3 times to make the results more reliable. You can adjust the number of runs by setting `--num-runs`.
!!! tip
You can use the `--resume` option to continue the parameter sweep if one of the runs failed.
### SLA auto-tuner
`vllm bench sweep serve_sla` is a wrapper over `vllm bench sweep serve` that tunes either the request rate or concurrency (choose using `--sla-variable`) in order to satisfy the SLA constraints given by `--sla-params`.
For example, to ensure E2E latency within different target values for 99% of requests:
```json
[
{
"p99_e2el_ms": "<=200"
},
{
"p99_e2el_ms": "<=500"
},
{
"p99_e2el_ms": "<=1000"
},
{
"p99_e2el_ms": "<=2000"
}
]
```
Example command:
```bash
vllm bench sweep serve_sla \
--serve-cmd 'vllm serve meta-llama/Llama-2-7b-chat-hf' \
--bench-cmd 'vllm bench serve --model meta-llama/Llama-2-7b-chat-hf --backend vllm --endpoint /v1/completions --dataset-name sharegpt --dataset-path benchmarks/ShareGPT_V3_unfiltered_cleaned_split.json' \
--serve-params benchmarks/serve_hparams.json \
--bench-params benchmarks/bench_hparams.json \
--sla-params benchmarks/sla_hparams.json \
--sla-variable max_concurrency \
-o benchmarks/results
```
The algorithm for adjusting the SLA variable is as follows:
1. Run the benchmark with infinite QPS, and use the corresponding metrics to determine the initial value of the variable.
- For example, the initial request rate is set to the concurrency under infinite QPS.
2. If the SLA is still satisfied, keep doubling the value until the SLA is no longer satisfied. This gives a relatively narrow window that contains the point where the SLA is barely satisfied.
3. Apply binary search over the window to find the maximum value that still satisfies the SLA.
!!! important
SLA tuning is applied over each combination of `--serve-params`, `--bench-params`, and `--sla-params`.
For a given combination of `--serve-params` and `--bench-params`, we share the benchmark results across `--sla-params` to avoid rerunning benchmarks with the same SLA variable value.
## Visualization
### Basic
`vllm bench sweep plot` can be used to plot performance curves from parameter sweep results.
Example command:
```bash
vllm bench sweep plot benchmarks/results/<timestamp> \
--var-x max_concurrency \
--row-by random_input_len \
--col-by random_output_len \
--curve-by api_server_count,max_num_batched_tokens \
--filter-by 'max_concurrency<=1024'
```
!!! tip
You can use `--dry-run` to preview the figures to be plotted.
### Pareto chart
`vllm bench sweep plot_pareto` helps pick configurations that balance per-user and per-GPU throughput.
Higher concurrency or batch size can raise GPU efficiency (per-GPU), but can add per user latency; lower concurrency improves per-user rate but underutilizes GPUs; The Pareto frontier shows the best achievable pairs across your runs.
- x-axis: tokens/s/user = `output_throughput` ÷ concurrency (`--user-count-var`, default `max_concurrency`, fallback `max_concurrent_requests`).
- y-axis: tokens/s/GPU = `output_throughput` ÷ GPU count (`--gpu-count-var` if set; else gpu_count is TP×PP*DP).
- Output: a single figure at `OUTPUT_DIR/pareto/PARETO.png`.
- Show the configuration used in each data point `--label-by` (default: `max_concurrency,gpu_count`).
Example:
```bash
vllm bench sweep plot_pareto benchmarks/results/<timestamp> \
--label-by max_concurrency,tensor_parallel_size,pipeline_parallel_size
```