mirror of
https://git.datalinker.icu/vllm-project/vllm.git
synced 2025-12-09 20:15:01 +08:00
179 lines
6.3 KiB
Markdown
179 lines
6.3 KiB
Markdown
# Benchmark KV Cache Offloading with Multi-Turn Conversations
|
|
|
|
The requirements (pip) for `benchmark_serving_multi_turn.py` can be found in `requirements.txt`
|
|
|
|
First start serving your model
|
|
|
|
```bash
|
|
export MODEL_PATH=/models/meta-llama/Meta-Llama-3.1-8B-Instruct/
|
|
|
|
vllm serve $MODEL_PATH --served-model-name Llama --disable-log-requests
|
|
```
|
|
|
|
The variable `MODEL_PATH` should be a path to the model files (e.g. downloaded from huggingface).
|
|
|
|
## Synthetic Multi-Turn Conversations
|
|
|
|
Download the following text file (used for generation of synthetic conversations)
|
|
|
|
```bash
|
|
wget https://www.gutenberg.org/ebooks/1184.txt.utf-8
|
|
mv 1184.txt.utf-8 pg1184.txt
|
|
```
|
|
|
|
The filename `pg1184.txt` is used in `generate_multi_turn.json` (see `"text_files"`).
|
|
|
|
But you may use other text files if you prefer (using this specific file is not required).
|
|
|
|
Then run the benchmarking script
|
|
|
|
```bash
|
|
export MODEL_PATH=/models/meta-llama/Meta-Llama-3.1-8B-Instruct/
|
|
|
|
python benchmark_serving_multi_turn.py --model $MODEL_PATH --served-model-name Llama \
|
|
--input-file generate_multi_turn.json --num-clients 2 --max-active-conversations 6
|
|
```
|
|
|
|
You can edit the file `generate_multi_turn.json` to change the conversation parameters (number of turns, etc.).
|
|
|
|
If successful, you will see the following output
|
|
|
|
```bash
|
|
----------------------------------------------------------------------------------------------------
|
|
Statistics summary:
|
|
runtime_sec = 215.810
|
|
requests_per_sec = 0.769
|
|
----------------------------------------------------------------------------------------------------
|
|
count mean std min 25% 50% 75% 90% 99% max
|
|
ttft_ms 166.0 78.22 67.63 45.91 59.94 62.26 64.43 69.66 353.18 567.54
|
|
tpot_ms 166.0 25.37 0.57 24.40 25.07 25.31 25.50 25.84 27.50 28.05
|
|
latency_ms 166.0 2591.07 326.90 1998.53 2341.62 2573.01 2860.10 3003.50 3268.46 3862.94
|
|
input_num_turns 166.0 7.43 4.57 1.00 3.00 7.00 11.00 13.00 17.00 17.00
|
|
input_num_tokens 166.0 2006.20 893.56 522.00 1247.75 2019.00 2718.00 3233.00 3736.45 3899.00
|
|
output_num_tokens 166.0 100.01 11.80 80.00 91.00 99.00 109.75 116.00 120.00 120.00
|
|
output_num_chunks 166.0 99.01 11.80 79.00 90.00 98.00 108.75 115.00 119.00 119.00
|
|
----------------------------------------------------------------------------------------------------
|
|
```
|
|
|
|
If you run with `--warmup-step`, the summary will also include `warmup_runtime_sec`
|
|
and `total_runtime_incl_warmup_sec` (while `runtime_sec` continues to reflect the
|
|
benchmark-only runtime so the reported throughput stays comparable).
|
|
|
|
### JSON configuration file for synthetic conversations generation
|
|
|
|
The input flag `--input-file` is used to determine the input conversations for the benchmark.<br/>
|
|
When the input is a JSON file with the field `"filetype": "generate_conversations"` the tool will generate synthetic multi-turn (questions and answers) conversations.
|
|
|
|
The file `generate_multi_turn.json` is an example file.
|
|
|
|
The file must contain the sections `prompt_input` and `prompt_output`.
|
|
|
|
The `prompt_input` section must contain `num_turns`, `prefix_num_tokens` and `num_tokens`:
|
|
|
|
* `num_turns` - Number of total turns in the conversation (both user & assistant).<br/>
|
|
The final value will always be rounded to an even number so each user turn has a reply.
|
|
* `prefix_num_tokens` - Tokens added at the start of only the **first user turn** in a conversation (unique per conversation).
|
|
* `num_tokens` - Total token length of each **user** message (one turn).
|
|
|
|
The `prompt_output` section must contain `num_tokens`:
|
|
|
|
* `num_tokens` - Total token length of each **assistant** message (one turn).
|
|
|
|
### Random distributions for synthetic conversations generation
|
|
|
|
When creating an input JSON file (such as `generate_multi_turn.json`),<br/>
|
|
every numeric field (such as `num_turns` or `num_tokens`) requires a distribution.<br/>
|
|
The distribution determines how to randomly sample values for the field.
|
|
|
|
The available distributions are listed below.
|
|
|
|
**Note:** The optional `max` field (for lognormal, zipf, and poisson) can be used to cap sampled values at an upper bound.</br>
|
|
Can be used to make sure that the total number of tokens in every request does not exceed `--max-model-len`.
|
|
|
|
#### constant
|
|
|
|
```json
|
|
{
|
|
"distribution": "constant",
|
|
"value": 500
|
|
}
|
|
```
|
|
|
|
* `value` - the fixed integer value (always returns the same number).
|
|
|
|
#### uniform
|
|
|
|
```json
|
|
{
|
|
"distribution": "uniform",
|
|
"min": 12,
|
|
"max": 18
|
|
}
|
|
```
|
|
|
|
* `min` - minimum value (inclusive).
|
|
* `max` - maximum value (inclusive), should be equal or larger than min.
|
|
|
|
#### lognormal
|
|
|
|
```json
|
|
{
|
|
"distribution": "lognormal",
|
|
"average": 1000,
|
|
"max": 5000
|
|
}
|
|
```
|
|
|
|
You can parameterize the lognormal distribution in one of two ways:
|
|
|
|
Using the average and optional median ratio:
|
|
|
|
* `average` - target average value of the distribution.
|
|
* `median_ratio` - the ratio of the median to the average; controls the skewness. Must be in the range (0, 1).
|
|
|
|
Using the parameters of the underlying normal distribution:
|
|
|
|
* `mean` - mean of the underlying normal distribution.
|
|
* `sigma` - standard deviation of the underlying normal distribution.
|
|
|
|
#### zipf
|
|
|
|
```json
|
|
{
|
|
"distribution": "zipf",
|
|
"alpha": 1.2,
|
|
"max": 100
|
|
}
|
|
```
|
|
|
|
* `alpha` - skew parameter (> 1). Larger values produce stronger skew toward smaller integers.
|
|
|
|
#### poisson
|
|
|
|
```json
|
|
{
|
|
"distribution": "poisson",
|
|
"alpha": 10,
|
|
"max": 50
|
|
}
|
|
```
|
|
|
|
* `alpha` - expected value (λ). Also the variance of the distribution.
|
|
|
|
## ShareGPT Conversations
|
|
|
|
To run with the ShareGPT data, download the following ShareGPT dataset:
|
|
`https://huggingface.co/datasets/philschmid/sharegpt-raw/blob/main/sharegpt_20230401_clean_lang_split.json`
|
|
|
|
Use the `convert_sharegpt_to_openai.py` script to convert the dataset to a format supported by `benchmark_serving_multi_turn.py`
|
|
|
|
```bash
|
|
python convert_sharegpt_to_openai.py sharegpt_20230401_clean_lang_split.json sharegpt_conv_128.json --seed=99 --max-items=128
|
|
```
|
|
|
|
The script will convert the ShareGPT dataset to a dataset with the standard user/assistant roles.
|
|
|
|
The flag `--max-items=128` is used to sample 128 conversations from the original dataset (change as needed).
|
|
|
|
Use the output JSON file `sharegpt_conv_128.json` as the `--input-file` for `benchmark_serving_multi_turn.py`.
|