mirror of https://git.datalinker.icu/vllm-project/vllm.git synced 2025-12-10 02:15:01 +08:00

Daniel Serebrenik 3c8a787247

[Benchmark] Add flag --served-model-name to benchmark_serving_multi_turn (#22889 )

Signed-off-by: daniels <daniels@pliops.com>

2025-08-19 07:48:07 +00:00

3.2 KiB

Raw Blame History

Benchmark KV Cache Offloading with Multi-Turn Conversations

The requirements (pip) for benchmark_serving_multi_turn.py can be found in requirements.txt

First start serving your model

export MODEL_PATH=/models/meta-llama/Meta-Llama-3.1-8B-Instruct/

vllm serve $MODEL_PATH --served-model-name Llama --disable-log-requests

The variable MODEL_PATH should be a path to the model files (e.g. downloaded from huggingface).

Synthetic Multi-Turn Conversations

Download the following text file (used for generation of synthetic conversations)

wget https://www.gutenberg.org/ebooks/1184.txt.utf-8
mv 1184.txt.utf-8 pg1184.txt

The filename pg1184.txt is used in generate_multi_turn.json (see "text_files").

But you may use other text files if you prefer (using this specific file is not required).

Then run the benchmarking script

export MODEL_PATH=/models/meta-llama/Meta-Llama-3.1-8B-Instruct/

python benchmark_serving_multi_turn.py --model $MODEL_PATH --served-model-name Llama \
--input-file generate_multi_turn.json --num-clients 2 --max-active-conversations 6

You can edit the file generate_multi_turn.json to change the conversation parameters (number of turns, etc.).

If successful, you will see the following output

----------------------------------------------------------------------------------------------------
Statistics summary:
runtime_sec = 215.810
requests_per_sec = 0.769
----------------------------------------------------------------------------------------------------
                   count     mean     std      min      25%      50%      75%      90%      99%      max
ttft_ms            166.0    78.22   67.63    45.91    59.94    62.26    64.43    69.66   353.18   567.54
tpot_ms            166.0    25.37    0.57    24.40    25.07    25.31    25.50    25.84    27.50    28.05
latency_ms         166.0  2591.07  326.90  1998.53  2341.62  2573.01  2860.10  3003.50  3268.46  3862.94
input_num_turns    166.0     7.43    4.57     1.00     3.00     7.00    11.00    13.00    17.00    17.00
input_num_tokens   166.0  2006.20  893.56   522.00  1247.75  2019.00  2718.00  3233.00  3736.45  3899.00
output_num_tokens  166.0   100.01   11.80    80.00    91.00    99.00   109.75   116.00   120.00   120.00
output_num_chunks  166.0    99.01   11.80    79.00    90.00    98.00   108.75   115.00   119.00   119.00
----------------------------------------------------------------------------------------------------

ShareGPT Conversations

To run with the ShareGPT data, download the following ShareGPT dataset: https://huggingface.co/datasets/philschmid/sharegpt-raw/blob/main/sharegpt_20230401_clean_lang_split.json

Use the convert_sharegpt_to_openai.py script to convert the dataset to a format supported by benchmark_serving_multi_turn.py

python convert_sharegpt_to_openai.py sharegpt_20230401_clean_lang_split.json sharegpt_conv_128.json --seed=99 --max-items=128

The script will convert the ShareGPT dataset to a dataset with the standard user/assistant roles.

The flag --max-items=128 is used to sample 128 conversations from the original dataset (change as needed).

Use the output JSON file sharegpt_conv_128.json as the --input-file for benchmark_serving_multi_turn.py.

3.2 KiB Raw Blame History

Benchmark KV Cache Offloading with Multi-Turn Conversations

Synthetic Multi-Turn Conversations

ShareGPT Conversations

3.2 KiB

Raw Blame History