vLLM Benchmark suite improvement (#22119)

Signed-off-by: Tsai, Louie <louie.tsai@intel.com>
Signed-off-by: Louie Tsai <louie.tsai@intel.com>
Co-authored-by: Li, Jiang <bigpyj64@gmail.com>
This commit is contained in:
Louie Tsai 2025-08-14 00:12:17 -07:00 committed by GitHub
parent a353bd083d
commit 00e3f9da46
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
10 changed files with 447 additions and 147 deletions

View File

@ -7,7 +7,7 @@ This directory contains two sets of benchmark for vllm.
- Performance benchmark: benchmark vllm's performance under various workload, for **developers** to gain clarity on whether their PR improves/degrades vllm's performance - Performance benchmark: benchmark vllm's performance under various workload, for **developers** to gain clarity on whether their PR improves/degrades vllm's performance
- Nightly benchmark: compare vllm's performance against alternatives (tgi, trt-llm and lmdeploy), for **the public** to know when to choose vllm. - Nightly benchmark: compare vllm's performance against alternatives (tgi, trt-llm and lmdeploy), for **the public** to know when to choose vllm.
See [vLLM performance dashboard](https://perf.vllm.ai) for the latest performance benchmark results and [vLLM GitHub README](https://github.com/vllm-project/vllm/blob/main/README.md) for latest nightly benchmark results. See [vLLM performance dashboard](https://hud.pytorch.org/benchmark/llms?repoName=vllm-project%2Fvllm) for the latest performance benchmark results and [vLLM GitHub README](https://github.com/vllm-project/vllm/blob/main/README.md) for latest nightly benchmark results.
## Performance benchmark quick overview ## Performance benchmark quick overview
@ -138,28 +138,20 @@ The raw benchmarking results (in the format of json files) are in the `Artifacts
The `compare-json-results.py` helps to compare benchmark results JSON files converted using `convert-results-json-to-markdown.py`. The `compare-json-results.py` helps to compare benchmark results JSON files converted using `convert-results-json-to-markdown.py`.
When run, benchmark script generates results under `benchmark/results` folder, along with the `benchmark_results.md` and `benchmark_results.json`. When run, benchmark script generates results under `benchmark/results` folder, along with the `benchmark_results.md` and `benchmark_results.json`.
`compare-json-results.py` compares two `benchmark_results.json` files and provides performance ratio e.g. for Output Tput, Median TTFT and Median TPOT. `compare-json-results.py` compares two `benchmark_results.json` files and provides performance ratio e.g. for Output Tput, Median TTFT and Median TPOT.
If only one benchmark_results.json is passed, `compare-json-results.py` compares different TP and PP configurations in the benchmark_results.json instead.
Here is an example using the script to compare result_a and result_b without detail test name. Here is an example using the script to compare result_a and result_b with Model, Dataset name, input/output lenght, max concurrency and qps.
`python3 compare-json-results.py -f results_a/benchmark_results.json -f results_b/benchmark_results.json --ignore_test_name`
| | results_a/benchmark_results.json | results_b/benchmark_results.json | perf_ratio |
|----|----------------------------------------|----------------------------------------|----------|
| 0 | 142.633982 | 156.526018 | 1.097396 |
| 1 | 241.620334 | 294.018783 | 1.216863 |
| 2 | 218.298905 | 262.664916 | 1.203235 |
| 3 | 242.743860 | 299.816190 | 1.235113 |
Here is an example using the script to compare result_a and result_b with detail test name.
`python3 compare-json-results.py -f results_a/benchmark_results.json -f results_b/benchmark_results.json` `python3 compare-json-results.py -f results_a/benchmark_results.json -f results_b/benchmark_results.json`
| | results_a/benchmark_results.json_name | results_a/benchmark_results.json | results_b/benchmark_results.json_name | results_b/benchmark_results.json | perf_ratio | | | Model | Dataset Name | Input Len | Output Len | # of max concurrency | qps | results_a/benchmark_results.json | results_b/benchmark_results.json | perf_ratio |
|---|---------------------------------------------|----------------------------------------|---------------------------------------------|----------------------------------------|----------| |----|---------------------------------------|--------|-----|-----|------|-----|-----------|----------|----------|
| 0 | serving_llama8B_tp1_sharegpt_qps_1 | 142.633982 | serving_llama8B_tp1_sharegpt_qps_1 | 156.526018 | 1.097396 | | 0 | meta-llama/Meta-Llama-3.1-8B-Instruct | random | 128 | 128 | 1000 | 1 | 142.633982 | 156.526018 | 1.097396 |
| 1 | serving_llama8B_tp1_sharegpt_qps_16 | 241.620334 | serving_llama8B_tp1_sharegpt_qps_16 | 294.018783 | 1.216863 | | 1 | meta-llama/Meta-Llama-3.1-8B-Instruct | random | 128 | 128 | 1000 | inf| 241.620334 | 294.018783 | 1.216863 |
| 2 | serving_llama8B_tp1_sharegpt_qps_4 | 218.298905 | serving_llama8B_tp1_sharegpt_qps_4 | 262.664916 | 1.203235 |
| 3 | serving_llama8B_tp1_sharegpt_qps_inf | 242.743860 | serving_llama8B_tp1_sharegpt_qps_inf | 299.816190 | 1.235113 | A comparison diagram will be generated below the table.
| 4 | serving_llama8B_tp2_random_1024_128_qps_1 | 96.613390 | serving_llama8B_tp4_random_1024_128_qps_1 | 108.404853 | 1.122048 | Here is an example to compare between 96c/results_gnr_96c_091_tp2pp3 and 128c/results_gnr_128c_091_tp2pp3
<img width="1886" height="828" alt="image" src="https://github.com/user-attachments/assets/c02a43ef-25d0-4fd6-90e5-2169a28682dd" />
## Nightly test details ## Nightly test details

View File

@ -1,24 +1,38 @@
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project # SPDX-FileCopyrightText: Copyright contributors to the vLLM project
import argparse import argparse
import json
import os
import pandas as pd import pandas as pd
def compare_data_columns( def compare_data_columns(
files, name_column, data_column, drop_column, ignore_test_name=False files, name_column, data_column, info_cols, drop_column, debug=False
): ):
print("\ncompare_data_column: " + data_column) print("\ncompare_data_column: " + data_column)
frames = [] frames = []
raw_data_cols = []
compare_frames = [] compare_frames = []
for file in files: for file in files:
data_df = pd.read_json(file) data_df = pd.read_json(file)
serving_df = data_df.dropna(subset=[drop_column], ignore_index=True) serving_df = data_df.dropna(subset=[drop_column], ignore_index=True)
if ignore_test_name is False: # Show all info columns in the first couple columns
if not frames:
for col in info_cols:
if col not in serving_df.columns:
print(f"Skipping missing column: {col}")
continue
frames.append(serving_df[col])
# only show test name under debug mode
if debug is True:
serving_df = serving_df.rename(columns={name_column: file + "_name"}) serving_df = serving_df.rename(columns={name_column: file + "_name"})
frames.append(serving_df[file + "_name"]) frames.append(serving_df[file + "_name"])
file = "/".join(file.split("/")[:-1])
serving_df = serving_df.rename(columns={data_column: file}) serving_df = serving_df.rename(columns={data_column: file})
frames.append(serving_df[file]) frames.append(serving_df[file])
raw_data_cols.append(file)
compare_frames.append(serving_df[file]) compare_frames.append(serving_df[file])
if len(compare_frames) >= 2: if len(compare_frames) >= 2:
# Compare numbers among two files # Compare numbers among two files
@ -27,7 +41,68 @@ def compare_data_columns(
compare_frames.pop(1) compare_frames.pop(1)
concat_df = pd.concat(frames, axis=1) concat_df = pd.concat(frames, axis=1)
return concat_df print(raw_data_cols)
return concat_df, raw_data_cols
def split_json_by_tp_pp(
input_file: str = "benchmark_results.json", output_root: str = "."
) -> list[str]:
"""
Split a benchmark JSON into separate folders by (TP Size, PP Size).
Creates: <output_root>/tp{TP}_pp{PP}/benchmark_results.json
Returns: list of file paths written.
"""
# Load JSON data into DataFrame
with open(input_file, encoding="utf-8") as f:
data = json.load(f)
# If the JSON is a dict with a list under common keys, use that list
if isinstance(data, dict):
for key in ("results", "serving_results", "benchmarks", "data"):
if isinstance(data.get(key), list):
data = data[key]
break
df = pd.DataFrame(data)
# Handle alias column names
rename_map = {
"tp_size": "TP Size",
"tensor_parallel_size": "TP Size",
"pp_size": "PP Size",
"pipeline_parallel_size": "PP Size",
}
df.rename(
columns={k: v for k, v in rename_map.items() if k in df.columns}, inplace=True
)
# Ensure TP/PP columns exist (default to 1 if missing)
if "TP Size" not in df.columns:
df["TP Size"] = 1
if "PP Size" not in df.columns:
df["PP Size"] = 1
# make sure TP/PP are numeric ints with no NaN
df["TP Size"] = (
pd.to_numeric(df.get("TP Size", 1), errors="coerce").fillna(1).astype(int)
)
df["PP Size"] = (
pd.to_numeric(df.get("PP Size", 1), errors="coerce").fillna(1).astype(int)
)
# Split into separate folders
saved_paths: list[str] = []
for (tp, pp), group_df in df.groupby(["TP Size", "PP Size"], dropna=False):
folder_name = os.path.join(output_root, f"tp{int(tp)}_pp{int(pp)}")
os.makedirs(folder_name, exist_ok=True)
filepath = os.path.join(folder_name, "benchmark_results.json")
group_df.to_json(filepath, orient="records", indent=2, force_ascii=False)
print(f"Saved: {filepath}")
saved_paths.append(filepath)
return saved_paths
if __name__ == "__main__": if __name__ == "__main__":
@ -36,31 +111,105 @@ if __name__ == "__main__":
"-f", "--file", action="append", type=str, help="input file name" "-f", "--file", action="append", type=str, help="input file name"
) )
parser.add_argument( parser.add_argument(
"--ignore_test_name", action="store_true", help="ignore_test_name or not" "--debug", action="store_true", help="show all information for debugging"
)
parser.add_argument(
"--plot",
action=argparse.BooleanOptionalAction,
default=True,
help="plot perf diagrams or not --no-plot --plot",
)
parser.add_argument(
"-x",
"--xaxis",
type=str,
default="# of max concurrency.",
help="column name to use as X Axis in comparision graph",
) )
args = parser.parse_args() args = parser.parse_args()
files = args.file
print("comparing : " + ", ".join(files))
drop_column = "P99" drop_column = "P99"
name_column = "Test name" name_column = "Test name"
info_cols = [
"Model",
"Dataset Name",
"Input Len",
"Output Len",
"TP Size",
"PP Size",
"# of max concurrency.",
"qps",
]
data_cols_to_compare = ["Output Tput (tok/s)", "Median TTFT (ms)", "Median"] data_cols_to_compare = ["Output Tput (tok/s)", "Median TTFT (ms)", "Median"]
html_msgs_for_data_cols = [ html_msgs_for_data_cols = [
"Compare Output Tokens /n", "Compare Output Tokens /n",
"Median TTFT /n", "Median TTFT /n",
"Median TPOT /n", "Median TPOT /n",
] ]
ignore_test_name = args.ignore_test_name
if len(args.file) == 1:
files = split_json_by_tp_pp(args.file[0], output_root="splits")
info_cols = [c for c in info_cols if c not in ("TP Size", "PP Size")]
else:
files = args.file
print("comparing : " + ", ".join(files))
debug = args.debug
plot = args.plot
# For Plot feature, assign y axis from one of info_cols
y_axis_index = info_cols.index(args.xaxis) if args.xaxis in info_cols else 6
with open("perf_comparison.html", "w") as text_file: with open("perf_comparison.html", "w") as text_file:
for i in range(len(data_cols_to_compare)): for i in range(len(data_cols_to_compare)):
output_df = compare_data_columns( output_df, raw_data_cols = compare_data_columns(
files, files,
name_column, name_column,
data_cols_to_compare[i], data_cols_to_compare[i],
info_cols,
drop_column, drop_column,
ignore_test_name=ignore_test_name, debug=debug,
) )
print(output_df)
html = output_df.to_html() # For Plot feature, insert y axis from one of info_cols
text_file.write(html_msgs_for_data_cols[i]) raw_data_cols.insert(0, info_cols[y_axis_index])
text_file.write(html)
filtered_info_cols = info_cols[:-2]
existing_group_cols = [
c for c in filtered_info_cols if c in output_df.columns
]
if not existing_group_cols:
raise ValueError(
f"No valid group-by columns "
f"Expected subset: {filtered_info_cols}, "
f"but DataFrame has: {list(output_df.columns)}"
)
output_df_sorted = output_df.sort_values(by=existing_group_cols)
output_groups = output_df_sorted.groupby(existing_group_cols, dropna=False)
for name, group in output_groups:
html = group.to_html()
text_file.write(html_msgs_for_data_cols[i])
text_file.write(html)
if plot is True:
import pandas as pd
import plotly.express as px
df = group[raw_data_cols]
df_sorted = df.sort_values(by=info_cols[y_axis_index])
# Melt DataFrame for plotting
df_melted = df_sorted.melt(
id_vars=info_cols[y_axis_index],
var_name="Configuration",
value_name=data_cols_to_compare[i],
)
title = data_cols_to_compare[i] + " vs " + info_cols[y_axis_index]
# Create Plotly line chart
fig = px.line(
df_melted,
x=info_cols[y_axis_index],
y=data_cols_to_compare[i],
color="Configuration",
title=title,
markers=True,
)
# Export to HTML
text_file.write(fig.to_html(full_html=True, include_plotlyjs="cdn"))

View File

@ -1,17 +1,19 @@
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project # SPDX-FileCopyrightText: Copyright contributors to the vLLM project
import argparse
import json import json
import os import os
import re
import shlex
from importlib import util from importlib import util
from pathlib import Path from pathlib import Path
from typing import Any
import pandas as pd import pandas as pd
import psutil import psutil
from tabulate import tabulate from tabulate import tabulate
results_folder = Path("results/")
# latency results and the keys that will be printed into markdown # latency results and the keys that will be printed into markdown
latency_results = [] latency_results = []
latency_column_mapping = { latency_column_mapping = {
@ -42,14 +44,22 @@ throughput_results_column_mapping = {
serving_results = [] serving_results = []
serving_column_mapping = { serving_column_mapping = {
"test_name": "Test name", "test_name": "Test name",
"model_id": "Model",
"dataset_name": "Dataset Name",
"input_len": "Input Len",
"output_len": "Output Len",
"tp_size": "TP Size",
"pp_size": "PP Size",
"dtype": "dtype",
"gpu_type": "GPU", "gpu_type": "GPU",
"completed": "# of req.", "completed": "# of req.",
"qps": "qps",
"max_concurrency": "# of max concurrency.", "max_concurrency": "# of max concurrency.",
"request_throughput": "Tput (req/s)", "request_throughput": "Tput (req/s)",
"total_token_throughput": "Total Token Tput (tok/s)", "total_token_throughput": "Total Token Tput (tok/s)",
"output_throughput": "Output Tput (tok/s)", "output_throughput": "Output Tput (tok/s)",
"total_input_tokens": "Total input tokens", # "total_input_tokens": "Total input tokens",
"total_output_tokens": "Total output tokens", # "total_output_tokens": "Total output tokens",
"mean_ttft_ms": "Mean TTFT (ms)", "mean_ttft_ms": "Mean TTFT (ms)",
"median_ttft_ms": "Median TTFT (ms)", "median_ttft_ms": "Median TTFT (ms)",
"p99_ttft_ms": "P99 TTFT (ms)", "p99_ttft_ms": "P99 TTFT (ms)",
@ -94,7 +104,104 @@ def get_size_with_unit(bytes, suffix="B"):
bytes /= factor bytes /= factor
def _coerce(val: str) -> Any:
"""Best-effort type coercion from string to Python types."""
low = val.lower()
if low == "null":
return None
if low == "true":
return True
if low == "false":
return False
# integers
if re.fullmatch(r"[+-]?\d+", val):
try:
return int(val)
except ValueError:
pass
# floats (keep 'inf'/'-inf'/'nan' as strings)
if re.fullmatch(r"[+-]?\d*\.\d+", val):
try:
return float(val)
except ValueError:
pass
return val
def parse_client_command(cmd: str) -> dict[str, Any]:
"""Parse the client_command shell string into {executable, script, args}."""
toks = shlex.split(cmd)
if len(toks) < 2:
raise ValueError("client_command must include an executable and a script")
executable, script = toks[0], toks[1]
args: dict[str, Any] = {}
i = 2
while i < len(toks):
t = toks[i]
if t.startswith("--"):
# --key=value or --key (value) or boolean flag
if "=" in t:
key, val = t.split("=", 1)
if key == "--metadata":
md = {}
if val:
if "=" in val:
k, v = val.split("=", 1)
md[k] = _coerce(v)
else:
md[val] = True
args[key] = md
else:
args[key] = _coerce(val)
i += 1
continue
key = t
# Special: consume metadata k=v pairs until next --flag
if key == "--metadata":
i += 1
md = {}
while i < len(toks) and not toks[i].startswith("--"):
pair = toks[i]
if "=" in pair:
k, v = pair.split("=", 1)
md[k] = _coerce(v)
else:
md[pair] = True
i += 1
args[key] = md
continue
# Standard: check if next token is a value (not a flag)
if i + 1 < len(toks) and not toks[i + 1].startswith("--"):
args[key] = _coerce(toks[i + 1])
i += 2
else:
# lone flag -> True
args[key] = True
i += 1
else:
# unexpected positional; skip
i += 1
return {"executable": executable, "script": script, "args": args}
if __name__ == "__main__": if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument(
"-r",
"--result",
type=str,
default="results",
help="Folder name for benchmark output results.",
)
args = parser.parse_args()
results_folder = Path(args.result)
if not results_folder.exists():
raise FileNotFoundError(f"results folder does not exist: {results_folder}")
# collect results # collect results
for test_file in results_folder.glob("*.json"): for test_file in results_folder.glob("*.json"):
with open(test_file) as f: with open(test_file) as f:
@ -102,7 +209,6 @@ if __name__ == "__main__":
if "serving" in str(test_file): if "serving" in str(test_file):
# this result is generated via `vllm bench serve` command # this result is generated via `vllm bench serve` command
# attach the benchmarking command to raw_result # attach the benchmarking command to raw_result
try: try:
with open(test_file.with_suffix(".commands")) as f: with open(test_file.with_suffix(".commands")) as f:
@ -110,12 +216,44 @@ if __name__ == "__main__":
except OSError as e: except OSError as e:
print(e) print(e)
continue continue
# Parse Server Command Arg
out: dict[str, Any] = {
"server_command": parse_client_command(command["server_command"])
}
parse_args = [
"--tensor-parallel-size",
"--pipeline-parallel-size",
"--dtype",
]
col_mapping = ["tp_size", "pp_size", "dtype"]
for index, arg in enumerate(parse_args):
if arg in out["server_command"]["args"]:
raw_result.update(
{col_mapping[index]: out["server_command"]["args"][arg]}
)
# Parse Client Command Arg
out: dict[str, Any] = {
"client_command": parse_client_command(command["client_command"])
}
parse_args = [
"--dataset-name",
"--random-input-len",
"--random-output-len",
"--request-rate",
]
col_mapping = ["dataset_name", "input_len", "output_len", "qps"]
for index, arg in enumerate(parse_args):
if arg in out["client_command"]["args"]:
raw_result.update(
{col_mapping[index]: out["client_command"]["args"][arg]}
)
# Add Server, Client command
raw_result.update(command) raw_result.update(command)
# update the test name of this result # update the test name of this result
raw_result.update({"test_name": test_file.stem}) raw_result.update({"test_name": test_file.stem})
# add the result to raw_result # add the result to raw_result
serving_results.append(raw_result) serving_results.append(raw_result)
continue continue
@ -205,7 +343,10 @@ if __name__ == "__main__":
columns=latency_column_mapping columns=latency_column_mapping
) )
if not serving_results.empty: if not serving_results.empty:
serving_results = serving_results[list(serving_column_mapping.keys())].rename( valid_columns = [
col for col in serving_column_mapping if col in serving_results.columns
]
serving_results = serving_results[valid_columns].rename(
columns=serving_column_mapping columns=serving_column_mapping
) )
if not throughput_results.empty: if not throughput_results.empty:
@ -245,7 +386,9 @@ if __name__ == "__main__":
) )
# document the result # document the result
with open(results_folder / "benchmark_results.md", "w") as f: md_file = "benchmark_results.md"
json_file = "benchmark_results.json"
with open(results_folder / md_file, "w") as f:
results = read_markdown( results = read_markdown(
"../.buildkite/nightly-benchmarks/" "../.buildkite/nightly-benchmarks/"
+ "performance-benchmarks-descriptions.md" + "performance-benchmarks-descriptions.md"
@ -260,7 +403,7 @@ if __name__ == "__main__":
f.write(results) f.write(results)
# document benchmarking results in json # document benchmarking results in json
with open(results_folder / "benchmark_results.json", "w") as f: with open(results_folder / json_file, "w") as f:
results = ( results = (
latency_results.to_dict(orient="records") latency_results.to_dict(orient="records")
+ throughput_results.to_dict(orient="records") + throughput_results.to_dict(orient="records")

View File

@ -194,9 +194,11 @@ run_latency_tests() {
# check if there is enough GPU to run the test # check if there is enough GPU to run the test
tp=$(echo "$latency_params" | jq -r '.tensor_parallel_size') tp=$(echo "$latency_params" | jq -r '.tensor_parallel_size')
if [ "$ON_CPU" == "1" ];then if [ "$ON_CPU" == "1" ]; then
if [[ $numa_count -lt $tp ]]; then pp=$(echo "$latency_params" | jq -r '.pipeline_parallel_size')
echo "Required tensor-parallel-size $tp but only $numa_count NUMA nodes found. Skip testcase $test_name." world_size=$(($tp*$pp))
if [[ $numa_count -lt $world_size && -z "${REMOTE_HOST}" ]]; then
echo "Required world-size $world_size but only $numa_count NUMA nodes found. Skip testcase $test_name."
continue continue
fi fi
else else
@ -261,9 +263,11 @@ run_throughput_tests() {
# check if there is enough GPU to run the test # check if there is enough GPU to run the test
tp=$(echo "$throughput_params" | jq -r '.tensor_parallel_size') tp=$(echo "$throughput_params" | jq -r '.tensor_parallel_size')
if [ "$ON_CPU" == "1" ];then if [ "$ON_CPU" == "1" ]; then
if [[ $numa_count -lt $tp ]]; then pp=$(echo "$throughput_params" | jq -r '.pipeline_parallel_size')
echo "Required tensor-parallel-size $tp but only $numa_count NUMA nodes found. Skip testcase $test_name." world_size=$(($tp*$pp))
if [[ $numa_count -lt $world_size && -z "${REMOTE_HOST}" ]]; then
echo "Required world-size $world_size but only $numa_count NUMA nodes found. Skip testcase $test_name."
continue continue
fi fi
else else
@ -329,12 +333,21 @@ run_serving_tests() {
qps_list=$(echo "$params" | jq -r '.qps_list') qps_list=$(echo "$params" | jq -r '.qps_list')
qps_list=$(echo "$qps_list" | jq -r '.[] | @sh') qps_list=$(echo "$qps_list" | jq -r '.[] | @sh')
echo "Running over qps list $qps_list" echo "Running over qps list $qps_list"
max_concurrency_list=$(echo "$params" | jq -r '.max_concurrency_list')
if [[ -z "$max_concurrency_list" || "$max_concurrency_list" == "null" ]]; then
num_prompts=$(echo "$client_params" | jq -r '.num_prompts')
max_concurrency_list="[$num_prompts]"
fi
max_concurrency_list=$(echo "$max_concurrency_list" | jq -r '.[] | @sh')
echo "Running over max concurrency list $max_concurrency_list"
# check if there is enough resources to run the test # check if there is enough resources to run the test
tp=$(echo "$server_params" | jq -r '.tensor_parallel_size') tp=$(echo "$server_params" | jq -r '.tensor_parallel_size')
if [ "$ON_CPU" == "1" ];then if [ "$ON_CPU" == "1" ]; then
if [[ $numa_count -lt $tp ]]; then pp=$(echo "$server_params" | jq -r '.pipeline_parallel_size')
echo "Required tensor-parallel-size $tp but only $numa_count NUMA nodes found. Skip testcase $test_name." world_size=$(($tp*$pp))
if [[ $numa_count -lt $world_size && -z "${REMOTE_HOST}" ]]; then
echo "Required world-size $world_size but only $numa_count NUMA nodes found. Skip testcase $test_name."
continue continue
fi fi
else else
@ -390,35 +403,39 @@ run_serving_tests() {
echo "now qps is $qps" echo "now qps is $qps"
fi fi
new_test_name=$test_name"_qps_"$qps # iterate over different max_concurrency
for max_concurrency in $max_concurrency_list; do
new_test_name=$test_name"_qps_"$qps"_concurrency_"$max_concurrency
echo " new test name $new_test_name"
# pass the tensor parallel size to the client so that it can be displayed
# on the benchmark dashboard
client_command="vllm bench serve \
--save-result \
--result-dir $RESULTS_FOLDER \
--result-filename ${new_test_name}.json \
--request-rate $qps \
--max-concurrency $max_concurrency \
--metadata "tensor_parallel_size=$tp" \
$client_args $client_remote_args "
# pass the tensor parallel size to the client so that it can be displayed echo "Running test case $test_name with qps $qps"
# on the benchmark dashboard echo "Client command: $client_command"
client_command="vllm bench serve \
--save-result \
--result-dir $RESULTS_FOLDER \
--result-filename ${new_test_name}.json \
--request-rate $qps \
--metadata "tensor_parallel_size=$tp" \
$client_args $client_remote_args "
echo "Running test case $test_name with qps $qps" bash -c "$client_command"
echo "Client command: $client_command"
bash -c "$client_command" # record the benchmarking commands
jq_output=$(jq -n \
# record the benchmarking commands --arg server "$server_command" \
jq_output=$(jq -n \ --arg client "$client_command" \
--arg server "$server_command" \ --arg gpu "$gpu_type" \
--arg client "$client_command" \ '{
--arg gpu "$gpu_type" \ server_command: $server,
'{ client_command: $client,
server_command: $server, gpu_type: $gpu
client_command: $client, }')
gpu_type: $gpu echo "$jq_output" >"$RESULTS_FOLDER/${new_test_name}.commands"
}')
echo "$jq_output" >"$RESULTS_FOLDER/${new_test_name}.commands"
done
done done
# clean up # clean up

View File

@ -6,7 +6,7 @@
"VLLM_CPU_KVCACHE_SPACE": 40 "VLLM_CPU_KVCACHE_SPACE": 40
}, },
"parameters": { "parameters": {
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct", "model": "meta-llama/Llama-3.1-8B-Instruct",
"tensor_parallel_size": 1, "tensor_parallel_size": 1,
"load_format": "dummy", "load_format": "dummy",
"num_iters_warmup": 5, "num_iters_warmup": 5,
@ -20,7 +20,7 @@
"VLLM_CPU_KVCACHE_SPACE": 40 "VLLM_CPU_KVCACHE_SPACE": 40
}, },
"parameters": { "parameters": {
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct", "model": "meta-llama/Llama-3.1-8B-Instruct",
"tensor_parallel_size": 4, "tensor_parallel_size": 4,
"load_format": "dummy", "load_format": "dummy",
"num_iters_warmup": 5, "num_iters_warmup": 5,

View File

@ -1,7 +1,8 @@
[ [
{ {
"test_name": "serving_llama8B_tp1_sharegpt", "test_name": "serving_llama8B_tp1_sharegpt",
"qps_list": [1, 4, 16, "inf"], "qps_list": ["inf"],
"max_concurrency_list": [12, 16, 24, 32, 64, 128, 200],
"server_environment_variables": { "server_environment_variables": {
"VLLM_RPC_TIMEOUT": 100000, "VLLM_RPC_TIMEOUT": 100000,
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1, "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
@ -10,7 +11,7 @@
"VLLM_CPU_KVCACHE_SPACE": 40 "VLLM_CPU_KVCACHE_SPACE": 40
}, },
"server_parameters": { "server_parameters": {
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct", "model": "meta-llama/Llama-3.1-8B-Instruct",
"tensor_parallel_size": 1, "tensor_parallel_size": 1,
"dtype": "bfloat16", "dtype": "bfloat16",
"distributed_executor_backend": "mp", "distributed_executor_backend": "mp",
@ -23,17 +24,17 @@
"load_format": "dummy" "load_format": "dummy"
}, },
"client_parameters": { "client_parameters": {
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct", "model": "meta-llama/Llama-3.1-8B-Instruct",
"backend": "vllm", "backend": "vllm",
"dataset_name": "sharegpt", "dataset_name": "sharegpt",
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json", "dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
"max_concurrency": 60,
"num_prompts": 200 "num_prompts": 200
} }
}, },
{ {
"test_name": "serving_llama8B_tp2_sharegpt", "test_name": "serving_llama8B_tp2_sharegpt",
"qps_list": [1, 4, 16, "inf"], "qps_list": ["inf"],
"max_concurrency_list": [12, 16, 24, 32, 64, 128, 200],
"server_environment_variables": { "server_environment_variables": {
"VLLM_RPC_TIMEOUT": 100000, "VLLM_RPC_TIMEOUT": 100000,
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1, "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
@ -42,7 +43,7 @@
"VLLM_CPU_KVCACHE_SPACE": 40 "VLLM_CPU_KVCACHE_SPACE": 40
}, },
"server_parameters": { "server_parameters": {
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct", "model": "meta-llama/Llama-3.1-8B-Instruct",
"tensor_parallel_size": 2, "tensor_parallel_size": 2,
"dtype": "bfloat16", "dtype": "bfloat16",
"distributed_executor_backend": "mp", "distributed_executor_backend": "mp",
@ -55,17 +56,17 @@
"load_format": "dummy" "load_format": "dummy"
}, },
"client_parameters": { "client_parameters": {
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct", "model": "meta-llama/Llama-3.1-8B-Instruct",
"backend": "vllm", "backend": "vllm",
"dataset_name": "sharegpt", "dataset_name": "sharegpt",
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json", "dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
"max_concurrency": 60,
"num_prompts": 200 "num_prompts": 200
} }
}, },
{ {
"test_name": "serving_llama8B_tp4_sharegpt", "test_name": "serving_llama8B_tp4_sharegpt",
"qps_list": [1, 4, 16, "inf"], "qps_list": ["inf"],
"max_concurrency_list": [12, 16, 24, 32, 64, 128, 200],
"server_environment_variables": { "server_environment_variables": {
"VLLM_RPC_TIMEOUT": 100000, "VLLM_RPC_TIMEOUT": 100000,
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1, "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
@ -74,7 +75,7 @@
"VLLM_CPU_KVCACHE_SPACE": 40 "VLLM_CPU_KVCACHE_SPACE": 40
}, },
"server_parameters": { "server_parameters": {
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct", "model": "meta-llama/Llama-3.1-8B-Instruct",
"tensor_parallel_size": 4, "tensor_parallel_size": 4,
"dtype": "bfloat16", "dtype": "bfloat16",
"distributed_executor_backend": "mp", "distributed_executor_backend": "mp",
@ -87,17 +88,17 @@
"load_format": "dummy" "load_format": "dummy"
}, },
"client_parameters": { "client_parameters": {
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct", "model": "meta-llama/Llama-3.1-8B-Instruct",
"backend": "vllm", "backend": "vllm",
"dataset_name": "sharegpt", "dataset_name": "sharegpt",
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json", "dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
"max_concurrency": 60,
"num_prompts": 200 "num_prompts": 200
} }
}, },
{ {
"test_name": "serving_llama8B_tp1_random_128_128", "test_name": "serving_llama8B_tp1_random_128_128",
"qps_list": [1, 4, 16, "inf"], "qps_list": ["inf"],
"max_concurrency_list": [12, 16, 24, 32, 64, 128, 200, 1000],
"server_environment_variables": { "server_environment_variables": {
"VLLM_RPC_TIMEOUT": 100000, "VLLM_RPC_TIMEOUT": 100000,
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1, "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
@ -106,7 +107,7 @@
"VLLM_CPU_KVCACHE_SPACE": 40 "VLLM_CPU_KVCACHE_SPACE": 40
}, },
"server_parameters": { "server_parameters": {
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct", "model": "meta-llama/Llama-3.1-8B-Instruct",
"tensor_parallel_size": 1, "tensor_parallel_size": 1,
"dtype": "bfloat16", "dtype": "bfloat16",
"distributed_executor_backend": "mp", "distributed_executor_backend": "mp",
@ -120,19 +121,19 @@
"load_format": "dummy" "load_format": "dummy"
}, },
"client_parameters": { "client_parameters": {
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct", "model": "meta-llama/Llama-3.1-8B-Instruct",
"backend": "vllm", "backend": "vllm",
"dataset_name": "random", "dataset_name": "random",
"random-input-len": 128, "random-input-len": 128,
"random-output-len": 128, "random-output-len": 128,
"ignore-eos": "", "ignore-eos": "",
"max_concurrency": 1000,
"num_prompts": 1000 "num_prompts": 1000
} }
}, },
{ {
"test_name": "serving_llama8B_tp2_random_128_128", "test_name": "serving_llama8B_tp2_random_128_128",
"qps_list": [1, 4, 16, "inf"], "qps_list": ["inf"],
"max_concurrency_list": [12, 16, 24, 32, 64, 128, 200, 1000],
"server_environment_variables": { "server_environment_variables": {
"VLLM_RPC_TIMEOUT": 100000, "VLLM_RPC_TIMEOUT": 100000,
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1, "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
@ -141,7 +142,7 @@
"VLLM_CPU_KVCACHE_SPACE": 40 "VLLM_CPU_KVCACHE_SPACE": 40
}, },
"server_parameters": { "server_parameters": {
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct", "model": "meta-llama/Llama-3.1-8B-Instruct",
"tensor_parallel_size": 2, "tensor_parallel_size": 2,
"dtype": "bfloat16", "dtype": "bfloat16",
"distributed_executor_backend": "mp", "distributed_executor_backend": "mp",
@ -155,19 +156,19 @@
"load_format": "dummy" "load_format": "dummy"
}, },
"client_parameters": { "client_parameters": {
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct", "model": "meta-llama/Llama-3.1-8B-Instruct",
"backend": "vllm", "backend": "vllm",
"dataset_name": "random", "dataset_name": "random",
"random-input-len": 128, "random-input-len": 128,
"random-output-len": 128, "random-output-len": 128,
"ignore-eos": "", "ignore-eos": "",
"max_concurrency": 1000,
"num_prompts": 1000 "num_prompts": 1000
} }
}, },
{ {
"test_name": "serving_llama8B_tp4_random_128_128", "test_name": "serving_llama8B_tp4_random_128_128",
"qps_list": [1, 4, 16, "inf"], "qps_list": ["inf"],
"max_concurrency_list": [12, 16, 24, 32, 64, 128, 200, 1000],
"server_environment_variables": { "server_environment_variables": {
"VLLM_RPC_TIMEOUT": 100000, "VLLM_RPC_TIMEOUT": 100000,
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1, "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
@ -176,7 +177,7 @@
"VLLM_CPU_KVCACHE_SPACE": 40 "VLLM_CPU_KVCACHE_SPACE": 40
}, },
"server_parameters": { "server_parameters": {
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct", "model": "meta-llama/Llama-3.1-8B-Instruct",
"tensor_parallel_size": 4, "tensor_parallel_size": 4,
"dtype": "bfloat16", "dtype": "bfloat16",
"distributed_executor_backend": "mp", "distributed_executor_backend": "mp",
@ -190,13 +191,11 @@
"load_format": "dummy" "load_format": "dummy"
}, },
"client_parameters": { "client_parameters": {
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct", "model": "meta-llama/Llama-3.1-8B-Instruct",
"backend": "vllm", "backend": "vllm",
"dataset_name": "random", "dataset_name": "random",
"random-input-len": 128, "random-input-len": 128,
"random-output-len": 128, "random-output-len": 128,
"ignore-eos": "",
"max_concurrency": 1000,
"num_prompts": 1000 "num_prompts": 1000
} }
} }

View File

@ -1,7 +1,8 @@
[ [
{ {
"test_name": "serving_llama8B_pp1_sharegpt", "test_name": "serving_llama8B_pp1_sharegpt",
"qps_list": [1, 4, 16, "inf"], "qps_list": ["inf"],
"max_concurrency_list": [12, 16, 24, 32, 64, 128, 200],
"server_environment_variables": { "server_environment_variables": {
"VLLM_RPC_TIMEOUT": 100000, "VLLM_RPC_TIMEOUT": 100000,
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1, "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
@ -10,7 +11,7 @@
"VLLM_CPU_KVCACHE_SPACE": 40 "VLLM_CPU_KVCACHE_SPACE": 40
}, },
"server_parameters": { "server_parameters": {
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct", "model": "meta-llama/Llama-3.1-8B-Instruct",
"pipeline_parallel_size": 1, "pipeline_parallel_size": 1,
"dtype": "bfloat16", "dtype": "bfloat16",
"distributed_executor_backend": "mp", "distributed_executor_backend": "mp",
@ -23,17 +24,17 @@
"load_format": "dummy" "load_format": "dummy"
}, },
"client_parameters": { "client_parameters": {
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct", "model": "meta-llama/Llama-3.1-8B-Instruct",
"backend": "vllm", "backend": "vllm",
"dataset_name": "sharegpt", "dataset_name": "sharegpt",
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json", "dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
"max_concurrency": 60,
"num_prompts": 200 "num_prompts": 200
} }
}, },
{ {
"test_name": "serving_llama8B_pp3_sharegpt", "test_name": "serving_llama8B_pp3_sharegpt",
"qps_list": [1, 4, 16, "inf"], "qps_list": ["inf"],
"max_concurrency_list": [12, 16, 24, 32, 64, 128, 200],
"server_environment_variables": { "server_environment_variables": {
"VLLM_RPC_TIMEOUT": 100000, "VLLM_RPC_TIMEOUT": 100000,
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1, "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
@ -42,7 +43,7 @@
"VLLM_CPU_KVCACHE_SPACE": 40 "VLLM_CPU_KVCACHE_SPACE": 40
}, },
"server_parameters": { "server_parameters": {
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct", "model": "meta-llama/Llama-3.1-8B-Instruct",
"pipeline_parallel_size": 3, "pipeline_parallel_size": 3,
"dtype": "bfloat16", "dtype": "bfloat16",
"distributed_executor_backend": "mp", "distributed_executor_backend": "mp",
@ -55,17 +56,17 @@
"load_format": "dummy" "load_format": "dummy"
}, },
"client_parameters": { "client_parameters": {
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct", "model": "meta-llama/Llama-3.1-8B-Instruct",
"backend": "vllm", "backend": "vllm",
"dataset_name": "sharegpt", "dataset_name": "sharegpt",
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json", "dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
"max_concurrency": 60,
"num_prompts": 200 "num_prompts": 200
} }
}, },
{ {
"test_name": "serving_llama8B_tp2pp6_sharegpt", "test_name": "serving_llama8B_tp2pp3_sharegpt",
"qps_list": [1, 4, 16, "inf"], "qps_list": ["inf"],
"max_concurrency_list": [12, 16, 24, 32, 64, 128, 200],
"server_environment_variables": { "server_environment_variables": {
"VLLM_RPC_TIMEOUT": 100000, "VLLM_RPC_TIMEOUT": 100000,
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1, "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
@ -74,7 +75,7 @@
"VLLM_CPU_KVCACHE_SPACE": 40 "VLLM_CPU_KVCACHE_SPACE": 40
}, },
"server_parameters": { "server_parameters": {
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct", "model": "meta-llama/Llama-3.1-8B-Instruct",
"tensor_parallel_size": 2, "tensor_parallel_size": 2,
"pipeline_parallel_size": 3, "pipeline_parallel_size": 3,
"dtype": "bfloat16", "dtype": "bfloat16",
@ -88,17 +89,17 @@
"load_format": "dummy" "load_format": "dummy"
}, },
"client_parameters": { "client_parameters": {
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct", "model": "meta-llama/Llama-3.1-8B-Instruct",
"backend": "vllm", "backend": "vllm",
"dataset_name": "sharegpt", "dataset_name": "sharegpt",
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json", "dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
"max_concurrency": 60,
"num_prompts": 200 "num_prompts": 200
} }
}, },
{ {
"test_name": "serving_llama8B_pp1_random_128_128", "test_name": "serving_llama8B_pp1_random_128_128",
"qps_list": [1, 4, 16, "inf"], "qps_list": ["inf"],
"max_concurrency_list": [12, 16, 24, 32, 64, 128, 200, 1000],
"server_environment_variables": { "server_environment_variables": {
"VLLM_RPC_TIMEOUT": 100000, "VLLM_RPC_TIMEOUT": 100000,
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1, "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
@ -107,7 +108,7 @@
"VLLM_CPU_KVCACHE_SPACE": 40 "VLLM_CPU_KVCACHE_SPACE": 40
}, },
"server_parameters": { "server_parameters": {
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct", "model": "meta-llama/Llama-3.1-8B-Instruct",
"pipeline_parallel_size": 1, "pipeline_parallel_size": 1,
"dtype": "bfloat16", "dtype": "bfloat16",
"distributed_executor_backend": "mp", "distributed_executor_backend": "mp",
@ -121,28 +122,28 @@
"load_format": "dummy" "load_format": "dummy"
}, },
"client_parameters": { "client_parameters": {
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct", "model": "meta-llama/Llama-3.1-8B-Instruct",
"backend": "vllm", "backend": "vllm",
"dataset_name": "random", "dataset_name": "random",
"random-input-len": 128, "random-input-len": 128,
"random-output-len": 128, "random-output-len": 128,
"ignore-eos": "", "ignore-eos": "",
"max_concurrency": 1000,
"num_prompts": 1000 "num_prompts": 1000
} }
}, },
{ {
"test_name": "serving_llama8B_pp3_random_128_128", "test_name": "serving_llama8B_pp3_random_128_128",
"qps_list": [1, 4, 16, "inf"], "qps_list": ["inf"],
"max_concurrency_list": [12, 16, 24, 32, 64, 128, 200, 1000],
"server_environment_variables": { "server_environment_variables": {
"VLLM_RPC_TIMEOUT": 100000, "VLLM_RPC_TIMEOUT": 100000,
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1, "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
"VLLM_ENGINE_ITERATION_TIMEOUT_S": 120, "VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
"VLLM_CPU_SGL_KERNEL:": 1, "VLLM_CPU_SGL_KERNEL": 1,
"VLLM_CPU_KVCACHE_SPACE": 40 "VLLM_CPU_KVCACHE_SPACE": 40
}, },
"server_parameters": { "server_parameters": {
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct", "model": "meta-llama/Llama-3.1-8B-Instruct",
"pipeline_parallel_size": 3, "pipeline_parallel_size": 3,
"dtype": "bfloat16", "dtype": "bfloat16",
"distributed_executor_backend": "mp", "distributed_executor_backend": "mp",
@ -156,19 +157,19 @@
"load_format": "dummy" "load_format": "dummy"
}, },
"client_parameters": { "client_parameters": {
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct", "model": "meta-llama/Llama-3.1-8B-Instruct",
"backend": "vllm", "backend": "vllm",
"dataset_name": "random", "dataset_name": "random",
"random-input-len": 128, "random-input-len": 128,
"random-output-len": 128, "random-output-len": 128,
"ignore-eos": "", "ignore-eos": "",
"max_concurrency": 1000,
"num_prompts": 1000 "num_prompts": 1000
} }
}, },
{ {
"test_name": "serving_llama8B_tp2pp3_random_128_128", "test_name": "serving_llama8B_tp2pp3_random_128_128",
"qps_list": [1, 4, 16, "inf"], "qps_list": ["inf"],
"max_concurrency_list": [12, 16, 24, 32, 64, 128, 200, 1000],
"server_environment_variables": { "server_environment_variables": {
"VLLM_RPC_TIMEOUT": 100000, "VLLM_RPC_TIMEOUT": 100000,
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1, "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
@ -177,7 +178,7 @@
"VLLM_CPU_KVCACHE_SPACE": 40 "VLLM_CPU_KVCACHE_SPACE": 40
}, },
"server_parameters": { "server_parameters": {
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct", "model": "meta-llama/Llama-3.1-8B-Instruct",
"tensor_parallel_size": 2, "tensor_parallel_size": 2,
"pipeline_parallel_size": 3, "pipeline_parallel_size": 3,
"dtype": "bfloat16", "dtype": "bfloat16",
@ -192,13 +193,12 @@
"load_format": "dummy" "load_format": "dummy"
}, },
"client_parameters": { "client_parameters": {
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct", "model": "meta-llama/Llama-3.1-8B-Instruct",
"backend": "vllm", "backend": "vllm",
"dataset_name": "random", "dataset_name": "random",
"random-input-len": 128, "random-input-len": 128,
"random-output-len": 128, "random-output-len": 128,
"ignore-eos": "", "ignore-eos": "",
"max_concurrency": 1000,
"num_prompts": 1000 "num_prompts": 1000
} }
} }

View File

@ -2,6 +2,7 @@
{ {
"test_name": "serving_llama8B_tp1_sharegpt", "test_name": "serving_llama8B_tp1_sharegpt",
"qps_list": [1, 4, 16, "inf"], "qps_list": [1, 4, 16, "inf"],
"max_concurrency_list": [12, 16, 24, 32, 64, 128, 200],
"server_environment_variables": { "server_environment_variables": {
"VLLM_RPC_TIMEOUT": 100000, "VLLM_RPC_TIMEOUT": 100000,
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1, "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
@ -10,7 +11,7 @@
"VLLM_CPU_KVCACHE_SPACE": 40 "VLLM_CPU_KVCACHE_SPACE": 40
}, },
"server_parameters": { "server_parameters": {
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct", "model": "meta-llama/Llama-3.1-8B-Instruct",
"tensor_parallel_size": 1, "tensor_parallel_size": 1,
"dtype": "bfloat16", "dtype": "bfloat16",
"distributed_executor_backend": "mp", "distributed_executor_backend": "mp",
@ -23,17 +24,17 @@
"load_format": "dummy" "load_format": "dummy"
}, },
"client_parameters": { "client_parameters": {
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct", "model": "meta-llama/Llama-3.1-8B-Instruct",
"backend": "vllm", "backend": "vllm",
"dataset_name": "sharegpt", "dataset_name": "sharegpt",
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json", "dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
"max_concurrency": 60,
"num_prompts": 200 "num_prompts": 200
} }
}, },
{ {
"test_name": "serving_llama8B_tp2_sharegpt", "test_name": "serving_llama8B_tp2_sharegpt",
"qps_list": [1, 4, 16, "inf"], "qps_list": [1, 4, 16, "inf"],
"max_concurrency_list": [12, 16, 24, 32, 64, 128, 200],
"server_environment_variables": { "server_environment_variables": {
"VLLM_RPC_TIMEOUT": 100000, "VLLM_RPC_TIMEOUT": 100000,
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1, "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
@ -42,7 +43,7 @@
"VLLM_CPU_KVCACHE_SPACE": 40 "VLLM_CPU_KVCACHE_SPACE": 40
}, },
"server_parameters": { "server_parameters": {
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct", "model": "meta-llama/Llama-3.1-8B-Instruct",
"tensor_parallel_size": 2, "tensor_parallel_size": 2,
"dtype": "bfloat16", "dtype": "bfloat16",
"distributed_executor_backend": "mp", "distributed_executor_backend": "mp",
@ -55,17 +56,17 @@
"load_format": "dummy" "load_format": "dummy"
}, },
"client_parameters": { "client_parameters": {
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct", "model": "meta-llama/Llama-3.1-8B-Instruct",
"backend": "vllm", "backend": "vllm",
"dataset_name": "sharegpt", "dataset_name": "sharegpt",
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json", "dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
"max_concurrency": 60,
"num_prompts": 200 "num_prompts": 200
} }
}, },
{ {
"test_name": "serving_llama8B_tp4_sharegpt", "test_name": "serving_llama8B_tp4_sharegpt",
"qps_list": [1, 4, 16, "inf"], "qps_list": [1, 4, 16, "inf"],
"max_concurrency_list": [12, 16, 24, 32, 64, 128, 200],
"server_environment_variables": { "server_environment_variables": {
"VLLM_RPC_TIMEOUT": 100000, "VLLM_RPC_TIMEOUT": 100000,
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1, "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
@ -74,7 +75,7 @@
"VLLM_CPU_KVCACHE_SPACE": 40 "VLLM_CPU_KVCACHE_SPACE": 40
}, },
"server_parameters": { "server_parameters": {
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct", "model": "meta-llama/Llama-3.1-8B-Instruct",
"tensor_parallel_size": 4, "tensor_parallel_size": 4,
"dtype": "bfloat16", "dtype": "bfloat16",
"distributed_executor_backend": "mp", "distributed_executor_backend": "mp",
@ -87,17 +88,17 @@
"load_format": "dummy" "load_format": "dummy"
}, },
"client_parameters": { "client_parameters": {
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct", "model": "meta-llama/Llama-3.1-8B-Instruct",
"backend": "vllm", "backend": "vllm",
"dataset_name": "sharegpt", "dataset_name": "sharegpt",
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json", "dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
"max_concurrency": 60,
"num_prompts": 200 "num_prompts": 200
} }
}, },
{ {
"test_name": "serving_llama8B_tp4_random_1024_128", "test_name": "serving_llama8B_tp4_random_1024_128",
"qps_list": [1, 4, 16, "inf"], "qps_list": [1, 4, 16, "inf"],
"max_concurrency_list": [12, 16, 24, 32, 64, 128, 200],
"server_environment_variables": { "server_environment_variables": {
"VLLM_RPC_TIMEOUT": 100000, "VLLM_RPC_TIMEOUT": 100000,
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1, "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
@ -106,7 +107,7 @@
"VLLM_CPU_KVCACHE_SPACE": 40 "VLLM_CPU_KVCACHE_SPACE": 40
}, },
"server_parameters": { "server_parameters": {
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct", "model": "meta-llama/Llama-3.1-8B-Instruct",
"tensor_parallel_size": 4, "tensor_parallel_size": 4,
"dtype": "bfloat16", "dtype": "bfloat16",
"distributed_executor_backend": "mp", "distributed_executor_backend": "mp",
@ -120,19 +121,19 @@
"load_format": "dummy" "load_format": "dummy"
}, },
"client_parameters": { "client_parameters": {
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct", "model": "meta-llama/Llama-3.1-8B-Instruct",
"backend": "vllm", "backend": "vllm",
"dataset_name": "random", "dataset_name": "random",
"random-input-len": 1024, "random-input-len": 1024,
"random-output-len": 128, "random-output-len": 128,
"ignore-eos": "", "ignore-eos": "",
"max_concurrency": 100,
"num_prompts": 100 "num_prompts": 100
} }
}, },
{ {
"test_name": "serving_llama8B_pp6_random_1024_128", "test_name": "serving_llama8B_pp6_random_1024_128",
"qps_list": [1, 4, 16, "inf"], "qps_list": [1, 4, 16, "inf"],
"max_concurrency_list": [12, 16, 24, 32, 64, 128, 200],
"server_environment_variables": { "server_environment_variables": {
"VLLM_RPC_TIMEOUT": 100000, "VLLM_RPC_TIMEOUT": 100000,
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1, "VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
@ -141,7 +142,7 @@
"VLLM_CPU_KVCACHE_SPACE": 40 "VLLM_CPU_KVCACHE_SPACE": 40
}, },
"server_parameters": { "server_parameters": {
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct", "model": "meta-llama/Llama-3.1-8B-Instruct",
"pipeline_parallel_size": 6, "pipeline_parallel_size": 6,
"dtype": "bfloat16", "dtype": "bfloat16",
"distributed_executor_backend": "mp", "distributed_executor_backend": "mp",
@ -155,13 +156,12 @@
"load_format": "dummy" "load_format": "dummy"
}, },
"client_parameters": { "client_parameters": {
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct", "model": "meta-llama/Llama-3.1-8B-Instruct",
"backend": "vllm", "backend": "vllm",
"dataset_name": "random", "dataset_name": "random",
"random-input-len": 1024, "random-input-len": 1024,
"random-output-len": 128, "random-output-len": 128,
"ignore-eos": "", "ignore-eos": "",
"max_concurrency": 100,
"num_prompts": 100 "num_prompts": 100
} }
} }

View File

@ -6,7 +6,7 @@
"VLLM_CPU_KVCACHE_SPACE": 40 "VLLM_CPU_KVCACHE_SPACE": 40
}, },
"parameters": { "parameters": {
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct", "model": "meta-llama/Llama-3.1-8B-Instruct",
"tensor_parallel_size": 1, "tensor_parallel_size": 1,
"load_format": "dummy", "load_format": "dummy",
"dataset": "./ShareGPT_V3_unfiltered_cleaned_split.json", "dataset": "./ShareGPT_V3_unfiltered_cleaned_split.json",
@ -21,7 +21,7 @@
"VLLM_CPU_KVCACHE_SPACE": 40 "VLLM_CPU_KVCACHE_SPACE": 40
}, },
"parameters": { "parameters": {
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct", "model": "meta-llama/Llama-3.1-8B-Instruct",
"tensor_parallel_size": 4, "tensor_parallel_size": 4,
"load_format": "dummy", "load_format": "dummy",
"dataset": "./ShareGPT_V3_unfiltered_cleaned_split.json", "dataset": "./ShareGPT_V3_unfiltered_cleaned_split.json",

View File

@ -11,7 +11,7 @@ vLLM contains two sets of benchmarks:
The performance benchmarks are used for development to confirm whether new changes improve performance under various workloads. They are triggered on every commit with both the `perf-benchmarks` and `ready` labels, and when a PR is merged into vLLM. The performance benchmarks are used for development to confirm whether new changes improve performance under various workloads. They are triggered on every commit with both the `perf-benchmarks` and `ready` labels, and when a PR is merged into vLLM.
The latest performance results are hosted on the public [vLLM Performance Dashboard](https://perf.vllm.ai). The latest performance results are hosted on the public [vLLM Performance Dashboard](https://hud.pytorch.org/benchmark/llms?repoName=vllm-project%2Fvllm).
More information on the performance benchmarks and their parameters can be found [here](gh-file:.buildkite/nightly-benchmarks/performance-benchmarks-descriptions.md). More information on the performance benchmarks and their parameters can be found [here](gh-file:.buildkite/nightly-benchmarks/performance-benchmarks-descriptions.md).