mirror of
https://git.datalinker.icu/vllm-project/vllm.git
synced 2025-12-10 07:15:01 +08:00
vLLM Benchmark suite improvement (#22119)
Signed-off-by: Tsai, Louie <louie.tsai@intel.com> Signed-off-by: Louie Tsai <louie.tsai@intel.com> Co-authored-by: Li, Jiang <bigpyj64@gmail.com>
This commit is contained in:
parent
a353bd083d
commit
00e3f9da46
@ -7,7 +7,7 @@ This directory contains two sets of benchmark for vllm.
|
|||||||
- Performance benchmark: benchmark vllm's performance under various workload, for **developers** to gain clarity on whether their PR improves/degrades vllm's performance
|
- Performance benchmark: benchmark vllm's performance under various workload, for **developers** to gain clarity on whether their PR improves/degrades vllm's performance
|
||||||
- Nightly benchmark: compare vllm's performance against alternatives (tgi, trt-llm and lmdeploy), for **the public** to know when to choose vllm.
|
- Nightly benchmark: compare vllm's performance against alternatives (tgi, trt-llm and lmdeploy), for **the public** to know when to choose vllm.
|
||||||
|
|
||||||
See [vLLM performance dashboard](https://perf.vllm.ai) for the latest performance benchmark results and [vLLM GitHub README](https://github.com/vllm-project/vllm/blob/main/README.md) for latest nightly benchmark results.
|
See [vLLM performance dashboard](https://hud.pytorch.org/benchmark/llms?repoName=vllm-project%2Fvllm) for the latest performance benchmark results and [vLLM GitHub README](https://github.com/vllm-project/vllm/blob/main/README.md) for latest nightly benchmark results.
|
||||||
|
|
||||||
## Performance benchmark quick overview
|
## Performance benchmark quick overview
|
||||||
|
|
||||||
@ -138,28 +138,20 @@ The raw benchmarking results (in the format of json files) are in the `Artifacts
|
|||||||
|
|
||||||
The `compare-json-results.py` helps to compare benchmark results JSON files converted using `convert-results-json-to-markdown.py`.
|
The `compare-json-results.py` helps to compare benchmark results JSON files converted using `convert-results-json-to-markdown.py`.
|
||||||
When run, benchmark script generates results under `benchmark/results` folder, along with the `benchmark_results.md` and `benchmark_results.json`.
|
When run, benchmark script generates results under `benchmark/results` folder, along with the `benchmark_results.md` and `benchmark_results.json`.
|
||||||
`compare-json-results.py` compares two `benchmark_results.json` files and provides performance ratio e.g. for Output Tput, Median TTFT and Median TPOT.
|
`compare-json-results.py` compares two `benchmark_results.json` files and provides performance ratio e.g. for Output Tput, Median TTFT and Median TPOT.
|
||||||
|
If only one benchmark_results.json is passed, `compare-json-results.py` compares different TP and PP configurations in the benchmark_results.json instead.
|
||||||
|
|
||||||
Here is an example using the script to compare result_a and result_b without detail test name.
|
Here is an example using the script to compare result_a and result_b with Model, Dataset name, input/output lenght, max concurrency and qps.
|
||||||
`python3 compare-json-results.py -f results_a/benchmark_results.json -f results_b/benchmark_results.json --ignore_test_name`
|
|
||||||
|
|
||||||
| | results_a/benchmark_results.json | results_b/benchmark_results.json | perf_ratio |
|
|
||||||
|----|----------------------------------------|----------------------------------------|----------|
|
|
||||||
| 0 | 142.633982 | 156.526018 | 1.097396 |
|
|
||||||
| 1 | 241.620334 | 294.018783 | 1.216863 |
|
|
||||||
| 2 | 218.298905 | 262.664916 | 1.203235 |
|
|
||||||
| 3 | 242.743860 | 299.816190 | 1.235113 |
|
|
||||||
|
|
||||||
Here is an example using the script to compare result_a and result_b with detail test name.
|
|
||||||
`python3 compare-json-results.py -f results_a/benchmark_results.json -f results_b/benchmark_results.json`
|
`python3 compare-json-results.py -f results_a/benchmark_results.json -f results_b/benchmark_results.json`
|
||||||
|
|
||||||
| | results_a/benchmark_results.json_name | results_a/benchmark_results.json | results_b/benchmark_results.json_name | results_b/benchmark_results.json | perf_ratio |
|
| | Model | Dataset Name | Input Len | Output Len | # of max concurrency | qps | results_a/benchmark_results.json | results_b/benchmark_results.json | perf_ratio |
|
||||||
|---|---------------------------------------------|----------------------------------------|---------------------------------------------|----------------------------------------|----------|
|
|----|---------------------------------------|--------|-----|-----|------|-----|-----------|----------|----------|
|
||||||
| 0 | serving_llama8B_tp1_sharegpt_qps_1 | 142.633982 | serving_llama8B_tp1_sharegpt_qps_1 | 156.526018 | 1.097396 |
|
| 0 | meta-llama/Meta-Llama-3.1-8B-Instruct | random | 128 | 128 | 1000 | 1 | 142.633982 | 156.526018 | 1.097396 |
|
||||||
| 1 | serving_llama8B_tp1_sharegpt_qps_16 | 241.620334 | serving_llama8B_tp1_sharegpt_qps_16 | 294.018783 | 1.216863 |
|
| 1 | meta-llama/Meta-Llama-3.1-8B-Instruct | random | 128 | 128 | 1000 | inf| 241.620334 | 294.018783 | 1.216863 |
|
||||||
| 2 | serving_llama8B_tp1_sharegpt_qps_4 | 218.298905 | serving_llama8B_tp1_sharegpt_qps_4 | 262.664916 | 1.203235 |
|
|
||||||
| 3 | serving_llama8B_tp1_sharegpt_qps_inf | 242.743860 | serving_llama8B_tp1_sharegpt_qps_inf | 299.816190 | 1.235113 |
|
A comparison diagram will be generated below the table.
|
||||||
| 4 | serving_llama8B_tp2_random_1024_128_qps_1 | 96.613390 | serving_llama8B_tp4_random_1024_128_qps_1 | 108.404853 | 1.122048 |
|
Here is an example to compare between 96c/results_gnr_96c_091_tp2pp3 and 128c/results_gnr_128c_091_tp2pp3
|
||||||
|
<img width="1886" height="828" alt="image" src="https://github.com/user-attachments/assets/c02a43ef-25d0-4fd6-90e5-2169a28682dd" />
|
||||||
|
|
||||||
## Nightly test details
|
## Nightly test details
|
||||||
|
|
||||||
|
|||||||
@ -1,24 +1,38 @@
|
|||||||
# SPDX-License-Identifier: Apache-2.0
|
# SPDX-License-Identifier: Apache-2.0
|
||||||
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
|
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
|
||||||
import argparse
|
import argparse
|
||||||
|
import json
|
||||||
|
import os
|
||||||
|
|
||||||
import pandas as pd
|
import pandas as pd
|
||||||
|
|
||||||
|
|
||||||
def compare_data_columns(
|
def compare_data_columns(
|
||||||
files, name_column, data_column, drop_column, ignore_test_name=False
|
files, name_column, data_column, info_cols, drop_column, debug=False
|
||||||
):
|
):
|
||||||
print("\ncompare_data_column: " + data_column)
|
print("\ncompare_data_column: " + data_column)
|
||||||
frames = []
|
frames = []
|
||||||
|
raw_data_cols = []
|
||||||
compare_frames = []
|
compare_frames = []
|
||||||
for file in files:
|
for file in files:
|
||||||
data_df = pd.read_json(file)
|
data_df = pd.read_json(file)
|
||||||
serving_df = data_df.dropna(subset=[drop_column], ignore_index=True)
|
serving_df = data_df.dropna(subset=[drop_column], ignore_index=True)
|
||||||
if ignore_test_name is False:
|
# Show all info columns in the first couple columns
|
||||||
|
if not frames:
|
||||||
|
for col in info_cols:
|
||||||
|
if col not in serving_df.columns:
|
||||||
|
print(f"Skipping missing column: {col}")
|
||||||
|
continue
|
||||||
|
frames.append(serving_df[col])
|
||||||
|
# only show test name under debug mode
|
||||||
|
if debug is True:
|
||||||
serving_df = serving_df.rename(columns={name_column: file + "_name"})
|
serving_df = serving_df.rename(columns={name_column: file + "_name"})
|
||||||
frames.append(serving_df[file + "_name"])
|
frames.append(serving_df[file + "_name"])
|
||||||
|
|
||||||
|
file = "/".join(file.split("/")[:-1])
|
||||||
serving_df = serving_df.rename(columns={data_column: file})
|
serving_df = serving_df.rename(columns={data_column: file})
|
||||||
frames.append(serving_df[file])
|
frames.append(serving_df[file])
|
||||||
|
raw_data_cols.append(file)
|
||||||
compare_frames.append(serving_df[file])
|
compare_frames.append(serving_df[file])
|
||||||
if len(compare_frames) >= 2:
|
if len(compare_frames) >= 2:
|
||||||
# Compare numbers among two files
|
# Compare numbers among two files
|
||||||
@ -27,7 +41,68 @@ def compare_data_columns(
|
|||||||
compare_frames.pop(1)
|
compare_frames.pop(1)
|
||||||
|
|
||||||
concat_df = pd.concat(frames, axis=1)
|
concat_df = pd.concat(frames, axis=1)
|
||||||
return concat_df
|
print(raw_data_cols)
|
||||||
|
return concat_df, raw_data_cols
|
||||||
|
|
||||||
|
|
||||||
|
def split_json_by_tp_pp(
|
||||||
|
input_file: str = "benchmark_results.json", output_root: str = "."
|
||||||
|
) -> list[str]:
|
||||||
|
"""
|
||||||
|
Split a benchmark JSON into separate folders by (TP Size, PP Size).
|
||||||
|
|
||||||
|
Creates: <output_root>/tp{TP}_pp{PP}/benchmark_results.json
|
||||||
|
Returns: list of file paths written.
|
||||||
|
"""
|
||||||
|
# Load JSON data into DataFrame
|
||||||
|
with open(input_file, encoding="utf-8") as f:
|
||||||
|
data = json.load(f)
|
||||||
|
|
||||||
|
# If the JSON is a dict with a list under common keys, use that list
|
||||||
|
if isinstance(data, dict):
|
||||||
|
for key in ("results", "serving_results", "benchmarks", "data"):
|
||||||
|
if isinstance(data.get(key), list):
|
||||||
|
data = data[key]
|
||||||
|
break
|
||||||
|
|
||||||
|
df = pd.DataFrame(data)
|
||||||
|
|
||||||
|
# Handle alias column names
|
||||||
|
rename_map = {
|
||||||
|
"tp_size": "TP Size",
|
||||||
|
"tensor_parallel_size": "TP Size",
|
||||||
|
"pp_size": "PP Size",
|
||||||
|
"pipeline_parallel_size": "PP Size",
|
||||||
|
}
|
||||||
|
df.rename(
|
||||||
|
columns={k: v for k, v in rename_map.items() if k in df.columns}, inplace=True
|
||||||
|
)
|
||||||
|
|
||||||
|
# Ensure TP/PP columns exist (default to 1 if missing)
|
||||||
|
if "TP Size" not in df.columns:
|
||||||
|
df["TP Size"] = 1
|
||||||
|
if "PP Size" not in df.columns:
|
||||||
|
df["PP Size"] = 1
|
||||||
|
|
||||||
|
# make sure TP/PP are numeric ints with no NaN
|
||||||
|
df["TP Size"] = (
|
||||||
|
pd.to_numeric(df.get("TP Size", 1), errors="coerce").fillna(1).astype(int)
|
||||||
|
)
|
||||||
|
df["PP Size"] = (
|
||||||
|
pd.to_numeric(df.get("PP Size", 1), errors="coerce").fillna(1).astype(int)
|
||||||
|
)
|
||||||
|
|
||||||
|
# Split into separate folders
|
||||||
|
saved_paths: list[str] = []
|
||||||
|
for (tp, pp), group_df in df.groupby(["TP Size", "PP Size"], dropna=False):
|
||||||
|
folder_name = os.path.join(output_root, f"tp{int(tp)}_pp{int(pp)}")
|
||||||
|
os.makedirs(folder_name, exist_ok=True)
|
||||||
|
filepath = os.path.join(folder_name, "benchmark_results.json")
|
||||||
|
group_df.to_json(filepath, orient="records", indent=2, force_ascii=False)
|
||||||
|
print(f"Saved: {filepath}")
|
||||||
|
saved_paths.append(filepath)
|
||||||
|
|
||||||
|
return saved_paths
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
@ -36,31 +111,105 @@ if __name__ == "__main__":
|
|||||||
"-f", "--file", action="append", type=str, help="input file name"
|
"-f", "--file", action="append", type=str, help="input file name"
|
||||||
)
|
)
|
||||||
parser.add_argument(
|
parser.add_argument(
|
||||||
"--ignore_test_name", action="store_true", help="ignore_test_name or not"
|
"--debug", action="store_true", help="show all information for debugging"
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--plot",
|
||||||
|
action=argparse.BooleanOptionalAction,
|
||||||
|
default=True,
|
||||||
|
help="plot perf diagrams or not --no-plot --plot",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"-x",
|
||||||
|
"--xaxis",
|
||||||
|
type=str,
|
||||||
|
default="# of max concurrency.",
|
||||||
|
help="column name to use as X Axis in comparision graph",
|
||||||
)
|
)
|
||||||
args = parser.parse_args()
|
args = parser.parse_args()
|
||||||
files = args.file
|
|
||||||
print("comparing : " + ", ".join(files))
|
|
||||||
|
|
||||||
drop_column = "P99"
|
drop_column = "P99"
|
||||||
name_column = "Test name"
|
name_column = "Test name"
|
||||||
|
info_cols = [
|
||||||
|
"Model",
|
||||||
|
"Dataset Name",
|
||||||
|
"Input Len",
|
||||||
|
"Output Len",
|
||||||
|
"TP Size",
|
||||||
|
"PP Size",
|
||||||
|
"# of max concurrency.",
|
||||||
|
"qps",
|
||||||
|
]
|
||||||
data_cols_to_compare = ["Output Tput (tok/s)", "Median TTFT (ms)", "Median"]
|
data_cols_to_compare = ["Output Tput (tok/s)", "Median TTFT (ms)", "Median"]
|
||||||
html_msgs_for_data_cols = [
|
html_msgs_for_data_cols = [
|
||||||
"Compare Output Tokens /n",
|
"Compare Output Tokens /n",
|
||||||
"Median TTFT /n",
|
"Median TTFT /n",
|
||||||
"Median TPOT /n",
|
"Median TPOT /n",
|
||||||
]
|
]
|
||||||
ignore_test_name = args.ignore_test_name
|
|
||||||
|
if len(args.file) == 1:
|
||||||
|
files = split_json_by_tp_pp(args.file[0], output_root="splits")
|
||||||
|
info_cols = [c for c in info_cols if c not in ("TP Size", "PP Size")]
|
||||||
|
else:
|
||||||
|
files = args.file
|
||||||
|
print("comparing : " + ", ".join(files))
|
||||||
|
debug = args.debug
|
||||||
|
plot = args.plot
|
||||||
|
# For Plot feature, assign y axis from one of info_cols
|
||||||
|
y_axis_index = info_cols.index(args.xaxis) if args.xaxis in info_cols else 6
|
||||||
with open("perf_comparison.html", "w") as text_file:
|
with open("perf_comparison.html", "w") as text_file:
|
||||||
for i in range(len(data_cols_to_compare)):
|
for i in range(len(data_cols_to_compare)):
|
||||||
output_df = compare_data_columns(
|
output_df, raw_data_cols = compare_data_columns(
|
||||||
files,
|
files,
|
||||||
name_column,
|
name_column,
|
||||||
data_cols_to_compare[i],
|
data_cols_to_compare[i],
|
||||||
|
info_cols,
|
||||||
drop_column,
|
drop_column,
|
||||||
ignore_test_name=ignore_test_name,
|
debug=debug,
|
||||||
)
|
)
|
||||||
print(output_df)
|
|
||||||
html = output_df.to_html()
|
# For Plot feature, insert y axis from one of info_cols
|
||||||
text_file.write(html_msgs_for_data_cols[i])
|
raw_data_cols.insert(0, info_cols[y_axis_index])
|
||||||
text_file.write(html)
|
|
||||||
|
filtered_info_cols = info_cols[:-2]
|
||||||
|
existing_group_cols = [
|
||||||
|
c for c in filtered_info_cols if c in output_df.columns
|
||||||
|
]
|
||||||
|
if not existing_group_cols:
|
||||||
|
raise ValueError(
|
||||||
|
f"No valid group-by columns "
|
||||||
|
f"Expected subset: {filtered_info_cols}, "
|
||||||
|
f"but DataFrame has: {list(output_df.columns)}"
|
||||||
|
)
|
||||||
|
|
||||||
|
output_df_sorted = output_df.sort_values(by=existing_group_cols)
|
||||||
|
output_groups = output_df_sorted.groupby(existing_group_cols, dropna=False)
|
||||||
|
for name, group in output_groups:
|
||||||
|
html = group.to_html()
|
||||||
|
text_file.write(html_msgs_for_data_cols[i])
|
||||||
|
text_file.write(html)
|
||||||
|
|
||||||
|
if plot is True:
|
||||||
|
import pandas as pd
|
||||||
|
import plotly.express as px
|
||||||
|
|
||||||
|
df = group[raw_data_cols]
|
||||||
|
df_sorted = df.sort_values(by=info_cols[y_axis_index])
|
||||||
|
# Melt DataFrame for plotting
|
||||||
|
df_melted = df_sorted.melt(
|
||||||
|
id_vars=info_cols[y_axis_index],
|
||||||
|
var_name="Configuration",
|
||||||
|
value_name=data_cols_to_compare[i],
|
||||||
|
)
|
||||||
|
title = data_cols_to_compare[i] + " vs " + info_cols[y_axis_index]
|
||||||
|
# Create Plotly line chart
|
||||||
|
fig = px.line(
|
||||||
|
df_melted,
|
||||||
|
x=info_cols[y_axis_index],
|
||||||
|
y=data_cols_to_compare[i],
|
||||||
|
color="Configuration",
|
||||||
|
title=title,
|
||||||
|
markers=True,
|
||||||
|
)
|
||||||
|
# Export to HTML
|
||||||
|
text_file.write(fig.to_html(full_html=True, include_plotlyjs="cdn"))
|
||||||
|
|||||||
@ -1,17 +1,19 @@
|
|||||||
# SPDX-License-Identifier: Apache-2.0
|
# SPDX-License-Identifier: Apache-2.0
|
||||||
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
|
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
|
||||||
|
|
||||||
|
import argparse
|
||||||
import json
|
import json
|
||||||
import os
|
import os
|
||||||
|
import re
|
||||||
|
import shlex
|
||||||
from importlib import util
|
from importlib import util
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
|
from typing import Any
|
||||||
|
|
||||||
import pandas as pd
|
import pandas as pd
|
||||||
import psutil
|
import psutil
|
||||||
from tabulate import tabulate
|
from tabulate import tabulate
|
||||||
|
|
||||||
results_folder = Path("results/")
|
|
||||||
|
|
||||||
# latency results and the keys that will be printed into markdown
|
# latency results and the keys that will be printed into markdown
|
||||||
latency_results = []
|
latency_results = []
|
||||||
latency_column_mapping = {
|
latency_column_mapping = {
|
||||||
@ -42,14 +44,22 @@ throughput_results_column_mapping = {
|
|||||||
serving_results = []
|
serving_results = []
|
||||||
serving_column_mapping = {
|
serving_column_mapping = {
|
||||||
"test_name": "Test name",
|
"test_name": "Test name",
|
||||||
|
"model_id": "Model",
|
||||||
|
"dataset_name": "Dataset Name",
|
||||||
|
"input_len": "Input Len",
|
||||||
|
"output_len": "Output Len",
|
||||||
|
"tp_size": "TP Size",
|
||||||
|
"pp_size": "PP Size",
|
||||||
|
"dtype": "dtype",
|
||||||
"gpu_type": "GPU",
|
"gpu_type": "GPU",
|
||||||
"completed": "# of req.",
|
"completed": "# of req.",
|
||||||
|
"qps": "qps",
|
||||||
"max_concurrency": "# of max concurrency.",
|
"max_concurrency": "# of max concurrency.",
|
||||||
"request_throughput": "Tput (req/s)",
|
"request_throughput": "Tput (req/s)",
|
||||||
"total_token_throughput": "Total Token Tput (tok/s)",
|
"total_token_throughput": "Total Token Tput (tok/s)",
|
||||||
"output_throughput": "Output Tput (tok/s)",
|
"output_throughput": "Output Tput (tok/s)",
|
||||||
"total_input_tokens": "Total input tokens",
|
# "total_input_tokens": "Total input tokens",
|
||||||
"total_output_tokens": "Total output tokens",
|
# "total_output_tokens": "Total output tokens",
|
||||||
"mean_ttft_ms": "Mean TTFT (ms)",
|
"mean_ttft_ms": "Mean TTFT (ms)",
|
||||||
"median_ttft_ms": "Median TTFT (ms)",
|
"median_ttft_ms": "Median TTFT (ms)",
|
||||||
"p99_ttft_ms": "P99 TTFT (ms)",
|
"p99_ttft_ms": "P99 TTFT (ms)",
|
||||||
@ -94,7 +104,104 @@ def get_size_with_unit(bytes, suffix="B"):
|
|||||||
bytes /= factor
|
bytes /= factor
|
||||||
|
|
||||||
|
|
||||||
|
def _coerce(val: str) -> Any:
|
||||||
|
"""Best-effort type coercion from string to Python types."""
|
||||||
|
low = val.lower()
|
||||||
|
if low == "null":
|
||||||
|
return None
|
||||||
|
if low == "true":
|
||||||
|
return True
|
||||||
|
if low == "false":
|
||||||
|
return False
|
||||||
|
# integers
|
||||||
|
if re.fullmatch(r"[+-]?\d+", val):
|
||||||
|
try:
|
||||||
|
return int(val)
|
||||||
|
except ValueError:
|
||||||
|
pass
|
||||||
|
# floats (keep 'inf'/'-inf'/'nan' as strings)
|
||||||
|
if re.fullmatch(r"[+-]?\d*\.\d+", val):
|
||||||
|
try:
|
||||||
|
return float(val)
|
||||||
|
except ValueError:
|
||||||
|
pass
|
||||||
|
return val
|
||||||
|
|
||||||
|
|
||||||
|
def parse_client_command(cmd: str) -> dict[str, Any]:
|
||||||
|
"""Parse the client_command shell string into {executable, script, args}."""
|
||||||
|
toks = shlex.split(cmd)
|
||||||
|
if len(toks) < 2:
|
||||||
|
raise ValueError("client_command must include an executable and a script")
|
||||||
|
executable, script = toks[0], toks[1]
|
||||||
|
args: dict[str, Any] = {}
|
||||||
|
|
||||||
|
i = 2
|
||||||
|
while i < len(toks):
|
||||||
|
t = toks[i]
|
||||||
|
if t.startswith("--"):
|
||||||
|
# --key=value or --key (value) or boolean flag
|
||||||
|
if "=" in t:
|
||||||
|
key, val = t.split("=", 1)
|
||||||
|
if key == "--metadata":
|
||||||
|
md = {}
|
||||||
|
if val:
|
||||||
|
if "=" in val:
|
||||||
|
k, v = val.split("=", 1)
|
||||||
|
md[k] = _coerce(v)
|
||||||
|
else:
|
||||||
|
md[val] = True
|
||||||
|
args[key] = md
|
||||||
|
else:
|
||||||
|
args[key] = _coerce(val)
|
||||||
|
i += 1
|
||||||
|
continue
|
||||||
|
|
||||||
|
key = t
|
||||||
|
|
||||||
|
# Special: consume metadata k=v pairs until next --flag
|
||||||
|
if key == "--metadata":
|
||||||
|
i += 1
|
||||||
|
md = {}
|
||||||
|
while i < len(toks) and not toks[i].startswith("--"):
|
||||||
|
pair = toks[i]
|
||||||
|
if "=" in pair:
|
||||||
|
k, v = pair.split("=", 1)
|
||||||
|
md[k] = _coerce(v)
|
||||||
|
else:
|
||||||
|
md[pair] = True
|
||||||
|
i += 1
|
||||||
|
args[key] = md
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Standard: check if next token is a value (not a flag)
|
||||||
|
if i + 1 < len(toks) and not toks[i + 1].startswith("--"):
|
||||||
|
args[key] = _coerce(toks[i + 1])
|
||||||
|
i += 2
|
||||||
|
else:
|
||||||
|
# lone flag -> True
|
||||||
|
args[key] = True
|
||||||
|
i += 1
|
||||||
|
else:
|
||||||
|
# unexpected positional; skip
|
||||||
|
i += 1
|
||||||
|
|
||||||
|
return {"executable": executable, "script": script, "args": args}
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
|
parser = argparse.ArgumentParser()
|
||||||
|
parser.add_argument(
|
||||||
|
"-r",
|
||||||
|
"--result",
|
||||||
|
type=str,
|
||||||
|
default="results",
|
||||||
|
help="Folder name for benchmark output results.",
|
||||||
|
)
|
||||||
|
args = parser.parse_args()
|
||||||
|
results_folder = Path(args.result)
|
||||||
|
if not results_folder.exists():
|
||||||
|
raise FileNotFoundError(f"results folder does not exist: {results_folder}")
|
||||||
# collect results
|
# collect results
|
||||||
for test_file in results_folder.glob("*.json"):
|
for test_file in results_folder.glob("*.json"):
|
||||||
with open(test_file) as f:
|
with open(test_file) as f:
|
||||||
@ -102,7 +209,6 @@ if __name__ == "__main__":
|
|||||||
|
|
||||||
if "serving" in str(test_file):
|
if "serving" in str(test_file):
|
||||||
# this result is generated via `vllm bench serve` command
|
# this result is generated via `vllm bench serve` command
|
||||||
|
|
||||||
# attach the benchmarking command to raw_result
|
# attach the benchmarking command to raw_result
|
||||||
try:
|
try:
|
||||||
with open(test_file.with_suffix(".commands")) as f:
|
with open(test_file.with_suffix(".commands")) as f:
|
||||||
@ -110,12 +216,44 @@ if __name__ == "__main__":
|
|||||||
except OSError as e:
|
except OSError as e:
|
||||||
print(e)
|
print(e)
|
||||||
continue
|
continue
|
||||||
|
# Parse Server Command Arg
|
||||||
|
out: dict[str, Any] = {
|
||||||
|
"server_command": parse_client_command(command["server_command"])
|
||||||
|
}
|
||||||
|
parse_args = [
|
||||||
|
"--tensor-parallel-size",
|
||||||
|
"--pipeline-parallel-size",
|
||||||
|
"--dtype",
|
||||||
|
]
|
||||||
|
col_mapping = ["tp_size", "pp_size", "dtype"]
|
||||||
|
for index, arg in enumerate(parse_args):
|
||||||
|
if arg in out["server_command"]["args"]:
|
||||||
|
raw_result.update(
|
||||||
|
{col_mapping[index]: out["server_command"]["args"][arg]}
|
||||||
|
)
|
||||||
|
|
||||||
|
# Parse Client Command Arg
|
||||||
|
out: dict[str, Any] = {
|
||||||
|
"client_command": parse_client_command(command["client_command"])
|
||||||
|
}
|
||||||
|
parse_args = [
|
||||||
|
"--dataset-name",
|
||||||
|
"--random-input-len",
|
||||||
|
"--random-output-len",
|
||||||
|
"--request-rate",
|
||||||
|
]
|
||||||
|
col_mapping = ["dataset_name", "input_len", "output_len", "qps"]
|
||||||
|
|
||||||
|
for index, arg in enumerate(parse_args):
|
||||||
|
if arg in out["client_command"]["args"]:
|
||||||
|
raw_result.update(
|
||||||
|
{col_mapping[index]: out["client_command"]["args"][arg]}
|
||||||
|
)
|
||||||
|
# Add Server, Client command
|
||||||
raw_result.update(command)
|
raw_result.update(command)
|
||||||
|
|
||||||
# update the test name of this result
|
# update the test name of this result
|
||||||
raw_result.update({"test_name": test_file.stem})
|
raw_result.update({"test_name": test_file.stem})
|
||||||
|
|
||||||
# add the result to raw_result
|
# add the result to raw_result
|
||||||
serving_results.append(raw_result)
|
serving_results.append(raw_result)
|
||||||
continue
|
continue
|
||||||
@ -205,7 +343,10 @@ if __name__ == "__main__":
|
|||||||
columns=latency_column_mapping
|
columns=latency_column_mapping
|
||||||
)
|
)
|
||||||
if not serving_results.empty:
|
if not serving_results.empty:
|
||||||
serving_results = serving_results[list(serving_column_mapping.keys())].rename(
|
valid_columns = [
|
||||||
|
col for col in serving_column_mapping if col in serving_results.columns
|
||||||
|
]
|
||||||
|
serving_results = serving_results[valid_columns].rename(
|
||||||
columns=serving_column_mapping
|
columns=serving_column_mapping
|
||||||
)
|
)
|
||||||
if not throughput_results.empty:
|
if not throughput_results.empty:
|
||||||
@ -245,7 +386,9 @@ if __name__ == "__main__":
|
|||||||
)
|
)
|
||||||
|
|
||||||
# document the result
|
# document the result
|
||||||
with open(results_folder / "benchmark_results.md", "w") as f:
|
md_file = "benchmark_results.md"
|
||||||
|
json_file = "benchmark_results.json"
|
||||||
|
with open(results_folder / md_file, "w") as f:
|
||||||
results = read_markdown(
|
results = read_markdown(
|
||||||
"../.buildkite/nightly-benchmarks/"
|
"../.buildkite/nightly-benchmarks/"
|
||||||
+ "performance-benchmarks-descriptions.md"
|
+ "performance-benchmarks-descriptions.md"
|
||||||
@ -260,7 +403,7 @@ if __name__ == "__main__":
|
|||||||
f.write(results)
|
f.write(results)
|
||||||
|
|
||||||
# document benchmarking results in json
|
# document benchmarking results in json
|
||||||
with open(results_folder / "benchmark_results.json", "w") as f:
|
with open(results_folder / json_file, "w") as f:
|
||||||
results = (
|
results = (
|
||||||
latency_results.to_dict(orient="records")
|
latency_results.to_dict(orient="records")
|
||||||
+ throughput_results.to_dict(orient="records")
|
+ throughput_results.to_dict(orient="records")
|
||||||
|
|||||||
@ -194,9 +194,11 @@ run_latency_tests() {
|
|||||||
|
|
||||||
# check if there is enough GPU to run the test
|
# check if there is enough GPU to run the test
|
||||||
tp=$(echo "$latency_params" | jq -r '.tensor_parallel_size')
|
tp=$(echo "$latency_params" | jq -r '.tensor_parallel_size')
|
||||||
if [ "$ON_CPU" == "1" ];then
|
if [ "$ON_CPU" == "1" ]; then
|
||||||
if [[ $numa_count -lt $tp ]]; then
|
pp=$(echo "$latency_params" | jq -r '.pipeline_parallel_size')
|
||||||
echo "Required tensor-parallel-size $tp but only $numa_count NUMA nodes found. Skip testcase $test_name."
|
world_size=$(($tp*$pp))
|
||||||
|
if [[ $numa_count -lt $world_size && -z "${REMOTE_HOST}" ]]; then
|
||||||
|
echo "Required world-size $world_size but only $numa_count NUMA nodes found. Skip testcase $test_name."
|
||||||
continue
|
continue
|
||||||
fi
|
fi
|
||||||
else
|
else
|
||||||
@ -261,9 +263,11 @@ run_throughput_tests() {
|
|||||||
|
|
||||||
# check if there is enough GPU to run the test
|
# check if there is enough GPU to run the test
|
||||||
tp=$(echo "$throughput_params" | jq -r '.tensor_parallel_size')
|
tp=$(echo "$throughput_params" | jq -r '.tensor_parallel_size')
|
||||||
if [ "$ON_CPU" == "1" ];then
|
if [ "$ON_CPU" == "1" ]; then
|
||||||
if [[ $numa_count -lt $tp ]]; then
|
pp=$(echo "$throughput_params" | jq -r '.pipeline_parallel_size')
|
||||||
echo "Required tensor-parallel-size $tp but only $numa_count NUMA nodes found. Skip testcase $test_name."
|
world_size=$(($tp*$pp))
|
||||||
|
if [[ $numa_count -lt $world_size && -z "${REMOTE_HOST}" ]]; then
|
||||||
|
echo "Required world-size $world_size but only $numa_count NUMA nodes found. Skip testcase $test_name."
|
||||||
continue
|
continue
|
||||||
fi
|
fi
|
||||||
else
|
else
|
||||||
@ -329,12 +333,21 @@ run_serving_tests() {
|
|||||||
qps_list=$(echo "$params" | jq -r '.qps_list')
|
qps_list=$(echo "$params" | jq -r '.qps_list')
|
||||||
qps_list=$(echo "$qps_list" | jq -r '.[] | @sh')
|
qps_list=$(echo "$qps_list" | jq -r '.[] | @sh')
|
||||||
echo "Running over qps list $qps_list"
|
echo "Running over qps list $qps_list"
|
||||||
|
max_concurrency_list=$(echo "$params" | jq -r '.max_concurrency_list')
|
||||||
|
if [[ -z "$max_concurrency_list" || "$max_concurrency_list" == "null" ]]; then
|
||||||
|
num_prompts=$(echo "$client_params" | jq -r '.num_prompts')
|
||||||
|
max_concurrency_list="[$num_prompts]"
|
||||||
|
fi
|
||||||
|
max_concurrency_list=$(echo "$max_concurrency_list" | jq -r '.[] | @sh')
|
||||||
|
echo "Running over max concurrency list $max_concurrency_list"
|
||||||
|
|
||||||
# check if there is enough resources to run the test
|
# check if there is enough resources to run the test
|
||||||
tp=$(echo "$server_params" | jq -r '.tensor_parallel_size')
|
tp=$(echo "$server_params" | jq -r '.tensor_parallel_size')
|
||||||
if [ "$ON_CPU" == "1" ];then
|
if [ "$ON_CPU" == "1" ]; then
|
||||||
if [[ $numa_count -lt $tp ]]; then
|
pp=$(echo "$server_params" | jq -r '.pipeline_parallel_size')
|
||||||
echo "Required tensor-parallel-size $tp but only $numa_count NUMA nodes found. Skip testcase $test_name."
|
world_size=$(($tp*$pp))
|
||||||
|
if [[ $numa_count -lt $world_size && -z "${REMOTE_HOST}" ]]; then
|
||||||
|
echo "Required world-size $world_size but only $numa_count NUMA nodes found. Skip testcase $test_name."
|
||||||
continue
|
continue
|
||||||
fi
|
fi
|
||||||
else
|
else
|
||||||
@ -390,35 +403,39 @@ run_serving_tests() {
|
|||||||
echo "now qps is $qps"
|
echo "now qps is $qps"
|
||||||
fi
|
fi
|
||||||
|
|
||||||
new_test_name=$test_name"_qps_"$qps
|
# iterate over different max_concurrency
|
||||||
|
for max_concurrency in $max_concurrency_list; do
|
||||||
|
new_test_name=$test_name"_qps_"$qps"_concurrency_"$max_concurrency
|
||||||
|
echo " new test name $new_test_name"
|
||||||
|
# pass the tensor parallel size to the client so that it can be displayed
|
||||||
|
# on the benchmark dashboard
|
||||||
|
client_command="vllm bench serve \
|
||||||
|
--save-result \
|
||||||
|
--result-dir $RESULTS_FOLDER \
|
||||||
|
--result-filename ${new_test_name}.json \
|
||||||
|
--request-rate $qps \
|
||||||
|
--max-concurrency $max_concurrency \
|
||||||
|
--metadata "tensor_parallel_size=$tp" \
|
||||||
|
$client_args $client_remote_args "
|
||||||
|
|
||||||
# pass the tensor parallel size to the client so that it can be displayed
|
echo "Running test case $test_name with qps $qps"
|
||||||
# on the benchmark dashboard
|
echo "Client command: $client_command"
|
||||||
client_command="vllm bench serve \
|
|
||||||
--save-result \
|
|
||||||
--result-dir $RESULTS_FOLDER \
|
|
||||||
--result-filename ${new_test_name}.json \
|
|
||||||
--request-rate $qps \
|
|
||||||
--metadata "tensor_parallel_size=$tp" \
|
|
||||||
$client_args $client_remote_args "
|
|
||||||
|
|
||||||
echo "Running test case $test_name with qps $qps"
|
bash -c "$client_command"
|
||||||
echo "Client command: $client_command"
|
|
||||||
|
|
||||||
bash -c "$client_command"
|
# record the benchmarking commands
|
||||||
|
jq_output=$(jq -n \
|
||||||
# record the benchmarking commands
|
--arg server "$server_command" \
|
||||||
jq_output=$(jq -n \
|
--arg client "$client_command" \
|
||||||
--arg server "$server_command" \
|
--arg gpu "$gpu_type" \
|
||||||
--arg client "$client_command" \
|
'{
|
||||||
--arg gpu "$gpu_type" \
|
server_command: $server,
|
||||||
'{
|
client_command: $client,
|
||||||
server_command: $server,
|
gpu_type: $gpu
|
||||||
client_command: $client,
|
}')
|
||||||
gpu_type: $gpu
|
echo "$jq_output" >"$RESULTS_FOLDER/${new_test_name}.commands"
|
||||||
}')
|
|
||||||
echo "$jq_output" >"$RESULTS_FOLDER/${new_test_name}.commands"
|
|
||||||
|
|
||||||
|
done
|
||||||
done
|
done
|
||||||
|
|
||||||
# clean up
|
# clean up
|
||||||
|
|||||||
@ -6,7 +6,7 @@
|
|||||||
"VLLM_CPU_KVCACHE_SPACE": 40
|
"VLLM_CPU_KVCACHE_SPACE": 40
|
||||||
},
|
},
|
||||||
"parameters": {
|
"parameters": {
|
||||||
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
|
"model": "meta-llama/Llama-3.1-8B-Instruct",
|
||||||
"tensor_parallel_size": 1,
|
"tensor_parallel_size": 1,
|
||||||
"load_format": "dummy",
|
"load_format": "dummy",
|
||||||
"num_iters_warmup": 5,
|
"num_iters_warmup": 5,
|
||||||
@ -20,7 +20,7 @@
|
|||||||
"VLLM_CPU_KVCACHE_SPACE": 40
|
"VLLM_CPU_KVCACHE_SPACE": 40
|
||||||
},
|
},
|
||||||
"parameters": {
|
"parameters": {
|
||||||
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
|
"model": "meta-llama/Llama-3.1-8B-Instruct",
|
||||||
"tensor_parallel_size": 4,
|
"tensor_parallel_size": 4,
|
||||||
"load_format": "dummy",
|
"load_format": "dummy",
|
||||||
"num_iters_warmup": 5,
|
"num_iters_warmup": 5,
|
||||||
|
|||||||
@ -1,7 +1,8 @@
|
|||||||
[
|
[
|
||||||
{
|
{
|
||||||
"test_name": "serving_llama8B_tp1_sharegpt",
|
"test_name": "serving_llama8B_tp1_sharegpt",
|
||||||
"qps_list": [1, 4, 16, "inf"],
|
"qps_list": ["inf"],
|
||||||
|
"max_concurrency_list": [12, 16, 24, 32, 64, 128, 200],
|
||||||
"server_environment_variables": {
|
"server_environment_variables": {
|
||||||
"VLLM_RPC_TIMEOUT": 100000,
|
"VLLM_RPC_TIMEOUT": 100000,
|
||||||
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
||||||
@ -10,7 +11,7 @@
|
|||||||
"VLLM_CPU_KVCACHE_SPACE": 40
|
"VLLM_CPU_KVCACHE_SPACE": 40
|
||||||
},
|
},
|
||||||
"server_parameters": {
|
"server_parameters": {
|
||||||
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
|
"model": "meta-llama/Llama-3.1-8B-Instruct",
|
||||||
"tensor_parallel_size": 1,
|
"tensor_parallel_size": 1,
|
||||||
"dtype": "bfloat16",
|
"dtype": "bfloat16",
|
||||||
"distributed_executor_backend": "mp",
|
"distributed_executor_backend": "mp",
|
||||||
@ -23,17 +24,17 @@
|
|||||||
"load_format": "dummy"
|
"load_format": "dummy"
|
||||||
},
|
},
|
||||||
"client_parameters": {
|
"client_parameters": {
|
||||||
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
|
"model": "meta-llama/Llama-3.1-8B-Instruct",
|
||||||
"backend": "vllm",
|
"backend": "vllm",
|
||||||
"dataset_name": "sharegpt",
|
"dataset_name": "sharegpt",
|
||||||
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
|
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
|
||||||
"max_concurrency": 60,
|
|
||||||
"num_prompts": 200
|
"num_prompts": 200
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"test_name": "serving_llama8B_tp2_sharegpt",
|
"test_name": "serving_llama8B_tp2_sharegpt",
|
||||||
"qps_list": [1, 4, 16, "inf"],
|
"qps_list": ["inf"],
|
||||||
|
"max_concurrency_list": [12, 16, 24, 32, 64, 128, 200],
|
||||||
"server_environment_variables": {
|
"server_environment_variables": {
|
||||||
"VLLM_RPC_TIMEOUT": 100000,
|
"VLLM_RPC_TIMEOUT": 100000,
|
||||||
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
||||||
@ -42,7 +43,7 @@
|
|||||||
"VLLM_CPU_KVCACHE_SPACE": 40
|
"VLLM_CPU_KVCACHE_SPACE": 40
|
||||||
},
|
},
|
||||||
"server_parameters": {
|
"server_parameters": {
|
||||||
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
|
"model": "meta-llama/Llama-3.1-8B-Instruct",
|
||||||
"tensor_parallel_size": 2,
|
"tensor_parallel_size": 2,
|
||||||
"dtype": "bfloat16",
|
"dtype": "bfloat16",
|
||||||
"distributed_executor_backend": "mp",
|
"distributed_executor_backend": "mp",
|
||||||
@ -55,17 +56,17 @@
|
|||||||
"load_format": "dummy"
|
"load_format": "dummy"
|
||||||
},
|
},
|
||||||
"client_parameters": {
|
"client_parameters": {
|
||||||
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
|
"model": "meta-llama/Llama-3.1-8B-Instruct",
|
||||||
"backend": "vllm",
|
"backend": "vllm",
|
||||||
"dataset_name": "sharegpt",
|
"dataset_name": "sharegpt",
|
||||||
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
|
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
|
||||||
"max_concurrency": 60,
|
|
||||||
"num_prompts": 200
|
"num_prompts": 200
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"test_name": "serving_llama8B_tp4_sharegpt",
|
"test_name": "serving_llama8B_tp4_sharegpt",
|
||||||
"qps_list": [1, 4, 16, "inf"],
|
"qps_list": ["inf"],
|
||||||
|
"max_concurrency_list": [12, 16, 24, 32, 64, 128, 200],
|
||||||
"server_environment_variables": {
|
"server_environment_variables": {
|
||||||
"VLLM_RPC_TIMEOUT": 100000,
|
"VLLM_RPC_TIMEOUT": 100000,
|
||||||
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
||||||
@ -74,7 +75,7 @@
|
|||||||
"VLLM_CPU_KVCACHE_SPACE": 40
|
"VLLM_CPU_KVCACHE_SPACE": 40
|
||||||
},
|
},
|
||||||
"server_parameters": {
|
"server_parameters": {
|
||||||
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
|
"model": "meta-llama/Llama-3.1-8B-Instruct",
|
||||||
"tensor_parallel_size": 4,
|
"tensor_parallel_size": 4,
|
||||||
"dtype": "bfloat16",
|
"dtype": "bfloat16",
|
||||||
"distributed_executor_backend": "mp",
|
"distributed_executor_backend": "mp",
|
||||||
@ -87,17 +88,17 @@
|
|||||||
"load_format": "dummy"
|
"load_format": "dummy"
|
||||||
},
|
},
|
||||||
"client_parameters": {
|
"client_parameters": {
|
||||||
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
|
"model": "meta-llama/Llama-3.1-8B-Instruct",
|
||||||
"backend": "vllm",
|
"backend": "vllm",
|
||||||
"dataset_name": "sharegpt",
|
"dataset_name": "sharegpt",
|
||||||
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
|
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
|
||||||
"max_concurrency": 60,
|
|
||||||
"num_prompts": 200
|
"num_prompts": 200
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"test_name": "serving_llama8B_tp1_random_128_128",
|
"test_name": "serving_llama8B_tp1_random_128_128",
|
||||||
"qps_list": [1, 4, 16, "inf"],
|
"qps_list": ["inf"],
|
||||||
|
"max_concurrency_list": [12, 16, 24, 32, 64, 128, 200, 1000],
|
||||||
"server_environment_variables": {
|
"server_environment_variables": {
|
||||||
"VLLM_RPC_TIMEOUT": 100000,
|
"VLLM_RPC_TIMEOUT": 100000,
|
||||||
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
||||||
@ -106,7 +107,7 @@
|
|||||||
"VLLM_CPU_KVCACHE_SPACE": 40
|
"VLLM_CPU_KVCACHE_SPACE": 40
|
||||||
},
|
},
|
||||||
"server_parameters": {
|
"server_parameters": {
|
||||||
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
|
"model": "meta-llama/Llama-3.1-8B-Instruct",
|
||||||
"tensor_parallel_size": 1,
|
"tensor_parallel_size": 1,
|
||||||
"dtype": "bfloat16",
|
"dtype": "bfloat16",
|
||||||
"distributed_executor_backend": "mp",
|
"distributed_executor_backend": "mp",
|
||||||
@ -120,19 +121,19 @@
|
|||||||
"load_format": "dummy"
|
"load_format": "dummy"
|
||||||
},
|
},
|
||||||
"client_parameters": {
|
"client_parameters": {
|
||||||
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
|
"model": "meta-llama/Llama-3.1-8B-Instruct",
|
||||||
"backend": "vllm",
|
"backend": "vllm",
|
||||||
"dataset_name": "random",
|
"dataset_name": "random",
|
||||||
"random-input-len": 128,
|
"random-input-len": 128,
|
||||||
"random-output-len": 128,
|
"random-output-len": 128,
|
||||||
"ignore-eos": "",
|
"ignore-eos": "",
|
||||||
"max_concurrency": 1000,
|
|
||||||
"num_prompts": 1000
|
"num_prompts": 1000
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"test_name": "serving_llama8B_tp2_random_128_128",
|
"test_name": "serving_llama8B_tp2_random_128_128",
|
||||||
"qps_list": [1, 4, 16, "inf"],
|
"qps_list": ["inf"],
|
||||||
|
"max_concurrency_list": [12, 16, 24, 32, 64, 128, 200, 1000],
|
||||||
"server_environment_variables": {
|
"server_environment_variables": {
|
||||||
"VLLM_RPC_TIMEOUT": 100000,
|
"VLLM_RPC_TIMEOUT": 100000,
|
||||||
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
||||||
@ -141,7 +142,7 @@
|
|||||||
"VLLM_CPU_KVCACHE_SPACE": 40
|
"VLLM_CPU_KVCACHE_SPACE": 40
|
||||||
},
|
},
|
||||||
"server_parameters": {
|
"server_parameters": {
|
||||||
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
|
"model": "meta-llama/Llama-3.1-8B-Instruct",
|
||||||
"tensor_parallel_size": 2,
|
"tensor_parallel_size": 2,
|
||||||
"dtype": "bfloat16",
|
"dtype": "bfloat16",
|
||||||
"distributed_executor_backend": "mp",
|
"distributed_executor_backend": "mp",
|
||||||
@ -155,19 +156,19 @@
|
|||||||
"load_format": "dummy"
|
"load_format": "dummy"
|
||||||
},
|
},
|
||||||
"client_parameters": {
|
"client_parameters": {
|
||||||
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
|
"model": "meta-llama/Llama-3.1-8B-Instruct",
|
||||||
"backend": "vllm",
|
"backend": "vllm",
|
||||||
"dataset_name": "random",
|
"dataset_name": "random",
|
||||||
"random-input-len": 128,
|
"random-input-len": 128,
|
||||||
"random-output-len": 128,
|
"random-output-len": 128,
|
||||||
"ignore-eos": "",
|
"ignore-eos": "",
|
||||||
"max_concurrency": 1000,
|
|
||||||
"num_prompts": 1000
|
"num_prompts": 1000
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"test_name": "serving_llama8B_tp4_random_128_128",
|
"test_name": "serving_llama8B_tp4_random_128_128",
|
||||||
"qps_list": [1, 4, 16, "inf"],
|
"qps_list": ["inf"],
|
||||||
|
"max_concurrency_list": [12, 16, 24, 32, 64, 128, 200, 1000],
|
||||||
"server_environment_variables": {
|
"server_environment_variables": {
|
||||||
"VLLM_RPC_TIMEOUT": 100000,
|
"VLLM_RPC_TIMEOUT": 100000,
|
||||||
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
||||||
@ -176,7 +177,7 @@
|
|||||||
"VLLM_CPU_KVCACHE_SPACE": 40
|
"VLLM_CPU_KVCACHE_SPACE": 40
|
||||||
},
|
},
|
||||||
"server_parameters": {
|
"server_parameters": {
|
||||||
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
|
"model": "meta-llama/Llama-3.1-8B-Instruct",
|
||||||
"tensor_parallel_size": 4,
|
"tensor_parallel_size": 4,
|
||||||
"dtype": "bfloat16",
|
"dtype": "bfloat16",
|
||||||
"distributed_executor_backend": "mp",
|
"distributed_executor_backend": "mp",
|
||||||
@ -190,13 +191,11 @@
|
|||||||
"load_format": "dummy"
|
"load_format": "dummy"
|
||||||
},
|
},
|
||||||
"client_parameters": {
|
"client_parameters": {
|
||||||
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
|
"model": "meta-llama/Llama-3.1-8B-Instruct",
|
||||||
"backend": "vllm",
|
"backend": "vllm",
|
||||||
"dataset_name": "random",
|
"dataset_name": "random",
|
||||||
"random-input-len": 128,
|
"random-input-len": 128,
|
||||||
"random-output-len": 128,
|
"random-output-len": 128,
|
||||||
"ignore-eos": "",
|
|
||||||
"max_concurrency": 1000,
|
|
||||||
"num_prompts": 1000
|
"num_prompts": 1000
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|||||||
@ -1,7 +1,8 @@
|
|||||||
[
|
[
|
||||||
{
|
{
|
||||||
"test_name": "serving_llama8B_pp1_sharegpt",
|
"test_name": "serving_llama8B_pp1_sharegpt",
|
||||||
"qps_list": [1, 4, 16, "inf"],
|
"qps_list": ["inf"],
|
||||||
|
"max_concurrency_list": [12, 16, 24, 32, 64, 128, 200],
|
||||||
"server_environment_variables": {
|
"server_environment_variables": {
|
||||||
"VLLM_RPC_TIMEOUT": 100000,
|
"VLLM_RPC_TIMEOUT": 100000,
|
||||||
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
||||||
@ -10,7 +11,7 @@
|
|||||||
"VLLM_CPU_KVCACHE_SPACE": 40
|
"VLLM_CPU_KVCACHE_SPACE": 40
|
||||||
},
|
},
|
||||||
"server_parameters": {
|
"server_parameters": {
|
||||||
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
|
"model": "meta-llama/Llama-3.1-8B-Instruct",
|
||||||
"pipeline_parallel_size": 1,
|
"pipeline_parallel_size": 1,
|
||||||
"dtype": "bfloat16",
|
"dtype": "bfloat16",
|
||||||
"distributed_executor_backend": "mp",
|
"distributed_executor_backend": "mp",
|
||||||
@ -23,17 +24,17 @@
|
|||||||
"load_format": "dummy"
|
"load_format": "dummy"
|
||||||
},
|
},
|
||||||
"client_parameters": {
|
"client_parameters": {
|
||||||
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
|
"model": "meta-llama/Llama-3.1-8B-Instruct",
|
||||||
"backend": "vllm",
|
"backend": "vllm",
|
||||||
"dataset_name": "sharegpt",
|
"dataset_name": "sharegpt",
|
||||||
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
|
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
|
||||||
"max_concurrency": 60,
|
|
||||||
"num_prompts": 200
|
"num_prompts": 200
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"test_name": "serving_llama8B_pp3_sharegpt",
|
"test_name": "serving_llama8B_pp3_sharegpt",
|
||||||
"qps_list": [1, 4, 16, "inf"],
|
"qps_list": ["inf"],
|
||||||
|
"max_concurrency_list": [12, 16, 24, 32, 64, 128, 200],
|
||||||
"server_environment_variables": {
|
"server_environment_variables": {
|
||||||
"VLLM_RPC_TIMEOUT": 100000,
|
"VLLM_RPC_TIMEOUT": 100000,
|
||||||
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
||||||
@ -42,7 +43,7 @@
|
|||||||
"VLLM_CPU_KVCACHE_SPACE": 40
|
"VLLM_CPU_KVCACHE_SPACE": 40
|
||||||
},
|
},
|
||||||
"server_parameters": {
|
"server_parameters": {
|
||||||
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
|
"model": "meta-llama/Llama-3.1-8B-Instruct",
|
||||||
"pipeline_parallel_size": 3,
|
"pipeline_parallel_size": 3,
|
||||||
"dtype": "bfloat16",
|
"dtype": "bfloat16",
|
||||||
"distributed_executor_backend": "mp",
|
"distributed_executor_backend": "mp",
|
||||||
@ -55,17 +56,17 @@
|
|||||||
"load_format": "dummy"
|
"load_format": "dummy"
|
||||||
},
|
},
|
||||||
"client_parameters": {
|
"client_parameters": {
|
||||||
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
|
"model": "meta-llama/Llama-3.1-8B-Instruct",
|
||||||
"backend": "vllm",
|
"backend": "vllm",
|
||||||
"dataset_name": "sharegpt",
|
"dataset_name": "sharegpt",
|
||||||
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
|
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
|
||||||
"max_concurrency": 60,
|
|
||||||
"num_prompts": 200
|
"num_prompts": 200
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"test_name": "serving_llama8B_tp2pp6_sharegpt",
|
"test_name": "serving_llama8B_tp2pp3_sharegpt",
|
||||||
"qps_list": [1, 4, 16, "inf"],
|
"qps_list": ["inf"],
|
||||||
|
"max_concurrency_list": [12, 16, 24, 32, 64, 128, 200],
|
||||||
"server_environment_variables": {
|
"server_environment_variables": {
|
||||||
"VLLM_RPC_TIMEOUT": 100000,
|
"VLLM_RPC_TIMEOUT": 100000,
|
||||||
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
||||||
@ -74,7 +75,7 @@
|
|||||||
"VLLM_CPU_KVCACHE_SPACE": 40
|
"VLLM_CPU_KVCACHE_SPACE": 40
|
||||||
},
|
},
|
||||||
"server_parameters": {
|
"server_parameters": {
|
||||||
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
|
"model": "meta-llama/Llama-3.1-8B-Instruct",
|
||||||
"tensor_parallel_size": 2,
|
"tensor_parallel_size": 2,
|
||||||
"pipeline_parallel_size": 3,
|
"pipeline_parallel_size": 3,
|
||||||
"dtype": "bfloat16",
|
"dtype": "bfloat16",
|
||||||
@ -88,17 +89,17 @@
|
|||||||
"load_format": "dummy"
|
"load_format": "dummy"
|
||||||
},
|
},
|
||||||
"client_parameters": {
|
"client_parameters": {
|
||||||
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
|
"model": "meta-llama/Llama-3.1-8B-Instruct",
|
||||||
"backend": "vllm",
|
"backend": "vllm",
|
||||||
"dataset_name": "sharegpt",
|
"dataset_name": "sharegpt",
|
||||||
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
|
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
|
||||||
"max_concurrency": 60,
|
|
||||||
"num_prompts": 200
|
"num_prompts": 200
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"test_name": "serving_llama8B_pp1_random_128_128",
|
"test_name": "serving_llama8B_pp1_random_128_128",
|
||||||
"qps_list": [1, 4, 16, "inf"],
|
"qps_list": ["inf"],
|
||||||
|
"max_concurrency_list": [12, 16, 24, 32, 64, 128, 200, 1000],
|
||||||
"server_environment_variables": {
|
"server_environment_variables": {
|
||||||
"VLLM_RPC_TIMEOUT": 100000,
|
"VLLM_RPC_TIMEOUT": 100000,
|
||||||
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
||||||
@ -107,7 +108,7 @@
|
|||||||
"VLLM_CPU_KVCACHE_SPACE": 40
|
"VLLM_CPU_KVCACHE_SPACE": 40
|
||||||
},
|
},
|
||||||
"server_parameters": {
|
"server_parameters": {
|
||||||
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
|
"model": "meta-llama/Llama-3.1-8B-Instruct",
|
||||||
"pipeline_parallel_size": 1,
|
"pipeline_parallel_size": 1,
|
||||||
"dtype": "bfloat16",
|
"dtype": "bfloat16",
|
||||||
"distributed_executor_backend": "mp",
|
"distributed_executor_backend": "mp",
|
||||||
@ -121,28 +122,28 @@
|
|||||||
"load_format": "dummy"
|
"load_format": "dummy"
|
||||||
},
|
},
|
||||||
"client_parameters": {
|
"client_parameters": {
|
||||||
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
|
"model": "meta-llama/Llama-3.1-8B-Instruct",
|
||||||
"backend": "vllm",
|
"backend": "vllm",
|
||||||
"dataset_name": "random",
|
"dataset_name": "random",
|
||||||
"random-input-len": 128,
|
"random-input-len": 128,
|
||||||
"random-output-len": 128,
|
"random-output-len": 128,
|
||||||
"ignore-eos": "",
|
"ignore-eos": "",
|
||||||
"max_concurrency": 1000,
|
|
||||||
"num_prompts": 1000
|
"num_prompts": 1000
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"test_name": "serving_llama8B_pp3_random_128_128",
|
"test_name": "serving_llama8B_pp3_random_128_128",
|
||||||
"qps_list": [1, 4, 16, "inf"],
|
"qps_list": ["inf"],
|
||||||
|
"max_concurrency_list": [12, 16, 24, 32, 64, 128, 200, 1000],
|
||||||
"server_environment_variables": {
|
"server_environment_variables": {
|
||||||
"VLLM_RPC_TIMEOUT": 100000,
|
"VLLM_RPC_TIMEOUT": 100000,
|
||||||
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
||||||
"VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
|
"VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
|
||||||
"VLLM_CPU_SGL_KERNEL:": 1,
|
"VLLM_CPU_SGL_KERNEL": 1,
|
||||||
"VLLM_CPU_KVCACHE_SPACE": 40
|
"VLLM_CPU_KVCACHE_SPACE": 40
|
||||||
},
|
},
|
||||||
"server_parameters": {
|
"server_parameters": {
|
||||||
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
|
"model": "meta-llama/Llama-3.1-8B-Instruct",
|
||||||
"pipeline_parallel_size": 3,
|
"pipeline_parallel_size": 3,
|
||||||
"dtype": "bfloat16",
|
"dtype": "bfloat16",
|
||||||
"distributed_executor_backend": "mp",
|
"distributed_executor_backend": "mp",
|
||||||
@ -156,19 +157,19 @@
|
|||||||
"load_format": "dummy"
|
"load_format": "dummy"
|
||||||
},
|
},
|
||||||
"client_parameters": {
|
"client_parameters": {
|
||||||
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
|
"model": "meta-llama/Llama-3.1-8B-Instruct",
|
||||||
"backend": "vllm",
|
"backend": "vllm",
|
||||||
"dataset_name": "random",
|
"dataset_name": "random",
|
||||||
"random-input-len": 128,
|
"random-input-len": 128,
|
||||||
"random-output-len": 128,
|
"random-output-len": 128,
|
||||||
"ignore-eos": "",
|
"ignore-eos": "",
|
||||||
"max_concurrency": 1000,
|
|
||||||
"num_prompts": 1000
|
"num_prompts": 1000
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"test_name": "serving_llama8B_tp2pp3_random_128_128",
|
"test_name": "serving_llama8B_tp2pp3_random_128_128",
|
||||||
"qps_list": [1, 4, 16, "inf"],
|
"qps_list": ["inf"],
|
||||||
|
"max_concurrency_list": [12, 16, 24, 32, 64, 128, 200, 1000],
|
||||||
"server_environment_variables": {
|
"server_environment_variables": {
|
||||||
"VLLM_RPC_TIMEOUT": 100000,
|
"VLLM_RPC_TIMEOUT": 100000,
|
||||||
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
||||||
@ -177,7 +178,7 @@
|
|||||||
"VLLM_CPU_KVCACHE_SPACE": 40
|
"VLLM_CPU_KVCACHE_SPACE": 40
|
||||||
},
|
},
|
||||||
"server_parameters": {
|
"server_parameters": {
|
||||||
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
|
"model": "meta-llama/Llama-3.1-8B-Instruct",
|
||||||
"tensor_parallel_size": 2,
|
"tensor_parallel_size": 2,
|
||||||
"pipeline_parallel_size": 3,
|
"pipeline_parallel_size": 3,
|
||||||
"dtype": "bfloat16",
|
"dtype": "bfloat16",
|
||||||
@ -192,13 +193,12 @@
|
|||||||
"load_format": "dummy"
|
"load_format": "dummy"
|
||||||
},
|
},
|
||||||
"client_parameters": {
|
"client_parameters": {
|
||||||
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
|
"model": "meta-llama/Llama-3.1-8B-Instruct",
|
||||||
"backend": "vllm",
|
"backend": "vllm",
|
||||||
"dataset_name": "random",
|
"dataset_name": "random",
|
||||||
"random-input-len": 128,
|
"random-input-len": 128,
|
||||||
"random-output-len": 128,
|
"random-output-len": 128,
|
||||||
"ignore-eos": "",
|
"ignore-eos": "",
|
||||||
"max_concurrency": 1000,
|
|
||||||
"num_prompts": 1000
|
"num_prompts": 1000
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|||||||
@ -2,6 +2,7 @@
|
|||||||
{
|
{
|
||||||
"test_name": "serving_llama8B_tp1_sharegpt",
|
"test_name": "serving_llama8B_tp1_sharegpt",
|
||||||
"qps_list": [1, 4, 16, "inf"],
|
"qps_list": [1, 4, 16, "inf"],
|
||||||
|
"max_concurrency_list": [12, 16, 24, 32, 64, 128, 200],
|
||||||
"server_environment_variables": {
|
"server_environment_variables": {
|
||||||
"VLLM_RPC_TIMEOUT": 100000,
|
"VLLM_RPC_TIMEOUT": 100000,
|
||||||
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
||||||
@ -10,7 +11,7 @@
|
|||||||
"VLLM_CPU_KVCACHE_SPACE": 40
|
"VLLM_CPU_KVCACHE_SPACE": 40
|
||||||
},
|
},
|
||||||
"server_parameters": {
|
"server_parameters": {
|
||||||
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
|
"model": "meta-llama/Llama-3.1-8B-Instruct",
|
||||||
"tensor_parallel_size": 1,
|
"tensor_parallel_size": 1,
|
||||||
"dtype": "bfloat16",
|
"dtype": "bfloat16",
|
||||||
"distributed_executor_backend": "mp",
|
"distributed_executor_backend": "mp",
|
||||||
@ -23,17 +24,17 @@
|
|||||||
"load_format": "dummy"
|
"load_format": "dummy"
|
||||||
},
|
},
|
||||||
"client_parameters": {
|
"client_parameters": {
|
||||||
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
|
"model": "meta-llama/Llama-3.1-8B-Instruct",
|
||||||
"backend": "vllm",
|
"backend": "vllm",
|
||||||
"dataset_name": "sharegpt",
|
"dataset_name": "sharegpt",
|
||||||
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
|
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
|
||||||
"max_concurrency": 60,
|
|
||||||
"num_prompts": 200
|
"num_prompts": 200
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"test_name": "serving_llama8B_tp2_sharegpt",
|
"test_name": "serving_llama8B_tp2_sharegpt",
|
||||||
"qps_list": [1, 4, 16, "inf"],
|
"qps_list": [1, 4, 16, "inf"],
|
||||||
|
"max_concurrency_list": [12, 16, 24, 32, 64, 128, 200],
|
||||||
"server_environment_variables": {
|
"server_environment_variables": {
|
||||||
"VLLM_RPC_TIMEOUT": 100000,
|
"VLLM_RPC_TIMEOUT": 100000,
|
||||||
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
||||||
@ -42,7 +43,7 @@
|
|||||||
"VLLM_CPU_KVCACHE_SPACE": 40
|
"VLLM_CPU_KVCACHE_SPACE": 40
|
||||||
},
|
},
|
||||||
"server_parameters": {
|
"server_parameters": {
|
||||||
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
|
"model": "meta-llama/Llama-3.1-8B-Instruct",
|
||||||
"tensor_parallel_size": 2,
|
"tensor_parallel_size": 2,
|
||||||
"dtype": "bfloat16",
|
"dtype": "bfloat16",
|
||||||
"distributed_executor_backend": "mp",
|
"distributed_executor_backend": "mp",
|
||||||
@ -55,17 +56,17 @@
|
|||||||
"load_format": "dummy"
|
"load_format": "dummy"
|
||||||
},
|
},
|
||||||
"client_parameters": {
|
"client_parameters": {
|
||||||
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
|
"model": "meta-llama/Llama-3.1-8B-Instruct",
|
||||||
"backend": "vllm",
|
"backend": "vllm",
|
||||||
"dataset_name": "sharegpt",
|
"dataset_name": "sharegpt",
|
||||||
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
|
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
|
||||||
"max_concurrency": 60,
|
|
||||||
"num_prompts": 200
|
"num_prompts": 200
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"test_name": "serving_llama8B_tp4_sharegpt",
|
"test_name": "serving_llama8B_tp4_sharegpt",
|
||||||
"qps_list": [1, 4, 16, "inf"],
|
"qps_list": [1, 4, 16, "inf"],
|
||||||
|
"max_concurrency_list": [12, 16, 24, 32, 64, 128, 200],
|
||||||
"server_environment_variables": {
|
"server_environment_variables": {
|
||||||
"VLLM_RPC_TIMEOUT": 100000,
|
"VLLM_RPC_TIMEOUT": 100000,
|
||||||
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
||||||
@ -74,7 +75,7 @@
|
|||||||
"VLLM_CPU_KVCACHE_SPACE": 40
|
"VLLM_CPU_KVCACHE_SPACE": 40
|
||||||
},
|
},
|
||||||
"server_parameters": {
|
"server_parameters": {
|
||||||
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
|
"model": "meta-llama/Llama-3.1-8B-Instruct",
|
||||||
"tensor_parallel_size": 4,
|
"tensor_parallel_size": 4,
|
||||||
"dtype": "bfloat16",
|
"dtype": "bfloat16",
|
||||||
"distributed_executor_backend": "mp",
|
"distributed_executor_backend": "mp",
|
||||||
@ -87,17 +88,17 @@
|
|||||||
"load_format": "dummy"
|
"load_format": "dummy"
|
||||||
},
|
},
|
||||||
"client_parameters": {
|
"client_parameters": {
|
||||||
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
|
"model": "meta-llama/Llama-3.1-8B-Instruct",
|
||||||
"backend": "vllm",
|
"backend": "vllm",
|
||||||
"dataset_name": "sharegpt",
|
"dataset_name": "sharegpt",
|
||||||
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
|
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
|
||||||
"max_concurrency": 60,
|
|
||||||
"num_prompts": 200
|
"num_prompts": 200
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"test_name": "serving_llama8B_tp4_random_1024_128",
|
"test_name": "serving_llama8B_tp4_random_1024_128",
|
||||||
"qps_list": [1, 4, 16, "inf"],
|
"qps_list": [1, 4, 16, "inf"],
|
||||||
|
"max_concurrency_list": [12, 16, 24, 32, 64, 128, 200],
|
||||||
"server_environment_variables": {
|
"server_environment_variables": {
|
||||||
"VLLM_RPC_TIMEOUT": 100000,
|
"VLLM_RPC_TIMEOUT": 100000,
|
||||||
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
||||||
@ -106,7 +107,7 @@
|
|||||||
"VLLM_CPU_KVCACHE_SPACE": 40
|
"VLLM_CPU_KVCACHE_SPACE": 40
|
||||||
},
|
},
|
||||||
"server_parameters": {
|
"server_parameters": {
|
||||||
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
|
"model": "meta-llama/Llama-3.1-8B-Instruct",
|
||||||
"tensor_parallel_size": 4,
|
"tensor_parallel_size": 4,
|
||||||
"dtype": "bfloat16",
|
"dtype": "bfloat16",
|
||||||
"distributed_executor_backend": "mp",
|
"distributed_executor_backend": "mp",
|
||||||
@ -120,19 +121,19 @@
|
|||||||
"load_format": "dummy"
|
"load_format": "dummy"
|
||||||
},
|
},
|
||||||
"client_parameters": {
|
"client_parameters": {
|
||||||
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
|
"model": "meta-llama/Llama-3.1-8B-Instruct",
|
||||||
"backend": "vllm",
|
"backend": "vllm",
|
||||||
"dataset_name": "random",
|
"dataset_name": "random",
|
||||||
"random-input-len": 1024,
|
"random-input-len": 1024,
|
||||||
"random-output-len": 128,
|
"random-output-len": 128,
|
||||||
"ignore-eos": "",
|
"ignore-eos": "",
|
||||||
"max_concurrency": 100,
|
|
||||||
"num_prompts": 100
|
"num_prompts": 100
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"test_name": "serving_llama8B_pp6_random_1024_128",
|
"test_name": "serving_llama8B_pp6_random_1024_128",
|
||||||
"qps_list": [1, 4, 16, "inf"],
|
"qps_list": [1, 4, 16, "inf"],
|
||||||
|
"max_concurrency_list": [12, 16, 24, 32, 64, 128, 200],
|
||||||
"server_environment_variables": {
|
"server_environment_variables": {
|
||||||
"VLLM_RPC_TIMEOUT": 100000,
|
"VLLM_RPC_TIMEOUT": 100000,
|
||||||
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
||||||
@ -141,7 +142,7 @@
|
|||||||
"VLLM_CPU_KVCACHE_SPACE": 40
|
"VLLM_CPU_KVCACHE_SPACE": 40
|
||||||
},
|
},
|
||||||
"server_parameters": {
|
"server_parameters": {
|
||||||
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
|
"model": "meta-llama/Llama-3.1-8B-Instruct",
|
||||||
"pipeline_parallel_size": 6,
|
"pipeline_parallel_size": 6,
|
||||||
"dtype": "bfloat16",
|
"dtype": "bfloat16",
|
||||||
"distributed_executor_backend": "mp",
|
"distributed_executor_backend": "mp",
|
||||||
@ -155,13 +156,12 @@
|
|||||||
"load_format": "dummy"
|
"load_format": "dummy"
|
||||||
},
|
},
|
||||||
"client_parameters": {
|
"client_parameters": {
|
||||||
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
|
"model": "meta-llama/Llama-3.1-8B-Instruct",
|
||||||
"backend": "vllm",
|
"backend": "vllm",
|
||||||
"dataset_name": "random",
|
"dataset_name": "random",
|
||||||
"random-input-len": 1024,
|
"random-input-len": 1024,
|
||||||
"random-output-len": 128,
|
"random-output-len": 128,
|
||||||
"ignore-eos": "",
|
"ignore-eos": "",
|
||||||
"max_concurrency": 100,
|
|
||||||
"num_prompts": 100
|
"num_prompts": 100
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|||||||
@ -6,7 +6,7 @@
|
|||||||
"VLLM_CPU_KVCACHE_SPACE": 40
|
"VLLM_CPU_KVCACHE_SPACE": 40
|
||||||
},
|
},
|
||||||
"parameters": {
|
"parameters": {
|
||||||
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
|
"model": "meta-llama/Llama-3.1-8B-Instruct",
|
||||||
"tensor_parallel_size": 1,
|
"tensor_parallel_size": 1,
|
||||||
"load_format": "dummy",
|
"load_format": "dummy",
|
||||||
"dataset": "./ShareGPT_V3_unfiltered_cleaned_split.json",
|
"dataset": "./ShareGPT_V3_unfiltered_cleaned_split.json",
|
||||||
@ -21,7 +21,7 @@
|
|||||||
"VLLM_CPU_KVCACHE_SPACE": 40
|
"VLLM_CPU_KVCACHE_SPACE": 40
|
||||||
},
|
},
|
||||||
"parameters": {
|
"parameters": {
|
||||||
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
|
"model": "meta-llama/Llama-3.1-8B-Instruct",
|
||||||
"tensor_parallel_size": 4,
|
"tensor_parallel_size": 4,
|
||||||
"load_format": "dummy",
|
"load_format": "dummy",
|
||||||
"dataset": "./ShareGPT_V3_unfiltered_cleaned_split.json",
|
"dataset": "./ShareGPT_V3_unfiltered_cleaned_split.json",
|
||||||
|
|||||||
@ -11,7 +11,7 @@ vLLM contains two sets of benchmarks:
|
|||||||
|
|
||||||
The performance benchmarks are used for development to confirm whether new changes improve performance under various workloads. They are triggered on every commit with both the `perf-benchmarks` and `ready` labels, and when a PR is merged into vLLM.
|
The performance benchmarks are used for development to confirm whether new changes improve performance under various workloads. They are triggered on every commit with both the `perf-benchmarks` and `ready` labels, and when a PR is merged into vLLM.
|
||||||
|
|
||||||
The latest performance results are hosted on the public [vLLM Performance Dashboard](https://perf.vllm.ai).
|
The latest performance results are hosted on the public [vLLM Performance Dashboard](https://hud.pytorch.org/benchmark/llms?repoName=vllm-project%2Fvllm).
|
||||||
|
|
||||||
More information on the performance benchmarks and their parameters can be found [here](gh-file:.buildkite/nightly-benchmarks/performance-benchmarks-descriptions.md).
|
More information on the performance benchmarks and their parameters can be found [here](gh-file:.buildkite/nightly-benchmarks/performance-benchmarks-descriptions.md).
|
||||||
|
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user