mirror of
https://git.datalinker.icu/vllm-project/vllm.git
synced 2025-12-12 22:56:11 +08:00
[doc] improve readability for long commands (#19920)
Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com>
This commit is contained in:
parent
a6e6604d32
commit
53243e5c42
@ -30,13 +30,21 @@ Refer to <gh-file:examples/offline_inference/simple_profiling.py> for an example
|
|||||||
#### OpenAI Server
|
#### OpenAI Server
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
VLLM_TORCH_PROFILER_DIR=./vllm_profile python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3-70B
|
VLLM_TORCH_PROFILER_DIR=./vllm_profile \
|
||||||
|
python -m vllm.entrypoints.openai.api_server \
|
||||||
|
--model meta-llama/Meta-Llama-3-70B
|
||||||
```
|
```
|
||||||
|
|
||||||
benchmark_serving.py:
|
benchmark_serving.py:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
python benchmarks/benchmark_serving.py --backend vllm --model meta-llama/Meta-Llama-3-70B --dataset-name sharegpt --dataset-path sharegpt.json --profile --num-prompts 2
|
python benchmarks/benchmark_serving.py \
|
||||||
|
--backend vllm \
|
||||||
|
--model meta-llama/Meta-Llama-3-70B \
|
||||||
|
--dataset-name sharegpt \
|
||||||
|
--dataset-path sharegpt.json \
|
||||||
|
--profile \
|
||||||
|
--num-prompts 2
|
||||||
```
|
```
|
||||||
|
|
||||||
## Profile with NVIDIA Nsight Systems
|
## Profile with NVIDIA Nsight Systems
|
||||||
@ -64,7 +72,16 @@ For basic usage, you can just append `nsys profile -o report.nsys-rep --trace-fo
|
|||||||
The following is an example using the `benchmarks/benchmark_latency.py` script:
|
The following is an example using the `benchmarks/benchmark_latency.py` script:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
nsys profile -o report.nsys-rep --trace-fork-before-exec=true --cuda-graph-trace=node python benchmarks/benchmark_latency.py --model meta-llama/Llama-3.1-8B-Instruct --num-iters-warmup 5 --num-iters 1 --batch-size 16 --input-len 512 --output-len 8
|
nsys profile -o report.nsys-rep \
|
||||||
|
--trace-fork-before-exec=true \
|
||||||
|
--cuda-graph-trace=node \
|
||||||
|
python benchmarks/benchmark_latency.py \
|
||||||
|
--model meta-llama/Llama-3.1-8B-Instruct \
|
||||||
|
--num-iters-warmup 5 \
|
||||||
|
--num-iters 1 \
|
||||||
|
--batch-size 16 \
|
||||||
|
--input-len 512 \
|
||||||
|
--output-len 8
|
||||||
```
|
```
|
||||||
|
|
||||||
#### OpenAI Server
|
#### OpenAI Server
|
||||||
@ -73,10 +90,21 @@ To profile the server, you will want to prepend your `vllm serve` command with `
|
|||||||
|
|
||||||
```bash
|
```bash
|
||||||
# server
|
# server
|
||||||
nsys profile -o report.nsys-rep --trace-fork-before-exec=true --cuda-graph-trace=node --delay 30 --duration 60 vllm serve meta-llama/Llama-3.1-8B-Instruct
|
nsys profile -o report.nsys-rep \
|
||||||
|
--trace-fork-before-exec=true \
|
||||||
|
--cuda-graph-trace=node \
|
||||||
|
--delay 30 \
|
||||||
|
--duration 60 \
|
||||||
|
vllm serve meta-llama/Llama-3.1-8B-Instruct
|
||||||
|
|
||||||
# client
|
# client
|
||||||
python benchmarks/benchmark_serving.py --backend vllm --model meta-llama/Llama-3.1-8B-Instruct --num-prompts 1 --dataset-name random --random-input 1024 --random-output 512
|
python benchmarks/benchmark_serving.py \
|
||||||
|
--backend vllm \
|
||||||
|
--model meta-llama/Llama-3.1-8B-Instruct \
|
||||||
|
--num-prompts 1 \
|
||||||
|
--dataset-name random \
|
||||||
|
--random-input 1024 \
|
||||||
|
--random-output 512
|
||||||
```
|
```
|
||||||
|
|
||||||
In practice, you should set the `--duration` argument to a large value. Whenever you want the server to stop profiling, run:
|
In practice, you should set the `--duration` argument to a large value. Whenever you want the server to stop profiling, run:
|
||||||
|
|||||||
@ -79,7 +79,9 @@ Currently, there are no pre-built CPU wheels.
|
|||||||
??? Commands
|
??? Commands
|
||||||
|
|
||||||
```console
|
```console
|
||||||
$ docker build -f docker/Dockerfile.cpu --tag vllm-cpu-env --target vllm-openai .
|
$ docker build -f docker/Dockerfile.cpu \
|
||||||
|
--tag vllm-cpu-env \
|
||||||
|
--target vllm-openai .
|
||||||
|
|
||||||
# Launching OpenAI server
|
# Launching OpenAI server
|
||||||
$ docker run --rm \
|
$ docker run --rm \
|
||||||
@ -188,13 +190,19 @@ vllm serve facebook/opt-125m
|
|||||||
- Tensor Parallel is supported for serving and offline inferencing. In general each NUMA node is treated as one GPU card. Below is the example script to enable Tensor Parallel = 2 for serving:
|
- Tensor Parallel is supported for serving and offline inferencing. In general each NUMA node is treated as one GPU card. Below is the example script to enable Tensor Parallel = 2 for serving:
|
||||||
|
|
||||||
```console
|
```console
|
||||||
VLLM_CPU_KVCACHE_SPACE=40 VLLM_CPU_OMP_THREADS_BIND="0-31|32-63" vllm serve meta-llama/Llama-2-7b-chat-hf -tp=2 --distributed-executor-backend mp
|
VLLM_CPU_KVCACHE_SPACE=40 VLLM_CPU_OMP_THREADS_BIND="0-31|32-63" \
|
||||||
|
vllm serve meta-llama/Llama-2-7b-chat-hf \
|
||||||
|
-tp=2 \
|
||||||
|
--distributed-executor-backend mp
|
||||||
```
|
```
|
||||||
|
|
||||||
or using default auto thread binding:
|
or using default auto thread binding:
|
||||||
|
|
||||||
```console
|
```console
|
||||||
VLLM_CPU_KVCACHE_SPACE=40 vllm serve meta-llama/Llama-2-7b-chat-hf -tp=2 --distributed-executor-backend mp
|
VLLM_CPU_KVCACHE_SPACE=40 \
|
||||||
|
vllm serve meta-llama/Llama-2-7b-chat-hf \
|
||||||
|
-tp=2 \
|
||||||
|
--distributed-executor-backend mp
|
||||||
```
|
```
|
||||||
|
|
||||||
- For each thread id list in `VLLM_CPU_OMP_THREADS_BIND`, users should guarantee threads in the list belong to a same NUMA node.
|
- For each thread id list in `VLLM_CPU_OMP_THREADS_BIND`, users should guarantee threads in the list belong to a same NUMA node.
|
||||||
|
|||||||
@ -134,7 +134,10 @@ NCCL_DEBUG=TRACE torchrun --nproc-per-node=<number-of-GPUs> test.py
|
|||||||
If you are testing with multi-nodes, adjust `--nproc-per-node` and `--nnodes` according to your setup and set `MASTER_ADDR` to the correct IP address of the master node, reachable from all nodes. Then, run:
|
If you are testing with multi-nodes, adjust `--nproc-per-node` and `--nnodes` according to your setup and set `MASTER_ADDR` to the correct IP address of the master node, reachable from all nodes. Then, run:
|
||||||
|
|
||||||
```console
|
```console
|
||||||
NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR test.py
|
NCCL_DEBUG=TRACE torchrun --nnodes 2 \
|
||||||
|
--nproc-per-node=2 \
|
||||||
|
--rdzv_backend=c10d \
|
||||||
|
--rdzv_endpoint=$MASTER_ADDR test.py
|
||||||
```
|
```
|
||||||
|
|
||||||
If the script runs successfully, you should see the message `sanity check is successful!`.
|
If the script runs successfully, you should see the message `sanity check is successful!`.
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user