diff --git a/docs/contributing/profiling.md b/docs/contributing/profiling.md index 6d6366741aae..20f4867057d3 100644 --- a/docs/contributing/profiling.md +++ b/docs/contributing/profiling.md @@ -30,13 +30,21 @@ Refer to for an example #### OpenAI Server ```bash -VLLM_TORCH_PROFILER_DIR=./vllm_profile python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3-70B +VLLM_TORCH_PROFILER_DIR=./vllm_profile \ + python -m vllm.entrypoints.openai.api_server \ + --model meta-llama/Meta-Llama-3-70B ``` benchmark_serving.py: ```bash -python benchmarks/benchmark_serving.py --backend vllm --model meta-llama/Meta-Llama-3-70B --dataset-name sharegpt --dataset-path sharegpt.json --profile --num-prompts 2 +python benchmarks/benchmark_serving.py \ + --backend vllm \ + --model meta-llama/Meta-Llama-3-70B \ + --dataset-name sharegpt \ + --dataset-path sharegpt.json \ + --profile \ + --num-prompts 2 ``` ## Profile with NVIDIA Nsight Systems @@ -64,7 +72,16 @@ For basic usage, you can just append `nsys profile -o report.nsys-rep --trace-fo The following is an example using the `benchmarks/benchmark_latency.py` script: ```bash -nsys profile -o report.nsys-rep --trace-fork-before-exec=true --cuda-graph-trace=node python benchmarks/benchmark_latency.py --model meta-llama/Llama-3.1-8B-Instruct --num-iters-warmup 5 --num-iters 1 --batch-size 16 --input-len 512 --output-len 8 +nsys profile -o report.nsys-rep \ + --trace-fork-before-exec=true \ + --cuda-graph-trace=node \ + python benchmarks/benchmark_latency.py \ + --model meta-llama/Llama-3.1-8B-Instruct \ + --num-iters-warmup 5 \ + --num-iters 1 \ + --batch-size 16 \ + --input-len 512 \ + --output-len 8 ``` #### OpenAI Server @@ -73,10 +90,21 @@ To profile the server, you will want to prepend your `vllm serve` command with ` ```bash # server -nsys profile -o report.nsys-rep --trace-fork-before-exec=true --cuda-graph-trace=node --delay 30 --duration 60 vllm serve meta-llama/Llama-3.1-8B-Instruct +nsys profile -o report.nsys-rep \ + --trace-fork-before-exec=true \ + --cuda-graph-trace=node \ + --delay 30 \ + --duration 60 \ + vllm serve meta-llama/Llama-3.1-8B-Instruct # client -python benchmarks/benchmark_serving.py --backend vllm --model meta-llama/Llama-3.1-8B-Instruct --num-prompts 1 --dataset-name random --random-input 1024 --random-output 512 +python benchmarks/benchmark_serving.py \ + --backend vllm \ + --model meta-llama/Llama-3.1-8B-Instruct \ + --num-prompts 1 \ + --dataset-name random \ + --random-input 1024 \ + --random-output 512 ``` In practice, you should set the `--duration` argument to a large value. Whenever you want the server to stop profiling, run: diff --git a/docs/getting_started/installation/cpu.md b/docs/getting_started/installation/cpu.md index 3f75d1aef300..5d7019e5a867 100644 --- a/docs/getting_started/installation/cpu.md +++ b/docs/getting_started/installation/cpu.md @@ -79,7 +79,9 @@ Currently, there are no pre-built CPU wheels. ??? Commands ```console - $ docker build -f docker/Dockerfile.cpu --tag vllm-cpu-env --target vllm-openai . + $ docker build -f docker/Dockerfile.cpu \ + --tag vllm-cpu-env \ + --target vllm-openai . # Launching OpenAI server $ docker run --rm \ @@ -188,13 +190,19 @@ vllm serve facebook/opt-125m - Tensor Parallel is supported for serving and offline inferencing. In general each NUMA node is treated as one GPU card. Below is the example script to enable Tensor Parallel = 2 for serving: ```console - VLLM_CPU_KVCACHE_SPACE=40 VLLM_CPU_OMP_THREADS_BIND="0-31|32-63" vllm serve meta-llama/Llama-2-7b-chat-hf -tp=2 --distributed-executor-backend mp + VLLM_CPU_KVCACHE_SPACE=40 VLLM_CPU_OMP_THREADS_BIND="0-31|32-63" \ + vllm serve meta-llama/Llama-2-7b-chat-hf \ + -tp=2 \ + --distributed-executor-backend mp ``` or using default auto thread binding: ```console - VLLM_CPU_KVCACHE_SPACE=40 vllm serve meta-llama/Llama-2-7b-chat-hf -tp=2 --distributed-executor-backend mp + VLLM_CPU_KVCACHE_SPACE=40 \ + vllm serve meta-llama/Llama-2-7b-chat-hf \ + -tp=2 \ + --distributed-executor-backend mp ``` - For each thread id list in `VLLM_CPU_OMP_THREADS_BIND`, users should guarantee threads in the list belong to a same NUMA node. diff --git a/docs/usage/troubleshooting.md b/docs/usage/troubleshooting.md index 9403abfad85f..631c8c40cfec 100644 --- a/docs/usage/troubleshooting.md +++ b/docs/usage/troubleshooting.md @@ -134,7 +134,10 @@ NCCL_DEBUG=TRACE torchrun --nproc-per-node= test.py If you are testing with multi-nodes, adjust `--nproc-per-node` and `--nnodes` according to your setup and set `MASTER_ADDR` to the correct IP address of the master node, reachable from all nodes. Then, run: ```console -NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR test.py +NCCL_DEBUG=TRACE torchrun --nnodes 2 \ + --nproc-per-node=2 \ + --rdzv_backend=c10d \ + --rdzv_endpoint=$MASTER_ADDR test.py ``` If the script runs successfully, you should see the message `sanity check is successful!`.