[Doc] Add documents for multi-node distributed serving with MP backend (#30509)

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
This commit is contained in:
Isotr0py 2025-12-14 02:02:29 +08:00 committed by GitHub
parent ddbfbe5278
commit 7c16f3fbcc
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
2 changed files with 24 additions and 4 deletions

View File

@ -62,7 +62,7 @@ If a single node lacks sufficient GPUs to hold the model, deploy vLLM across mul
### What is Ray? ### What is Ray?
Ray is a distributed computing framework for scaling Python programs. Multi-node vLLM deployments require Ray as the runtime engine. Ray is a distributed computing framework for scaling Python programs. Multi-node vLLM deployments can use Ray as the runtime engine.
vLLM uses Ray to manage the distributed execution of tasks across multiple nodes and control where execution happens. vLLM uses Ray to manage the distributed execution of tasks across multiple nodes and control where execution happens.
@ -130,6 +130,28 @@ vllm serve /path/to/the/model/in/the/container \
--distributed-executor-backend ray --distributed-executor-backend ray
``` ```
### Running vLLM with MultiProcessing
Besides Ray, Multi-node vLLM deployments can also use `multiprocessing` as the runtime engine. Here's an example to deploy model across 2 nodes (8 GPUs per node) with `tp_size=8` and `pp_size=2`.
Choose one node as the head node and run:
```bash
vllm serve /path/to/the/model/in/the/container \
--tensor-parallel-size 8 --pipeline-parallel-size 2 \
--nnodes 2 --node-rank 0 \
--master-addr <HEAD_NODE_IP>
```
On the other worker node, run:
```bash
vllm serve /path/to/the/model/in/the/container \
--tensor-parallel-size 8 --pipeline-parallel-size 2 \
--nnodes 2 --node-rank 1 \
--master-addr <HEAD_NODE_IP> --headless
```
## Optimizing network communication for tensor parallelism ## Optimizing network communication for tensor parallelism
Efficient tensor parallelism requires fast inter-node communication, preferably through high-speed network adapters such as InfiniBand. Efficient tensor parallelism requires fast inter-node communication, preferably through high-speed network adapters such as InfiniBand.

View File

@ -124,9 +124,7 @@ class MultiprocExecutor(Executor):
# Set multiprocessing envs # Set multiprocessing envs
set_multiprocessing_worker_envs() set_multiprocessing_worker_envs()
# Multiprocessing-based executor does not support multi-node setting. # use the loopback address get_loopback_ip() for communication.
# Since it only works for single node, we can use the loopback address
# get_loopback_ip() for communication.
distributed_init_method = get_distributed_init_method( distributed_init_method = get_distributed_init_method(
get_loopback_ip(), get_open_port() get_loopback_ip(), get_open_port()
) )