mirror of
https://git.datalinker.icu/vllm-project/vllm.git
synced 2025-12-18 05:45:01 +08:00
[Doc] Add documents for multi-node distributed serving with MP backend (#30509)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
This commit is contained in:
parent
ddbfbe5278
commit
7c16f3fbcc
@ -62,7 +62,7 @@ If a single node lacks sufficient GPUs to hold the model, deploy vLLM across mul
|
|||||||
|
|
||||||
### What is Ray?
|
### What is Ray?
|
||||||
|
|
||||||
Ray is a distributed computing framework for scaling Python programs. Multi-node vLLM deployments require Ray as the runtime engine.
|
Ray is a distributed computing framework for scaling Python programs. Multi-node vLLM deployments can use Ray as the runtime engine.
|
||||||
|
|
||||||
vLLM uses Ray to manage the distributed execution of tasks across multiple nodes and control where execution happens.
|
vLLM uses Ray to manage the distributed execution of tasks across multiple nodes and control where execution happens.
|
||||||
|
|
||||||
@ -130,6 +130,28 @@ vllm serve /path/to/the/model/in/the/container \
|
|||||||
--distributed-executor-backend ray
|
--distributed-executor-backend ray
|
||||||
```
|
```
|
||||||
|
|
||||||
|
### Running vLLM with MultiProcessing
|
||||||
|
|
||||||
|
Besides Ray, Multi-node vLLM deployments can also use `multiprocessing` as the runtime engine. Here's an example to deploy model across 2 nodes (8 GPUs per node) with `tp_size=8` and `pp_size=2`.
|
||||||
|
|
||||||
|
Choose one node as the head node and run:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
vllm serve /path/to/the/model/in/the/container \
|
||||||
|
--tensor-parallel-size 8 --pipeline-parallel-size 2 \
|
||||||
|
--nnodes 2 --node-rank 0 \
|
||||||
|
--master-addr <HEAD_NODE_IP>
|
||||||
|
```
|
||||||
|
|
||||||
|
On the other worker node, run:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
vllm serve /path/to/the/model/in/the/container \
|
||||||
|
--tensor-parallel-size 8 --pipeline-parallel-size 2 \
|
||||||
|
--nnodes 2 --node-rank 1 \
|
||||||
|
--master-addr <HEAD_NODE_IP> --headless
|
||||||
|
```
|
||||||
|
|
||||||
## Optimizing network communication for tensor parallelism
|
## Optimizing network communication for tensor parallelism
|
||||||
|
|
||||||
Efficient tensor parallelism requires fast inter-node communication, preferably through high-speed network adapters such as InfiniBand.
|
Efficient tensor parallelism requires fast inter-node communication, preferably through high-speed network adapters such as InfiniBand.
|
||||||
|
|||||||
@ -124,9 +124,7 @@ class MultiprocExecutor(Executor):
|
|||||||
# Set multiprocessing envs
|
# Set multiprocessing envs
|
||||||
set_multiprocessing_worker_envs()
|
set_multiprocessing_worker_envs()
|
||||||
|
|
||||||
# Multiprocessing-based executor does not support multi-node setting.
|
# use the loopback address get_loopback_ip() for communication.
|
||||||
# Since it only works for single node, we can use the loopback address
|
|
||||||
# get_loopback_ip() for communication.
|
|
||||||
distributed_init_method = get_distributed_init_method(
|
distributed_init_method = get_distributed_init_method(
|
||||||
get_loopback_ip(), get_open_port()
|
get_loopback_ip(), get_open_port()
|
||||||
)
|
)
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user