mirror of
https://git.datalinker.icu/vllm-project/vllm.git
synced 2025-12-09 08:15:56 +08:00
[Doc] add config and troubleshooting guide for NCCL & GPUDirect RDMA (#15897)
Signed-off-by: Ernest Wong <chwong719@gmail.com>
This commit is contained in:
parent
a2f14dc8f9
commit
bd5038af07
@ -100,7 +100,50 @@ vllm serve /path/to/the/model/in/the/container \
|
||||
--tensor-parallel-size 16
|
||||
```
|
||||
|
||||
To make tensor parallel performant, you should make sure the communication between nodes is efficient, e.g. using high-speed network cards like Infiniband. To correctly set up the cluster to use Infiniband, append additional arguments like `--privileged -e NCCL_IB_HCA=mlx5` to the `run_cluster.sh` script. Please contact your system administrator for more information on how to set up the flags. One way to confirm if the Infiniband is working is to run vLLM with `NCCL_DEBUG=TRACE` environment variable set, e.g. `NCCL_DEBUG=TRACE vllm serve ...` and check the logs for the NCCL version and the network used. If you find `[send] via NET/Socket` in the logs, it means NCCL uses raw TCP Socket, which is not efficient for cross-node tensor parallel. If you find `[send] via NET/IB/GDRDMA` in the logs, it means NCCL uses Infiniband with GPU-Direct RDMA, which is efficient.
|
||||
To make tensor parallel performant, you should make sure the communication between nodes is efficient, e.g. using high-speed network cards like InfiniBand. To correctly set up the cluster to use InfiniBand, append additional arguments like `--privileged -e NCCL_IB_HCA=mlx5` to the `run_cluster.sh` script. Please contact your system administrator for more information on how to set up the flags. One way to confirm if the InfiniBand is working is to run vLLM with `NCCL_DEBUG=TRACE` environment variable set, e.g. `NCCL_DEBUG=TRACE vllm serve ...` and check the logs for the NCCL version and the network used. If you find `[send] via NET/Socket` in the logs, it means NCCL uses raw TCP Socket, which is not efficient for cross-node tensor parallel. If you find `[send] via NET/IB/GDRDMA` in the logs, it means NCCL uses InfiniBand with GPUDirect RDMA, which is efficient.
|
||||
|
||||
### GPUDirect RDMA
|
||||
|
||||
To enable GPUDirect RDMA with vLLM, specific configuration tweaks are needed. This setup ensures:
|
||||
|
||||
- `IPC_LOCK` Security Context: Add the `IPC_LOCK` capability to the container’s security context to lock memory pages and prevent swapping to disk.
|
||||
- Shared Memory with `/dev/shm`: Mount `/dev/shm` in the pod spec to provide shared memory for IPC.
|
||||
|
||||
When using Docker, you can set up the container as follows:
|
||||
|
||||
```bash
|
||||
docker run --gpus all \
|
||||
--ipc=host \
|
||||
--shm-size=16G \
|
||||
-v /dev/shm:/dev/shm \
|
||||
vllm/vllm-openai
|
||||
```
|
||||
|
||||
When using Kubernetes, you can set up the pod spec as follows:
|
||||
|
||||
```yaml
|
||||
...
|
||||
spec:
|
||||
containers:
|
||||
- name: vllm
|
||||
image: vllm/vllm-openai
|
||||
securityContext:
|
||||
capabilities:
|
||||
add: ["IPC_LOCK"]
|
||||
volumeMounts:
|
||||
- mountPath: /dev/shm
|
||||
name: dshm
|
||||
resources:
|
||||
limits:
|
||||
nvidia.com/gpu: 8
|
||||
requests:
|
||||
nvidia.com/gpu: 8
|
||||
volumes:
|
||||
- name: dshm
|
||||
emptyDir:
|
||||
medium: Memory
|
||||
...
|
||||
```
|
||||
|
||||
!!! warning
|
||||
After you start the Ray cluster, you'd better also check the GPU-GPU communication between nodes. It can be non-trivial to set up. Please refer to the [sanity check script][troubleshooting-incorrect-hardware-driver] for more information. If you need to set some environment variables for the communication configuration, you can append them to the `run_cluster.sh` script, e.g. `-e NCCL_SOCKET_IFNAME=eth0`. Note that setting environment variables in the shell (e.g. `NCCL_SOCKET_IFNAME=eth0 vllm serve ...`) only works for the processes in the same node, not for the processes in the other nodes. Setting environment variables when you create the cluster is the recommended way. See <gh-issue:6803> for more information.
|
||||
|
||||
@ -273,6 +273,27 @@ But you are sure that the model is in the [list of supported models][supported-m
|
||||
|
||||
If you see an error like `RuntimeError: Failed to infer device type`, it means that vLLM failed to infer the device type of the runtime environment. You can check [the code](gh-file:vllm/platforms/__init__.py) to see how vLLM infers the device type and why it is not working as expected. After [this PR](gh-pr:14195), you can also set the environment variable `VLLM_LOGGING_LEVEL=DEBUG` to see more detailed logs to help debug the issue.
|
||||
|
||||
## NCCL error: unhandled system error during `ncclCommInitRank`
|
||||
|
||||
If your serving workload uses GPUDirect RDMA for distributed serving across multiple nodes and encounters an error during `ncclCommInitRank`, with no clear error message even with `NCCL_DEBUG=INFO` set, it might look like this:
|
||||
|
||||
```text
|
||||
Error executing method 'init_device'. This might cause deadlock in distributed execution.
|
||||
Traceback (most recent call last):
|
||||
...
|
||||
File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 99, in __init__
|
||||
self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 277, in ncclCommInitRank
|
||||
self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
|
||||
File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 256, in NCCL_CHECK
|
||||
raise RuntimeError(f"NCCL error: {error_str}")
|
||||
RuntimeError: NCCL error: unhandled system error (run with NCCL_DEBUG=INFO for details)
|
||||
...
|
||||
```
|
||||
|
||||
This indicates vLLM failed to initialize the NCCL communicator, possibly due to a missing `IPC_LOCK` linux capability or an unmounted `/dev/shm`. Refer to [Distributed Inference and Serving](../serving/distributed_serving.md#running-vllm-on-multiple-nodes) for guidance on properly configuring the environment for distributed serving.
|
||||
|
||||
## Known Issues
|
||||
|
||||
- In `v0.5.2`, `v0.5.3`, and `v0.5.3.post1`, there is a bug caused by [zmq](https://github.com/zeromq/pyzmq/issues/2000) , which can occasionally cause vLLM to hang depending on the machine configuration. The solution is to upgrade to the latest version of `vllm` to include the [fix](gh-pr:6759).
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user