From 8c5a747246515995aff188a3e6320b0643083d5c Mon Sep 17 00:00:00 2001 From: youkaichao Date: Thu, 11 Sep 2025 11:09:38 +0800 Subject: [PATCH] [distributed] update known issues (#24624) Signed-off-by: youkaichao --- docs/usage/troubleshooting.md | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/usage/troubleshooting.md b/docs/usage/troubleshooting.md index a82d97ea222f7..6e700d1faaa9c 100644 --- a/docs/usage/troubleshooting.md +++ b/docs/usage/troubleshooting.md @@ -324,3 +324,4 @@ This indicates vLLM failed to initialize the NCCL communicator, possibly due to - In `v0.5.2`, `v0.5.3`, and `v0.5.3.post1`, there is a bug caused by [zmq](https://github.com/zeromq/pyzmq/issues/2000) , which can occasionally cause vLLM to hang depending on the machine configuration. The solution is to upgrade to the latest version of `vllm` to include the [fix](gh-pr:6759). - To address a memory overhead issue in older NCCL versions (see [bug](https://github.com/NVIDIA/nccl/issues/1234)), vLLM versions `>= 0.4.3, <= 0.10.1.1` would set the environment variable `NCCL_CUMEM_ENABLE=0`. External processes connecting to vLLM also needed to set this variable to prevent hangs or crashes. Since the underlying NCCL bug was fixed in NCCL 2.22.3, this override was removed in newer vLLM versions to allow for NCCL performance optimizations. +- In some PCIe machines (e.g. machines without NVLink), if you see an error like `transport/shm.cc:590 NCCL WARN Cuda failure 217 'peer access is not supported between these two devices'`, it's likely caused by a driver bug. See [this issue](https://github.com/NVIDIA/nccl/issues/1838) for more details. In that case, you can try to set `NCCL_CUMEM_HOST_ENABLE=0` to disable the feature, or upgrade your driver to the latest version.