diff --git a/docs/usage/troubleshooting.md b/docs/usage/troubleshooting.md index b92c6cef4a3f..4945927e3d78 100644 --- a/docs/usage/troubleshooting.md +++ b/docs/usage/troubleshooting.md @@ -295,4 +295,4 @@ This indicates vLLM failed to initialize the NCCL communicator, possibly due to ## Known Issues - In `v0.5.2`, `v0.5.3`, and `v0.5.3.post1`, there is a bug caused by [zmq](https://github.com/zeromq/pyzmq/issues/2000) , which can occasionally cause vLLM to hang depending on the machine configuration. The solution is to upgrade to the latest version of `vllm` to include the [fix](gh-pr:6759). -- To circumvent a NCCL [bug](https://github.com/NVIDIA/nccl/issues/1234) , all vLLM processes will set an environment variable `NCCL_CUMEM_ENABLE=0` to disable NCCL's `cuMem` allocator. It does not affect performance but only gives memory benefits. When external processes want to set up a NCCL connection with vLLM's processes, they should also set this environment variable, otherwise, inconsistent environment setup will cause NCCL to hang or crash, as observed in the [RLHF integration](https://github.com/OpenRLHF/OpenRLHF/pull/604) and the [discussion](gh-issue:5723#issuecomment-2554389656) . +- To address a memory overhead issue in older NCCL versions (see [bug](https://github.com/NVIDIA/nccl/issues/1234)), vLLM versions `>= 0.4.3, <= 0.10.1.1` would set the environment variable `NCCL_CUMEM_ENABLE=0`. External processes connecting to vLLM also needed to set this variable to prevent hangs or crashes. Since the underlying NCCL bug was fixed in NCCL 2.22.3, this override was removed in newer vLLM versions to allow for NCCL performance optimizations. diff --git a/vllm/env_override.py b/vllm/env_override.py index ef425d433320..b06703a2fbf9 100644 --- a/vllm/env_override.py +++ b/vllm/env_override.py @@ -13,24 +13,6 @@ logger = init_logger(__name__) # that interact with vllm workers. # they are executed whenever `import vllm` is called. -if os.environ.get('NCCL_CUMEM_ENABLE', '0') != '0': - logger.warning( - "NCCL_CUMEM_ENABLE is set to %s, skipping override. " - "This may increase memory overhead with cudagraph+allreduce: " - "https://github.com/NVIDIA/nccl/issues/1234", - os.environ['NCCL_CUMEM_ENABLE']) -elif not os.path.exists('/dev/nvidia-caps-imex-channels'): - # NCCL requires NCCL_CUMEM_ENABLE to work with - # multi-node NVLink, typically on GB200-NVL72 systems. - # The ultimate way to detect multi-node NVLink is to use - # NVML APIs, which are too expensive to call here. - # As an approximation, we check the existence of - # /dev/nvidia-caps-imex-channels, used by - # multi-node NVLink to communicate across nodes. - # This will still cost some GPU memory, but it is worthwhile - # because we can get very fast cross-node bandwidth with NVLink. - os.environ['NCCL_CUMEM_ENABLE'] = '0' - # see https://github.com/vllm-project/vllm/pull/15951 # it avoids unintentional cuda initialization from torch.cuda.is_available() os.environ['PYTORCH_NVML_BASED_CUDA_CHECK'] = '1'