[doc] update doc on testing and debugging (#8514)

2026-03-17 16:17:14 +08:00 · 2024-09-16 12:10:23 -07:00 · 2024-09-16 12:10:23 -07:00 · 2759a43a26
commit 2759a43a26
parent 5d73ae49d6
1 changed files with 7 additions and 0 deletions
--- a/docs/source/getting_started/debugging.rst
+++ b/docs/source/getting_started/debugging.rst
@ -98,6 +98,13 @@ Here are some common issues that can cause hangs:

    If the script runs successfully, you should see the message ``sanity check is successful!``.

+    Note that multi-node environment is more complicated than single-node. If you see errors such as ``torch.distributed.DistNetworkError``, it is likely that the network/DNS setup is incorrect. In that case, you can manually assign node rank and specify the IP via command line arguments:
+
+    - In the first node, run ``NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --node-rank 0 --master_addr $MASTER_ADDR test.py``.
+    - In the second node, run ``NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --node-rank 1 --master_addr $MASTER_ADDR test.py``.
+
+    Adjust ``--nproc-per-node``, ``--nnodes``, and ``--node-rank`` according to your setup. The difference is that you need to execute different commands (with different ``--node-rank``) on different nodes.
+
 If the problem persists, feel free to `open an issue on GitHub <https://github.com/vllm-project/vllm/issues/new/choose>`_, with a detailed description of the issue, your environment, and the logs.

 Some known issues: