From b9e0951f964e1b8adfebb973c30462c0e0417c1f Mon Sep 17 00:00:00 2001
From: Seiji Eicher <58963096+eicherseiji@users.noreply.github.com>
Date: Wed, 10 Dec 2025 17:15:54 -0500
Subject: [PATCH] [docs] Improve wide-EP performance + benchmarking
 documentation (#27933)

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
---
 docs/serving/data_parallel_deployment.md   | 14 ++++++++++-
 docs/serving/expert_parallel_deployment.md | 28 +++++++++++++++++++++-
 tools/ep_kernels/README.md                 |  4 ++--
 3 files changed, 42 insertions(+), 4 deletions(-)
diff --git a/docs/serving/data_parallel_deployment.md b/docs/serving/data_parallel_deployment.md
index eff9c5d5e4efa..e5954917cd790 100644
--- a/docs/serving/data_parallel_deployment.md
+++ b/docs/serving/data_parallel_deployment.md
@@ -24,7 +24,7 @@ There are two distinct modes supported for online deployments - self-contained w
 
 vLLM supports "self-contained" data parallel deployments that expose a single API endpoint.
 
-It can be configured by simply including e.g. `--data-parallel-size=4` in the vllm serve command line arguments. This will require 4 GPUs. It can be combined with tensor parallel, for example `--data-parallel-size=4 --tensor-parallel-size=2`, which would require 8 GPUs.
+It can be configured by simply including e.g. `--data-parallel-size=4` in the vllm serve command line arguments. This will require 4 GPUs. It can be combined with tensor parallel, for example `--data-parallel-size=4 --tensor-parallel-size=2`, which would require 8 GPUs. When sizing DP deployments, remember that `--max-num-seqs` applies per DP rank.
 
 Running a single data parallel deployment across multiple nodes requires a different `vllm serve` to be run on each node, specifying which DP ranks should run on that node. In this case, there will still be a single HTTP entrypoint - the API server(s) will run only on one node, but it doesn't necessarily need to be co-located with the DP ranks.
 
@@ -80,6 +80,18 @@ When deploying large DP sizes using this method, the API server process can beco
 ![DP Internal LB Diagram](../assets/deployment/dp_internal_lb.png)
 </figure>
 
+## Hybrid Load Balancing
+
+Hybrid load balancing sits between the internal and external approaches. Each node runs its own API server(s) that only queue requests to the data-parallel engines colocated on that node. An upstream load balancer (for example, an ingress controller or traffic router) spreads user requests across those per-node endpoints.
+
+Enable this mode with `--data-parallel-hybrid-lb` while still launching every node with the global data-parallel size. The key differences from internal load balancing are:
+
+- You must provide `--data-parallel-size-local` and `--data-parallel-start-rank` so each node knows which ranks it owns.
+- Not compatible with `--headless` since every node exposes an API endpoint.
+- Scale `--api-server-count` per node based on the number of local ranks
+
+In this configuration, each node keeps scheduling decisions local, which reduces cross-node traffic and avoids single node bottlenecks at larger DP sizes.
+
 ## External Load Balancing
 
 For larger scale deployments especially, it can make sense to handle the orchestration and load balancing of data parallel ranks externally.
diff --git a/docs/serving/expert_parallel_deployment.md b/docs/serving/expert_parallel_deployment.md
index 98f242ab8b892..923020dc88c91 100644
--- a/docs/serving/expert_parallel_deployment.md
+++ b/docs/serving/expert_parallel_deployment.md
@@ -83,7 +83,7 @@ vllm serve deepseek-ai/DeepSeek-V3-0324 \
     --data-parallel-size-local 8 \           # Local DP size on this node (8 GPUs per node)
     --data-parallel-address 192.168.1.100 \  # Replace with actual IP of Node 1
     --data-parallel-rpc-port 13345 \         # RPC communication port, can be any port as long as reachable by all nodes
-    --api-server-count=8                     # Number of API servers for load handling (scaling this out to total ranks are recommended)
+    --api-server-count=8                     # Number of API servers for load handling (scaling this out to # local ranks is recommended)
 
 # Node 2 (Secondary - headless mode, no API server)
 vllm serve deepseek-ai/DeepSeek-V3-0324 \
@@ -184,6 +184,26 @@ vllm serve deepseek-ai/DeepSeek-V3-0324 \
 
 For multi-node deployment, add these EPLB flags to each node's command. We recommend setting `--eplb-config '{"num_redundant_experts":32}'` to 32 in large scale use cases so the most popular experts are always available.
 
+## Advanced Configuration
+
+### Performance Optimization
+
+- **DeepEP kernels**: The `high_throughput` and `low_latency` kernels are optimized for disaggregated serving and may show poor performance for mixed workloads
+- **Dual Batch Overlap**: Use `--enable-dbo` to overlap all-to-all communication with compute. See [Dual Batch Overlap](../design/dbo.md) for more details.
+- **Async scheduling (experimental)**: Try `--async-scheduling` to overlap scheduling with model execution.
+
+### Troubleshooting
+
+- **`non-zero status: 7 cannot register cq buf`**: When using Infiniband/RoCE, make sure host VM and pods show `ulimit -l` "unlimited".
+- **`init failed for transport: IBGDA`**: The InfiniBand GDA kernel modules are missing. Run `tools/ep_kernels/configure_system_drivers.sh` on each GPU node and reboot. Also fixes error `NVSHMEM API called before NVSHMEM initialization has completed`.
+- **NVSHMEM peer disconnect**: Usually a networking misconfiguration. If deploying via Kubernetes, verify that every pod runs with `hostNetwork: true`, `securityContext.privileged: true` to access Infiniband.
+
+### Benchmarking
+
+- Use simulator flags `VLLM_MOE_ROUTING_SIMULATION_STRATEGY=uniform_random` and `VLLM_RANDOMIZE_DP_DUMMY_INPUTS=1` so token routing is balanced across EP ranks.
+
+- Increasing `VLLM_MOE_DP_CHUNK_SIZE` may increase throughput by increasing the maximum batch size for inter-rank token transfers. This may cause DeepEP  to throw `assert self.nvshmem_qp_depth >= (num_max_dispatch_tokens_per_rank + 1) * 2`, which can be fixed by increasing environment variable `NVSHMEM_QP_DEPTH`.
+
 ## Disaggregated Serving (Prefill/Decode Split)
 
 For production deployments requiring strict SLA guarantees for time-to-first-token and inter-token latency, disaggregated serving allows independent scaling of prefill and decode operations.
@@ -274,3 +294,9 @@ except Exception as e:
     print(f"❌ Error during disaggregated serving: {e}")
     print("Check that both prefill and decode instances are running and accessible")
 ```
+
+### Benchmarking
+
+- To simulate the decode deployment of disaggregated serving, pass `--kv-transfer-config '{"kv_connector":"DecodeBenchConnector","kv_role":"kv_both"}'` to the `vllm serve` invocation. The connector populates KV cache with random values so decode can be profiled in isolation.
+
+- **CUDAGraph capture**: Use `--compilation_config '{"cudagraph_mode": "FULL_DECODE_ONLY"}'` to enable CUDA graph capture for decode only and save KV cache.
diff --git a/tools/ep_kernels/README.md b/tools/ep_kernels/README.md
index 85e9d2a4f8129..ab0e358802bf8 100644
--- a/tools/ep_kernels/README.md
+++ b/tools/ep_kernels/README.md
@@ -7,7 +7,7 @@ Here we break down the requirements in 2 steps:
 1. Build and install the Python libraries (both [pplx-kernels](https://github.com/ppl-ai/pplx-kernels) and [DeepEP](https://github.com/deepseek-ai/DeepEP)), including necessary dependencies like NVSHMEM. This step does not require any privileged access. Any user can do this.
 2. Configure NVIDIA driver to enable IBGDA. This step requires root access, and must be done on the host machine.
 
-2 is necessary for multi-node deployment.
+Step 2 is necessary for multi-node deployment.
 
 All scripts accept a positional argument as workspace path for staging the build, defaulting to `$(pwd)/ep_kernels_workspace`.
 
@@ -23,6 +23,6 @@ TORCH_CUDA_ARCH_LIST="10.0" bash install_python_libraries.sh
 Additional step for multi-node deployment:
 
 ```bash
-sudo bash configure_system_drivers.sh
+sudo bash configure_system_drivers.sh # update-initramfs can take several minutes
 sudo reboot # Reboot is required to load the new driver
 ```