[Docs] Update EPLB docs (#30426)

Signed-off-by: mgoin <mgoin64@gmail.com>
This commit is contained in:
Michael Goin 2025-12-10 15:56:51 -05:00 committed by GitHub
parent 6ccb7baeb1
commit fcb894222f
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

View File

@ -40,10 +40,12 @@ EP_SIZE = TP_SIZE × DP_SIZE
Where: Where:
- `TP_SIZE`: Tensor parallel size (always 1 for now) - `TP_SIZE`: Tensor parallel size
- `DP_SIZE`: Data parallel size - `DP_SIZE`: Data parallel size
- `EP_SIZE`: Expert parallel size (computed automatically) - `EP_SIZE`: Expert parallel size (computed automatically)
When EP is enabled, MoE layers use expert parallelism instead of tensor parallelism, while attention layers continue to use tensor parallelism if `TP_SIZE > 1`.
### Example Command ### Example Command
The following command serves a `DeepSeek-V3-0324` model with 1-way tensor parallel, 8-way (attention) data parallel, and 8-way expert parallel. The attention weights are replicated across all GPUs, while the expert weights are split across GPUs. It will work on a H200 (or H20) node with 8 GPUs. For H100, you can try to serve a smaller model or refer to the multi-node deployment section. The following command serves a `DeepSeek-V3-0324` model with 1-way tensor parallel, 8-way (attention) data parallel, and 8-way expert parallel. The attention weights are replicated across all GPUs, while the expert weights are split across GPUs. It will work on a H200 (or H20) node with 8 GPUs. For H100, you can try to serve a smaller model or refer to the multi-node deployment section.
@ -119,9 +121,6 @@ While MoE models are typically trained so that each expert receives a similar nu
Enable EPLB with the `--enable-eplb` flag. Enable EPLB with the `--enable-eplb` flag.
!!! note "Model Support"
Currently only DeepSeek V3 architecture is supported.
When enabled, vLLM collects load statistics with every forward pass and periodically rebalances expert distribution. When enabled, vLLM collects load statistics with every forward pass and periodically rebalances expert distribution.
### EPLB Parameters ### EPLB Parameters
@ -134,6 +133,8 @@ Configure EPLB with the `--eplb-config` argument, which accepts a JSON string. T
| `step_interval`| Frequency of rebalancing (every N engine steps) | 3000 | | `step_interval`| Frequency of rebalancing (every N engine steps) | 3000 |
| `log_balancedness` | Log balancedness metrics (avg tokens per expert ÷ max tokens per expert) | `false` | | `log_balancedness` | Log balancedness metrics (avg tokens per expert ÷ max tokens per expert) | `false` |
| `num_redundant_experts` | Additional global experts per EP rank beyond equal distribution | `0` | | `num_redundant_experts` | Additional global experts per EP rank beyond equal distribution | `0` |
| `use_async` | Use non-blocking EPLB for reduced latency overhead | `false` |
| `policy` | The policy type for expert parallel load balancing | `"default"` |
For example: For example: