mirror of
https://git.datalinker.icu/vllm-project/vllm.git
synced 2026-03-17 10:37:06 +08:00
[Docs] Update EPLB docs (#30426)
Signed-off-by: mgoin <mgoin64@gmail.com>
This commit is contained in:
parent
6ccb7baeb1
commit
fcb894222f
@ -40,10 +40,12 @@ EP_SIZE = TP_SIZE × DP_SIZE
|
||||
|
||||
Where:
|
||||
|
||||
- `TP_SIZE`: Tensor parallel size (always 1 for now)
|
||||
- `TP_SIZE`: Tensor parallel size
|
||||
- `DP_SIZE`: Data parallel size
|
||||
- `EP_SIZE`: Expert parallel size (computed automatically)
|
||||
|
||||
When EP is enabled, MoE layers use expert parallelism instead of tensor parallelism, while attention layers continue to use tensor parallelism if `TP_SIZE > 1`.
|
||||
|
||||
### Example Command
|
||||
|
||||
The following command serves a `DeepSeek-V3-0324` model with 1-way tensor parallel, 8-way (attention) data parallel, and 8-way expert parallel. The attention weights are replicated across all GPUs, while the expert weights are split across GPUs. It will work on a H200 (or H20) node with 8 GPUs. For H100, you can try to serve a smaller model or refer to the multi-node deployment section.
|
||||
@ -119,9 +121,6 @@ While MoE models are typically trained so that each expert receives a similar nu
|
||||
|
||||
Enable EPLB with the `--enable-eplb` flag.
|
||||
|
||||
!!! note "Model Support"
|
||||
Currently only DeepSeek V3 architecture is supported.
|
||||
|
||||
When enabled, vLLM collects load statistics with every forward pass and periodically rebalances expert distribution.
|
||||
|
||||
### EPLB Parameters
|
||||
@ -134,6 +133,8 @@ Configure EPLB with the `--eplb-config` argument, which accepts a JSON string. T
|
||||
| `step_interval`| Frequency of rebalancing (every N engine steps) | 3000 |
|
||||
| `log_balancedness` | Log balancedness metrics (avg tokens per expert ÷ max tokens per expert) | `false` |
|
||||
| `num_redundant_experts` | Additional global experts per EP rank beyond equal distribution | `0` |
|
||||
| `use_async` | Use non-blocking EPLB for reduced latency overhead | `false` |
|
||||
| `policy` | The policy type for expert parallel load balancing | `"default"` |
|
||||
|
||||
For example:
|
||||
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user