From 8b83b937397ca921ca59eb14515da0ebf0d73b6d Mon Sep 17 00:00:00 2001 From: Tyler Michael Smith Date: Wed, 10 Sep 2025 09:09:49 -0400 Subject: [PATCH] [Docs] Document the extra memory footprint overhead when using EPLB (#24537) Signed-off-by: Tyler Michael Smith --- docs/serving/expert_parallel_deployment.md | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/docs/serving/expert_parallel_deployment.md b/docs/serving/expert_parallel_deployment.md index 7bf87b151e6af..f8701870864dc 100644 --- a/docs/serving/expert_parallel_deployment.md +++ b/docs/serving/expert_parallel_deployment.md @@ -156,6 +156,13 @@ vllm serve Qwen/Qwen3-30B-A3B \ - **Default**: Each EP rank has `NUM_TOTAL_EXPERTS ÷ NUM_EP_RANKS` experts - **With redundancy**: Each EP rank has `(NUM_TOTAL_EXPERTS + NUM_REDUNDANT_EXPERTS) ÷ NUM_EP_RANKS` experts +### Memory Footprint Overhead + +EPLB uses redundant experts to that need to fit in GPU memory. This means that EPLB may not be a good fit for memory constrained environments or when KV cache space is at a premium. + +This overhead equals `NUM_MOE_LAYERS * BYTES_PER_EXPERT * (NUM_TOTAL_EXPERTS + NUM_REDUNDANT_EXPERTS) ÷ NUM_EP_RANKS`. +For DeepSeekV3, this is approximately `2.4 GB` for one redundant expert per rank. + ### Example Command Single node deployment with EPLB enabled: