vllm/examples/online_serving/disaggregated_encoder
Chenguang Zheng 4ccffe561f
[Core] Encoder separation for Encode-Prefill-Decode Disaggregation (#25233)
Signed-off-by: n00909098 <nguyen.kha.long@huawei.com>
Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>
Signed-off-by: herotai214 <herotai214@gmail.com>
Signed-off-by: Khuong Le <khuong.le.manh@huawei.com>
Signed-off-by: Khuong Le <lemanhkhuong2611@gmail.com>
Co-authored-by: n00909098 <nguyen.kha.long@huawei.com>
Co-authored-by: knlnguyen1802 <knlnguyen1802@gmail.com>
Co-authored-by: herotai214 <herotai214@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Khuong Le <khuong.le.manh@huawei.com>
Co-authored-by: Khuong Le <lemanhkhuong2611@gmail.com>
2025-11-11 18:58:33 -08:00
..

Disaggregated Encoder

These example scripts that demonstrate the disaggregated encoder (EPD) features of vLLM.

For a detailed explanation of the EPD features, please refer to the Disaggregated Encoder Feature Documentation.

Files

  • disagg_epd_proxy.py - Proxy script that demonstrates the XeYpZd setup (X encode instances, Y prefill instances, Z decode instances). Currently stable for the 1e1p1d configuration.

  • disagg_1e1p1d_example.sh - Sets up the 1e1p1d configuration, runs the VisionArena benchmark, and processes a single request with a local image.

  • disagg_1e1pd_example.sh - Sets up the 1e1pd configuration, runs the VisionArena benchmark, and processes a single request with a local image.

Custom Configuration

# Use specific GPUs
GPU_E=0 GPU_PD=1 GPU_P=1 GPU_D=2 bash disagg_1e1p1d_example.sh

# Use specific ports
ENDPOINT_PORT=10001 bash disagg_1e1p1d_example.sh

# Use specific model
MODEL="Qwen/Qwen2.5-VL-3B-Instruct" bash disagg_1e1p1d_example.sh

# Use specific storage path
EC_SHARED_STORAGE_PATH="/tmp/my_ec_cache" bash disagg_1e1p1d_example.sh

Encoder Instances

Encoder engines should be launched with the following flags:

  • --enforce-eager (required) The current EPD implementation is only compatible with encoder instances running in this mode.

  • --no-enable-prefix-caching (required) Encoder instances do not consume KV cache; prefix caching is disabled to avoid conflicts with other features.

  • --max-num-batched-tokens=<large value> (default: 2048) This flag controls the token scheduling budget per decoding step and is irrelevant to encoder-only instances. Set it to a very high value (effectively unlimited) to bypass scheduler limitations. The actual token budget is managed by the encoder cache manager.

Local media inputs

To support local image inputs (from your MEDIA_PATH directory), add the following flag to the encoder instance:

--allowed-local-media-path $MEDIA_PATH

The vllm instances and disagg_encoder_proxy supports local URIs with {"url": "file://'"$MEDIA_PATH_FILENAME"'} as multimodal inputs. Each URI is passed unchanged from the disagg_encoder_proxy to the encoder instance so that the encoder can load the media locally.

EC connector and KV transfer

The ECSharedStorageConnector is used to store the encoder cache on local disk and facilitate transfer. To enable the encoder disaggregation feature, add the following configuration:

# Add to encoder instance: 
--ec-transfer-config '{
    "ec_connector": "ECSharedStorageConnector",
    "ec_role": "ec_producer",
    "ec_connector_extra_config": {
        "shared_storage_path": "'"$EC_SHARED_STORAGE_PATH"'"
    }
}' 

# Add to prefill/prefill+decode instance: 
--ec-transfer-config '{
    "ec_connector": "ECSharedStorageConnector",
    "ec_role": "ec_consumer",
    "ec_connector_extra_config": {
        "shared_storage_path": "'"$EC_SHARED_STORAGE_PATH"'"
    }
}' 

$EC_SHARED_STORAGE_PATH is the path where the EC connector temporarily stores the cache.

If you enable prefill instance (--prefill-servers-urls not disabled), you will need --kv-transfer-config to facilitate the PD disaggregation. Currently, we use the NixlConnector for this purpose. Refer to tests/v1/kv_connector/nixl_integration for more example codes on PD disaggregation with Nixl.

# Add to prefill instance:    
--kv-transfer-config '{
    "kv_connector": "NixlConnector",
    "kv_role": "kv_producer"
}' 

# Add to decode instance:
--kv-transfer-config '{
    "kv_connector": "NixlConnector",
    "kv_role": "kv_consumer"
}' 

Proxy Instance Flags (disagg_epd_proxy.py)

Flag Description
--encode-servers-urls Comma-separated list of encoder endpoints. Every multimodal item extracted from the request is fanned out to one of these URLs in a round-robin fashion.
--prefill-servers-urls Comma-separated list of prefill endpoints. Set to disable, none, or "" to skip the dedicated prefill phase and run E+PD (encoder + combined prefill/decode).
--decode-servers-urls Comma-separated list of decode endpoints. Non-stream and stream paths both round-robin over this list.
--host, --port Bind address for the proxy itself (defaults: 0.0.0.0:8000).

Example usage: For E + PD setup:

$ python disagg_encoder_proxy.py \
      --encode-servers-urls "http://e1:8001,http://e2:8002" \
      --prefill-servers-urls "disable" \
      --decode-servers-urls "http://pd1:8003,http://pd2:8004"

For E + P + D setup:

$ python disagg_encoder_proxy.py \
      --encode-servers-urls "http://e1:8001,http://e2:8001" \
      --prefill-servers-urls "http://p1:8003,http://p2:8004" \ 
      --decode-servers-urls "http://d1:8005,http://d2:8006"