vllm/docs/features/disagg_encoder.md
Chenguang Zheng 4ccffe561f
[Core] Encoder separation for Encode-Prefill-Decode Disaggregation (#25233)
Signed-off-by: n00909098 <nguyen.kha.long@huawei.com>
Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>
Signed-off-by: herotai214 <herotai214@gmail.com>
Signed-off-by: Khuong Le <khuong.le.manh@huawei.com>
Signed-off-by: Khuong Le <lemanhkhuong2611@gmail.com>
Co-authored-by: n00909098 <nguyen.kha.long@huawei.com>
Co-authored-by: knlnguyen1802 <knlnguyen1802@gmail.com>
Co-authored-by: herotai214 <herotai214@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Khuong Le <khuong.le.manh@huawei.com>
Co-authored-by: Khuong Le <lemanhkhuong2611@gmail.com>
2025-11-11 18:58:33 -08:00

3.3 KiB
Raw Permalink Blame History

Disaggregated Encoder

A disaggregated encoder runs the vision-encoder stage of a multimodal LLM in a process that is separate from the pre-fill / decoder stage. Deploying these two stages in independent vLLM instances brings three practical benefits:

  1. Independent, fine-grained scaling
  2. Lower time-to-first-token (TTFT)
  3. Cross-process reuse and caching of encoder outputs

Design doc: https://docs.google.com/document/d/1aed8KtC6XkXtdoV87pWT0a8OJlZ-CpnuLLzmR8l9BAE


1 Motivation

1. Independent, fine-grained scaling

  • Vision encoders are lightweight, while language models are orders of magnitude larger.
  • The language model can be parallelised without affecting the encoder fleet.
  • Encoder nodes can be added or removed independently.

2. Lower time-to-first-token (TTFT)

  • Language-only requests bypass the vision encoder entirely.
  • Encoder output is injected only at required attention layers, shortening the pre-fill critical path.

3. Cross-process reuse and caching

  • In-process encoders confine reuse to a single worker.
  • A remote, shared cache lets any worker retrieve existing embeddings, eliminating redundant computation.

2 Usage Example

The current reference pathway is SharedStorageConnector.
Below ready-to-run scripts shows the workflow:

1 Encoder instance + 1 PD instance: examples/online_serving/disaggregated_encoder/shared_storage_connector/disagg_encoder_example.sh

1 Encoder instance + 1 Prefill instance + 1 Decode instance: examples/online_serving/disaggregated_encoder/shared_storage_connector/disagg_epd_example.sh


3 Test Script

Please refer to the directories tests/v1/ec_connector

4 Development

Disaggregated encoding is implemented by running two parts:

  • Encoder instance a vLLM instance to performs vision encoding.
  • Prefill/Decode (PD) instance(s) runs language pre-fill and decode.
    • PD can be in either a single normal instance with disagg_encoder_example.sh (E->PD) or in disaggregated instances with disagg_epd_example.sh (E->P->D)

A connector transfers encoder-cache (EC) embeddings from the encoder instance to the PD instance.
All related code is under vllm/distributed/ec_transfer.

Key abstractions

  • ECConnector interface for retrieving EC caches produced by the encoder.
    • Scheduler role checks cache existence and schedules loads.
    • Worker role loads the embeddings into memory.

Here is a figure illustrating disaggregate encoder flow:

Disaggregated Encoder Flow

For the PD disaggregation part, the Prefill instance receive cache exactly the same as the disaggregate encoder flow above. Prefill instance executes 1 step (prefill -> 1 token output) and then transfer KV cache to the Decode instance for the remaining execution. The KV transfer part purely happens after the execute of the PDinstance.

docs/features/disagg_prefill.md shows the brief idea about the disaggregated prefill (v0)

We create the example setup with the NixlConnector from vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py and referred to the tests/v1/kv_connector/nixl_integration/toy_proxy_server.py to facilitate the kv transfer between P and D;