mirror of
https://git.datalinker.icu/vllm-project/vllm.git
synced 2025-12-10 07:34:57 +08:00
Signed-off-by: n00909098 <nguyen.kha.long@huawei.com> Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com> Signed-off-by: herotai214 <herotai214@gmail.com> Signed-off-by: Khuong Le <khuong.le.manh@huawei.com> Signed-off-by: Khuong Le <lemanhkhuong2611@gmail.com> Co-authored-by: n00909098 <nguyen.kha.long@huawei.com> Co-authored-by: knlnguyen1802 <knlnguyen1802@gmail.com> Co-authored-by: herotai214 <herotai214@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Khuong Le <khuong.le.manh@huawei.com> Co-authored-by: Khuong Le <lemanhkhuong2611@gmail.com>
76 lines
3.3 KiB
Markdown
76 lines
3.3 KiB
Markdown
# Disaggregated Encoder
|
||
|
||
A **disaggregated encoder** runs the vision-encoder stage of a multimodal LLM in a process that is separate from the pre-fill / decoder stage. Deploying these two stages in independent vLLM instances brings three practical benefits:
|
||
|
||
1. **Independent, fine-grained scaling**
|
||
2. **Lower time-to-first-token (TTFT)**
|
||
3. **Cross-process reuse and caching of encoder outputs**
|
||
|
||
Design doc: <https://docs.google.com/document/d/1aed8KtC6XkXtdoV87pWT0a8OJlZ-CpnuLLzmR8l9BAE>
|
||
|
||
---
|
||
|
||
## 1 Motivation
|
||
|
||
### 1. Independent, fine-grained scaling
|
||
|
||
* Vision encoders are lightweight, while language models are orders of magnitude larger.
|
||
* The language model can be parallelised without affecting the encoder fleet.
|
||
* Encoder nodes can be added or removed independently.
|
||
|
||
### 2. Lower time-to-first-token (TTFT)
|
||
|
||
* Language-only requests bypass the vision encoder entirely.
|
||
* Encoder output is injected only at required attention layers, shortening the pre-fill critical path.
|
||
|
||
### 3. Cross-process reuse and caching
|
||
|
||
* In-process encoders confine reuse to a single worker.
|
||
* A remote, shared cache lets any worker retrieve existing embeddings, eliminating redundant computation.
|
||
|
||
---
|
||
|
||
## 2 Usage Example
|
||
|
||
The current reference pathway is **SharedStorageConnector**.
|
||
Below ready-to-run scripts shows the workflow:
|
||
|
||
1 Encoder instance + 1 PD instance:
|
||
`examples/online_serving/disaggregated_encoder/shared_storage_connector/disagg_encoder_example.sh`
|
||
|
||
1 Encoder instance + 1 Prefill instance + 1 Decode instance:
|
||
`examples/online_serving/disaggregated_encoder/shared_storage_connector/disagg_epd_example.sh`
|
||
|
||
---
|
||
|
||
## 3 Test Script
|
||
|
||
Please refer to the directories `tests/v1/ec_connector`
|
||
|
||
## 4 Development
|
||
|
||
Disaggregated encoding is implemented by running two parts:
|
||
|
||
* **Encoder instance** – a vLLM instance to performs vision encoding.
|
||
* **Prefill/Decode (PD) instance(s)** – runs language pre-fill and decode.
|
||
* PD can be in either a single normal instance with `disagg_encoder_example.sh` (E->PD) or in disaggregated instances with `disagg_epd_example.sh` (E->P->D)
|
||
|
||
A connector transfers encoder-cache (EC) embeddings from the encoder instance to the PD instance.
|
||
All related code is under `vllm/distributed/ec_transfer`.
|
||
|
||
### Key abstractions
|
||
|
||
* **ECConnector** – interface for retrieving EC caches produced by the encoder.
|
||
* *Scheduler role* – checks cache existence and schedules loads.
|
||
* *Worker role* – loads the embeddings into memory.
|
||
|
||
Here is a figure illustrating disaggregate encoder flow:
|
||
|
||

|
||
|
||
For the PD disaggregation part, the Prefill instance receive cache exactly the same as the disaggregate encoder flow above. Prefill instance executes 1 step (prefill -> 1 token output) and then transfer KV cache to the Decode instance for the remaining execution. The KV transfer part purely happens after the execute of the PDinstance.
|
||
|
||
`docs/features/disagg_prefill.md` shows the brief idea about the disaggregated prefill (v0)
|
||
|
||
We create the example setup with the **NixlConnector** from `vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py` and referred to the `tests/v1/kv_connector/nixl_integration/toy_proxy_server.py` to facilitate the kv transfer between P and D;
|