mirror of
https://git.datalinker.icu/vllm-project/vllm.git
synced 2025-12-10 23:55:19 +08:00
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> Signed-off-by: dtc <dtcccc@linux.alibaba.com> Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com>
59 lines
2.8 KiB
Markdown
59 lines
2.8 KiB
Markdown
# MooncakeConnector Usage Guide
|
||
|
||
## About Mooncake
|
||
|
||
Mooncake aims to enhance the inference efficiency of large language models (LLMs), especially in slow object storage environments, by constructing a multi-level caching pool on high-speed interconnected DRAM/SSD resources. Compared to traditional caching systems, Mooncake utilizes (GPUDirect) RDMA technology to transfer data directly in a zero-copy manner, while maximizing the use of multi-NIC resources on a single machine.
|
||
|
||
For more details about Mooncake, please refer to [Mooncake project](https://github.com/kvcache-ai/Mooncake) and [Mooncake documents](https://kvcache-ai.github.io/Mooncake/).
|
||
|
||
## Prerequisites
|
||
|
||
### Installation
|
||
|
||
Install mooncake through pip: `uv pip install mooncake-transfer-engine`.
|
||
|
||
Refer to [Mooncake official repository](https://github.com/kvcache-ai/Mooncake) for more installation instructions
|
||
|
||
## Usage
|
||
|
||
### Prefiller Node (192.168.0.2)
|
||
|
||
```bash
|
||
vllm serve Qwen/Qwen2.5-7B-Instruct --port 8010 --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_producer"}'
|
||
```
|
||
|
||
### Decoder Node (192.168.0.3)
|
||
|
||
```bash
|
||
vllm serve Qwen/Qwen2.5-7B-Instruct --port 8020 --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_consumer"}'
|
||
```
|
||
|
||
### Proxy
|
||
|
||
```bash
|
||
python tests/v1/kv_connector/nixl_integration/toy_proxy_server.py --prefiller-host 192.168.0.2 --prefiller-port 8010 --decoder-host 192.168.0.3 --decoder-port 8020
|
||
```
|
||
|
||
> NOTE: The Mooncake Connector currently uses the proxy from nixl_integration. This will be replaced with a self-developed proxy in the future.
|
||
|
||
Now you can send requests to the proxy server through port 8000.
|
||
|
||
## Environment Variables
|
||
|
||
- `VLLM_MOONCAKE_BOOTSTRAP_PORT`: Port for Mooncake bootstrap server
|
||
- Default: 8998
|
||
- Required only for prefiller instances
|
||
- Each vLLM worker needs a unique port on its host; using the same port number across different hosts is fine
|
||
- For TP/DP deployments, each worker's port on a node is computed as: base_port + dp_rank * tp_size + tp_rank
|
||
- Used for the decoder notifying the prefiller
|
||
|
||
- `VLLM_MOONCAKE_ABORT_REQUEST_TIMEOUT`: Timeout (in seconds) for automatically releasing the prefiller’s KV cache for a particular request. (Optional)
|
||
- Default: 480
|
||
- If a request is aborted and the decoder has not yet notified the prefiller, the prefill instance will release its KV-cache blocks after this timeout to avoid holding them indefinitely.
|
||
|
||
## KV Role Options
|
||
|
||
- **kv_producer**: For prefiller instances that generate KV caches
|
||
- **kv_consumer**: For decoder instances that consume KV caches from prefiller
|
||
- **kv_both**: Enables symmetric functionality where the connector can act as both producer and consumer. This provides flexibility for experimental setups and scenarios where the role distinction is not predetermined.
|