mirror of
https://git.datalinker.icu/vllm-project/vllm.git
synced 2026-03-16 14:27:19 +08:00
[Docs] Merge design docs for a V1 only future (#21832)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
This commit is contained in:
parent
04e38500ee
commit
f693b067a2
@ -56,9 +56,7 @@ nav:
|
||||
- contributing/model/tests.md
|
||||
- contributing/model/multimodal.md
|
||||
- CI: contributing/ci
|
||||
- Design Documents:
|
||||
- V0: design
|
||||
- V1: design/v1
|
||||
- Design Documents: design
|
||||
- API Reference:
|
||||
- Summary: api/README.md
|
||||
- Contents:
|
||||
|
||||
@ -1,40 +0,0 @@
|
||||
# Automatic Prefix Caching
|
||||
|
||||
The core idea of [PagedAttention](https://blog.vllm.ai/2023/06/20/vllm.html) is to partition the KV cache of each request into KV Blocks. Each block contains the attention keys and values for a fixed number of tokens. The PagedAttention algorithm allows these blocks to be stored in non-contiguous physical memory so that we can eliminate memory fragmentation by allocating the memory on demand.
|
||||
|
||||
To automatically cache the KV cache, we utilize the following key observation: Each KV block can be uniquely identified by the tokens within the block and the tokens in the prefix before the block.
|
||||
|
||||
```text
|
||||
Block 1 Block 2 Block 3
|
||||
[A gentle breeze stirred] [the leaves as children] [laughed in the distance]
|
||||
Block 1: |<--- block tokens ---->|
|
||||
Block 2: |<------- prefix ------>| |<--- block tokens --->|
|
||||
Block 3: |<------------------ prefix -------------------->| |<--- block tokens ---->|
|
||||
```
|
||||
|
||||
In the example above, the KV cache in the first block can be uniquely identified with the tokens “A gentle breeze stirred”. The third block can be uniquely identified with the tokens in the block “laughed in the distance”, along with the prefix tokens “A gentle breeze stirred the leaves as children”. Therefore, we can build the following one-to-one mapping:
|
||||
|
||||
```text
|
||||
hash(prefix tokens + block tokens) <--> KV Block
|
||||
```
|
||||
|
||||
With this mapping, we can add another indirection in vLLM’s KV cache management. Previously, each sequence in vLLM maintained a mapping from their logical KV blocks to physical blocks. To achieve automatic caching of KV blocks, we map the logical KV blocks to their hash value and maintain a global hash table of all the physical blocks. In this way, all the KV blocks sharing the same hash value (e.g., shared prefix blocks across two requests) can be mapped to the same physical block and share the memory space.
|
||||
|
||||
This design achieves automatic prefix caching without the need of maintaining a tree structure among the KV blocks. More specifically, all of the blocks are independent of each other and can be allocated and freed by itself, which enables us to manages the KV cache as ordinary caches in operating system.
|
||||
|
||||
## Generalized Caching Policy
|
||||
|
||||
Keeping all the KV blocks in a hash table enables vLLM to cache KV blocks from earlier requests to save memory and accelerate the computation of future requests. For example, if a new request shares the system prompt with the previous request, the KV cache of the shared prompt can directly be used for the new request without recomputation. However, the total KV cache space is limited and we have to decide which KV blocks to keep or evict when the cache is full.
|
||||
|
||||
Managing KV cache with a hash table allows us to implement flexible caching policies. As an example, in current vLLM, we implement the following eviction policy:
|
||||
|
||||
* When there are no free blocks left, we will evict a KV block with reference count (i.e., number of current requests using the block) equals 0.
|
||||
* If there are multiple blocks with reference count equals to 0, we prioritize to evict the least recently used block (LRU).
|
||||
* If there are multiple blocks whose last access time are the same, we prioritize the eviction of the block that is at the end of the longest prefix (i.e., has the maximum number of blocks before it).
|
||||
|
||||
Note that this eviction policy effectively implements the exact policy as in [RadixAttention](https://lmsys.org/blog/2024-01-17-sglang/) when applied to models with full attention, which prioritizes to evict reference count zero and least recent used leaf nodes in the prefix tree.
|
||||
|
||||
However, the hash-based KV cache management gives us the flexibility to handle more complicated serving scenarios and implement more complicated eviction policies beyond the policy above:
|
||||
|
||||
* Multi-LoRA serving. When serving requests for multiple LoRA adapters, we can simply let the hash of each KV block to also include the LoRA ID the request is querying for to enable caching for all adapters. In this way, we can jointly manage the KV blocks for different adapters, which simplifies the system implementation and improves the global cache hit rate and efficiency.
|
||||
* Multi-modal models. When the user input includes more than just discrete tokens, we can use different hashing methods to handle the caching of inputs of different modalities. For example, perceptual hashing for images to cache similar input images.
|
||||
@ -1,4 +1,4 @@
|
||||
# Integration with HuggingFace
|
||||
# Integration with Hugging Face
|
||||
|
||||
This document describes how vLLM integrates with HuggingFace libraries. We will explain step by step what happens under the hood when we run `vllm serve`.
|
||||
|
||||
|
||||
@ -1,8 +1,10 @@
|
||||
# P2P NCCL Connector
|
||||
|
||||
An implementation of xPyD with dynamic scaling based on point-to-point communication, partly inspired by Dynamo.
|
||||
|
||||
# Detailed Design
|
||||
## Detailed Design
|
||||
|
||||
## Overall Process
|
||||
### Overall Process
|
||||
As shown in Figure 1, the overall process of this **PD disaggregation** solution is described through a request flow:
|
||||
|
||||
1. The client sends an HTTP request to the Proxy/Router's `/v1/completions` interface.
|
||||
@ -15,7 +17,7 @@ As shown in Figure 1, the overall process of this **PD disaggregation** solution
|
||||
|
||||

|
||||
|
||||
## Proxy/Router (Demo)
|
||||
### Proxy/Router (Demo)
|
||||
|
||||
A simple HTTP service acts as the entry point for client requests and starts a background thread to listen for P/D instances reporting their HTTP IP and PORT, as well as ZMQ IP and PORT. It maintains a dictionary of `http_addr -> zmq_addr`. The `http_addr` is the IP:PORT for the vLLM instance's request, while the `zmq_addr` is the address for KV cache handshake and metadata reception.
|
||||
|
||||
@ -29,13 +31,13 @@ Currently, to quickly verify whether xPyD can work, a round-robin selection of 1
|
||||
|
||||
Each P/D instance periodically sends a heartbeat packet to the Proxy/Router (currently every 3 seconds) to register (i.e., report `http_addr -> zmq_addr`) and keep the connection alive. If an instance crashes and fails to send a ping for a certain period of time, the Proxy/Router will remove the timed-out instance (this feature has not yet been developed).
|
||||
|
||||
## KV Cache Transfer Methods
|
||||
### KV Cache Transfer Methods
|
||||
|
||||
There are three methods for KVCache transfer: PUT, GET, and PUT_ASYNC. These methods can be specified using the `--kv-transfer-config` and `kv_connector_extra_config` parameters, specifically through the `send_type` field. Both PUT and PUT_ASYNC involve the P instance actively sending KVCache to the D instance. The difference is that PUT is a synchronous transfer method that blocks the main process, while PUT_ASYNC is an asynchronous transfer method. PUT_ASYNC uses a dedicated thread for sending KVCache, which means it does not block the main process. In contrast, the GET method involves the P instance saving the KVCache to the memory buffer after computing the prefill. The D instance then actively retrieves the computed KVCache from the P instance once it has allocated space for the KVCache.
|
||||
|
||||
Experimental results have shown that the performance of these methods, from highest to lowest, is as follows: PUT_ASYNC → GET → PUT.
|
||||
|
||||
## P2P Communication via ZMQ & NCCL
|
||||
### P2P Communication via ZMQ & NCCL
|
||||
|
||||
As long as the address of the counterpart is known, point-to-point KV cache transfer (using NCCL) can be performed, without being constrained by rank and world size. To support dynamic scaling (expansion and contraction) of instances with PD disaggregation. This means that adding or removing P/D instances does not require a full system restart.
|
||||
|
||||
@ -43,7 +45,7 @@ Each P/D instance only needs to create a single `P2pNcclEngine` instance. This i
|
||||
|
||||
When a P instance and a D instance transmit KVCache for the first time, they need to establish a ZMQ connection and an NCCL group. For subsequent KVCache transmissions, this ZMQ connection and NCCL group are reused. The NCCL group consists of only two ranks, meaning the world size is equal to 2. This design is intended to support dynamic scaling, which means that adding or removing P/D instances does not require a full system restart. As long as the address of the counterpart is known, point-to-point KVCache transmission can be performed, without being restricted by rank or world size.
|
||||
|
||||
## NCCL Group Topology
|
||||
### NCCL Group Topology
|
||||
|
||||
Currently, only symmetric TP (Tensor Parallelism) methods are supported for KVCache transmission. Asymmetric TP and PP (Pipeline Parallelism) methods will be supported in the future. Figure 2 illustrates the 1P2D setup, where each instance has a TP (Tensor Parallelism) degree of 2. There are a total of 7 NCCL groups: three vLLM instances each have one NCCL group with TP=2. Additionally, the 0th GPU card of the P instance establishes an NCCL group with the 0th GPU card of each D instance. Similarly, the 1st GPU card of the P instance establishes an NCCL group with the 1st GPU card of each D instance.
|
||||
|
||||
@ -51,7 +53,7 @@ Currently, only symmetric TP (Tensor Parallelism) methods are supported for KVCa
|
||||
|
||||
Each NCCL group occupies a certain amount of GPU memory buffer for communication, the size of which is primarily influenced by the `NCCL_MAX_NCHANNELS` environment variable. When `NCCL_MAX_NCHANNELS=16`, an NCCL group typically occupies 100MB, while when `NCCL_MAX_NCHANNELS=8`, it usually takes up 52MB. For large-scale xPyD configurations—such as DeepSeek's 96P144D—this implementation is currently not feasible. Moving forward, we are considering using RDMA for point-to-point communication and are also keeping an eye on UCCL.
|
||||
|
||||
## GPU Memory Buffer and Tensor Memory Pool
|
||||
### GPU Memory Buffer and Tensor Memory Pool
|
||||
|
||||
The trade-off in the size of the memory buffer is as follows: For P instances, the memory buffer is not required in PUT and PUT_ASYNC modes, but it is necessary in GET mode. For D instances, a memory buffer is needed in all three modes. The memory buffer for D instances should not be too large. Similarly, for P instances in GET mode, the memory buffer should also not be too large. The memory buffer of D instances is used to temporarily store KVCache sent by P instances. If it is too large, it will reduce the KVCache space available for normal inference by D instances, thereby decreasing the inference batch size and ultimately leading to a reduction in output throughput. The size of the memory buffer is configured by the parameter `kv_buffer_size`, measured in bytes, and is typically set to 5%~10% of the memory size.
|
||||
|
||||
@ -59,15 +61,15 @@ If the `--max-num-seqs` parameter for P instances is set to a large value, due t
|
||||
|
||||
To address the above issues, I have designed and developed a local Tensor memory pool for storing KVCache, inspired by the buddy system used in Linux memory modules. Since the memory is sufficiently large, typically in the TB range on servers, there is no need to consider prefix caching or using block-based designs to reuse memory, thereby saving space. When the memory buffer is insufficient, KVCache can be directly stored in the Tensor memory pool, and D instances can subsequently retrieve KVCache from it. The read and write speed is that of PCIe, with PCIe 4.0 having a speed of approximately 21 GB/s, which is usually faster than the Prefill speed. Otherwise, solutions like Mooncake and lmcache would not be necessary. The Tensor memory pool acts as a flood diversion area, typically unused except during sudden traffic surges. In the worst-case scenario, my solution performs no worse than the normal situation with a Cache store.
|
||||
|
||||
# Install vLLM
|
||||
## Install vLLM
|
||||
|
||||
```shell
|
||||
pip install "vllm>=0.9.2"
|
||||
```
|
||||
|
||||
# Run xPyD
|
||||
## Run xPyD
|
||||
|
||||
## Instructions
|
||||
### Instructions
|
||||
- The following examples are run on an A800 (80GB) device, using the Meta-Llama-3.1-8B-Instruct model.
|
||||
- Pay attention to the setting of the `kv_buffer_size` (in bytes). The empirical value is 10% of the GPU memory size. This is related to the kvcache size. If it is too small, the GPU memory buffer for temporarily storing the received kvcache will overflow, causing the kvcache to be stored in the tensor memory pool, which increases latency. If it is too large, the kvcache available for inference will be reduced, leading to a smaller batch size and decreased throughput.
|
||||
- For Prefill instances, when using non-GET mode, the `kv_buffer_size` can be set to 1, as Prefill currently does not need to receive kvcache. However, when using GET mode, a larger `kv_buffer_size` is required because it needs to store the kvcache sent to the D instance.
|
||||
@ -79,16 +81,16 @@ pip install "vllm>=0.9.2"
|
||||
- Supports multiple nodes; you just need to modify the `proxy_ip` and `proxy_port` in `--kv-transfer-config`.
|
||||
- In the following examples, it is assumed that **the proxy's IP is 10.0.1.1**.
|
||||
|
||||
## Run 1P3D
|
||||
### Run 1P3D
|
||||
|
||||
### Proxy (e.g. 10.0.1.1)
|
||||
#### Proxy (e.g. 10.0.1.1)
|
||||
|
||||
```shell
|
||||
cd {your vllm directory}/examples/online_serving/disaggregated_serving_p2p_nccl_xpyd/
|
||||
python3 disagg_proxy_p2p_nccl_xpyd.py &
|
||||
```
|
||||
|
||||
### Prefill1 (e.g. 10.0.1.2 or 10.0.1.1)
|
||||
#### Prefill1 (e.g. 10.0.1.2 or 10.0.1.1)
|
||||
|
||||
??? console "Command"
|
||||
|
||||
@ -110,7 +112,7 @@ python3 disagg_proxy_p2p_nccl_xpyd.py &
|
||||
'{"kv_connector":"P2pNcclConnector","kv_role":"kv_producer","kv_buffer_size":"1e1","kv_port":"21001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20001"}}' > /var/vllm.log 2>&1 &
|
||||
```
|
||||
|
||||
### Decode1 (e.g. 10.0.1.3 or 10.0.1.1)
|
||||
#### Decode1 (e.g. 10.0.1.3 or 10.0.1.1)
|
||||
|
||||
??? console "Command"
|
||||
|
||||
@ -132,7 +134,7 @@ python3 disagg_proxy_p2p_nccl_xpyd.py &
|
||||
'{"kv_connector":"P2pNcclConnector","kv_role":"kv_consumer","kv_buffer_size":"8e9","kv_port":"22001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20002"}}' > /var/vllm.log 2>&1 &
|
||||
```
|
||||
|
||||
### Decode2 (e.g. 10.0.1.4 or 10.0.1.1)
|
||||
#### Decode2 (e.g. 10.0.1.4 or 10.0.1.1)
|
||||
|
||||
??? console "Command"
|
||||
|
||||
@ -154,7 +156,7 @@ python3 disagg_proxy_p2p_nccl_xpyd.py &
|
||||
'{"kv_connector":"P2pNcclConnector","kv_role":"kv_consumer","kv_buffer_size":"8e9","kv_port":"23001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20003"}}' > /var/vllm.log 2>&1 &
|
||||
```
|
||||
|
||||
### Decode3 (e.g. 10.0.1.5 or 10.0.1.1)
|
||||
#### Decode3 (e.g. 10.0.1.5 or 10.0.1.1)
|
||||
|
||||
??? console "Command"
|
||||
|
||||
@ -176,16 +178,16 @@ python3 disagg_proxy_p2p_nccl_xpyd.py &
|
||||
'{"kv_connector":"P2pNcclConnector","kv_role":"kv_consumer","kv_buffer_size":"8e9","kv_port":"24001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20004"}}' > /var/vllm.log 2>&1 &
|
||||
```
|
||||
|
||||
## Run 3P1D
|
||||
### Run 3P1D
|
||||
|
||||
### Proxy (e.g. 10.0.1.1)
|
||||
#### Proxy (e.g. 10.0.1.1)
|
||||
|
||||
```shell
|
||||
cd {your vllm directory}/examples/online_serving/disaggregated_serving_p2p_nccl_xpyd/
|
||||
python3 disagg_proxy_p2p_nccl_xpyd.py &
|
||||
```
|
||||
|
||||
### Prefill1 (e.g. 10.0.1.2 or 10.0.1.1)
|
||||
#### Prefill1 (e.g. 10.0.1.2 or 10.0.1.1)
|
||||
|
||||
??? console "Command"
|
||||
|
||||
@ -207,7 +209,7 @@ python3 disagg_proxy_p2p_nccl_xpyd.py &
|
||||
'{"kv_connector":"P2pNcclConnector","kv_role":"kv_producer","kv_buffer_size":"1e1","kv_port":"21001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20001"}}' > /var/vllm.log 2>&1 &
|
||||
```
|
||||
|
||||
### Prefill2 (e.g. 10.0.1.3 or 10.0.1.1)
|
||||
#### Prefill2 (e.g. 10.0.1.3 or 10.0.1.1)
|
||||
|
||||
??? console "Command"
|
||||
|
||||
@ -229,7 +231,7 @@ python3 disagg_proxy_p2p_nccl_xpyd.py &
|
||||
'{"kv_connector":"P2pNcclConnector","kv_role":"kv_producer","kv_buffer_size":"1e1","kv_port":"22001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20002"}}' > /var/vllm.log 2>&1 &
|
||||
```
|
||||
|
||||
### Prefill3 (e.g. 10.0.1.4 or 10.0.1.1)
|
||||
#### Prefill3 (e.g. 10.0.1.4 or 10.0.1.1)
|
||||
|
||||
??? console "Command"
|
||||
|
||||
@ -251,7 +253,7 @@ python3 disagg_proxy_p2p_nccl_xpyd.py &
|
||||
'{"kv_connector":"P2pNcclConnector","kv_role":"kv_producer","kv_buffer_size":"1e1","kv_port":"23001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20003"}}' > /var/vllm.log 2>&1 &
|
||||
```
|
||||
|
||||
### Decode1 (e.g. 10.0.1.5 or 10.0.1.1)
|
||||
#### Decode1 (e.g. 10.0.1.5 or 10.0.1.1)
|
||||
|
||||
??? console "Command"
|
||||
|
||||
@ -273,7 +275,7 @@ python3 disagg_proxy_p2p_nccl_xpyd.py &
|
||||
'{"kv_connector":"P2pNcclConnector","kv_role":"kv_consumer","kv_buffer_size":"8e9","kv_port":"24001","kv_connector_extra_config":{"proxy_ip":"10.0.1.1","proxy_port":"30001","http_port":"20004"}}' > /var/vllm.log 2>&1 &
|
||||
```
|
||||
|
||||
# Single request
|
||||
## Single request
|
||||
|
||||
```shell
|
||||
curl -X POST -s http://10.0.1.1:10001/v1/completions \
|
||||
@ -286,7 +288,7 @@ curl -X POST -s http://10.0.1.1:10001/v1/completions \
|
||||
}'
|
||||
```
|
||||
|
||||
# Benchmark
|
||||
## Benchmark
|
||||
|
||||
??? console "Command"
|
||||
|
||||
@ -310,14 +312,14 @@ curl -X POST -s http://10.0.1.1:10001/v1/completions \
|
||||
--num-prompts 1000
|
||||
```
|
||||
|
||||
# Shut down
|
||||
## Shut down
|
||||
|
||||
```shell
|
||||
pgrep python | xargs kill -9 && pkill -f python
|
||||
```
|
||||
|
||||
# Test data
|
||||
## Test data
|
||||
|
||||
## **Scenario**: 1K input & 200 output tokens, E2E P99 latency ~2s
|
||||
### **Scenario**: 1K input & 200 output tokens, E2E P99 latency ~2s
|
||||
|
||||

|
||||
@ -1,5 +1,9 @@
|
||||
# vLLM Paged Attention
|
||||
|
||||
!!! warning
|
||||
This document is being kept in the vLLM documentation for historical purposes.
|
||||
It no longer describes the code used in vLLM today.
|
||||
|
||||
Currently, vLLM utilizes its own implementation of a multi-head query
|
||||
attention kernel (`csrc/attention/attention_kernels.cu`).
|
||||
This kernel is designed to be compatible with
|
||||
Loading…
x
Reference in New Issue
Block a user