diff --git a/docs/assets/kernel/k_vecs.png b/docs/assets/design/paged_attention/k_vecs.png similarity index 100% rename from docs/assets/kernel/k_vecs.png rename to docs/assets/design/paged_attention/k_vecs.png diff --git a/docs/assets/kernel/key.png b/docs/assets/design/paged_attention/key.png similarity index 100% rename from docs/assets/kernel/key.png rename to docs/assets/design/paged_attention/key.png diff --git a/docs/assets/kernel/logits_vec.png b/docs/assets/design/paged_attention/logits_vec.png similarity index 100% rename from docs/assets/kernel/logits_vec.png rename to docs/assets/design/paged_attention/logits_vec.png diff --git a/docs/assets/kernel/q_vecs.png b/docs/assets/design/paged_attention/q_vecs.png similarity index 100% rename from docs/assets/kernel/q_vecs.png rename to docs/assets/design/paged_attention/q_vecs.png diff --git a/docs/assets/kernel/query.png b/docs/assets/design/paged_attention/query.png similarity index 100% rename from docs/assets/kernel/query.png rename to docs/assets/design/paged_attention/query.png diff --git a/docs/assets/kernel/v_vec.png b/docs/assets/design/paged_attention/v_vec.png similarity index 100% rename from docs/assets/kernel/v_vec.png rename to docs/assets/design/paged_attention/v_vec.png diff --git a/docs/assets/kernel/value.png b/docs/assets/design/paged_attention/value.png similarity index 100% rename from docs/assets/kernel/value.png rename to docs/assets/design/paged_attention/value.png diff --git a/docs/design/paged_attention.md b/docs/design/paged_attention.md index ef525e8c60412..fb991a35caf30 100644 --- a/docs/design/paged_attention.md +++ b/docs/design/paged_attention.md @@ -1,7 +1,7 @@ -# vLLM Paged Attention +# Paged Attention !!! warning - This document is being kept in the vLLM documentation for historical purposes. + This is a historical document based on the [original paper for vLLM](https://arxiv.org/abs/2309.06180). It no longer describes the code used in vLLM today. Currently, vLLM utilizes its own implementation of a multi-head query @@ -140,7 +140,7 @@ const scalar_t* q_ptr = q + seq_idx * q_stride + head_idx * HEAD_SIZE; ```
- ![](../../assets/kernel/query.png){ align="center" alt="query" width="70%" } + ![](../assets/design/paged_attention/query.png){ align="center" alt="query" width="70%" }
Each thread defines its own `q_ptr` which points to the assigned @@ -149,7 +149,7 @@ and `HEAD_SIZE` is 128, the `q_ptr` points to data that contains total of 128 elements divided into 128 / 4 = 32 vecs.
- ![](../../assets/kernel/q_vecs.png){ align="center" alt="q_vecs" width="70%" } + ![](../assets/design/paged_attention/q_vecs.png){ align="center" alt="q_vecs" width="70%" }
```cpp @@ -188,7 +188,7 @@ points to key token data based on `k_cache` at assigned block, assigned head and assigned token.
- ![](../../assets/kernel/key.png){ align="center" alt="key" width="70%" } + ![](../assets/design/paged_attention/key.png){ align="center" alt="key" width="70%" }
The diagram above illustrates the memory layout for key data. It @@ -203,7 +203,7 @@ elements for one token) that will be processed by 2 threads (one thread group) separately.
- ![](../../assets/kernel/k_vecs.png){ align="center" alt="k_vecs" width="70%" } + ![](../assets/design/paged_attention/k_vecs.png){ align="center" alt="k_vecs" width="70%" }
```cpp @@ -362,15 +362,15 @@ later steps. Now, it should store the normalized softmax result of ## Value
- ![](../../assets/kernel/value.png){ align="center" alt="value" width="70%" } + ![](../assets/design/paged_attention/value.png){ align="center" alt="value" width="70%" }
- ![](../../assets/kernel/logits_vec.png){ align="center" alt="logits_vec" width="50%" } + ![](../assets/design/paged_attention/logits_vec.png){ align="center" alt="logits_vec" width="50%" }
- ![](../../assets/kernel/v_vec.png){ align="center" alt="v_vec" width="70%" } + ![](../assets/design/paged_attention/v_vec.png){ align="center" alt="v_vec" width="70%" }
Now we need to retrieve the value data and perform dot multiplication @@ -499,3 +499,14 @@ for (int i = 0; i < NUM_ROWS_PER_THREAD; i++) { Finally, we need to iterate over different assigned head positions and write out the corresponding accumulated result based on the `out_ptr`. + +## Citation + +```bibtex +@inproceedings{kwon2023efficient, + title={Efficient Memory Management for Large Language Model Serving with PagedAttention}, + author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica}, + booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles}, + year={2023} +} +``` diff --git a/docs/design/plugin_system.md b/docs/design/plugin_system.md index 23a05ac719ce2..ca1c2c2305d91 100644 --- a/docs/design/plugin_system.md +++ b/docs/design/plugin_system.md @@ -1,4 +1,4 @@ -# vLLM's Plugin System +# Plugin System The community frequently requests the ability to extend vLLM with custom features. To facilitate this, vLLM includes a plugin system that allows users to add custom features without modifying the vLLM codebase. This document explains how plugins work in vLLM and how to create a plugin for vLLM. diff --git a/docs/design/torch_compile.md b/docs/design/torch_compile.md index 2d76e7f3adc5c..47ac4958dbf7f 100644 --- a/docs/design/torch_compile.md +++ b/docs/design/torch_compile.md @@ -1,4 +1,4 @@ -# vLLM's `torch.compile` integration +# `torch.compile` integration In vLLM's V1 architecture, `torch.compile` is enabled by default and is a critical part of the framework. This document gives a simple walk-through example to show how to understand the `torch.compile` usage.