[Doc] Remove vLLM prefix and add citation for PagedAttention (#21910)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
This commit is contained in:
Cyrus Leung 2025-07-30 21:36:34 +08:00 committed by GitHub
parent d979dd6beb
commit fcfd1eb9c5
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
10 changed files with 22 additions and 11 deletions

View File

Before

Width:  |  Height:  |  Size: 27 KiB

After

Width:  |  Height:  |  Size: 27 KiB

View File

Before

Width:  |  Height:  |  Size: 109 KiB

After

Width:  |  Height:  |  Size: 109 KiB

View File

Before

Width:  |  Height:  |  Size: 17 KiB

After

Width:  |  Height:  |  Size: 17 KiB

View File

Before

Width:  |  Height:  |  Size: 41 KiB

After

Width:  |  Height:  |  Size: 41 KiB

View File

Before

Width:  |  Height:  |  Size: 32 KiB

After

Width:  |  Height:  |  Size: 32 KiB

View File

Before

Width:  |  Height:  |  Size: 42 KiB

After

Width:  |  Height:  |  Size: 42 KiB

View File

Before

Width:  |  Height:  |  Size: 167 KiB

After

Width:  |  Height:  |  Size: 167 KiB

View File

@ -1,7 +1,7 @@
# vLLM Paged Attention # Paged Attention
!!! warning !!! warning
This document is being kept in the vLLM documentation for historical purposes. This is a historical document based on the [original paper for vLLM](https://arxiv.org/abs/2309.06180).
It no longer describes the code used in vLLM today. It no longer describes the code used in vLLM today.
Currently, vLLM utilizes its own implementation of a multi-head query Currently, vLLM utilizes its own implementation of a multi-head query
@ -140,7 +140,7 @@ const scalar_t* q_ptr = q + seq_idx * q_stride + head_idx * HEAD_SIZE;
``` ```
<figure markdown="span"> <figure markdown="span">
![](../../assets/kernel/query.png){ align="center" alt="query" width="70%" } ![](../assets/design/paged_attention/query.png){ align="center" alt="query" width="70%" }
</figure> </figure>
Each thread defines its own `q_ptr` which points to the assigned Each thread defines its own `q_ptr` which points to the assigned
@ -149,7 +149,7 @@ and `HEAD_SIZE` is 128, the `q_ptr` points to data that contains
total of 128 elements divided into 128 / 4 = 32 vecs. total of 128 elements divided into 128 / 4 = 32 vecs.
<figure markdown="span"> <figure markdown="span">
![](../../assets/kernel/q_vecs.png){ align="center" alt="q_vecs" width="70%" } ![](../assets/design/paged_attention/q_vecs.png){ align="center" alt="q_vecs" width="70%" }
</figure> </figure>
```cpp ```cpp
@ -188,7 +188,7 @@ points to key token data based on `k_cache` at assigned block,
assigned head and assigned token. assigned head and assigned token.
<figure markdown="span"> <figure markdown="span">
![](../../assets/kernel/key.png){ align="center" alt="key" width="70%" } ![](../assets/design/paged_attention/key.png){ align="center" alt="key" width="70%" }
</figure> </figure>
The diagram above illustrates the memory layout for key data. It The diagram above illustrates the memory layout for key data. It
@ -203,7 +203,7 @@ elements for one token) that will be processed by 2 threads (one
thread group) separately. thread group) separately.
<figure markdown="span"> <figure markdown="span">
![](../../assets/kernel/k_vecs.png){ align="center" alt="k_vecs" width="70%" } ![](../assets/design/paged_attention/k_vecs.png){ align="center" alt="k_vecs" width="70%" }
</figure> </figure>
```cpp ```cpp
@ -362,15 +362,15 @@ later steps. Now, it should store the normalized softmax result of
## Value ## Value
<figure markdown="span"> <figure markdown="span">
![](../../assets/kernel/value.png){ align="center" alt="value" width="70%" } ![](../assets/design/paged_attention/value.png){ align="center" alt="value" width="70%" }
</figure> </figure>
<figure markdown="span"> <figure markdown="span">
![](../../assets/kernel/logits_vec.png){ align="center" alt="logits_vec" width="50%" } ![](../assets/design/paged_attention/logits_vec.png){ align="center" alt="logits_vec" width="50%" }
</figure> </figure>
<figure markdown="span"> <figure markdown="span">
![](../../assets/kernel/v_vec.png){ align="center" alt="v_vec" width="70%" } ![](../assets/design/paged_attention/v_vec.png){ align="center" alt="v_vec" width="70%" }
</figure> </figure>
Now we need to retrieve the value data and perform dot multiplication Now we need to retrieve the value data and perform dot multiplication
@ -499,3 +499,14 @@ for (int i = 0; i < NUM_ROWS_PER_THREAD; i++) {
Finally, we need to iterate over different assigned head positions Finally, we need to iterate over different assigned head positions
and write out the corresponding accumulated result based on the and write out the corresponding accumulated result based on the
`out_ptr`. `out_ptr`.
## Citation
```bibtex
@inproceedings{kwon2023efficient,
title={Efficient Memory Management for Large Language Model Serving with PagedAttention},
author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica},
booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles},
year={2023}
}
```

View File

@ -1,4 +1,4 @@
# vLLM's Plugin System # Plugin System
The community frequently requests the ability to extend vLLM with custom features. To facilitate this, vLLM includes a plugin system that allows users to add custom features without modifying the vLLM codebase. This document explains how plugins work in vLLM and how to create a plugin for vLLM. The community frequently requests the ability to extend vLLM with custom features. To facilitate this, vLLM includes a plugin system that allows users to add custom features without modifying the vLLM codebase. This document explains how plugins work in vLLM and how to create a plugin for vLLM.

View File

@ -1,4 +1,4 @@
# vLLM's `torch.compile` integration # `torch.compile` integration
In vLLM's V1 architecture, `torch.compile` is enabled by default and is a critical part of the framework. This document gives a simple walk-through example to show how to understand the `torch.compile` usage. In vLLM's V1 architecture, `torch.compile` is enabled by default and is a critical part of the framework. This document gives a simple walk-through example to show how to understand the `torch.compile` usage.