[Doc] Remove vLLM prefix and add citation for PagedAttention (#21910)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
Before Width: | Height: | Size: 27 KiB After Width: | Height: | Size: 27 KiB |
|
Before Width: | Height: | Size: 109 KiB After Width: | Height: | Size: 109 KiB |
|
Before Width: | Height: | Size: 17 KiB After Width: | Height: | Size: 17 KiB |
|
Before Width: | Height: | Size: 41 KiB After Width: | Height: | Size: 41 KiB |
|
Before Width: | Height: | Size: 32 KiB After Width: | Height: | Size: 32 KiB |
|
Before Width: | Height: | Size: 42 KiB After Width: | Height: | Size: 42 KiB |
|
Before Width: | Height: | Size: 167 KiB After Width: | Height: | Size: 167 KiB |
@ -1,7 +1,7 @@
|
||||
# vLLM Paged Attention
|
||||
# Paged Attention
|
||||
|
||||
!!! warning
|
||||
This document is being kept in the vLLM documentation for historical purposes.
|
||||
This is a historical document based on the [original paper for vLLM](https://arxiv.org/abs/2309.06180).
|
||||
It no longer describes the code used in vLLM today.
|
||||
|
||||
Currently, vLLM utilizes its own implementation of a multi-head query
|
||||
@ -140,7 +140,7 @@ const scalar_t* q_ptr = q + seq_idx * q_stride + head_idx * HEAD_SIZE;
|
||||
```
|
||||
|
||||
<figure markdown="span">
|
||||
{ align="center" alt="query" width="70%" }
|
||||
{ align="center" alt="query" width="70%" }
|
||||
</figure>
|
||||
|
||||
Each thread defines its own `q_ptr` which points to the assigned
|
||||
@ -149,7 +149,7 @@ and `HEAD_SIZE` is 128, the `q_ptr` points to data that contains
|
||||
total of 128 elements divided into 128 / 4 = 32 vecs.
|
||||
|
||||
<figure markdown="span">
|
||||
{ align="center" alt="q_vecs" width="70%" }
|
||||
{ align="center" alt="q_vecs" width="70%" }
|
||||
</figure>
|
||||
|
||||
```cpp
|
||||
@ -188,7 +188,7 @@ points to key token data based on `k_cache` at assigned block,
|
||||
assigned head and assigned token.
|
||||
|
||||
<figure markdown="span">
|
||||
{ align="center" alt="key" width="70%" }
|
||||
{ align="center" alt="key" width="70%" }
|
||||
</figure>
|
||||
|
||||
The diagram above illustrates the memory layout for key data. It
|
||||
@ -203,7 +203,7 @@ elements for one token) that will be processed by 2 threads (one
|
||||
thread group) separately.
|
||||
|
||||
<figure markdown="span">
|
||||
{ align="center" alt="k_vecs" width="70%" }
|
||||
{ align="center" alt="k_vecs" width="70%" }
|
||||
</figure>
|
||||
|
||||
```cpp
|
||||
@ -362,15 +362,15 @@ later steps. Now, it should store the normalized softmax result of
|
||||
## Value
|
||||
|
||||
<figure markdown="span">
|
||||
{ align="center" alt="value" width="70%" }
|
||||
{ align="center" alt="value" width="70%" }
|
||||
</figure>
|
||||
|
||||
<figure markdown="span">
|
||||
{ align="center" alt="logits_vec" width="50%" }
|
||||
{ align="center" alt="logits_vec" width="50%" }
|
||||
</figure>
|
||||
|
||||
<figure markdown="span">
|
||||
{ align="center" alt="v_vec" width="70%" }
|
||||
{ align="center" alt="v_vec" width="70%" }
|
||||
</figure>
|
||||
|
||||
Now we need to retrieve the value data and perform dot multiplication
|
||||
@ -499,3 +499,14 @@ for (int i = 0; i < NUM_ROWS_PER_THREAD; i++) {
|
||||
Finally, we need to iterate over different assigned head positions
|
||||
and write out the corresponding accumulated result based on the
|
||||
`out_ptr`.
|
||||
|
||||
## Citation
|
||||
|
||||
```bibtex
|
||||
@inproceedings{kwon2023efficient,
|
||||
title={Efficient Memory Management for Large Language Model Serving with PagedAttention},
|
||||
author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica},
|
||||
booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles},
|
||||
year={2023}
|
||||
}
|
||||
```
|
||||
|
||||
@ -1,4 +1,4 @@
|
||||
# vLLM's Plugin System
|
||||
# Plugin System
|
||||
|
||||
The community frequently requests the ability to extend vLLM with custom features. To facilitate this, vLLM includes a plugin system that allows users to add custom features without modifying the vLLM codebase. This document explains how plugins work in vLLM and how to create a plugin for vLLM.
|
||||
|
||||
|
||||
@ -1,4 +1,4 @@
|
||||
# vLLM's `torch.compile` integration
|
||||
# `torch.compile` integration
|
||||
|
||||
In vLLM's V1 architecture, `torch.compile` is enabled by default and is a critical part of the framework. This document gives a simple walk-through example to show how to understand the `torch.compile` usage.
|
||||
|
||||
|
||||