. To
DOCKER_BUILDKIT=1 docker build . --target vllm-openai --tag vllm/vllm-openai
```
-```{note}
+:::{note}
By default vLLM will build for all GPU types for widest distribution. If you are just building for the
current GPU type the machine is running on, you can add the argument `--build-arg torch_cuda_arch_list=""`
for vLLM to find the current GPU type and build for that.
If you are using Podman instead of Docker, you might need to disable SELinux labeling by
adding `--security-opt label=disable` when running `podman build` command to avoid certain [existing issues](https://github.com/containers/buildah/discussions/4184).
-```
+:::
## Building for Arm64/aarch64
A docker container can be built for aarch64 systems such as the Nvidia Grace-Hopper. At time of this writing, this requires the use
of PyTorch Nightly and should be considered **experimental**. Using the flag `--platform "linux/arm64"` will attempt to build for arm64.
-```{note}
+:::{note}
Multiple modules must be compiled, so this process can take a while. Recommend using `--build-arg max_jobs=` & `--build-arg nvcc_threads=`
flags to speed up build process. However, ensure your `max_jobs` is substantially larger than `nvcc_threads` to get the most benefits.
Keep an eye on memory usage with parallel jobs as it can be substantial (see example below).
-```
+:::
```console
# Example of building on Nvidia GH200 server. (Memory usage: ~15GB, Build time: ~1475s / ~25 min, Image size: 6.93GB)
@@ -85,6 +85,6 @@ $ docker run --runtime nvidia --gpus all \
The argument `vllm/vllm-openai` specifies the image to run, and should be replaced with the name of the custom-built image (the `-t` tag from the build command).
-```{note}
+:::{note}
**For version 0.4.1 and 0.4.2 only** - the vLLM docker images under these versions are supposed to be run under the root user since a library under the root user's home directory, i.e. `/root/.config/vllm/nccl/cu12/libnccl.so.2.18.1` is required to be loaded during runtime. If you are running the container under a different user, you may need to first change the permissions of the library (and all the parent directories) to allow the user to access it, then run vLLM with environment variable `VLLM_NCCL_SO_PATH=/root/.config/vllm/nccl/cu12/libnccl.so.2.18.1` .
-```
+:::
diff --git a/docs/source/deployment/frameworks/cerebrium.md b/docs/source/deployment/frameworks/cerebrium.md
index 5787c4a407bfb..b20c95137b6e7 100644
--- a/docs/source/deployment/frameworks/cerebrium.md
+++ b/docs/source/deployment/frameworks/cerebrium.md
@@ -2,11 +2,11 @@
# Cerebrium
-```{raw} html
+:::{raw} html
-```
+:::
vLLM can be run on a cloud based GPU machine with [Cerebrium](https://www.cerebrium.ai/), a serverless AI infrastructure platform that makes it easier for companies to build and deploy AI based applications.
diff --git a/docs/source/deployment/frameworks/dstack.md b/docs/source/deployment/frameworks/dstack.md
index b42a34125c6d7..a16e28f2d8983 100644
--- a/docs/source/deployment/frameworks/dstack.md
+++ b/docs/source/deployment/frameworks/dstack.md
@@ -2,11 +2,11 @@
# dstack
-```{raw} html
+:::{raw} html
-```
+:::
vLLM can be run on a cloud based GPU machine with [dstack](https://dstack.ai/), an open-source framework for running LLMs on any cloud. This tutorial assumes that you have already configured credentials, gateway, and GPU quotas on your cloud environment.
@@ -97,6 +97,6 @@ completion = client.chat.completions.create(
print(completion.choices[0].message.content)
```
-```{note}
+:::{note}
dstack automatically handles authentication on the gateway using dstack's tokens. Meanwhile, if you don't want to configure a gateway, you can provision dstack `Task` instead of `Service`. The `Task` is for development purpose only. If you want to know more about hands-on materials how to serve vLLM using dstack, check out [this repository](https://github.com/dstackai/dstack-examples/tree/main/deployment/vllm)
-```
+:::
diff --git a/docs/source/deployment/frameworks/helm.md b/docs/source/deployment/frameworks/helm.md
index 18ed293191468..e4fc5e1313079 100644
--- a/docs/source/deployment/frameworks/helm.md
+++ b/docs/source/deployment/frameworks/helm.md
@@ -38,213 +38,213 @@ chart **including persistent volumes** and deletes the release.
## Architecture
-```{image} /assets/deployment/architecture_helm_deployment.png
-```
+:::{image} /assets/deployment/architecture_helm_deployment.png
+:::
## Values
-```{list-table}
+:::{list-table}
:widths: 25 25 25 25
:header-rows: 1
-* - Key
- - Type
- - Default
- - Description
-* - autoscaling
- - object
- - {"enabled":false,"maxReplicas":100,"minReplicas":1,"targetCPUUtilizationPercentage":80}
- - Autoscaling configuration
-* - autoscaling.enabled
- - bool
- - false
- - Enable autoscaling
-* - autoscaling.maxReplicas
- - int
- - 100
- - Maximum replicas
-* - autoscaling.minReplicas
- - int
- - 1
- - Minimum replicas
-* - autoscaling.targetCPUUtilizationPercentage
- - int
- - 80
- - Target CPU utilization for autoscaling
-* - configs
- - object
- - {}
- - Configmap
-* - containerPort
- - int
- - 8000
- - Container port
-* - customObjects
- - list
- - []
- - Custom Objects configuration
-* - deploymentStrategy
- - object
- - {}
- - Deployment strategy configuration
-* - externalConfigs
- - list
- - []
- - External configuration
-* - extraContainers
- - list
- - []
- - Additional containers configuration
-* - extraInit
- - object
- - {"pvcStorage":"1Gi","s3modelpath":"relative_s3_model_path/opt-125m", "awsEc2MetadataDisabled": true}
- - Additional configuration for the init container
-* - extraInit.pvcStorage
- - string
- - "50Gi"
- - Storage size of the s3
-* - extraInit.s3modelpath
- - string
- - "relative_s3_model_path/opt-125m"
- - Path of the model on the s3 which hosts model weights and config files
-* - extraInit.awsEc2MetadataDisabled
- - boolean
- - true
- - Disables the use of the Amazon EC2 instance metadata service
-* - extraPorts
- - list
- - []
- - Additional ports configuration
-* - gpuModels
- - list
- - ["TYPE_GPU_USED"]
- - Type of gpu used
-* - image
- - object
- - {"command":["vllm","serve","/data/","--served-model-name","opt-125m","--host","0.0.0.0","--port","8000"],"repository":"vllm/vllm-openai","tag":"latest"}
- - Image configuration
-* - image.command
- - list
- - ["vllm","serve","/data/","--served-model-name","opt-125m","--host","0.0.0.0","--port","8000"]
- - Container launch command
-* - image.repository
- - string
- - "vllm/vllm-openai"
- - Image repository
-* - image.tag
- - string
- - "latest"
- - Image tag
-* - livenessProbe
- - object
- - {"failureThreshold":3,"httpGet":{"path":"/health","port":8000},"initialDelaySeconds":15,"periodSeconds":10}
- - Liveness probe configuration
-* - livenessProbe.failureThreshold
- - int
- - 3
- - Number of times after which if a probe fails in a row, Kubernetes considers that the overall check has failed: the container is not alive
-* - livenessProbe.httpGet
- - object
- - {"path":"/health","port":8000}
- - Configuration of the Kubelet http request on the server
-* - livenessProbe.httpGet.path
- - string
- - "/health"
- - Path to access on the HTTP server
-* - livenessProbe.httpGet.port
- - int
- - 8000
- - Name or number of the port to access on the container, on which the server is listening
-* - livenessProbe.initialDelaySeconds
- - int
- - 15
- - Number of seconds after the container has started before liveness probe is initiated
-* - livenessProbe.periodSeconds
- - int
- - 10
- - How often (in seconds) to perform the liveness probe
-* - maxUnavailablePodDisruptionBudget
- - string
- - ""
- - Disruption Budget Configuration
-* - readinessProbe
- - object
- - {"failureThreshold":3,"httpGet":{"path":"/health","port":8000},"initialDelaySeconds":5,"periodSeconds":5}
- - Readiness probe configuration
-* - readinessProbe.failureThreshold
- - int
- - 3
- - Number of times after which if a probe fails in a row, Kubernetes considers that the overall check has failed: the container is not ready
-* - readinessProbe.httpGet
- - object
- - {"path":"/health","port":8000}
- - Configuration of the Kubelet http request on the server
-* - readinessProbe.httpGet.path
- - string
- - "/health"
- - Path to access on the HTTP server
-* - readinessProbe.httpGet.port
- - int
- - 8000
- - Name or number of the port to access on the container, on which the server is listening
-* - readinessProbe.initialDelaySeconds
- - int
- - 5
- - Number of seconds after the container has started before readiness probe is initiated
-* - readinessProbe.periodSeconds
- - int
- - 5
- - How often (in seconds) to perform the readiness probe
-* - replicaCount
- - int
- - 1
- - Number of replicas
-* - resources
- - object
- - {"limits":{"cpu":4,"memory":"16Gi","nvidia.com/gpu":1},"requests":{"cpu":4,"memory":"16Gi","nvidia.com/gpu":1}}
- - Resource configuration
-* - resources.limits."nvidia.com/gpu"
- - int
- - 1
- - Number of gpus used
-* - resources.limits.cpu
- - int
- - 4
- - Number of CPUs
-* - resources.limits.memory
- - string
- - "16Gi"
- - CPU memory configuration
-* - resources.requests."nvidia.com/gpu"
- - int
- - 1
- - Number of gpus used
-* - resources.requests.cpu
- - int
- - 4
- - Number of CPUs
-* - resources.requests.memory
- - string
- - "16Gi"
- - CPU memory configuration
-* - secrets
- - object
- - {}
- - Secrets configuration
-* - serviceName
- - string
- -
- - Service name
-* - servicePort
- - int
- - 80
- - Service port
-* - labels.environment
- - string
- - test
- - Environment name
-* - labels.release
- - string
- - test
- - Release name
-```
+- * Key
+ * Type
+ * Default
+ * Description
+- * autoscaling
+ * object
+ * {"enabled":false,"maxReplicas":100,"minReplicas":1,"targetCPUUtilizationPercentage":80}
+ * Autoscaling configuration
+- * autoscaling.enabled
+ * bool
+ * false
+ * Enable autoscaling
+- * autoscaling.maxReplicas
+ * int
+ * 100
+ * Maximum replicas
+- * autoscaling.minReplicas
+ * int
+ * 1
+ * Minimum replicas
+- * autoscaling.targetCPUUtilizationPercentage
+ * int
+ * 80
+ * Target CPU utilization for autoscaling
+- * configs
+ * object
+ * {}
+ * Configmap
+- * containerPort
+ * int
+ * 8000
+ * Container port
+- * customObjects
+ * list
+ * []
+ * Custom Objects configuration
+- * deploymentStrategy
+ * object
+ * {}
+ * Deployment strategy configuration
+- * externalConfigs
+ * list
+ * []
+ * External configuration
+- * extraContainers
+ * list
+ * []
+ * Additional containers configuration
+- * extraInit
+ * object
+ * {"pvcStorage":"1Gi","s3modelpath":"relative_s3_model_path/opt-125m", "awsEc2MetadataDisabled": true}
+ * Additional configuration for the init container
+- * extraInit.pvcStorage
+ * string
+ * "50Gi"
+ * Storage size of the s3
+- * extraInit.s3modelpath
+ * string
+ * "relative_s3_model_path/opt-125m"
+ * Path of the model on the s3 which hosts model weights and config files
+- * extraInit.awsEc2MetadataDisabled
+ * boolean
+ * true
+ * Disables the use of the Amazon EC2 instance metadata service
+- * extraPorts
+ * list
+ * []
+ * Additional ports configuration
+- * gpuModels
+ * list
+ * ["TYPE_GPU_USED"]
+ * Type of gpu used
+- * image
+ * object
+ * {"command":["vllm","serve","/data/","--served-model-name","opt-125m","--host","0.0.0.0","--port","8000"],"repository":"vllm/vllm-openai","tag":"latest"}
+ * Image configuration
+- * image.command
+ * list
+ * ["vllm","serve","/data/","--served-model-name","opt-125m","--host","0.0.0.0","--port","8000"]
+ * Container launch command
+- * image.repository
+ * string
+ * "vllm/vllm-openai"
+ * Image repository
+- * image.tag
+ * string
+ * "latest"
+ * Image tag
+- * livenessProbe
+ * object
+ * {"failureThreshold":3,"httpGet":{"path":"/health","port":8000},"initialDelaySeconds":15,"periodSeconds":10}
+ * Liveness probe configuration
+- * livenessProbe.failureThreshold
+ * int
+ * 3
+ * Number of times after which if a probe fails in a row, Kubernetes considers that the overall check has failed: the container is not alive
+- * livenessProbe.httpGet
+ * object
+ * {"path":"/health","port":8000}
+ * Configuration of the Kubelet http request on the server
+- * livenessProbe.httpGet.path
+ * string
+ * "/health"
+ * Path to access on the HTTP server
+- * livenessProbe.httpGet.port
+ * int
+ * 8000
+ * Name or number of the port to access on the container, on which the server is listening
+- * livenessProbe.initialDelaySeconds
+ * int
+ * 15
+ * Number of seconds after the container has started before liveness probe is initiated
+- * livenessProbe.periodSeconds
+ * int
+ * 10
+ * How often (in seconds) to perform the liveness probe
+- * maxUnavailablePodDisruptionBudget
+ * string
+ * ""
+ * Disruption Budget Configuration
+- * readinessProbe
+ * object
+ * {"failureThreshold":3,"httpGet":{"path":"/health","port":8000},"initialDelaySeconds":5,"periodSeconds":5}
+ * Readiness probe configuration
+- * readinessProbe.failureThreshold
+ * int
+ * 3
+ * Number of times after which if a probe fails in a row, Kubernetes considers that the overall check has failed: the container is not ready
+- * readinessProbe.httpGet
+ * object
+ * {"path":"/health","port":8000}
+ * Configuration of the Kubelet http request on the server
+- * readinessProbe.httpGet.path
+ * string
+ * "/health"
+ * Path to access on the HTTP server
+- * readinessProbe.httpGet.port
+ * int
+ * 8000
+ * Name or number of the port to access on the container, on which the server is listening
+- * readinessProbe.initialDelaySeconds
+ * int
+ * 5
+ * Number of seconds after the container has started before readiness probe is initiated
+- * readinessProbe.periodSeconds
+ * int
+ * 5
+ * How often (in seconds) to perform the readiness probe
+- * replicaCount
+ * int
+ * 1
+ * Number of replicas
+- * resources
+ * object
+ * {"limits":{"cpu":4,"memory":"16Gi","nvidia.com/gpu":1},"requests":{"cpu":4,"memory":"16Gi","nvidia.com/gpu":1}}
+ * Resource configuration
+- * resources.limits."nvidia.com/gpu"
+ * int
+ * 1
+ * Number of gpus used
+- * resources.limits.cpu
+ * int
+ * 4
+ * Number of CPUs
+- * resources.limits.memory
+ * string
+ * "16Gi"
+ * CPU memory configuration
+- * resources.requests."nvidia.com/gpu"
+ * int
+ * 1
+ * Number of gpus used
+- * resources.requests.cpu
+ * int
+ * 4
+ * Number of CPUs
+- * resources.requests.memory
+ * string
+ * "16Gi"
+ * CPU memory configuration
+- * secrets
+ * object
+ * {}
+ * Secrets configuration
+- * serviceName
+ * string
+ *
+ * Service name
+- * servicePort
+ * int
+ * 80
+ * Service port
+- * labels.environment
+ * string
+ * test
+ * Environment name
+- * labels.release
+ * string
+ * test
+ * Release name
+:::
diff --git a/docs/source/deployment/frameworks/index.md b/docs/source/deployment/frameworks/index.md
index 964782763f6b3..cb758d3e6d2e4 100644
--- a/docs/source/deployment/frameworks/index.md
+++ b/docs/source/deployment/frameworks/index.md
@@ -1,6 +1,6 @@
# Using other frameworks
-```{toctree}
+:::{toctree}
:maxdepth: 1
bentoml
@@ -11,4 +11,4 @@ lws
modal
skypilot
triton
-```
+:::
diff --git a/docs/source/deployment/frameworks/skypilot.md b/docs/source/deployment/frameworks/skypilot.md
index 051fc2f2a8d4e..5e101b9001033 100644
--- a/docs/source/deployment/frameworks/skypilot.md
+++ b/docs/source/deployment/frameworks/skypilot.md
@@ -2,11 +2,11 @@
# SkyPilot
-```{raw} html
+:::{raw} html
-```
+:::
vLLM can be **run and scaled to multiple service replicas on clouds and Kubernetes** with [SkyPilot](https://github.com/skypilot-org/skypilot), an open-source framework for running LLMs on any cloud. More examples for various open models, such as Llama-3, Mixtral, etc, can be found in [SkyPilot AI gallery](https://skypilot.readthedocs.io/en/latest/gallery/index.html).
@@ -104,10 +104,10 @@ service:
max_completion_tokens: 1
```
-```{raw} html
+:::{raw} html
Click to see the full recipe YAML
-```
+:::
```yaml
service:
@@ -153,9 +153,9 @@ run: |
2>&1 | tee api_server.log
```
-```{raw} html
+:::{raw} html
-```
+:::
Start the serving the Llama-3 8B model on multiple replicas:
@@ -169,10 +169,10 @@ Wait until the service is ready:
watch -n10 sky serve status vllm
```
-```{raw} html
+:::{raw} html
Example outputs:
-```
+:::
```console
Services
@@ -185,9 +185,9 @@ vllm 1 1 xx.yy.zz.121 18 mins ago 1x GCP([Spot]{'L4': 1}) R
vllm 2 1 xx.yy.zz.245 18 mins ago 1x GCP([Spot]{'L4': 1}) READY us-east4
```
-```{raw} html
+:::{raw} html
-```
+:::
After the service is READY, you can find a single endpoint for the service and access the service with the endpoint:
@@ -223,10 +223,10 @@ service:
This will scale the service up to when the QPS exceeds 2 for each replica.
-```{raw} html
+:::{raw} html
Click to see the full recipe YAML
-```
+:::
```yaml
service:
@@ -275,9 +275,9 @@ run: |
2>&1 | tee api_server.log
```
-```{raw} html
+:::{raw} html
-```
+:::
To update the service with the new config:
@@ -295,10 +295,10 @@ sky serve down vllm
It is also possible to access the Llama-3 service with a separate GUI frontend, so the user requests send to the GUI will be load-balanced across replicas.
-```{raw} html
+:::{raw} html
Click to see the full GUI YAML
-```
+:::
```yaml
envs:
@@ -328,9 +328,9 @@ run: |
--stop-token-ids 128009,128001 | tee ~/gradio.log
```
-```{raw} html
+:::{raw} html
-```
+:::
1. Start the chat web UI:
diff --git a/docs/source/deployment/integrations/index.md b/docs/source/deployment/integrations/index.md
index d47ede8967547..c286edb4d7bc1 100644
--- a/docs/source/deployment/integrations/index.md
+++ b/docs/source/deployment/integrations/index.md
@@ -1,9 +1,9 @@
# External Integrations
-```{toctree}
+:::{toctree}
:maxdepth: 1
kserve
kubeai
llamastack
-```
+:::
diff --git a/docs/source/deployment/nginx.md b/docs/source/deployment/nginx.md
index a58f791c2997b..87feb48856853 100644
--- a/docs/source/deployment/nginx.md
+++ b/docs/source/deployment/nginx.md
@@ -105,9 +105,9 @@ docker run -itd --ipc host --privileged --network vllm_nginx --gpus all --shm-si
docker run -itd --ipc host --privileged --network vllm_nginx --gpus all --shm-size=10.24gb -v $hf_cache_dir:/root/.cache/huggingface/ -p 8082:8000 --name vllm1 vllm --model meta-llama/Llama-2-7b-chat-hf
```
-```{note}
+:::{note}
If you are behind proxy, you can pass the proxy settings to the docker run command via `-e http_proxy=$http_proxy -e https_proxy=$https_proxy`.
-```
+:::
(nginxloadbalancer-nginx-launch-nginx)=
diff --git a/docs/source/design/arch_overview.md b/docs/source/design/arch_overview.md
index cec503ef2f77d..04886e5981eef 100644
--- a/docs/source/design/arch_overview.md
+++ b/docs/source/design/arch_overview.md
@@ -4,19 +4,19 @@
This document provides an overview of the vLLM architecture.
-```{contents} Table of Contents
+:::{contents} Table of Contents
:depth: 2
:local: true
-```
+:::
## Entrypoints
vLLM provides a number of entrypoints for interacting with the system. The
following diagram shows the relationship between them.
-```{image} /assets/design/arch_overview/entrypoints.excalidraw.png
+:::{image} /assets/design/arch_overview/entrypoints.excalidraw.png
:alt: Entrypoints Diagram
-```
+:::
### LLM Class
@@ -84,9 +84,9 @@ More details on the API server can be found in the [OpenAI-Compatible Server](#o
The `LLMEngine` and `AsyncLLMEngine` classes are central to the functioning of
the vLLM system, handling model inference and asynchronous request processing.
-```{image} /assets/design/arch_overview/llm_engine.excalidraw.png
+:::{image} /assets/design/arch_overview/llm_engine.excalidraw.png
:alt: LLMEngine Diagram
-```
+:::
### LLMEngine
@@ -144,11 +144,11 @@ configurations affect the class we ultimately get.
The following figure shows the class hierarchy of vLLM:
-> ```{figure} /assets/design/hierarchy.png
+> :::{figure} /assets/design/hierarchy.png
> :align: center
> :alt: query
> :width: 100%
-> ```
+> :::
There are several important design choices behind this class hierarchy:
@@ -178,7 +178,7 @@ of a vision model and a language model. By making the constructor uniform, we
can easily create a vision model and a language model and compose them into a
vision-language model.
-````{note}
+:::{note}
To support this change, all vLLM models' signatures have been updated to:
```python
@@ -215,7 +215,7 @@ else:
```
This way, the model can work with both old and new versions of vLLM.
-````
+:::
3\. **Sharding and Quantization at Initialization**: Certain features require
changing the model weights. For example, tensor parallelism needs to shard the
diff --git a/docs/source/design/kernel/paged_attention.md b/docs/source/design/kernel/paged_attention.md
index f896f903c78f5..5f2582877260a 100644
--- a/docs/source/design/kernel/paged_attention.md
+++ b/docs/source/design/kernel/paged_attention.md
@@ -139,26 +139,26 @@
const scalar_t* q_ptr = q + seq_idx * q_stride + head_idx * HEAD_SIZE;
```
- ```{figure} ../../assets/kernel/query.png
+ :::{figure} ../../assets/kernel/query.png
:align: center
:alt: query
:width: 70%
Query data of one token at one head
- ```
+ :::
- Each thread defines its own `q_ptr` which points to the assigned
query token data on global memory. For example, if `VEC_SIZE` is 4
and `HEAD_SIZE` is 128, the `q_ptr` points to data that contains
total of 128 elements divided into 128 / 4 = 32 vecs.
- ```{figure} ../../assets/kernel/q_vecs.png
+ :::{figure} ../../assets/kernel/q_vecs.png
:align: center
:alt: q_vecs
:width: 70%
`q_vecs` for one thread group
- ```
+ :::
```cpp
__shared__ Q_vec q_vecs[THREAD_GROUP_SIZE][NUM_VECS_PER_THREAD];
@@ -195,13 +195,13 @@
points to key token data based on `k_cache` at assigned block,
assigned head and assigned token.
- ```{figure} ../../assets/kernel/key.png
+ :::{figure} ../../assets/kernel/key.png
:align: center
:alt: key
:width: 70%
Key data of all context tokens at one head
- ```
+ :::
- The diagram above illustrates the memory layout for key data. It
assumes that the `BLOCK_SIZE` is 16, `HEAD_SIZE` is 128, `x` is
@@ -214,13 +214,13 @@
elements for one token) that will be processed by 2 threads (one
thread group) separately.
- ```{figure} ../../assets/kernel/k_vecs.png
+ :::{figure} ../../assets/kernel/k_vecs.png
:align: center
:alt: k_vecs
:width: 70%
`k_vecs` for one thread
- ```
+ :::
```cpp
K_vec k_vecs[NUM_VECS_PER_THREAD]
@@ -289,14 +289,14 @@
should be performed across the entire thread block, encompassing
results between the query token and all context key tokens.
- ```{math}
+ :::{math}
:nowrap: true
\begin{gather*}
m(x):=\max _i \quad x_i \\ \quad f(x):=\left[\begin{array}{lll}e^{x_1-m(x)} & \ldots & e^{x_B-m(x)}\end{array}\right]\\ \quad \ell(x):=\sum_i f(x)_i \\
\quad \operatorname{softmax}(x):=\frac{f(x)}{\ell(x)}
\end{gather*}
- ```
+ :::
### `qk_max` and `logits`
@@ -379,29 +379,29 @@
## Value
-```{figure} ../../assets/kernel/value.png
+:::{figure} ../../assets/kernel/value.png
:align: center
:alt: value
:width: 70%
Value data of all context tokens at one head
-```
+:::
-```{figure} ../../assets/kernel/logits_vec.png
+:::{figure} ../../assets/kernel/logits_vec.png
:align: center
:alt: logits_vec
:width: 50%
`logits_vec` for one thread
-```
+:::
-```{figure} ../../assets/kernel/v_vec.png
+:::{figure} ../../assets/kernel/v_vec.png
:align: center
:alt: v_vec
:width: 70%
List of `v_vec` for one thread
-```
+:::
- Now we need to retrieve the value data and perform dot multiplication
with `logits`. Unlike query and key, there is no thread group
diff --git a/docs/source/design/multiprocessing.md b/docs/source/design/multiprocessing.md
index c2cdb75ea08a7..55dae0bb92d4e 100644
--- a/docs/source/design/multiprocessing.md
+++ b/docs/source/design/multiprocessing.md
@@ -7,9 +7,9 @@ page for information on known issues and how to solve them.
## Introduction
-```{important}
+:::{important}
The source code references are to the state of the code at the time of writing in December, 2024.
-```
+:::
The use of Python multiprocessing in vLLM is complicated by:
diff --git a/docs/source/design/v1/prefix_caching.md b/docs/source/design/v1/prefix_caching.md
new file mode 100644
index 0000000000000..dc8432baef9d9
--- /dev/null
+++ b/docs/source/design/v1/prefix_caching.md
@@ -0,0 +1,228 @@
+# Automatic Prefix Caching
+
+Prefix caching kv-cache blocks is a popular optimization in LLM inference to avoid redundant prompt computations. The core idea is simple – we cache the kv-cache blocks of processed requests, and reuse these blocks when a new request comes in with the same prefix as previous requests. Since prefix caching is almost a free lunch and won’t change model outputs, it has been widely used by many public endpoints (e.g., OpenAI, Anthropic, etc) and most open source LLM inference frameworks (e.g., SGLang).
+
+While there are many ways to implement prefix caching, vLLM chooses a hash-based approach. Specifically, we hash each kv-cache block by the tokens in the block and the tokens in the prefix before the block:
+
+```text
+ Block 1 Block 2 Block 3
+ [A gentle breeze stirred] [the leaves as children] [laughed in the distance]
+Block 1: |<--- block tokens ---->|
+Block 2: |<------- prefix ------>| |<--- block tokens --->|
+Block 3: |<------------------ prefix -------------------->| |<--- block tokens ---->|
+```
+
+In the example above, the KV cache in the first block can be uniquely identified with the token “A gentle breeze stirred”. The third block can be uniquely identified with the tokens in the block “laughed in the distance”, along with the prefix tokens “A gentle breeze stirred the leaves as children”. Therefore, we can build the block hash of `hash(tuple[components])`, where components are:
+
+* Parent hash value: The hash value of the parent hash block.
+* Block tokens: A tuple of tokens in this block. The reason to include the exact tokens is to reduce potential hash value collision.
+* Extra hashes: Other values required to make this block unique, such as LoRA IDs and multi-modality input hashes (see the example below).
+
+Note 1: We only cache full blocks.
+
+Note 2: The above hash key structure is not 100% collision free. Theoretically it’s still possible for the different prefix tokens to have the same hash value, but this should be nearly impossible to happen. Of course, contributions are welcome if you have an awesome idea to eliminate collusion entirely.
+
+**A hashing example with multi-modality inputs**
+In this example, we illustrate how prefix caching works with multi-modality inputs (e.g., images). Assuming we have a request with the following messages:
+
+```text
+messages = [
+ {"role": "user",
+ "content": [
+ {"type": "text",
+ "text": "What's in this image?"
+ },
+ {"type": "image_url",
+ "image_url": {"url": image_url},
+ },
+ ]},
+]
+```
+
+It will become the following prompt:
+
+```text
+Prompt:
+ [INST]What's in this image?\n[IMG][/INST]
+
+Tokenized prompt:
+ [1, 3, 7493, 1681, 1294, 1593, 3937, 9551, 10, 4]
+
+Prompt with placeholders ():
+ [1, 3, 7493, 1681, 1294, 1593, 3937, 9551,
,
, ...,
, 4]
+```
+
+As we can see, after the tokenization, the `[IMG]` will be replaced by a sequence of placeholder tokens, and these placeholders will be replaced by image embeddings during prefill. The challenge for prefix caching to support this case is we need to differentiate images from the placeholders. To address this problem, we encode the image hash generated by the frontend image processor. For example, the hash of the blocks in the above prompt would be (assuming block size 16, and we have 41 placeholder tokens):
+
+```text
+Block 0
+ Parent hash: None
+ Token IDs: 1, 3, 7493, 1681, 1294, 1593, 3937, 9551,
, ...,
+ Extra hash:
+Block 1
+ Parent hash: Block 0 hash
+ Token IDs: , ...,
+ Extra hash:
+Block 2
+ Parent hash: Block 1 hash
+ Token IDs: , ...,
+ Extra hash:
+Block 3
+ Parent hash: Block 2 hash
+ Token IDs: , ...,
, 4
+ Extra hash:
+```
+
+In the rest of this document, we first introduce the data structure used for prefix caching in vLLM v1, followed by the prefix caching workflow of major KV cache operators (e.g., allocate, append, free, eviction). Finally, we use an example to illustrate the end to end prefix caching workflow.
+
+## Data Structure
+
+The prefix caching in vLLM v1 is implemented in the KV cache manager. The basic building block is the “Block” data class (simplified):
+
+```python
+class KVCacheBlock:
+ # The block ID (immutable)
+ block_id: int
+ # The block hash (will be assigned when the block is full,
+ # and will be reset when the block is evicted).
+ block_hash: BlockHashType
+ # The number of requests using this block now.
+ ref_cnt: int
+
+ # The pointers to form a doubly linked list for the free queue.
+ prev_free_block: Optional["KVCacheBlock"] = None
+ next_free_block: Optional["KVCacheBlock"] = None
+```
+
+There are two design points to highlight:
+
+1. We allocate all KVCacheBlock when initializing the KV cache manager to be a block pool. This avoids Python object creation overheads and can easily track all blocks all the time.
+2. We introduce doubly linked list pointers directly in the KVCacheBlock, so that we could construct a free queue directly. This gives us two benefits:
+ 1. We could have O(1) complexity moving elements in the middle to the tail.
+ 2. We could avoid introducing another Python queue (e.g., `deque`) which has a wrapper to the elements.
+
+As a result, we will have the following components when the KV cache manager is initialized:
+
+:::{image} /assets/design/v1/prefix_caching/overview.png
+:alt: Component Overview
+:::
+
+* Block Pool: A list of KVCacheBlock.
+* Free Block Queue: Only store the pointers of head and tail blocks for manipulations.
+* Cache blocks: Mapping from hash key to block IDs.
+* Request blocks: Mapping from request ID to allocated block IDs.
+
+## Operations
+
+### Block Allocation
+
+**New request:** Workflow for the scheduler to schedule a new request with KV cache block allocation:
+
+1. The scheduler calls `kv_cache_manager.get_computed_blocks()` to get a sequence of blocks that have already been computed. This is done by hashing the prompt tokens in the request and looking up Cache Blocks.
+2. The scheduler calls `kv_cache_manager.allocate_slots()`. It does the following steps:
+ 1. Compute the number of new required blocks, and return if there are no sufficient blocks to allocate.
+ 2. “Touch” the computed blocks. It increases the reference count of the computed block by one, and removes the block from the free queue if the block wasn’t used by other requests. This is to avoid these computed blocks being evicted. See the example in the next section for illustration.
+ 3. Allocate new blocks by popping the heads of the free queue. If the head block is a cached block, this also “evicts” the block so that no other requests can reuse it anymore from now on.
+ 4. If an allocated block is already full of tokens, we immediately add it to the Cache Block, so that the block can be reused by other requests in the same batch.
+
+**Running request:** Workflow for the scheduler to schedule a running request with KV cache block allocation:
+
+1. The scheduler calls `kv_cache_manager.append_slots()`. It does the following steps:
+ 1. Compute the number of new required blocks, and return if there are no sufficient blocks to allocate.
+ 2. Allocate new blocks by popping the heads of the free queue. If the head block is a cached block, this also “evicts” the block so that no other requests can reuse it anymore from now on.
+ 3. Append token IDs to the slots in existing blocks as well as the new blocks. If a block is full, we add it to the Cache Block to cache it.
+
+**Duplicated blocks**
+Assuming block size is 4 and you send a request (Request 1\) with prompt ABCDEF and decoding length 3:
+
+```text
+Prompt: [A, B, C, D, E, F]
+Output: [G, H, I]
+
+Time 0:
+ Tokens: [A, B, C, D, E, F, G]
+ Block Table: [0 (ABCD), 1 (EFG)]
+ Cache Blocks: 0
+Time 1:
+ Tokens: [A, B, C, D, E, F, G, H]
+ Block Table: [0 (ABCD), 1 (EFGH)]
+ Cache Blocks: 0, 1
+Time 2:
+ Tokens: [A, B, C, D, E, F, G, H, I]
+ Block Table: [0 (ABCD), 1 (EFGH), 2 (I)]
+ Cache Blocks: 0, 1
+```
+
+Now block 0 and block 1 are cached, and we send the same request again (Request 2\) with greedy sampling, so that it will produce exactly the same outputs as the Request 1:
+
+```text
+Prompt: [A, B, C, D, E, F]
+Output: [G, H, I]
+
+Time 0:
+ Tokens: [A, B, C, D, E, F, G]
+ Block Table: [0 (ABCD), 3 (EFG)]
+ Cache Blocks: 0, 1
+Time 1:
+ Tokens: [A, B, C, D, E, F, G, H]
+ Block Table: [0 (ABCD), 3 (EFGH)]
+ Cache Blocks: 0, 1, 3
+```
+
+As can be seen, block 3 is a new full block and is cached. However, it is redundant as block 1, meaning that we cached the same block twice. In v0, when detecting block 3 is duplicated, we free block 3 and let Request 2 use block 1 instead, so its block table becomes `[0, 1]` in Time 1. However, the block table in vLLM v1 is append-only, meaning that changing the block table from `[0, 3]` to `[0, 1]` is not allowed. As a result, we will have duplicated blocks for the hash key E-H. This duplication will be eliminated when the request is freed.
+
+### Free
+
+When a request is finished, we free all its blocks if no other requests are using them (reference count = 0). In this example, we free request 1 and block 2, 3, 4, 8 associated with it. We can see that the freed blocks are added to the tail of the free queue in the *reverse* order. This is because the last block of a request must hash more tokens and is less likely to be reused by other requests. As a result, it should be evicted first.
+
+:::{image} /assets/design/v1/prefix_caching/free.png
+:alt: Free Queue after Free a Request
+:::
+
+### Eviction (LRU)
+
+When the head block (least recently used block) of the free queue is cached, we have to evict the block to prevent it from being used by other requests. Specifically, eviction involves the following steps:
+
+1. Pop the block from the head of the free queue. This is the LRU black to be evicted.
+2. Remove the block ID from the Cache Block.
+3. Remove the block hash.
+
+## Example
+
+In this example, we assume the block size is 4 (each block can cache 4 tokens), and we have 10 blocks in the KV-cache manager in total.
+
+**Time 1: The cache is empty and a new request comes in.** We allocate 4 blocks. 3 of them are already full and cached. The fourth block is partially full with 2 of 4 tokens.
+
+:::{image} /assets/design/v1/prefix_caching/example-time-1.png
+:alt: Example Time 1
+:::
+
+**Time 3: Request 0 makes the block 3 full and asks for a new block to keep decoding.** We cache block 3 and allocate block 4.
+
+:::{image} /assets/design/v1/prefix_caching/example-time-3.png
+:alt: Example Time 3
+:::
+
+**Time 4: Request 1 comes in with the 14 prompt tokens, where the first 11 tokens are the same as request 0.** We can see that only 2 blocks (11 tokens) hit the cache, because the 3rd block only matches 3 of 4 tokens.
+
+:::{image} /assets/design/v1/prefix_caching/example-time-4.png
+:alt: Example Time 4
+:::
+
+**Time 5: Request 0 is finished and free.** Blocks 2, 3 and 4 are added to the free queue in the reverse order (but block 2 and 3 are still cached). Block 0 and 1 are not added to the free queue because they are being used by Request 1.
+
+:::{image} /assets/design/v1/prefix_caching/example-time-5.png
+:alt: Example Time 5
+:::
+
+**Time 6: Request 1 is finished and free.**
+
+:::{image} /assets/design/v1/prefix_caching/example-time-6.png
+:alt: Example Time 6
+:::
+
+**Time 7: Request 2 comes in with the 33 prompt tokens, where the first 16 tokens are the same as request 0\.** Note that even the block order in the free queue was `7 - 8 - 9 - 4 - 3 - 2 - 6 - 5 - 1 - 0`, the cache hit blocks (i.e., 0, 1, 2) are touched and removed from the queue before allocation, so the free queue becomes `7 - 8 - 9 - 4 - 3 - 6 - 5`. As a result, the allocated blocks are 0 (cached), 1 (cached), 2 (cached), 7, 8, 9, 4, 3 (evicted).
+
+:::{image} /assets/design/v1/prefix_caching/example-time-7.png
+:alt: Example Time 7
+:::
diff --git a/docs/source/features/automatic_prefix_caching.md b/docs/source/features/automatic_prefix_caching.md
index 3d70cbb29c385..59016d7fcf6b3 100644
--- a/docs/source/features/automatic_prefix_caching.md
+++ b/docs/source/features/automatic_prefix_caching.md
@@ -6,9 +6,9 @@
Automatic Prefix Caching (APC in short) caches the KV cache of existing queries, so that a new query can directly reuse the KV cache if it shares the same prefix with one of the existing queries, allowing the new query to skip the computation of the shared part.
-```{note}
+:::{note}
Technical details on how vLLM implements APC can be found [here](#design-automatic-prefix-caching).
-```
+:::
## Enabling APC in vLLM
diff --git a/docs/source/features/compatibility_matrix.md b/docs/source/features/compatibility_matrix.md
index 47ab616b30686..b0018ebccf5ba 100644
--- a/docs/source/features/compatibility_matrix.md
+++ b/docs/source/features/compatibility_matrix.md
@@ -4,13 +4,13 @@
The tables below show mutually exclusive features and the support on some hardware.
-```{note}
+:::{note}
Check the '✗' with links to see tracking issue for unsupported feature/hardware combination.
-```
+:::
## Feature x Feature
-```{raw} html
+:::{raw} html
-```
+:::
-```{list-table}
- :header-rows: 1
- :stub-columns: 1
- :widths: auto
+:::{list-table}
+:header-rows: 1
+:stub-columns: 1
+:widths: auto
- * - Feature
- - [CP](#chunked-prefill)
- - [APC](#automatic-prefix-caching)
- - [LoRA](#lora-adapter)
- - prmpt adptr
- - [SD](#spec_decode)
- - CUDA graph
- - pooling
- - enc-dec
- - logP
- - prmpt logP
- - async output
- - multi-step
- - mm
- - best-of
- - beam-search
- - guided dec
- * - [CP](#chunked-prefill)
- -
- -
- -
- -
- -
- -
- -
- -
- -
- -
- -
- -
- -
- -
- -
- -
- * - [APC](#automatic-prefix-caching)
- - ✅
- -
- -
- -
- -
- -
- -
- -
- -
- -
- -
- -
- -
- -
- -
- -
- * - [LoRA](#lora-adapter)
- - [✗](gh-pr:9057)
- - ✅
- -
- -
- -
- -
- -
- -
- -
- -
- -
- -
- -
- -
- -
- -
- * - prmpt adptr
- - ✅
- - ✅
- - ✅
- -
- -
- -
- -
- -
- -
- -
- -
- -
- -
- -
- -
- -
- * - [SD](#spec_decode)
- - ✅
- - ✅
- - ✗
- - ✅
- -
- -
- -
- -
- -
- -
- -
- -
- -
- -
- -
- -
- * - CUDA graph
- - ✅
- - ✅
- - ✅
- - ✅
- - ✅
- -
- -
- -
- -
- -
- -
- -
- -
- -
- -
- -
- * - pooling
- - ✗
- - ✗
- - ✗
- - ✗
- - ✗
- - ✗
- -
- -
- -
- -
- -
- -
- -
- -
- -
- -
- * - enc-dec
- - ✗
- - [✗](gh-issue:7366)
- - ✗
- - ✗
- - [✗](gh-issue:7366)
- - ✅
- - ✅
- -
- -
- -
- -
- -
- -
- -
- -
- -
- * - logP
- - ✅
- - ✅
- - ✅
- - ✅
- - ✅
- - ✅
- - ✗
- - ✅
- -
- -
- -
- -
- -
- -
- -
- -
- * - prmpt logP
- - ✅
- - ✅
- - ✅
- - ✅
- - [✗](gh-pr:8199)
- - ✅
- - ✗
- - ✅
- - ✅
- -
- -
- -
- -
- -
- -
- -
- * - async output
- - ✅
- - ✅
- - ✅
- - ✅
- - ✗
- - ✅
- - ✗
- - ✗
- - ✅
- - ✅
- -
- -
- -
- -
- -
- -
- * - multi-step
- - ✗
- - ✅
- - ✗
- - ✅
- - ✗
- - ✅
- - ✗
- - ✗
- - ✅
- - [✗](gh-issue:8198)
- - ✅
- -
- -
- -
- -
- -
- * - mm
- - ✅
- - [✗](gh-pr:8348)
- - [✗](gh-pr:7199)
- - ?
- - ?
- - ✅
- - ✅
- - ✅
- - ✅
- - ✅
- - ✅
- - ?
- -
- -
- -
- -
- * - best-of
- - ✅
- - ✅
- - ✅
- - ✅
- - [✗](gh-issue:6137)
- - ✅
- - ✗
- - ✅
- - ✅
- - ✅
- - ?
- - [✗](gh-issue:7968)
- - ✅
- -
- -
- -
- * - beam-search
- - ✅
- - ✅
- - ✅
- - ✅
- - [✗](gh-issue:6137)
- - ✅
- - ✗
- - ✅
- - ✅
- - ✅
- - ?
- - [✗](gh-issue:7968>)
- - ?
- - ✅
- -
- -
- * - guided dec
- - ✅
- - ✅
- - ?
- - ?
- - [✗](gh-issue:11484)
- - ✅
- - ✗
- - ?
- - ✅
- - ✅
- - ✅
- - [✗](gh-issue:9893)
- - ?
- - ✅
- - ✅
- -
-
-```
+- * Feature
+ * [CP](#chunked-prefill)
+ * [APC](#automatic-prefix-caching)
+ * [LoRA](#lora-adapter)
+ * prmpt adptr
+ * [SD](#spec_decode)
+ * CUDA graph
+ * pooling
+ * enc-dec
+ * logP
+ * prmpt logP
+ * async output
+ * multi-step
+ * mm
+ * best-of
+ * beam-search
+ * guided dec
+- * [CP](#chunked-prefill)
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+- * [APC](#automatic-prefix-caching)
+ * ✅
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+- * [LoRA](#lora-adapter)
+ * [✗](gh-pr:9057)
+ * ✅
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+- * prmpt adptr
+ * ✅
+ * ✅
+ * ✅
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+- * [SD](#spec_decode)
+ * ✅
+ * ✅
+ * ✗
+ * ✅
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+- * CUDA graph
+ * ✅
+ * ✅
+ * ✅
+ * ✅
+ * ✅
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+- * pooling
+ * ✗
+ * ✗
+ * ✗
+ * ✗
+ * ✗
+ * ✗
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+- * enc-dec
+ * ✗
+ * [✗](gh-issue:7366)
+ * ✗
+ * ✗
+ * [✗](gh-issue:7366)
+ * ✅
+ * ✅
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+- * logP
+ * ✅
+ * ✅
+ * ✅
+ * ✅
+ * ✅
+ * ✅
+ * ✗
+ * ✅
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+- * prmpt logP
+ * ✅
+ * ✅
+ * ✅
+ * ✅
+ * [✗](gh-pr:8199)
+ * ✅
+ * ✗
+ * ✅
+ * ✅
+ *
+ *
+ *
+ *
+ *
+ *
+ *
+- * async output
+ * ✅
+ * ✅
+ * ✅
+ * ✅
+ * ✗
+ * ✅
+ * ✗
+ * ✗
+ * ✅
+ * ✅
+ *
+ *
+ *
+ *
+ *
+ *
+- * multi-step
+ * ✗
+ * ✅
+ * ✗
+ * ✅
+ * ✗
+ * ✅
+ * ✗
+ * ✗
+ * ✅
+ * [✗](gh-issue:8198)
+ * ✅
+ *
+ *
+ *
+ *
+ *
+- * mm
+ * ✅
+ * [✗](gh-pr:8348)
+ * [✗](gh-pr:7199)
+ * ?
+ * ?
+ * ✅
+ * ✅
+ * ✅
+ * ✅
+ * ✅
+ * ✅
+ * ?
+ *
+ *
+ *
+ *
+- * best-of
+ * ✅
+ * ✅
+ * ✅
+ * ✅
+ * [✗](gh-issue:6137)
+ * ✅
+ * ✗
+ * ✅
+ * ✅
+ * ✅
+ * ?
+ * [✗](gh-issue:7968)
+ * ✅
+ *
+ *
+ *
+- * beam-search
+ * ✅
+ * ✅
+ * ✅
+ * ✅
+ * [✗](gh-issue:6137)
+ * ✅
+ * ✗
+ * ✅
+ * ✅
+ * ✅
+ * ?
+ * [✗](gh-issue:7968>)
+ * ?
+ * ✅
+ *
+ *
+- * guided dec
+ * ✅
+ * ✅
+ * ?
+ * ?
+ * [✗](gh-issue:11484)
+ * ✅
+ * ✗
+ * ?
+ * ✅
+ * ✅
+ * ✅
+ * [✗](gh-issue:9893)
+ * ?
+ * ✅
+ * ✅
+ *
+:::
(feature-x-hardware)=
## Feature x Hardware
-```{list-table}
- :header-rows: 1
- :stub-columns: 1
- :widths: auto
+:::{list-table}
+:header-rows: 1
+:stub-columns: 1
+:widths: auto
- * - Feature
- - Volta
- - Turing
- - Ampere
- - Ada
- - Hopper
- - CPU
- - AMD
- * - [CP](#chunked-prefill)
- - [✗](gh-issue:2729)
- - ✅
- - ✅
- - ✅
- - ✅
- - ✅
- - ✅
- * - [APC](#automatic-prefix-caching)
- - [✗](gh-issue:3687)
- - ✅
- - ✅
- - ✅
- - ✅
- - ✅
- - ✅
- * - [LoRA](#lora-adapter)
- - ✅
- - ✅
- - ✅
- - ✅
- - ✅
- - ✅
- - ✅
- * - prmpt adptr
- - ✅
- - ✅
- - ✅
- - ✅
- - ✅
- - [✗](gh-issue:8475)
- - ✅
- * - [SD](#spec_decode)
- - ✅
- - ✅
- - ✅
- - ✅
- - ✅
- - ✅
- - ✅
- * - CUDA graph
- - ✅
- - ✅
- - ✅
- - ✅
- - ✅
- - ✗
- - ✅
- * - pooling
- - ✅
- - ✅
- - ✅
- - ✅
- - ✅
- - ✅
- - ?
- * - enc-dec
- - ✅
- - ✅
- - ✅
- - ✅
- - ✅
- - ✅
- - ✗
- * - mm
- - ✅
- - ✅
- - ✅
- - ✅
- - ✅
- - ✅
- - ✅
- * - logP
- - ✅
- - ✅
- - ✅
- - ✅
- - ✅
- - ✅
- - ✅
- * - prmpt logP
- - ✅
- - ✅
- - ✅
- - ✅
- - ✅
- - ✅
- - ✅
- * - async output
- - ✅
- - ✅
- - ✅
- - ✅
- - ✅
- - ✗
- - ✗
- * - multi-step
- - ✅
- - ✅
- - ✅
- - ✅
- - ✅
- - [✗](gh-issue:8477)
- - ✅
- * - best-of
- - ✅
- - ✅
- - ✅
- - ✅
- - ✅
- - ✅
- - ✅
- * - beam-search
- - ✅
- - ✅
- - ✅
- - ✅
- - ✅
- - ✅
- - ✅
- * - guided dec
- - ✅
- - ✅
- - ✅
- - ✅
- - ✅
- - ✅
- - ✅
-```
+- * Feature
+ * Volta
+ * Turing
+ * Ampere
+ * Ada
+ * Hopper
+ * CPU
+ * AMD
+- * [CP](#chunked-prefill)
+ * [✗](gh-issue:2729)
+ * ✅
+ * ✅
+ * ✅
+ * ✅
+ * ✅
+ * ✅
+- * [APC](#automatic-prefix-caching)
+ * [✗](gh-issue:3687)
+ * ✅
+ * ✅
+ * ✅
+ * ✅
+ * ✅
+ * ✅
+- * [LoRA](#lora-adapter)
+ * ✅
+ * ✅
+ * ✅
+ * ✅
+ * ✅
+ * ✅
+ * ✅
+- * prmpt adptr
+ * ✅
+ * ✅
+ * ✅
+ * ✅
+ * ✅
+ * [✗](gh-issue:8475)
+ * ✅
+- * [SD](#spec_decode)
+ * ✅
+ * ✅
+ * ✅
+ * ✅
+ * ✅
+ * ✅
+ * ✅
+- * CUDA graph
+ * ✅
+ * ✅
+ * ✅
+ * ✅
+ * ✅
+ * ✗
+ * ✅
+- * pooling
+ * ✅
+ * ✅
+ * ✅
+ * ✅
+ * ✅
+ * ✅
+ * ?
+- * enc-dec
+ * ✅
+ * ✅
+ * ✅
+ * ✅
+ * ✅
+ * ✅
+ * ✗
+- * mm
+ * ✅
+ * ✅
+ * ✅
+ * ✅
+ * ✅
+ * ✅
+ * ✅
+- * logP
+ * ✅
+ * ✅
+ * ✅
+ * ✅
+ * ✅
+ * ✅
+ * ✅
+- * prmpt logP
+ * ✅
+ * ✅
+ * ✅
+ * ✅
+ * ✅
+ * ✅
+ * ✅
+- * async output
+ * ✅
+ * ✅
+ * ✅
+ * ✅
+ * ✅
+ * ✗
+ * ✗
+- * multi-step
+ * ✅
+ * ✅
+ * ✅
+ * ✅
+ * ✅
+ * [✗](gh-issue:8477)
+ * ✅
+- * best-of
+ * ✅
+ * ✅
+ * ✅
+ * ✅
+ * ✅
+ * ✅
+ * ✅
+- * beam-search
+ * ✅
+ * ✅
+ * ✅
+ * ✅
+ * ✅
+ * ✅
+ * ✅
+- * guided dec
+ * ✅
+ * ✅
+ * ✅
+ * ✅
+ * ✅
+ * ✅
+ * ✅
+:::
diff --git a/docs/source/features/disagg_prefill.md b/docs/source/features/disagg_prefill.md
index efa2efc66192e..52d253b9c2b18 100644
--- a/docs/source/features/disagg_prefill.md
+++ b/docs/source/features/disagg_prefill.md
@@ -4,9 +4,9 @@
This page introduces you the disaggregated prefilling feature in vLLM.
-```{note}
+:::{note}
This feature is experimental and subject to change.
-```
+:::
## Why disaggregated prefilling?
@@ -15,9 +15,9 @@ Two main reasons:
- **Tuning time-to-first-token (TTFT) and inter-token-latency (ITL) separately**. Disaggregated prefilling put prefill and decode phase of LLM inference inside different vLLM instances. This gives you the flexibility to assign different parallel strategies (e.g. `tp` and `pp`) to tune TTFT without affecting ITL, or to tune ITL without affecting TTFT.
- **Controlling tail ITL**. Without disaggregated prefilling, vLLM may insert some prefill jobs during the decoding of one request. This results in higher tail latency. Disaggregated prefilling helps you solve this issue and control tail ITL. Chunked prefill with a proper chunk size also can achieve the same goal, but in practice it's hard to figure out the correct chunk size value. So disaggregated prefilling is a much more reliable way to control tail ITL.
-```{note}
+:::{note}
Disaggregated prefill DOES NOT improve throughput.
-```
+:::
## Usage example
@@ -39,21 +39,21 @@ Key abstractions for disaggregated prefilling:
- **LookupBuffer**: LookupBuffer provides two API: `insert` KV cache and `drop_select` KV cache. The semantics of `insert` and `drop_select` are similar to SQL, where `insert` inserts a KV cache into the buffer, and `drop_select` returns the KV cache that matches the given condition and drop it from the buffer.
- **Pipe**: A single-direction FIFO pipe for tensor transmission. It supports `send_tensor` and `recv_tensor`.
-```{note}
+:::{note}
`insert` is non-blocking operation but `drop_select` is blocking operation.
-```
+:::
Here is a figure illustrating how the above 3 abstractions are organized:
-```{image} /assets/features/disagg_prefill/abstraction.jpg
+:::{image} /assets/features/disagg_prefill/abstraction.jpg
:alt: Disaggregated prefilling abstractions
-```
+:::
The workflow of disaggregated prefilling is as follows:
-```{image} /assets/features/disagg_prefill/overview.jpg
+:::{image} /assets/features/disagg_prefill/overview.jpg
:alt: Disaggregated prefilling workflow
-```
+:::
The `buffer` corresponds to `insert` API in LookupBuffer, and the `drop_select` corresponds to `drop_select` API in LookupBuffer.
diff --git a/docs/source/features/lora.md b/docs/source/features/lora.md
index b00d05147bb32..fb5a7a0d519cb 100644
--- a/docs/source/features/lora.md
+++ b/docs/source/features/lora.md
@@ -60,9 +60,9 @@ vllm serve meta-llama/Llama-2-7b-hf \
--lora-modules sql-lora=$HOME/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/snapshots/0dfa347e8877a4d4ed19ee56c140fa518470028c/
```
-```{note}
+:::{note}
The commit ID `0dfa347e8877a4d4ed19ee56c140fa518470028c` may change over time. Please check the latest commit ID in your environment to ensure you are using the correct one.
-```
+:::
The server entrypoint accepts all other LoRA configuration parameters (`max_loras`, `max_lora_rank`, `max_cpu_loras`,
etc.), which will apply to all forthcoming requests. Upon querying the `/models` endpoint, we should see our LoRA along
diff --git a/docs/source/features/quantization/auto_awq.md b/docs/source/features/quantization/auto_awq.md
index 404505eb3890e..30735b1161ff3 100644
--- a/docs/source/features/quantization/auto_awq.md
+++ b/docs/source/features/quantization/auto_awq.md
@@ -2,11 +2,11 @@
# AutoAWQ
-```{warning}
+:::{warning}
Please note that AWQ support in vLLM is under-optimized at the moment. We would recommend using the unquantized version of the model for better
accuracy and higher throughput. Currently, you can use AWQ as a way to reduce memory footprint. As of now, it is more suitable for low latency
inference with small number of concurrent requests. vLLM's AWQ implementation have lower throughput than unquantized version.
-```
+:::
To create a new 4-bit quantized model, you can leverage [AutoAWQ](https://github.com/casper-hansen/AutoAWQ).
Quantizing reduces the model's precision from FP16 to INT4 which effectively reduces the file size by ~70%.
diff --git a/docs/source/features/quantization/fp8.md b/docs/source/features/quantization/fp8.md
index 1398e8a324201..a62e0124b7706 100644
--- a/docs/source/features/quantization/fp8.md
+++ b/docs/source/features/quantization/fp8.md
@@ -14,10 +14,10 @@ The FP8 types typically supported in hardware have two distinct representations,
- **E4M3**: Consists of 1 sign bit, 4 exponent bits, and 3 bits of mantissa. It can store values up to +/-448 and `nan`.
- **E5M2**: Consists of 1 sign bit, 5 exponent bits, and 2 bits of mantissa. It can store values up to +/-57344, +/- `inf`, and `nan`. The tradeoff for the increased dynamic range is lower precision of the stored values.
-```{note}
+:::{note}
FP8 computation is supported on NVIDIA GPUs with compute capability > 8.9 (Ada Lovelace, Hopper).
FP8 models will run on compute capability > 8.0 (Ampere) as weight-only W8A16, utilizing FP8 Marlin.
-```
+:::
## Quick Start with Online Dynamic Quantization
@@ -32,9 +32,9 @@ model = LLM("facebook/opt-125m", quantization="fp8")
result = model.generate("Hello, my name is")
```
-```{warning}
+:::{warning}
Currently, we load the model at original precision before quantizing down to 8-bits, so you need enough memory to load the whole model.
-```
+:::
## Installation
@@ -110,9 +110,9 @@ model.generate("Hello my name is")
Evaluate accuracy with `lm_eval` (for example on 250 samples of `gsm8k`):
-```{note}
+:::{note}
Quantized models can be sensitive to the presence of the `bos` token. `lm_eval` does not add a `bos` token by default, so make sure to include the `add_bos_token=True` argument when running your evaluations.
-```
+:::
```console
$ MODEL=$PWD/Meta-Llama-3-8B-Instruct-FP8-Dynamic
@@ -137,10 +137,10 @@ If you encounter any issues or have feature requests, please open an issue on th
## Deprecated Flow
-```{note}
+:::{note}
The following information is preserved for reference and search purposes.
The quantization method described below is deprecated in favor of the `llmcompressor` method described above.
-```
+:::
For static per-tensor offline quantization to FP8, please install the [AutoFP8 library](https://github.com/neuralmagic/autofp8).
diff --git a/docs/source/features/quantization/gguf.md b/docs/source/features/quantization/gguf.md
index 640997cf4bc39..65c181900f9be 100644
--- a/docs/source/features/quantization/gguf.md
+++ b/docs/source/features/quantization/gguf.md
@@ -2,13 +2,13 @@
# GGUF
-```{warning}
+:::{warning}
Please note that GGUF support in vLLM is highly experimental and under-optimized at the moment, it might be incompatible with other features. Currently, you can use GGUF as a way to reduce memory footprint. If you encounter any issues, please report them to the vLLM team.
-```
+:::
-```{warning}
+:::{warning}
Currently, vllm only supports loading single-file GGUF models. If you have a multi-files GGUF model, you can use [gguf-split](https://github.com/ggerganov/llama.cpp/pull/6135) tool to merge them to a single-file model.
-```
+:::
To run a GGUF model with vLLM, you can download and use the local GGUF model from [TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF](https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF) with the following command:
@@ -25,9 +25,9 @@ You can also add `--tensor-parallel-size 2` to enable tensor parallelism inferen
vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 --tensor-parallel-size 2
```
-```{warning}
+:::{warning}
We recommend using the tokenizer from base model instead of GGUF model. Because the tokenizer conversion from GGUF is time-consuming and unstable, especially for some models with large vocab size.
-```
+:::
You can also use the GGUF model directly through the LLM entrypoint:
diff --git a/docs/source/features/quantization/index.md b/docs/source/features/quantization/index.md
index 56ccdb5f00c34..1c98620aa2145 100644
--- a/docs/source/features/quantization/index.md
+++ b/docs/source/features/quantization/index.md
@@ -4,7 +4,7 @@
Quantization trades off model precision for smaller memory footprint, allowing large models to be run on a wider range of devices.
-```{toctree}
+:::{toctree}
:caption: Contents
:maxdepth: 1
@@ -12,7 +12,8 @@ supported_hardware
auto_awq
bnb
gguf
+int4
int8
fp8
quantized_kvcache
-```
+:::
diff --git a/docs/source/features/quantization/int4.md b/docs/source/features/quantization/int4.md
new file mode 100644
index 0000000000000..f8939e5bf0150
--- /dev/null
+++ b/docs/source/features/quantization/int4.md
@@ -0,0 +1,166 @@
+(int4)=
+
+# INT4 W4A16
+
+vLLM supports quantizing weights to INT4 for memory savings and inference acceleration. This quantization method is particularly useful for reducing model size and maintaining low latency in workloads with low queries per second (QPS).
+
+Please visit the HF collection of [quantized INT4 checkpoints of popular LLMs ready to use with vLLM](https://huggingface.co/collections/neuralmagic/int4-llms-for-vllm-668ec34bf3c9fa45f857df2c).
+
+:::{note}
+INT4 computation is supported on NVIDIA GPUs with compute capability > 8.0 (Ampere, Ada Lovelace, Hopper, Blackwell).
+:::
+
+## Prerequisites
+
+To use INT4 quantization with vLLM, you'll need to install the [llm-compressor](https://github.com/vllm-project/llm-compressor/) library:
+
+```console
+pip install llmcompressor
+```
+
+## Quantization Process
+
+The quantization process involves four main steps:
+
+1. Loading the model
+2. Preparing calibration data
+3. Applying quantization
+4. Evaluating accuracy in vLLM
+
+### 1. Loading the Model
+
+Load your model and tokenizer using the standard `transformers` AutoModel classes:
+
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+
+MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"
+model = AutoModelForCausalLM.from_pretrained(
+ MODEL_ID, device_map="auto", torch_dtype="auto",
+)
+tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
+```
+
+### 2. Preparing Calibration Data
+
+When quantizing weights to INT4, you need sample data to estimate the weight updates and calibrated scales.
+It's best to use calibration data that closely matches your deployment data.
+For a general-purpose instruction-tuned model, you can use a dataset like `ultrachat`:
+
+```python
+from datasets import load_dataset
+
+NUM_CALIBRATION_SAMPLES = 512
+MAX_SEQUENCE_LENGTH = 2048
+
+# Load and preprocess the dataset
+ds = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft")
+ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))
+
+def preprocess(example):
+ return {"text": tokenizer.apply_chat_template(example["messages"], tokenize=False)}
+ds = ds.map(preprocess)
+
+def tokenize(sample):
+ return tokenizer(sample["text"], padding=False, max_length=MAX_SEQUENCE_LENGTH, truncation=True, add_special_tokens=False)
+ds = ds.map(tokenize, remove_columns=ds.column_names)
+```
+
+### 3. Applying Quantization
+
+Now, apply the quantization algorithms:
+
+```python
+from llmcompressor.transformers import oneshot
+from llmcompressor.modifiers.quantization import GPTQModifier
+from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
+
+# Configure the quantization algorithms
+recipe = GPTQModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"])
+
+# Apply quantization
+oneshot(
+ model=model,
+ dataset=ds,
+ recipe=recipe,
+ max_seq_length=MAX_SEQUENCE_LENGTH,
+ num_calibration_samples=NUM_CALIBRATION_SAMPLES,
+)
+
+# Save the compressed model
+SAVE_DIR = MODEL_ID.split("/")[1] + "-W4A16-G128"
+model.save_pretrained(SAVE_DIR, save_compressed=True)
+tokenizer.save_pretrained(SAVE_DIR)
+```
+
+This process creates a W4A16 model with weights quantized to 4-bit integers.
+
+### 4. Evaluating Accuracy
+
+After quantization, you can load and run the model in vLLM:
+
+```python
+from vllm import LLM
+model = LLM("./Meta-Llama-3-8B-Instruct-W4A16-G128")
+```
+
+To evaluate accuracy, you can use `lm_eval`:
+
+```console
+$ lm_eval --model vllm \
+ --model_args pretrained="./Meta-Llama-3-8B-Instruct-W4A16-G128",add_bos_token=true \
+ --tasks gsm8k \
+ --num_fewshot 5 \
+ --limit 250 \
+ --batch_size 'auto'
+```
+
+:::{note}
+Quantized models can be sensitive to the presence of the `bos` token. Make sure to include the `add_bos_token=True` argument when running evaluations.
+:::
+
+## Best Practices
+
+- Start with 512 samples for calibration data, and increase if accuracy drops
+- Ensure the calibration data contains a high variety of samples to prevent overfitting towards a specific use case
+- Use a sequence length of 2048 as a starting point
+- Employ the chat template or instruction template that the model was trained with
+- If you've fine-tuned a model, consider using a sample of your training data for calibration
+- Tune key hyperparameters to the quantization algorithm:
+ - `dampening_frac` sets how much influence the GPTQ algorithm has. Lower values can improve accuracy, but can lead to numerical instabilities that cause the algorithm to fail.
+ - `actorder` sets the activation ordering. When compressing the weights of a layer weight, the order in which channels are quantized matters. Setting `actorder="weight"` can improve accuracy without added latency.
+
+The following is an example of an expanded quantization recipe you can tune to your own use case:
+
+```python
+from compressed_tensors.quantization import (
+ QuantizationArgs,
+ QuantizationScheme,
+ QuantizationStrategy,
+ QuantizationType,
+)
+recipe = GPTQModifier(
+ targets="Linear",
+ config_groups={
+ "config_group": QuantizationScheme(
+ targets=["Linear"],
+ weights=QuantizationArgs(
+ num_bits=4,
+ type=QuantizationType.INT,
+ strategy=QuantizationStrategy.GROUP,
+ group_size=128,
+ symmetric=True,
+ dynamic=False,
+ actorder="weight",
+ ),
+ ),
+ },
+ ignore=["lm_head"],
+ update_size=NUM_CALIBRATION_SAMPLES,
+ dampening_frac=0.01
+)
+```
+
+## Troubleshooting and Support
+
+If you encounter any issues or have feature requests, please open an issue on the [`vllm-project/llm-compressor`](https://github.com/vllm-project/llm-compressor) GitHub repository. The full INT4 quantization example in `llm-compressor` is available [here](https://github.com/vllm-project/llm-compressor/blob/main/examples/quantization_w4a16/llama3_example.py).
diff --git a/docs/source/features/quantization/int8.md b/docs/source/features/quantization/int8.md
index 592a60d3988b2..b381f34bccd34 100644
--- a/docs/source/features/quantization/int8.md
+++ b/docs/source/features/quantization/int8.md
@@ -7,9 +7,9 @@ This quantization method is particularly useful for reducing model size while ma
Please visit the HF collection of [quantized INT8 checkpoints of popular LLMs ready to use with vLLM](https://huggingface.co/collections/neuralmagic/int8-llms-for-vllm-668ec32c049dca0369816415).
-```{note}
-INT8 computation is supported on NVIDIA GPUs with compute capability > 7.5 (Turing, Ampere, Ada Lovelace, Hopper).
-```
+:::{note}
+INT8 computation is supported on NVIDIA GPUs with compute capability > 7.5 (Turing, Ampere, Ada Lovelace, Hopper, Blackwell).
+:::
## Prerequisites
@@ -119,9 +119,9 @@ $ lm_eval --model vllm \
--batch_size 'auto'
```
-```{note}
+:::{note}
Quantized models can be sensitive to the presence of the `bos` token. Make sure to include the `add_bos_token=True` argument when running evaluations.
-```
+:::
## Best Practices
@@ -132,4 +132,4 @@ Quantized models can be sensitive to the presence of the `bos` token. Make sure
## Troubleshooting and Support
-If you encounter any issues or have feature requests, please open an issue on the `vllm-project/llm-compressor` GitHub repository.
+If you encounter any issues or have feature requests, please open an issue on the [`vllm-project/llm-compressor`](https://github.com/vllm-project/llm-compressor) GitHub repository.
diff --git a/docs/source/features/quantization/supported_hardware.md b/docs/source/features/quantization/supported_hardware.md
index f5c0a95ea426e..555ed4ce4c8db 100644
--- a/docs/source/features/quantization/supported_hardware.md
+++ b/docs/source/features/quantization/supported_hardware.md
@@ -4,128 +4,129 @@
The table below shows the compatibility of various quantization implementations with different hardware platforms in vLLM:
-```{list-table}
+:::{list-table}
:header-rows: 1
:widths: 20 8 8 8 8 8 8 8 8 8 8
-* - Implementation
- - Volta
- - Turing
- - Ampere
- - Ada
- - Hopper
- - AMD GPU
- - Intel GPU
- - x86 CPU
- - AWS Inferentia
- - Google TPU
-* - AWQ
- - ✗
- - ✅︎
- - ✅︎
- - ✅︎
- - ✅︎
- - ✗
- - ✅︎
- - ✅︎
- - ✗
- - ✗
-* - GPTQ
- - ✅︎
- - ✅︎
- - ✅︎
- - ✅︎
- - ✅︎
- - ✗
- - ✅︎
- - ✅︎
- - ✗
- - ✗
-* - Marlin (GPTQ/AWQ/FP8)
- - ✗
- - ✗
- - ✅︎
- - ✅︎
- - ✅︎
- - ✗
- - ✗
- - ✗
- - ✗
- - ✗
-* - INT8 (W8A8)
- - ✗
- - ✅︎
- - ✅︎
- - ✅︎
- - ✅︎
- - ✗
- - ✗
- - ✅︎
- - ✗
- - ✗
-* - FP8 (W8A8)
- - ✗
- - ✗
- - ✗
- - ✅︎
- - ✅︎
- - ✅︎
- - ✗
- - ✗
- - ✗
- - ✗
-* - AQLM
- - ✅︎
- - ✅︎
- - ✅︎
- - ✅︎
- - ✅︎
- - ✗
- - ✗
- - ✗
- - ✗
- - ✗
-* - bitsandbytes
- - ✅︎
- - ✅︎
- - ✅︎
- - ✅︎
- - ✅︎
- - ✗
- - ✗
- - ✗
- - ✗
- - ✗
-* - DeepSpeedFP
- - ✅︎
- - ✅︎
- - ✅︎
- - ✅︎
- - ✅︎
- - ✗
- - ✗
- - ✗
- - ✗
- - ✗
-* - GGUF
- - ✅︎
- - ✅︎
- - ✅︎
- - ✅︎
- - ✅︎
- - ✅︎
- - ✗
- - ✗
- - ✗
- - ✗
-```
+- * Implementation
+ * Volta
+ * Turing
+ * Ampere
+ * Ada
+ * Hopper
+ * AMD GPU
+ * Intel GPU
+ * x86 CPU
+ * AWS Inferentia
+ * Google TPU
+- * AWQ
+ * ✗
+ * ✅︎
+ * ✅︎
+ * ✅︎
+ * ✅︎
+ * ✗
+ * ✅︎
+ * ✅︎
+ * ✗
+ * ✗
+- * GPTQ
+ * ✅︎
+ * ✅︎
+ * ✅︎
+ * ✅︎
+ * ✅︎
+ * ✗
+ * ✅︎
+ * ✅︎
+ * ✗
+ * ✗
+- * Marlin (GPTQ/AWQ/FP8)
+ * ✗
+ * ✗
+ * ✅︎
+ * ✅︎
+ * ✅︎
+ * ✗
+ * ✗
+ * ✗
+ * ✗
+ * ✗
+- * INT8 (W8A8)
+ * ✗
+ * ✅︎
+ * ✅︎
+ * ✅︎
+ * ✅︎
+ * ✗
+ * ✗
+ * ✅︎
+ * ✗
+ * ✗
+- * FP8 (W8A8)
+ * ✗
+ * ✗
+ * ✗
+ * ✅︎
+ * ✅︎
+ * ✅︎
+ * ✗
+ * ✗
+ * ✗
+ * ✗
+- * AQLM
+ * ✅︎
+ * ✅︎
+ * ✅︎
+ * ✅︎
+ * ✅︎
+ * ✗
+ * ✗
+ * ✗
+ * ✗
+ * ✗
+- * bitsandbytes
+ * ✅︎
+ * ✅︎
+ * ✅︎
+ * ✅︎
+ * ✅︎
+ * ✗
+ * ✗
+ * ✗
+ * ✗
+ * ✗
+- * DeepSpeedFP
+ * ✅︎
+ * ✅︎
+ * ✅︎
+ * ✅︎
+ * ✅︎
+ * ✗
+ * ✗
+ * ✗
+ * ✗
+ * ✗
+- * GGUF
+ * ✅︎
+ * ✅︎
+ * ✅︎
+ * ✅︎
+ * ✅︎
+ * ✅︎
+ * ✗
+ * ✗
+ * ✗
+ * ✗
+
+:::
- Volta refers to SM 7.0, Turing to SM 7.5, Ampere to SM 8.0/8.6, Ada to SM 8.9, and Hopper to SM 9.0.
- "✅︎" indicates that the quantization method is supported on the specified hardware.
- "✗" indicates that the quantization method is not supported on the specified hardware.
-```{note}
+:::{note}
This compatibility chart is subject to change as vLLM continues to evolve and expand its support for different hardware platforms and quantization methods.
For the most up-to-date information on hardware support and quantization methods, please refer to or consult with the vLLM development team.
-```
+:::
diff --git a/docs/source/features/reasoning_outputs.md b/docs/source/features/reasoning_outputs.md
new file mode 100644
index 0000000000000..e39bbacf1138d
--- /dev/null
+++ b/docs/source/features/reasoning_outputs.md
@@ -0,0 +1,151 @@
+(reasoning-outputs)=
+
+# Reasoning Outputs
+
+vLLM offers support for reasoning models like [DeepSeek R1](https://huggingface.co/deepseek-ai/DeepSeek-R1), which are designed to generate outputs containing both reasoning steps and final conclusions.
+
+Reasoning models return a additional `reasoning_content` field in their outputs, which contains the reasoning steps that led to the final conclusion. This field is not present in the outputs of other models.
+
+## Supported Models
+
+vLLM currently supports the following reasoning models:
+
+- [DeepSeek R1 series](https://huggingface.co/collections/deepseek-ai/deepseek-r1-678e1e131c0169c0bc89728d) (`deepseek_r1`, which looks for ` ... `)
+
+## Quickstart
+
+To use reasoning models, you need to specify the `--enable-reasoning` and `--reasoning-parser` flags when making a request to the chat completion endpoint. The `--reasoning-parser` flag specifies the reasoning parser to use for extracting reasoning content from the model output.
+
+```bash
+vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
+ --enable-reasoning --reasoning-parser deepseek_r1
+```
+
+Next, make a request to the model that should return the reasoning content in the response.
+
+```python
+from openai import OpenAI
+
+# Modify OpenAI's API key and API base to use vLLM's API server.
+openai_api_key = "EMPTY"
+openai_api_base = "http://localhost:8000/v1"
+
+client = OpenAI(
+ api_key=openai_api_key,
+ base_url=openai_api_base,
+)
+
+models = client.models.list()
+model = models.data[0].id
+
+# Round 1
+messages = [{"role": "user", "content": "9.11 and 9.8, which is greater?"}]
+response = client.chat.completions.create(model=model, messages=messages)
+
+reasoning_content = response.choices[0].message.reasoning_content
+content = response.choices[0].message.content
+
+print("reasoning_content:", reasoning_content)
+print("content:", content)
+```
+
+The `reasoning_content` field contains the reasoning steps that led to the final conclusion, while the `content` field contains the final conclusion.
+
+## Streaming chat completions
+
+Streaming chat completions are also supported for reasoning models. The `reasoning_content` field is available in the `delta` field in [chat completion response chunks](https://platform.openai.com/docs/api-reference/chat/streaming).
+
+```json
+{
+ "id": "chatcmpl-123",
+ "object": "chat.completion.chunk",
+ "created": 1694268190,
+ "model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B",
+ "system_fingerprint": "fp_44709d6fcb",
+ "choices": [
+ {
+ "index": 0,
+ "delta": {
+ "role": "assistant",
+ "reasoning_content": "is",
+ },
+ "logprobs": null,
+ "finish_reason": null
+ }
+ ]
+}
+```
+
+Please note that it is not compatible with the OpenAI Python client library. You can use the `requests` library to make streaming requests.
+
+## How to support a new reasoning model
+
+You can add a new `ReasoningParser` similar to `vllm/entrypoints/openai/reasoning_parsers/deepseek_r1_reasoning_parser.py`.
+
+```python
+# import the required packages
+
+from vllm.entrypoints.openai.reasoning_parsers.abs_reasoning_parsers import (
+ ReasoningParser, ReasoningParserManager)
+from vllm.entrypoints.openai.protocol import (ChatCompletionRequest,
+ DeltaMessage)
+
+# define a reasoning parser and register it to vllm
+# the name list in register_module can be used
+# in --reasoning-parser.
+@ReasoningParserManager.register_module(["example"])
+class ExampleParser(ReasoningParser):
+ def __init__(self, tokenizer: AnyTokenizer):
+ super().__init__(tokenizer)
+
+ def extract_reasoning_content_streaming(
+ self,
+ previous_text: str,
+ current_text: str,
+ delta_text: str,
+ previous_token_ids: Sequence[int],
+ current_token_ids: Sequence[int],
+ delta_token_ids: Sequence[int],
+ ) -> Union[DeltaMessage, None]:
+ """
+ Instance method that should be implemented for extracting reasoning
+ from an incomplete response; for use when handling reasoning calls and
+ streaming. Has to be an instance method because it requires state -
+ the current tokens/diffs, but also the information about what has
+ previously been parsed and extracted (see constructor)
+ """
+
+ def extract_reasoning_content(
+ self, model_output: str, request: ChatCompletionRequest
+ ) -> Tuple[Optional[str], Optional[str]]:
+ """
+ Extract reasoning content from a complete model-generated string.
+
+ Used for non-streaming responses where we have the entire model response
+ available before sending to the client.
+
+ Parameters:
+ model_output: str
+ The model-generated string to extract reasoning content from.
+
+ request: ChatCompletionRequest
+ The request object that was used to generate the model_output.
+
+ Returns:
+ Tuple[Optional[str], Optional[str]]
+ A tuple containing the reasoning content and the content.
+ """
+```
+
+After defining the reasoning parser, you can use it by specifying the `--reasoning-parser` flag when making a request to the chat completion endpoint.
+
+```bash
+vllm serve \
+ --enable-reasoning --reasoning-parser example
+```
+
+## Limitations
+
+- The reasoning content is only available for online serving's chat completion endpoint (`/v1/chat/completions`).
+- It is not compatible with the [`structured_outputs`](#structured_outputs) and [`tool_calling`](#tool_calling) features.
+- The reasoning content is not available for all models. Check the model's documentation to see if it supports reasoning.
diff --git a/docs/source/features/spec_decode.md b/docs/source/features/spec_decode.md
index ab7b2f302bd13..da87127057dc5 100644
--- a/docs/source/features/spec_decode.md
+++ b/docs/source/features/spec_decode.md
@@ -2,15 +2,15 @@
# Speculative Decoding
-```{warning}
+:::{warning}
Please note that speculative decoding in vLLM is not yet optimized and does
not usually yield inter-token latency reductions for all prompt datasets or sampling parameters.
The work to optimize it is ongoing and can be followed here:
-```
+:::
-```{warning}
+:::{warning}
Currently, speculative decoding in vLLM is not compatible with pipeline parallelism.
-```
+:::
This document shows how to use [Speculative Decoding](https://x.com/karpathy/status/1697318534555336961) with vLLM.
Speculative decoding is a technique which improves inter-token latency in memory-bound LLM inference.
diff --git a/docs/source/features/structured_outputs.md b/docs/source/features/structured_outputs.md
index 1d77c7339a33f..90c880e8cfa46 100644
--- a/docs/source/features/structured_outputs.md
+++ b/docs/source/features/structured_outputs.md
@@ -95,10 +95,10 @@ completion = client.chat.completions.create(
print(completion.choices[0].message.content)
```
-```{tip}
+:::{tip}
While not strictly necessary, normally it´s better to indicate in the prompt that a JSON needs to be generated and which fields and how should the LLM fill them.
This can improve the results notably in most cases.
-```
+:::
Finally we have the `guided_grammar`, which probably is the most difficult one to use but it´s really powerful, as it allows us to define complete languages like SQL queries.
It works by using a context free EBNF grammar, which for example we can use to define a specific format of simplified SQL queries, like in the example below:
diff --git a/docs/source/generate_examples.py b/docs/source/generate_examples.py
index aaa13d0fb6d3f..9d4de18a3b79d 100644
--- a/docs/source/generate_examples.py
+++ b/docs/source/generate_examples.py
@@ -1,3 +1,5 @@
+# SPDX-License-Identifier: Apache-2.0
+
import itertools
import re
from dataclasses import dataclass, field
@@ -57,9 +59,9 @@ class Index:
def generate(self) -> str:
content = f"# {self.title}\n\n{self.description}\n\n"
- content += "```{toctree}\n"
+ content += ":::{toctree}\n"
content += f":caption: {self.caption}\n:maxdepth: {self.maxdepth}\n"
- content += "\n".join(self.documents) + "\n```\n"
+ content += "\n".join(self.documents) + "\n:::\n"
return content
diff --git a/docs/source/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md b/docs/source/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md
index ae42dd0c0d08f..f3b0d6dc9bdc8 100644
--- a/docs/source/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md
+++ b/docs/source/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md
@@ -2,6 +2,10 @@
This tab provides instructions on running vLLM with Intel Gaudi devices.
+:::{attention}
+There are no pre-built wheels or images for this device, so you must build vLLM from source.
+:::
+
## Requirements
- OS: Ubuntu 22.04 LTS
@@ -86,9 +90,9 @@ docker build -f Dockerfile.hpu -t vllm-hpu-env .
docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --rm vllm-hpu-env
```
-```{tip}
+:::{tip}
If you're observing the following error: `docker: Error response from daemon: Unknown runtime specified habana.`, please refer to "Install Using Containers" section of [Intel Gaudi Software Stack and Driver Installation](https://docs.habana.ai/en/v1.18.0/Installation_Guide/Bare_Metal_Fresh_OS.html). Make sure you have `habana-container-runtime` package installed and that `habana` container runtime is registered.
-```
+:::
## Extra information
@@ -155,30 +159,30 @@ Gaudi2 devices. Configurations that are not listed may or may not work.
Currently in vLLM for HPU we support four execution modes, depending on selected HPU PyTorch Bridge backend (via `PT_HPU_LAZY_MODE` environment variable), and `--enforce-eager` flag.
-```{list-table} vLLM execution modes
+:::{list-table} vLLM execution modes
:widths: 25 25 50
:header-rows: 1
-* - `PT_HPU_LAZY_MODE`
- - `enforce_eager`
- - execution mode
-* - 0
- - 0
- - torch.compile
-* - 0
- - 1
- - PyTorch eager mode
-* - 1
- - 0
- - HPU Graphs
-* - 1
- - 1
- - PyTorch lazy mode
-```
+- * `PT_HPU_LAZY_MODE`
+ * `enforce_eager`
+ * execution mode
+- * 0
+ * 0
+ * torch.compile
+- * 0
+ * 1
+ * PyTorch eager mode
+- * 1
+ * 0
+ * HPU Graphs
+- * 1
+ * 1
+ * PyTorch lazy mode
+:::
-```{warning}
+:::{warning}
In 1.18.0, all modes utilizing `PT_HPU_LAZY_MODE=0` are highly experimental and should be only used for validating functional correctness. Their performance will be improved in the next releases. For obtaining the best performance in 1.18.0, please use HPU Graphs, or PyTorch lazy mode.
-```
+:::
(gaudi-bucketing-mechanism)=
@@ -187,9 +191,9 @@ In 1.18.0, all modes utilizing `PT_HPU_LAZY_MODE=0` are highly experimental and
Intel Gaudi accelerators work best when operating on models with fixed tensor shapes. [Intel Gaudi Graph Compiler](https://docs.habana.ai/en/latest/Gaudi_Overview/Intel_Gaudi_Software_Suite.html#graph-compiler-and-runtime) is responsible for generating optimized binary code that implements the given model topology on Gaudi. In its default configuration, the produced binary code may be heavily dependent on input and output tensor shapes, and can require graph recompilation when encountering differently shaped tensors within the same topology. While the resulting binaries utilize Gaudi efficiently, the compilation itself may introduce a noticeable overhead in end-to-end execution.
In a dynamic inference serving scenario, there is a need to minimize the number of graph compilations and reduce the risk of graph compilation occurring during server runtime. Currently it is achieved by "bucketing" model's forward pass across two dimensions - `batch_size` and `sequence_length`.
-```{note}
+:::{note}
Bucketing allows us to reduce the number of required graphs significantly, but it does not handle any graph compilation and device code generation - this is done in warmup and HPUGraph capture phase.
-```
+:::
Bucketing ranges are determined with 3 parameters - `min`, `step` and `max`. They can be set separately for prompt and decode phase, and for batch size and sequence length dimension. These parameters can be observed in logs during vLLM startup:
@@ -222,15 +226,15 @@ min = 128, step = 128, max = 512
In the logged scenario, 24 buckets were generated for prompt (prefill) runs, and 48 buckets for decode runs. Each bucket corresponds to a separate optimized device binary for a given model with specified tensor shapes. Whenever a batch of requests is processed, it is padded across batch and sequence length dimension to the smallest possible bucket.
-```{warning}
+:::{warning}
If a request exceeds maximum bucket size in any dimension, it will be processed without padding, and its processing may require a graph compilation, potentially significantly increasing end-to-end latency. The boundaries of the buckets are user-configurable via environment variables, and upper bucket boundaries can be increased to avoid such scenario.
-```
+:::
As an example, if a request of 3 sequences, with max sequence length of 412 comes in to an idle vLLM server, it will be padded executed as `(4, 512)` prefill bucket, as `batch_size` (number of sequences) will be padded to 4 (closest batch_size dimension higher than 3), and max sequence length will be padded to 512 (closest sequence length dimension higher than 412). After prefill stage, it will be executed as `(4, 512)` decode bucket and will continue as that bucket until either batch dimension changes (due to request being finished) - in which case it will become a `(2, 512)` bucket, or context length increases above 512 tokens, in which case it will become `(4, 640)` bucket.
-```{note}
+:::{note}
Bucketing is transparent to a client -- padding in sequence length dimension is never returned to the client, and padding in batch dimension does not create new requests.
-```
+:::
### Warmup
@@ -252,9 +256,9 @@ INFO 08-01 22:27:16 hpu_model_runner.py:1066] [Warmup][Decode][48/48] batch_size
This example uses the same buckets as in the [Bucketing Mechanism](#gaudi-bucketing-mechanism) section. Each output line corresponds to execution of a single bucket. When bucket is executed for the first time, its graph is compiled and can be reused later on, skipping further graph compilations.
-```{tip}
+:::{tip}
Compiling all the buckets might take some time and can be turned off with `VLLM_SKIP_WARMUP=true` environment variable. Keep in mind that if you do that, you may face graph compilations once executing a given bucket for the first time. It is fine to disable warmup for development, but it's highly recommended to enable it in deployment.
-```
+:::
### HPU Graph capture
@@ -269,9 +273,9 @@ With its default value (`VLLM_GRAPH_RESERVED_MEM=0.1`), 10% of usable memory wil
Environment variable `VLLM_GRAPH_PROMPT_RATIO` determines the ratio of usable graph memory reserved for prefill and decode graphs. By default (`VLLM_GRAPH_PROMPT_RATIO=0.3`), both stages have equal memory constraints.
Lower value corresponds to less usable graph memory reserved for prefill stage, e.g. `VLLM_GRAPH_PROMPT_RATIO=0.2` will reserve 20% of usable graph memory for prefill graphs, and 80% of usable graph memory for decode graphs.
-```{note}
+:::{note}
`gpu_memory_utilization` does not correspond to the absolute memory usage across HPU. It specifies the memory margin after loading the model and performing a profile run. If device has 100 GiB of total memory, and 50 GiB of free memory after loading model weights and executing profiling run, `gpu_memory_utilization` at its default value will mark 90% of 50 GiB as usable, leaving 5 GiB of margin, regardless of total device memory.
-```
+:::
User can also configure the strategy for capturing HPU Graphs for prompt and decode stages separately. Strategy affects the order of capturing graphs. There are two strategies implemented:
\- `max_bs` - graph capture queue will sorted in descending order by their batch sizes. Buckets with equal batch sizes are sorted by sequence length in ascending order (e.g. `(64, 128)`, `(64, 256)`, `(32, 128)`, `(32, 256)`, `(1, 128)`, `(1,256)`), default strategy for decode
@@ -279,9 +283,9 @@ User can also configure the strategy for capturing HPU Graphs for prompt and dec
When there's large amount of requests pending, vLLM scheduler will attempt to fill the maximum batch size for decode as soon as possible. When a request is finished, decode batch size decreases. When that happens, vLLM will attempt to schedule a prefill iteration for requests in the waiting queue, to fill the decode batch size to its previous state. This means that in a full load scenario, decode batch size is often at its maximum, which makes large batch size HPU Graphs crucial to capture, as reflected by `max_bs` strategy. On the other hand, prefills will be executed most frequently with very low batch sizes (1-4), which is reflected in `min_tokens` strategy.
-```{note}
+:::{note}
`VLLM_GRAPH_PROMPT_RATIO` does not set a hard limit on memory taken by graphs for each stage (prefill and decode). vLLM will first attempt to use up entirety of usable prefill graph memory (usable graph memory * `VLLM_GRAPH_PROMPT_RATIO`) for capturing prefill HPU Graphs, next it will attempt do the same for decode graphs and usable decode graph memory pool. If one stage is fully captured, and there is unused memory left within usable graph memory pool, vLLM will attempt further graph capture for the other stage, until no more HPU Graphs can be captured without exceeding reserved memory pool. The behavior on that mechanism can be observed in the example below.
-```
+:::
Each described step is logged by vLLM server, as follows (negative values correspond to memory being released):
@@ -352,13 +356,13 @@ INFO 08-02 17:38:43 hpu_executor.py:91] init_cache_engine took 37.92 GiB of devi
- `VLLM_{phase}_{dim}_BUCKET_{param}` - collection of 12 environment variables configuring ranges of bucketing mechanism
- - `{phase}` is either `PROMPT` or `DECODE`
+ * `{phase}` is either `PROMPT` or `DECODE`
- - `{dim}` is either `BS`, `SEQ` or `BLOCK`
+ * `{dim}` is either `BS`, `SEQ` or `BLOCK`
- - `{param}` is either `MIN`, `STEP` or `MAX`
+ * `{param}` is either `MIN`, `STEP` or `MAX`
- - Default values:
+ * Default values:
- Prompt:
- batch size min (`VLLM_PROMPT_BS_BUCKET_MIN`): `1`
diff --git a/docs/source/getting_started/installation/ai_accelerator/index.md b/docs/source/getting_started/installation/ai_accelerator/index.md
index a6c4c44305a4c..01793572fee7c 100644
--- a/docs/source/getting_started/installation/ai_accelerator/index.md
+++ b/docs/source/getting_started/installation/ai_accelerator/index.md
@@ -2,374 +2,375 @@
vLLM is a Python library that supports the following AI accelerators. Select your AI accelerator type to see vendor specific instructions:
-::::{tab-set}
+:::::{tab-set}
:sync-group: device
-:::{tab-item} TPU
+::::{tab-item} Google TPU
+:selected:
:sync: tpu
-```{include} tpu.inc.md
+:::{include} tpu.inc.md
:start-after: "# Installation"
:end-before: "## Requirements"
-```
-
-:::
-
-:::{tab-item} Intel Gaudi
-:sync: hpu-gaudi
-
-```{include} hpu-gaudi.inc.md
-:start-after: "# Installation"
-:end-before: "## Requirements"
-```
-
-:::
-
-:::{tab-item} Neuron
-:sync: neuron
-
-```{include} neuron.inc.md
-:start-after: "# Installation"
-:end-before: "## Requirements"
-```
-
-:::
-
-:::{tab-item} OpenVINO
-:sync: openvino
-
-```{include} openvino.inc.md
-:start-after: "# Installation"
-:end-before: "## Requirements"
-```
-
:::
::::
+::::{tab-item} Intel Gaudi
+:sync: hpu-gaudi
+
+:::{include} hpu-gaudi.inc.md
+:start-after: "# Installation"
+:end-before: "## Requirements"
+:::
+
+::::
+
+::::{tab-item} AWS Neuron
+:sync: neuron
+
+:::{include} neuron.inc.md
+:start-after: "# Installation"
+:end-before: "## Requirements"
+:::
+
+::::
+
+::::{tab-item} OpenVINO
+:sync: openvino
+
+:::{include} openvino.inc.md
+:start-after: "# Installation"
+:end-before: "## Requirements"
+:::
+
+::::
+
+:::::
+
## Requirements
-::::{tab-set}
+:::::{tab-set}
:sync-group: device
-:::{tab-item} TPU
+::::{tab-item} Google TPU
:sync: tpu
-```{include} tpu.inc.md
+:::{include} tpu.inc.md
:start-after: "## Requirements"
:end-before: "## Configure a new environment"
-```
-
-:::
-
-:::{tab-item} Intel Gaudi
-:sync: hpu-gaudi
-
-```{include} hpu-gaudi.inc.md
-:start-after: "## Requirements"
-:end-before: "## Configure a new environment"
-```
-
-:::
-
-:::{tab-item} Neuron
-:sync: neuron
-
-```{include} neuron.inc.md
-:start-after: "## Requirements"
-:end-before: "## Configure a new environment"
-```
-
-:::
-
-:::{tab-item} OpenVINO
-:sync: openvino
-
-```{include} openvino.inc.md
-:start-after: "## Requirements"
-:end-before: "## Set up using Python"
-```
-
:::
::::
+::::{tab-item} Intel Gaudi
+:sync: hpu-gaudi
+
+:::{include} hpu-gaudi.inc.md
+:start-after: "## Requirements"
+:end-before: "## Configure a new environment"
+:::
+
+::::
+
+::::{tab-item} AWS Neuron
+:sync: neuron
+
+:::{include} neuron.inc.md
+:start-after: "## Requirements"
+:end-before: "## Configure a new environment"
+:::
+
+::::
+
+::::{tab-item} OpenVINO
+:sync: openvino
+
+:::{include} openvino.inc.md
+:start-after: "## Requirements"
+:end-before: "## Set up using Python"
+:::
+
+::::
+
+:::::
+
## Configure a new environment
-::::{tab-set}
+:::::{tab-set}
:sync-group: device
-:::{tab-item} TPU
+::::{tab-item} Google TPU
:sync: tpu
-```{include} tpu.inc.md
+:::{include} tpu.inc.md
:start-after: "## Configure a new environment"
:end-before: "## Set up using Python"
-```
-
-:::
-
-:::{tab-item} Intel Gaudi
-:sync: hpu-gaudi
-
-```{include} hpu-gaudi.inc.md
-:start-after: "## Configure a new environment"
-:end-before: "## Set up using Python"
-```
-
-:::
-
-:::{tab-item} Neuron
-:sync: neuron
-
-```{include} neuron.inc.md
-:start-after: "## Configure a new environment"
-:end-before: "## Set up using Python"
-```
-
-:::
-
-:::{tab-item} OpenVINO
-:sync: openvino
-
-```{include} ../python_env_setup.inc.md
-```
-
:::
::::
+::::{tab-item} Intel Gaudi
+:sync: hpu-gaudi
+
+:::{include} hpu-gaudi.inc.md
+:start-after: "## Configure a new environment"
+:end-before: "## Set up using Python"
+:::
+
+::::
+
+::::{tab-item} AWS Neuron
+:sync: neuron
+
+:::{include} neuron.inc.md
+:start-after: "## Configure a new environment"
+:end-before: "## Set up using Python"
+:::
+
+::::
+
+::::{tab-item} OpenVINO
+:sync: openvino
+
+:::{include} ../python_env_setup.inc.md
+:::
+
+::::
+
+:::::
+
## Set up using Python
### Pre-built wheels
-::::{tab-set}
+:::::{tab-set}
:sync-group: device
-:::{tab-item} TPU
+::::{tab-item} Google TPU
:sync: tpu
-```{include} tpu.inc.md
+:::{include} tpu.inc.md
:start-after: "### Pre-built wheels"
:end-before: "### Build wheel from source"
-```
-
-:::
-
-:::{tab-item} Intel Gaudi
-:sync: hpu-gaudi
-
-```{include} hpu-gaudi.inc.md
-:start-after: "### Pre-built wheels"
-:end-before: "### Build wheel from source"
-```
-
-:::
-
-:::{tab-item} Neuron
-:sync: neuron
-
-```{include} neuron.inc.md
-:start-after: "### Pre-built wheels"
-:end-before: "### Build wheel from source"
-```
-
-:::
-
-:::{tab-item} OpenVINO
-:sync: openvino
-
-```{include} openvino.inc.md
-:start-after: "### Pre-built wheels"
-:end-before: "### Build wheel from source"
-```
-
:::
::::
+::::{tab-item} Intel Gaudi
+:sync: hpu-gaudi
+
+:::{include} hpu-gaudi.inc.md
+:start-after: "### Pre-built wheels"
+:end-before: "### Build wheel from source"
+:::
+
+::::
+
+::::{tab-item} AWS Neuron
+:sync: neuron
+
+:::{include} neuron.inc.md
+:start-after: "### Pre-built wheels"
+:end-before: "### Build wheel from source"
+:::
+
+::::
+
+::::{tab-item} OpenVINO
+:sync: openvino
+
+:::{include} openvino.inc.md
+:start-after: "### Pre-built wheels"
+:end-before: "### Build wheel from source"
+:::
+
+::::
+
+:::::
+
### Build wheel from source
-::::{tab-set}
+:::::{tab-set}
:sync-group: device
-:::{tab-item} TPU
+::::{tab-item} Google TPU
:sync: tpu
-```{include} tpu.inc.md
+:::{include} tpu.inc.md
:start-after: "### Build wheel from source"
:end-before: "## Set up using Docker"
-```
-
-:::
-
-:::{tab-item} Intel Gaudi
-:sync: hpu-gaudi
-
-```{include} hpu-gaudi.inc.md
-:start-after: "### Build wheel from source"
-:end-before: "## Set up using Docker"
-```
-
-:::
-
-:::{tab-item} Neuron
-:sync: neuron
-
-```{include} neuron.inc.md
-:start-after: "### Build wheel from source"
-:end-before: "## Set up using Docker"
-```
-
-:::
-
-:::{tab-item} OpenVINO
-:sync: openvino
-
-```{include} openvino.inc.md
-:start-after: "### Build wheel from source"
-:end-before: "## Set up using Docker"
-```
-
:::
::::
+::::{tab-item} Intel Gaudi
+:sync: hpu-gaudi
+
+:::{include} hpu-gaudi.inc.md
+:start-after: "### Build wheel from source"
+:end-before: "## Set up using Docker"
+:::
+
+::::
+
+::::{tab-item} AWS Neuron
+:sync: neuron
+
+:::{include} neuron.inc.md
+:start-after: "### Build wheel from source"
+:end-before: "## Set up using Docker"
+:::
+
+::::
+
+::::{tab-item} OpenVINO
+:sync: openvino
+
+:::{include} openvino.inc.md
+:start-after: "### Build wheel from source"
+:end-before: "## Set up using Docker"
+:::
+
+::::
+
+:::::
+
## Set up using Docker
### Pre-built images
-::::{tab-set}
+:::::{tab-set}
:sync-group: device
-:::{tab-item} TPU
+::::{tab-item} Google TPU
:sync: tpu
-```{include} tpu.inc.md
+:::{include} tpu.inc.md
:start-after: "### Pre-built images"
:end-before: "### Build image from source"
-```
-
-:::
-
-:::{tab-item} Intel Gaudi
-:sync: hpu-gaudi
-
-```{include} hpu-gaudi.inc.md
-:start-after: "### Pre-built images"
-:end-before: "### Build image from source"
-```
-
-:::
-
-:::{tab-item} Neuron
-:sync: neuron
-
-```{include} neuron.inc.md
-:start-after: "### Pre-built images"
-:end-before: "### Build image from source"
-```
-
-:::
-
-:::{tab-item} OpenVINO
-:sync: openvino
-
-```{include} openvino.inc.md
-:start-after: "### Pre-built images"
-:end-before: "### Build image from source"
-```
-
:::
::::
+::::{tab-item} Intel Gaudi
+:sync: hpu-gaudi
+
+:::{include} hpu-gaudi.inc.md
+:start-after: "### Pre-built images"
+:end-before: "### Build image from source"
+:::
+
+::::
+
+::::{tab-item} AWS Neuron
+:sync: neuron
+
+:::{include} neuron.inc.md
+:start-after: "### Pre-built images"
+:end-before: "### Build image from source"
+:::
+
+::::
+
+::::{tab-item} OpenVINO
+:sync: openvino
+
+:::{include} openvino.inc.md
+:start-after: "### Pre-built images"
+:end-before: "### Build image from source"
+:::
+
+::::
+
+:::::
+
### Build image from source
-::::{tab-set}
+:::::{tab-set}
:sync-group: device
-:::{tab-item} TPU
+::::{tab-item} Google TPU
:sync: tpu
-```{include} tpu.inc.md
+:::{include} tpu.inc.md
:start-after: "### Build image from source"
:end-before: "## Extra information"
-```
-
-:::
-
-:::{tab-item} Intel Gaudi
-:sync: hpu-gaudi
-
-```{include} hpu-gaudi.inc.md
-:start-after: "### Build image from source"
-:end-before: "## Extra information"
-```
-
-:::
-
-:::{tab-item} Neuron
-:sync: neuron
-
-```{include} neuron.inc.md
-:start-after: "### Build image from source"
-:end-before: "## Extra information"
-```
-
-:::
-
-:::{tab-item} OpenVINO
-:sync: openvino
-
-```{include} openvino.inc.md
-:start-after: "### Build image from source"
-:end-before: "## Extra information"
-```
-
:::
::::
+::::{tab-item} Intel Gaudi
+:sync: hpu-gaudi
+
+:::{include} hpu-gaudi.inc.md
+:start-after: "### Build image from source"
+:end-before: "## Extra information"
+:::
+
+::::
+
+::::{tab-item} AWS Neuron
+:sync: neuron
+
+:::{include} neuron.inc.md
+:start-after: "### Build image from source"
+:end-before: "## Extra information"
+:::
+
+::::
+
+::::{tab-item} OpenVINO
+:sync: openvino
+
+:::{include} openvino.inc.md
+:start-after: "### Build image from source"
+:end-before: "## Extra information"
+:::
+
+::::
+
+:::::
+
## Extra information
-::::{tab-set}
+:::::{tab-set}
:sync-group: device
-:::{tab-item} TPU
+::::{tab-item} Google TPU
:sync: tpu
-```{include} tpu.inc.md
+:::{include} tpu.inc.md
:start-after: "## Extra information"
-```
-
-:::
-
-:::{tab-item} Intel Gaudi
-:sync: hpu-gaudi
-
-```{include} hpu-gaudi.inc.md
-:start-after: "## Extra information"
-```
-
-:::
-
-:::{tab-item} Neuron
-:sync: neuron
-
-```{include} neuron.inc.md
-:start-after: "## Extra information"
-```
-
-:::
-
-:::{tab-item} OpenVINO
-:sync: openvino
-
-```{include} openvino.inc.md
-:start-after: "## Extra information"
-```
-
:::
::::
+
+::::{tab-item} Intel Gaudi
+:sync: hpu-gaudi
+
+:::{include} hpu-gaudi.inc.md
+:start-after: "## Extra information"
+:::
+
+::::
+
+::::{tab-item} AWS Neuron
+:sync: neuron
+
+:::{include} neuron.inc.md
+:start-after: "## Extra information"
+:::
+
+::::
+
+::::{tab-item} OpenVINO
+:sync: openvino
+
+:::{include} openvino.inc.md
+:start-after: "## Extra information"
+:::
+
+::::
+
+:::::
diff --git a/docs/source/getting_started/installation/ai_accelerator/neuron.inc.md b/docs/source/getting_started/installation/ai_accelerator/neuron.inc.md
index 575a9f9c2e2f0..f149818acafb8 100644
--- a/docs/source/getting_started/installation/ai_accelerator/neuron.inc.md
+++ b/docs/source/getting_started/installation/ai_accelerator/neuron.inc.md
@@ -4,6 +4,10 @@ vLLM 0.3.3 onwards supports model inferencing and serving on AWS Trainium/Infere
Paged Attention and Chunked Prefill are currently in development and will be available soon.
Data types currently supported in Neuron SDK are FP16 and BF16.
+:::{attention}
+There are no pre-built wheels or images for this device, so you must build vLLM from source.
+:::
+
## Requirements
- OS: Linux
@@ -67,9 +71,9 @@ Currently, there are no pre-built Neuron wheels.
### Build wheel from source
-```{note}
+:::{note}
The currently supported version of Pytorch for Neuron installs `triton` version `2.1.0`. This is incompatible with `vllm >= 0.5.3`. You may see an error `cannot import name 'default_dump_dir...`. To work around this, run a `pip install --upgrade triton==3.0.0` after installing the vLLM wheel.
-```
+:::
Following instructions are applicable to Neuron SDK 2.16 and beyond.
diff --git a/docs/source/getting_started/installation/ai_accelerator/openvino.inc.md b/docs/source/getting_started/installation/ai_accelerator/openvino.inc.md
index a7867472583d6..112e8d4d9b256 100644
--- a/docs/source/getting_started/installation/ai_accelerator/openvino.inc.md
+++ b/docs/source/getting_started/installation/ai_accelerator/openvino.inc.md
@@ -2,6 +2,10 @@
vLLM powered by OpenVINO supports all LLM models from [vLLM supported models list](#supported-models) and can perform optimal model serving on all x86-64 CPUs with, at least, AVX2 support, as well as on both integrated and discrete Intel® GPUs ([the list of supported GPUs](https://docs.openvino.ai/2024/about-openvino/release-notes-openvino/system-requirements.html#gpu)).
+:::{attention}
+There are no pre-built wheels or images for this device, so you must build vLLM from source.
+:::
+
## Requirements
- OS: Linux
diff --git a/docs/source/getting_started/installation/ai_accelerator/tpu.inc.md b/docs/source/getting_started/installation/ai_accelerator/tpu.inc.md
index 6a911cc6b9eba..c0d50feafce56 100644
--- a/docs/source/getting_started/installation/ai_accelerator/tpu.inc.md
+++ b/docs/source/getting_started/installation/ai_accelerator/tpu.inc.md
@@ -30,6 +30,10 @@ For TPU pricing information, see [Cloud TPU pricing](https://cloud.google.com/tp
You may need additional persistent storage for your TPU VMs. For more
information, see [Storage options for Cloud TPU data](https://cloud.devsite.corp.google.com/tpu/docs/storage-options).
+:::{attention}
+There are no pre-built wheels for this device, so you must either use the pre-built Docker image or build vLLM from source.
+:::
+
## Requirements
- Google Cloud TPU VM
@@ -47,10 +51,10 @@ When you request queued resources, the request is added to a queue maintained by
the Cloud TPU service. When the requested resource becomes available, it's
assigned to your Google Cloud project for your immediate exclusive use.
-```{note}
+:::{note}
In all of the following commands, replace the ALL CAPS parameter names with
appropriate values. See the parameter descriptions table for more information.
-```
+:::
### Provision Cloud TPUs with GKE
@@ -75,33 +79,33 @@ gcloud alpha compute tpus queued-resources create QUEUED_RESOURCE_ID \
--service-account SERVICE_ACCOUNT
```
-```{list-table} Parameter descriptions
+:::{list-table} Parameter descriptions
:header-rows: 1
-* - Parameter name
- - Description
-* - QUEUED_RESOURCE_ID
- - The user-assigned ID of the queued resource request.
-* - TPU_NAME
- - The user-assigned name of the TPU which is created when the queued
+- * Parameter name
+ * Description
+- * QUEUED_RESOURCE_ID
+ * The user-assigned ID of the queued resource request.
+- * TPU_NAME
+ * The user-assigned name of the TPU which is created when the queued
resource request is allocated.
-* - PROJECT_ID
- - Your Google Cloud project
-* - ZONE
- - The GCP zone where you want to create your Cloud TPU. The value you use
+- * PROJECT_ID
+ * Your Google Cloud project
+- * ZONE
+ * The GCP zone where you want to create your Cloud TPU. The value you use
depends on the version of TPUs you are using. For more information, see
`TPU regions and zones `_
-* - ACCELERATOR_TYPE
- - The TPU version you want to use. Specify the TPU version, for example
+- * ACCELERATOR_TYPE
+ * The TPU version you want to use. Specify the TPU version, for example
`v5litepod-4` specifies a v5e TPU with 4 cores. For more information,
see `TPU versions `_.
-* - RUNTIME_VERSION
- - The TPU VM runtime version to use. For more information see `TPU VM images `_.
-* - SERVICE_ACCOUNT
- - The email address for your service account. You can find it in the IAM
+- * RUNTIME_VERSION
+ * The TPU VM runtime version to use. For more information see `TPU VM images `_.
+- * SERVICE_ACCOUNT
+ * The email address for your service account. You can find it in the IAM
Cloud Console under *Service Accounts*. For example:
`tpu-service-account@.iam.gserviceaccount.com`
-```
+:::
Connect to your TPU using SSH:
@@ -178,15 +182,15 @@ Run the Docker image with the following command:
docker run --privileged --net host --shm-size=16G -it vllm-tpu
```
-```{note}
+:::{note}
Since TPU relies on XLA which requires static shapes, vLLM bucketizes the
possible input shapes and compiles an XLA graph for each shape. The
compilation time may take 20~30 minutes in the first run. However, the
compilation time reduces to ~5 minutes afterwards because the XLA graphs are
cached in the disk (in {code}`VLLM_XLA_CACHE_PATH` or {code}`~/.cache/vllm/xla_cache` by default).
-```
+:::
-````{tip}
+:::{tip}
If you encounter the following error:
```console
@@ -198,9 +202,10 @@ file or directory
Install OpenBLAS with the following command:
```console
-$ sudo apt-get install libopenblas-base libopenmpi-dev libomp-dev
+sudo apt-get install libopenblas-base libopenmpi-dev libomp-dev
```
-````
+
+:::
## Extra information
diff --git a/docs/source/getting_started/installation/cpu/apple.inc.md b/docs/source/getting_started/installation/cpu/apple.inc.md
index 56545253b1ef7..3bf1d47fa0ff9 100644
--- a/docs/source/getting_started/installation/cpu/apple.inc.md
+++ b/docs/source/getting_started/installation/cpu/apple.inc.md
@@ -4,6 +4,10 @@ vLLM has experimental support for macOS with Apple silicon. For now, users shall
Currently the CPU implementation for macOS supports FP32 and FP16 datatypes.
+:::{attention}
+There are no pre-built wheels or images for this device, so you must build vLLM from source.
+:::
+
## Requirements
- OS: `macOS Sonoma` or later
@@ -25,9 +29,9 @@ pip install -r requirements-cpu.txt
pip install -e .
```
-```{note}
+:::{note}
On macOS the `VLLM_TARGET_DEVICE` is automatically set to `cpu`, which currently is the only supported device.
-```
+:::
#### Troubleshooting
diff --git a/docs/source/getting_started/installation/cpu/arm.inc.md b/docs/source/getting_started/installation/cpu/arm.inc.md
index 08a764e1a25f4..a661a0ca5adc7 100644
--- a/docs/source/getting_started/installation/cpu/arm.inc.md
+++ b/docs/source/getting_started/installation/cpu/arm.inc.md
@@ -4,6 +4,10 @@ vLLM has been adapted to work on ARM64 CPUs with NEON support, leveraging the CP
ARM CPU backend currently supports Float32, FP16 and BFloat16 datatypes.
+:::{attention}
+There are no pre-built wheels or images for this device, so you must build vLLM from source.
+:::
+
## Requirements
- OS: Linux
diff --git a/docs/source/getting_started/installation/cpu/index.md b/docs/source/getting_started/installation/cpu/index.md
index 4ec907c0e9fda..d53430403583c 100644
--- a/docs/source/getting_started/installation/cpu/index.md
+++ b/docs/source/getting_started/installation/cpu/index.md
@@ -2,86 +2,87 @@
vLLM is a Python library that supports the following CPU variants. Select your CPU type to see vendor specific instructions:
-::::{tab-set}
+:::::{tab-set}
:sync-group: device
-:::{tab-item} x86
+::::{tab-item} Intel/AMD x86
+:selected:
:sync: x86
-```{include} x86.inc.md
+:::{include} x86.inc.md
:start-after: "# Installation"
:end-before: "## Requirements"
-```
-
-:::
-
-:::{tab-item} ARM
-:sync: arm
-
-```{include} arm.inc.md
-:start-after: "# Installation"
-:end-before: "## Requirements"
-```
-
-:::
-
-:::{tab-item} Apple silicon
-:sync: apple
-
-```{include} apple.inc.md
-:start-after: "# Installation"
-:end-before: "## Requirements"
-```
-
:::
::::
+::::{tab-item} ARM AArch64
+:sync: arm
+
+:::{include} arm.inc.md
+:start-after: "# Installation"
+:end-before: "## Requirements"
+:::
+
+::::
+
+::::{tab-item} Apple silicon
+:sync: apple
+
+:::{include} apple.inc.md
+:start-after: "# Installation"
+:end-before: "## Requirements"
+:::
+
+::::
+
+:::::
+
## Requirements
- Python: 3.9 -- 3.12
-::::{tab-set}
+:::::{tab-set}
:sync-group: device
-:::{tab-item} x86
+::::{tab-item} Intel/AMD x86
:sync: x86
-```{include} x86.inc.md
+:::{include} x86.inc.md
:start-after: "## Requirements"
:end-before: "## Set up using Python"
-```
-
-:::
-
-:::{tab-item} ARM
-:sync: arm
-
-```{include} arm.inc.md
-:start-after: "## Requirements"
-:end-before: "## Set up using Python"
-```
-
-:::
-
-:::{tab-item} Apple silicon
-:sync: apple
-
-```{include} apple.inc.md
-:start-after: "## Requirements"
-:end-before: "## Set up using Python"
-```
-
:::
::::
+::::{tab-item} ARM AArch64
+:sync: arm
+
+:::{include} arm.inc.md
+:start-after: "## Requirements"
+:end-before: "## Set up using Python"
+:::
+
+::::
+
+::::{tab-item} Apple silicon
+:sync: apple
+
+:::{include} apple.inc.md
+:start-after: "## Requirements"
+:end-before: "## Set up using Python"
+:::
+
+::::
+
+:::::
+
## Set up using Python
### Create a new Python environment
-```{include} ../python_env_setup.inc.md
-```
+:::{include} ../python_env_setup.inc.md
+:::
### Pre-built wheels
@@ -89,41 +90,41 @@ Currently, there are no pre-built CPU wheels.
### Build wheel from source
-::::{tab-set}
+:::::{tab-set}
:sync-group: device
-:::{tab-item} x86
+::::{tab-item} Intel/AMD x86
:sync: x86
-```{include} x86.inc.md
+:::{include} x86.inc.md
:start-after: "### Build wheel from source"
:end-before: "## Set up using Docker"
-```
-
-:::
-
-:::{tab-item} ARM
-:sync: arm
-
-```{include} arm.inc.md
-:start-after: "### Build wheel from source"
-:end-before: "## Set up using Docker"
-```
-
-:::
-
-:::{tab-item} Apple silicon
-:sync: apple
-
-```{include} apple.inc.md
-:start-after: "### Build wheel from source"
-:end-before: "## Set up using Docker"
-```
-
:::
::::
+::::{tab-item} ARM AArch64
+:sync: arm
+
+:::{include} arm.inc.md
+:start-after: "### Build wheel from source"
+:end-before: "## Set up using Docker"
+:::
+
+::::
+
+::::{tab-item} Apple silicon
+:sync: apple
+
+:::{include} apple.inc.md
+:start-after: "### Build wheel from source"
+:end-before: "## Set up using Docker"
+:::
+
+::::
+
+:::::
+
## Set up using Docker
### Pre-built images
@@ -142,9 +143,9 @@ $ docker run -it \
vllm-cpu-env
```
-:::{tip}
+::::{tip}
For ARM or Apple silicon, use `Dockerfile.arm`
-:::
+::::
## Supported features
diff --git a/docs/source/getting_started/installation/cpu/x86.inc.md b/docs/source/getting_started/installation/cpu/x86.inc.md
index e4f99d3cebdf2..1dafc3660060e 100644
--- a/docs/source/getting_started/installation/cpu/x86.inc.md
+++ b/docs/source/getting_started/installation/cpu/x86.inc.md
@@ -2,12 +2,20 @@
vLLM initially supports basic model inferencing and serving on x86 CPU platform, with data types FP32, FP16 and BF16.
+:::{attention}
+There are no pre-built wheels or images for this device, so you must build vLLM from source.
+:::
+
## Requirements
- OS: Linux
- Compiler: `gcc/g++ >= 12.3.0` (optional, recommended)
- Instruction Set Architecture (ISA): AVX512 (optional, recommended)
+:::{tip}
+[Intel Extension for PyTorch (IPEX)](https://github.com/intel/intel-extension-for-pytorch) extends PyTorch with up-to-date features optimizations for an extra performance boost on Intel hardware.
+:::
+
## Set up using Python
### Pre-built wheels
@@ -17,10 +25,10 @@ vLLM initially supports basic model inferencing and serving on x86 CPU platform,
:::{include} build.inc.md
:::
-```{note}
-- AVX512_BF16 is an extension ISA provides native BF16 data type conversion and vector product instructions, will brings some performance improvement compared with pure AVX512. The CPU backend build script will check the host CPU flags to determine whether to enable AVX512_BF16.
+:::{note}
+- AVX512_BF16 is an extension ISA provides native BF16 data type conversion and vector product instructions, which brings some performance improvement compared with pure AVX512. The CPU backend build script will check the host CPU flags to determine whether to enable AVX512_BF16.
- If you want to force enable AVX512_BF16 for the cross-compilation, please set environment variable `VLLM_CPU_AVX512BF16=1` before the building.
-```
+:::
## Set up using Docker
@@ -29,7 +37,3 @@ vLLM initially supports basic model inferencing and serving on x86 CPU platform,
### Build image from source
## Extra information
-
-## Intel Extension for PyTorch
-
-- [Intel Extension for PyTorch (IPEX)](https://github.com/intel/intel-extension-for-pytorch) extends PyTorch with up-to-date features optimizations for an extra performance boost on Intel hardware.
diff --git a/docs/source/getting_started/installation/gpu/cuda.inc.md b/docs/source/getting_started/installation/gpu/cuda.inc.md
index 4cce65278c069..5c2ea30dbfde1 100644
--- a/docs/source/getting_started/installation/gpu/cuda.inc.md
+++ b/docs/source/getting_started/installation/gpu/cuda.inc.md
@@ -10,9 +10,9 @@ vLLM contains pre-compiled C++ and CUDA (12.1) binaries.
### Create a new Python environment
-```{note}
+:::{note}
PyTorch installed via `conda` will statically link `NCCL` library, which can cause issues when vLLM tries to use `NCCL`. See for more details.
-```
+:::
In order to be performant, vLLM has to compile many cuda kernels. The compilation unfortunately introduces binary incompatibility with other CUDA versions and PyTorch versions, even for the same PyTorch version with different building configurations.
@@ -100,10 +100,10 @@ pip install --editable .
You can find more information about vLLM's wheels in .
-```{note}
+:::{note}
There is a possibility that your source code may have a different commit ID compared to the latest vLLM wheel, which could potentially lead to unknown errors.
It is recommended to use the same commit ID for the source code as the vLLM wheel you have installed. Please refer to for instructions on how to install a specified wheel.
-```
+:::
#### Full build (with compilation)
@@ -115,7 +115,7 @@ cd vllm
pip install -e .
```
-```{tip}
+:::{tip}
Building from source requires a lot of compilation. If you are building from source repeatedly, it's more efficient to cache the compilation results.
For example, you can install [ccache](https://github.com/ccache/ccache) using `conda install ccache` or `apt install ccache` .
@@ -123,7 +123,7 @@ As long as `which ccache` command can find the `ccache` binary, it will be used
[sccache](https://github.com/mozilla/sccache) works similarly to `ccache`, but has the capability to utilize caching in remote storage environments.
The following environment variables can be set to configure the vLLM `sccache` remote: `SCCACHE_BUCKET=vllm-build-sccache SCCACHE_REGION=us-west-2 SCCACHE_S3_NO_CREDENTIALS=1`. We also recommend setting `SCCACHE_IDLE_TIMEOUT=0`.
-```
+:::
##### Use an existing PyTorch installation
diff --git a/docs/source/getting_started/installation/gpu/index.md b/docs/source/getting_started/installation/gpu/index.md
index 6c007382b2c3d..f82c4bda28620 100644
--- a/docs/source/getting_started/installation/gpu/index.md
+++ b/docs/source/getting_started/installation/gpu/index.md
@@ -2,299 +2,300 @@
vLLM is a Python library that supports the following GPU variants. Select your GPU type to see vendor specific instructions:
-::::{tab-set}
+:::::{tab-set}
:sync-group: device
-:::{tab-item} CUDA
+::::{tab-item} NVIDIA CUDA
+:selected:
:sync: cuda
-```{include} cuda.inc.md
+:::{include} cuda.inc.md
:start-after: "# Installation"
:end-before: "## Requirements"
-```
-
-:::
-
-:::{tab-item} ROCm
-:sync: rocm
-
-```{include} rocm.inc.md
-:start-after: "# Installation"
-:end-before: "## Requirements"
-```
-
-:::
-
-:::{tab-item} XPU
-:sync: xpu
-
-```{include} xpu.inc.md
-:start-after: "# Installation"
-:end-before: "## Requirements"
-```
-
:::
::::
+::::{tab-item} AMD ROCm
+:sync: rocm
+
+:::{include} rocm.inc.md
+:start-after: "# Installation"
+:end-before: "## Requirements"
+:::
+
+::::
+
+::::{tab-item} Intel XPU
+:sync: xpu
+
+:::{include} xpu.inc.md
+:start-after: "# Installation"
+:end-before: "## Requirements"
+:::
+
+::::
+
+:::::
+
## Requirements
- OS: Linux
- Python: 3.9 -- 3.12
-::::{tab-set}
+:::::{tab-set}
:sync-group: device
-:::{tab-item} CUDA
+::::{tab-item} NVIDIA CUDA
:sync: cuda
-```{include} cuda.inc.md
+:::{include} cuda.inc.md
:start-after: "## Requirements"
:end-before: "## Set up using Python"
-```
-
-:::
-
-:::{tab-item} ROCm
-:sync: rocm
-
-```{include} rocm.inc.md
-:start-after: "## Requirements"
-:end-before: "## Set up using Python"
-```
-
-:::
-
-:::{tab-item} XPU
-:sync: xpu
-
-```{include} xpu.inc.md
-:start-after: "## Requirements"
-:end-before: "## Set up using Python"
-```
-
:::
::::
+::::{tab-item} AMD ROCm
+:sync: rocm
+
+:::{include} rocm.inc.md
+:start-after: "## Requirements"
+:end-before: "## Set up using Python"
+:::
+
+::::
+
+::::{tab-item} Intel XPU
+:sync: xpu
+
+:::{include} xpu.inc.md
+:start-after: "## Requirements"
+:end-before: "## Set up using Python"
+:::
+
+::::
+
+:::::
+
## Set up using Python
### Create a new Python environment
-```{include} ../python_env_setup.inc.md
-```
-
-::::{tab-set}
-:sync-group: device
-
-:::{tab-item} CUDA
-:sync: cuda
-
-```{include} cuda.inc.md
-:start-after: "## Create a new Python environment"
-:end-before: "### Pre-built wheels"
-```
-
+:::{include} ../python_env_setup.inc.md
:::
-:::{tab-item} ROCm
+:::::{tab-set}
+:sync-group: device
+
+::::{tab-item} NVIDIA CUDA
+:sync: cuda
+
+:::{include} cuda.inc.md
+:start-after: "## Create a new Python environment"
+:end-before: "### Pre-built wheels"
+:::
+
+::::
+
+::::{tab-item} AMD ROCm
:sync: rocm
There is no extra information on creating a new Python environment for this device.
-:::
+::::
-:::{tab-item} XPU
+::::{tab-item} Intel XPU
:sync: xpu
There is no extra information on creating a new Python environment for this device.
-:::
-
::::
+:::::
+
### Pre-built wheels
-::::{tab-set}
+:::::{tab-set}
:sync-group: device
-:::{tab-item} CUDA
+::::{tab-item} NVIDIA CUDA
:sync: cuda
-```{include} cuda.inc.md
+:::{include} cuda.inc.md
:start-after: "### Pre-built wheels"
:end-before: "### Build wheel from source"
-```
-
-:::
-
-:::{tab-item} ROCm
-:sync: rocm
-
-```{include} rocm.inc.md
-:start-after: "### Pre-built wheels"
-:end-before: "### Build wheel from source"
-```
-
-:::
-
-:::{tab-item} XPU
-:sync: xpu
-
-```{include} xpu.inc.md
-:start-after: "### Pre-built wheels"
-:end-before: "### Build wheel from source"
-```
-
:::
::::
+::::{tab-item} AMD ROCm
+:sync: rocm
+
+:::{include} rocm.inc.md
+:start-after: "### Pre-built wheels"
+:end-before: "### Build wheel from source"
+:::
+
+::::
+
+::::{tab-item} Intel XPU
+:sync: xpu
+
+:::{include} xpu.inc.md
+:start-after: "### Pre-built wheels"
+:end-before: "### Build wheel from source"
+:::
+
+::::
+
+:::::
+
(build-from-source)=
### Build wheel from source
-::::{tab-set}
+:::::{tab-set}
:sync-group: device
-:::{tab-item} CUDA
+::::{tab-item} NVIDIA CUDA
:sync: cuda
-```{include} cuda.inc.md
+:::{include} cuda.inc.md
:start-after: "### Build wheel from source"
:end-before: "## Set up using Docker"
-```
-
-:::
-
-:::{tab-item} ROCm
-:sync: rocm
-
-```{include} rocm.inc.md
-:start-after: "### Build wheel from source"
-:end-before: "## Set up using Docker"
-```
-
-:::
-
-:::{tab-item} XPU
-:sync: xpu
-
-```{include} xpu.inc.md
-:start-after: "### Build wheel from source"
-:end-before: "## Set up using Docker"
-```
-
:::
::::
+::::{tab-item} AMD ROCm
+:sync: rocm
+
+:::{include} rocm.inc.md
+:start-after: "### Build wheel from source"
+:end-before: "## Set up using Docker"
+:::
+
+::::
+
+::::{tab-item} Intel XPU
+:sync: xpu
+
+:::{include} xpu.inc.md
+:start-after: "### Build wheel from source"
+:end-before: "## Set up using Docker"
+:::
+
+::::
+
+:::::
+
## Set up using Docker
### Pre-built images
-::::{tab-set}
+:::::{tab-set}
:sync-group: device
-:::{tab-item} CUDA
+::::{tab-item} NVIDIA CUDA
:sync: cuda
-```{include} cuda.inc.md
+:::{include} cuda.inc.md
:start-after: "### Pre-built images"
:end-before: "### Build image from source"
-```
-
-:::
-
-:::{tab-item} ROCm
-:sync: rocm
-
-```{include} rocm.inc.md
-:start-after: "### Pre-built images"
-:end-before: "### Build image from source"
-```
-
-:::
-
-:::{tab-item} XPU
-:sync: xpu
-
-```{include} xpu.inc.md
-:start-after: "### Pre-built images"
-:end-before: "### Build image from source"
-```
-
:::
::::
+::::{tab-item} AMD ROCm
+:sync: rocm
+
+:::{include} rocm.inc.md
+:start-after: "### Pre-built images"
+:end-before: "### Build image from source"
+:::
+
+::::
+
+::::{tab-item} Intel XPU
+:sync: xpu
+
+:::{include} xpu.inc.md
+:start-after: "### Pre-built images"
+:end-before: "### Build image from source"
+:::
+
+::::
+
+:::::
+
### Build image from source
-::::{tab-set}
+:::::{tab-set}
:sync-group: device
-:::{tab-item} CUDA
+::::{tab-item} NVIDIA CUDA
:sync: cuda
-```{include} cuda.inc.md
+:::{include} cuda.inc.md
:start-after: "### Build image from source"
:end-before: "## Supported features"
-```
-
-:::
-
-:::{tab-item} ROCm
-:sync: rocm
-
-```{include} rocm.inc.md
-:start-after: "### Build image from source"
-:end-before: "## Supported features"
-```
-
-:::
-
-:::{tab-item} XPU
-:sync: xpu
-
-```{include} xpu.inc.md
-:start-after: "### Build image from source"
-:end-before: "## Supported features"
-```
-
:::
::::
+::::{tab-item} AMD ROCm
+:sync: rocm
+
+:::{include} rocm.inc.md
+:start-after: "### Build image from source"
+:end-before: "## Supported features"
+:::
+
+::::
+
+::::{tab-item} Intel XPU
+:sync: xpu
+
+:::{include} xpu.inc.md
+:start-after: "### Build image from source"
+:end-before: "## Supported features"
+:::
+
+::::
+
+:::::
+
## Supported features
-::::{tab-set}
+:::::{tab-set}
:sync-group: device
-:::{tab-item} CUDA
+::::{tab-item} NVIDIA CUDA
:sync: cuda
-```{include} cuda.inc.md
+:::{include} cuda.inc.md
:start-after: "## Supported features"
-```
-
-:::
-
-:::{tab-item} ROCm
-:sync: rocm
-
-```{include} rocm.inc.md
-:start-after: "## Supported features"
-```
-
-:::
-
-:::{tab-item} XPU
-:sync: xpu
-
-```{include} xpu.inc.md
-:start-after: "## Supported features"
-```
-
:::
::::
+
+::::{tab-item} AMD ROCm
+:sync: rocm
+
+:::{include} rocm.inc.md
+:start-after: "## Supported features"
+:::
+
+::::
+
+::::{tab-item} Intel XPU
+:sync: xpu
+
+:::{include} xpu.inc.md
+:start-after: "## Supported features"
+:::
+
+::::
+
+:::::
diff --git a/docs/source/getting_started/installation/gpu/rocm.inc.md b/docs/source/getting_started/installation/gpu/rocm.inc.md
index 69238f6e36fb2..c8fd11415cfda 100644
--- a/docs/source/getting_started/installation/gpu/rocm.inc.md
+++ b/docs/source/getting_started/installation/gpu/rocm.inc.md
@@ -2,6 +2,10 @@
vLLM supports AMD GPUs with ROCm 6.2.
+:::{attention}
+There are no pre-built wheels for this device, so you must either use the pre-built Docker image or build vLLM from source.
+:::
+
## Requirements
- GPU: MI200s (gfx90a), MI300 (gfx942), Radeon RX 7900 series (gfx1100)
@@ -13,14 +17,6 @@ vLLM supports AMD GPUs with ROCm 6.2.
Currently, there are no pre-built ROCm wheels.
-However, the [AMD Infinity hub for vLLM](https://hub.docker.com/r/rocm/vllm/tags) offers a prebuilt, optimized
-docker image designed for validating inference performance on the AMD Instinct™ MI300X accelerator.
-
-```{tip}
-Please check [LLM inference performance validation on AMD Instinct MI300X](https://rocm.docs.amd.com/en/latest/how-to/performance-validation/mi300x/vllm-benchmark.html)
-for instructions on how to use this prebuilt docker image.
-```
-
### Build wheel from source
0. Install prerequisites (skip if you are already in an environment/docker with the following installed):
@@ -47,9 +43,9 @@ for instructions on how to use this prebuilt docker image.
cd ../..
```
- ```{note}
- - If you see HTTP issue related to downloading packages during building triton, please try again as the HTTP error is intermittent.
- ```
+ :::{note}
+ If you see HTTP issue related to downloading packages during building triton, please try again as the HTTP error is intermittent.
+ :::
2. Optionally, if you choose to use CK flash attention, you can install [flash attention for ROCm](https://github.com/ROCm/flash-attention/tree/ck_tile)
@@ -67,9 +63,9 @@ for instructions on how to use this prebuilt docker image.
cd ..
```
- ```{note}
- - You might need to downgrade the "ninja" version to 1.10 it is not used when compiling flash-attention-2 (e.g. `pip install ninja==1.10.2.4`)
- ```
+ :::{note}
+ You might need to downgrade the "ninja" version to 1.10 it is not used when compiling flash-attention-2 (e.g. `pip install ninja==1.10.2.4`)
+ :::
3. Build vLLM. For example, vLLM on ROCM 6.2 can be built with the following steps:
@@ -95,23 +91,30 @@ for instructions on how to use this prebuilt docker image.
This may take 5-10 minutes. Currently, `pip install .` does not work for ROCm installation.
- ```{tip}
+
+ :::{tip}
- Triton flash attention is used by default. For benchmarking purposes, it is recommended to run a warm up step before collecting perf numbers.
- Triton flash attention does not currently support sliding window attention. If using half precision, please use CK flash-attention for sliding window support.
- To use CK flash-attention or PyTorch naive attention, please use this flag `export VLLM_USE_TRITON_FLASH_ATTN=0` to turn off triton flash attention.
- The ROCm version of PyTorch, ideally, should match the ROCm driver version.
- ```
+ :::
-```{tip}
+:::{tip}
- For MI300x (gfx942) users, to achieve optimal performance, please refer to [MI300x tuning guide](https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/index.html) for performance optimization and tuning tips on system and workflow level.
For vLLM, please refer to [vLLM performance optimization](https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/workload.html#vllm-performance-optimization).
-```
+:::
## Set up using Docker
### Pre-built images
-Currently, there are no pre-built ROCm images.
+The [AMD Infinity hub for vLLM](https://hub.docker.com/r/rocm/vllm/tags) offers a prebuilt, optimized
+docker image designed for validating inference performance on the AMD Instinct™ MI300X accelerator.
+
+:::{tip}
+Please check [LLM inference performance validation on AMD Instinct MI300X](https://rocm.docs.amd.com/en/latest/how-to/performance-validation/mi300x/vllm-benchmark.html)
+for instructions on how to use this prebuilt docker image.
+:::
### Build image from source
diff --git a/docs/source/getting_started/installation/gpu/xpu.inc.md b/docs/source/getting_started/installation/gpu/xpu.inc.md
index 577986eba74fd..ef02d9a078a1b 100644
--- a/docs/source/getting_started/installation/gpu/xpu.inc.md
+++ b/docs/source/getting_started/installation/gpu/xpu.inc.md
@@ -2,6 +2,10 @@
vLLM initially supports basic model inferencing and serving on Intel GPU platform.
+:::{attention}
+There are no pre-built wheels or images for this device, so you must build vLLM from source.
+:::
+
## Requirements
- Supported Hardware: Intel Data Center GPU, Intel ARC GPU
@@ -30,10 +34,10 @@ pip install -v -r requirements-xpu.txt
VLLM_TARGET_DEVICE=xpu python setup.py install
```
-```{note}
+:::{note}
- FP16 is the default data type in the current XPU backend. The BF16 data
- type will be supported in the future.
-```
+ type is supported on Intel Data Center GPU, not supported on Intel Arc GPU yet.
+:::
## Set up using Docker
diff --git a/docs/source/getting_started/installation/index.md b/docs/source/getting_started/installation/index.md
index bc1d268bf0c7e..c64c3a7208eeb 100644
--- a/docs/source/getting_started/installation/index.md
+++ b/docs/source/getting_started/installation/index.md
@@ -4,10 +4,25 @@
vLLM supports the following hardware platforms:
-```{toctree}
+:::{toctree}
:maxdepth: 1
+:hidden:
gpu/index
cpu/index
ai_accelerator/index
-```
+:::
+
+-
+ - NVIDIA CUDA
+ - AMD ROCm
+ - Intel XPU
+-
+ - Intel/AMD x86
+ - ARM AArch64
+ - Apple silicon
+-
+ - Google TPU
+ - Intel Gaudi
+ - AWS Neuron
+ - OpenVINO
diff --git a/docs/source/getting_started/installation/python_env_setup.inc.md b/docs/source/getting_started/installation/python_env_setup.inc.md
index 25cfac5f58aa7..cb73914c9c75e 100644
--- a/docs/source/getting_started/installation/python_env_setup.inc.md
+++ b/docs/source/getting_started/installation/python_env_setup.inc.md
@@ -6,9 +6,9 @@ conda create -n myenv python=3.12 -y
conda activate myenv
```
-```{note}
+:::{note}
[PyTorch has deprecated the conda release channel](https://github.com/pytorch/pytorch/issues/138506). If you use `conda`, please only use it to create Python environment rather than installing packages.
-```
+:::
Or you can create a new Python environment using [uv](https://docs.astral.sh/uv/), a very fast Python environment manager. Please follow the [documentation](https://docs.astral.sh/uv/#getting-started) to install `uv`. After installing `uv`, you can create a new Python environment using the following command:
diff --git a/docs/source/getting_started/quickstart.md b/docs/source/getting_started/quickstart.md
index 8ac80e5e5c553..f4682ee45a48e 100644
--- a/docs/source/getting_started/quickstart.md
+++ b/docs/source/getting_started/quickstart.md
@@ -32,9 +32,9 @@ conda activate myenv
pip install vllm
```
-```{note}
+:::{note}
For non-CUDA platforms, please refer [here](#installation-index) for specific instructions on how to install vLLM.
-```
+:::
(quickstart-offline)=
@@ -69,9 +69,9 @@ The {class}`~vllm.LLM` class initializes vLLM's engine and the [OPT-125M model](
llm = LLM(model="facebook/opt-125m")
```
-```{note}
+:::{note}
By default, vLLM downloads models from [HuggingFace](https://huggingface.co/). If you would like to use models from [ModelScope](https://www.modelscope.cn), set the environment variable `VLLM_USE_MODELSCOPE` before initializing the engine.
-```
+:::
Now, the fun part! The outputs are generated using `llm.generate`. It adds the input prompts to the vLLM engine's waiting queue and executes the vLLM engine to generate the outputs with high throughput. The outputs are returned as a list of `RequestOutput` objects, which include all of the output tokens.
@@ -97,10 +97,10 @@ Run the following command to start the vLLM server with the [Qwen2.5-1.5B-Instru
vllm serve Qwen/Qwen2.5-1.5B-Instruct
```
-```{note}
+:::{note}
By default, the server uses a predefined chat template stored in the tokenizer.
You can learn about overriding it [here](#chat-template).
-```
+:::
This server can be queried in the same format as OpenAI API. For example, to list the models:
diff --git a/docs/source/getting_started/troubleshooting.md b/docs/source/getting_started/troubleshooting.md
index 7bfe9b4036adf..2f41fa3b6b19e 100644
--- a/docs/source/getting_started/troubleshooting.md
+++ b/docs/source/getting_started/troubleshooting.md
@@ -4,9 +4,9 @@
This document outlines some troubleshooting strategies you can consider. If you think you've discovered a bug, please [search existing issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue) first to see if it has already been reported. If not, please [file a new issue](https://github.com/vllm-project/vllm/issues/new/choose), providing as much relevant information as possible.
-```{note}
+:::{note}
Once you've debugged a problem, remember to turn off any debugging environment variables defined, or simply start a new shell to avoid being affected by lingering debugging settings. Otherwise, the system might be slow with debugging functionalities left activated.
-```
+:::
## Hangs downloading a model
@@ -18,9 +18,9 @@ It's recommended to download the model first using the [huggingface-cli](https:/
If the model is large, it can take a long time to load it from disk. Pay attention to where you store the model. Some clusters have shared filesystems across nodes, e.g. a distributed filesystem or a network filesystem, which can be slow.
It'd be better to store the model in a local disk. Additionally, have a look at the CPU memory usage, when the model is too large it might take a lot of CPU memory, slowing down the operating system because it needs to frequently swap between disk and memory.
-```{note}
+:::{note}
To isolate the model downloading and loading issue, you can use the `--load-format dummy` argument to skip loading the model weights. This way, you can check if the model downloading and loading is the bottleneck.
-```
+:::
## Out of memory
@@ -132,14 +132,14 @@ If the script runs successfully, you should see the message `sanity check is suc
If the test script hangs or crashes, usually it means the hardware/drivers are broken in some sense. You should try to contact your system administrator or hardware vendor for further assistance. As a common workaround, you can try to tune some NCCL environment variables, such as `export NCCL_P2P_DISABLE=1` to see if it helps. Please check [their documentation](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html) for more information. Please only use these environment variables as a temporary workaround, as they might affect the performance of the system. The best solution is still to fix the hardware/drivers so that the test script can run successfully.
-```{note}
+:::{note}
A multi-node environment is more complicated than a single-node one. If you see errors such as `torch.distributed.DistNetworkError`, it is likely that the network/DNS setup is incorrect. In that case, you can manually assign node rank and specify the IP via command line arguments:
- In the first node, run `NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --node-rank 0 --master_addr $MASTER_ADDR test.py`.
- In the second node, run `NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --node-rank 1 --master_addr $MASTER_ADDR test.py`.
Adjust `--nproc-per-node`, `--nnodes`, and `--node-rank` according to your setup, being sure to execute different commands (with different `--node-rank`) on different nodes.
-```
+:::
(troubleshooting-python-multiprocessing)=
diff --git a/docs/source/index.md b/docs/source/index.md
index d7a1117df9c27..ee25678e2c418 100644
--- a/docs/source/index.md
+++ b/docs/source/index.md
@@ -1,13 +1,13 @@
# Welcome to vLLM
-```{figure} ./assets/logos/vllm-logo-text-light.png
+:::{figure} ./assets/logos/vllm-logo-text-light.png
:align: center
:alt: vLLM
:class: no-scaled-link
:width: 60%
-```
+:::
-```{raw} html
+:::{raw} html