[DOC] [ROCm] Add ROCm quickstart guide (#26505)

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
2025-12-21 12:35:31 +08:00 · 2025-10-22 11:10:48 +08:00 · 2025-10-22 11:10:48 +08:00 · 265ecb05fb
commit 265ecb05fb
parent 09a7e6f617
1 changed files with 54 additions and 20 deletions
--- a/docs/getting_started/quickstart.md
+++ b/docs/getting_started/quickstart.md
@ -12,6 +12,8 @@ This guide will help you quickly get started with vLLM to perform:
 ## Installation
 === "NVIDIA CUDA"
    If you are using NVIDIA GPUs, you can install vLLM using [pip](https://pypi.org/project/vllm/) directly.
    It's recommended to use [uv](https://docs.astral.sh/uv/), a very fast Python environment manager, to create and manage Python environments. Please follow the [documentation](https://docs.astral.sh/uv/#getting-started) to install `uv`. After installing `uv`, you can create a new Python environment and install vLLM using the following commands:
@ -39,6 +41,28 @@ pip install --upgrade uv
    uv pip install vllm --torch-backend=auto
    ```
 === "AMD ROCm"
    Use a pre-built docker image from Docker Hub. The public stable image is [rocm/vllm:latest](https://hub.docker.com/r/rocm/vllm). There is also a development image at [rocm/vllm-dev](https://hub.docker.com/r/rocm/vllm-dev).
    The `-v` flag in the `docker run` command below mounts a local directory into the container. Replace `<path/to/your/models>` with the path on your host machine to the directory containing your models. The models will then be accessible inside the container at `/app/models`.
    ???+ console "Commands"
        ```bash
        docker pull rocm/vllm-dev:nightly # to get the latest image
        docker run -it --rm \
        --network=host \
        --group-add=video \
        --ipc=host \
        --cap-add=SYS_PTRACE \
        --security-opt seccomp=unconfined \
        --device /dev/kfd \
        --device /dev/dri \
        -v <path/to/your/models>:/app/models \
        -e HF_HOME="/app/models" \
        rocm/vllm-dev:nightly
        ```
 !!! note
    For more detail and non-CUDA platforms, please refer [here](installation/README.md) for specific instructions on how to install vLLM.
@ -246,7 +270,17 @@ Alternatively, you can use the `openai` Python package:
 Currently, vLLM supports multiple backends for efficient Attention computation across different platforms and accelerator architectures. It automatically selects the most performant backend compatible with your system and model specifications.
-If desired, you can also manually set the backend of your choice by configuring the environment variable `VLLM_ATTENTION_BACKEND` to one of the following options: `FLASH_ATTN`, `FLASHINFER` or `XFORMERS`.
+If desired, you can also manually set the backend of your choice by configuring the environment variable `VLLM_ATTENTION_BACKEND` to one of the following options:
 - On NVIDIA CUDA: `FLASH_ATTN`, `FLASHINFER` or `XFORMERS`.
 - On AMD ROCm: `TRITON_ATTN`, `ROCM_ATTN`, `ROCM_AITER_FA` or `ROCM_AITER_UNIFIED_ATTN`.
 For AMD ROCm, you can futher control the specific Attention implementation using the following variables:
 - Triton Unified Attention: `VLLM_ROCM_USE_AITER=0 VLLM_V1_USE_PREFILL_DECODE_ATTENTION=0 VLLM_ROCM_USE_AITER_MHA=0`
 - AITER Unified Attention: `VLLM_ROCM_USE_AITER=1 VLLM_USE_AITER_UNIFIED_ATTENTION=1 VLLM_V1_USE_PREFILL_DECODE_ATTENTION=0 VLLM_ROCM_USE_AITER_MHA=0`
 - Triton Prefill-Decode Attention: `VLLM_ROCM_USE_AITER=1 VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1 VLLM_ROCM_USE_AITER_MHA=0`
 - AITER Multi-head Attention: `VLLM_ROCM_USE_AITER=1 VLLM_V1_USE_PREFILL_DECODE_ATTENTION=0 VLLM_ROCM_USE_AITER_MHA=1`
 !!! warning
    There are no pre-built vllm wheels containing Flash Infer, so you must install it in your environment first. Refer to the [Flash Infer official docs](https://docs.flashinfer.ai/) or see [docker/Dockerfile](../../docker/Dockerfile) for instructions on how to install it.