mirror of
https://git.datalinker.icu/vllm-project/vllm.git
synced 2025-12-22 03:25:02 +08:00
[Docs] Remove references to VLLM_ATTENTION_BACKEND (#30564)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
This commit is contained in:
parent
fc0119425c
commit
f5dfbbd8e9
@ -281,17 +281,27 @@ Alternatively, you can use the `openai` Python package:
|
|||||||
|
|
||||||
Currently, vLLM supports multiple backends for efficient Attention computation across different platforms and accelerator architectures. It automatically selects the most performant backend compatible with your system and model specifications.
|
Currently, vLLM supports multiple backends for efficient Attention computation across different platforms and accelerator architectures. It automatically selects the most performant backend compatible with your system and model specifications.
|
||||||
|
|
||||||
If desired, you can also manually set the backend of your choice by configuring the environment variable `VLLM_ATTENTION_BACKEND` to one of the following options:
|
If desired, you can also manually set the backend of your choice using the `--attention-backend` CLI argument:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# For online serving
|
||||||
|
vllm serve Qwen/Qwen2.5-1.5B-Instruct --attention-backend FLASH_ATTN
|
||||||
|
|
||||||
|
# For offline inference
|
||||||
|
python script.py --attention-backend FLASHINFER
|
||||||
|
```
|
||||||
|
|
||||||
|
Some of the available backend options include:
|
||||||
|
|
||||||
- On NVIDIA CUDA: `FLASH_ATTN` or `FLASHINFER`.
|
- On NVIDIA CUDA: `FLASH_ATTN` or `FLASHINFER`.
|
||||||
- On AMD ROCm: `TRITON_ATTN`, `ROCM_ATTN`, `ROCM_AITER_FA` or `ROCM_AITER_UNIFIED_ATTN`.
|
- On AMD ROCm: `TRITON_ATTN`, `ROCM_ATTN`, `ROCM_AITER_FA` or `ROCM_AITER_UNIFIED_ATTN`.
|
||||||
|
|
||||||
For AMD ROCm, you can further control the specific Attention implementation using the following variables:
|
For AMD ROCm, you can further control the specific Attention implementation using the following options:
|
||||||
|
|
||||||
- Triton Unified Attention: `VLLM_ROCM_USE_AITER=0 VLLM_V1_USE_PREFILL_DECODE_ATTENTION=0 VLLM_ROCM_USE_AITER_MHA=0`
|
- Triton Unified Attention: Set the environment variables `VLLM_ROCM_USE_AITER=0 VLLM_ROCM_USE_AITER_MHA=0` and pass `--attention-config.use_prefill_decode_attention=false` as a CLI argument.
|
||||||
- AITER Unified Attention: `VLLM_ROCM_USE_AITER=1 VLLM_USE_AITER_UNIFIED_ATTENTION=1 VLLM_V1_USE_PREFILL_DECODE_ATTENTION=0 VLLM_ROCM_USE_AITER_MHA=0`
|
- AITER Unified Attention: Set the environment variables `VLLM_ROCM_USE_AITER=1 VLLM_USE_AITER_UNIFIED_ATTENTION=1 VLLM_ROCM_USE_AITER_MHA=0` and pass `--attention-config.use_prefill_decode_attention=false` as a CLI argument.
|
||||||
- Triton Prefill-Decode Attention: `VLLM_ROCM_USE_AITER=1 VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1 VLLM_ROCM_USE_AITER_MHA=0`
|
- Triton Prefill-Decode Attention: Set the environment variables `VLLM_ROCM_USE_AITER=1 VLLM_ROCM_USE_AITER_MHA=0` and pass `--attention-config.use_prefill_decode_attention=true` as a CLI argument.
|
||||||
- AITER Multi-head Attention: `VLLM_ROCM_USE_AITER=1 VLLM_V1_USE_PREFILL_DECODE_ATTENTION=0 VLLM_ROCM_USE_AITER_MHA=1`
|
- AITER Multi-head Attention: Set the environment variables `VLLM_ROCM_USE_AITER=1 VLLM_ROCM_USE_AITER_MHA=1` and pass `--attention-config.use_prefill_decode_attention=false` as a CLI argument.
|
||||||
|
|
||||||
!!! warning
|
!!! warning
|
||||||
There are no pre-built vllm wheels containing Flash Infer, so you must install it in your environment first. Refer to the [Flash Infer official docs](https://docs.flashinfer.ai/) or see [docker/Dockerfile](../../docker/Dockerfile) for instructions on how to install it.
|
There are no pre-built vllm wheels containing Flash Infer, so you must install it in your environment first. Refer to the [Flash Infer official docs](https://docs.flashinfer.ai/) or see [docker/Dockerfile](../../docker/Dockerfile) for instructions on how to install it.
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user