mirror of https://git.datalinker.icu/vllm-project/vllm.git synced 2025-12-09 20:28:42 +08:00

[Doc] cleanup TPU documentation and remove outdated examples (#29048 )

Signed-off-by: Rob Mulla <rob.mulla@gmail.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

2025-11-21 00:05:59 +00:00

3.7 KiB

Raw Blame History

Quantization

Quantization trades off model precision for smaller memory footprint, allowing large models to be run on a wider range of devices.

Contents:

Supported Hardware

The table below shows the compatibility of various quantization implementations with different hardware platforms in vLLM:

Implementation	Volta	Turing	Ampere	Ada	Hopper	AMD GPU	Intel GPU	Intel Gaudi	x86 CPU
AWQ	❌	✅︎	✅︎	✅︎	✅︎	❌	✅︎	❌	✅︎
GPTQ	✅︎	✅︎	✅︎	✅︎	✅︎	❌	✅︎	❌	✅︎
Marlin (GPTQ/AWQ/FP8)	❌	❌	✅︎	✅︎	✅︎	❌	❌	❌	❌
INT8 (W8A8)	❌	✅︎	✅︎	✅︎	✅︎	❌	❌	❌	✅︎
FP8 (W8A8)	❌	❌	❌	✅︎	✅︎	✅︎	❌	❌	❌
BitBLAS	✅︎	✅	✅︎	✅︎	✅︎	❌	❌	❌	❌
BitBLAS (GPTQ)	❌	❌	✅︎	✅︎	✅︎	❌	❌	❌	❌
bitsandbytes	✅︎	✅︎	✅︎	✅︎	✅︎	❌	❌	❌	❌
DeepSpeedFP	✅︎	✅︎	✅︎	✅︎	✅︎	❌	❌	❌	❌
GGUF	✅︎	✅︎	✅︎	✅︎	✅︎	✅︎	❌	❌	❌
INC (W8A8)	❌	❌	❌	❌	❌	❌	❌	✅︎	❌

Volta refers to SM 7.0, Turing to SM 7.5, Ampere to SM 8.0/8.6, Ada to SM 8.9, and Hopper to SM 9.0.
✅︎ indicates that the quantization method is supported on the specified hardware.
❌ indicates that the quantization method is not supported on the specified hardware.

!!! note For information on quantization support on Google TPU, please refer to the TPU-Inference Recommended Models and Features documentation.

!!! note This compatibility chart is subject to change as vLLM continues to evolve and expand its support for different hardware platforms and quantization methods.

For the most up-to-date information on hardware support and quantization methods, please refer to [vllm/model_executor/layers/quantization](../../../vllm/model_executor/layers/quantization) or consult with the vLLM development team.

3.7 KiB Raw Blame History Unescape Escape

Quantization

Supported Hardware

3.7 KiB

Raw Blame History