mirror of
https://git.datalinker.icu/vllm-project/vllm.git
synced 2025-12-14 06:45:00 +08:00
Signed-off-by: Wenhua Cheng <wenhua.cheng@intel.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
22 lines
580 B
Markdown
22 lines
580 B
Markdown
# Quantization
|
|
|
|
Quantization trades off model precision for smaller memory footprint, allowing large models to be run on a wider range of devices.
|
|
|
|
Contents:
|
|
|
|
- [Supported Hardware](supported_hardware.md)
|
|
- [AutoAWQ](auto_awq.md)
|
|
- [AutoRound](auto_round.md)
|
|
- [BitsAndBytes](bnb.md)
|
|
- [BitBLAS](bitblas.md)
|
|
- [GGUF](gguf.md)
|
|
- [GPTQModel](gptqmodel.md)
|
|
- [INC](inc.md)
|
|
- [INT4 W4A16](int4.md)
|
|
- [INT8 W8A8](int8.md)
|
|
- [FP8 W8A8](fp8.md)
|
|
- [NVIDIA TensorRT Model Optimizer](modelopt.md)
|
|
- [AMD Quark](quark.md)
|
|
- [Quantized KV Cache](quantized_kvcache.md)
|
|
- [TorchAO](torchao.md)
|