[Docs] Update the AWQ documentation to highlight performance issue (#1883)

2025-12-11 01:45:01 +08:00 · 2023-12-02 15:52:47 -08:00 · 2023-12-02 15:52:47 -08:00 · 4cefa9b49b
commit 4cefa9b49b
parent f86bd6190a
1 changed files with 6 additions and 0 deletions
--- a/docs/source/quantization/auto_awq.rst
+++ b/docs/source/quantization/auto_awq.rst
@ -3,6 +3,12 @@
 AutoAWQ
 ==================
 .. warning::
   Please note that AWQ support in vLLM is under-optimized at the moment. We would recommend using the unquantized version of the model for better
   accuracy and higher throughput. Currently, you can use AWQ as a way to reduce memory footprint. As of now, it is more suitable for low latency
   inference with small number of concurrent requests. vLLM's AWQ implementation have lower throughput than unquantized version.
 To create a new 4-bit quantized model, you can leverage `AutoAWQ <https://github.com/casper-hansen/AutoAWQ>`_. 
 Quantizing reduces the model's precision from FP16 to INT4 which effectively reduces the file size by ~70%.
 The main benefits are lower latency and memory usage.