mirror of
https://git.datalinker.icu/vllm-project/vllm.git
synced 2025-12-11 01:45:01 +08:00
[Docs] Update the AWQ documentation to highlight performance issue (#1883)
This commit is contained in:
parent
f86bd6190a
commit
4cefa9b49b
@ -3,6 +3,12 @@
|
|||||||
AutoAWQ
|
AutoAWQ
|
||||||
==================
|
==================
|
||||||
|
|
||||||
|
.. warning::
|
||||||
|
|
||||||
|
Please note that AWQ support in vLLM is under-optimized at the moment. We would recommend using the unquantized version of the model for better
|
||||||
|
accuracy and higher throughput. Currently, you can use AWQ as a way to reduce memory footprint. As of now, it is more suitable for low latency
|
||||||
|
inference with small number of concurrent requests. vLLM's AWQ implementation have lower throughput than unquantized version.
|
||||||
|
|
||||||
To create a new 4-bit quantized model, you can leverage `AutoAWQ <https://github.com/casper-hansen/AutoAWQ>`_.
|
To create a new 4-bit quantized model, you can leverage `AutoAWQ <https://github.com/casper-hansen/AutoAWQ>`_.
|
||||||
Quantizing reduces the model's precision from FP16 to INT4 which effectively reduces the file size by ~70%.
|
Quantizing reduces the model's precision from FP16 to INT4 which effectively reduces the file size by ~70%.
|
||||||
The main benefits are lower latency and memory usage.
|
The main benefits are lower latency and memory usage.
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user