diff --git a/docs/source/index.rst b/docs/source/index.rst
index 54e480635457..4e79871e6e78 100644
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -107,6 +107,7 @@ Documentation
quantization/supported_hardware
quantization/auto_awq
quantization/bnb
+ quantization/int8
quantization/fp8
quantization/fp8_e5m2_kvcache
quantization/fp8_e4m3_kvcache
diff --git a/docs/source/quantization/fp8.rst b/docs/source/quantization/fp8.rst
index 7f796fc3ab45..d7d9b21b4b94 100644
--- a/docs/source/quantization/fp8.rst
+++ b/docs/source/quantization/fp8.rst
@@ -1,6 +1,6 @@
.. _fp8:
-FP8
+FP8 W8A8
==================
vLLM supports FP8 (8-bit floating point) weight and activation quantization using hardware acceleration on GPUs such as Nvidia H100 and AMD MI300x.
@@ -15,6 +15,11 @@ The FP8 types typically supported in hardware have two distinct representations,
- **E4M3**: Consists of 1 sign bit, 4 exponent bits, and 3 bits of mantissa. It can store values up to +/-448 and ``nan``.
- **E5M2**: Consists of 1 sign bit, 5 exponent bits, and 2 bits of mantissa. It can store values up to +/-57344, +/- ``inf``, and ``nan``. The tradeoff for the increased dynamic range is lower precision of the stored values.
+.. note::
+
+ FP8 computation is supported on NVIDIA GPUs with compute capability > 8.9 (Ada Lovelace, Hopper).
+ FP8 models will run on compute capability > 8.0 (Ampere) as weight-only W8A16, utilizing FP8 Marlin.
+
Quick Start with Online Dynamic Quantization
--------------------------------------------
@@ -33,10 +38,122 @@ In this mode, all Linear modules (except for the final ``lm_head``) have their w
Currently, we load the model at original precision before quantizing down to 8-bits, so you need enough memory to load the whole model.
-Offline Quantization
+Installation
+------------
+
+To produce performant FP8 quantized models with vLLM, you'll need to install the `llm-compressor `_ library:
+
+.. code-block:: console
+
+ $ pip install llmcompressor==0.1.0
+
+Quantization Process
--------------------
-For offline quantization to FP8, please install the `AutoFP8 library `_.
+The quantization process involves three main steps:
+
+1. Loading the model
+2. Applying quantization
+3. Evaluating accuracy in vLLM
+
+1. Loading the Model
+^^^^^^^^^^^^^^^^^^^^
+
+Use ``SparseAutoModelForCausalLM``, which wraps ``AutoModelForCausalLM``, for saving and loading quantized models:
+
+.. code-block:: python
+
+ from llmcompressor.transformers import SparseAutoModelForCausalLM
+ from transformers import AutoTokenizer
+
+ MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"
+
+ model = SparseAutoModelForCausalLM.from_pretrained(
+ MODEL_ID, device_map="auto", torch_dtype="auto")
+ tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
+
+2. Applying Quantization
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+For FP8 quantization, we can recover accuracy with simple RTN quantization. We recommend targeting all ``Linear`` layers using the ``FP8_DYNAMIC`` scheme, which uses:
+
+- Static, per-channel quantization on the weights
+- Dynamic, per-token quantization on the activations
+
+Since simple RTN does not require data for weight quantization and the activations are quantized dynamically, we do not need any calibration data for this quantization flow.
+
+.. code-block:: python
+
+ from llmcompressor.transformers import oneshot
+ from llmcompressor.modifiers.quantization import QuantizationModifier
+
+ # Configure the simple PTQ quantization
+ recipe = QuantizationModifier(
+ targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"])
+
+ # Apply the quantization algorithm.
+ oneshot(model=model, recipe=recipe)
+
+ # Save the model.
+ SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-Dynamic"
+ model.save_pretrained(SAVE_DIR)
+ tokenizer.save_pretrained(SAVE_DIR)
+
+3. Evaluating Accuracy
+^^^^^^^^^^^^^^^^^^^^^^
+
+Install ``vllm`` and ``lm-evaluation-harness``:
+
+.. code-block:: console
+
+ $ pip install vllm lm_eval==0.4.3
+
+Load and run the model in ``vllm``:
+
+.. code-block:: python
+
+ from vllm import LLM
+ model = LLM("./Meta-Llama-3-8B-Instruct-FP8-Dynamic")
+ model.generate("Hello my name is")
+
+Evaluate accuracy with ``lm_eval`` (for example on 250 samples of ``gsm8k``):
+
+.. note::
+
+ Quantized models can be sensitive to the presence of the ``bos`` token. ``lm_eval`` does not add a ``bos`` token by default, so make sure to include the ``add_bos_token=True`` argument when running your evaluations.
+
+.. code-block:: console
+
+ $ MODEL=$PWD/Meta-Llama-3-8B-Instruct-FP8-Dynamic
+ $ lm_eval \
+ --model vllm \
+ --model_args pretrained=$MODEL,add_bos_token=True \
+ --tasks gsm8k --num_fewshot 5 --batch_size auto --limit 250
+
+Here's an example of the resulting scores:
+
+.. code-block:: text
+
+ |Tasks|Version| Filter |n-shot| Metric | |Value| |Stderr|
+ |-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
+ |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.768|± |0.0268|
+ | | |strict-match | 5|exact_match|↑ |0.768|± |0.0268|
+
+Troubleshooting and Support
+---------------------------
+
+If you encounter any issues or have feature requests, please open an issue on the ``vllm-project/llm-compressor`` GitHub repository.
+
+
+Deprecated Flow
+------------------
+
+.. note::
+
+ The following information is preserved for reference and search purposes.
+ The quantization method described below is deprecated in favor of the ``llmcompressor`` method described above.
+
+For static per-tensor offline quantization to FP8, please install the `AutoFP8 library `_.
.. code-block:: bash
@@ -45,94 +162,10 @@ For offline quantization to FP8, please install the `AutoFP8 library `_ contained within quantized checkpoints specified through the ``.kv_scale`` parameter present on the Attention Module, such as:
-
-.. code-block:: text
-
- model.layers.0.self_attn.kv_scale < F32
diff --git a/docs/source/quantization/int8.rst b/docs/source/quantization/int8.rst
new file mode 100644
index 000000000000..04fa30844950
--- /dev/null
+++ b/docs/source/quantization/int8.rst
@@ -0,0 +1,145 @@
+.. _int8:
+
+INT8 W8A8
+==================
+
+vLLM supports quantizing weights and activations to INT8 for memory savings and inference acceleration.
+This quantization method is particularly useful for reducing model size while maintaining good performance.
+
+Please visit the HF collection of `quantized INT8 checkpoints of popular LLMs ready to use with vLLM `_.
+
+.. note::
+
+ INT8 computation is supported on NVIDIA GPUs with compute capability > 7.5 (Turing, Ampere, Ada Lovelace, Hopper).
+
+Prerequisites
+-------------
+
+To use INT8 quantization with vLLM, you'll need to install the `llm-compressor `_ library:
+
+.. code-block:: console
+
+ $ pip install llmcompressor==0.1.0
+
+Quantization Process
+--------------------
+
+The quantization process involves four main steps:
+
+1. Loading the model
+2. Preparing calibration data
+3. Applying quantization
+4. Evaluating accuracy in vLLM
+
+1. Loading the Model
+^^^^^^^^^^^^^^^^^^^^
+
+Use ``SparseAutoModelForCausalLM``, which wraps ``AutoModelForCausalLM``, for saving and loading quantized models:
+
+.. code-block:: python
+
+ from llmcompressor.transformers import SparseAutoModelForCausalLM
+ from transformers import AutoTokenizer
+
+ MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"
+ model = SparseAutoModelForCausalLM.from_pretrained(
+ MODEL_ID, device_map="auto", torch_dtype="auto",
+ )
+ tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
+
+2. Preparing Calibration Data
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+When quantizing activations to INT8, you need sample data to estimate the activation scales.
+It's best to use calibration data that closely matches your deployment data.
+For a general-purpose instruction-tuned model, you can use a dataset like ``ultrachat``:
+
+.. code-block:: python
+
+ from datasets import load_dataset
+
+ NUM_CALIBRATION_SAMPLES = 512
+ MAX_SEQUENCE_LENGTH = 2048
+
+ # Load and preprocess the dataset
+ ds = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft")
+ ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))
+
+ def preprocess(example):
+ return {"text": tokenizer.apply_chat_template(example["messages"], tokenize=False)}
+ ds = ds.map(preprocess)
+
+ def tokenize(sample):
+ return tokenizer(sample["text"], padding=False, max_length=MAX_SEQUENCE_LENGTH, truncation=True, add_special_tokens=False)
+ ds = ds.map(tokenize, remove_columns=ds.column_names)
+
+3. Applying Quantization
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+Now, apply the quantization algorithms:
+
+.. code-block:: python
+
+ from llmcompressor.transformers import oneshot
+ from llmcompressor.modifiers.quantization import GPTQModifier
+ from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
+
+ # Configure the quantization algorithms
+ recipe = [
+ SmoothQuantModifier(smoothing_strength=0.8),
+ GPTQModifier(targets="Linear", scheme="W8A8", ignore=["lm_head"]),
+ ]
+
+ # Apply quantization
+ oneshot(
+ model=model,
+ dataset=ds,
+ recipe=recipe,
+ max_seq_length=MAX_SEQUENCE_LENGTH,
+ num_calibration_samples=NUM_CALIBRATION_SAMPLES,
+ )
+
+ # Save the compressed model
+ SAVE_DIR = MODEL_ID.split("/")[1] + "-W8A8-Dynamic-Per-Token"
+ model.save_pretrained(SAVE_DIR, save_compressed=True)
+ tokenizer.save_pretrained(SAVE_DIR)
+
+This process creates a W8A8 model with weights and activations quantized to 8-bit integers.
+
+4. Evaluating Accuracy
+^^^^^^^^^^^^^^^^^^^^^^
+
+After quantization, you can load and run the model in vLLM:
+
+.. code-block:: python
+
+ from vllm import LLM
+ model = LLM("./Meta-Llama-3-8B-Instruct-W8A8-Dynamic-Per-Token")
+
+To evaluate accuracy, you can use ``lm_eval``:
+
+.. code-block:: console
+
+ $ lm_eval --model vllm \
+ --model_args pretrained="./Meta-Llama-3-8B-Instruct-W8A8-Dynamic-Per-Token",add_bos_token=true \
+ --tasks gsm8k \
+ --num_fewshot 5 \
+ --limit 250 \
+ --batch_size 'auto'
+
+.. note::
+
+ Quantized models can be sensitive to the presence of the ``bos`` token. Make sure to include the ``add_bos_token=True`` argument when running evaluations.
+
+Best Practices
+--------------
+
+- Start with 512 samples for calibration data (increase if accuracy drops)
+- Use a sequence length of 2048 as a starting point
+- Employ the chat template or instruction template that the model was trained with
+- If you've fine-tuned a model, consider using a sample of your training data for calibration
+
+Troubleshooting and Support
+---------------------------
+
+If you encounter any issues or have feature requests, please open an issue on the ``vllm-project/llm-compressor`` GitHub repository.
\ No newline at end of file