mirror of
https://git.datalinker.icu/vllm-project/vllm.git
synced 2025-12-14 07:25:01 +08:00
[CI] docfix (#5410)
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: ywang96 <ywang@roblox.com>
This commit is contained in:
parent
8bab4959be
commit
246598a6b1
@ -20,7 +20,8 @@ The following :ref:`engine arguments <engine_args>` are specific to VLMs:
|
|||||||
Currently, the support for vision language models on vLLM has the following limitations:
|
Currently, the support for vision language models on vLLM has the following limitations:
|
||||||
|
|
||||||
* Only single image input is supported per text prompt.
|
* Only single image input is supported per text prompt.
|
||||||
* Dynamic ``image_input_shape`` is not supported: the input image will be resized to the static ``image_input_shape``. This means model output might not exactly match the huggingface implementation.
|
* Dynamic ``image_input_shape`` is not supported: the input image will be resized to the static ``image_input_shape``. This means model output might not exactly match the HuggingFace implementation.
|
||||||
|
|
||||||
We are continuously improving user & developer experience for VLMs. Please raise an issue on GitHub if you have any feedback or feature requests.
|
We are continuously improving user & developer experience for VLMs. Please raise an issue on GitHub if you have any feedback or feature requests.
|
||||||
|
|
||||||
Offline Batched Inference
|
Offline Batched Inference
|
||||||
|
|||||||
@ -13,7 +13,7 @@ The FP8 types typically supported in hardware have two distinct representations,
|
|||||||
- **E5M2**: Consists of 1 sign bit, 5 exponent bits, and 2 bits of mantissa. It can store values up to +/-57344, +/- ``inf``, and ``nan``. The tradeoff for the increased dynamic range is lower precision of the stored values.
|
- **E5M2**: Consists of 1 sign bit, 5 exponent bits, and 2 bits of mantissa. It can store values up to +/-57344, +/- ``inf``, and ``nan``. The tradeoff for the increased dynamic range is lower precision of the stored values.
|
||||||
|
|
||||||
Quick Start with Online Dynamic Quantization
|
Quick Start with Online Dynamic Quantization
|
||||||
-------------------------------------
|
--------------------------------------------
|
||||||
|
|
||||||
Dynamic quantization of an original precision BF16/FP16 model to FP8 can be achieved with vLLM without any calibration data required. You can enable the feature by specifying ``--quantization="fp8"`` in the command line or setting ``quantization="fp8"`` in the LLM constructor.
|
Dynamic quantization of an original precision BF16/FP16 model to FP8 can be achieved with vLLM without any calibration data required. You can enable the feature by specifying ``--quantization="fp8"`` in the command line or setting ``quantization="fp8"`` in the LLM constructor.
|
||||||
|
|
||||||
@ -173,25 +173,28 @@ Here we detail the structure for the FP8 checkpoints.
|
|||||||
|
|
||||||
The following is necessary to be present in the model's ``config.json``:
|
The following is necessary to be present in the model's ``config.json``:
|
||||||
|
|
||||||
.. code-block:: yaml
|
.. code-block:: text
|
||||||
|
|
||||||
"quantization_config": {
|
"quantization_config": {
|
||||||
"quant_method": "fp8",
|
"quant_method": "fp8",
|
||||||
"activation_scheme": "static" or "dynamic"
|
"activation_scheme": "static" or "dynamic"
|
||||||
},
|
}
|
||||||
|
|
||||||
|
|
||||||
Each quantized layer in the state_dict will have these tensors:
|
Each quantized layer in the state_dict will have these tensors:
|
||||||
|
|
||||||
* If the config has `"activation_scheme": "static"`:
|
* If the config has ``"activation_scheme": "static"``:
|
||||||
|
|
||||||
.. code-block:: text
|
.. code-block:: text
|
||||||
|
|
||||||
model.layers.0.mlp.down_proj.weight < F8_E4M3
|
model.layers.0.mlp.down_proj.weight < F8_E4M3
|
||||||
model.layers.0.mlp.down_proj.input_scale < F32
|
model.layers.0.mlp.down_proj.input_scale < F32
|
||||||
model.layers.0.mlp.down_proj.weight_scale < F32
|
model.layers.0.mlp.down_proj.weight_scale < F32
|
||||||
|
|
||||||
* If the config has `"activation_scheme": "dynamic"`:
|
* If the config has ``"activation_scheme": "dynamic"``:
|
||||||
|
|
||||||
.. code-block:: text
|
.. code-block:: text
|
||||||
|
|
||||||
model.layers.0.mlp.down_proj.weight < F8_E4M3
|
model.layers.0.mlp.down_proj.weight < F8_E4M3
|
||||||
model.layers.0.mlp.down_proj.weight_scale < F32
|
model.layers.0.mlp.down_proj.weight_scale < F32
|
||||||
|
|
||||||
@ -199,4 +202,5 @@ Each quantized layer in the state_dict will have these tensors:
|
|||||||
Additionally, there can be `FP8 kv-cache scaling factors <https://github.com/vllm-project/vllm/pull/4893>`_ contained within quantized checkpoints specified through the ``.kv_scale`` parameter present on the Attention Module, such as:
|
Additionally, there can be `FP8 kv-cache scaling factors <https://github.com/vllm-project/vllm/pull/4893>`_ contained within quantized checkpoints specified through the ``.kv_scale`` parameter present on the Attention Module, such as:
|
||||||
|
|
||||||
.. code-block:: text
|
.. code-block:: text
|
||||||
model.layers.0.self_attn.kv_scale < F32
|
|
||||||
|
model.layers.0.self_attn.kv_scale < F32
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user