From 398521ad199d0ca8f822a98a004fef0e5914753a Mon Sep 17 00:00:00 2001 From: Ilya Lavrenov Date: Tue, 20 Aug 2024 17:33:56 +0400 Subject: [PATCH] [OpenVINO] Updated documentation (#7687) --- docs/source/getting_started/openvino-installation.rst | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/docs/source/getting_started/openvino-installation.rst b/docs/source/getting_started/openvino-installation.rst index d8f27c4328a5..b67e0410f744 100644 --- a/docs/source/getting_started/openvino-installation.rst +++ b/docs/source/getting_started/openvino-installation.rst @@ -70,7 +70,7 @@ vLLM OpenVINO backend uses the following environment variables to control behavi - ``VLLM_OPENVINO_CPU_KV_CACHE_PRECISION=u8`` to control KV cache precision. By default, FP16 / BF16 is used depending on platform. -- ``VLLM_OPENVINO_ENABLE_QUANTIZED_WEIGHTS=ON`` to enable U8 weights compression during model loading stage. By default, compression is turned off. +- ``VLLM_OPENVINO_ENABLE_QUANTIZED_WEIGHTS=ON`` to enable U8 weights compression during model loading stage. By default, compression is turned off. You can also export model with different compression techniques using `optimum-cli` and pass exported folder as `` To enable better TPOT / TTFT latency, you can use vLLM's chunked prefill feature (``--enable-chunked-prefill``). Based on the experiments, the recommended batch size is ``256`` (``--max-num-batched-tokens``) @@ -91,5 +91,3 @@ Limitations - Only LLM models are currently supported. LLaVa and encoder-decoder models are not currently enabled in vLLM OpenVINO integration. - Tensor and pipeline parallelism are not currently enabled in vLLM integration. - -- Speculative sampling is not tested within vLLM integration.