mirror of
https://git.datalinker.icu/vllm-project/vllm.git
synced 2025-12-10 05:34:57 +08:00
[Doc]: fix typos in various files (#29010)
Signed-off-by: Didier Durand <durand.didier@gmail.com>
This commit is contained in:
parent
da2f6800e0
commit
09540cd918
@ -4,7 +4,7 @@
|
|||||||
<img src="https://imgur.com/yxtzPEu.png" alt="vLLM"/>
|
<img src="https://imgur.com/yxtzPEu.png" alt="vLLM"/>
|
||||||
</p>
|
</p>
|
||||||
|
|
||||||
vLLM can be **run and scaled to multiple service replicas on clouds and Kubernetes** with [SkyPilot](https://github.com/skypilot-org/skypilot), an open-source framework for running LLMs on any cloud. More examples for various open models, such as Llama-3, Mixtral, etc, can be found in [SkyPilot AI gallery](https://skypilot.readthedocs.io/en/latest/gallery/index.html).
|
vLLM can be **run and scaled to multiple service replicas on clouds and Kubernetes** with [SkyPilot](https://github.com/skypilot-org/skypilot), an open-source framework for running LLMs on any cloud. More examples for various open models, such as Llama-3, Mixtral, etc., can be found in [SkyPilot AI gallery](https://skypilot.readthedocs.io/en/latest/gallery/index.html).
|
||||||
|
|
||||||
## Prerequisites
|
## Prerequisites
|
||||||
|
|
||||||
|
|||||||
@ -1,6 +1,6 @@
|
|||||||
# Automatic Prefix Caching
|
# Automatic Prefix Caching
|
||||||
|
|
||||||
Prefix caching kv-cache blocks is a popular optimization in LLM inference to avoid redundant prompt computations. The core idea is simple – we cache the kv-cache blocks of processed requests, and reuse these blocks when a new request comes in with the same prefix as previous requests. Since prefix caching is almost a free lunch and won’t change model outputs, it has been widely used by many public endpoints (e.g., OpenAI, Anthropic, etc) and most open source LLM inference frameworks (e.g., SGLang).
|
Prefix caching kv-cache blocks is a popular optimization in LLM inference to avoid redundant prompt computations. The core idea is simple – we cache the kv-cache blocks of processed requests, and reuse these blocks when a new request comes in with the same prefix as previous requests. Since prefix caching is almost a free lunch and won’t change model outputs, it has been widely used by many public endpoints (e.g., OpenAI, Anthropic, etc.) and most open source LLM inference frameworks (e.g., SGLang).
|
||||||
|
|
||||||
While there are many ways to implement prefix caching, vLLM chooses a hash-based approach. Specifically, we hash each kv-cache block by the tokens in the block and the tokens in the prefix before the block:
|
While there are many ways to implement prefix caching, vLLM chooses a hash-based approach. Specifically, we hash each kv-cache block by the tokens in the block and the tokens in the prefix before the block:
|
||||||
|
|
||||||
|
|||||||
@ -158,7 +158,7 @@ python tests/v1/kv_connector/nixl_integration/toy_proxy_server.py \
|
|||||||
|
|
||||||
## Experimental Feature
|
## Experimental Feature
|
||||||
|
|
||||||
### Heterogenuous KV Layout support
|
### Heterogeneous KV Layout support
|
||||||
|
|
||||||
Support use case: Prefill with 'HND' and decode with 'NHD' with experimental configuration
|
Support use case: Prefill with 'HND' and decode with 'NHD' with experimental configuration
|
||||||
|
|
||||||
|
|||||||
@ -286,7 +286,7 @@ If desired, you can also manually set the backend of your choice by configuring
|
|||||||
- On NVIDIA CUDA: `FLASH_ATTN`, `FLASHINFER` or `XFORMERS`.
|
- On NVIDIA CUDA: `FLASH_ATTN`, `FLASHINFER` or `XFORMERS`.
|
||||||
- On AMD ROCm: `TRITON_ATTN`, `ROCM_ATTN`, `ROCM_AITER_FA` or `ROCM_AITER_UNIFIED_ATTN`.
|
- On AMD ROCm: `TRITON_ATTN`, `ROCM_ATTN`, `ROCM_AITER_FA` or `ROCM_AITER_UNIFIED_ATTN`.
|
||||||
|
|
||||||
For AMD ROCm, you can futher control the specific Attention implementation using the following variables:
|
For AMD ROCm, you can further control the specific Attention implementation using the following variables:
|
||||||
|
|
||||||
- Triton Unified Attention: `VLLM_ROCM_USE_AITER=0 VLLM_V1_USE_PREFILL_DECODE_ATTENTION=0 VLLM_ROCM_USE_AITER_MHA=0`
|
- Triton Unified Attention: `VLLM_ROCM_USE_AITER=0 VLLM_V1_USE_PREFILL_DECODE_ATTENTION=0 VLLM_ROCM_USE_AITER_MHA=0`
|
||||||
- AITER Unified Attention: `VLLM_ROCM_USE_AITER=1 VLLM_USE_AITER_UNIFIED_ATTENTION=1 VLLM_V1_USE_PREFILL_DECODE_ATTENTION=0 VLLM_ROCM_USE_AITER_MHA=0`
|
- AITER Unified Attention: `VLLM_ROCM_USE_AITER=1 VLLM_USE_AITER_UNIFIED_ATTENTION=1 VLLM_V1_USE_PREFILL_DECODE_ATTENTION=0 VLLM_ROCM_USE_AITER_MHA=0`
|
||||||
|
|||||||
@ -113,7 +113,7 @@ Quick sanity check:
|
|||||||
|
|
||||||
- Outputs differ between baseline and disagg
|
- Outputs differ between baseline and disagg
|
||||||
- Server startup fails
|
- Server startup fails
|
||||||
- Encoder cache not found (should fallback to local execution)
|
- Encoder cache not found (should fall back to local execution)
|
||||||
- Proxy routing errors
|
- Proxy routing errors
|
||||||
|
|
||||||
## Notes
|
## Notes
|
||||||
|
|||||||
@ -185,7 +185,7 @@ def recompute_mrope_positions(
|
|||||||
|
|
||||||
Args:
|
Args:
|
||||||
input_ids: (N,) All input tokens of the prompt (entire sequence).
|
input_ids: (N,) All input tokens of the prompt (entire sequence).
|
||||||
multimodal_positions: List of mrope positsions for each media.
|
multimodal_positions: List of mrope positions for each media.
|
||||||
mrope_positions: Existing mrope positions (4, N) for entire sequence.
|
mrope_positions: Existing mrope positions (4, N) for entire sequence.
|
||||||
num_computed_tokens: A number of computed tokens so far.
|
num_computed_tokens: A number of computed tokens so far.
|
||||||
vision_start_token_id: Token indicating start of vision media.
|
vision_start_token_id: Token indicating start of vision media.
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user