[Docs] Reduce custom syntax used in docs (#27009)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
This commit is contained in:
Harry Mellor 2025-10-17 04:05:34 +01:00 committed by GitHub
parent 965c5f4914
commit 4ffd6e8942
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
65 changed files with 381 additions and 402 deletions

View File

@ -23,7 +23,7 @@ llm = LLM(model="ibm-granite/granite-3.1-8b-instruct", tensor_parallel_size=2)
!!! note !!! note
With tensor parallelism enabled, each process will read the whole model and split it into chunks, which makes the disk reading time even longer (proportional to the size of tensor parallelism). With tensor parallelism enabled, each process will read the whole model and split it into chunks, which makes the disk reading time even longer (proportional to the size of tensor parallelism).
You can convert the model checkpoint to a sharded checkpoint using <gh-file:examples/offline_inference/save_sharded_state.py>. The conversion process might take some time, but later you can load the sharded checkpoint much faster. The model loading time should remain constant regardless of the size of tensor parallelism. You can convert the model checkpoint to a sharded checkpoint using [examples/offline_inference/save_sharded_state.py](../../examples/offline_inference/save_sharded_state.py). The conversion process might take some time, but later you can load the sharded checkpoint much faster. The model loading time should remain constant regardless of the size of tensor parallelism.
## Quantization ## Quantization

View File

@ -174,14 +174,14 @@ Regardless, you need to set `mm_encoder_tp_mode="data"` in engine arguments to u
Known supported models (with corresponding benchmarks): Known supported models (with corresponding benchmarks):
- dots_ocr (<gh-pr:25466>) - dots_ocr (<https://github.com/vllm-project/vllm/pull/25466>)
- GLM-4.1V or above (<gh-pr:23168>) - GLM-4.1V or above (<https://github.com/vllm-project/vllm/pull/23168>)
- InternVL (<gh-pr:23909>) - InternVL (<https://github.com/vllm-project/vllm/pull/23909>)
- Kimi-VL (<gh-pr:23817>) - Kimi-VL (<https://github.com/vllm-project/vllm/pull/23817>)
- Llama4 (<gh-pr:18368>) - Llama4 (<https://github.com/vllm-project/vllm/pull/18368>)
- MiniCPM-V-2.5 or above (<gh-pr:23327>, <gh-pr:23948>) - MiniCPM-V-2.5 or above (<https://github.com/vllm-project/vllm/pull/23327>, <https://github.com/vllm-project/vllm/pull/23948>)
- Qwen2-VL or above (<gh-pr:22742>, <gh-pr:24955>, <gh-pr:25445>) - Qwen2-VL or above (<https://github.com/vllm-project/vllm/pull/22742>, <https://github.com/vllm-project/vllm/pull/24955>, <https://github.com/vllm-project/vllm/pull/25445>)
- Step3 (<gh-pr:22697>) - Step3 (<https://github.com/vllm-project/vllm/pull/22697>)
## Input Processing ## Input Processing

View File

@ -96,7 +96,7 @@ Although its common to do this with GPUs, don't try to fragment 2 or 8 differ
### Tune your workloads ### Tune your workloads
Although we try to have great default configs, we strongly recommend you check out the [vLLM auto-tuner](gh-file:benchmarks/auto_tune/README.md) to optimize your workloads for your use case. Although we try to have great default configs, we strongly recommend you check out the [vLLM auto-tuner](../../benchmarks/auto_tune/README.md) to optimize your workloads for your use case.
### Future Topics We'll Cover ### Future Topics We'll Cover

View File

@ -22,7 +22,7 @@ Unsure on where to start? Check out the following links for tasks to work on:
## License ## License
See <gh-file:LICENSE>. See [LICENSE](../../LICENSE).
## Developing ## Developing
@ -54,7 +54,7 @@ For more details about installing from source and installing for other hardware,
For an optimized workflow when iterating on C++/CUDA kernels, see the [Incremental Compilation Workflow](./incremental_build.md) for recommendations. For an optimized workflow when iterating on C++/CUDA kernels, see the [Incremental Compilation Workflow](./incremental_build.md) for recommendations.
!!! tip !!! tip
vLLM is compatible with Python versions 3.10 to 3.13. However, vLLM's default [Dockerfile](gh-file:docker/Dockerfile) ships with Python 3.12 and tests in CI (except `mypy`) are run with Python 3.12. vLLM is compatible with Python versions 3.10 to 3.13. However, vLLM's default [Dockerfile](../../docker/Dockerfile) ships with Python 3.12 and tests in CI (except `mypy`) are run with Python 3.12.
Therefore, we recommend developing with Python 3.12 to minimise the chance of your local environment clashing with our CI environment. Therefore, we recommend developing with Python 3.12 to minimise the chance of your local environment clashing with our CI environment.
@ -88,7 +88,7 @@ vLLM's `pre-commit` hooks will now run automatically every time you commit.
### Documentation ### Documentation
MkDocs is a fast, simple and downright gorgeous static site generator that's geared towards building project documentation. Documentation source files are written in Markdown, and configured with a single YAML configuration file, <gh-file:mkdocs.yaml>. MkDocs is a fast, simple and downright gorgeous static site generator that's geared towards building project documentation. Documentation source files are written in Markdown, and configured with a single YAML configuration file, [mkdocs.yaml](../../mkdocs.yaml).
Get started with: Get started with:
@ -152,7 +152,7 @@ pytest -s -v tests/test_logger.py
If you encounter a bug or have a feature request, please [search existing issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue) first to see if it has already been reported. If not, please [file a new issue](https://github.com/vllm-project/vllm/issues/new/choose), providing as much relevant information as possible. If you encounter a bug or have a feature request, please [search existing issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue) first to see if it has already been reported. If not, please [file a new issue](https://github.com/vllm-project/vllm/issues/new/choose), providing as much relevant information as possible.
!!! important !!! important
If you discover a security vulnerability, please follow the instructions [here](gh-file:SECURITY.md#reporting-a-vulnerability). If you discover a security vulnerability, please follow the instructions [here](../../SECURITY.md).
## Pull Requests & Code Reviews ## Pull Requests & Code Reviews
@ -162,7 +162,7 @@ code quality and improve the efficiency of the review process.
### DCO and Signed-off-by ### DCO and Signed-off-by
When contributing changes to this project, you must agree to the <gh-file:DCO>. When contributing changes to this project, you must agree to the [DCO](../../DCO).
Commits must include a `Signed-off-by:` header which certifies agreement with Commits must include a `Signed-off-by:` header which certifies agreement with
the terms of the DCO. the terms of the DCO.

View File

@ -822,7 +822,7 @@ you should set `--endpoint /v1/embeddings` to use the Embeddings API. The backen
- CLIP: `--backend openai-embeddings-clip` - CLIP: `--backend openai-embeddings-clip`
- VLM2Vec: `--backend openai-embeddings-vlm2vec` - VLM2Vec: `--backend openai-embeddings-vlm2vec`
For other models, please add your own implementation inside <gh-file:vllm/benchmarks/lib/endpoint_request_func.py> to match the expected instruction format. For other models, please add your own implementation inside [vllm/benchmarks/lib/endpoint_request_func.py](../../vllm/benchmarks/lib/endpoint_request_func.py) to match the expected instruction format.
You can use any text or multi-modal dataset to benchmark the model, as long as the model supports it. You can use any text or multi-modal dataset to benchmark the model, as long as the model supports it.
For example, you can use ShareGPT and VisionArena to benchmark vision-language embeddings. For example, you can use ShareGPT and VisionArena to benchmark vision-language embeddings.
@ -962,7 +962,7 @@ For more results visualization, check the [visualizing the results](https://gith
The latest performance results are hosted on the public [vLLM Performance Dashboard](https://hud.pytorch.org/benchmark/llms?repoName=vllm-project%2Fvllm). The latest performance results are hosted on the public [vLLM Performance Dashboard](https://hud.pytorch.org/benchmark/llms?repoName=vllm-project%2Fvllm).
More information on the performance benchmarks and their parameters can be found in [Benchmark README](https://github.com/intel-ai-tce/vllm/blob/more_cpu_models/.buildkite/nightly-benchmarks/README.md) and [performance benchmark description](gh-file:.buildkite/nightly-benchmarks/performance-benchmarks-descriptions.md). More information on the performance benchmarks and their parameters can be found in [Benchmark README](https://github.com/intel-ai-tce/vllm/blob/more_cpu_models/.buildkite/nightly-benchmarks/README.md) and [performance benchmark description](../../.buildkite/nightly-benchmarks/performance-benchmarks-descriptions.md).
### Continuous Benchmarking ### Continuous Benchmarking
@ -996,4 +996,4 @@ These compare vLLM's performance against alternatives (`tgi`, `trt-llm`, and `lm
The latest nightly benchmark results are shared in major release blog posts such as [vLLM v0.6.0](https://blog.vllm.ai/2024/09/05/perf-update.html). The latest nightly benchmark results are shared in major release blog posts such as [vLLM v0.6.0](https://blog.vllm.ai/2024/09/05/perf-update.html).
More information on the nightly benchmarks and their parameters can be found [here](gh-file:.buildkite/nightly-benchmarks/nightly-descriptions.md). More information on the nightly benchmarks and their parameters can be found [here](../../.buildkite/nightly-benchmarks/nightly-descriptions.md).

View File

@ -64,7 +64,7 @@ Download the full log file from Buildkite locally.
Strip timestamps and colorization: Strip timestamps and colorization:
<gh-file:.buildkite/scripts/ci-clean-log.sh> [.buildkite/scripts/ci-clean-log.sh](../../../.buildkite/scripts/ci-clean-log.sh)
```bash ```bash
./ci-clean-log.sh ci.log ./ci-clean-log.sh ci.log
@ -87,7 +87,7 @@ tail -525 ci_build.log | wl-copy
CI test failures may be flaky. Use a bash loop to run repeatedly: CI test failures may be flaky. Use a bash loop to run repeatedly:
<gh-file:.buildkite/scripts/rerun-test.sh> [.buildkite/scripts/rerun-test.sh](../../../.buildkite/scripts/rerun-test.sh)
```bash ```bash
./rerun-test.sh tests/v1/engine/test_engine_core_client.py::test_kv_cache_events[True-tcp] ./rerun-test.sh tests/v1/engine/test_engine_core_client.py::test_kv_cache_events[True-tcp]

View File

@ -5,7 +5,7 @@ release in CI/CD. It is standard practice to submit a PR to update the
PyTorch version as early as possible when a new [PyTorch stable PyTorch version as early as possible when a new [PyTorch stable
release](https://github.com/pytorch/pytorch/blob/main/RELEASE.md#release-cadence) becomes available. release](https://github.com/pytorch/pytorch/blob/main/RELEASE.md#release-cadence) becomes available.
This process is non-trivial due to the gap between PyTorch This process is non-trivial due to the gap between PyTorch
releases. Using <gh-pr:16859> as an example, this document outlines common steps to achieve this releases. Using <https://github.com/vllm-project/vllm/pull/16859> as an example, this document outlines common steps to achieve this
update along with a list of potential issues and how to address them. update along with a list of potential issues and how to address them.
## Test PyTorch release candidates (RCs) ## Test PyTorch release candidates (RCs)
@ -85,7 +85,7 @@ and timeout. Additionally, since vLLM's fastcheck pipeline runs in read-only mod
it doesn't populate the cache, so re-running it to warm up the cache it doesn't populate the cache, so re-running it to warm up the cache
is ineffective. is ineffective.
While ongoing efforts like [#17419](gh-issue:17419) While ongoing efforts like <https://github.com/vllm-project/vllm/issues/17419>
address the long build time at its source, the current workaround is to set `VLLM_CI_BRANCH` address the long build time at its source, the current workaround is to set `VLLM_CI_BRANCH`
to a custom branch provided by @khluu (`VLLM_CI_BRANCH=khluu/use_postmerge_q`) to a custom branch provided by @khluu (`VLLM_CI_BRANCH=khluu/use_postmerge_q`)
when manually triggering a build on Buildkite. This branch accomplishes two things: when manually triggering a build on Buildkite. This branch accomplishes two things:
@ -138,5 +138,5 @@ to handle some platforms separately. The separation of requirements and Dockerfi
for different platforms in vLLM CI/CD allows us to selectively choose for different platforms in vLLM CI/CD allows us to selectively choose
which platforms to update. For instance, updating XPU requires the corresponding which platforms to update. For instance, updating XPU requires the corresponding
release from [Intel Extension for PyTorch](https://github.com/intel/intel-extension-for-pytorch) by Intel. release from [Intel Extension for PyTorch](https://github.com/intel/intel-extension-for-pytorch) by Intel.
While <gh-pr:16859> updated vLLM to PyTorch 2.7.0 on CPU, CUDA, and ROCm, While <https://github.com/vllm-project/vllm/pull/16859> updated vLLM to PyTorch 2.7.0 on CPU, CUDA, and ROCm,
<gh-pr:17444> completed the update for XPU. <https://github.com/vllm-project/vllm/pull/17444> completed the update for XPU.

View File

@ -1,6 +1,6 @@
# Dockerfile # Dockerfile
We provide a <gh-file:docker/Dockerfile> to construct the image for running an OpenAI compatible server with vLLM. We provide a [docker/Dockerfile](../../../docker/Dockerfile) to construct the image for running an OpenAI compatible server with vLLM.
More information about deploying with Docker can be found [here](../../deployment/docker.md). More information about deploying with Docker can be found [here](../../deployment/docker.md).
Below is a visual representation of the multi-stage Dockerfile. The build graph contains the following nodes: Below is a visual representation of the multi-stage Dockerfile. The build graph contains the following nodes:

View File

@ -5,7 +5,7 @@ This guide walks you through the steps to implement a basic vLLM model.
## 1. Bring your model code ## 1. Bring your model code
First, clone the PyTorch model code from the source repository. First, clone the PyTorch model code from the source repository.
For instance, vLLM's [OPT model](gh-file:vllm/model_executor/models/opt.py) was adapted from For instance, vLLM's [OPT model](../../../vllm/model_executor/models/opt.py) was adapted from
HuggingFace's [modeling_opt.py](https://github.com/huggingface/transformers/blob/main/src/transformers/models/opt/modeling_opt.py) file. HuggingFace's [modeling_opt.py](https://github.com/huggingface/transformers/blob/main/src/transformers/models/opt/modeling_opt.py) file.
!!! warning !!! warning
@ -83,7 +83,7 @@ def forward(
Currently, vLLM supports the basic multi-head attention mechanism and its variant with rotary positional embeddings. Currently, vLLM supports the basic multi-head attention mechanism and its variant with rotary positional embeddings.
If your model employs a different attention mechanism, you will need to implement a new attention layer in vLLM. If your model employs a different attention mechanism, you will need to implement a new attention layer in vLLM.
For reference, check out our [Llama implementation](gh-file:vllm/model_executor/models/llama.py). vLLM already supports a large number of models. It is recommended to find a model similar to yours and adapt it to your model's architecture. Check out <gh-dir:vllm/model_executor/models> for more examples. For reference, check out our [Llama implementation](../../../vllm/model_executor/models/llama.py). vLLM already supports a large number of models. It is recommended to find a model similar to yours and adapt it to your model's architecture. Check out [vllm/model_executor/models](../../../vllm/model_executor/models) for more examples.
## 3. (Optional) Implement tensor parallelism and quantization support ## 3. (Optional) Implement tensor parallelism and quantization support
@ -130,22 +130,22 @@ We consider 3 different scenarios:
2. Models that combine Mamba layers (either Mamba-1 or Mamba-2) together with attention layers. 2. Models that combine Mamba layers (either Mamba-1 or Mamba-2) together with attention layers.
3. Models that combine Mamba-like mechanisms (e.g., Linear Attention, ShortConv) together with attention layers. 3. Models that combine Mamba-like mechanisms (e.g., Linear Attention, ShortConv) together with attention layers.
For case (1), we recommend looking at the implementation of [`MambaForCausalLM`](gh-file:vllm/model_executor/models/mamba.py) (for Mamba-1) or [`Mamba2ForCausalLM`](gh-file:vllm/model_executor/models/mamba2.py) (for Mamba-2) as a reference. For case (1), we recommend looking at the implementation of [`MambaForCausalLM`](../../../vllm/model_executor/models/mamba.py) (for Mamba-1) or [`Mamba2ForCausalLM`](../../../vllm/model_executor/models/mamba2.py) (for Mamba-2) as a reference.
The model should inherit protocol `IsAttentionFree` and also implement class methods `get_mamba_state_dtype_from_config` and `get_mamba_state_shape_from_config` to calculate the state shapes and data types from the config. The model should inherit protocol `IsAttentionFree` and also implement class methods `get_mamba_state_dtype_from_config` and `get_mamba_state_shape_from_config` to calculate the state shapes and data types from the config.
For the mamba layers themselves, please use the [`MambaMixer`](gh-file:vllm/model_executor/layers/mamba/mamba_mixer.py) (for Mamba-1) or [`MambaMixer2`](gh-file:vllm/model_executor/layers/mamba/mamba_mixer2.py) (for Mamba-2) classes. For the mamba layers themselves, please use the [`MambaMixer`](../../../vllm/model_executor/layers/mamba/mamba_mixer.py) (for Mamba-1) or [`MambaMixer2`](../../../vllm/model_executor/layers/mamba/mamba_mixer2.py) (for Mamba-2) classes.
Please *do not* use the `MambaCacheManager` (deprecated in V1) or replicate any of the V0-specific code paths in the existing model implementations. Please *do not* use the `MambaCacheManager` (deprecated in V1) or replicate any of the V0-specific code paths in the existing model implementations.
V0-only classes and code will be removed in the very near future. V0-only classes and code will be removed in the very near future.
The model should also be added to the `MODELS_CONFIG_MAP` dictionary in <gh-file:vllm/model_executor/models/config.py> to ensure that the runtime defaults are optimized. The model should also be added to the `MODELS_CONFIG_MAP` dictionary in [vllm/model_executor/models/config.py](../../../vllm/model_executor/models/config.py) to ensure that the runtime defaults are optimized.
For case (2), we recommend using as a reference the implementation of [`JambaForCausalLM`](gh-file:vllm/model_executor/models/jamba.py) (for an example of a model that uses Mamba-1 and attention together) or [`BambaForCausalLM`](gh-file:vllm/model_executor/models/bamba.py) (for an example of a model that uses Mamba-2 and attention together). For case (2), we recommend using as a reference the implementation of [`JambaForCausalLM`](../../../vllm/model_executor/models/jamba.py) (for an example of a model that uses Mamba-1 and attention together) or [`BambaForCausalLM`](../../../vllm/model_executor/models/bamba.py) (for an example of a model that uses Mamba-2 and attention together).
These models should follow the same instructions as case (1), but they should inherit protocol `IsHybrid` (instead of `IsAttentionFree`) and it is *not* necessary to add them to the `MODELS_CONFIG_MAP` (their runtime defaults will be inferred from the protocol). These models should follow the same instructions as case (1), but they should inherit protocol `IsHybrid` (instead of `IsAttentionFree`) and it is *not* necessary to add them to the `MODELS_CONFIG_MAP` (their runtime defaults will be inferred from the protocol).
For case (3), we recommend looking at the implementation of [`MiniMaxText01ForCausalLM`](gh-file:vllm/model_executor/models/minimax_text_01.py) or [`Lfm2ForCausalLM`](gh-file:vllm/model_executor/models/lfm2.py) as a reference, which use custom "mamba-like" layers `MiniMaxText01LinearAttention` and `ShortConv` respectively. For case (3), we recommend looking at the implementation of [`MiniMaxText01ForCausalLM`](../../../vllm/model_executor/models/minimax_text_01.py) or [`Lfm2ForCausalLM`](../../../vllm/model_executor/models/lfm2.py) as a reference, which use custom "mamba-like" layers `MiniMaxText01LinearAttention` and `ShortConv` respectively.
Please follow the same guidelines as case (2) for implementing these models. Please follow the same guidelines as case (2) for implementing these models.
We use "mamba-like" to refer to layers that posses a state that is updated in-place, rather than being appended-to (like KV cache for attention). We use "mamba-like" to refer to layers that posses a state that is updated in-place, rather than being appended-to (like KV cache for attention).
For implementing new custom mamba-like layers, one should inherit from `MambaBase` and implement the methods `get_state_dtype`, `get_state_shape` to calculate the data types and state shapes at runtime, as well as `mamba_type` and `get_attn_backend`. For implementing new custom mamba-like layers, one should inherit from `MambaBase` and implement the methods `get_state_dtype`, `get_state_shape` to calculate the data types and state shapes at runtime, as well as `mamba_type` and `get_attn_backend`.
It is also necessary to implement the "attention meta-data" class which handles the meta-data that is common across all layers. It is also necessary to implement the "attention meta-data" class which handles the meta-data that is common across all layers.
Please see [`LinearAttentionMetadata`](gh-file:vllm/v1/attention/backends/linear_attn.py) or [`ShortConvAttentionMetadata`](gh-file:v1/attention/backends/short_conv_attn.py) for examples of this. Please see [`LinearAttentionMetadata`](../../../vllm/v1/attention/backends/linear_attn.py) or [`ShortConvAttentionMetadata`](../../../vllm/v1/attention/backends/short_conv_attn.py) for examples of this.
Finally, if one wants to support torch compile and CUDA graphs, it necessary to wrap the call to the mamba-like layer inside a custom op and register it. Finally, if one wants to support torch compile and CUDA graphs, it necessary to wrap the call to the mamba-like layer inside a custom op and register it.
Please see the calls to `direct_register_custom_op` in <gh-file:vllm/model_executor/models/minimax_text_01.py> or <gh-file:vllm/model_executor/layers/mamba/short_conv.py> for examples of this. Please see the calls to `direct_register_custom_op` in [vllm/model_executor/models/minimax_text_01.py](../../../vllm/model_executor/models/minimax_text_01.py) or [vllm/model_executor/layers/mamba/short_conv.py](../../../vllm/model_executor/layers/mamba/short_conv.py) for examples of this.
The new custom op should then be added to the list `_attention_ops` in <gh-file:vllm/config/compilation.py> to ensure that piecewise CUDA graphs works as intended. The new custom op should then be added to the list `_attention_ops` in [vllm/config/compilation.py](../../../vllm/config/compilation.py) to ensure that piecewise CUDA graphs works as intended.

View File

@ -507,7 +507,7 @@ return a schema of the tensors outputted by the HF processor that are related to
``` ```
!!! note !!! note
Our [actual code](gh-file:vllm/model_executor/models/llava.py) additionally supports Our [actual code](../../../vllm/model_executor/models/llava.py) additionally supports
pre-computed image embeddings, which can be passed to be model via the `image_embeds` argument. pre-computed image embeddings, which can be passed to be model via the `image_embeds` argument.
=== "With postprocessing: Fuyu" === "With postprocessing: Fuyu"
@ -569,7 +569,7 @@ return a schema of the tensors outputted by the HF processor that are related to
``` ```
!!! note !!! note
Our [actual code](gh-file:vllm/model_executor/models/fuyu.py) has special handling Our [actual code](../../../vllm/model_executor/models/fuyu.py) has special handling
for text-only inputs to prevent unnecessary warnings from HF processor. for text-only inputs to prevent unnecessary warnings from HF processor.
!!! note !!! note
@ -828,8 +828,8 @@ Some HF processors directly insert feature tokens without replacing anything in
Examples: Examples:
- BLIP-2 (insert at start of prompt): <gh-file:vllm/model_executor/models/blip2.py> - BLIP-2 (insert at start of prompt): [vllm/model_executor/models/blip2.py](../../../vllm/model_executor/models/blip2.py)
- Molmo (insert after `<|endoftext|>` token): <gh-file:vllm/model_executor/models/molmo.py> - Molmo (insert after `<|endoftext|>` token): [vllm/model_executor/models/molmo.py](../../../vllm/model_executor/models/molmo.py)
### Handling prompt updates unrelated to multi-modal data ### Handling prompt updates unrelated to multi-modal data
@ -837,9 +837,9 @@ Examples:
Examples: Examples:
- Chameleon (appends `sep_token`): <gh-file:vllm/model_executor/models/chameleon.py> - Chameleon (appends `sep_token`): [vllm/model_executor/models/chameleon.py](../../../vllm/model_executor/models/chameleon.py)
- Fuyu (appends `boa_token`): <gh-file:vllm/model_executor/models/fuyu.py> - Fuyu (appends `boa_token`): [vllm/model_executor/models/fuyu.py](../../../vllm/model_executor/models/fuyu.py)
- Molmo (applies chat template which is not defined elsewhere): <gh-file:vllm/model_executor/models/molmo.py> - Molmo (applies chat template which is not defined elsewhere): [vllm/model_executor/models/molmo.py](../../../vllm/model_executor/models/molmo.py)
### Custom HF processor ### Custom HF processor
@ -847,6 +847,6 @@ Some models don't define an HF processor class on HF Hub. In that case, you can
Examples: Examples:
- DeepSeek-VL2: <gh-file:vllm/model_executor/models/deepseek_vl2.py> - DeepSeek-VL2: [vllm/model_executor/models/deepseek_vl2.py](../../../vllm/model_executor/models/deepseek_vl2.py)
- InternVL: <gh-file:vllm/model_executor/models/internvl.py> - InternVL: [vllm/model_executor/models/internvl.py](../../../vllm/model_executor/models/internvl.py)
- Qwen-VL: <gh-file:vllm/model_executor/models/qwen_vl.py> - Qwen-VL: [vllm/model_executor/models/qwen_vl.py](../../../vllm/model_executor/models/qwen_vl.py)

View File

@ -11,8 +11,8 @@ This page provides detailed instructions on how to do so.
To add a model directly to the vLLM library, start by forking our [GitHub repository](https://github.com/vllm-project/vllm) and then [build it from source][build-from-source]. To add a model directly to the vLLM library, start by forking our [GitHub repository](https://github.com/vllm-project/vllm) and then [build it from source][build-from-source].
This gives you the ability to modify the codebase and test your model. This gives you the ability to modify the codebase and test your model.
After you have implemented your model (see [tutorial](basic.md)), put it into the <gh-dir:vllm/model_executor/models> directory. After you have implemented your model (see [tutorial](basic.md)), put it into the [vllm/model_executor/models](../../../vllm/model_executor/models) directory.
Then, add your model class to `_VLLM_MODELS` in <gh-file:vllm/model_executor/models/registry.py> so that it is automatically registered upon importing vLLM. Then, add your model class to `_VLLM_MODELS` in [vllm/model_executor/models/registry.py](../../../vllm/model_executor/models/registry.py) so that it is automatically registered upon importing vLLM.
Finally, update our [list of supported models](../../models/supported_models.md) to promote your model! Finally, update our [list of supported models](../../models/supported_models.md) to promote your model!
!!! important !!! important

View File

@ -9,7 +9,7 @@ Without them, the CI for your PR will fail.
### Model loading ### Model loading
Include an example HuggingFace repository for your model in <gh-file:tests/models/registry.py>. Include an example HuggingFace repository for your model in [tests/models/registry.py](../../../tests/models/registry.py).
This enables a unit test that loads dummy weights to ensure that the model can be initialized in vLLM. This enables a unit test that loads dummy weights to ensure that the model can be initialized in vLLM.
!!! important !!! important
@ -26,18 +26,18 @@ Passing these tests provides more confidence that your implementation is correct
### Model correctness ### Model correctness
These tests compare the model outputs of vLLM against [HF Transformers](https://github.com/huggingface/transformers). You can add new tests under the subdirectories of <gh-dir:tests/models>. These tests compare the model outputs of vLLM against [HF Transformers](https://github.com/huggingface/transformers). You can add new tests under the subdirectories of [tests/models](../../../tests/models).
#### Generative models #### Generative models
For [generative models](../../models/generative_models.md), there are two levels of correctness tests, as defined in <gh-file:tests/models/utils.py>: For [generative models](../../models/generative_models.md), there are two levels of correctness tests, as defined in [tests/models/utils.py](../../../tests/models/utils.py):
- Exact correctness (`check_outputs_equal`): The text outputted by vLLM should exactly match the text outputted by HF. - Exact correctness (`check_outputs_equal`): The text outputted by vLLM should exactly match the text outputted by HF.
- Logprobs similarity (`check_logprobs_close`): The logprobs outputted by vLLM should be in the top-k logprobs outputted by HF, and vice versa. - Logprobs similarity (`check_logprobs_close`): The logprobs outputted by vLLM should be in the top-k logprobs outputted by HF, and vice versa.
#### Pooling models #### Pooling models
For [pooling models](../../models/pooling_models.md), we simply check the cosine similarity, as defined in <gh-file:tests/models/utils.py>. For [pooling models](../../models/pooling_models.md), we simply check the cosine similarity, as defined in [tests/models/utils.py](../../../tests/models/utils.py).
[](){ #mm-processing-tests } [](){ #mm-processing-tests }
@ -45,7 +45,7 @@ For [pooling models](../../models/pooling_models.md), we simply check the cosine
#### Common tests #### Common tests
Adding your model to <gh-file:tests/models/multimodal/processing/test_common.py> verifies that the following input combinations result in the same outputs: Adding your model to [tests/models/multimodal/processing/test_common.py](../../../tests/models/multimodal/processing/test_common.py) verifies that the following input combinations result in the same outputs:
- Text + multi-modal data - Text + multi-modal data
- Tokens + multi-modal data - Tokens + multi-modal data
@ -54,6 +54,6 @@ Adding your model to <gh-file:tests/models/multimodal/processing/test_common.py>
#### Model-specific tests #### Model-specific tests
You can add a new file under <gh-dir:tests/models/multimodal/processing> to run tests that only apply to your model. You can add a new file under [tests/models/multimodal/processing](../../../tests/models/multimodal/processing) to run tests that only apply to your model.
For example, if the HF processor for your model accepts user-specified keyword arguments, you can verify that the keyword arguments are being applied correctly, such as in <gh-file:tests/models/multimodal/processing/test_phi3v.py>. For example, if the HF processor for your model accepts user-specified keyword arguments, you can verify that the keyword arguments are being applied correctly, such as in [tests/models/multimodal/processing/test_phi3v.py](../../../tests/models/multimodal/processing/test_phi3v.py).

View File

@ -248,9 +248,9 @@ No extra registration is required beyond having your model class available via t
## Examples in-tree ## Examples in-tree
- Whisper encoderdecoder (audio-only): <gh-file:vllm/model_executor/models/whisper.py> - Whisper encoderdecoder (audio-only): [vllm/model_executor/models/whisper.py](../../../vllm/model_executor/models/whisper.py)
- Voxtral decoder-only (audio embeddings + LLM): <gh-file:vllm/model_executor/models/voxtral.py> - Voxtral decoder-only (audio embeddings + LLM): [vllm/model_executor/models/voxtral.py](../../../vllm/model_executor/models/voxtral.py)
- Gemma3n decoder-only with fixed instruction prompt: <gh-file:vllm/model_executor/models/gemma3n_mm.py> - Gemma3n decoder-only with fixed instruction prompt: [vllm/model_executor/models/gemma3n_mm.py](../../../vllm/model_executor/models/gemma3n_mm.py)
## Test with the API ## Test with the API
@ -278,7 +278,7 @@ Once your model implements `SupportsTranscription`, you can test the endpoints (
http://localhost:8000/v1/audio/translations http://localhost:8000/v1/audio/translations
``` ```
Or check out more examples in <gh-file:examples/online_serving>. Or check out more examples in [examples/online_serving](../../../examples/online_serving).
!!! note !!! note
- If your model handles chunking internally (e.g., via its processor or encoder), set `min_energy_split_window_size=None` in the returned `SpeechToTextConfig` to disable server-side chunking. - If your model handles chunking internally (e.g., via its processor or encoder), set `min_energy_split_window_size=None` in the returned `SpeechToTextConfig` to disable server-side chunking.

View File

@ -33,7 +33,7 @@ Traces can be visualized using <https://ui.perfetto.dev/>.
#### Offline Inference #### Offline Inference
Refer to <gh-file:examples/offline_inference/simple_profiling.py> for an example. Refer to [examples/offline_inference/simple_profiling.py](../../examples/offline_inference/simple_profiling.py) for an example.
#### OpenAI Server #### OpenAI Server

View File

@ -37,7 +37,7 @@ You can add any other [engine-args](../configuration/engine_args.md) you need af
memory to share data between processes under the hood, particularly for tensor parallel inference. memory to share data between processes under the hood, particularly for tensor parallel inference.
!!! note !!! note
Optional dependencies are not included in order to avoid licensing issues (e.g. <gh-issue:8030>). Optional dependencies are not included in order to avoid licensing issues (e.g. <https://github.com/vllm-project/vllm/issues/8030>).
If you need to use those dependencies (having accepted the license terms), If you need to use those dependencies (having accepted the license terms),
create a custom Dockerfile on top of the base image with an extra layer that installs them: create a custom Dockerfile on top of the base image with an extra layer that installs them:
@ -66,7 +66,7 @@ You can add any other [engine-args](../configuration/engine_args.md) you need af
## Building vLLM's Docker Image from Source ## Building vLLM's Docker Image from Source
You can build and run vLLM from source via the provided <gh-file:docker/Dockerfile>. To build vLLM: You can build and run vLLM from source via the provided [docker/Dockerfile](../../docker/Dockerfile). To build vLLM:
```bash ```bash
# optionally specifies: --build-arg max_jobs=8 --build-arg nvcc_threads=2 # optionally specifies: --build-arg max_jobs=8 --build-arg nvcc_threads=2

View File

@ -5,7 +5,7 @@
[Anyscale](https://www.anyscale.com) is a managed, multi-cloud platform developed by the creators of Ray. [Anyscale](https://www.anyscale.com) is a managed, multi-cloud platform developed by the creators of Ray.
Anyscale automates the entire lifecycle of Ray clusters in your AWS, GCP, or Azure account, delivering the flexibility of open-source Ray Anyscale automates the entire lifecycle of Ray clusters in your AWS, GCP, or Azure account, delivering the flexibility of open-source Ray
without the operational overhead of maintaining Kubernetes control planes, configuring autoscalers, managing observability stacks, or manually managing head and worker nodes with helper scripts like <gh-file:examples/online_serving/run_cluster.sh>. without the operational overhead of maintaining Kubernetes control planes, configuring autoscalers, managing observability stacks, or manually managing head and worker nodes with helper scripts like [examples/online_serving/run_cluster.sh](../../../examples/online_serving/run_cluster.sh).
When serving large language models with vLLM, Anyscale can rapidly provision [production-ready HTTPS endpoints](https://docs.anyscale.com/examples/deploy-ray-serve-llms) or [fault-tolerant batch inference jobs](https://docs.anyscale.com/examples/ray-data-llm). When serving large language models with vLLM, Anyscale can rapidly provision [production-ready HTTPS endpoints](https://docs.anyscale.com/examples/deploy-ray-serve-llms) or [fault-tolerant batch inference jobs](https://docs.anyscale.com/examples/ray-data-llm).

View File

@ -36,7 +36,7 @@ pip install -U vllm \
vllm serve qwen/Qwen1.5-0.5B-Chat --port 8001 vllm serve qwen/Qwen1.5-0.5B-Chat --port 8001
``` ```
1. Use the script: <gh-file:examples/online_serving/retrieval_augmented_generation_with_langchain.py> 1. Use the script: [examples/online_serving/retrieval_augmented_generation_with_langchain.py](../../../examples/online_serving/retrieval_augmented_generation_with_langchain.py)
1. Run the script 1. Run the script
@ -74,7 +74,7 @@ pip install vllm \
vllm serve qwen/Qwen1.5-0.5B-Chat --port 8001 vllm serve qwen/Qwen1.5-0.5B-Chat --port 8001
``` ```
1. Use the script: <gh-file:examples/online_serving/retrieval_augmented_generation_with_llamaindex.py> 1. Use the script: [examples/online_serving/retrieval_augmented_generation_with_llamaindex.py](../../../examples/online_serving/retrieval_augmented_generation_with_llamaindex.py)
1. Run the script: 1. Run the script:

View File

@ -20,7 +20,7 @@ pip install vllm streamlit openai
vllm serve Qwen/Qwen1.5-0.5B-Chat vllm serve Qwen/Qwen1.5-0.5B-Chat
``` ```
1. Use the script: <gh-file:examples/online_serving/streamlit_openai_chatbot_webserver.py> 1. Use the script: [examples/online_serving/streamlit_openai_chatbot_webserver.py](../../../examples/online_serving/streamlit_openai_chatbot_webserver.py)
1. Start the streamlit web UI and start to chat: 1. Start the streamlit web UI and start to chat:

View File

@ -49,7 +49,7 @@ Here is a sample of `LLM` class usage:
More API details can be found in the [Offline Inference](#offline-inference-api) section of the API docs. More API details can be found in the [Offline Inference](#offline-inference-api) section of the API docs.
The code for the `LLM` class can be found in <gh-file:vllm/entrypoints/llm.py>. The code for the `LLM` class can be found in [vllm/entrypoints/llm.py](../../vllm/entrypoints/llm.py).
### OpenAI-Compatible API Server ### OpenAI-Compatible API Server
@ -60,7 +60,7 @@ This server can be started using the `vllm serve` command.
vllm serve <model> vllm serve <model>
``` ```
The code for the `vllm` CLI can be found in <gh-file:vllm/entrypoints/cli/main.py>. The code for the `vllm` CLI can be found in [vllm/entrypoints/cli/main.py](../../vllm/entrypoints/cli/main.py).
Sometimes you may see the API server entrypoint used directly instead of via the Sometimes you may see the API server entrypoint used directly instead of via the
`vllm` CLI command. For example: `vllm` CLI command. For example:
@ -74,7 +74,7 @@ python -m vllm.entrypoints.openai.api_server --model <model>
`python -m vllm.entrypoints.openai.api_server` is deprecated `python -m vllm.entrypoints.openai.api_server` is deprecated
and may become unsupported in a future release. and may become unsupported in a future release.
That code can be found in <gh-file:vllm/entrypoints/openai/api_server.py>. That code can be found in [vllm/entrypoints/openai/api_server.py](../../vllm/entrypoints/openai/api_server.py).
More details on the API server can be found in the [OpenAI-Compatible Server](../serving/openai_compatible_server.md) document. More details on the API server can be found in the [OpenAI-Compatible Server](../serving/openai_compatible_server.md) document.
@ -101,7 +101,7 @@ processing.
- **Output Processing**: Processes the outputs generated by the model, decoding the - **Output Processing**: Processes the outputs generated by the model, decoding the
token IDs from a language model into human-readable text. token IDs from a language model into human-readable text.
The code for `LLMEngine` can be found in <gh-file:vllm/engine/llm_engine.py>. The code for `LLMEngine` can be found in [vllm/engine/llm_engine.py](../../vllm/engine/llm_engine.py).
### AsyncLLMEngine ### AsyncLLMEngine
@ -111,9 +111,9 @@ incoming requests. The `AsyncLLMEngine` is designed for online serving, where it
can handle multiple concurrent requests and stream outputs to clients. can handle multiple concurrent requests and stream outputs to clients.
The OpenAI-compatible API server uses the `AsyncLLMEngine`. There is also a demo The OpenAI-compatible API server uses the `AsyncLLMEngine`. There is also a demo
API server that serves as a simpler example in <gh-file:vllm/entrypoints/api_server.py>. API server that serves as a simpler example in [vllm/entrypoints/api_server.py](../../vllm/entrypoints/api_server.py).
The code for `AsyncLLMEngine` can be found in <gh-file:vllm/engine/async_llm_engine.py>. The code for `AsyncLLMEngine` can be found in [vllm/engine/async_llm_engine.py](../../vllm/engine/async_llm_engine.py).
## Worker ## Worker

View File

@ -17,7 +17,7 @@ In this document we will discuss the:
In this document, we refer to pure decode (`max_query_len=1`) or speculative decode (`max_query_len =1+num_spec_tokens`) as **uniform decode** batches, and the opposite would be **non-uniform** batches (i.e., prefill or mixed prefill-decode batches). In this document, we refer to pure decode (`max_query_len=1`) or speculative decode (`max_query_len =1+num_spec_tokens`) as **uniform decode** batches, and the opposite would be **non-uniform** batches (i.e., prefill or mixed prefill-decode batches).
!!! note !!! note
The following contents are mostly based on the last commit of <gh-pr:20059>. The following contents are mostly based on the last commit of <https://github.com/vllm-project/vllm/pull/20059>.
## Motivation ## Motivation
@ -92,7 +92,7 @@ where `num_tokens` can be the padded token length, and `uniform_decode` is deter
The goal of this structure is to uniquely identify a (padded) batch with minimal possible items corresponding to a CUDA Graphs item. We are safe to exclude items like `uniform_query_len` because it is a constant at runtime for a certain setup currently. For example, it should be either `1` for a commonly pure decode or `1+num_spec_tokens` for a validation phase of speculative decode. The goal of this structure is to uniquely identify a (padded) batch with minimal possible items corresponding to a CUDA Graphs item. We are safe to exclude items like `uniform_query_len` because it is a constant at runtime for a certain setup currently. For example, it should be either `1` for a commonly pure decode or `1+num_spec_tokens` for a validation phase of speculative decode.
!!! note !!! note
The prototype of `BatchDescriptor` may be extended for more general situations in the future, e.g., include more items, like `uniform_query_len` to support multiple different uniform decode lengths settings (<gh-pr:23679>), or other modifications needed to support CUDA Graphs for models whose inputs are not necessarily token length aware (for example, some multi-modal inputs). The prototype of `BatchDescriptor` may be extended for more general situations in the future, e.g., include more items, like `uniform_query_len` to support multiple different uniform decode lengths settings (<https://github.com/vllm-project/vllm/pull/23679>), or other modifications needed to support CUDA Graphs for models whose inputs are not necessarily token length aware (for example, some multi-modal inputs).
### `CudagraphDispatcher` ### `CudagraphDispatcher`

View File

@ -2,7 +2,7 @@
## Introduction ## Introduction
FusedMoEModularKernel is implemented [here](gh-file:/vllm/model_executor/layers/fused_moe/modular_kernel.py) FusedMoEModularKernel is implemented [here](../..//vllm/model_executor/layers/fused_moe/modular_kernel.py)
Based on the format of the input activations, FusedMoE implementations are broadly classified into 2 types. Based on the format of the input activations, FusedMoE implementations are broadly classified into 2 types.
@ -44,7 +44,7 @@ FusedMoEModularKernel splits the FusedMoE operation into 3 parts,
The TopK Weight Application and Reduction components happen right after the Unpermute operation and before the All2All Combine. Note that the `FusedMoEPermuteExpertsUnpermute` is responsible for the Unpermute and `FusedMoEPrepareAndFinalize` is responsible for the All2All Combine. There is value in doing the TopK Weight Application and Reduction in the `FusedMoEPermuteExpertsUnpermute`. But some implementations choose to do it `FusedMoEPrepareAndFinalize`. In order to enable this flexibility, we have a TopKWeightAndReduce abstract class. The TopK Weight Application and Reduction components happen right after the Unpermute operation and before the All2All Combine. Note that the `FusedMoEPermuteExpertsUnpermute` is responsible for the Unpermute and `FusedMoEPrepareAndFinalize` is responsible for the All2All Combine. There is value in doing the TopK Weight Application and Reduction in the `FusedMoEPermuteExpertsUnpermute`. But some implementations choose to do it `FusedMoEPrepareAndFinalize`. In order to enable this flexibility, we have a TopKWeightAndReduce abstract class.
Please find the implementations of TopKWeightAndReduce [here](gh-file:vllm/model_executor/layers/fused_moe/topk_weight_and_reduce.py). Please find the implementations of TopKWeightAndReduce [here](../../vllm/model_executor/layers/fused_moe/topk_weight_and_reduce.py).
`FusedMoEPrepareAndFinalize::finalize()` method accepts a `TopKWeightAndReduce` argument that is invoked inside the method. `FusedMoEPrepareAndFinalize::finalize()` method accepts a `TopKWeightAndReduce` argument that is invoked inside the method.
The `FusedMoEModularKernel` acts as a bridge between the `FusedMoEPermuteExpertsUnpermute` and `FusedMoEPerpareAndFinalize` implementations to determine where the TopK Weight Application and Reduction happens. The `FusedMoEModularKernel` acts as a bridge between the `FusedMoEPermuteExpertsUnpermute` and `FusedMoEPerpareAndFinalize` implementations to determine where the TopK Weight Application and Reduction happens.
@ -138,7 +138,7 @@ Typically a FusedMoEPrepareAndFinalize type is backed by an All2All Dispatch & C
#### Step 1: Add an All2All manager #### Step 1: Add an All2All manager
The purpose of the All2All Manager is to set up the All2All kernel implementations. The `FusedMoEPrepareAndFinalize` implementations typically fetch a kernel-implementation "handle" from the All2All Manager to invoke the Dispatch and Combine functions. Please look at the All2All Manager implementations [here](gh-file:vllm/distributed/device_communicators/all2all.py). The purpose of the All2All Manager is to set up the All2All kernel implementations. The `FusedMoEPrepareAndFinalize` implementations typically fetch a kernel-implementation "handle" from the All2All Manager to invoke the Dispatch and Combine functions. Please look at the All2All Manager implementations [here](../../vllm/distributed/device_communicators/all2all.py).
#### Step 2: Add a FusedMoEPrepareAndFinalize Type #### Step 2: Add a FusedMoEPrepareAndFinalize Type
@ -213,29 +213,29 @@ Please take a look at [init_prepare_finalize](https://github.com/vllm-project/vl
### How To Unit Test ### How To Unit Test
We have `FusedMoEModularKernel` unit tests at [test_modular_kernel_combinations.py](gh-file:tests/kernels/moe/test_modular_kernel_combinations.py). We have `FusedMoEModularKernel` unit tests at [test_modular_kernel_combinations.py](../../tests/kernels/moe/test_modular_kernel_combinations.py).
The unit test iterates through all combinations of `FusedMoEPrepareAndFinalize` and `FusedMoEPremuteExpertsUnpermute` types and if they are The unit test iterates through all combinations of `FusedMoEPrepareAndFinalize` and `FusedMoEPremuteExpertsUnpermute` types and if they are
compatible, runs some correctness tests. compatible, runs some correctness tests.
If you are adding some `FusedMoEPrepareAndFinalize` / `FusedMoEPermuteExpertsUnpermute` implementations, If you are adding some `FusedMoEPrepareAndFinalize` / `FusedMoEPermuteExpertsUnpermute` implementations,
1. Add the implementation type to `MK_ALL_PREPARE_FINALIZE_TYPES` and `MK_FUSED_EXPERT_TYPES` in [mk_objects.py](gh-file:tests/kernels/moe/modular_kernel_tools/mk_objects.py) respectively. 1. Add the implementation type to `MK_ALL_PREPARE_FINALIZE_TYPES` and `MK_FUSED_EXPERT_TYPES` in [mk_objects.py](../../tests/kernels/moe/modular_kernel_tools/mk_objects.py) respectively.
2. Update `Config::is_batched_prepare_finalize()`, `Config::is_batched_fused_experts()`, `Config::is_standard_fused_experts()`, 2. Update `Config::is_batched_prepare_finalize()`, `Config::is_batched_fused_experts()`, `Config::is_standard_fused_experts()`,
`Config::is_fe_16bit_supported()`, `Config::is_fe_fp8_supported()`, `Config::is_fe_block_fp8_supported()`, `Config::is_fe_16bit_supported()`, `Config::is_fe_fp8_supported()`, `Config::is_fe_block_fp8_supported()`,
`Config::is_fe_supports_chunking()` methods in [/tests/kernels/moe/modular_kernel_tools/common.py](gh-file:tests/kernels/moe/modular_kernel_tools/common.py) `Config::is_fe_supports_chunking()` methods in [/tests/kernels/moe/modular_kernel_tools/common.py](../../tests/kernels/moe/modular_kernel_tools/common.py)
Doing this will add the new implementation to the test suite. Doing this will add the new implementation to the test suite.
### How To Check `FusedMoEPrepareAndFinalize` & `FusedMoEPermuteExpertsUnpermute` Compatibility ### How To Check `FusedMoEPrepareAndFinalize` & `FusedMoEPermuteExpertsUnpermute` Compatibility
The unit test file [test_modular_kernel_combinations.py](gh-file:tests/kernels/moe/test_modular_kernel_combinations.py) can also be executed as a standalone script. The unit test file [test_modular_kernel_combinations.py](../../tests/kernels/moe/test_modular_kernel_combinations.py) can also be executed as a standalone script.
Example: `python3 -m tests.kernels.moe.test_modular_kernel_combinations --pf-type PplxPrepareAndFinalize --experts-type BatchedTritonExperts` Example: `python3 -m tests.kernels.moe.test_modular_kernel_combinations --pf-type PplxPrepareAndFinalize --experts-type BatchedTritonExperts`
As a side effect, this script can be used to test `FusedMoEPrepareAndFinalize` & `FusedMoEPermuteExpertsUnpermute` compatibility. When invoked As a side effect, this script can be used to test `FusedMoEPrepareAndFinalize` & `FusedMoEPermuteExpertsUnpermute` compatibility. When invoked
with incompatible types, the script will error. with incompatible types, the script will error.
### How To Profile ### How To Profile
Please take a look at [profile_modular_kernel.py](gh-file:tests/kernels/moe/modular_kernel_tools/profile_modular_kernel.py) Please take a look at [profile_modular_kernel.py](../../tests/kernels/moe/modular_kernel_tools/profile_modular_kernel.py)
The script can be used to generate Torch traces for a single `FusedMoEModularKernel::forward()` call for any compatible The script can be used to generate Torch traces for a single `FusedMoEModularKernel::forward()` call for any compatible
`FusedMoEPrepareAndFinalize` and `FusedMoEPermuteExpertsUnpermute` types. `FusedMoEPrepareAndFinalize` and `FusedMoEPermuteExpertsUnpermute` types.
Example: `python3 -m tests.kernels.moe.modular_kernel_tools.profile_modular_kernel --pf-type PplxPrepareAndFinalize --experts-type BatchedTritonExperts` Example: `python3 -m tests.kernels.moe.modular_kernel_tools.profile_modular_kernel --pf-type PplxPrepareAndFinalize --experts-type BatchedTritonExperts`

View File

@ -6,7 +6,7 @@ When performing an inference with IO Processor plugins, the prompt type is defin
## Writing an IO Processor Plugin ## Writing an IO Processor Plugin
IO Processor plugins implement the `IOProcessor` interface (<gh-file:vllm/plugins/io_processors/interface.py>): IO Processor plugins implement the [`IOProcessor`][vllm.plugins.io_processors.interface.IOProcessor] interface:
```python ```python
IOProcessorInput = TypeVar("IOProcessorInput") IOProcessorInput = TypeVar("IOProcessorInput")
@ -67,9 +67,9 @@ The `parse_request` method is used for validating the user prompt and converting
The `pre_process*` methods take the validated plugin input to generate vLLM's model prompts for regular inference. The `pre_process*` methods take the validated plugin input to generate vLLM's model prompts for regular inference.
The `post_process*` methods take `PoolingRequestOutput` objects as input and generate a custom plugin output. The `post_process*` methods take `PoolingRequestOutput` objects as input and generate a custom plugin output.
The `output_to_response` method is used only for online serving and converts the plugin output to the `IOProcessorResponse` type that is then returned by the API Server. The implementation of the `/io_processor_pooling` serving endpoint is available here <gh-file:vllm/entrypoints/openai/serving_pooling_with_io_plugin.py>. The `output_to_response` method is used only for online serving and converts the plugin output to the `IOProcessorResponse` type that is then returned by the API Server. The implementation of the `/pooling` serving endpoint is available here [vllm/entrypoints/openai/serving_pooling.py](../../vllm/entrypoints/openai/serving_pooling.py).
An example implementation of a plugin that enables generating geotiff images with the PrithviGeospatialMAE model is available [here](https://github.com/christian-pinto/prithvi_io_processor_plugin). Please, also refer to our online (<gh-file:examples/online_serving/prithvi_geospatial_mae.py>) and offline (<gh-file:examples/offline_inference/prithvi_geospatial_mae_io_processor.py>) inference examples. An example implementation of a plugin that enables generating geotiff images with the PrithviGeospatialMAE model is available [here](https://github.com/christian-pinto/prithvi_io_processor_plugin). Please, also refer to our online ([examples/online_serving/prithvi_geospatial_mae.py](../../examples/online_serving/prithvi_geospatial_mae.py)) and offline ([examples/offline_inference/prithvi_geospatial_mae_io_processor.py](../../examples/offline_inference/prithvi_geospatial_mae_io_processor.py)) inference examples.
## Using an IO Processor plugin ## Using an IO Processor plugin

View File

@ -80,13 +80,13 @@ The subset of metrics exposed in the Grafana dashboard gives us an indication of
- `vllm:request_decode_time_seconds` - Requests decode time. - `vllm:request_decode_time_seconds` - Requests decode time.
- `vllm:request_max_num_generation_tokens` - Max generation tokens in a sequence group. - `vllm:request_max_num_generation_tokens` - Max generation tokens in a sequence group.
See [the PR which added this Dashboard](gh-pr:2316) for interesting and useful background on the choices made here. See [the PR which added this Dashboard](https://github.com/vllm-project/vllm/pull/2316) for interesting and useful background on the choices made here.
### Prometheus Client Library ### Prometheus Client Library
Prometheus support was initially added [using the aioprometheus library](gh-pr:1890), but a switch was made quickly to [prometheus_client](gh-pr:2730). The rationale is discussed in both linked PRs. Prometheus support was initially added [using the aioprometheus library](https://github.com/vllm-project/vllm/pull/1890), but a switch was made quickly to [prometheus_client](https://github.com/vllm-project/vllm/pull/2730). The rationale is discussed in both linked PRs.
With the switch to `aioprometheus`, we lost a `MetricsMiddleware` to track HTTP metrics, but this was reinstated [using prometheus_fastapi_instrumentator](gh-pr:15657): With the switch to `aioprometheus`, we lost a `MetricsMiddleware` to track HTTP metrics, but this was reinstated [using prometheus_fastapi_instrumentator](https://github.com/vllm-project/vllm/pull/15657):
```bash ```bash
$ curl http://0.0.0.0:8000/metrics 2>/dev/null | grep -P '^http_(?!.*(_bucket|_created|_sum)).*' $ curl http://0.0.0.0:8000/metrics 2>/dev/null | grep -P '^http_(?!.*(_bucket|_created|_sum)).*'
@ -99,7 +99,7 @@ http_request_duration_seconds_count{handler="/v1/completions",method="POST"} 201
### Multi-process Mode ### Multi-process Mode
In v0, metrics are collected in the engine core process and we use multiprocess mode to make them available in the API server process. See <gh-pr:7279>. In v0, metrics are collected in the engine core process and we use multiprocess mode to make them available in the API server process. See <https://github.com/vllm-project/vllm/pull/7279>.
### Built in Python/Process Metrics ### Built in Python/Process Metrics
@ -125,32 +125,32 @@ vLLM instance.
For background, these are some of the relevant PRs which added the v0 metrics: For background, these are some of the relevant PRs which added the v0 metrics:
- <gh-pr:1890> - <https://github.com/vllm-project/vllm/pull/1890>
- <gh-pr:2316> - <https://github.com/vllm-project/vllm/pull/2316>
- <gh-pr:2730> - <https://github.com/vllm-project/vllm/pull/2730>
- <gh-pr:4464> - <https://github.com/vllm-project/vllm/pull/4464>
- <gh-pr:7279> - <https://github.com/vllm-project/vllm/pull/7279>
Also note the ["Even Better Observability"](gh-issue:3616) feature where e.g. [a detailed roadmap was laid out](gh-issue:3616#issuecomment-2030858781). Also note the ["Even Better Observability"](https://github.com/vllm-project/vllm/issues/3616) feature where e.g. [a detailed roadmap was laid out](https://github.com/vllm-project/vllm/issues/3616#issuecomment-2030858781).
## v1 Design ## v1 Design
### v1 PRs ### v1 PRs
For background, here are the relevant v1 PRs relating to the v1 For background, here are the relevant v1 PRs relating to the v1
metrics issue <gh-issue:10582>: metrics issue <https://github.com/vllm-project/vllm/issues/10582>:
- <gh-pr:11962> - <https://github.com/vllm-project/vllm/pull/11962>
- <gh-pr:11973> - <https://github.com/vllm-project/vllm/pull/11973>
- <gh-pr:10907> - <https://github.com/vllm-project/vllm/pull/10907>
- <gh-pr:12416> - <https://github.com/vllm-project/vllm/pull/12416>
- <gh-pr:12478> - <https://github.com/vllm-project/vllm/pull/12478>
- <gh-pr:12516> - <https://github.com/vllm-project/vllm/pull/12516>
- <gh-pr:12530> - <https://github.com/vllm-project/vllm/pull/12530>
- <gh-pr:12561> - <https://github.com/vllm-project/vllm/pull/12561>
- <gh-pr:12579> - <https://github.com/vllm-project/vllm/pull/12579>
- <gh-pr:12592> - <https://github.com/vllm-project/vllm/pull/12592>
- <gh-pr:12644> - <https://github.com/vllm-project/vllm/pull/12644>
### Metrics Collection ### Metrics Collection
@ -369,7 +369,7 @@ vllm:cache_config_info{block_size="16",cache_dtype="auto",calculate_kv_scales="F
However, `prometheus_client` has However, `prometheus_client` has
[never supported Info metrics in multiprocessing mode](https://github.com/prometheus/client_python/pull/300) - [never supported Info metrics in multiprocessing mode](https://github.com/prometheus/client_python/pull/300) -
for [unclear reasons](gh-pr:7279#discussion_r1710417152). We for [unclear reasons](https://github.com/vllm-project/vllm/pull/7279#discussion_r1710417152). We
simply use a `Gauge` metric set to 1 and simply use a `Gauge` metric set to 1 and
`multiprocess_mode="mostrecent"` instead. `multiprocess_mode="mostrecent"` instead.
@ -394,7 +394,7 @@ distinguish between per-adapter counts. This should be revisited.
Note that `multiprocess_mode="livemostrecent"` is used - the most Note that `multiprocess_mode="livemostrecent"` is used - the most
recent metric is used, but only from currently running processes. recent metric is used, but only from currently running processes.
This was added in <gh-pr:9477> and there is This was added in <https://github.com/vllm-project/vllm/pull/9477> and there is
[at least one known user](https://github.com/kubernetes-sigs/gateway-api-inference-extension/pull/54). [at least one known user](https://github.com/kubernetes-sigs/gateway-api-inference-extension/pull/54).
If we revisit this design and deprecate the old metric, we should reduce If we revisit this design and deprecate the old metric, we should reduce
the need for a significant deprecation period by making the change in the need for a significant deprecation period by making the change in
@ -402,7 +402,7 @@ v0 also and asking this project to move to the new metric.
### Prefix Cache metrics ### Prefix Cache metrics
The discussion in <gh-issue:10582> about adding prefix cache metrics yielded The discussion in <https://github.com/vllm-project/vllm/issues/10582> about adding prefix cache metrics yielded
some interesting points which may be relevant to how we approach some interesting points which may be relevant to how we approach
future metrics. future metrics.
@ -439,8 +439,8 @@ suddenly (from their perspective) when it is removed, even if there is
an equivalent metric for them to use. an equivalent metric for them to use.
As an example, see how `vllm:avg_prompt_throughput_toks_per_s` was As an example, see how `vllm:avg_prompt_throughput_toks_per_s` was
[deprecated](gh-pr:2764) (with a comment in the code), [deprecated](https://github.com/vllm-project/vllm/pull/2764) (with a comment in the code),
[removed](gh-pr:12383), and then [noticed by a user](gh-issue:13218). [removed](https://github.com/vllm-project/vllm/pull/12383), and then [noticed by a user](https://github.com/vllm-project/vllm/issues/13218).
In general: In general:
@ -460,20 +460,20 @@ the project-wide deprecation policy.
### Unimplemented - `vllm:tokens_total` ### Unimplemented - `vllm:tokens_total`
Added by <gh-pr:4464>, but apparently never implemented. This can just be Added by <https://github.com/vllm-project/vllm/pull/4464>, but apparently never implemented. This can just be
removed. removed.
### Duplicated - Queue Time ### Duplicated - Queue Time
The `vllm:time_in_queue_requests` Histogram metric was added by The `vllm:time_in_queue_requests` Histogram metric was added by
<gh-pr:9659> and its calculation is: <https://github.com/vllm-project/vllm/pull/9659> and its calculation is:
```python ```python
self.metrics.first_scheduled_time = now self.metrics.first_scheduled_time = now
self.metrics.time_in_queue = now - self.metrics.arrival_time self.metrics.time_in_queue = now - self.metrics.arrival_time
``` ```
Two weeks later, <gh-pr:4464> added `vllm:request_queue_time_seconds` leaving Two weeks later, <https://github.com/vllm-project/vllm/pull/4464> added `vllm:request_queue_time_seconds` leaving
us with: us with:
```python ```python
@ -513,7 +513,7 @@ cache to complete other requests), we swap kv cache blocks out to CPU
memory. This is also known as "KV cache offloading" and is configured memory. This is also known as "KV cache offloading" and is configured
with `--swap-space` and `--preemption-mode`. with `--swap-space` and `--preemption-mode`.
In v0, [vLLM has long supported beam search](gh-issue:6226). The In v0, [vLLM has long supported beam search](https://github.com/vllm-project/vllm/issues/6226). The
SequenceGroup encapsulated the idea of N Sequences which SequenceGroup encapsulated the idea of N Sequences which
all shared the same prompt kv blocks. This enabled KV cache block all shared the same prompt kv blocks. This enabled KV cache block
sharing between requests, and copy-on-write to do branching. CPU sharing between requests, and copy-on-write to do branching. CPU
@ -526,7 +526,7 @@ and the part of the prompt that was evicted can be recomputed.
SequenceGroup was removed in V1, although a replacement will be SequenceGroup was removed in V1, although a replacement will be
required for "parallel sampling" (`n>1`). required for "parallel sampling" (`n>1`).
[Beam search was moved out of the core (in V0)](gh-issue:8306). There was a [Beam search was moved out of the core (in V0)](https://github.com/vllm-project/vllm/issues/8306). There was a
lot of complex code for a very uncommon feature. lot of complex code for a very uncommon feature.
In V1, with prefix caching being better (zero over head) and therefore In V1, with prefix caching being better (zero over head) and therefore
@ -541,7 +541,7 @@ Some v0 metrics are only relevant in the context of "parallel
sampling". This is where the `n` parameter in a request is used to sampling". This is where the `n` parameter in a request is used to
request multiple completions from the same prompt. request multiple completions from the same prompt.
As part of adding parallel sampling support in <gh-pr:10980>, we should As part of adding parallel sampling support in <https://github.com/vllm-project/vllm/pull/10980>, we should
also add these metrics. also add these metrics.
- `vllm:request_params_n` (Histogram) - `vllm:request_params_n` (Histogram)
@ -566,7 +566,7 @@ model and then validate those tokens with the larger model.
- `vllm:spec_decode_num_draft_tokens_total` (Counter) - `vllm:spec_decode_num_draft_tokens_total` (Counter)
- `vllm:spec_decode_num_emitted_tokens_total` (Counter) - `vllm:spec_decode_num_emitted_tokens_total` (Counter)
There is a PR under review (<gh-pr:12193>) to add "prompt lookup (ngram)" There is a PR under review (<https://github.com/vllm-project/vllm/pull/12193>) to add "prompt lookup (ngram)"
speculative decoding to v1. Other techniques will follow. We should speculative decoding to v1. Other techniques will follow. We should
revisit the v0 metrics in this context. revisit the v0 metrics in this context.
@ -587,7 +587,7 @@ see:
- [Standardizing Large Model Server Metrics in Kubernetes](https://docs.google.com/document/d/1SpSp1E6moa4HSrJnS4x3NpLuj88sMXr2tbofKlzTZpk) - [Standardizing Large Model Server Metrics in Kubernetes](https://docs.google.com/document/d/1SpSp1E6moa4HSrJnS4x3NpLuj88sMXr2tbofKlzTZpk)
- [Benchmarking LLM Workloads for Performance Evaluation and Autoscaling in Kubernetes](https://docs.google.com/document/d/1k4Q4X14hW4vftElIuYGDu5KDe2LtV1XammoG-Xi3bbQ) - [Benchmarking LLM Workloads for Performance Evaluation and Autoscaling in Kubernetes](https://docs.google.com/document/d/1k4Q4X14hW4vftElIuYGDu5KDe2LtV1XammoG-Xi3bbQ)
- [Inference Perf](https://github.com/kubernetes-sigs/wg-serving/tree/main/proposals/013-inference-perf) - [Inference Perf](https://github.com/kubernetes-sigs/wg-serving/tree/main/proposals/013-inference-perf)
- <gh-issue:5041> and <gh-pr:12726>. - <https://github.com/vllm-project/vllm/issues/5041> and <https://github.com/vllm-project/vllm/pull/12726>.
This is a non-trivial topic. Consider this comment from Rob: This is a non-trivial topic. Consider this comment from Rob:
@ -654,7 +654,7 @@ fall under the more general heading of "Observability".
v0 has support for OpenTelemetry tracing: v0 has support for OpenTelemetry tracing:
- Added by <gh-pr:4687> - Added by <https://github.com/vllm-project/vllm/pull/4687>
- Configured with `--oltp-traces-endpoint` and `--collect-detailed-traces` - Configured with `--oltp-traces-endpoint` and `--collect-detailed-traces`
- [OpenTelemetry blog post](https://opentelemetry.io/blog/2024/llm-observability/) - [OpenTelemetry blog post](https://opentelemetry.io/blog/2024/llm-observability/)
- [User-facing docs](../examples/online_serving/opentelemetry.md) - [User-facing docs](../examples/online_serving/opentelemetry.md)
@ -685,7 +685,7 @@ documentation for this option states:
> use of possibly costly and or blocking operations and hence might > use of possibly costly and or blocking operations and hence might
> have a performance impact. > have a performance impact.
The metrics were added by <gh-pr:7089> and who up in an OpenTelemetry trace The metrics were added by <https://github.com/vllm-project/vllm/pull/7089> and who up in an OpenTelemetry trace
as: as:
```text ```text

View File

@ -60,7 +60,7 @@ With the help of dummy text and automatic prompt updating, our multi-modal proce
## Processor Output Caching ## Processor Output Caching
Some HF processors, such as the one for Qwen2-VL, are [very slow](gh-issue:9238). To alleviate this problem, we cache the multi-modal outputs of HF processor to avoid processing the same multi-modal input (e.g. image) again. Some HF processors, such as the one for Qwen2-VL, are [very slow](https://github.com/vllm-project/vllm/issues/9238). To alleviate this problem, we cache the multi-modal outputs of HF processor to avoid processing the same multi-modal input (e.g. image) again.
When new data is passed in, we first check which items are in the cache, and which ones are missing. The missing items are passed into the HF processor in a single batch and cached, before being merged with the existing items in the cache. When new data is passed in, we first check which items are in the cache, and which ones are missing. The missing items are passed into the HF processor in a single batch and cached, before being merged with the existing items in the cache.

View File

@ -82,7 +82,7 @@ There are other miscellaneous places hard-coding the use of `spawn`:
Related PRs: Related PRs:
- <gh-pr:8823> - <https://github.com/vllm-project/vllm/pull/8823>
## Prior State in v1 ## Prior State in v1

View File

@ -19,8 +19,8 @@ vLLM will take all the available factors into consideration, and decide a direct
The factors considered include: The factors considered include:
- All the related configs (see the `compute_hash` functions in their respective configs in the [config folder](gh-file:vllm/config)) - All the related configs (see the `compute_hash` functions in their respective configs in the [config folder](../../vllm/config))
- PyTorch configs (see the `compute_hash` functions in the [compiler_interface.py](gh-file:vllm/compilation/compiler_interface.py)) - PyTorch configs (see the `compute_hash` functions in the [compiler_interface.py](../../vllm/compilation/compiler_interface.py))
- The model's forward function and the relevant functions called by the forward function (see below) - The model's forward function and the relevant functions called by the forward function (see below)
With all these factors taken into consideration, usually we can guarantee that the cache is safe to use, and will not cause any unexpected behavior. Therefore, the cache is enabled by default. If you want to debug the compilation process, or if you suspect the cache is causing some issues, you can disable it by setting the environment variable `VLLM_DISABLE_COMPILE_CACHE=1`. With all these factors taken into consideration, usually we can guarantee that the cache is safe to use, and will not cause any unexpected behavior. Therefore, the cache is enabled by default. If you want to debug the compilation process, or if you suspect the cache is causing some issues, you can disable it by setting the environment variable `VLLM_DISABLE_COMPILE_CACHE=1`.

View File

@ -44,15 +44,15 @@ th:not(:first-child) {
| [SD](spec_decode.md) | ✅ | ✅ | ❌ | ✅ | | | | | | | | | | | | | [SD](spec_decode.md) | ✅ | ✅ | ❌ | ✅ | | | | | | | | | | | |
| CUDA graph | ✅ | ✅ | ✅ | ✅ | ✅ | | | | | | | | | | | | CUDA graph | ✅ | ✅ | ✅ | ✅ | ✅ | | | | | | | | | | |
| [pooling](../models/pooling_models.md) | 🟠\* | 🟠\* | ✅ | ❌ | ✅ | ✅ | | | | | | | | | | | [pooling](../models/pooling_models.md) | 🟠\* | 🟠\* | ✅ | ❌ | ✅ | ✅ | | | | | | | | | |
| <abbr title="Encoder-Decoder Models">enc-dec</abbr> | ❌ | [](gh-issue:7366) | ❌ | [](gh-issue:7366) | ✅ | ✅ | ✅ | | | | | | | | | | <abbr title="Encoder-Decoder Models">enc-dec</abbr> | ❌ | [](https://github.com/vllm-project/vllm/issues/7366) | ❌ | [](https://github.com/vllm-project/vllm/issues/7366) | ✅ | ✅ | ✅ | | | | | | | | |
| <abbr title="Logprobs">logP</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | | | | | | | | | <abbr title="Logprobs">logP</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | | | | | | | |
| <abbr title="Prompt Logprobs">prmpt logP</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ✅ | | | | | | | | <abbr title="Prompt Logprobs">prmpt logP</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ✅ | | | | | | |
| <abbr title="Async Output Processing">async output</abbr> | ✅ | ✅ | ✅ | ❌ | ✅ | ❌ | ❌ | ✅ | ✅ | ✅ | | | | | | | <abbr title="Async Output Processing">async output</abbr> | ✅ | ✅ | ✅ | ❌ | ✅ | ❌ | ❌ | ✅ | ✅ | ✅ | | | | | |
| multi-step | ❌ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | | | | | | multi-step | ❌ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | | | | |
| [mm](multimodal_inputs.md) | ✅ | ✅ | [🟠](gh-pr:4194)<sup>^</sup> | ❔ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❔ | ✅ | | | | | [mm](multimodal_inputs.md) | ✅ | ✅ | [🟠](https://github.com/vllm-project/vllm/pull/4194)<sup>^</sup> | ❔ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❔ | ✅ | | | |
| best-of | ✅ | ✅ | ✅ | [](gh-issue:6137) | ✅ | ❌ | ✅ | ✅ | ✅ | ❔ | [](gh-issue:7968) | ✅ | ✅ | | | | best-of | ✅ | ✅ | ✅ | [](https://github.com/vllm-project/vllm/issues/6137) | ✅ | ❌ | ✅ | ✅ | ✅ | ❔ | [](https://github.com/vllm-project/vllm/issues/7968) | ✅ | ✅ | | |
| beam-search | ✅ | ✅ | ✅ | [](gh-issue:6137) | ✅ | ❌ | ✅ | ✅ | ✅ | ❔ | [](gh-issue:7968) | ❔ | ✅ | ✅ | | | beam-search | ✅ | ✅ | ✅ | [](https://github.com/vllm-project/vllm/issues/6137) | ✅ | ❌ | ✅ | ✅ | ✅ | ❔ | [](https://github.com/vllm-project/vllm/issues/7968) | ❔ | ✅ | ✅ | |
| [prompt-embeds](prompt_embeds.md) | ✅ | [](gh-issue:25096) | ✅ | ❌ | ✅ | ❌ | ❌ | ✅ | ❌ | ❔ | ❔ | ❌ | ❔ | ❔ | ✅ | | [prompt-embeds](prompt_embeds.md) | ✅ | [](https://github.com/vllm-project/vllm/issues/25096) | ✅ | ❌ | ✅ | ❌ | ❌ | ✅ | ❌ | ❔ | ❔ | ❌ | ❔ | ❔ | ✅ |
\* Chunked prefill and prefix caching are only applicable to last-token pooling. \* Chunked prefill and prefix caching are only applicable to last-token pooling.
<sup>^</sup> LoRA is only applicable to the language backbone of multimodal models. <sup>^</sup> LoRA is only applicable to the language backbone of multimodal models.
@ -63,18 +63,18 @@ th:not(:first-child) {
| Feature | Volta | Turing | Ampere | Ada | Hopper | CPU | AMD | TPU | Intel GPU | | Feature | Volta | Turing | Ampere | Ada | Hopper | CPU | AMD | TPU | Intel GPU |
|-----------------------------------------------------------|---------------------|-----------|-----------|--------|------------|--------------------|--------|-----| ------------| |-----------------------------------------------------------|---------------------|-----------|-----------|--------|------------|--------------------|--------|-----| ------------|
| [CP][chunked-prefill] | [](gh-issue:2729) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | [CP][chunked-prefill] | [](https://github.com/vllm-project/vllm/issues/2729) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| [APC](automatic_prefix_caching.md) | [](gh-issue:3687) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | [APC](automatic_prefix_caching.md) | [](https://github.com/vllm-project/vllm/issues/3687) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| [LoRA](lora.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | [LoRA](lora.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| [SD](spec_decode.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | [🟠](gh-issue:26963) | | [SD](spec_decode.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | [🟠](https://github.com/vllm-project/vllm/issues/26963) |
| CUDA graph | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ❌ | [](gh-issue:26970) | | CUDA graph | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ❌ | [](https://github.com/vllm-project/vllm/issues/26970) |
| [pooling](../models/pooling_models.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | | [pooling](../models/pooling_models.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ |
| <abbr title="Encoder-Decoder Models">enc-dec</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ✅ | | <abbr title="Encoder-Decoder Models">enc-dec</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ✅ |
| [mm](multimodal_inputs.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | [🟠](gh-issue:26965) | | [mm](multimodal_inputs.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | [🟠](https://github.com/vllm-project/vllm/issues/26965) |
| <abbr title="Logprobs">logP</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | | <abbr title="Logprobs">logP</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ |
| <abbr title="Prompt Logprobs">prmpt logP</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | | <abbr title="Prompt Logprobs">prmpt logP</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ |
| <abbr title="Async Output Processing">async output</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | ✅ | | <abbr title="Async Output Processing">async output</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | ✅ |
| multi-step | ✅ | ✅ | ✅ | ✅ | ✅ | [](gh-issue:8477) | ✅ | ❌ | ✅ | | multi-step | ✅ | ✅ | ✅ | ✅ | ✅ | [](https://github.com/vllm-project/vllm/issues/8477) | ✅ | ❌ | ✅ |
| best-of | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | | best-of | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ |
| beam-search | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | | beam-search | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ |
| [prompt-embeds](prompt_embeds.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ? | [](gh-issue:25097) | ✅ | | [prompt-embeds](prompt_embeds.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ? | [](https://github.com/vllm-project/vllm/issues/25097) | ✅ |

View File

@ -11,7 +11,7 @@ Automatic Prefix Caching (APC in short) caches the KV cache of existing queries,
Set `enable_prefix_caching=True` in vLLM engine to enable APC. Here is an example: Set `enable_prefix_caching=True` in vLLM engine to enable APC. Here is an example:
<gh-file:examples/offline_inference/automatic_prefix_caching.py> [examples/offline_inference/automatic_prefix_caching.py](../../examples/offline_inference/automatic_prefix_caching.py)
## Example workloads ## Example workloads

View File

@ -17,14 +17,14 @@ Two main reasons:
## Usage example ## Usage example
Please refer to <gh-file:examples/online_serving/disaggregated_prefill.sh> for the example usage of disaggregated prefilling. Please refer to [examples/online_serving/disaggregated_prefill.sh](../../examples/online_serving/disaggregated_prefill.sh) for the example usage of disaggregated prefilling.
Now supports 5 types of connectors: Now supports 5 types of connectors:
- **SharedStorageConnector**: refer to <gh-file:examples/offline_inference/disaggregated-prefill-v1/run.sh> for the example usage of SharedStorageConnector disaggregated prefilling. - **SharedStorageConnector**: refer to [examples/offline_inference/disaggregated-prefill-v1/run.sh](../../examples/offline_inference/disaggregated-prefill-v1/run.sh) for the example usage of SharedStorageConnector disaggregated prefilling.
- **LMCacheConnectorV1**: refer to <gh-file:examples/others/lmcache/disagg_prefill_lmcache_v1/disagg_example_nixl.sh> for the example usage of LMCacheConnectorV1 disaggregated prefilling which uses NIXL as the underlying KV transmission. - **LMCacheConnectorV1**: refer to [examples/others/lmcache/disagg_prefill_lmcache_v1/disagg_example_nixl.sh](../../examples/others/lmcache/disagg_prefill_lmcache_v1/disagg_example_nixl.sh) for the example usage of LMCacheConnectorV1 disaggregated prefilling which uses NIXL as the underlying KV transmission.
- **NixlConnector**: refer to <gh-file:tests/v1/kv_connector/nixl_integration/run_accuracy_test.sh> for the example usage of NixlConnector disaggregated prefilling which support fully async send/recv. For detailed usage guide, see [NixlConnector Usage Guide](nixl_connector_usage.md). - **NixlConnector**: refer to [tests/v1/kv_connector/nixl_integration/run_accuracy_test.sh](../../tests/v1/kv_connector/nixl_integration/run_accuracy_test.sh) for the example usage of NixlConnector disaggregated prefilling which support fully async send/recv. For detailed usage guide, see [NixlConnector Usage Guide](nixl_connector_usage.md).
- **P2pNcclConnector**: refer to <gh-file:examples/online_serving/disaggregated_serving_p2p_nccl_xpyd/disagg_example_p2p_nccl_xpyd.sh> for the example usage of P2pNcclConnector disaggregated prefilling. - **P2pNcclConnector**: refer to [examples/online_serving/disaggregated_serving_p2p_nccl_xpyd/disagg_example_p2p_nccl_xpyd.sh](../../examples/online_serving/disaggregated_serving_p2p_nccl_xpyd/disagg_example_p2p_nccl_xpyd.sh) for the example usage of P2pNcclConnector disaggregated prefilling.
- **MultiConnector**: take advantage of the kv_connector_extra_config: dict[str, Any] already present in KVTransferConfig to stash all the connectors we want in an ordered list of kwargs.such as: - **MultiConnector**: take advantage of the kv_connector_extra_config: dict[str, Any] already present in KVTransferConfig to stash all the connectors we want in an ordered list of kwargs.such as:
```bash ```bash
@ -45,7 +45,7 @@ For NixlConnector, you may also specify one or multiple NIXL_Backend. Such as:
## Benchmarks ## Benchmarks
Please refer to <gh-file:benchmarks/disagg_benchmarks> for disaggregated prefilling benchmarks. Please refer to [benchmarks/disagg_benchmarks](../../benchmarks/disagg_benchmarks) for disaggregated prefilling benchmarks.
## Development ## Development

View File

@ -47,7 +47,7 @@ the third parameter is the path to the LoRA adapter.
) )
``` ```
Check out <gh-file:examples/offline_inference/multilora_inference.py> for an example of how to use LoRA adapters with the async engine and how to use more advanced configuration options. Check out [examples/offline_inference/multilora_inference.py](../../examples/offline_inference/multilora_inference.py) for an example of how to use LoRA adapters with the async engine and how to use more advanced configuration options.
## Serving LoRA Adapters ## Serving LoRA Adapters

View File

@ -3,7 +3,7 @@
This page teaches you how to pass multi-modal inputs to [multi-modal models][supported-mm-models] in vLLM. This page teaches you how to pass multi-modal inputs to [multi-modal models][supported-mm-models] in vLLM.
!!! note !!! note
We are actively iterating on multi-modal support. See [this RFC](gh-issue:4194) for upcoming changes, We are actively iterating on multi-modal support. See [this RFC](https://github.com/vllm-project/vllm/issues/4194) for upcoming changes,
and [open an issue on GitHub](https://github.com/vllm-project/vllm/issues/new/choose) if you have any feedback or feature requests. and [open an issue on GitHub](https://github.com/vllm-project/vllm/issues/new/choose) if you have any feedback or feature requests.
!!! tip !!! tip
@ -129,7 +129,7 @@ You can pass a single image to the `'image'` field of the multi-modal dictionary
print(generated_text) print(generated_text)
``` ```
Full example: <gh-file:examples/offline_inference/vision_language.py> Full example: [examples/offline_inference/vision_language.py](../../examples/offline_inference/vision_language.py)
To substitute multiple images inside the same text prompt, you can pass in a list of images instead: To substitute multiple images inside the same text prompt, you can pass in a list of images instead:
@ -162,7 +162,7 @@ To substitute multiple images inside the same text prompt, you can pass in a lis
print(generated_text) print(generated_text)
``` ```
Full example: <gh-file:examples/offline_inference/vision_language_multi_image.py> Full example: [examples/offline_inference/vision_language_multi_image.py](../../examples/offline_inference/vision_language_multi_image.py)
If using the [LLM.chat](../models/generative_models.md#llmchat) method, you can pass images directly in the message content using various formats: image URLs, PIL Image objects, or pre-computed embeddings: If using the [LLM.chat](../models/generative_models.md#llmchat) method, you can pass images directly in the message content using various formats: image URLs, PIL Image objects, or pre-computed embeddings:
@ -346,13 +346,13 @@ Instead of NumPy arrays, you can also pass `'torch.Tensor'` instances, as shown
!!! note !!! note
'process_vision_info' is only applicable to Qwen2.5-VL and similar models. 'process_vision_info' is only applicable to Qwen2.5-VL and similar models.
Full example: <gh-file:examples/offline_inference/vision_language.py> Full example: [examples/offline_inference/vision_language.py](../../examples/offline_inference/vision_language.py)
### Audio Inputs ### Audio Inputs
You can pass a tuple `(array, sampling_rate)` to the `'audio'` field of the multi-modal dictionary. You can pass a tuple `(array, sampling_rate)` to the `'audio'` field of the multi-modal dictionary.
Full example: <gh-file:examples/offline_inference/audio_language.py> Full example: [examples/offline_inference/audio_language.py](../../examples/offline_inference/audio_language.py)
### Embedding Inputs ### Embedding Inputs
@ -434,11 +434,11 @@ Our OpenAI-compatible server accepts multi-modal data via the [Chat Completions
A chat template is **required** to use Chat Completions API. A chat template is **required** to use Chat Completions API.
For HF format models, the default chat template is defined inside `chat_template.json` or `tokenizer_config.json`. For HF format models, the default chat template is defined inside `chat_template.json` or `tokenizer_config.json`.
If no default chat template is available, we will first look for a built-in fallback in <gh-file:vllm/transformers_utils/chat_templates/registry.py>. If no default chat template is available, we will first look for a built-in fallback in [vllm/transformers_utils/chat_templates/registry.py](../../vllm/transformers_utils/chat_templates/registry.py).
If no fallback is available, an error is raised and you have to provide the chat template manually via the `--chat-template` argument. If no fallback is available, an error is raised and you have to provide the chat template manually via the `--chat-template` argument.
For certain models, we provide alternative chat templates inside <gh-dir:examples>. For certain models, we provide alternative chat templates inside [examples](../../examples).
For example, VLM2Vec uses <gh-file:examples/template_vlm2vec_phi3v.jinja> which is different from the default one for Phi-3-Vision. For example, VLM2Vec uses [examples/template_vlm2vec_phi3v.jinja](../../examples/template_vlm2vec_phi3v.jinja) which is different from the default one for Phi-3-Vision.
### Image Inputs ### Image Inputs
@ -524,7 +524,7 @@ Then, you can use the OpenAI client as follows:
print("Chat completion output:", chat_response.choices[0].message.content) print("Chat completion output:", chat_response.choices[0].message.content)
``` ```
Full example: <gh-file:examples/online_serving/openai_chat_completion_client_for_multimodal.py> Full example: [examples/online_serving/openai_chat_completion_client_for_multimodal.py](../../examples/online_serving/openai_chat_completion_client_for_multimodal.py)
!!! tip !!! tip
Loading from local file paths is also supported on vLLM: You can specify the allowed local media path via `--allowed-local-media-path` when launching the API server/engine, Loading from local file paths is also supported on vLLM: You can specify the allowed local media path via `--allowed-local-media-path` when launching the API server/engine,
@ -595,7 +595,7 @@ Then, you can use the OpenAI client as follows:
print("Chat completion output from image url:", result) print("Chat completion output from image url:", result)
``` ```
Full example: <gh-file:examples/online_serving/openai_chat_completion_client_for_multimodal.py> Full example: [examples/online_serving/openai_chat_completion_client_for_multimodal.py](../../examples/online_serving/openai_chat_completion_client_for_multimodal.py)
!!! note !!! note
By default, the timeout for fetching videos through HTTP URL is `30` seconds. By default, the timeout for fetching videos through HTTP URL is `30` seconds.
@ -719,7 +719,7 @@ Alternatively, you can pass `audio_url`, which is the audio counterpart of `imag
print("Chat completion output from audio url:", result) print("Chat completion output from audio url:", result)
``` ```
Full example: <gh-file:examples/online_serving/openai_chat_completion_client_for_multimodal.py> Full example: [examples/online_serving/openai_chat_completion_client_for_multimodal.py](../../examples/online_serving/openai_chat_completion_client_for_multimodal.py)
!!! note !!! note
By default, the timeout for fetching audios through HTTP URL is `10` seconds. By default, the timeout for fetching audios through HTTP URL is `10` seconds.

View File

@ -9,7 +9,7 @@ NixlConnector is a high-performance KV cache transfer connector for vLLM's disag
Install the NIXL library: `uv pip install nixl`, as a quick start. Install the NIXL library: `uv pip install nixl`, as a quick start.
- Refer to [NIXL official repository](https://github.com/ai-dynamo/nixl) for more installation instructions - Refer to [NIXL official repository](https://github.com/ai-dynamo/nixl) for more installation instructions
- The specified required NIXL version can be found in [requirements/kv_connectors.txt](gh-file:requirements/kv_connectors.txt) and other relevant config files - The specified required NIXL version can be found in [requirements/kv_connectors.txt](../../requirements/kv_connectors.txt) and other relevant config files
For non-cuda platform, please install nixl with ucx build from source, instructed as below. For non-cuda platform, please install nixl with ucx build from source, instructed as below.
@ -170,6 +170,6 @@ Support use case: Prefill with 'HND' and decode with 'NHD' with experimental con
Refer to these example scripts in the vLLM repository: Refer to these example scripts in the vLLM repository:
- [run_accuracy_test.sh](gh-file:tests/v1/kv_connector/nixl_integration/run_accuracy_test.sh) - [run_accuracy_test.sh](../../tests/v1/kv_connector/nixl_integration/run_accuracy_test.sh)
- [toy_proxy_server.py](gh-file:tests/v1/kv_connector/nixl_integration/toy_proxy_server.py) - [toy_proxy_server.py](../../tests/v1/kv_connector/nixl_integration/toy_proxy_server.py)
- [test_accuracy.py](gh-file:tests/v1/kv_connector/nixl_integration/test_accuracy.py) - [test_accuracy.py](../../tests/v1/kv_connector/nixl_integration/test_accuracy.py)

View File

@ -16,7 +16,7 @@ To input multi-modal data, follow this schema in [vllm.inputs.EmbedsPrompt][]:
You can pass prompt embeddings from Hugging Face Transformers models to the `'prompt_embeds'` field of the prompt embedding dictionary, as shown in the following examples: You can pass prompt embeddings from Hugging Face Transformers models to the `'prompt_embeds'` field of the prompt embedding dictionary, as shown in the following examples:
<gh-file:examples/offline_inference/prompt_embed_inference.py> [examples/offline_inference/prompt_embed_inference.py](../../examples/offline_inference/prompt_embed_inference.py)
## Online Serving ## Online Serving
@ -37,4 +37,4 @@ vllm serve meta-llama/Llama-3.2-1B-Instruct --runner generate \
Then, you can use the OpenAI client as follows: Then, you can use the OpenAI client as follows:
<gh-file:examples/online_serving/prompt_embed_inference_with_openai_client.py> [examples/online_serving/prompt_embed_inference_with_openai_client.py](../../examples/online_serving/prompt_embed_inference_with_openai_client.py)

View File

@ -64,4 +64,4 @@ th:not(:first-child) {
!!! note !!! note
This compatibility chart is subject to change as vLLM continues to evolve and expand its support for different hardware platforms and quantization methods. This compatibility chart is subject to change as vLLM continues to evolve and expand its support for different hardware platforms and quantization methods.
For the most up-to-date information on hardware support and quantization methods, please refer to <gh-dir:vllm/model_executor/layers/quantization> or consult with the vLLM development team. For the most up-to-date information on hardware support and quantization methods, please refer to [vllm/model_executor/layers/quantization](../../../vllm/model_executor/layers/quantization) or consult with the vLLM development team.

View File

@ -196,7 +196,7 @@ The reasoning content is also available when both tool calling and the reasoning
print(f"Arguments: {tool_call.arguments}") print(f"Arguments: {tool_call.arguments}")
``` ```
For more examples, please refer to <gh-file:examples/online_serving/openai_chat_completion_tool_calls_with_reasoning.py>. For more examples, please refer to [examples/online_serving/openai_chat_completion_tool_calls_with_reasoning.py](../../examples/online_serving/openai_chat_completion_tool_calls_with_reasoning.py).
## Limitations ## Limitations
@ -204,7 +204,7 @@ For more examples, please refer to <gh-file:examples/online_serving/openai_chat_
## How to support a new reasoning model ## How to support a new reasoning model
You can add a new `ReasoningParser` similar to <gh-file:vllm/reasoning/deepseek_r1_reasoning_parser.py>. You can add a new `ReasoningParser` similar to [vllm/reasoning/deepseek_r1_reasoning_parser.py](../../vllm/reasoning/deepseek_r1_reasoning_parser.py).
??? code ??? code
@ -264,7 +264,7 @@ You can add a new `ReasoningParser` similar to <gh-file:vllm/reasoning/deepseek_
""" """
``` ```
Additionally, to enable structured output, you'll need to create a new `Reasoner` similar to the one in <gh-file:vllm/reasoning/deepseek_r1_reasoning_parser.py>. Additionally, to enable structured output, you'll need to create a new `Reasoner` similar to the one in [vllm/reasoning/deepseek_r1_reasoning_parser.py](../../vllm/reasoning/deepseek_r1_reasoning_parser.py).
??? code ??? code

View File

@ -3,7 +3,7 @@
!!! warning !!! warning
Please note that speculative decoding in vLLM is not yet optimized and does Please note that speculative decoding in vLLM is not yet optimized and does
not usually yield inter-token latency reductions for all prompt datasets or sampling parameters. not usually yield inter-token latency reductions for all prompt datasets or sampling parameters.
The work to optimize it is ongoing and can be followed here: <gh-issue:4630> The work to optimize it is ongoing and can be followed here: <https://github.com/vllm-project/vllm/issues/4630>
!!! warning !!! warning
Currently, speculative decoding in vLLM is not compatible with pipeline parallelism. Currently, speculative decoding in vLLM is not compatible with pipeline parallelism.
@ -183,7 +183,7 @@ A variety of speculative models of this type are available on HF hub:
## Speculating using EAGLE based draft models ## Speculating using EAGLE based draft models
The following code configures vLLM to use speculative decoding where proposals are generated by The following code configures vLLM to use speculative decoding where proposals are generated by
an [EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency)](https://arxiv.org/pdf/2401.15077) based draft model. A more detailed example for offline mode, including how to extract request level acceptance rate, can be found [here](gh-file:examples/offline_inference/eagle.py). an [EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency)](https://arxiv.org/pdf/2401.15077) based draft model. A more detailed example for offline mode, including how to extract request level acceptance rate, can be found [here](../../examples/offline_inference/spec_decode.py).
??? code ??? code
@ -218,8 +218,8 @@ an [EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency)](https
A few important things to consider when using the EAGLE based draft models: A few important things to consider when using the EAGLE based draft models:
1. The EAGLE draft models available in the [HF repository for EAGLE models](https://huggingface.co/yuhuili) should 1. The EAGLE draft models available in the [HF repository for EAGLE models](https://huggingface.co/yuhuili) should
be able to be loaded and used directly by vLLM after <gh-pr:12304>. be able to be loaded and used directly by vLLM after <https://github.com/vllm-project/vllm/pull/12304>.
If you are using vllm version before <gh-pr:12304>, please use the If you are using vllm version before <https://github.com/vllm-project/vllm/pull/12304>, please use the
[script](https://gist.github.com/abhigoyal1997/1e7a4109ccb7704fbc67f625e86b2d6d) to convert the speculative model, [script](https://gist.github.com/abhigoyal1997/1e7a4109ccb7704fbc67f625e86b2d6d) to convert the speculative model,
and specify `"model": "path/to/modified/eagle/model"` in `speculative_config`. If weight-loading problems still occur when using the latest version of vLLM, please leave a comment or raise an issue. and specify `"model": "path/to/modified/eagle/model"` in `speculative_config`. If weight-loading problems still occur when using the latest version of vLLM, please leave a comment or raise an issue.
@ -229,7 +229,7 @@ A few important things to consider when using the EAGLE based draft models:
3. When using EAGLE-based speculators with vLLM, the observed speedup is lower than what is 3. When using EAGLE-based speculators with vLLM, the observed speedup is lower than what is
reported in the reference implementation [here](https://github.com/SafeAILab/EAGLE). This issue is under reported in the reference implementation [here](https://github.com/SafeAILab/EAGLE). This issue is under
investigation and tracked here: <gh-issue:9565>. investigation and tracked here: <https://github.com/vllm-project/vllm/issues/9565>.
4. When using EAGLE-3 based draft model, option "method" must be set to "eagle3". 4. When using EAGLE-3 based draft model, option "method" must be set to "eagle3".
That is, to specify `"method": "eagle3"` in `speculative_config`. That is, to specify `"method": "eagle3"` in `speculative_config`.
@ -267,7 +267,7 @@ speculative decoding, breaking down the guarantees into three key areas:
> distribution. [View Test Code](https://github.com/vllm-project/vllm/blob/47b65a550866c7ffbd076ecb74106714838ce7da/tests/samplers/test_rejection_sampler.py#L252) > distribution. [View Test Code](https://github.com/vllm-project/vllm/blob/47b65a550866c7ffbd076ecb74106714838ce7da/tests/samplers/test_rejection_sampler.py#L252)
> - **Greedy Sampling Equality**: Confirms that greedy sampling with speculative decoding matches greedy sampling > - **Greedy Sampling Equality**: Confirms that greedy sampling with speculative decoding matches greedy sampling
> without it. This verifies that vLLM's speculative decoding framework, when integrated with the vLLM forward pass and the vLLM rejection sampler, > without it. This verifies that vLLM's speculative decoding framework, when integrated with the vLLM forward pass and the vLLM rejection sampler,
> provides a lossless guarantee. Almost all of the tests in <gh-dir:tests/spec_decode/e2e>. > provides a lossless guarantee. Almost all of the tests in [tests/spec_decode/e2e](../../tests/spec_decode/e2e).
> verify this property using [this assertion implementation](https://github.com/vllm-project/vllm/blob/b67ae00cdbbe1a58ffc8ff170f0c8d79044a684a/tests/spec_decode/e2e/conftest.py#L291) > verify this property using [this assertion implementation](https://github.com/vllm-project/vllm/blob/b67ae00cdbbe1a58ffc8ff170f0c8d79044a684a/tests/spec_decode/e2e/conftest.py#L291)
3. **vLLM Logprob Stability** 3. **vLLM Logprob Stability**
@ -289,4 +289,4 @@ For mitigation strategies, please refer to the FAQ entry *Can the output of a pr
- [A Hacker's Guide to Speculative Decoding in vLLM](https://www.youtube.com/watch?v=9wNAgpX6z_4) - [A Hacker's Guide to Speculative Decoding in vLLM](https://www.youtube.com/watch?v=9wNAgpX6z_4)
- [What is Lookahead Scheduling in vLLM?](https://docs.google.com/document/d/1Z9TvqzzBPnh5WHcRwjvK2UEeFeq5zMZb5mFE8jR0HCs/edit#heading=h.1fjfb0donq5a) - [What is Lookahead Scheduling in vLLM?](https://docs.google.com/document/d/1Z9TvqzzBPnh5WHcRwjvK2UEeFeq5zMZb5mFE8jR0HCs/edit#heading=h.1fjfb0donq5a)
- [Information on batch expansion](https://docs.google.com/document/d/1T-JaS2T1NRfdP51qzqpyakoCXxSXTtORppiwaj5asxA/edit#heading=h.kk7dq05lc6q8) - [Information on batch expansion](https://docs.google.com/document/d/1T-JaS2T1NRfdP51qzqpyakoCXxSXTtORppiwaj5asxA/edit#heading=h.kk7dq05lc6q8)
- [Dynamic speculative decoding](gh-issue:4565) - [Dynamic speculative decoding](https://github.com/vllm-project/vllm/issues/4565)

View File

@ -298,7 +298,7 @@ Step #2: explanation="Next, let's isolate 'x' by dividing both sides of the equa
Answer: x = -29/8 Answer: x = -29/8
``` ```
An example of using `structural_tag` can be found here: <gh-file:examples/online_serving/structured_outputs> An example of using `structural_tag` can be found here: [examples/online_serving/structured_outputs](../../examples/online_serving/structured_outputs)
## Offline Inference ## Offline Inference

View File

@ -151,9 +151,9 @@ Known issues:
much shorter than what vLLM generates. Since an exception is thrown when this condition much shorter than what vLLM generates. Since an exception is thrown when this condition
is not met, the following additional chat templates are provided: is not met, the following additional chat templates are provided:
* <gh-file:examples/tool_chat_template_mistral.jinja> - this is the "official" Mistral chat template, but tweaked so that * [examples/tool_chat_template_mistral.jinja](../../examples/tool_chat_template_mistral.jinja) - this is the "official" Mistral chat template, but tweaked so that
it works with vLLM's tool call IDs (provided `tool_call_id` fields are truncated to the last 9 digits) it works with vLLM's tool call IDs (provided `tool_call_id` fields are truncated to the last 9 digits)
* <gh-file:examples/tool_chat_template_mistral_parallel.jinja> - this is a "better" version that adds a tool-use system prompt * [examples/tool_chat_template_mistral_parallel.jinja](../../examples/tool_chat_template_mistral_parallel.jinja) - this is a "better" version that adds a tool-use system prompt
when tools are provided, that results in much better reliability when working with parallel tool calling. when tools are provided, that results in much better reliability when working with parallel tool calling.
Recommended flags: Recommended flags:
@ -187,16 +187,16 @@ Known issues:
VLLM provides two JSON-based chat templates for Llama 3.1 and 3.2: VLLM provides two JSON-based chat templates for Llama 3.1 and 3.2:
* <gh-file:examples/tool_chat_template_llama3.1_json.jinja> - this is the "official" chat template for the Llama 3.1 * [examples/tool_chat_template_llama3.1_json.jinja](../../examples/tool_chat_template_llama3.1_json.jinja) - this is the "official" chat template for the Llama 3.1
models, but tweaked so that it works better with vLLM. models, but tweaked so that it works better with vLLM.
* <gh-file:examples/tool_chat_template_llama3.2_json.jinja> - this extends upon the Llama 3.1 chat template by adding support for * [examples/tool_chat_template_llama3.2_json.jinja](../../examples/tool_chat_template_llama3.2_json.jinja) - this extends upon the Llama 3.1 chat template by adding support for
images. images.
Recommended flags: `--tool-call-parser llama3_json --chat-template {see_above}` Recommended flags: `--tool-call-parser llama3_json --chat-template {see_above}`
VLLM also provides a pythonic and JSON-based chat template for Llama 4, but pythonic tool calling is recommended: VLLM also provides a pythonic and JSON-based chat template for Llama 4, but pythonic tool calling is recommended:
* <gh-file:examples/tool_chat_template_llama4_pythonic.jinja> - this is based on the [official chat template](https://www.llama.com/docs/model-cards-and-prompt-formats/llama4/) for the Llama 4 models. * [examples/tool_chat_template_llama4_pythonic.jinja](../../examples/tool_chat_template_llama4_pythonic.jinja) - this is based on the [official chat template](https://www.llama.com/docs/model-cards-and-prompt-formats/llama4/) for the Llama 4 models.
For Llama 4 model, use `--tool-call-parser llama4_pythonic --chat-template examples/tool_chat_template_llama4_pythonic.jinja`. For Llama 4 model, use `--tool-call-parser llama4_pythonic --chat-template examples/tool_chat_template_llama4_pythonic.jinja`.
@ -212,7 +212,7 @@ Supported models:
Recommended flags: `--tool-call-parser granite --chat-template examples/tool_chat_template_granite.jinja` Recommended flags: `--tool-call-parser granite --chat-template examples/tool_chat_template_granite.jinja`
<gh-file:examples/tool_chat_template_granite.jinja>: this is a modified chat template from the original on Hugging Face. Parallel function calls are supported. [examples/tool_chat_template_granite.jinja](../../examples/tool_chat_template_granite.jinja): this is a modified chat template from the original on Hugging Face. Parallel function calls are supported.
* `ibm-granite/granite-3.1-8b-instruct` * `ibm-granite/granite-3.1-8b-instruct`
@ -224,7 +224,7 @@ Supported models:
Recommended flags: `--tool-call-parser granite-20b-fc --chat-template examples/tool_chat_template_granite_20b_fc.jinja` Recommended flags: `--tool-call-parser granite-20b-fc --chat-template examples/tool_chat_template_granite_20b_fc.jinja`
<gh-file:examples/tool_chat_template_granite_20b_fc.jinja>: this is a modified chat template from the original on Hugging Face, which is not vLLM-compatible. It blends function description elements from the Hermes template and follows the same system prompt as "Response Generation" mode from [the paper](https://arxiv.org/abs/2407.00121). Parallel function calls are supported. [examples/tool_chat_template_granite_20b_fc.jinja](../../examples/tool_chat_template_granite_20b_fc.jinja): this is a modified chat template from the original on Hugging Face, which is not vLLM-compatible. It blends function description elements from the Hermes template and follows the same system prompt as "Response Generation" mode from [the paper](https://arxiv.org/abs/2407.00121). Parallel function calls are supported.
### InternLM Models (`internlm`) ### InternLM Models (`internlm`)
@ -282,8 +282,8 @@ Flags: `--tool-call-parser hermes`
Supported models: Supported models:
* `MiniMaxAi/MiniMax-M1-40k` (use with <gh-file:examples/tool_chat_template_minimax_m1.jinja>) * `MiniMaxAi/MiniMax-M1-40k` (use with [examples/tool_chat_template_minimax_m1.jinja](../../examples/tool_chat_template_minimax_m1.jinja))
* `MiniMaxAi/MiniMax-M1-80k` (use with <gh-file:examples/tool_chat_template_minimax_m1.jinja>) * `MiniMaxAi/MiniMax-M1-80k` (use with [examples/tool_chat_template_minimax_m1.jinja](../../examples/tool_chat_template_minimax_m1.jinja))
Flags: `--tool-call-parser minimax --chat-template examples/tool_chat_template_minimax_m1.jinja` Flags: `--tool-call-parser minimax --chat-template examples/tool_chat_template_minimax_m1.jinja`
@ -291,8 +291,8 @@ Flags: `--tool-call-parser minimax --chat-template examples/tool_chat_template_m
Supported models: Supported models:
* `deepseek-ai/DeepSeek-V3-0324` (use with <gh-file:examples/tool_chat_template_deepseekv3.jinja>) * `deepseek-ai/DeepSeek-V3-0324` (use with [examples/tool_chat_template_deepseekv3.jinja](../../examples/tool_chat_template_deepseekv3.jinja))
* `deepseek-ai/DeepSeek-R1-0528` (use with <gh-file:examples/tool_chat_template_deepseekr1.jinja>) * `deepseek-ai/DeepSeek-R1-0528` (use with [examples/tool_chat_template_deepseekr1.jinja](../../examples/tool_chat_template_deepseekr1.jinja))
Flags: `--tool-call-parser deepseek_v3 --chat-template {see_above}` Flags: `--tool-call-parser deepseek_v3 --chat-template {see_above}`
@ -300,7 +300,7 @@ Flags: `--tool-call-parser deepseek_v3 --chat-template {see_above}`
Supported models: Supported models:
* `deepseek-ai/DeepSeek-V3.1` (use with <gh-file:examples/tool_chat_template_deepseekv31.jinja>) * `deepseek-ai/DeepSeek-V3.1` (use with [examples/tool_chat_template_deepseekv31.jinja](../../examples/tool_chat_template_deepseekv31.jinja))
Flags: `--tool-call-parser deepseek_v31 --chat-template {see_above}` Flags: `--tool-call-parser deepseek_v31 --chat-template {see_above}`
@ -379,12 +379,12 @@ Limitations:
Example supported models: Example supported models:
* `meta-llama/Llama-3.2-1B-Instruct` ⚠️ (use with <gh-file:examples/tool_chat_template_llama3.2_pythonic.jinja>) * `meta-llama/Llama-3.2-1B-Instruct` ⚠️ (use with [examples/tool_chat_template_llama3.2_pythonic.jinja](../../examples/tool_chat_template_llama3.2_pythonic.jinja))
* `meta-llama/Llama-3.2-3B-Instruct` ⚠️ (use with <gh-file:examples/tool_chat_template_llama3.2_pythonic.jinja>) * `meta-llama/Llama-3.2-3B-Instruct` ⚠️ (use with [examples/tool_chat_template_llama3.2_pythonic.jinja](../../examples/tool_chat_template_llama3.2_pythonic.jinja))
* `Team-ACE/ToolACE-8B` (use with <gh-file:examples/tool_chat_template_toolace.jinja>) * `Team-ACE/ToolACE-8B` (use with [examples/tool_chat_template_toolace.jinja](../../examples/tool_chat_template_toolace.jinja))
* `fixie-ai/ultravox-v0_4-ToolACE-8B` (use with <gh-file:examples/tool_chat_template_toolace.jinja>) * `fixie-ai/ultravox-v0_4-ToolACE-8B` (use with [examples/tool_chat_template_toolace.jinja](../../examples/tool_chat_template_toolace.jinja))
* `meta-llama/Llama-4-Scout-17B-16E-Instruct` ⚠️ (use with <gh-file:examples/tool_chat_template_llama4_pythonic.jinja>) * `meta-llama/Llama-4-Scout-17B-16E-Instruct` ⚠️ (use with [examples/tool_chat_template_llama4_pythonic.jinja](../../examples/tool_chat_template_llama4_pythonic.jinja))
* `meta-llama/Llama-4-Maverick-17B-128E-Instruct` ⚠️ (use with <gh-file:examples/tool_chat_template_llama4_pythonic.jinja>) * `meta-llama/Llama-4-Maverick-17B-128E-Instruct` ⚠️ (use with [examples/tool_chat_template_llama4_pythonic.jinja](../../examples/tool_chat_template_llama4_pythonic.jinja))
Flags: `--tool-call-parser pythonic --chat-template {see_above}` Flags: `--tool-call-parser pythonic --chat-template {see_above}`
@ -393,7 +393,7 @@ Flags: `--tool-call-parser pythonic --chat-template {see_above}`
## How to Write a Tool Parser Plugin ## How to Write a Tool Parser Plugin
A tool parser plugin is a Python file containing one or more ToolParser implementations. You can write a ToolParser similar to the `Hermes2ProToolParser` in <gh-file:vllm/entrypoints/openai/tool_parsers/hermes_tool_parser.py>. A tool parser plugin is a Python file containing one or more ToolParser implementations. You can write a ToolParser similar to the `Hermes2ProToolParser` in [vllm/entrypoints/openai/tool_parsers/hermes_tool_parser.py](../../vllm/entrypoints/openai/tool_parsers/hermes_tool_parser.py).
Here is a summary of a plugin file: Here is a summary of a plugin file:

View File

@ -4,19 +4,19 @@ vLLM is a Python library that supports the following CPU variants. Select your C
=== "Intel/AMD x86" === "Intel/AMD x86"
--8<-- "docs/getting_started/installation/cpu/x86.inc.md:installation" --8<-- "docs/getting_started/installation/cpu.x86.inc.md:installation"
=== "ARM AArch64" === "ARM AArch64"
--8<-- "docs/getting_started/installation/cpu/arm.inc.md:installation" --8<-- "docs/getting_started/installation/cpu.arm.inc.md:installation"
=== "Apple silicon" === "Apple silicon"
--8<-- "docs/getting_started/installation/cpu/apple.inc.md:installation" --8<-- "docs/getting_started/installation/cpu.apple.inc.md:installation"
=== "IBM Z (S390X)" === "IBM Z (S390X)"
--8<-- "docs/getting_started/installation/cpu/s390x.inc.md:installation" --8<-- "docs/getting_started/installation/cpu.s390x.inc.md:installation"
## Requirements ## Requirements
@ -24,19 +24,19 @@ vLLM is a Python library that supports the following CPU variants. Select your C
=== "Intel/AMD x86" === "Intel/AMD x86"
--8<-- "docs/getting_started/installation/cpu/x86.inc.md:requirements" --8<-- "docs/getting_started/installation/cpu.x86.inc.md:requirements"
=== "ARM AArch64" === "ARM AArch64"
--8<-- "docs/getting_started/installation/cpu/arm.inc.md:requirements" --8<-- "docs/getting_started/installation/cpu.arm.inc.md:requirements"
=== "Apple silicon" === "Apple silicon"
--8<-- "docs/getting_started/installation/cpu/apple.inc.md:requirements" --8<-- "docs/getting_started/installation/cpu.apple.inc.md:requirements"
=== "IBM Z (S390X)" === "IBM Z (S390X)"
--8<-- "docs/getting_started/installation/cpu/s390x.inc.md:requirements" --8<-- "docs/getting_started/installation/cpu.s390x.inc.md:requirements"
## Set up using Python ## Set up using Python
@ -52,19 +52,19 @@ Currently, there are no pre-built CPU wheels.
=== "Intel/AMD x86" === "Intel/AMD x86"
--8<-- "docs/getting_started/installation/cpu/x86.inc.md:build-wheel-from-source" --8<-- "docs/getting_started/installation/cpu.x86.inc.md:build-wheel-from-source"
=== "ARM AArch64" === "ARM AArch64"
--8<-- "docs/getting_started/installation/cpu/arm.inc.md:build-wheel-from-source" --8<-- "docs/getting_started/installation/cpu.arm.inc.md:build-wheel-from-source"
=== "Apple silicon" === "Apple silicon"
--8<-- "docs/getting_started/installation/cpu/apple.inc.md:build-wheel-from-source" --8<-- "docs/getting_started/installation/cpu.apple.inc.md:build-wheel-from-source"
=== "IBM Z (s390x)" === "IBM Z (s390x)"
--8<-- "docs/getting_started/installation/cpu/s390x.inc.md:build-wheel-from-source" --8<-- "docs/getting_started/installation/cpu.s390x.inc.md:build-wheel-from-source"
## Set up using Docker ## Set up using Docker
@ -72,24 +72,24 @@ Currently, there are no pre-built CPU wheels.
=== "Intel/AMD x86" === "Intel/AMD x86"
--8<-- "docs/getting_started/installation/cpu/x86.inc.md:pre-built-images" --8<-- "docs/getting_started/installation/cpu.x86.inc.md:pre-built-images"
### Build image from source ### Build image from source
=== "Intel/AMD x86" === "Intel/AMD x86"
--8<-- "docs/getting_started/installation/cpu/x86.inc.md:build-image-from-source" --8<-- "docs/getting_started/installation/cpu.x86.inc.md:build-image-from-source"
=== "ARM AArch64" === "ARM AArch64"
--8<-- "docs/getting_started/installation/cpu/arm.inc.md:build-image-from-source" --8<-- "docs/getting_started/installation/cpu.arm.inc.md:build-image-from-source"
=== "Apple silicon" === "Apple silicon"
--8<-- "docs/getting_started/installation/cpu/arm.inc.md:build-image-from-source" --8<-- "docs/getting_started/installation/cpu.arm.inc.md:build-image-from-source"
=== "IBM Z (S390X)" === "IBM Z (S390X)"
--8<-- "docs/getting_started/installation/cpu/s390x.inc.md:build-image-from-source" --8<-- "docs/getting_started/installation/cpu.s390x.inc.md:build-image-from-source"
## Related runtime environment variables ## Related runtime environment variables

View File

@ -157,7 +157,7 @@ See [deployment-docker-pre-built-image][deployment-docker-pre-built-image] for i
### Build image from source ### Build image from source
You can use <gh-file:docker/Dockerfile.tpu> to build a Docker image with TPU support. You can use [docker/Dockerfile.tpu](../../../docker/Dockerfile.tpu) to build a Docker image with TPU support.
```bash ```bash
docker build -f docker/Dockerfile.tpu -t vllm-tpu . docker build -f docker/Dockerfile.tpu -t vllm-tpu .

View File

@ -11,7 +11,7 @@ vLLM contains pre-compiled C++ and CUDA (12.8) binaries.
# --8<-- [start:set-up-using-python] # --8<-- [start:set-up-using-python]
!!! note !!! note
PyTorch installed via `conda` will statically link `NCCL` library, which can cause issues when vLLM tries to use `NCCL`. See <gh-issue:8420> for more details. PyTorch installed via `conda` will statically link `NCCL` library, which can cause issues when vLLM tries to use `NCCL`. See <https://github.com/vllm-project/vllm/issues/8420> for more details.
In order to be performant, vLLM has to compile many cuda kernels. The compilation unfortunately introduces binary incompatibility with other CUDA versions and PyTorch versions, even for the same PyTorch version with different building configurations. In order to be performant, vLLM has to compile many cuda kernels. The compilation unfortunately introduces binary incompatibility with other CUDA versions and PyTorch versions, even for the same PyTorch version with different building configurations.

View File

@ -4,15 +4,15 @@ vLLM is a Python library that supports the following GPU variants. Select your G
=== "NVIDIA CUDA" === "NVIDIA CUDA"
--8<-- "docs/getting_started/installation/gpu/cuda.inc.md:installation" --8<-- "docs/getting_started/installation/gpu.cuda.inc.md:installation"
=== "AMD ROCm" === "AMD ROCm"
--8<-- "docs/getting_started/installation/gpu/rocm.inc.md:installation" --8<-- "docs/getting_started/installation/gpu.rocm.inc.md:installation"
=== "Intel XPU" === "Intel XPU"
--8<-- "docs/getting_started/installation/gpu/xpu.inc.md:installation" --8<-- "docs/getting_started/installation/gpu.xpu.inc.md:installation"
## Requirements ## Requirements
@ -24,15 +24,15 @@ vLLM is a Python library that supports the following GPU variants. Select your G
=== "NVIDIA CUDA" === "NVIDIA CUDA"
--8<-- "docs/getting_started/installation/gpu/cuda.inc.md:requirements" --8<-- "docs/getting_started/installation/gpu.cuda.inc.md:requirements"
=== "AMD ROCm" === "AMD ROCm"
--8<-- "docs/getting_started/installation/gpu/rocm.inc.md:requirements" --8<-- "docs/getting_started/installation/gpu.rocm.inc.md:requirements"
=== "Intel XPU" === "Intel XPU"
--8<-- "docs/getting_started/installation/gpu/xpu.inc.md:requirements" --8<-- "docs/getting_started/installation/gpu.xpu.inc.md:requirements"
## Set up using Python ## Set up using Python
@ -42,29 +42,29 @@ vLLM is a Python library that supports the following GPU variants. Select your G
=== "NVIDIA CUDA" === "NVIDIA CUDA"
--8<-- "docs/getting_started/installation/gpu/cuda.inc.md:set-up-using-python" --8<-- "docs/getting_started/installation/gpu.cuda.inc.md:set-up-using-python"
=== "AMD ROCm" === "AMD ROCm"
--8<-- "docs/getting_started/installation/gpu/rocm.inc.md:set-up-using-python" --8<-- "docs/getting_started/installation/gpu.rocm.inc.md:set-up-using-python"
=== "Intel XPU" === "Intel XPU"
--8<-- "docs/getting_started/installation/gpu/xpu.inc.md:set-up-using-python" --8<-- "docs/getting_started/installation/gpu.xpu.inc.md:set-up-using-python"
### Pre-built wheels ### Pre-built wheels
=== "NVIDIA CUDA" === "NVIDIA CUDA"
--8<-- "docs/getting_started/installation/gpu/cuda.inc.md:pre-built-wheels" --8<-- "docs/getting_started/installation/gpu.cuda.inc.md:pre-built-wheels"
=== "AMD ROCm" === "AMD ROCm"
--8<-- "docs/getting_started/installation/gpu/rocm.inc.md:pre-built-wheels" --8<-- "docs/getting_started/installation/gpu.rocm.inc.md:pre-built-wheels"
=== "Intel XPU" === "Intel XPU"
--8<-- "docs/getting_started/installation/gpu/xpu.inc.md:pre-built-wheels" --8<-- "docs/getting_started/installation/gpu.xpu.inc.md:pre-built-wheels"
[](){ #build-from-source } [](){ #build-from-source }
@ -72,15 +72,15 @@ vLLM is a Python library that supports the following GPU variants. Select your G
=== "NVIDIA CUDA" === "NVIDIA CUDA"
--8<-- "docs/getting_started/installation/gpu/cuda.inc.md:build-wheel-from-source" --8<-- "docs/getting_started/installation/gpu.cuda.inc.md:build-wheel-from-source"
=== "AMD ROCm" === "AMD ROCm"
--8<-- "docs/getting_started/installation/gpu/rocm.inc.md:build-wheel-from-source" --8<-- "docs/getting_started/installation/gpu.rocm.inc.md:build-wheel-from-source"
=== "Intel XPU" === "Intel XPU"
--8<-- "docs/getting_started/installation/gpu/xpu.inc.md:build-wheel-from-source" --8<-- "docs/getting_started/installation/gpu.xpu.inc.md:build-wheel-from-source"
## Set up using Docker ## Set up using Docker
@ -88,40 +88,40 @@ vLLM is a Python library that supports the following GPU variants. Select your G
=== "NVIDIA CUDA" === "NVIDIA CUDA"
--8<-- "docs/getting_started/installation/gpu/cuda.inc.md:pre-built-images" --8<-- "docs/getting_started/installation/gpu.cuda.inc.md:pre-built-images"
=== "AMD ROCm" === "AMD ROCm"
--8<-- "docs/getting_started/installation/gpu/rocm.inc.md:pre-built-images" --8<-- "docs/getting_started/installation/gpu.rocm.inc.md:pre-built-images"
=== "Intel XPU" === "Intel XPU"
--8<-- "docs/getting_started/installation/gpu/xpu.inc.md:pre-built-images" --8<-- "docs/getting_started/installation/gpu.xpu.inc.md:pre-built-images"
### Build image from source ### Build image from source
=== "NVIDIA CUDA" === "NVIDIA CUDA"
--8<-- "docs/getting_started/installation/gpu/cuda.inc.md:build-image-from-source" --8<-- "docs/getting_started/installation/gpu.cuda.inc.md:build-image-from-source"
=== "AMD ROCm" === "AMD ROCm"
--8<-- "docs/getting_started/installation/gpu/rocm.inc.md:build-image-from-source" --8<-- "docs/getting_started/installation/gpu.rocm.inc.md:build-image-from-source"
=== "Intel XPU" === "Intel XPU"
--8<-- "docs/getting_started/installation/gpu/xpu.inc.md:build-image-from-source" --8<-- "docs/getting_started/installation/gpu.xpu.inc.md:build-image-from-source"
## Supported features ## Supported features
=== "NVIDIA CUDA" === "NVIDIA CUDA"
--8<-- "docs/getting_started/installation/gpu/cuda.inc.md:supported-features" --8<-- "docs/getting_started/installation/gpu.cuda.inc.md:supported-features"
=== "AMD ROCm" === "AMD ROCm"
--8<-- "docs/getting_started/installation/gpu/rocm.inc.md:supported-features" --8<-- "docs/getting_started/installation/gpu.rocm.inc.md:supported-features"
=== "Intel XPU" === "Intel XPU"
--8<-- "docs/getting_started/installation/gpu/xpu.inc.md:supported-features" --8<-- "docs/getting_started/installation/gpu.xpu.inc.md:supported-features"

View File

@ -146,7 +146,7 @@ Building the Docker image from source is the recommended way to use vLLM with RO
#### (Optional) Build an image with ROCm software stack #### (Optional) Build an image with ROCm software stack
Build a docker image from <gh-file:docker/Dockerfile.rocm_base> which setup ROCm software stack needed by the vLLM. Build a docker image from [docker/Dockerfile.rocm_base](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.rocm_base) which setup ROCm software stack needed by the vLLM.
**This step is optional as this rocm_base image is usually prebuilt and store at [Docker Hub](https://hub.docker.com/r/rocm/vllm-dev) under tag `rocm/vllm-dev:base` to speed up user experience.** **This step is optional as this rocm_base image is usually prebuilt and store at [Docker Hub](https://hub.docker.com/r/rocm/vllm-dev) under tag `rocm/vllm-dev:base` to speed up user experience.**
If you choose to build this rocm_base image yourself, the steps are as follows. If you choose to build this rocm_base image yourself, the steps are as follows.
@ -170,7 +170,7 @@ DOCKER_BUILDKIT=1 docker build \
#### Build an image with vLLM #### Build an image with vLLM
First, build a docker image from <gh-file:docker/Dockerfile.rocm> and launch a docker container from the image. First, build a docker image from [docker/Dockerfile.rocm](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.rocm) and launch a docker container from the image.
It is important that the user kicks off the docker build using buildkit. Either the user put `DOCKER_BUILDKIT=1` as environment variable when calling docker build command, or the user needs to set up buildkit in the docker daemon configuration /etc/docker/daemon.json as follows and restart the daemon: It is important that the user kicks off the docker build using buildkit. Either the user put `DOCKER_BUILDKIT=1` as environment variable when calling docker build command, or the user needs to set up buildkit in the docker daemon configuration /etc/docker/daemon.json as follows and restart the daemon:
```bash ```bash
@ -181,10 +181,10 @@ It is important that the user kicks off the docker build using buildkit. Either
} }
``` ```
<gh-file:docker/Dockerfile.rocm> uses ROCm 6.3 by default, but also supports ROCm 5.7, 6.0, 6.1, and 6.2, in older vLLM branches. [docker/Dockerfile.rocm](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.rocm) uses ROCm 6.3 by default, but also supports ROCm 5.7, 6.0, 6.1, and 6.2, in older vLLM branches.
It provides flexibility to customize the build of docker image using the following arguments: It provides flexibility to customize the build of docker image using the following arguments:
- `BASE_IMAGE`: specifies the base image used when running `docker build`. The default value `rocm/vllm-dev:base` is an image published and maintained by AMD. It is being built using <gh-file:docker/Dockerfile.rocm_base> - `BASE_IMAGE`: specifies the base image used when running `docker build`. The default value `rocm/vllm-dev:base` is an image published and maintained by AMD. It is being built using [docker/Dockerfile.rocm_base](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.rocm_base)
- `ARG_PYTORCH_ROCM_ARCH`: Allows to override the gfx architecture values from the base docker image - `ARG_PYTORCH_ROCM_ARCH`: Allows to override the gfx architecture values from the base docker image
Their values can be passed in when running `docker build` with `--build-arg` options. Their values can be passed in when running `docker build` with `--build-arg` options.

View File

@ -75,7 +75,7 @@ vllm serve facebook/opt-13b \
-tp=8 -tp=8
``` ```
By default, a ray instance will be launched automatically if no existing one is detected in the system, with `num-gpus` equals to `parallel_config.world_size`. We recommend properly starting a ray cluster before execution, referring to the <gh-file:examples/online_serving/run_cluster.sh> helper script. By default, a ray instance will be launched automatically if no existing one is detected in the system, with `num-gpus` equals to `parallel_config.world_size`. We recommend properly starting a ray cluster before execution, referring to the [examples/online_serving/run_cluster.sh](https://github.com/vllm-project/vllm/blob/main/examples/online_serving/run_cluster.sh) helper script.
# --8<-- [end:supported-features] # --8<-- [end:supported-features]
# --8<-- [start:distributed-backend] # --8<-- [start:distributed-backend]

View File

@ -46,7 +46,7 @@ uv pip install vllm --torch-backend=auto
## Offline Batched Inference ## Offline Batched Inference
With vLLM installed, you can start generating texts for list of input prompts (i.e. offline batch inferencing). See the example script: <gh-file:examples/offline_inference/basic/basic.py> With vLLM installed, you can start generating texts for list of input prompts (i.e. offline batch inferencing). See the example script: [examples/offline_inference/basic/basic.py](../../examples/offline_inference/basic/basic.py)
The first line of this example imports the classes [LLM][vllm.LLM] and [SamplingParams][vllm.SamplingParams]: The first line of this example imports the classes [LLM][vllm.LLM] and [SamplingParams][vllm.SamplingParams]:
@ -201,7 +201,7 @@ Since this server is compatible with OpenAI API, you can use it as a drop-in rep
print("Completion result:", completion) print("Completion result:", completion)
``` ```
A more detailed client example can be found here: <gh-file:examples/online_serving/openai_completion_client.py> A more detailed client example can be found here: [examples/offline_inference/basic/basic.py](../../examples/offline_inference/basic/basic.py)
### OpenAI Chat Completions API with vLLM ### OpenAI Chat Completions API with vLLM
@ -253,4 +253,4 @@ Currently, vLLM supports multiple backends for efficient Attention computation a
If desired, you can also manually set the backend of your choice by configuring the environment variable `VLLM_ATTENTION_BACKEND` to one of the following options: `FLASH_ATTN`, `FLASHINFER` or `XFORMERS`. If desired, you can also manually set the backend of your choice by configuring the environment variable `VLLM_ATTENTION_BACKEND` to one of the following options: `FLASH_ATTN`, `FLASHINFER` or `XFORMERS`.
!!! warning !!! warning
There are no pre-built vllm wheels containing Flash Infer, so you must install it in your environment first. Refer to the [Flash Infer official docs](https://docs.flashinfer.ai/) or see <gh-file:docker/Dockerfile> for instructions on how to install it. There are no pre-built vllm wheels containing Flash Infer, so you must install it in your environment first. Refer to the [Flash Infer official docs](https://docs.flashinfer.ai/) or see [docker/Dockerfile](../../docker/Dockerfile) for instructions on how to install it.

View File

@ -137,13 +137,20 @@ class Example:
gh_file = (self.main_file.parent / relative_path).resolve() gh_file = (self.main_file.parent / relative_path).resolve()
gh_file = gh_file.relative_to(ROOT_DIR) gh_file = gh_file.relative_to(ROOT_DIR)
return f"[{link_text}](gh-file:{gh_file})" # Make GitHub URL
url = "https://github.com/vllm-project/vllm/"
url += "tree/main" if self.path.is_dir() else "blob/main"
gh_url = f"{url}/{gh_file}"
return f"[{link_text}]({gh_url})"
return re.sub(link_pattern, replace_link, content) return re.sub(link_pattern, replace_link, content)
def generate(self) -> str: def generate(self) -> str:
content = f"# {self.title}\n\n" content = f"# {self.title}\n\n"
content += f"Source <gh-file:{self.path.relative_to(ROOT_DIR)}>.\n\n" url = "https://github.com/vllm-project/vllm/"
url += "tree/main" if self.path.is_dir() else "blob/main"
content += f"Source <{url}/{self.path.relative_to(ROOT_DIR)}>.\n\n"
# Use long code fence to avoid issues with # Use long code fence to avoid issues with
# included files containing code fences too # included files containing code fences too

View File

@ -1,123 +1,95 @@
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project # SPDX-FileCopyrightText: Copyright contributors to the vLLM project
""" """
This is basically a port of MyST parsers external URL resolution mechanism MkDocs hook to enable the following links to render correctly:
(https://myst-parser.readthedocs.io/en/latest/syntax/cross-referencing.html#customising-external-url-resolution)
to work with MkDocs.
It allows Markdown authors to use GitHub shorthand links like: - Relative file links outside of the `docs/` directory, e.g.:
- [Text](../some_file.py)
- [Text](gh-issue:123) - [Directory](../../some_directory/)
- <gh-pr:456> - GitHub URLs for issues, pull requests, and projects, e.g.:
- [File](gh-file:path/to/file.py#L10) - Adds GitHub icon before links
- Replaces raw links with descriptive text,
These are automatically rewritten into fully qualified GitHub URLs pointing to e.g. <...pull/123> -> [Pull Request #123](.../pull/123)
issues, pull requests, files, directories, or projects in the - Works for external repos too by including the `owner/repo` in the link title
`vllm-project/vllm` repository.
The goal is to simplify cross-referencing common GitHub resources The goal is to simplify cross-referencing common GitHub resources
in project docs. in project docs.
""" """
from pathlib import Path
import regex as re import regex as re
from mkdocs.config.defaults import MkDocsConfig from mkdocs.config.defaults import MkDocsConfig
from mkdocs.structure.files import Files from mkdocs.structure.files import Files
from mkdocs.structure.pages import Page from mkdocs.structure.pages import Page
ROOT_DIR = Path(__file__).parent.parent.parent.parent.resolve()
DOC_DIR = ROOT_DIR / "docs"
gh_icon = ":octicons-mark-github-16:"
# Regex pieces
TITLE = r"(?P<title>[^\[\]<>]+?)"
REPO = r"(?P<repo>.+?/.+?)"
TYPE = r"(?P<type>issues|pull|projects)"
NUMBER = r"(?P<number>\d+)"
FRAGMENT = r"(?P<fragment>#[^\s]+)?"
URL = f"https://github.com/{REPO}/{TYPE}/{NUMBER}{FRAGMENT}"
RELATIVE = r"(?!(https?|ftp)://|#)(?P<path>[^\s]+?)"
# Common titles to use for GitHub links when none is provided in the link.
TITLES = {"issues": "Issue ", "pull": "Pull Request ", "projects": "Project "}
# Regex to match GitHub issue, PR, and project links with optional titles.
github_link = re.compile(rf"(\[{TITLE}\]\(|<){URL}(\)|>)")
# Regex to match relative file links with optional titles.
relative_link = re.compile(rf"\[{TITLE}\]\({RELATIVE}\)")
def on_page_markdown( def on_page_markdown(
markdown: str, *, page: Page, config: MkDocsConfig, files: Files markdown: str, *, page: Page, config: MkDocsConfig, files: Files
) -> str: ) -> str:
""" def replace_relative_link(match: re.Match) -> str:
Custom MkDocs plugin hook to rewrite special GitHub reference links """Replace relative file links with URLs if they point outside the docs dir."""
in Markdown. title = match.group("title")
This function scans the given Markdown content for specially formatted
GitHub shorthand links, such as:
- `[Link text](gh-issue:123)`
- `<gh-pr:456>`
And rewrites them into fully-qualified GitHub URLs with GitHub icons:
- `[:octicons-mark-github-16: Link text](https://github.com/vllm-project/vllm/issues/123)`
- `[:octicons-mark-github-16: Pull Request #456](https://github.com/vllm-project/vllm/pull/456)`
Supported shorthand types:
- `gh-issue`
- `gh-pr`
- `gh-project`
- `gh-dir`
- `gh-file`
Args:
markdown (str): The raw Markdown content of the page.
page (Page): The MkDocs page object being processed.
config (MkDocsConfig): The MkDocs site configuration.
files (Files): The collection of files in the MkDocs build.
Returns:
str: The updated Markdown content with GitHub shorthand links replaced.
"""
gh_icon = ":octicons-mark-github-16:"
gh_url = "https://github.com"
repo_url = f"{gh_url}/vllm-project/vllm"
org_url = f"{gh_url}/orgs/vllm-project"
# Mapping of shorthand types to their corresponding GitHub base URLs
urls = {
"issue": f"{repo_url}/issues",
"pr": f"{repo_url}/pull",
"project": f"{org_url}/projects",
"dir": f"{repo_url}/tree/main",
"file": f"{repo_url}/blob/main",
}
# Default title prefixes for auto links
titles = {
"issue": "Issue #",
"pr": "Pull Request #",
"project": "Project #",
"dir": "",
"file": "",
}
# Regular expression to match GitHub shorthand links
scheme = r"gh-(?P<type>.+?):(?P<path>.+?)(#(?P<fragment>.+?))?"
inline_link = re.compile(r"\[(?P<title>[^\[]+?)\]\(" + scheme + r"\)")
auto_link = re.compile(f"<{scheme}>")
def replace_inline_link(match: re.Match) -> str:
"""
Replaces a matched inline-style GitHub shorthand link
with a full Markdown link.
Example:
[My issue](gh-issue:123) [:octicons-mark-github-16: My issue](https://github.com/vllm-project/vllm/issues/123)
"""
url = f"{urls[match.group('type')]}/{match.group('path')}"
if fragment := match.group("fragment"):
url += f"#{fragment}"
return f"[{gh_icon} {match.group('title')}]({url})"
def replace_auto_link(match: re.Match) -> str:
"""
Replaces a matched autolink-style GitHub shorthand
with a full Markdown link.
Example:
<gh-pr:456> [:octicons-mark-github-16: Pull Request #456](https://github.com/vllm-project/vllm/pull/456)
"""
type = match.group("type")
path = match.group("path") path = match.group("path")
title = f"{titles[type]}{path}" path = (Path(page.file.abs_src_path).parent / path).resolve()
url = f"{urls[type]}/{path}"
if fragment := match.group("fragment"):
url += f"#{fragment}"
# Check if the path exists and is outside the docs dir
if not path.exists() or path.is_relative_to(DOC_DIR):
return match.group(0)
# Files and directories have different URL schemes on GitHub
slug = "tree/main" if path.is_dir() else "blob/main"
path = path.relative_to(ROOT_DIR)
url = f"https://github.com/vllm-project/vllm/{slug}/{path}"
return f"[{gh_icon} {title}]({url})" return f"[{gh_icon} {title}]({url})"
# Replace both inline and autolinks def replace_github_link(match: re.Match) -> str:
markdown = inline_link.sub(replace_inline_link, markdown) """Replace GitHub issue, PR, and project links with enhanced Markdown links."""
markdown = auto_link.sub(replace_auto_link, markdown) repo = match.group("repo")
type = match.group("type")
number = match.group("number")
# Title and fragment could be None
title = match.group("title") or ""
fragment = match.group("fragment") or ""
# Use default titles for raw links
if not title:
title = TITLES[type]
if "vllm-project" not in repo:
title += repo
title += f"#{number}"
url = f"https://github.com/{repo}/{type}/{number}{fragment}"
return f"[{gh_icon} {title}]({url})"
markdown = relative_link.sub(replace_relative_link, markdown)
markdown = github_link.sub(replace_github_link, markdown)
if "interface" in str(page.file.abs_src_path):
print(markdown)
return markdown return markdown

View File

@ -82,7 +82,7 @@ vllm serve /path/to/sharded/model \
--model-loader-extra-config '{"pattern":"custom-model-rank-{rank}-part-{part}.safetensors"}' --model-loader-extra-config '{"pattern":"custom-model-rank-{rank}-part-{part}.safetensors"}'
``` ```
To create sharded model files, you can use the script provided in <gh-file:examples/offline_inference/save_sharded_state.py>. This script demonstrates how to save a model in the sharded format that is compatible with the Run:ai Model Streamer sharded loader. To create sharded model files, you can use the script provided in [examples/offline_inference/save_sharded_state.py](../../../examples/offline_inference/save_sharded_state.py). This script demonstrates how to save a model in the sharded format that is compatible with the Run:ai Model Streamer sharded loader.
The sharded loader supports all the same tunable parameters as the regular Run:ai Model Streamer, including `concurrency` and `memory_limit`. These can be configured in the same way: The sharded loader supports all the same tunable parameters as the regular Run:ai Model Streamer, including `concurrency` and `memory_limit`. These can be configured in the same way:

View File

@ -59,7 +59,7 @@ for output in outputs:
By default, vLLM will use sampling parameters recommended by model creator by applying the `generation_config.json` from the huggingface model repository if it exists. In most cases, this will provide you with the best results by default if [SamplingParams][vllm.SamplingParams] is not specified. By default, vLLM will use sampling parameters recommended by model creator by applying the `generation_config.json` from the huggingface model repository if it exists. In most cases, this will provide you with the best results by default if [SamplingParams][vllm.SamplingParams] is not specified.
However, if vLLM's default sampling parameters are preferred, please pass `generation_config="vllm"` when creating the [LLM][vllm.LLM] instance. However, if vLLM's default sampling parameters are preferred, please pass `generation_config="vllm"` when creating the [LLM][vllm.LLM] instance.
A code example can be found here: <gh-file:examples/offline_inference/basic/basic.py> A code example can be found here: [examples/offline_inference/basic/basic.py](../../examples/offline_inference/basic/basic.py)
### `LLM.beam_search` ### `LLM.beam_search`
@ -121,7 +121,7 @@ and automatically applies the model's [chat template](https://huggingface.co/doc
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
``` ```
A code example can be found here: <gh-file:examples/offline_inference/basic/chat.py> A code example can be found here: [examples/offline_inference/basic/chat.py](../../examples/offline_inference/basic/chat.py)
If the model doesn't have a chat template or you want to specify another one, If the model doesn't have a chat template or you want to specify another one,
you can explicitly pass a chat template: you can explicitly pass a chat template:

View File

@ -9,7 +9,7 @@ before returning them.
!!! note !!! note
We currently support pooling models primarily as a matter of convenience. This is not guaranteed to have any performance improvement over using HF Transformers / Sentence Transformers directly. We currently support pooling models primarily as a matter of convenience. This is not guaranteed to have any performance improvement over using HF Transformers / Sentence Transformers directly.
We are now planning to optimize pooling models in vLLM. Please comment on <gh-issue:21796> if you have any suggestions! We are now planning to optimize pooling models in vLLM. Please comment on <https://github.com/vllm-project/vllm/issues/21796> if you have any suggestions!
## Configuration ## Configuration
@ -98,7 +98,7 @@ embeds = output.outputs.embedding
print(f"Embeddings: {embeds!r} (size={len(embeds)})") print(f"Embeddings: {embeds!r} (size={len(embeds)})")
``` ```
A code example can be found here: <gh-file:examples/offline_inference/basic/embed.py> A code example can be found here: [examples/offline_inference/basic/embed.py](../../examples/offline_inference/basic/embed.py)
### `LLM.classify` ### `LLM.classify`
@ -115,7 +115,7 @@ probs = output.outputs.probs
print(f"Class Probabilities: {probs!r} (size={len(probs)})") print(f"Class Probabilities: {probs!r} (size={len(probs)})")
``` ```
A code example can be found here: <gh-file:examples/offline_inference/basic/classify.py> A code example can be found here: [examples/offline_inference/basic/classify.py](../../examples/offline_inference/basic/classify.py)
### `LLM.score` ### `LLM.score`
@ -139,7 +139,7 @@ score = output.outputs.score
print(f"Score: {score}") print(f"Score: {score}")
``` ```
A code example can be found here: <gh-file:examples/offline_inference/basic/score.py> A code example can be found here: [examples/offline_inference/basic/score.py](../../examples/offline_inference/basic/score.py)
### `LLM.reward` ### `LLM.reward`
@ -156,7 +156,7 @@ data = output.outputs.data
print(f"Data: {data!r}") print(f"Data: {data!r}")
``` ```
A code example can be found here: <gh-file:examples/offline_inference/basic/reward.py> A code example can be found here: [examples/offline_inference/basic/reward.py](../../examples/offline_inference/basic/reward.py)
### `LLM.encode` ### `LLM.encode`
@ -234,7 +234,7 @@ outputs = llm.embed(
print(outputs[0].outputs) print(outputs[0].outputs)
``` ```
A code example can be found here: <gh-file:examples/offline_inference/pooling/embed_matryoshka_fy.py> A code example can be found here: [examples/offline_inference/pooling/embed_matryoshka_fy.py](../../examples/offline_inference/pooling/embed_matryoshka_fy.py)
### Online Inference ### Online Inference
@ -264,4 +264,4 @@ Expected output:
{"id":"embd-5c21fc9a5c9d4384a1b021daccaf9f64","object":"list","created":1745476417,"model":"jinaai/jina-embeddings-v3","data":[{"index":0,"object":"embedding","embedding":[-0.3828125,-0.1357421875,0.03759765625,0.125,0.21875,0.09521484375,-0.003662109375,0.1591796875,-0.130859375,-0.0869140625,-0.1982421875,0.1689453125,-0.220703125,0.1728515625,-0.2275390625,-0.0712890625,-0.162109375,-0.283203125,-0.055419921875,-0.0693359375,0.031982421875,-0.04052734375,-0.2734375,0.1826171875,-0.091796875,0.220703125,0.37890625,-0.0888671875,-0.12890625,-0.021484375,-0.0091552734375,0.23046875]}],"usage":{"prompt_tokens":8,"total_tokens":8,"completion_tokens":0,"prompt_tokens_details":null}} {"id":"embd-5c21fc9a5c9d4384a1b021daccaf9f64","object":"list","created":1745476417,"model":"jinaai/jina-embeddings-v3","data":[{"index":0,"object":"embedding","embedding":[-0.3828125,-0.1357421875,0.03759765625,0.125,0.21875,0.09521484375,-0.003662109375,0.1591796875,-0.130859375,-0.0869140625,-0.1982421875,0.1689453125,-0.220703125,0.1728515625,-0.2275390625,-0.0712890625,-0.162109375,-0.283203125,-0.055419921875,-0.0693359375,0.031982421875,-0.04052734375,-0.2734375,0.1826171875,-0.091796875,0.220703125,0.37890625,-0.0888671875,-0.12890625,-0.021484375,-0.0091552734375,0.23046875]}],"usage":{"prompt_tokens":8,"total_tokens":8,"completion_tokens":0,"prompt_tokens_details":null}}
``` ```
An OpenAI client example can be found here: <gh-file:examples/online_serving/pooling/openai_embedding_matryoshka_fy.py> An OpenAI client example can be found here: [examples/online_serving/pooling/openai_embedding_matryoshka_fy.py](../../examples/online_serving/pooling/openai_embedding_matryoshka_fy.py)

View File

@ -9,7 +9,7 @@ Alongside each architecture, we include some popular models that use it.
### vLLM ### vLLM
If vLLM natively supports a model, its implementation can be found in <gh-file:vllm/model_executor/models>. If vLLM natively supports a model, its implementation can be found in [vllm/model_executor/models](../../vllm/model_executor/models).
These models are what we list in [supported-text-models][supported-text-models] and [supported-mm-models][supported-mm-models]. These models are what we list in [supported-text-models][supported-text-models] and [supported-mm-models][supported-mm-models].
@ -116,7 +116,7 @@ Here is what happens in the background when this model is loaded:
1. The config is loaded. 1. The config is loaded.
2. `MyModel` Python class is loaded from the `auto_map` in config, and we check that the model `is_backend_compatible()`. 2. `MyModel` Python class is loaded from the `auto_map` in config, and we check that the model `is_backend_compatible()`.
3. `MyModel` is loaded into one of the Transformers backend classes in <gh-file:vllm/model_executor/models/transformers.py> which sets `self.config._attn_implementation = "vllm"` so that vLLM's attention layer is used. 3. `MyModel` is loaded into one of the Transformers backend classes in [vllm/model_executor/models/transformers.py](../../vllm/model_executor/models/transformers.py) which sets `self.config._attn_implementation = "vllm"` so that vLLM's attention layer is used.
That's it! That's it!
@ -543,7 +543,7 @@ These models primarily support the [`LLM.score`](./pooling_models.md#llmscore) A
``` ```
!!! note !!! note
Load the official original `Qwen3 Reranker` by using the following command. More information can be found at: <gh-file:examples/offline_inference/pooling/qwen3_reranker.py>. Load the official original `Qwen3 Reranker` by using the following command. More information can be found at: [examples/offline_inference/pooling/qwen3_reranker.py](../../examples/offline_inference/pooling/qwen3_reranker.py).
```bash ```bash
vllm serve Qwen/Qwen3-Reranker-0.6B --hf_overrides '{"architectures": ["Qwen3ForSequenceClassification"],"classifier_from_token": ["no", "yes"],"is_original_qwen3_reranker": true}' vllm serve Qwen/Qwen3-Reranker-0.6B --hf_overrides '{"architectures": ["Qwen3ForSequenceClassification"],"classifier_from_token": ["no", "yes"],"is_original_qwen3_reranker": true}'
@ -581,7 +581,7 @@ These models primarily support the [`LLM.encode`](./pooling_models.md#llmencode)
| `ModernBertForTokenClassification` | ModernBERT-based | `disham993/electrical-ner-ModernBERT-base` | | | | `ModernBertForTokenClassification` | ModernBERT-based | `disham993/electrical-ner-ModernBERT-base` | | |
!!! note !!! note
Named Entity Recognition (NER) usage, please refer to <gh-file:examples/offline_inference/pooling/ner.py>, <gh-file:examples/online_serving/pooling/ner_client.py>. Named Entity Recognition (NER) usage, please refer to [examples/offline_inference/pooling/ner.py](../../examples/offline_inference/pooling/ner.py), [examples/online_serving/pooling/ner_client.py](../../examples/online_serving/pooling/ner_client.py).
[](){ #supported-mm-models } [](){ #supported-mm-models }
@ -776,7 +776,7 @@ Some models are supported only via the [Transformers backend](#transformers). Th
!!! note !!! note
The official `openbmb/MiniCPM-V-2` doesn't work yet, so we need to use a fork (`HwwwH/MiniCPM-V-2`) for now. The official `openbmb/MiniCPM-V-2` doesn't work yet, so we need to use a fork (`HwwwH/MiniCPM-V-2`) for now.
For more details, please see: <gh-pr:4087#issuecomment-2250397630> For more details, please see: <https://github.com/vllm-project/vllm/pull/4087#issuecomment-2250397630>
!!! warning !!! warning
Our PaliGemma implementations have the same problem as Gemma 3 (see above) for both V0 and V1. Our PaliGemma implementations have the same problem as Gemma 3 (see above) for both V0 and V1.
@ -856,5 +856,5 @@ We have the following levels of testing for models:
1. **Strict Consistency**: We compare the output of the model with the output of the model in the HuggingFace Transformers library under greedy decoding. This is the most stringent test. Please refer to [models tests](https://github.com/vllm-project/vllm/blob/main/tests/models) for the models that have passed this test. 1. **Strict Consistency**: We compare the output of the model with the output of the model in the HuggingFace Transformers library under greedy decoding. This is the most stringent test. Please refer to [models tests](https://github.com/vllm-project/vllm/blob/main/tests/models) for the models that have passed this test.
2. **Output Sensibility**: We check if the output of the model is sensible and coherent, by measuring the perplexity of the output and checking for any obvious errors. This is a less stringent test. 2. **Output Sensibility**: We check if the output of the model is sensible and coherent, by measuring the perplexity of the output and checking for any obvious errors. This is a less stringent test.
3. **Runtime Functionality**: We check if the model can be loaded and run without errors. This is the least stringent test. Please refer to [functionality tests](gh-dir:tests) and [examples](gh-dir:examples) for the models that have passed this test. 3. **Runtime Functionality**: We check if the model can be loaded and run without errors. This is the least stringent test. Please refer to [functionality tests](../../tests) and [examples](../../examples) for the models that have passed this test.
4. **Community Feedback**: We rely on the community to provide feedback on the models. If a model is broken or not working as expected, we encourage users to raise issues to report it or open pull requests to fix it. The rest of the models fall under this category. 4. **Community Feedback**: We rely on the community to provide feedback on the models. If a model is broken or not working as expected, we encourage users to raise issues to report it or open pull requests to fix it. The rest of the models fall under this category.

View File

@ -16,7 +16,7 @@ For MoE models, when any requests are in progress in any rank, we must ensure th
In all cases, it is beneficial to load-balance requests between DP ranks. For online deployments, this balancing can be optimized by taking into account the state of each DP engine - in particular its currently scheduled and waiting (queued) requests, and KV cache state. Each DP engine has an independent KV cache, and the benefit of prefix caching can be maximized by directing prompts intelligently. In all cases, it is beneficial to load-balance requests between DP ranks. For online deployments, this balancing can be optimized by taking into account the state of each DP engine - in particular its currently scheduled and waiting (queued) requests, and KV cache state. Each DP engine has an independent KV cache, and the benefit of prefix caching can be maximized by directing prompts intelligently.
This document focuses on online deployments (with the API server). DP + EP is also supported for offline usage (via the LLM class), for an example see <gh-file:examples/offline_inference/data_parallel.py>. This document focuses on online deployments (with the API server). DP + EP is also supported for offline usage (via the LLM class), for an example see [examples/offline_inference/data_parallel.py](../../examples/offline_inference/data_parallel.py).
There are two distinct modes supported for online deployments - self-contained with internal load balancing, or externally per-rank process deployment and load balancing. There are two distinct modes supported for online deployments - self-contained with internal load balancing, or externally per-rank process deployment and load balancing.

View File

@ -4,11 +4,11 @@ For general troubleshooting, see [Troubleshooting](../usage/troubleshooting.md).
## Verify inter-node GPU communication ## Verify inter-node GPU communication
After you start the Ray cluster, verify GPU-to-GPU communication across nodes. Proper configuration can be non-trivial. For more information, see [troubleshooting script][troubleshooting-incorrect-hardware-driver]. If you need additional environment variables for communication configuration, append them to <gh-file:examples/online_serving/run_cluster.sh>, for example `-e NCCL_SOCKET_IFNAME=eth0`. Setting environment variables during cluster creation is recommended because the variables propagate to all nodes. In contrast, setting environment variables in the shell affects only the local node. For more information, see <gh-issue:6803>. After you start the Ray cluster, verify GPU-to-GPU communication across nodes. Proper configuration can be non-trivial. For more information, see [troubleshooting script][troubleshooting-incorrect-hardware-driver]. If you need additional environment variables for communication configuration, append them to [examples/online_serving/run_cluster.sh](../../examples/online_serving/run_cluster.sh), for example `-e NCCL_SOCKET_IFNAME=eth0`. Setting environment variables during cluster creation is recommended because the variables propagate to all nodes. In contrast, setting environment variables in the shell affects only the local node. For more information, see <https://github.com/vllm-project/vllm/issues/6803>.
## No available node types can fulfill resource request ## No available node types can fulfill resource request
The error message `Error: No available node types can fulfill resource request` can appear even when the cluster has enough GPUs. The issue often occurs when nodes have multiple IP addresses and vLLM can't select the correct one. Ensure that vLLM and Ray use the same IP address by setting `VLLM_HOST_IP` in <gh-file:examples/online_serving/run_cluster.sh> (with a different value on each node). Use `ray status` and `ray list nodes` to verify the chosen IP address. For more information, see <gh-issue:7815>. The error message `Error: No available node types can fulfill resource request` can appear even when the cluster has enough GPUs. The issue often occurs when nodes have multiple IP addresses and vLLM can't select the correct one. Ensure that vLLM and Ray use the same IP address by setting `VLLM_HOST_IP` in [examples/online_serving/run_cluster.sh](../../examples/online_serving/run_cluster.sh) (with a different value on each node). Use `ray status` and `ray list nodes` to verify the chosen IP address. For more information, see <https://github.com/vllm-project/vllm/issues/7815>.
## Ray observability ## Ray observability

View File

@ -8,9 +8,9 @@ EP is typically coupled with Data Parallelism (DP). While DP can be used indepen
Before using EP, you need to install the necessary dependencies. We are actively working on making this easier in the future: Before using EP, you need to install the necessary dependencies. We are actively working on making this easier in the future:
1. **Install DeepEP and pplx-kernels**: Set up host environment following vLLM's guide for EP kernels [here](gh-file:tools/ep_kernels). 1. **Install DeepEP and pplx-kernels**: Set up host environment following vLLM's guide for EP kernels [here](../../tools/ep_kernels).
2. **Install DeepGEMM library**: Follow the [official instructions](https://github.com/deepseek-ai/DeepGEMM#installation). 2. **Install DeepGEMM library**: Follow the [official instructions](https://github.com/deepseek-ai/DeepGEMM#installation).
3. **For disaggregated serving**: Install `gdrcopy` by running the [`install_gdrcopy.sh`](gh-file:tools/install_gdrcopy.sh) script (e.g., `install_gdrcopy.sh "${GDRCOPY_OS_VERSION}" "12.8" "x64"`). You can find available OS versions [here](https://developer.download.nvidia.com/compute/redist/gdrcopy/CUDA%2012.8/). 3. **For disaggregated serving**: Install `gdrcopy` by running the [`install_gdrcopy.sh`](../../tools/install_gdrcopy.sh) script (e.g., `install_gdrcopy.sh "${GDRCOPY_OS_VERSION}" "12.8" "x64"`). You can find available OS versions [here](https://developer.download.nvidia.com/compute/redist/gdrcopy/CUDA%2012.8/).
### Backend Selection Guide ### Backend Selection Guide
@ -195,7 +195,7 @@ For production deployments requiring strict SLA guarantees for time-to-first-tok
### Setup Steps ### Setup Steps
1. **Install gdrcopy/ucx/nixl**: For maximum performance, run the [install_gdrcopy.sh](gh-file:tools/install_gdrcopy.sh) script to install `gdrcopy` (e.g., `install_gdrcopy.sh "${GDRCOPY_OS_VERSION}" "12.8" "x64"`). You can find available OS versions [here](https://developer.download.nvidia.com/compute/redist/gdrcopy/CUDA%2012.8/). If `gdrcopy` is not installed, things will still work with a plain `pip install nixl`, just with lower performance. `nixl` and `ucx` are installed as dependencies via pip. For non-cuda platform to install nixl with non-cuda UCX build, run the [install_nixl_from_source_ubuntu.py](gh-file:tools/install_nixl_from_source_ubuntu.py) script. 1. **Install gdrcopy/ucx/nixl**: For maximum performance, run the [install_gdrcopy.sh](../../tools/install_gdrcopy.sh) script to install `gdrcopy` (e.g., `install_gdrcopy.sh "${GDRCOPY_OS_VERSION}" "12.8" "x64"`). You can find available OS versions [here](https://developer.download.nvidia.com/compute/redist/gdrcopy/CUDA%2012.8/). If `gdrcopy` is not installed, things will still work with a plain `pip install nixl`, just with lower performance. `nixl` and `ucx` are installed as dependencies via pip. For non-cuda platform to install nixl with non-cuda UCX build, run the [install_nixl_from_source_ubuntu.py](../../tools/install_nixl_from_source_ubuntu.py) script.
2. **Configure Both Instances**: Add this flag to both prefill and decode instances `--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}`. Noted, you may also specify one or multiple NIXL_Backend. Such as: `--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both", "kv_connector_extra_config":{"backends":["UCX", "GDS"]}}'` 2. **Configure Both Instances**: Add this flag to both prefill and decode instances `--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}`. Noted, you may also specify one or multiple NIXL_Backend. Such as: `--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both", "kv_connector_extra_config":{"backends":["UCX", "GDS"]}}'`

View File

@ -92,7 +92,7 @@ and all chat requests will error.
vllm serve <model> --chat-template ./path-to-chat-template.jinja vllm serve <model> --chat-template ./path-to-chat-template.jinja
``` ```
vLLM community provides a set of chat templates for popular models. You can find them under the <gh-dir:examples> directory. vLLM community provides a set of chat templates for popular models. You can find them under the [examples](../../examples) directory.
With the inclusion of multi-modal chat APIs, the OpenAI spec now accepts chat messages in a new format which specifies With the inclusion of multi-modal chat APIs, the OpenAI spec now accepts chat messages in a new format which specifies
both a `type` and a `text` field. An example is provided below: both a `type` and a `text` field. An example is provided below:
@ -181,7 +181,7 @@ with `--enable-request-id-headers`.
Our Completions API is compatible with [OpenAI's Completions API](https://platform.openai.com/docs/api-reference/completions); Our Completions API is compatible with [OpenAI's Completions API](https://platform.openai.com/docs/api-reference/completions);
you can use the [official OpenAI Python client](https://github.com/openai/openai-python) to interact with it. you can use the [official OpenAI Python client](https://github.com/openai/openai-python) to interact with it.
Code example: <gh-file:examples/online_serving/openai_completion_client.py> Code example: [examples/online_serving/openai_completion_client.py](../../examples/online_serving/openai_completion_client.py)
#### Extra parameters #### Extra parameters
@ -214,7 +214,7 @@ see our [Multimodal Inputs](../features/multimodal_inputs.md) guide for more inf
- *Note: `image_url.detail` parameter is not supported.* - *Note: `image_url.detail` parameter is not supported.*
Code example: <gh-file:examples/online_serving/openai_chat_completion_client.py> Code example: [examples/online_serving/openai_chat_completion_client.py](../../examples/online_serving/openai_chat_completion_client.py)
#### Extra parameters #### Extra parameters
@ -241,7 +241,7 @@ The following extra parameters are supported:
Our Embeddings API is compatible with [OpenAI's Embeddings API](https://platform.openai.com/docs/api-reference/embeddings); Our Embeddings API is compatible with [OpenAI's Embeddings API](https://platform.openai.com/docs/api-reference/embeddings);
you can use the [official OpenAI Python client](https://github.com/openai/openai-python) to interact with it. you can use the [official OpenAI Python client](https://github.com/openai/openai-python) to interact with it.
Code example: <gh-file:examples/online_serving/pooling/openai_embedding_client.py> Code example: [examples/online_serving/pooling/openai_embedding_client.py](../../examples/online_serving/pooling/openai_embedding_client.py)
If the model has a [chat template][chat-template], you can replace `inputs` with a list of `messages` (same schema as [Chat API][chat-api]) If the model has a [chat template][chat-template], you can replace `inputs` with a list of `messages` (same schema as [Chat API][chat-api])
which will be treated as a single prompt to the model. Here is a convenience function for calling the API while retaining OpenAI's type annotations: which will be treated as a single prompt to the model. Here is a convenience function for calling the API while retaining OpenAI's type annotations:
@ -289,7 +289,7 @@ and passing a list of `messages` in the request. Refer to the examples below for
to run this model in embedding mode instead of text generation mode. to run this model in embedding mode instead of text generation mode.
The custom chat template is completely different from the original one for this model, The custom chat template is completely different from the original one for this model,
and can be found here: <gh-file:examples/template_vlm2vec_phi3v.jinja> and can be found here: [examples/template_vlm2vec_phi3v.jinja](../../examples/template_vlm2vec_phi3v.jinja)
Since the request schema is not defined by OpenAI client, we post a request to the server using the lower-level `requests` library: Since the request schema is not defined by OpenAI client, we post a request to the server using the lower-level `requests` library:
@ -336,13 +336,13 @@ and passing a list of `messages` in the request. Refer to the examples below for
Like with VLM2Vec, we have to explicitly pass `--runner pooling`. Like with VLM2Vec, we have to explicitly pass `--runner pooling`.
Additionally, `MrLight/dse-qwen2-2b-mrl-v1` requires an EOS token for embeddings, which is handled Additionally, `MrLight/dse-qwen2-2b-mrl-v1` requires an EOS token for embeddings, which is handled
by a custom chat template: <gh-file:examples/template_dse_qwen2_vl.jinja> by a custom chat template: [examples/template_dse_qwen2_vl.jinja](../../examples/template_dse_qwen2_vl.jinja)
!!! important !!! important
`MrLight/dse-qwen2-2b-mrl-v1` requires a placeholder image of the minimum image size for text query embeddings. See the full code `MrLight/dse-qwen2-2b-mrl-v1` requires a placeholder image of the minimum image size for text query embeddings. See the full code
example below for details. example below for details.
Full example: <gh-file:examples/online_serving/pooling/openai_chat_embedding_client_for_multimodal.py> Full example: [examples/online_serving/pooling/openai_chat_embedding_client_for_multimodal.py](../../examples/online_serving/pooling/openai_chat_embedding_client_for_multimodal.py)
#### Extra parameters #### Extra parameters
@ -379,7 +379,7 @@ you can use the [official OpenAI Python client](https://github.com/openai/openai
!!! note !!! note
To use the Transcriptions API, please install with extra audio dependencies using `pip install vllm[audio]`. To use the Transcriptions API, please install with extra audio dependencies using `pip install vllm[audio]`.
Code example: <gh-file:examples/online_serving/openai_transcription_client.py> Code example: [examples/online_serving/openai_transcription_client.py](../../examples/online_serving/openai_transcription_client.py)
#### API Enforced Limits #### API Enforced Limits
@ -496,7 +496,7 @@ Please mind that the popular `openai/whisper-large-v3-turbo` model does not supp
!!! note !!! note
To use the Translation API, please install with extra audio dependencies using `pip install vllm[audio]`. To use the Translation API, please install with extra audio dependencies using `pip install vllm[audio]`.
Code example: <gh-file:examples/online_serving/openai_translation_client.py> Code example: [examples/online_serving/openai_translation_client.py](../../examples/online_serving/openai_translation_client.py)
#### Extra Parameters #### Extra Parameters
@ -530,7 +530,7 @@ Our Pooling API encodes input prompts using a [pooling model](../models/pooling_
The input format is the same as [Embeddings API][embeddings-api], but the output data can contain an arbitrary nested list, not just a 1-D list of floats. The input format is the same as [Embeddings API][embeddings-api], but the output data can contain an arbitrary nested list, not just a 1-D list of floats.
Code example: <gh-file:examples/online_serving/pooling/openai_pooling_client.py> Code example: [examples/online_serving/pooling/openai_pooling_client.py](../../examples/online_serving/pooling/openai_pooling_client.py)
[](){ #classification-api } [](){ #classification-api }
@ -540,7 +540,7 @@ Our Classification API directly supports Hugging Face sequence-classification mo
We automatically wrap any other transformer via `as_seq_cls_model()`, which pools on the last token, attaches a `RowParallelLinear` head, and applies a softmax to produce per-class probabilities. We automatically wrap any other transformer via `as_seq_cls_model()`, which pools on the last token, attaches a `RowParallelLinear` head, and applies a softmax to produce per-class probabilities.
Code example: <gh-file:examples/online_serving/pooling/openai_classification_client.py> Code example: [examples/online_serving/pooling/openai_classification_client.py](../../examples/online_serving/pooling/openai_classification_client.py)
#### Example Requests #### Example Requests
@ -658,7 +658,7 @@ Usually, the score for a sentence pair refers to the similarity between two sent
You can find the documentation for cross encoder models at [sbert.net](https://www.sbert.net/docs/package_reference/cross_encoder/cross_encoder.html). You can find the documentation for cross encoder models at [sbert.net](https://www.sbert.net/docs/package_reference/cross_encoder/cross_encoder.html).
Code example: <gh-file:examples/online_serving/openai_cross_encoder_score.py> Code example: [examples/online_serving/openai_cross_encoder_score.py](../../examples/online_serving/openai_cross_encoder_score.py)
#### Single inference #### Single inference
@ -839,7 +839,7 @@ You can pass multi-modal inputs to scoring models by passing `content` including
print("Scoring output:", response_json["data"][0]["score"]) print("Scoring output:", response_json["data"][0]["score"])
print("Scoring output:", response_json["data"][1]["score"]) print("Scoring output:", response_json["data"][1]["score"])
``` ```
Full example: <gh-file:examples/online_serving/openai_cross_encoder_score_for_multimodal.py> Full example: [examples/online_serving/openai_cross_encoder_score_for_multimodal.py](../../examples/online_serving/openai_cross_encoder_score_for_multimodal.py)
#### Extra parameters #### Extra parameters
@ -871,7 +871,7 @@ endpoints are compatible with both [Jina AI's re-rank API interface](https://jin
[Cohere's re-rank API interface](https://docs.cohere.com/v2/reference/rerank) to ensure compatibility with [Cohere's re-rank API interface](https://docs.cohere.com/v2/reference/rerank) to ensure compatibility with
popular open-source tools. popular open-source tools.
Code example: <gh-file:examples/online_serving/pooling/jinaai_rerank_client.py> Code example: [examples/online_serving/pooling/jinaai_rerank_client.py](../../examples/online_serving/pooling/jinaai_rerank_client.py)
#### Example Request #### Example Request
@ -949,6 +949,6 @@ Key capabilities:
- Scales from a single GPU to a multi-node cluster without code changes. - Scales from a single GPU to a multi-node cluster without code changes.
- Provides observability and autoscaling policies through Ray dashboards and metrics. - Provides observability and autoscaling policies through Ray dashboards and metrics.
The following example shows how to deploy a large model like DeepSeek R1 with Ray Serve LLM: <gh-file:examples/online_serving/ray_serve_deepseek.py>. The following example shows how to deploy a large model like DeepSeek R1 with Ray Serve LLM: [examples/online_serving/ray_serve_deepseek.py](../../examples/online_serving/ray_serve_deepseek.py).
Learn more about Ray Serve LLM with the official [Ray Serve LLM documentation](https://docs.ray.io/en/latest/serve/llm/serving-llms.html). Learn more about Ray Serve LLM with the official [Ray Serve LLM documentation](https://docs.ray.io/en/latest/serve/llm/serving-llms.html).

View File

@ -72,7 +72,7 @@ For details, see the [Ray documentation](https://docs.ray.io/en/latest/index.htm
### Ray cluster setup with containers ### Ray cluster setup with containers
The helper script <gh-file:examples/online_serving/run_cluster.sh> starts containers across nodes and initializes Ray. By default, the script runs Docker without administrative privileges, which prevents access to the GPU performance counters when profiling or tracing. To enable admin privileges, add the `--cap-add=CAP_SYS_ADMIN` flag to the Docker command. The helper script [examples/online_serving/run_cluster.sh](../../examples/online_serving/run_cluster.sh) starts containers across nodes and initializes Ray. By default, the script runs Docker without administrative privileges, which prevents access to the GPU performance counters when profiling or tracing. To enable admin privileges, add the `--cap-add=CAP_SYS_ADMIN` flag to the Docker command.
Choose one node as the head node and run: Choose one node as the head node and run:
@ -132,7 +132,7 @@ vllm serve /path/to/the/model/in/the/container \
Efficient tensor parallelism requires fast inter-node communication, preferably through high-speed network adapters such as InfiniBand. Efficient tensor parallelism requires fast inter-node communication, preferably through high-speed network adapters such as InfiniBand.
To set up the cluster to use InfiniBand, append additional arguments like `--privileged -e NCCL_IB_HCA=mlx5` to the To set up the cluster to use InfiniBand, append additional arguments like `--privileged -e NCCL_IB_HCA=mlx5` to the
<gh-file:examples/online_serving/run_cluster.sh> helper script. [examples/online_serving/run_cluster.sh](../../examples/online_serving/run_cluster.sh) helper script.
Contact your system administrator for more information about the required flags. Contact your system administrator for more information about the required flags.
## Enabling GPUDirect RDMA ## Enabling GPUDirect RDMA

View File

@ -6,7 +6,7 @@ reproducible results:
- For V1: Turn off multiprocessing to make the scheduling deterministic by setting `VLLM_ENABLE_V1_MULTIPROCESSING=0`. - For V1: Turn off multiprocessing to make the scheduling deterministic by setting `VLLM_ENABLE_V1_MULTIPROCESSING=0`.
- For V0: Set the global seed (see below). - For V0: Set the global seed (see below).
Example: <gh-file:examples/offline_inference/reproducibility.py> Example: [examples/offline_inference/reproducibility.py](../../examples/offline_inference/reproducibility.py)
!!! warning !!! warning
@ -39,7 +39,7 @@ In V1, the `seed` parameter defaults to `0` which sets the random state for each
It is impossible to un-specify a seed for V1 because different workers need to sample the same outputs It is impossible to un-specify a seed for V1 because different workers need to sample the same outputs
for workflows such as speculative decoding. for workflows such as speculative decoding.
For more information, see: <gh-pr:17929> For more information, see: <https://github.com/vllm-project/vllm/pull/17929>
### Locality of random state ### Locality of random state

View File

@ -24,7 +24,7 @@ If the model is too large to fit in a single GPU, you will get an out-of-memory
## Generation quality changed ## Generation quality changed
In v0.8.0, the source of default sampling parameters was changed in <gh-pr:12622>. Prior to v0.8.0, the default sampling parameters came from vLLM's set of neutral defaults. From v0.8.0 onwards, the default sampling parameters come from the `generation_config.json` provided by the model creator. In v0.8.0, the source of default sampling parameters was changed in <https://github.com/vllm-project/vllm/pull/12622>. Prior to v0.8.0, the default sampling parameters came from vLLM's set of neutral defaults. From v0.8.0 onwards, the default sampling parameters come from the `generation_config.json` provided by the model creator.
In most cases, this should lead to higher quality responses, because the model creator is likely to know which sampling parameters are best for their model. However, in some cases the defaults provided by the model creator can lead to degraded performance. In most cases, this should lead to higher quality responses, because the model creator is likely to know which sampling parameters are best for their model. However, in some cases the defaults provided by the model creator can lead to degraded performance.
@ -238,7 +238,7 @@ if __name__ == '__main__':
## `torch.compile` Error ## `torch.compile` Error
vLLM heavily depends on `torch.compile` to optimize the model for better performance, which introduces the dependency on the `torch.compile` functionality and the `triton` library. By default, we use `torch.compile` to [optimize some functions](gh-pr:10406) in the model. Before running vLLM, you can check if `torch.compile` is working as expected by running the following script: vLLM heavily depends on `torch.compile` to optimize the model for better performance, which introduces the dependency on the `torch.compile` functionality and the `triton` library. By default, we use `torch.compile` to [optimize some functions](https://github.com/vllm-project/vllm/pull/10406) in the model. Before running vLLM, you can check if `torch.compile` is working as expected by running the following script:
??? code ??? code
@ -257,7 +257,7 @@ vLLM heavily depends on `torch.compile` to optimize the model for better perform
print(f(x)) print(f(x))
``` ```
If it raises errors from `torch/_inductor` directory, usually it means you have a custom `triton` library that is not compatible with the version of PyTorch you are using. See <gh-issue:12219> for example. If it raises errors from `torch/_inductor` directory, usually it means you have a custom `triton` library that is not compatible with the version of PyTorch you are using. See <https://github.com/vllm-project/vllm/issues/12219> for example.
## Model failed to be inspected ## Model failed to be inspected
@ -297,7 +297,7 @@ But you are sure that the model is in the [list of supported models](../models/s
## Failed to infer device type ## Failed to infer device type
If you see an error like `RuntimeError: Failed to infer device type`, it means that vLLM failed to infer the device type of the runtime environment. You can check [the code](gh-file:vllm/platforms/__init__.py) to see how vLLM infers the device type and why it is not working as expected. After [this PR](gh-pr:14195), you can also set the environment variable `VLLM_LOGGING_LEVEL=DEBUG` to see more detailed logs to help debug the issue. If you see an error like `RuntimeError: Failed to infer device type`, it means that vLLM failed to infer the device type of the runtime environment. You can check [the code](../../vllm/platforms/__init__.py) to see how vLLM infers the device type and why it is not working as expected. After [this PR](https://github.com/vllm-project/vllm/pull/14195), you can also set the environment variable `VLLM_LOGGING_LEVEL=DEBUG` to see more detailed logs to help debug the issue.
## NCCL error: unhandled system error during `ncclCommInitRank` ## NCCL error: unhandled system error during `ncclCommInitRank`
@ -322,6 +322,6 @@ This indicates vLLM failed to initialize the NCCL communicator, possibly due to
## Known Issues ## Known Issues
- In `v0.5.2`, `v0.5.3`, and `v0.5.3.post1`, there is a bug caused by [zmq](https://github.com/zeromq/pyzmq/issues/2000) , which can occasionally cause vLLM to hang depending on the machine configuration. The solution is to upgrade to the latest version of `vllm` to include the [fix](gh-pr:6759). - In `v0.5.2`, `v0.5.3`, and `v0.5.3.post1`, there is a bug caused by [zmq](https://github.com/zeromq/pyzmq/issues/2000) , which can occasionally cause vLLM to hang depending on the machine configuration. The solution is to upgrade to the latest version of `vllm` to include the [fix](https://github.com/vllm-project/vllm/pull/6759).
- To address a memory overhead issue in older NCCL versions (see [bug](https://github.com/NVIDIA/nccl/issues/1234)), vLLM versions `>= 0.4.3, <= 0.10.1.1` would set the environment variable `NCCL_CUMEM_ENABLE=0`. External processes connecting to vLLM also needed to set this variable to prevent hangs or crashes. Since the underlying NCCL bug was fixed in NCCL 2.22.3, this override was removed in newer vLLM versions to allow for NCCL performance optimizations. - To address a memory overhead issue in older NCCL versions (see [bug](https://github.com/NVIDIA/nccl/issues/1234)), vLLM versions `>= 0.4.3, <= 0.10.1.1` would set the environment variable `NCCL_CUMEM_ENABLE=0`. External processes connecting to vLLM also needed to set this variable to prevent hangs or crashes. Since the underlying NCCL bug was fixed in NCCL 2.22.3, this override was removed in newer vLLM versions to allow for NCCL performance optimizations.
- In some PCIe machines (e.g. machines without NVLink), if you see an error like `transport/shm.cc:590 NCCL WARN Cuda failure 217 'peer access is not supported between these two devices'`, it's likely caused by a driver bug. See [this issue](https://github.com/NVIDIA/nccl/issues/1838) for more details. In that case, you can try to set `NCCL_CUMEM_HOST_ENABLE=0` to disable the feature, or upgrade your driver to the latest version. - In some PCIe machines (e.g. machines without NVLink), if you see an error like `transport/shm.cc:590 NCCL WARN Cuda failure 217 'peer access is not supported between these two devices'`, it's likely caused by a driver bug. See [this issue](https://github.com/NVIDIA/nccl/issues/1838) for more details. In that case, you can try to set `NCCL_CUMEM_HOST_ENABLE=0` to disable the feature, or upgrade your driver to the latest version.

View File

@ -6,7 +6,7 @@ A subset of the data, after cleaning and aggregation, will be publicly released
## What data is collected? ## What data is collected?
The list of data collected by the latest version of vLLM can be found here: <gh-file:vllm/usage/usage_lib.py> The list of data collected by the latest version of vLLM can be found here: [vllm/usage/usage_lib.py](../../vllm/usage/usage_lib.py)
Here is an example as of v0.4.0: Here is an example as of v0.4.0:

View File

@ -2,7 +2,7 @@
!!! announcement !!! announcement
We have started the process of deprecating V0. Please read [RFC #18571](gh-issue:18571) for more details. We have started the process of deprecating V0. Please read [RFC #18571](https://github.com/vllm-project/vllm/issues/18571) for more details.
V1 is now enabled by default for all supported use cases, and we will gradually enable it for every use case we plan to support. Please share any feedback on [GitHub](https://github.com/vllm-project/vllm) or in the [vLLM Slack](https://inviter.co/vllm-slack). V1 is now enabled by default for all supported use cases, and we will gradually enable it for every use case we plan to support. Please share any feedback on [GitHub](https://github.com/vllm-project/vllm) or in the [vLLM Slack](https://inviter.co/vllm-slack).
@ -94,8 +94,8 @@ See below for the status of models that are not yet supported or have more featu
The initial basic support is now functional. The initial basic support is now functional.
Later, we will consider using [hidden states processor](gh-issue:12249), Later, we will consider using [hidden states processor](https://github.com/vllm-project/vllm/issues/12249),
which is based on [global logits processor](gh-pr:13360) which is based on [global logits processor](https://github.com/vllm-project/vllm/pull/13360)
to enable simultaneous generation and embedding using the same engine instance in V1. to enable simultaneous generation and embedding using the same engine instance in V1.
#### Mamba Models #### Mamba Models
@ -124,13 +124,13 @@ encoder and decoder (e.g., `BartForConditionalGeneration`,
| **Chunked Prefill** | <nobr>🚀 Optimized</nobr> | | **Chunked Prefill** | <nobr>🚀 Optimized</nobr> |
| **LoRA** | <nobr>🚀 Optimized</nobr> | | **LoRA** | <nobr>🚀 Optimized</nobr> |
| **Logprobs Calculation** | <nobr>🟢 Functional</nobr> | | **Logprobs Calculation** | <nobr>🟢 Functional</nobr> |
| **FP8 KV Cache** | <nobr>🟢 Functional on Hopper devices (<gh-pr:15191>)</nobr>| | **FP8 KV Cache** | <nobr>🟢 Functional on Hopper devices (<https://github.com/vllm-project/vllm/pull/15191>)</nobr>|
| **Spec Decode** | <nobr>🚀 Optimized</nobr> | | **Spec Decode** | <nobr>🚀 Optimized</nobr> |
| **Prompt Logprobs with Prefix Caching** | <nobr>🟡 Planned ([RFC #13414](gh-issue:13414))</nobr>| | **Prompt Logprobs with Prefix Caching** | <nobr>🟡 Planned ([RFC #13414](https://github.com/vllm-project/vllm/issues/13414))</nobr>|
| **Structured Output Alternative Backends** | <nobr>🟢 Functional</nobr> | | **Structured Output Alternative Backends** | <nobr>🟢 Functional</nobr> |
| **Request-level Structured Output Backend** | <nobr>🔴 Deprecated</nobr> | | **Request-level Structured Output Backend** | <nobr>🔴 Deprecated</nobr> |
| **best_of** | <nobr>🔴 Deprecated ([RFC #13361](gh-issue:13361))</nobr>| | **best_of** | <nobr>🔴 Deprecated ([RFC #13361](https://github.com/vllm-project/vllm/issues/13361))</nobr>|
| **Per-Request Logits Processors** | <nobr>🔴 Deprecated ([RFC #13360](gh-pr:13360))</nobr> | | **Per-Request Logits Processors** | <nobr>🔴 Deprecated ([RFC #13360](https://github.com/vllm-project/vllm/pull/13360))</nobr> |
| **GPU <> CPU KV Cache Swapping** | <nobr>🔴 Deprecated</nobr> | | **GPU <> CPU KV Cache Swapping** | <nobr>🔴 Deprecated</nobr> |
!!! note !!! note
@ -168,11 +168,11 @@ As part of the major architectural rework in vLLM V1, several legacy features ha
##### Sampling features ##### Sampling features
- **best_of**: This feature has been deprecated due to limited usage. See details at [RFC #13361](gh-issue:13361). - **best_of**: This feature has been deprecated due to limited usage. See details at [RFC #13361](https://github.com/vllm-project/vllm/issues/13361).
- **Per-Request Logits Processors**: In V0, users could pass custom - **Per-Request Logits Processors**: In V0, users could pass custom
processing functions to adjust logits on a per-request basis. In vLLM V1, this processing functions to adjust logits on a per-request basis. In vLLM V1, this
feature has been deprecated. Instead, the design is moving toward supporting **global logits feature has been deprecated. Instead, the design is moving toward supporting **global logits
processors**, a feature the team is actively working on for future releases. See details at [RFC #13360](gh-pr:13360). processors**, a feature the team is actively working on for future releases. See details at [RFC #13360](https://github.com/vllm-project/vllm/pull/13360).
##### KV Cache features ##### KV Cache features

View File

@ -14,7 +14,7 @@ python examples/offline_inference/pooling/convert_model_to_seq_cls.py --model_na
## Embed jina_embeddings_v3 usage ## Embed jina_embeddings_v3 usage
Only text matching task is supported for now. See <gh-pr:16120> Only text matching task is supported for now. See <https://github.com/vllm-project/vllm/pull/16120>
```bash ```bash
python examples/offline_inference/pooling/embed_jina_embeddings_v3.py python examples/offline_inference/pooling/embed_jina_embeddings_v3.py