[Docs] Reduce custom syntax used in docs (#27009)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
This commit is contained in:
Harry Mellor 2025-10-17 04:05:34 +01:00 committed by GitHub
parent 965c5f4914
commit 4ffd6e8942
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
65 changed files with 381 additions and 402 deletions

View File

@ -23,7 +23,7 @@ llm = LLM(model="ibm-granite/granite-3.1-8b-instruct", tensor_parallel_size=2)
!!! note
With tensor parallelism enabled, each process will read the whole model and split it into chunks, which makes the disk reading time even longer (proportional to the size of tensor parallelism).
You can convert the model checkpoint to a sharded checkpoint using <gh-file:examples/offline_inference/save_sharded_state.py>. The conversion process might take some time, but later you can load the sharded checkpoint much faster. The model loading time should remain constant regardless of the size of tensor parallelism.
You can convert the model checkpoint to a sharded checkpoint using [examples/offline_inference/save_sharded_state.py](../../examples/offline_inference/save_sharded_state.py). The conversion process might take some time, but later you can load the sharded checkpoint much faster. The model loading time should remain constant regardless of the size of tensor parallelism.
## Quantization

View File

@ -174,14 +174,14 @@ Regardless, you need to set `mm_encoder_tp_mode="data"` in engine arguments to u
Known supported models (with corresponding benchmarks):
- dots_ocr (<gh-pr:25466>)
- GLM-4.1V or above (<gh-pr:23168>)
- InternVL (<gh-pr:23909>)
- Kimi-VL (<gh-pr:23817>)
- Llama4 (<gh-pr:18368>)
- MiniCPM-V-2.5 or above (<gh-pr:23327>, <gh-pr:23948>)
- Qwen2-VL or above (<gh-pr:22742>, <gh-pr:24955>, <gh-pr:25445>)
- Step3 (<gh-pr:22697>)
- dots_ocr (<https://github.com/vllm-project/vllm/pull/25466>)
- GLM-4.1V or above (<https://github.com/vllm-project/vllm/pull/23168>)
- InternVL (<https://github.com/vllm-project/vllm/pull/23909>)
- Kimi-VL (<https://github.com/vllm-project/vllm/pull/23817>)
- Llama4 (<https://github.com/vllm-project/vllm/pull/18368>)
- MiniCPM-V-2.5 or above (<https://github.com/vllm-project/vllm/pull/23327>, <https://github.com/vllm-project/vllm/pull/23948>)
- Qwen2-VL or above (<https://github.com/vllm-project/vllm/pull/22742>, <https://github.com/vllm-project/vllm/pull/24955>, <https://github.com/vllm-project/vllm/pull/25445>)
- Step3 (<https://github.com/vllm-project/vllm/pull/22697>)
## Input Processing

View File

@ -96,7 +96,7 @@ Although its common to do this with GPUs, don't try to fragment 2 or 8 differ
### Tune your workloads
Although we try to have great default configs, we strongly recommend you check out the [vLLM auto-tuner](gh-file:benchmarks/auto_tune/README.md) to optimize your workloads for your use case.
Although we try to have great default configs, we strongly recommend you check out the [vLLM auto-tuner](../../benchmarks/auto_tune/README.md) to optimize your workloads for your use case.
### Future Topics We'll Cover

View File

@ -22,7 +22,7 @@ Unsure on where to start? Check out the following links for tasks to work on:
## License
See <gh-file:LICENSE>.
See [LICENSE](../../LICENSE).
## Developing
@ -54,7 +54,7 @@ For more details about installing from source and installing for other hardware,
For an optimized workflow when iterating on C++/CUDA kernels, see the [Incremental Compilation Workflow](./incremental_build.md) for recommendations.
!!! tip
vLLM is compatible with Python versions 3.10 to 3.13. However, vLLM's default [Dockerfile](gh-file:docker/Dockerfile) ships with Python 3.12 and tests in CI (except `mypy`) are run with Python 3.12.
vLLM is compatible with Python versions 3.10 to 3.13. However, vLLM's default [Dockerfile](../../docker/Dockerfile) ships with Python 3.12 and tests in CI (except `mypy`) are run with Python 3.12.
Therefore, we recommend developing with Python 3.12 to minimise the chance of your local environment clashing with our CI environment.
@ -88,7 +88,7 @@ vLLM's `pre-commit` hooks will now run automatically every time you commit.
### Documentation
MkDocs is a fast, simple and downright gorgeous static site generator that's geared towards building project documentation. Documentation source files are written in Markdown, and configured with a single YAML configuration file, <gh-file:mkdocs.yaml>.
MkDocs is a fast, simple and downright gorgeous static site generator that's geared towards building project documentation. Documentation source files are written in Markdown, and configured with a single YAML configuration file, [mkdocs.yaml](../../mkdocs.yaml).
Get started with:
@ -152,7 +152,7 @@ pytest -s -v tests/test_logger.py
If you encounter a bug or have a feature request, please [search existing issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue) first to see if it has already been reported. If not, please [file a new issue](https://github.com/vllm-project/vllm/issues/new/choose), providing as much relevant information as possible.
!!! important
If you discover a security vulnerability, please follow the instructions [here](gh-file:SECURITY.md#reporting-a-vulnerability).
If you discover a security vulnerability, please follow the instructions [here](../../SECURITY.md).
## Pull Requests & Code Reviews
@ -162,7 +162,7 @@ code quality and improve the efficiency of the review process.
### DCO and Signed-off-by
When contributing changes to this project, you must agree to the <gh-file:DCO>.
When contributing changes to this project, you must agree to the [DCO](../../DCO).
Commits must include a `Signed-off-by:` header which certifies agreement with
the terms of the DCO.

View File

@ -822,7 +822,7 @@ you should set `--endpoint /v1/embeddings` to use the Embeddings API. The backen
- CLIP: `--backend openai-embeddings-clip`
- VLM2Vec: `--backend openai-embeddings-vlm2vec`
For other models, please add your own implementation inside <gh-file:vllm/benchmarks/lib/endpoint_request_func.py> to match the expected instruction format.
For other models, please add your own implementation inside [vllm/benchmarks/lib/endpoint_request_func.py](../../vllm/benchmarks/lib/endpoint_request_func.py) to match the expected instruction format.
You can use any text or multi-modal dataset to benchmark the model, as long as the model supports it.
For example, you can use ShareGPT and VisionArena to benchmark vision-language embeddings.
@ -962,7 +962,7 @@ For more results visualization, check the [visualizing the results](https://gith
The latest performance results are hosted on the public [vLLM Performance Dashboard](https://hud.pytorch.org/benchmark/llms?repoName=vllm-project%2Fvllm).
More information on the performance benchmarks and their parameters can be found in [Benchmark README](https://github.com/intel-ai-tce/vllm/blob/more_cpu_models/.buildkite/nightly-benchmarks/README.md) and [performance benchmark description](gh-file:.buildkite/nightly-benchmarks/performance-benchmarks-descriptions.md).
More information on the performance benchmarks and their parameters can be found in [Benchmark README](https://github.com/intel-ai-tce/vllm/blob/more_cpu_models/.buildkite/nightly-benchmarks/README.md) and [performance benchmark description](../../.buildkite/nightly-benchmarks/performance-benchmarks-descriptions.md).
### Continuous Benchmarking
@ -996,4 +996,4 @@ These compare vLLM's performance against alternatives (`tgi`, `trt-llm`, and `lm
The latest nightly benchmark results are shared in major release blog posts such as [vLLM v0.6.0](https://blog.vllm.ai/2024/09/05/perf-update.html).
More information on the nightly benchmarks and their parameters can be found [here](gh-file:.buildkite/nightly-benchmarks/nightly-descriptions.md).
More information on the nightly benchmarks and their parameters can be found [here](../../.buildkite/nightly-benchmarks/nightly-descriptions.md).

View File

@ -64,7 +64,7 @@ Download the full log file from Buildkite locally.
Strip timestamps and colorization:
<gh-file:.buildkite/scripts/ci-clean-log.sh>
[.buildkite/scripts/ci-clean-log.sh](../../../.buildkite/scripts/ci-clean-log.sh)
```bash
./ci-clean-log.sh ci.log
@ -87,7 +87,7 @@ tail -525 ci_build.log | wl-copy
CI test failures may be flaky. Use a bash loop to run repeatedly:
<gh-file:.buildkite/scripts/rerun-test.sh>
[.buildkite/scripts/rerun-test.sh](../../../.buildkite/scripts/rerun-test.sh)
```bash
./rerun-test.sh tests/v1/engine/test_engine_core_client.py::test_kv_cache_events[True-tcp]

View File

@ -5,7 +5,7 @@ release in CI/CD. It is standard practice to submit a PR to update the
PyTorch version as early as possible when a new [PyTorch stable
release](https://github.com/pytorch/pytorch/blob/main/RELEASE.md#release-cadence) becomes available.
This process is non-trivial due to the gap between PyTorch
releases. Using <gh-pr:16859> as an example, this document outlines common steps to achieve this
releases. Using <https://github.com/vllm-project/vllm/pull/16859> as an example, this document outlines common steps to achieve this
update along with a list of potential issues and how to address them.
## Test PyTorch release candidates (RCs)
@ -85,7 +85,7 @@ and timeout. Additionally, since vLLM's fastcheck pipeline runs in read-only mod
it doesn't populate the cache, so re-running it to warm up the cache
is ineffective.
While ongoing efforts like [#17419](gh-issue:17419)
While ongoing efforts like <https://github.com/vllm-project/vllm/issues/17419>
address the long build time at its source, the current workaround is to set `VLLM_CI_BRANCH`
to a custom branch provided by @khluu (`VLLM_CI_BRANCH=khluu/use_postmerge_q`)
when manually triggering a build on Buildkite. This branch accomplishes two things:
@ -138,5 +138,5 @@ to handle some platforms separately. The separation of requirements and Dockerfi
for different platforms in vLLM CI/CD allows us to selectively choose
which platforms to update. For instance, updating XPU requires the corresponding
release from [Intel Extension for PyTorch](https://github.com/intel/intel-extension-for-pytorch) by Intel.
While <gh-pr:16859> updated vLLM to PyTorch 2.7.0 on CPU, CUDA, and ROCm,
<gh-pr:17444> completed the update for XPU.
While <https://github.com/vllm-project/vllm/pull/16859> updated vLLM to PyTorch 2.7.0 on CPU, CUDA, and ROCm,
<https://github.com/vllm-project/vllm/pull/17444> completed the update for XPU.

View File

@ -1,6 +1,6 @@
# Dockerfile
We provide a <gh-file:docker/Dockerfile> to construct the image for running an OpenAI compatible server with vLLM.
We provide a [docker/Dockerfile](../../../docker/Dockerfile) to construct the image for running an OpenAI compatible server with vLLM.
More information about deploying with Docker can be found [here](../../deployment/docker.md).
Below is a visual representation of the multi-stage Dockerfile. The build graph contains the following nodes:

View File

@ -5,7 +5,7 @@ This guide walks you through the steps to implement a basic vLLM model.
## 1. Bring your model code
First, clone the PyTorch model code from the source repository.
For instance, vLLM's [OPT model](gh-file:vllm/model_executor/models/opt.py) was adapted from
For instance, vLLM's [OPT model](../../../vllm/model_executor/models/opt.py) was adapted from
HuggingFace's [modeling_opt.py](https://github.com/huggingface/transformers/blob/main/src/transformers/models/opt/modeling_opt.py) file.
!!! warning
@ -83,7 +83,7 @@ def forward(
Currently, vLLM supports the basic multi-head attention mechanism and its variant with rotary positional embeddings.
If your model employs a different attention mechanism, you will need to implement a new attention layer in vLLM.
For reference, check out our [Llama implementation](gh-file:vllm/model_executor/models/llama.py). vLLM already supports a large number of models. It is recommended to find a model similar to yours and adapt it to your model's architecture. Check out <gh-dir:vllm/model_executor/models> for more examples.
For reference, check out our [Llama implementation](../../../vllm/model_executor/models/llama.py). vLLM already supports a large number of models. It is recommended to find a model similar to yours and adapt it to your model's architecture. Check out [vllm/model_executor/models](../../../vllm/model_executor/models) for more examples.
## 3. (Optional) Implement tensor parallelism and quantization support
@ -130,22 +130,22 @@ We consider 3 different scenarios:
2. Models that combine Mamba layers (either Mamba-1 or Mamba-2) together with attention layers.
3. Models that combine Mamba-like mechanisms (e.g., Linear Attention, ShortConv) together with attention layers.
For case (1), we recommend looking at the implementation of [`MambaForCausalLM`](gh-file:vllm/model_executor/models/mamba.py) (for Mamba-1) or [`Mamba2ForCausalLM`](gh-file:vllm/model_executor/models/mamba2.py) (for Mamba-2) as a reference.
For case (1), we recommend looking at the implementation of [`MambaForCausalLM`](../../../vllm/model_executor/models/mamba.py) (for Mamba-1) or [`Mamba2ForCausalLM`](../../../vllm/model_executor/models/mamba2.py) (for Mamba-2) as a reference.
The model should inherit protocol `IsAttentionFree` and also implement class methods `get_mamba_state_dtype_from_config` and `get_mamba_state_shape_from_config` to calculate the state shapes and data types from the config.
For the mamba layers themselves, please use the [`MambaMixer`](gh-file:vllm/model_executor/layers/mamba/mamba_mixer.py) (for Mamba-1) or [`MambaMixer2`](gh-file:vllm/model_executor/layers/mamba/mamba_mixer2.py) (for Mamba-2) classes.
For the mamba layers themselves, please use the [`MambaMixer`](../../../vllm/model_executor/layers/mamba/mamba_mixer.py) (for Mamba-1) or [`MambaMixer2`](../../../vllm/model_executor/layers/mamba/mamba_mixer2.py) (for Mamba-2) classes.
Please *do not* use the `MambaCacheManager` (deprecated in V1) or replicate any of the V0-specific code paths in the existing model implementations.
V0-only classes and code will be removed in the very near future.
The model should also be added to the `MODELS_CONFIG_MAP` dictionary in <gh-file:vllm/model_executor/models/config.py> to ensure that the runtime defaults are optimized.
The model should also be added to the `MODELS_CONFIG_MAP` dictionary in [vllm/model_executor/models/config.py](../../../vllm/model_executor/models/config.py) to ensure that the runtime defaults are optimized.
For case (2), we recommend using as a reference the implementation of [`JambaForCausalLM`](gh-file:vllm/model_executor/models/jamba.py) (for an example of a model that uses Mamba-1 and attention together) or [`BambaForCausalLM`](gh-file:vllm/model_executor/models/bamba.py) (for an example of a model that uses Mamba-2 and attention together).
For case (2), we recommend using as a reference the implementation of [`JambaForCausalLM`](../../../vllm/model_executor/models/jamba.py) (for an example of a model that uses Mamba-1 and attention together) or [`BambaForCausalLM`](../../../vllm/model_executor/models/bamba.py) (for an example of a model that uses Mamba-2 and attention together).
These models should follow the same instructions as case (1), but they should inherit protocol `IsHybrid` (instead of `IsAttentionFree`) and it is *not* necessary to add them to the `MODELS_CONFIG_MAP` (their runtime defaults will be inferred from the protocol).
For case (3), we recommend looking at the implementation of [`MiniMaxText01ForCausalLM`](gh-file:vllm/model_executor/models/minimax_text_01.py) or [`Lfm2ForCausalLM`](gh-file:vllm/model_executor/models/lfm2.py) as a reference, which use custom "mamba-like" layers `MiniMaxText01LinearAttention` and `ShortConv` respectively.
For case (3), we recommend looking at the implementation of [`MiniMaxText01ForCausalLM`](../../../vllm/model_executor/models/minimax_text_01.py) or [`Lfm2ForCausalLM`](../../../vllm/model_executor/models/lfm2.py) as a reference, which use custom "mamba-like" layers `MiniMaxText01LinearAttention` and `ShortConv` respectively.
Please follow the same guidelines as case (2) for implementing these models.
We use "mamba-like" to refer to layers that posses a state that is updated in-place, rather than being appended-to (like KV cache for attention).
For implementing new custom mamba-like layers, one should inherit from `MambaBase` and implement the methods `get_state_dtype`, `get_state_shape` to calculate the data types and state shapes at runtime, as well as `mamba_type` and `get_attn_backend`.
It is also necessary to implement the "attention meta-data" class which handles the meta-data that is common across all layers.
Please see [`LinearAttentionMetadata`](gh-file:vllm/v1/attention/backends/linear_attn.py) or [`ShortConvAttentionMetadata`](gh-file:v1/attention/backends/short_conv_attn.py) for examples of this.
Please see [`LinearAttentionMetadata`](../../../vllm/v1/attention/backends/linear_attn.py) or [`ShortConvAttentionMetadata`](../../../vllm/v1/attention/backends/short_conv_attn.py) for examples of this.
Finally, if one wants to support torch compile and CUDA graphs, it necessary to wrap the call to the mamba-like layer inside a custom op and register it.
Please see the calls to `direct_register_custom_op` in <gh-file:vllm/model_executor/models/minimax_text_01.py> or <gh-file:vllm/model_executor/layers/mamba/short_conv.py> for examples of this.
The new custom op should then be added to the list `_attention_ops` in <gh-file:vllm/config/compilation.py> to ensure that piecewise CUDA graphs works as intended.
Please see the calls to `direct_register_custom_op` in [vllm/model_executor/models/minimax_text_01.py](../../../vllm/model_executor/models/minimax_text_01.py) or [vllm/model_executor/layers/mamba/short_conv.py](../../../vllm/model_executor/layers/mamba/short_conv.py) for examples of this.
The new custom op should then be added to the list `_attention_ops` in [vllm/config/compilation.py](../../../vllm/config/compilation.py) to ensure that piecewise CUDA graphs works as intended.

View File

@ -507,7 +507,7 @@ return a schema of the tensors outputted by the HF processor that are related to
```
!!! note
Our [actual code](gh-file:vllm/model_executor/models/llava.py) additionally supports
Our [actual code](../../../vllm/model_executor/models/llava.py) additionally supports
pre-computed image embeddings, which can be passed to be model via the `image_embeds` argument.
=== "With postprocessing: Fuyu"
@ -569,7 +569,7 @@ return a schema of the tensors outputted by the HF processor that are related to
```
!!! note
Our [actual code](gh-file:vllm/model_executor/models/fuyu.py) has special handling
Our [actual code](../../../vllm/model_executor/models/fuyu.py) has special handling
for text-only inputs to prevent unnecessary warnings from HF processor.
!!! note
@ -828,8 +828,8 @@ Some HF processors directly insert feature tokens without replacing anything in
Examples:
- BLIP-2 (insert at start of prompt): <gh-file:vllm/model_executor/models/blip2.py>
- Molmo (insert after `<|endoftext|>` token): <gh-file:vllm/model_executor/models/molmo.py>
- BLIP-2 (insert at start of prompt): [vllm/model_executor/models/blip2.py](../../../vllm/model_executor/models/blip2.py)
- Molmo (insert after `<|endoftext|>` token): [vllm/model_executor/models/molmo.py](../../../vllm/model_executor/models/molmo.py)
### Handling prompt updates unrelated to multi-modal data
@ -837,9 +837,9 @@ Examples:
Examples:
- Chameleon (appends `sep_token`): <gh-file:vllm/model_executor/models/chameleon.py>
- Fuyu (appends `boa_token`): <gh-file:vllm/model_executor/models/fuyu.py>
- Molmo (applies chat template which is not defined elsewhere): <gh-file:vllm/model_executor/models/molmo.py>
- Chameleon (appends `sep_token`): [vllm/model_executor/models/chameleon.py](../../../vllm/model_executor/models/chameleon.py)
- Fuyu (appends `boa_token`): [vllm/model_executor/models/fuyu.py](../../../vllm/model_executor/models/fuyu.py)
- Molmo (applies chat template which is not defined elsewhere): [vllm/model_executor/models/molmo.py](../../../vllm/model_executor/models/molmo.py)
### Custom HF processor
@ -847,6 +847,6 @@ Some models don't define an HF processor class on HF Hub. In that case, you can
Examples:
- DeepSeek-VL2: <gh-file:vllm/model_executor/models/deepseek_vl2.py>
- InternVL: <gh-file:vllm/model_executor/models/internvl.py>
- Qwen-VL: <gh-file:vllm/model_executor/models/qwen_vl.py>
- DeepSeek-VL2: [vllm/model_executor/models/deepseek_vl2.py](../../../vllm/model_executor/models/deepseek_vl2.py)
- InternVL: [vllm/model_executor/models/internvl.py](../../../vllm/model_executor/models/internvl.py)
- Qwen-VL: [vllm/model_executor/models/qwen_vl.py](../../../vllm/model_executor/models/qwen_vl.py)

View File

@ -11,8 +11,8 @@ This page provides detailed instructions on how to do so.
To add a model directly to the vLLM library, start by forking our [GitHub repository](https://github.com/vllm-project/vllm) and then [build it from source][build-from-source].
This gives you the ability to modify the codebase and test your model.
After you have implemented your model (see [tutorial](basic.md)), put it into the <gh-dir:vllm/model_executor/models> directory.
Then, add your model class to `_VLLM_MODELS` in <gh-file:vllm/model_executor/models/registry.py> so that it is automatically registered upon importing vLLM.
After you have implemented your model (see [tutorial](basic.md)), put it into the [vllm/model_executor/models](../../../vllm/model_executor/models) directory.
Then, add your model class to `_VLLM_MODELS` in [vllm/model_executor/models/registry.py](../../../vllm/model_executor/models/registry.py) so that it is automatically registered upon importing vLLM.
Finally, update our [list of supported models](../../models/supported_models.md) to promote your model!
!!! important

View File

@ -9,7 +9,7 @@ Without them, the CI for your PR will fail.
### Model loading
Include an example HuggingFace repository for your model in <gh-file:tests/models/registry.py>.
Include an example HuggingFace repository for your model in [tests/models/registry.py](../../../tests/models/registry.py).
This enables a unit test that loads dummy weights to ensure that the model can be initialized in vLLM.
!!! important
@ -26,18 +26,18 @@ Passing these tests provides more confidence that your implementation is correct
### Model correctness
These tests compare the model outputs of vLLM against [HF Transformers](https://github.com/huggingface/transformers). You can add new tests under the subdirectories of <gh-dir:tests/models>.
These tests compare the model outputs of vLLM against [HF Transformers](https://github.com/huggingface/transformers). You can add new tests under the subdirectories of [tests/models](../../../tests/models).
#### Generative models
For [generative models](../../models/generative_models.md), there are two levels of correctness tests, as defined in <gh-file:tests/models/utils.py>:
For [generative models](../../models/generative_models.md), there are two levels of correctness tests, as defined in [tests/models/utils.py](../../../tests/models/utils.py):
- Exact correctness (`check_outputs_equal`): The text outputted by vLLM should exactly match the text outputted by HF.
- Logprobs similarity (`check_logprobs_close`): The logprobs outputted by vLLM should be in the top-k logprobs outputted by HF, and vice versa.
#### Pooling models
For [pooling models](../../models/pooling_models.md), we simply check the cosine similarity, as defined in <gh-file:tests/models/utils.py>.
For [pooling models](../../models/pooling_models.md), we simply check the cosine similarity, as defined in [tests/models/utils.py](../../../tests/models/utils.py).
[](){ #mm-processing-tests }
@ -45,7 +45,7 @@ For [pooling models](../../models/pooling_models.md), we simply check the cosine
#### Common tests
Adding your model to <gh-file:tests/models/multimodal/processing/test_common.py> verifies that the following input combinations result in the same outputs:
Adding your model to [tests/models/multimodal/processing/test_common.py](../../../tests/models/multimodal/processing/test_common.py) verifies that the following input combinations result in the same outputs:
- Text + multi-modal data
- Tokens + multi-modal data
@ -54,6 +54,6 @@ Adding your model to <gh-file:tests/models/multimodal/processing/test_common.py>
#### Model-specific tests
You can add a new file under <gh-dir:tests/models/multimodal/processing> to run tests that only apply to your model.
You can add a new file under [tests/models/multimodal/processing](../../../tests/models/multimodal/processing) to run tests that only apply to your model.
For example, if the HF processor for your model accepts user-specified keyword arguments, you can verify that the keyword arguments are being applied correctly, such as in <gh-file:tests/models/multimodal/processing/test_phi3v.py>.
For example, if the HF processor for your model accepts user-specified keyword arguments, you can verify that the keyword arguments are being applied correctly, such as in [tests/models/multimodal/processing/test_phi3v.py](../../../tests/models/multimodal/processing/test_phi3v.py).

View File

@ -248,9 +248,9 @@ No extra registration is required beyond having your model class available via t
## Examples in-tree
- Whisper encoderdecoder (audio-only): <gh-file:vllm/model_executor/models/whisper.py>
- Voxtral decoder-only (audio embeddings + LLM): <gh-file:vllm/model_executor/models/voxtral.py>
- Gemma3n decoder-only with fixed instruction prompt: <gh-file:vllm/model_executor/models/gemma3n_mm.py>
- Whisper encoderdecoder (audio-only): [vllm/model_executor/models/whisper.py](../../../vllm/model_executor/models/whisper.py)
- Voxtral decoder-only (audio embeddings + LLM): [vllm/model_executor/models/voxtral.py](../../../vllm/model_executor/models/voxtral.py)
- Gemma3n decoder-only with fixed instruction prompt: [vllm/model_executor/models/gemma3n_mm.py](../../../vllm/model_executor/models/gemma3n_mm.py)
## Test with the API
@ -278,7 +278,7 @@ Once your model implements `SupportsTranscription`, you can test the endpoints (
http://localhost:8000/v1/audio/translations
```
Or check out more examples in <gh-file:examples/online_serving>.
Or check out more examples in [examples/online_serving](../../../examples/online_serving).
!!! note
- If your model handles chunking internally (e.g., via its processor or encoder), set `min_energy_split_window_size=None` in the returned `SpeechToTextConfig` to disable server-side chunking.

View File

@ -33,7 +33,7 @@ Traces can be visualized using <https://ui.perfetto.dev/>.
#### Offline Inference
Refer to <gh-file:examples/offline_inference/simple_profiling.py> for an example.
Refer to [examples/offline_inference/simple_profiling.py](../../examples/offline_inference/simple_profiling.py) for an example.
#### OpenAI Server

View File

@ -37,7 +37,7 @@ You can add any other [engine-args](../configuration/engine_args.md) you need af
memory to share data between processes under the hood, particularly for tensor parallel inference.
!!! note
Optional dependencies are not included in order to avoid licensing issues (e.g. <gh-issue:8030>).
Optional dependencies are not included in order to avoid licensing issues (e.g. <https://github.com/vllm-project/vllm/issues/8030>).
If you need to use those dependencies (having accepted the license terms),
create a custom Dockerfile on top of the base image with an extra layer that installs them:
@ -66,7 +66,7 @@ You can add any other [engine-args](../configuration/engine_args.md) you need af
## Building vLLM's Docker Image from Source
You can build and run vLLM from source via the provided <gh-file:docker/Dockerfile>. To build vLLM:
You can build and run vLLM from source via the provided [docker/Dockerfile](../../docker/Dockerfile). To build vLLM:
```bash
# optionally specifies: --build-arg max_jobs=8 --build-arg nvcc_threads=2

View File

@ -5,7 +5,7 @@
[Anyscale](https://www.anyscale.com) is a managed, multi-cloud platform developed by the creators of Ray.
Anyscale automates the entire lifecycle of Ray clusters in your AWS, GCP, or Azure account, delivering the flexibility of open-source Ray
without the operational overhead of maintaining Kubernetes control planes, configuring autoscalers, managing observability stacks, or manually managing head and worker nodes with helper scripts like <gh-file:examples/online_serving/run_cluster.sh>.
without the operational overhead of maintaining Kubernetes control planes, configuring autoscalers, managing observability stacks, or manually managing head and worker nodes with helper scripts like [examples/online_serving/run_cluster.sh](../../../examples/online_serving/run_cluster.sh).
When serving large language models with vLLM, Anyscale can rapidly provision [production-ready HTTPS endpoints](https://docs.anyscale.com/examples/deploy-ray-serve-llms) or [fault-tolerant batch inference jobs](https://docs.anyscale.com/examples/ray-data-llm).

View File

@ -36,7 +36,7 @@ pip install -U vllm \
vllm serve qwen/Qwen1.5-0.5B-Chat --port 8001
```
1. Use the script: <gh-file:examples/online_serving/retrieval_augmented_generation_with_langchain.py>
1. Use the script: [examples/online_serving/retrieval_augmented_generation_with_langchain.py](../../../examples/online_serving/retrieval_augmented_generation_with_langchain.py)
1. Run the script
@ -74,7 +74,7 @@ pip install vllm \
vllm serve qwen/Qwen1.5-0.5B-Chat --port 8001
```
1. Use the script: <gh-file:examples/online_serving/retrieval_augmented_generation_with_llamaindex.py>
1. Use the script: [examples/online_serving/retrieval_augmented_generation_with_llamaindex.py](../../../examples/online_serving/retrieval_augmented_generation_with_llamaindex.py)
1. Run the script:

View File

@ -20,7 +20,7 @@ pip install vllm streamlit openai
vllm serve Qwen/Qwen1.5-0.5B-Chat
```
1. Use the script: <gh-file:examples/online_serving/streamlit_openai_chatbot_webserver.py>
1. Use the script: [examples/online_serving/streamlit_openai_chatbot_webserver.py](../../../examples/online_serving/streamlit_openai_chatbot_webserver.py)
1. Start the streamlit web UI and start to chat:

View File

@ -49,7 +49,7 @@ Here is a sample of `LLM` class usage:
More API details can be found in the [Offline Inference](#offline-inference-api) section of the API docs.
The code for the `LLM` class can be found in <gh-file:vllm/entrypoints/llm.py>.
The code for the `LLM` class can be found in [vllm/entrypoints/llm.py](../../vllm/entrypoints/llm.py).
### OpenAI-Compatible API Server
@ -60,7 +60,7 @@ This server can be started using the `vllm serve` command.
vllm serve <model>
```
The code for the `vllm` CLI can be found in <gh-file:vllm/entrypoints/cli/main.py>.
The code for the `vllm` CLI can be found in [vllm/entrypoints/cli/main.py](../../vllm/entrypoints/cli/main.py).
Sometimes you may see the API server entrypoint used directly instead of via the
`vllm` CLI command. For example:
@ -74,7 +74,7 @@ python -m vllm.entrypoints.openai.api_server --model <model>
`python -m vllm.entrypoints.openai.api_server` is deprecated
and may become unsupported in a future release.
That code can be found in <gh-file:vllm/entrypoints/openai/api_server.py>.
That code can be found in [vllm/entrypoints/openai/api_server.py](../../vllm/entrypoints/openai/api_server.py).
More details on the API server can be found in the [OpenAI-Compatible Server](../serving/openai_compatible_server.md) document.
@ -101,7 +101,7 @@ processing.
- **Output Processing**: Processes the outputs generated by the model, decoding the
token IDs from a language model into human-readable text.
The code for `LLMEngine` can be found in <gh-file:vllm/engine/llm_engine.py>.
The code for `LLMEngine` can be found in [vllm/engine/llm_engine.py](../../vllm/engine/llm_engine.py).
### AsyncLLMEngine
@ -111,9 +111,9 @@ incoming requests. The `AsyncLLMEngine` is designed for online serving, where it
can handle multiple concurrent requests and stream outputs to clients.
The OpenAI-compatible API server uses the `AsyncLLMEngine`. There is also a demo
API server that serves as a simpler example in <gh-file:vllm/entrypoints/api_server.py>.
API server that serves as a simpler example in [vllm/entrypoints/api_server.py](../../vllm/entrypoints/api_server.py).
The code for `AsyncLLMEngine` can be found in <gh-file:vllm/engine/async_llm_engine.py>.
The code for `AsyncLLMEngine` can be found in [vllm/engine/async_llm_engine.py](../../vllm/engine/async_llm_engine.py).
## Worker

View File

@ -17,7 +17,7 @@ In this document we will discuss the:
In this document, we refer to pure decode (`max_query_len=1`) or speculative decode (`max_query_len =1+num_spec_tokens`) as **uniform decode** batches, and the opposite would be **non-uniform** batches (i.e., prefill or mixed prefill-decode batches).
!!! note
The following contents are mostly based on the last commit of <gh-pr:20059>.
The following contents are mostly based on the last commit of <https://github.com/vllm-project/vllm/pull/20059>.
## Motivation
@ -92,7 +92,7 @@ where `num_tokens` can be the padded token length, and `uniform_decode` is deter
The goal of this structure is to uniquely identify a (padded) batch with minimal possible items corresponding to a CUDA Graphs item. We are safe to exclude items like `uniform_query_len` because it is a constant at runtime for a certain setup currently. For example, it should be either `1` for a commonly pure decode or `1+num_spec_tokens` for a validation phase of speculative decode.
!!! note
The prototype of `BatchDescriptor` may be extended for more general situations in the future, e.g., include more items, like `uniform_query_len` to support multiple different uniform decode lengths settings (<gh-pr:23679>), or other modifications needed to support CUDA Graphs for models whose inputs are not necessarily token length aware (for example, some multi-modal inputs).
The prototype of `BatchDescriptor` may be extended for more general situations in the future, e.g., include more items, like `uniform_query_len` to support multiple different uniform decode lengths settings (<https://github.com/vllm-project/vllm/pull/23679>), or other modifications needed to support CUDA Graphs for models whose inputs are not necessarily token length aware (for example, some multi-modal inputs).
### `CudagraphDispatcher`

View File

@ -2,7 +2,7 @@
## Introduction
FusedMoEModularKernel is implemented [here](gh-file:/vllm/model_executor/layers/fused_moe/modular_kernel.py)
FusedMoEModularKernel is implemented [here](../..//vllm/model_executor/layers/fused_moe/modular_kernel.py)
Based on the format of the input activations, FusedMoE implementations are broadly classified into 2 types.
@ -44,7 +44,7 @@ FusedMoEModularKernel splits the FusedMoE operation into 3 parts,
The TopK Weight Application and Reduction components happen right after the Unpermute operation and before the All2All Combine. Note that the `FusedMoEPermuteExpertsUnpermute` is responsible for the Unpermute and `FusedMoEPrepareAndFinalize` is responsible for the All2All Combine. There is value in doing the TopK Weight Application and Reduction in the `FusedMoEPermuteExpertsUnpermute`. But some implementations choose to do it `FusedMoEPrepareAndFinalize`. In order to enable this flexibility, we have a TopKWeightAndReduce abstract class.
Please find the implementations of TopKWeightAndReduce [here](gh-file:vllm/model_executor/layers/fused_moe/topk_weight_and_reduce.py).
Please find the implementations of TopKWeightAndReduce [here](../../vllm/model_executor/layers/fused_moe/topk_weight_and_reduce.py).
`FusedMoEPrepareAndFinalize::finalize()` method accepts a `TopKWeightAndReduce` argument that is invoked inside the method.
The `FusedMoEModularKernel` acts as a bridge between the `FusedMoEPermuteExpertsUnpermute` and `FusedMoEPerpareAndFinalize` implementations to determine where the TopK Weight Application and Reduction happens.
@ -138,7 +138,7 @@ Typically a FusedMoEPrepareAndFinalize type is backed by an All2All Dispatch & C
#### Step 1: Add an All2All manager
The purpose of the All2All Manager is to set up the All2All kernel implementations. The `FusedMoEPrepareAndFinalize` implementations typically fetch a kernel-implementation "handle" from the All2All Manager to invoke the Dispatch and Combine functions. Please look at the All2All Manager implementations [here](gh-file:vllm/distributed/device_communicators/all2all.py).
The purpose of the All2All Manager is to set up the All2All kernel implementations. The `FusedMoEPrepareAndFinalize` implementations typically fetch a kernel-implementation "handle" from the All2All Manager to invoke the Dispatch and Combine functions. Please look at the All2All Manager implementations [here](../../vllm/distributed/device_communicators/all2all.py).
#### Step 2: Add a FusedMoEPrepareAndFinalize Type
@ -213,29 +213,29 @@ Please take a look at [init_prepare_finalize](https://github.com/vllm-project/vl
### How To Unit Test
We have `FusedMoEModularKernel` unit tests at [test_modular_kernel_combinations.py](gh-file:tests/kernels/moe/test_modular_kernel_combinations.py).
We have `FusedMoEModularKernel` unit tests at [test_modular_kernel_combinations.py](../../tests/kernels/moe/test_modular_kernel_combinations.py).
The unit test iterates through all combinations of `FusedMoEPrepareAndFinalize` and `FusedMoEPremuteExpertsUnpermute` types and if they are
compatible, runs some correctness tests.
If you are adding some `FusedMoEPrepareAndFinalize` / `FusedMoEPermuteExpertsUnpermute` implementations,
1. Add the implementation type to `MK_ALL_PREPARE_FINALIZE_TYPES` and `MK_FUSED_EXPERT_TYPES` in [mk_objects.py](gh-file:tests/kernels/moe/modular_kernel_tools/mk_objects.py) respectively.
1. Add the implementation type to `MK_ALL_PREPARE_FINALIZE_TYPES` and `MK_FUSED_EXPERT_TYPES` in [mk_objects.py](../../tests/kernels/moe/modular_kernel_tools/mk_objects.py) respectively.
2. Update `Config::is_batched_prepare_finalize()`, `Config::is_batched_fused_experts()`, `Config::is_standard_fused_experts()`,
`Config::is_fe_16bit_supported()`, `Config::is_fe_fp8_supported()`, `Config::is_fe_block_fp8_supported()`,
`Config::is_fe_supports_chunking()` methods in [/tests/kernels/moe/modular_kernel_tools/common.py](gh-file:tests/kernels/moe/modular_kernel_tools/common.py)
`Config::is_fe_supports_chunking()` methods in [/tests/kernels/moe/modular_kernel_tools/common.py](../../tests/kernels/moe/modular_kernel_tools/common.py)
Doing this will add the new implementation to the test suite.
### How To Check `FusedMoEPrepareAndFinalize` & `FusedMoEPermuteExpertsUnpermute` Compatibility
The unit test file [test_modular_kernel_combinations.py](gh-file:tests/kernels/moe/test_modular_kernel_combinations.py) can also be executed as a standalone script.
The unit test file [test_modular_kernel_combinations.py](../../tests/kernels/moe/test_modular_kernel_combinations.py) can also be executed as a standalone script.
Example: `python3 -m tests.kernels.moe.test_modular_kernel_combinations --pf-type PplxPrepareAndFinalize --experts-type BatchedTritonExperts`
As a side effect, this script can be used to test `FusedMoEPrepareAndFinalize` & `FusedMoEPermuteExpertsUnpermute` compatibility. When invoked
with incompatible types, the script will error.
### How To Profile
Please take a look at [profile_modular_kernel.py](gh-file:tests/kernels/moe/modular_kernel_tools/profile_modular_kernel.py)
Please take a look at [profile_modular_kernel.py](../../tests/kernels/moe/modular_kernel_tools/profile_modular_kernel.py)
The script can be used to generate Torch traces for a single `FusedMoEModularKernel::forward()` call for any compatible
`FusedMoEPrepareAndFinalize` and `FusedMoEPermuteExpertsUnpermute` types.
Example: `python3 -m tests.kernels.moe.modular_kernel_tools.profile_modular_kernel --pf-type PplxPrepareAndFinalize --experts-type BatchedTritonExperts`

View File

@ -6,7 +6,7 @@ When performing an inference with IO Processor plugins, the prompt type is defin
## Writing an IO Processor Plugin
IO Processor plugins implement the `IOProcessor` interface (<gh-file:vllm/plugins/io_processors/interface.py>):
IO Processor plugins implement the [`IOProcessor`][vllm.plugins.io_processors.interface.IOProcessor] interface:
```python
IOProcessorInput = TypeVar("IOProcessorInput")
@ -67,9 +67,9 @@ The `parse_request` method is used for validating the user prompt and converting
The `pre_process*` methods take the validated plugin input to generate vLLM's model prompts for regular inference.
The `post_process*` methods take `PoolingRequestOutput` objects as input and generate a custom plugin output.
The `output_to_response` method is used only for online serving and converts the plugin output to the `IOProcessorResponse` type that is then returned by the API Server. The implementation of the `/io_processor_pooling` serving endpoint is available here <gh-file:vllm/entrypoints/openai/serving_pooling_with_io_plugin.py>.
The `output_to_response` method is used only for online serving and converts the plugin output to the `IOProcessorResponse` type that is then returned by the API Server. The implementation of the `/pooling` serving endpoint is available here [vllm/entrypoints/openai/serving_pooling.py](../../vllm/entrypoints/openai/serving_pooling.py).
An example implementation of a plugin that enables generating geotiff images with the PrithviGeospatialMAE model is available [here](https://github.com/christian-pinto/prithvi_io_processor_plugin). Please, also refer to our online (<gh-file:examples/online_serving/prithvi_geospatial_mae.py>) and offline (<gh-file:examples/offline_inference/prithvi_geospatial_mae_io_processor.py>) inference examples.
An example implementation of a plugin that enables generating geotiff images with the PrithviGeospatialMAE model is available [here](https://github.com/christian-pinto/prithvi_io_processor_plugin). Please, also refer to our online ([examples/online_serving/prithvi_geospatial_mae.py](../../examples/online_serving/prithvi_geospatial_mae.py)) and offline ([examples/offline_inference/prithvi_geospatial_mae_io_processor.py](../../examples/offline_inference/prithvi_geospatial_mae_io_processor.py)) inference examples.
## Using an IO Processor plugin

View File

@ -80,13 +80,13 @@ The subset of metrics exposed in the Grafana dashboard gives us an indication of
- `vllm:request_decode_time_seconds` - Requests decode time.
- `vllm:request_max_num_generation_tokens` - Max generation tokens in a sequence group.
See [the PR which added this Dashboard](gh-pr:2316) for interesting and useful background on the choices made here.
See [the PR which added this Dashboard](https://github.com/vllm-project/vllm/pull/2316) for interesting and useful background on the choices made here.
### Prometheus Client Library
Prometheus support was initially added [using the aioprometheus library](gh-pr:1890), but a switch was made quickly to [prometheus_client](gh-pr:2730). The rationale is discussed in both linked PRs.
Prometheus support was initially added [using the aioprometheus library](https://github.com/vllm-project/vllm/pull/1890), but a switch was made quickly to [prometheus_client](https://github.com/vllm-project/vllm/pull/2730). The rationale is discussed in both linked PRs.
With the switch to `aioprometheus`, we lost a `MetricsMiddleware` to track HTTP metrics, but this was reinstated [using prometheus_fastapi_instrumentator](gh-pr:15657):
With the switch to `aioprometheus`, we lost a `MetricsMiddleware` to track HTTP metrics, but this was reinstated [using prometheus_fastapi_instrumentator](https://github.com/vllm-project/vllm/pull/15657):
```bash
$ curl http://0.0.0.0:8000/metrics 2>/dev/null | grep -P '^http_(?!.*(_bucket|_created|_sum)).*'
@ -99,7 +99,7 @@ http_request_duration_seconds_count{handler="/v1/completions",method="POST"} 201
### Multi-process Mode
In v0, metrics are collected in the engine core process and we use multiprocess mode to make them available in the API server process. See <gh-pr:7279>.
In v0, metrics are collected in the engine core process and we use multiprocess mode to make them available in the API server process. See <https://github.com/vllm-project/vllm/pull/7279>.
### Built in Python/Process Metrics
@ -125,32 +125,32 @@ vLLM instance.
For background, these are some of the relevant PRs which added the v0 metrics:
- <gh-pr:1890>
- <gh-pr:2316>
- <gh-pr:2730>
- <gh-pr:4464>
- <gh-pr:7279>
- <https://github.com/vllm-project/vllm/pull/1890>
- <https://github.com/vllm-project/vllm/pull/2316>
- <https://github.com/vllm-project/vllm/pull/2730>
- <https://github.com/vllm-project/vllm/pull/4464>
- <https://github.com/vllm-project/vllm/pull/7279>
Also note the ["Even Better Observability"](gh-issue:3616) feature where e.g. [a detailed roadmap was laid out](gh-issue:3616#issuecomment-2030858781).
Also note the ["Even Better Observability"](https://github.com/vllm-project/vllm/issues/3616) feature where e.g. [a detailed roadmap was laid out](https://github.com/vllm-project/vllm/issues/3616#issuecomment-2030858781).
## v1 Design
### v1 PRs
For background, here are the relevant v1 PRs relating to the v1
metrics issue <gh-issue:10582>:
metrics issue <https://github.com/vllm-project/vllm/issues/10582>:
- <gh-pr:11962>
- <gh-pr:11973>
- <gh-pr:10907>
- <gh-pr:12416>
- <gh-pr:12478>
- <gh-pr:12516>
- <gh-pr:12530>
- <gh-pr:12561>
- <gh-pr:12579>
- <gh-pr:12592>
- <gh-pr:12644>
- <https://github.com/vllm-project/vllm/pull/11962>
- <https://github.com/vllm-project/vllm/pull/11973>
- <https://github.com/vllm-project/vllm/pull/10907>
- <https://github.com/vllm-project/vllm/pull/12416>
- <https://github.com/vllm-project/vllm/pull/12478>
- <https://github.com/vllm-project/vllm/pull/12516>
- <https://github.com/vllm-project/vllm/pull/12530>
- <https://github.com/vllm-project/vllm/pull/12561>
- <https://github.com/vllm-project/vllm/pull/12579>
- <https://github.com/vllm-project/vllm/pull/12592>
- <https://github.com/vllm-project/vllm/pull/12644>
### Metrics Collection
@ -369,7 +369,7 @@ vllm:cache_config_info{block_size="16",cache_dtype="auto",calculate_kv_scales="F
However, `prometheus_client` has
[never supported Info metrics in multiprocessing mode](https://github.com/prometheus/client_python/pull/300) -
for [unclear reasons](gh-pr:7279#discussion_r1710417152). We
for [unclear reasons](https://github.com/vllm-project/vllm/pull/7279#discussion_r1710417152). We
simply use a `Gauge` metric set to 1 and
`multiprocess_mode="mostrecent"` instead.
@ -394,7 +394,7 @@ distinguish between per-adapter counts. This should be revisited.
Note that `multiprocess_mode="livemostrecent"` is used - the most
recent metric is used, but only from currently running processes.
This was added in <gh-pr:9477> and there is
This was added in <https://github.com/vllm-project/vllm/pull/9477> and there is
[at least one known user](https://github.com/kubernetes-sigs/gateway-api-inference-extension/pull/54).
If we revisit this design and deprecate the old metric, we should reduce
the need for a significant deprecation period by making the change in
@ -402,7 +402,7 @@ v0 also and asking this project to move to the new metric.
### Prefix Cache metrics
The discussion in <gh-issue:10582> about adding prefix cache metrics yielded
The discussion in <https://github.com/vllm-project/vllm/issues/10582> about adding prefix cache metrics yielded
some interesting points which may be relevant to how we approach
future metrics.
@ -439,8 +439,8 @@ suddenly (from their perspective) when it is removed, even if there is
an equivalent metric for them to use.
As an example, see how `vllm:avg_prompt_throughput_toks_per_s` was
[deprecated](gh-pr:2764) (with a comment in the code),
[removed](gh-pr:12383), and then [noticed by a user](gh-issue:13218).
[deprecated](https://github.com/vllm-project/vllm/pull/2764) (with a comment in the code),
[removed](https://github.com/vllm-project/vllm/pull/12383), and then [noticed by a user](https://github.com/vllm-project/vllm/issues/13218).
In general:
@ -460,20 +460,20 @@ the project-wide deprecation policy.
### Unimplemented - `vllm:tokens_total`
Added by <gh-pr:4464>, but apparently never implemented. This can just be
Added by <https://github.com/vllm-project/vllm/pull/4464>, but apparently never implemented. This can just be
removed.
### Duplicated - Queue Time
The `vllm:time_in_queue_requests` Histogram metric was added by
<gh-pr:9659> and its calculation is:
<https://github.com/vllm-project/vllm/pull/9659> and its calculation is:
```python
self.metrics.first_scheduled_time = now
self.metrics.time_in_queue = now - self.metrics.arrival_time
```
Two weeks later, <gh-pr:4464> added `vllm:request_queue_time_seconds` leaving
Two weeks later, <https://github.com/vllm-project/vllm/pull/4464> added `vllm:request_queue_time_seconds` leaving
us with:
```python
@ -513,7 +513,7 @@ cache to complete other requests), we swap kv cache blocks out to CPU
memory. This is also known as "KV cache offloading" and is configured
with `--swap-space` and `--preemption-mode`.
In v0, [vLLM has long supported beam search](gh-issue:6226). The
In v0, [vLLM has long supported beam search](https://github.com/vllm-project/vllm/issues/6226). The
SequenceGroup encapsulated the idea of N Sequences which
all shared the same prompt kv blocks. This enabled KV cache block
sharing between requests, and copy-on-write to do branching. CPU
@ -526,7 +526,7 @@ and the part of the prompt that was evicted can be recomputed.
SequenceGroup was removed in V1, although a replacement will be
required for "parallel sampling" (`n>1`).
[Beam search was moved out of the core (in V0)](gh-issue:8306). There was a
[Beam search was moved out of the core (in V0)](https://github.com/vllm-project/vllm/issues/8306). There was a
lot of complex code for a very uncommon feature.
In V1, with prefix caching being better (zero over head) and therefore
@ -541,7 +541,7 @@ Some v0 metrics are only relevant in the context of "parallel
sampling". This is where the `n` parameter in a request is used to
request multiple completions from the same prompt.
As part of adding parallel sampling support in <gh-pr:10980>, we should
As part of adding parallel sampling support in <https://github.com/vllm-project/vllm/pull/10980>, we should
also add these metrics.
- `vllm:request_params_n` (Histogram)
@ -566,7 +566,7 @@ model and then validate those tokens with the larger model.
- `vllm:spec_decode_num_draft_tokens_total` (Counter)
- `vllm:spec_decode_num_emitted_tokens_total` (Counter)
There is a PR under review (<gh-pr:12193>) to add "prompt lookup (ngram)"
There is a PR under review (<https://github.com/vllm-project/vllm/pull/12193>) to add "prompt lookup (ngram)"
speculative decoding to v1. Other techniques will follow. We should
revisit the v0 metrics in this context.
@ -587,7 +587,7 @@ see:
- [Standardizing Large Model Server Metrics in Kubernetes](https://docs.google.com/document/d/1SpSp1E6moa4HSrJnS4x3NpLuj88sMXr2tbofKlzTZpk)
- [Benchmarking LLM Workloads for Performance Evaluation and Autoscaling in Kubernetes](https://docs.google.com/document/d/1k4Q4X14hW4vftElIuYGDu5KDe2LtV1XammoG-Xi3bbQ)
- [Inference Perf](https://github.com/kubernetes-sigs/wg-serving/tree/main/proposals/013-inference-perf)
- <gh-issue:5041> and <gh-pr:12726>.
- <https://github.com/vllm-project/vllm/issues/5041> and <https://github.com/vllm-project/vllm/pull/12726>.
This is a non-trivial topic. Consider this comment from Rob:
@ -654,7 +654,7 @@ fall under the more general heading of "Observability".
v0 has support for OpenTelemetry tracing:
- Added by <gh-pr:4687>
- Added by <https://github.com/vllm-project/vllm/pull/4687>
- Configured with `--oltp-traces-endpoint` and `--collect-detailed-traces`
- [OpenTelemetry blog post](https://opentelemetry.io/blog/2024/llm-observability/)
- [User-facing docs](../examples/online_serving/opentelemetry.md)
@ -685,7 +685,7 @@ documentation for this option states:
> use of possibly costly and or blocking operations and hence might
> have a performance impact.
The metrics were added by <gh-pr:7089> and who up in an OpenTelemetry trace
The metrics were added by <https://github.com/vllm-project/vllm/pull/7089> and who up in an OpenTelemetry trace
as:
```text

View File

@ -60,7 +60,7 @@ With the help of dummy text and automatic prompt updating, our multi-modal proce
## Processor Output Caching
Some HF processors, such as the one for Qwen2-VL, are [very slow](gh-issue:9238). To alleviate this problem, we cache the multi-modal outputs of HF processor to avoid processing the same multi-modal input (e.g. image) again.
Some HF processors, such as the one for Qwen2-VL, are [very slow](https://github.com/vllm-project/vllm/issues/9238). To alleviate this problem, we cache the multi-modal outputs of HF processor to avoid processing the same multi-modal input (e.g. image) again.
When new data is passed in, we first check which items are in the cache, and which ones are missing. The missing items are passed into the HF processor in a single batch and cached, before being merged with the existing items in the cache.

View File

@ -82,7 +82,7 @@ There are other miscellaneous places hard-coding the use of `spawn`:
Related PRs:
- <gh-pr:8823>
- <https://github.com/vllm-project/vllm/pull/8823>
## Prior State in v1

View File

@ -19,8 +19,8 @@ vLLM will take all the available factors into consideration, and decide a direct
The factors considered include:
- All the related configs (see the `compute_hash` functions in their respective configs in the [config folder](gh-file:vllm/config))
- PyTorch configs (see the `compute_hash` functions in the [compiler_interface.py](gh-file:vllm/compilation/compiler_interface.py))
- All the related configs (see the `compute_hash` functions in their respective configs in the [config folder](../../vllm/config))
- PyTorch configs (see the `compute_hash` functions in the [compiler_interface.py](../../vllm/compilation/compiler_interface.py))
- The model's forward function and the relevant functions called by the forward function (see below)
With all these factors taken into consideration, usually we can guarantee that the cache is safe to use, and will not cause any unexpected behavior. Therefore, the cache is enabled by default. If you want to debug the compilation process, or if you suspect the cache is causing some issues, you can disable it by setting the environment variable `VLLM_DISABLE_COMPILE_CACHE=1`.

View File

@ -44,15 +44,15 @@ th:not(:first-child) {
| [SD](spec_decode.md) | ✅ | ✅ | ❌ | ✅ | | | | | | | | | | | |
| CUDA graph | ✅ | ✅ | ✅ | ✅ | ✅ | | | | | | | | | | |
| [pooling](../models/pooling_models.md) | 🟠\* | 🟠\* | ✅ | ❌ | ✅ | ✅ | | | | | | | | | |
| <abbr title="Encoder-Decoder Models">enc-dec</abbr> | ❌ | [](gh-issue:7366) | ❌ | [](gh-issue:7366) | ✅ | ✅ | ✅ | | | | | | | | |
| <abbr title="Encoder-Decoder Models">enc-dec</abbr> | ❌ | [](https://github.com/vllm-project/vllm/issues/7366) | ❌ | [](https://github.com/vllm-project/vllm/issues/7366) | ✅ | ✅ | ✅ | | | | | | | | |
| <abbr title="Logprobs">logP</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | | | | | | | |
| <abbr title="Prompt Logprobs">prmpt logP</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ✅ | | | | | | |
| <abbr title="Async Output Processing">async output</abbr> | ✅ | ✅ | ✅ | ❌ | ✅ | ❌ | ❌ | ✅ | ✅ | ✅ | | | | | |
| multi-step | ❌ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | | | | |
| [mm](multimodal_inputs.md) | ✅ | ✅ | [🟠](gh-pr:4194)<sup>^</sup> | ❔ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❔ | ✅ | | | |
| best-of | ✅ | ✅ | ✅ | [](gh-issue:6137) | ✅ | ❌ | ✅ | ✅ | ✅ | ❔ | [](gh-issue:7968) | ✅ | ✅ | | |
| beam-search | ✅ | ✅ | ✅ | [](gh-issue:6137) | ✅ | ❌ | ✅ | ✅ | ✅ | ❔ | [](gh-issue:7968) | ❔ | ✅ | ✅ | |
| [prompt-embeds](prompt_embeds.md) | ✅ | [](gh-issue:25096) | ✅ | ❌ | ✅ | ❌ | ❌ | ✅ | ❌ | ❔ | ❔ | ❌ | ❔ | ❔ | ✅ |
| [mm](multimodal_inputs.md) | ✅ | ✅ | [🟠](https://github.com/vllm-project/vllm/pull/4194)<sup>^</sup> | ❔ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❔ | ✅ | | | |
| best-of | ✅ | ✅ | ✅ | [](https://github.com/vllm-project/vllm/issues/6137) | ✅ | ❌ | ✅ | ✅ | ✅ | ❔ | [](https://github.com/vllm-project/vllm/issues/7968) | ✅ | ✅ | | |
| beam-search | ✅ | ✅ | ✅ | [](https://github.com/vllm-project/vllm/issues/6137) | ✅ | ❌ | ✅ | ✅ | ✅ | ❔ | [](https://github.com/vllm-project/vllm/issues/7968) | ❔ | ✅ | ✅ | |
| [prompt-embeds](prompt_embeds.md) | ✅ | [](https://github.com/vllm-project/vllm/issues/25096) | ✅ | ❌ | ✅ | ❌ | ❌ | ✅ | ❌ | ❔ | ❔ | ❌ | ❔ | ❔ | ✅ |
\* Chunked prefill and prefix caching are only applicable to last-token pooling.
<sup>^</sup> LoRA is only applicable to the language backbone of multimodal models.
@ -63,18 +63,18 @@ th:not(:first-child) {
| Feature | Volta | Turing | Ampere | Ada | Hopper | CPU | AMD | TPU | Intel GPU |
|-----------------------------------------------------------|---------------------|-----------|-----------|--------|------------|--------------------|--------|-----| ------------|
| [CP][chunked-prefill] | [](gh-issue:2729) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| [APC](automatic_prefix_caching.md) | [](gh-issue:3687) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| [CP][chunked-prefill] | [](https://github.com/vllm-project/vllm/issues/2729) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| [APC](automatic_prefix_caching.md) | [](https://github.com/vllm-project/vllm/issues/3687) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| [LoRA](lora.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| [SD](spec_decode.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | [🟠](gh-issue:26963) |
| CUDA graph | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ❌ | [](gh-issue:26970) |
| [SD](spec_decode.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | [🟠](https://github.com/vllm-project/vllm/issues/26963) |
| CUDA graph | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ❌ | [](https://github.com/vllm-project/vllm/issues/26970) |
| [pooling](../models/pooling_models.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ |
| <abbr title="Encoder-Decoder Models">enc-dec</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ✅ |
| [mm](multimodal_inputs.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | [🟠](gh-issue:26965) |
| [mm](multimodal_inputs.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | [🟠](https://github.com/vllm-project/vllm/issues/26965) |
| <abbr title="Logprobs">logP</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ |
| <abbr title="Prompt Logprobs">prmpt logP</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ |
| <abbr title="Async Output Processing">async output</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | ✅ |
| multi-step | ✅ | ✅ | ✅ | ✅ | ✅ | [](gh-issue:8477) | ✅ | ❌ | ✅ |
| multi-step | ✅ | ✅ | ✅ | ✅ | ✅ | [](https://github.com/vllm-project/vllm/issues/8477) | ✅ | ❌ | ✅ |
| best-of | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ |
| beam-search | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ |
| [prompt-embeds](prompt_embeds.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ? | [](gh-issue:25097) | ✅ |
| [prompt-embeds](prompt_embeds.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ? | [](https://github.com/vllm-project/vllm/issues/25097) | ✅ |

View File

@ -11,7 +11,7 @@ Automatic Prefix Caching (APC in short) caches the KV cache of existing queries,
Set `enable_prefix_caching=True` in vLLM engine to enable APC. Here is an example:
<gh-file:examples/offline_inference/automatic_prefix_caching.py>
[examples/offline_inference/automatic_prefix_caching.py](../../examples/offline_inference/automatic_prefix_caching.py)
## Example workloads

View File

@ -17,14 +17,14 @@ Two main reasons:
## Usage example
Please refer to <gh-file:examples/online_serving/disaggregated_prefill.sh> for the example usage of disaggregated prefilling.
Please refer to [examples/online_serving/disaggregated_prefill.sh](../../examples/online_serving/disaggregated_prefill.sh) for the example usage of disaggregated prefilling.
Now supports 5 types of connectors:
- **SharedStorageConnector**: refer to <gh-file:examples/offline_inference/disaggregated-prefill-v1/run.sh> for the example usage of SharedStorageConnector disaggregated prefilling.
- **LMCacheConnectorV1**: refer to <gh-file:examples/others/lmcache/disagg_prefill_lmcache_v1/disagg_example_nixl.sh> for the example usage of LMCacheConnectorV1 disaggregated prefilling which uses NIXL as the underlying KV transmission.
- **NixlConnector**: refer to <gh-file:tests/v1/kv_connector/nixl_integration/run_accuracy_test.sh> for the example usage of NixlConnector disaggregated prefilling which support fully async send/recv. For detailed usage guide, see [NixlConnector Usage Guide](nixl_connector_usage.md).
- **P2pNcclConnector**: refer to <gh-file:examples/online_serving/disaggregated_serving_p2p_nccl_xpyd/disagg_example_p2p_nccl_xpyd.sh> for the example usage of P2pNcclConnector disaggregated prefilling.
- **SharedStorageConnector**: refer to [examples/offline_inference/disaggregated-prefill-v1/run.sh](../../examples/offline_inference/disaggregated-prefill-v1/run.sh) for the example usage of SharedStorageConnector disaggregated prefilling.
- **LMCacheConnectorV1**: refer to [examples/others/lmcache/disagg_prefill_lmcache_v1/disagg_example_nixl.sh](../../examples/others/lmcache/disagg_prefill_lmcache_v1/disagg_example_nixl.sh) for the example usage of LMCacheConnectorV1 disaggregated prefilling which uses NIXL as the underlying KV transmission.
- **NixlConnector**: refer to [tests/v1/kv_connector/nixl_integration/run_accuracy_test.sh](../../tests/v1/kv_connector/nixl_integration/run_accuracy_test.sh) for the example usage of NixlConnector disaggregated prefilling which support fully async send/recv. For detailed usage guide, see [NixlConnector Usage Guide](nixl_connector_usage.md).
- **P2pNcclConnector**: refer to [examples/online_serving/disaggregated_serving_p2p_nccl_xpyd/disagg_example_p2p_nccl_xpyd.sh](../../examples/online_serving/disaggregated_serving_p2p_nccl_xpyd/disagg_example_p2p_nccl_xpyd.sh) for the example usage of P2pNcclConnector disaggregated prefilling.
- **MultiConnector**: take advantage of the kv_connector_extra_config: dict[str, Any] already present in KVTransferConfig to stash all the connectors we want in an ordered list of kwargs.such as:
```bash
@ -45,7 +45,7 @@ For NixlConnector, you may also specify one or multiple NIXL_Backend. Such as:
## Benchmarks
Please refer to <gh-file:benchmarks/disagg_benchmarks> for disaggregated prefilling benchmarks.
Please refer to [benchmarks/disagg_benchmarks](../../benchmarks/disagg_benchmarks) for disaggregated prefilling benchmarks.
## Development

View File

@ -47,7 +47,7 @@ the third parameter is the path to the LoRA adapter.
)
```
Check out <gh-file:examples/offline_inference/multilora_inference.py> for an example of how to use LoRA adapters with the async engine and how to use more advanced configuration options.
Check out [examples/offline_inference/multilora_inference.py](../../examples/offline_inference/multilora_inference.py) for an example of how to use LoRA adapters with the async engine and how to use more advanced configuration options.
## Serving LoRA Adapters

View File

@ -3,7 +3,7 @@
This page teaches you how to pass multi-modal inputs to [multi-modal models][supported-mm-models] in vLLM.
!!! note
We are actively iterating on multi-modal support. See [this RFC](gh-issue:4194) for upcoming changes,
We are actively iterating on multi-modal support. See [this RFC](https://github.com/vllm-project/vllm/issues/4194) for upcoming changes,
and [open an issue on GitHub](https://github.com/vllm-project/vllm/issues/new/choose) if you have any feedback or feature requests.
!!! tip
@ -129,7 +129,7 @@ You can pass a single image to the `'image'` field of the multi-modal dictionary
print(generated_text)
```
Full example: <gh-file:examples/offline_inference/vision_language.py>
Full example: [examples/offline_inference/vision_language.py](../../examples/offline_inference/vision_language.py)
To substitute multiple images inside the same text prompt, you can pass in a list of images instead:
@ -162,7 +162,7 @@ To substitute multiple images inside the same text prompt, you can pass in a lis
print(generated_text)
```
Full example: <gh-file:examples/offline_inference/vision_language_multi_image.py>
Full example: [examples/offline_inference/vision_language_multi_image.py](../../examples/offline_inference/vision_language_multi_image.py)
If using the [LLM.chat](../models/generative_models.md#llmchat) method, you can pass images directly in the message content using various formats: image URLs, PIL Image objects, or pre-computed embeddings:
@ -346,13 +346,13 @@ Instead of NumPy arrays, you can also pass `'torch.Tensor'` instances, as shown
!!! note
'process_vision_info' is only applicable to Qwen2.5-VL and similar models.
Full example: <gh-file:examples/offline_inference/vision_language.py>
Full example: [examples/offline_inference/vision_language.py](../../examples/offline_inference/vision_language.py)
### Audio Inputs
You can pass a tuple `(array, sampling_rate)` to the `'audio'` field of the multi-modal dictionary.
Full example: <gh-file:examples/offline_inference/audio_language.py>
Full example: [examples/offline_inference/audio_language.py](../../examples/offline_inference/audio_language.py)
### Embedding Inputs
@ -434,11 +434,11 @@ Our OpenAI-compatible server accepts multi-modal data via the [Chat Completions
A chat template is **required** to use Chat Completions API.
For HF format models, the default chat template is defined inside `chat_template.json` or `tokenizer_config.json`.
If no default chat template is available, we will first look for a built-in fallback in <gh-file:vllm/transformers_utils/chat_templates/registry.py>.
If no default chat template is available, we will first look for a built-in fallback in [vllm/transformers_utils/chat_templates/registry.py](../../vllm/transformers_utils/chat_templates/registry.py).
If no fallback is available, an error is raised and you have to provide the chat template manually via the `--chat-template` argument.
For certain models, we provide alternative chat templates inside <gh-dir:examples>.
For example, VLM2Vec uses <gh-file:examples/template_vlm2vec_phi3v.jinja> which is different from the default one for Phi-3-Vision.
For certain models, we provide alternative chat templates inside [examples](../../examples).
For example, VLM2Vec uses [examples/template_vlm2vec_phi3v.jinja](../../examples/template_vlm2vec_phi3v.jinja) which is different from the default one for Phi-3-Vision.
### Image Inputs
@ -524,7 +524,7 @@ Then, you can use the OpenAI client as follows:
print("Chat completion output:", chat_response.choices[0].message.content)
```
Full example: <gh-file:examples/online_serving/openai_chat_completion_client_for_multimodal.py>
Full example: [examples/online_serving/openai_chat_completion_client_for_multimodal.py](../../examples/online_serving/openai_chat_completion_client_for_multimodal.py)
!!! tip
Loading from local file paths is also supported on vLLM: You can specify the allowed local media path via `--allowed-local-media-path` when launching the API server/engine,
@ -595,7 +595,7 @@ Then, you can use the OpenAI client as follows:
print("Chat completion output from image url:", result)
```
Full example: <gh-file:examples/online_serving/openai_chat_completion_client_for_multimodal.py>
Full example: [examples/online_serving/openai_chat_completion_client_for_multimodal.py](../../examples/online_serving/openai_chat_completion_client_for_multimodal.py)
!!! note
By default, the timeout for fetching videos through HTTP URL is `30` seconds.
@ -719,7 +719,7 @@ Alternatively, you can pass `audio_url`, which is the audio counterpart of `imag
print("Chat completion output from audio url:", result)
```
Full example: <gh-file:examples/online_serving/openai_chat_completion_client_for_multimodal.py>
Full example: [examples/online_serving/openai_chat_completion_client_for_multimodal.py](../../examples/online_serving/openai_chat_completion_client_for_multimodal.py)
!!! note
By default, the timeout for fetching audios through HTTP URL is `10` seconds.

View File

@ -9,7 +9,7 @@ NixlConnector is a high-performance KV cache transfer connector for vLLM's disag
Install the NIXL library: `uv pip install nixl`, as a quick start.
- Refer to [NIXL official repository](https://github.com/ai-dynamo/nixl) for more installation instructions
- The specified required NIXL version can be found in [requirements/kv_connectors.txt](gh-file:requirements/kv_connectors.txt) and other relevant config files
- The specified required NIXL version can be found in [requirements/kv_connectors.txt](../../requirements/kv_connectors.txt) and other relevant config files
For non-cuda platform, please install nixl with ucx build from source, instructed as below.
@ -170,6 +170,6 @@ Support use case: Prefill with 'HND' and decode with 'NHD' with experimental con
Refer to these example scripts in the vLLM repository:
- [run_accuracy_test.sh](gh-file:tests/v1/kv_connector/nixl_integration/run_accuracy_test.sh)
- [toy_proxy_server.py](gh-file:tests/v1/kv_connector/nixl_integration/toy_proxy_server.py)
- [test_accuracy.py](gh-file:tests/v1/kv_connector/nixl_integration/test_accuracy.py)
- [run_accuracy_test.sh](../../tests/v1/kv_connector/nixl_integration/run_accuracy_test.sh)
- [toy_proxy_server.py](../../tests/v1/kv_connector/nixl_integration/toy_proxy_server.py)
- [test_accuracy.py](../../tests/v1/kv_connector/nixl_integration/test_accuracy.py)

View File

@ -16,7 +16,7 @@ To input multi-modal data, follow this schema in [vllm.inputs.EmbedsPrompt][]:
You can pass prompt embeddings from Hugging Face Transformers models to the `'prompt_embeds'` field of the prompt embedding dictionary, as shown in the following examples:
<gh-file:examples/offline_inference/prompt_embed_inference.py>
[examples/offline_inference/prompt_embed_inference.py](../../examples/offline_inference/prompt_embed_inference.py)
## Online Serving
@ -37,4 +37,4 @@ vllm serve meta-llama/Llama-3.2-1B-Instruct --runner generate \
Then, you can use the OpenAI client as follows:
<gh-file:examples/online_serving/prompt_embed_inference_with_openai_client.py>
[examples/online_serving/prompt_embed_inference_with_openai_client.py](../../examples/online_serving/prompt_embed_inference_with_openai_client.py)

View File

@ -64,4 +64,4 @@ th:not(:first-child) {
!!! note
This compatibility chart is subject to change as vLLM continues to evolve and expand its support for different hardware platforms and quantization methods.
For the most up-to-date information on hardware support and quantization methods, please refer to <gh-dir:vllm/model_executor/layers/quantization> or consult with the vLLM development team.
For the most up-to-date information on hardware support and quantization methods, please refer to [vllm/model_executor/layers/quantization](../../../vllm/model_executor/layers/quantization) or consult with the vLLM development team.

View File

@ -196,7 +196,7 @@ The reasoning content is also available when both tool calling and the reasoning
print(f"Arguments: {tool_call.arguments}")
```
For more examples, please refer to <gh-file:examples/online_serving/openai_chat_completion_tool_calls_with_reasoning.py>.
For more examples, please refer to [examples/online_serving/openai_chat_completion_tool_calls_with_reasoning.py](../../examples/online_serving/openai_chat_completion_tool_calls_with_reasoning.py).
## Limitations
@ -204,7 +204,7 @@ For more examples, please refer to <gh-file:examples/online_serving/openai_chat_
## How to support a new reasoning model
You can add a new `ReasoningParser` similar to <gh-file:vllm/reasoning/deepseek_r1_reasoning_parser.py>.
You can add a new `ReasoningParser` similar to [vllm/reasoning/deepseek_r1_reasoning_parser.py](../../vllm/reasoning/deepseek_r1_reasoning_parser.py).
??? code
@ -264,7 +264,7 @@ You can add a new `ReasoningParser` similar to <gh-file:vllm/reasoning/deepseek_
"""
```
Additionally, to enable structured output, you'll need to create a new `Reasoner` similar to the one in <gh-file:vllm/reasoning/deepseek_r1_reasoning_parser.py>.
Additionally, to enable structured output, you'll need to create a new `Reasoner` similar to the one in [vllm/reasoning/deepseek_r1_reasoning_parser.py](../../vllm/reasoning/deepseek_r1_reasoning_parser.py).
??? code

View File

@ -3,7 +3,7 @@
!!! warning
Please note that speculative decoding in vLLM is not yet optimized and does
not usually yield inter-token latency reductions for all prompt datasets or sampling parameters.
The work to optimize it is ongoing and can be followed here: <gh-issue:4630>
The work to optimize it is ongoing and can be followed here: <https://github.com/vllm-project/vllm/issues/4630>
!!! warning
Currently, speculative decoding in vLLM is not compatible with pipeline parallelism.
@ -183,7 +183,7 @@ A variety of speculative models of this type are available on HF hub:
## Speculating using EAGLE based draft models
The following code configures vLLM to use speculative decoding where proposals are generated by
an [EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency)](https://arxiv.org/pdf/2401.15077) based draft model. A more detailed example for offline mode, including how to extract request level acceptance rate, can be found [here](gh-file:examples/offline_inference/eagle.py).
an [EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency)](https://arxiv.org/pdf/2401.15077) based draft model. A more detailed example for offline mode, including how to extract request level acceptance rate, can be found [here](../../examples/offline_inference/spec_decode.py).
??? code
@ -218,8 +218,8 @@ an [EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency)](https
A few important things to consider when using the EAGLE based draft models:
1. The EAGLE draft models available in the [HF repository for EAGLE models](https://huggingface.co/yuhuili) should
be able to be loaded and used directly by vLLM after <gh-pr:12304>.
If you are using vllm version before <gh-pr:12304>, please use the
be able to be loaded and used directly by vLLM after <https://github.com/vllm-project/vllm/pull/12304>.
If you are using vllm version before <https://github.com/vllm-project/vllm/pull/12304>, please use the
[script](https://gist.github.com/abhigoyal1997/1e7a4109ccb7704fbc67f625e86b2d6d) to convert the speculative model,
and specify `"model": "path/to/modified/eagle/model"` in `speculative_config`. If weight-loading problems still occur when using the latest version of vLLM, please leave a comment or raise an issue.
@ -229,7 +229,7 @@ A few important things to consider when using the EAGLE based draft models:
3. When using EAGLE-based speculators with vLLM, the observed speedup is lower than what is
reported in the reference implementation [here](https://github.com/SafeAILab/EAGLE). This issue is under
investigation and tracked here: <gh-issue:9565>.
investigation and tracked here: <https://github.com/vllm-project/vllm/issues/9565>.
4. When using EAGLE-3 based draft model, option "method" must be set to "eagle3".
That is, to specify `"method": "eagle3"` in `speculative_config`.
@ -267,7 +267,7 @@ speculative decoding, breaking down the guarantees into three key areas:
> distribution. [View Test Code](https://github.com/vllm-project/vllm/blob/47b65a550866c7ffbd076ecb74106714838ce7da/tests/samplers/test_rejection_sampler.py#L252)
> - **Greedy Sampling Equality**: Confirms that greedy sampling with speculative decoding matches greedy sampling
> without it. This verifies that vLLM's speculative decoding framework, when integrated with the vLLM forward pass and the vLLM rejection sampler,
> provides a lossless guarantee. Almost all of the tests in <gh-dir:tests/spec_decode/e2e>.
> provides a lossless guarantee. Almost all of the tests in [tests/spec_decode/e2e](../../tests/spec_decode/e2e).
> verify this property using [this assertion implementation](https://github.com/vllm-project/vllm/blob/b67ae00cdbbe1a58ffc8ff170f0c8d79044a684a/tests/spec_decode/e2e/conftest.py#L291)
3. **vLLM Logprob Stability**
@ -289,4 +289,4 @@ For mitigation strategies, please refer to the FAQ entry *Can the output of a pr
- [A Hacker's Guide to Speculative Decoding in vLLM](https://www.youtube.com/watch?v=9wNAgpX6z_4)
- [What is Lookahead Scheduling in vLLM?](https://docs.google.com/document/d/1Z9TvqzzBPnh5WHcRwjvK2UEeFeq5zMZb5mFE8jR0HCs/edit#heading=h.1fjfb0donq5a)
- [Information on batch expansion](https://docs.google.com/document/d/1T-JaS2T1NRfdP51qzqpyakoCXxSXTtORppiwaj5asxA/edit#heading=h.kk7dq05lc6q8)
- [Dynamic speculative decoding](gh-issue:4565)
- [Dynamic speculative decoding](https://github.com/vllm-project/vllm/issues/4565)

View File

@ -298,7 +298,7 @@ Step #2: explanation="Next, let's isolate 'x' by dividing both sides of the equa
Answer: x = -29/8
```
An example of using `structural_tag` can be found here: <gh-file:examples/online_serving/structured_outputs>
An example of using `structural_tag` can be found here: [examples/online_serving/structured_outputs](../../examples/online_serving/structured_outputs)
## Offline Inference

View File

@ -151,9 +151,9 @@ Known issues:
much shorter than what vLLM generates. Since an exception is thrown when this condition
is not met, the following additional chat templates are provided:
* <gh-file:examples/tool_chat_template_mistral.jinja> - this is the "official" Mistral chat template, but tweaked so that
* [examples/tool_chat_template_mistral.jinja](../../examples/tool_chat_template_mistral.jinja) - this is the "official" Mistral chat template, but tweaked so that
it works with vLLM's tool call IDs (provided `tool_call_id` fields are truncated to the last 9 digits)
* <gh-file:examples/tool_chat_template_mistral_parallel.jinja> - this is a "better" version that adds a tool-use system prompt
* [examples/tool_chat_template_mistral_parallel.jinja](../../examples/tool_chat_template_mistral_parallel.jinja) - this is a "better" version that adds a tool-use system prompt
when tools are provided, that results in much better reliability when working with parallel tool calling.
Recommended flags:
@ -187,16 +187,16 @@ Known issues:
VLLM provides two JSON-based chat templates for Llama 3.1 and 3.2:
* <gh-file:examples/tool_chat_template_llama3.1_json.jinja> - this is the "official" chat template for the Llama 3.1
* [examples/tool_chat_template_llama3.1_json.jinja](../../examples/tool_chat_template_llama3.1_json.jinja) - this is the "official" chat template for the Llama 3.1
models, but tweaked so that it works better with vLLM.
* <gh-file:examples/tool_chat_template_llama3.2_json.jinja> - this extends upon the Llama 3.1 chat template by adding support for
* [examples/tool_chat_template_llama3.2_json.jinja](../../examples/tool_chat_template_llama3.2_json.jinja) - this extends upon the Llama 3.1 chat template by adding support for
images.
Recommended flags: `--tool-call-parser llama3_json --chat-template {see_above}`
VLLM also provides a pythonic and JSON-based chat template for Llama 4, but pythonic tool calling is recommended:
* <gh-file:examples/tool_chat_template_llama4_pythonic.jinja> - this is based on the [official chat template](https://www.llama.com/docs/model-cards-and-prompt-formats/llama4/) for the Llama 4 models.
* [examples/tool_chat_template_llama4_pythonic.jinja](../../examples/tool_chat_template_llama4_pythonic.jinja) - this is based on the [official chat template](https://www.llama.com/docs/model-cards-and-prompt-formats/llama4/) for the Llama 4 models.
For Llama 4 model, use `--tool-call-parser llama4_pythonic --chat-template examples/tool_chat_template_llama4_pythonic.jinja`.
@ -212,7 +212,7 @@ Supported models:
Recommended flags: `--tool-call-parser granite --chat-template examples/tool_chat_template_granite.jinja`
<gh-file:examples/tool_chat_template_granite.jinja>: this is a modified chat template from the original on Hugging Face. Parallel function calls are supported.
[examples/tool_chat_template_granite.jinja](../../examples/tool_chat_template_granite.jinja): this is a modified chat template from the original on Hugging Face. Parallel function calls are supported.
* `ibm-granite/granite-3.1-8b-instruct`
@ -224,7 +224,7 @@ Supported models:
Recommended flags: `--tool-call-parser granite-20b-fc --chat-template examples/tool_chat_template_granite_20b_fc.jinja`
<gh-file:examples/tool_chat_template_granite_20b_fc.jinja>: this is a modified chat template from the original on Hugging Face, which is not vLLM-compatible. It blends function description elements from the Hermes template and follows the same system prompt as "Response Generation" mode from [the paper](https://arxiv.org/abs/2407.00121). Parallel function calls are supported.
[examples/tool_chat_template_granite_20b_fc.jinja](../../examples/tool_chat_template_granite_20b_fc.jinja): this is a modified chat template from the original on Hugging Face, which is not vLLM-compatible. It blends function description elements from the Hermes template and follows the same system prompt as "Response Generation" mode from [the paper](https://arxiv.org/abs/2407.00121). Parallel function calls are supported.
### InternLM Models (`internlm`)
@ -282,8 +282,8 @@ Flags: `--tool-call-parser hermes`
Supported models:
* `MiniMaxAi/MiniMax-M1-40k` (use with <gh-file:examples/tool_chat_template_minimax_m1.jinja>)
* `MiniMaxAi/MiniMax-M1-80k` (use with <gh-file:examples/tool_chat_template_minimax_m1.jinja>)
* `MiniMaxAi/MiniMax-M1-40k` (use with [examples/tool_chat_template_minimax_m1.jinja](../../examples/tool_chat_template_minimax_m1.jinja))
* `MiniMaxAi/MiniMax-M1-80k` (use with [examples/tool_chat_template_minimax_m1.jinja](../../examples/tool_chat_template_minimax_m1.jinja))
Flags: `--tool-call-parser minimax --chat-template examples/tool_chat_template_minimax_m1.jinja`
@ -291,8 +291,8 @@ Flags: `--tool-call-parser minimax --chat-template examples/tool_chat_template_m
Supported models:
* `deepseek-ai/DeepSeek-V3-0324` (use with <gh-file:examples/tool_chat_template_deepseekv3.jinja>)
* `deepseek-ai/DeepSeek-R1-0528` (use with <gh-file:examples/tool_chat_template_deepseekr1.jinja>)
* `deepseek-ai/DeepSeek-V3-0324` (use with [examples/tool_chat_template_deepseekv3.jinja](../../examples/tool_chat_template_deepseekv3.jinja))
* `deepseek-ai/DeepSeek-R1-0528` (use with [examples/tool_chat_template_deepseekr1.jinja](../../examples/tool_chat_template_deepseekr1.jinja))
Flags: `--tool-call-parser deepseek_v3 --chat-template {see_above}`
@ -300,7 +300,7 @@ Flags: `--tool-call-parser deepseek_v3 --chat-template {see_above}`
Supported models:
* `deepseek-ai/DeepSeek-V3.1` (use with <gh-file:examples/tool_chat_template_deepseekv31.jinja>)
* `deepseek-ai/DeepSeek-V3.1` (use with [examples/tool_chat_template_deepseekv31.jinja](../../examples/tool_chat_template_deepseekv31.jinja))
Flags: `--tool-call-parser deepseek_v31 --chat-template {see_above}`
@ -379,12 +379,12 @@ Limitations:
Example supported models:
* `meta-llama/Llama-3.2-1B-Instruct` ⚠️ (use with <gh-file:examples/tool_chat_template_llama3.2_pythonic.jinja>)
* `meta-llama/Llama-3.2-3B-Instruct` ⚠️ (use with <gh-file:examples/tool_chat_template_llama3.2_pythonic.jinja>)
* `Team-ACE/ToolACE-8B` (use with <gh-file:examples/tool_chat_template_toolace.jinja>)
* `fixie-ai/ultravox-v0_4-ToolACE-8B` (use with <gh-file:examples/tool_chat_template_toolace.jinja>)
* `meta-llama/Llama-4-Scout-17B-16E-Instruct` ⚠️ (use with <gh-file:examples/tool_chat_template_llama4_pythonic.jinja>)
* `meta-llama/Llama-4-Maverick-17B-128E-Instruct` ⚠️ (use with <gh-file:examples/tool_chat_template_llama4_pythonic.jinja>)
* `meta-llama/Llama-3.2-1B-Instruct` ⚠️ (use with [examples/tool_chat_template_llama3.2_pythonic.jinja](../../examples/tool_chat_template_llama3.2_pythonic.jinja))
* `meta-llama/Llama-3.2-3B-Instruct` ⚠️ (use with [examples/tool_chat_template_llama3.2_pythonic.jinja](../../examples/tool_chat_template_llama3.2_pythonic.jinja))
* `Team-ACE/ToolACE-8B` (use with [examples/tool_chat_template_toolace.jinja](../../examples/tool_chat_template_toolace.jinja))
* `fixie-ai/ultravox-v0_4-ToolACE-8B` (use with [examples/tool_chat_template_toolace.jinja](../../examples/tool_chat_template_toolace.jinja))
* `meta-llama/Llama-4-Scout-17B-16E-Instruct` ⚠️ (use with [examples/tool_chat_template_llama4_pythonic.jinja](../../examples/tool_chat_template_llama4_pythonic.jinja))
* `meta-llama/Llama-4-Maverick-17B-128E-Instruct` ⚠️ (use with [examples/tool_chat_template_llama4_pythonic.jinja](../../examples/tool_chat_template_llama4_pythonic.jinja))
Flags: `--tool-call-parser pythonic --chat-template {see_above}`
@ -393,7 +393,7 @@ Flags: `--tool-call-parser pythonic --chat-template {see_above}`
## How to Write a Tool Parser Plugin
A tool parser plugin is a Python file containing one or more ToolParser implementations. You can write a ToolParser similar to the `Hermes2ProToolParser` in <gh-file:vllm/entrypoints/openai/tool_parsers/hermes_tool_parser.py>.
A tool parser plugin is a Python file containing one or more ToolParser implementations. You can write a ToolParser similar to the `Hermes2ProToolParser` in [vllm/entrypoints/openai/tool_parsers/hermes_tool_parser.py](../../vllm/entrypoints/openai/tool_parsers/hermes_tool_parser.py).
Here is a summary of a plugin file:

View File

@ -4,19 +4,19 @@ vLLM is a Python library that supports the following CPU variants. Select your C
=== "Intel/AMD x86"
--8<-- "docs/getting_started/installation/cpu/x86.inc.md:installation"
--8<-- "docs/getting_started/installation/cpu.x86.inc.md:installation"
=== "ARM AArch64"
--8<-- "docs/getting_started/installation/cpu/arm.inc.md:installation"
--8<-- "docs/getting_started/installation/cpu.arm.inc.md:installation"
=== "Apple silicon"
--8<-- "docs/getting_started/installation/cpu/apple.inc.md:installation"
--8<-- "docs/getting_started/installation/cpu.apple.inc.md:installation"
=== "IBM Z (S390X)"
--8<-- "docs/getting_started/installation/cpu/s390x.inc.md:installation"
--8<-- "docs/getting_started/installation/cpu.s390x.inc.md:installation"
## Requirements
@ -24,19 +24,19 @@ vLLM is a Python library that supports the following CPU variants. Select your C
=== "Intel/AMD x86"
--8<-- "docs/getting_started/installation/cpu/x86.inc.md:requirements"
--8<-- "docs/getting_started/installation/cpu.x86.inc.md:requirements"
=== "ARM AArch64"
--8<-- "docs/getting_started/installation/cpu/arm.inc.md:requirements"
--8<-- "docs/getting_started/installation/cpu.arm.inc.md:requirements"
=== "Apple silicon"
--8<-- "docs/getting_started/installation/cpu/apple.inc.md:requirements"
--8<-- "docs/getting_started/installation/cpu.apple.inc.md:requirements"
=== "IBM Z (S390X)"
--8<-- "docs/getting_started/installation/cpu/s390x.inc.md:requirements"
--8<-- "docs/getting_started/installation/cpu.s390x.inc.md:requirements"
## Set up using Python
@ -52,19 +52,19 @@ Currently, there are no pre-built CPU wheels.
=== "Intel/AMD x86"
--8<-- "docs/getting_started/installation/cpu/x86.inc.md:build-wheel-from-source"
--8<-- "docs/getting_started/installation/cpu.x86.inc.md:build-wheel-from-source"
=== "ARM AArch64"
--8<-- "docs/getting_started/installation/cpu/arm.inc.md:build-wheel-from-source"
--8<-- "docs/getting_started/installation/cpu.arm.inc.md:build-wheel-from-source"
=== "Apple silicon"
--8<-- "docs/getting_started/installation/cpu/apple.inc.md:build-wheel-from-source"
--8<-- "docs/getting_started/installation/cpu.apple.inc.md:build-wheel-from-source"
=== "IBM Z (s390x)"
--8<-- "docs/getting_started/installation/cpu/s390x.inc.md:build-wheel-from-source"
--8<-- "docs/getting_started/installation/cpu.s390x.inc.md:build-wheel-from-source"
## Set up using Docker
@ -72,24 +72,24 @@ Currently, there are no pre-built CPU wheels.
=== "Intel/AMD x86"
--8<-- "docs/getting_started/installation/cpu/x86.inc.md:pre-built-images"
--8<-- "docs/getting_started/installation/cpu.x86.inc.md:pre-built-images"
### Build image from source
=== "Intel/AMD x86"
--8<-- "docs/getting_started/installation/cpu/x86.inc.md:build-image-from-source"
--8<-- "docs/getting_started/installation/cpu.x86.inc.md:build-image-from-source"
=== "ARM AArch64"
--8<-- "docs/getting_started/installation/cpu/arm.inc.md:build-image-from-source"
--8<-- "docs/getting_started/installation/cpu.arm.inc.md:build-image-from-source"
=== "Apple silicon"
--8<-- "docs/getting_started/installation/cpu/arm.inc.md:build-image-from-source"
--8<-- "docs/getting_started/installation/cpu.arm.inc.md:build-image-from-source"
=== "IBM Z (S390X)"
--8<-- "docs/getting_started/installation/cpu/s390x.inc.md:build-image-from-source"
--8<-- "docs/getting_started/installation/cpu.s390x.inc.md:build-image-from-source"
## Related runtime environment variables

View File

@ -157,7 +157,7 @@ See [deployment-docker-pre-built-image][deployment-docker-pre-built-image] for i
### Build image from source
You can use <gh-file:docker/Dockerfile.tpu> to build a Docker image with TPU support.
You can use [docker/Dockerfile.tpu](../../../docker/Dockerfile.tpu) to build a Docker image with TPU support.
```bash
docker build -f docker/Dockerfile.tpu -t vllm-tpu .

View File

@ -11,7 +11,7 @@ vLLM contains pre-compiled C++ and CUDA (12.8) binaries.
# --8<-- [start:set-up-using-python]
!!! note
PyTorch installed via `conda` will statically link `NCCL` library, which can cause issues when vLLM tries to use `NCCL`. See <gh-issue:8420> for more details.
PyTorch installed via `conda` will statically link `NCCL` library, which can cause issues when vLLM tries to use `NCCL`. See <https://github.com/vllm-project/vllm/issues/8420> for more details.
In order to be performant, vLLM has to compile many cuda kernels. The compilation unfortunately introduces binary incompatibility with other CUDA versions and PyTorch versions, even for the same PyTorch version with different building configurations.

View File

@ -4,15 +4,15 @@ vLLM is a Python library that supports the following GPU variants. Select your G
=== "NVIDIA CUDA"
--8<-- "docs/getting_started/installation/gpu/cuda.inc.md:installation"
--8<-- "docs/getting_started/installation/gpu.cuda.inc.md:installation"
=== "AMD ROCm"
--8<-- "docs/getting_started/installation/gpu/rocm.inc.md:installation"
--8<-- "docs/getting_started/installation/gpu.rocm.inc.md:installation"
=== "Intel XPU"
--8<-- "docs/getting_started/installation/gpu/xpu.inc.md:installation"
--8<-- "docs/getting_started/installation/gpu.xpu.inc.md:installation"
## Requirements
@ -24,15 +24,15 @@ vLLM is a Python library that supports the following GPU variants. Select your G
=== "NVIDIA CUDA"
--8<-- "docs/getting_started/installation/gpu/cuda.inc.md:requirements"
--8<-- "docs/getting_started/installation/gpu.cuda.inc.md:requirements"
=== "AMD ROCm"
--8<-- "docs/getting_started/installation/gpu/rocm.inc.md:requirements"
--8<-- "docs/getting_started/installation/gpu.rocm.inc.md:requirements"
=== "Intel XPU"
--8<-- "docs/getting_started/installation/gpu/xpu.inc.md:requirements"
--8<-- "docs/getting_started/installation/gpu.xpu.inc.md:requirements"
## Set up using Python
@ -42,29 +42,29 @@ vLLM is a Python library that supports the following GPU variants. Select your G
=== "NVIDIA CUDA"
--8<-- "docs/getting_started/installation/gpu/cuda.inc.md:set-up-using-python"
--8<-- "docs/getting_started/installation/gpu.cuda.inc.md:set-up-using-python"
=== "AMD ROCm"
--8<-- "docs/getting_started/installation/gpu/rocm.inc.md:set-up-using-python"
--8<-- "docs/getting_started/installation/gpu.rocm.inc.md:set-up-using-python"
=== "Intel XPU"
--8<-- "docs/getting_started/installation/gpu/xpu.inc.md:set-up-using-python"
--8<-- "docs/getting_started/installation/gpu.xpu.inc.md:set-up-using-python"
### Pre-built wheels
=== "NVIDIA CUDA"
--8<-- "docs/getting_started/installation/gpu/cuda.inc.md:pre-built-wheels"
--8<-- "docs/getting_started/installation/gpu.cuda.inc.md:pre-built-wheels"
=== "AMD ROCm"
--8<-- "docs/getting_started/installation/gpu/rocm.inc.md:pre-built-wheels"
--8<-- "docs/getting_started/installation/gpu.rocm.inc.md:pre-built-wheels"
=== "Intel XPU"
--8<-- "docs/getting_started/installation/gpu/xpu.inc.md:pre-built-wheels"
--8<-- "docs/getting_started/installation/gpu.xpu.inc.md:pre-built-wheels"
[](){ #build-from-source }
@ -72,15 +72,15 @@ vLLM is a Python library that supports the following GPU variants. Select your G
=== "NVIDIA CUDA"
--8<-- "docs/getting_started/installation/gpu/cuda.inc.md:build-wheel-from-source"
--8<-- "docs/getting_started/installation/gpu.cuda.inc.md:build-wheel-from-source"
=== "AMD ROCm"
--8<-- "docs/getting_started/installation/gpu/rocm.inc.md:build-wheel-from-source"
--8<-- "docs/getting_started/installation/gpu.rocm.inc.md:build-wheel-from-source"
=== "Intel XPU"
--8<-- "docs/getting_started/installation/gpu/xpu.inc.md:build-wheel-from-source"
--8<-- "docs/getting_started/installation/gpu.xpu.inc.md:build-wheel-from-source"
## Set up using Docker
@ -88,40 +88,40 @@ vLLM is a Python library that supports the following GPU variants. Select your G
=== "NVIDIA CUDA"
--8<-- "docs/getting_started/installation/gpu/cuda.inc.md:pre-built-images"
--8<-- "docs/getting_started/installation/gpu.cuda.inc.md:pre-built-images"
=== "AMD ROCm"
--8<-- "docs/getting_started/installation/gpu/rocm.inc.md:pre-built-images"
--8<-- "docs/getting_started/installation/gpu.rocm.inc.md:pre-built-images"
=== "Intel XPU"
--8<-- "docs/getting_started/installation/gpu/xpu.inc.md:pre-built-images"
--8<-- "docs/getting_started/installation/gpu.xpu.inc.md:pre-built-images"
### Build image from source
=== "NVIDIA CUDA"
--8<-- "docs/getting_started/installation/gpu/cuda.inc.md:build-image-from-source"
--8<-- "docs/getting_started/installation/gpu.cuda.inc.md:build-image-from-source"
=== "AMD ROCm"
--8<-- "docs/getting_started/installation/gpu/rocm.inc.md:build-image-from-source"
--8<-- "docs/getting_started/installation/gpu.rocm.inc.md:build-image-from-source"
=== "Intel XPU"
--8<-- "docs/getting_started/installation/gpu/xpu.inc.md:build-image-from-source"
--8<-- "docs/getting_started/installation/gpu.xpu.inc.md:build-image-from-source"
## Supported features
=== "NVIDIA CUDA"
--8<-- "docs/getting_started/installation/gpu/cuda.inc.md:supported-features"
--8<-- "docs/getting_started/installation/gpu.cuda.inc.md:supported-features"
=== "AMD ROCm"
--8<-- "docs/getting_started/installation/gpu/rocm.inc.md:supported-features"
--8<-- "docs/getting_started/installation/gpu.rocm.inc.md:supported-features"
=== "Intel XPU"
--8<-- "docs/getting_started/installation/gpu/xpu.inc.md:supported-features"
--8<-- "docs/getting_started/installation/gpu.xpu.inc.md:supported-features"

View File

@ -146,7 +146,7 @@ Building the Docker image from source is the recommended way to use vLLM with RO
#### (Optional) Build an image with ROCm software stack
Build a docker image from <gh-file:docker/Dockerfile.rocm_base> which setup ROCm software stack needed by the vLLM.
Build a docker image from [docker/Dockerfile.rocm_base](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.rocm_base) which setup ROCm software stack needed by the vLLM.
**This step is optional as this rocm_base image is usually prebuilt and store at [Docker Hub](https://hub.docker.com/r/rocm/vllm-dev) under tag `rocm/vllm-dev:base` to speed up user experience.**
If you choose to build this rocm_base image yourself, the steps are as follows.
@ -170,7 +170,7 @@ DOCKER_BUILDKIT=1 docker build \
#### Build an image with vLLM
First, build a docker image from <gh-file:docker/Dockerfile.rocm> and launch a docker container from the image.
First, build a docker image from [docker/Dockerfile.rocm](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.rocm) and launch a docker container from the image.
It is important that the user kicks off the docker build using buildkit. Either the user put `DOCKER_BUILDKIT=1` as environment variable when calling docker build command, or the user needs to set up buildkit in the docker daemon configuration /etc/docker/daemon.json as follows and restart the daemon:
```bash
@ -181,10 +181,10 @@ It is important that the user kicks off the docker build using buildkit. Either
}
```
<gh-file:docker/Dockerfile.rocm> uses ROCm 6.3 by default, but also supports ROCm 5.7, 6.0, 6.1, and 6.2, in older vLLM branches.
[docker/Dockerfile.rocm](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.rocm) uses ROCm 6.3 by default, but also supports ROCm 5.7, 6.0, 6.1, and 6.2, in older vLLM branches.
It provides flexibility to customize the build of docker image using the following arguments:
- `BASE_IMAGE`: specifies the base image used when running `docker build`. The default value `rocm/vllm-dev:base` is an image published and maintained by AMD. It is being built using <gh-file:docker/Dockerfile.rocm_base>
- `BASE_IMAGE`: specifies the base image used when running `docker build`. The default value `rocm/vllm-dev:base` is an image published and maintained by AMD. It is being built using [docker/Dockerfile.rocm_base](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.rocm_base)
- `ARG_PYTORCH_ROCM_ARCH`: Allows to override the gfx architecture values from the base docker image
Their values can be passed in when running `docker build` with `--build-arg` options.

View File

@ -75,7 +75,7 @@ vllm serve facebook/opt-13b \
-tp=8
```
By default, a ray instance will be launched automatically if no existing one is detected in the system, with `num-gpus` equals to `parallel_config.world_size`. We recommend properly starting a ray cluster before execution, referring to the <gh-file:examples/online_serving/run_cluster.sh> helper script.
By default, a ray instance will be launched automatically if no existing one is detected in the system, with `num-gpus` equals to `parallel_config.world_size`. We recommend properly starting a ray cluster before execution, referring to the [examples/online_serving/run_cluster.sh](https://github.com/vllm-project/vllm/blob/main/examples/online_serving/run_cluster.sh) helper script.
# --8<-- [end:supported-features]
# --8<-- [start:distributed-backend]

View File

@ -46,7 +46,7 @@ uv pip install vllm --torch-backend=auto
## Offline Batched Inference
With vLLM installed, you can start generating texts for list of input prompts (i.e. offline batch inferencing). See the example script: <gh-file:examples/offline_inference/basic/basic.py>
With vLLM installed, you can start generating texts for list of input prompts (i.e. offline batch inferencing). See the example script: [examples/offline_inference/basic/basic.py](../../examples/offline_inference/basic/basic.py)
The first line of this example imports the classes [LLM][vllm.LLM] and [SamplingParams][vllm.SamplingParams]:
@ -201,7 +201,7 @@ Since this server is compatible with OpenAI API, you can use it as a drop-in rep
print("Completion result:", completion)
```
A more detailed client example can be found here: <gh-file:examples/online_serving/openai_completion_client.py>
A more detailed client example can be found here: [examples/offline_inference/basic/basic.py](../../examples/offline_inference/basic/basic.py)
### OpenAI Chat Completions API with vLLM
@ -253,4 +253,4 @@ Currently, vLLM supports multiple backends for efficient Attention computation a
If desired, you can also manually set the backend of your choice by configuring the environment variable `VLLM_ATTENTION_BACKEND` to one of the following options: `FLASH_ATTN`, `FLASHINFER` or `XFORMERS`.
!!! warning
There are no pre-built vllm wheels containing Flash Infer, so you must install it in your environment first. Refer to the [Flash Infer official docs](https://docs.flashinfer.ai/) or see <gh-file:docker/Dockerfile> for instructions on how to install it.
There are no pre-built vllm wheels containing Flash Infer, so you must install it in your environment first. Refer to the [Flash Infer official docs](https://docs.flashinfer.ai/) or see [docker/Dockerfile](../../docker/Dockerfile) for instructions on how to install it.

View File

@ -137,13 +137,20 @@ class Example:
gh_file = (self.main_file.parent / relative_path).resolve()
gh_file = gh_file.relative_to(ROOT_DIR)
return f"[{link_text}](gh-file:{gh_file})"
# Make GitHub URL
url = "https://github.com/vllm-project/vllm/"
url += "tree/main" if self.path.is_dir() else "blob/main"
gh_url = f"{url}/{gh_file}"
return f"[{link_text}]({gh_url})"
return re.sub(link_pattern, replace_link, content)
def generate(self) -> str:
content = f"# {self.title}\n\n"
content += f"Source <gh-file:{self.path.relative_to(ROOT_DIR)}>.\n\n"
url = "https://github.com/vllm-project/vllm/"
url += "tree/main" if self.path.is_dir() else "blob/main"
content += f"Source <{url}/{self.path.relative_to(ROOT_DIR)}>.\n\n"
# Use long code fence to avoid issues with
# included files containing code fences too

View File

@ -1,123 +1,95 @@
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
"""
This is basically a port of MyST parsers external URL resolution mechanism
(https://myst-parser.readthedocs.io/en/latest/syntax/cross-referencing.html#customising-external-url-resolution)
to work with MkDocs.
MkDocs hook to enable the following links to render correctly:
It allows Markdown authors to use GitHub shorthand links like:
- [Text](gh-issue:123)
- <gh-pr:456>
- [File](gh-file:path/to/file.py#L10)
These are automatically rewritten into fully qualified GitHub URLs pointing to
issues, pull requests, files, directories, or projects in the
`vllm-project/vllm` repository.
- Relative file links outside of the `docs/` directory, e.g.:
- [Text](../some_file.py)
- [Directory](../../some_directory/)
- GitHub URLs for issues, pull requests, and projects, e.g.:
- Adds GitHub icon before links
- Replaces raw links with descriptive text,
e.g. <...pull/123> -> [Pull Request #123](.../pull/123)
- Works for external repos too by including the `owner/repo` in the link title
The goal is to simplify cross-referencing common GitHub resources
in project docs.
"""
from pathlib import Path
import regex as re
from mkdocs.config.defaults import MkDocsConfig
from mkdocs.structure.files import Files
from mkdocs.structure.pages import Page
ROOT_DIR = Path(__file__).parent.parent.parent.parent.resolve()
DOC_DIR = ROOT_DIR / "docs"
gh_icon = ":octicons-mark-github-16:"
# Regex pieces
TITLE = r"(?P<title>[^\[\]<>]+?)"
REPO = r"(?P<repo>.+?/.+?)"
TYPE = r"(?P<type>issues|pull|projects)"
NUMBER = r"(?P<number>\d+)"
FRAGMENT = r"(?P<fragment>#[^\s]+)?"
URL = f"https://github.com/{REPO}/{TYPE}/{NUMBER}{FRAGMENT}"
RELATIVE = r"(?!(https?|ftp)://|#)(?P<path>[^\s]+?)"
# Common titles to use for GitHub links when none is provided in the link.
TITLES = {"issues": "Issue ", "pull": "Pull Request ", "projects": "Project "}
# Regex to match GitHub issue, PR, and project links with optional titles.
github_link = re.compile(rf"(\[{TITLE}\]\(|<){URL}(\)|>)")
# Regex to match relative file links with optional titles.
relative_link = re.compile(rf"\[{TITLE}\]\({RELATIVE}\)")
def on_page_markdown(
markdown: str, *, page: Page, config: MkDocsConfig, files: Files
) -> str:
"""
Custom MkDocs plugin hook to rewrite special GitHub reference links
in Markdown.
This function scans the given Markdown content for specially formatted
GitHub shorthand links, such as:
- `[Link text](gh-issue:123)`
- `<gh-pr:456>`
And rewrites them into fully-qualified GitHub URLs with GitHub icons:
- `[:octicons-mark-github-16: Link text](https://github.com/vllm-project/vllm/issues/123)`
- `[:octicons-mark-github-16: Pull Request #456](https://github.com/vllm-project/vllm/pull/456)`
Supported shorthand types:
- `gh-issue`
- `gh-pr`
- `gh-project`
- `gh-dir`
- `gh-file`
Args:
markdown (str): The raw Markdown content of the page.
page (Page): The MkDocs page object being processed.
config (MkDocsConfig): The MkDocs site configuration.
files (Files): The collection of files in the MkDocs build.
Returns:
str: The updated Markdown content with GitHub shorthand links replaced.
"""
gh_icon = ":octicons-mark-github-16:"
gh_url = "https://github.com"
repo_url = f"{gh_url}/vllm-project/vllm"
org_url = f"{gh_url}/orgs/vllm-project"
# Mapping of shorthand types to their corresponding GitHub base URLs
urls = {
"issue": f"{repo_url}/issues",
"pr": f"{repo_url}/pull",
"project": f"{org_url}/projects",
"dir": f"{repo_url}/tree/main",
"file": f"{repo_url}/blob/main",
}
# Default title prefixes for auto links
titles = {
"issue": "Issue #",
"pr": "Pull Request #",
"project": "Project #",
"dir": "",
"file": "",
}
# Regular expression to match GitHub shorthand links
scheme = r"gh-(?P<type>.+?):(?P<path>.+?)(#(?P<fragment>.+?))?"
inline_link = re.compile(r"\[(?P<title>[^\[]+?)\]\(" + scheme + r"\)")
auto_link = re.compile(f"<{scheme}>")
def replace_inline_link(match: re.Match) -> str:
"""
Replaces a matched inline-style GitHub shorthand link
with a full Markdown link.
Example:
[My issue](gh-issue:123) [:octicons-mark-github-16: My issue](https://github.com/vllm-project/vllm/issues/123)
"""
url = f"{urls[match.group('type')]}/{match.group('path')}"
if fragment := match.group("fragment"):
url += f"#{fragment}"
return f"[{gh_icon} {match.group('title')}]({url})"
def replace_auto_link(match: re.Match) -> str:
"""
Replaces a matched autolink-style GitHub shorthand
with a full Markdown link.
Example:
<gh-pr:456> [:octicons-mark-github-16: Pull Request #456](https://github.com/vllm-project/vllm/pull/456)
"""
type = match.group("type")
def replace_relative_link(match: re.Match) -> str:
"""Replace relative file links with URLs if they point outside the docs dir."""
title = match.group("title")
path = match.group("path")
title = f"{titles[type]}{path}"
url = f"{urls[type]}/{path}"
if fragment := match.group("fragment"):
url += f"#{fragment}"
path = (Path(page.file.abs_src_path).parent / path).resolve()
# Check if the path exists and is outside the docs dir
if not path.exists() or path.is_relative_to(DOC_DIR):
return match.group(0)
# Files and directories have different URL schemes on GitHub
slug = "tree/main" if path.is_dir() else "blob/main"
path = path.relative_to(ROOT_DIR)
url = f"https://github.com/vllm-project/vllm/{slug}/{path}"
return f"[{gh_icon} {title}]({url})"
# Replace both inline and autolinks
markdown = inline_link.sub(replace_inline_link, markdown)
markdown = auto_link.sub(replace_auto_link, markdown)
def replace_github_link(match: re.Match) -> str:
"""Replace GitHub issue, PR, and project links with enhanced Markdown links."""
repo = match.group("repo")
type = match.group("type")
number = match.group("number")
# Title and fragment could be None
title = match.group("title") or ""
fragment = match.group("fragment") or ""
# Use default titles for raw links
if not title:
title = TITLES[type]
if "vllm-project" not in repo:
title += repo
title += f"#{number}"
url = f"https://github.com/{repo}/{type}/{number}{fragment}"
return f"[{gh_icon} {title}]({url})"
markdown = relative_link.sub(replace_relative_link, markdown)
markdown = github_link.sub(replace_github_link, markdown)
if "interface" in str(page.file.abs_src_path):
print(markdown)
return markdown

View File

@ -82,7 +82,7 @@ vllm serve /path/to/sharded/model \
--model-loader-extra-config '{"pattern":"custom-model-rank-{rank}-part-{part}.safetensors"}'
```
To create sharded model files, you can use the script provided in <gh-file:examples/offline_inference/save_sharded_state.py>. This script demonstrates how to save a model in the sharded format that is compatible with the Run:ai Model Streamer sharded loader.
To create sharded model files, you can use the script provided in [examples/offline_inference/save_sharded_state.py](../../../examples/offline_inference/save_sharded_state.py). This script demonstrates how to save a model in the sharded format that is compatible with the Run:ai Model Streamer sharded loader.
The sharded loader supports all the same tunable parameters as the regular Run:ai Model Streamer, including `concurrency` and `memory_limit`. These can be configured in the same way:

View File

@ -59,7 +59,7 @@ for output in outputs:
By default, vLLM will use sampling parameters recommended by model creator by applying the `generation_config.json` from the huggingface model repository if it exists. In most cases, this will provide you with the best results by default if [SamplingParams][vllm.SamplingParams] is not specified.
However, if vLLM's default sampling parameters are preferred, please pass `generation_config="vllm"` when creating the [LLM][vllm.LLM] instance.
A code example can be found here: <gh-file:examples/offline_inference/basic/basic.py>
A code example can be found here: [examples/offline_inference/basic/basic.py](../../examples/offline_inference/basic/basic.py)
### `LLM.beam_search`
@ -121,7 +121,7 @@ and automatically applies the model's [chat template](https://huggingface.co/doc
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```
A code example can be found here: <gh-file:examples/offline_inference/basic/chat.py>
A code example can be found here: [examples/offline_inference/basic/chat.py](../../examples/offline_inference/basic/chat.py)
If the model doesn't have a chat template or you want to specify another one,
you can explicitly pass a chat template:

View File

@ -9,7 +9,7 @@ before returning them.
!!! note
We currently support pooling models primarily as a matter of convenience. This is not guaranteed to have any performance improvement over using HF Transformers / Sentence Transformers directly.
We are now planning to optimize pooling models in vLLM. Please comment on <gh-issue:21796> if you have any suggestions!
We are now planning to optimize pooling models in vLLM. Please comment on <https://github.com/vllm-project/vllm/issues/21796> if you have any suggestions!
## Configuration
@ -98,7 +98,7 @@ embeds = output.outputs.embedding
print(f"Embeddings: {embeds!r} (size={len(embeds)})")
```
A code example can be found here: <gh-file:examples/offline_inference/basic/embed.py>
A code example can be found here: [examples/offline_inference/basic/embed.py](../../examples/offline_inference/basic/embed.py)
### `LLM.classify`
@ -115,7 +115,7 @@ probs = output.outputs.probs
print(f"Class Probabilities: {probs!r} (size={len(probs)})")
```
A code example can be found here: <gh-file:examples/offline_inference/basic/classify.py>
A code example can be found here: [examples/offline_inference/basic/classify.py](../../examples/offline_inference/basic/classify.py)
### `LLM.score`
@ -139,7 +139,7 @@ score = output.outputs.score
print(f"Score: {score}")
```
A code example can be found here: <gh-file:examples/offline_inference/basic/score.py>
A code example can be found here: [examples/offline_inference/basic/score.py](../../examples/offline_inference/basic/score.py)
### `LLM.reward`
@ -156,7 +156,7 @@ data = output.outputs.data
print(f"Data: {data!r}")
```
A code example can be found here: <gh-file:examples/offline_inference/basic/reward.py>
A code example can be found here: [examples/offline_inference/basic/reward.py](../../examples/offline_inference/basic/reward.py)
### `LLM.encode`
@ -234,7 +234,7 @@ outputs = llm.embed(
print(outputs[0].outputs)
```
A code example can be found here: <gh-file:examples/offline_inference/pooling/embed_matryoshka_fy.py>
A code example can be found here: [examples/offline_inference/pooling/embed_matryoshka_fy.py](../../examples/offline_inference/pooling/embed_matryoshka_fy.py)
### Online Inference
@ -264,4 +264,4 @@ Expected output:
{"id":"embd-5c21fc9a5c9d4384a1b021daccaf9f64","object":"list","created":1745476417,"model":"jinaai/jina-embeddings-v3","data":[{"index":0,"object":"embedding","embedding":[-0.3828125,-0.1357421875,0.03759765625,0.125,0.21875,0.09521484375,-0.003662109375,0.1591796875,-0.130859375,-0.0869140625,-0.1982421875,0.1689453125,-0.220703125,0.1728515625,-0.2275390625,-0.0712890625,-0.162109375,-0.283203125,-0.055419921875,-0.0693359375,0.031982421875,-0.04052734375,-0.2734375,0.1826171875,-0.091796875,0.220703125,0.37890625,-0.0888671875,-0.12890625,-0.021484375,-0.0091552734375,0.23046875]}],"usage":{"prompt_tokens":8,"total_tokens":8,"completion_tokens":0,"prompt_tokens_details":null}}
```
An OpenAI client example can be found here: <gh-file:examples/online_serving/pooling/openai_embedding_matryoshka_fy.py>
An OpenAI client example can be found here: [examples/online_serving/pooling/openai_embedding_matryoshka_fy.py](../../examples/online_serving/pooling/openai_embedding_matryoshka_fy.py)

View File

@ -9,7 +9,7 @@ Alongside each architecture, we include some popular models that use it.
### vLLM
If vLLM natively supports a model, its implementation can be found in <gh-file:vllm/model_executor/models>.
If vLLM natively supports a model, its implementation can be found in [vllm/model_executor/models](../../vllm/model_executor/models).
These models are what we list in [supported-text-models][supported-text-models] and [supported-mm-models][supported-mm-models].
@ -116,7 +116,7 @@ Here is what happens in the background when this model is loaded:
1. The config is loaded.
2. `MyModel` Python class is loaded from the `auto_map` in config, and we check that the model `is_backend_compatible()`.
3. `MyModel` is loaded into one of the Transformers backend classes in <gh-file:vllm/model_executor/models/transformers.py> which sets `self.config._attn_implementation = "vllm"` so that vLLM's attention layer is used.
3. `MyModel` is loaded into one of the Transformers backend classes in [vllm/model_executor/models/transformers.py](../../vllm/model_executor/models/transformers.py) which sets `self.config._attn_implementation = "vllm"` so that vLLM's attention layer is used.
That's it!
@ -543,7 +543,7 @@ These models primarily support the [`LLM.score`](./pooling_models.md#llmscore) A
```
!!! note
Load the official original `Qwen3 Reranker` by using the following command. More information can be found at: <gh-file:examples/offline_inference/pooling/qwen3_reranker.py>.
Load the official original `Qwen3 Reranker` by using the following command. More information can be found at: [examples/offline_inference/pooling/qwen3_reranker.py](../../examples/offline_inference/pooling/qwen3_reranker.py).
```bash
vllm serve Qwen/Qwen3-Reranker-0.6B --hf_overrides '{"architectures": ["Qwen3ForSequenceClassification"],"classifier_from_token": ["no", "yes"],"is_original_qwen3_reranker": true}'
@ -581,7 +581,7 @@ These models primarily support the [`LLM.encode`](./pooling_models.md#llmencode)
| `ModernBertForTokenClassification` | ModernBERT-based | `disham993/electrical-ner-ModernBERT-base` | | |
!!! note
Named Entity Recognition (NER) usage, please refer to <gh-file:examples/offline_inference/pooling/ner.py>, <gh-file:examples/online_serving/pooling/ner_client.py>.
Named Entity Recognition (NER) usage, please refer to [examples/offline_inference/pooling/ner.py](../../examples/offline_inference/pooling/ner.py), [examples/online_serving/pooling/ner_client.py](../../examples/online_serving/pooling/ner_client.py).
[](){ #supported-mm-models }
@ -776,7 +776,7 @@ Some models are supported only via the [Transformers backend](#transformers). Th
!!! note
The official `openbmb/MiniCPM-V-2` doesn't work yet, so we need to use a fork (`HwwwH/MiniCPM-V-2`) for now.
For more details, please see: <gh-pr:4087#issuecomment-2250397630>
For more details, please see: <https://github.com/vllm-project/vllm/pull/4087#issuecomment-2250397630>
!!! warning
Our PaliGemma implementations have the same problem as Gemma 3 (see above) for both V0 and V1.
@ -856,5 +856,5 @@ We have the following levels of testing for models:
1. **Strict Consistency**: We compare the output of the model with the output of the model in the HuggingFace Transformers library under greedy decoding. This is the most stringent test. Please refer to [models tests](https://github.com/vllm-project/vllm/blob/main/tests/models) for the models that have passed this test.
2. **Output Sensibility**: We check if the output of the model is sensible and coherent, by measuring the perplexity of the output and checking for any obvious errors. This is a less stringent test.
3. **Runtime Functionality**: We check if the model can be loaded and run without errors. This is the least stringent test. Please refer to [functionality tests](gh-dir:tests) and [examples](gh-dir:examples) for the models that have passed this test.
3. **Runtime Functionality**: We check if the model can be loaded and run without errors. This is the least stringent test. Please refer to [functionality tests](../../tests) and [examples](../../examples) for the models that have passed this test.
4. **Community Feedback**: We rely on the community to provide feedback on the models. If a model is broken or not working as expected, we encourage users to raise issues to report it or open pull requests to fix it. The rest of the models fall under this category.

View File

@ -16,7 +16,7 @@ For MoE models, when any requests are in progress in any rank, we must ensure th
In all cases, it is beneficial to load-balance requests between DP ranks. For online deployments, this balancing can be optimized by taking into account the state of each DP engine - in particular its currently scheduled and waiting (queued) requests, and KV cache state. Each DP engine has an independent KV cache, and the benefit of prefix caching can be maximized by directing prompts intelligently.
This document focuses on online deployments (with the API server). DP + EP is also supported for offline usage (via the LLM class), for an example see <gh-file:examples/offline_inference/data_parallel.py>.
This document focuses on online deployments (with the API server). DP + EP is also supported for offline usage (via the LLM class), for an example see [examples/offline_inference/data_parallel.py](../../examples/offline_inference/data_parallel.py).
There are two distinct modes supported for online deployments - self-contained with internal load balancing, or externally per-rank process deployment and load balancing.

View File

@ -4,11 +4,11 @@ For general troubleshooting, see [Troubleshooting](../usage/troubleshooting.md).
## Verify inter-node GPU communication
After you start the Ray cluster, verify GPU-to-GPU communication across nodes. Proper configuration can be non-trivial. For more information, see [troubleshooting script][troubleshooting-incorrect-hardware-driver]. If you need additional environment variables for communication configuration, append them to <gh-file:examples/online_serving/run_cluster.sh>, for example `-e NCCL_SOCKET_IFNAME=eth0`. Setting environment variables during cluster creation is recommended because the variables propagate to all nodes. In contrast, setting environment variables in the shell affects only the local node. For more information, see <gh-issue:6803>.
After you start the Ray cluster, verify GPU-to-GPU communication across nodes. Proper configuration can be non-trivial. For more information, see [troubleshooting script][troubleshooting-incorrect-hardware-driver]. If you need additional environment variables for communication configuration, append them to [examples/online_serving/run_cluster.sh](../../examples/online_serving/run_cluster.sh), for example `-e NCCL_SOCKET_IFNAME=eth0`. Setting environment variables during cluster creation is recommended because the variables propagate to all nodes. In contrast, setting environment variables in the shell affects only the local node. For more information, see <https://github.com/vllm-project/vllm/issues/6803>.
## No available node types can fulfill resource request
The error message `Error: No available node types can fulfill resource request` can appear even when the cluster has enough GPUs. The issue often occurs when nodes have multiple IP addresses and vLLM can't select the correct one. Ensure that vLLM and Ray use the same IP address by setting `VLLM_HOST_IP` in <gh-file:examples/online_serving/run_cluster.sh> (with a different value on each node). Use `ray status` and `ray list nodes` to verify the chosen IP address. For more information, see <gh-issue:7815>.
The error message `Error: No available node types can fulfill resource request` can appear even when the cluster has enough GPUs. The issue often occurs when nodes have multiple IP addresses and vLLM can't select the correct one. Ensure that vLLM and Ray use the same IP address by setting `VLLM_HOST_IP` in [examples/online_serving/run_cluster.sh](../../examples/online_serving/run_cluster.sh) (with a different value on each node). Use `ray status` and `ray list nodes` to verify the chosen IP address. For more information, see <https://github.com/vllm-project/vllm/issues/7815>.
## Ray observability

View File

@ -8,9 +8,9 @@ EP is typically coupled with Data Parallelism (DP). While DP can be used indepen
Before using EP, you need to install the necessary dependencies. We are actively working on making this easier in the future:
1. **Install DeepEP and pplx-kernels**: Set up host environment following vLLM's guide for EP kernels [here](gh-file:tools/ep_kernels).
1. **Install DeepEP and pplx-kernels**: Set up host environment following vLLM's guide for EP kernels [here](../../tools/ep_kernels).
2. **Install DeepGEMM library**: Follow the [official instructions](https://github.com/deepseek-ai/DeepGEMM#installation).
3. **For disaggregated serving**: Install `gdrcopy` by running the [`install_gdrcopy.sh`](gh-file:tools/install_gdrcopy.sh) script (e.g., `install_gdrcopy.sh "${GDRCOPY_OS_VERSION}" "12.8" "x64"`). You can find available OS versions [here](https://developer.download.nvidia.com/compute/redist/gdrcopy/CUDA%2012.8/).
3. **For disaggregated serving**: Install `gdrcopy` by running the [`install_gdrcopy.sh`](../../tools/install_gdrcopy.sh) script (e.g., `install_gdrcopy.sh "${GDRCOPY_OS_VERSION}" "12.8" "x64"`). You can find available OS versions [here](https://developer.download.nvidia.com/compute/redist/gdrcopy/CUDA%2012.8/).
### Backend Selection Guide
@ -195,7 +195,7 @@ For production deployments requiring strict SLA guarantees for time-to-first-tok
### Setup Steps
1. **Install gdrcopy/ucx/nixl**: For maximum performance, run the [install_gdrcopy.sh](gh-file:tools/install_gdrcopy.sh) script to install `gdrcopy` (e.g., `install_gdrcopy.sh "${GDRCOPY_OS_VERSION}" "12.8" "x64"`). You can find available OS versions [here](https://developer.download.nvidia.com/compute/redist/gdrcopy/CUDA%2012.8/). If `gdrcopy` is not installed, things will still work with a plain `pip install nixl`, just with lower performance. `nixl` and `ucx` are installed as dependencies via pip. For non-cuda platform to install nixl with non-cuda UCX build, run the [install_nixl_from_source_ubuntu.py](gh-file:tools/install_nixl_from_source_ubuntu.py) script.
1. **Install gdrcopy/ucx/nixl**: For maximum performance, run the [install_gdrcopy.sh](../../tools/install_gdrcopy.sh) script to install `gdrcopy` (e.g., `install_gdrcopy.sh "${GDRCOPY_OS_VERSION}" "12.8" "x64"`). You can find available OS versions [here](https://developer.download.nvidia.com/compute/redist/gdrcopy/CUDA%2012.8/). If `gdrcopy` is not installed, things will still work with a plain `pip install nixl`, just with lower performance. `nixl` and `ucx` are installed as dependencies via pip. For non-cuda platform to install nixl with non-cuda UCX build, run the [install_nixl_from_source_ubuntu.py](../../tools/install_nixl_from_source_ubuntu.py) script.
2. **Configure Both Instances**: Add this flag to both prefill and decode instances `--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}`. Noted, you may also specify one or multiple NIXL_Backend. Such as: `--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both", "kv_connector_extra_config":{"backends":["UCX", "GDS"]}}'`

View File

@ -92,7 +92,7 @@ and all chat requests will error.
vllm serve <model> --chat-template ./path-to-chat-template.jinja
```
vLLM community provides a set of chat templates for popular models. You can find them under the <gh-dir:examples> directory.
vLLM community provides a set of chat templates for popular models. You can find them under the [examples](../../examples) directory.
With the inclusion of multi-modal chat APIs, the OpenAI spec now accepts chat messages in a new format which specifies
both a `type` and a `text` field. An example is provided below:
@ -181,7 +181,7 @@ with `--enable-request-id-headers`.
Our Completions API is compatible with [OpenAI's Completions API](https://platform.openai.com/docs/api-reference/completions);
you can use the [official OpenAI Python client](https://github.com/openai/openai-python) to interact with it.
Code example: <gh-file:examples/online_serving/openai_completion_client.py>
Code example: [examples/online_serving/openai_completion_client.py](../../examples/online_serving/openai_completion_client.py)
#### Extra parameters
@ -214,7 +214,7 @@ see our [Multimodal Inputs](../features/multimodal_inputs.md) guide for more inf
- *Note: `image_url.detail` parameter is not supported.*
Code example: <gh-file:examples/online_serving/openai_chat_completion_client.py>
Code example: [examples/online_serving/openai_chat_completion_client.py](../../examples/online_serving/openai_chat_completion_client.py)
#### Extra parameters
@ -241,7 +241,7 @@ The following extra parameters are supported:
Our Embeddings API is compatible with [OpenAI's Embeddings API](https://platform.openai.com/docs/api-reference/embeddings);
you can use the [official OpenAI Python client](https://github.com/openai/openai-python) to interact with it.
Code example: <gh-file:examples/online_serving/pooling/openai_embedding_client.py>
Code example: [examples/online_serving/pooling/openai_embedding_client.py](../../examples/online_serving/pooling/openai_embedding_client.py)
If the model has a [chat template][chat-template], you can replace `inputs` with a list of `messages` (same schema as [Chat API][chat-api])
which will be treated as a single prompt to the model. Here is a convenience function for calling the API while retaining OpenAI's type annotations:
@ -289,7 +289,7 @@ and passing a list of `messages` in the request. Refer to the examples below for
to run this model in embedding mode instead of text generation mode.
The custom chat template is completely different from the original one for this model,
and can be found here: <gh-file:examples/template_vlm2vec_phi3v.jinja>
and can be found here: [examples/template_vlm2vec_phi3v.jinja](../../examples/template_vlm2vec_phi3v.jinja)
Since the request schema is not defined by OpenAI client, we post a request to the server using the lower-level `requests` library:
@ -336,13 +336,13 @@ and passing a list of `messages` in the request. Refer to the examples below for
Like with VLM2Vec, we have to explicitly pass `--runner pooling`.
Additionally, `MrLight/dse-qwen2-2b-mrl-v1` requires an EOS token for embeddings, which is handled
by a custom chat template: <gh-file:examples/template_dse_qwen2_vl.jinja>
by a custom chat template: [examples/template_dse_qwen2_vl.jinja](../../examples/template_dse_qwen2_vl.jinja)
!!! important
`MrLight/dse-qwen2-2b-mrl-v1` requires a placeholder image of the minimum image size for text query embeddings. See the full code
example below for details.
Full example: <gh-file:examples/online_serving/pooling/openai_chat_embedding_client_for_multimodal.py>
Full example: [examples/online_serving/pooling/openai_chat_embedding_client_for_multimodal.py](../../examples/online_serving/pooling/openai_chat_embedding_client_for_multimodal.py)
#### Extra parameters
@ -379,7 +379,7 @@ you can use the [official OpenAI Python client](https://github.com/openai/openai
!!! note
To use the Transcriptions API, please install with extra audio dependencies using `pip install vllm[audio]`.
Code example: <gh-file:examples/online_serving/openai_transcription_client.py>
Code example: [examples/online_serving/openai_transcription_client.py](../../examples/online_serving/openai_transcription_client.py)
#### API Enforced Limits
@ -496,7 +496,7 @@ Please mind that the popular `openai/whisper-large-v3-turbo` model does not supp
!!! note
To use the Translation API, please install with extra audio dependencies using `pip install vllm[audio]`.
Code example: <gh-file:examples/online_serving/openai_translation_client.py>
Code example: [examples/online_serving/openai_translation_client.py](../../examples/online_serving/openai_translation_client.py)
#### Extra Parameters
@ -530,7 +530,7 @@ Our Pooling API encodes input prompts using a [pooling model](../models/pooling_
The input format is the same as [Embeddings API][embeddings-api], but the output data can contain an arbitrary nested list, not just a 1-D list of floats.
Code example: <gh-file:examples/online_serving/pooling/openai_pooling_client.py>
Code example: [examples/online_serving/pooling/openai_pooling_client.py](../../examples/online_serving/pooling/openai_pooling_client.py)
[](){ #classification-api }
@ -540,7 +540,7 @@ Our Classification API directly supports Hugging Face sequence-classification mo
We automatically wrap any other transformer via `as_seq_cls_model()`, which pools on the last token, attaches a `RowParallelLinear` head, and applies a softmax to produce per-class probabilities.
Code example: <gh-file:examples/online_serving/pooling/openai_classification_client.py>
Code example: [examples/online_serving/pooling/openai_classification_client.py](../../examples/online_serving/pooling/openai_classification_client.py)
#### Example Requests
@ -658,7 +658,7 @@ Usually, the score for a sentence pair refers to the similarity between two sent
You can find the documentation for cross encoder models at [sbert.net](https://www.sbert.net/docs/package_reference/cross_encoder/cross_encoder.html).
Code example: <gh-file:examples/online_serving/openai_cross_encoder_score.py>
Code example: [examples/online_serving/openai_cross_encoder_score.py](../../examples/online_serving/openai_cross_encoder_score.py)
#### Single inference
@ -839,7 +839,7 @@ You can pass multi-modal inputs to scoring models by passing `content` including
print("Scoring output:", response_json["data"][0]["score"])
print("Scoring output:", response_json["data"][1]["score"])
```
Full example: <gh-file:examples/online_serving/openai_cross_encoder_score_for_multimodal.py>
Full example: [examples/online_serving/openai_cross_encoder_score_for_multimodal.py](../../examples/online_serving/openai_cross_encoder_score_for_multimodal.py)
#### Extra parameters
@ -871,7 +871,7 @@ endpoints are compatible with both [Jina AI's re-rank API interface](https://jin
[Cohere's re-rank API interface](https://docs.cohere.com/v2/reference/rerank) to ensure compatibility with
popular open-source tools.
Code example: <gh-file:examples/online_serving/pooling/jinaai_rerank_client.py>
Code example: [examples/online_serving/pooling/jinaai_rerank_client.py](../../examples/online_serving/pooling/jinaai_rerank_client.py)
#### Example Request
@ -949,6 +949,6 @@ Key capabilities:
- Scales from a single GPU to a multi-node cluster without code changes.
- Provides observability and autoscaling policies through Ray dashboards and metrics.
The following example shows how to deploy a large model like DeepSeek R1 with Ray Serve LLM: <gh-file:examples/online_serving/ray_serve_deepseek.py>.
The following example shows how to deploy a large model like DeepSeek R1 with Ray Serve LLM: [examples/online_serving/ray_serve_deepseek.py](../../examples/online_serving/ray_serve_deepseek.py).
Learn more about Ray Serve LLM with the official [Ray Serve LLM documentation](https://docs.ray.io/en/latest/serve/llm/serving-llms.html).

View File

@ -72,7 +72,7 @@ For details, see the [Ray documentation](https://docs.ray.io/en/latest/index.htm
### Ray cluster setup with containers
The helper script <gh-file:examples/online_serving/run_cluster.sh> starts containers across nodes and initializes Ray. By default, the script runs Docker without administrative privileges, which prevents access to the GPU performance counters when profiling or tracing. To enable admin privileges, add the `--cap-add=CAP_SYS_ADMIN` flag to the Docker command.
The helper script [examples/online_serving/run_cluster.sh](../../examples/online_serving/run_cluster.sh) starts containers across nodes and initializes Ray. By default, the script runs Docker without administrative privileges, which prevents access to the GPU performance counters when profiling or tracing. To enable admin privileges, add the `--cap-add=CAP_SYS_ADMIN` flag to the Docker command.
Choose one node as the head node and run:
@ -132,7 +132,7 @@ vllm serve /path/to/the/model/in/the/container \
Efficient tensor parallelism requires fast inter-node communication, preferably through high-speed network adapters such as InfiniBand.
To set up the cluster to use InfiniBand, append additional arguments like `--privileged -e NCCL_IB_HCA=mlx5` to the
<gh-file:examples/online_serving/run_cluster.sh> helper script.
[examples/online_serving/run_cluster.sh](../../examples/online_serving/run_cluster.sh) helper script.
Contact your system administrator for more information about the required flags.
## Enabling GPUDirect RDMA

View File

@ -6,7 +6,7 @@ reproducible results:
- For V1: Turn off multiprocessing to make the scheduling deterministic by setting `VLLM_ENABLE_V1_MULTIPROCESSING=0`.
- For V0: Set the global seed (see below).
Example: <gh-file:examples/offline_inference/reproducibility.py>
Example: [examples/offline_inference/reproducibility.py](../../examples/offline_inference/reproducibility.py)
!!! warning
@ -39,7 +39,7 @@ In V1, the `seed` parameter defaults to `0` which sets the random state for each
It is impossible to un-specify a seed for V1 because different workers need to sample the same outputs
for workflows such as speculative decoding.
For more information, see: <gh-pr:17929>
For more information, see: <https://github.com/vllm-project/vllm/pull/17929>
### Locality of random state

View File

@ -24,7 +24,7 @@ If the model is too large to fit in a single GPU, you will get an out-of-memory
## Generation quality changed
In v0.8.0, the source of default sampling parameters was changed in <gh-pr:12622>. Prior to v0.8.0, the default sampling parameters came from vLLM's set of neutral defaults. From v0.8.0 onwards, the default sampling parameters come from the `generation_config.json` provided by the model creator.
In v0.8.0, the source of default sampling parameters was changed in <https://github.com/vllm-project/vllm/pull/12622>. Prior to v0.8.0, the default sampling parameters came from vLLM's set of neutral defaults. From v0.8.0 onwards, the default sampling parameters come from the `generation_config.json` provided by the model creator.
In most cases, this should lead to higher quality responses, because the model creator is likely to know which sampling parameters are best for their model. However, in some cases the defaults provided by the model creator can lead to degraded performance.
@ -238,7 +238,7 @@ if __name__ == '__main__':
## `torch.compile` Error
vLLM heavily depends on `torch.compile` to optimize the model for better performance, which introduces the dependency on the `torch.compile` functionality and the `triton` library. By default, we use `torch.compile` to [optimize some functions](gh-pr:10406) in the model. Before running vLLM, you can check if `torch.compile` is working as expected by running the following script:
vLLM heavily depends on `torch.compile` to optimize the model for better performance, which introduces the dependency on the `torch.compile` functionality and the `triton` library. By default, we use `torch.compile` to [optimize some functions](https://github.com/vllm-project/vllm/pull/10406) in the model. Before running vLLM, you can check if `torch.compile` is working as expected by running the following script:
??? code
@ -257,7 +257,7 @@ vLLM heavily depends on `torch.compile` to optimize the model for better perform
print(f(x))
```
If it raises errors from `torch/_inductor` directory, usually it means you have a custom `triton` library that is not compatible with the version of PyTorch you are using. See <gh-issue:12219> for example.
If it raises errors from `torch/_inductor` directory, usually it means you have a custom `triton` library that is not compatible with the version of PyTorch you are using. See <https://github.com/vllm-project/vllm/issues/12219> for example.
## Model failed to be inspected
@ -297,7 +297,7 @@ But you are sure that the model is in the [list of supported models](../models/s
## Failed to infer device type
If you see an error like `RuntimeError: Failed to infer device type`, it means that vLLM failed to infer the device type of the runtime environment. You can check [the code](gh-file:vllm/platforms/__init__.py) to see how vLLM infers the device type and why it is not working as expected. After [this PR](gh-pr:14195), you can also set the environment variable `VLLM_LOGGING_LEVEL=DEBUG` to see more detailed logs to help debug the issue.
If you see an error like `RuntimeError: Failed to infer device type`, it means that vLLM failed to infer the device type of the runtime environment. You can check [the code](../../vllm/platforms/__init__.py) to see how vLLM infers the device type and why it is not working as expected. After [this PR](https://github.com/vllm-project/vllm/pull/14195), you can also set the environment variable `VLLM_LOGGING_LEVEL=DEBUG` to see more detailed logs to help debug the issue.
## NCCL error: unhandled system error during `ncclCommInitRank`
@ -322,6 +322,6 @@ This indicates vLLM failed to initialize the NCCL communicator, possibly due to
## Known Issues
- In `v0.5.2`, `v0.5.3`, and `v0.5.3.post1`, there is a bug caused by [zmq](https://github.com/zeromq/pyzmq/issues/2000) , which can occasionally cause vLLM to hang depending on the machine configuration. The solution is to upgrade to the latest version of `vllm` to include the [fix](gh-pr:6759).
- In `v0.5.2`, `v0.5.3`, and `v0.5.3.post1`, there is a bug caused by [zmq](https://github.com/zeromq/pyzmq/issues/2000) , which can occasionally cause vLLM to hang depending on the machine configuration. The solution is to upgrade to the latest version of `vllm` to include the [fix](https://github.com/vllm-project/vllm/pull/6759).
- To address a memory overhead issue in older NCCL versions (see [bug](https://github.com/NVIDIA/nccl/issues/1234)), vLLM versions `>= 0.4.3, <= 0.10.1.1` would set the environment variable `NCCL_CUMEM_ENABLE=0`. External processes connecting to vLLM also needed to set this variable to prevent hangs or crashes. Since the underlying NCCL bug was fixed in NCCL 2.22.3, this override was removed in newer vLLM versions to allow for NCCL performance optimizations.
- In some PCIe machines (e.g. machines without NVLink), if you see an error like `transport/shm.cc:590 NCCL WARN Cuda failure 217 'peer access is not supported between these two devices'`, it's likely caused by a driver bug. See [this issue](https://github.com/NVIDIA/nccl/issues/1838) for more details. In that case, you can try to set `NCCL_CUMEM_HOST_ENABLE=0` to disable the feature, or upgrade your driver to the latest version.

View File

@ -6,7 +6,7 @@ A subset of the data, after cleaning and aggregation, will be publicly released
## What data is collected?
The list of data collected by the latest version of vLLM can be found here: <gh-file:vllm/usage/usage_lib.py>
The list of data collected by the latest version of vLLM can be found here: [vllm/usage/usage_lib.py](../../vllm/usage/usage_lib.py)
Here is an example as of v0.4.0:

View File

@ -2,7 +2,7 @@
!!! announcement
We have started the process of deprecating V0. Please read [RFC #18571](gh-issue:18571) for more details.
We have started the process of deprecating V0. Please read [RFC #18571](https://github.com/vllm-project/vllm/issues/18571) for more details.
V1 is now enabled by default for all supported use cases, and we will gradually enable it for every use case we plan to support. Please share any feedback on [GitHub](https://github.com/vllm-project/vllm) or in the [vLLM Slack](https://inviter.co/vllm-slack).
@ -94,8 +94,8 @@ See below for the status of models that are not yet supported or have more featu
The initial basic support is now functional.
Later, we will consider using [hidden states processor](gh-issue:12249),
which is based on [global logits processor](gh-pr:13360)
Later, we will consider using [hidden states processor](https://github.com/vllm-project/vllm/issues/12249),
which is based on [global logits processor](https://github.com/vllm-project/vllm/pull/13360)
to enable simultaneous generation and embedding using the same engine instance in V1.
#### Mamba Models
@ -124,13 +124,13 @@ encoder and decoder (e.g., `BartForConditionalGeneration`,
| **Chunked Prefill** | <nobr>🚀 Optimized</nobr> |
| **LoRA** | <nobr>🚀 Optimized</nobr> |
| **Logprobs Calculation** | <nobr>🟢 Functional</nobr> |
| **FP8 KV Cache** | <nobr>🟢 Functional on Hopper devices (<gh-pr:15191>)</nobr>|
| **FP8 KV Cache** | <nobr>🟢 Functional on Hopper devices (<https://github.com/vllm-project/vllm/pull/15191>)</nobr>|
| **Spec Decode** | <nobr>🚀 Optimized</nobr> |
| **Prompt Logprobs with Prefix Caching** | <nobr>🟡 Planned ([RFC #13414](gh-issue:13414))</nobr>|
| **Prompt Logprobs with Prefix Caching** | <nobr>🟡 Planned ([RFC #13414](https://github.com/vllm-project/vllm/issues/13414))</nobr>|
| **Structured Output Alternative Backends** | <nobr>🟢 Functional</nobr> |
| **Request-level Structured Output Backend** | <nobr>🔴 Deprecated</nobr> |
| **best_of** | <nobr>🔴 Deprecated ([RFC #13361](gh-issue:13361))</nobr>|
| **Per-Request Logits Processors** | <nobr>🔴 Deprecated ([RFC #13360](gh-pr:13360))</nobr> |
| **best_of** | <nobr>🔴 Deprecated ([RFC #13361](https://github.com/vllm-project/vllm/issues/13361))</nobr>|
| **Per-Request Logits Processors** | <nobr>🔴 Deprecated ([RFC #13360](https://github.com/vllm-project/vllm/pull/13360))</nobr> |
| **GPU <> CPU KV Cache Swapping** | <nobr>🔴 Deprecated</nobr> |
!!! note
@ -168,11 +168,11 @@ As part of the major architectural rework in vLLM V1, several legacy features ha
##### Sampling features
- **best_of**: This feature has been deprecated due to limited usage. See details at [RFC #13361](gh-issue:13361).
- **best_of**: This feature has been deprecated due to limited usage. See details at [RFC #13361](https://github.com/vllm-project/vllm/issues/13361).
- **Per-Request Logits Processors**: In V0, users could pass custom
processing functions to adjust logits on a per-request basis. In vLLM V1, this
feature has been deprecated. Instead, the design is moving toward supporting **global logits
processors**, a feature the team is actively working on for future releases. See details at [RFC #13360](gh-pr:13360).
processors**, a feature the team is actively working on for future releases. See details at [RFC #13360](https://github.com/vllm-project/vllm/pull/13360).
##### KV Cache features

View File

@ -14,7 +14,7 @@ python examples/offline_inference/pooling/convert_model_to_seq_cls.py --model_na
## Embed jina_embeddings_v3 usage
Only text matching task is supported for now. See <gh-pr:16120>
Only text matching task is supported for now. See <https://github.com/vllm-project/vllm/pull/16120>
```bash
python examples/offline_inference/pooling/embed_jina_embeddings_v3.py