[Docs] Fix warnings in docs build (#22588)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
This commit is contained in:
Harry Mellor 2025-08-10 13:49:51 +01:00 committed by GitHub
parent d411df0296
commit 00976db0c3
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
10 changed files with 80 additions and 90 deletions

View File

@ -1,7 +1,5 @@
# Summary # Summary
[](){ #configuration }
## Configuration ## Configuration
API documentation for vLLM's configuration classes. API documentation for vLLM's configuration classes.

View File

@ -96,7 +96,7 @@ Although its common to do this with GPUs, don't try to fragment 2 or 8 differ
### Tune your workloads ### Tune your workloads
Although we try to have great default configs, we strongly recommend you check out the [vLLM auto-tuner](../../benchmarks/auto_tune/README.md) to optimize your workloads for your use case. Although we try to have great default configs, we strongly recommend you check out the [vLLM auto-tuner](gh-file:benchmarks/auto_tune/README.md) to optimize your workloads for your use case.
### Future Topics We'll Cover ### Future Topics We'll Cover

View File

@ -540,8 +540,10 @@ return a schema of the tensors outputted by the HF processor that are related to
The shape of `image_patches` outputted by `FuyuImageProcessor` is therefore The shape of `image_patches` outputted by `FuyuImageProcessor` is therefore
`(1, num_images, num_patches, patch_width * patch_height * num_channels)`. `(1, num_images, num_patches, patch_width * patch_height * num_channels)`.
In order to support the use of [MultiModalFieldConfig.batched][] like in LLaVA, In order to support the use of
we remove the extra batch dimension by overriding [BaseMultiModalProcessor._call_hf_processor][]: [MultiModalFieldConfig.batched][vllm.multimodal.inputs.MultiModalFieldConfig.batched]
like in LLaVA, we remove the extra batch dimension by overriding
[BaseMultiModalProcessor._call_hf_processor][vllm.multimodal.processing.BaseMultiModalProcessor._call_hf_processor]:
??? code ??? code
@ -816,7 +818,7 @@ Each [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instance specifies
After you have defined [BaseProcessingInfo][vllm.multimodal.processing.BaseProcessingInfo] (Step 2), After you have defined [BaseProcessingInfo][vllm.multimodal.processing.BaseProcessingInfo] (Step 2),
[BaseDummyInputsBuilder][vllm.multimodal.profiling.BaseDummyInputsBuilder] (Step 3), [BaseDummyInputsBuilder][vllm.multimodal.profiling.BaseDummyInputsBuilder] (Step 3),
and [BaseMultiModalProcessor][vllm.multimodal.processing.BaseMultiModalProcessor] (Step 4), and [BaseMultiModalProcessor][vllm.multimodal.processing.BaseMultiModalProcessor] (Step 4),
decorate the model class with [MULTIMODAL_REGISTRY.register_processor][vllm.multimodal.processing.MultiModalRegistry.register_processor] decorate the model class with [MULTIMODAL_REGISTRY.register_processor][vllm.multimodal.registry.MultiModalRegistry.register_processor]
to register them to the multi-modal registry: to register them to the multi-modal registry:
```diff ```diff

View File

@ -4,7 +4,7 @@ vLLM provides first-class support for generative models, which covers most of LL
In vLLM, generative models implement the[VllmModelForTextGeneration][vllm.model_executor.models.VllmModelForTextGeneration] interface. In vLLM, generative models implement the[VllmModelForTextGeneration][vllm.model_executor.models.VllmModelForTextGeneration] interface.
Based on the final hidden states of the input, these models output log probabilities of the tokens to generate, Based on the final hidden states of the input, these models output log probabilities of the tokens to generate,
which are then passed through [Sampler][vllm.model_executor.layers.Sampler] to obtain the final text. which are then passed through [Sampler][vllm.model_executor.layers.sampler.Sampler] to obtain the final text.
## Configuration ## Configuration
@ -19,7 +19,7 @@ Run a model in generation mode via the option `--runner generate`.
## Offline Inference ## Offline Inference
The [LLM][vllm.LLM] class provides various methods for offline inference. The [LLM][vllm.LLM] class provides various methods for offline inference.
See [configuration][configuration] for a list of options when initializing the model. See [configuration](../api/summary.md#configuration) for a list of options when initializing the model.
### `LLM.generate` ### `LLM.generate`

View File

@ -81,7 +81,7 @@ which takes priority over both the model's and Sentence Transformers's defaults.
## Offline Inference ## Offline Inference
The [LLM][vllm.LLM] class provides various methods for offline inference. The [LLM][vllm.LLM] class provides various methods for offline inference.
See [configuration][configuration] for a list of options when initializing the model. See [configuration](../api/summary.md#configuration) for a list of options when initializing the model.
### `LLM.embed` ### `LLM.embed`

View File

@ -770,7 +770,7 @@ The following table lists those that are tested in vLLM.
Cross-encoder and reranker models are a subset of classification models that accept two prompts as input. Cross-encoder and reranker models are a subset of classification models that accept two prompts as input.
These models primarily support the [`LLM.score`](./pooling_models.md#llmscore) API. These models primarily support the [`LLM.score`](./pooling_models.md#llmscore) API.
| Architecture | Models | Inputs | Example HF Models | [LoRA][lora-adapter] | [PP][parallelism-scaling] | [V1](gh-issue:8779) | | Architecture | Models | Inputs | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/parallelism_scaling.md) | [V1](gh-issue:8779) |
|-------------------------------------|--------------------|----------|--------------------------|------------------------|-----------------------------|-----------------------| |-------------------------------------|--------------------|----------|--------------------------|------------------------|-----------------------------|-----------------------|
| `JinaVLForSequenceClassification` | JinaVL-based | T + I<sup>E+</sup> | `jinaai/jina-reranker-m0`, etc. | | | ✅︎ | | `JinaVLForSequenceClassification` | JinaVL-based | T + I<sup>E+</sup> | `jinaai/jina-reranker-m0`, etc. | | | ✅︎ |

View File

View File

@ -1,10 +1,11 @@
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project # SPDX-FileCopyrightText: Copyright contributors to the vLLM project
from .data import (DecoderOnlyInputs, EmbedsInputs, EncoderDecoderInputs, from .data import (DecoderOnlyInputs, EmbedsInputs, EmbedsPrompt,
ExplicitEncoderDecoderPrompt, ProcessorInputs, PromptType, EncoderDecoderInputs, ExplicitEncoderDecoderPrompt,
SingletonInputs, SingletonPrompt, TextPrompt, TokenInputs, ProcessorInputs, PromptType, SingletonInputs,
TokensPrompt, build_explicit_enc_dec_prompt, embeds_inputs, SingletonPrompt, TextPrompt, TokenInputs, TokensPrompt,
build_explicit_enc_dec_prompt, embeds_inputs,
to_enc_dec_tuple_list, token_inputs, zip_enc_dec_prompts) to_enc_dec_tuple_list, token_inputs, zip_enc_dec_prompts)
from .registry import (DummyData, InputContext, InputProcessingContext, from .registry import (DummyData, InputContext, InputProcessingContext,
InputRegistry) InputRegistry)
@ -24,6 +25,7 @@ __all__ = [
"ExplicitEncoderDecoderPrompt", "ExplicitEncoderDecoderPrompt",
"TokenInputs", "TokenInputs",
"EmbedsInputs", "EmbedsInputs",
"EmbedsPrompt",
"token_inputs", "token_inputs",
"embeds_inputs", "embeds_inputs",
"DecoderOnlyInputs", "DecoderOnlyInputs",

View File

View File

@ -103,113 +103,89 @@ class SamplingParams(
Overall, we follow the sampling parameters from the OpenAI text completion Overall, we follow the sampling parameters from the OpenAI text completion
API (https://platform.openai.com/docs/api-reference/completions/create). API (https://platform.openai.com/docs/api-reference/completions/create).
In addition, we support beam search, which is not supported by OpenAI. In addition, we support beam search, which is not supported by OpenAI.
Args:
n: Number of output sequences to return for the given prompt.
best_of: Number of output sequences that are generated from the prompt.
From these `best_of` sequences, the top `n` sequences are returned.
`best_of` must be greater than or equal to `n`. By default,
`best_of` is set to `n`. Warning, this is only supported in V0.
presence_penalty: Float that penalizes new tokens based on whether they
appear in the generated text so far. Values > 0 encourage the model
to use new tokens, while values < 0 encourage the model to repeat
tokens.
frequency_penalty: Float that penalizes new tokens based on their
frequency in the generated text so far. Values > 0 encourage the
model to use new tokens, while values < 0 encourage the model to
repeat tokens.
repetition_penalty: Float that penalizes new tokens based on whether
they appear in the prompt and the generated text so far. Values > 1
encourage the model to use new tokens, while values < 1 encourage
the model to repeat tokens.
temperature: Float that controls the randomness of the sampling. Lower
values make the model more deterministic, while higher values make
the model more random. Zero means greedy sampling.
top_p: Float that controls the cumulative probability of the top tokens
to consider. Must be in (0, 1]. Set to 1 to consider all tokens.
top_k: Integer that controls the number of top tokens to consider. Set
to 0 (or -1) to consider all tokens.
min_p: Float that represents the minimum probability for a token to be
considered, relative to the probability of the most likely token.
Must be in [0, 1]. Set to 0 to disable this.
seed: Random seed to use for the generation.
stop: list of strings that stop the generation when they are generated.
The returned output will not contain the stop strings.
stop_token_ids: list of tokens that stop the generation when they are
generated. The returned output will contain the stop tokens unless
the stop tokens are special tokens.
bad_words: list of words that are not allowed to be generated.
More precisely, only the last token of a corresponding
token sequence is not allowed when the next generated token
can complete the sequence.
include_stop_str_in_output: Whether to include the stop strings in
output text. Defaults to False.
ignore_eos: Whether to ignore the EOS token and continue generating
tokens after the EOS token is generated.
max_tokens: Maximum number of tokens to generate per output sequence.
min_tokens: Minimum number of tokens to generate per output sequence
before EOS or stop_token_ids can be generated
logprobs: Number of log probabilities to return per output token.
When set to None, no probability is returned. If set to a non-None
value, the result includes the log probabilities of the specified
number of most likely tokens, as well as the chosen tokens.
Note that the implementation follows the OpenAI API: The API will
always return the log probability of the sampled token, so there
may be up to `logprobs+1` elements in the response.
When set to -1, return all `vocab_size` log probabilities.
prompt_logprobs: Number of log probabilities to return per prompt token.
detokenize: Whether to detokenize the output. Defaults to True.
skip_special_tokens: Whether to skip special tokens in the output.
spaces_between_special_tokens: Whether to add spaces between special
tokens in the output. Defaults to True.
logits_processors: list of functions that modify logits based on
previously generated tokens, and optionally prompt tokens as
a first argument.
truncate_prompt_tokens: If set to -1, will use the truncation size
supported by the model. If set to an integer k, will use only
the last k tokens from the prompt (i.e., left truncation).
Defaults to None (i.e., no truncation).
guided_decoding: If provided, the engine will construct a guided
decoding logits processor from these parameters. Defaults to None.
logit_bias: If provided, the engine will construct a logits processor
that applies these logit biases. Defaults to None.
allowed_token_ids: If provided, the engine will construct a logits
processor which only retains scores for the given token ids.
Defaults to None.
extra_args: Arbitrary additional args, that can be used by custom
sampling implementations, plugins, etc. Not used by any in-tree
sampling implementations.
""" """
n: int = 1 n: int = 1
"""Number of output sequences to return for the given prompt."""
best_of: Optional[int] = None best_of: Optional[int] = None
"""Number of output sequences that are generated from the prompt. From
these `best_of` sequences, the top `n` sequences are returned. `best_of`
must be greater than or equal to `n`. By default, `best_of` is set to `n`.
Warning, this is only supported in V0."""
_real_n: Optional[int] = None _real_n: Optional[int] = None
presence_penalty: float = 0.0 presence_penalty: float = 0.0
"""Penalizes new tokens based on whether they appear in the generated text
so far. Values > 0 encourage the model to use new tokens, while values < 0
encourage the model to repeat tokens."""
frequency_penalty: float = 0.0 frequency_penalty: float = 0.0
"""Penalizes new tokens based on their frequency in the generated text so
far. Values > 0 encourage the model to use new tokens, while values < 0
encourage the model to repeat tokens."""
repetition_penalty: float = 1.0 repetition_penalty: float = 1.0
"""Penalizes new tokens based on whether they appear in the prompt and the
generated text so far. Values > 1 encourage the model to use new tokens,
while values < 1 encourage the model to repeat tokens."""
temperature: float = 1.0 temperature: float = 1.0
"""Controls the randomness of the sampling. Lower values make the model
more deterministic, while higher values make the model more random. Zero
means greedy sampling."""
top_p: float = 1.0 top_p: float = 1.0
"""Controls the cumulative probability of the top tokens to consider. Must
be in (0, 1]. Set to 1 to consider all tokens."""
top_k: int = 0 top_k: int = 0
"""Controls the number of top tokens to consider. Set to 0 (or -1) to
consider all tokens."""
min_p: float = 0.0 min_p: float = 0.0
"""Represents the minimum probability for a token to be considered,
relative to the probability of the most likely token. Must be in [0, 1].
Set to 0 to disable this."""
seed: Optional[int] = None seed: Optional[int] = None
"""Random seed to use for the generation."""
stop: Optional[Union[str, list[str]]] = None stop: Optional[Union[str, list[str]]] = None
"""String(s) that stop the generation when they are generated. The returned
output will not contain the stop strings."""
stop_token_ids: Optional[list[int]] = None stop_token_ids: Optional[list[int]] = None
"""Token IDs that stop the generation when they are generated. The returned
output will contain the stop tokens unless the stop tokens are special
tokens."""
ignore_eos: bool = False ignore_eos: bool = False
"""Whether to ignore the EOS token and continue generating
tokens after the EOS token is generated."""
max_tokens: Optional[int] = 16 max_tokens: Optional[int] = 16
"""Maximum number of tokens to generate per output sequence."""
min_tokens: int = 0 min_tokens: int = 0
"""Minimum number of tokens to generate per output sequence before EOS or
`stop_token_ids` can be generated"""
logprobs: Optional[int] = None logprobs: Optional[int] = None
"""Number of log probabilities to return per output token. When set to
`None`, no probability is returned. If set to a non-`None` value, the
result includes the log probabilities of the specified number of most
likely tokens, as well as the chosen tokens. Note that the implementation
follows the OpenAI API: The API will always return the log probability of
the sampled token, so there may be up to `logprobs+1` elements in the
response. When set to -1, return all `vocab_size` log probabilities."""
prompt_logprobs: Optional[int] = None prompt_logprobs: Optional[int] = None
"""Number of log probabilities to return per prompt token."""
# NOTE: This parameter is only exposed at the engine level for now. # NOTE: This parameter is only exposed at the engine level for now.
# It is not exposed in the OpenAI API server, as the OpenAI API does # It is not exposed in the OpenAI API server, as the OpenAI API does
# not support returning only a list of token IDs. # not support returning only a list of token IDs.
detokenize: bool = True detokenize: bool = True
"""Whether to detokenize the output."""
skip_special_tokens: bool = True skip_special_tokens: bool = True
"""Whether to skip special tokens in the output."""
spaces_between_special_tokens: bool = True spaces_between_special_tokens: bool = True
"""Whether to add spaces between special tokens in the output."""
# Optional[list[LogitsProcessor]] type. We use Any here because # Optional[list[LogitsProcessor]] type. We use Any here because
# Optional[list[LogitsProcessor]] type is not supported by msgspec. # Optional[list[LogitsProcessor]] type is not supported by msgspec.
logits_processors: Optional[Any] = None logits_processors: Optional[Any] = None
"""Functions that modify logits based on previously generated tokens, and
optionally prompt tokens as a first argument."""
include_stop_str_in_output: bool = False include_stop_str_in_output: bool = False
"""Whether to include the stop strings in output text."""
truncate_prompt_tokens: Optional[Annotated[int, msgspec.Meta(ge=1)]] = None truncate_prompt_tokens: Optional[Annotated[int, msgspec.Meta(ge=1)]] = None
"""If set to -1, will use the truncation size supported by the model. If
set to an integer k, will use only the last k tokens from the prompt
(i.e., left truncation). If set to `None`, truncation is disabled."""
output_kind: RequestOutputKind = RequestOutputKind.CUMULATIVE output_kind: RequestOutputKind = RequestOutputKind.CUMULATIVE
# The below fields are not supposed to be used as an input. # The below fields are not supposed to be used as an input.
@ -219,12 +195,24 @@ class SamplingParams(
# Fields used to construct logits processors # Fields used to construct logits processors
guided_decoding: Optional[GuidedDecodingParams] = None guided_decoding: Optional[GuidedDecodingParams] = None
"""If provided, the engine will construct a guided decoding logits
processor from these parameters."""
logit_bias: Optional[dict[int, float]] = None logit_bias: Optional[dict[int, float]] = None
"""If provided, the engine will construct a logits processor that applies
these logit biases."""
allowed_token_ids: Optional[list[int]] = None allowed_token_ids: Optional[list[int]] = None
"""If provided, the engine will construct a logits processor which only
retains scores for the given token ids."""
extra_args: Optional[dict[str, Any]] = None extra_args: Optional[dict[str, Any]] = None
"""Arbitrary additional args, that can be used by custom sampling
implementations, plugins, etc. Not used by any in-tree sampling
implementations."""
# Fields used for bad words # Fields used for bad words
bad_words: Optional[list[str]] = None bad_words: Optional[list[str]] = None
"""Words that are not allowed to be generated. More precisely, only the
last token of a corresponding token sequence is not allowed when the next
generated token can complete the sequence."""
_bad_words_token_ids: Optional[list[list[int]]] = None _bad_words_token_ids: Optional[list[list[int]]] = None
@staticmethod @staticmethod