mirror of
https://git.datalinker.icu/vllm-project/vllm.git
synced 2025-12-13 18:05:49 +08:00
[Docs] Fix warnings in docs build (#22588)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
This commit is contained in:
parent
d411df0296
commit
00976db0c3
@ -1,7 +1,5 @@
|
|||||||
# Summary
|
# Summary
|
||||||
|
|
||||||
[](){ #configuration }
|
|
||||||
|
|
||||||
## Configuration
|
## Configuration
|
||||||
|
|
||||||
API documentation for vLLM's configuration classes.
|
API documentation for vLLM's configuration classes.
|
||||||
|
|||||||
@ -96,7 +96,7 @@ Although it’s common to do this with GPUs, don't try to fragment 2 or 8 differ
|
|||||||
|
|
||||||
### Tune your workloads
|
### Tune your workloads
|
||||||
|
|
||||||
Although we try to have great default configs, we strongly recommend you check out the [vLLM auto-tuner](../../benchmarks/auto_tune/README.md) to optimize your workloads for your use case.
|
Although we try to have great default configs, we strongly recommend you check out the [vLLM auto-tuner](gh-file:benchmarks/auto_tune/README.md) to optimize your workloads for your use case.
|
||||||
|
|
||||||
### Future Topics We'll Cover
|
### Future Topics We'll Cover
|
||||||
|
|
||||||
|
|||||||
@ -540,8 +540,10 @@ return a schema of the tensors outputted by the HF processor that are related to
|
|||||||
The shape of `image_patches` outputted by `FuyuImageProcessor` is therefore
|
The shape of `image_patches` outputted by `FuyuImageProcessor` is therefore
|
||||||
`(1, num_images, num_patches, patch_width * patch_height * num_channels)`.
|
`(1, num_images, num_patches, patch_width * patch_height * num_channels)`.
|
||||||
|
|
||||||
In order to support the use of [MultiModalFieldConfig.batched][] like in LLaVA,
|
In order to support the use of
|
||||||
we remove the extra batch dimension by overriding [BaseMultiModalProcessor._call_hf_processor][]:
|
[MultiModalFieldConfig.batched][vllm.multimodal.inputs.MultiModalFieldConfig.batched]
|
||||||
|
like in LLaVA, we remove the extra batch dimension by overriding
|
||||||
|
[BaseMultiModalProcessor._call_hf_processor][vllm.multimodal.processing.BaseMultiModalProcessor._call_hf_processor]:
|
||||||
|
|
||||||
??? code
|
??? code
|
||||||
|
|
||||||
@ -816,7 +818,7 @@ Each [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instance specifies
|
|||||||
After you have defined [BaseProcessingInfo][vllm.multimodal.processing.BaseProcessingInfo] (Step 2),
|
After you have defined [BaseProcessingInfo][vllm.multimodal.processing.BaseProcessingInfo] (Step 2),
|
||||||
[BaseDummyInputsBuilder][vllm.multimodal.profiling.BaseDummyInputsBuilder] (Step 3),
|
[BaseDummyInputsBuilder][vllm.multimodal.profiling.BaseDummyInputsBuilder] (Step 3),
|
||||||
and [BaseMultiModalProcessor][vllm.multimodal.processing.BaseMultiModalProcessor] (Step 4),
|
and [BaseMultiModalProcessor][vllm.multimodal.processing.BaseMultiModalProcessor] (Step 4),
|
||||||
decorate the model class with [MULTIMODAL_REGISTRY.register_processor][vllm.multimodal.processing.MultiModalRegistry.register_processor]
|
decorate the model class with [MULTIMODAL_REGISTRY.register_processor][vllm.multimodal.registry.MultiModalRegistry.register_processor]
|
||||||
to register them to the multi-modal registry:
|
to register them to the multi-modal registry:
|
||||||
|
|
||||||
```diff
|
```diff
|
||||||
|
|||||||
@ -4,7 +4,7 @@ vLLM provides first-class support for generative models, which covers most of LL
|
|||||||
|
|
||||||
In vLLM, generative models implement the[VllmModelForTextGeneration][vllm.model_executor.models.VllmModelForTextGeneration] interface.
|
In vLLM, generative models implement the[VllmModelForTextGeneration][vllm.model_executor.models.VllmModelForTextGeneration] interface.
|
||||||
Based on the final hidden states of the input, these models output log probabilities of the tokens to generate,
|
Based on the final hidden states of the input, these models output log probabilities of the tokens to generate,
|
||||||
which are then passed through [Sampler][vllm.model_executor.layers.Sampler] to obtain the final text.
|
which are then passed through [Sampler][vllm.model_executor.layers.sampler.Sampler] to obtain the final text.
|
||||||
|
|
||||||
## Configuration
|
## Configuration
|
||||||
|
|
||||||
@ -19,7 +19,7 @@ Run a model in generation mode via the option `--runner generate`.
|
|||||||
## Offline Inference
|
## Offline Inference
|
||||||
|
|
||||||
The [LLM][vllm.LLM] class provides various methods for offline inference.
|
The [LLM][vllm.LLM] class provides various methods for offline inference.
|
||||||
See [configuration][configuration] for a list of options when initializing the model.
|
See [configuration](../api/summary.md#configuration) for a list of options when initializing the model.
|
||||||
|
|
||||||
### `LLM.generate`
|
### `LLM.generate`
|
||||||
|
|
||||||
|
|||||||
@ -81,7 +81,7 @@ which takes priority over both the model's and Sentence Transformers's defaults.
|
|||||||
## Offline Inference
|
## Offline Inference
|
||||||
|
|
||||||
The [LLM][vllm.LLM] class provides various methods for offline inference.
|
The [LLM][vllm.LLM] class provides various methods for offline inference.
|
||||||
See [configuration][configuration] for a list of options when initializing the model.
|
See [configuration](../api/summary.md#configuration) for a list of options when initializing the model.
|
||||||
|
|
||||||
### `LLM.embed`
|
### `LLM.embed`
|
||||||
|
|
||||||
|
|||||||
@ -770,7 +770,7 @@ The following table lists those that are tested in vLLM.
|
|||||||
Cross-encoder and reranker models are a subset of classification models that accept two prompts as input.
|
Cross-encoder and reranker models are a subset of classification models that accept two prompts as input.
|
||||||
These models primarily support the [`LLM.score`](./pooling_models.md#llmscore) API.
|
These models primarily support the [`LLM.score`](./pooling_models.md#llmscore) API.
|
||||||
|
|
||||||
| Architecture | Models | Inputs | Example HF Models | [LoRA][lora-adapter] | [PP][parallelism-scaling] | [V1](gh-issue:8779) |
|
| Architecture | Models | Inputs | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/parallelism_scaling.md) | [V1](gh-issue:8779) |
|
||||||
|-------------------------------------|--------------------|----------|--------------------------|------------------------|-----------------------------|-----------------------|
|
|-------------------------------------|--------------------|----------|--------------------------|------------------------|-----------------------------|-----------------------|
|
||||||
| `JinaVLForSequenceClassification` | JinaVL-based | T + I<sup>E+</sup> | `jinaai/jina-reranker-m0`, etc. | | | ✅︎ |
|
| `JinaVLForSequenceClassification` | JinaVL-based | T + I<sup>E+</sup> | `jinaai/jina-reranker-m0`, etc. | | | ✅︎ |
|
||||||
|
|
||||||
|
|||||||
0
vllm/attention/layers/__init__.py
Normal file
0
vllm/attention/layers/__init__.py
Normal file
@ -1,10 +1,11 @@
|
|||||||
# SPDX-License-Identifier: Apache-2.0
|
# SPDX-License-Identifier: Apache-2.0
|
||||||
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
|
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
|
||||||
|
|
||||||
from .data import (DecoderOnlyInputs, EmbedsInputs, EncoderDecoderInputs,
|
from .data import (DecoderOnlyInputs, EmbedsInputs, EmbedsPrompt,
|
||||||
ExplicitEncoderDecoderPrompt, ProcessorInputs, PromptType,
|
EncoderDecoderInputs, ExplicitEncoderDecoderPrompt,
|
||||||
SingletonInputs, SingletonPrompt, TextPrompt, TokenInputs,
|
ProcessorInputs, PromptType, SingletonInputs,
|
||||||
TokensPrompt, build_explicit_enc_dec_prompt, embeds_inputs,
|
SingletonPrompt, TextPrompt, TokenInputs, TokensPrompt,
|
||||||
|
build_explicit_enc_dec_prompt, embeds_inputs,
|
||||||
to_enc_dec_tuple_list, token_inputs, zip_enc_dec_prompts)
|
to_enc_dec_tuple_list, token_inputs, zip_enc_dec_prompts)
|
||||||
from .registry import (DummyData, InputContext, InputProcessingContext,
|
from .registry import (DummyData, InputContext, InputProcessingContext,
|
||||||
InputRegistry)
|
InputRegistry)
|
||||||
@ -24,6 +25,7 @@ __all__ = [
|
|||||||
"ExplicitEncoderDecoderPrompt",
|
"ExplicitEncoderDecoderPrompt",
|
||||||
"TokenInputs",
|
"TokenInputs",
|
||||||
"EmbedsInputs",
|
"EmbedsInputs",
|
||||||
|
"EmbedsPrompt",
|
||||||
"token_inputs",
|
"token_inputs",
|
||||||
"embeds_inputs",
|
"embeds_inputs",
|
||||||
"DecoderOnlyInputs",
|
"DecoderOnlyInputs",
|
||||||
|
|||||||
0
vllm/model_executor/warmup/__init__.py
Normal file
0
vllm/model_executor/warmup/__init__.py
Normal file
@ -103,113 +103,89 @@ class SamplingParams(
|
|||||||
Overall, we follow the sampling parameters from the OpenAI text completion
|
Overall, we follow the sampling parameters from the OpenAI text completion
|
||||||
API (https://platform.openai.com/docs/api-reference/completions/create).
|
API (https://platform.openai.com/docs/api-reference/completions/create).
|
||||||
In addition, we support beam search, which is not supported by OpenAI.
|
In addition, we support beam search, which is not supported by OpenAI.
|
||||||
|
|
||||||
Args:
|
|
||||||
n: Number of output sequences to return for the given prompt.
|
|
||||||
best_of: Number of output sequences that are generated from the prompt.
|
|
||||||
From these `best_of` sequences, the top `n` sequences are returned.
|
|
||||||
`best_of` must be greater than or equal to `n`. By default,
|
|
||||||
`best_of` is set to `n`. Warning, this is only supported in V0.
|
|
||||||
presence_penalty: Float that penalizes new tokens based on whether they
|
|
||||||
appear in the generated text so far. Values > 0 encourage the model
|
|
||||||
to use new tokens, while values < 0 encourage the model to repeat
|
|
||||||
tokens.
|
|
||||||
frequency_penalty: Float that penalizes new tokens based on their
|
|
||||||
frequency in the generated text so far. Values > 0 encourage the
|
|
||||||
model to use new tokens, while values < 0 encourage the model to
|
|
||||||
repeat tokens.
|
|
||||||
repetition_penalty: Float that penalizes new tokens based on whether
|
|
||||||
they appear in the prompt and the generated text so far. Values > 1
|
|
||||||
encourage the model to use new tokens, while values < 1 encourage
|
|
||||||
the model to repeat tokens.
|
|
||||||
temperature: Float that controls the randomness of the sampling. Lower
|
|
||||||
values make the model more deterministic, while higher values make
|
|
||||||
the model more random. Zero means greedy sampling.
|
|
||||||
top_p: Float that controls the cumulative probability of the top tokens
|
|
||||||
to consider. Must be in (0, 1]. Set to 1 to consider all tokens.
|
|
||||||
top_k: Integer that controls the number of top tokens to consider. Set
|
|
||||||
to 0 (or -1) to consider all tokens.
|
|
||||||
min_p: Float that represents the minimum probability for a token to be
|
|
||||||
considered, relative to the probability of the most likely token.
|
|
||||||
Must be in [0, 1]. Set to 0 to disable this.
|
|
||||||
seed: Random seed to use for the generation.
|
|
||||||
stop: list of strings that stop the generation when they are generated.
|
|
||||||
The returned output will not contain the stop strings.
|
|
||||||
stop_token_ids: list of tokens that stop the generation when they are
|
|
||||||
generated. The returned output will contain the stop tokens unless
|
|
||||||
the stop tokens are special tokens.
|
|
||||||
bad_words: list of words that are not allowed to be generated.
|
|
||||||
More precisely, only the last token of a corresponding
|
|
||||||
token sequence is not allowed when the next generated token
|
|
||||||
can complete the sequence.
|
|
||||||
include_stop_str_in_output: Whether to include the stop strings in
|
|
||||||
output text. Defaults to False.
|
|
||||||
ignore_eos: Whether to ignore the EOS token and continue generating
|
|
||||||
tokens after the EOS token is generated.
|
|
||||||
max_tokens: Maximum number of tokens to generate per output sequence.
|
|
||||||
min_tokens: Minimum number of tokens to generate per output sequence
|
|
||||||
before EOS or stop_token_ids can be generated
|
|
||||||
logprobs: Number of log probabilities to return per output token.
|
|
||||||
When set to None, no probability is returned. If set to a non-None
|
|
||||||
value, the result includes the log probabilities of the specified
|
|
||||||
number of most likely tokens, as well as the chosen tokens.
|
|
||||||
Note that the implementation follows the OpenAI API: The API will
|
|
||||||
always return the log probability of the sampled token, so there
|
|
||||||
may be up to `logprobs+1` elements in the response.
|
|
||||||
When set to -1, return all `vocab_size` log probabilities.
|
|
||||||
prompt_logprobs: Number of log probabilities to return per prompt token.
|
|
||||||
detokenize: Whether to detokenize the output. Defaults to True.
|
|
||||||
skip_special_tokens: Whether to skip special tokens in the output.
|
|
||||||
spaces_between_special_tokens: Whether to add spaces between special
|
|
||||||
tokens in the output. Defaults to True.
|
|
||||||
logits_processors: list of functions that modify logits based on
|
|
||||||
previously generated tokens, and optionally prompt tokens as
|
|
||||||
a first argument.
|
|
||||||
truncate_prompt_tokens: If set to -1, will use the truncation size
|
|
||||||
supported by the model. If set to an integer k, will use only
|
|
||||||
the last k tokens from the prompt (i.e., left truncation).
|
|
||||||
Defaults to None (i.e., no truncation).
|
|
||||||
guided_decoding: If provided, the engine will construct a guided
|
|
||||||
decoding logits processor from these parameters. Defaults to None.
|
|
||||||
logit_bias: If provided, the engine will construct a logits processor
|
|
||||||
that applies these logit biases. Defaults to None.
|
|
||||||
allowed_token_ids: If provided, the engine will construct a logits
|
|
||||||
processor which only retains scores for the given token ids.
|
|
||||||
Defaults to None.
|
|
||||||
extra_args: Arbitrary additional args, that can be used by custom
|
|
||||||
sampling implementations, plugins, etc. Not used by any in-tree
|
|
||||||
sampling implementations.
|
|
||||||
"""
|
"""
|
||||||
|
|
||||||
n: int = 1
|
n: int = 1
|
||||||
|
"""Number of output sequences to return for the given prompt."""
|
||||||
best_of: Optional[int] = None
|
best_of: Optional[int] = None
|
||||||
|
"""Number of output sequences that are generated from the prompt. From
|
||||||
|
these `best_of` sequences, the top `n` sequences are returned. `best_of`
|
||||||
|
must be greater than or equal to `n`. By default, `best_of` is set to `n`.
|
||||||
|
Warning, this is only supported in V0."""
|
||||||
_real_n: Optional[int] = None
|
_real_n: Optional[int] = None
|
||||||
presence_penalty: float = 0.0
|
presence_penalty: float = 0.0
|
||||||
|
"""Penalizes new tokens based on whether they appear in the generated text
|
||||||
|
so far. Values > 0 encourage the model to use new tokens, while values < 0
|
||||||
|
encourage the model to repeat tokens."""
|
||||||
frequency_penalty: float = 0.0
|
frequency_penalty: float = 0.0
|
||||||
|
"""Penalizes new tokens based on their frequency in the generated text so
|
||||||
|
far. Values > 0 encourage the model to use new tokens, while values < 0
|
||||||
|
encourage the model to repeat tokens."""
|
||||||
repetition_penalty: float = 1.0
|
repetition_penalty: float = 1.0
|
||||||
|
"""Penalizes new tokens based on whether they appear in the prompt and the
|
||||||
|
generated text so far. Values > 1 encourage the model to use new tokens,
|
||||||
|
while values < 1 encourage the model to repeat tokens."""
|
||||||
temperature: float = 1.0
|
temperature: float = 1.0
|
||||||
|
"""Controls the randomness of the sampling. Lower values make the model
|
||||||
|
more deterministic, while higher values make the model more random. Zero
|
||||||
|
means greedy sampling."""
|
||||||
top_p: float = 1.0
|
top_p: float = 1.0
|
||||||
|
"""Controls the cumulative probability of the top tokens to consider. Must
|
||||||
|
be in (0, 1]. Set to 1 to consider all tokens."""
|
||||||
top_k: int = 0
|
top_k: int = 0
|
||||||
|
"""Controls the number of top tokens to consider. Set to 0 (or -1) to
|
||||||
|
consider all tokens."""
|
||||||
min_p: float = 0.0
|
min_p: float = 0.0
|
||||||
|
"""Represents the minimum probability for a token to be considered,
|
||||||
|
relative to the probability of the most likely token. Must be in [0, 1].
|
||||||
|
Set to 0 to disable this."""
|
||||||
seed: Optional[int] = None
|
seed: Optional[int] = None
|
||||||
|
"""Random seed to use for the generation."""
|
||||||
stop: Optional[Union[str, list[str]]] = None
|
stop: Optional[Union[str, list[str]]] = None
|
||||||
|
"""String(s) that stop the generation when they are generated. The returned
|
||||||
|
output will not contain the stop strings."""
|
||||||
stop_token_ids: Optional[list[int]] = None
|
stop_token_ids: Optional[list[int]] = None
|
||||||
|
"""Token IDs that stop the generation when they are generated. The returned
|
||||||
|
output will contain the stop tokens unless the stop tokens are special
|
||||||
|
tokens."""
|
||||||
ignore_eos: bool = False
|
ignore_eos: bool = False
|
||||||
|
"""Whether to ignore the EOS token and continue generating
|
||||||
|
tokens after the EOS token is generated."""
|
||||||
max_tokens: Optional[int] = 16
|
max_tokens: Optional[int] = 16
|
||||||
|
"""Maximum number of tokens to generate per output sequence."""
|
||||||
min_tokens: int = 0
|
min_tokens: int = 0
|
||||||
|
"""Minimum number of tokens to generate per output sequence before EOS or
|
||||||
|
`stop_token_ids` can be generated"""
|
||||||
logprobs: Optional[int] = None
|
logprobs: Optional[int] = None
|
||||||
|
"""Number of log probabilities to return per output token. When set to
|
||||||
|
`None`, no probability is returned. If set to a non-`None` value, the
|
||||||
|
result includes the log probabilities of the specified number of most
|
||||||
|
likely tokens, as well as the chosen tokens. Note that the implementation
|
||||||
|
follows the OpenAI API: The API will always return the log probability of
|
||||||
|
the sampled token, so there may be up to `logprobs+1` elements in the
|
||||||
|
response. When set to -1, return all `vocab_size` log probabilities."""
|
||||||
prompt_logprobs: Optional[int] = None
|
prompt_logprobs: Optional[int] = None
|
||||||
|
"""Number of log probabilities to return per prompt token."""
|
||||||
# NOTE: This parameter is only exposed at the engine level for now.
|
# NOTE: This parameter is only exposed at the engine level for now.
|
||||||
# It is not exposed in the OpenAI API server, as the OpenAI API does
|
# It is not exposed in the OpenAI API server, as the OpenAI API does
|
||||||
# not support returning only a list of token IDs.
|
# not support returning only a list of token IDs.
|
||||||
detokenize: bool = True
|
detokenize: bool = True
|
||||||
|
"""Whether to detokenize the output."""
|
||||||
skip_special_tokens: bool = True
|
skip_special_tokens: bool = True
|
||||||
|
"""Whether to skip special tokens in the output."""
|
||||||
spaces_between_special_tokens: bool = True
|
spaces_between_special_tokens: bool = True
|
||||||
|
"""Whether to add spaces between special tokens in the output."""
|
||||||
# Optional[list[LogitsProcessor]] type. We use Any here because
|
# Optional[list[LogitsProcessor]] type. We use Any here because
|
||||||
# Optional[list[LogitsProcessor]] type is not supported by msgspec.
|
# Optional[list[LogitsProcessor]] type is not supported by msgspec.
|
||||||
logits_processors: Optional[Any] = None
|
logits_processors: Optional[Any] = None
|
||||||
|
"""Functions that modify logits based on previously generated tokens, and
|
||||||
|
optionally prompt tokens as a first argument."""
|
||||||
include_stop_str_in_output: bool = False
|
include_stop_str_in_output: bool = False
|
||||||
|
"""Whether to include the stop strings in output text."""
|
||||||
truncate_prompt_tokens: Optional[Annotated[int, msgspec.Meta(ge=1)]] = None
|
truncate_prompt_tokens: Optional[Annotated[int, msgspec.Meta(ge=1)]] = None
|
||||||
|
"""If set to -1, will use the truncation size supported by the model. If
|
||||||
|
set to an integer k, will use only the last k tokens from the prompt
|
||||||
|
(i.e., left truncation). If set to `None`, truncation is disabled."""
|
||||||
output_kind: RequestOutputKind = RequestOutputKind.CUMULATIVE
|
output_kind: RequestOutputKind = RequestOutputKind.CUMULATIVE
|
||||||
|
|
||||||
# The below fields are not supposed to be used as an input.
|
# The below fields are not supposed to be used as an input.
|
||||||
@ -219,12 +195,24 @@ class SamplingParams(
|
|||||||
|
|
||||||
# Fields used to construct logits processors
|
# Fields used to construct logits processors
|
||||||
guided_decoding: Optional[GuidedDecodingParams] = None
|
guided_decoding: Optional[GuidedDecodingParams] = None
|
||||||
|
"""If provided, the engine will construct a guided decoding logits
|
||||||
|
processor from these parameters."""
|
||||||
logit_bias: Optional[dict[int, float]] = None
|
logit_bias: Optional[dict[int, float]] = None
|
||||||
|
"""If provided, the engine will construct a logits processor that applies
|
||||||
|
these logit biases."""
|
||||||
allowed_token_ids: Optional[list[int]] = None
|
allowed_token_ids: Optional[list[int]] = None
|
||||||
|
"""If provided, the engine will construct a logits processor which only
|
||||||
|
retains scores for the given token ids."""
|
||||||
extra_args: Optional[dict[str, Any]] = None
|
extra_args: Optional[dict[str, Any]] = None
|
||||||
|
"""Arbitrary additional args, that can be used by custom sampling
|
||||||
|
implementations, plugins, etc. Not used by any in-tree sampling
|
||||||
|
implementations."""
|
||||||
|
|
||||||
# Fields used for bad words
|
# Fields used for bad words
|
||||||
bad_words: Optional[list[str]] = None
|
bad_words: Optional[list[str]] = None
|
||||||
|
"""Words that are not allowed to be generated. More precisely, only the
|
||||||
|
last token of a corresponding token sequence is not allowed when the next
|
||||||
|
generated token can complete the sequence."""
|
||||||
_bad_words_token_ids: Optional[list[list[int]]] = None
|
_bad_words_token_ids: Optional[list[list[int]]] = None
|
||||||
|
|
||||||
@staticmethod
|
@staticmethod
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user