mirror of
https://git.datalinker.icu/vllm-project/vllm.git
synced 2025-12-09 23:54:56 +08:00
[Docs] Fix warnings in docs build (#22588)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
This commit is contained in:
parent
d411df0296
commit
00976db0c3
@ -1,7 +1,5 @@
|
||||
# Summary
|
||||
|
||||
[](){ #configuration }
|
||||
|
||||
## Configuration
|
||||
|
||||
API documentation for vLLM's configuration classes.
|
||||
|
||||
@ -96,7 +96,7 @@ Although it’s common to do this with GPUs, don't try to fragment 2 or 8 differ
|
||||
|
||||
### Tune your workloads
|
||||
|
||||
Although we try to have great default configs, we strongly recommend you check out the [vLLM auto-tuner](../../benchmarks/auto_tune/README.md) to optimize your workloads for your use case.
|
||||
Although we try to have great default configs, we strongly recommend you check out the [vLLM auto-tuner](gh-file:benchmarks/auto_tune/README.md) to optimize your workloads for your use case.
|
||||
|
||||
### Future Topics We'll Cover
|
||||
|
||||
|
||||
@ -540,8 +540,10 @@ return a schema of the tensors outputted by the HF processor that are related to
|
||||
The shape of `image_patches` outputted by `FuyuImageProcessor` is therefore
|
||||
`(1, num_images, num_patches, patch_width * patch_height * num_channels)`.
|
||||
|
||||
In order to support the use of [MultiModalFieldConfig.batched][] like in LLaVA,
|
||||
we remove the extra batch dimension by overriding [BaseMultiModalProcessor._call_hf_processor][]:
|
||||
In order to support the use of
|
||||
[MultiModalFieldConfig.batched][vllm.multimodal.inputs.MultiModalFieldConfig.batched]
|
||||
like in LLaVA, we remove the extra batch dimension by overriding
|
||||
[BaseMultiModalProcessor._call_hf_processor][vllm.multimodal.processing.BaseMultiModalProcessor._call_hf_processor]:
|
||||
|
||||
??? code
|
||||
|
||||
@ -816,7 +818,7 @@ Each [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instance specifies
|
||||
After you have defined [BaseProcessingInfo][vllm.multimodal.processing.BaseProcessingInfo] (Step 2),
|
||||
[BaseDummyInputsBuilder][vllm.multimodal.profiling.BaseDummyInputsBuilder] (Step 3),
|
||||
and [BaseMultiModalProcessor][vllm.multimodal.processing.BaseMultiModalProcessor] (Step 4),
|
||||
decorate the model class with [MULTIMODAL_REGISTRY.register_processor][vllm.multimodal.processing.MultiModalRegistry.register_processor]
|
||||
decorate the model class with [MULTIMODAL_REGISTRY.register_processor][vllm.multimodal.registry.MultiModalRegistry.register_processor]
|
||||
to register them to the multi-modal registry:
|
||||
|
||||
```diff
|
||||
|
||||
@ -4,7 +4,7 @@ vLLM provides first-class support for generative models, which covers most of LL
|
||||
|
||||
In vLLM, generative models implement the[VllmModelForTextGeneration][vllm.model_executor.models.VllmModelForTextGeneration] interface.
|
||||
Based on the final hidden states of the input, these models output log probabilities of the tokens to generate,
|
||||
which are then passed through [Sampler][vllm.model_executor.layers.Sampler] to obtain the final text.
|
||||
which are then passed through [Sampler][vllm.model_executor.layers.sampler.Sampler] to obtain the final text.
|
||||
|
||||
## Configuration
|
||||
|
||||
@ -19,7 +19,7 @@ Run a model in generation mode via the option `--runner generate`.
|
||||
## Offline Inference
|
||||
|
||||
The [LLM][vllm.LLM] class provides various methods for offline inference.
|
||||
See [configuration][configuration] for a list of options when initializing the model.
|
||||
See [configuration](../api/summary.md#configuration) for a list of options when initializing the model.
|
||||
|
||||
### `LLM.generate`
|
||||
|
||||
|
||||
@ -81,7 +81,7 @@ which takes priority over both the model's and Sentence Transformers's defaults.
|
||||
## Offline Inference
|
||||
|
||||
The [LLM][vllm.LLM] class provides various methods for offline inference.
|
||||
See [configuration][configuration] for a list of options when initializing the model.
|
||||
See [configuration](../api/summary.md#configuration) for a list of options when initializing the model.
|
||||
|
||||
### `LLM.embed`
|
||||
|
||||
|
||||
@ -770,7 +770,7 @@ The following table lists those that are tested in vLLM.
|
||||
Cross-encoder and reranker models are a subset of classification models that accept two prompts as input.
|
||||
These models primarily support the [`LLM.score`](./pooling_models.md#llmscore) API.
|
||||
|
||||
| Architecture | Models | Inputs | Example HF Models | [LoRA][lora-adapter] | [PP][parallelism-scaling] | [V1](gh-issue:8779) |
|
||||
| Architecture | Models | Inputs | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/parallelism_scaling.md) | [V1](gh-issue:8779) |
|
||||
|-------------------------------------|--------------------|----------|--------------------------|------------------------|-----------------------------|-----------------------|
|
||||
| `JinaVLForSequenceClassification` | JinaVL-based | T + I<sup>E+</sup> | `jinaai/jina-reranker-m0`, etc. | | | ✅︎ |
|
||||
|
||||
|
||||
0
vllm/attention/layers/__init__.py
Normal file
0
vllm/attention/layers/__init__.py
Normal file
@ -1,10 +1,11 @@
|
||||
# SPDX-License-Identifier: Apache-2.0
|
||||
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
|
||||
|
||||
from .data import (DecoderOnlyInputs, EmbedsInputs, EncoderDecoderInputs,
|
||||
ExplicitEncoderDecoderPrompt, ProcessorInputs, PromptType,
|
||||
SingletonInputs, SingletonPrompt, TextPrompt, TokenInputs,
|
||||
TokensPrompt, build_explicit_enc_dec_prompt, embeds_inputs,
|
||||
from .data import (DecoderOnlyInputs, EmbedsInputs, EmbedsPrompt,
|
||||
EncoderDecoderInputs, ExplicitEncoderDecoderPrompt,
|
||||
ProcessorInputs, PromptType, SingletonInputs,
|
||||
SingletonPrompt, TextPrompt, TokenInputs, TokensPrompt,
|
||||
build_explicit_enc_dec_prompt, embeds_inputs,
|
||||
to_enc_dec_tuple_list, token_inputs, zip_enc_dec_prompts)
|
||||
from .registry import (DummyData, InputContext, InputProcessingContext,
|
||||
InputRegistry)
|
||||
@ -24,6 +25,7 @@ __all__ = [
|
||||
"ExplicitEncoderDecoderPrompt",
|
||||
"TokenInputs",
|
||||
"EmbedsInputs",
|
||||
"EmbedsPrompt",
|
||||
"token_inputs",
|
||||
"embeds_inputs",
|
||||
"DecoderOnlyInputs",
|
||||
|
||||
0
vllm/model_executor/warmup/__init__.py
Normal file
0
vllm/model_executor/warmup/__init__.py
Normal file
@ -103,113 +103,89 @@ class SamplingParams(
|
||||
Overall, we follow the sampling parameters from the OpenAI text completion
|
||||
API (https://platform.openai.com/docs/api-reference/completions/create).
|
||||
In addition, we support beam search, which is not supported by OpenAI.
|
||||
|
||||
Args:
|
||||
n: Number of output sequences to return for the given prompt.
|
||||
best_of: Number of output sequences that are generated from the prompt.
|
||||
From these `best_of` sequences, the top `n` sequences are returned.
|
||||
`best_of` must be greater than or equal to `n`. By default,
|
||||
`best_of` is set to `n`. Warning, this is only supported in V0.
|
||||
presence_penalty: Float that penalizes new tokens based on whether they
|
||||
appear in the generated text so far. Values > 0 encourage the model
|
||||
to use new tokens, while values < 0 encourage the model to repeat
|
||||
tokens.
|
||||
frequency_penalty: Float that penalizes new tokens based on their
|
||||
frequency in the generated text so far. Values > 0 encourage the
|
||||
model to use new tokens, while values < 0 encourage the model to
|
||||
repeat tokens.
|
||||
repetition_penalty: Float that penalizes new tokens based on whether
|
||||
they appear in the prompt and the generated text so far. Values > 1
|
||||
encourage the model to use new tokens, while values < 1 encourage
|
||||
the model to repeat tokens.
|
||||
temperature: Float that controls the randomness of the sampling. Lower
|
||||
values make the model more deterministic, while higher values make
|
||||
the model more random. Zero means greedy sampling.
|
||||
top_p: Float that controls the cumulative probability of the top tokens
|
||||
to consider. Must be in (0, 1]. Set to 1 to consider all tokens.
|
||||
top_k: Integer that controls the number of top tokens to consider. Set
|
||||
to 0 (or -1) to consider all tokens.
|
||||
min_p: Float that represents the minimum probability for a token to be
|
||||
considered, relative to the probability of the most likely token.
|
||||
Must be in [0, 1]. Set to 0 to disable this.
|
||||
seed: Random seed to use for the generation.
|
||||
stop: list of strings that stop the generation when they are generated.
|
||||
The returned output will not contain the stop strings.
|
||||
stop_token_ids: list of tokens that stop the generation when they are
|
||||
generated. The returned output will contain the stop tokens unless
|
||||
the stop tokens are special tokens.
|
||||
bad_words: list of words that are not allowed to be generated.
|
||||
More precisely, only the last token of a corresponding
|
||||
token sequence is not allowed when the next generated token
|
||||
can complete the sequence.
|
||||
include_stop_str_in_output: Whether to include the stop strings in
|
||||
output text. Defaults to False.
|
||||
ignore_eos: Whether to ignore the EOS token and continue generating
|
||||
tokens after the EOS token is generated.
|
||||
max_tokens: Maximum number of tokens to generate per output sequence.
|
||||
min_tokens: Minimum number of tokens to generate per output sequence
|
||||
before EOS or stop_token_ids can be generated
|
||||
logprobs: Number of log probabilities to return per output token.
|
||||
When set to None, no probability is returned. If set to a non-None
|
||||
value, the result includes the log probabilities of the specified
|
||||
number of most likely tokens, as well as the chosen tokens.
|
||||
Note that the implementation follows the OpenAI API: The API will
|
||||
always return the log probability of the sampled token, so there
|
||||
may be up to `logprobs+1` elements in the response.
|
||||
When set to -1, return all `vocab_size` log probabilities.
|
||||
prompt_logprobs: Number of log probabilities to return per prompt token.
|
||||
detokenize: Whether to detokenize the output. Defaults to True.
|
||||
skip_special_tokens: Whether to skip special tokens in the output.
|
||||
spaces_between_special_tokens: Whether to add spaces between special
|
||||
tokens in the output. Defaults to True.
|
||||
logits_processors: list of functions that modify logits based on
|
||||
previously generated tokens, and optionally prompt tokens as
|
||||
a first argument.
|
||||
truncate_prompt_tokens: If set to -1, will use the truncation size
|
||||
supported by the model. If set to an integer k, will use only
|
||||
the last k tokens from the prompt (i.e., left truncation).
|
||||
Defaults to None (i.e., no truncation).
|
||||
guided_decoding: If provided, the engine will construct a guided
|
||||
decoding logits processor from these parameters. Defaults to None.
|
||||
logit_bias: If provided, the engine will construct a logits processor
|
||||
that applies these logit biases. Defaults to None.
|
||||
allowed_token_ids: If provided, the engine will construct a logits
|
||||
processor which only retains scores for the given token ids.
|
||||
Defaults to None.
|
||||
extra_args: Arbitrary additional args, that can be used by custom
|
||||
sampling implementations, plugins, etc. Not used by any in-tree
|
||||
sampling implementations.
|
||||
"""
|
||||
|
||||
n: int = 1
|
||||
"""Number of output sequences to return for the given prompt."""
|
||||
best_of: Optional[int] = None
|
||||
"""Number of output sequences that are generated from the prompt. From
|
||||
these `best_of` sequences, the top `n` sequences are returned. `best_of`
|
||||
must be greater than or equal to `n`. By default, `best_of` is set to `n`.
|
||||
Warning, this is only supported in V0."""
|
||||
_real_n: Optional[int] = None
|
||||
presence_penalty: float = 0.0
|
||||
"""Penalizes new tokens based on whether they appear in the generated text
|
||||
so far. Values > 0 encourage the model to use new tokens, while values < 0
|
||||
encourage the model to repeat tokens."""
|
||||
frequency_penalty: float = 0.0
|
||||
"""Penalizes new tokens based on their frequency in the generated text so
|
||||
far. Values > 0 encourage the model to use new tokens, while values < 0
|
||||
encourage the model to repeat tokens."""
|
||||
repetition_penalty: float = 1.0
|
||||
"""Penalizes new tokens based on whether they appear in the prompt and the
|
||||
generated text so far. Values > 1 encourage the model to use new tokens,
|
||||
while values < 1 encourage the model to repeat tokens."""
|
||||
temperature: float = 1.0
|
||||
"""Controls the randomness of the sampling. Lower values make the model
|
||||
more deterministic, while higher values make the model more random. Zero
|
||||
means greedy sampling."""
|
||||
top_p: float = 1.0
|
||||
"""Controls the cumulative probability of the top tokens to consider. Must
|
||||
be in (0, 1]. Set to 1 to consider all tokens."""
|
||||
top_k: int = 0
|
||||
"""Controls the number of top tokens to consider. Set to 0 (or -1) to
|
||||
consider all tokens."""
|
||||
min_p: float = 0.0
|
||||
"""Represents the minimum probability for a token to be considered,
|
||||
relative to the probability of the most likely token. Must be in [0, 1].
|
||||
Set to 0 to disable this."""
|
||||
seed: Optional[int] = None
|
||||
"""Random seed to use for the generation."""
|
||||
stop: Optional[Union[str, list[str]]] = None
|
||||
"""String(s) that stop the generation when they are generated. The returned
|
||||
output will not contain the stop strings."""
|
||||
stop_token_ids: Optional[list[int]] = None
|
||||
"""Token IDs that stop the generation when they are generated. The returned
|
||||
output will contain the stop tokens unless the stop tokens are special
|
||||
tokens."""
|
||||
ignore_eos: bool = False
|
||||
"""Whether to ignore the EOS token and continue generating
|
||||
tokens after the EOS token is generated."""
|
||||
max_tokens: Optional[int] = 16
|
||||
"""Maximum number of tokens to generate per output sequence."""
|
||||
min_tokens: int = 0
|
||||
"""Minimum number of tokens to generate per output sequence before EOS or
|
||||
`stop_token_ids` can be generated"""
|
||||
logprobs: Optional[int] = None
|
||||
"""Number of log probabilities to return per output token. When set to
|
||||
`None`, no probability is returned. If set to a non-`None` value, the
|
||||
result includes the log probabilities of the specified number of most
|
||||
likely tokens, as well as the chosen tokens. Note that the implementation
|
||||
follows the OpenAI API: The API will always return the log probability of
|
||||
the sampled token, so there may be up to `logprobs+1` elements in the
|
||||
response. When set to -1, return all `vocab_size` log probabilities."""
|
||||
prompt_logprobs: Optional[int] = None
|
||||
"""Number of log probabilities to return per prompt token."""
|
||||
# NOTE: This parameter is only exposed at the engine level for now.
|
||||
# It is not exposed in the OpenAI API server, as the OpenAI API does
|
||||
# not support returning only a list of token IDs.
|
||||
detokenize: bool = True
|
||||
"""Whether to detokenize the output."""
|
||||
skip_special_tokens: bool = True
|
||||
"""Whether to skip special tokens in the output."""
|
||||
spaces_between_special_tokens: bool = True
|
||||
"""Whether to add spaces between special tokens in the output."""
|
||||
# Optional[list[LogitsProcessor]] type. We use Any here because
|
||||
# Optional[list[LogitsProcessor]] type is not supported by msgspec.
|
||||
logits_processors: Optional[Any] = None
|
||||
"""Functions that modify logits based on previously generated tokens, and
|
||||
optionally prompt tokens as a first argument."""
|
||||
include_stop_str_in_output: bool = False
|
||||
"""Whether to include the stop strings in output text."""
|
||||
truncate_prompt_tokens: Optional[Annotated[int, msgspec.Meta(ge=1)]] = None
|
||||
"""If set to -1, will use the truncation size supported by the model. If
|
||||
set to an integer k, will use only the last k tokens from the prompt
|
||||
(i.e., left truncation). If set to `None`, truncation is disabled."""
|
||||
output_kind: RequestOutputKind = RequestOutputKind.CUMULATIVE
|
||||
|
||||
# The below fields are not supposed to be used as an input.
|
||||
@ -219,12 +195,24 @@ class SamplingParams(
|
||||
|
||||
# Fields used to construct logits processors
|
||||
guided_decoding: Optional[GuidedDecodingParams] = None
|
||||
"""If provided, the engine will construct a guided decoding logits
|
||||
processor from these parameters."""
|
||||
logit_bias: Optional[dict[int, float]] = None
|
||||
"""If provided, the engine will construct a logits processor that applies
|
||||
these logit biases."""
|
||||
allowed_token_ids: Optional[list[int]] = None
|
||||
"""If provided, the engine will construct a logits processor which only
|
||||
retains scores for the given token ids."""
|
||||
extra_args: Optional[dict[str, Any]] = None
|
||||
"""Arbitrary additional args, that can be used by custom sampling
|
||||
implementations, plugins, etc. Not used by any in-tree sampling
|
||||
implementations."""
|
||||
|
||||
# Fields used for bad words
|
||||
bad_words: Optional[list[str]] = None
|
||||
"""Words that are not allowed to be generated. More precisely, only the
|
||||
last token of a corresponding token sequence is not allowed when the next
|
||||
generated token can complete the sequence."""
|
||||
_bad_words_token_ids: Optional[list[list[int]]] = None
|
||||
|
||||
@staticmethod
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user