[Deprecation][2/N] Replace --task with --runner and --convert (#21470)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
This commit is contained in:
Cyrus Leung 2025-07-28 10:42:40 +08:00 committed by GitHub
parent 8f605ee309
commit 86ae693f20
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
94 changed files with 1117 additions and 1083 deletions

View File

@ -343,7 +343,7 @@ Here is a simple example using Phi-3.5-Vision.
First, launch the OpenAI-compatible server:
```bash
vllm serve microsoft/Phi-3.5-vision-instruct --task generate \
vllm serve microsoft/Phi-3.5-vision-instruct --runner generate \
--trust-remote-code --max-model-len 4096 --limit-mm-per-prompt '{"image":2}'
```
@ -422,7 +422,7 @@ Instead of `image_url`, you can pass a video file via `video_url`. Here is a sim
First, launch the OpenAI-compatible server:
```bash
vllm serve llava-hf/llava-onevision-qwen2-0.5b-ov-hf --task generate --max-model-len 8192
vllm serve llava-hf/llava-onevision-qwen2-0.5b-ov-hf --runner generate --max-model-len 8192
```
Then, you can use the OpenAI client as follows:

View File

@ -34,7 +34,7 @@ Prompt embeddings are passed in as base64 encoded torch tensors.
First, launch the OpenAI-compatible server:
```bash
vllm serve meta-llama/Llama-3.2-1B-Instruct --task generate \
vllm serve meta-llama/Llama-3.2-1B-Instruct --runner generate \
--max-model-len 4096 --enable-prompt-embeds
```

View File

@ -2,12 +2,19 @@
vLLM provides first-class support for generative models, which covers most of LLMs.
In vLLM, generative models implement the [VllmModelForTextGeneration][vllm.model_executor.models.VllmModelForTextGeneration] interface.
In vLLM, generative models implement the[VllmModelForTextGeneration][vllm.model_executor.models.VllmModelForTextGeneration] interface.
Based on the final hidden states of the input, these models output log probabilities of the tokens to generate,
which are then passed through [Sampler][vllm.model_executor.layers.Sampler] to obtain the final text.
For generative models, the only supported `--task` option is `"generate"`.
Usually, this is automatically inferred so you don't have to specify it.
## Configuration
### Model Runner (`--runner`)
Run a model in generation mode via the option `--runner generate`.
!!! tip
There is no need to set this option in the vast majority of cases as vLLM can automatically
detect the model runner to use via `--runner auto`.
## Offline Inference

View File

@ -1,9 +1,9 @@
# Pooling Models
vLLM also supports pooling models, including embedding, reranking and reward models.
vLLM also supports pooling models, such as embedding, classification and reward models.
In vLLM, pooling models implement the [VllmModelForPooling][vllm.model_executor.models.VllmModelForPooling] interface.
These models use a [Pooler][vllm.model_executor.layers.Pooler] to extract the final hidden states of the input
These models use a [Pooler][vllm.model_executor.layers.pooler.Pooler] to extract the final hidden states of the input
before returning them.
!!! note
@ -11,18 +11,39 @@ before returning them.
As shown in the [Compatibility Matrix](../features/compatibility_matrix.md), most vLLM features are not applicable to
pooling models as they only work on the generation or decode stage, so performance may not improve as much.
If the model doesn't implement this interface, you can set `--task` which tells vLLM
to convert the model into a pooling model.
## Configuration
| `--task` | Model type | Supported pooling tasks |
|------------|----------------------|-------------------------------|
| `embed` | Embedding model | `encode`, `embed` |
| `classify` | Classification model | `encode`, `classify`, `score` |
| `reward` | Reward model | `encode` |
### Model Runner
## Pooling Tasks
Run a model in pooling mode via the option `--runner pooling`.
In vLLM, we define the following pooling tasks and corresponding APIs:
!!! tip
There is no need to set this option in the vast majority of cases as vLLM can automatically
detect the model runner to use via `--runner auto`.
### Model Conversion
vLLM can adapt models for various pooling tasks via the option `--convert <type>`.
If `--runner pooling` has been set (manually or automatically) but the model does not implement the
[VllmModelForPooling][vllm.model_executor.models.VllmModelForPooling] interface,
vLLM will attempt to automatically convert the model according to the architecture names
shown in the table below.
| Architecture | `--convert` | Supported pooling tasks |
|-------------------------------------------------|-------------|-------------------------------|
| `*ForTextEncoding`, `*EmbeddingModel`, `*Model` | `embed` | `encode`, `embed` |
| `*For*Classification`, `*ClassificationModel` | `classify` | `encode`, `classify`, `score` |
| `*ForRewardModeling`, `*RewardModel` | `reward` | `encode` |
!!! tip
You can explicitly set `--convert <type>` to specify how to convert the model.
### Pooling Tasks
Each pooling model in vLLM supports one or more of these tasks according to
[Pooler.get_supported_tasks][vllm.model_executor.layers.pooler.Pooler.get_supported_tasks],
enabling the corresponding APIs:
| Task | APIs |
|------------|--------------------|
@ -31,11 +52,19 @@ In vLLM, we define the following pooling tasks and corresponding APIs:
| `classify` | `classify` |
| `score` | `score` |
\*The `score` API falls back to `embed` task if the model does not support `score` task.
\* The `score` API falls back to `embed` task if the model does not support `score` task.
Each pooling model in vLLM supports one or more of these tasks according to [Pooler.get_supported_tasks][vllm.model_executor.layers.Pooler.get_supported_tasks].
### Pooler Configuration
By default, the pooler assigned to each task has the following attributes:
#### Predefined models
If the [Pooler][vllm.model_executor.layers.pooler.Pooler] defined by the model accepts `pooler_config`,
you can override some of its attributes via the `--override-pooler-config` option.
#### Converted models
If the model has been converted via `--convert` (see above),
the pooler assigned to each task has the following attributes by default:
| Task | Pooling Type | Normalization | Softmax |
|------------|----------------|---------------|---------|
@ -43,20 +72,12 @@ By default, the pooler assigned to each task has the following attributes:
| `embed` | `LAST` | ✅︎ | ❌ |
| `classify` | `LAST` | ❌ | ✅︎ |
These defaults may be overridden by the model's implementation in vLLM.
When loading [Sentence Transformers](https://huggingface.co/sentence-transformers) models,
we attempt to override the defaults based on its Sentence Transformers configuration file (`modules.json`),
which takes priority over the model's defaults.
its Sentence Transformers configuration file (`modules.json`) takes priority over the model's defaults.
You can further customize this via the `--override-pooler-config` option,
which takes priority over both the model's and Sentence Transformers's defaults.
!!! note
The above configuration may be disregarded if the model's implementation in vLLM defines its own pooler
that is not based on [PoolerConfig][vllm.config.PoolerConfig].
## Offline Inference
The [LLM][vllm.LLM] class provides various methods for offline inference.
@ -70,7 +91,7 @@ It returns the extracted hidden states directly, which is useful for reward mode
```python
from vllm import LLM
llm = LLM(model="Qwen/Qwen2.5-Math-RM-72B", task="reward")
llm = LLM(model="Qwen/Qwen2.5-Math-RM-72B", runner="pooling")
(output,) = llm.encode("Hello, my name is")
data = output.outputs.data
@ -85,7 +106,7 @@ It is primarily designed for embedding models.
```python
from vllm import LLM
llm = LLM(model="intfloat/e5-mistral-7b-instruct", task="embed")
llm = LLM(model="intfloat/e5-mistral-7b-instruct", runner="pooling")
(output,) = llm.embed("Hello, my name is")
embeds = output.outputs.embedding
@ -102,7 +123,7 @@ It is primarily designed for classification models.
```python
from vllm import LLM
llm = LLM(model="jason9693/Qwen2.5-1.5B-apeach", task="classify")
llm = LLM(model="jason9693/Qwen2.5-1.5B-apeach", runner="pooling")
(output,) = llm.classify("Hello, my name is")
probs = output.outputs.probs
@ -123,7 +144,7 @@ It is designed for embedding models and cross encoder models. Embedding models u
```python
from vllm import LLM
llm = LLM(model="BAAI/bge-reranker-v2-m3", task="score")
llm = LLM(model="BAAI/bge-reranker-v2-m3", runner="pooling")
(output,) = llm.score("What is the capital of France?",
"The capital of Brazil is Brasilia.")
@ -175,7 +196,7 @@ You can change the output dimensions of embedding models that support Matryoshka
from vllm import LLM, PoolingParams
llm = LLM(model="jinaai/jina-embeddings-v3",
task="embed",
runner="pooling",
trust_remote_code=True)
outputs = llm.embed(["Follow the white rabbit."],
pooling_params=PoolingParams(dimensions=32))

View File

@ -1,7 +1,6 @@
# Supported Models
vLLM supports [generative](./generative_models.md) and [pooling](./pooling_models.md) models across various tasks.
If a model supports more than one task, you can set the task via the `--task` argument.
For each task, we list the model architectures that have been implemented in vLLM.
Alongside each architecture, we include some popular models that use it.
@ -24,7 +23,7 @@ To check if the modeling backend is Transformers, you can simply do this:
```python
from vllm import LLM
llm = LLM(model=..., task="generate") # Name or path of your model
llm = LLM(model=...) # Name or path of your model
llm.apply_model(lambda model: print(type(model)))
```
@ -158,13 +157,13 @@ The [Transformers backend][transformers-backend] enables you to run models direc
```python
from vllm import LLM
# For generative models (task=generate) only
llm = LLM(model=..., task="generate") # Name or path of your model
# For generative models (runner=generate) only
llm = LLM(model=..., runner="generate") # Name or path of your model
output = llm.generate("Hello, my name is")
print(output)
# For pooling models (task={embed,classify,reward,score}) only
llm = LLM(model=..., task="embed") # Name or path of your model
# For pooling models (runner=pooling) only
llm = LLM(model=..., runner="pooling") # Name or path of your model
output = llm.encode("Hello, my name is")
print(output)
```
@ -281,13 +280,13 @@ And use with `trust_remote_code=True`.
```python
from vllm import LLM
llm = LLM(model=..., revision=..., task=..., trust_remote_code=True)
llm = LLM(model=..., revision=..., runner=..., trust_remote_code=True)
# For generative models (task=generate) only
# For generative models (runner=generate) only
output = llm.generate("Hello, my name is")
print(output)
# For pooling models (task={embed,classify,reward,score}) only
# For pooling models (runner=pooling) only
output = llm.encode("Hello, my name is")
print(output)
```
@ -312,8 +311,6 @@ See [this page](generative_models.md) for more information on how to use generat
#### Text Generation
Specified using `--task generate`.
<style>
th {
white-space: nowrap;
@ -420,25 +417,27 @@ See [this page](./pooling_models.md) for more information on how to use pooling
!!! important
Since some model architectures support both generative and pooling tasks,
you should explicitly specify the task type to ensure that the model is used in pooling mode instead of generative mode.
you should explicitly specify `--runner pooling` to ensure that the model is used in pooling mode instead of generative mode.
#### Text Embedding
Specified using `--task embed`.
| Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
|--------------|--------|-------------------|----------------------|---------------------------|---------------------|
| `BertModel` | BERT-based | `BAAI/bge-base-en-v1.5`, `Snowflake/snowflake-arctic-embed-xs`, etc. | | | |
| `Gemma2Model` | Gemma 2-based | `BAAI/bge-multilingual-gemma2`, etc. | ✅︎ | | ✅︎ |
| `BertModel`<sup>C</sup> | BERT-based | `BAAI/bge-base-en-v1.5`, `Snowflake/snowflake-arctic-embed-xs`, etc. | | | |
| `Gemma2Model`<sup>C</sup> | Gemma 2-based | `BAAI/bge-multilingual-gemma2`, etc. | ✅︎ | | ✅︎ |
| `GritLM` | GritLM | `parasail-ai/GritLM-7B-vllm`. | ✅︎ | ✅︎ | |
| `GteModel` | Arctic-Embed-2.0-M | `Snowflake/snowflake-arctic-embed-m-v2.0`. | | | |
| `GteNewModel` | mGTE-TRM (see note) | `Alibaba-NLP/gte-multilingual-base`, etc. | | | |
| `ModernBertModel` | ModernBERT-based | `Alibaba-NLP/gte-modernbert-base`, etc. | | | |
| `NomicBertModel` | Nomic BERT | `nomic-ai/nomic-embed-text-v1`, `nomic-ai/nomic-embed-text-v2-moe`, `Snowflake/snowflake-arctic-embed-m-long`, etc. | | | |
| `LlamaModel`, `LlamaForCausalLM`, `MistralModel`, etc. | Llama-based | `intfloat/e5-mistral-7b-instruct`, etc. | ✅︎ | ✅︎ | ✅︎ |
| `Qwen2Model`, `Qwen2ForCausalLM` | Qwen2-based | `ssmits/Qwen2-7B-Instruct-embed-base` (see note), `Alibaba-NLP/gte-Qwen2-7B-instruct` (see note), etc. | ✅︎ | ✅︎ | ✅︎ |
| `Qwen3Model`, `Qwen3ForCausalLM` | Qwen3-based | `Qwen/Qwen3-Embedding-0.6B`, etc. | ✅︎ | ✅︎ | ✅︎ |
| `GteModel`<sup>C</sup> | Arctic-Embed-2.0-M | `Snowflake/snowflake-arctic-embed-m-v2.0`. | | | |
| `GteNewModel`<sup>C</sup> | mGTE-TRM (see note) | `Alibaba-NLP/gte-multilingual-base`, etc. | | | |
| `ModernBertModel`<sup>C</sup> | ModernBERT-based | `Alibaba-NLP/gte-modernbert-base`, etc. | | | |
| `NomicBertModel`<sup>C</sup> | Nomic BERT | `nomic-ai/nomic-embed-text-v1`, `nomic-ai/nomic-embed-text-v2-moe`, `Snowflake/snowflake-arctic-embed-m-long`, etc. | | | |
| `LlamaModel`<sup>C</sup>, `LlamaForCausalLM`<sup>C</sup>, `MistralModel`<sup>C</sup>, etc. | Llama-based | `intfloat/e5-mistral-7b-instruct`, etc. | ✅︎ | ✅︎ | ✅︎ |
| `Qwen2Model`<sup>C</sup>, `Qwen2ForCausalLM`<sup>C</sup> | Qwen2-based | `ssmits/Qwen2-7B-Instruct-embed-base` (see note), `Alibaba-NLP/gte-Qwen2-7B-instruct` (see note), etc. | ✅︎ | ✅︎ | ✅︎ |
| `Qwen3Model`<sup>C</sup>, `Qwen3ForCausalLM`<sup>C</sup> | Qwen3-based | `Qwen/Qwen3-Embedding-0.6B`, etc. | ✅︎ | ✅︎ | ✅︎ |
| `RobertaModel`, `RobertaForMaskedLM` | RoBERTa-based | `sentence-transformers/all-roberta-large-v1`, etc. | | | |
| `*Model`<sup>C</sup>, `*ForCausalLM`<sup>C</sup>, etc. | Generative models | N/A | \* | \* | \* |
<sup>C</sup> Automatically converted into an embedding model via `--convert embed`. ([details](./pooling_models.md#model-conversion))
\* Feature support is the same as that of the original model.
!!! note
`ssmits/Qwen2-7B-Instruct-embed-base` has an improperly defined Sentence Transformers config.
@ -460,14 +459,16 @@ of the whole prompt are extracted from the normalized hidden state corresponding
#### Reward Modeling
Specified using `--task reward`.
| Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
|--------------|--------|-------------------|----------------------|---------------------------|---------------------|
| `InternLM2ForRewardModel` | InternLM2-based | `internlm/internlm2-1_8b-reward`, `internlm/internlm2-7b-reward`, etc. | ✅︎ | ✅︎ | ✅︎ |
| `LlamaForCausalLM` | Llama-based | `peiyi9979/math-shepherd-mistral-7b-prm`, etc. | ✅︎ | ✅︎ | ✅︎ |
| `LlamaForCausalLM`<sup>C</sup> | Llama-based | `peiyi9979/math-shepherd-mistral-7b-prm`, etc. | ✅︎ | ✅︎ | ✅︎ |
| `Qwen2ForRewardModel` | Qwen2-based | `Qwen/Qwen2.5-Math-RM-72B`, etc. | ✅︎ | ✅︎ | ✅︎ |
| `Qwen2ForProcessRewardModel` | Qwen2-based | `Qwen/Qwen2.5-Math-PRM-7B`, etc. | ✅︎ | ✅︎ | ✅︎ |
| `*Model`<sup>C</sup>, `*ForCausalLM`<sup>C</sup>, etc. | Generative models | N/A | \* | \* | \* |
<sup>C</sup> Automatically converted into a reward model via `--convert reward`. ([details](./pooling_models.md#model-conversion))
\* Feature support is the same as that of the original model.
If your model is not in the above list, we will try to automatically convert the model using
[as_reward_model][vllm.model_executor.models.adapters.as_reward_model]. By default, we return the hidden states of each token directly.
@ -478,28 +479,31 @@ If your model is not in the above list, we will try to automatically convert the
#### Classification
Specified using `--task classify`.
| Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
|--------------|--------|-------------------|----------------------|---------------------------|---------------------|
| `JambaForSequenceClassification` | Jamba | `ai21labs/Jamba-tiny-reward-dev`, etc. | ✅︎ | ✅︎ | |
| `GPT2ForSequenceClassification` | GPT2 | `nie3e/sentiment-polish-gpt2-small` | | | ✅︎ |
| `*Model`<sup>C</sup>, `*ForCausalLM`<sup>C</sup>, etc. | Generative models | N/A | \* | \* | \* |
<sup>C</sup> Automatically converted into a classification model via `--convert classify`. ([details](./pooling_models.md#model-conversion))
\* Feature support is the same as that of the original model.
If your model is not in the above list, we will try to automatically convert the model using
[as_seq_cls_model][vllm.model_executor.models.adapters.as_seq_cls_model]. By default, the class probabilities are extracted from the softmaxed hidden state corresponding to the last token.
#### Sentence Pair Scoring
Specified using `--task score`.
| Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
|--------------|--------|-------------------|----------------------|---------------------------|---------------------|
| `BertForSequenceClassification` | BERT-based | `cross-encoder/ms-marco-MiniLM-L-6-v2`, etc. | | | |
| `GemmaForSequenceClassification` | Gemma-based | `BAAI/bge-reranker-v2-gemma` (see note), etc. | ✅︎ | ✅︎ | ✅︎ |
| `Qwen2ForSequenceClassification` | Qwen2-based | `mixedbread-ai/mxbai-rerank-base-v2` (see note), etc. | ✅︎ | ✅︎ | ✅︎ |
| `Qwen3ForSequenceClassification` | Qwen3-based | `tomaarsen/Qwen3-Reranker-0.6B-seq-cls`, `Qwen/Qwen3-Reranker-0.6B` (see note), etc. | ✅︎ | ✅︎ | ✅︎ |
| `RobertaForSequenceClassification` | RoBERTa-based | `cross-encoder/quora-roberta-base`, etc. | | | |
| `XLMRobertaForSequenceClassification` | XLM-RoBERTa-based | `BAAI/bge-reranker-v2-m3`, etc. | | | |
| Architecture | Models | Example HF Models | [V1](gh-issue:8779) |
|--------------|--------|-------------------|---------------------|
| `BertForSequenceClassification` | BERT-based | `cross-encoder/ms-marco-MiniLM-L-6-v2`, etc. | |
| `GemmaForSequenceClassification` | Gemma-based | `BAAI/bge-reranker-v2-gemma` (see note), etc. | |
| `Qwen2ForSequenceClassification` | Qwen2-based | `mixedbread-ai/mxbai-rerank-base-v2` (see note), etc. | ✅︎ |
| `Qwen3ForSequenceClassification` | Qwen3-based | `tomaarsen/Qwen3-Reranker-0.6B-seq-cls`, `Qwen/Qwen3-Reranker-0.6B` (see note), etc. | ✅︎ |
| `RobertaForSequenceClassification` | RoBERTa-based | `cross-encoder/quora-roberta-base`, etc. | |
| `XLMRobertaForSequenceClassification` | XLM-RoBERTa-based | `BAAI/bge-reranker-v2-m3`, etc. | |
<sup>C</sup> Automatically converted into a classification model via `--convert classify`. ([details](./pooling_models.md#model-conversion))
\* Feature support is the same as that of the original model.
!!! note
Load the official original `BAAI/bge-reranker-v2-gemma` by using the following command.
@ -575,8 +579,6 @@ See [this page](generative_models.md) for more information on how to use generat
#### Text Generation
Specified using `--task generate`.
| Architecture | Models | Inputs | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
|--------------|--------|--------|-------------------|----------------------|---------------------------|---------------------|
| `AriaForConditionalGeneration` | Aria | T + I<sup>+</sup> | `rhymes-ai/Aria` | | | ✅︎ |
@ -705,8 +707,6 @@ Some models are supported only via the [Transformers backend](#transformers). Th
#### Transcription
Specified using `--task transcription`.
Speech2Text models trained specifically for Automatic Speech Recognition.
| Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
@ -719,14 +719,10 @@ See [this page](./pooling_models.md) for more information on how to use pooling
!!! important
Since some model architectures support both generative and pooling tasks,
you should explicitly specify the task type to ensure that the model is used in pooling mode instead of generative mode.
you should explicitly specify `--runner pooling` to ensure that the model is used in pooling mode instead of generative mode.
#### Text Embedding
Specified using `--task embed`.
Any text generation model can be converted into an embedding model by passing `--task embed`.
!!! note
To get the best results, you should use pooling models that are specifically trained as such.
@ -734,19 +730,24 @@ The following table lists those that are tested in vLLM.
| Architecture | Models | Inputs | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
|--------------|--------|--------|-------------------|----------------------|---------------------------|---------------------|
| `LlavaNextForConditionalGeneration` | LLaVA-NeXT-based | T / I | `royokong/e5-v` | | | |
| `Phi3VForCausalLM` | Phi-3-Vision-based | T + I | `TIGER-Lab/VLM2Vec-Full` | 🚧 | ✅︎ | |
| `LlavaNextForConditionalGeneration`<sup>C</sup> | LLaVA-NeXT-based | T / I | `royokong/e5-v` | | | |
| `Phi3VForCausalLM`<sup>C</sup> | Phi-3-Vision-based | T + I | `TIGER-Lab/VLM2Vec-Full` | 🚧 | ✅︎ | |
| `*ForConditionalGeneration`<sup>C</sup>, `*ForCausalLM`<sup>C</sup>, etc. | Generative models | \* | N/A | \* | \* | \* |
<sup>C</sup> Automatically converted into an embedding model via `--convert embed`. ([details](./pooling_models.md#model-conversion))
\* Feature support is the same as that of the original model.
---
#### Scoring
Specified using `--task score`.
| Architecture | Models | Inputs | Example HF Models | [LoRA][lora-adapter] | [PP][distributed-serving] | [V1](gh-issue:8779) |
|-------------------------------------|--------------------|----------|--------------------------|------------------------|-----------------------------|-----------------------|
| `JinaVLForSequenceClassification` | JinaVL-based | T + I<sup>E+</sup> | `jinaai/jina-reranker-m0`, etc. | | | ✅︎ |
<sup>C</sup> Automatically converted into a classification model via `--convert classify`. ([details](./pooling_models.md#model-conversion))
\* Feature support is the same as that of the original model.
## Model Support Policy
At vLLM, we are committed to facilitating the integration and support of third-party models within our ecosystem. Our approach is designed to balance the need for robustness and the practical limitations of supporting a wide range of models. Heres how we manage third-party model support:

View File

@ -45,17 +45,17 @@ To call the server, in your preferred text editor, create a script that uses an
We currently support the following OpenAI APIs:
- [Completions API][completions-api] (`/v1/completions`)
- Only applicable to [text generation models](../models/generative_models.md) (`--task generate`).
- Only applicable to [text generation models](../models/generative_models.md).
- *Note: `suffix` parameter is not supported.*
- [Chat Completions API][chat-api] (`/v1/chat/completions`)
- Only applicable to [text generation models](../models/generative_models.md) (`--task generate`) with a [chat template][chat-template].
- Only applicable to [text generation models](../models/generative_models.md) with a [chat template][chat-template].
- *Note: `parallel_tool_calls` and `user` parameters are ignored.*
- [Embeddings API][embeddings-api] (`/v1/embeddings`)
- Only applicable to [embedding models](../models/pooling_models.md) (`--task embed`).
- Only applicable to [embedding models](../models/pooling_models.md).
- [Transcriptions API][transcriptions-api] (`/v1/audio/transcriptions`)
- Only applicable to Automatic Speech Recognition (ASR) models (OpenAI Whisper) (`--task generate`).
- Only applicable to [Automatic Speech Recognition (ASR) models](../models/supported_models.md#transcription).
- [Translation API][translations-api] (`/v1/audio/translations`)
- Only applicable to Automatic Speech Recognition (ASR) models (OpenAI Whisper) (`--task generate`).
- Only applicable to [Automatic Speech Recognition (ASR) models](../models/supported_models.md#transcription).
In addition, we have the following custom APIs:
@ -64,14 +64,14 @@ In addition, we have the following custom APIs:
- [Pooling API][pooling-api] (`/pooling`)
- Applicable to all [pooling models](../models/pooling_models.md).
- [Classification API][classification-api] (`/classify`)
- Only applicable to [classification models](../models/pooling_models.md) (`--task classify`).
- Only applicable to [classification models](../models/pooling_models.md).
- [Score API][score-api] (`/score`)
- Applicable to embedding models and [cross-encoder models](../models/pooling_models.md) (`--task score`).
- Applicable to [embedding models and cross-encoder models](../models/pooling_models.md).
- [Re-rank API][rerank-api] (`/rerank`, `/v1/rerank`, `/v2/rerank`)
- Implements [Jina AI's v1 re-rank API](https://jina.ai/reranker/)
- Also compatible with [Cohere's v1 & v2 re-rank APIs](https://docs.cohere.com/v2/reference/rerank)
- Jina and Cohere's APIs are very similar; Jina's includes extra information in the rerank endpoint's response.
- Only applicable to [cross-encoder models](../models/pooling_models.md) (`--task score`).
- Only applicable to [cross-encoder models](../models/pooling_models.md).
[](){ #chat-template }
@ -250,14 +250,14 @@ and passing a list of `messages` in the request. Refer to the examples below for
To serve the model:
```bash
vllm serve TIGER-Lab/VLM2Vec-Full --task embed \
vllm serve TIGER-Lab/VLM2Vec-Full --runner pooling \
--trust-remote-code \
--max-model-len 4096 \
--chat-template examples/template_vlm2vec.jinja
```
!!! important
Since VLM2Vec has the same model architecture as Phi-3.5-Vision, we have to explicitly pass `--task embed`
Since VLM2Vec has the same model architecture as Phi-3.5-Vision, we have to explicitly pass `--runner pooling`
to run this model in embedding mode instead of text generation mode.
The custom chat template is completely different from the original one for this model,
@ -296,14 +296,14 @@ and passing a list of `messages` in the request. Refer to the examples below for
To serve the model:
```bash
vllm serve MrLight/dse-qwen2-2b-mrl-v1 --task embed \
vllm serve MrLight/dse-qwen2-2b-mrl-v1 --runner pooling \
--trust-remote-code \
--max-model-len 8192 \
--chat-template examples/template_dse_qwen2_vl.jinja
```
!!! important
Like with VLM2Vec, we have to explicitly pass `--task embed`.
Like with VLM2Vec, we have to explicitly pass `--runner pooling`.
Additionally, `MrLight/dse-qwen2-2b-mrl-v1` requires an EOS token for embeddings, which is handled
by a custom chat template: <gh-file:examples/template_dse_qwen2_vl.jinja>

View File

@ -12,7 +12,9 @@ def parse_args():
parser = EngineArgs.add_cli_args(parser)
# Set example specific arguments
parser.set_defaults(
model="jason9693/Qwen2.5-1.5B-apeach", task="classify", enforce_eager=True
model="jason9693/Qwen2.5-1.5B-apeach",
runner="pooling",
enforce_eager=True,
)
return parser.parse_args()
@ -27,7 +29,7 @@ def main(args: Namespace):
]
# Create an LLM.
# You should pass task="classify" for classification models
# You should pass runner="pooling" for classification models
llm = LLM(**vars(args))
# Generate logits. The output is a list of ClassificationRequestOutputs.

View File

@ -13,7 +13,7 @@ def parse_args():
# Set example specific arguments
parser.set_defaults(
model="intfloat/e5-mistral-7b-instruct",
task="embed",
runner="pooling",
enforce_eager=True,
max_model_len=1024,
)
@ -30,7 +30,7 @@ def main(args: Namespace):
]
# Create an LLM.
# You should pass task="embed" for embedding models
# You should pass runner="pooling" for embedding models
llm = LLM(**vars(args))
# Generate embedding. The output is a list of EmbeddingRequestOutputs.

View File

@ -12,7 +12,9 @@ def parse_args():
parser = EngineArgs.add_cli_args(parser)
# Set example specific arguments
parser.set_defaults(
model="BAAI/bge-reranker-v2-m3", task="score", enforce_eager=True
model="BAAI/bge-reranker-v2-m3",
runner="pooling",
enforce_eager=True,
)
return parser.parse_args()
@ -26,7 +28,7 @@ def main(args: Namespace):
]
# Create an LLM.
# You should pass task="score" for cross-encoder models
# You should pass runner="pooling" for cross-encoder models
llm = LLM(**vars(args))
# Generate scores. The output is a list of ScoringRequestOutputs.

View File

@ -12,7 +12,9 @@ def parse_args():
parser = EngineArgs.add_cli_args(parser)
# Set example specific arguments
parser.set_defaults(
model="jinaai/jina-embeddings-v3", task="embed", trust_remote_code=True
model="jinaai/jina-embeddings-v3",
runner="pooling",
trust_remote_code=True,
)
return parser.parse_args()
@ -29,7 +31,7 @@ def main(args: Namespace):
]
# Create an LLM.
# You should pass task="embed" for embedding models
# You should pass runner="pooling" for embedding models
llm = LLM(**vars(args))
# Generate embedding. The output is a list of EmbeddingRequestOutputs.

View File

@ -12,7 +12,9 @@ def parse_args():
parser = EngineArgs.add_cli_args(parser)
# Set example specific arguments
parser.set_defaults(
model="jinaai/jina-embeddings-v3", task="embed", trust_remote_code=True
model="jinaai/jina-embeddings-v3",
runner="pooling",
trust_remote_code=True,
)
return parser.parse_args()
@ -29,7 +31,7 @@ def main(args: Namespace):
]
# Create an LLM.
# You should pass task="embed" for embedding models
# You should pass runner="pooling" for embedding models
llm = LLM(**vars(args))
# Generate embedding. The output is a list of EmbeddingRequestOutputs.

View File

@ -17,7 +17,7 @@ model_name = "Qwen/Qwen3-Reranker-0.6B"
# Models converted offline using this method can not only be more efficient
# and support the vllm score API, but also make the init parameters more
# concise, for example.
# llm = LLM(model="tomaarsen/Qwen3-Reranker-0.6B-seq-cls", task="score")
# llm = LLM(model="tomaarsen/Qwen3-Reranker-0.6B-seq-cls", runner="pooling")
# If you want to load the official original version, the init parameters are
# as follows.
@ -27,7 +27,7 @@ def get_llm() -> LLM:
"""Initializes and returns the LLM model for Qwen3-Reranker."""
return LLM(
model=model_name,
task="score",
runner="pooling",
hf_overrides={
"architectures": ["Qwen3ForSequenceClassification"],
"classifier_from_token": ["no", "yes"],

View File

@ -70,7 +70,7 @@ def run_e5_v(query: Query) -> ModelRequestData:
engine_args = EngineArgs(
model="royokong/e5-v",
task="embed",
runner="pooling",
max_model_len=4096,
limit_mm_per_prompt={"image": 1},
)
@ -102,7 +102,7 @@ def run_vlm2vec(query: Query) -> ModelRequestData:
engine_args = EngineArgs(
model="TIGER-Lab/VLM2Vec-Full",
task="embed",
runner="pooling",
max_model_len=4096,
trust_remote_code=True,
mm_processor_kwargs={"num_crops": 4},
@ -122,7 +122,7 @@ def run_jinavl_reranker(query: Query) -> ModelRequestData:
engine_args = EngineArgs(
model="jinaai/jina-reranker-m0",
task="score",
runner="pooling",
max_model_len=32768,
trust_remote_code=True,
mm_processor_kwargs={

View File

@ -9,7 +9,7 @@ Launch the vLLM server with the following command:
vllm serve llava-hf/llava-1.5-7b-hf
(multi-image inference with Phi-3.5-vision-instruct)
vllm serve microsoft/Phi-3.5-vision-instruct --task generate \
vllm serve microsoft/Phi-3.5-vision-instruct --runner generate \
--trust-remote-code --max-model-len 4096 --limit-mm-per-prompt '{"image":2}'
(audio inference with Ultravox)

View File

@ -92,7 +92,7 @@ def dse_qwen2_vl(inp: dict):
def parse_args():
parser = argparse.ArgumentParser(
"Script to call a specified VLM through the API. Make sure to serve "
"the model with --task embed before running this."
"the model with `--runner pooling` before running this."
)
parser.add_argument(
"--model",

View File

@ -3,7 +3,7 @@
"""
Example online usage of Score API.
Run `vllm serve <model> --task score` to start up the server in vLLM.
Run `vllm serve <model> --runner pooling` to start up the server in vLLM.
"""
import argparse

View File

@ -3,7 +3,7 @@
"""
Example online usage of Score API.
Run `vllm serve <model> --task score` to start up the server in vLLM.
Run `vllm serve <model> --runner pooling` to start up the server in vLLM.
"""
import argparse

View File

@ -3,7 +3,7 @@
"""
Example online usage of Pooling API.
Run `vllm serve <model> --task <embed|classify|reward|score>`
Run `vllm serve <model> --runner pooling`
to start up the server in vLLM.
"""

View File

@ -10,7 +10,7 @@ This script demonstrates how to:
Run the vLLM server first:
vllm serve meta-llama/Llama-3.2-1B-Instruct \
--task generate \
--runner generate \
--max-model-len 4096 \
--enable-prompt-embeds

View File

@ -148,9 +148,6 @@ def async_tp_pass_on_test_model(local_rank: int, world_size: int,
# in the vllm_config, it's not really used.
model_name = "nm-testing/TinyLlama-1.1B-Chat-v1.0-FP8-e2e"
vllm_config.model_config = ModelConfig(model=model_name,
task="auto",
tokenizer=model_name,
tokenizer_mode="auto",
trust_remote_code=True,
dtype=dtype,
seed=42)

View File

@ -62,8 +62,8 @@ class TestSetting:
TestSetting(
model="BAAI/bge-multilingual-gemma2",
model_args=[
"--task", "embed", "--dtype", "bfloat16", "--max-model-len",
"2048"
"--runner", "pooling", "--dtype", "bfloat16",
"--max-model-len", "2048"
],
pp_size=1,
tp_size=1,
@ -75,7 +75,7 @@ class TestSetting:
# # encoder-based embedding model (BERT)
# TestSetting(
# model="BAAI/bge-base-en-v1.5",
# model_args=["--task", "embed"],
# model_args=["--runner", "pooling"],
# pp_size=1,
# tp_size=1,
# attn_backend="XFORMERS",

View File

@ -125,9 +125,6 @@ def all_reduce_fusion_pass_on_test_model(local_rank: int, world_size: int,
# in the vllm_config, it's not really used.
model_name = "nm-testing/TinyLlama-1.1B-Chat-v1.0-FP8-e2e"
vllm_config.model_config = ModelConfig(model=model_name,
task="auto",
tokenizer=model_name,
tokenizer_mode="auto",
trust_remote_code=True,
dtype=dtype,
seed=42)

View File

@ -250,9 +250,6 @@ def sequence_parallelism_pass_on_test_model(
# in the vllm_config, it's not really used.
model_name = "nm-testing/TinyLlama-1.1B-Chat-v1.0-FP8-e2e"
vllm_config.model_config = ModelConfig(model=model_name,
task="auto",
tokenizer=model_name,
tokenizer_mode="auto",
trust_remote_code=True,
dtype=dtype,
seed=42)

View File

@ -23,7 +23,7 @@ from vllm import LLM, SamplingParams
from vllm.assets.audio import AudioAsset
from vllm.assets.image import ImageAsset
from vllm.assets.video import VideoAsset
from vllm.config import TaskOption, _get_and_verify_dtype
from vllm.config import ConvertOption, RunnerOption, _get_and_verify_dtype
from vllm.connections import global_http_connection
from vllm.distributed import (cleanup_dist_env_and_memory,
init_distributed_environment,
@ -769,7 +769,8 @@ class VllmRunner:
def __init__(
self,
model_name: str,
task: TaskOption = "auto",
runner: RunnerOption = "auto",
convert: ConvertOption = "auto",
tokenizer_name: Optional[str] = None,
tokenizer_mode: str = "auto",
trust_remote_code: bool = True,
@ -786,7 +787,8 @@ class VllmRunner:
) -> None:
self.llm = LLM(
model=model_name,
task=task,
runner=runner,
convert=convert,
tokenizer=tokenizer_name,
tokenizer_mode=tokenizer_mode,
trust_remote_code=trust_remote_code,

View File

@ -6,7 +6,7 @@ from typing import Literal, NamedTuple, Optional
import pytest
from vllm.config import TaskOption
from vllm.config import RunnerOption
from vllm.logger import init_logger
from ..utils import compare_two_settings, create_new_process_for_each_test
@ -31,14 +31,14 @@ class EPTestOptions(NamedTuple):
class EPTestSettings:
parallel_setups: list[ParallelSetup]
distributed_backends: list[str]
task: TaskOption
runner: RunnerOption
test_options: EPTestOptions
@staticmethod
def detailed(
*,
tp_base: int = 2,
task: TaskOption = "auto",
runner: RunnerOption = "auto",
trust_remote_code: bool = False,
tokenizer_mode: Optional[str] = None,
load_format: Optional[str] = None,
@ -63,7 +63,7 @@ class EPTestSettings:
chunked_prefill=False),
],
distributed_backends=["mp", "ray"],
task=task,
runner=runner,
test_options=EPTestOptions(trust_remote_code=trust_remote_code,
tokenizer_mode=tokenizer_mode,
load_format=load_format,
@ -74,7 +74,7 @@ class EPTestSettings:
def fast(
*,
tp_base: int = 2,
task: TaskOption = "auto",
runner: RunnerOption = "auto",
trust_remote_code: bool = False,
tokenizer_mode: Optional[str] = None,
load_format: Optional[str] = None,
@ -87,7 +87,7 @@ class EPTestSettings:
chunked_prefill=False),
],
distributed_backends=["mp"],
task=task,
runner=runner,
test_options=EPTestOptions(trust_remote_code=trust_remote_code,
tokenizer_mode=tokenizer_mode,
load_format=load_format,
@ -100,7 +100,7 @@ class EPTestSettings:
for parallel_setup in self.parallel_setups:
for distributed_backend in self.distributed_backends:
yield (model_name, parallel_setup, distributed_backend,
self.task, opts)
self.runner, opts)
# NOTE: You can adjust tp_base locally to fit the model in GPU
@ -118,7 +118,7 @@ def _compare_tp(
model_name: str,
parallel_setup: ParallelSetup,
distributed_backend: str,
task: TaskOption,
runner: RunnerOption,
test_options: EPTestOptions,
num_gpus_available: int,
*,
@ -154,8 +154,8 @@ def _compare_tp(
common_args.append("--enable-chunked-prefill")
if eager_mode:
common_args.append("--enforce-eager")
if task != "auto":
common_args.extend(["--task", task])
if runner != "auto":
common_args.extend(["--runner", runner])
if trust_remote_code:
common_args.append("--trust-remote-code")
if tokenizer_mode:
@ -203,7 +203,7 @@ def _compare_tp(
@pytest.mark.parametrize(
("model_name", "parallel_setup", "distributed_backend", "task",
("model_name", "parallel_setup", "distributed_backend", "runner",
"test_options"),
[
params for model_name, settings in TEST_MODELS.items()
@ -215,14 +215,14 @@ def test_ep(
model_name: str,
parallel_setup: ParallelSetup,
distributed_backend: str,
task: TaskOption,
runner: RunnerOption,
test_options: EPTestOptions,
num_gpus_available,
):
_compare_tp(model_name,
parallel_setup,
distributed_backend,
task,
runner,
test_options,
num_gpus_available,
method="generate")

View File

@ -14,7 +14,7 @@ from typing import Literal, NamedTuple, Optional
import pytest
from vllm.config import _FLOAT16_NOT_SUPPORTED_MODELS, TaskOption
from vllm.config import _FLOAT16_NOT_SUPPORTED_MODELS, RunnerOption
from vllm.logger import init_logger
from vllm.transformers_utils.config import get_config
@ -60,7 +60,7 @@ class PPTestSettings:
distributed_backends: list[str]
# vllm major version: "0" for V0, "1" for V1
vllm_major_versions: list[str]
task: TaskOption
runner: RunnerOption
test_options: PPTestOptions
def __post_init__(self):
@ -76,7 +76,7 @@ class PPTestSettings:
tp_base: int = 1,
pp_base: int = 2,
multi_node_only: bool = False,
task: TaskOption = "auto",
runner: RunnerOption = "auto",
load_format: Optional[str] = None,
):
return PPTestSettings(
@ -104,7 +104,7 @@ class PPTestSettings:
],
distributed_backends=["mp", "mp", "ray", "ray"],
vllm_major_versions=["0", "1", "0", "1"],
task=task,
runner=runner,
test_options=PPTestOptions(multi_node_only=multi_node_only,
load_format=load_format),
)
@ -114,7 +114,7 @@ class PPTestSettings:
*,
tp_base: int = 1,
pp_base: int = 2,
task: TaskOption = "auto",
runner: RunnerOption = "auto",
multi_node_only: bool = False,
load_format: Optional[str] = None,
):
@ -127,7 +127,7 @@ class PPTestSettings:
],
distributed_backends=["mp"],
vllm_major_versions=["0"],
task=task,
runner=runner,
test_options=PPTestOptions(multi_node_only=multi_node_only,
load_format=load_format),
)
@ -139,7 +139,7 @@ class PPTestSettings:
for backend, vllm_major_version in zip(self.distributed_backends,
self.vllm_major_versions):
yield (model_id, parallel_setup, backend, vllm_major_version,
self.task, opts)
self.runner, opts)
# NOTE: You can adjust tp_base and/or pp_base locally to fit the model in GPU
@ -211,10 +211,10 @@ TEXT_GENERATION_MODELS = {
EMBEDDING_MODELS = { # type: ignore[var-annotated]
# [Text-only]
"intfloat/e5-mistral-7b-instruct": PPTestSettings.fast(task="embed"),
"BAAI/bge-multilingual-gemma2": PPTestSettings.fast(task="embed"),
"intfloat/e5-mistral-7b-instruct": PPTestSettings.fast(runner="pooling"),
"BAAI/bge-multilingual-gemma2": PPTestSettings.fast(runner="pooling"),
"Qwen/Qwen2.5-Math-RM-72B": PPTestSettings.fast(
load_format="dummy", task="embed"
load_format="dummy", runner="pooling"
),
}
@ -269,7 +269,7 @@ def _compare_tp(
parallel_setup: ParallelSetup,
distributed_backend: str,
vllm_major_version: str,
task: TaskOption,
runner: RunnerOption,
test_options: PPTestOptions,
num_gpus_available: int,
*,
@ -335,8 +335,8 @@ def _compare_tp(
common_args.append("--enable-chunked-prefill")
if eager_mode:
common_args.append("--enforce-eager")
if task != "auto":
common_args.extend(["--task", task])
if runner != "auto":
common_args.extend(["--runner", runner])
if trust_remote_code:
common_args.append("--trust-remote-code")
if tokenizer_mode:
@ -415,7 +415,7 @@ def _compare_tp(
@pytest.mark.parametrize(
("model_id", "parallel_setup", "distributed_backend", "vllm_major_version",
"task", "test_options"),
"runner", "test_options"),
[
params for model_id, settings in TEXT_GENERATION_MODELS.items()
for params in settings.iter_params(model_id) if model_id in TEST_MODELS
@ -427,7 +427,7 @@ def test_tp_language_generation(
parallel_setup: ParallelSetup,
distributed_backend: str,
vllm_major_version: str,
task: TaskOption,
runner: RunnerOption,
test_options: PPTestOptions,
num_gpus_available,
):
@ -435,7 +435,7 @@ def test_tp_language_generation(
parallel_setup,
distributed_backend,
vllm_major_version,
task,
runner,
test_options,
num_gpus_available,
method="generate",
@ -444,7 +444,7 @@ def test_tp_language_generation(
@pytest.mark.parametrize(
("model_id", "parallel_setup", "distributed_backend", "vllm_major_version",
"task", "test_options"),
"runner", "test_options"),
[
params for model_id, settings in EMBEDDING_MODELS.items()
for params in settings.iter_params(model_id) if model_id in TEST_MODELS
@ -456,7 +456,7 @@ def test_tp_language_embedding(
parallel_setup: ParallelSetup,
distributed_backend: str,
vllm_major_version: str,
task: TaskOption,
runner: RunnerOption,
test_options: PPTestOptions,
num_gpus_available,
):
@ -464,7 +464,7 @@ def test_tp_language_embedding(
parallel_setup,
distributed_backend,
vllm_major_version,
task,
runner,
test_options,
num_gpus_available,
method="encode",
@ -473,7 +473,7 @@ def test_tp_language_embedding(
@pytest.mark.parametrize(
("model_id", "parallel_setup", "distributed_backend", "vllm_major_version",
"task", "test_options"),
"runner", "test_options"),
[
params for model_id, settings in MULTIMODAL_MODELS.items()
for params in settings.iter_params(model_id) if model_id in TEST_MODELS
@ -485,7 +485,7 @@ def test_tp_multimodal_generation(
parallel_setup: ParallelSetup,
distributed_backend: str,
vllm_major_version: str,
task: TaskOption,
runner: RunnerOption,
test_options: PPTestOptions,
num_gpus_available,
):
@ -493,7 +493,7 @@ def test_tp_multimodal_generation(
parallel_setup,
distributed_backend,
vllm_major_version,
task,
runner,
test_options,
num_gpus_available,
method="generate",

View File

@ -14,7 +14,7 @@ from typing import Literal, NamedTuple, Optional
import pytest
from vllm.config import TaskOption
from vllm.config import RunnerOption
from vllm.logger import init_logger
from ..models.registry import HF_EXAMPLE_MODELS
@ -48,7 +48,7 @@ class SPTestSettings:
distributed_backends: list[str]
# vllm major version: "0" for V0, "1" for V1
vllm_major_versions: list[str]
task: TaskOption
runner: RunnerOption
test_options: SPTestOptions
def __post_init__(self):
@ -64,7 +64,7 @@ class SPTestSettings:
tp_base: int = 2,
pp_base: int = 1,
multi_node_only: bool = False,
task: TaskOption = "auto",
runner: RunnerOption = "auto",
load_format: Optional[str] = None,
):
parallel_setups = []
@ -81,7 +81,7 @@ class SPTestSettings:
parallel_setups=parallel_setups,
distributed_backends=["mp", "ray"],
vllm_major_versions=["1", "1"],
task=task,
runner=runner,
test_options=SPTestOptions(multi_node_only=multi_node_only,
load_format=load_format),
)
@ -91,7 +91,7 @@ class SPTestSettings:
*,
tp_base: int = 2,
pp_base: int = 1,
task: TaskOption = "auto",
runner: RunnerOption = "auto",
multi_node_only: bool = False,
load_format: Optional[str] = None,
):
@ -109,7 +109,7 @@ class SPTestSettings:
parallel_setups=parallel_setups,
distributed_backends=["mp", "ray"],
vllm_major_versions=["1", "1"],
task=task,
runner=runner,
test_options=SPTestOptions(multi_node_only=multi_node_only,
load_format=load_format),
)
@ -119,7 +119,7 @@ class SPTestSettings:
*,
tp_base: int = 2,
pp_base: int = 1,
task: TaskOption = "auto",
runner: RunnerOption = "auto",
multi_node_only: bool = False,
load_format: Optional[str] = None,
):
@ -135,7 +135,7 @@ class SPTestSettings:
parallel_setups=parallel_setups,
distributed_backends=["mp", "ray"],
vllm_major_versions=["1", "1"],
task=task,
runner=runner,
test_options=SPTestOptions(multi_node_only=multi_node_only,
load_format=load_format),
)
@ -147,7 +147,7 @@ class SPTestSettings:
for backend, vllm_major_version in zip(self.distributed_backends,
self.vllm_major_versions):
yield (model_id, parallel_setup, backend, vllm_major_version,
self.task, opts)
self.runner, opts)
def _compare_sp(
@ -155,7 +155,7 @@ def _compare_sp(
parallel_setup: ParallelSetup,
distributed_backend: str,
vllm_major_version: str,
task: TaskOption,
runner: RunnerOption,
test_options: SPTestOptions,
num_gpus_available: int,
*,
@ -217,8 +217,8 @@ def _compare_sp(
common_args.append("--enable-chunked-prefill")
if eager_mode:
common_args.append("--enforce-eager")
if task != "auto":
common_args.extend(["--task", task])
if runner != "auto":
common_args.extend(["--runner", runner])
if trust_remote_code:
common_args.append("--trust-remote-code")
if tokenizer_mode:
@ -298,7 +298,7 @@ SP_TEST_MODELS = [
@pytest.mark.parametrize(
("model_id", "parallel_setup", "distributed_backend", "vllm_major_version",
"task", "test_options"),
"runner", "test_options"),
[
params for model_id, settings in SP_TEXT_GENERATION_MODELS.items()
for params in settings.iter_params(model_id)
@ -311,7 +311,7 @@ def test_tp_sp_generation(
parallel_setup: ParallelSetup,
distributed_backend: str,
vllm_major_version: str,
task: TaskOption,
runner: RunnerOption,
test_options: SPTestOptions,
num_gpus_available,
):
@ -319,7 +319,7 @@ def test_tp_sp_generation(
parallel_setup,
distributed_backend,
vllm_major_version,
task,
runner,
test_options,
num_gpus_available,
method="generate",

View File

@ -19,7 +19,8 @@ MAIN_SCORE = 0.7422994752439667
@pytest.fixture(scope="module")
def server():
args = [
"--task", "embed", "--enforce-eager", "--disable-uvicorn-access-log"
"--runner", "pooling", "--enforce-eager",
"--disable-uvicorn-access-log"
]
with RemoteOpenAIServer(MODEL_NAME, args) as remote_server:

View File

@ -21,7 +21,8 @@ MODEL_NAME = "cross-encoder/ms-marco-MiniLM-L-6-v2"
@pytest.fixture(scope="module")
def server():
args = [
"--task", "score", "--enforce-eager", "--disable-uvicorn-access-log"
"--runner", "pooling", "--enforce-eager",
"--disable-uvicorn-access-log"
]
with RemoteOpenAIServer(MODEL_NAME, args) as remote_server:

View File

@ -15,10 +15,6 @@ MODEL_NAME = "Qwen/Qwen2.5-1.5B-Instruct"
def get_vocab_size(model_name):
config = ModelConfig(
model=model_name,
task="auto",
tokenizer=model_name,
tokenizer_mode="auto",
trust_remote_code=False,
seed=0,
dtype="bfloat16",
)

View File

@ -102,6 +102,7 @@ def test_get_gen_prompt(model, template, add_generation_prompt,
tokenizer=model_info.tokenizer or model,
tokenizer_mode=model_info.tokenizer_mode,
trust_remote_code=model_info.trust_remote_code,
revision=model_info.revision,
hf_overrides=model_info.hf_overrides,
)

View File

@ -33,8 +33,8 @@ def v1(run_with_both_engines):
@pytest.fixture(scope="module")
def server():
args = [
"--task",
"embed",
"--runner",
"pooling",
# use half precision for speed and memory savings in CI environment
"--dtype",
DTYPE,

View File

@ -42,8 +42,8 @@ def dtype(request):
@pytest.fixture(scope="module")
def server(model_info, dtype: str):
args = [
"--task",
"embed",
"--runner",
"pooling",
# use half precision for speed and memory savings in CI environment
"--dtype",
dtype,

View File

@ -21,7 +21,7 @@ LONG_TIMEOUT_SECONDS: Final[int] = 60
@pytest.fixture(scope="module")
def server():
args = [
"--task",
"--runner",
"generate",
"--max-model-len",
"2048",

View File

@ -27,8 +27,8 @@ def server(request: pytest.FixtureRequest):
passed_params = [passed_params]
args = [
"--task",
"embed",
"--runner",
"pooling",
# use half precision for speed and memory savings in CI environment
"--dtype",
"float16",

View File

@ -20,8 +20,8 @@ DUMMY_CHAT_TEMPLATE = """{% for message in messages %}{{message['role'] + ': ' +
@pytest.fixture(scope="module")
def server():
args = [
"--task",
"reward",
"--runner",
"pooling",
# use half precision for speed and memory savings in CI environment
"--dtype",
"bfloat16",

View File

@ -26,8 +26,8 @@ def v1(run_with_both_engines):
@pytest.fixture(scope="module")
def server():
args = [
"--task",
"embed",
"--runner",
"pooling",
# use half precision for speed and memory savings in CI environment
"--dtype",
DTYPE,

View File

@ -29,8 +29,8 @@ input = """Immerse yourself in the enchanting chronicle of calculus, a
@pytest.fixture(scope="module")
def server():
args = [
"--task",
"embed",
"--runner",
"pooling",
"--dtype",
"bfloat16",
"--enforce-eager",

View File

@ -25,7 +25,7 @@ TEST_VIDEO_URLS = [
@pytest.fixture(scope="module")
def server():
args = [
"--task",
"--runner",
"generate",
"--max-model-len",
"32768",

View File

@ -48,7 +48,7 @@ EXPECTED_MM_BEAM_SEARCH_RES = [
@pytest.fixture(scope="module")
def server():
args = [
"--task",
"--runner",
"generate",
"--max-model-len",
"2048",

View File

@ -31,8 +31,8 @@ TEST_IMAGE_URLS = [
@pytest.fixture(scope="module")
def server():
args = [
"--task",
"embed",
"--runner",
"pooling",
"--max-model-len",
"2048",
"--max-num-seqs",

View File

@ -47,12 +47,8 @@ MISTRAL_MODEL_ID = "mistralai/Mistral-Small-3.1-24B-Instruct-2503"
@pytest.fixture(scope="function")
def phi3v_model_config():
return ModelConfig(PHI3V_MODEL_ID,
task="generate",
tokenizer=PHI3V_MODEL_ID,
tokenizer_mode="auto",
runner="generate",
trust_remote_code=True,
dtype="auto",
seed=0,
limit_mm_per_prompt={
"image": 2,
})
@ -61,12 +57,8 @@ def phi3v_model_config():
@pytest.fixture(scope="function")
def phi3v_model_config_mm_interleaved():
return ModelConfig(PHI3V_MODEL_ID,
task="generate",
tokenizer=PHI3V_MODEL_ID,
tokenizer_mode="auto",
runner="generate",
trust_remote_code=True,
dtype="auto",
seed=0,
interleave_mm_strings=True,
limit_mm_per_prompt={
"image": 2,
@ -86,11 +78,7 @@ def phi3v_tokenizer():
@pytest.fixture(scope="function")
def qwen25omni_model_config_mm_interleaved():
return ModelConfig(QWEN25OMNI_MODEL_ID,
task="generate",
tokenizer=QWEN25OMNI_MODEL_ID,
tokenizer_mode="auto",
dtype="auto",
seed=0,
runner="generate",
interleave_mm_strings=True,
limit_mm_per_prompt={
"image": 2,
@ -112,12 +100,7 @@ def qwen25omni_tokenizer():
@pytest.fixture(scope="module")
def mllama_model_config():
return ModelConfig(MLLAMA_MODEL_ID,
task="generate",
tokenizer=MLLAMA_MODEL_ID,
tokenizer_mode="auto",
trust_remote_code=True,
dtype="auto",
seed=0,
runner="generate",
limit_mm_per_prompt={
"image": 2,
})
@ -136,12 +119,7 @@ def mllama_tokenizer():
@pytest.fixture(scope="function")
def mistral_model_config():
return ModelConfig(MISTRAL_MODEL_ID,
task="generate",
tokenizer=MISTRAL_MODEL_ID,
tokenizer_mode="auto",
trust_remote_code=True,
dtype="auto",
seed=0,
runner="generate",
limit_mm_per_prompt={
"image": 2,
})
@ -1105,12 +1083,7 @@ def test_multimodal_image_parsing_matches_hf(model, image_url):
# Build a config for the model
model_config = ModelConfig(model,
task="generate",
tokenizer=model,
tokenizer_mode="auto",
trust_remote_code=True,
dtype="auto",
seed=0,
runner="generate",
limit_mm_per_prompt={
"image": 2,
})
@ -1170,6 +1143,7 @@ def test_resolve_hf_chat_template(sample_json_schema, model, use_tools):
model,
tokenizer=model_info.tokenizer or model,
tokenizer_mode=model_info.tokenizer_mode,
revision=model_info.revision,
trust_remote_code=model_info.trust_remote_code,
hf_overrides=model_info.hf_overrides,
)
@ -1225,6 +1199,7 @@ def test_resolve_content_format_hf_defined(model, expected_format):
model,
tokenizer=model_info.tokenizer or model,
tokenizer_mode=model_info.tokenizer_mode,
revision=model_info.revision,
trust_remote_code=model_info.trust_remote_code,
hf_overrides=model_info.hf_overrides,
)
@ -1284,6 +1259,7 @@ def test_resolve_content_format_fallbacks(model, expected_format):
model,
tokenizer=model_info.tokenizer or model,
tokenizer_mode=model_info.tokenizer_mode,
revision=model_info.revision,
trust_remote_code=model_info.trust_remote_code,
hf_overrides=model_info.hf_overrides,
)

View File

@ -38,13 +38,8 @@ def test_worker_apply_lora(sql_lora_files):
vllm_config = VllmConfig(
model_config=ModelConfig(
"meta-llama/Llama-2-7b-hf",
task="auto",
tokenizer="meta-llama/Llama-2-7b-hf",
tokenizer_mode="auto",
trust_remote_code=False,
seed=0,
dtype="float16",
revision=None,
enforce_eager=True,
),
load_config=LoadConfig(

View File

@ -69,10 +69,7 @@ async def test_guided_logits_processor_black_box(backend: str, is_local: bool,
config = ModelConfig(
MODEL_NAME,
task="generate",
tokenizer=MODEL_NAME,
tokenizer_mode="auto",
trust_remote_code=False,
runner="generate",
seed=0,
dtype="bfloat16",
)
@ -113,10 +110,7 @@ async def test_guided_logits_processor_with_reasoning(
config = ModelConfig(
REASONING_MODEL_NAME,
task="generate",
tokenizer=REASONING_MODEL_NAME,
tokenizer_mode="auto",
trust_remote_code=False,
runner="generate",
seed=0,
dtype="bfloat16",
)

View File

@ -57,7 +57,6 @@ def test_model_loading_with_params(vllm_runner, monkeypatch):
vllm_model.apply_model(check_model)
# assert output
assert output
@ -99,7 +98,6 @@ def test_roberta_model_loading_with_params(vllm_runner, monkeypatch):
vllm_model.apply_model(check_model)
# assert output
assert output

View File

@ -52,7 +52,7 @@ def correctness_test_embed_models(hf_runner,
vllm_extra_kwargs["dtype"] = model_info.dtype
with vllm_runner(model_info.name,
task="embed",
runner="pooling",
max_model_len=None,
**vllm_extra_kwargs) as vllm_model:
vllm_outputs = vllm_model.embed(example_prompts)

View File

@ -172,7 +172,7 @@ def mteb_test_embed_models(hf_runner,
vllm_extra_kwargs["dtype"] = model_info.dtype
with vllm_runner(model_info.name,
task="embed",
runner="pooling",
max_model_len=None,
**vllm_extra_kwargs) as vllm_model:
@ -279,15 +279,12 @@ def mteb_test_rerank_models(hf_runner,
vllm_extra_kwargs["dtype"] = model_info.dtype
with vllm_runner(model_info.name,
task="score",
runner="pooling",
max_model_len=None,
max_num_seqs=8,
**vllm_extra_kwargs) as vllm_model:
model_config = vllm_model.llm.llm_engine.model_config
if model_info.architecture:
assert (model_info.architecture in model_config.architectures)
assert model_config.hf_config.num_labels == 1
vllm_main_score = run_mteb_rerank(vllm_mteb_encoder(vllm_model),

View File

@ -85,7 +85,7 @@ def test_models(
hf_outputs = hf_model.encode(example_prompts)
with vllm_runner(model,
task="embed",
runner="pooling",
max_model_len=max_model_len,
**vllm_extra_kwargs) as vllm_model:
vllm_outputs = vllm_model.embed(example_prompts)

View File

@ -28,10 +28,7 @@ def test_find_array():
model_config = ModelConfig(
MODEL_NAME,
task="embed",
tokenizer=MODEL_NAME,
tokenizer_mode="auto",
trust_remote_code=False,
runner="pooling",
dtype="bfloat16",
seed=0,
)
@ -117,7 +114,7 @@ def test_gritlm_offline_embedding(vllm_runner):
with vllm_runner(
MODEL_NAME,
task="embed",
runner="pooling",
max_model_len=MAX_MODEL_LEN,
) as vllm_model:
llm = vllm_model.llm
@ -140,7 +137,7 @@ def test_gritlm_offline_embedding(vllm_runner):
async def test_gritlm_api_server_embedding():
queries, q_instruction, documents, d_instruction = get_test_data()
args = ["--task", "embed", "--max_model_len", str(MAX_MODEL_LEN)]
args = ["--runner", "pooling", "--max_model_len", str(MAX_MODEL_LEN)]
with RemoteOpenAIServer(MODEL_NAME, args) as server:
client_embedding = server.get_async_client()
@ -164,7 +161,7 @@ def test_gritlm_offline_generate(monkeypatch: pytest.MonkeyPatch, vllm_runner):
with vllm_runner(
MODEL_NAME,
task="generate",
runner="generate",
max_model_len=MAX_MODEL_LEN,
) as vllm_model:
llm = vllm_model.llm
@ -179,7 +176,7 @@ def test_gritlm_offline_generate(monkeypatch: pytest.MonkeyPatch, vllm_runner):
async def test_gritlm_api_server_generate():
input = "<|user|>\nWhat is the capital of France?\n<|assistant|>\n"
args = ["--task", "generate", "--max_model_len", str(MAX_MODEL_LEN)]
args = ["--runner", "generate", "--max_model_len", str(MAX_MODEL_LEN)]
with RemoteOpenAIServer(MODEL_NAME, args) as server:
client_generate = server.get_async_client()

View File

@ -4,6 +4,7 @@ from functools import partial
import pytest
import vllm.envs as envs
from vllm import PoolingParams
from ...utils import EmbedModelInfo, RerankModelInfo
@ -62,6 +63,10 @@ def test_embed_models_correctness(hf_runner, vllm_runner,
@pytest.mark.parametrize("model_info", RERANK_MODELS)
def test_rerank_models_mteb(hf_runner, vllm_runner,
model_info: RerankModelInfo) -> None:
if (model_info.architecture == "XLMRobertaForSequenceClassification"
and envs.VLLM_USE_V1):
pytest.skip("Not supported yet")
mteb_test_rerank_models(hf_runner, vllm_runner, model_info)
@ -92,7 +97,7 @@ def test_matryoshka(
hf_outputs = matryoshka_fy(hf_outputs, dimensions)
with vllm_runner(model_info.name,
task="embed",
runner="pooling",
dtype=dtype,
max_model_len=None) as vllm_model:
assert vllm_model.llm.llm_engine.model_config.is_matryoshka

View File

@ -21,7 +21,7 @@ max_model_len = int(original_max_position_embeddings * factor)
@pytest.mark.parametrize("model_info", MODELS)
def test_default(model_info, vllm_runner):
with vllm_runner(model_info.name, task="embed",
with vllm_runner(model_info.name, runner="pooling",
max_model_len=None) as vllm_model:
model_config = vllm_model.llm.llm_engine.model_config
if model_info.name == "nomic-ai/nomic-embed-text-v2-moe":
@ -36,7 +36,7 @@ def test_default(model_info, vllm_runner):
@pytest.mark.parametrize("model_info", MODELS)
def test_set_max_model_len_legal(model_info, vllm_runner):
# set max_model_len <= 512
with vllm_runner(model_info.name, task="embed",
with vllm_runner(model_info.name, runner="pooling",
max_model_len=256) as vllm_model:
model_config = vllm_model.llm.llm_engine.model_config
assert model_config.max_model_len == 256
@ -46,11 +46,12 @@ def test_set_max_model_len_legal(model_info, vllm_runner):
# For nomic-embed-text-v2-moe the length is set to 512
# by sentence_bert_config.json.
with pytest.raises(ValueError):
with vllm_runner(model_info.name, task="embed",
with vllm_runner(model_info.name,
runner="pooling",
max_model_len=1024):
pass
else:
with vllm_runner(model_info.name, task="embed",
with vllm_runner(model_info.name, runner="pooling",
max_model_len=1024) as vllm_model:
model_config = vllm_model.llm.llm_engine.model_config
assert model_config.max_model_len == 1024
@ -60,14 +61,15 @@ def test_set_max_model_len_legal(model_info, vllm_runner):
def test_set_max_model_len_illegal(model_info, vllm_runner):
# set max_model_len > 2048
with pytest.raises(ValueError):
with vllm_runner(model_info.name, task="embed", max_model_len=4096):
with vllm_runner(model_info.name, runner="pooling",
max_model_len=4096):
pass
# set max_model_len > 2048 by hf_overrides
hf_overrides = {"max_model_len": 4096}
with pytest.raises(ValueError):
with vllm_runner(model_info.name,
task="embed",
runner="pooling",
max_model_len=None,
hf_overrides=hf_overrides):
pass
@ -87,7 +89,7 @@ def test_use_rope_scaling_legal(model_info, vllm_runner):
}
with vllm_runner(model_info.name,
task="embed",
runner="pooling",
max_model_len=None,
hf_overrides=hf_overrides):
pass
@ -107,7 +109,7 @@ def test_use_rope_scaling_illegal(model_info, vllm_runner):
# illegal max_model_len
with pytest.raises(ValueError):
with vllm_runner(model_info.name,
task="embed",
runner="pooling",
max_model_len=max_model_len + 1,
hf_overrides=hf_overrides):
pass
@ -125,7 +127,7 @@ def test_use_rope_scaling_illegal(model_info, vllm_runner):
# illegal max_model_len by hf_overrides
with pytest.raises(ValueError):
with vllm_runner(model_info.name,
task="embed",
runner="pooling",
max_model_len=None,
hf_overrides=hf_overrides):
pass

View File

@ -37,7 +37,9 @@ def test_cross_encoder_1_to_1(vllm_runner, hf_runner, model_name):
with hf_runner(model_name, dtype=DTYPE, is_cross_encoder=True) as hf_model:
hf_outputs = hf_model.predict([text_pair]).tolist()
with vllm_runner(model_name, task="score", dtype=DTYPE,
with vllm_runner(model_name,
runner="pooling",
dtype=DTYPE,
max_model_len=None) as vllm_model:
vllm_outputs = vllm_model.score(text_pair[0], text_pair[1])
@ -56,7 +58,9 @@ def test_cross_encoder_1_to_N(vllm_runner, hf_runner, model_name):
with hf_runner(model_name, dtype=DTYPE, is_cross_encoder=True) as hf_model:
hf_outputs = hf_model.predict(text_pairs).tolist()
with vllm_runner(model_name, task="score", dtype=DTYPE,
with vllm_runner(model_name,
runner="pooling",
dtype=DTYPE,
max_model_len=None) as vllm_model:
vllm_outputs = vllm_model.score(TEXTS_1[0], TEXTS_2)
@ -76,7 +80,9 @@ def test_cross_encoder_N_to_N(vllm_runner, hf_runner, model_name):
with hf_runner(model_name, dtype=DTYPE, is_cross_encoder=True) as hf_model:
hf_outputs = hf_model.predict(text_pairs).tolist()
with vllm_runner(model_name, task="score", dtype=DTYPE,
with vllm_runner(model_name,
runner="pooling",
dtype=DTYPE,
max_model_len=None) as vllm_model:
vllm_outputs = vllm_model.score(TEXTS_1, TEXTS_2)
@ -103,7 +109,7 @@ def test_embedding_1_to_1(vllm_runner, hf_runner, emb_model_name):
]
with vllm_runner(emb_model_name,
task="embed",
runner="pooling",
dtype=DTYPE,
max_model_len=None) as vllm_model:
vllm_outputs = vllm_model.score(text_pair[0], text_pair[1])
@ -131,7 +137,7 @@ def test_embedding_1_to_N(vllm_runner, hf_runner, emb_model_name):
]
with vllm_runner(emb_model_name,
task="embed",
runner="pooling",
dtype=DTYPE,
max_model_len=None) as vllm_model:
vllm_outputs = vllm_model.score(TEXTS_1[0], TEXTS_2)
@ -160,7 +166,7 @@ def test_embedding_N_to_N(vllm_runner, hf_runner, emb_model_name):
]
with vllm_runner(emb_model_name,
task="embed",
runner="pooling",
dtype=DTYPE,
max_model_len=None) as vllm_model:
vllm_outputs = vllm_model.score(TEXTS_1, TEXTS_2)

View File

@ -26,7 +26,7 @@ def test_smaller_truncation_size(vllm_runner,
truncate_prompt_tokens = 10
with vllm_runner(model_name, task="embed",
with vllm_runner(model_name, runner="pooling",
max_model_len=max_model_len) as vllm_model:
vllm_output = vllm_model.llm.encode(
input_str, truncate_prompt_tokens=truncate_prompt_tokens)
@ -41,7 +41,7 @@ def test_max_truncation_size(vllm_runner,
input_str=input_str):
truncate_prompt_tokens = -1
with vllm_runner(model_name, task="embed",
with vllm_runner(model_name, runner="pooling",
max_model_len=max_model_len) as vllm_model:
vllm_output = vllm_model.llm.encode(
input_str, truncate_prompt_tokens=truncate_prompt_tokens)
@ -58,7 +58,7 @@ def test_bigger_truncation_size(vllm_runner,
truncate_prompt_tokens = max_model_len + 1
with pytest.raises(ValueError), vllm_runner(
model_name, task="embed",
model_name, runner="pooling",
max_model_len=max_model_len) as vllm_model:
llm_output = vllm_model.llm.encode(

View File

@ -222,7 +222,6 @@ VLM_TEST_SETTINGS = {
},
marks=[large_gpu_mark(min_gb=32)],
),
# Check "auto" with fallback to transformers
"internvl-transformers": VLMTestInfo(
models=["OpenGVLab/InternVL3-1B-hf"],
test_type=(VLMTestType.IMAGE, VLMTestType.MULTI_IMAGE),
@ -232,7 +231,7 @@ VLM_TEST_SETTINGS = {
use_tokenizer_eos=True,
image_size_factors=[(0.25, 0.5, 1.0)],
vllm_runner_kwargs={
"model_impl": "auto",
"model_impl": "transformers",
},
auto_cls=AutoModelForImageTextToText,
marks=[pytest.mark.core_model],
@ -638,7 +637,7 @@ VLM_TEST_SETTINGS = {
img_idx_to_prompt=lambda idx: f"<|image_{idx}|>\n",
max_model_len=4096,
max_num_seqs=2,
task="generate",
runner="generate",
# use sdpa mode for hf runner since phi3v didn't work with flash_attn
hf_model_kwargs={"_attn_implementation": "sdpa"},
use_tokenizer_eos=True,

View File

@ -65,7 +65,7 @@ def run_test(
# max_model_len should be greater than image_feature_size
with vllm_runner(
model,
task="generate",
runner="generate",
max_model_len=max_model_len,
max_num_seqs=1,
dtype=dtype,

View File

@ -48,7 +48,7 @@ def test_models(vllm_runner, model, dtype: str, max_tokens: int) -> None:
]
with vllm_runner(model,
task="generate",
runner="generate",
dtype=dtype,
limit_mm_per_prompt={"image": 2},
max_model_len=32768,

View File

@ -99,7 +99,7 @@ def run_test(
# max_model_len should be greater than image_feature_size
with vllm_runner(
model,
task="generate",
runner="generate",
max_model_len=max_model_len,
max_num_seqs=2,
dtype=dtype,

View File

@ -267,7 +267,7 @@ def run_embedding_input_test(
# max_model_len should be greater than image_feature_size
with vllm_runner(model,
task="generate",
runner="generate",
max_model_len=4000,
max_num_seqs=3,
dtype=dtype,

View File

@ -6,7 +6,7 @@ from typing import Any, Callable, Optional
import torch
from transformers.models.auto.auto_factory import _BaseAutoModelClass
from vllm.config import TaskOption
from vllm.config import RunnerOption
from vllm.transformers_utils.tokenizer import AnyTokenizer
from .....conftest import HfRunner, VllmRunner
@ -37,7 +37,7 @@ def run_test(
vllm_runner_kwargs: Optional[dict[str, Any]],
hf_model_kwargs: Optional[dict[str, Any]],
patch_hf_runner: Optional[Callable[[HfRunner], HfRunner]],
task: TaskOption = "auto",
runner: RunnerOption = "auto",
distributed_executor_backend: Optional[str] = None,
tensor_parallel_size: int = 1,
vllm_embeddings: Optional[torch.Tensor] = None,
@ -83,7 +83,7 @@ def run_test(
tensor_parallel_size=tensor_parallel_size,
distributed_executor_backend=distributed_executor_backend,
enforce_eager=enforce_eager,
task=task,
runner=runner,
**vllm_runner_kwargs_) as vllm_model:
tokenizer = vllm_model.llm.get_tokenizer()

View File

@ -11,7 +11,7 @@ from pytest import MarkDecorator
from transformers import AutoModelForCausalLM
from transformers.models.auto.auto_factory import _BaseAutoModelClass
from vllm.config import TaskOption
from vllm.config import RunnerOption
from vllm.sequence import SampleLogprobs
from vllm.transformers_utils.tokenizer import AnyTokenizer
@ -109,7 +109,7 @@ class VLMTestInfo(NamedTuple):
enforce_eager: bool = True
max_model_len: int = 1024
max_num_seqs: int = 256
task: TaskOption = "auto"
runner: RunnerOption = "auto"
tensor_parallel_size: int = 1
vllm_runner_kwargs: Optional[dict[str, Any]] = None
@ -173,7 +173,7 @@ class VLMTestInfo(NamedTuple):
"enforce_eager": self.enforce_eager,
"max_model_len": self.max_model_len,
"max_num_seqs": self.max_num_seqs,
"task": self.task,
"runner": self.runner,
"tensor_parallel_size": self.tensor_parallel_size,
"vllm_runner_kwargs": self.vllm_runner_kwargs,
"hf_output_post_proc": self.hf_output_post_proc,

View File

@ -92,7 +92,7 @@ def _run_test(
# if we run HF first, the cuda initialization will be done and it
# will hurt multiprocessing backend with fork method (the default method).
with vllm_runner(model,
task="embed",
runner="pooling",
dtype=dtype,
enforce_eager=True,
max_model_len=8192) as vllm_model:

View File

@ -49,7 +49,7 @@ def vllm_reranker(
with vllm_runner(
model_name,
task="score",
runner="pooling",
dtype=dtype,
max_num_seqs=2,
max_model_len=2048,

View File

@ -64,7 +64,7 @@ def _run_test(
# if we run HF first, the cuda initialization will be done and it
# will hurt multiprocessing backend with fork method (the default method).
with vllm_runner(model,
task="embed",
runner="pooling",
dtype=dtype,
max_model_len=4096,
enforce_eager=True) as vllm_model:

View File

@ -44,7 +44,7 @@ def _run_test(
# vLLM needs a fresh new process without cuda initialization.
# if we run HF first, the cuda initialization will be done and it
# will hurt multiprocessing backend with fork method (the default method).
with vllm_runner(model, task="embed", dtype=dtype,
with vllm_runner(model, runner="pooling", dtype=dtype,
enforce_eager=True) as vllm_model:
vllm_outputs = vllm_model.embed(input_texts, images=input_images)

View File

@ -34,7 +34,7 @@ def _run_test(
set_default_torch_num_threads(1),
vllm_runner(
model,
task="embed",
runner="pooling",
dtype=torch.float16,
enforce_eager=True,
skip_tokenizer_init=True,

View File

@ -58,13 +58,10 @@ def _test_processing_correctness(
model_config = ModelConfig(
model_id,
task="auto",
tokenizer=model_info.tokenizer or model_id,
tokenizer_mode=model_info.tokenizer_mode,
trust_remote_code=model_info.trust_remote_code,
seed=0,
dtype="auto",
revision=model_info.revision,
trust_remote_code=model_info.trust_remote_code,
hf_overrides=model_info.hf_overrides,
)

View File

@ -54,13 +54,10 @@ def test_hf_model_weights_mapper(model_arch: str):
model_config = ModelConfig(
model_id,
task="auto",
tokenizer=model_info.tokenizer or model_id,
tokenizer_mode=model_info.tokenizer_mode,
revision=model_info.revision,
trust_remote_code=model_info.trust_remote_code,
seed=0,
dtype="auto",
revision=None,
hf_overrides=model_info.hf_overrides,
)
model_cls = MULTIMODAL_REGISTRY._get_model_cls(model_config)

View File

@ -172,7 +172,7 @@ def test_4bit_bnb_embedding_model(
# Inflight 4bit quantization
with vllm_runner(model_name,
task="embed",
runner="pooling",
dtype=dtype,
gpu_memory_utilization=0.5,
quantization="bitsandbytes") as vllm_model:

View File

@ -7,13 +7,15 @@ import pytest
from transformers import PretrainedConfig
from vllm import LLM
from vllm.config import ModelImpl
from vllm.engine.llm_engine import LLMEngine as V0LLMEngine
from vllm.utils import GiB_bytes
from vllm.v1.core.kv_cache_utils import get_kv_cache_config
from vllm.v1.engine.core import EngineCore as V1EngineCore
from ..utils import create_new_process_for_each_test
from .registry import AUTO_EXAMPLE_MODELS, HF_EXAMPLE_MODELS, HfExampleModels
from .registry import (_TRANSFORMERS_BACKEND_MODELS, AUTO_EXAMPLE_MODELS,
HF_EXAMPLE_MODELS, HfExampleModels)
@create_new_process_for_each_test()
@ -126,6 +128,8 @@ def can_initialize(model_arch: str, monkeypatch: pytest.MonkeyPatch,
# these tests seem to produce leftover memory
gpu_memory_utilization=0.80,
load_format="dummy",
model_impl=ModelImpl.TRANSFORMERS
if model_arch in _TRANSFORMERS_BACKEND_MODELS else ModelImpl.VLLM,
hf_overrides=hf_overrides,
)

View File

@ -24,11 +24,9 @@ from .registry import HF_EXAMPLE_MODELS
@pytest.mark.parametrize("model_arch", ModelRegistry.get_supported_archs())
def test_registry_imports(model_arch):
model_info = HF_EXAMPLE_MODELS.get_hf_info(model_arch)
model_info.check_transformers_version(on_fail="skip")
# Ensure all model classes can be imported successfully
model_cls, _ = ModelRegistry.resolve_model_cls(model_arch)
model_cls = ModelRegistry._try_load_model_cls(model_arch)
assert model_cls is not None
if model_arch in _SPECULATIVE_DECODING_MODELS:
return # Ignore these models which do not have a unified format
@ -56,14 +54,16 @@ def test_registry_imports(model_arch):
("XLMRobertaForSequenceClassification", False, False, True),
])
def test_registry_model_property(model_arch, is_mm, init_cuda, is_ce):
assert ModelRegistry.is_multimodal_model(model_arch) is is_mm
model_info = ModelRegistry._try_inspect_model_cls(model_arch)
assert model_info is not None
assert ModelRegistry.is_cross_encoder_model(model_arch) is is_ce
assert model_info.supports_multimodal is is_mm
assert model_info.supports_cross_encoding is is_ce
if init_cuda and current_platform.is_cuda_alike():
assert not torch.cuda.is_initialized()
ModelRegistry.resolve_model_cls(model_arch)
ModelRegistry._try_load_model_cls(model_arch)
if not torch.cuda.is_initialized():
warnings.warn(
"This model no longer initializes CUDA on import. "
@ -82,12 +82,15 @@ def test_registry_model_property(model_arch, is_mm, init_cuda, is_ce):
("Qwen2VLForConditionalGeneration", True, True),
])
def test_registry_is_pp(model_arch, is_pp, init_cuda):
assert ModelRegistry.is_pp_supported_model(model_arch) is is_pp
model_info = ModelRegistry._try_inspect_model_cls(model_arch)
assert model_info is not None
assert model_info.supports_pp is is_pp
if init_cuda and current_platform.is_cuda_alike():
assert not torch.cuda.is_initialized()
ModelRegistry.resolve_model_cls(model_arch)
ModelRegistry._try_load_model_cls(model_arch)
if not torch.cuda.is_initialized():
warnings.warn(
"This model no longer initializes CUDA on import. "

View File

@ -33,6 +33,10 @@ def check_implementation(
args = (example_prompts, max_tokens, num_logprobs)
with runner_test(model, **kwargs_test, **kwargs) as model_test:
model_config = model_test.llm.llm_engine.model_config
assert model_config.architecture == (
model_config._get_transformers_backend_cls())
outputs_test = model_test.generate_greedy_logprobs(*args)
with runner_ref(model, **kwargs_ref) as model_ref:
@ -130,8 +134,13 @@ def test_quantization(
model_impl="transformers",
enforce_eager=True,
**quantization_kwargs) as vllm_model: # type: ignore[arg-type]
model_config = vllm_model.llm.llm_engine.model_config
assert model_config.architecture == (
model_config._get_transformers_backend_cls())
transformers_outputs = vllm_model.generate_greedy_logprobs(
example_prompts, max_tokens=max_tokens, num_logprobs=num_logprobs)
check_logprobs_close(
outputs_0_lst=transformers_outputs,
outputs_1_lst=vllm_outputs,
@ -151,7 +160,6 @@ def test_classify(
example_prompts,
model: str,
dtype: str,
monkeypatch,
) -> None:
import torch
from transformers import AutoModelForSequenceClassification
@ -160,6 +168,10 @@ def test_classify(
max_model_len=512,
dtype=dtype,
model_impl="transformers") as vllm_model:
model_config = vllm_model.llm.llm_engine.model_config
assert model_config.architecture == (
model_config._get_transformers_backend_cls())
vllm_outputs = vllm_model.classify(example_prompts)
with hf_runner(model,

View File

@ -8,7 +8,7 @@ from typing import Any, NamedTuple, Optional, Union
import torch
import torch.nn.functional as F
from vllm.config import ModelConfig, TaskOption
from vllm.config import ModelConfig, RunnerOption
from vllm.inputs import InputContext
from vllm.sequence import Logprob, PromptLogprobs, SampleLogprobs
@ -255,7 +255,7 @@ def check_logprobs_close(
def build_model_context(
model_id: str,
task: TaskOption = "auto",
runner: RunnerOption = "auto",
dtype: Union[str, torch.dtype] = "auto",
model_config_kwargs: Optional[dict[str, Any]] = None,
mm_processor_kwargs: Optional[dict[str, Any]] = None,
@ -280,9 +280,10 @@ def build_model_context(
model_config_kwargs = model_config_kwargs or {}
model_config = ModelConfig(
model_id,
task=task,
runner=runner,
tokenizer=model_info.tokenizer or model_id,
tokenizer_mode=model_info.tokenizer_mode,
revision=model_info.revision,
trust_remote_code=model_info.trust_remote_code,
dtype=dtype,
seed=0,

View File

@ -954,13 +954,6 @@ def test_limit_mm_per_prompt_dummy(model_id, limit, num_supported, is_valid):
model_config = ModelConfig(
model=model_id,
task="auto",
tokenizer=model_id,
tokenizer_mode="auto",
trust_remote_code=False,
seed=0,
dtype="auto",
revision=None,
limit_mm_per_prompt=limit_mm_per_prompt,
)
@ -993,13 +986,6 @@ def test_limit_mm_per_prompt_apply(model_id, num_images, limit, is_valid):
model_config = ModelConfig(
model=model_id,
task="auto",
tokenizer=model_id,
tokenizer_mode="auto",
trust_remote_code=False,
seed=0,
dtype="auto",
revision=None,
limit_mm_per_prompt=limit_mm_per_prompt,
)
@ -1061,16 +1047,7 @@ class _ProcessorProxy:
)
# yapf: enable
def test_hf_processor_kwargs(model_id, call_kwargs, expected_kwargs):
model_config = ModelConfig(
model=model_id,
task="auto",
tokenizer=model_id,
tokenizer_mode="auto",
trust_remote_code=False,
seed=0,
dtype="auto",
revision=None,
)
model_config = ModelConfig(model_id)
processor = MULTIMODAL_REGISTRY.create_processor(model_config)
orig_get_hf_processor = processor.info.get_hf_processor

View File

@ -57,15 +57,7 @@ def test_auto_gptq(model_arg_exptype: tuple[str, None, str]) -> None:
model_path, quantization_arg, expected_type = model_arg_exptype
try:
model_config = ModelConfig(model_path,
task="auto",
tokenizer=model_path,
tokenizer_mode="auto",
trust_remote_code=False,
seed=0,
dtype="float16",
revision=None,
quantization=quantization_arg)
model_config = ModelConfig(model_path, quantization=quantization_arg)
found_quantization_type = model_config.quantization
except ValueError:
found_quantization_type = "ERROR"

View File

@ -74,115 +74,116 @@ def test_update_config():
new_config3 = update_config(config3, {"a": "new_value"})
# Can remove once --task option is fully deprecated
@pytest.mark.parametrize(
("model_id", "expected_runner_type", "expected_task"),
("model_id", "expected_runner_type", "expected_convert_type",
"expected_task"),
[
("distilbert/distilgpt2", "generate", "generate"),
("intfloat/multilingual-e5-small", "pooling", "embed"),
("jason9693/Qwen2.5-1.5B-apeach", "pooling", "classify"),
("cross-encoder/ms-marco-MiniLM-L-6-v2", "pooling", "classify"),
("Qwen/Qwen2.5-Math-RM-72B", "pooling", "reward"),
("openai/whisper-small", "generate", "transcription"),
("distilbert/distilgpt2", "generate", "none", "generate"),
("intfloat/multilingual-e5-small", "pooling", "none", "embed"),
("jason9693/Qwen2.5-1.5B-apeach", "pooling", "classify", "classify"),
("cross-encoder/ms-marco-MiniLM-L-6-v2", "pooling", "none",
"classify"),
("Qwen/Qwen2.5-Math-RM-72B", "pooling", "none", "reward"),
("openai/whisper-small", "generate", "none", "transcription"),
],
)
def test_auto_task(model_id, expected_runner_type, expected_task):
config = ModelConfig(
model_id,
task="auto",
tokenizer=model_id,
tokenizer_mode="auto",
trust_remote_code=False,
seed=0,
dtype="float16",
)
def test_auto_task(model_id, expected_runner_type, expected_convert_type,
expected_task):
config = ModelConfig(model_id, task="auto")
assert config.runner_type == expected_runner_type
assert config.convert_type == expected_convert_type
assert expected_task in config.supported_tasks
if config.runner_type == "pooling":
assert config.task == expected_task
else:
assert expected_task in config.supported_tasks
# Can remove once --task option is fully deprecated
@pytest.mark.parametrize(
("model_id", "expected_runner_type", "expected_convert_type",
"expected_task"),
[
("distilbert/distilgpt2", "pooling", "embed", "embed"),
("intfloat/multilingual-e5-small", "pooling", "embed", "embed"),
("jason9693/Qwen2.5-1.5B-apeach", "pooling", "classify", "classify"),
("cross-encoder/ms-marco-MiniLM-L-6-v2", "pooling", "classify",
"classify"),
("Qwen/Qwen2.5-Math-RM-72B", "pooling", "embed", "embed"),
("openai/whisper-small", "pooling", "embed", "embed"),
],
)
def test_score_task(model_id, expected_runner_type, expected_convert_type,
expected_task):
config = ModelConfig(model_id, task="score")
assert config.runner_type == expected_runner_type
assert config.convert_type == expected_convert_type
assert expected_task in config.supported_tasks
# Can remove once --task option is fully deprecated
@pytest.mark.parametrize(
("model_id", "expected_runner_type", "expected_convert_type",
"expected_task"),
[
("openai/whisper-small", "generate", "none", "transcription"),
],
)
def test_transcription_task(model_id, expected_runner_type,
expected_convert_type, expected_task):
config = ModelConfig(model_id, task="transcription")
assert config.runner_type == expected_runner_type
assert config.convert_type == expected_convert_type
assert expected_task in config.supported_tasks
@pytest.mark.parametrize(
("model_id", "expected_runner_type", "expected_task"),
("model_id", "expected_runner_type", "expected_convert_type"),
[
("distilbert/distilgpt2", "generate", "none"),
("intfloat/multilingual-e5-small", "pooling", "none"),
("jason9693/Qwen2.5-1.5B-apeach", "pooling", "classify"),
("cross-encoder/ms-marco-MiniLM-L-6-v2", "pooling", "none"),
("Qwen/Qwen2.5-Math-RM-72B", "pooling", "none"),
("openai/whisper-small", "generate", "none"),
],
)
def test_auto_runner(model_id, expected_runner_type, expected_convert_type):
config = ModelConfig(model_id, runner="auto")
assert config.runner_type == expected_runner_type
assert config.convert_type == expected_convert_type
@pytest.mark.parametrize(
("model_id", "expected_runner_type", "expected_convert_type"),
[
("distilbert/distilgpt2", "pooling", "embed"),
("intfloat/multilingual-e5-small", "pooling", "embed"),
("intfloat/multilingual-e5-small", "pooling", "none"),
("jason9693/Qwen2.5-1.5B-apeach", "pooling", "classify"),
("cross-encoder/ms-marco-MiniLM-L-6-v2", "pooling", "classify"),
("Qwen/Qwen2.5-Math-RM-72B", "pooling", "embed"),
("cross-encoder/ms-marco-MiniLM-L-6-v2", "pooling", "none"),
("Qwen/Qwen2.5-Math-RM-72B", "pooling", "none"),
("openai/whisper-small", "pooling", "embed"),
],
)
def test_score_task(model_id, expected_runner_type, expected_task):
config = ModelConfig(
model_id,
task="score",
tokenizer=model_id,
tokenizer_mode="auto",
trust_remote_code=False,
seed=0,
dtype="float16",
)
def test_pooling_runner(model_id, expected_runner_type, expected_convert_type):
config = ModelConfig(model_id, runner="pooling")
assert config.runner_type == expected_runner_type
assert config.task == expected_task
@pytest.mark.parametrize(("model_id", "expected_runner_type", "expected_task"),
[
("Qwen/Qwen2.5-1.5B-Instruct", "draft", "auto"),
])
def test_draft_task(model_id, expected_runner_type, expected_task):
config = ModelConfig(
model_id,
runner="draft",
tokenizer=model_id,
seed=0,
dtype="float16",
)
assert config.runner_type == expected_runner_type
assert config.task == expected_task
assert config.convert_type == expected_convert_type
@pytest.mark.parametrize(
("model_id", "expected_runner_type", "expected_task"),
("model_id", "expected_runner_type", "expected_convert_type"),
[
("openai/whisper-small", "generate", "transcription"),
("Qwen/Qwen2.5-1.5B-Instruct", "draft", "none"),
],
)
def test_transcription_task(model_id, expected_runner_type, expected_task):
config = ModelConfig(
model_id,
task="transcription",
tokenizer=model_id,
tokenizer_mode="auto",
trust_remote_code=False,
seed=0,
dtype="float16",
)
def test_draft_runner(model_id, expected_runner_type, expected_convert_type):
config = ModelConfig(model_id, runner="draft")
assert config.runner_type == expected_runner_type
assert config.task == expected_task
@pytest.mark.parametrize(("model_id", "bad_task"), [
("Qwen/Qwen2.5-Math-RM-72B", "generate"),
("Qwen/Qwen3-0.6B", "transcription"),
])
def test_incorrect_task(model_id, bad_task):
with pytest.raises(ValueError, match=r"does not support task=.*"):
ModelConfig(
model_id,
task=bad_task,
tokenizer=model_id,
tokenizer_mode="auto",
trust_remote_code=False,
seed=0,
dtype="float16",
)
assert config.convert_type == expected_convert_type
MODEL_IDS_EXPECTED = [
@ -195,17 +196,7 @@ MODEL_IDS_EXPECTED = [
@pytest.mark.parametrize("model_id_expected", MODEL_IDS_EXPECTED)
def test_disable_sliding_window(model_id_expected):
model_id, expected = model_id_expected
model_config = ModelConfig(
model_id,
task="auto",
tokenizer=model_id,
tokenizer_mode="auto",
trust_remote_code=False,
seed=0,
dtype="float16",
revision=None,
disable_sliding_window=True,
)
model_config = ModelConfig(model_id, disable_sliding_window=True)
assert model_config.max_model_len == expected
@ -214,16 +205,7 @@ def test_get_sliding_window():
# Test that the sliding window is correctly computed.
# For Qwen1.5/Qwen2, get_sliding_window() should be None
# when use_sliding_window is False.
qwen2_model_config = ModelConfig(
"Qwen/Qwen1.5-7B",
task="auto",
tokenizer="Qwen/Qwen1.5-7B",
tokenizer_mode="auto",
trust_remote_code=False,
seed=0,
dtype="float16",
revision=None,
)
qwen2_model_config = ModelConfig("Qwen/Qwen1.5-7B")
qwen2_model_config.hf_config.use_sliding_window = False
qwen2_model_config.hf_config.sliding_window = TEST_SLIDING_WINDOW
@ -232,16 +214,7 @@ def test_get_sliding_window():
qwen2_model_config.hf_config.use_sliding_window = True
assert qwen2_model_config.get_sliding_window() == TEST_SLIDING_WINDOW
mistral_model_config = ModelConfig(
"mistralai/Mistral-7B-v0.1",
task="auto",
tokenizer="mistralai/Mistral-7B-v0.1",
tokenizer_mode="auto",
trust_remote_code=False,
seed=0,
dtype="float16",
revision=None,
)
mistral_model_config = ModelConfig("mistralai/Mistral-7B-v0.1")
mistral_model_config.hf_config.sliding_window = None
assert mistral_model_config.get_sliding_window() is None
@ -253,16 +226,7 @@ def test_get_sliding_window():
reason="Xformers backend is not supported on ROCm.")
def test_get_pooling_config():
model_id = "sentence-transformers/all-MiniLM-L12-v2"
model_config = ModelConfig(
model_id,
task="auto",
tokenizer=model_id,
tokenizer_mode="auto",
trust_remote_code=False,
seed=0,
dtype="float16",
revision=None,
)
model_config = ModelConfig(model_id)
pooling_config = model_config._init_pooler_config()
assert pooling_config is not None
@ -275,14 +239,7 @@ def test_get_pooling_config():
reason="Xformers backend is not supported on ROCm.")
def test_get_pooling_config_from_args():
model_id = "sentence-transformers/all-MiniLM-L12-v2"
model_config = ModelConfig(model_id,
task="auto",
tokenizer=model_id,
tokenizer_mode="auto",
trust_remote_code=False,
seed=0,
dtype="float16",
revision=None)
model_config = ModelConfig(model_id)
override_pooler_config = PoolerConfig(pooling_type='CLS', normalize=True)
model_config.override_pooler_config = override_pooler_config
@ -295,16 +252,8 @@ def test_get_pooling_config_from_args():
@pytest.mark.skipif(current_platform.is_rocm(),
reason="Xformers backend is not supported on ROCm.")
def test_get_bert_tokenization_sentence_transformer_config():
bge_model_config = ModelConfig(
model="BAAI/bge-base-en-v1.5",
task="auto",
tokenizer="BAAI/bge-base-en-v1.5",
tokenizer_mode="auto",
trust_remote_code=False,
seed=0,
dtype="float16",
revision=None,
)
model_id = "BAAI/bge-base-en-v1.5"
bge_model_config = ModelConfig(model_id)
bert_bge_model_config = bge_model_config._get_encoder_config()
@ -317,27 +266,13 @@ def test_rope_customization():
TEST_ROPE_THETA = 16_000_000.0
LONGCHAT_ROPE_SCALING = {"rope_type": "linear", "factor": 8.0}
llama_model_config = ModelConfig(
"meta-llama/Meta-Llama-3-8B-Instruct",
task="auto",
tokenizer="meta-llama/Meta-Llama-3-8B-Instruct",
tokenizer_mode="auto",
trust_remote_code=False,
dtype="float16",
seed=0,
)
llama_model_config = ModelConfig("meta-llama/Meta-Llama-3-8B-Instruct")
assert getattr(llama_model_config.hf_config, "rope_scaling", None) is None
assert getattr(llama_model_config.hf_config, "rope_theta", None) == 500_000
assert llama_model_config.max_model_len == 8192
llama_model_config = ModelConfig(
"meta-llama/Meta-Llama-3-8B-Instruct",
task="auto",
tokenizer="meta-llama/Meta-Llama-3-8B-Instruct",
tokenizer_mode="auto",
trust_remote_code=False,
dtype="float16",
seed=0,
hf_overrides={
"rope_scaling": TEST_ROPE_SCALING,
"rope_theta": TEST_ROPE_THETA,
@ -349,15 +284,7 @@ def test_rope_customization():
None) == TEST_ROPE_THETA
assert llama_model_config.max_model_len == 16384
longchat_model_config = ModelConfig(
"lmsys/longchat-13b-16k",
task="auto",
tokenizer="lmsys/longchat-13b-16k",
tokenizer_mode="auto",
trust_remote_code=False,
dtype="float16",
seed=0,
)
longchat_model_config = ModelConfig("lmsys/longchat-13b-16k")
# Check if LONGCHAT_ROPE_SCALING entries are in longchat_model_config
assert all(
longchat_model_config.hf_config.rope_scaling.get(key) == value
@ -366,12 +293,6 @@ def test_rope_customization():
longchat_model_config = ModelConfig(
"lmsys/longchat-13b-16k",
task="auto",
tokenizer="lmsys/longchat-13b-16k",
tokenizer_mode="auto",
trust_remote_code=False,
dtype="float16",
seed=0,
hf_overrides={
"rope_scaling": TEST_ROPE_SCALING,
},
@ -390,15 +311,7 @@ def test_rope_customization():
("meta-llama/Llama-3.2-11B-Vision", True),
])
def test_is_encoder_decoder(model_id, is_encoder_decoder):
config = ModelConfig(
model_id,
task="auto",
tokenizer=model_id,
tokenizer_mode="auto",
trust_remote_code=False,
dtype="float16",
seed=0,
)
config = ModelConfig(model_id)
assert config.is_encoder_decoder == is_encoder_decoder
@ -408,15 +321,7 @@ def test_is_encoder_decoder(model_id, is_encoder_decoder):
("Qwen/Qwen2-VL-2B-Instruct", True),
])
def test_uses_mrope(model_id, uses_mrope):
config = ModelConfig(
model_id,
task="auto",
tokenizer=model_id,
tokenizer_mode="auto",
trust_remote_code=False,
dtype="float16",
seed=0,
)
config = ModelConfig(model_id)
assert config.uses_mrope == uses_mrope
@ -426,26 +331,12 @@ def test_generation_config_loading():
# When set generation_config to "vllm", the default generation config
# will not be loaded.
model_config = ModelConfig(model_id,
task="auto",
tokenizer=model_id,
tokenizer_mode="auto",
trust_remote_code=False,
seed=0,
dtype="float16",
generation_config="vllm")
model_config = ModelConfig(model_id, generation_config="vllm")
assert model_config.get_diff_sampling_param() == {}
# When set generation_config to "auto", the default generation config
# should be loaded.
model_config = ModelConfig(model_id,
task="auto",
tokenizer=model_id,
tokenizer_mode="auto",
trust_remote_code=False,
seed=0,
dtype="float16",
generation_config="auto")
model_config = ModelConfig(model_id, generation_config="auto")
correct_generation_config = {
"repetition_penalty": 1.1,
@ -461,12 +352,6 @@ def test_generation_config_loading():
model_config = ModelConfig(
model_id,
task="auto",
tokenizer=model_id,
tokenizer_mode="auto",
trust_remote_code=False,
seed=0,
dtype="float16",
generation_config="auto",
override_generation_config=override_generation_config)
@ -479,12 +364,6 @@ def test_generation_config_loading():
# is set, the override_generation_config should be used directly.
model_config = ModelConfig(
model_id,
task="auto",
tokenizer=model_id,
tokenizer_mode="auto",
trust_remote_code=False,
seed=0,
dtype="float16",
generation_config="vllm",
override_generation_config=override_generation_config)
@ -515,16 +394,7 @@ def test_load_config_pt_load_map_location(pt_load_map_location):
def test_get_and_verify_max_len(model_id, max_model_len, expected_max_len,
should_raise):
"""Test get_and_verify_max_len with different configurations."""
model_config = ModelConfig(
model_id,
task="auto",
tokenizer=model_id,
tokenizer_mode="auto",
trust_remote_code=False,
seed=0,
dtype="float16",
revision=None,
)
model_config = ModelConfig(model_id)
if should_raise:
with pytest.raises(ValueError):

View File

@ -21,13 +21,8 @@ def test_max_tokens_none():
def model_config():
return ModelConfig(
MODEL_NAME,
task="auto",
tokenizer=MODEL_NAME,
tokenizer_mode="auto",
trust_remote_code=False,
seed=0,
dtype="float16",
revision=None,
)

View File

@ -695,11 +695,7 @@ def test_estimate_max_model_len(model_id, max_model_len,
# Create a VllmConfig
model_config = ModelConfig(
model_id,
task="generate",
tokenizer=model_id,
tokenizer_mode="auto",
trust_remote_code=False,
seed=0,
runner="generate",
dtype="float16",
max_model_len=max_model_len,
)
@ -733,11 +729,7 @@ def test_get_max_concurrency_for_kv_cache_config():
max_model_len = 16384
model_config = ModelConfig(
model_id,
task="generate",
tokenizer=model_id,
tokenizer_mode="auto",
trust_remote_code=False,
seed=0,
runner="generate",
dtype="float16",
max_model_len=max_model_len,
)

View File

@ -1248,9 +1248,6 @@ def create_scheduler_with_priority(
)
model_config = ModelConfig(
model=model,
task="auto",
tokenizer=model,
tokenizer_mode="auto",
trust_remote_code=True,
dtype="float16",
seed=42,

View File

@ -59,9 +59,6 @@ def create_scheduler(
)
model_config = ModelConfig(
model=model,
task="auto",
tokenizer=model,
tokenizer_mode="auto",
trust_remote_code=True,
dtype="float16",
seed=42,

View File

@ -68,9 +68,6 @@ def create_vllm_config(
)
model_config = ModelConfig(
model=model,
task="auto",
tokenizer=model,
tokenizer_mode="auto",
trust_remote_code=True,
dtype="float16",
seed=42,

View File

@ -24,13 +24,8 @@ eagle3_dir = "yuhuili/EAGLE3-LLaMA3.1-Instruct-8B"
def _create_proposer(method: str, k: int) -> EagleProposer:
model_config = ModelConfig(model=model_dir,
task="generate",
max_model_len=100,
tokenizer=model_dir,
tokenizer_mode="auto",
dtype="auto",
seed=None,
trust_remote_code=False)
runner="generate",
max_model_len=100)
# Choose model directory based on method
draft_model_dir = eagle_dir if method == "eagle" else eagle3_dir

View File

@ -44,14 +44,7 @@ def test_ngram_proposer():
def ngram_proposer(min_n: int, max_n: int, k: int) -> NgramProposer:
# Dummy model config. Just to set max_model_len.
model_config = ModelConfig(model="facebook/opt-125m",
task="generate",
max_model_len=100,
tokenizer="facebook/opt-125m",
tokenizer_mode="auto",
dtype="auto",
seed=None,
trust_remote_code=False)
model_config = ModelConfig(model="facebook/opt-125m")
return NgramProposer(
vllm_config=VllmConfig(model_config=model_config,
speculative_config=SpeculativeConfig.

View File

@ -26,10 +26,6 @@ def get_vllm_config():
)
model_config = ModelConfig(
model="facebook/opt-125m",
task="generate",
tokenizer="facebook/opt-125m",
tokenizer_mode="auto",
trust_remote_code=True,
dtype="bfloat16", # TPUs typically use bfloat16
seed=42,
)

View File

@ -76,10 +76,6 @@ def get_vllm_config():
)
model_config = ModelConfig(
model="facebook/opt-125m",
task="generate",
tokenizer="facebook/opt-125m",
tokenizer_mode="auto",
trust_remote_code=True,
dtype="float16",
seed=42,
)

View File

@ -26,7 +26,7 @@ from pydantic import (ConfigDict, SkipValidation, TypeAdapter, field_validator,
from pydantic.dataclasses import dataclass
from safetensors.torch import _TYPES as _SAFETENSORS_TO_TORCH_DTYPE
from torch.distributed import ProcessGroup, ReduceOp
from typing_extensions import Self, runtime_checkable
from typing_extensions import Self, assert_never, runtime_checkable
import vllm.envs as envs
from vllm import version
@ -102,12 +102,63 @@ RunnerOption = Literal["auto", "generate", "pooling", "draft"]
RunnerType = Literal["generate", "pooling", "draft"]
_RUNNER_TASKS: dict[RunnerType, list[_ResolvedTask]] = {
ConvertOption = Literal["auto", "none", "embed", "classify", "reward"]
ConvertType = Literal["none", "embed", "classify", "reward"]
_RUNNER_TASKS: dict[RunnerType, list[TaskOption]] = {
"generate": ["generate", "transcription"],
"pooling": ["encode", "embed", "classify", "reward"],
"pooling": ["embedding", "embed", "classify", "score", "reward"],
"draft": ["draft"],
}
_RUNNER_CONVERTS: dict[RunnerType, list[ConvertType]] = {
"generate": [],
"pooling": ["embed", "classify", "reward"],
"draft": [],
}
# Some model suffixes are based on auto classes from Transformers:
# https://huggingface.co/docs/transformers/en/model_doc/auto
# NOTE: Items higher on this list priority over lower ones
_SUFFIX_TO_DEFAULTS: list[tuple[str, tuple[RunnerType, ConvertType]]] = [
("ForCausalLM", ("generate", "none")),
("ForConditionalGeneration", ("generate", "none")),
("ChatModel", ("generate", "none")),
("LMHeadModel", ("generate", "none")),
("ForTextEncoding", ("pooling", "embed")),
("EmbeddingModel", ("pooling", "embed")),
("ForSequenceClassification", ("pooling", "classify")),
("ForAudioClassification", ("pooling", "classify")),
("ForImageClassification", ("pooling", "classify")),
("ForVideoClassification", ("pooling", "classify")),
("ClassificationModel", ("pooling", "classify")),
("ForRewardModeling", ("pooling", "reward")),
("RewardModel", ("pooling", "reward")),
# Let other `*Model`s take priority
("Model", ("pooling", "embed")),
]
def iter_architecture_defaults():
yield from _SUFFIX_TO_DEFAULTS
def try_match_architecture_defaults(
architecture: str,
*,
runner_type: Optional[RunnerType] = None,
convert_type: Optional[ConvertType] = None,
) -> Optional[tuple[str, tuple[RunnerType, ConvertType]]]:
for suffix, (default_runner_type,
default_convert_type) in iter_architecture_defaults():
if ((runner_type is None or runner_type == default_runner_type) and
(convert_type is None or convert_type == default_convert_type)
and architecture.endswith(suffix)):
return suffix, (default_runner_type, default_convert_type)
return None
@runtime_checkable
class SupportsHash(Protocol):
@ -236,11 +287,16 @@ class ModelConfig:
runner: RunnerOption = "auto"
"""The type of model runner to use. Each vLLM instance only supports one
model runner, even if the same model can be used for multiple types."""
task: TaskOption = "auto"
"""The task to use the model for. If the model supports more than one
model runner, this is used to select which model runner to run.
convert: ConvertOption = "auto"
"""Convert the model using adapters defined in
[vllm.model_executor.models.adapters][]. The most common use case is to
adapt a text generation model to be used for pooling tasks."""
task: Optional[TaskOption] = None
"""[DEPRECATED] The task to use the model for. If the model supports more
than one model runner, this is used to select which model runner to run.
Note that the model may support other tasks using the same model runner."""
Note that the model may support other tasks using the same model runner.
"""
tokenizer: SkipValidation[str] = None # type: ignore
"""Name or path of the Hugging Face tokenizer to use. If unspecified, model
name or path will be used."""
@ -558,48 +614,103 @@ class ModelConfig:
self.hf_image_processor_config = get_hf_image_processor_config(
self.model, hf_token=self.hf_token, revision=self.revision)
# For pooling models, self.task is used to indicate the
# user-selected task
if self.task == "score":
if self._is_classify_task(self.architectures):
self.task = "classify"
architectures = self.architectures
registry = self.registry
is_generative_model = registry.is_text_generation_model(
architectures, self)
is_pooling_model = registry.is_pooling_model(architectures, self)
def _task_to_convert(task: TaskOption) -> ConvertType:
if task == "embedding" or task == "embed":
return "embed"
if task == "classify":
return "classify"
if task == "reward":
return "reward"
if task == "score":
new_task = self._get_default_pooling_task(architectures)
return "classify" if new_task == "classify" else "embed"
return "none"
if self.task is not None:
runner: RunnerOption = "auto"
convert: ConvertOption = "auto"
msg_prefix = ("The 'task' option has been deprecated and will be "
"removed in v0.13.0 or v1.0, whichever comes first.")
msg_hint = "Please remove this option."
is_generative_task = self.task in _RUNNER_TASKS["generate"]
is_pooling_task = self.task in _RUNNER_TASKS["pooling"]
if is_generative_model and is_pooling_model:
if is_generative_task:
runner = "generate"
convert = "auto"
msg_hint = ("Please replace this option with `--runner "
"generate` to continue using this model "
"as a generative model.")
elif is_pooling_task:
runner = "pooling"
convert = "auto"
msg_hint = ("Please replace this option with `--runner "
"pooling` to continue using this model "
"as a pooling model.")
else: # task == "auto"
pass
elif is_generative_model or is_pooling_model:
if is_generative_task:
runner = "generate"
convert = "auto"
msg_hint = "Please remove this option"
elif is_pooling_task:
runner = "pooling"
convert = _task_to_convert(self.task)
msg_hint = ("Please replace this option with `--convert "
f"{convert}` to continue using this model "
"as a pooling model.")
else: # task == "auto"
pass
else:
self.task = "embed"
elif self.task == "embedding":
msg = ("The 'embedding' task has been renamed to 'embed', please "
"use the new name. The old name will be removed in v1.0.")
raise AssertionError("The model should be a generative or "
"pooling model when task is set to "
f"{self.task!r}.")
self.runner = runner
self.convert = convert
msg = f"{msg_prefix} {msg_hint}"
warnings.warn(msg, DeprecationWarning, stacklevel=2)
self.task = "embed"
self.runner_type = self._get_runner_type(architectures, self.runner)
self.convert_type = self._get_convert_type(architectures,
self.runner_type,
self.convert)
model_info, arch = self.registry.inspect_model_cls(self.architectures)
if self.runner_type == "generate" and not is_generative_model:
generate_converts = _RUNNER_CONVERTS["generate"]
if self.convert_type not in generate_converts:
# Currently we don't have any converters for generative models
raise ValueError(
"This model does not support `--runner generate`.")
if self.runner_type == "pooling" and not is_pooling_model:
pooling_converts = _RUNNER_CONVERTS["pooling"]
if self.convert_type not in pooling_converts:
convert_option = "<" + "|".join(pooling_converts) + ">"
raise ValueError(
"This model does not support `--runner pooling`. "
f"You can pass `--convert {convert_option} to adapt "
"it into a pooling model.")
self.supported_tasks = self._get_supported_tasks(
architectures, self.runner_type, self.convert_type)
# Note: Initialize these attributes early because transformers fallback
# may fail to load dynamic modules in child processes
model_info, arch = registry.inspect_model_cls(architectures, self)
self._model_info = model_info
self._architecture = arch
all_supported_tasks = self._get_supported_tasks(self.task)
logger.debug("Tasks supported by runner type: %s", all_supported_tasks)
supported_runner_types = self._get_supported_runner_types(
all_supported_tasks)
runner_type = self._resolve_runner(self.runner, self.task,
supported_runner_types,
all_supported_tasks)
logger.debug("Selected runner type: %s", runner_type)
# For pooling models, self.task is used to indicate the
# user-selected task
if runner_type == "pooling" and self.task == "auto":
selected_task = all_supported_tasks[runner_type][-1]
assert selected_task != "encode"
self.task = selected_task
self.supported_runner_types = supported_runner_types
self.runner_type = runner_type
self.supported_tasks = all_supported_tasks[runner_type]
if self.runner_type in ("draft",
"generate") and self.task != "transcription":
self.truncation_side = "left"
else:
self.truncation_side = "right"
logger.info("Resolved architecture: %s", arch)
self.pooler_config = self._init_pooler_config()
@ -652,16 +763,10 @@ class ModelConfig:
self.original_max_model_len = self.max_model_len
self.max_model_len = self.get_and_verify_max_len(self.max_model_len)
self.multimodal_config = self._init_multimodal_config()
self.model_supports_multimodal_raw_input = (
self.registry.supports_multimodal_raw_input(self.architectures))
if not self.skip_tokenizer_init:
self._verify_tokenizer_mode()
self.is_attention_free = self._init_attention_free()
self.is_hybrid = self._init_is_hybrid()
self.has_noops = self._init_has_noops()
self.has_inner_state = self._init_has_inner_state()
if (not current_platform.is_neuron() and self.override_neuron_config):
raise ValueError(
"`override_neuron_config` is only supported on Neuron.")
@ -702,30 +807,13 @@ class ModelConfig:
@property
def architectures(self) -> list[str]:
# architectures in the model config.
architectures = getattr(self.hf_config, "architectures", [])
# The registry assumes that it can always inspect the vLLM model class
# for a given architecture. This assumption breaks down for the
# Transformers backend, which may use a different class depending on
# the model type. To work around this, we add the correct Transformers
# backend class to the architectures list. We must do this here because
# we need access to the `hf_config` to determine the backend class.
transformers_backend_cls = self._get_transformers_backend_cls()
if (self.model_impl != ModelImpl.VLLM.value
and all(arch != transformers_backend_cls
for arch in architectures)):
architectures.append(transformers_backend_cls)
return architectures
return getattr(self.hf_config, "architectures", [])
@property
def architecture(self) -> str:
# The architecture vllm actually used.
"""The architecture vllm actually used."""
return self._architecture
@property
def model_info(self):
return self._model_info
def maybe_pull_model_tokenizer_for_s3(self, model: str,
tokenizer: str) -> None:
"""Pull model/tokenizer from S3 to temporary directory when needed.
@ -763,7 +851,7 @@ class ModelConfig:
self.tokenizer = s3_tokenizer.dir
def _init_multimodal_config(self) -> Optional["MultiModalConfig"]:
if self.registry.is_multimodal_model(self.architectures):
if self.registry.is_multimodal_model(self.architectures, self):
return MultiModalConfig(
limit_per_prompt=self.limit_mm_per_prompt,
media_io_kwargs=self.media_io_kwargs,
@ -819,19 +907,6 @@ class ModelConfig:
return None
def _init_attention_free(self) -> bool:
return self.registry.is_attention_free_model(self.architectures)
def _init_is_hybrid(self) -> bool:
return self.registry.is_hybrid_model(self.architectures)
def _init_has_noops(self) -> bool:
architectures = getattr(self.hf_config, "architectures", [])
return self.registry.is_noops_model(architectures)
def _init_has_inner_state(self) -> bool:
return self.registry.model_has_inner_state(self.architectures)
def _verify_tokenizer_mode(self) -> None:
tokenizer_mode = cast(TokenizerMode, self.tokenizer_mode.lower())
if tokenizer_mode not in get_args(TokenizerMode):
@ -840,155 +915,168 @@ class ModelConfig:
f"one of {get_args(TokenizerMode)}.")
self.tokenizer_mode = tokenizer_mode
def _is_classify_task(self, architectures: list[str]):
for arch in architectures:
if arch.endswith("ForSequenceClassification"):
return True
return self.registry.is_cross_encoder_model(architectures)
def _get_preferred_pooling_task(
def _get_default_runner_type(
self,
architectures: list[str],
) -> _ResolvedTask:
model_id = self.model
if get_pooling_config(model_id, self.revision):
) -> RunnerType:
registry = self.registry
# Some Sentence Transformers models use *ForCausalLM archs
if get_pooling_config(self.model, self.revision):
return "pooling"
for arch in architectures:
if arch in registry.get_supported_archs():
if registry.is_pooling_model(architectures, self):
return "pooling"
if registry.is_text_generation_model(architectures, self):
return "generate"
match = try_match_architecture_defaults(arch)
if match:
_, (runner_type, _) = match
return runner_type
return "generate"
def _get_runner_type(
self,
architectures: list[str],
runner: RunnerOption,
) -> RunnerType:
if runner != "auto":
return runner
runner_type = self._get_default_runner_type(architectures)
logger.info(
"Resolved `--runner auto` to `--runner %s`. "
"Pass the value explicitly to silence this message.", runner_type)
return runner_type
def _get_default_convert_type(
self,
architectures: list[str],
runner_type: RunnerType,
) -> ConvertType:
registry = self.registry
for arch in architectures:
if arch in registry.get_supported_archs():
if (runner_type == "generate"
and registry.is_text_generation_model(
architectures, self)):
return "none"
if (runner_type == "pooling"
and registry.is_pooling_model(architectures, self)):
return "none"
match = try_match_architecture_defaults(arch,
runner_type=runner_type)
if match:
_, (_, convert_type) = match
return convert_type
# This is to handle Sentence Transformers models that use *ForCausalLM
# and also multi-modal pooling models which are not defined as
# Sentence Transformers models
if runner_type == "pooling":
return "embed"
if self.registry.is_transcription_model(architectures):
return "transcription"
suffix_to_preferred_task: list[tuple[str, _ResolvedTask]] = [
# Other models follow this pattern
("EmbeddingModel", "embed"),
("RewardModel", "reward"),
]
return "none"
for suffix, pref_task in suffix_to_preferred_task:
if self.architecture.endswith(suffix):
return pref_task
def _get_convert_type(
self,
architectures: list[str],
runner_type: RunnerType,
convert: ConvertOption,
) -> ConvertType:
if convert != "auto":
return convert
return "embed"
convert_type = self._get_default_convert_type(architectures,
runner_type)
logger.info(
"Resolved `--convert auto` to `--convert %s`. "
"Pass the value explicitly to silence this message.", convert_type)
return convert_type
def _get_supported_generation_tasks(
self,
task_option: TaskOption,
architectures: list[str],
convert_type: ConvertType,
) -> list[_ResolvedTask]:
registry = self.registry
architectures = self.architectures
if registry.is_transcription_only_model(architectures):
if registry.is_transcription_only_model(architectures, self):
return ["transcription"]
# TODO: Use get_supported_generation_tasks once V0 is removed
supported_tasks = list[_ResolvedTask]()
if registry.is_text_generation_model(architectures):
if (registry.is_text_generation_model(architectures, self)
or convert_type in _RUNNER_CONVERTS["generate"]):
supported_tasks.append("generate")
if registry.is_transcription_model(architectures):
supported_tasks.append("transcription")
if registry.is_transcription_model(architectures, self):
supported_tasks.append("transcription")
return supported_tasks
def _get_default_pooling_task(
self,
architectures: list[str],
) -> Literal["embed", "classify", "reward"]:
if self.registry.is_cross_encoder_model(architectures, self):
return "classify"
for arch in architectures:
match = try_match_architecture_defaults(arch,
runner_type="pooling")
if match:
_, (_, convert_type) = match
assert convert_type != "none"
return convert_type
return "embed"
def _get_supported_pooling_tasks(
self,
task_option: TaskOption,
architectures: list[str],
convert_type: ConvertType,
) -> list[_ResolvedTask]:
registry = self.registry
architectures = self.architectures
# TODO: Use get_supported_pooling_tasks once V0 is removed
supported_tasks = list[_ResolvedTask]()
if registry.is_pooling_model(architectures):
if (registry.is_pooling_model(architectures, self)
or convert_type in _RUNNER_CONVERTS["pooling"]):
supported_tasks.append("encode")
# For now, users must specify the task (other than "pooling")
# to use for pooling models
if task_option == "auto":
preferred_task = self._get_preferred_pooling_task(
architectures)
supported_tasks.append(preferred_task)
elif task_option in _RUNNER_TASKS["pooling"]:
supported_tasks.append(cast(_ResolvedTask, task_option))
extra_task = (self._get_default_pooling_task(architectures)
if convert_type == "none" else convert_type)
supported_tasks.append(extra_task)
return supported_tasks
def _get_supported_tasks(
self,
task_option: TaskOption,
) -> dict[RunnerType, list[_ResolvedTask]]:
if self._is_classify_task(self.architectures):
return {"generate": [], "pooling": ["classify"], "draft": []}
else:
return {
"generate": self._get_supported_generation_tasks(task_option),
"pooling": self._get_supported_pooling_tasks(task_option),
"draft": ["draft"]
}
architectures: list[str],
runner_type: RunnerType,
convert_type: ConvertType,
) -> list[_ResolvedTask]:
if runner_type == "generate":
return self._get_supported_generation_tasks(
architectures, convert_type)
if runner_type == "pooling":
return self._get_supported_pooling_tasks(architectures,
convert_type)
if runner_type == "draft":
return ["draft"]
def _get_supported_runner_types(
self,
supported_tasks: dict[RunnerType, list[_ResolvedTask]],
) -> set[RunnerType]:
return {
runner
for runner, runner_tasks in supported_tasks.items()
if len(runner_tasks) > 0
}
def _resolve_runner(
self,
runner_option: RunnerOption,
task_option: TaskOption,
supported_runner_types: set[RunnerType],
supported_tasks: dict[RunnerType, list[_ResolvedTask]],
) -> RunnerType:
if not supported_runner_types:
raise ValueError("This model does not support any model runners!")
if runner_option != "auto":
if runner_option not in supported_runner_types:
raise ValueError(
f"This model does not support runner={runner_option!r}. "
f"Available runners: {supported_runner_types}")
return runner_option
if task_option != "auto":
for runner, runner_tasks in supported_tasks.items():
if task_option in runner_tasks:
return runner
else:
task_runner: RunnerType = next(
runner for runner, tasks in _RUNNER_TASKS.items()
if task_option in tasks)
raise ValueError(
f"This model does not support task={task_option!r}. "
f"Available tasks for runner={task_runner!r}: "
f"{supported_tasks[task_runner]}")
if "classify" in supported_tasks.get("pooling", []):
# When multiple pooling tasks are present, default to
# pooling (eg cross-encoder) for non-standard architectures.
return "pooling"
suffix_to_preferred_runner: list[tuple[str, RunnerType]] = [
("ForCausalLM", "generate"),
("ForConditionalGeneration", "generate"),
("ChatModel", "generate"),
("LMHeadModel", "generate"),
("EmbeddingModel", "pooling"),
("RewardModel", "pooling"),
]
for suffix, pref_runner in suffix_to_preferred_runner:
if self.architecture.endswith(
suffix) and pref_runner in supported_runner_types:
return pref_runner
if "generate" in supported_runner_types:
return "generate"
if "pooling" in supported_runner_types:
return "pooling"
raise AssertionError("This line should not be reached")
assert_never(runner_type)
def _parse_quant_hf_config(self):
quant_cfg = getattr(self.hf_config, "quantization_config", None)
@ -1216,7 +1304,8 @@ class ModelConfig:
pipeline_parallel_size = parallel_config.pipeline_parallel_size
if pipeline_parallel_size > 1:
if not self.registry.is_pp_supported_model(self.architectures):
if not self.registry.is_pp_supported_model(self.architectures,
self):
raise NotImplementedError(
"Pipeline parallelism is not supported for this model. "
"Supported models implement the `SupportsPP` interface.")
@ -1558,17 +1647,41 @@ class ModelConfig:
@property
def is_cross_encoder(self) -> bool:
return self.task == "classify"
return (self._model_info.supports_cross_encoding
or self.convert_type == "classify")
@property
def is_pp_supported(self) -> bool:
return self._model_info.supports_pp
@property
def is_multimodal_raw_input_supported(self) -> bool:
return self._model_info.supports_multimodal_raw_input
@property
def is_attention_free(self) -> bool:
return self._model_info.is_attention_free
@property
def is_hybrid(self) -> bool:
return self._model_info.is_hybrid
@property
def has_noops(self) -> bool:
return self._model_info.has_noops
@property
def has_inner_state(self):
return self._model_info.has_inner_state
@property
def is_v1_compatible(self) -> bool:
return not self._model_info.supports_v0_only
@property
def use_mla(self) -> bool:
return self.is_deepseek_mla and not envs.VLLM_MLA_DISABLE
@property
def is_v1_compatible(self) -> bool:
architectures = getattr(self.hf_config, "architectures", [])
return me_models.ModelRegistry.is_v1_compatible(architectures)
@property
def is_matryoshka(self) -> bool:
return (bool(getattr(self.hf_config, "matryoshka_dimensions", None))
@ -4769,7 +4882,10 @@ class VllmConfig:
self.scheduler_config.max_model_len = max_model_len
def try_verify_and_update_config(self):
architecture = getattr(self.model_config, "architecture", None)
if self.model_config is None:
return
architecture = self.model_config.architecture
if architecture is None:
return
@ -4782,7 +4898,7 @@ class VllmConfig:
if self.model_config.is_hybrid:
HybridAttentionMambaModelConfig.verify_and_update_config(self)
if self.model_config.task == "classify":
if self.model_config.convert_type == "classify":
# Maybe convert ForCausalLM into ForSequenceClassification model.
from vllm.model_executor.models.adapters import (
SequenceClassificationConfig)

View File

@ -22,14 +22,15 @@ from typing_extensions import TypeIs
import vllm.envs as envs
from vllm.config import (BlockSize, CacheConfig, CacheDType, CompilationConfig,
ConfigFormat, ConfigType, DecodingConfig,
DetailedTraceModules, Device, DeviceConfig,
DistributedExecutorBackend, GuidedDecodingBackend,
GuidedDecodingBackendV1, HfOverrides, KVEventsConfig,
KVTransferConfig, LoadConfig, LogprobsMode,
LoRAConfig, ModelConfig, ModelDType, ModelImpl,
MultiModalConfig, ObservabilityConfig, ParallelConfig,
PoolerConfig, PrefixCachingHashAlgo, SchedulerConfig,
ConfigFormat, ConfigType, ConvertOption,
DecodingConfig, DetailedTraceModules, Device,
DeviceConfig, DistributedExecutorBackend,
GuidedDecodingBackend, GuidedDecodingBackendV1,
HfOverrides, KVEventsConfig, KVTransferConfig,
LoadConfig, LogprobsMode, LoRAConfig, ModelConfig,
ModelDType, ModelImpl, MultiModalConfig,
ObservabilityConfig, ParallelConfig, PoolerConfig,
PrefixCachingHashAlgo, RunnerOption, SchedulerConfig,
SchedulerPolicy, SpeculativeConfig, TaskOption,
TokenizerMode, VllmConfig, get_attr_docs, get_field)
from vllm.logger import init_logger
@ -270,7 +271,9 @@ class EngineArgs:
str, List[str]]] = ModelConfig.served_model_name
tokenizer: Optional[str] = ModelConfig.tokenizer
hf_config_path: Optional[str] = ModelConfig.hf_config_path
task: TaskOption = ModelConfig.task
runner: RunnerOption = ModelConfig.runner
convert: ConvertOption = ModelConfig.convert
task: Optional[TaskOption] = ModelConfig.task
skip_tokenizer_init: bool = ModelConfig.skip_tokenizer_init
enable_prompt_embeds: bool = ModelConfig.enable_prompt_embeds
tokenizer_mode: TokenizerMode = ModelConfig.tokenizer_mode
@ -461,7 +464,11 @@ class EngineArgs:
)
if not ('serve' in sys.argv[1:] and '--help' in sys.argv[1:]):
model_group.add_argument("--model", **model_kwargs["model"])
model_group.add_argument("--task", **model_kwargs["task"])
model_group.add_argument("--runner", **model_kwargs["runner"])
model_group.add_argument("--convert", **model_kwargs["convert"])
model_group.add_argument("--task",
**model_kwargs["task"],
deprecated=True)
model_group.add_argument("--tokenizer", **model_kwargs["tokenizer"])
model_group.add_argument("--tokenizer-mode",
**model_kwargs["tokenizer_mode"])
@ -870,6 +877,8 @@ class EngineArgs:
return ModelConfig(
model=self.model,
hf_config_path=self.hf_config_path,
runner=self.runner,
convert=self.convert,
task=self.task,
tokenizer=self.tokenizer,
tokenizer_mode=self.tokenizer_mode,

View File

@ -20,8 +20,8 @@ from vllm.beam_search import (BeamSearchInstance, BeamSearchOutput,
create_sort_beams_key_function)
from vllm.config import (CompilationConfig, ModelDType, TokenizerMode,
is_init_field)
from vllm.engine.arg_utils import (EngineArgs, HfOverrides, PoolerConfig,
TaskOption)
from vllm.engine.arg_utils import (ConvertOption, EngineArgs, HfOverrides,
PoolerConfig, RunnerOption)
from vllm.engine.llm_engine import LLMEngine
from vllm.entrypoints.chat_utils import (ChatCompletionMessageParam,
ChatTemplateContentFormatOption,
@ -170,7 +170,8 @@ class LLM:
self,
model: str,
*,
task: TaskOption = "auto",
runner: RunnerOption = "auto",
convert: ConvertOption = "auto",
tokenizer: Optional[str] = None,
tokenizer_mode: TokenizerMode = "auto",
skip_tokenizer_init: bool = False,
@ -244,7 +245,8 @@ class LLM:
engine_args = EngineArgs(
model=model,
task=task,
runner=runner,
convert=convert,
tokenizer=tokenizer,
tokenizer_mode=tokenizer_mode,
skip_tokenizer_init=skip_tokenizer_init,
@ -459,18 +461,10 @@ class LLM:
model_config = self.llm_engine.model_config
runner_type = model_config.runner_type
if runner_type != "generate":
messages = [
"LLM.generate() is only supported for generative models."
]
if "generate" in model_config.supported_runner_types:
messages.append(
"Your model supports the 'generate' runner, but is "
f"currently initialized for the '{runner_type}' runner. "
"Please initialize vLLM using `--task generate` or "
"`--task transcription`.")
raise ValueError(" ".join(messages))
raise ValueError(
"LLM.generate() is only supported for generative models. "
"Try passing `--runner generate` to use the model as a "
"generative model.")
if prompt_token_ids is not None:
parsed_prompts = self._convert_v1_inputs(
@ -497,7 +491,8 @@ class LLM:
truncate_prompt_tokens = None
if isinstance(sampling_params, SamplingParams):
truncate_prompt_tokens = sampling_params.truncate_prompt_tokens
_validate_truncation_size(self.llm_engine.model_config.max_model_len,
_validate_truncation_size(model_config.max_model_len,
truncate_prompt_tokens, tokenization_kwargs)
# Add any modality specific loras to the corresponding prompts
@ -1100,16 +1095,10 @@ class LLM:
model_config = self.llm_engine.model_config
runner_type = model_config.runner_type
if runner_type != "pooling":
messages = ["LLM.encode() is only supported for pooling models."]
if "pooling" in model_config.supported_runner_types:
messages.append(
"Your model supports the 'pooling' runner, but is "
f"currently initialized for the '{runner_type}' runner. "
"Please initialize vLLM using `--task embed`, "
"`--task classify`, `--task score` etc.")
raise ValueError(" ".join(messages))
raise ValueError(
"LLM.encode() is only supported for pooling models. "
"Try passing `--runner pooling` to use the model as a "
"pooling model.")
if prompt_token_ids is not None:
parsed_prompts = self._convert_v1_inputs(
@ -1183,8 +1172,9 @@ class LLM:
embedding vectors in the same order as the input prompts.
"""
if "embed" not in self.supported_tasks:
raise ValueError("Embedding API is not supported by this model. "
"Please set `--task embed`.")
raise ValueError(
"Embedding API is not supported by this model. "
"Try converting the model using `--convert embed`.")
items = self.encode(
prompts,
@ -1229,7 +1219,7 @@ class LLM:
if "classify" not in self.supported_tasks:
raise ValueError(
"Classification API is not supported by this model. "
"Please set `--task classify`.")
"Try converting the model using `--convert classify`.")
items = self.encode(
prompts,
@ -1283,27 +1273,26 @@ class LLM:
use_tqdm: Union[bool, Callable[..., tqdm]] = True,
lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None,
) -> list[ScoringRequestOutput]:
model_config = self.llm_engine.model_config
if isinstance(tokenizer, MistralTokenizer):
raise ValueError(
"Score API is only enabled for `--task embed or score`")
"Score API is not supported for Mistral tokenizer")
if len(data_1) == 1:
data_1 = data_1 * len(data_2)
pooling_params = PoolingParams(task="score")
tokenization_kwargs: dict[str, Any] = {}
_validate_truncation_size(self.llm_engine.model_config.max_model_len,
_validate_truncation_size(model_config.max_model_len,
truncate_prompt_tokens, tokenization_kwargs)
parsed_prompts = []
input_pairs = [(t1, t2) for t1, t2 in zip(data_1, data_2)]
if self.llm_engine.model_config.is_multimodal_model:
model_config = self.llm_engine.model_config
if model_config.is_multimodal_model:
for q, d in input_pairs:
_, engine_prompt = get_score_prompt(
model_config=model_config,
@ -1314,11 +1303,9 @@ class LLM:
)
parsed_prompts.append(engine_prompt)
else:
for q, t in input_pairs:
if self.llm_engine.model_config.use_pad_token:
if model_config.use_pad_token:
# cross_encoder models defaults to using pad_token.
prompt_inputs = tokenizer(
text=q, # type: ignore[arg-type]
@ -1396,23 +1383,18 @@ class LLM:
model_config = self.llm_engine.model_config
runner_type = model_config.runner_type
if runner_type != "pooling":
messages = ["LLM.score() is only supported for pooling models."]
if "pooling" in model_config.supported_runner_types:
messages.append(
"Your model supports the 'pooling' runner, but is "
f"currently initialized for the '{runner_type}' runner. "
"Please initialize vLLM using `--task embed`, "
"`--task classify`, `--task score` etc.")
raise ValueError(" ".join(messages))
raise ValueError(
"LLM.score() is only supported for pooling models. "
"Try passing `--runner pooling` to use the model as a "
"pooling model.")
supported_tasks = self.supported_tasks
if all(t not in supported_tasks for t in ("embed", "classify")):
raise ValueError("Score API is not supported by this model. "
"Please set `--task embed` or `--task classify`.")
"Try converting the model using "
"`--convert embed` or `--convert classify`.")
if (model_config.task == "classify"
if (model_config.is_cross_encoder
and getattr(model_config.hf_config, "num_labels", 0) != 1):
raise ValueError("Score API is only enabled for num_labels == 1.")
@ -1421,15 +1403,14 @@ class LLM:
# lists of tokens to the `text` and `text_pair` kwargs
tokenizer = self.get_tokenizer()
if not self.llm_engine.model_config.is_multimodal_model:
if not model_config.is_multimodal_model:
def check_data_type(data: Union[SingletonPrompt,
Sequence[SingletonPrompt],
ScoreMultiModalParam]):
if isinstance(data, dict) and "content" in data:
raise ValueError(
f"ScoreMultiModalParam is not supported for {self.llm_engine.model_config.architecture}", # noqa: E501
)
raise ValueError("ScoreMultiModalParam is not supported "
f"for {model_config.architecture}")
check_data_type(data_1)
check_data_type(data_2)
@ -1471,7 +1452,7 @@ class LLM:
_validate_score_input_lens(data_1, data_2) # type: ignore[arg-type]
if self.llm_engine.model_config.is_cross_encoder:
if model_config.is_cross_encoder:
return self._cross_encoding_score(
tokenizer,
data_1, # type: ignore[arg-type]

View File

@ -1734,7 +1734,6 @@ async def init_app_state(
state.openai_serving_models,
request_logger=request_logger,
) if "transcription" in supported_tasks else None
state.task = model_config.task
state.enable_server_load_tracking = args.enable_server_load_tracking
state.server_load_metrics = 0

View File

@ -9,9 +9,8 @@ from dataclasses import dataclass, field
from typing import Optional
import torch
import transformers
from torch import nn
from transformers.dynamic_module_utils import get_class_from_dynamic_module
from typing_extensions import assert_never
from vllm.attention import Attention
from vllm.config import (ModelConfig, ModelImpl, VllmConfig,
@ -20,13 +19,10 @@ from vllm.logger import init_logger
from vllm.model_executor.layers.linear import QKVCrossParallelLinear
from vllm.model_executor.layers.quantization.base_config import (
QuantizationConfig, QuantizeMethodBase)
from vllm.model_executor.models import ModelRegistry
from vllm.model_executor.models.adapters import (as_embedding_model,
as_reward_model,
as_seq_cls_model)
from vllm.model_executor.models.interfaces import SupportsQuant
from vllm.model_executor.models.registry import (_PREVIOUSLY_SUPPORTED_MODELS,
_TRANSFORMERS_BACKEND_MODELS)
from vllm.utils import is_pin_memory_available
logger = init_logger(__name__)
@ -169,61 +165,6 @@ def device_loading_context(module: torch.nn.Module,
# New parameters or parameters already on target device are untouched
def resolve_transformers_arch(model_config: ModelConfig,
architectures: list[str]):
if model_config.model_impl == ModelImpl.VLLM:
raise ValueError(
"Attempting to resolve architecture from the Transformers library "
"but the model implementation is set to vLLM. This should never "
"happen.")
for i, arch in enumerate(architectures):
if arch in _TRANSFORMERS_BACKEND_MODELS:
continue
if model_config.model_impl == ModelImpl.AUTO:
logger.warning(
"%s has no vLLM implementation, falling back to Transformers "
"implementation. Some features may not be supported and "
"performance may not be optimal.", arch)
auto_map: dict[str, str] = getattr(model_config.hf_config, "auto_map",
None) or dict()
# Make sure that config class is always initialized before model class,
# otherwise the model class won't be able to access the config class,
# the expected auto_map should have correct order like:
# "auto_map": {
# "AutoConfig": "<your-repo-name>--<config-name>",
# "AutoModel": "<your-repo-name>--<config-name>",
# "AutoModelFor<Task>": "<your-repo-name>--<config-name>",
# },
auto_modules = {
name:
get_class_from_dynamic_module(module,
model_config.model,
revision=model_config.revision)
for name, module in sorted(auto_map.items(), key=lambda x: x[0])
}
model_module = getattr(transformers, arch, None)
if model_module is None:
if "AutoModel" not in auto_map:
raise ValueError(
f"Cannot find model module. '{arch}' is not a registered "
"model in the Transformers library (only relevant if the "
"model is meant to be in Transformers) and 'AutoModel' is "
"not present in the model config's 'auto_map' (relevant "
"if the model is custom).")
model_module = auto_modules["AutoModel"]
if not model_module.is_backend_compatible():
raise ValueError(
f"The Transformers implementation of '{arch}' is not "
"compatible with vLLM.")
architectures[i] = model_config._get_transformers_backend_cls()
return architectures
def get_model_architecture(
model_config: ModelConfig) -> tuple[type[nn.Module], str]:
architectures = getattr(model_config.hf_config, "architectures", [])
@ -239,56 +180,38 @@ def get_model_architecture(
"bitsandbytes",
]
vllm_supported_archs = ModelRegistry.get_supported_archs()
is_supported = lambda arch: (arch in vllm_supported_archs and arch not in
_TRANSFORMERS_BACKEND_MODELS)
vllm_not_supported = not any(is_supported(arch) for arch in architectures)
if vllm_not_supported:
# try automatic conversion in adapters.py
for arch in architectures:
if not arch.endswith("ForSequenceClassification"):
continue
assert model_config.task == "classify"
causal_lm_arch = arch.replace("ForSequenceClassification",
"ForCausalLM")
causal_lm_arch_vllm_supported = (causal_lm_arch
in vllm_supported_archs)
if not causal_lm_arch_vllm_supported:
continue
architectures = [causal_lm_arch]
vllm_not_supported = False
break
if any(arch in _PREVIOUSLY_SUPPORTED_MODELS for arch in architectures):
previous_version = _PREVIOUSLY_SUPPORTED_MODELS[architectures[0]]
raise ValueError(
f"Model architecture {architectures[0]} was supported"
f" in vLLM until version {previous_version}, and is "
"not supported anymore. Please use an older version"
" of vLLM if you want to use this model architecture.")
if (model_config.model_impl == ModelImpl.TRANSFORMERS or
model_config.model_impl == ModelImpl.AUTO and vllm_not_supported):
architectures = resolve_transformers_arch(model_config, architectures)
logger.debug_once("Resolve transformers arch %s", str(architectures))
elif (model_config.quantization is not None
and model_config.quantization not in mixtral_supported
and "MixtralForCausalLM" in architectures):
if (model_config.quantization is not None
and model_config.quantization not in mixtral_supported
and "MixtralForCausalLM" in architectures):
architectures = ["QuantMixtralForCausalLM"]
model_cls, arch = ModelRegistry.resolve_model_cls(architectures)
if model_config.task == "embed":
logger.debug_once("Automatic conversion using `as_embedding_model`.")
model_cls, arch = model_config.registry.resolve_model_cls(
architectures,
model_config=model_config,
)
if arch == model_config._get_transformers_backend_cls():
assert model_config.model_impl != ModelImpl.VLLM
if model_config.model_impl == ModelImpl.AUTO:
logger.warning_once(
"%s has no vLLM implementation, falling back to Transformers "
"implementation. Some features may not be supported and "
"performance may not be optimal.", arch)
convert_type = model_config.convert_type
if convert_type == "none":
pass
elif convert_type == "embed":
logger.debug_once("Converting to embedding model.")
model_cls = as_embedding_model(model_cls)
elif model_config.task == "classify":
logger.debug_once("Automatic conversion using `as_seq_cls_model`.")
elif convert_type == "classify":
logger.debug_once("Converting to sequence classification model.")
model_cls = as_seq_cls_model(model_cls)
elif model_config.task == "reward":
logger.debug_once("Automatic conversion using `as_reward_model`.")
elif convert_type == "reward":
logger.debug_once("Converting to reward model.")
model_cls = as_reward_model(model_cls)
else:
assert_never(convert_type)
return model_cls, arch

View File

@ -253,8 +253,10 @@ class HybridAttentionMambaModelConfig(VerifyAndUpdateConfig):
dtype=kv_cache_dtype,
use_mla=model_config.use_mla).page_size_bytes
model_cls = ModelRegistry.resolve_model_cls(
model_config._model_info.architecture)[0]
model_cls, _ = ModelRegistry.resolve_model_cls(
model_config.architecture,
model_config=model_config,
)
# get mamba page size
mamba_page_size = MambaSpec(

View File

@ -12,19 +12,24 @@ import sys
import tempfile
from abc import ABC, abstractmethod
from collections.abc import Set
from dataclasses import asdict, dataclass, field
from dataclasses import dataclass, field
from functools import lru_cache
from typing import Callable, Optional, TypeVar, Union
import torch.nn as nn
import transformers
from vllm.config import (ModelConfig, ModelImpl, iter_architecture_defaults,
try_match_architecture_defaults)
from vllm.logger import init_logger
from vllm.transformers_utils.dynamic_module import (
try_get_class_from_dynamic_module)
from .interfaces import (has_inner_state, has_noops, is_attention_free,
is_hybrid, supports_cross_encoding,
supports_multimodal, supports_multimodal_raw_input,
supports_pp, supports_transcription, supports_v0_only)
from .interfaces_base import is_text_generation_model
from .interfaces_base import is_pooling_model, is_text_generation_model
logger = init_logger(__name__)
@ -311,7 +316,7 @@ class _ModelInfo:
return _ModelInfo(
architecture=model.__name__,
is_text_generation_model=is_text_generation_model(model),
is_pooling_model=True, # Can convert any model into a pooling model
is_pooling_model=is_pooling_model(model),
supports_cross_encoding=supports_cross_encoding(model),
supports_multimodal=supports_multimodal(model),
supports_multimodal_raw_input=supports_multimodal_raw_input(model),
@ -465,6 +470,16 @@ class _ModelRegistry:
f"Model architectures {architectures} failed "
"to be inspected. Please check the logs for more details.")
for arch in architectures:
if arch in _PREVIOUSLY_SUPPORTED_MODELS:
previous_version = _PREVIOUSLY_SUPPORTED_MODELS[arch]
raise ValueError(
f"Model architecture {arch} was supported in vLLM until "
f"v{previous_version}, and is not supported anymore. "
"Please use an older version of vLLM if you want to "
"use this model architecture.")
raise ValueError(
f"Model architectures {architectures} are not supported for now. "
f"Supported architectures: {all_supported_archs}")
@ -477,174 +492,284 @@ class _ModelRegistry:
return _try_load_model_cls(model_arch, self.models[model_arch])
def _try_inspect_model_cls(self, model_arch: str) -> Optional[_ModelInfo]:
if model_arch in self.models:
return _try_inspect_model_cls(model_arch, self.models[model_arch])
if model_arch not in self.models:
return None
if model_arch.endswith("ForSequenceClassification"):
causal_lm_arch = model_arch.replace("ForSequenceClassification",
"ForCausalLM")
if causal_lm_arch not in self.models:
return _try_inspect_model_cls(model_arch, self.models[model_arch])
def _try_resolve_transformers(
self,
architecture: str,
model_config: ModelConfig,
) -> Optional[str]:
if architecture in _TRANSFORMERS_BACKEND_MODELS:
return architecture
auto_map: dict[str, str] = getattr(model_config.hf_config, "auto_map",
None) or dict()
# Make sure that config class is always initialized before model class,
# otherwise the model class won't be able to access the config class,
# the expected auto_map should have correct order like:
# "auto_map": {
# "AutoConfig": "<your-repo-name>--<config-name>",
# "AutoModel": "<your-repo-name>--<config-name>",
# "AutoModelFor<Task>": "<your-repo-name>--<config-name>",
# },
for prefix in ("AutoConfig", "AutoModel"):
for name, module in auto_map.items():
if name.startswith(prefix):
try_get_class_from_dynamic_module(
module,
model_config.model,
revision=model_config.revision,
warn_on_fail=False,
)
model_module = getattr(transformers, architecture, None)
if model_module is None:
for name, module in auto_map.items():
if name.startswith("AutoModel"):
model_module = try_get_class_from_dynamic_module(
module,
model_config.model,
revision=model_config.revision,
warn_on_fail=True,
)
if model_module is not None:
break
else:
if model_config.model_impl != ModelImpl.TRANSFORMERS:
return None
raise ValueError(
f"Cannot find model module. {architecture!r} is not a "
"registered model in the Transformers library (only "
"relevant if the model is meant to be in Transformers) "
"and 'AutoModel' is not present in the model config's "
"'auto_map' (relevant if the model is custom).")
if not model_module.is_backend_compatible():
if model_config.model_impl != ModelImpl.TRANSFORMERS:
return None
info = _try_inspect_model_cls(causal_lm_arch,
self.models[causal_lm_arch])
raise ValueError(
f"The Transformers implementation of {architecture!r} "
"is not compatible with vLLM.")
info = _ModelInfo(**dict(
asdict(info), **{
"architecture": model_arch,
"supports_cross_encoding": True
}))
return info
return model_config._get_transformers_backend_cls()
return None
def _normalize_arch(
self,
architecture: str,
model_config: ModelConfig,
) -> str:
if architecture in self.models:
return architecture
# This may be called in order to resolve runner_type and convert_type
# in the first place, in which case we consider the default match
match = try_match_architecture_defaults(
architecture,
runner_type=getattr(model_config, "runner_type", None),
convert_type=getattr(model_config, "convert_type", None),
)
if match:
suffix, _ = match
# Get the name of the base model to convert
for repl_suffix, _ in iter_architecture_defaults():
base_arch = architecture.replace(suffix, repl_suffix)
if base_arch in self.models:
return base_arch
return architecture
def _normalize_archs(
self,
architectures: Union[str, list[str]],
architectures: list[str],
model_config: ModelConfig,
) -> list[str]:
if isinstance(architectures, str):
architectures = [architectures]
if not architectures:
logger.warning("No model architectures are specified")
# filter out support architectures
normalized_arch = list(
filter(lambda model: model in self.models, architectures))
# try automatic conversion in adapters.py
for arch in architectures:
if not arch.endswith("ForSequenceClassification"):
continue
causal_lm_arch = arch.replace("ForSequenceClassification",
"ForCausalLM")
if causal_lm_arch in self.models:
normalized_arch.append(arch)
# NOTE(Isotr0py): Be careful of architectures' order!
# Make sure Transformers backend architecture is at the end of the
# list, otherwise pooling models automatic conversion will fail!
for arch in normalized_arch:
if arch.startswith("TransformersFor"):
normalized_arch.remove(arch)
normalized_arch.append(arch)
return normalized_arch
return [
self._normalize_arch(arch, model_config) for arch in architectures
]
def inspect_model_cls(
self,
architectures: Union[str, list[str]],
model_config: ModelConfig,
) -> tuple[_ModelInfo, str]:
architectures = self._normalize_archs(architectures)
if isinstance(architectures, str):
architectures = [architectures]
for arch in architectures:
model_info = self._try_inspect_model_cls(arch)
normalized_archs = self._normalize_archs(architectures, model_config)
# Require transformers impl
if model_config.model_impl == ModelImpl.TRANSFORMERS:
arch = self._try_resolve_transformers(architectures[0],
model_config)
if arch is not None:
model_info = self._try_inspect_model_cls(arch)
if model_info is not None:
return (model_info, arch)
for arch, normalized_arch in zip(architectures, normalized_archs):
model_info = self._try_inspect_model_cls(normalized_arch)
if model_info is not None:
return (model_info, arch)
# Fallback to transformers impl
if model_config.model_impl in (ModelImpl.AUTO, ModelImpl.TRANSFORMERS):
arch = self._try_resolve_transformers(architectures[0],
model_config)
if arch is not None:
model_info = self._try_inspect_model_cls(arch)
if model_info is not None:
return (model_info, arch)
return self._raise_for_unsupported(architectures)
def resolve_model_cls(
self,
architectures: Union[str, list[str]],
model_config: ModelConfig,
) -> tuple[type[nn.Module], str]:
architectures = self._normalize_archs(architectures)
if isinstance(architectures, str):
architectures = [architectures]
for arch in architectures:
model_cls = self._try_load_model_cls(arch)
normalized_archs = self._normalize_archs(architectures, model_config)
# Require transformers impl
if model_config.model_impl == ModelImpl.TRANSFORMERS:
arch = self._try_resolve_transformers(architectures[0],
model_config)
if arch is not None:
model_cls = self._try_load_model_cls(arch)
if model_cls is not None:
return (model_cls, arch)
for arch, normalized_arch in zip(architectures, normalized_archs):
model_cls = self._try_load_model_cls(normalized_arch)
if model_cls is not None:
return (model_cls, arch)
# Fallback to transformers impl
if model_config.model_impl in (ModelImpl.AUTO, ModelImpl.TRANSFORMERS):
arch = self._try_resolve_transformers(architectures[0],
model_config)
if arch is not None:
model_cls = self._try_load_model_cls(arch)
if model_cls is not None:
return (model_cls, arch)
return self._raise_for_unsupported(architectures)
def is_text_generation_model(
self,
architectures: Union[str, list[str]],
model_config: ModelConfig,
) -> bool:
model_cls, _ = self.inspect_model_cls(architectures)
model_cls, _ = self.inspect_model_cls(architectures, model_config)
return model_cls.is_text_generation_model
def is_pooling_model(
self,
architectures: Union[str, list[str]],
model_config: ModelConfig,
) -> bool:
model_cls, _ = self.inspect_model_cls(architectures)
model_cls, _ = self.inspect_model_cls(architectures, model_config)
return model_cls.is_pooling_model
def is_cross_encoder_model(
self,
architectures: Union[str, list[str]],
model_config: ModelConfig,
) -> bool:
model_cls, _ = self.inspect_model_cls(architectures)
model_cls, _ = self.inspect_model_cls(architectures, model_config)
return model_cls.supports_cross_encoding
def is_multimodal_model(
self,
architectures: Union[str, list[str]],
model_config: ModelConfig,
) -> bool:
model_cls, _ = self.inspect_model_cls(architectures)
model_cls, _ = self.inspect_model_cls(architectures, model_config)
return model_cls.supports_multimodal
def supports_multimodal_raw_input(
self,
architectures: Union[str, list[str]],
model_config: ModelConfig,
) -> bool:
model_cls, _ = self.inspect_model_cls(architectures)
model_cls, _ = self.inspect_model_cls(architectures, model_config)
return model_cls.supports_multimodal_raw_input
def is_pp_supported_model(
self,
architectures: Union[str, list[str]],
model_config: ModelConfig,
) -> bool:
model_cls, _ = self.inspect_model_cls(architectures)
model_cls, _ = self.inspect_model_cls(architectures, model_config)
return model_cls.supports_pp
def model_has_inner_state(
self,
architectures: Union[str, list[str]],
model_config: ModelConfig,
) -> bool:
model_cls, _ = self.inspect_model_cls(architectures)
model_cls, _ = self.inspect_model_cls(architectures, model_config)
return model_cls.has_inner_state
def is_attention_free_model(
self,
architectures: Union[str, list[str]],
model_config: ModelConfig,
) -> bool:
model_cls, _ = self.inspect_model_cls(architectures)
model_cls, _ = self.inspect_model_cls(architectures, model_config)
return model_cls.is_attention_free
def is_hybrid_model(
self,
architectures: Union[str, list[str]],
model_config: ModelConfig,
) -> bool:
model_cls, _ = self.inspect_model_cls(architectures)
model_cls, _ = self.inspect_model_cls(architectures, model_config)
return model_cls.is_hybrid
def is_noops_model(
self,
architectures: Union[str, list[str]],
model_config: ModelConfig,
) -> bool:
model_cls, _ = self.inspect_model_cls(architectures)
model_cls, _ = self.inspect_model_cls(architectures, model_config)
return model_cls.has_noops
def is_transcription_model(
self,
architectures: Union[str, list[str]],
model_config: ModelConfig,
) -> bool:
model_cls, _ = self.inspect_model_cls(architectures)
model_cls, _ = self.inspect_model_cls(architectures, model_config)
return model_cls.supports_transcription
def is_transcription_only_model(
self,
architectures: Union[str, list[str]],
model_config: ModelConfig,
) -> bool:
model_cls, _ = self.inspect_model_cls(architectures)
model_cls, _ = self.inspect_model_cls(architectures, model_config)
return model_cls.supports_transcription_only
def is_v1_compatible(
self,
architectures: Union[str, list[str]],
model_config: ModelConfig,
) -> bool:
model_cls, _ = self.inspect_model_cls(architectures)
model_cls, _ = self.inspect_model_cls(architectures, model_config)
return not model_cls.supports_v0_only

View File

@ -0,0 +1,60 @@
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
import os
from typing import Optional, Union
from transformers.dynamic_module_utils import get_class_from_dynamic_module
import vllm.envs as envs
from vllm.logger import init_logger
logger = init_logger(__name__)
def try_get_class_from_dynamic_module(
class_reference: str,
pretrained_model_name_or_path: str,
cache_dir: Optional[Union[str, os.PathLike]] = None,
force_download: bool = False,
resume_download: Optional[bool] = None,
proxies: Optional[dict[str, str]] = None,
token: Optional[Union[bool, str]] = None,
revision: Optional[str] = None,
local_files_only: bool = False,
repo_type: Optional[str] = None,
code_revision: Optional[str] = None,
warn_on_fail: bool = True,
**kwargs,
) -> Optional[type]:
"""
As [transformers.dynamic_module_utils.get_class_from_dynamic_module][],
but ignoring any errors.
"""
try:
return get_class_from_dynamic_module(
class_reference,
pretrained_model_name_or_path,
cache_dir=cache_dir,
force_download=force_download,
resume_download=resume_download,
proxies=proxies,
token=token,
revision=revision,
local_files_only=local_files_only,
repo_type=repo_type,
code_revision=code_revision,
**kwargs,
)
except Exception:
location = "ModelScope" if envs.VLLM_USE_MODELSCOPE else "HF Hub"
if warn_on_fail:
logger.warning(
"Unable to load %s from %s on %s.",
class_reference,
pretrained_model_name_or_path,
location,
exc_info=True,
)
return None

View File

@ -3,6 +3,8 @@
from typing import Optional
from typing_extensions import assert_never
from vllm.config import LoRAConfig, ModelConfig, SchedulerConfig
from vllm.lora.request import LoRARequest
from vllm.transformers_utils.tokenizer import (AnyTokenizer, encode_tokens,
@ -108,6 +110,14 @@ class TokenizerGroup:
def init_tokenizer_from_configs(model_config: ModelConfig,
scheduler_config: SchedulerConfig,
lora_config: Optional[LoRAConfig]):
runner_type = model_config.runner_type
if runner_type == "generate" or runner_type == "draft":
truncation_side = "left"
elif runner_type == "pooling":
truncation_side = "right"
else:
assert_never(runner_type)
return TokenizerGroup(
tokenizer_id=model_config.tokenizer,
enable_lora=bool(lora_config),
@ -117,4 +127,4 @@ def init_tokenizer_from_configs(model_config: ModelConfig,
tokenizer_mode=model_config.tokenizer_mode,
trust_remote_code=model_config.trust_remote_code,
revision=model_config.tokenizer_revision,
truncation_side=model_config.truncation_side)
truncation_side=truncation_side)

View File

@ -127,8 +127,8 @@ class GPUModelRunner(LoRAModelRunnerMixin, KVConnectorModelRunnerMixin):
self.is_multimodal_model = model_config.is_multimodal_model
self.is_pooling_model = model_config.pooler_config is not None
self.is_encoder_only_model = False
self.model_supports_multimodal_raw_input = (
model_config.model_supports_multimodal_raw_input)
self.is_multimodal_raw_input_supported = (
model_config.is_multimodal_raw_input_supported)
self.max_model_len = model_config.max_model_len
self.max_num_tokens = scheduler_config.max_num_batched_tokens
self.max_num_reqs = scheduler_config.max_num_seqs
@ -583,7 +583,7 @@ class GPUModelRunner(LoRAModelRunnerMixin, KVConnectorModelRunnerMixin):
) -> dict[str, Any]:
model_kwargs: dict[str, Any] = {}
if self.model_supports_multimodal_raw_input:
if self.is_multimodal_raw_input_supported:
# This model requires the raw multimodal data in input.
if scheduler_output:
multi_modal_kwargs_list = []