mirror of
https://git.datalinker.icu/vllm-project/vllm.git
synced 2026-03-16 13:07:16 +08:00
[Deprecation][2/N] Replace --task with --runner and --convert (#21470)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
This commit is contained in:
parent
8f605ee309
commit
86ae693f20
@ -343,7 +343,7 @@ Here is a simple example using Phi-3.5-Vision.
|
||||
First, launch the OpenAI-compatible server:
|
||||
|
||||
```bash
|
||||
vllm serve microsoft/Phi-3.5-vision-instruct --task generate \
|
||||
vllm serve microsoft/Phi-3.5-vision-instruct --runner generate \
|
||||
--trust-remote-code --max-model-len 4096 --limit-mm-per-prompt '{"image":2}'
|
||||
```
|
||||
|
||||
@ -422,7 +422,7 @@ Instead of `image_url`, you can pass a video file via `video_url`. Here is a sim
|
||||
First, launch the OpenAI-compatible server:
|
||||
|
||||
```bash
|
||||
vllm serve llava-hf/llava-onevision-qwen2-0.5b-ov-hf --task generate --max-model-len 8192
|
||||
vllm serve llava-hf/llava-onevision-qwen2-0.5b-ov-hf --runner generate --max-model-len 8192
|
||||
```
|
||||
|
||||
Then, you can use the OpenAI client as follows:
|
||||
|
||||
@ -34,7 +34,7 @@ Prompt embeddings are passed in as base64 encoded torch tensors.
|
||||
First, launch the OpenAI-compatible server:
|
||||
|
||||
```bash
|
||||
vllm serve meta-llama/Llama-3.2-1B-Instruct --task generate \
|
||||
vllm serve meta-llama/Llama-3.2-1B-Instruct --runner generate \
|
||||
--max-model-len 4096 --enable-prompt-embeds
|
||||
```
|
||||
|
||||
|
||||
@ -2,12 +2,19 @@
|
||||
|
||||
vLLM provides first-class support for generative models, which covers most of LLMs.
|
||||
|
||||
In vLLM, generative models implement the [VllmModelForTextGeneration][vllm.model_executor.models.VllmModelForTextGeneration] interface.
|
||||
In vLLM, generative models implement the[VllmModelForTextGeneration][vllm.model_executor.models.VllmModelForTextGeneration] interface.
|
||||
Based on the final hidden states of the input, these models output log probabilities of the tokens to generate,
|
||||
which are then passed through [Sampler][vllm.model_executor.layers.Sampler] to obtain the final text.
|
||||
|
||||
For generative models, the only supported `--task` option is `"generate"`.
|
||||
Usually, this is automatically inferred so you don't have to specify it.
|
||||
## Configuration
|
||||
|
||||
### Model Runner (`--runner`)
|
||||
|
||||
Run a model in generation mode via the option `--runner generate`.
|
||||
|
||||
!!! tip
|
||||
There is no need to set this option in the vast majority of cases as vLLM can automatically
|
||||
detect the model runner to use via `--runner auto`.
|
||||
|
||||
## Offline Inference
|
||||
|
||||
|
||||
@ -1,9 +1,9 @@
|
||||
# Pooling Models
|
||||
|
||||
vLLM also supports pooling models, including embedding, reranking and reward models.
|
||||
vLLM also supports pooling models, such as embedding, classification and reward models.
|
||||
|
||||
In vLLM, pooling models implement the [VllmModelForPooling][vllm.model_executor.models.VllmModelForPooling] interface.
|
||||
These models use a [Pooler][vllm.model_executor.layers.Pooler] to extract the final hidden states of the input
|
||||
These models use a [Pooler][vllm.model_executor.layers.pooler.Pooler] to extract the final hidden states of the input
|
||||
before returning them.
|
||||
|
||||
!!! note
|
||||
@ -11,18 +11,39 @@ before returning them.
|
||||
As shown in the [Compatibility Matrix](../features/compatibility_matrix.md), most vLLM features are not applicable to
|
||||
pooling models as they only work on the generation or decode stage, so performance may not improve as much.
|
||||
|
||||
If the model doesn't implement this interface, you can set `--task` which tells vLLM
|
||||
to convert the model into a pooling model.
|
||||
## Configuration
|
||||
|
||||
| `--task` | Model type | Supported pooling tasks |
|
||||
|------------|----------------------|-------------------------------|
|
||||
| `embed` | Embedding model | `encode`, `embed` |
|
||||
| `classify` | Classification model | `encode`, `classify`, `score` |
|
||||
| `reward` | Reward model | `encode` |
|
||||
### Model Runner
|
||||
|
||||
## Pooling Tasks
|
||||
Run a model in pooling mode via the option `--runner pooling`.
|
||||
|
||||
In vLLM, we define the following pooling tasks and corresponding APIs:
|
||||
!!! tip
|
||||
There is no need to set this option in the vast majority of cases as vLLM can automatically
|
||||
detect the model runner to use via `--runner auto`.
|
||||
|
||||
### Model Conversion
|
||||
|
||||
vLLM can adapt models for various pooling tasks via the option `--convert <type>`.
|
||||
|
||||
If `--runner pooling` has been set (manually or automatically) but the model does not implement the
|
||||
[VllmModelForPooling][vllm.model_executor.models.VllmModelForPooling] interface,
|
||||
vLLM will attempt to automatically convert the model according to the architecture names
|
||||
shown in the table below.
|
||||
|
||||
| Architecture | `--convert` | Supported pooling tasks |
|
||||
|-------------------------------------------------|-------------|-------------------------------|
|
||||
| `*ForTextEncoding`, `*EmbeddingModel`, `*Model` | `embed` | `encode`, `embed` |
|
||||
| `*For*Classification`, `*ClassificationModel` | `classify` | `encode`, `classify`, `score` |
|
||||
| `*ForRewardModeling`, `*RewardModel` | `reward` | `encode` |
|
||||
|
||||
!!! tip
|
||||
You can explicitly set `--convert <type>` to specify how to convert the model.
|
||||
|
||||
### Pooling Tasks
|
||||
|
||||
Each pooling model in vLLM supports one or more of these tasks according to
|
||||
[Pooler.get_supported_tasks][vllm.model_executor.layers.pooler.Pooler.get_supported_tasks],
|
||||
enabling the corresponding APIs:
|
||||
|
||||
| Task | APIs |
|
||||
|------------|--------------------|
|
||||
@ -31,11 +52,19 @@ In vLLM, we define the following pooling tasks and corresponding APIs:
|
||||
| `classify` | `classify` |
|
||||
| `score` | `score` |
|
||||
|
||||
\*The `score` API falls back to `embed` task if the model does not support `score` task.
|
||||
\* The `score` API falls back to `embed` task if the model does not support `score` task.
|
||||
|
||||
Each pooling model in vLLM supports one or more of these tasks according to [Pooler.get_supported_tasks][vllm.model_executor.layers.Pooler.get_supported_tasks].
|
||||
### Pooler Configuration
|
||||
|
||||
By default, the pooler assigned to each task has the following attributes:
|
||||
#### Predefined models
|
||||
|
||||
If the [Pooler][vllm.model_executor.layers.pooler.Pooler] defined by the model accepts `pooler_config`,
|
||||
you can override some of its attributes via the `--override-pooler-config` option.
|
||||
|
||||
#### Converted models
|
||||
|
||||
If the model has been converted via `--convert` (see above),
|
||||
the pooler assigned to each task has the following attributes by default:
|
||||
|
||||
| Task | Pooling Type | Normalization | Softmax |
|
||||
|------------|----------------|---------------|---------|
|
||||
@ -43,20 +72,12 @@ By default, the pooler assigned to each task has the following attributes:
|
||||
| `embed` | `LAST` | ✅︎ | ❌ |
|
||||
| `classify` | `LAST` | ❌ | ✅︎ |
|
||||
|
||||
These defaults may be overridden by the model's implementation in vLLM.
|
||||
|
||||
When loading [Sentence Transformers](https://huggingface.co/sentence-transformers) models,
|
||||
we attempt to override the defaults based on its Sentence Transformers configuration file (`modules.json`),
|
||||
which takes priority over the model's defaults.
|
||||
its Sentence Transformers configuration file (`modules.json`) takes priority over the model's defaults.
|
||||
|
||||
You can further customize this via the `--override-pooler-config` option,
|
||||
which takes priority over both the model's and Sentence Transformers's defaults.
|
||||
|
||||
!!! note
|
||||
|
||||
The above configuration may be disregarded if the model's implementation in vLLM defines its own pooler
|
||||
that is not based on [PoolerConfig][vllm.config.PoolerConfig].
|
||||
|
||||
## Offline Inference
|
||||
|
||||
The [LLM][vllm.LLM] class provides various methods for offline inference.
|
||||
@ -70,7 +91,7 @@ It returns the extracted hidden states directly, which is useful for reward mode
|
||||
```python
|
||||
from vllm import LLM
|
||||
|
||||
llm = LLM(model="Qwen/Qwen2.5-Math-RM-72B", task="reward")
|
||||
llm = LLM(model="Qwen/Qwen2.5-Math-RM-72B", runner="pooling")
|
||||
(output,) = llm.encode("Hello, my name is")
|
||||
|
||||
data = output.outputs.data
|
||||
@ -85,7 +106,7 @@ It is primarily designed for embedding models.
|
||||
```python
|
||||
from vllm import LLM
|
||||
|
||||
llm = LLM(model="intfloat/e5-mistral-7b-instruct", task="embed")
|
||||
llm = LLM(model="intfloat/e5-mistral-7b-instruct", runner="pooling")
|
||||
(output,) = llm.embed("Hello, my name is")
|
||||
|
||||
embeds = output.outputs.embedding
|
||||
@ -102,7 +123,7 @@ It is primarily designed for classification models.
|
||||
```python
|
||||
from vllm import LLM
|
||||
|
||||
llm = LLM(model="jason9693/Qwen2.5-1.5B-apeach", task="classify")
|
||||
llm = LLM(model="jason9693/Qwen2.5-1.5B-apeach", runner="pooling")
|
||||
(output,) = llm.classify("Hello, my name is")
|
||||
|
||||
probs = output.outputs.probs
|
||||
@ -123,7 +144,7 @@ It is designed for embedding models and cross encoder models. Embedding models u
|
||||
```python
|
||||
from vllm import LLM
|
||||
|
||||
llm = LLM(model="BAAI/bge-reranker-v2-m3", task="score")
|
||||
llm = LLM(model="BAAI/bge-reranker-v2-m3", runner="pooling")
|
||||
(output,) = llm.score("What is the capital of France?",
|
||||
"The capital of Brazil is Brasilia.")
|
||||
|
||||
@ -175,7 +196,7 @@ You can change the output dimensions of embedding models that support Matryoshka
|
||||
from vllm import LLM, PoolingParams
|
||||
|
||||
llm = LLM(model="jinaai/jina-embeddings-v3",
|
||||
task="embed",
|
||||
runner="pooling",
|
||||
trust_remote_code=True)
|
||||
outputs = llm.embed(["Follow the white rabbit."],
|
||||
pooling_params=PoolingParams(dimensions=32))
|
||||
|
||||
@ -1,7 +1,6 @@
|
||||
# Supported Models
|
||||
|
||||
vLLM supports [generative](./generative_models.md) and [pooling](./pooling_models.md) models across various tasks.
|
||||
If a model supports more than one task, you can set the task via the `--task` argument.
|
||||
|
||||
For each task, we list the model architectures that have been implemented in vLLM.
|
||||
Alongside each architecture, we include some popular models that use it.
|
||||
@ -24,7 +23,7 @@ To check if the modeling backend is Transformers, you can simply do this:
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
llm = LLM(model=..., task="generate") # Name or path of your model
|
||||
llm = LLM(model=...) # Name or path of your model
|
||||
llm.apply_model(lambda model: print(type(model)))
|
||||
```
|
||||
|
||||
@ -158,13 +157,13 @@ The [Transformers backend][transformers-backend] enables you to run models direc
|
||||
```python
|
||||
from vllm import LLM
|
||||
|
||||
# For generative models (task=generate) only
|
||||
llm = LLM(model=..., task="generate") # Name or path of your model
|
||||
# For generative models (runner=generate) only
|
||||
llm = LLM(model=..., runner="generate") # Name or path of your model
|
||||
output = llm.generate("Hello, my name is")
|
||||
print(output)
|
||||
|
||||
# For pooling models (task={embed,classify,reward,score}) only
|
||||
llm = LLM(model=..., task="embed") # Name or path of your model
|
||||
# For pooling models (runner=pooling) only
|
||||
llm = LLM(model=..., runner="pooling") # Name or path of your model
|
||||
output = llm.encode("Hello, my name is")
|
||||
print(output)
|
||||
```
|
||||
@ -281,13 +280,13 @@ And use with `trust_remote_code=True`.
|
||||
```python
|
||||
from vllm import LLM
|
||||
|
||||
llm = LLM(model=..., revision=..., task=..., trust_remote_code=True)
|
||||
llm = LLM(model=..., revision=..., runner=..., trust_remote_code=True)
|
||||
|
||||
# For generative models (task=generate) only
|
||||
# For generative models (runner=generate) only
|
||||
output = llm.generate("Hello, my name is")
|
||||
print(output)
|
||||
|
||||
# For pooling models (task={embed,classify,reward,score}) only
|
||||
# For pooling models (runner=pooling) only
|
||||
output = llm.encode("Hello, my name is")
|
||||
print(output)
|
||||
```
|
||||
@ -312,8 +311,6 @@ See [this page](generative_models.md) for more information on how to use generat
|
||||
|
||||
#### Text Generation
|
||||
|
||||
Specified using `--task generate`.
|
||||
|
||||
<style>
|
||||
th {
|
||||
white-space: nowrap;
|
||||
@ -420,25 +417,27 @@ See [this page](./pooling_models.md) for more information on how to use pooling
|
||||
|
||||
!!! important
|
||||
Since some model architectures support both generative and pooling tasks,
|
||||
you should explicitly specify the task type to ensure that the model is used in pooling mode instead of generative mode.
|
||||
you should explicitly specify `--runner pooling` to ensure that the model is used in pooling mode instead of generative mode.
|
||||
|
||||
#### Text Embedding
|
||||
|
||||
Specified using `--task embed`.
|
||||
|
||||
| Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
|
||||
|--------------|--------|-------------------|----------------------|---------------------------|---------------------|
|
||||
| `BertModel` | BERT-based | `BAAI/bge-base-en-v1.5`, `Snowflake/snowflake-arctic-embed-xs`, etc. | | | |
|
||||
| `Gemma2Model` | Gemma 2-based | `BAAI/bge-multilingual-gemma2`, etc. | ✅︎ | | ✅︎ |
|
||||
| `BertModel`<sup>C</sup> | BERT-based | `BAAI/bge-base-en-v1.5`, `Snowflake/snowflake-arctic-embed-xs`, etc. | | | |
|
||||
| `Gemma2Model`<sup>C</sup> | Gemma 2-based | `BAAI/bge-multilingual-gemma2`, etc. | ✅︎ | | ✅︎ |
|
||||
| `GritLM` | GritLM | `parasail-ai/GritLM-7B-vllm`. | ✅︎ | ✅︎ | |
|
||||
| `GteModel` | Arctic-Embed-2.0-M | `Snowflake/snowflake-arctic-embed-m-v2.0`. | | | |
|
||||
| `GteNewModel` | mGTE-TRM (see note) | `Alibaba-NLP/gte-multilingual-base`, etc. | | | |
|
||||
| `ModernBertModel` | ModernBERT-based | `Alibaba-NLP/gte-modernbert-base`, etc. | | | |
|
||||
| `NomicBertModel` | Nomic BERT | `nomic-ai/nomic-embed-text-v1`, `nomic-ai/nomic-embed-text-v2-moe`, `Snowflake/snowflake-arctic-embed-m-long`, etc. | | | |
|
||||
| `LlamaModel`, `LlamaForCausalLM`, `MistralModel`, etc. | Llama-based | `intfloat/e5-mistral-7b-instruct`, etc. | ✅︎ | ✅︎ | ✅︎ |
|
||||
| `Qwen2Model`, `Qwen2ForCausalLM` | Qwen2-based | `ssmits/Qwen2-7B-Instruct-embed-base` (see note), `Alibaba-NLP/gte-Qwen2-7B-instruct` (see note), etc. | ✅︎ | ✅︎ | ✅︎ |
|
||||
| `Qwen3Model`, `Qwen3ForCausalLM` | Qwen3-based | `Qwen/Qwen3-Embedding-0.6B`, etc. | ✅︎ | ✅︎ | ✅︎ |
|
||||
| `GteModel`<sup>C</sup> | Arctic-Embed-2.0-M | `Snowflake/snowflake-arctic-embed-m-v2.0`. | | | |
|
||||
| `GteNewModel`<sup>C</sup> | mGTE-TRM (see note) | `Alibaba-NLP/gte-multilingual-base`, etc. | | | |
|
||||
| `ModernBertModel`<sup>C</sup> | ModernBERT-based | `Alibaba-NLP/gte-modernbert-base`, etc. | | | |
|
||||
| `NomicBertModel`<sup>C</sup> | Nomic BERT | `nomic-ai/nomic-embed-text-v1`, `nomic-ai/nomic-embed-text-v2-moe`, `Snowflake/snowflake-arctic-embed-m-long`, etc. | | | |
|
||||
| `LlamaModel`<sup>C</sup>, `LlamaForCausalLM`<sup>C</sup>, `MistralModel`<sup>C</sup>, etc. | Llama-based | `intfloat/e5-mistral-7b-instruct`, etc. | ✅︎ | ✅︎ | ✅︎ |
|
||||
| `Qwen2Model`<sup>C</sup>, `Qwen2ForCausalLM`<sup>C</sup> | Qwen2-based | `ssmits/Qwen2-7B-Instruct-embed-base` (see note), `Alibaba-NLP/gte-Qwen2-7B-instruct` (see note), etc. | ✅︎ | ✅︎ | ✅︎ |
|
||||
| `Qwen3Model`<sup>C</sup>, `Qwen3ForCausalLM`<sup>C</sup> | Qwen3-based | `Qwen/Qwen3-Embedding-0.6B`, etc. | ✅︎ | ✅︎ | ✅︎ |
|
||||
| `RobertaModel`, `RobertaForMaskedLM` | RoBERTa-based | `sentence-transformers/all-roberta-large-v1`, etc. | | | |
|
||||
| `*Model`<sup>C</sup>, `*ForCausalLM`<sup>C</sup>, etc. | Generative models | N/A | \* | \* | \* |
|
||||
|
||||
<sup>C</sup> Automatically converted into an embedding model via `--convert embed`. ([details](./pooling_models.md#model-conversion))
|
||||
\* Feature support is the same as that of the original model.
|
||||
|
||||
!!! note
|
||||
`ssmits/Qwen2-7B-Instruct-embed-base` has an improperly defined Sentence Transformers config.
|
||||
@ -460,14 +459,16 @@ of the whole prompt are extracted from the normalized hidden state corresponding
|
||||
|
||||
#### Reward Modeling
|
||||
|
||||
Specified using `--task reward`.
|
||||
|
||||
| Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
|
||||
|--------------|--------|-------------------|----------------------|---------------------------|---------------------|
|
||||
| `InternLM2ForRewardModel` | InternLM2-based | `internlm/internlm2-1_8b-reward`, `internlm/internlm2-7b-reward`, etc. | ✅︎ | ✅︎ | ✅︎ |
|
||||
| `LlamaForCausalLM` | Llama-based | `peiyi9979/math-shepherd-mistral-7b-prm`, etc. | ✅︎ | ✅︎ | ✅︎ |
|
||||
| `LlamaForCausalLM`<sup>C</sup> | Llama-based | `peiyi9979/math-shepherd-mistral-7b-prm`, etc. | ✅︎ | ✅︎ | ✅︎ |
|
||||
| `Qwen2ForRewardModel` | Qwen2-based | `Qwen/Qwen2.5-Math-RM-72B`, etc. | ✅︎ | ✅︎ | ✅︎ |
|
||||
| `Qwen2ForProcessRewardModel` | Qwen2-based | `Qwen/Qwen2.5-Math-PRM-7B`, etc. | ✅︎ | ✅︎ | ✅︎ |
|
||||
| `*Model`<sup>C</sup>, `*ForCausalLM`<sup>C</sup>, etc. | Generative models | N/A | \* | \* | \* |
|
||||
|
||||
<sup>C</sup> Automatically converted into a reward model via `--convert reward`. ([details](./pooling_models.md#model-conversion))
|
||||
\* Feature support is the same as that of the original model.
|
||||
|
||||
If your model is not in the above list, we will try to automatically convert the model using
|
||||
[as_reward_model][vllm.model_executor.models.adapters.as_reward_model]. By default, we return the hidden states of each token directly.
|
||||
@ -478,28 +479,31 @@ If your model is not in the above list, we will try to automatically convert the
|
||||
|
||||
#### Classification
|
||||
|
||||
Specified using `--task classify`.
|
||||
|
||||
| Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
|
||||
|--------------|--------|-------------------|----------------------|---------------------------|---------------------|
|
||||
| `JambaForSequenceClassification` | Jamba | `ai21labs/Jamba-tiny-reward-dev`, etc. | ✅︎ | ✅︎ | |
|
||||
| `GPT2ForSequenceClassification` | GPT2 | `nie3e/sentiment-polish-gpt2-small` | | | ✅︎ |
|
||||
| `*Model`<sup>C</sup>, `*ForCausalLM`<sup>C</sup>, etc. | Generative models | N/A | \* | \* | \* |
|
||||
|
||||
<sup>C</sup> Automatically converted into a classification model via `--convert classify`. ([details](./pooling_models.md#model-conversion))
|
||||
\* Feature support is the same as that of the original model.
|
||||
|
||||
If your model is not in the above list, we will try to automatically convert the model using
|
||||
[as_seq_cls_model][vllm.model_executor.models.adapters.as_seq_cls_model]. By default, the class probabilities are extracted from the softmaxed hidden state corresponding to the last token.
|
||||
|
||||
#### Sentence Pair Scoring
|
||||
|
||||
Specified using `--task score`.
|
||||
| Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
|
||||
|--------------|--------|-------------------|----------------------|---------------------------|---------------------|
|
||||
| `BertForSequenceClassification` | BERT-based | `cross-encoder/ms-marco-MiniLM-L-6-v2`, etc. | | | |
|
||||
| `GemmaForSequenceClassification` | Gemma-based | `BAAI/bge-reranker-v2-gemma` (see note), etc. | ✅︎ | ✅︎ | ✅︎ |
|
||||
| `Qwen2ForSequenceClassification` | Qwen2-based | `mixedbread-ai/mxbai-rerank-base-v2` (see note), etc. | ✅︎ | ✅︎ | ✅︎ |
|
||||
| `Qwen3ForSequenceClassification` | Qwen3-based | `tomaarsen/Qwen3-Reranker-0.6B-seq-cls`, `Qwen/Qwen3-Reranker-0.6B` (see note), etc. | ✅︎ | ✅︎ | ✅︎ |
|
||||
| `RobertaForSequenceClassification` | RoBERTa-based | `cross-encoder/quora-roberta-base`, etc. | | | |
|
||||
| `XLMRobertaForSequenceClassification` | XLM-RoBERTa-based | `BAAI/bge-reranker-v2-m3`, etc. | | | |
|
||||
|
||||
| Architecture | Models | Example HF Models | [V1](gh-issue:8779) |
|
||||
|--------------|--------|-------------------|---------------------|
|
||||
| `BertForSequenceClassification` | BERT-based | `cross-encoder/ms-marco-MiniLM-L-6-v2`, etc. | |
|
||||
| `GemmaForSequenceClassification` | Gemma-based | `BAAI/bge-reranker-v2-gemma` (see note), etc. | |
|
||||
| `Qwen2ForSequenceClassification` | Qwen2-based | `mixedbread-ai/mxbai-rerank-base-v2` (see note), etc. | ✅︎ |
|
||||
| `Qwen3ForSequenceClassification` | Qwen3-based | `tomaarsen/Qwen3-Reranker-0.6B-seq-cls`, `Qwen/Qwen3-Reranker-0.6B` (see note), etc. | ✅︎ |
|
||||
| `RobertaForSequenceClassification` | RoBERTa-based | `cross-encoder/quora-roberta-base`, etc. | |
|
||||
| `XLMRobertaForSequenceClassification` | XLM-RoBERTa-based | `BAAI/bge-reranker-v2-m3`, etc. | |
|
||||
<sup>C</sup> Automatically converted into a classification model via `--convert classify`. ([details](./pooling_models.md#model-conversion))
|
||||
\* Feature support is the same as that of the original model.
|
||||
|
||||
!!! note
|
||||
Load the official original `BAAI/bge-reranker-v2-gemma` by using the following command.
|
||||
@ -575,8 +579,6 @@ See [this page](generative_models.md) for more information on how to use generat
|
||||
|
||||
#### Text Generation
|
||||
|
||||
Specified using `--task generate`.
|
||||
|
||||
| Architecture | Models | Inputs | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
|
||||
|--------------|--------|--------|-------------------|----------------------|---------------------------|---------------------|
|
||||
| `AriaForConditionalGeneration` | Aria | T + I<sup>+</sup> | `rhymes-ai/Aria` | | | ✅︎ |
|
||||
@ -705,8 +707,6 @@ Some models are supported only via the [Transformers backend](#transformers). Th
|
||||
|
||||
#### Transcription
|
||||
|
||||
Specified using `--task transcription`.
|
||||
|
||||
Speech2Text models trained specifically for Automatic Speech Recognition.
|
||||
|
||||
| Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
|
||||
@ -719,14 +719,10 @@ See [this page](./pooling_models.md) for more information on how to use pooling
|
||||
|
||||
!!! important
|
||||
Since some model architectures support both generative and pooling tasks,
|
||||
you should explicitly specify the task type to ensure that the model is used in pooling mode instead of generative mode.
|
||||
you should explicitly specify `--runner pooling` to ensure that the model is used in pooling mode instead of generative mode.
|
||||
|
||||
#### Text Embedding
|
||||
|
||||
Specified using `--task embed`.
|
||||
|
||||
Any text generation model can be converted into an embedding model by passing `--task embed`.
|
||||
|
||||
!!! note
|
||||
To get the best results, you should use pooling models that are specifically trained as such.
|
||||
|
||||
@ -734,19 +730,24 @@ The following table lists those that are tested in vLLM.
|
||||
|
||||
| Architecture | Models | Inputs | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
|
||||
|--------------|--------|--------|-------------------|----------------------|---------------------------|---------------------|
|
||||
| `LlavaNextForConditionalGeneration` | LLaVA-NeXT-based | T / I | `royokong/e5-v` | | | |
|
||||
| `Phi3VForCausalLM` | Phi-3-Vision-based | T + I | `TIGER-Lab/VLM2Vec-Full` | 🚧 | ✅︎ | |
|
||||
| `LlavaNextForConditionalGeneration`<sup>C</sup> | LLaVA-NeXT-based | T / I | `royokong/e5-v` | | | |
|
||||
| `Phi3VForCausalLM`<sup>C</sup> | Phi-3-Vision-based | T + I | `TIGER-Lab/VLM2Vec-Full` | 🚧 | ✅︎ | |
|
||||
| `*ForConditionalGeneration`<sup>C</sup>, `*ForCausalLM`<sup>C</sup>, etc. | Generative models | \* | N/A | \* | \* | \* |
|
||||
|
||||
<sup>C</sup> Automatically converted into an embedding model via `--convert embed`. ([details](./pooling_models.md#model-conversion))
|
||||
\* Feature support is the same as that of the original model.
|
||||
|
||||
---
|
||||
|
||||
#### Scoring
|
||||
|
||||
Specified using `--task score`.
|
||||
|
||||
| Architecture | Models | Inputs | Example HF Models | [LoRA][lora-adapter] | [PP][distributed-serving] | [V1](gh-issue:8779) |
|
||||
|-------------------------------------|--------------------|----------|--------------------------|------------------------|-----------------------------|-----------------------|
|
||||
| `JinaVLForSequenceClassification` | JinaVL-based | T + I<sup>E+</sup> | `jinaai/jina-reranker-m0`, etc. | | | ✅︎ |
|
||||
|
||||
<sup>C</sup> Automatically converted into a classification model via `--convert classify`. ([details](./pooling_models.md#model-conversion))
|
||||
\* Feature support is the same as that of the original model.
|
||||
|
||||
## Model Support Policy
|
||||
|
||||
At vLLM, we are committed to facilitating the integration and support of third-party models within our ecosystem. Our approach is designed to balance the need for robustness and the practical limitations of supporting a wide range of models. Here’s how we manage third-party model support:
|
||||
|
||||
@ -45,17 +45,17 @@ To call the server, in your preferred text editor, create a script that uses an
|
||||
We currently support the following OpenAI APIs:
|
||||
|
||||
- [Completions API][completions-api] (`/v1/completions`)
|
||||
- Only applicable to [text generation models](../models/generative_models.md) (`--task generate`).
|
||||
- Only applicable to [text generation models](../models/generative_models.md).
|
||||
- *Note: `suffix` parameter is not supported.*
|
||||
- [Chat Completions API][chat-api] (`/v1/chat/completions`)
|
||||
- Only applicable to [text generation models](../models/generative_models.md) (`--task generate`) with a [chat template][chat-template].
|
||||
- Only applicable to [text generation models](../models/generative_models.md) with a [chat template][chat-template].
|
||||
- *Note: `parallel_tool_calls` and `user` parameters are ignored.*
|
||||
- [Embeddings API][embeddings-api] (`/v1/embeddings`)
|
||||
- Only applicable to [embedding models](../models/pooling_models.md) (`--task embed`).
|
||||
- Only applicable to [embedding models](../models/pooling_models.md).
|
||||
- [Transcriptions API][transcriptions-api] (`/v1/audio/transcriptions`)
|
||||
- Only applicable to Automatic Speech Recognition (ASR) models (OpenAI Whisper) (`--task generate`).
|
||||
- Only applicable to [Automatic Speech Recognition (ASR) models](../models/supported_models.md#transcription).
|
||||
- [Translation API][translations-api] (`/v1/audio/translations`)
|
||||
- Only applicable to Automatic Speech Recognition (ASR) models (OpenAI Whisper) (`--task generate`).
|
||||
- Only applicable to [Automatic Speech Recognition (ASR) models](../models/supported_models.md#transcription).
|
||||
|
||||
In addition, we have the following custom APIs:
|
||||
|
||||
@ -64,14 +64,14 @@ In addition, we have the following custom APIs:
|
||||
- [Pooling API][pooling-api] (`/pooling`)
|
||||
- Applicable to all [pooling models](../models/pooling_models.md).
|
||||
- [Classification API][classification-api] (`/classify`)
|
||||
- Only applicable to [classification models](../models/pooling_models.md) (`--task classify`).
|
||||
- Only applicable to [classification models](../models/pooling_models.md).
|
||||
- [Score API][score-api] (`/score`)
|
||||
- Applicable to embedding models and [cross-encoder models](../models/pooling_models.md) (`--task score`).
|
||||
- Applicable to [embedding models and cross-encoder models](../models/pooling_models.md).
|
||||
- [Re-rank API][rerank-api] (`/rerank`, `/v1/rerank`, `/v2/rerank`)
|
||||
- Implements [Jina AI's v1 re-rank API](https://jina.ai/reranker/)
|
||||
- Also compatible with [Cohere's v1 & v2 re-rank APIs](https://docs.cohere.com/v2/reference/rerank)
|
||||
- Jina and Cohere's APIs are very similar; Jina's includes extra information in the rerank endpoint's response.
|
||||
- Only applicable to [cross-encoder models](../models/pooling_models.md) (`--task score`).
|
||||
- Only applicable to [cross-encoder models](../models/pooling_models.md).
|
||||
|
||||
[](){ #chat-template }
|
||||
|
||||
@ -250,14 +250,14 @@ and passing a list of `messages` in the request. Refer to the examples below for
|
||||
To serve the model:
|
||||
|
||||
```bash
|
||||
vllm serve TIGER-Lab/VLM2Vec-Full --task embed \
|
||||
vllm serve TIGER-Lab/VLM2Vec-Full --runner pooling \
|
||||
--trust-remote-code \
|
||||
--max-model-len 4096 \
|
||||
--chat-template examples/template_vlm2vec.jinja
|
||||
```
|
||||
|
||||
!!! important
|
||||
Since VLM2Vec has the same model architecture as Phi-3.5-Vision, we have to explicitly pass `--task embed`
|
||||
Since VLM2Vec has the same model architecture as Phi-3.5-Vision, we have to explicitly pass `--runner pooling`
|
||||
to run this model in embedding mode instead of text generation mode.
|
||||
|
||||
The custom chat template is completely different from the original one for this model,
|
||||
@ -296,14 +296,14 @@ and passing a list of `messages` in the request. Refer to the examples below for
|
||||
To serve the model:
|
||||
|
||||
```bash
|
||||
vllm serve MrLight/dse-qwen2-2b-mrl-v1 --task embed \
|
||||
vllm serve MrLight/dse-qwen2-2b-mrl-v1 --runner pooling \
|
||||
--trust-remote-code \
|
||||
--max-model-len 8192 \
|
||||
--chat-template examples/template_dse_qwen2_vl.jinja
|
||||
```
|
||||
|
||||
!!! important
|
||||
Like with VLM2Vec, we have to explicitly pass `--task embed`.
|
||||
Like with VLM2Vec, we have to explicitly pass `--runner pooling`.
|
||||
|
||||
Additionally, `MrLight/dse-qwen2-2b-mrl-v1` requires an EOS token for embeddings, which is handled
|
||||
by a custom chat template: <gh-file:examples/template_dse_qwen2_vl.jinja>
|
||||
|
||||
@ -12,7 +12,9 @@ def parse_args():
|
||||
parser = EngineArgs.add_cli_args(parser)
|
||||
# Set example specific arguments
|
||||
parser.set_defaults(
|
||||
model="jason9693/Qwen2.5-1.5B-apeach", task="classify", enforce_eager=True
|
||||
model="jason9693/Qwen2.5-1.5B-apeach",
|
||||
runner="pooling",
|
||||
enforce_eager=True,
|
||||
)
|
||||
return parser.parse_args()
|
||||
|
||||
@ -27,7 +29,7 @@ def main(args: Namespace):
|
||||
]
|
||||
|
||||
# Create an LLM.
|
||||
# You should pass task="classify" for classification models
|
||||
# You should pass runner="pooling" for classification models
|
||||
llm = LLM(**vars(args))
|
||||
|
||||
# Generate logits. The output is a list of ClassificationRequestOutputs.
|
||||
|
||||
@ -13,7 +13,7 @@ def parse_args():
|
||||
# Set example specific arguments
|
||||
parser.set_defaults(
|
||||
model="intfloat/e5-mistral-7b-instruct",
|
||||
task="embed",
|
||||
runner="pooling",
|
||||
enforce_eager=True,
|
||||
max_model_len=1024,
|
||||
)
|
||||
@ -30,7 +30,7 @@ def main(args: Namespace):
|
||||
]
|
||||
|
||||
# Create an LLM.
|
||||
# You should pass task="embed" for embedding models
|
||||
# You should pass runner="pooling" for embedding models
|
||||
llm = LLM(**vars(args))
|
||||
|
||||
# Generate embedding. The output is a list of EmbeddingRequestOutputs.
|
||||
|
||||
@ -12,7 +12,9 @@ def parse_args():
|
||||
parser = EngineArgs.add_cli_args(parser)
|
||||
# Set example specific arguments
|
||||
parser.set_defaults(
|
||||
model="BAAI/bge-reranker-v2-m3", task="score", enforce_eager=True
|
||||
model="BAAI/bge-reranker-v2-m3",
|
||||
runner="pooling",
|
||||
enforce_eager=True,
|
||||
)
|
||||
return parser.parse_args()
|
||||
|
||||
@ -26,7 +28,7 @@ def main(args: Namespace):
|
||||
]
|
||||
|
||||
# Create an LLM.
|
||||
# You should pass task="score" for cross-encoder models
|
||||
# You should pass runner="pooling" for cross-encoder models
|
||||
llm = LLM(**vars(args))
|
||||
|
||||
# Generate scores. The output is a list of ScoringRequestOutputs.
|
||||
|
||||
@ -12,7 +12,9 @@ def parse_args():
|
||||
parser = EngineArgs.add_cli_args(parser)
|
||||
# Set example specific arguments
|
||||
parser.set_defaults(
|
||||
model="jinaai/jina-embeddings-v3", task="embed", trust_remote_code=True
|
||||
model="jinaai/jina-embeddings-v3",
|
||||
runner="pooling",
|
||||
trust_remote_code=True,
|
||||
)
|
||||
return parser.parse_args()
|
||||
|
||||
@ -29,7 +31,7 @@ def main(args: Namespace):
|
||||
]
|
||||
|
||||
# Create an LLM.
|
||||
# You should pass task="embed" for embedding models
|
||||
# You should pass runner="pooling" for embedding models
|
||||
llm = LLM(**vars(args))
|
||||
|
||||
# Generate embedding. The output is a list of EmbeddingRequestOutputs.
|
||||
|
||||
@ -12,7 +12,9 @@ def parse_args():
|
||||
parser = EngineArgs.add_cli_args(parser)
|
||||
# Set example specific arguments
|
||||
parser.set_defaults(
|
||||
model="jinaai/jina-embeddings-v3", task="embed", trust_remote_code=True
|
||||
model="jinaai/jina-embeddings-v3",
|
||||
runner="pooling",
|
||||
trust_remote_code=True,
|
||||
)
|
||||
return parser.parse_args()
|
||||
|
||||
@ -29,7 +31,7 @@ def main(args: Namespace):
|
||||
]
|
||||
|
||||
# Create an LLM.
|
||||
# You should pass task="embed" for embedding models
|
||||
# You should pass runner="pooling" for embedding models
|
||||
llm = LLM(**vars(args))
|
||||
|
||||
# Generate embedding. The output is a list of EmbeddingRequestOutputs.
|
||||
|
||||
@ -17,7 +17,7 @@ model_name = "Qwen/Qwen3-Reranker-0.6B"
|
||||
# Models converted offline using this method can not only be more efficient
|
||||
# and support the vllm score API, but also make the init parameters more
|
||||
# concise, for example.
|
||||
# llm = LLM(model="tomaarsen/Qwen3-Reranker-0.6B-seq-cls", task="score")
|
||||
# llm = LLM(model="tomaarsen/Qwen3-Reranker-0.6B-seq-cls", runner="pooling")
|
||||
|
||||
# If you want to load the official original version, the init parameters are
|
||||
# as follows.
|
||||
@ -27,7 +27,7 @@ def get_llm() -> LLM:
|
||||
"""Initializes and returns the LLM model for Qwen3-Reranker."""
|
||||
return LLM(
|
||||
model=model_name,
|
||||
task="score",
|
||||
runner="pooling",
|
||||
hf_overrides={
|
||||
"architectures": ["Qwen3ForSequenceClassification"],
|
||||
"classifier_from_token": ["no", "yes"],
|
||||
|
||||
@ -70,7 +70,7 @@ def run_e5_v(query: Query) -> ModelRequestData:
|
||||
|
||||
engine_args = EngineArgs(
|
||||
model="royokong/e5-v",
|
||||
task="embed",
|
||||
runner="pooling",
|
||||
max_model_len=4096,
|
||||
limit_mm_per_prompt={"image": 1},
|
||||
)
|
||||
@ -102,7 +102,7 @@ def run_vlm2vec(query: Query) -> ModelRequestData:
|
||||
|
||||
engine_args = EngineArgs(
|
||||
model="TIGER-Lab/VLM2Vec-Full",
|
||||
task="embed",
|
||||
runner="pooling",
|
||||
max_model_len=4096,
|
||||
trust_remote_code=True,
|
||||
mm_processor_kwargs={"num_crops": 4},
|
||||
@ -122,7 +122,7 @@ def run_jinavl_reranker(query: Query) -> ModelRequestData:
|
||||
|
||||
engine_args = EngineArgs(
|
||||
model="jinaai/jina-reranker-m0",
|
||||
task="score",
|
||||
runner="pooling",
|
||||
max_model_len=32768,
|
||||
trust_remote_code=True,
|
||||
mm_processor_kwargs={
|
||||
|
||||
@ -9,7 +9,7 @@ Launch the vLLM server with the following command:
|
||||
vllm serve llava-hf/llava-1.5-7b-hf
|
||||
|
||||
(multi-image inference with Phi-3.5-vision-instruct)
|
||||
vllm serve microsoft/Phi-3.5-vision-instruct --task generate \
|
||||
vllm serve microsoft/Phi-3.5-vision-instruct --runner generate \
|
||||
--trust-remote-code --max-model-len 4096 --limit-mm-per-prompt '{"image":2}'
|
||||
|
||||
(audio inference with Ultravox)
|
||||
|
||||
@ -92,7 +92,7 @@ def dse_qwen2_vl(inp: dict):
|
||||
def parse_args():
|
||||
parser = argparse.ArgumentParser(
|
||||
"Script to call a specified VLM through the API. Make sure to serve "
|
||||
"the model with --task embed before running this."
|
||||
"the model with `--runner pooling` before running this."
|
||||
)
|
||||
parser.add_argument(
|
||||
"--model",
|
||||
|
||||
@ -3,7 +3,7 @@
|
||||
"""
|
||||
Example online usage of Score API.
|
||||
|
||||
Run `vllm serve <model> --task score` to start up the server in vLLM.
|
||||
Run `vllm serve <model> --runner pooling` to start up the server in vLLM.
|
||||
"""
|
||||
|
||||
import argparse
|
||||
|
||||
@ -3,7 +3,7 @@
|
||||
"""
|
||||
Example online usage of Score API.
|
||||
|
||||
Run `vllm serve <model> --task score` to start up the server in vLLM.
|
||||
Run `vllm serve <model> --runner pooling` to start up the server in vLLM.
|
||||
"""
|
||||
|
||||
import argparse
|
||||
|
||||
@ -3,7 +3,7 @@
|
||||
"""
|
||||
Example online usage of Pooling API.
|
||||
|
||||
Run `vllm serve <model> --task <embed|classify|reward|score>`
|
||||
Run `vllm serve <model> --runner pooling`
|
||||
to start up the server in vLLM.
|
||||
"""
|
||||
|
||||
|
||||
@ -10,7 +10,7 @@ This script demonstrates how to:
|
||||
|
||||
Run the vLLM server first:
|
||||
vllm serve meta-llama/Llama-3.2-1B-Instruct \
|
||||
--task generate \
|
||||
--runner generate \
|
||||
--max-model-len 4096 \
|
||||
--enable-prompt-embeds
|
||||
|
||||
|
||||
@ -148,9 +148,6 @@ def async_tp_pass_on_test_model(local_rank: int, world_size: int,
|
||||
# in the vllm_config, it's not really used.
|
||||
model_name = "nm-testing/TinyLlama-1.1B-Chat-v1.0-FP8-e2e"
|
||||
vllm_config.model_config = ModelConfig(model=model_name,
|
||||
task="auto",
|
||||
tokenizer=model_name,
|
||||
tokenizer_mode="auto",
|
||||
trust_remote_code=True,
|
||||
dtype=dtype,
|
||||
seed=42)
|
||||
|
||||
@ -62,8 +62,8 @@ class TestSetting:
|
||||
TestSetting(
|
||||
model="BAAI/bge-multilingual-gemma2",
|
||||
model_args=[
|
||||
"--task", "embed", "--dtype", "bfloat16", "--max-model-len",
|
||||
"2048"
|
||||
"--runner", "pooling", "--dtype", "bfloat16",
|
||||
"--max-model-len", "2048"
|
||||
],
|
||||
pp_size=1,
|
||||
tp_size=1,
|
||||
@ -75,7 +75,7 @@ class TestSetting:
|
||||
# # encoder-based embedding model (BERT)
|
||||
# TestSetting(
|
||||
# model="BAAI/bge-base-en-v1.5",
|
||||
# model_args=["--task", "embed"],
|
||||
# model_args=["--runner", "pooling"],
|
||||
# pp_size=1,
|
||||
# tp_size=1,
|
||||
# attn_backend="XFORMERS",
|
||||
|
||||
@ -125,9 +125,6 @@ def all_reduce_fusion_pass_on_test_model(local_rank: int, world_size: int,
|
||||
# in the vllm_config, it's not really used.
|
||||
model_name = "nm-testing/TinyLlama-1.1B-Chat-v1.0-FP8-e2e"
|
||||
vllm_config.model_config = ModelConfig(model=model_name,
|
||||
task="auto",
|
||||
tokenizer=model_name,
|
||||
tokenizer_mode="auto",
|
||||
trust_remote_code=True,
|
||||
dtype=dtype,
|
||||
seed=42)
|
||||
|
||||
@ -250,9 +250,6 @@ def sequence_parallelism_pass_on_test_model(
|
||||
# in the vllm_config, it's not really used.
|
||||
model_name = "nm-testing/TinyLlama-1.1B-Chat-v1.0-FP8-e2e"
|
||||
vllm_config.model_config = ModelConfig(model=model_name,
|
||||
task="auto",
|
||||
tokenizer=model_name,
|
||||
tokenizer_mode="auto",
|
||||
trust_remote_code=True,
|
||||
dtype=dtype,
|
||||
seed=42)
|
||||
|
||||
@ -23,7 +23,7 @@ from vllm import LLM, SamplingParams
|
||||
from vllm.assets.audio import AudioAsset
|
||||
from vllm.assets.image import ImageAsset
|
||||
from vllm.assets.video import VideoAsset
|
||||
from vllm.config import TaskOption, _get_and_verify_dtype
|
||||
from vllm.config import ConvertOption, RunnerOption, _get_and_verify_dtype
|
||||
from vllm.connections import global_http_connection
|
||||
from vllm.distributed import (cleanup_dist_env_and_memory,
|
||||
init_distributed_environment,
|
||||
@ -769,7 +769,8 @@ class VllmRunner:
|
||||
def __init__(
|
||||
self,
|
||||
model_name: str,
|
||||
task: TaskOption = "auto",
|
||||
runner: RunnerOption = "auto",
|
||||
convert: ConvertOption = "auto",
|
||||
tokenizer_name: Optional[str] = None,
|
||||
tokenizer_mode: str = "auto",
|
||||
trust_remote_code: bool = True,
|
||||
@ -786,7 +787,8 @@ class VllmRunner:
|
||||
) -> None:
|
||||
self.llm = LLM(
|
||||
model=model_name,
|
||||
task=task,
|
||||
runner=runner,
|
||||
convert=convert,
|
||||
tokenizer=tokenizer_name,
|
||||
tokenizer_mode=tokenizer_mode,
|
||||
trust_remote_code=trust_remote_code,
|
||||
|
||||
@ -6,7 +6,7 @@ from typing import Literal, NamedTuple, Optional
|
||||
|
||||
import pytest
|
||||
|
||||
from vllm.config import TaskOption
|
||||
from vllm.config import RunnerOption
|
||||
from vllm.logger import init_logger
|
||||
|
||||
from ..utils import compare_two_settings, create_new_process_for_each_test
|
||||
@ -31,14 +31,14 @@ class EPTestOptions(NamedTuple):
|
||||
class EPTestSettings:
|
||||
parallel_setups: list[ParallelSetup]
|
||||
distributed_backends: list[str]
|
||||
task: TaskOption
|
||||
runner: RunnerOption
|
||||
test_options: EPTestOptions
|
||||
|
||||
@staticmethod
|
||||
def detailed(
|
||||
*,
|
||||
tp_base: int = 2,
|
||||
task: TaskOption = "auto",
|
||||
runner: RunnerOption = "auto",
|
||||
trust_remote_code: bool = False,
|
||||
tokenizer_mode: Optional[str] = None,
|
||||
load_format: Optional[str] = None,
|
||||
@ -63,7 +63,7 @@ class EPTestSettings:
|
||||
chunked_prefill=False),
|
||||
],
|
||||
distributed_backends=["mp", "ray"],
|
||||
task=task,
|
||||
runner=runner,
|
||||
test_options=EPTestOptions(trust_remote_code=trust_remote_code,
|
||||
tokenizer_mode=tokenizer_mode,
|
||||
load_format=load_format,
|
||||
@ -74,7 +74,7 @@ class EPTestSettings:
|
||||
def fast(
|
||||
*,
|
||||
tp_base: int = 2,
|
||||
task: TaskOption = "auto",
|
||||
runner: RunnerOption = "auto",
|
||||
trust_remote_code: bool = False,
|
||||
tokenizer_mode: Optional[str] = None,
|
||||
load_format: Optional[str] = None,
|
||||
@ -87,7 +87,7 @@ class EPTestSettings:
|
||||
chunked_prefill=False),
|
||||
],
|
||||
distributed_backends=["mp"],
|
||||
task=task,
|
||||
runner=runner,
|
||||
test_options=EPTestOptions(trust_remote_code=trust_remote_code,
|
||||
tokenizer_mode=tokenizer_mode,
|
||||
load_format=load_format,
|
||||
@ -100,7 +100,7 @@ class EPTestSettings:
|
||||
for parallel_setup in self.parallel_setups:
|
||||
for distributed_backend in self.distributed_backends:
|
||||
yield (model_name, parallel_setup, distributed_backend,
|
||||
self.task, opts)
|
||||
self.runner, opts)
|
||||
|
||||
|
||||
# NOTE: You can adjust tp_base locally to fit the model in GPU
|
||||
@ -118,7 +118,7 @@ def _compare_tp(
|
||||
model_name: str,
|
||||
parallel_setup: ParallelSetup,
|
||||
distributed_backend: str,
|
||||
task: TaskOption,
|
||||
runner: RunnerOption,
|
||||
test_options: EPTestOptions,
|
||||
num_gpus_available: int,
|
||||
*,
|
||||
@ -154,8 +154,8 @@ def _compare_tp(
|
||||
common_args.append("--enable-chunked-prefill")
|
||||
if eager_mode:
|
||||
common_args.append("--enforce-eager")
|
||||
if task != "auto":
|
||||
common_args.extend(["--task", task])
|
||||
if runner != "auto":
|
||||
common_args.extend(["--runner", runner])
|
||||
if trust_remote_code:
|
||||
common_args.append("--trust-remote-code")
|
||||
if tokenizer_mode:
|
||||
@ -203,7 +203,7 @@ def _compare_tp(
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
("model_name", "parallel_setup", "distributed_backend", "task",
|
||||
("model_name", "parallel_setup", "distributed_backend", "runner",
|
||||
"test_options"),
|
||||
[
|
||||
params for model_name, settings in TEST_MODELS.items()
|
||||
@ -215,14 +215,14 @@ def test_ep(
|
||||
model_name: str,
|
||||
parallel_setup: ParallelSetup,
|
||||
distributed_backend: str,
|
||||
task: TaskOption,
|
||||
runner: RunnerOption,
|
||||
test_options: EPTestOptions,
|
||||
num_gpus_available,
|
||||
):
|
||||
_compare_tp(model_name,
|
||||
parallel_setup,
|
||||
distributed_backend,
|
||||
task,
|
||||
runner,
|
||||
test_options,
|
||||
num_gpus_available,
|
||||
method="generate")
|
||||
|
||||
@ -14,7 +14,7 @@ from typing import Literal, NamedTuple, Optional
|
||||
|
||||
import pytest
|
||||
|
||||
from vllm.config import _FLOAT16_NOT_SUPPORTED_MODELS, TaskOption
|
||||
from vllm.config import _FLOAT16_NOT_SUPPORTED_MODELS, RunnerOption
|
||||
from vllm.logger import init_logger
|
||||
from vllm.transformers_utils.config import get_config
|
||||
|
||||
@ -60,7 +60,7 @@ class PPTestSettings:
|
||||
distributed_backends: list[str]
|
||||
# vllm major version: "0" for V0, "1" for V1
|
||||
vllm_major_versions: list[str]
|
||||
task: TaskOption
|
||||
runner: RunnerOption
|
||||
test_options: PPTestOptions
|
||||
|
||||
def __post_init__(self):
|
||||
@ -76,7 +76,7 @@ class PPTestSettings:
|
||||
tp_base: int = 1,
|
||||
pp_base: int = 2,
|
||||
multi_node_only: bool = False,
|
||||
task: TaskOption = "auto",
|
||||
runner: RunnerOption = "auto",
|
||||
load_format: Optional[str] = None,
|
||||
):
|
||||
return PPTestSettings(
|
||||
@ -104,7 +104,7 @@ class PPTestSettings:
|
||||
],
|
||||
distributed_backends=["mp", "mp", "ray", "ray"],
|
||||
vllm_major_versions=["0", "1", "0", "1"],
|
||||
task=task,
|
||||
runner=runner,
|
||||
test_options=PPTestOptions(multi_node_only=multi_node_only,
|
||||
load_format=load_format),
|
||||
)
|
||||
@ -114,7 +114,7 @@ class PPTestSettings:
|
||||
*,
|
||||
tp_base: int = 1,
|
||||
pp_base: int = 2,
|
||||
task: TaskOption = "auto",
|
||||
runner: RunnerOption = "auto",
|
||||
multi_node_only: bool = False,
|
||||
load_format: Optional[str] = None,
|
||||
):
|
||||
@ -127,7 +127,7 @@ class PPTestSettings:
|
||||
],
|
||||
distributed_backends=["mp"],
|
||||
vllm_major_versions=["0"],
|
||||
task=task,
|
||||
runner=runner,
|
||||
test_options=PPTestOptions(multi_node_only=multi_node_only,
|
||||
load_format=load_format),
|
||||
)
|
||||
@ -139,7 +139,7 @@ class PPTestSettings:
|
||||
for backend, vllm_major_version in zip(self.distributed_backends,
|
||||
self.vllm_major_versions):
|
||||
yield (model_id, parallel_setup, backend, vllm_major_version,
|
||||
self.task, opts)
|
||||
self.runner, opts)
|
||||
|
||||
|
||||
# NOTE: You can adjust tp_base and/or pp_base locally to fit the model in GPU
|
||||
@ -211,10 +211,10 @@ TEXT_GENERATION_MODELS = {
|
||||
|
||||
EMBEDDING_MODELS = { # type: ignore[var-annotated]
|
||||
# [Text-only]
|
||||
"intfloat/e5-mistral-7b-instruct": PPTestSettings.fast(task="embed"),
|
||||
"BAAI/bge-multilingual-gemma2": PPTestSettings.fast(task="embed"),
|
||||
"intfloat/e5-mistral-7b-instruct": PPTestSettings.fast(runner="pooling"),
|
||||
"BAAI/bge-multilingual-gemma2": PPTestSettings.fast(runner="pooling"),
|
||||
"Qwen/Qwen2.5-Math-RM-72B": PPTestSettings.fast(
|
||||
load_format="dummy", task="embed"
|
||||
load_format="dummy", runner="pooling"
|
||||
),
|
||||
}
|
||||
|
||||
@ -269,7 +269,7 @@ def _compare_tp(
|
||||
parallel_setup: ParallelSetup,
|
||||
distributed_backend: str,
|
||||
vllm_major_version: str,
|
||||
task: TaskOption,
|
||||
runner: RunnerOption,
|
||||
test_options: PPTestOptions,
|
||||
num_gpus_available: int,
|
||||
*,
|
||||
@ -335,8 +335,8 @@ def _compare_tp(
|
||||
common_args.append("--enable-chunked-prefill")
|
||||
if eager_mode:
|
||||
common_args.append("--enforce-eager")
|
||||
if task != "auto":
|
||||
common_args.extend(["--task", task])
|
||||
if runner != "auto":
|
||||
common_args.extend(["--runner", runner])
|
||||
if trust_remote_code:
|
||||
common_args.append("--trust-remote-code")
|
||||
if tokenizer_mode:
|
||||
@ -415,7 +415,7 @@ def _compare_tp(
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
("model_id", "parallel_setup", "distributed_backend", "vllm_major_version",
|
||||
"task", "test_options"),
|
||||
"runner", "test_options"),
|
||||
[
|
||||
params for model_id, settings in TEXT_GENERATION_MODELS.items()
|
||||
for params in settings.iter_params(model_id) if model_id in TEST_MODELS
|
||||
@ -427,7 +427,7 @@ def test_tp_language_generation(
|
||||
parallel_setup: ParallelSetup,
|
||||
distributed_backend: str,
|
||||
vllm_major_version: str,
|
||||
task: TaskOption,
|
||||
runner: RunnerOption,
|
||||
test_options: PPTestOptions,
|
||||
num_gpus_available,
|
||||
):
|
||||
@ -435,7 +435,7 @@ def test_tp_language_generation(
|
||||
parallel_setup,
|
||||
distributed_backend,
|
||||
vllm_major_version,
|
||||
task,
|
||||
runner,
|
||||
test_options,
|
||||
num_gpus_available,
|
||||
method="generate",
|
||||
@ -444,7 +444,7 @@ def test_tp_language_generation(
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
("model_id", "parallel_setup", "distributed_backend", "vllm_major_version",
|
||||
"task", "test_options"),
|
||||
"runner", "test_options"),
|
||||
[
|
||||
params for model_id, settings in EMBEDDING_MODELS.items()
|
||||
for params in settings.iter_params(model_id) if model_id in TEST_MODELS
|
||||
@ -456,7 +456,7 @@ def test_tp_language_embedding(
|
||||
parallel_setup: ParallelSetup,
|
||||
distributed_backend: str,
|
||||
vllm_major_version: str,
|
||||
task: TaskOption,
|
||||
runner: RunnerOption,
|
||||
test_options: PPTestOptions,
|
||||
num_gpus_available,
|
||||
):
|
||||
@ -464,7 +464,7 @@ def test_tp_language_embedding(
|
||||
parallel_setup,
|
||||
distributed_backend,
|
||||
vllm_major_version,
|
||||
task,
|
||||
runner,
|
||||
test_options,
|
||||
num_gpus_available,
|
||||
method="encode",
|
||||
@ -473,7 +473,7 @@ def test_tp_language_embedding(
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
("model_id", "parallel_setup", "distributed_backend", "vllm_major_version",
|
||||
"task", "test_options"),
|
||||
"runner", "test_options"),
|
||||
[
|
||||
params for model_id, settings in MULTIMODAL_MODELS.items()
|
||||
for params in settings.iter_params(model_id) if model_id in TEST_MODELS
|
||||
@ -485,7 +485,7 @@ def test_tp_multimodal_generation(
|
||||
parallel_setup: ParallelSetup,
|
||||
distributed_backend: str,
|
||||
vllm_major_version: str,
|
||||
task: TaskOption,
|
||||
runner: RunnerOption,
|
||||
test_options: PPTestOptions,
|
||||
num_gpus_available,
|
||||
):
|
||||
@ -493,7 +493,7 @@ def test_tp_multimodal_generation(
|
||||
parallel_setup,
|
||||
distributed_backend,
|
||||
vllm_major_version,
|
||||
task,
|
||||
runner,
|
||||
test_options,
|
||||
num_gpus_available,
|
||||
method="generate",
|
||||
|
||||
@ -14,7 +14,7 @@ from typing import Literal, NamedTuple, Optional
|
||||
|
||||
import pytest
|
||||
|
||||
from vllm.config import TaskOption
|
||||
from vllm.config import RunnerOption
|
||||
from vllm.logger import init_logger
|
||||
|
||||
from ..models.registry import HF_EXAMPLE_MODELS
|
||||
@ -48,7 +48,7 @@ class SPTestSettings:
|
||||
distributed_backends: list[str]
|
||||
# vllm major version: "0" for V0, "1" for V1
|
||||
vllm_major_versions: list[str]
|
||||
task: TaskOption
|
||||
runner: RunnerOption
|
||||
test_options: SPTestOptions
|
||||
|
||||
def __post_init__(self):
|
||||
@ -64,7 +64,7 @@ class SPTestSettings:
|
||||
tp_base: int = 2,
|
||||
pp_base: int = 1,
|
||||
multi_node_only: bool = False,
|
||||
task: TaskOption = "auto",
|
||||
runner: RunnerOption = "auto",
|
||||
load_format: Optional[str] = None,
|
||||
):
|
||||
parallel_setups = []
|
||||
@ -81,7 +81,7 @@ class SPTestSettings:
|
||||
parallel_setups=parallel_setups,
|
||||
distributed_backends=["mp", "ray"],
|
||||
vllm_major_versions=["1", "1"],
|
||||
task=task,
|
||||
runner=runner,
|
||||
test_options=SPTestOptions(multi_node_only=multi_node_only,
|
||||
load_format=load_format),
|
||||
)
|
||||
@ -91,7 +91,7 @@ class SPTestSettings:
|
||||
*,
|
||||
tp_base: int = 2,
|
||||
pp_base: int = 1,
|
||||
task: TaskOption = "auto",
|
||||
runner: RunnerOption = "auto",
|
||||
multi_node_only: bool = False,
|
||||
load_format: Optional[str] = None,
|
||||
):
|
||||
@ -109,7 +109,7 @@ class SPTestSettings:
|
||||
parallel_setups=parallel_setups,
|
||||
distributed_backends=["mp", "ray"],
|
||||
vllm_major_versions=["1", "1"],
|
||||
task=task,
|
||||
runner=runner,
|
||||
test_options=SPTestOptions(multi_node_only=multi_node_only,
|
||||
load_format=load_format),
|
||||
)
|
||||
@ -119,7 +119,7 @@ class SPTestSettings:
|
||||
*,
|
||||
tp_base: int = 2,
|
||||
pp_base: int = 1,
|
||||
task: TaskOption = "auto",
|
||||
runner: RunnerOption = "auto",
|
||||
multi_node_only: bool = False,
|
||||
load_format: Optional[str] = None,
|
||||
):
|
||||
@ -135,7 +135,7 @@ class SPTestSettings:
|
||||
parallel_setups=parallel_setups,
|
||||
distributed_backends=["mp", "ray"],
|
||||
vllm_major_versions=["1", "1"],
|
||||
task=task,
|
||||
runner=runner,
|
||||
test_options=SPTestOptions(multi_node_only=multi_node_only,
|
||||
load_format=load_format),
|
||||
)
|
||||
@ -147,7 +147,7 @@ class SPTestSettings:
|
||||
for backend, vllm_major_version in zip(self.distributed_backends,
|
||||
self.vllm_major_versions):
|
||||
yield (model_id, parallel_setup, backend, vllm_major_version,
|
||||
self.task, opts)
|
||||
self.runner, opts)
|
||||
|
||||
|
||||
def _compare_sp(
|
||||
@ -155,7 +155,7 @@ def _compare_sp(
|
||||
parallel_setup: ParallelSetup,
|
||||
distributed_backend: str,
|
||||
vllm_major_version: str,
|
||||
task: TaskOption,
|
||||
runner: RunnerOption,
|
||||
test_options: SPTestOptions,
|
||||
num_gpus_available: int,
|
||||
*,
|
||||
@ -217,8 +217,8 @@ def _compare_sp(
|
||||
common_args.append("--enable-chunked-prefill")
|
||||
if eager_mode:
|
||||
common_args.append("--enforce-eager")
|
||||
if task != "auto":
|
||||
common_args.extend(["--task", task])
|
||||
if runner != "auto":
|
||||
common_args.extend(["--runner", runner])
|
||||
if trust_remote_code:
|
||||
common_args.append("--trust-remote-code")
|
||||
if tokenizer_mode:
|
||||
@ -298,7 +298,7 @@ SP_TEST_MODELS = [
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
("model_id", "parallel_setup", "distributed_backend", "vllm_major_version",
|
||||
"task", "test_options"),
|
||||
"runner", "test_options"),
|
||||
[
|
||||
params for model_id, settings in SP_TEXT_GENERATION_MODELS.items()
|
||||
for params in settings.iter_params(model_id)
|
||||
@ -311,7 +311,7 @@ def test_tp_sp_generation(
|
||||
parallel_setup: ParallelSetup,
|
||||
distributed_backend: str,
|
||||
vllm_major_version: str,
|
||||
task: TaskOption,
|
||||
runner: RunnerOption,
|
||||
test_options: SPTestOptions,
|
||||
num_gpus_available,
|
||||
):
|
||||
@ -319,7 +319,7 @@ def test_tp_sp_generation(
|
||||
parallel_setup,
|
||||
distributed_backend,
|
||||
vllm_major_version,
|
||||
task,
|
||||
runner,
|
||||
test_options,
|
||||
num_gpus_available,
|
||||
method="generate",
|
||||
|
||||
@ -19,7 +19,8 @@ MAIN_SCORE = 0.7422994752439667
|
||||
@pytest.fixture(scope="module")
|
||||
def server():
|
||||
args = [
|
||||
"--task", "embed", "--enforce-eager", "--disable-uvicorn-access-log"
|
||||
"--runner", "pooling", "--enforce-eager",
|
||||
"--disable-uvicorn-access-log"
|
||||
]
|
||||
|
||||
with RemoteOpenAIServer(MODEL_NAME, args) as remote_server:
|
||||
|
||||
@ -21,7 +21,8 @@ MODEL_NAME = "cross-encoder/ms-marco-MiniLM-L-6-v2"
|
||||
@pytest.fixture(scope="module")
|
||||
def server():
|
||||
args = [
|
||||
"--task", "score", "--enforce-eager", "--disable-uvicorn-access-log"
|
||||
"--runner", "pooling", "--enforce-eager",
|
||||
"--disable-uvicorn-access-log"
|
||||
]
|
||||
|
||||
with RemoteOpenAIServer(MODEL_NAME, args) as remote_server:
|
||||
|
||||
@ -15,10 +15,6 @@ MODEL_NAME = "Qwen/Qwen2.5-1.5B-Instruct"
|
||||
def get_vocab_size(model_name):
|
||||
config = ModelConfig(
|
||||
model=model_name,
|
||||
task="auto",
|
||||
tokenizer=model_name,
|
||||
tokenizer_mode="auto",
|
||||
trust_remote_code=False,
|
||||
seed=0,
|
||||
dtype="bfloat16",
|
||||
)
|
||||
|
||||
@ -102,6 +102,7 @@ def test_get_gen_prompt(model, template, add_generation_prompt,
|
||||
tokenizer=model_info.tokenizer or model,
|
||||
tokenizer_mode=model_info.tokenizer_mode,
|
||||
trust_remote_code=model_info.trust_remote_code,
|
||||
revision=model_info.revision,
|
||||
hf_overrides=model_info.hf_overrides,
|
||||
)
|
||||
|
||||
|
||||
@ -33,8 +33,8 @@ def v1(run_with_both_engines):
|
||||
@pytest.fixture(scope="module")
|
||||
def server():
|
||||
args = [
|
||||
"--task",
|
||||
"embed",
|
||||
"--runner",
|
||||
"pooling",
|
||||
# use half precision for speed and memory savings in CI environment
|
||||
"--dtype",
|
||||
DTYPE,
|
||||
|
||||
@ -42,8 +42,8 @@ def dtype(request):
|
||||
@pytest.fixture(scope="module")
|
||||
def server(model_info, dtype: str):
|
||||
args = [
|
||||
"--task",
|
||||
"embed",
|
||||
"--runner",
|
||||
"pooling",
|
||||
# use half precision for speed and memory savings in CI environment
|
||||
"--dtype",
|
||||
dtype,
|
||||
|
||||
@ -21,7 +21,7 @@ LONG_TIMEOUT_SECONDS: Final[int] = 60
|
||||
@pytest.fixture(scope="module")
|
||||
def server():
|
||||
args = [
|
||||
"--task",
|
||||
"--runner",
|
||||
"generate",
|
||||
"--max-model-len",
|
||||
"2048",
|
||||
|
||||
@ -27,8 +27,8 @@ def server(request: pytest.FixtureRequest):
|
||||
passed_params = [passed_params]
|
||||
|
||||
args = [
|
||||
"--task",
|
||||
"embed",
|
||||
"--runner",
|
||||
"pooling",
|
||||
# use half precision for speed and memory savings in CI environment
|
||||
"--dtype",
|
||||
"float16",
|
||||
|
||||
@ -20,8 +20,8 @@ DUMMY_CHAT_TEMPLATE = """{% for message in messages %}{{message['role'] + ': ' +
|
||||
@pytest.fixture(scope="module")
|
||||
def server():
|
||||
args = [
|
||||
"--task",
|
||||
"reward",
|
||||
"--runner",
|
||||
"pooling",
|
||||
# use half precision for speed and memory savings in CI environment
|
||||
"--dtype",
|
||||
"bfloat16",
|
||||
|
||||
@ -26,8 +26,8 @@ def v1(run_with_both_engines):
|
||||
@pytest.fixture(scope="module")
|
||||
def server():
|
||||
args = [
|
||||
"--task",
|
||||
"embed",
|
||||
"--runner",
|
||||
"pooling",
|
||||
# use half precision for speed and memory savings in CI environment
|
||||
"--dtype",
|
||||
DTYPE,
|
||||
|
||||
@ -29,8 +29,8 @@ input = """Immerse yourself in the enchanting chronicle of calculus, a
|
||||
@pytest.fixture(scope="module")
|
||||
def server():
|
||||
args = [
|
||||
"--task",
|
||||
"embed",
|
||||
"--runner",
|
||||
"pooling",
|
||||
"--dtype",
|
||||
"bfloat16",
|
||||
"--enforce-eager",
|
||||
|
||||
@ -25,7 +25,7 @@ TEST_VIDEO_URLS = [
|
||||
@pytest.fixture(scope="module")
|
||||
def server():
|
||||
args = [
|
||||
"--task",
|
||||
"--runner",
|
||||
"generate",
|
||||
"--max-model-len",
|
||||
"32768",
|
||||
|
||||
@ -48,7 +48,7 @@ EXPECTED_MM_BEAM_SEARCH_RES = [
|
||||
@pytest.fixture(scope="module")
|
||||
def server():
|
||||
args = [
|
||||
"--task",
|
||||
"--runner",
|
||||
"generate",
|
||||
"--max-model-len",
|
||||
"2048",
|
||||
|
||||
@ -31,8 +31,8 @@ TEST_IMAGE_URLS = [
|
||||
@pytest.fixture(scope="module")
|
||||
def server():
|
||||
args = [
|
||||
"--task",
|
||||
"embed",
|
||||
"--runner",
|
||||
"pooling",
|
||||
"--max-model-len",
|
||||
"2048",
|
||||
"--max-num-seqs",
|
||||
|
||||
@ -47,12 +47,8 @@ MISTRAL_MODEL_ID = "mistralai/Mistral-Small-3.1-24B-Instruct-2503"
|
||||
@pytest.fixture(scope="function")
|
||||
def phi3v_model_config():
|
||||
return ModelConfig(PHI3V_MODEL_ID,
|
||||
task="generate",
|
||||
tokenizer=PHI3V_MODEL_ID,
|
||||
tokenizer_mode="auto",
|
||||
runner="generate",
|
||||
trust_remote_code=True,
|
||||
dtype="auto",
|
||||
seed=0,
|
||||
limit_mm_per_prompt={
|
||||
"image": 2,
|
||||
})
|
||||
@ -61,12 +57,8 @@ def phi3v_model_config():
|
||||
@pytest.fixture(scope="function")
|
||||
def phi3v_model_config_mm_interleaved():
|
||||
return ModelConfig(PHI3V_MODEL_ID,
|
||||
task="generate",
|
||||
tokenizer=PHI3V_MODEL_ID,
|
||||
tokenizer_mode="auto",
|
||||
runner="generate",
|
||||
trust_remote_code=True,
|
||||
dtype="auto",
|
||||
seed=0,
|
||||
interleave_mm_strings=True,
|
||||
limit_mm_per_prompt={
|
||||
"image": 2,
|
||||
@ -86,11 +78,7 @@ def phi3v_tokenizer():
|
||||
@pytest.fixture(scope="function")
|
||||
def qwen25omni_model_config_mm_interleaved():
|
||||
return ModelConfig(QWEN25OMNI_MODEL_ID,
|
||||
task="generate",
|
||||
tokenizer=QWEN25OMNI_MODEL_ID,
|
||||
tokenizer_mode="auto",
|
||||
dtype="auto",
|
||||
seed=0,
|
||||
runner="generate",
|
||||
interleave_mm_strings=True,
|
||||
limit_mm_per_prompt={
|
||||
"image": 2,
|
||||
@ -112,12 +100,7 @@ def qwen25omni_tokenizer():
|
||||
@pytest.fixture(scope="module")
|
||||
def mllama_model_config():
|
||||
return ModelConfig(MLLAMA_MODEL_ID,
|
||||
task="generate",
|
||||
tokenizer=MLLAMA_MODEL_ID,
|
||||
tokenizer_mode="auto",
|
||||
trust_remote_code=True,
|
||||
dtype="auto",
|
||||
seed=0,
|
||||
runner="generate",
|
||||
limit_mm_per_prompt={
|
||||
"image": 2,
|
||||
})
|
||||
@ -136,12 +119,7 @@ def mllama_tokenizer():
|
||||
@pytest.fixture(scope="function")
|
||||
def mistral_model_config():
|
||||
return ModelConfig(MISTRAL_MODEL_ID,
|
||||
task="generate",
|
||||
tokenizer=MISTRAL_MODEL_ID,
|
||||
tokenizer_mode="auto",
|
||||
trust_remote_code=True,
|
||||
dtype="auto",
|
||||
seed=0,
|
||||
runner="generate",
|
||||
limit_mm_per_prompt={
|
||||
"image": 2,
|
||||
})
|
||||
@ -1105,12 +1083,7 @@ def test_multimodal_image_parsing_matches_hf(model, image_url):
|
||||
|
||||
# Build a config for the model
|
||||
model_config = ModelConfig(model,
|
||||
task="generate",
|
||||
tokenizer=model,
|
||||
tokenizer_mode="auto",
|
||||
trust_remote_code=True,
|
||||
dtype="auto",
|
||||
seed=0,
|
||||
runner="generate",
|
||||
limit_mm_per_prompt={
|
||||
"image": 2,
|
||||
})
|
||||
@ -1170,6 +1143,7 @@ def test_resolve_hf_chat_template(sample_json_schema, model, use_tools):
|
||||
model,
|
||||
tokenizer=model_info.tokenizer or model,
|
||||
tokenizer_mode=model_info.tokenizer_mode,
|
||||
revision=model_info.revision,
|
||||
trust_remote_code=model_info.trust_remote_code,
|
||||
hf_overrides=model_info.hf_overrides,
|
||||
)
|
||||
@ -1225,6 +1199,7 @@ def test_resolve_content_format_hf_defined(model, expected_format):
|
||||
model,
|
||||
tokenizer=model_info.tokenizer or model,
|
||||
tokenizer_mode=model_info.tokenizer_mode,
|
||||
revision=model_info.revision,
|
||||
trust_remote_code=model_info.trust_remote_code,
|
||||
hf_overrides=model_info.hf_overrides,
|
||||
)
|
||||
@ -1284,6 +1259,7 @@ def test_resolve_content_format_fallbacks(model, expected_format):
|
||||
model,
|
||||
tokenizer=model_info.tokenizer or model,
|
||||
tokenizer_mode=model_info.tokenizer_mode,
|
||||
revision=model_info.revision,
|
||||
trust_remote_code=model_info.trust_remote_code,
|
||||
hf_overrides=model_info.hf_overrides,
|
||||
)
|
||||
|
||||
@ -38,13 +38,8 @@ def test_worker_apply_lora(sql_lora_files):
|
||||
vllm_config = VllmConfig(
|
||||
model_config=ModelConfig(
|
||||
"meta-llama/Llama-2-7b-hf",
|
||||
task="auto",
|
||||
tokenizer="meta-llama/Llama-2-7b-hf",
|
||||
tokenizer_mode="auto",
|
||||
trust_remote_code=False,
|
||||
seed=0,
|
||||
dtype="float16",
|
||||
revision=None,
|
||||
enforce_eager=True,
|
||||
),
|
||||
load_config=LoadConfig(
|
||||
|
||||
@ -69,10 +69,7 @@ async def test_guided_logits_processor_black_box(backend: str, is_local: bool,
|
||||
|
||||
config = ModelConfig(
|
||||
MODEL_NAME,
|
||||
task="generate",
|
||||
tokenizer=MODEL_NAME,
|
||||
tokenizer_mode="auto",
|
||||
trust_remote_code=False,
|
||||
runner="generate",
|
||||
seed=0,
|
||||
dtype="bfloat16",
|
||||
)
|
||||
@ -113,10 +110,7 @@ async def test_guided_logits_processor_with_reasoning(
|
||||
|
||||
config = ModelConfig(
|
||||
REASONING_MODEL_NAME,
|
||||
task="generate",
|
||||
tokenizer=REASONING_MODEL_NAME,
|
||||
tokenizer_mode="auto",
|
||||
trust_remote_code=False,
|
||||
runner="generate",
|
||||
seed=0,
|
||||
dtype="bfloat16",
|
||||
)
|
||||
|
||||
@ -57,7 +57,6 @@ def test_model_loading_with_params(vllm_runner, monkeypatch):
|
||||
|
||||
vllm_model.apply_model(check_model)
|
||||
|
||||
# assert output
|
||||
assert output
|
||||
|
||||
|
||||
@ -99,7 +98,6 @@ def test_roberta_model_loading_with_params(vllm_runner, monkeypatch):
|
||||
|
||||
vllm_model.apply_model(check_model)
|
||||
|
||||
# assert output
|
||||
assert output
|
||||
|
||||
|
||||
|
||||
@ -52,7 +52,7 @@ def correctness_test_embed_models(hf_runner,
|
||||
vllm_extra_kwargs["dtype"] = model_info.dtype
|
||||
|
||||
with vllm_runner(model_info.name,
|
||||
task="embed",
|
||||
runner="pooling",
|
||||
max_model_len=None,
|
||||
**vllm_extra_kwargs) as vllm_model:
|
||||
vllm_outputs = vllm_model.embed(example_prompts)
|
||||
|
||||
@ -172,7 +172,7 @@ def mteb_test_embed_models(hf_runner,
|
||||
vllm_extra_kwargs["dtype"] = model_info.dtype
|
||||
|
||||
with vllm_runner(model_info.name,
|
||||
task="embed",
|
||||
runner="pooling",
|
||||
max_model_len=None,
|
||||
**vllm_extra_kwargs) as vllm_model:
|
||||
|
||||
@ -279,15 +279,12 @@ def mteb_test_rerank_models(hf_runner,
|
||||
vllm_extra_kwargs["dtype"] = model_info.dtype
|
||||
|
||||
with vllm_runner(model_info.name,
|
||||
task="score",
|
||||
runner="pooling",
|
||||
max_model_len=None,
|
||||
max_num_seqs=8,
|
||||
**vllm_extra_kwargs) as vllm_model:
|
||||
|
||||
model_config = vllm_model.llm.llm_engine.model_config
|
||||
|
||||
if model_info.architecture:
|
||||
assert (model_info.architecture in model_config.architectures)
|
||||
assert model_config.hf_config.num_labels == 1
|
||||
|
||||
vllm_main_score = run_mteb_rerank(vllm_mteb_encoder(vllm_model),
|
||||
|
||||
@ -85,7 +85,7 @@ def test_models(
|
||||
hf_outputs = hf_model.encode(example_prompts)
|
||||
|
||||
with vllm_runner(model,
|
||||
task="embed",
|
||||
runner="pooling",
|
||||
max_model_len=max_model_len,
|
||||
**vllm_extra_kwargs) as vllm_model:
|
||||
vllm_outputs = vllm_model.embed(example_prompts)
|
||||
|
||||
@ -28,10 +28,7 @@ def test_find_array():
|
||||
|
||||
model_config = ModelConfig(
|
||||
MODEL_NAME,
|
||||
task="embed",
|
||||
tokenizer=MODEL_NAME,
|
||||
tokenizer_mode="auto",
|
||||
trust_remote_code=False,
|
||||
runner="pooling",
|
||||
dtype="bfloat16",
|
||||
seed=0,
|
||||
)
|
||||
@ -117,7 +114,7 @@ def test_gritlm_offline_embedding(vllm_runner):
|
||||
|
||||
with vllm_runner(
|
||||
MODEL_NAME,
|
||||
task="embed",
|
||||
runner="pooling",
|
||||
max_model_len=MAX_MODEL_LEN,
|
||||
) as vllm_model:
|
||||
llm = vllm_model.llm
|
||||
@ -140,7 +137,7 @@ def test_gritlm_offline_embedding(vllm_runner):
|
||||
async def test_gritlm_api_server_embedding():
|
||||
queries, q_instruction, documents, d_instruction = get_test_data()
|
||||
|
||||
args = ["--task", "embed", "--max_model_len", str(MAX_MODEL_LEN)]
|
||||
args = ["--runner", "pooling", "--max_model_len", str(MAX_MODEL_LEN)]
|
||||
|
||||
with RemoteOpenAIServer(MODEL_NAME, args) as server:
|
||||
client_embedding = server.get_async_client()
|
||||
@ -164,7 +161,7 @@ def test_gritlm_offline_generate(monkeypatch: pytest.MonkeyPatch, vllm_runner):
|
||||
|
||||
with vllm_runner(
|
||||
MODEL_NAME,
|
||||
task="generate",
|
||||
runner="generate",
|
||||
max_model_len=MAX_MODEL_LEN,
|
||||
) as vllm_model:
|
||||
llm = vllm_model.llm
|
||||
@ -179,7 +176,7 @@ def test_gritlm_offline_generate(monkeypatch: pytest.MonkeyPatch, vllm_runner):
|
||||
async def test_gritlm_api_server_generate():
|
||||
input = "<|user|>\nWhat is the capital of France?\n<|assistant|>\n"
|
||||
|
||||
args = ["--task", "generate", "--max_model_len", str(MAX_MODEL_LEN)]
|
||||
args = ["--runner", "generate", "--max_model_len", str(MAX_MODEL_LEN)]
|
||||
|
||||
with RemoteOpenAIServer(MODEL_NAME, args) as server:
|
||||
client_generate = server.get_async_client()
|
||||
|
||||
@ -4,6 +4,7 @@ from functools import partial
|
||||
|
||||
import pytest
|
||||
|
||||
import vllm.envs as envs
|
||||
from vllm import PoolingParams
|
||||
|
||||
from ...utils import EmbedModelInfo, RerankModelInfo
|
||||
@ -62,6 +63,10 @@ def test_embed_models_correctness(hf_runner, vllm_runner,
|
||||
@pytest.mark.parametrize("model_info", RERANK_MODELS)
|
||||
def test_rerank_models_mteb(hf_runner, vllm_runner,
|
||||
model_info: RerankModelInfo) -> None:
|
||||
if (model_info.architecture == "XLMRobertaForSequenceClassification"
|
||||
and envs.VLLM_USE_V1):
|
||||
pytest.skip("Not supported yet")
|
||||
|
||||
mteb_test_rerank_models(hf_runner, vllm_runner, model_info)
|
||||
|
||||
|
||||
@ -92,7 +97,7 @@ def test_matryoshka(
|
||||
hf_outputs = matryoshka_fy(hf_outputs, dimensions)
|
||||
|
||||
with vllm_runner(model_info.name,
|
||||
task="embed",
|
||||
runner="pooling",
|
||||
dtype=dtype,
|
||||
max_model_len=None) as vllm_model:
|
||||
assert vllm_model.llm.llm_engine.model_config.is_matryoshka
|
||||
|
||||
@ -21,7 +21,7 @@ max_model_len = int(original_max_position_embeddings * factor)
|
||||
|
||||
@pytest.mark.parametrize("model_info", MODELS)
|
||||
def test_default(model_info, vllm_runner):
|
||||
with vllm_runner(model_info.name, task="embed",
|
||||
with vllm_runner(model_info.name, runner="pooling",
|
||||
max_model_len=None) as vllm_model:
|
||||
model_config = vllm_model.llm.llm_engine.model_config
|
||||
if model_info.name == "nomic-ai/nomic-embed-text-v2-moe":
|
||||
@ -36,7 +36,7 @@ def test_default(model_info, vllm_runner):
|
||||
@pytest.mark.parametrize("model_info", MODELS)
|
||||
def test_set_max_model_len_legal(model_info, vllm_runner):
|
||||
# set max_model_len <= 512
|
||||
with vllm_runner(model_info.name, task="embed",
|
||||
with vllm_runner(model_info.name, runner="pooling",
|
||||
max_model_len=256) as vllm_model:
|
||||
model_config = vllm_model.llm.llm_engine.model_config
|
||||
assert model_config.max_model_len == 256
|
||||
@ -46,11 +46,12 @@ def test_set_max_model_len_legal(model_info, vllm_runner):
|
||||
# For nomic-embed-text-v2-moe the length is set to 512
|
||||
# by sentence_bert_config.json.
|
||||
with pytest.raises(ValueError):
|
||||
with vllm_runner(model_info.name, task="embed",
|
||||
with vllm_runner(model_info.name,
|
||||
runner="pooling",
|
||||
max_model_len=1024):
|
||||
pass
|
||||
else:
|
||||
with vllm_runner(model_info.name, task="embed",
|
||||
with vllm_runner(model_info.name, runner="pooling",
|
||||
max_model_len=1024) as vllm_model:
|
||||
model_config = vllm_model.llm.llm_engine.model_config
|
||||
assert model_config.max_model_len == 1024
|
||||
@ -60,14 +61,15 @@ def test_set_max_model_len_legal(model_info, vllm_runner):
|
||||
def test_set_max_model_len_illegal(model_info, vllm_runner):
|
||||
# set max_model_len > 2048
|
||||
with pytest.raises(ValueError):
|
||||
with vllm_runner(model_info.name, task="embed", max_model_len=4096):
|
||||
with vllm_runner(model_info.name, runner="pooling",
|
||||
max_model_len=4096):
|
||||
pass
|
||||
|
||||
# set max_model_len > 2048 by hf_overrides
|
||||
hf_overrides = {"max_model_len": 4096}
|
||||
with pytest.raises(ValueError):
|
||||
with vllm_runner(model_info.name,
|
||||
task="embed",
|
||||
runner="pooling",
|
||||
max_model_len=None,
|
||||
hf_overrides=hf_overrides):
|
||||
pass
|
||||
@ -87,7 +89,7 @@ def test_use_rope_scaling_legal(model_info, vllm_runner):
|
||||
}
|
||||
|
||||
with vllm_runner(model_info.name,
|
||||
task="embed",
|
||||
runner="pooling",
|
||||
max_model_len=None,
|
||||
hf_overrides=hf_overrides):
|
||||
pass
|
||||
@ -107,7 +109,7 @@ def test_use_rope_scaling_illegal(model_info, vllm_runner):
|
||||
# illegal max_model_len
|
||||
with pytest.raises(ValueError):
|
||||
with vllm_runner(model_info.name,
|
||||
task="embed",
|
||||
runner="pooling",
|
||||
max_model_len=max_model_len + 1,
|
||||
hf_overrides=hf_overrides):
|
||||
pass
|
||||
@ -125,7 +127,7 @@ def test_use_rope_scaling_illegal(model_info, vllm_runner):
|
||||
# illegal max_model_len by hf_overrides
|
||||
with pytest.raises(ValueError):
|
||||
with vllm_runner(model_info.name,
|
||||
task="embed",
|
||||
runner="pooling",
|
||||
max_model_len=None,
|
||||
hf_overrides=hf_overrides):
|
||||
pass
|
||||
|
||||
@ -37,7 +37,9 @@ def test_cross_encoder_1_to_1(vllm_runner, hf_runner, model_name):
|
||||
with hf_runner(model_name, dtype=DTYPE, is_cross_encoder=True) as hf_model:
|
||||
hf_outputs = hf_model.predict([text_pair]).tolist()
|
||||
|
||||
with vllm_runner(model_name, task="score", dtype=DTYPE,
|
||||
with vllm_runner(model_name,
|
||||
runner="pooling",
|
||||
dtype=DTYPE,
|
||||
max_model_len=None) as vllm_model:
|
||||
vllm_outputs = vllm_model.score(text_pair[0], text_pair[1])
|
||||
|
||||
@ -56,7 +58,9 @@ def test_cross_encoder_1_to_N(vllm_runner, hf_runner, model_name):
|
||||
with hf_runner(model_name, dtype=DTYPE, is_cross_encoder=True) as hf_model:
|
||||
hf_outputs = hf_model.predict(text_pairs).tolist()
|
||||
|
||||
with vllm_runner(model_name, task="score", dtype=DTYPE,
|
||||
with vllm_runner(model_name,
|
||||
runner="pooling",
|
||||
dtype=DTYPE,
|
||||
max_model_len=None) as vllm_model:
|
||||
vllm_outputs = vllm_model.score(TEXTS_1[0], TEXTS_2)
|
||||
|
||||
@ -76,7 +80,9 @@ def test_cross_encoder_N_to_N(vllm_runner, hf_runner, model_name):
|
||||
with hf_runner(model_name, dtype=DTYPE, is_cross_encoder=True) as hf_model:
|
||||
hf_outputs = hf_model.predict(text_pairs).tolist()
|
||||
|
||||
with vllm_runner(model_name, task="score", dtype=DTYPE,
|
||||
with vllm_runner(model_name,
|
||||
runner="pooling",
|
||||
dtype=DTYPE,
|
||||
max_model_len=None) as vllm_model:
|
||||
vllm_outputs = vllm_model.score(TEXTS_1, TEXTS_2)
|
||||
|
||||
@ -103,7 +109,7 @@ def test_embedding_1_to_1(vllm_runner, hf_runner, emb_model_name):
|
||||
]
|
||||
|
||||
with vllm_runner(emb_model_name,
|
||||
task="embed",
|
||||
runner="pooling",
|
||||
dtype=DTYPE,
|
||||
max_model_len=None) as vllm_model:
|
||||
vllm_outputs = vllm_model.score(text_pair[0], text_pair[1])
|
||||
@ -131,7 +137,7 @@ def test_embedding_1_to_N(vllm_runner, hf_runner, emb_model_name):
|
||||
]
|
||||
|
||||
with vllm_runner(emb_model_name,
|
||||
task="embed",
|
||||
runner="pooling",
|
||||
dtype=DTYPE,
|
||||
max_model_len=None) as vllm_model:
|
||||
vllm_outputs = vllm_model.score(TEXTS_1[0], TEXTS_2)
|
||||
@ -160,7 +166,7 @@ def test_embedding_N_to_N(vllm_runner, hf_runner, emb_model_name):
|
||||
]
|
||||
|
||||
with vllm_runner(emb_model_name,
|
||||
task="embed",
|
||||
runner="pooling",
|
||||
dtype=DTYPE,
|
||||
max_model_len=None) as vllm_model:
|
||||
vllm_outputs = vllm_model.score(TEXTS_1, TEXTS_2)
|
||||
|
||||
@ -26,7 +26,7 @@ def test_smaller_truncation_size(vllm_runner,
|
||||
|
||||
truncate_prompt_tokens = 10
|
||||
|
||||
with vllm_runner(model_name, task="embed",
|
||||
with vllm_runner(model_name, runner="pooling",
|
||||
max_model_len=max_model_len) as vllm_model:
|
||||
vllm_output = vllm_model.llm.encode(
|
||||
input_str, truncate_prompt_tokens=truncate_prompt_tokens)
|
||||
@ -41,7 +41,7 @@ def test_max_truncation_size(vllm_runner,
|
||||
input_str=input_str):
|
||||
truncate_prompt_tokens = -1
|
||||
|
||||
with vllm_runner(model_name, task="embed",
|
||||
with vllm_runner(model_name, runner="pooling",
|
||||
max_model_len=max_model_len) as vllm_model:
|
||||
vllm_output = vllm_model.llm.encode(
|
||||
input_str, truncate_prompt_tokens=truncate_prompt_tokens)
|
||||
@ -58,7 +58,7 @@ def test_bigger_truncation_size(vllm_runner,
|
||||
truncate_prompt_tokens = max_model_len + 1
|
||||
|
||||
with pytest.raises(ValueError), vllm_runner(
|
||||
model_name, task="embed",
|
||||
model_name, runner="pooling",
|
||||
max_model_len=max_model_len) as vllm_model:
|
||||
|
||||
llm_output = vllm_model.llm.encode(
|
||||
|
||||
@ -222,7 +222,6 @@ VLM_TEST_SETTINGS = {
|
||||
},
|
||||
marks=[large_gpu_mark(min_gb=32)],
|
||||
),
|
||||
# Check "auto" with fallback to transformers
|
||||
"internvl-transformers": VLMTestInfo(
|
||||
models=["OpenGVLab/InternVL3-1B-hf"],
|
||||
test_type=(VLMTestType.IMAGE, VLMTestType.MULTI_IMAGE),
|
||||
@ -232,7 +231,7 @@ VLM_TEST_SETTINGS = {
|
||||
use_tokenizer_eos=True,
|
||||
image_size_factors=[(0.25, 0.5, 1.0)],
|
||||
vllm_runner_kwargs={
|
||||
"model_impl": "auto",
|
||||
"model_impl": "transformers",
|
||||
},
|
||||
auto_cls=AutoModelForImageTextToText,
|
||||
marks=[pytest.mark.core_model],
|
||||
@ -638,7 +637,7 @@ VLM_TEST_SETTINGS = {
|
||||
img_idx_to_prompt=lambda idx: f"<|image_{idx}|>\n",
|
||||
max_model_len=4096,
|
||||
max_num_seqs=2,
|
||||
task="generate",
|
||||
runner="generate",
|
||||
# use sdpa mode for hf runner since phi3v didn't work with flash_attn
|
||||
hf_model_kwargs={"_attn_implementation": "sdpa"},
|
||||
use_tokenizer_eos=True,
|
||||
|
||||
@ -65,7 +65,7 @@ def run_test(
|
||||
# max_model_len should be greater than image_feature_size
|
||||
with vllm_runner(
|
||||
model,
|
||||
task="generate",
|
||||
runner="generate",
|
||||
max_model_len=max_model_len,
|
||||
max_num_seqs=1,
|
||||
dtype=dtype,
|
||||
|
||||
@ -48,7 +48,7 @@ def test_models(vllm_runner, model, dtype: str, max_tokens: int) -> None:
|
||||
]
|
||||
|
||||
with vllm_runner(model,
|
||||
task="generate",
|
||||
runner="generate",
|
||||
dtype=dtype,
|
||||
limit_mm_per_prompt={"image": 2},
|
||||
max_model_len=32768,
|
||||
|
||||
@ -99,7 +99,7 @@ def run_test(
|
||||
# max_model_len should be greater than image_feature_size
|
||||
with vllm_runner(
|
||||
model,
|
||||
task="generate",
|
||||
runner="generate",
|
||||
max_model_len=max_model_len,
|
||||
max_num_seqs=2,
|
||||
dtype=dtype,
|
||||
|
||||
@ -267,7 +267,7 @@ def run_embedding_input_test(
|
||||
|
||||
# max_model_len should be greater than image_feature_size
|
||||
with vllm_runner(model,
|
||||
task="generate",
|
||||
runner="generate",
|
||||
max_model_len=4000,
|
||||
max_num_seqs=3,
|
||||
dtype=dtype,
|
||||
|
||||
@ -6,7 +6,7 @@ from typing import Any, Callable, Optional
|
||||
import torch
|
||||
from transformers.models.auto.auto_factory import _BaseAutoModelClass
|
||||
|
||||
from vllm.config import TaskOption
|
||||
from vllm.config import RunnerOption
|
||||
from vllm.transformers_utils.tokenizer import AnyTokenizer
|
||||
|
||||
from .....conftest import HfRunner, VllmRunner
|
||||
@ -37,7 +37,7 @@ def run_test(
|
||||
vllm_runner_kwargs: Optional[dict[str, Any]],
|
||||
hf_model_kwargs: Optional[dict[str, Any]],
|
||||
patch_hf_runner: Optional[Callable[[HfRunner], HfRunner]],
|
||||
task: TaskOption = "auto",
|
||||
runner: RunnerOption = "auto",
|
||||
distributed_executor_backend: Optional[str] = None,
|
||||
tensor_parallel_size: int = 1,
|
||||
vllm_embeddings: Optional[torch.Tensor] = None,
|
||||
@ -83,7 +83,7 @@ def run_test(
|
||||
tensor_parallel_size=tensor_parallel_size,
|
||||
distributed_executor_backend=distributed_executor_backend,
|
||||
enforce_eager=enforce_eager,
|
||||
task=task,
|
||||
runner=runner,
|
||||
**vllm_runner_kwargs_) as vllm_model:
|
||||
tokenizer = vllm_model.llm.get_tokenizer()
|
||||
|
||||
|
||||
@ -11,7 +11,7 @@ from pytest import MarkDecorator
|
||||
from transformers import AutoModelForCausalLM
|
||||
from transformers.models.auto.auto_factory import _BaseAutoModelClass
|
||||
|
||||
from vllm.config import TaskOption
|
||||
from vllm.config import RunnerOption
|
||||
from vllm.sequence import SampleLogprobs
|
||||
from vllm.transformers_utils.tokenizer import AnyTokenizer
|
||||
|
||||
@ -109,7 +109,7 @@ class VLMTestInfo(NamedTuple):
|
||||
enforce_eager: bool = True
|
||||
max_model_len: int = 1024
|
||||
max_num_seqs: int = 256
|
||||
task: TaskOption = "auto"
|
||||
runner: RunnerOption = "auto"
|
||||
tensor_parallel_size: int = 1
|
||||
vllm_runner_kwargs: Optional[dict[str, Any]] = None
|
||||
|
||||
@ -173,7 +173,7 @@ class VLMTestInfo(NamedTuple):
|
||||
"enforce_eager": self.enforce_eager,
|
||||
"max_model_len": self.max_model_len,
|
||||
"max_num_seqs": self.max_num_seqs,
|
||||
"task": self.task,
|
||||
"runner": self.runner,
|
||||
"tensor_parallel_size": self.tensor_parallel_size,
|
||||
"vllm_runner_kwargs": self.vllm_runner_kwargs,
|
||||
"hf_output_post_proc": self.hf_output_post_proc,
|
||||
|
||||
@ -92,7 +92,7 @@ def _run_test(
|
||||
# if we run HF first, the cuda initialization will be done and it
|
||||
# will hurt multiprocessing backend with fork method (the default method).
|
||||
with vllm_runner(model,
|
||||
task="embed",
|
||||
runner="pooling",
|
||||
dtype=dtype,
|
||||
enforce_eager=True,
|
||||
max_model_len=8192) as vllm_model:
|
||||
|
||||
@ -49,7 +49,7 @@ def vllm_reranker(
|
||||
|
||||
with vllm_runner(
|
||||
model_name,
|
||||
task="score",
|
||||
runner="pooling",
|
||||
dtype=dtype,
|
||||
max_num_seqs=2,
|
||||
max_model_len=2048,
|
||||
|
||||
@ -64,7 +64,7 @@ def _run_test(
|
||||
# if we run HF first, the cuda initialization will be done and it
|
||||
# will hurt multiprocessing backend with fork method (the default method).
|
||||
with vllm_runner(model,
|
||||
task="embed",
|
||||
runner="pooling",
|
||||
dtype=dtype,
|
||||
max_model_len=4096,
|
||||
enforce_eager=True) as vllm_model:
|
||||
|
||||
@ -44,7 +44,7 @@ def _run_test(
|
||||
# vLLM needs a fresh new process without cuda initialization.
|
||||
# if we run HF first, the cuda initialization will be done and it
|
||||
# will hurt multiprocessing backend with fork method (the default method).
|
||||
with vllm_runner(model, task="embed", dtype=dtype,
|
||||
with vllm_runner(model, runner="pooling", dtype=dtype,
|
||||
enforce_eager=True) as vllm_model:
|
||||
vllm_outputs = vllm_model.embed(input_texts, images=input_images)
|
||||
|
||||
|
||||
@ -34,7 +34,7 @@ def _run_test(
|
||||
set_default_torch_num_threads(1),
|
||||
vllm_runner(
|
||||
model,
|
||||
task="embed",
|
||||
runner="pooling",
|
||||
dtype=torch.float16,
|
||||
enforce_eager=True,
|
||||
skip_tokenizer_init=True,
|
||||
|
||||
@ -58,13 +58,10 @@ def _test_processing_correctness(
|
||||
|
||||
model_config = ModelConfig(
|
||||
model_id,
|
||||
task="auto",
|
||||
tokenizer=model_info.tokenizer or model_id,
|
||||
tokenizer_mode=model_info.tokenizer_mode,
|
||||
trust_remote_code=model_info.trust_remote_code,
|
||||
seed=0,
|
||||
dtype="auto",
|
||||
revision=model_info.revision,
|
||||
trust_remote_code=model_info.trust_remote_code,
|
||||
hf_overrides=model_info.hf_overrides,
|
||||
)
|
||||
|
||||
|
||||
@ -54,13 +54,10 @@ def test_hf_model_weights_mapper(model_arch: str):
|
||||
|
||||
model_config = ModelConfig(
|
||||
model_id,
|
||||
task="auto",
|
||||
tokenizer=model_info.tokenizer or model_id,
|
||||
tokenizer_mode=model_info.tokenizer_mode,
|
||||
revision=model_info.revision,
|
||||
trust_remote_code=model_info.trust_remote_code,
|
||||
seed=0,
|
||||
dtype="auto",
|
||||
revision=None,
|
||||
hf_overrides=model_info.hf_overrides,
|
||||
)
|
||||
model_cls = MULTIMODAL_REGISTRY._get_model_cls(model_config)
|
||||
|
||||
@ -172,7 +172,7 @@ def test_4bit_bnb_embedding_model(
|
||||
|
||||
# Inflight 4bit quantization
|
||||
with vllm_runner(model_name,
|
||||
task="embed",
|
||||
runner="pooling",
|
||||
dtype=dtype,
|
||||
gpu_memory_utilization=0.5,
|
||||
quantization="bitsandbytes") as vllm_model:
|
||||
|
||||
@ -7,13 +7,15 @@ import pytest
|
||||
from transformers import PretrainedConfig
|
||||
|
||||
from vllm import LLM
|
||||
from vllm.config import ModelImpl
|
||||
from vllm.engine.llm_engine import LLMEngine as V0LLMEngine
|
||||
from vllm.utils import GiB_bytes
|
||||
from vllm.v1.core.kv_cache_utils import get_kv_cache_config
|
||||
from vllm.v1.engine.core import EngineCore as V1EngineCore
|
||||
|
||||
from ..utils import create_new_process_for_each_test
|
||||
from .registry import AUTO_EXAMPLE_MODELS, HF_EXAMPLE_MODELS, HfExampleModels
|
||||
from .registry import (_TRANSFORMERS_BACKEND_MODELS, AUTO_EXAMPLE_MODELS,
|
||||
HF_EXAMPLE_MODELS, HfExampleModels)
|
||||
|
||||
|
||||
@create_new_process_for_each_test()
|
||||
@ -126,6 +128,8 @@ def can_initialize(model_arch: str, monkeypatch: pytest.MonkeyPatch,
|
||||
# these tests seem to produce leftover memory
|
||||
gpu_memory_utilization=0.80,
|
||||
load_format="dummy",
|
||||
model_impl=ModelImpl.TRANSFORMERS
|
||||
if model_arch in _TRANSFORMERS_BACKEND_MODELS else ModelImpl.VLLM,
|
||||
hf_overrides=hf_overrides,
|
||||
)
|
||||
|
||||
|
||||
@ -24,11 +24,9 @@ from .registry import HF_EXAMPLE_MODELS
|
||||
|
||||
@pytest.mark.parametrize("model_arch", ModelRegistry.get_supported_archs())
|
||||
def test_registry_imports(model_arch):
|
||||
model_info = HF_EXAMPLE_MODELS.get_hf_info(model_arch)
|
||||
model_info.check_transformers_version(on_fail="skip")
|
||||
|
||||
# Ensure all model classes can be imported successfully
|
||||
model_cls, _ = ModelRegistry.resolve_model_cls(model_arch)
|
||||
model_cls = ModelRegistry._try_load_model_cls(model_arch)
|
||||
assert model_cls is not None
|
||||
|
||||
if model_arch in _SPECULATIVE_DECODING_MODELS:
|
||||
return # Ignore these models which do not have a unified format
|
||||
@ -56,14 +54,16 @@ def test_registry_imports(model_arch):
|
||||
("XLMRobertaForSequenceClassification", False, False, True),
|
||||
])
|
||||
def test_registry_model_property(model_arch, is_mm, init_cuda, is_ce):
|
||||
assert ModelRegistry.is_multimodal_model(model_arch) is is_mm
|
||||
model_info = ModelRegistry._try_inspect_model_cls(model_arch)
|
||||
assert model_info is not None
|
||||
|
||||
assert ModelRegistry.is_cross_encoder_model(model_arch) is is_ce
|
||||
assert model_info.supports_multimodal is is_mm
|
||||
assert model_info.supports_cross_encoding is is_ce
|
||||
|
||||
if init_cuda and current_platform.is_cuda_alike():
|
||||
assert not torch.cuda.is_initialized()
|
||||
|
||||
ModelRegistry.resolve_model_cls(model_arch)
|
||||
ModelRegistry._try_load_model_cls(model_arch)
|
||||
if not torch.cuda.is_initialized():
|
||||
warnings.warn(
|
||||
"This model no longer initializes CUDA on import. "
|
||||
@ -82,12 +82,15 @@ def test_registry_model_property(model_arch, is_mm, init_cuda, is_ce):
|
||||
("Qwen2VLForConditionalGeneration", True, True),
|
||||
])
|
||||
def test_registry_is_pp(model_arch, is_pp, init_cuda):
|
||||
assert ModelRegistry.is_pp_supported_model(model_arch) is is_pp
|
||||
model_info = ModelRegistry._try_inspect_model_cls(model_arch)
|
||||
assert model_info is not None
|
||||
|
||||
assert model_info.supports_pp is is_pp
|
||||
|
||||
if init_cuda and current_platform.is_cuda_alike():
|
||||
assert not torch.cuda.is_initialized()
|
||||
|
||||
ModelRegistry.resolve_model_cls(model_arch)
|
||||
ModelRegistry._try_load_model_cls(model_arch)
|
||||
if not torch.cuda.is_initialized():
|
||||
warnings.warn(
|
||||
"This model no longer initializes CUDA on import. "
|
||||
|
||||
@ -33,6 +33,10 @@ def check_implementation(
|
||||
args = (example_prompts, max_tokens, num_logprobs)
|
||||
|
||||
with runner_test(model, **kwargs_test, **kwargs) as model_test:
|
||||
model_config = model_test.llm.llm_engine.model_config
|
||||
assert model_config.architecture == (
|
||||
model_config._get_transformers_backend_cls())
|
||||
|
||||
outputs_test = model_test.generate_greedy_logprobs(*args)
|
||||
|
||||
with runner_ref(model, **kwargs_ref) as model_ref:
|
||||
@ -130,8 +134,13 @@ def test_quantization(
|
||||
model_impl="transformers",
|
||||
enforce_eager=True,
|
||||
**quantization_kwargs) as vllm_model: # type: ignore[arg-type]
|
||||
model_config = vllm_model.llm.llm_engine.model_config
|
||||
assert model_config.architecture == (
|
||||
model_config._get_transformers_backend_cls())
|
||||
|
||||
transformers_outputs = vllm_model.generate_greedy_logprobs(
|
||||
example_prompts, max_tokens=max_tokens, num_logprobs=num_logprobs)
|
||||
|
||||
check_logprobs_close(
|
||||
outputs_0_lst=transformers_outputs,
|
||||
outputs_1_lst=vllm_outputs,
|
||||
@ -151,7 +160,6 @@ def test_classify(
|
||||
example_prompts,
|
||||
model: str,
|
||||
dtype: str,
|
||||
monkeypatch,
|
||||
) -> None:
|
||||
import torch
|
||||
from transformers import AutoModelForSequenceClassification
|
||||
@ -160,6 +168,10 @@ def test_classify(
|
||||
max_model_len=512,
|
||||
dtype=dtype,
|
||||
model_impl="transformers") as vllm_model:
|
||||
model_config = vllm_model.llm.llm_engine.model_config
|
||||
assert model_config.architecture == (
|
||||
model_config._get_transformers_backend_cls())
|
||||
|
||||
vllm_outputs = vllm_model.classify(example_prompts)
|
||||
|
||||
with hf_runner(model,
|
||||
|
||||
@ -8,7 +8,7 @@ from typing import Any, NamedTuple, Optional, Union
|
||||
import torch
|
||||
import torch.nn.functional as F
|
||||
|
||||
from vllm.config import ModelConfig, TaskOption
|
||||
from vllm.config import ModelConfig, RunnerOption
|
||||
from vllm.inputs import InputContext
|
||||
from vllm.sequence import Logprob, PromptLogprobs, SampleLogprobs
|
||||
|
||||
@ -255,7 +255,7 @@ def check_logprobs_close(
|
||||
|
||||
def build_model_context(
|
||||
model_id: str,
|
||||
task: TaskOption = "auto",
|
||||
runner: RunnerOption = "auto",
|
||||
dtype: Union[str, torch.dtype] = "auto",
|
||||
model_config_kwargs: Optional[dict[str, Any]] = None,
|
||||
mm_processor_kwargs: Optional[dict[str, Any]] = None,
|
||||
@ -280,9 +280,10 @@ def build_model_context(
|
||||
model_config_kwargs = model_config_kwargs or {}
|
||||
model_config = ModelConfig(
|
||||
model_id,
|
||||
task=task,
|
||||
runner=runner,
|
||||
tokenizer=model_info.tokenizer or model_id,
|
||||
tokenizer_mode=model_info.tokenizer_mode,
|
||||
revision=model_info.revision,
|
||||
trust_remote_code=model_info.trust_remote_code,
|
||||
dtype=dtype,
|
||||
seed=0,
|
||||
|
||||
@ -954,13 +954,6 @@ def test_limit_mm_per_prompt_dummy(model_id, limit, num_supported, is_valid):
|
||||
|
||||
model_config = ModelConfig(
|
||||
model=model_id,
|
||||
task="auto",
|
||||
tokenizer=model_id,
|
||||
tokenizer_mode="auto",
|
||||
trust_remote_code=False,
|
||||
seed=0,
|
||||
dtype="auto",
|
||||
revision=None,
|
||||
limit_mm_per_prompt=limit_mm_per_prompt,
|
||||
)
|
||||
|
||||
@ -993,13 +986,6 @@ def test_limit_mm_per_prompt_apply(model_id, num_images, limit, is_valid):
|
||||
|
||||
model_config = ModelConfig(
|
||||
model=model_id,
|
||||
task="auto",
|
||||
tokenizer=model_id,
|
||||
tokenizer_mode="auto",
|
||||
trust_remote_code=False,
|
||||
seed=0,
|
||||
dtype="auto",
|
||||
revision=None,
|
||||
limit_mm_per_prompt=limit_mm_per_prompt,
|
||||
)
|
||||
|
||||
@ -1061,16 +1047,7 @@ class _ProcessorProxy:
|
||||
)
|
||||
# yapf: enable
|
||||
def test_hf_processor_kwargs(model_id, call_kwargs, expected_kwargs):
|
||||
model_config = ModelConfig(
|
||||
model=model_id,
|
||||
task="auto",
|
||||
tokenizer=model_id,
|
||||
tokenizer_mode="auto",
|
||||
trust_remote_code=False,
|
||||
seed=0,
|
||||
dtype="auto",
|
||||
revision=None,
|
||||
)
|
||||
model_config = ModelConfig(model_id)
|
||||
|
||||
processor = MULTIMODAL_REGISTRY.create_processor(model_config)
|
||||
orig_get_hf_processor = processor.info.get_hf_processor
|
||||
|
||||
@ -57,15 +57,7 @@ def test_auto_gptq(model_arg_exptype: tuple[str, None, str]) -> None:
|
||||
model_path, quantization_arg, expected_type = model_arg_exptype
|
||||
|
||||
try:
|
||||
model_config = ModelConfig(model_path,
|
||||
task="auto",
|
||||
tokenizer=model_path,
|
||||
tokenizer_mode="auto",
|
||||
trust_remote_code=False,
|
||||
seed=0,
|
||||
dtype="float16",
|
||||
revision=None,
|
||||
quantization=quantization_arg)
|
||||
model_config = ModelConfig(model_path, quantization=quantization_arg)
|
||||
found_quantization_type = model_config.quantization
|
||||
except ValueError:
|
||||
found_quantization_type = "ERROR"
|
||||
|
||||
@ -74,115 +74,116 @@ def test_update_config():
|
||||
new_config3 = update_config(config3, {"a": "new_value"})
|
||||
|
||||
|
||||
# Can remove once --task option is fully deprecated
|
||||
@pytest.mark.parametrize(
|
||||
("model_id", "expected_runner_type", "expected_task"),
|
||||
("model_id", "expected_runner_type", "expected_convert_type",
|
||||
"expected_task"),
|
||||
[
|
||||
("distilbert/distilgpt2", "generate", "generate"),
|
||||
("intfloat/multilingual-e5-small", "pooling", "embed"),
|
||||
("jason9693/Qwen2.5-1.5B-apeach", "pooling", "classify"),
|
||||
("cross-encoder/ms-marco-MiniLM-L-6-v2", "pooling", "classify"),
|
||||
("Qwen/Qwen2.5-Math-RM-72B", "pooling", "reward"),
|
||||
("openai/whisper-small", "generate", "transcription"),
|
||||
("distilbert/distilgpt2", "generate", "none", "generate"),
|
||||
("intfloat/multilingual-e5-small", "pooling", "none", "embed"),
|
||||
("jason9693/Qwen2.5-1.5B-apeach", "pooling", "classify", "classify"),
|
||||
("cross-encoder/ms-marco-MiniLM-L-6-v2", "pooling", "none",
|
||||
"classify"),
|
||||
("Qwen/Qwen2.5-Math-RM-72B", "pooling", "none", "reward"),
|
||||
("openai/whisper-small", "generate", "none", "transcription"),
|
||||
],
|
||||
)
|
||||
def test_auto_task(model_id, expected_runner_type, expected_task):
|
||||
config = ModelConfig(
|
||||
model_id,
|
||||
task="auto",
|
||||
tokenizer=model_id,
|
||||
tokenizer_mode="auto",
|
||||
trust_remote_code=False,
|
||||
seed=0,
|
||||
dtype="float16",
|
||||
)
|
||||
def test_auto_task(model_id, expected_runner_type, expected_convert_type,
|
||||
expected_task):
|
||||
config = ModelConfig(model_id, task="auto")
|
||||
|
||||
assert config.runner_type == expected_runner_type
|
||||
assert config.convert_type == expected_convert_type
|
||||
assert expected_task in config.supported_tasks
|
||||
|
||||
if config.runner_type == "pooling":
|
||||
assert config.task == expected_task
|
||||
else:
|
||||
assert expected_task in config.supported_tasks
|
||||
|
||||
# Can remove once --task option is fully deprecated
|
||||
@pytest.mark.parametrize(
|
||||
("model_id", "expected_runner_type", "expected_convert_type",
|
||||
"expected_task"),
|
||||
[
|
||||
("distilbert/distilgpt2", "pooling", "embed", "embed"),
|
||||
("intfloat/multilingual-e5-small", "pooling", "embed", "embed"),
|
||||
("jason9693/Qwen2.5-1.5B-apeach", "pooling", "classify", "classify"),
|
||||
("cross-encoder/ms-marco-MiniLM-L-6-v2", "pooling", "classify",
|
||||
"classify"),
|
||||
("Qwen/Qwen2.5-Math-RM-72B", "pooling", "embed", "embed"),
|
||||
("openai/whisper-small", "pooling", "embed", "embed"),
|
||||
],
|
||||
)
|
||||
def test_score_task(model_id, expected_runner_type, expected_convert_type,
|
||||
expected_task):
|
||||
config = ModelConfig(model_id, task="score")
|
||||
|
||||
assert config.runner_type == expected_runner_type
|
||||
assert config.convert_type == expected_convert_type
|
||||
assert expected_task in config.supported_tasks
|
||||
|
||||
|
||||
# Can remove once --task option is fully deprecated
|
||||
@pytest.mark.parametrize(
|
||||
("model_id", "expected_runner_type", "expected_convert_type",
|
||||
"expected_task"),
|
||||
[
|
||||
("openai/whisper-small", "generate", "none", "transcription"),
|
||||
],
|
||||
)
|
||||
def test_transcription_task(model_id, expected_runner_type,
|
||||
expected_convert_type, expected_task):
|
||||
config = ModelConfig(model_id, task="transcription")
|
||||
|
||||
assert config.runner_type == expected_runner_type
|
||||
assert config.convert_type == expected_convert_type
|
||||
assert expected_task in config.supported_tasks
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
("model_id", "expected_runner_type", "expected_task"),
|
||||
("model_id", "expected_runner_type", "expected_convert_type"),
|
||||
[
|
||||
("distilbert/distilgpt2", "generate", "none"),
|
||||
("intfloat/multilingual-e5-small", "pooling", "none"),
|
||||
("jason9693/Qwen2.5-1.5B-apeach", "pooling", "classify"),
|
||||
("cross-encoder/ms-marco-MiniLM-L-6-v2", "pooling", "none"),
|
||||
("Qwen/Qwen2.5-Math-RM-72B", "pooling", "none"),
|
||||
("openai/whisper-small", "generate", "none"),
|
||||
],
|
||||
)
|
||||
def test_auto_runner(model_id, expected_runner_type, expected_convert_type):
|
||||
config = ModelConfig(model_id, runner="auto")
|
||||
|
||||
assert config.runner_type == expected_runner_type
|
||||
assert config.convert_type == expected_convert_type
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
("model_id", "expected_runner_type", "expected_convert_type"),
|
||||
[
|
||||
("distilbert/distilgpt2", "pooling", "embed"),
|
||||
("intfloat/multilingual-e5-small", "pooling", "embed"),
|
||||
("intfloat/multilingual-e5-small", "pooling", "none"),
|
||||
("jason9693/Qwen2.5-1.5B-apeach", "pooling", "classify"),
|
||||
("cross-encoder/ms-marco-MiniLM-L-6-v2", "pooling", "classify"),
|
||||
("Qwen/Qwen2.5-Math-RM-72B", "pooling", "embed"),
|
||||
("cross-encoder/ms-marco-MiniLM-L-6-v2", "pooling", "none"),
|
||||
("Qwen/Qwen2.5-Math-RM-72B", "pooling", "none"),
|
||||
("openai/whisper-small", "pooling", "embed"),
|
||||
],
|
||||
)
|
||||
def test_score_task(model_id, expected_runner_type, expected_task):
|
||||
config = ModelConfig(
|
||||
model_id,
|
||||
task="score",
|
||||
tokenizer=model_id,
|
||||
tokenizer_mode="auto",
|
||||
trust_remote_code=False,
|
||||
seed=0,
|
||||
dtype="float16",
|
||||
)
|
||||
def test_pooling_runner(model_id, expected_runner_type, expected_convert_type):
|
||||
config = ModelConfig(model_id, runner="pooling")
|
||||
|
||||
assert config.runner_type == expected_runner_type
|
||||
assert config.task == expected_task
|
||||
|
||||
|
||||
@pytest.mark.parametrize(("model_id", "expected_runner_type", "expected_task"),
|
||||
[
|
||||
("Qwen/Qwen2.5-1.5B-Instruct", "draft", "auto"),
|
||||
])
|
||||
def test_draft_task(model_id, expected_runner_type, expected_task):
|
||||
config = ModelConfig(
|
||||
model_id,
|
||||
runner="draft",
|
||||
tokenizer=model_id,
|
||||
seed=0,
|
||||
dtype="float16",
|
||||
)
|
||||
|
||||
assert config.runner_type == expected_runner_type
|
||||
assert config.task == expected_task
|
||||
assert config.convert_type == expected_convert_type
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
("model_id", "expected_runner_type", "expected_task"),
|
||||
("model_id", "expected_runner_type", "expected_convert_type"),
|
||||
[
|
||||
("openai/whisper-small", "generate", "transcription"),
|
||||
("Qwen/Qwen2.5-1.5B-Instruct", "draft", "none"),
|
||||
],
|
||||
)
|
||||
def test_transcription_task(model_id, expected_runner_type, expected_task):
|
||||
config = ModelConfig(
|
||||
model_id,
|
||||
task="transcription",
|
||||
tokenizer=model_id,
|
||||
tokenizer_mode="auto",
|
||||
trust_remote_code=False,
|
||||
seed=0,
|
||||
dtype="float16",
|
||||
)
|
||||
def test_draft_runner(model_id, expected_runner_type, expected_convert_type):
|
||||
config = ModelConfig(model_id, runner="draft")
|
||||
|
||||
assert config.runner_type == expected_runner_type
|
||||
assert config.task == expected_task
|
||||
|
||||
|
||||
@pytest.mark.parametrize(("model_id", "bad_task"), [
|
||||
("Qwen/Qwen2.5-Math-RM-72B", "generate"),
|
||||
("Qwen/Qwen3-0.6B", "transcription"),
|
||||
])
|
||||
def test_incorrect_task(model_id, bad_task):
|
||||
with pytest.raises(ValueError, match=r"does not support task=.*"):
|
||||
ModelConfig(
|
||||
model_id,
|
||||
task=bad_task,
|
||||
tokenizer=model_id,
|
||||
tokenizer_mode="auto",
|
||||
trust_remote_code=False,
|
||||
seed=0,
|
||||
dtype="float16",
|
||||
)
|
||||
assert config.convert_type == expected_convert_type
|
||||
|
||||
|
||||
MODEL_IDS_EXPECTED = [
|
||||
@ -195,17 +196,7 @@ MODEL_IDS_EXPECTED = [
|
||||
@pytest.mark.parametrize("model_id_expected", MODEL_IDS_EXPECTED)
|
||||
def test_disable_sliding_window(model_id_expected):
|
||||
model_id, expected = model_id_expected
|
||||
model_config = ModelConfig(
|
||||
model_id,
|
||||
task="auto",
|
||||
tokenizer=model_id,
|
||||
tokenizer_mode="auto",
|
||||
trust_remote_code=False,
|
||||
seed=0,
|
||||
dtype="float16",
|
||||
revision=None,
|
||||
disable_sliding_window=True,
|
||||
)
|
||||
model_config = ModelConfig(model_id, disable_sliding_window=True)
|
||||
assert model_config.max_model_len == expected
|
||||
|
||||
|
||||
@ -214,16 +205,7 @@ def test_get_sliding_window():
|
||||
# Test that the sliding window is correctly computed.
|
||||
# For Qwen1.5/Qwen2, get_sliding_window() should be None
|
||||
# when use_sliding_window is False.
|
||||
qwen2_model_config = ModelConfig(
|
||||
"Qwen/Qwen1.5-7B",
|
||||
task="auto",
|
||||
tokenizer="Qwen/Qwen1.5-7B",
|
||||
tokenizer_mode="auto",
|
||||
trust_remote_code=False,
|
||||
seed=0,
|
||||
dtype="float16",
|
||||
revision=None,
|
||||
)
|
||||
qwen2_model_config = ModelConfig("Qwen/Qwen1.5-7B")
|
||||
|
||||
qwen2_model_config.hf_config.use_sliding_window = False
|
||||
qwen2_model_config.hf_config.sliding_window = TEST_SLIDING_WINDOW
|
||||
@ -232,16 +214,7 @@ def test_get_sliding_window():
|
||||
qwen2_model_config.hf_config.use_sliding_window = True
|
||||
assert qwen2_model_config.get_sliding_window() == TEST_SLIDING_WINDOW
|
||||
|
||||
mistral_model_config = ModelConfig(
|
||||
"mistralai/Mistral-7B-v0.1",
|
||||
task="auto",
|
||||
tokenizer="mistralai/Mistral-7B-v0.1",
|
||||
tokenizer_mode="auto",
|
||||
trust_remote_code=False,
|
||||
seed=0,
|
||||
dtype="float16",
|
||||
revision=None,
|
||||
)
|
||||
mistral_model_config = ModelConfig("mistralai/Mistral-7B-v0.1")
|
||||
mistral_model_config.hf_config.sliding_window = None
|
||||
assert mistral_model_config.get_sliding_window() is None
|
||||
|
||||
@ -253,16 +226,7 @@ def test_get_sliding_window():
|
||||
reason="Xformers backend is not supported on ROCm.")
|
||||
def test_get_pooling_config():
|
||||
model_id = "sentence-transformers/all-MiniLM-L12-v2"
|
||||
model_config = ModelConfig(
|
||||
model_id,
|
||||
task="auto",
|
||||
tokenizer=model_id,
|
||||
tokenizer_mode="auto",
|
||||
trust_remote_code=False,
|
||||
seed=0,
|
||||
dtype="float16",
|
||||
revision=None,
|
||||
)
|
||||
model_config = ModelConfig(model_id)
|
||||
|
||||
pooling_config = model_config._init_pooler_config()
|
||||
assert pooling_config is not None
|
||||
@ -275,14 +239,7 @@ def test_get_pooling_config():
|
||||
reason="Xformers backend is not supported on ROCm.")
|
||||
def test_get_pooling_config_from_args():
|
||||
model_id = "sentence-transformers/all-MiniLM-L12-v2"
|
||||
model_config = ModelConfig(model_id,
|
||||
task="auto",
|
||||
tokenizer=model_id,
|
||||
tokenizer_mode="auto",
|
||||
trust_remote_code=False,
|
||||
seed=0,
|
||||
dtype="float16",
|
||||
revision=None)
|
||||
model_config = ModelConfig(model_id)
|
||||
|
||||
override_pooler_config = PoolerConfig(pooling_type='CLS', normalize=True)
|
||||
model_config.override_pooler_config = override_pooler_config
|
||||
@ -295,16 +252,8 @@ def test_get_pooling_config_from_args():
|
||||
@pytest.mark.skipif(current_platform.is_rocm(),
|
||||
reason="Xformers backend is not supported on ROCm.")
|
||||
def test_get_bert_tokenization_sentence_transformer_config():
|
||||
bge_model_config = ModelConfig(
|
||||
model="BAAI/bge-base-en-v1.5",
|
||||
task="auto",
|
||||
tokenizer="BAAI/bge-base-en-v1.5",
|
||||
tokenizer_mode="auto",
|
||||
trust_remote_code=False,
|
||||
seed=0,
|
||||
dtype="float16",
|
||||
revision=None,
|
||||
)
|
||||
model_id = "BAAI/bge-base-en-v1.5"
|
||||
bge_model_config = ModelConfig(model_id)
|
||||
|
||||
bert_bge_model_config = bge_model_config._get_encoder_config()
|
||||
|
||||
@ -317,27 +266,13 @@ def test_rope_customization():
|
||||
TEST_ROPE_THETA = 16_000_000.0
|
||||
LONGCHAT_ROPE_SCALING = {"rope_type": "linear", "factor": 8.0}
|
||||
|
||||
llama_model_config = ModelConfig(
|
||||
"meta-llama/Meta-Llama-3-8B-Instruct",
|
||||
task="auto",
|
||||
tokenizer="meta-llama/Meta-Llama-3-8B-Instruct",
|
||||
tokenizer_mode="auto",
|
||||
trust_remote_code=False,
|
||||
dtype="float16",
|
||||
seed=0,
|
||||
)
|
||||
llama_model_config = ModelConfig("meta-llama/Meta-Llama-3-8B-Instruct")
|
||||
assert getattr(llama_model_config.hf_config, "rope_scaling", None) is None
|
||||
assert getattr(llama_model_config.hf_config, "rope_theta", None) == 500_000
|
||||
assert llama_model_config.max_model_len == 8192
|
||||
|
||||
llama_model_config = ModelConfig(
|
||||
"meta-llama/Meta-Llama-3-8B-Instruct",
|
||||
task="auto",
|
||||
tokenizer="meta-llama/Meta-Llama-3-8B-Instruct",
|
||||
tokenizer_mode="auto",
|
||||
trust_remote_code=False,
|
||||
dtype="float16",
|
||||
seed=0,
|
||||
hf_overrides={
|
||||
"rope_scaling": TEST_ROPE_SCALING,
|
||||
"rope_theta": TEST_ROPE_THETA,
|
||||
@ -349,15 +284,7 @@ def test_rope_customization():
|
||||
None) == TEST_ROPE_THETA
|
||||
assert llama_model_config.max_model_len == 16384
|
||||
|
||||
longchat_model_config = ModelConfig(
|
||||
"lmsys/longchat-13b-16k",
|
||||
task="auto",
|
||||
tokenizer="lmsys/longchat-13b-16k",
|
||||
tokenizer_mode="auto",
|
||||
trust_remote_code=False,
|
||||
dtype="float16",
|
||||
seed=0,
|
||||
)
|
||||
longchat_model_config = ModelConfig("lmsys/longchat-13b-16k")
|
||||
# Check if LONGCHAT_ROPE_SCALING entries are in longchat_model_config
|
||||
assert all(
|
||||
longchat_model_config.hf_config.rope_scaling.get(key) == value
|
||||
@ -366,12 +293,6 @@ def test_rope_customization():
|
||||
|
||||
longchat_model_config = ModelConfig(
|
||||
"lmsys/longchat-13b-16k",
|
||||
task="auto",
|
||||
tokenizer="lmsys/longchat-13b-16k",
|
||||
tokenizer_mode="auto",
|
||||
trust_remote_code=False,
|
||||
dtype="float16",
|
||||
seed=0,
|
||||
hf_overrides={
|
||||
"rope_scaling": TEST_ROPE_SCALING,
|
||||
},
|
||||
@ -390,15 +311,7 @@ def test_rope_customization():
|
||||
("meta-llama/Llama-3.2-11B-Vision", True),
|
||||
])
|
||||
def test_is_encoder_decoder(model_id, is_encoder_decoder):
|
||||
config = ModelConfig(
|
||||
model_id,
|
||||
task="auto",
|
||||
tokenizer=model_id,
|
||||
tokenizer_mode="auto",
|
||||
trust_remote_code=False,
|
||||
dtype="float16",
|
||||
seed=0,
|
||||
)
|
||||
config = ModelConfig(model_id)
|
||||
|
||||
assert config.is_encoder_decoder == is_encoder_decoder
|
||||
|
||||
@ -408,15 +321,7 @@ def test_is_encoder_decoder(model_id, is_encoder_decoder):
|
||||
("Qwen/Qwen2-VL-2B-Instruct", True),
|
||||
])
|
||||
def test_uses_mrope(model_id, uses_mrope):
|
||||
config = ModelConfig(
|
||||
model_id,
|
||||
task="auto",
|
||||
tokenizer=model_id,
|
||||
tokenizer_mode="auto",
|
||||
trust_remote_code=False,
|
||||
dtype="float16",
|
||||
seed=0,
|
||||
)
|
||||
config = ModelConfig(model_id)
|
||||
|
||||
assert config.uses_mrope == uses_mrope
|
||||
|
||||
@ -426,26 +331,12 @@ def test_generation_config_loading():
|
||||
|
||||
# When set generation_config to "vllm", the default generation config
|
||||
# will not be loaded.
|
||||
model_config = ModelConfig(model_id,
|
||||
task="auto",
|
||||
tokenizer=model_id,
|
||||
tokenizer_mode="auto",
|
||||
trust_remote_code=False,
|
||||
seed=0,
|
||||
dtype="float16",
|
||||
generation_config="vllm")
|
||||
model_config = ModelConfig(model_id, generation_config="vllm")
|
||||
assert model_config.get_diff_sampling_param() == {}
|
||||
|
||||
# When set generation_config to "auto", the default generation config
|
||||
# should be loaded.
|
||||
model_config = ModelConfig(model_id,
|
||||
task="auto",
|
||||
tokenizer=model_id,
|
||||
tokenizer_mode="auto",
|
||||
trust_remote_code=False,
|
||||
seed=0,
|
||||
dtype="float16",
|
||||
generation_config="auto")
|
||||
model_config = ModelConfig(model_id, generation_config="auto")
|
||||
|
||||
correct_generation_config = {
|
||||
"repetition_penalty": 1.1,
|
||||
@ -461,12 +352,6 @@ def test_generation_config_loading():
|
||||
|
||||
model_config = ModelConfig(
|
||||
model_id,
|
||||
task="auto",
|
||||
tokenizer=model_id,
|
||||
tokenizer_mode="auto",
|
||||
trust_remote_code=False,
|
||||
seed=0,
|
||||
dtype="float16",
|
||||
generation_config="auto",
|
||||
override_generation_config=override_generation_config)
|
||||
|
||||
@ -479,12 +364,6 @@ def test_generation_config_loading():
|
||||
# is set, the override_generation_config should be used directly.
|
||||
model_config = ModelConfig(
|
||||
model_id,
|
||||
task="auto",
|
||||
tokenizer=model_id,
|
||||
tokenizer_mode="auto",
|
||||
trust_remote_code=False,
|
||||
seed=0,
|
||||
dtype="float16",
|
||||
generation_config="vllm",
|
||||
override_generation_config=override_generation_config)
|
||||
|
||||
@ -515,16 +394,7 @@ def test_load_config_pt_load_map_location(pt_load_map_location):
|
||||
def test_get_and_verify_max_len(model_id, max_model_len, expected_max_len,
|
||||
should_raise):
|
||||
"""Test get_and_verify_max_len with different configurations."""
|
||||
model_config = ModelConfig(
|
||||
model_id,
|
||||
task="auto",
|
||||
tokenizer=model_id,
|
||||
tokenizer_mode="auto",
|
||||
trust_remote_code=False,
|
||||
seed=0,
|
||||
dtype="float16",
|
||||
revision=None,
|
||||
)
|
||||
model_config = ModelConfig(model_id)
|
||||
|
||||
if should_raise:
|
||||
with pytest.raises(ValueError):
|
||||
|
||||
@ -21,13 +21,8 @@ def test_max_tokens_none():
|
||||
def model_config():
|
||||
return ModelConfig(
|
||||
MODEL_NAME,
|
||||
task="auto",
|
||||
tokenizer=MODEL_NAME,
|
||||
tokenizer_mode="auto",
|
||||
trust_remote_code=False,
|
||||
seed=0,
|
||||
dtype="float16",
|
||||
revision=None,
|
||||
)
|
||||
|
||||
|
||||
|
||||
@ -695,11 +695,7 @@ def test_estimate_max_model_len(model_id, max_model_len,
|
||||
# Create a VllmConfig
|
||||
model_config = ModelConfig(
|
||||
model_id,
|
||||
task="generate",
|
||||
tokenizer=model_id,
|
||||
tokenizer_mode="auto",
|
||||
trust_remote_code=False,
|
||||
seed=0,
|
||||
runner="generate",
|
||||
dtype="float16",
|
||||
max_model_len=max_model_len,
|
||||
)
|
||||
@ -733,11 +729,7 @@ def test_get_max_concurrency_for_kv_cache_config():
|
||||
max_model_len = 16384
|
||||
model_config = ModelConfig(
|
||||
model_id,
|
||||
task="generate",
|
||||
tokenizer=model_id,
|
||||
tokenizer_mode="auto",
|
||||
trust_remote_code=False,
|
||||
seed=0,
|
||||
runner="generate",
|
||||
dtype="float16",
|
||||
max_model_len=max_model_len,
|
||||
)
|
||||
|
||||
@ -1248,9 +1248,6 @@ def create_scheduler_with_priority(
|
||||
)
|
||||
model_config = ModelConfig(
|
||||
model=model,
|
||||
task="auto",
|
||||
tokenizer=model,
|
||||
tokenizer_mode="auto",
|
||||
trust_remote_code=True,
|
||||
dtype="float16",
|
||||
seed=42,
|
||||
|
||||
@ -59,9 +59,6 @@ def create_scheduler(
|
||||
)
|
||||
model_config = ModelConfig(
|
||||
model=model,
|
||||
task="auto",
|
||||
tokenizer=model,
|
||||
tokenizer_mode="auto",
|
||||
trust_remote_code=True,
|
||||
dtype="float16",
|
||||
seed=42,
|
||||
|
||||
@ -68,9 +68,6 @@ def create_vllm_config(
|
||||
)
|
||||
model_config = ModelConfig(
|
||||
model=model,
|
||||
task="auto",
|
||||
tokenizer=model,
|
||||
tokenizer_mode="auto",
|
||||
trust_remote_code=True,
|
||||
dtype="float16",
|
||||
seed=42,
|
||||
|
||||
@ -24,13 +24,8 @@ eagle3_dir = "yuhuili/EAGLE3-LLaMA3.1-Instruct-8B"
|
||||
|
||||
def _create_proposer(method: str, k: int) -> EagleProposer:
|
||||
model_config = ModelConfig(model=model_dir,
|
||||
task="generate",
|
||||
max_model_len=100,
|
||||
tokenizer=model_dir,
|
||||
tokenizer_mode="auto",
|
||||
dtype="auto",
|
||||
seed=None,
|
||||
trust_remote_code=False)
|
||||
runner="generate",
|
||||
max_model_len=100)
|
||||
|
||||
# Choose model directory based on method
|
||||
draft_model_dir = eagle_dir if method == "eagle" else eagle3_dir
|
||||
|
||||
@ -44,14 +44,7 @@ def test_ngram_proposer():
|
||||
|
||||
def ngram_proposer(min_n: int, max_n: int, k: int) -> NgramProposer:
|
||||
# Dummy model config. Just to set max_model_len.
|
||||
model_config = ModelConfig(model="facebook/opt-125m",
|
||||
task="generate",
|
||||
max_model_len=100,
|
||||
tokenizer="facebook/opt-125m",
|
||||
tokenizer_mode="auto",
|
||||
dtype="auto",
|
||||
seed=None,
|
||||
trust_remote_code=False)
|
||||
model_config = ModelConfig(model="facebook/opt-125m")
|
||||
return NgramProposer(
|
||||
vllm_config=VllmConfig(model_config=model_config,
|
||||
speculative_config=SpeculativeConfig.
|
||||
|
||||
@ -26,10 +26,6 @@ def get_vllm_config():
|
||||
)
|
||||
model_config = ModelConfig(
|
||||
model="facebook/opt-125m",
|
||||
task="generate",
|
||||
tokenizer="facebook/opt-125m",
|
||||
tokenizer_mode="auto",
|
||||
trust_remote_code=True,
|
||||
dtype="bfloat16", # TPUs typically use bfloat16
|
||||
seed=42,
|
||||
)
|
||||
|
||||
@ -76,10 +76,6 @@ def get_vllm_config():
|
||||
)
|
||||
model_config = ModelConfig(
|
||||
model="facebook/opt-125m",
|
||||
task="generate",
|
||||
tokenizer="facebook/opt-125m",
|
||||
tokenizer_mode="auto",
|
||||
trust_remote_code=True,
|
||||
dtype="float16",
|
||||
seed=42,
|
||||
)
|
||||
|
||||
530
vllm/config.py
530
vllm/config.py
@ -26,7 +26,7 @@ from pydantic import (ConfigDict, SkipValidation, TypeAdapter, field_validator,
|
||||
from pydantic.dataclasses import dataclass
|
||||
from safetensors.torch import _TYPES as _SAFETENSORS_TO_TORCH_DTYPE
|
||||
from torch.distributed import ProcessGroup, ReduceOp
|
||||
from typing_extensions import Self, runtime_checkable
|
||||
from typing_extensions import Self, assert_never, runtime_checkable
|
||||
|
||||
import vllm.envs as envs
|
||||
from vllm import version
|
||||
@ -102,12 +102,63 @@ RunnerOption = Literal["auto", "generate", "pooling", "draft"]
|
||||
|
||||
RunnerType = Literal["generate", "pooling", "draft"]
|
||||
|
||||
_RUNNER_TASKS: dict[RunnerType, list[_ResolvedTask]] = {
|
||||
ConvertOption = Literal["auto", "none", "embed", "classify", "reward"]
|
||||
|
||||
ConvertType = Literal["none", "embed", "classify", "reward"]
|
||||
|
||||
_RUNNER_TASKS: dict[RunnerType, list[TaskOption]] = {
|
||||
"generate": ["generate", "transcription"],
|
||||
"pooling": ["encode", "embed", "classify", "reward"],
|
||||
"pooling": ["embedding", "embed", "classify", "score", "reward"],
|
||||
"draft": ["draft"],
|
||||
}
|
||||
|
||||
_RUNNER_CONVERTS: dict[RunnerType, list[ConvertType]] = {
|
||||
"generate": [],
|
||||
"pooling": ["embed", "classify", "reward"],
|
||||
"draft": [],
|
||||
}
|
||||
|
||||
# Some model suffixes are based on auto classes from Transformers:
|
||||
# https://huggingface.co/docs/transformers/en/model_doc/auto
|
||||
# NOTE: Items higher on this list priority over lower ones
|
||||
_SUFFIX_TO_DEFAULTS: list[tuple[str, tuple[RunnerType, ConvertType]]] = [
|
||||
("ForCausalLM", ("generate", "none")),
|
||||
("ForConditionalGeneration", ("generate", "none")),
|
||||
("ChatModel", ("generate", "none")),
|
||||
("LMHeadModel", ("generate", "none")),
|
||||
("ForTextEncoding", ("pooling", "embed")),
|
||||
("EmbeddingModel", ("pooling", "embed")),
|
||||
("ForSequenceClassification", ("pooling", "classify")),
|
||||
("ForAudioClassification", ("pooling", "classify")),
|
||||
("ForImageClassification", ("pooling", "classify")),
|
||||
("ForVideoClassification", ("pooling", "classify")),
|
||||
("ClassificationModel", ("pooling", "classify")),
|
||||
("ForRewardModeling", ("pooling", "reward")),
|
||||
("RewardModel", ("pooling", "reward")),
|
||||
# Let other `*Model`s take priority
|
||||
("Model", ("pooling", "embed")),
|
||||
]
|
||||
|
||||
|
||||
def iter_architecture_defaults():
|
||||
yield from _SUFFIX_TO_DEFAULTS
|
||||
|
||||
|
||||
def try_match_architecture_defaults(
|
||||
architecture: str,
|
||||
*,
|
||||
runner_type: Optional[RunnerType] = None,
|
||||
convert_type: Optional[ConvertType] = None,
|
||||
) -> Optional[tuple[str, tuple[RunnerType, ConvertType]]]:
|
||||
for suffix, (default_runner_type,
|
||||
default_convert_type) in iter_architecture_defaults():
|
||||
if ((runner_type is None or runner_type == default_runner_type) and
|
||||
(convert_type is None or convert_type == default_convert_type)
|
||||
and architecture.endswith(suffix)):
|
||||
return suffix, (default_runner_type, default_convert_type)
|
||||
|
||||
return None
|
||||
|
||||
|
||||
@runtime_checkable
|
||||
class SupportsHash(Protocol):
|
||||
@ -236,11 +287,16 @@ class ModelConfig:
|
||||
runner: RunnerOption = "auto"
|
||||
"""The type of model runner to use. Each vLLM instance only supports one
|
||||
model runner, even if the same model can be used for multiple types."""
|
||||
task: TaskOption = "auto"
|
||||
"""The task to use the model for. If the model supports more than one
|
||||
model runner, this is used to select which model runner to run.
|
||||
convert: ConvertOption = "auto"
|
||||
"""Convert the model using adapters defined in
|
||||
[vllm.model_executor.models.adapters][]. The most common use case is to
|
||||
adapt a text generation model to be used for pooling tasks."""
|
||||
task: Optional[TaskOption] = None
|
||||
"""[DEPRECATED] The task to use the model for. If the model supports more
|
||||
than one model runner, this is used to select which model runner to run.
|
||||
|
||||
Note that the model may support other tasks using the same model runner."""
|
||||
Note that the model may support other tasks using the same model runner.
|
||||
"""
|
||||
tokenizer: SkipValidation[str] = None # type: ignore
|
||||
"""Name or path of the Hugging Face tokenizer to use. If unspecified, model
|
||||
name or path will be used."""
|
||||
@ -558,48 +614,103 @@ class ModelConfig:
|
||||
self.hf_image_processor_config = get_hf_image_processor_config(
|
||||
self.model, hf_token=self.hf_token, revision=self.revision)
|
||||
|
||||
# For pooling models, self.task is used to indicate the
|
||||
# user-selected task
|
||||
if self.task == "score":
|
||||
if self._is_classify_task(self.architectures):
|
||||
self.task = "classify"
|
||||
architectures = self.architectures
|
||||
registry = self.registry
|
||||
is_generative_model = registry.is_text_generation_model(
|
||||
architectures, self)
|
||||
is_pooling_model = registry.is_pooling_model(architectures, self)
|
||||
|
||||
def _task_to_convert(task: TaskOption) -> ConvertType:
|
||||
if task == "embedding" or task == "embed":
|
||||
return "embed"
|
||||
if task == "classify":
|
||||
return "classify"
|
||||
if task == "reward":
|
||||
return "reward"
|
||||
if task == "score":
|
||||
new_task = self._get_default_pooling_task(architectures)
|
||||
return "classify" if new_task == "classify" else "embed"
|
||||
|
||||
return "none"
|
||||
|
||||
if self.task is not None:
|
||||
runner: RunnerOption = "auto"
|
||||
convert: ConvertOption = "auto"
|
||||
msg_prefix = ("The 'task' option has been deprecated and will be "
|
||||
"removed in v0.13.0 or v1.0, whichever comes first.")
|
||||
msg_hint = "Please remove this option."
|
||||
|
||||
is_generative_task = self.task in _RUNNER_TASKS["generate"]
|
||||
is_pooling_task = self.task in _RUNNER_TASKS["pooling"]
|
||||
|
||||
if is_generative_model and is_pooling_model:
|
||||
if is_generative_task:
|
||||
runner = "generate"
|
||||
convert = "auto"
|
||||
msg_hint = ("Please replace this option with `--runner "
|
||||
"generate` to continue using this model "
|
||||
"as a generative model.")
|
||||
elif is_pooling_task:
|
||||
runner = "pooling"
|
||||
convert = "auto"
|
||||
msg_hint = ("Please replace this option with `--runner "
|
||||
"pooling` to continue using this model "
|
||||
"as a pooling model.")
|
||||
else: # task == "auto"
|
||||
pass
|
||||
elif is_generative_model or is_pooling_model:
|
||||
if is_generative_task:
|
||||
runner = "generate"
|
||||
convert = "auto"
|
||||
msg_hint = "Please remove this option"
|
||||
elif is_pooling_task:
|
||||
runner = "pooling"
|
||||
convert = _task_to_convert(self.task)
|
||||
msg_hint = ("Please replace this option with `--convert "
|
||||
f"{convert}` to continue using this model "
|
||||
"as a pooling model.")
|
||||
else: # task == "auto"
|
||||
pass
|
||||
else:
|
||||
self.task = "embed"
|
||||
elif self.task == "embedding":
|
||||
msg = ("The 'embedding' task has been renamed to 'embed', please "
|
||||
"use the new name. The old name will be removed in v1.0.")
|
||||
raise AssertionError("The model should be a generative or "
|
||||
"pooling model when task is set to "
|
||||
f"{self.task!r}.")
|
||||
|
||||
self.runner = runner
|
||||
self.convert = convert
|
||||
|
||||
msg = f"{msg_prefix} {msg_hint}"
|
||||
warnings.warn(msg, DeprecationWarning, stacklevel=2)
|
||||
|
||||
self.task = "embed"
|
||||
self.runner_type = self._get_runner_type(architectures, self.runner)
|
||||
self.convert_type = self._get_convert_type(architectures,
|
||||
self.runner_type,
|
||||
self.convert)
|
||||
|
||||
model_info, arch = self.registry.inspect_model_cls(self.architectures)
|
||||
if self.runner_type == "generate" and not is_generative_model:
|
||||
generate_converts = _RUNNER_CONVERTS["generate"]
|
||||
if self.convert_type not in generate_converts:
|
||||
# Currently we don't have any converters for generative models
|
||||
raise ValueError(
|
||||
"This model does not support `--runner generate`.")
|
||||
if self.runner_type == "pooling" and not is_pooling_model:
|
||||
pooling_converts = _RUNNER_CONVERTS["pooling"]
|
||||
if self.convert_type not in pooling_converts:
|
||||
convert_option = "<" + "|".join(pooling_converts) + ">"
|
||||
raise ValueError(
|
||||
"This model does not support `--runner pooling`. "
|
||||
f"You can pass `--convert {convert_option} to adapt "
|
||||
"it into a pooling model.")
|
||||
|
||||
self.supported_tasks = self._get_supported_tasks(
|
||||
architectures, self.runner_type, self.convert_type)
|
||||
|
||||
# Note: Initialize these attributes early because transformers fallback
|
||||
# may fail to load dynamic modules in child processes
|
||||
model_info, arch = registry.inspect_model_cls(architectures, self)
|
||||
self._model_info = model_info
|
||||
self._architecture = arch
|
||||
|
||||
all_supported_tasks = self._get_supported_tasks(self.task)
|
||||
logger.debug("Tasks supported by runner type: %s", all_supported_tasks)
|
||||
supported_runner_types = self._get_supported_runner_types(
|
||||
all_supported_tasks)
|
||||
runner_type = self._resolve_runner(self.runner, self.task,
|
||||
supported_runner_types,
|
||||
all_supported_tasks)
|
||||
|
||||
logger.debug("Selected runner type: %s", runner_type)
|
||||
# For pooling models, self.task is used to indicate the
|
||||
# user-selected task
|
||||
if runner_type == "pooling" and self.task == "auto":
|
||||
selected_task = all_supported_tasks[runner_type][-1]
|
||||
assert selected_task != "encode"
|
||||
self.task = selected_task
|
||||
self.supported_runner_types = supported_runner_types
|
||||
self.runner_type = runner_type
|
||||
self.supported_tasks = all_supported_tasks[runner_type]
|
||||
|
||||
if self.runner_type in ("draft",
|
||||
"generate") and self.task != "transcription":
|
||||
self.truncation_side = "left"
|
||||
else:
|
||||
self.truncation_side = "right"
|
||||
logger.info("Resolved architecture: %s", arch)
|
||||
|
||||
self.pooler_config = self._init_pooler_config()
|
||||
|
||||
@ -652,16 +763,10 @@ class ModelConfig:
|
||||
self.original_max_model_len = self.max_model_len
|
||||
self.max_model_len = self.get_and_verify_max_len(self.max_model_len)
|
||||
self.multimodal_config = self._init_multimodal_config()
|
||||
self.model_supports_multimodal_raw_input = (
|
||||
self.registry.supports_multimodal_raw_input(self.architectures))
|
||||
|
||||
if not self.skip_tokenizer_init:
|
||||
self._verify_tokenizer_mode()
|
||||
|
||||
self.is_attention_free = self._init_attention_free()
|
||||
self.is_hybrid = self._init_is_hybrid()
|
||||
self.has_noops = self._init_has_noops()
|
||||
self.has_inner_state = self._init_has_inner_state()
|
||||
|
||||
if (not current_platform.is_neuron() and self.override_neuron_config):
|
||||
raise ValueError(
|
||||
"`override_neuron_config` is only supported on Neuron.")
|
||||
@ -702,30 +807,13 @@ class ModelConfig:
|
||||
|
||||
@property
|
||||
def architectures(self) -> list[str]:
|
||||
# architectures in the model config.
|
||||
architectures = getattr(self.hf_config, "architectures", [])
|
||||
# The registry assumes that it can always inspect the vLLM model class
|
||||
# for a given architecture. This assumption breaks down for the
|
||||
# Transformers backend, which may use a different class depending on
|
||||
# the model type. To work around this, we add the correct Transformers
|
||||
# backend class to the architectures list. We must do this here because
|
||||
# we need access to the `hf_config` to determine the backend class.
|
||||
transformers_backend_cls = self._get_transformers_backend_cls()
|
||||
if (self.model_impl != ModelImpl.VLLM.value
|
||||
and all(arch != transformers_backend_cls
|
||||
for arch in architectures)):
|
||||
architectures.append(transformers_backend_cls)
|
||||
return architectures
|
||||
return getattr(self.hf_config, "architectures", [])
|
||||
|
||||
@property
|
||||
def architecture(self) -> str:
|
||||
# The architecture vllm actually used.
|
||||
"""The architecture vllm actually used."""
|
||||
return self._architecture
|
||||
|
||||
@property
|
||||
def model_info(self):
|
||||
return self._model_info
|
||||
|
||||
def maybe_pull_model_tokenizer_for_s3(self, model: str,
|
||||
tokenizer: str) -> None:
|
||||
"""Pull model/tokenizer from S3 to temporary directory when needed.
|
||||
@ -763,7 +851,7 @@ class ModelConfig:
|
||||
self.tokenizer = s3_tokenizer.dir
|
||||
|
||||
def _init_multimodal_config(self) -> Optional["MultiModalConfig"]:
|
||||
if self.registry.is_multimodal_model(self.architectures):
|
||||
if self.registry.is_multimodal_model(self.architectures, self):
|
||||
return MultiModalConfig(
|
||||
limit_per_prompt=self.limit_mm_per_prompt,
|
||||
media_io_kwargs=self.media_io_kwargs,
|
||||
@ -819,19 +907,6 @@ class ModelConfig:
|
||||
|
||||
return None
|
||||
|
||||
def _init_attention_free(self) -> bool:
|
||||
return self.registry.is_attention_free_model(self.architectures)
|
||||
|
||||
def _init_is_hybrid(self) -> bool:
|
||||
return self.registry.is_hybrid_model(self.architectures)
|
||||
|
||||
def _init_has_noops(self) -> bool:
|
||||
architectures = getattr(self.hf_config, "architectures", [])
|
||||
return self.registry.is_noops_model(architectures)
|
||||
|
||||
def _init_has_inner_state(self) -> bool:
|
||||
return self.registry.model_has_inner_state(self.architectures)
|
||||
|
||||
def _verify_tokenizer_mode(self) -> None:
|
||||
tokenizer_mode = cast(TokenizerMode, self.tokenizer_mode.lower())
|
||||
if tokenizer_mode not in get_args(TokenizerMode):
|
||||
@ -840,155 +915,168 @@ class ModelConfig:
|
||||
f"one of {get_args(TokenizerMode)}.")
|
||||
self.tokenizer_mode = tokenizer_mode
|
||||
|
||||
def _is_classify_task(self, architectures: list[str]):
|
||||
for arch in architectures:
|
||||
if arch.endswith("ForSequenceClassification"):
|
||||
return True
|
||||
return self.registry.is_cross_encoder_model(architectures)
|
||||
|
||||
def _get_preferred_pooling_task(
|
||||
def _get_default_runner_type(
|
||||
self,
|
||||
architectures: list[str],
|
||||
) -> _ResolvedTask:
|
||||
model_id = self.model
|
||||
if get_pooling_config(model_id, self.revision):
|
||||
) -> RunnerType:
|
||||
registry = self.registry
|
||||
|
||||
# Some Sentence Transformers models use *ForCausalLM archs
|
||||
if get_pooling_config(self.model, self.revision):
|
||||
return "pooling"
|
||||
|
||||
for arch in architectures:
|
||||
if arch in registry.get_supported_archs():
|
||||
if registry.is_pooling_model(architectures, self):
|
||||
return "pooling"
|
||||
if registry.is_text_generation_model(architectures, self):
|
||||
return "generate"
|
||||
|
||||
match = try_match_architecture_defaults(arch)
|
||||
if match:
|
||||
_, (runner_type, _) = match
|
||||
return runner_type
|
||||
|
||||
return "generate"
|
||||
|
||||
def _get_runner_type(
|
||||
self,
|
||||
architectures: list[str],
|
||||
runner: RunnerOption,
|
||||
) -> RunnerType:
|
||||
if runner != "auto":
|
||||
return runner
|
||||
|
||||
runner_type = self._get_default_runner_type(architectures)
|
||||
|
||||
logger.info(
|
||||
"Resolved `--runner auto` to `--runner %s`. "
|
||||
"Pass the value explicitly to silence this message.", runner_type)
|
||||
|
||||
return runner_type
|
||||
|
||||
def _get_default_convert_type(
|
||||
self,
|
||||
architectures: list[str],
|
||||
runner_type: RunnerType,
|
||||
) -> ConvertType:
|
||||
registry = self.registry
|
||||
|
||||
for arch in architectures:
|
||||
if arch in registry.get_supported_archs():
|
||||
if (runner_type == "generate"
|
||||
and registry.is_text_generation_model(
|
||||
architectures, self)):
|
||||
return "none"
|
||||
if (runner_type == "pooling"
|
||||
and registry.is_pooling_model(architectures, self)):
|
||||
return "none"
|
||||
|
||||
match = try_match_architecture_defaults(arch,
|
||||
runner_type=runner_type)
|
||||
if match:
|
||||
_, (_, convert_type) = match
|
||||
return convert_type
|
||||
|
||||
# This is to handle Sentence Transformers models that use *ForCausalLM
|
||||
# and also multi-modal pooling models which are not defined as
|
||||
# Sentence Transformers models
|
||||
if runner_type == "pooling":
|
||||
return "embed"
|
||||
if self.registry.is_transcription_model(architectures):
|
||||
return "transcription"
|
||||
|
||||
suffix_to_preferred_task: list[tuple[str, _ResolvedTask]] = [
|
||||
# Other models follow this pattern
|
||||
("EmbeddingModel", "embed"),
|
||||
("RewardModel", "reward"),
|
||||
]
|
||||
return "none"
|
||||
|
||||
for suffix, pref_task in suffix_to_preferred_task:
|
||||
if self.architecture.endswith(suffix):
|
||||
return pref_task
|
||||
def _get_convert_type(
|
||||
self,
|
||||
architectures: list[str],
|
||||
runner_type: RunnerType,
|
||||
convert: ConvertOption,
|
||||
) -> ConvertType:
|
||||
if convert != "auto":
|
||||
return convert
|
||||
|
||||
return "embed"
|
||||
convert_type = self._get_default_convert_type(architectures,
|
||||
runner_type)
|
||||
|
||||
logger.info(
|
||||
"Resolved `--convert auto` to `--convert %s`. "
|
||||
"Pass the value explicitly to silence this message.", convert_type)
|
||||
|
||||
return convert_type
|
||||
|
||||
def _get_supported_generation_tasks(
|
||||
self,
|
||||
task_option: TaskOption,
|
||||
architectures: list[str],
|
||||
convert_type: ConvertType,
|
||||
) -> list[_ResolvedTask]:
|
||||
registry = self.registry
|
||||
architectures = self.architectures
|
||||
|
||||
if registry.is_transcription_only_model(architectures):
|
||||
if registry.is_transcription_only_model(architectures, self):
|
||||
return ["transcription"]
|
||||
|
||||
# TODO: Use get_supported_generation_tasks once V0 is removed
|
||||
supported_tasks = list[_ResolvedTask]()
|
||||
if registry.is_text_generation_model(architectures):
|
||||
if (registry.is_text_generation_model(architectures, self)
|
||||
or convert_type in _RUNNER_CONVERTS["generate"]):
|
||||
supported_tasks.append("generate")
|
||||
|
||||
if registry.is_transcription_model(architectures):
|
||||
supported_tasks.append("transcription")
|
||||
if registry.is_transcription_model(architectures, self):
|
||||
supported_tasks.append("transcription")
|
||||
|
||||
return supported_tasks
|
||||
|
||||
def _get_default_pooling_task(
|
||||
self,
|
||||
architectures: list[str],
|
||||
) -> Literal["embed", "classify", "reward"]:
|
||||
if self.registry.is_cross_encoder_model(architectures, self):
|
||||
return "classify"
|
||||
|
||||
for arch in architectures:
|
||||
match = try_match_architecture_defaults(arch,
|
||||
runner_type="pooling")
|
||||
if match:
|
||||
_, (_, convert_type) = match
|
||||
assert convert_type != "none"
|
||||
return convert_type
|
||||
|
||||
return "embed"
|
||||
|
||||
def _get_supported_pooling_tasks(
|
||||
self,
|
||||
task_option: TaskOption,
|
||||
architectures: list[str],
|
||||
convert_type: ConvertType,
|
||||
) -> list[_ResolvedTask]:
|
||||
registry = self.registry
|
||||
architectures = self.architectures
|
||||
|
||||
# TODO: Use get_supported_pooling_tasks once V0 is removed
|
||||
supported_tasks = list[_ResolvedTask]()
|
||||
if registry.is_pooling_model(architectures):
|
||||
if (registry.is_pooling_model(architectures, self)
|
||||
or convert_type in _RUNNER_CONVERTS["pooling"]):
|
||||
supported_tasks.append("encode")
|
||||
|
||||
# For now, users must specify the task (other than "pooling")
|
||||
# to use for pooling models
|
||||
if task_option == "auto":
|
||||
preferred_task = self._get_preferred_pooling_task(
|
||||
architectures)
|
||||
|
||||
supported_tasks.append(preferred_task)
|
||||
elif task_option in _RUNNER_TASKS["pooling"]:
|
||||
supported_tasks.append(cast(_ResolvedTask, task_option))
|
||||
extra_task = (self._get_default_pooling_task(architectures)
|
||||
if convert_type == "none" else convert_type)
|
||||
supported_tasks.append(extra_task)
|
||||
|
||||
return supported_tasks
|
||||
|
||||
def _get_supported_tasks(
|
||||
self,
|
||||
task_option: TaskOption,
|
||||
) -> dict[RunnerType, list[_ResolvedTask]]:
|
||||
if self._is_classify_task(self.architectures):
|
||||
return {"generate": [], "pooling": ["classify"], "draft": []}
|
||||
else:
|
||||
return {
|
||||
"generate": self._get_supported_generation_tasks(task_option),
|
||||
"pooling": self._get_supported_pooling_tasks(task_option),
|
||||
"draft": ["draft"]
|
||||
}
|
||||
architectures: list[str],
|
||||
runner_type: RunnerType,
|
||||
convert_type: ConvertType,
|
||||
) -> list[_ResolvedTask]:
|
||||
if runner_type == "generate":
|
||||
return self._get_supported_generation_tasks(
|
||||
architectures, convert_type)
|
||||
if runner_type == "pooling":
|
||||
return self._get_supported_pooling_tasks(architectures,
|
||||
convert_type)
|
||||
if runner_type == "draft":
|
||||
return ["draft"]
|
||||
|
||||
def _get_supported_runner_types(
|
||||
self,
|
||||
supported_tasks: dict[RunnerType, list[_ResolvedTask]],
|
||||
) -> set[RunnerType]:
|
||||
return {
|
||||
runner
|
||||
for runner, runner_tasks in supported_tasks.items()
|
||||
if len(runner_tasks) > 0
|
||||
}
|
||||
|
||||
def _resolve_runner(
|
||||
self,
|
||||
runner_option: RunnerOption,
|
||||
task_option: TaskOption,
|
||||
supported_runner_types: set[RunnerType],
|
||||
supported_tasks: dict[RunnerType, list[_ResolvedTask]],
|
||||
) -> RunnerType:
|
||||
if not supported_runner_types:
|
||||
raise ValueError("This model does not support any model runners!")
|
||||
|
||||
if runner_option != "auto":
|
||||
if runner_option not in supported_runner_types:
|
||||
raise ValueError(
|
||||
f"This model does not support runner={runner_option!r}. "
|
||||
f"Available runners: {supported_runner_types}")
|
||||
|
||||
return runner_option
|
||||
|
||||
if task_option != "auto":
|
||||
for runner, runner_tasks in supported_tasks.items():
|
||||
if task_option in runner_tasks:
|
||||
return runner
|
||||
else:
|
||||
task_runner: RunnerType = next(
|
||||
runner for runner, tasks in _RUNNER_TASKS.items()
|
||||
if task_option in tasks)
|
||||
raise ValueError(
|
||||
f"This model does not support task={task_option!r}. "
|
||||
f"Available tasks for runner={task_runner!r}: "
|
||||
f"{supported_tasks[task_runner]}")
|
||||
|
||||
if "classify" in supported_tasks.get("pooling", []):
|
||||
# When multiple pooling tasks are present, default to
|
||||
# pooling (eg cross-encoder) for non-standard architectures.
|
||||
return "pooling"
|
||||
|
||||
suffix_to_preferred_runner: list[tuple[str, RunnerType]] = [
|
||||
("ForCausalLM", "generate"),
|
||||
("ForConditionalGeneration", "generate"),
|
||||
("ChatModel", "generate"),
|
||||
("LMHeadModel", "generate"),
|
||||
("EmbeddingModel", "pooling"),
|
||||
("RewardModel", "pooling"),
|
||||
]
|
||||
|
||||
for suffix, pref_runner in suffix_to_preferred_runner:
|
||||
if self.architecture.endswith(
|
||||
suffix) and pref_runner in supported_runner_types:
|
||||
return pref_runner
|
||||
|
||||
if "generate" in supported_runner_types:
|
||||
return "generate"
|
||||
if "pooling" in supported_runner_types:
|
||||
return "pooling"
|
||||
|
||||
raise AssertionError("This line should not be reached")
|
||||
assert_never(runner_type)
|
||||
|
||||
def _parse_quant_hf_config(self):
|
||||
quant_cfg = getattr(self.hf_config, "quantization_config", None)
|
||||
@ -1216,7 +1304,8 @@ class ModelConfig:
|
||||
|
||||
pipeline_parallel_size = parallel_config.pipeline_parallel_size
|
||||
if pipeline_parallel_size > 1:
|
||||
if not self.registry.is_pp_supported_model(self.architectures):
|
||||
if not self.registry.is_pp_supported_model(self.architectures,
|
||||
self):
|
||||
raise NotImplementedError(
|
||||
"Pipeline parallelism is not supported for this model. "
|
||||
"Supported models implement the `SupportsPP` interface.")
|
||||
@ -1558,17 +1647,41 @@ class ModelConfig:
|
||||
|
||||
@property
|
||||
def is_cross_encoder(self) -> bool:
|
||||
return self.task == "classify"
|
||||
return (self._model_info.supports_cross_encoding
|
||||
or self.convert_type == "classify")
|
||||
|
||||
@property
|
||||
def is_pp_supported(self) -> bool:
|
||||
return self._model_info.supports_pp
|
||||
|
||||
@property
|
||||
def is_multimodal_raw_input_supported(self) -> bool:
|
||||
return self._model_info.supports_multimodal_raw_input
|
||||
|
||||
@property
|
||||
def is_attention_free(self) -> bool:
|
||||
return self._model_info.is_attention_free
|
||||
|
||||
@property
|
||||
def is_hybrid(self) -> bool:
|
||||
return self._model_info.is_hybrid
|
||||
|
||||
@property
|
||||
def has_noops(self) -> bool:
|
||||
return self._model_info.has_noops
|
||||
|
||||
@property
|
||||
def has_inner_state(self):
|
||||
return self._model_info.has_inner_state
|
||||
|
||||
@property
|
||||
def is_v1_compatible(self) -> bool:
|
||||
return not self._model_info.supports_v0_only
|
||||
|
||||
@property
|
||||
def use_mla(self) -> bool:
|
||||
return self.is_deepseek_mla and not envs.VLLM_MLA_DISABLE
|
||||
|
||||
@property
|
||||
def is_v1_compatible(self) -> bool:
|
||||
architectures = getattr(self.hf_config, "architectures", [])
|
||||
return me_models.ModelRegistry.is_v1_compatible(architectures)
|
||||
|
||||
@property
|
||||
def is_matryoshka(self) -> bool:
|
||||
return (bool(getattr(self.hf_config, "matryoshka_dimensions", None))
|
||||
@ -4769,7 +4882,10 @@ class VllmConfig:
|
||||
self.scheduler_config.max_model_len = max_model_len
|
||||
|
||||
def try_verify_and_update_config(self):
|
||||
architecture = getattr(self.model_config, "architecture", None)
|
||||
if self.model_config is None:
|
||||
return
|
||||
|
||||
architecture = self.model_config.architecture
|
||||
if architecture is None:
|
||||
return
|
||||
|
||||
@ -4782,7 +4898,7 @@ class VllmConfig:
|
||||
if self.model_config.is_hybrid:
|
||||
HybridAttentionMambaModelConfig.verify_and_update_config(self)
|
||||
|
||||
if self.model_config.task == "classify":
|
||||
if self.model_config.convert_type == "classify":
|
||||
# Maybe convert ForCausalLM into ForSequenceClassification model.
|
||||
from vllm.model_executor.models.adapters import (
|
||||
SequenceClassificationConfig)
|
||||
|
||||
@ -22,14 +22,15 @@ from typing_extensions import TypeIs
|
||||
|
||||
import vllm.envs as envs
|
||||
from vllm.config import (BlockSize, CacheConfig, CacheDType, CompilationConfig,
|
||||
ConfigFormat, ConfigType, DecodingConfig,
|
||||
DetailedTraceModules, Device, DeviceConfig,
|
||||
DistributedExecutorBackend, GuidedDecodingBackend,
|
||||
GuidedDecodingBackendV1, HfOverrides, KVEventsConfig,
|
||||
KVTransferConfig, LoadConfig, LogprobsMode,
|
||||
LoRAConfig, ModelConfig, ModelDType, ModelImpl,
|
||||
MultiModalConfig, ObservabilityConfig, ParallelConfig,
|
||||
PoolerConfig, PrefixCachingHashAlgo, SchedulerConfig,
|
||||
ConfigFormat, ConfigType, ConvertOption,
|
||||
DecodingConfig, DetailedTraceModules, Device,
|
||||
DeviceConfig, DistributedExecutorBackend,
|
||||
GuidedDecodingBackend, GuidedDecodingBackendV1,
|
||||
HfOverrides, KVEventsConfig, KVTransferConfig,
|
||||
LoadConfig, LogprobsMode, LoRAConfig, ModelConfig,
|
||||
ModelDType, ModelImpl, MultiModalConfig,
|
||||
ObservabilityConfig, ParallelConfig, PoolerConfig,
|
||||
PrefixCachingHashAlgo, RunnerOption, SchedulerConfig,
|
||||
SchedulerPolicy, SpeculativeConfig, TaskOption,
|
||||
TokenizerMode, VllmConfig, get_attr_docs, get_field)
|
||||
from vllm.logger import init_logger
|
||||
@ -270,7 +271,9 @@ class EngineArgs:
|
||||
str, List[str]]] = ModelConfig.served_model_name
|
||||
tokenizer: Optional[str] = ModelConfig.tokenizer
|
||||
hf_config_path: Optional[str] = ModelConfig.hf_config_path
|
||||
task: TaskOption = ModelConfig.task
|
||||
runner: RunnerOption = ModelConfig.runner
|
||||
convert: ConvertOption = ModelConfig.convert
|
||||
task: Optional[TaskOption] = ModelConfig.task
|
||||
skip_tokenizer_init: bool = ModelConfig.skip_tokenizer_init
|
||||
enable_prompt_embeds: bool = ModelConfig.enable_prompt_embeds
|
||||
tokenizer_mode: TokenizerMode = ModelConfig.tokenizer_mode
|
||||
@ -461,7 +464,11 @@ class EngineArgs:
|
||||
)
|
||||
if not ('serve' in sys.argv[1:] and '--help' in sys.argv[1:]):
|
||||
model_group.add_argument("--model", **model_kwargs["model"])
|
||||
model_group.add_argument("--task", **model_kwargs["task"])
|
||||
model_group.add_argument("--runner", **model_kwargs["runner"])
|
||||
model_group.add_argument("--convert", **model_kwargs["convert"])
|
||||
model_group.add_argument("--task",
|
||||
**model_kwargs["task"],
|
||||
deprecated=True)
|
||||
model_group.add_argument("--tokenizer", **model_kwargs["tokenizer"])
|
||||
model_group.add_argument("--tokenizer-mode",
|
||||
**model_kwargs["tokenizer_mode"])
|
||||
@ -870,6 +877,8 @@ class EngineArgs:
|
||||
return ModelConfig(
|
||||
model=self.model,
|
||||
hf_config_path=self.hf_config_path,
|
||||
runner=self.runner,
|
||||
convert=self.convert,
|
||||
task=self.task,
|
||||
tokenizer=self.tokenizer,
|
||||
tokenizer_mode=self.tokenizer_mode,
|
||||
|
||||
@ -20,8 +20,8 @@ from vllm.beam_search import (BeamSearchInstance, BeamSearchOutput,
|
||||
create_sort_beams_key_function)
|
||||
from vllm.config import (CompilationConfig, ModelDType, TokenizerMode,
|
||||
is_init_field)
|
||||
from vllm.engine.arg_utils import (EngineArgs, HfOverrides, PoolerConfig,
|
||||
TaskOption)
|
||||
from vllm.engine.arg_utils import (ConvertOption, EngineArgs, HfOverrides,
|
||||
PoolerConfig, RunnerOption)
|
||||
from vllm.engine.llm_engine import LLMEngine
|
||||
from vllm.entrypoints.chat_utils import (ChatCompletionMessageParam,
|
||||
ChatTemplateContentFormatOption,
|
||||
@ -170,7 +170,8 @@ class LLM:
|
||||
self,
|
||||
model: str,
|
||||
*,
|
||||
task: TaskOption = "auto",
|
||||
runner: RunnerOption = "auto",
|
||||
convert: ConvertOption = "auto",
|
||||
tokenizer: Optional[str] = None,
|
||||
tokenizer_mode: TokenizerMode = "auto",
|
||||
skip_tokenizer_init: bool = False,
|
||||
@ -244,7 +245,8 @@ class LLM:
|
||||
|
||||
engine_args = EngineArgs(
|
||||
model=model,
|
||||
task=task,
|
||||
runner=runner,
|
||||
convert=convert,
|
||||
tokenizer=tokenizer,
|
||||
tokenizer_mode=tokenizer_mode,
|
||||
skip_tokenizer_init=skip_tokenizer_init,
|
||||
@ -459,18 +461,10 @@ class LLM:
|
||||
model_config = self.llm_engine.model_config
|
||||
runner_type = model_config.runner_type
|
||||
if runner_type != "generate":
|
||||
messages = [
|
||||
"LLM.generate() is only supported for generative models."
|
||||
]
|
||||
|
||||
if "generate" in model_config.supported_runner_types:
|
||||
messages.append(
|
||||
"Your model supports the 'generate' runner, but is "
|
||||
f"currently initialized for the '{runner_type}' runner. "
|
||||
"Please initialize vLLM using `--task generate` or "
|
||||
"`--task transcription`.")
|
||||
|
||||
raise ValueError(" ".join(messages))
|
||||
raise ValueError(
|
||||
"LLM.generate() is only supported for generative models. "
|
||||
"Try passing `--runner generate` to use the model as a "
|
||||
"generative model.")
|
||||
|
||||
if prompt_token_ids is not None:
|
||||
parsed_prompts = self._convert_v1_inputs(
|
||||
@ -497,7 +491,8 @@ class LLM:
|
||||
truncate_prompt_tokens = None
|
||||
if isinstance(sampling_params, SamplingParams):
|
||||
truncate_prompt_tokens = sampling_params.truncate_prompt_tokens
|
||||
_validate_truncation_size(self.llm_engine.model_config.max_model_len,
|
||||
|
||||
_validate_truncation_size(model_config.max_model_len,
|
||||
truncate_prompt_tokens, tokenization_kwargs)
|
||||
|
||||
# Add any modality specific loras to the corresponding prompts
|
||||
@ -1100,16 +1095,10 @@ class LLM:
|
||||
model_config = self.llm_engine.model_config
|
||||
runner_type = model_config.runner_type
|
||||
if runner_type != "pooling":
|
||||
messages = ["LLM.encode() is only supported for pooling models."]
|
||||
|
||||
if "pooling" in model_config.supported_runner_types:
|
||||
messages.append(
|
||||
"Your model supports the 'pooling' runner, but is "
|
||||
f"currently initialized for the '{runner_type}' runner. "
|
||||
"Please initialize vLLM using `--task embed`, "
|
||||
"`--task classify`, `--task score` etc.")
|
||||
|
||||
raise ValueError(" ".join(messages))
|
||||
raise ValueError(
|
||||
"LLM.encode() is only supported for pooling models. "
|
||||
"Try passing `--runner pooling` to use the model as a "
|
||||
"pooling model.")
|
||||
|
||||
if prompt_token_ids is not None:
|
||||
parsed_prompts = self._convert_v1_inputs(
|
||||
@ -1183,8 +1172,9 @@ class LLM:
|
||||
embedding vectors in the same order as the input prompts.
|
||||
"""
|
||||
if "embed" not in self.supported_tasks:
|
||||
raise ValueError("Embedding API is not supported by this model. "
|
||||
"Please set `--task embed`.")
|
||||
raise ValueError(
|
||||
"Embedding API is not supported by this model. "
|
||||
"Try converting the model using `--convert embed`.")
|
||||
|
||||
items = self.encode(
|
||||
prompts,
|
||||
@ -1229,7 +1219,7 @@ class LLM:
|
||||
if "classify" not in self.supported_tasks:
|
||||
raise ValueError(
|
||||
"Classification API is not supported by this model. "
|
||||
"Please set `--task classify`.")
|
||||
"Try converting the model using `--convert classify`.")
|
||||
|
||||
items = self.encode(
|
||||
prompts,
|
||||
@ -1283,27 +1273,26 @@ class LLM:
|
||||
use_tqdm: Union[bool, Callable[..., tqdm]] = True,
|
||||
lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None,
|
||||
) -> list[ScoringRequestOutput]:
|
||||
model_config = self.llm_engine.model_config
|
||||
|
||||
if isinstance(tokenizer, MistralTokenizer):
|
||||
raise ValueError(
|
||||
"Score API is only enabled for `--task embed or score`")
|
||||
"Score API is not supported for Mistral tokenizer")
|
||||
|
||||
if len(data_1) == 1:
|
||||
data_1 = data_1 * len(data_2)
|
||||
|
||||
pooling_params = PoolingParams(task="score")
|
||||
tokenization_kwargs: dict[str, Any] = {}
|
||||
_validate_truncation_size(self.llm_engine.model_config.max_model_len,
|
||||
|
||||
_validate_truncation_size(model_config.max_model_len,
|
||||
truncate_prompt_tokens, tokenization_kwargs)
|
||||
|
||||
parsed_prompts = []
|
||||
|
||||
input_pairs = [(t1, t2) for t1, t2 in zip(data_1, data_2)]
|
||||
|
||||
if self.llm_engine.model_config.is_multimodal_model:
|
||||
|
||||
model_config = self.llm_engine.model_config
|
||||
|
||||
if model_config.is_multimodal_model:
|
||||
for q, d in input_pairs:
|
||||
_, engine_prompt = get_score_prompt(
|
||||
model_config=model_config,
|
||||
@ -1314,11 +1303,9 @@ class LLM:
|
||||
)
|
||||
|
||||
parsed_prompts.append(engine_prompt)
|
||||
|
||||
else:
|
||||
|
||||
for q, t in input_pairs:
|
||||
if self.llm_engine.model_config.use_pad_token:
|
||||
if model_config.use_pad_token:
|
||||
# cross_encoder models defaults to using pad_token.
|
||||
prompt_inputs = tokenizer(
|
||||
text=q, # type: ignore[arg-type]
|
||||
@ -1396,23 +1383,18 @@ class LLM:
|
||||
model_config = self.llm_engine.model_config
|
||||
runner_type = model_config.runner_type
|
||||
if runner_type != "pooling":
|
||||
messages = ["LLM.score() is only supported for pooling models."]
|
||||
|
||||
if "pooling" in model_config.supported_runner_types:
|
||||
messages.append(
|
||||
"Your model supports the 'pooling' runner, but is "
|
||||
f"currently initialized for the '{runner_type}' runner. "
|
||||
"Please initialize vLLM using `--task embed`, "
|
||||
"`--task classify`, `--task score` etc.")
|
||||
|
||||
raise ValueError(" ".join(messages))
|
||||
raise ValueError(
|
||||
"LLM.score() is only supported for pooling models. "
|
||||
"Try passing `--runner pooling` to use the model as a "
|
||||
"pooling model.")
|
||||
|
||||
supported_tasks = self.supported_tasks
|
||||
if all(t not in supported_tasks for t in ("embed", "classify")):
|
||||
raise ValueError("Score API is not supported by this model. "
|
||||
"Please set `--task embed` or `--task classify`.")
|
||||
"Try converting the model using "
|
||||
"`--convert embed` or `--convert classify`.")
|
||||
|
||||
if (model_config.task == "classify"
|
||||
if (model_config.is_cross_encoder
|
||||
and getattr(model_config.hf_config, "num_labels", 0) != 1):
|
||||
raise ValueError("Score API is only enabled for num_labels == 1.")
|
||||
|
||||
@ -1421,15 +1403,14 @@ class LLM:
|
||||
# lists of tokens to the `text` and `text_pair` kwargs
|
||||
tokenizer = self.get_tokenizer()
|
||||
|
||||
if not self.llm_engine.model_config.is_multimodal_model:
|
||||
if not model_config.is_multimodal_model:
|
||||
|
||||
def check_data_type(data: Union[SingletonPrompt,
|
||||
Sequence[SingletonPrompt],
|
||||
ScoreMultiModalParam]):
|
||||
if isinstance(data, dict) and "content" in data:
|
||||
raise ValueError(
|
||||
f"ScoreMultiModalParam is not supported for {self.llm_engine.model_config.architecture}", # noqa: E501
|
||||
)
|
||||
raise ValueError("ScoreMultiModalParam is not supported "
|
||||
f"for {model_config.architecture}")
|
||||
|
||||
check_data_type(data_1)
|
||||
check_data_type(data_2)
|
||||
@ -1471,7 +1452,7 @@ class LLM:
|
||||
|
||||
_validate_score_input_lens(data_1, data_2) # type: ignore[arg-type]
|
||||
|
||||
if self.llm_engine.model_config.is_cross_encoder:
|
||||
if model_config.is_cross_encoder:
|
||||
return self._cross_encoding_score(
|
||||
tokenizer,
|
||||
data_1, # type: ignore[arg-type]
|
||||
|
||||
@ -1734,7 +1734,6 @@ async def init_app_state(
|
||||
state.openai_serving_models,
|
||||
request_logger=request_logger,
|
||||
) if "transcription" in supported_tasks else None
|
||||
state.task = model_config.task
|
||||
|
||||
state.enable_server_load_tracking = args.enable_server_load_tracking
|
||||
state.server_load_metrics = 0
|
||||
|
||||
@ -9,9 +9,8 @@ from dataclasses import dataclass, field
|
||||
from typing import Optional
|
||||
|
||||
import torch
|
||||
import transformers
|
||||
from torch import nn
|
||||
from transformers.dynamic_module_utils import get_class_from_dynamic_module
|
||||
from typing_extensions import assert_never
|
||||
|
||||
from vllm.attention import Attention
|
||||
from vllm.config import (ModelConfig, ModelImpl, VllmConfig,
|
||||
@ -20,13 +19,10 @@ from vllm.logger import init_logger
|
||||
from vllm.model_executor.layers.linear import QKVCrossParallelLinear
|
||||
from vllm.model_executor.layers.quantization.base_config import (
|
||||
QuantizationConfig, QuantizeMethodBase)
|
||||
from vllm.model_executor.models import ModelRegistry
|
||||
from vllm.model_executor.models.adapters import (as_embedding_model,
|
||||
as_reward_model,
|
||||
as_seq_cls_model)
|
||||
from vllm.model_executor.models.interfaces import SupportsQuant
|
||||
from vllm.model_executor.models.registry import (_PREVIOUSLY_SUPPORTED_MODELS,
|
||||
_TRANSFORMERS_BACKEND_MODELS)
|
||||
from vllm.utils import is_pin_memory_available
|
||||
|
||||
logger = init_logger(__name__)
|
||||
@ -169,61 +165,6 @@ def device_loading_context(module: torch.nn.Module,
|
||||
# New parameters or parameters already on target device are untouched
|
||||
|
||||
|
||||
def resolve_transformers_arch(model_config: ModelConfig,
|
||||
architectures: list[str]):
|
||||
if model_config.model_impl == ModelImpl.VLLM:
|
||||
raise ValueError(
|
||||
"Attempting to resolve architecture from the Transformers library "
|
||||
"but the model implementation is set to vLLM. This should never "
|
||||
"happen.")
|
||||
|
||||
for i, arch in enumerate(architectures):
|
||||
if arch in _TRANSFORMERS_BACKEND_MODELS:
|
||||
continue
|
||||
|
||||
if model_config.model_impl == ModelImpl.AUTO:
|
||||
logger.warning(
|
||||
"%s has no vLLM implementation, falling back to Transformers "
|
||||
"implementation. Some features may not be supported and "
|
||||
"performance may not be optimal.", arch)
|
||||
|
||||
auto_map: dict[str, str] = getattr(model_config.hf_config, "auto_map",
|
||||
None) or dict()
|
||||
# Make sure that config class is always initialized before model class,
|
||||
# otherwise the model class won't be able to access the config class,
|
||||
# the expected auto_map should have correct order like:
|
||||
# "auto_map": {
|
||||
# "AutoConfig": "<your-repo-name>--<config-name>",
|
||||
# "AutoModel": "<your-repo-name>--<config-name>",
|
||||
# "AutoModelFor<Task>": "<your-repo-name>--<config-name>",
|
||||
# },
|
||||
auto_modules = {
|
||||
name:
|
||||
get_class_from_dynamic_module(module,
|
||||
model_config.model,
|
||||
revision=model_config.revision)
|
||||
for name, module in sorted(auto_map.items(), key=lambda x: x[0])
|
||||
}
|
||||
model_module = getattr(transformers, arch, None)
|
||||
if model_module is None:
|
||||
if "AutoModel" not in auto_map:
|
||||
raise ValueError(
|
||||
f"Cannot find model module. '{arch}' is not a registered "
|
||||
"model in the Transformers library (only relevant if the "
|
||||
"model is meant to be in Transformers) and 'AutoModel' is "
|
||||
"not present in the model config's 'auto_map' (relevant "
|
||||
"if the model is custom).")
|
||||
model_module = auto_modules["AutoModel"]
|
||||
|
||||
if not model_module.is_backend_compatible():
|
||||
raise ValueError(
|
||||
f"The Transformers implementation of '{arch}' is not "
|
||||
"compatible with vLLM.")
|
||||
|
||||
architectures[i] = model_config._get_transformers_backend_cls()
|
||||
return architectures
|
||||
|
||||
|
||||
def get_model_architecture(
|
||||
model_config: ModelConfig) -> tuple[type[nn.Module], str]:
|
||||
architectures = getattr(model_config.hf_config, "architectures", [])
|
||||
@ -239,56 +180,38 @@ def get_model_architecture(
|
||||
"bitsandbytes",
|
||||
]
|
||||
|
||||
vllm_supported_archs = ModelRegistry.get_supported_archs()
|
||||
is_supported = lambda arch: (arch in vllm_supported_archs and arch not in
|
||||
_TRANSFORMERS_BACKEND_MODELS)
|
||||
vllm_not_supported = not any(is_supported(arch) for arch in architectures)
|
||||
|
||||
if vllm_not_supported:
|
||||
# try automatic conversion in adapters.py
|
||||
for arch in architectures:
|
||||
if not arch.endswith("ForSequenceClassification"):
|
||||
continue
|
||||
|
||||
assert model_config.task == "classify"
|
||||
causal_lm_arch = arch.replace("ForSequenceClassification",
|
||||
"ForCausalLM")
|
||||
causal_lm_arch_vllm_supported = (causal_lm_arch
|
||||
in vllm_supported_archs)
|
||||
if not causal_lm_arch_vllm_supported:
|
||||
continue
|
||||
|
||||
architectures = [causal_lm_arch]
|
||||
vllm_not_supported = False
|
||||
break
|
||||
|
||||
if any(arch in _PREVIOUSLY_SUPPORTED_MODELS for arch in architectures):
|
||||
previous_version = _PREVIOUSLY_SUPPORTED_MODELS[architectures[0]]
|
||||
raise ValueError(
|
||||
f"Model architecture {architectures[0]} was supported"
|
||||
f" in vLLM until version {previous_version}, and is "
|
||||
"not supported anymore. Please use an older version"
|
||||
" of vLLM if you want to use this model architecture.")
|
||||
|
||||
if (model_config.model_impl == ModelImpl.TRANSFORMERS or
|
||||
model_config.model_impl == ModelImpl.AUTO and vllm_not_supported):
|
||||
architectures = resolve_transformers_arch(model_config, architectures)
|
||||
logger.debug_once("Resolve transformers arch %s", str(architectures))
|
||||
elif (model_config.quantization is not None
|
||||
and model_config.quantization not in mixtral_supported
|
||||
and "MixtralForCausalLM" in architectures):
|
||||
if (model_config.quantization is not None
|
||||
and model_config.quantization not in mixtral_supported
|
||||
and "MixtralForCausalLM" in architectures):
|
||||
architectures = ["QuantMixtralForCausalLM"]
|
||||
|
||||
model_cls, arch = ModelRegistry.resolve_model_cls(architectures)
|
||||
if model_config.task == "embed":
|
||||
logger.debug_once("Automatic conversion using `as_embedding_model`.")
|
||||
model_cls, arch = model_config.registry.resolve_model_cls(
|
||||
architectures,
|
||||
model_config=model_config,
|
||||
)
|
||||
|
||||
if arch == model_config._get_transformers_backend_cls():
|
||||
assert model_config.model_impl != ModelImpl.VLLM
|
||||
if model_config.model_impl == ModelImpl.AUTO:
|
||||
logger.warning_once(
|
||||
"%s has no vLLM implementation, falling back to Transformers "
|
||||
"implementation. Some features may not be supported and "
|
||||
"performance may not be optimal.", arch)
|
||||
|
||||
convert_type = model_config.convert_type
|
||||
if convert_type == "none":
|
||||
pass
|
||||
elif convert_type == "embed":
|
||||
logger.debug_once("Converting to embedding model.")
|
||||
model_cls = as_embedding_model(model_cls)
|
||||
elif model_config.task == "classify":
|
||||
logger.debug_once("Automatic conversion using `as_seq_cls_model`.")
|
||||
elif convert_type == "classify":
|
||||
logger.debug_once("Converting to sequence classification model.")
|
||||
model_cls = as_seq_cls_model(model_cls)
|
||||
elif model_config.task == "reward":
|
||||
logger.debug_once("Automatic conversion using `as_reward_model`.")
|
||||
elif convert_type == "reward":
|
||||
logger.debug_once("Converting to reward model.")
|
||||
model_cls = as_reward_model(model_cls)
|
||||
else:
|
||||
assert_never(convert_type)
|
||||
|
||||
return model_cls, arch
|
||||
|
||||
|
||||
@ -253,8 +253,10 @@ class HybridAttentionMambaModelConfig(VerifyAndUpdateConfig):
|
||||
dtype=kv_cache_dtype,
|
||||
use_mla=model_config.use_mla).page_size_bytes
|
||||
|
||||
model_cls = ModelRegistry.resolve_model_cls(
|
||||
model_config._model_info.architecture)[0]
|
||||
model_cls, _ = ModelRegistry.resolve_model_cls(
|
||||
model_config.architecture,
|
||||
model_config=model_config,
|
||||
)
|
||||
|
||||
# get mamba page size
|
||||
mamba_page_size = MambaSpec(
|
||||
|
||||
@ -12,19 +12,24 @@ import sys
|
||||
import tempfile
|
||||
from abc import ABC, abstractmethod
|
||||
from collections.abc import Set
|
||||
from dataclasses import asdict, dataclass, field
|
||||
from dataclasses import dataclass, field
|
||||
from functools import lru_cache
|
||||
from typing import Callable, Optional, TypeVar, Union
|
||||
|
||||
import torch.nn as nn
|
||||
import transformers
|
||||
|
||||
from vllm.config import (ModelConfig, ModelImpl, iter_architecture_defaults,
|
||||
try_match_architecture_defaults)
|
||||
from vllm.logger import init_logger
|
||||
from vllm.transformers_utils.dynamic_module import (
|
||||
try_get_class_from_dynamic_module)
|
||||
|
||||
from .interfaces import (has_inner_state, has_noops, is_attention_free,
|
||||
is_hybrid, supports_cross_encoding,
|
||||
supports_multimodal, supports_multimodal_raw_input,
|
||||
supports_pp, supports_transcription, supports_v0_only)
|
||||
from .interfaces_base import is_text_generation_model
|
||||
from .interfaces_base import is_pooling_model, is_text_generation_model
|
||||
|
||||
logger = init_logger(__name__)
|
||||
|
||||
@ -311,7 +316,7 @@ class _ModelInfo:
|
||||
return _ModelInfo(
|
||||
architecture=model.__name__,
|
||||
is_text_generation_model=is_text_generation_model(model),
|
||||
is_pooling_model=True, # Can convert any model into a pooling model
|
||||
is_pooling_model=is_pooling_model(model),
|
||||
supports_cross_encoding=supports_cross_encoding(model),
|
||||
supports_multimodal=supports_multimodal(model),
|
||||
supports_multimodal_raw_input=supports_multimodal_raw_input(model),
|
||||
@ -465,6 +470,16 @@ class _ModelRegistry:
|
||||
f"Model architectures {architectures} failed "
|
||||
"to be inspected. Please check the logs for more details.")
|
||||
|
||||
for arch in architectures:
|
||||
if arch in _PREVIOUSLY_SUPPORTED_MODELS:
|
||||
previous_version = _PREVIOUSLY_SUPPORTED_MODELS[arch]
|
||||
|
||||
raise ValueError(
|
||||
f"Model architecture {arch} was supported in vLLM until "
|
||||
f"v{previous_version}, and is not supported anymore. "
|
||||
"Please use an older version of vLLM if you want to "
|
||||
"use this model architecture.")
|
||||
|
||||
raise ValueError(
|
||||
f"Model architectures {architectures} are not supported for now. "
|
||||
f"Supported architectures: {all_supported_archs}")
|
||||
@ -477,174 +492,284 @@ class _ModelRegistry:
|
||||
return _try_load_model_cls(model_arch, self.models[model_arch])
|
||||
|
||||
def _try_inspect_model_cls(self, model_arch: str) -> Optional[_ModelInfo]:
|
||||
if model_arch in self.models:
|
||||
return _try_inspect_model_cls(model_arch, self.models[model_arch])
|
||||
if model_arch not in self.models:
|
||||
return None
|
||||
|
||||
if model_arch.endswith("ForSequenceClassification"):
|
||||
causal_lm_arch = model_arch.replace("ForSequenceClassification",
|
||||
"ForCausalLM")
|
||||
if causal_lm_arch not in self.models:
|
||||
return _try_inspect_model_cls(model_arch, self.models[model_arch])
|
||||
|
||||
def _try_resolve_transformers(
|
||||
self,
|
||||
architecture: str,
|
||||
model_config: ModelConfig,
|
||||
) -> Optional[str]:
|
||||
if architecture in _TRANSFORMERS_BACKEND_MODELS:
|
||||
return architecture
|
||||
|
||||
auto_map: dict[str, str] = getattr(model_config.hf_config, "auto_map",
|
||||
None) or dict()
|
||||
|
||||
# Make sure that config class is always initialized before model class,
|
||||
# otherwise the model class won't be able to access the config class,
|
||||
# the expected auto_map should have correct order like:
|
||||
# "auto_map": {
|
||||
# "AutoConfig": "<your-repo-name>--<config-name>",
|
||||
# "AutoModel": "<your-repo-name>--<config-name>",
|
||||
# "AutoModelFor<Task>": "<your-repo-name>--<config-name>",
|
||||
# },
|
||||
for prefix in ("AutoConfig", "AutoModel"):
|
||||
for name, module in auto_map.items():
|
||||
if name.startswith(prefix):
|
||||
try_get_class_from_dynamic_module(
|
||||
module,
|
||||
model_config.model,
|
||||
revision=model_config.revision,
|
||||
warn_on_fail=False,
|
||||
)
|
||||
|
||||
model_module = getattr(transformers, architecture, None)
|
||||
|
||||
if model_module is None:
|
||||
for name, module in auto_map.items():
|
||||
if name.startswith("AutoModel"):
|
||||
model_module = try_get_class_from_dynamic_module(
|
||||
module,
|
||||
model_config.model,
|
||||
revision=model_config.revision,
|
||||
warn_on_fail=True,
|
||||
)
|
||||
if model_module is not None:
|
||||
break
|
||||
else:
|
||||
if model_config.model_impl != ModelImpl.TRANSFORMERS:
|
||||
return None
|
||||
|
||||
raise ValueError(
|
||||
f"Cannot find model module. {architecture!r} is not a "
|
||||
"registered model in the Transformers library (only "
|
||||
"relevant if the model is meant to be in Transformers) "
|
||||
"and 'AutoModel' is not present in the model config's "
|
||||
"'auto_map' (relevant if the model is custom).")
|
||||
|
||||
if not model_module.is_backend_compatible():
|
||||
if model_config.model_impl != ModelImpl.TRANSFORMERS:
|
||||
return None
|
||||
|
||||
info = _try_inspect_model_cls(causal_lm_arch,
|
||||
self.models[causal_lm_arch])
|
||||
raise ValueError(
|
||||
f"The Transformers implementation of {architecture!r} "
|
||||
"is not compatible with vLLM.")
|
||||
|
||||
info = _ModelInfo(**dict(
|
||||
asdict(info), **{
|
||||
"architecture": model_arch,
|
||||
"supports_cross_encoding": True
|
||||
}))
|
||||
return info
|
||||
return model_config._get_transformers_backend_cls()
|
||||
|
||||
return None
|
||||
def _normalize_arch(
|
||||
self,
|
||||
architecture: str,
|
||||
model_config: ModelConfig,
|
||||
) -> str:
|
||||
if architecture in self.models:
|
||||
return architecture
|
||||
|
||||
# This may be called in order to resolve runner_type and convert_type
|
||||
# in the first place, in which case we consider the default match
|
||||
match = try_match_architecture_defaults(
|
||||
architecture,
|
||||
runner_type=getattr(model_config, "runner_type", None),
|
||||
convert_type=getattr(model_config, "convert_type", None),
|
||||
)
|
||||
if match:
|
||||
suffix, _ = match
|
||||
|
||||
# Get the name of the base model to convert
|
||||
for repl_suffix, _ in iter_architecture_defaults():
|
||||
base_arch = architecture.replace(suffix, repl_suffix)
|
||||
if base_arch in self.models:
|
||||
return base_arch
|
||||
|
||||
return architecture
|
||||
|
||||
def _normalize_archs(
|
||||
self,
|
||||
architectures: Union[str, list[str]],
|
||||
architectures: list[str],
|
||||
model_config: ModelConfig,
|
||||
) -> list[str]:
|
||||
if isinstance(architectures, str):
|
||||
architectures = [architectures]
|
||||
if not architectures:
|
||||
logger.warning("No model architectures are specified")
|
||||
|
||||
# filter out support architectures
|
||||
normalized_arch = list(
|
||||
filter(lambda model: model in self.models, architectures))
|
||||
|
||||
# try automatic conversion in adapters.py
|
||||
for arch in architectures:
|
||||
if not arch.endswith("ForSequenceClassification"):
|
||||
continue
|
||||
causal_lm_arch = arch.replace("ForSequenceClassification",
|
||||
"ForCausalLM")
|
||||
if causal_lm_arch in self.models:
|
||||
normalized_arch.append(arch)
|
||||
|
||||
# NOTE(Isotr0py): Be careful of architectures' order!
|
||||
# Make sure Transformers backend architecture is at the end of the
|
||||
# list, otherwise pooling models automatic conversion will fail!
|
||||
for arch in normalized_arch:
|
||||
if arch.startswith("TransformersFor"):
|
||||
normalized_arch.remove(arch)
|
||||
normalized_arch.append(arch)
|
||||
|
||||
return normalized_arch
|
||||
return [
|
||||
self._normalize_arch(arch, model_config) for arch in architectures
|
||||
]
|
||||
|
||||
def inspect_model_cls(
|
||||
self,
|
||||
architectures: Union[str, list[str]],
|
||||
model_config: ModelConfig,
|
||||
) -> tuple[_ModelInfo, str]:
|
||||
architectures = self._normalize_archs(architectures)
|
||||
if isinstance(architectures, str):
|
||||
architectures = [architectures]
|
||||
|
||||
for arch in architectures:
|
||||
model_info = self._try_inspect_model_cls(arch)
|
||||
normalized_archs = self._normalize_archs(architectures, model_config)
|
||||
|
||||
# Require transformers impl
|
||||
if model_config.model_impl == ModelImpl.TRANSFORMERS:
|
||||
arch = self._try_resolve_transformers(architectures[0],
|
||||
model_config)
|
||||
if arch is not None:
|
||||
model_info = self._try_inspect_model_cls(arch)
|
||||
if model_info is not None:
|
||||
return (model_info, arch)
|
||||
|
||||
for arch, normalized_arch in zip(architectures, normalized_archs):
|
||||
model_info = self._try_inspect_model_cls(normalized_arch)
|
||||
if model_info is not None:
|
||||
return (model_info, arch)
|
||||
|
||||
# Fallback to transformers impl
|
||||
if model_config.model_impl in (ModelImpl.AUTO, ModelImpl.TRANSFORMERS):
|
||||
arch = self._try_resolve_transformers(architectures[0],
|
||||
model_config)
|
||||
if arch is not None:
|
||||
model_info = self._try_inspect_model_cls(arch)
|
||||
if model_info is not None:
|
||||
return (model_info, arch)
|
||||
|
||||
return self._raise_for_unsupported(architectures)
|
||||
|
||||
def resolve_model_cls(
|
||||
self,
|
||||
architectures: Union[str, list[str]],
|
||||
model_config: ModelConfig,
|
||||
) -> tuple[type[nn.Module], str]:
|
||||
architectures = self._normalize_archs(architectures)
|
||||
if isinstance(architectures, str):
|
||||
architectures = [architectures]
|
||||
|
||||
for arch in architectures:
|
||||
model_cls = self._try_load_model_cls(arch)
|
||||
normalized_archs = self._normalize_archs(architectures, model_config)
|
||||
|
||||
# Require transformers impl
|
||||
if model_config.model_impl == ModelImpl.TRANSFORMERS:
|
||||
arch = self._try_resolve_transformers(architectures[0],
|
||||
model_config)
|
||||
if arch is not None:
|
||||
model_cls = self._try_load_model_cls(arch)
|
||||
if model_cls is not None:
|
||||
return (model_cls, arch)
|
||||
|
||||
for arch, normalized_arch in zip(architectures, normalized_archs):
|
||||
model_cls = self._try_load_model_cls(normalized_arch)
|
||||
if model_cls is not None:
|
||||
return (model_cls, arch)
|
||||
|
||||
# Fallback to transformers impl
|
||||
if model_config.model_impl in (ModelImpl.AUTO, ModelImpl.TRANSFORMERS):
|
||||
arch = self._try_resolve_transformers(architectures[0],
|
||||
model_config)
|
||||
if arch is not None:
|
||||
model_cls = self._try_load_model_cls(arch)
|
||||
if model_cls is not None:
|
||||
return (model_cls, arch)
|
||||
|
||||
return self._raise_for_unsupported(architectures)
|
||||
|
||||
def is_text_generation_model(
|
||||
self,
|
||||
architectures: Union[str, list[str]],
|
||||
model_config: ModelConfig,
|
||||
) -> bool:
|
||||
model_cls, _ = self.inspect_model_cls(architectures)
|
||||
model_cls, _ = self.inspect_model_cls(architectures, model_config)
|
||||
return model_cls.is_text_generation_model
|
||||
|
||||
def is_pooling_model(
|
||||
self,
|
||||
architectures: Union[str, list[str]],
|
||||
model_config: ModelConfig,
|
||||
) -> bool:
|
||||
model_cls, _ = self.inspect_model_cls(architectures)
|
||||
model_cls, _ = self.inspect_model_cls(architectures, model_config)
|
||||
return model_cls.is_pooling_model
|
||||
|
||||
def is_cross_encoder_model(
|
||||
self,
|
||||
architectures: Union[str, list[str]],
|
||||
model_config: ModelConfig,
|
||||
) -> bool:
|
||||
model_cls, _ = self.inspect_model_cls(architectures)
|
||||
model_cls, _ = self.inspect_model_cls(architectures, model_config)
|
||||
return model_cls.supports_cross_encoding
|
||||
|
||||
def is_multimodal_model(
|
||||
self,
|
||||
architectures: Union[str, list[str]],
|
||||
model_config: ModelConfig,
|
||||
) -> bool:
|
||||
model_cls, _ = self.inspect_model_cls(architectures)
|
||||
model_cls, _ = self.inspect_model_cls(architectures, model_config)
|
||||
return model_cls.supports_multimodal
|
||||
|
||||
def supports_multimodal_raw_input(
|
||||
self,
|
||||
architectures: Union[str, list[str]],
|
||||
model_config: ModelConfig,
|
||||
) -> bool:
|
||||
model_cls, _ = self.inspect_model_cls(architectures)
|
||||
model_cls, _ = self.inspect_model_cls(architectures, model_config)
|
||||
return model_cls.supports_multimodal_raw_input
|
||||
|
||||
def is_pp_supported_model(
|
||||
self,
|
||||
architectures: Union[str, list[str]],
|
||||
model_config: ModelConfig,
|
||||
) -> bool:
|
||||
model_cls, _ = self.inspect_model_cls(architectures)
|
||||
model_cls, _ = self.inspect_model_cls(architectures, model_config)
|
||||
return model_cls.supports_pp
|
||||
|
||||
def model_has_inner_state(
|
||||
self,
|
||||
architectures: Union[str, list[str]],
|
||||
model_config: ModelConfig,
|
||||
) -> bool:
|
||||
model_cls, _ = self.inspect_model_cls(architectures)
|
||||
model_cls, _ = self.inspect_model_cls(architectures, model_config)
|
||||
return model_cls.has_inner_state
|
||||
|
||||
def is_attention_free_model(
|
||||
self,
|
||||
architectures: Union[str, list[str]],
|
||||
model_config: ModelConfig,
|
||||
) -> bool:
|
||||
model_cls, _ = self.inspect_model_cls(architectures)
|
||||
model_cls, _ = self.inspect_model_cls(architectures, model_config)
|
||||
return model_cls.is_attention_free
|
||||
|
||||
def is_hybrid_model(
|
||||
self,
|
||||
architectures: Union[str, list[str]],
|
||||
model_config: ModelConfig,
|
||||
) -> bool:
|
||||
model_cls, _ = self.inspect_model_cls(architectures)
|
||||
model_cls, _ = self.inspect_model_cls(architectures, model_config)
|
||||
return model_cls.is_hybrid
|
||||
|
||||
def is_noops_model(
|
||||
self,
|
||||
architectures: Union[str, list[str]],
|
||||
model_config: ModelConfig,
|
||||
) -> bool:
|
||||
model_cls, _ = self.inspect_model_cls(architectures)
|
||||
model_cls, _ = self.inspect_model_cls(architectures, model_config)
|
||||
return model_cls.has_noops
|
||||
|
||||
def is_transcription_model(
|
||||
self,
|
||||
architectures: Union[str, list[str]],
|
||||
model_config: ModelConfig,
|
||||
) -> bool:
|
||||
model_cls, _ = self.inspect_model_cls(architectures)
|
||||
model_cls, _ = self.inspect_model_cls(architectures, model_config)
|
||||
return model_cls.supports_transcription
|
||||
|
||||
def is_transcription_only_model(
|
||||
self,
|
||||
architectures: Union[str, list[str]],
|
||||
model_config: ModelConfig,
|
||||
) -> bool:
|
||||
model_cls, _ = self.inspect_model_cls(architectures)
|
||||
model_cls, _ = self.inspect_model_cls(architectures, model_config)
|
||||
return model_cls.supports_transcription_only
|
||||
|
||||
def is_v1_compatible(
|
||||
self,
|
||||
architectures: Union[str, list[str]],
|
||||
model_config: ModelConfig,
|
||||
) -> bool:
|
||||
model_cls, _ = self.inspect_model_cls(architectures)
|
||||
model_cls, _ = self.inspect_model_cls(architectures, model_config)
|
||||
return not model_cls.supports_v0_only
|
||||
|
||||
|
||||
|
||||
60
vllm/transformers_utils/dynamic_module.py
Normal file
60
vllm/transformers_utils/dynamic_module.py
Normal file
@ -0,0 +1,60 @@
|
||||
# SPDX-License-Identifier: Apache-2.0
|
||||
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
|
||||
import os
|
||||
from typing import Optional, Union
|
||||
|
||||
from transformers.dynamic_module_utils import get_class_from_dynamic_module
|
||||
|
||||
import vllm.envs as envs
|
||||
from vllm.logger import init_logger
|
||||
|
||||
logger = init_logger(__name__)
|
||||
|
||||
|
||||
def try_get_class_from_dynamic_module(
|
||||
class_reference: str,
|
||||
pretrained_model_name_or_path: str,
|
||||
cache_dir: Optional[Union[str, os.PathLike]] = None,
|
||||
force_download: bool = False,
|
||||
resume_download: Optional[bool] = None,
|
||||
proxies: Optional[dict[str, str]] = None,
|
||||
token: Optional[Union[bool, str]] = None,
|
||||
revision: Optional[str] = None,
|
||||
local_files_only: bool = False,
|
||||
repo_type: Optional[str] = None,
|
||||
code_revision: Optional[str] = None,
|
||||
warn_on_fail: bool = True,
|
||||
**kwargs,
|
||||
) -> Optional[type]:
|
||||
"""
|
||||
As [transformers.dynamic_module_utils.get_class_from_dynamic_module][],
|
||||
but ignoring any errors.
|
||||
"""
|
||||
try:
|
||||
return get_class_from_dynamic_module(
|
||||
class_reference,
|
||||
pretrained_model_name_or_path,
|
||||
cache_dir=cache_dir,
|
||||
force_download=force_download,
|
||||
resume_download=resume_download,
|
||||
proxies=proxies,
|
||||
token=token,
|
||||
revision=revision,
|
||||
local_files_only=local_files_only,
|
||||
repo_type=repo_type,
|
||||
code_revision=code_revision,
|
||||
**kwargs,
|
||||
)
|
||||
except Exception:
|
||||
location = "ModelScope" if envs.VLLM_USE_MODELSCOPE else "HF Hub"
|
||||
|
||||
if warn_on_fail:
|
||||
logger.warning(
|
||||
"Unable to load %s from %s on %s.",
|
||||
class_reference,
|
||||
pretrained_model_name_or_path,
|
||||
location,
|
||||
exc_info=True,
|
||||
)
|
||||
|
||||
return None
|
||||
@ -3,6 +3,8 @@
|
||||
|
||||
from typing import Optional
|
||||
|
||||
from typing_extensions import assert_never
|
||||
|
||||
from vllm.config import LoRAConfig, ModelConfig, SchedulerConfig
|
||||
from vllm.lora.request import LoRARequest
|
||||
from vllm.transformers_utils.tokenizer import (AnyTokenizer, encode_tokens,
|
||||
@ -108,6 +110,14 @@ class TokenizerGroup:
|
||||
def init_tokenizer_from_configs(model_config: ModelConfig,
|
||||
scheduler_config: SchedulerConfig,
|
||||
lora_config: Optional[LoRAConfig]):
|
||||
runner_type = model_config.runner_type
|
||||
if runner_type == "generate" or runner_type == "draft":
|
||||
truncation_side = "left"
|
||||
elif runner_type == "pooling":
|
||||
truncation_side = "right"
|
||||
else:
|
||||
assert_never(runner_type)
|
||||
|
||||
return TokenizerGroup(
|
||||
tokenizer_id=model_config.tokenizer,
|
||||
enable_lora=bool(lora_config),
|
||||
@ -117,4 +127,4 @@ def init_tokenizer_from_configs(model_config: ModelConfig,
|
||||
tokenizer_mode=model_config.tokenizer_mode,
|
||||
trust_remote_code=model_config.trust_remote_code,
|
||||
revision=model_config.tokenizer_revision,
|
||||
truncation_side=model_config.truncation_side)
|
||||
truncation_side=truncation_side)
|
||||
|
||||
@ -127,8 +127,8 @@ class GPUModelRunner(LoRAModelRunnerMixin, KVConnectorModelRunnerMixin):
|
||||
self.is_multimodal_model = model_config.is_multimodal_model
|
||||
self.is_pooling_model = model_config.pooler_config is not None
|
||||
self.is_encoder_only_model = False
|
||||
self.model_supports_multimodal_raw_input = (
|
||||
model_config.model_supports_multimodal_raw_input)
|
||||
self.is_multimodal_raw_input_supported = (
|
||||
model_config.is_multimodal_raw_input_supported)
|
||||
self.max_model_len = model_config.max_model_len
|
||||
self.max_num_tokens = scheduler_config.max_num_batched_tokens
|
||||
self.max_num_reqs = scheduler_config.max_num_seqs
|
||||
@ -583,7 +583,7 @@ class GPUModelRunner(LoRAModelRunnerMixin, KVConnectorModelRunnerMixin):
|
||||
) -> dict[str, Any]:
|
||||
|
||||
model_kwargs: dict[str, Any] = {}
|
||||
if self.model_supports_multimodal_raw_input:
|
||||
if self.is_multimodal_raw_input_supported:
|
||||
# This model requires the raw multimodal data in input.
|
||||
if scheduler_output:
|
||||
multi_modal_kwargs_list = []
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user