mirror of
https://git.datalinker.icu/vllm-project/vllm.git
synced 2026-04-28 06:57:03 +08:00
[Deprecation][2/N] Replace --task with --runner and --convert (#21470)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
This commit is contained in:
parent
8f605ee309
commit
86ae693f20
@ -343,7 +343,7 @@ Here is a simple example using Phi-3.5-Vision.
|
|||||||
First, launch the OpenAI-compatible server:
|
First, launch the OpenAI-compatible server:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
vllm serve microsoft/Phi-3.5-vision-instruct --task generate \
|
vllm serve microsoft/Phi-3.5-vision-instruct --runner generate \
|
||||||
--trust-remote-code --max-model-len 4096 --limit-mm-per-prompt '{"image":2}'
|
--trust-remote-code --max-model-len 4096 --limit-mm-per-prompt '{"image":2}'
|
||||||
```
|
```
|
||||||
|
|
||||||
@ -422,7 +422,7 @@ Instead of `image_url`, you can pass a video file via `video_url`. Here is a sim
|
|||||||
First, launch the OpenAI-compatible server:
|
First, launch the OpenAI-compatible server:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
vllm serve llava-hf/llava-onevision-qwen2-0.5b-ov-hf --task generate --max-model-len 8192
|
vllm serve llava-hf/llava-onevision-qwen2-0.5b-ov-hf --runner generate --max-model-len 8192
|
||||||
```
|
```
|
||||||
|
|
||||||
Then, you can use the OpenAI client as follows:
|
Then, you can use the OpenAI client as follows:
|
||||||
|
|||||||
@ -34,7 +34,7 @@ Prompt embeddings are passed in as base64 encoded torch tensors.
|
|||||||
First, launch the OpenAI-compatible server:
|
First, launch the OpenAI-compatible server:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
vllm serve meta-llama/Llama-3.2-1B-Instruct --task generate \
|
vllm serve meta-llama/Llama-3.2-1B-Instruct --runner generate \
|
||||||
--max-model-len 4096 --enable-prompt-embeds
|
--max-model-len 4096 --enable-prompt-embeds
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|||||||
@ -2,12 +2,19 @@
|
|||||||
|
|
||||||
vLLM provides first-class support for generative models, which covers most of LLMs.
|
vLLM provides first-class support for generative models, which covers most of LLMs.
|
||||||
|
|
||||||
In vLLM, generative models implement the [VllmModelForTextGeneration][vllm.model_executor.models.VllmModelForTextGeneration] interface.
|
In vLLM, generative models implement the[VllmModelForTextGeneration][vllm.model_executor.models.VllmModelForTextGeneration] interface.
|
||||||
Based on the final hidden states of the input, these models output log probabilities of the tokens to generate,
|
Based on the final hidden states of the input, these models output log probabilities of the tokens to generate,
|
||||||
which are then passed through [Sampler][vllm.model_executor.layers.Sampler] to obtain the final text.
|
which are then passed through [Sampler][vllm.model_executor.layers.Sampler] to obtain the final text.
|
||||||
|
|
||||||
For generative models, the only supported `--task` option is `"generate"`.
|
## Configuration
|
||||||
Usually, this is automatically inferred so you don't have to specify it.
|
|
||||||
|
### Model Runner (`--runner`)
|
||||||
|
|
||||||
|
Run a model in generation mode via the option `--runner generate`.
|
||||||
|
|
||||||
|
!!! tip
|
||||||
|
There is no need to set this option in the vast majority of cases as vLLM can automatically
|
||||||
|
detect the model runner to use via `--runner auto`.
|
||||||
|
|
||||||
## Offline Inference
|
## Offline Inference
|
||||||
|
|
||||||
|
|||||||
@ -1,9 +1,9 @@
|
|||||||
# Pooling Models
|
# Pooling Models
|
||||||
|
|
||||||
vLLM also supports pooling models, including embedding, reranking and reward models.
|
vLLM also supports pooling models, such as embedding, classification and reward models.
|
||||||
|
|
||||||
In vLLM, pooling models implement the [VllmModelForPooling][vllm.model_executor.models.VllmModelForPooling] interface.
|
In vLLM, pooling models implement the [VllmModelForPooling][vllm.model_executor.models.VllmModelForPooling] interface.
|
||||||
These models use a [Pooler][vllm.model_executor.layers.Pooler] to extract the final hidden states of the input
|
These models use a [Pooler][vllm.model_executor.layers.pooler.Pooler] to extract the final hidden states of the input
|
||||||
before returning them.
|
before returning them.
|
||||||
|
|
||||||
!!! note
|
!!! note
|
||||||
@ -11,18 +11,39 @@ before returning them.
|
|||||||
As shown in the [Compatibility Matrix](../features/compatibility_matrix.md), most vLLM features are not applicable to
|
As shown in the [Compatibility Matrix](../features/compatibility_matrix.md), most vLLM features are not applicable to
|
||||||
pooling models as they only work on the generation or decode stage, so performance may not improve as much.
|
pooling models as they only work on the generation or decode stage, so performance may not improve as much.
|
||||||
|
|
||||||
If the model doesn't implement this interface, you can set `--task` which tells vLLM
|
## Configuration
|
||||||
to convert the model into a pooling model.
|
|
||||||
|
|
||||||
| `--task` | Model type | Supported pooling tasks |
|
### Model Runner
|
||||||
|------------|----------------------|-------------------------------|
|
|
||||||
| `embed` | Embedding model | `encode`, `embed` |
|
|
||||||
| `classify` | Classification model | `encode`, `classify`, `score` |
|
|
||||||
| `reward` | Reward model | `encode` |
|
|
||||||
|
|
||||||
## Pooling Tasks
|
Run a model in pooling mode via the option `--runner pooling`.
|
||||||
|
|
||||||
In vLLM, we define the following pooling tasks and corresponding APIs:
|
!!! tip
|
||||||
|
There is no need to set this option in the vast majority of cases as vLLM can automatically
|
||||||
|
detect the model runner to use via `--runner auto`.
|
||||||
|
|
||||||
|
### Model Conversion
|
||||||
|
|
||||||
|
vLLM can adapt models for various pooling tasks via the option `--convert <type>`.
|
||||||
|
|
||||||
|
If `--runner pooling` has been set (manually or automatically) but the model does not implement the
|
||||||
|
[VllmModelForPooling][vllm.model_executor.models.VllmModelForPooling] interface,
|
||||||
|
vLLM will attempt to automatically convert the model according to the architecture names
|
||||||
|
shown in the table below.
|
||||||
|
|
||||||
|
| Architecture | `--convert` | Supported pooling tasks |
|
||||||
|
|-------------------------------------------------|-------------|-------------------------------|
|
||||||
|
| `*ForTextEncoding`, `*EmbeddingModel`, `*Model` | `embed` | `encode`, `embed` |
|
||||||
|
| `*For*Classification`, `*ClassificationModel` | `classify` | `encode`, `classify`, `score` |
|
||||||
|
| `*ForRewardModeling`, `*RewardModel` | `reward` | `encode` |
|
||||||
|
|
||||||
|
!!! tip
|
||||||
|
You can explicitly set `--convert <type>` to specify how to convert the model.
|
||||||
|
|
||||||
|
### Pooling Tasks
|
||||||
|
|
||||||
|
Each pooling model in vLLM supports one or more of these tasks according to
|
||||||
|
[Pooler.get_supported_tasks][vllm.model_executor.layers.pooler.Pooler.get_supported_tasks],
|
||||||
|
enabling the corresponding APIs:
|
||||||
|
|
||||||
| Task | APIs |
|
| Task | APIs |
|
||||||
|------------|--------------------|
|
|------------|--------------------|
|
||||||
@ -31,11 +52,19 @@ In vLLM, we define the following pooling tasks and corresponding APIs:
|
|||||||
| `classify` | `classify` |
|
| `classify` | `classify` |
|
||||||
| `score` | `score` |
|
| `score` | `score` |
|
||||||
|
|
||||||
\*The `score` API falls back to `embed` task if the model does not support `score` task.
|
\* The `score` API falls back to `embed` task if the model does not support `score` task.
|
||||||
|
|
||||||
Each pooling model in vLLM supports one or more of these tasks according to [Pooler.get_supported_tasks][vllm.model_executor.layers.Pooler.get_supported_tasks].
|
### Pooler Configuration
|
||||||
|
|
||||||
By default, the pooler assigned to each task has the following attributes:
|
#### Predefined models
|
||||||
|
|
||||||
|
If the [Pooler][vllm.model_executor.layers.pooler.Pooler] defined by the model accepts `pooler_config`,
|
||||||
|
you can override some of its attributes via the `--override-pooler-config` option.
|
||||||
|
|
||||||
|
#### Converted models
|
||||||
|
|
||||||
|
If the model has been converted via `--convert` (see above),
|
||||||
|
the pooler assigned to each task has the following attributes by default:
|
||||||
|
|
||||||
| Task | Pooling Type | Normalization | Softmax |
|
| Task | Pooling Type | Normalization | Softmax |
|
||||||
|------------|----------------|---------------|---------|
|
|------------|----------------|---------------|---------|
|
||||||
@ -43,20 +72,12 @@ By default, the pooler assigned to each task has the following attributes:
|
|||||||
| `embed` | `LAST` | ✅︎ | ❌ |
|
| `embed` | `LAST` | ✅︎ | ❌ |
|
||||||
| `classify` | `LAST` | ❌ | ✅︎ |
|
| `classify` | `LAST` | ❌ | ✅︎ |
|
||||||
|
|
||||||
These defaults may be overridden by the model's implementation in vLLM.
|
|
||||||
|
|
||||||
When loading [Sentence Transformers](https://huggingface.co/sentence-transformers) models,
|
When loading [Sentence Transformers](https://huggingface.co/sentence-transformers) models,
|
||||||
we attempt to override the defaults based on its Sentence Transformers configuration file (`modules.json`),
|
its Sentence Transformers configuration file (`modules.json`) takes priority over the model's defaults.
|
||||||
which takes priority over the model's defaults.
|
|
||||||
|
|
||||||
You can further customize this via the `--override-pooler-config` option,
|
You can further customize this via the `--override-pooler-config` option,
|
||||||
which takes priority over both the model's and Sentence Transformers's defaults.
|
which takes priority over both the model's and Sentence Transformers's defaults.
|
||||||
|
|
||||||
!!! note
|
|
||||||
|
|
||||||
The above configuration may be disregarded if the model's implementation in vLLM defines its own pooler
|
|
||||||
that is not based on [PoolerConfig][vllm.config.PoolerConfig].
|
|
||||||
|
|
||||||
## Offline Inference
|
## Offline Inference
|
||||||
|
|
||||||
The [LLM][vllm.LLM] class provides various methods for offline inference.
|
The [LLM][vllm.LLM] class provides various methods for offline inference.
|
||||||
@ -70,7 +91,7 @@ It returns the extracted hidden states directly, which is useful for reward mode
|
|||||||
```python
|
```python
|
||||||
from vllm import LLM
|
from vllm import LLM
|
||||||
|
|
||||||
llm = LLM(model="Qwen/Qwen2.5-Math-RM-72B", task="reward")
|
llm = LLM(model="Qwen/Qwen2.5-Math-RM-72B", runner="pooling")
|
||||||
(output,) = llm.encode("Hello, my name is")
|
(output,) = llm.encode("Hello, my name is")
|
||||||
|
|
||||||
data = output.outputs.data
|
data = output.outputs.data
|
||||||
@ -85,7 +106,7 @@ It is primarily designed for embedding models.
|
|||||||
```python
|
```python
|
||||||
from vllm import LLM
|
from vllm import LLM
|
||||||
|
|
||||||
llm = LLM(model="intfloat/e5-mistral-7b-instruct", task="embed")
|
llm = LLM(model="intfloat/e5-mistral-7b-instruct", runner="pooling")
|
||||||
(output,) = llm.embed("Hello, my name is")
|
(output,) = llm.embed("Hello, my name is")
|
||||||
|
|
||||||
embeds = output.outputs.embedding
|
embeds = output.outputs.embedding
|
||||||
@ -102,7 +123,7 @@ It is primarily designed for classification models.
|
|||||||
```python
|
```python
|
||||||
from vllm import LLM
|
from vllm import LLM
|
||||||
|
|
||||||
llm = LLM(model="jason9693/Qwen2.5-1.5B-apeach", task="classify")
|
llm = LLM(model="jason9693/Qwen2.5-1.5B-apeach", runner="pooling")
|
||||||
(output,) = llm.classify("Hello, my name is")
|
(output,) = llm.classify("Hello, my name is")
|
||||||
|
|
||||||
probs = output.outputs.probs
|
probs = output.outputs.probs
|
||||||
@ -123,7 +144,7 @@ It is designed for embedding models and cross encoder models. Embedding models u
|
|||||||
```python
|
```python
|
||||||
from vllm import LLM
|
from vllm import LLM
|
||||||
|
|
||||||
llm = LLM(model="BAAI/bge-reranker-v2-m3", task="score")
|
llm = LLM(model="BAAI/bge-reranker-v2-m3", runner="pooling")
|
||||||
(output,) = llm.score("What is the capital of France?",
|
(output,) = llm.score("What is the capital of France?",
|
||||||
"The capital of Brazil is Brasilia.")
|
"The capital of Brazil is Brasilia.")
|
||||||
|
|
||||||
@ -175,7 +196,7 @@ You can change the output dimensions of embedding models that support Matryoshka
|
|||||||
from vllm import LLM, PoolingParams
|
from vllm import LLM, PoolingParams
|
||||||
|
|
||||||
llm = LLM(model="jinaai/jina-embeddings-v3",
|
llm = LLM(model="jinaai/jina-embeddings-v3",
|
||||||
task="embed",
|
runner="pooling",
|
||||||
trust_remote_code=True)
|
trust_remote_code=True)
|
||||||
outputs = llm.embed(["Follow the white rabbit."],
|
outputs = llm.embed(["Follow the white rabbit."],
|
||||||
pooling_params=PoolingParams(dimensions=32))
|
pooling_params=PoolingParams(dimensions=32))
|
||||||
|
|||||||
@ -1,7 +1,6 @@
|
|||||||
# Supported Models
|
# Supported Models
|
||||||
|
|
||||||
vLLM supports [generative](./generative_models.md) and [pooling](./pooling_models.md) models across various tasks.
|
vLLM supports [generative](./generative_models.md) and [pooling](./pooling_models.md) models across various tasks.
|
||||||
If a model supports more than one task, you can set the task via the `--task` argument.
|
|
||||||
|
|
||||||
For each task, we list the model architectures that have been implemented in vLLM.
|
For each task, we list the model architectures that have been implemented in vLLM.
|
||||||
Alongside each architecture, we include some popular models that use it.
|
Alongside each architecture, we include some popular models that use it.
|
||||||
@ -24,7 +23,7 @@ To check if the modeling backend is Transformers, you can simply do this:
|
|||||||
|
|
||||||
```python
|
```python
|
||||||
from vllm import LLM
|
from vllm import LLM
|
||||||
llm = LLM(model=..., task="generate") # Name or path of your model
|
llm = LLM(model=...) # Name or path of your model
|
||||||
llm.apply_model(lambda model: print(type(model)))
|
llm.apply_model(lambda model: print(type(model)))
|
||||||
```
|
```
|
||||||
|
|
||||||
@ -158,13 +157,13 @@ The [Transformers backend][transformers-backend] enables you to run models direc
|
|||||||
```python
|
```python
|
||||||
from vllm import LLM
|
from vllm import LLM
|
||||||
|
|
||||||
# For generative models (task=generate) only
|
# For generative models (runner=generate) only
|
||||||
llm = LLM(model=..., task="generate") # Name or path of your model
|
llm = LLM(model=..., runner="generate") # Name or path of your model
|
||||||
output = llm.generate("Hello, my name is")
|
output = llm.generate("Hello, my name is")
|
||||||
print(output)
|
print(output)
|
||||||
|
|
||||||
# For pooling models (task={embed,classify,reward,score}) only
|
# For pooling models (runner=pooling) only
|
||||||
llm = LLM(model=..., task="embed") # Name or path of your model
|
llm = LLM(model=..., runner="pooling") # Name or path of your model
|
||||||
output = llm.encode("Hello, my name is")
|
output = llm.encode("Hello, my name is")
|
||||||
print(output)
|
print(output)
|
||||||
```
|
```
|
||||||
@ -281,13 +280,13 @@ And use with `trust_remote_code=True`.
|
|||||||
```python
|
```python
|
||||||
from vllm import LLM
|
from vllm import LLM
|
||||||
|
|
||||||
llm = LLM(model=..., revision=..., task=..., trust_remote_code=True)
|
llm = LLM(model=..., revision=..., runner=..., trust_remote_code=True)
|
||||||
|
|
||||||
# For generative models (task=generate) only
|
# For generative models (runner=generate) only
|
||||||
output = llm.generate("Hello, my name is")
|
output = llm.generate("Hello, my name is")
|
||||||
print(output)
|
print(output)
|
||||||
|
|
||||||
# For pooling models (task={embed,classify,reward,score}) only
|
# For pooling models (runner=pooling) only
|
||||||
output = llm.encode("Hello, my name is")
|
output = llm.encode("Hello, my name is")
|
||||||
print(output)
|
print(output)
|
||||||
```
|
```
|
||||||
@ -312,8 +311,6 @@ See [this page](generative_models.md) for more information on how to use generat
|
|||||||
|
|
||||||
#### Text Generation
|
#### Text Generation
|
||||||
|
|
||||||
Specified using `--task generate`.
|
|
||||||
|
|
||||||
<style>
|
<style>
|
||||||
th {
|
th {
|
||||||
white-space: nowrap;
|
white-space: nowrap;
|
||||||
@ -420,25 +417,27 @@ See [this page](./pooling_models.md) for more information on how to use pooling
|
|||||||
|
|
||||||
!!! important
|
!!! important
|
||||||
Since some model architectures support both generative and pooling tasks,
|
Since some model architectures support both generative and pooling tasks,
|
||||||
you should explicitly specify the task type to ensure that the model is used in pooling mode instead of generative mode.
|
you should explicitly specify `--runner pooling` to ensure that the model is used in pooling mode instead of generative mode.
|
||||||
|
|
||||||
#### Text Embedding
|
#### Text Embedding
|
||||||
|
|
||||||
Specified using `--task embed`.
|
|
||||||
|
|
||||||
| Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
|
| Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
|
||||||
|--------------|--------|-------------------|----------------------|---------------------------|---------------------|
|
|--------------|--------|-------------------|----------------------|---------------------------|---------------------|
|
||||||
| `BertModel` | BERT-based | `BAAI/bge-base-en-v1.5`, `Snowflake/snowflake-arctic-embed-xs`, etc. | | | |
|
| `BertModel`<sup>C</sup> | BERT-based | `BAAI/bge-base-en-v1.5`, `Snowflake/snowflake-arctic-embed-xs`, etc. | | | |
|
||||||
| `Gemma2Model` | Gemma 2-based | `BAAI/bge-multilingual-gemma2`, etc. | ✅︎ | | ✅︎ |
|
| `Gemma2Model`<sup>C</sup> | Gemma 2-based | `BAAI/bge-multilingual-gemma2`, etc. | ✅︎ | | ✅︎ |
|
||||||
| `GritLM` | GritLM | `parasail-ai/GritLM-7B-vllm`. | ✅︎ | ✅︎ | |
|
| `GritLM` | GritLM | `parasail-ai/GritLM-7B-vllm`. | ✅︎ | ✅︎ | |
|
||||||
| `GteModel` | Arctic-Embed-2.0-M | `Snowflake/snowflake-arctic-embed-m-v2.0`. | | | |
|
| `GteModel`<sup>C</sup> | Arctic-Embed-2.0-M | `Snowflake/snowflake-arctic-embed-m-v2.0`. | | | |
|
||||||
| `GteNewModel` | mGTE-TRM (see note) | `Alibaba-NLP/gte-multilingual-base`, etc. | | | |
|
| `GteNewModel`<sup>C</sup> | mGTE-TRM (see note) | `Alibaba-NLP/gte-multilingual-base`, etc. | | | |
|
||||||
| `ModernBertModel` | ModernBERT-based | `Alibaba-NLP/gte-modernbert-base`, etc. | | | |
|
| `ModernBertModel`<sup>C</sup> | ModernBERT-based | `Alibaba-NLP/gte-modernbert-base`, etc. | | | |
|
||||||
| `NomicBertModel` | Nomic BERT | `nomic-ai/nomic-embed-text-v1`, `nomic-ai/nomic-embed-text-v2-moe`, `Snowflake/snowflake-arctic-embed-m-long`, etc. | | | |
|
| `NomicBertModel`<sup>C</sup> | Nomic BERT | `nomic-ai/nomic-embed-text-v1`, `nomic-ai/nomic-embed-text-v2-moe`, `Snowflake/snowflake-arctic-embed-m-long`, etc. | | | |
|
||||||
| `LlamaModel`, `LlamaForCausalLM`, `MistralModel`, etc. | Llama-based | `intfloat/e5-mistral-7b-instruct`, etc. | ✅︎ | ✅︎ | ✅︎ |
|
| `LlamaModel`<sup>C</sup>, `LlamaForCausalLM`<sup>C</sup>, `MistralModel`<sup>C</sup>, etc. | Llama-based | `intfloat/e5-mistral-7b-instruct`, etc. | ✅︎ | ✅︎ | ✅︎ |
|
||||||
| `Qwen2Model`, `Qwen2ForCausalLM` | Qwen2-based | `ssmits/Qwen2-7B-Instruct-embed-base` (see note), `Alibaba-NLP/gte-Qwen2-7B-instruct` (see note), etc. | ✅︎ | ✅︎ | ✅︎ |
|
| `Qwen2Model`<sup>C</sup>, `Qwen2ForCausalLM`<sup>C</sup> | Qwen2-based | `ssmits/Qwen2-7B-Instruct-embed-base` (see note), `Alibaba-NLP/gte-Qwen2-7B-instruct` (see note), etc. | ✅︎ | ✅︎ | ✅︎ |
|
||||||
| `Qwen3Model`, `Qwen3ForCausalLM` | Qwen3-based | `Qwen/Qwen3-Embedding-0.6B`, etc. | ✅︎ | ✅︎ | ✅︎ |
|
| `Qwen3Model`<sup>C</sup>, `Qwen3ForCausalLM`<sup>C</sup> | Qwen3-based | `Qwen/Qwen3-Embedding-0.6B`, etc. | ✅︎ | ✅︎ | ✅︎ |
|
||||||
| `RobertaModel`, `RobertaForMaskedLM` | RoBERTa-based | `sentence-transformers/all-roberta-large-v1`, etc. | | | |
|
| `RobertaModel`, `RobertaForMaskedLM` | RoBERTa-based | `sentence-transformers/all-roberta-large-v1`, etc. | | | |
|
||||||
|
| `*Model`<sup>C</sup>, `*ForCausalLM`<sup>C</sup>, etc. | Generative models | N/A | \* | \* | \* |
|
||||||
|
|
||||||
|
<sup>C</sup> Automatically converted into an embedding model via `--convert embed`. ([details](./pooling_models.md#model-conversion))
|
||||||
|
\* Feature support is the same as that of the original model.
|
||||||
|
|
||||||
!!! note
|
!!! note
|
||||||
`ssmits/Qwen2-7B-Instruct-embed-base` has an improperly defined Sentence Transformers config.
|
`ssmits/Qwen2-7B-Instruct-embed-base` has an improperly defined Sentence Transformers config.
|
||||||
@ -460,14 +459,16 @@ of the whole prompt are extracted from the normalized hidden state corresponding
|
|||||||
|
|
||||||
#### Reward Modeling
|
#### Reward Modeling
|
||||||
|
|
||||||
Specified using `--task reward`.
|
|
||||||
|
|
||||||
| Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
|
| Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
|
||||||
|--------------|--------|-------------------|----------------------|---------------------------|---------------------|
|
|--------------|--------|-------------------|----------------------|---------------------------|---------------------|
|
||||||
| `InternLM2ForRewardModel` | InternLM2-based | `internlm/internlm2-1_8b-reward`, `internlm/internlm2-7b-reward`, etc. | ✅︎ | ✅︎ | ✅︎ |
|
| `InternLM2ForRewardModel` | InternLM2-based | `internlm/internlm2-1_8b-reward`, `internlm/internlm2-7b-reward`, etc. | ✅︎ | ✅︎ | ✅︎ |
|
||||||
| `LlamaForCausalLM` | Llama-based | `peiyi9979/math-shepherd-mistral-7b-prm`, etc. | ✅︎ | ✅︎ | ✅︎ |
|
| `LlamaForCausalLM`<sup>C</sup> | Llama-based | `peiyi9979/math-shepherd-mistral-7b-prm`, etc. | ✅︎ | ✅︎ | ✅︎ |
|
||||||
| `Qwen2ForRewardModel` | Qwen2-based | `Qwen/Qwen2.5-Math-RM-72B`, etc. | ✅︎ | ✅︎ | ✅︎ |
|
| `Qwen2ForRewardModel` | Qwen2-based | `Qwen/Qwen2.5-Math-RM-72B`, etc. | ✅︎ | ✅︎ | ✅︎ |
|
||||||
| `Qwen2ForProcessRewardModel` | Qwen2-based | `Qwen/Qwen2.5-Math-PRM-7B`, etc. | ✅︎ | ✅︎ | ✅︎ |
|
| `Qwen2ForProcessRewardModel` | Qwen2-based | `Qwen/Qwen2.5-Math-PRM-7B`, etc. | ✅︎ | ✅︎ | ✅︎ |
|
||||||
|
| `*Model`<sup>C</sup>, `*ForCausalLM`<sup>C</sup>, etc. | Generative models | N/A | \* | \* | \* |
|
||||||
|
|
||||||
|
<sup>C</sup> Automatically converted into a reward model via `--convert reward`. ([details](./pooling_models.md#model-conversion))
|
||||||
|
\* Feature support is the same as that of the original model.
|
||||||
|
|
||||||
If your model is not in the above list, we will try to automatically convert the model using
|
If your model is not in the above list, we will try to automatically convert the model using
|
||||||
[as_reward_model][vllm.model_executor.models.adapters.as_reward_model]. By default, we return the hidden states of each token directly.
|
[as_reward_model][vllm.model_executor.models.adapters.as_reward_model]. By default, we return the hidden states of each token directly.
|
||||||
@ -478,28 +479,31 @@ If your model is not in the above list, we will try to automatically convert the
|
|||||||
|
|
||||||
#### Classification
|
#### Classification
|
||||||
|
|
||||||
Specified using `--task classify`.
|
|
||||||
|
|
||||||
| Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
|
| Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
|
||||||
|--------------|--------|-------------------|----------------------|---------------------------|---------------------|
|
|--------------|--------|-------------------|----------------------|---------------------------|---------------------|
|
||||||
| `JambaForSequenceClassification` | Jamba | `ai21labs/Jamba-tiny-reward-dev`, etc. | ✅︎ | ✅︎ | |
|
| `JambaForSequenceClassification` | Jamba | `ai21labs/Jamba-tiny-reward-dev`, etc. | ✅︎ | ✅︎ | |
|
||||||
| `GPT2ForSequenceClassification` | GPT2 | `nie3e/sentiment-polish-gpt2-small` | | | ✅︎ |
|
| `GPT2ForSequenceClassification` | GPT2 | `nie3e/sentiment-polish-gpt2-small` | | | ✅︎ |
|
||||||
|
| `*Model`<sup>C</sup>, `*ForCausalLM`<sup>C</sup>, etc. | Generative models | N/A | \* | \* | \* |
|
||||||
|
|
||||||
|
<sup>C</sup> Automatically converted into a classification model via `--convert classify`. ([details](./pooling_models.md#model-conversion))
|
||||||
|
\* Feature support is the same as that of the original model.
|
||||||
|
|
||||||
If your model is not in the above list, we will try to automatically convert the model using
|
If your model is not in the above list, we will try to automatically convert the model using
|
||||||
[as_seq_cls_model][vllm.model_executor.models.adapters.as_seq_cls_model]. By default, the class probabilities are extracted from the softmaxed hidden state corresponding to the last token.
|
[as_seq_cls_model][vllm.model_executor.models.adapters.as_seq_cls_model]. By default, the class probabilities are extracted from the softmaxed hidden state corresponding to the last token.
|
||||||
|
|
||||||
#### Sentence Pair Scoring
|
#### Sentence Pair Scoring
|
||||||
|
|
||||||
Specified using `--task score`.
|
| Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
|
||||||
|
|--------------|--------|-------------------|----------------------|---------------------------|---------------------|
|
||||||
|
| `BertForSequenceClassification` | BERT-based | `cross-encoder/ms-marco-MiniLM-L-6-v2`, etc. | | | |
|
||||||
|
| `GemmaForSequenceClassification` | Gemma-based | `BAAI/bge-reranker-v2-gemma` (see note), etc. | ✅︎ | ✅︎ | ✅︎ |
|
||||||
|
| `Qwen2ForSequenceClassification` | Qwen2-based | `mixedbread-ai/mxbai-rerank-base-v2` (see note), etc. | ✅︎ | ✅︎ | ✅︎ |
|
||||||
|
| `Qwen3ForSequenceClassification` | Qwen3-based | `tomaarsen/Qwen3-Reranker-0.6B-seq-cls`, `Qwen/Qwen3-Reranker-0.6B` (see note), etc. | ✅︎ | ✅︎ | ✅︎ |
|
||||||
|
| `RobertaForSequenceClassification` | RoBERTa-based | `cross-encoder/quora-roberta-base`, etc. | | | |
|
||||||
|
| `XLMRobertaForSequenceClassification` | XLM-RoBERTa-based | `BAAI/bge-reranker-v2-m3`, etc. | | | |
|
||||||
|
|
||||||
| Architecture | Models | Example HF Models | [V1](gh-issue:8779) |
|
<sup>C</sup> Automatically converted into a classification model via `--convert classify`. ([details](./pooling_models.md#model-conversion))
|
||||||
|--------------|--------|-------------------|---------------------|
|
\* Feature support is the same as that of the original model.
|
||||||
| `BertForSequenceClassification` | BERT-based | `cross-encoder/ms-marco-MiniLM-L-6-v2`, etc. | |
|
|
||||||
| `GemmaForSequenceClassification` | Gemma-based | `BAAI/bge-reranker-v2-gemma` (see note), etc. | |
|
|
||||||
| `Qwen2ForSequenceClassification` | Qwen2-based | `mixedbread-ai/mxbai-rerank-base-v2` (see note), etc. | ✅︎ |
|
|
||||||
| `Qwen3ForSequenceClassification` | Qwen3-based | `tomaarsen/Qwen3-Reranker-0.6B-seq-cls`, `Qwen/Qwen3-Reranker-0.6B` (see note), etc. | ✅︎ |
|
|
||||||
| `RobertaForSequenceClassification` | RoBERTa-based | `cross-encoder/quora-roberta-base`, etc. | |
|
|
||||||
| `XLMRobertaForSequenceClassification` | XLM-RoBERTa-based | `BAAI/bge-reranker-v2-m3`, etc. | |
|
|
||||||
|
|
||||||
!!! note
|
!!! note
|
||||||
Load the official original `BAAI/bge-reranker-v2-gemma` by using the following command.
|
Load the official original `BAAI/bge-reranker-v2-gemma` by using the following command.
|
||||||
@ -575,8 +579,6 @@ See [this page](generative_models.md) for more information on how to use generat
|
|||||||
|
|
||||||
#### Text Generation
|
#### Text Generation
|
||||||
|
|
||||||
Specified using `--task generate`.
|
|
||||||
|
|
||||||
| Architecture | Models | Inputs | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
|
| Architecture | Models | Inputs | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
|
||||||
|--------------|--------|--------|-------------------|----------------------|---------------------------|---------------------|
|
|--------------|--------|--------|-------------------|----------------------|---------------------------|---------------------|
|
||||||
| `AriaForConditionalGeneration` | Aria | T + I<sup>+</sup> | `rhymes-ai/Aria` | | | ✅︎ |
|
| `AriaForConditionalGeneration` | Aria | T + I<sup>+</sup> | `rhymes-ai/Aria` | | | ✅︎ |
|
||||||
@ -705,8 +707,6 @@ Some models are supported only via the [Transformers backend](#transformers). Th
|
|||||||
|
|
||||||
#### Transcription
|
#### Transcription
|
||||||
|
|
||||||
Specified using `--task transcription`.
|
|
||||||
|
|
||||||
Speech2Text models trained specifically for Automatic Speech Recognition.
|
Speech2Text models trained specifically for Automatic Speech Recognition.
|
||||||
|
|
||||||
| Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
|
| Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
|
||||||
@ -719,14 +719,10 @@ See [this page](./pooling_models.md) for more information on how to use pooling
|
|||||||
|
|
||||||
!!! important
|
!!! important
|
||||||
Since some model architectures support both generative and pooling tasks,
|
Since some model architectures support both generative and pooling tasks,
|
||||||
you should explicitly specify the task type to ensure that the model is used in pooling mode instead of generative mode.
|
you should explicitly specify `--runner pooling` to ensure that the model is used in pooling mode instead of generative mode.
|
||||||
|
|
||||||
#### Text Embedding
|
#### Text Embedding
|
||||||
|
|
||||||
Specified using `--task embed`.
|
|
||||||
|
|
||||||
Any text generation model can be converted into an embedding model by passing `--task embed`.
|
|
||||||
|
|
||||||
!!! note
|
!!! note
|
||||||
To get the best results, you should use pooling models that are specifically trained as such.
|
To get the best results, you should use pooling models that are specifically trained as such.
|
||||||
|
|
||||||
@ -734,19 +730,24 @@ The following table lists those that are tested in vLLM.
|
|||||||
|
|
||||||
| Architecture | Models | Inputs | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
|
| Architecture | Models | Inputs | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
|
||||||
|--------------|--------|--------|-------------------|----------------------|---------------------------|---------------------|
|
|--------------|--------|--------|-------------------|----------------------|---------------------------|---------------------|
|
||||||
| `LlavaNextForConditionalGeneration` | LLaVA-NeXT-based | T / I | `royokong/e5-v` | | | |
|
| `LlavaNextForConditionalGeneration`<sup>C</sup> | LLaVA-NeXT-based | T / I | `royokong/e5-v` | | | |
|
||||||
| `Phi3VForCausalLM` | Phi-3-Vision-based | T + I | `TIGER-Lab/VLM2Vec-Full` | 🚧 | ✅︎ | |
|
| `Phi3VForCausalLM`<sup>C</sup> | Phi-3-Vision-based | T + I | `TIGER-Lab/VLM2Vec-Full` | 🚧 | ✅︎ | |
|
||||||
|
| `*ForConditionalGeneration`<sup>C</sup>, `*ForCausalLM`<sup>C</sup>, etc. | Generative models | \* | N/A | \* | \* | \* |
|
||||||
|
|
||||||
|
<sup>C</sup> Automatically converted into an embedding model via `--convert embed`. ([details](./pooling_models.md#model-conversion))
|
||||||
|
\* Feature support is the same as that of the original model.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
#### Scoring
|
#### Scoring
|
||||||
|
|
||||||
Specified using `--task score`.
|
|
||||||
|
|
||||||
| Architecture | Models | Inputs | Example HF Models | [LoRA][lora-adapter] | [PP][distributed-serving] | [V1](gh-issue:8779) |
|
| Architecture | Models | Inputs | Example HF Models | [LoRA][lora-adapter] | [PP][distributed-serving] | [V1](gh-issue:8779) |
|
||||||
|-------------------------------------|--------------------|----------|--------------------------|------------------------|-----------------------------|-----------------------|
|
|-------------------------------------|--------------------|----------|--------------------------|------------------------|-----------------------------|-----------------------|
|
||||||
| `JinaVLForSequenceClassification` | JinaVL-based | T + I<sup>E+</sup> | `jinaai/jina-reranker-m0`, etc. | | | ✅︎ |
|
| `JinaVLForSequenceClassification` | JinaVL-based | T + I<sup>E+</sup> | `jinaai/jina-reranker-m0`, etc. | | | ✅︎ |
|
||||||
|
|
||||||
|
<sup>C</sup> Automatically converted into a classification model via `--convert classify`. ([details](./pooling_models.md#model-conversion))
|
||||||
|
\* Feature support is the same as that of the original model.
|
||||||
|
|
||||||
## Model Support Policy
|
## Model Support Policy
|
||||||
|
|
||||||
At vLLM, we are committed to facilitating the integration and support of third-party models within our ecosystem. Our approach is designed to balance the need for robustness and the practical limitations of supporting a wide range of models. Here’s how we manage third-party model support:
|
At vLLM, we are committed to facilitating the integration and support of third-party models within our ecosystem. Our approach is designed to balance the need for robustness and the practical limitations of supporting a wide range of models. Here’s how we manage third-party model support:
|
||||||
|
|||||||
@ -45,17 +45,17 @@ To call the server, in your preferred text editor, create a script that uses an
|
|||||||
We currently support the following OpenAI APIs:
|
We currently support the following OpenAI APIs:
|
||||||
|
|
||||||
- [Completions API][completions-api] (`/v1/completions`)
|
- [Completions API][completions-api] (`/v1/completions`)
|
||||||
- Only applicable to [text generation models](../models/generative_models.md) (`--task generate`).
|
- Only applicable to [text generation models](../models/generative_models.md).
|
||||||
- *Note: `suffix` parameter is not supported.*
|
- *Note: `suffix` parameter is not supported.*
|
||||||
- [Chat Completions API][chat-api] (`/v1/chat/completions`)
|
- [Chat Completions API][chat-api] (`/v1/chat/completions`)
|
||||||
- Only applicable to [text generation models](../models/generative_models.md) (`--task generate`) with a [chat template][chat-template].
|
- Only applicable to [text generation models](../models/generative_models.md) with a [chat template][chat-template].
|
||||||
- *Note: `parallel_tool_calls` and `user` parameters are ignored.*
|
- *Note: `parallel_tool_calls` and `user` parameters are ignored.*
|
||||||
- [Embeddings API][embeddings-api] (`/v1/embeddings`)
|
- [Embeddings API][embeddings-api] (`/v1/embeddings`)
|
||||||
- Only applicable to [embedding models](../models/pooling_models.md) (`--task embed`).
|
- Only applicable to [embedding models](../models/pooling_models.md).
|
||||||
- [Transcriptions API][transcriptions-api] (`/v1/audio/transcriptions`)
|
- [Transcriptions API][transcriptions-api] (`/v1/audio/transcriptions`)
|
||||||
- Only applicable to Automatic Speech Recognition (ASR) models (OpenAI Whisper) (`--task generate`).
|
- Only applicable to [Automatic Speech Recognition (ASR) models](../models/supported_models.md#transcription).
|
||||||
- [Translation API][translations-api] (`/v1/audio/translations`)
|
- [Translation API][translations-api] (`/v1/audio/translations`)
|
||||||
- Only applicable to Automatic Speech Recognition (ASR) models (OpenAI Whisper) (`--task generate`).
|
- Only applicable to [Automatic Speech Recognition (ASR) models](../models/supported_models.md#transcription).
|
||||||
|
|
||||||
In addition, we have the following custom APIs:
|
In addition, we have the following custom APIs:
|
||||||
|
|
||||||
@ -64,14 +64,14 @@ In addition, we have the following custom APIs:
|
|||||||
- [Pooling API][pooling-api] (`/pooling`)
|
- [Pooling API][pooling-api] (`/pooling`)
|
||||||
- Applicable to all [pooling models](../models/pooling_models.md).
|
- Applicable to all [pooling models](../models/pooling_models.md).
|
||||||
- [Classification API][classification-api] (`/classify`)
|
- [Classification API][classification-api] (`/classify`)
|
||||||
- Only applicable to [classification models](../models/pooling_models.md) (`--task classify`).
|
- Only applicable to [classification models](../models/pooling_models.md).
|
||||||
- [Score API][score-api] (`/score`)
|
- [Score API][score-api] (`/score`)
|
||||||
- Applicable to embedding models and [cross-encoder models](../models/pooling_models.md) (`--task score`).
|
- Applicable to [embedding models and cross-encoder models](../models/pooling_models.md).
|
||||||
- [Re-rank API][rerank-api] (`/rerank`, `/v1/rerank`, `/v2/rerank`)
|
- [Re-rank API][rerank-api] (`/rerank`, `/v1/rerank`, `/v2/rerank`)
|
||||||
- Implements [Jina AI's v1 re-rank API](https://jina.ai/reranker/)
|
- Implements [Jina AI's v1 re-rank API](https://jina.ai/reranker/)
|
||||||
- Also compatible with [Cohere's v1 & v2 re-rank APIs](https://docs.cohere.com/v2/reference/rerank)
|
- Also compatible with [Cohere's v1 & v2 re-rank APIs](https://docs.cohere.com/v2/reference/rerank)
|
||||||
- Jina and Cohere's APIs are very similar; Jina's includes extra information in the rerank endpoint's response.
|
- Jina and Cohere's APIs are very similar; Jina's includes extra information in the rerank endpoint's response.
|
||||||
- Only applicable to [cross-encoder models](../models/pooling_models.md) (`--task score`).
|
- Only applicable to [cross-encoder models](../models/pooling_models.md).
|
||||||
|
|
||||||
[](){ #chat-template }
|
[](){ #chat-template }
|
||||||
|
|
||||||
@ -250,14 +250,14 @@ and passing a list of `messages` in the request. Refer to the examples below for
|
|||||||
To serve the model:
|
To serve the model:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
vllm serve TIGER-Lab/VLM2Vec-Full --task embed \
|
vllm serve TIGER-Lab/VLM2Vec-Full --runner pooling \
|
||||||
--trust-remote-code \
|
--trust-remote-code \
|
||||||
--max-model-len 4096 \
|
--max-model-len 4096 \
|
||||||
--chat-template examples/template_vlm2vec.jinja
|
--chat-template examples/template_vlm2vec.jinja
|
||||||
```
|
```
|
||||||
|
|
||||||
!!! important
|
!!! important
|
||||||
Since VLM2Vec has the same model architecture as Phi-3.5-Vision, we have to explicitly pass `--task embed`
|
Since VLM2Vec has the same model architecture as Phi-3.5-Vision, we have to explicitly pass `--runner pooling`
|
||||||
to run this model in embedding mode instead of text generation mode.
|
to run this model in embedding mode instead of text generation mode.
|
||||||
|
|
||||||
The custom chat template is completely different from the original one for this model,
|
The custom chat template is completely different from the original one for this model,
|
||||||
@ -296,14 +296,14 @@ and passing a list of `messages` in the request. Refer to the examples below for
|
|||||||
To serve the model:
|
To serve the model:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
vllm serve MrLight/dse-qwen2-2b-mrl-v1 --task embed \
|
vllm serve MrLight/dse-qwen2-2b-mrl-v1 --runner pooling \
|
||||||
--trust-remote-code \
|
--trust-remote-code \
|
||||||
--max-model-len 8192 \
|
--max-model-len 8192 \
|
||||||
--chat-template examples/template_dse_qwen2_vl.jinja
|
--chat-template examples/template_dse_qwen2_vl.jinja
|
||||||
```
|
```
|
||||||
|
|
||||||
!!! important
|
!!! important
|
||||||
Like with VLM2Vec, we have to explicitly pass `--task embed`.
|
Like with VLM2Vec, we have to explicitly pass `--runner pooling`.
|
||||||
|
|
||||||
Additionally, `MrLight/dse-qwen2-2b-mrl-v1` requires an EOS token for embeddings, which is handled
|
Additionally, `MrLight/dse-qwen2-2b-mrl-v1` requires an EOS token for embeddings, which is handled
|
||||||
by a custom chat template: <gh-file:examples/template_dse_qwen2_vl.jinja>
|
by a custom chat template: <gh-file:examples/template_dse_qwen2_vl.jinja>
|
||||||
|
|||||||
@ -12,7 +12,9 @@ def parse_args():
|
|||||||
parser = EngineArgs.add_cli_args(parser)
|
parser = EngineArgs.add_cli_args(parser)
|
||||||
# Set example specific arguments
|
# Set example specific arguments
|
||||||
parser.set_defaults(
|
parser.set_defaults(
|
||||||
model="jason9693/Qwen2.5-1.5B-apeach", task="classify", enforce_eager=True
|
model="jason9693/Qwen2.5-1.5B-apeach",
|
||||||
|
runner="pooling",
|
||||||
|
enforce_eager=True,
|
||||||
)
|
)
|
||||||
return parser.parse_args()
|
return parser.parse_args()
|
||||||
|
|
||||||
@ -27,7 +29,7 @@ def main(args: Namespace):
|
|||||||
]
|
]
|
||||||
|
|
||||||
# Create an LLM.
|
# Create an LLM.
|
||||||
# You should pass task="classify" for classification models
|
# You should pass runner="pooling" for classification models
|
||||||
llm = LLM(**vars(args))
|
llm = LLM(**vars(args))
|
||||||
|
|
||||||
# Generate logits. The output is a list of ClassificationRequestOutputs.
|
# Generate logits. The output is a list of ClassificationRequestOutputs.
|
||||||
|
|||||||
@ -13,7 +13,7 @@ def parse_args():
|
|||||||
# Set example specific arguments
|
# Set example specific arguments
|
||||||
parser.set_defaults(
|
parser.set_defaults(
|
||||||
model="intfloat/e5-mistral-7b-instruct",
|
model="intfloat/e5-mistral-7b-instruct",
|
||||||
task="embed",
|
runner="pooling",
|
||||||
enforce_eager=True,
|
enforce_eager=True,
|
||||||
max_model_len=1024,
|
max_model_len=1024,
|
||||||
)
|
)
|
||||||
@ -30,7 +30,7 @@ def main(args: Namespace):
|
|||||||
]
|
]
|
||||||
|
|
||||||
# Create an LLM.
|
# Create an LLM.
|
||||||
# You should pass task="embed" for embedding models
|
# You should pass runner="pooling" for embedding models
|
||||||
llm = LLM(**vars(args))
|
llm = LLM(**vars(args))
|
||||||
|
|
||||||
# Generate embedding. The output is a list of EmbeddingRequestOutputs.
|
# Generate embedding. The output is a list of EmbeddingRequestOutputs.
|
||||||
|
|||||||
@ -12,7 +12,9 @@ def parse_args():
|
|||||||
parser = EngineArgs.add_cli_args(parser)
|
parser = EngineArgs.add_cli_args(parser)
|
||||||
# Set example specific arguments
|
# Set example specific arguments
|
||||||
parser.set_defaults(
|
parser.set_defaults(
|
||||||
model="BAAI/bge-reranker-v2-m3", task="score", enforce_eager=True
|
model="BAAI/bge-reranker-v2-m3",
|
||||||
|
runner="pooling",
|
||||||
|
enforce_eager=True,
|
||||||
)
|
)
|
||||||
return parser.parse_args()
|
return parser.parse_args()
|
||||||
|
|
||||||
@ -26,7 +28,7 @@ def main(args: Namespace):
|
|||||||
]
|
]
|
||||||
|
|
||||||
# Create an LLM.
|
# Create an LLM.
|
||||||
# You should pass task="score" for cross-encoder models
|
# You should pass runner="pooling" for cross-encoder models
|
||||||
llm = LLM(**vars(args))
|
llm = LLM(**vars(args))
|
||||||
|
|
||||||
# Generate scores. The output is a list of ScoringRequestOutputs.
|
# Generate scores. The output is a list of ScoringRequestOutputs.
|
||||||
|
|||||||
@ -12,7 +12,9 @@ def parse_args():
|
|||||||
parser = EngineArgs.add_cli_args(parser)
|
parser = EngineArgs.add_cli_args(parser)
|
||||||
# Set example specific arguments
|
# Set example specific arguments
|
||||||
parser.set_defaults(
|
parser.set_defaults(
|
||||||
model="jinaai/jina-embeddings-v3", task="embed", trust_remote_code=True
|
model="jinaai/jina-embeddings-v3",
|
||||||
|
runner="pooling",
|
||||||
|
trust_remote_code=True,
|
||||||
)
|
)
|
||||||
return parser.parse_args()
|
return parser.parse_args()
|
||||||
|
|
||||||
@ -29,7 +31,7 @@ def main(args: Namespace):
|
|||||||
]
|
]
|
||||||
|
|
||||||
# Create an LLM.
|
# Create an LLM.
|
||||||
# You should pass task="embed" for embedding models
|
# You should pass runner="pooling" for embedding models
|
||||||
llm = LLM(**vars(args))
|
llm = LLM(**vars(args))
|
||||||
|
|
||||||
# Generate embedding. The output is a list of EmbeddingRequestOutputs.
|
# Generate embedding. The output is a list of EmbeddingRequestOutputs.
|
||||||
|
|||||||
@ -12,7 +12,9 @@ def parse_args():
|
|||||||
parser = EngineArgs.add_cli_args(parser)
|
parser = EngineArgs.add_cli_args(parser)
|
||||||
# Set example specific arguments
|
# Set example specific arguments
|
||||||
parser.set_defaults(
|
parser.set_defaults(
|
||||||
model="jinaai/jina-embeddings-v3", task="embed", trust_remote_code=True
|
model="jinaai/jina-embeddings-v3",
|
||||||
|
runner="pooling",
|
||||||
|
trust_remote_code=True,
|
||||||
)
|
)
|
||||||
return parser.parse_args()
|
return parser.parse_args()
|
||||||
|
|
||||||
@ -29,7 +31,7 @@ def main(args: Namespace):
|
|||||||
]
|
]
|
||||||
|
|
||||||
# Create an LLM.
|
# Create an LLM.
|
||||||
# You should pass task="embed" for embedding models
|
# You should pass runner="pooling" for embedding models
|
||||||
llm = LLM(**vars(args))
|
llm = LLM(**vars(args))
|
||||||
|
|
||||||
# Generate embedding. The output is a list of EmbeddingRequestOutputs.
|
# Generate embedding. The output is a list of EmbeddingRequestOutputs.
|
||||||
|
|||||||
@ -17,7 +17,7 @@ model_name = "Qwen/Qwen3-Reranker-0.6B"
|
|||||||
# Models converted offline using this method can not only be more efficient
|
# Models converted offline using this method can not only be more efficient
|
||||||
# and support the vllm score API, but also make the init parameters more
|
# and support the vllm score API, but also make the init parameters more
|
||||||
# concise, for example.
|
# concise, for example.
|
||||||
# llm = LLM(model="tomaarsen/Qwen3-Reranker-0.6B-seq-cls", task="score")
|
# llm = LLM(model="tomaarsen/Qwen3-Reranker-0.6B-seq-cls", runner="pooling")
|
||||||
|
|
||||||
# If you want to load the official original version, the init parameters are
|
# If you want to load the official original version, the init parameters are
|
||||||
# as follows.
|
# as follows.
|
||||||
@ -27,7 +27,7 @@ def get_llm() -> LLM:
|
|||||||
"""Initializes and returns the LLM model for Qwen3-Reranker."""
|
"""Initializes and returns the LLM model for Qwen3-Reranker."""
|
||||||
return LLM(
|
return LLM(
|
||||||
model=model_name,
|
model=model_name,
|
||||||
task="score",
|
runner="pooling",
|
||||||
hf_overrides={
|
hf_overrides={
|
||||||
"architectures": ["Qwen3ForSequenceClassification"],
|
"architectures": ["Qwen3ForSequenceClassification"],
|
||||||
"classifier_from_token": ["no", "yes"],
|
"classifier_from_token": ["no", "yes"],
|
||||||
|
|||||||
@ -70,7 +70,7 @@ def run_e5_v(query: Query) -> ModelRequestData:
|
|||||||
|
|
||||||
engine_args = EngineArgs(
|
engine_args = EngineArgs(
|
||||||
model="royokong/e5-v",
|
model="royokong/e5-v",
|
||||||
task="embed",
|
runner="pooling",
|
||||||
max_model_len=4096,
|
max_model_len=4096,
|
||||||
limit_mm_per_prompt={"image": 1},
|
limit_mm_per_prompt={"image": 1},
|
||||||
)
|
)
|
||||||
@ -102,7 +102,7 @@ def run_vlm2vec(query: Query) -> ModelRequestData:
|
|||||||
|
|
||||||
engine_args = EngineArgs(
|
engine_args = EngineArgs(
|
||||||
model="TIGER-Lab/VLM2Vec-Full",
|
model="TIGER-Lab/VLM2Vec-Full",
|
||||||
task="embed",
|
runner="pooling",
|
||||||
max_model_len=4096,
|
max_model_len=4096,
|
||||||
trust_remote_code=True,
|
trust_remote_code=True,
|
||||||
mm_processor_kwargs={"num_crops": 4},
|
mm_processor_kwargs={"num_crops": 4},
|
||||||
@ -122,7 +122,7 @@ def run_jinavl_reranker(query: Query) -> ModelRequestData:
|
|||||||
|
|
||||||
engine_args = EngineArgs(
|
engine_args = EngineArgs(
|
||||||
model="jinaai/jina-reranker-m0",
|
model="jinaai/jina-reranker-m0",
|
||||||
task="score",
|
runner="pooling",
|
||||||
max_model_len=32768,
|
max_model_len=32768,
|
||||||
trust_remote_code=True,
|
trust_remote_code=True,
|
||||||
mm_processor_kwargs={
|
mm_processor_kwargs={
|
||||||
|
|||||||
@ -9,7 +9,7 @@ Launch the vLLM server with the following command:
|
|||||||
vllm serve llava-hf/llava-1.5-7b-hf
|
vllm serve llava-hf/llava-1.5-7b-hf
|
||||||
|
|
||||||
(multi-image inference with Phi-3.5-vision-instruct)
|
(multi-image inference with Phi-3.5-vision-instruct)
|
||||||
vllm serve microsoft/Phi-3.5-vision-instruct --task generate \
|
vllm serve microsoft/Phi-3.5-vision-instruct --runner generate \
|
||||||
--trust-remote-code --max-model-len 4096 --limit-mm-per-prompt '{"image":2}'
|
--trust-remote-code --max-model-len 4096 --limit-mm-per-prompt '{"image":2}'
|
||||||
|
|
||||||
(audio inference with Ultravox)
|
(audio inference with Ultravox)
|
||||||
|
|||||||
@ -92,7 +92,7 @@ def dse_qwen2_vl(inp: dict):
|
|||||||
def parse_args():
|
def parse_args():
|
||||||
parser = argparse.ArgumentParser(
|
parser = argparse.ArgumentParser(
|
||||||
"Script to call a specified VLM through the API. Make sure to serve "
|
"Script to call a specified VLM through the API. Make sure to serve "
|
||||||
"the model with --task embed before running this."
|
"the model with `--runner pooling` before running this."
|
||||||
)
|
)
|
||||||
parser.add_argument(
|
parser.add_argument(
|
||||||
"--model",
|
"--model",
|
||||||
|
|||||||
@ -3,7 +3,7 @@
|
|||||||
"""
|
"""
|
||||||
Example online usage of Score API.
|
Example online usage of Score API.
|
||||||
|
|
||||||
Run `vllm serve <model> --task score` to start up the server in vLLM.
|
Run `vllm serve <model> --runner pooling` to start up the server in vLLM.
|
||||||
"""
|
"""
|
||||||
|
|
||||||
import argparse
|
import argparse
|
||||||
|
|||||||
@ -3,7 +3,7 @@
|
|||||||
"""
|
"""
|
||||||
Example online usage of Score API.
|
Example online usage of Score API.
|
||||||
|
|
||||||
Run `vllm serve <model> --task score` to start up the server in vLLM.
|
Run `vllm serve <model> --runner pooling` to start up the server in vLLM.
|
||||||
"""
|
"""
|
||||||
|
|
||||||
import argparse
|
import argparse
|
||||||
|
|||||||
@ -3,7 +3,7 @@
|
|||||||
"""
|
"""
|
||||||
Example online usage of Pooling API.
|
Example online usage of Pooling API.
|
||||||
|
|
||||||
Run `vllm serve <model> --task <embed|classify|reward|score>`
|
Run `vllm serve <model> --runner pooling`
|
||||||
to start up the server in vLLM.
|
to start up the server in vLLM.
|
||||||
"""
|
"""
|
||||||
|
|
||||||
|
|||||||
@ -10,7 +10,7 @@ This script demonstrates how to:
|
|||||||
|
|
||||||
Run the vLLM server first:
|
Run the vLLM server first:
|
||||||
vllm serve meta-llama/Llama-3.2-1B-Instruct \
|
vllm serve meta-llama/Llama-3.2-1B-Instruct \
|
||||||
--task generate \
|
--runner generate \
|
||||||
--max-model-len 4096 \
|
--max-model-len 4096 \
|
||||||
--enable-prompt-embeds
|
--enable-prompt-embeds
|
||||||
|
|
||||||
|
|||||||
@ -148,9 +148,6 @@ def async_tp_pass_on_test_model(local_rank: int, world_size: int,
|
|||||||
# in the vllm_config, it's not really used.
|
# in the vllm_config, it's not really used.
|
||||||
model_name = "nm-testing/TinyLlama-1.1B-Chat-v1.0-FP8-e2e"
|
model_name = "nm-testing/TinyLlama-1.1B-Chat-v1.0-FP8-e2e"
|
||||||
vllm_config.model_config = ModelConfig(model=model_name,
|
vllm_config.model_config = ModelConfig(model=model_name,
|
||||||
task="auto",
|
|
||||||
tokenizer=model_name,
|
|
||||||
tokenizer_mode="auto",
|
|
||||||
trust_remote_code=True,
|
trust_remote_code=True,
|
||||||
dtype=dtype,
|
dtype=dtype,
|
||||||
seed=42)
|
seed=42)
|
||||||
|
|||||||
@ -62,8 +62,8 @@ class TestSetting:
|
|||||||
TestSetting(
|
TestSetting(
|
||||||
model="BAAI/bge-multilingual-gemma2",
|
model="BAAI/bge-multilingual-gemma2",
|
||||||
model_args=[
|
model_args=[
|
||||||
"--task", "embed", "--dtype", "bfloat16", "--max-model-len",
|
"--runner", "pooling", "--dtype", "bfloat16",
|
||||||
"2048"
|
"--max-model-len", "2048"
|
||||||
],
|
],
|
||||||
pp_size=1,
|
pp_size=1,
|
||||||
tp_size=1,
|
tp_size=1,
|
||||||
@ -75,7 +75,7 @@ class TestSetting:
|
|||||||
# # encoder-based embedding model (BERT)
|
# # encoder-based embedding model (BERT)
|
||||||
# TestSetting(
|
# TestSetting(
|
||||||
# model="BAAI/bge-base-en-v1.5",
|
# model="BAAI/bge-base-en-v1.5",
|
||||||
# model_args=["--task", "embed"],
|
# model_args=["--runner", "pooling"],
|
||||||
# pp_size=1,
|
# pp_size=1,
|
||||||
# tp_size=1,
|
# tp_size=1,
|
||||||
# attn_backend="XFORMERS",
|
# attn_backend="XFORMERS",
|
||||||
|
|||||||
@ -125,9 +125,6 @@ def all_reduce_fusion_pass_on_test_model(local_rank: int, world_size: int,
|
|||||||
# in the vllm_config, it's not really used.
|
# in the vllm_config, it's not really used.
|
||||||
model_name = "nm-testing/TinyLlama-1.1B-Chat-v1.0-FP8-e2e"
|
model_name = "nm-testing/TinyLlama-1.1B-Chat-v1.0-FP8-e2e"
|
||||||
vllm_config.model_config = ModelConfig(model=model_name,
|
vllm_config.model_config = ModelConfig(model=model_name,
|
||||||
task="auto",
|
|
||||||
tokenizer=model_name,
|
|
||||||
tokenizer_mode="auto",
|
|
||||||
trust_remote_code=True,
|
trust_remote_code=True,
|
||||||
dtype=dtype,
|
dtype=dtype,
|
||||||
seed=42)
|
seed=42)
|
||||||
|
|||||||
@ -250,9 +250,6 @@ def sequence_parallelism_pass_on_test_model(
|
|||||||
# in the vllm_config, it's not really used.
|
# in the vllm_config, it's not really used.
|
||||||
model_name = "nm-testing/TinyLlama-1.1B-Chat-v1.0-FP8-e2e"
|
model_name = "nm-testing/TinyLlama-1.1B-Chat-v1.0-FP8-e2e"
|
||||||
vllm_config.model_config = ModelConfig(model=model_name,
|
vllm_config.model_config = ModelConfig(model=model_name,
|
||||||
task="auto",
|
|
||||||
tokenizer=model_name,
|
|
||||||
tokenizer_mode="auto",
|
|
||||||
trust_remote_code=True,
|
trust_remote_code=True,
|
||||||
dtype=dtype,
|
dtype=dtype,
|
||||||
seed=42)
|
seed=42)
|
||||||
|
|||||||
@ -23,7 +23,7 @@ from vllm import LLM, SamplingParams
|
|||||||
from vllm.assets.audio import AudioAsset
|
from vllm.assets.audio import AudioAsset
|
||||||
from vllm.assets.image import ImageAsset
|
from vllm.assets.image import ImageAsset
|
||||||
from vllm.assets.video import VideoAsset
|
from vllm.assets.video import VideoAsset
|
||||||
from vllm.config import TaskOption, _get_and_verify_dtype
|
from vllm.config import ConvertOption, RunnerOption, _get_and_verify_dtype
|
||||||
from vllm.connections import global_http_connection
|
from vllm.connections import global_http_connection
|
||||||
from vllm.distributed import (cleanup_dist_env_and_memory,
|
from vllm.distributed import (cleanup_dist_env_and_memory,
|
||||||
init_distributed_environment,
|
init_distributed_environment,
|
||||||
@ -769,7 +769,8 @@ class VllmRunner:
|
|||||||
def __init__(
|
def __init__(
|
||||||
self,
|
self,
|
||||||
model_name: str,
|
model_name: str,
|
||||||
task: TaskOption = "auto",
|
runner: RunnerOption = "auto",
|
||||||
|
convert: ConvertOption = "auto",
|
||||||
tokenizer_name: Optional[str] = None,
|
tokenizer_name: Optional[str] = None,
|
||||||
tokenizer_mode: str = "auto",
|
tokenizer_mode: str = "auto",
|
||||||
trust_remote_code: bool = True,
|
trust_remote_code: bool = True,
|
||||||
@ -786,7 +787,8 @@ class VllmRunner:
|
|||||||
) -> None:
|
) -> None:
|
||||||
self.llm = LLM(
|
self.llm = LLM(
|
||||||
model=model_name,
|
model=model_name,
|
||||||
task=task,
|
runner=runner,
|
||||||
|
convert=convert,
|
||||||
tokenizer=tokenizer_name,
|
tokenizer=tokenizer_name,
|
||||||
tokenizer_mode=tokenizer_mode,
|
tokenizer_mode=tokenizer_mode,
|
||||||
trust_remote_code=trust_remote_code,
|
trust_remote_code=trust_remote_code,
|
||||||
|
|||||||
@ -6,7 +6,7 @@ from typing import Literal, NamedTuple, Optional
|
|||||||
|
|
||||||
import pytest
|
import pytest
|
||||||
|
|
||||||
from vllm.config import TaskOption
|
from vllm.config import RunnerOption
|
||||||
from vllm.logger import init_logger
|
from vllm.logger import init_logger
|
||||||
|
|
||||||
from ..utils import compare_two_settings, create_new_process_for_each_test
|
from ..utils import compare_two_settings, create_new_process_for_each_test
|
||||||
@ -31,14 +31,14 @@ class EPTestOptions(NamedTuple):
|
|||||||
class EPTestSettings:
|
class EPTestSettings:
|
||||||
parallel_setups: list[ParallelSetup]
|
parallel_setups: list[ParallelSetup]
|
||||||
distributed_backends: list[str]
|
distributed_backends: list[str]
|
||||||
task: TaskOption
|
runner: RunnerOption
|
||||||
test_options: EPTestOptions
|
test_options: EPTestOptions
|
||||||
|
|
||||||
@staticmethod
|
@staticmethod
|
||||||
def detailed(
|
def detailed(
|
||||||
*,
|
*,
|
||||||
tp_base: int = 2,
|
tp_base: int = 2,
|
||||||
task: TaskOption = "auto",
|
runner: RunnerOption = "auto",
|
||||||
trust_remote_code: bool = False,
|
trust_remote_code: bool = False,
|
||||||
tokenizer_mode: Optional[str] = None,
|
tokenizer_mode: Optional[str] = None,
|
||||||
load_format: Optional[str] = None,
|
load_format: Optional[str] = None,
|
||||||
@ -63,7 +63,7 @@ class EPTestSettings:
|
|||||||
chunked_prefill=False),
|
chunked_prefill=False),
|
||||||
],
|
],
|
||||||
distributed_backends=["mp", "ray"],
|
distributed_backends=["mp", "ray"],
|
||||||
task=task,
|
runner=runner,
|
||||||
test_options=EPTestOptions(trust_remote_code=trust_remote_code,
|
test_options=EPTestOptions(trust_remote_code=trust_remote_code,
|
||||||
tokenizer_mode=tokenizer_mode,
|
tokenizer_mode=tokenizer_mode,
|
||||||
load_format=load_format,
|
load_format=load_format,
|
||||||
@ -74,7 +74,7 @@ class EPTestSettings:
|
|||||||
def fast(
|
def fast(
|
||||||
*,
|
*,
|
||||||
tp_base: int = 2,
|
tp_base: int = 2,
|
||||||
task: TaskOption = "auto",
|
runner: RunnerOption = "auto",
|
||||||
trust_remote_code: bool = False,
|
trust_remote_code: bool = False,
|
||||||
tokenizer_mode: Optional[str] = None,
|
tokenizer_mode: Optional[str] = None,
|
||||||
load_format: Optional[str] = None,
|
load_format: Optional[str] = None,
|
||||||
@ -87,7 +87,7 @@ class EPTestSettings:
|
|||||||
chunked_prefill=False),
|
chunked_prefill=False),
|
||||||
],
|
],
|
||||||
distributed_backends=["mp"],
|
distributed_backends=["mp"],
|
||||||
task=task,
|
runner=runner,
|
||||||
test_options=EPTestOptions(trust_remote_code=trust_remote_code,
|
test_options=EPTestOptions(trust_remote_code=trust_remote_code,
|
||||||
tokenizer_mode=tokenizer_mode,
|
tokenizer_mode=tokenizer_mode,
|
||||||
load_format=load_format,
|
load_format=load_format,
|
||||||
@ -100,7 +100,7 @@ class EPTestSettings:
|
|||||||
for parallel_setup in self.parallel_setups:
|
for parallel_setup in self.parallel_setups:
|
||||||
for distributed_backend in self.distributed_backends:
|
for distributed_backend in self.distributed_backends:
|
||||||
yield (model_name, parallel_setup, distributed_backend,
|
yield (model_name, parallel_setup, distributed_backend,
|
||||||
self.task, opts)
|
self.runner, opts)
|
||||||
|
|
||||||
|
|
||||||
# NOTE: You can adjust tp_base locally to fit the model in GPU
|
# NOTE: You can adjust tp_base locally to fit the model in GPU
|
||||||
@ -118,7 +118,7 @@ def _compare_tp(
|
|||||||
model_name: str,
|
model_name: str,
|
||||||
parallel_setup: ParallelSetup,
|
parallel_setup: ParallelSetup,
|
||||||
distributed_backend: str,
|
distributed_backend: str,
|
||||||
task: TaskOption,
|
runner: RunnerOption,
|
||||||
test_options: EPTestOptions,
|
test_options: EPTestOptions,
|
||||||
num_gpus_available: int,
|
num_gpus_available: int,
|
||||||
*,
|
*,
|
||||||
@ -154,8 +154,8 @@ def _compare_tp(
|
|||||||
common_args.append("--enable-chunked-prefill")
|
common_args.append("--enable-chunked-prefill")
|
||||||
if eager_mode:
|
if eager_mode:
|
||||||
common_args.append("--enforce-eager")
|
common_args.append("--enforce-eager")
|
||||||
if task != "auto":
|
if runner != "auto":
|
||||||
common_args.extend(["--task", task])
|
common_args.extend(["--runner", runner])
|
||||||
if trust_remote_code:
|
if trust_remote_code:
|
||||||
common_args.append("--trust-remote-code")
|
common_args.append("--trust-remote-code")
|
||||||
if tokenizer_mode:
|
if tokenizer_mode:
|
||||||
@ -203,7 +203,7 @@ def _compare_tp(
|
|||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize(
|
@pytest.mark.parametrize(
|
||||||
("model_name", "parallel_setup", "distributed_backend", "task",
|
("model_name", "parallel_setup", "distributed_backend", "runner",
|
||||||
"test_options"),
|
"test_options"),
|
||||||
[
|
[
|
||||||
params for model_name, settings in TEST_MODELS.items()
|
params for model_name, settings in TEST_MODELS.items()
|
||||||
@ -215,14 +215,14 @@ def test_ep(
|
|||||||
model_name: str,
|
model_name: str,
|
||||||
parallel_setup: ParallelSetup,
|
parallel_setup: ParallelSetup,
|
||||||
distributed_backend: str,
|
distributed_backend: str,
|
||||||
task: TaskOption,
|
runner: RunnerOption,
|
||||||
test_options: EPTestOptions,
|
test_options: EPTestOptions,
|
||||||
num_gpus_available,
|
num_gpus_available,
|
||||||
):
|
):
|
||||||
_compare_tp(model_name,
|
_compare_tp(model_name,
|
||||||
parallel_setup,
|
parallel_setup,
|
||||||
distributed_backend,
|
distributed_backend,
|
||||||
task,
|
runner,
|
||||||
test_options,
|
test_options,
|
||||||
num_gpus_available,
|
num_gpus_available,
|
||||||
method="generate")
|
method="generate")
|
||||||
|
|||||||
@ -14,7 +14,7 @@ from typing import Literal, NamedTuple, Optional
|
|||||||
|
|
||||||
import pytest
|
import pytest
|
||||||
|
|
||||||
from vllm.config import _FLOAT16_NOT_SUPPORTED_MODELS, TaskOption
|
from vllm.config import _FLOAT16_NOT_SUPPORTED_MODELS, RunnerOption
|
||||||
from vllm.logger import init_logger
|
from vllm.logger import init_logger
|
||||||
from vllm.transformers_utils.config import get_config
|
from vllm.transformers_utils.config import get_config
|
||||||
|
|
||||||
@ -60,7 +60,7 @@ class PPTestSettings:
|
|||||||
distributed_backends: list[str]
|
distributed_backends: list[str]
|
||||||
# vllm major version: "0" for V0, "1" for V1
|
# vllm major version: "0" for V0, "1" for V1
|
||||||
vllm_major_versions: list[str]
|
vllm_major_versions: list[str]
|
||||||
task: TaskOption
|
runner: RunnerOption
|
||||||
test_options: PPTestOptions
|
test_options: PPTestOptions
|
||||||
|
|
||||||
def __post_init__(self):
|
def __post_init__(self):
|
||||||
@ -76,7 +76,7 @@ class PPTestSettings:
|
|||||||
tp_base: int = 1,
|
tp_base: int = 1,
|
||||||
pp_base: int = 2,
|
pp_base: int = 2,
|
||||||
multi_node_only: bool = False,
|
multi_node_only: bool = False,
|
||||||
task: TaskOption = "auto",
|
runner: RunnerOption = "auto",
|
||||||
load_format: Optional[str] = None,
|
load_format: Optional[str] = None,
|
||||||
):
|
):
|
||||||
return PPTestSettings(
|
return PPTestSettings(
|
||||||
@ -104,7 +104,7 @@ class PPTestSettings:
|
|||||||
],
|
],
|
||||||
distributed_backends=["mp", "mp", "ray", "ray"],
|
distributed_backends=["mp", "mp", "ray", "ray"],
|
||||||
vllm_major_versions=["0", "1", "0", "1"],
|
vllm_major_versions=["0", "1", "0", "1"],
|
||||||
task=task,
|
runner=runner,
|
||||||
test_options=PPTestOptions(multi_node_only=multi_node_only,
|
test_options=PPTestOptions(multi_node_only=multi_node_only,
|
||||||
load_format=load_format),
|
load_format=load_format),
|
||||||
)
|
)
|
||||||
@ -114,7 +114,7 @@ class PPTestSettings:
|
|||||||
*,
|
*,
|
||||||
tp_base: int = 1,
|
tp_base: int = 1,
|
||||||
pp_base: int = 2,
|
pp_base: int = 2,
|
||||||
task: TaskOption = "auto",
|
runner: RunnerOption = "auto",
|
||||||
multi_node_only: bool = False,
|
multi_node_only: bool = False,
|
||||||
load_format: Optional[str] = None,
|
load_format: Optional[str] = None,
|
||||||
):
|
):
|
||||||
@ -127,7 +127,7 @@ class PPTestSettings:
|
|||||||
],
|
],
|
||||||
distributed_backends=["mp"],
|
distributed_backends=["mp"],
|
||||||
vllm_major_versions=["0"],
|
vllm_major_versions=["0"],
|
||||||
task=task,
|
runner=runner,
|
||||||
test_options=PPTestOptions(multi_node_only=multi_node_only,
|
test_options=PPTestOptions(multi_node_only=multi_node_only,
|
||||||
load_format=load_format),
|
load_format=load_format),
|
||||||
)
|
)
|
||||||
@ -139,7 +139,7 @@ class PPTestSettings:
|
|||||||
for backend, vllm_major_version in zip(self.distributed_backends,
|
for backend, vllm_major_version in zip(self.distributed_backends,
|
||||||
self.vllm_major_versions):
|
self.vllm_major_versions):
|
||||||
yield (model_id, parallel_setup, backend, vllm_major_version,
|
yield (model_id, parallel_setup, backend, vllm_major_version,
|
||||||
self.task, opts)
|
self.runner, opts)
|
||||||
|
|
||||||
|
|
||||||
# NOTE: You can adjust tp_base and/or pp_base locally to fit the model in GPU
|
# NOTE: You can adjust tp_base and/or pp_base locally to fit the model in GPU
|
||||||
@ -211,10 +211,10 @@ TEXT_GENERATION_MODELS = {
|
|||||||
|
|
||||||
EMBEDDING_MODELS = { # type: ignore[var-annotated]
|
EMBEDDING_MODELS = { # type: ignore[var-annotated]
|
||||||
# [Text-only]
|
# [Text-only]
|
||||||
"intfloat/e5-mistral-7b-instruct": PPTestSettings.fast(task="embed"),
|
"intfloat/e5-mistral-7b-instruct": PPTestSettings.fast(runner="pooling"),
|
||||||
"BAAI/bge-multilingual-gemma2": PPTestSettings.fast(task="embed"),
|
"BAAI/bge-multilingual-gemma2": PPTestSettings.fast(runner="pooling"),
|
||||||
"Qwen/Qwen2.5-Math-RM-72B": PPTestSettings.fast(
|
"Qwen/Qwen2.5-Math-RM-72B": PPTestSettings.fast(
|
||||||
load_format="dummy", task="embed"
|
load_format="dummy", runner="pooling"
|
||||||
),
|
),
|
||||||
}
|
}
|
||||||
|
|
||||||
@ -269,7 +269,7 @@ def _compare_tp(
|
|||||||
parallel_setup: ParallelSetup,
|
parallel_setup: ParallelSetup,
|
||||||
distributed_backend: str,
|
distributed_backend: str,
|
||||||
vllm_major_version: str,
|
vllm_major_version: str,
|
||||||
task: TaskOption,
|
runner: RunnerOption,
|
||||||
test_options: PPTestOptions,
|
test_options: PPTestOptions,
|
||||||
num_gpus_available: int,
|
num_gpus_available: int,
|
||||||
*,
|
*,
|
||||||
@ -335,8 +335,8 @@ def _compare_tp(
|
|||||||
common_args.append("--enable-chunked-prefill")
|
common_args.append("--enable-chunked-prefill")
|
||||||
if eager_mode:
|
if eager_mode:
|
||||||
common_args.append("--enforce-eager")
|
common_args.append("--enforce-eager")
|
||||||
if task != "auto":
|
if runner != "auto":
|
||||||
common_args.extend(["--task", task])
|
common_args.extend(["--runner", runner])
|
||||||
if trust_remote_code:
|
if trust_remote_code:
|
||||||
common_args.append("--trust-remote-code")
|
common_args.append("--trust-remote-code")
|
||||||
if tokenizer_mode:
|
if tokenizer_mode:
|
||||||
@ -415,7 +415,7 @@ def _compare_tp(
|
|||||||
|
|
||||||
@pytest.mark.parametrize(
|
@pytest.mark.parametrize(
|
||||||
("model_id", "parallel_setup", "distributed_backend", "vllm_major_version",
|
("model_id", "parallel_setup", "distributed_backend", "vllm_major_version",
|
||||||
"task", "test_options"),
|
"runner", "test_options"),
|
||||||
[
|
[
|
||||||
params for model_id, settings in TEXT_GENERATION_MODELS.items()
|
params for model_id, settings in TEXT_GENERATION_MODELS.items()
|
||||||
for params in settings.iter_params(model_id) if model_id in TEST_MODELS
|
for params in settings.iter_params(model_id) if model_id in TEST_MODELS
|
||||||
@ -427,7 +427,7 @@ def test_tp_language_generation(
|
|||||||
parallel_setup: ParallelSetup,
|
parallel_setup: ParallelSetup,
|
||||||
distributed_backend: str,
|
distributed_backend: str,
|
||||||
vllm_major_version: str,
|
vllm_major_version: str,
|
||||||
task: TaskOption,
|
runner: RunnerOption,
|
||||||
test_options: PPTestOptions,
|
test_options: PPTestOptions,
|
||||||
num_gpus_available,
|
num_gpus_available,
|
||||||
):
|
):
|
||||||
@ -435,7 +435,7 @@ def test_tp_language_generation(
|
|||||||
parallel_setup,
|
parallel_setup,
|
||||||
distributed_backend,
|
distributed_backend,
|
||||||
vllm_major_version,
|
vllm_major_version,
|
||||||
task,
|
runner,
|
||||||
test_options,
|
test_options,
|
||||||
num_gpus_available,
|
num_gpus_available,
|
||||||
method="generate",
|
method="generate",
|
||||||
@ -444,7 +444,7 @@ def test_tp_language_generation(
|
|||||||
|
|
||||||
@pytest.mark.parametrize(
|
@pytest.mark.parametrize(
|
||||||
("model_id", "parallel_setup", "distributed_backend", "vllm_major_version",
|
("model_id", "parallel_setup", "distributed_backend", "vllm_major_version",
|
||||||
"task", "test_options"),
|
"runner", "test_options"),
|
||||||
[
|
[
|
||||||
params for model_id, settings in EMBEDDING_MODELS.items()
|
params for model_id, settings in EMBEDDING_MODELS.items()
|
||||||
for params in settings.iter_params(model_id) if model_id in TEST_MODELS
|
for params in settings.iter_params(model_id) if model_id in TEST_MODELS
|
||||||
@ -456,7 +456,7 @@ def test_tp_language_embedding(
|
|||||||
parallel_setup: ParallelSetup,
|
parallel_setup: ParallelSetup,
|
||||||
distributed_backend: str,
|
distributed_backend: str,
|
||||||
vllm_major_version: str,
|
vllm_major_version: str,
|
||||||
task: TaskOption,
|
runner: RunnerOption,
|
||||||
test_options: PPTestOptions,
|
test_options: PPTestOptions,
|
||||||
num_gpus_available,
|
num_gpus_available,
|
||||||
):
|
):
|
||||||
@ -464,7 +464,7 @@ def test_tp_language_embedding(
|
|||||||
parallel_setup,
|
parallel_setup,
|
||||||
distributed_backend,
|
distributed_backend,
|
||||||
vllm_major_version,
|
vllm_major_version,
|
||||||
task,
|
runner,
|
||||||
test_options,
|
test_options,
|
||||||
num_gpus_available,
|
num_gpus_available,
|
||||||
method="encode",
|
method="encode",
|
||||||
@ -473,7 +473,7 @@ def test_tp_language_embedding(
|
|||||||
|
|
||||||
@pytest.mark.parametrize(
|
@pytest.mark.parametrize(
|
||||||
("model_id", "parallel_setup", "distributed_backend", "vllm_major_version",
|
("model_id", "parallel_setup", "distributed_backend", "vllm_major_version",
|
||||||
"task", "test_options"),
|
"runner", "test_options"),
|
||||||
[
|
[
|
||||||
params for model_id, settings in MULTIMODAL_MODELS.items()
|
params for model_id, settings in MULTIMODAL_MODELS.items()
|
||||||
for params in settings.iter_params(model_id) if model_id in TEST_MODELS
|
for params in settings.iter_params(model_id) if model_id in TEST_MODELS
|
||||||
@ -485,7 +485,7 @@ def test_tp_multimodal_generation(
|
|||||||
parallel_setup: ParallelSetup,
|
parallel_setup: ParallelSetup,
|
||||||
distributed_backend: str,
|
distributed_backend: str,
|
||||||
vllm_major_version: str,
|
vllm_major_version: str,
|
||||||
task: TaskOption,
|
runner: RunnerOption,
|
||||||
test_options: PPTestOptions,
|
test_options: PPTestOptions,
|
||||||
num_gpus_available,
|
num_gpus_available,
|
||||||
):
|
):
|
||||||
@ -493,7 +493,7 @@ def test_tp_multimodal_generation(
|
|||||||
parallel_setup,
|
parallel_setup,
|
||||||
distributed_backend,
|
distributed_backend,
|
||||||
vllm_major_version,
|
vllm_major_version,
|
||||||
task,
|
runner,
|
||||||
test_options,
|
test_options,
|
||||||
num_gpus_available,
|
num_gpus_available,
|
||||||
method="generate",
|
method="generate",
|
||||||
|
|||||||
@ -14,7 +14,7 @@ from typing import Literal, NamedTuple, Optional
|
|||||||
|
|
||||||
import pytest
|
import pytest
|
||||||
|
|
||||||
from vllm.config import TaskOption
|
from vllm.config import RunnerOption
|
||||||
from vllm.logger import init_logger
|
from vllm.logger import init_logger
|
||||||
|
|
||||||
from ..models.registry import HF_EXAMPLE_MODELS
|
from ..models.registry import HF_EXAMPLE_MODELS
|
||||||
@ -48,7 +48,7 @@ class SPTestSettings:
|
|||||||
distributed_backends: list[str]
|
distributed_backends: list[str]
|
||||||
# vllm major version: "0" for V0, "1" for V1
|
# vllm major version: "0" for V0, "1" for V1
|
||||||
vllm_major_versions: list[str]
|
vllm_major_versions: list[str]
|
||||||
task: TaskOption
|
runner: RunnerOption
|
||||||
test_options: SPTestOptions
|
test_options: SPTestOptions
|
||||||
|
|
||||||
def __post_init__(self):
|
def __post_init__(self):
|
||||||
@ -64,7 +64,7 @@ class SPTestSettings:
|
|||||||
tp_base: int = 2,
|
tp_base: int = 2,
|
||||||
pp_base: int = 1,
|
pp_base: int = 1,
|
||||||
multi_node_only: bool = False,
|
multi_node_only: bool = False,
|
||||||
task: TaskOption = "auto",
|
runner: RunnerOption = "auto",
|
||||||
load_format: Optional[str] = None,
|
load_format: Optional[str] = None,
|
||||||
):
|
):
|
||||||
parallel_setups = []
|
parallel_setups = []
|
||||||
@ -81,7 +81,7 @@ class SPTestSettings:
|
|||||||
parallel_setups=parallel_setups,
|
parallel_setups=parallel_setups,
|
||||||
distributed_backends=["mp", "ray"],
|
distributed_backends=["mp", "ray"],
|
||||||
vllm_major_versions=["1", "1"],
|
vllm_major_versions=["1", "1"],
|
||||||
task=task,
|
runner=runner,
|
||||||
test_options=SPTestOptions(multi_node_only=multi_node_only,
|
test_options=SPTestOptions(multi_node_only=multi_node_only,
|
||||||
load_format=load_format),
|
load_format=load_format),
|
||||||
)
|
)
|
||||||
@ -91,7 +91,7 @@ class SPTestSettings:
|
|||||||
*,
|
*,
|
||||||
tp_base: int = 2,
|
tp_base: int = 2,
|
||||||
pp_base: int = 1,
|
pp_base: int = 1,
|
||||||
task: TaskOption = "auto",
|
runner: RunnerOption = "auto",
|
||||||
multi_node_only: bool = False,
|
multi_node_only: bool = False,
|
||||||
load_format: Optional[str] = None,
|
load_format: Optional[str] = None,
|
||||||
):
|
):
|
||||||
@ -109,7 +109,7 @@ class SPTestSettings:
|
|||||||
parallel_setups=parallel_setups,
|
parallel_setups=parallel_setups,
|
||||||
distributed_backends=["mp", "ray"],
|
distributed_backends=["mp", "ray"],
|
||||||
vllm_major_versions=["1", "1"],
|
vllm_major_versions=["1", "1"],
|
||||||
task=task,
|
runner=runner,
|
||||||
test_options=SPTestOptions(multi_node_only=multi_node_only,
|
test_options=SPTestOptions(multi_node_only=multi_node_only,
|
||||||
load_format=load_format),
|
load_format=load_format),
|
||||||
)
|
)
|
||||||
@ -119,7 +119,7 @@ class SPTestSettings:
|
|||||||
*,
|
*,
|
||||||
tp_base: int = 2,
|
tp_base: int = 2,
|
||||||
pp_base: int = 1,
|
pp_base: int = 1,
|
||||||
task: TaskOption = "auto",
|
runner: RunnerOption = "auto",
|
||||||
multi_node_only: bool = False,
|
multi_node_only: bool = False,
|
||||||
load_format: Optional[str] = None,
|
load_format: Optional[str] = None,
|
||||||
):
|
):
|
||||||
@ -135,7 +135,7 @@ class SPTestSettings:
|
|||||||
parallel_setups=parallel_setups,
|
parallel_setups=parallel_setups,
|
||||||
distributed_backends=["mp", "ray"],
|
distributed_backends=["mp", "ray"],
|
||||||
vllm_major_versions=["1", "1"],
|
vllm_major_versions=["1", "1"],
|
||||||
task=task,
|
runner=runner,
|
||||||
test_options=SPTestOptions(multi_node_only=multi_node_only,
|
test_options=SPTestOptions(multi_node_only=multi_node_only,
|
||||||
load_format=load_format),
|
load_format=load_format),
|
||||||
)
|
)
|
||||||
@ -147,7 +147,7 @@ class SPTestSettings:
|
|||||||
for backend, vllm_major_version in zip(self.distributed_backends,
|
for backend, vllm_major_version in zip(self.distributed_backends,
|
||||||
self.vllm_major_versions):
|
self.vllm_major_versions):
|
||||||
yield (model_id, parallel_setup, backend, vllm_major_version,
|
yield (model_id, parallel_setup, backend, vllm_major_version,
|
||||||
self.task, opts)
|
self.runner, opts)
|
||||||
|
|
||||||
|
|
||||||
def _compare_sp(
|
def _compare_sp(
|
||||||
@ -155,7 +155,7 @@ def _compare_sp(
|
|||||||
parallel_setup: ParallelSetup,
|
parallel_setup: ParallelSetup,
|
||||||
distributed_backend: str,
|
distributed_backend: str,
|
||||||
vllm_major_version: str,
|
vllm_major_version: str,
|
||||||
task: TaskOption,
|
runner: RunnerOption,
|
||||||
test_options: SPTestOptions,
|
test_options: SPTestOptions,
|
||||||
num_gpus_available: int,
|
num_gpus_available: int,
|
||||||
*,
|
*,
|
||||||
@ -217,8 +217,8 @@ def _compare_sp(
|
|||||||
common_args.append("--enable-chunked-prefill")
|
common_args.append("--enable-chunked-prefill")
|
||||||
if eager_mode:
|
if eager_mode:
|
||||||
common_args.append("--enforce-eager")
|
common_args.append("--enforce-eager")
|
||||||
if task != "auto":
|
if runner != "auto":
|
||||||
common_args.extend(["--task", task])
|
common_args.extend(["--runner", runner])
|
||||||
if trust_remote_code:
|
if trust_remote_code:
|
||||||
common_args.append("--trust-remote-code")
|
common_args.append("--trust-remote-code")
|
||||||
if tokenizer_mode:
|
if tokenizer_mode:
|
||||||
@ -298,7 +298,7 @@ SP_TEST_MODELS = [
|
|||||||
|
|
||||||
@pytest.mark.parametrize(
|
@pytest.mark.parametrize(
|
||||||
("model_id", "parallel_setup", "distributed_backend", "vllm_major_version",
|
("model_id", "parallel_setup", "distributed_backend", "vllm_major_version",
|
||||||
"task", "test_options"),
|
"runner", "test_options"),
|
||||||
[
|
[
|
||||||
params for model_id, settings in SP_TEXT_GENERATION_MODELS.items()
|
params for model_id, settings in SP_TEXT_GENERATION_MODELS.items()
|
||||||
for params in settings.iter_params(model_id)
|
for params in settings.iter_params(model_id)
|
||||||
@ -311,7 +311,7 @@ def test_tp_sp_generation(
|
|||||||
parallel_setup: ParallelSetup,
|
parallel_setup: ParallelSetup,
|
||||||
distributed_backend: str,
|
distributed_backend: str,
|
||||||
vllm_major_version: str,
|
vllm_major_version: str,
|
||||||
task: TaskOption,
|
runner: RunnerOption,
|
||||||
test_options: SPTestOptions,
|
test_options: SPTestOptions,
|
||||||
num_gpus_available,
|
num_gpus_available,
|
||||||
):
|
):
|
||||||
@ -319,7 +319,7 @@ def test_tp_sp_generation(
|
|||||||
parallel_setup,
|
parallel_setup,
|
||||||
distributed_backend,
|
distributed_backend,
|
||||||
vllm_major_version,
|
vllm_major_version,
|
||||||
task,
|
runner,
|
||||||
test_options,
|
test_options,
|
||||||
num_gpus_available,
|
num_gpus_available,
|
||||||
method="generate",
|
method="generate",
|
||||||
|
|||||||
@ -19,7 +19,8 @@ MAIN_SCORE = 0.7422994752439667
|
|||||||
@pytest.fixture(scope="module")
|
@pytest.fixture(scope="module")
|
||||||
def server():
|
def server():
|
||||||
args = [
|
args = [
|
||||||
"--task", "embed", "--enforce-eager", "--disable-uvicorn-access-log"
|
"--runner", "pooling", "--enforce-eager",
|
||||||
|
"--disable-uvicorn-access-log"
|
||||||
]
|
]
|
||||||
|
|
||||||
with RemoteOpenAIServer(MODEL_NAME, args) as remote_server:
|
with RemoteOpenAIServer(MODEL_NAME, args) as remote_server:
|
||||||
|
|||||||
@ -21,7 +21,8 @@ MODEL_NAME = "cross-encoder/ms-marco-MiniLM-L-6-v2"
|
|||||||
@pytest.fixture(scope="module")
|
@pytest.fixture(scope="module")
|
||||||
def server():
|
def server():
|
||||||
args = [
|
args = [
|
||||||
"--task", "score", "--enforce-eager", "--disable-uvicorn-access-log"
|
"--runner", "pooling", "--enforce-eager",
|
||||||
|
"--disable-uvicorn-access-log"
|
||||||
]
|
]
|
||||||
|
|
||||||
with RemoteOpenAIServer(MODEL_NAME, args) as remote_server:
|
with RemoteOpenAIServer(MODEL_NAME, args) as remote_server:
|
||||||
|
|||||||
@ -15,10 +15,6 @@ MODEL_NAME = "Qwen/Qwen2.5-1.5B-Instruct"
|
|||||||
def get_vocab_size(model_name):
|
def get_vocab_size(model_name):
|
||||||
config = ModelConfig(
|
config = ModelConfig(
|
||||||
model=model_name,
|
model=model_name,
|
||||||
task="auto",
|
|
||||||
tokenizer=model_name,
|
|
||||||
tokenizer_mode="auto",
|
|
||||||
trust_remote_code=False,
|
|
||||||
seed=0,
|
seed=0,
|
||||||
dtype="bfloat16",
|
dtype="bfloat16",
|
||||||
)
|
)
|
||||||
|
|||||||
@ -102,6 +102,7 @@ def test_get_gen_prompt(model, template, add_generation_prompt,
|
|||||||
tokenizer=model_info.tokenizer or model,
|
tokenizer=model_info.tokenizer or model,
|
||||||
tokenizer_mode=model_info.tokenizer_mode,
|
tokenizer_mode=model_info.tokenizer_mode,
|
||||||
trust_remote_code=model_info.trust_remote_code,
|
trust_remote_code=model_info.trust_remote_code,
|
||||||
|
revision=model_info.revision,
|
||||||
hf_overrides=model_info.hf_overrides,
|
hf_overrides=model_info.hf_overrides,
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|||||||
@ -33,8 +33,8 @@ def v1(run_with_both_engines):
|
|||||||
@pytest.fixture(scope="module")
|
@pytest.fixture(scope="module")
|
||||||
def server():
|
def server():
|
||||||
args = [
|
args = [
|
||||||
"--task",
|
"--runner",
|
||||||
"embed",
|
"pooling",
|
||||||
# use half precision for speed and memory savings in CI environment
|
# use half precision for speed and memory savings in CI environment
|
||||||
"--dtype",
|
"--dtype",
|
||||||
DTYPE,
|
DTYPE,
|
||||||
|
|||||||
@ -42,8 +42,8 @@ def dtype(request):
|
|||||||
@pytest.fixture(scope="module")
|
@pytest.fixture(scope="module")
|
||||||
def server(model_info, dtype: str):
|
def server(model_info, dtype: str):
|
||||||
args = [
|
args = [
|
||||||
"--task",
|
"--runner",
|
||||||
"embed",
|
"pooling",
|
||||||
# use half precision for speed and memory savings in CI environment
|
# use half precision for speed and memory savings in CI environment
|
||||||
"--dtype",
|
"--dtype",
|
||||||
dtype,
|
dtype,
|
||||||
|
|||||||
@ -21,7 +21,7 @@ LONG_TIMEOUT_SECONDS: Final[int] = 60
|
|||||||
@pytest.fixture(scope="module")
|
@pytest.fixture(scope="module")
|
||||||
def server():
|
def server():
|
||||||
args = [
|
args = [
|
||||||
"--task",
|
"--runner",
|
||||||
"generate",
|
"generate",
|
||||||
"--max-model-len",
|
"--max-model-len",
|
||||||
"2048",
|
"2048",
|
||||||
|
|||||||
@ -27,8 +27,8 @@ def server(request: pytest.FixtureRequest):
|
|||||||
passed_params = [passed_params]
|
passed_params = [passed_params]
|
||||||
|
|
||||||
args = [
|
args = [
|
||||||
"--task",
|
"--runner",
|
||||||
"embed",
|
"pooling",
|
||||||
# use half precision for speed and memory savings in CI environment
|
# use half precision for speed and memory savings in CI environment
|
||||||
"--dtype",
|
"--dtype",
|
||||||
"float16",
|
"float16",
|
||||||
|
|||||||
@ -20,8 +20,8 @@ DUMMY_CHAT_TEMPLATE = """{% for message in messages %}{{message['role'] + ': ' +
|
|||||||
@pytest.fixture(scope="module")
|
@pytest.fixture(scope="module")
|
||||||
def server():
|
def server():
|
||||||
args = [
|
args = [
|
||||||
"--task",
|
"--runner",
|
||||||
"reward",
|
"pooling",
|
||||||
# use half precision for speed and memory savings in CI environment
|
# use half precision for speed and memory savings in CI environment
|
||||||
"--dtype",
|
"--dtype",
|
||||||
"bfloat16",
|
"bfloat16",
|
||||||
|
|||||||
@ -26,8 +26,8 @@ def v1(run_with_both_engines):
|
|||||||
@pytest.fixture(scope="module")
|
@pytest.fixture(scope="module")
|
||||||
def server():
|
def server():
|
||||||
args = [
|
args = [
|
||||||
"--task",
|
"--runner",
|
||||||
"embed",
|
"pooling",
|
||||||
# use half precision for speed and memory savings in CI environment
|
# use half precision for speed and memory savings in CI environment
|
||||||
"--dtype",
|
"--dtype",
|
||||||
DTYPE,
|
DTYPE,
|
||||||
|
|||||||
@ -29,8 +29,8 @@ input = """Immerse yourself in the enchanting chronicle of calculus, a
|
|||||||
@pytest.fixture(scope="module")
|
@pytest.fixture(scope="module")
|
||||||
def server():
|
def server():
|
||||||
args = [
|
args = [
|
||||||
"--task",
|
"--runner",
|
||||||
"embed",
|
"pooling",
|
||||||
"--dtype",
|
"--dtype",
|
||||||
"bfloat16",
|
"bfloat16",
|
||||||
"--enforce-eager",
|
"--enforce-eager",
|
||||||
|
|||||||
@ -25,7 +25,7 @@ TEST_VIDEO_URLS = [
|
|||||||
@pytest.fixture(scope="module")
|
@pytest.fixture(scope="module")
|
||||||
def server():
|
def server():
|
||||||
args = [
|
args = [
|
||||||
"--task",
|
"--runner",
|
||||||
"generate",
|
"generate",
|
||||||
"--max-model-len",
|
"--max-model-len",
|
||||||
"32768",
|
"32768",
|
||||||
|
|||||||
@ -48,7 +48,7 @@ EXPECTED_MM_BEAM_SEARCH_RES = [
|
|||||||
@pytest.fixture(scope="module")
|
@pytest.fixture(scope="module")
|
||||||
def server():
|
def server():
|
||||||
args = [
|
args = [
|
||||||
"--task",
|
"--runner",
|
||||||
"generate",
|
"generate",
|
||||||
"--max-model-len",
|
"--max-model-len",
|
||||||
"2048",
|
"2048",
|
||||||
|
|||||||
@ -31,8 +31,8 @@ TEST_IMAGE_URLS = [
|
|||||||
@pytest.fixture(scope="module")
|
@pytest.fixture(scope="module")
|
||||||
def server():
|
def server():
|
||||||
args = [
|
args = [
|
||||||
"--task",
|
"--runner",
|
||||||
"embed",
|
"pooling",
|
||||||
"--max-model-len",
|
"--max-model-len",
|
||||||
"2048",
|
"2048",
|
||||||
"--max-num-seqs",
|
"--max-num-seqs",
|
||||||
|
|||||||
@ -47,12 +47,8 @@ MISTRAL_MODEL_ID = "mistralai/Mistral-Small-3.1-24B-Instruct-2503"
|
|||||||
@pytest.fixture(scope="function")
|
@pytest.fixture(scope="function")
|
||||||
def phi3v_model_config():
|
def phi3v_model_config():
|
||||||
return ModelConfig(PHI3V_MODEL_ID,
|
return ModelConfig(PHI3V_MODEL_ID,
|
||||||
task="generate",
|
runner="generate",
|
||||||
tokenizer=PHI3V_MODEL_ID,
|
|
||||||
tokenizer_mode="auto",
|
|
||||||
trust_remote_code=True,
|
trust_remote_code=True,
|
||||||
dtype="auto",
|
|
||||||
seed=0,
|
|
||||||
limit_mm_per_prompt={
|
limit_mm_per_prompt={
|
||||||
"image": 2,
|
"image": 2,
|
||||||
})
|
})
|
||||||
@ -61,12 +57,8 @@ def phi3v_model_config():
|
|||||||
@pytest.fixture(scope="function")
|
@pytest.fixture(scope="function")
|
||||||
def phi3v_model_config_mm_interleaved():
|
def phi3v_model_config_mm_interleaved():
|
||||||
return ModelConfig(PHI3V_MODEL_ID,
|
return ModelConfig(PHI3V_MODEL_ID,
|
||||||
task="generate",
|
runner="generate",
|
||||||
tokenizer=PHI3V_MODEL_ID,
|
|
||||||
tokenizer_mode="auto",
|
|
||||||
trust_remote_code=True,
|
trust_remote_code=True,
|
||||||
dtype="auto",
|
|
||||||
seed=0,
|
|
||||||
interleave_mm_strings=True,
|
interleave_mm_strings=True,
|
||||||
limit_mm_per_prompt={
|
limit_mm_per_prompt={
|
||||||
"image": 2,
|
"image": 2,
|
||||||
@ -86,11 +78,7 @@ def phi3v_tokenizer():
|
|||||||
@pytest.fixture(scope="function")
|
@pytest.fixture(scope="function")
|
||||||
def qwen25omni_model_config_mm_interleaved():
|
def qwen25omni_model_config_mm_interleaved():
|
||||||
return ModelConfig(QWEN25OMNI_MODEL_ID,
|
return ModelConfig(QWEN25OMNI_MODEL_ID,
|
||||||
task="generate",
|
runner="generate",
|
||||||
tokenizer=QWEN25OMNI_MODEL_ID,
|
|
||||||
tokenizer_mode="auto",
|
|
||||||
dtype="auto",
|
|
||||||
seed=0,
|
|
||||||
interleave_mm_strings=True,
|
interleave_mm_strings=True,
|
||||||
limit_mm_per_prompt={
|
limit_mm_per_prompt={
|
||||||
"image": 2,
|
"image": 2,
|
||||||
@ -112,12 +100,7 @@ def qwen25omni_tokenizer():
|
|||||||
@pytest.fixture(scope="module")
|
@pytest.fixture(scope="module")
|
||||||
def mllama_model_config():
|
def mllama_model_config():
|
||||||
return ModelConfig(MLLAMA_MODEL_ID,
|
return ModelConfig(MLLAMA_MODEL_ID,
|
||||||
task="generate",
|
runner="generate",
|
||||||
tokenizer=MLLAMA_MODEL_ID,
|
|
||||||
tokenizer_mode="auto",
|
|
||||||
trust_remote_code=True,
|
|
||||||
dtype="auto",
|
|
||||||
seed=0,
|
|
||||||
limit_mm_per_prompt={
|
limit_mm_per_prompt={
|
||||||
"image": 2,
|
"image": 2,
|
||||||
})
|
})
|
||||||
@ -136,12 +119,7 @@ def mllama_tokenizer():
|
|||||||
@pytest.fixture(scope="function")
|
@pytest.fixture(scope="function")
|
||||||
def mistral_model_config():
|
def mistral_model_config():
|
||||||
return ModelConfig(MISTRAL_MODEL_ID,
|
return ModelConfig(MISTRAL_MODEL_ID,
|
||||||
task="generate",
|
runner="generate",
|
||||||
tokenizer=MISTRAL_MODEL_ID,
|
|
||||||
tokenizer_mode="auto",
|
|
||||||
trust_remote_code=True,
|
|
||||||
dtype="auto",
|
|
||||||
seed=0,
|
|
||||||
limit_mm_per_prompt={
|
limit_mm_per_prompt={
|
||||||
"image": 2,
|
"image": 2,
|
||||||
})
|
})
|
||||||
@ -1105,12 +1083,7 @@ def test_multimodal_image_parsing_matches_hf(model, image_url):
|
|||||||
|
|
||||||
# Build a config for the model
|
# Build a config for the model
|
||||||
model_config = ModelConfig(model,
|
model_config = ModelConfig(model,
|
||||||
task="generate",
|
runner="generate",
|
||||||
tokenizer=model,
|
|
||||||
tokenizer_mode="auto",
|
|
||||||
trust_remote_code=True,
|
|
||||||
dtype="auto",
|
|
||||||
seed=0,
|
|
||||||
limit_mm_per_prompt={
|
limit_mm_per_prompt={
|
||||||
"image": 2,
|
"image": 2,
|
||||||
})
|
})
|
||||||
@ -1170,6 +1143,7 @@ def test_resolve_hf_chat_template(sample_json_schema, model, use_tools):
|
|||||||
model,
|
model,
|
||||||
tokenizer=model_info.tokenizer or model,
|
tokenizer=model_info.tokenizer or model,
|
||||||
tokenizer_mode=model_info.tokenizer_mode,
|
tokenizer_mode=model_info.tokenizer_mode,
|
||||||
|
revision=model_info.revision,
|
||||||
trust_remote_code=model_info.trust_remote_code,
|
trust_remote_code=model_info.trust_remote_code,
|
||||||
hf_overrides=model_info.hf_overrides,
|
hf_overrides=model_info.hf_overrides,
|
||||||
)
|
)
|
||||||
@ -1225,6 +1199,7 @@ def test_resolve_content_format_hf_defined(model, expected_format):
|
|||||||
model,
|
model,
|
||||||
tokenizer=model_info.tokenizer or model,
|
tokenizer=model_info.tokenizer or model,
|
||||||
tokenizer_mode=model_info.tokenizer_mode,
|
tokenizer_mode=model_info.tokenizer_mode,
|
||||||
|
revision=model_info.revision,
|
||||||
trust_remote_code=model_info.trust_remote_code,
|
trust_remote_code=model_info.trust_remote_code,
|
||||||
hf_overrides=model_info.hf_overrides,
|
hf_overrides=model_info.hf_overrides,
|
||||||
)
|
)
|
||||||
@ -1284,6 +1259,7 @@ def test_resolve_content_format_fallbacks(model, expected_format):
|
|||||||
model,
|
model,
|
||||||
tokenizer=model_info.tokenizer or model,
|
tokenizer=model_info.tokenizer or model,
|
||||||
tokenizer_mode=model_info.tokenizer_mode,
|
tokenizer_mode=model_info.tokenizer_mode,
|
||||||
|
revision=model_info.revision,
|
||||||
trust_remote_code=model_info.trust_remote_code,
|
trust_remote_code=model_info.trust_remote_code,
|
||||||
hf_overrides=model_info.hf_overrides,
|
hf_overrides=model_info.hf_overrides,
|
||||||
)
|
)
|
||||||
|
|||||||
@ -38,13 +38,8 @@ def test_worker_apply_lora(sql_lora_files):
|
|||||||
vllm_config = VllmConfig(
|
vllm_config = VllmConfig(
|
||||||
model_config=ModelConfig(
|
model_config=ModelConfig(
|
||||||
"meta-llama/Llama-2-7b-hf",
|
"meta-llama/Llama-2-7b-hf",
|
||||||
task="auto",
|
|
||||||
tokenizer="meta-llama/Llama-2-7b-hf",
|
|
||||||
tokenizer_mode="auto",
|
|
||||||
trust_remote_code=False,
|
|
||||||
seed=0,
|
seed=0,
|
||||||
dtype="float16",
|
dtype="float16",
|
||||||
revision=None,
|
|
||||||
enforce_eager=True,
|
enforce_eager=True,
|
||||||
),
|
),
|
||||||
load_config=LoadConfig(
|
load_config=LoadConfig(
|
||||||
|
|||||||
@ -69,10 +69,7 @@ async def test_guided_logits_processor_black_box(backend: str, is_local: bool,
|
|||||||
|
|
||||||
config = ModelConfig(
|
config = ModelConfig(
|
||||||
MODEL_NAME,
|
MODEL_NAME,
|
||||||
task="generate",
|
runner="generate",
|
||||||
tokenizer=MODEL_NAME,
|
|
||||||
tokenizer_mode="auto",
|
|
||||||
trust_remote_code=False,
|
|
||||||
seed=0,
|
seed=0,
|
||||||
dtype="bfloat16",
|
dtype="bfloat16",
|
||||||
)
|
)
|
||||||
@ -113,10 +110,7 @@ async def test_guided_logits_processor_with_reasoning(
|
|||||||
|
|
||||||
config = ModelConfig(
|
config = ModelConfig(
|
||||||
REASONING_MODEL_NAME,
|
REASONING_MODEL_NAME,
|
||||||
task="generate",
|
runner="generate",
|
||||||
tokenizer=REASONING_MODEL_NAME,
|
|
||||||
tokenizer_mode="auto",
|
|
||||||
trust_remote_code=False,
|
|
||||||
seed=0,
|
seed=0,
|
||||||
dtype="bfloat16",
|
dtype="bfloat16",
|
||||||
)
|
)
|
||||||
|
|||||||
@ -57,7 +57,6 @@ def test_model_loading_with_params(vllm_runner, monkeypatch):
|
|||||||
|
|
||||||
vllm_model.apply_model(check_model)
|
vllm_model.apply_model(check_model)
|
||||||
|
|
||||||
# assert output
|
|
||||||
assert output
|
assert output
|
||||||
|
|
||||||
|
|
||||||
@ -99,7 +98,6 @@ def test_roberta_model_loading_with_params(vllm_runner, monkeypatch):
|
|||||||
|
|
||||||
vllm_model.apply_model(check_model)
|
vllm_model.apply_model(check_model)
|
||||||
|
|
||||||
# assert output
|
|
||||||
assert output
|
assert output
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
@ -52,7 +52,7 @@ def correctness_test_embed_models(hf_runner,
|
|||||||
vllm_extra_kwargs["dtype"] = model_info.dtype
|
vllm_extra_kwargs["dtype"] = model_info.dtype
|
||||||
|
|
||||||
with vllm_runner(model_info.name,
|
with vllm_runner(model_info.name,
|
||||||
task="embed",
|
runner="pooling",
|
||||||
max_model_len=None,
|
max_model_len=None,
|
||||||
**vllm_extra_kwargs) as vllm_model:
|
**vllm_extra_kwargs) as vllm_model:
|
||||||
vllm_outputs = vllm_model.embed(example_prompts)
|
vllm_outputs = vllm_model.embed(example_prompts)
|
||||||
|
|||||||
@ -172,7 +172,7 @@ def mteb_test_embed_models(hf_runner,
|
|||||||
vllm_extra_kwargs["dtype"] = model_info.dtype
|
vllm_extra_kwargs["dtype"] = model_info.dtype
|
||||||
|
|
||||||
with vllm_runner(model_info.name,
|
with vllm_runner(model_info.name,
|
||||||
task="embed",
|
runner="pooling",
|
||||||
max_model_len=None,
|
max_model_len=None,
|
||||||
**vllm_extra_kwargs) as vllm_model:
|
**vllm_extra_kwargs) as vllm_model:
|
||||||
|
|
||||||
@ -279,15 +279,12 @@ def mteb_test_rerank_models(hf_runner,
|
|||||||
vllm_extra_kwargs["dtype"] = model_info.dtype
|
vllm_extra_kwargs["dtype"] = model_info.dtype
|
||||||
|
|
||||||
with vllm_runner(model_info.name,
|
with vllm_runner(model_info.name,
|
||||||
task="score",
|
runner="pooling",
|
||||||
max_model_len=None,
|
max_model_len=None,
|
||||||
max_num_seqs=8,
|
max_num_seqs=8,
|
||||||
**vllm_extra_kwargs) as vllm_model:
|
**vllm_extra_kwargs) as vllm_model:
|
||||||
|
|
||||||
model_config = vllm_model.llm.llm_engine.model_config
|
model_config = vllm_model.llm.llm_engine.model_config
|
||||||
|
|
||||||
if model_info.architecture:
|
|
||||||
assert (model_info.architecture in model_config.architectures)
|
|
||||||
assert model_config.hf_config.num_labels == 1
|
assert model_config.hf_config.num_labels == 1
|
||||||
|
|
||||||
vllm_main_score = run_mteb_rerank(vllm_mteb_encoder(vllm_model),
|
vllm_main_score = run_mteb_rerank(vllm_mteb_encoder(vllm_model),
|
||||||
|
|||||||
@ -85,7 +85,7 @@ def test_models(
|
|||||||
hf_outputs = hf_model.encode(example_prompts)
|
hf_outputs = hf_model.encode(example_prompts)
|
||||||
|
|
||||||
with vllm_runner(model,
|
with vllm_runner(model,
|
||||||
task="embed",
|
runner="pooling",
|
||||||
max_model_len=max_model_len,
|
max_model_len=max_model_len,
|
||||||
**vllm_extra_kwargs) as vllm_model:
|
**vllm_extra_kwargs) as vllm_model:
|
||||||
vllm_outputs = vllm_model.embed(example_prompts)
|
vllm_outputs = vllm_model.embed(example_prompts)
|
||||||
|
|||||||
@ -28,10 +28,7 @@ def test_find_array():
|
|||||||
|
|
||||||
model_config = ModelConfig(
|
model_config = ModelConfig(
|
||||||
MODEL_NAME,
|
MODEL_NAME,
|
||||||
task="embed",
|
runner="pooling",
|
||||||
tokenizer=MODEL_NAME,
|
|
||||||
tokenizer_mode="auto",
|
|
||||||
trust_remote_code=False,
|
|
||||||
dtype="bfloat16",
|
dtype="bfloat16",
|
||||||
seed=0,
|
seed=0,
|
||||||
)
|
)
|
||||||
@ -117,7 +114,7 @@ def test_gritlm_offline_embedding(vllm_runner):
|
|||||||
|
|
||||||
with vllm_runner(
|
with vllm_runner(
|
||||||
MODEL_NAME,
|
MODEL_NAME,
|
||||||
task="embed",
|
runner="pooling",
|
||||||
max_model_len=MAX_MODEL_LEN,
|
max_model_len=MAX_MODEL_LEN,
|
||||||
) as vllm_model:
|
) as vllm_model:
|
||||||
llm = vllm_model.llm
|
llm = vllm_model.llm
|
||||||
@ -140,7 +137,7 @@ def test_gritlm_offline_embedding(vllm_runner):
|
|||||||
async def test_gritlm_api_server_embedding():
|
async def test_gritlm_api_server_embedding():
|
||||||
queries, q_instruction, documents, d_instruction = get_test_data()
|
queries, q_instruction, documents, d_instruction = get_test_data()
|
||||||
|
|
||||||
args = ["--task", "embed", "--max_model_len", str(MAX_MODEL_LEN)]
|
args = ["--runner", "pooling", "--max_model_len", str(MAX_MODEL_LEN)]
|
||||||
|
|
||||||
with RemoteOpenAIServer(MODEL_NAME, args) as server:
|
with RemoteOpenAIServer(MODEL_NAME, args) as server:
|
||||||
client_embedding = server.get_async_client()
|
client_embedding = server.get_async_client()
|
||||||
@ -164,7 +161,7 @@ def test_gritlm_offline_generate(monkeypatch: pytest.MonkeyPatch, vllm_runner):
|
|||||||
|
|
||||||
with vllm_runner(
|
with vllm_runner(
|
||||||
MODEL_NAME,
|
MODEL_NAME,
|
||||||
task="generate",
|
runner="generate",
|
||||||
max_model_len=MAX_MODEL_LEN,
|
max_model_len=MAX_MODEL_LEN,
|
||||||
) as vllm_model:
|
) as vllm_model:
|
||||||
llm = vllm_model.llm
|
llm = vllm_model.llm
|
||||||
@ -179,7 +176,7 @@ def test_gritlm_offline_generate(monkeypatch: pytest.MonkeyPatch, vllm_runner):
|
|||||||
async def test_gritlm_api_server_generate():
|
async def test_gritlm_api_server_generate():
|
||||||
input = "<|user|>\nWhat is the capital of France?\n<|assistant|>\n"
|
input = "<|user|>\nWhat is the capital of France?\n<|assistant|>\n"
|
||||||
|
|
||||||
args = ["--task", "generate", "--max_model_len", str(MAX_MODEL_LEN)]
|
args = ["--runner", "generate", "--max_model_len", str(MAX_MODEL_LEN)]
|
||||||
|
|
||||||
with RemoteOpenAIServer(MODEL_NAME, args) as server:
|
with RemoteOpenAIServer(MODEL_NAME, args) as server:
|
||||||
client_generate = server.get_async_client()
|
client_generate = server.get_async_client()
|
||||||
|
|||||||
@ -4,6 +4,7 @@ from functools import partial
|
|||||||
|
|
||||||
import pytest
|
import pytest
|
||||||
|
|
||||||
|
import vllm.envs as envs
|
||||||
from vllm import PoolingParams
|
from vllm import PoolingParams
|
||||||
|
|
||||||
from ...utils import EmbedModelInfo, RerankModelInfo
|
from ...utils import EmbedModelInfo, RerankModelInfo
|
||||||
@ -62,6 +63,10 @@ def test_embed_models_correctness(hf_runner, vllm_runner,
|
|||||||
@pytest.mark.parametrize("model_info", RERANK_MODELS)
|
@pytest.mark.parametrize("model_info", RERANK_MODELS)
|
||||||
def test_rerank_models_mteb(hf_runner, vllm_runner,
|
def test_rerank_models_mteb(hf_runner, vllm_runner,
|
||||||
model_info: RerankModelInfo) -> None:
|
model_info: RerankModelInfo) -> None:
|
||||||
|
if (model_info.architecture == "XLMRobertaForSequenceClassification"
|
||||||
|
and envs.VLLM_USE_V1):
|
||||||
|
pytest.skip("Not supported yet")
|
||||||
|
|
||||||
mteb_test_rerank_models(hf_runner, vllm_runner, model_info)
|
mteb_test_rerank_models(hf_runner, vllm_runner, model_info)
|
||||||
|
|
||||||
|
|
||||||
@ -92,7 +97,7 @@ def test_matryoshka(
|
|||||||
hf_outputs = matryoshka_fy(hf_outputs, dimensions)
|
hf_outputs = matryoshka_fy(hf_outputs, dimensions)
|
||||||
|
|
||||||
with vllm_runner(model_info.name,
|
with vllm_runner(model_info.name,
|
||||||
task="embed",
|
runner="pooling",
|
||||||
dtype=dtype,
|
dtype=dtype,
|
||||||
max_model_len=None) as vllm_model:
|
max_model_len=None) as vllm_model:
|
||||||
assert vllm_model.llm.llm_engine.model_config.is_matryoshka
|
assert vllm_model.llm.llm_engine.model_config.is_matryoshka
|
||||||
|
|||||||
@ -21,7 +21,7 @@ max_model_len = int(original_max_position_embeddings * factor)
|
|||||||
|
|
||||||
@pytest.mark.parametrize("model_info", MODELS)
|
@pytest.mark.parametrize("model_info", MODELS)
|
||||||
def test_default(model_info, vllm_runner):
|
def test_default(model_info, vllm_runner):
|
||||||
with vllm_runner(model_info.name, task="embed",
|
with vllm_runner(model_info.name, runner="pooling",
|
||||||
max_model_len=None) as vllm_model:
|
max_model_len=None) as vllm_model:
|
||||||
model_config = vllm_model.llm.llm_engine.model_config
|
model_config = vllm_model.llm.llm_engine.model_config
|
||||||
if model_info.name == "nomic-ai/nomic-embed-text-v2-moe":
|
if model_info.name == "nomic-ai/nomic-embed-text-v2-moe":
|
||||||
@ -36,7 +36,7 @@ def test_default(model_info, vllm_runner):
|
|||||||
@pytest.mark.parametrize("model_info", MODELS)
|
@pytest.mark.parametrize("model_info", MODELS)
|
||||||
def test_set_max_model_len_legal(model_info, vllm_runner):
|
def test_set_max_model_len_legal(model_info, vllm_runner):
|
||||||
# set max_model_len <= 512
|
# set max_model_len <= 512
|
||||||
with vllm_runner(model_info.name, task="embed",
|
with vllm_runner(model_info.name, runner="pooling",
|
||||||
max_model_len=256) as vllm_model:
|
max_model_len=256) as vllm_model:
|
||||||
model_config = vllm_model.llm.llm_engine.model_config
|
model_config = vllm_model.llm.llm_engine.model_config
|
||||||
assert model_config.max_model_len == 256
|
assert model_config.max_model_len == 256
|
||||||
@ -46,11 +46,12 @@ def test_set_max_model_len_legal(model_info, vllm_runner):
|
|||||||
# For nomic-embed-text-v2-moe the length is set to 512
|
# For nomic-embed-text-v2-moe the length is set to 512
|
||||||
# by sentence_bert_config.json.
|
# by sentence_bert_config.json.
|
||||||
with pytest.raises(ValueError):
|
with pytest.raises(ValueError):
|
||||||
with vllm_runner(model_info.name, task="embed",
|
with vllm_runner(model_info.name,
|
||||||
|
runner="pooling",
|
||||||
max_model_len=1024):
|
max_model_len=1024):
|
||||||
pass
|
pass
|
||||||
else:
|
else:
|
||||||
with vllm_runner(model_info.name, task="embed",
|
with vllm_runner(model_info.name, runner="pooling",
|
||||||
max_model_len=1024) as vllm_model:
|
max_model_len=1024) as vllm_model:
|
||||||
model_config = vllm_model.llm.llm_engine.model_config
|
model_config = vllm_model.llm.llm_engine.model_config
|
||||||
assert model_config.max_model_len == 1024
|
assert model_config.max_model_len == 1024
|
||||||
@ -60,14 +61,15 @@ def test_set_max_model_len_legal(model_info, vllm_runner):
|
|||||||
def test_set_max_model_len_illegal(model_info, vllm_runner):
|
def test_set_max_model_len_illegal(model_info, vllm_runner):
|
||||||
# set max_model_len > 2048
|
# set max_model_len > 2048
|
||||||
with pytest.raises(ValueError):
|
with pytest.raises(ValueError):
|
||||||
with vllm_runner(model_info.name, task="embed", max_model_len=4096):
|
with vllm_runner(model_info.name, runner="pooling",
|
||||||
|
max_model_len=4096):
|
||||||
pass
|
pass
|
||||||
|
|
||||||
# set max_model_len > 2048 by hf_overrides
|
# set max_model_len > 2048 by hf_overrides
|
||||||
hf_overrides = {"max_model_len": 4096}
|
hf_overrides = {"max_model_len": 4096}
|
||||||
with pytest.raises(ValueError):
|
with pytest.raises(ValueError):
|
||||||
with vllm_runner(model_info.name,
|
with vllm_runner(model_info.name,
|
||||||
task="embed",
|
runner="pooling",
|
||||||
max_model_len=None,
|
max_model_len=None,
|
||||||
hf_overrides=hf_overrides):
|
hf_overrides=hf_overrides):
|
||||||
pass
|
pass
|
||||||
@ -87,7 +89,7 @@ def test_use_rope_scaling_legal(model_info, vllm_runner):
|
|||||||
}
|
}
|
||||||
|
|
||||||
with vllm_runner(model_info.name,
|
with vllm_runner(model_info.name,
|
||||||
task="embed",
|
runner="pooling",
|
||||||
max_model_len=None,
|
max_model_len=None,
|
||||||
hf_overrides=hf_overrides):
|
hf_overrides=hf_overrides):
|
||||||
pass
|
pass
|
||||||
@ -107,7 +109,7 @@ def test_use_rope_scaling_illegal(model_info, vllm_runner):
|
|||||||
# illegal max_model_len
|
# illegal max_model_len
|
||||||
with pytest.raises(ValueError):
|
with pytest.raises(ValueError):
|
||||||
with vllm_runner(model_info.name,
|
with vllm_runner(model_info.name,
|
||||||
task="embed",
|
runner="pooling",
|
||||||
max_model_len=max_model_len + 1,
|
max_model_len=max_model_len + 1,
|
||||||
hf_overrides=hf_overrides):
|
hf_overrides=hf_overrides):
|
||||||
pass
|
pass
|
||||||
@ -125,7 +127,7 @@ def test_use_rope_scaling_illegal(model_info, vllm_runner):
|
|||||||
# illegal max_model_len by hf_overrides
|
# illegal max_model_len by hf_overrides
|
||||||
with pytest.raises(ValueError):
|
with pytest.raises(ValueError):
|
||||||
with vllm_runner(model_info.name,
|
with vllm_runner(model_info.name,
|
||||||
task="embed",
|
runner="pooling",
|
||||||
max_model_len=None,
|
max_model_len=None,
|
||||||
hf_overrides=hf_overrides):
|
hf_overrides=hf_overrides):
|
||||||
pass
|
pass
|
||||||
|
|||||||
@ -37,7 +37,9 @@ def test_cross_encoder_1_to_1(vllm_runner, hf_runner, model_name):
|
|||||||
with hf_runner(model_name, dtype=DTYPE, is_cross_encoder=True) as hf_model:
|
with hf_runner(model_name, dtype=DTYPE, is_cross_encoder=True) as hf_model:
|
||||||
hf_outputs = hf_model.predict([text_pair]).tolist()
|
hf_outputs = hf_model.predict([text_pair]).tolist()
|
||||||
|
|
||||||
with vllm_runner(model_name, task="score", dtype=DTYPE,
|
with vllm_runner(model_name,
|
||||||
|
runner="pooling",
|
||||||
|
dtype=DTYPE,
|
||||||
max_model_len=None) as vllm_model:
|
max_model_len=None) as vllm_model:
|
||||||
vllm_outputs = vllm_model.score(text_pair[0], text_pair[1])
|
vllm_outputs = vllm_model.score(text_pair[0], text_pair[1])
|
||||||
|
|
||||||
@ -56,7 +58,9 @@ def test_cross_encoder_1_to_N(vllm_runner, hf_runner, model_name):
|
|||||||
with hf_runner(model_name, dtype=DTYPE, is_cross_encoder=True) as hf_model:
|
with hf_runner(model_name, dtype=DTYPE, is_cross_encoder=True) as hf_model:
|
||||||
hf_outputs = hf_model.predict(text_pairs).tolist()
|
hf_outputs = hf_model.predict(text_pairs).tolist()
|
||||||
|
|
||||||
with vllm_runner(model_name, task="score", dtype=DTYPE,
|
with vllm_runner(model_name,
|
||||||
|
runner="pooling",
|
||||||
|
dtype=DTYPE,
|
||||||
max_model_len=None) as vllm_model:
|
max_model_len=None) as vllm_model:
|
||||||
vllm_outputs = vllm_model.score(TEXTS_1[0], TEXTS_2)
|
vllm_outputs = vllm_model.score(TEXTS_1[0], TEXTS_2)
|
||||||
|
|
||||||
@ -76,7 +80,9 @@ def test_cross_encoder_N_to_N(vllm_runner, hf_runner, model_name):
|
|||||||
with hf_runner(model_name, dtype=DTYPE, is_cross_encoder=True) as hf_model:
|
with hf_runner(model_name, dtype=DTYPE, is_cross_encoder=True) as hf_model:
|
||||||
hf_outputs = hf_model.predict(text_pairs).tolist()
|
hf_outputs = hf_model.predict(text_pairs).tolist()
|
||||||
|
|
||||||
with vllm_runner(model_name, task="score", dtype=DTYPE,
|
with vllm_runner(model_name,
|
||||||
|
runner="pooling",
|
||||||
|
dtype=DTYPE,
|
||||||
max_model_len=None) as vllm_model:
|
max_model_len=None) as vllm_model:
|
||||||
vllm_outputs = vllm_model.score(TEXTS_1, TEXTS_2)
|
vllm_outputs = vllm_model.score(TEXTS_1, TEXTS_2)
|
||||||
|
|
||||||
@ -103,7 +109,7 @@ def test_embedding_1_to_1(vllm_runner, hf_runner, emb_model_name):
|
|||||||
]
|
]
|
||||||
|
|
||||||
with vllm_runner(emb_model_name,
|
with vllm_runner(emb_model_name,
|
||||||
task="embed",
|
runner="pooling",
|
||||||
dtype=DTYPE,
|
dtype=DTYPE,
|
||||||
max_model_len=None) as vllm_model:
|
max_model_len=None) as vllm_model:
|
||||||
vllm_outputs = vllm_model.score(text_pair[0], text_pair[1])
|
vllm_outputs = vllm_model.score(text_pair[0], text_pair[1])
|
||||||
@ -131,7 +137,7 @@ def test_embedding_1_to_N(vllm_runner, hf_runner, emb_model_name):
|
|||||||
]
|
]
|
||||||
|
|
||||||
with vllm_runner(emb_model_name,
|
with vllm_runner(emb_model_name,
|
||||||
task="embed",
|
runner="pooling",
|
||||||
dtype=DTYPE,
|
dtype=DTYPE,
|
||||||
max_model_len=None) as vllm_model:
|
max_model_len=None) as vllm_model:
|
||||||
vllm_outputs = vllm_model.score(TEXTS_1[0], TEXTS_2)
|
vllm_outputs = vllm_model.score(TEXTS_1[0], TEXTS_2)
|
||||||
@ -160,7 +166,7 @@ def test_embedding_N_to_N(vllm_runner, hf_runner, emb_model_name):
|
|||||||
]
|
]
|
||||||
|
|
||||||
with vllm_runner(emb_model_name,
|
with vllm_runner(emb_model_name,
|
||||||
task="embed",
|
runner="pooling",
|
||||||
dtype=DTYPE,
|
dtype=DTYPE,
|
||||||
max_model_len=None) as vllm_model:
|
max_model_len=None) as vllm_model:
|
||||||
vllm_outputs = vllm_model.score(TEXTS_1, TEXTS_2)
|
vllm_outputs = vllm_model.score(TEXTS_1, TEXTS_2)
|
||||||
|
|||||||
@ -26,7 +26,7 @@ def test_smaller_truncation_size(vllm_runner,
|
|||||||
|
|
||||||
truncate_prompt_tokens = 10
|
truncate_prompt_tokens = 10
|
||||||
|
|
||||||
with vllm_runner(model_name, task="embed",
|
with vllm_runner(model_name, runner="pooling",
|
||||||
max_model_len=max_model_len) as vllm_model:
|
max_model_len=max_model_len) as vllm_model:
|
||||||
vllm_output = vllm_model.llm.encode(
|
vllm_output = vllm_model.llm.encode(
|
||||||
input_str, truncate_prompt_tokens=truncate_prompt_tokens)
|
input_str, truncate_prompt_tokens=truncate_prompt_tokens)
|
||||||
@ -41,7 +41,7 @@ def test_max_truncation_size(vllm_runner,
|
|||||||
input_str=input_str):
|
input_str=input_str):
|
||||||
truncate_prompt_tokens = -1
|
truncate_prompt_tokens = -1
|
||||||
|
|
||||||
with vllm_runner(model_name, task="embed",
|
with vllm_runner(model_name, runner="pooling",
|
||||||
max_model_len=max_model_len) as vllm_model:
|
max_model_len=max_model_len) as vllm_model:
|
||||||
vllm_output = vllm_model.llm.encode(
|
vllm_output = vllm_model.llm.encode(
|
||||||
input_str, truncate_prompt_tokens=truncate_prompt_tokens)
|
input_str, truncate_prompt_tokens=truncate_prompt_tokens)
|
||||||
@ -58,7 +58,7 @@ def test_bigger_truncation_size(vllm_runner,
|
|||||||
truncate_prompt_tokens = max_model_len + 1
|
truncate_prompt_tokens = max_model_len + 1
|
||||||
|
|
||||||
with pytest.raises(ValueError), vllm_runner(
|
with pytest.raises(ValueError), vllm_runner(
|
||||||
model_name, task="embed",
|
model_name, runner="pooling",
|
||||||
max_model_len=max_model_len) as vllm_model:
|
max_model_len=max_model_len) as vllm_model:
|
||||||
|
|
||||||
llm_output = vllm_model.llm.encode(
|
llm_output = vllm_model.llm.encode(
|
||||||
|
|||||||
@ -222,7 +222,6 @@ VLM_TEST_SETTINGS = {
|
|||||||
},
|
},
|
||||||
marks=[large_gpu_mark(min_gb=32)],
|
marks=[large_gpu_mark(min_gb=32)],
|
||||||
),
|
),
|
||||||
# Check "auto" with fallback to transformers
|
|
||||||
"internvl-transformers": VLMTestInfo(
|
"internvl-transformers": VLMTestInfo(
|
||||||
models=["OpenGVLab/InternVL3-1B-hf"],
|
models=["OpenGVLab/InternVL3-1B-hf"],
|
||||||
test_type=(VLMTestType.IMAGE, VLMTestType.MULTI_IMAGE),
|
test_type=(VLMTestType.IMAGE, VLMTestType.MULTI_IMAGE),
|
||||||
@ -232,7 +231,7 @@ VLM_TEST_SETTINGS = {
|
|||||||
use_tokenizer_eos=True,
|
use_tokenizer_eos=True,
|
||||||
image_size_factors=[(0.25, 0.5, 1.0)],
|
image_size_factors=[(0.25, 0.5, 1.0)],
|
||||||
vllm_runner_kwargs={
|
vllm_runner_kwargs={
|
||||||
"model_impl": "auto",
|
"model_impl": "transformers",
|
||||||
},
|
},
|
||||||
auto_cls=AutoModelForImageTextToText,
|
auto_cls=AutoModelForImageTextToText,
|
||||||
marks=[pytest.mark.core_model],
|
marks=[pytest.mark.core_model],
|
||||||
@ -638,7 +637,7 @@ VLM_TEST_SETTINGS = {
|
|||||||
img_idx_to_prompt=lambda idx: f"<|image_{idx}|>\n",
|
img_idx_to_prompt=lambda idx: f"<|image_{idx}|>\n",
|
||||||
max_model_len=4096,
|
max_model_len=4096,
|
||||||
max_num_seqs=2,
|
max_num_seqs=2,
|
||||||
task="generate",
|
runner="generate",
|
||||||
# use sdpa mode for hf runner since phi3v didn't work with flash_attn
|
# use sdpa mode for hf runner since phi3v didn't work with flash_attn
|
||||||
hf_model_kwargs={"_attn_implementation": "sdpa"},
|
hf_model_kwargs={"_attn_implementation": "sdpa"},
|
||||||
use_tokenizer_eos=True,
|
use_tokenizer_eos=True,
|
||||||
|
|||||||
@ -65,7 +65,7 @@ def run_test(
|
|||||||
# max_model_len should be greater than image_feature_size
|
# max_model_len should be greater than image_feature_size
|
||||||
with vllm_runner(
|
with vllm_runner(
|
||||||
model,
|
model,
|
||||||
task="generate",
|
runner="generate",
|
||||||
max_model_len=max_model_len,
|
max_model_len=max_model_len,
|
||||||
max_num_seqs=1,
|
max_num_seqs=1,
|
||||||
dtype=dtype,
|
dtype=dtype,
|
||||||
|
|||||||
@ -48,7 +48,7 @@ def test_models(vllm_runner, model, dtype: str, max_tokens: int) -> None:
|
|||||||
]
|
]
|
||||||
|
|
||||||
with vllm_runner(model,
|
with vllm_runner(model,
|
||||||
task="generate",
|
runner="generate",
|
||||||
dtype=dtype,
|
dtype=dtype,
|
||||||
limit_mm_per_prompt={"image": 2},
|
limit_mm_per_prompt={"image": 2},
|
||||||
max_model_len=32768,
|
max_model_len=32768,
|
||||||
|
|||||||
@ -99,7 +99,7 @@ def run_test(
|
|||||||
# max_model_len should be greater than image_feature_size
|
# max_model_len should be greater than image_feature_size
|
||||||
with vllm_runner(
|
with vllm_runner(
|
||||||
model,
|
model,
|
||||||
task="generate",
|
runner="generate",
|
||||||
max_model_len=max_model_len,
|
max_model_len=max_model_len,
|
||||||
max_num_seqs=2,
|
max_num_seqs=2,
|
||||||
dtype=dtype,
|
dtype=dtype,
|
||||||
|
|||||||
@ -267,7 +267,7 @@ def run_embedding_input_test(
|
|||||||
|
|
||||||
# max_model_len should be greater than image_feature_size
|
# max_model_len should be greater than image_feature_size
|
||||||
with vllm_runner(model,
|
with vllm_runner(model,
|
||||||
task="generate",
|
runner="generate",
|
||||||
max_model_len=4000,
|
max_model_len=4000,
|
||||||
max_num_seqs=3,
|
max_num_seqs=3,
|
||||||
dtype=dtype,
|
dtype=dtype,
|
||||||
|
|||||||
@ -6,7 +6,7 @@ from typing import Any, Callable, Optional
|
|||||||
import torch
|
import torch
|
||||||
from transformers.models.auto.auto_factory import _BaseAutoModelClass
|
from transformers.models.auto.auto_factory import _BaseAutoModelClass
|
||||||
|
|
||||||
from vllm.config import TaskOption
|
from vllm.config import RunnerOption
|
||||||
from vllm.transformers_utils.tokenizer import AnyTokenizer
|
from vllm.transformers_utils.tokenizer import AnyTokenizer
|
||||||
|
|
||||||
from .....conftest import HfRunner, VllmRunner
|
from .....conftest import HfRunner, VllmRunner
|
||||||
@ -37,7 +37,7 @@ def run_test(
|
|||||||
vllm_runner_kwargs: Optional[dict[str, Any]],
|
vllm_runner_kwargs: Optional[dict[str, Any]],
|
||||||
hf_model_kwargs: Optional[dict[str, Any]],
|
hf_model_kwargs: Optional[dict[str, Any]],
|
||||||
patch_hf_runner: Optional[Callable[[HfRunner], HfRunner]],
|
patch_hf_runner: Optional[Callable[[HfRunner], HfRunner]],
|
||||||
task: TaskOption = "auto",
|
runner: RunnerOption = "auto",
|
||||||
distributed_executor_backend: Optional[str] = None,
|
distributed_executor_backend: Optional[str] = None,
|
||||||
tensor_parallel_size: int = 1,
|
tensor_parallel_size: int = 1,
|
||||||
vllm_embeddings: Optional[torch.Tensor] = None,
|
vllm_embeddings: Optional[torch.Tensor] = None,
|
||||||
@ -83,7 +83,7 @@ def run_test(
|
|||||||
tensor_parallel_size=tensor_parallel_size,
|
tensor_parallel_size=tensor_parallel_size,
|
||||||
distributed_executor_backend=distributed_executor_backend,
|
distributed_executor_backend=distributed_executor_backend,
|
||||||
enforce_eager=enforce_eager,
|
enforce_eager=enforce_eager,
|
||||||
task=task,
|
runner=runner,
|
||||||
**vllm_runner_kwargs_) as vllm_model:
|
**vllm_runner_kwargs_) as vllm_model:
|
||||||
tokenizer = vllm_model.llm.get_tokenizer()
|
tokenizer = vllm_model.llm.get_tokenizer()
|
||||||
|
|
||||||
|
|||||||
@ -11,7 +11,7 @@ from pytest import MarkDecorator
|
|||||||
from transformers import AutoModelForCausalLM
|
from transformers import AutoModelForCausalLM
|
||||||
from transformers.models.auto.auto_factory import _BaseAutoModelClass
|
from transformers.models.auto.auto_factory import _BaseAutoModelClass
|
||||||
|
|
||||||
from vllm.config import TaskOption
|
from vllm.config import RunnerOption
|
||||||
from vllm.sequence import SampleLogprobs
|
from vllm.sequence import SampleLogprobs
|
||||||
from vllm.transformers_utils.tokenizer import AnyTokenizer
|
from vllm.transformers_utils.tokenizer import AnyTokenizer
|
||||||
|
|
||||||
@ -109,7 +109,7 @@ class VLMTestInfo(NamedTuple):
|
|||||||
enforce_eager: bool = True
|
enforce_eager: bool = True
|
||||||
max_model_len: int = 1024
|
max_model_len: int = 1024
|
||||||
max_num_seqs: int = 256
|
max_num_seqs: int = 256
|
||||||
task: TaskOption = "auto"
|
runner: RunnerOption = "auto"
|
||||||
tensor_parallel_size: int = 1
|
tensor_parallel_size: int = 1
|
||||||
vllm_runner_kwargs: Optional[dict[str, Any]] = None
|
vllm_runner_kwargs: Optional[dict[str, Any]] = None
|
||||||
|
|
||||||
@ -173,7 +173,7 @@ class VLMTestInfo(NamedTuple):
|
|||||||
"enforce_eager": self.enforce_eager,
|
"enforce_eager": self.enforce_eager,
|
||||||
"max_model_len": self.max_model_len,
|
"max_model_len": self.max_model_len,
|
||||||
"max_num_seqs": self.max_num_seqs,
|
"max_num_seqs": self.max_num_seqs,
|
||||||
"task": self.task,
|
"runner": self.runner,
|
||||||
"tensor_parallel_size": self.tensor_parallel_size,
|
"tensor_parallel_size": self.tensor_parallel_size,
|
||||||
"vllm_runner_kwargs": self.vllm_runner_kwargs,
|
"vllm_runner_kwargs": self.vllm_runner_kwargs,
|
||||||
"hf_output_post_proc": self.hf_output_post_proc,
|
"hf_output_post_proc": self.hf_output_post_proc,
|
||||||
|
|||||||
@ -92,7 +92,7 @@ def _run_test(
|
|||||||
# if we run HF first, the cuda initialization will be done and it
|
# if we run HF first, the cuda initialization will be done and it
|
||||||
# will hurt multiprocessing backend with fork method (the default method).
|
# will hurt multiprocessing backend with fork method (the default method).
|
||||||
with vllm_runner(model,
|
with vllm_runner(model,
|
||||||
task="embed",
|
runner="pooling",
|
||||||
dtype=dtype,
|
dtype=dtype,
|
||||||
enforce_eager=True,
|
enforce_eager=True,
|
||||||
max_model_len=8192) as vllm_model:
|
max_model_len=8192) as vllm_model:
|
||||||
|
|||||||
@ -49,7 +49,7 @@ def vllm_reranker(
|
|||||||
|
|
||||||
with vllm_runner(
|
with vllm_runner(
|
||||||
model_name,
|
model_name,
|
||||||
task="score",
|
runner="pooling",
|
||||||
dtype=dtype,
|
dtype=dtype,
|
||||||
max_num_seqs=2,
|
max_num_seqs=2,
|
||||||
max_model_len=2048,
|
max_model_len=2048,
|
||||||
|
|||||||
@ -64,7 +64,7 @@ def _run_test(
|
|||||||
# if we run HF first, the cuda initialization will be done and it
|
# if we run HF first, the cuda initialization will be done and it
|
||||||
# will hurt multiprocessing backend with fork method (the default method).
|
# will hurt multiprocessing backend with fork method (the default method).
|
||||||
with vllm_runner(model,
|
with vllm_runner(model,
|
||||||
task="embed",
|
runner="pooling",
|
||||||
dtype=dtype,
|
dtype=dtype,
|
||||||
max_model_len=4096,
|
max_model_len=4096,
|
||||||
enforce_eager=True) as vllm_model:
|
enforce_eager=True) as vllm_model:
|
||||||
|
|||||||
@ -44,7 +44,7 @@ def _run_test(
|
|||||||
# vLLM needs a fresh new process without cuda initialization.
|
# vLLM needs a fresh new process without cuda initialization.
|
||||||
# if we run HF first, the cuda initialization will be done and it
|
# if we run HF first, the cuda initialization will be done and it
|
||||||
# will hurt multiprocessing backend with fork method (the default method).
|
# will hurt multiprocessing backend with fork method (the default method).
|
||||||
with vllm_runner(model, task="embed", dtype=dtype,
|
with vllm_runner(model, runner="pooling", dtype=dtype,
|
||||||
enforce_eager=True) as vllm_model:
|
enforce_eager=True) as vllm_model:
|
||||||
vllm_outputs = vllm_model.embed(input_texts, images=input_images)
|
vllm_outputs = vllm_model.embed(input_texts, images=input_images)
|
||||||
|
|
||||||
|
|||||||
@ -34,7 +34,7 @@ def _run_test(
|
|||||||
set_default_torch_num_threads(1),
|
set_default_torch_num_threads(1),
|
||||||
vllm_runner(
|
vllm_runner(
|
||||||
model,
|
model,
|
||||||
task="embed",
|
runner="pooling",
|
||||||
dtype=torch.float16,
|
dtype=torch.float16,
|
||||||
enforce_eager=True,
|
enforce_eager=True,
|
||||||
skip_tokenizer_init=True,
|
skip_tokenizer_init=True,
|
||||||
|
|||||||
@ -58,13 +58,10 @@ def _test_processing_correctness(
|
|||||||
|
|
||||||
model_config = ModelConfig(
|
model_config = ModelConfig(
|
||||||
model_id,
|
model_id,
|
||||||
task="auto",
|
|
||||||
tokenizer=model_info.tokenizer or model_id,
|
tokenizer=model_info.tokenizer or model_id,
|
||||||
tokenizer_mode=model_info.tokenizer_mode,
|
tokenizer_mode=model_info.tokenizer_mode,
|
||||||
trust_remote_code=model_info.trust_remote_code,
|
|
||||||
seed=0,
|
|
||||||
dtype="auto",
|
|
||||||
revision=model_info.revision,
|
revision=model_info.revision,
|
||||||
|
trust_remote_code=model_info.trust_remote_code,
|
||||||
hf_overrides=model_info.hf_overrides,
|
hf_overrides=model_info.hf_overrides,
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|||||||
@ -54,13 +54,10 @@ def test_hf_model_weights_mapper(model_arch: str):
|
|||||||
|
|
||||||
model_config = ModelConfig(
|
model_config = ModelConfig(
|
||||||
model_id,
|
model_id,
|
||||||
task="auto",
|
|
||||||
tokenizer=model_info.tokenizer or model_id,
|
tokenizer=model_info.tokenizer or model_id,
|
||||||
tokenizer_mode=model_info.tokenizer_mode,
|
tokenizer_mode=model_info.tokenizer_mode,
|
||||||
|
revision=model_info.revision,
|
||||||
trust_remote_code=model_info.trust_remote_code,
|
trust_remote_code=model_info.trust_remote_code,
|
||||||
seed=0,
|
|
||||||
dtype="auto",
|
|
||||||
revision=None,
|
|
||||||
hf_overrides=model_info.hf_overrides,
|
hf_overrides=model_info.hf_overrides,
|
||||||
)
|
)
|
||||||
model_cls = MULTIMODAL_REGISTRY._get_model_cls(model_config)
|
model_cls = MULTIMODAL_REGISTRY._get_model_cls(model_config)
|
||||||
|
|||||||
@ -172,7 +172,7 @@ def test_4bit_bnb_embedding_model(
|
|||||||
|
|
||||||
# Inflight 4bit quantization
|
# Inflight 4bit quantization
|
||||||
with vllm_runner(model_name,
|
with vllm_runner(model_name,
|
||||||
task="embed",
|
runner="pooling",
|
||||||
dtype=dtype,
|
dtype=dtype,
|
||||||
gpu_memory_utilization=0.5,
|
gpu_memory_utilization=0.5,
|
||||||
quantization="bitsandbytes") as vllm_model:
|
quantization="bitsandbytes") as vllm_model:
|
||||||
|
|||||||
@ -7,13 +7,15 @@ import pytest
|
|||||||
from transformers import PretrainedConfig
|
from transformers import PretrainedConfig
|
||||||
|
|
||||||
from vllm import LLM
|
from vllm import LLM
|
||||||
|
from vllm.config import ModelImpl
|
||||||
from vllm.engine.llm_engine import LLMEngine as V0LLMEngine
|
from vllm.engine.llm_engine import LLMEngine as V0LLMEngine
|
||||||
from vllm.utils import GiB_bytes
|
from vllm.utils import GiB_bytes
|
||||||
from vllm.v1.core.kv_cache_utils import get_kv_cache_config
|
from vllm.v1.core.kv_cache_utils import get_kv_cache_config
|
||||||
from vllm.v1.engine.core import EngineCore as V1EngineCore
|
from vllm.v1.engine.core import EngineCore as V1EngineCore
|
||||||
|
|
||||||
from ..utils import create_new_process_for_each_test
|
from ..utils import create_new_process_for_each_test
|
||||||
from .registry import AUTO_EXAMPLE_MODELS, HF_EXAMPLE_MODELS, HfExampleModels
|
from .registry import (_TRANSFORMERS_BACKEND_MODELS, AUTO_EXAMPLE_MODELS,
|
||||||
|
HF_EXAMPLE_MODELS, HfExampleModels)
|
||||||
|
|
||||||
|
|
||||||
@create_new_process_for_each_test()
|
@create_new_process_for_each_test()
|
||||||
@ -126,6 +128,8 @@ def can_initialize(model_arch: str, monkeypatch: pytest.MonkeyPatch,
|
|||||||
# these tests seem to produce leftover memory
|
# these tests seem to produce leftover memory
|
||||||
gpu_memory_utilization=0.80,
|
gpu_memory_utilization=0.80,
|
||||||
load_format="dummy",
|
load_format="dummy",
|
||||||
|
model_impl=ModelImpl.TRANSFORMERS
|
||||||
|
if model_arch in _TRANSFORMERS_BACKEND_MODELS else ModelImpl.VLLM,
|
||||||
hf_overrides=hf_overrides,
|
hf_overrides=hf_overrides,
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|||||||
@ -24,11 +24,9 @@ from .registry import HF_EXAMPLE_MODELS
|
|||||||
|
|
||||||
@pytest.mark.parametrize("model_arch", ModelRegistry.get_supported_archs())
|
@pytest.mark.parametrize("model_arch", ModelRegistry.get_supported_archs())
|
||||||
def test_registry_imports(model_arch):
|
def test_registry_imports(model_arch):
|
||||||
model_info = HF_EXAMPLE_MODELS.get_hf_info(model_arch)
|
|
||||||
model_info.check_transformers_version(on_fail="skip")
|
|
||||||
|
|
||||||
# Ensure all model classes can be imported successfully
|
# Ensure all model classes can be imported successfully
|
||||||
model_cls, _ = ModelRegistry.resolve_model_cls(model_arch)
|
model_cls = ModelRegistry._try_load_model_cls(model_arch)
|
||||||
|
assert model_cls is not None
|
||||||
|
|
||||||
if model_arch in _SPECULATIVE_DECODING_MODELS:
|
if model_arch in _SPECULATIVE_DECODING_MODELS:
|
||||||
return # Ignore these models which do not have a unified format
|
return # Ignore these models which do not have a unified format
|
||||||
@ -56,14 +54,16 @@ def test_registry_imports(model_arch):
|
|||||||
("XLMRobertaForSequenceClassification", False, False, True),
|
("XLMRobertaForSequenceClassification", False, False, True),
|
||||||
])
|
])
|
||||||
def test_registry_model_property(model_arch, is_mm, init_cuda, is_ce):
|
def test_registry_model_property(model_arch, is_mm, init_cuda, is_ce):
|
||||||
assert ModelRegistry.is_multimodal_model(model_arch) is is_mm
|
model_info = ModelRegistry._try_inspect_model_cls(model_arch)
|
||||||
|
assert model_info is not None
|
||||||
|
|
||||||
assert ModelRegistry.is_cross_encoder_model(model_arch) is is_ce
|
assert model_info.supports_multimodal is is_mm
|
||||||
|
assert model_info.supports_cross_encoding is is_ce
|
||||||
|
|
||||||
if init_cuda and current_platform.is_cuda_alike():
|
if init_cuda and current_platform.is_cuda_alike():
|
||||||
assert not torch.cuda.is_initialized()
|
assert not torch.cuda.is_initialized()
|
||||||
|
|
||||||
ModelRegistry.resolve_model_cls(model_arch)
|
ModelRegistry._try_load_model_cls(model_arch)
|
||||||
if not torch.cuda.is_initialized():
|
if not torch.cuda.is_initialized():
|
||||||
warnings.warn(
|
warnings.warn(
|
||||||
"This model no longer initializes CUDA on import. "
|
"This model no longer initializes CUDA on import. "
|
||||||
@ -82,12 +82,15 @@ def test_registry_model_property(model_arch, is_mm, init_cuda, is_ce):
|
|||||||
("Qwen2VLForConditionalGeneration", True, True),
|
("Qwen2VLForConditionalGeneration", True, True),
|
||||||
])
|
])
|
||||||
def test_registry_is_pp(model_arch, is_pp, init_cuda):
|
def test_registry_is_pp(model_arch, is_pp, init_cuda):
|
||||||
assert ModelRegistry.is_pp_supported_model(model_arch) is is_pp
|
model_info = ModelRegistry._try_inspect_model_cls(model_arch)
|
||||||
|
assert model_info is not None
|
||||||
|
|
||||||
|
assert model_info.supports_pp is is_pp
|
||||||
|
|
||||||
if init_cuda and current_platform.is_cuda_alike():
|
if init_cuda and current_platform.is_cuda_alike():
|
||||||
assert not torch.cuda.is_initialized()
|
assert not torch.cuda.is_initialized()
|
||||||
|
|
||||||
ModelRegistry.resolve_model_cls(model_arch)
|
ModelRegistry._try_load_model_cls(model_arch)
|
||||||
if not torch.cuda.is_initialized():
|
if not torch.cuda.is_initialized():
|
||||||
warnings.warn(
|
warnings.warn(
|
||||||
"This model no longer initializes CUDA on import. "
|
"This model no longer initializes CUDA on import. "
|
||||||
|
|||||||
@ -33,6 +33,10 @@ def check_implementation(
|
|||||||
args = (example_prompts, max_tokens, num_logprobs)
|
args = (example_prompts, max_tokens, num_logprobs)
|
||||||
|
|
||||||
with runner_test(model, **kwargs_test, **kwargs) as model_test:
|
with runner_test(model, **kwargs_test, **kwargs) as model_test:
|
||||||
|
model_config = model_test.llm.llm_engine.model_config
|
||||||
|
assert model_config.architecture == (
|
||||||
|
model_config._get_transformers_backend_cls())
|
||||||
|
|
||||||
outputs_test = model_test.generate_greedy_logprobs(*args)
|
outputs_test = model_test.generate_greedy_logprobs(*args)
|
||||||
|
|
||||||
with runner_ref(model, **kwargs_ref) as model_ref:
|
with runner_ref(model, **kwargs_ref) as model_ref:
|
||||||
@ -130,8 +134,13 @@ def test_quantization(
|
|||||||
model_impl="transformers",
|
model_impl="transformers",
|
||||||
enforce_eager=True,
|
enforce_eager=True,
|
||||||
**quantization_kwargs) as vllm_model: # type: ignore[arg-type]
|
**quantization_kwargs) as vllm_model: # type: ignore[arg-type]
|
||||||
|
model_config = vllm_model.llm.llm_engine.model_config
|
||||||
|
assert model_config.architecture == (
|
||||||
|
model_config._get_transformers_backend_cls())
|
||||||
|
|
||||||
transformers_outputs = vllm_model.generate_greedy_logprobs(
|
transformers_outputs = vllm_model.generate_greedy_logprobs(
|
||||||
example_prompts, max_tokens=max_tokens, num_logprobs=num_logprobs)
|
example_prompts, max_tokens=max_tokens, num_logprobs=num_logprobs)
|
||||||
|
|
||||||
check_logprobs_close(
|
check_logprobs_close(
|
||||||
outputs_0_lst=transformers_outputs,
|
outputs_0_lst=transformers_outputs,
|
||||||
outputs_1_lst=vllm_outputs,
|
outputs_1_lst=vllm_outputs,
|
||||||
@ -151,7 +160,6 @@ def test_classify(
|
|||||||
example_prompts,
|
example_prompts,
|
||||||
model: str,
|
model: str,
|
||||||
dtype: str,
|
dtype: str,
|
||||||
monkeypatch,
|
|
||||||
) -> None:
|
) -> None:
|
||||||
import torch
|
import torch
|
||||||
from transformers import AutoModelForSequenceClassification
|
from transformers import AutoModelForSequenceClassification
|
||||||
@ -160,6 +168,10 @@ def test_classify(
|
|||||||
max_model_len=512,
|
max_model_len=512,
|
||||||
dtype=dtype,
|
dtype=dtype,
|
||||||
model_impl="transformers") as vllm_model:
|
model_impl="transformers") as vllm_model:
|
||||||
|
model_config = vllm_model.llm.llm_engine.model_config
|
||||||
|
assert model_config.architecture == (
|
||||||
|
model_config._get_transformers_backend_cls())
|
||||||
|
|
||||||
vllm_outputs = vllm_model.classify(example_prompts)
|
vllm_outputs = vllm_model.classify(example_prompts)
|
||||||
|
|
||||||
with hf_runner(model,
|
with hf_runner(model,
|
||||||
|
|||||||
@ -8,7 +8,7 @@ from typing import Any, NamedTuple, Optional, Union
|
|||||||
import torch
|
import torch
|
||||||
import torch.nn.functional as F
|
import torch.nn.functional as F
|
||||||
|
|
||||||
from vllm.config import ModelConfig, TaskOption
|
from vllm.config import ModelConfig, RunnerOption
|
||||||
from vllm.inputs import InputContext
|
from vllm.inputs import InputContext
|
||||||
from vllm.sequence import Logprob, PromptLogprobs, SampleLogprobs
|
from vllm.sequence import Logprob, PromptLogprobs, SampleLogprobs
|
||||||
|
|
||||||
@ -255,7 +255,7 @@ def check_logprobs_close(
|
|||||||
|
|
||||||
def build_model_context(
|
def build_model_context(
|
||||||
model_id: str,
|
model_id: str,
|
||||||
task: TaskOption = "auto",
|
runner: RunnerOption = "auto",
|
||||||
dtype: Union[str, torch.dtype] = "auto",
|
dtype: Union[str, torch.dtype] = "auto",
|
||||||
model_config_kwargs: Optional[dict[str, Any]] = None,
|
model_config_kwargs: Optional[dict[str, Any]] = None,
|
||||||
mm_processor_kwargs: Optional[dict[str, Any]] = None,
|
mm_processor_kwargs: Optional[dict[str, Any]] = None,
|
||||||
@ -280,9 +280,10 @@ def build_model_context(
|
|||||||
model_config_kwargs = model_config_kwargs or {}
|
model_config_kwargs = model_config_kwargs or {}
|
||||||
model_config = ModelConfig(
|
model_config = ModelConfig(
|
||||||
model_id,
|
model_id,
|
||||||
task=task,
|
runner=runner,
|
||||||
tokenizer=model_info.tokenizer or model_id,
|
tokenizer=model_info.tokenizer or model_id,
|
||||||
tokenizer_mode=model_info.tokenizer_mode,
|
tokenizer_mode=model_info.tokenizer_mode,
|
||||||
|
revision=model_info.revision,
|
||||||
trust_remote_code=model_info.trust_remote_code,
|
trust_remote_code=model_info.trust_remote_code,
|
||||||
dtype=dtype,
|
dtype=dtype,
|
||||||
seed=0,
|
seed=0,
|
||||||
|
|||||||
@ -954,13 +954,6 @@ def test_limit_mm_per_prompt_dummy(model_id, limit, num_supported, is_valid):
|
|||||||
|
|
||||||
model_config = ModelConfig(
|
model_config = ModelConfig(
|
||||||
model=model_id,
|
model=model_id,
|
||||||
task="auto",
|
|
||||||
tokenizer=model_id,
|
|
||||||
tokenizer_mode="auto",
|
|
||||||
trust_remote_code=False,
|
|
||||||
seed=0,
|
|
||||||
dtype="auto",
|
|
||||||
revision=None,
|
|
||||||
limit_mm_per_prompt=limit_mm_per_prompt,
|
limit_mm_per_prompt=limit_mm_per_prompt,
|
||||||
)
|
)
|
||||||
|
|
||||||
@ -993,13 +986,6 @@ def test_limit_mm_per_prompt_apply(model_id, num_images, limit, is_valid):
|
|||||||
|
|
||||||
model_config = ModelConfig(
|
model_config = ModelConfig(
|
||||||
model=model_id,
|
model=model_id,
|
||||||
task="auto",
|
|
||||||
tokenizer=model_id,
|
|
||||||
tokenizer_mode="auto",
|
|
||||||
trust_remote_code=False,
|
|
||||||
seed=0,
|
|
||||||
dtype="auto",
|
|
||||||
revision=None,
|
|
||||||
limit_mm_per_prompt=limit_mm_per_prompt,
|
limit_mm_per_prompt=limit_mm_per_prompt,
|
||||||
)
|
)
|
||||||
|
|
||||||
@ -1061,16 +1047,7 @@ class _ProcessorProxy:
|
|||||||
)
|
)
|
||||||
# yapf: enable
|
# yapf: enable
|
||||||
def test_hf_processor_kwargs(model_id, call_kwargs, expected_kwargs):
|
def test_hf_processor_kwargs(model_id, call_kwargs, expected_kwargs):
|
||||||
model_config = ModelConfig(
|
model_config = ModelConfig(model_id)
|
||||||
model=model_id,
|
|
||||||
task="auto",
|
|
||||||
tokenizer=model_id,
|
|
||||||
tokenizer_mode="auto",
|
|
||||||
trust_remote_code=False,
|
|
||||||
seed=0,
|
|
||||||
dtype="auto",
|
|
||||||
revision=None,
|
|
||||||
)
|
|
||||||
|
|
||||||
processor = MULTIMODAL_REGISTRY.create_processor(model_config)
|
processor = MULTIMODAL_REGISTRY.create_processor(model_config)
|
||||||
orig_get_hf_processor = processor.info.get_hf_processor
|
orig_get_hf_processor = processor.info.get_hf_processor
|
||||||
|
|||||||
@ -57,15 +57,7 @@ def test_auto_gptq(model_arg_exptype: tuple[str, None, str]) -> None:
|
|||||||
model_path, quantization_arg, expected_type = model_arg_exptype
|
model_path, quantization_arg, expected_type = model_arg_exptype
|
||||||
|
|
||||||
try:
|
try:
|
||||||
model_config = ModelConfig(model_path,
|
model_config = ModelConfig(model_path, quantization=quantization_arg)
|
||||||
task="auto",
|
|
||||||
tokenizer=model_path,
|
|
||||||
tokenizer_mode="auto",
|
|
||||||
trust_remote_code=False,
|
|
||||||
seed=0,
|
|
||||||
dtype="float16",
|
|
||||||
revision=None,
|
|
||||||
quantization=quantization_arg)
|
|
||||||
found_quantization_type = model_config.quantization
|
found_quantization_type = model_config.quantization
|
||||||
except ValueError:
|
except ValueError:
|
||||||
found_quantization_type = "ERROR"
|
found_quantization_type = "ERROR"
|
||||||
|
|||||||
@ -74,115 +74,116 @@ def test_update_config():
|
|||||||
new_config3 = update_config(config3, {"a": "new_value"})
|
new_config3 = update_config(config3, {"a": "new_value"})
|
||||||
|
|
||||||
|
|
||||||
|
# Can remove once --task option is fully deprecated
|
||||||
@pytest.mark.parametrize(
|
@pytest.mark.parametrize(
|
||||||
("model_id", "expected_runner_type", "expected_task"),
|
("model_id", "expected_runner_type", "expected_convert_type",
|
||||||
|
"expected_task"),
|
||||||
[
|
[
|
||||||
("distilbert/distilgpt2", "generate", "generate"),
|
("distilbert/distilgpt2", "generate", "none", "generate"),
|
||||||
("intfloat/multilingual-e5-small", "pooling", "embed"),
|
("intfloat/multilingual-e5-small", "pooling", "none", "embed"),
|
||||||
("jason9693/Qwen2.5-1.5B-apeach", "pooling", "classify"),
|
("jason9693/Qwen2.5-1.5B-apeach", "pooling", "classify", "classify"),
|
||||||
("cross-encoder/ms-marco-MiniLM-L-6-v2", "pooling", "classify"),
|
("cross-encoder/ms-marco-MiniLM-L-6-v2", "pooling", "none",
|
||||||
("Qwen/Qwen2.5-Math-RM-72B", "pooling", "reward"),
|
"classify"),
|
||||||
("openai/whisper-small", "generate", "transcription"),
|
("Qwen/Qwen2.5-Math-RM-72B", "pooling", "none", "reward"),
|
||||||
|
("openai/whisper-small", "generate", "none", "transcription"),
|
||||||
],
|
],
|
||||||
)
|
)
|
||||||
def test_auto_task(model_id, expected_runner_type, expected_task):
|
def test_auto_task(model_id, expected_runner_type, expected_convert_type,
|
||||||
config = ModelConfig(
|
expected_task):
|
||||||
model_id,
|
config = ModelConfig(model_id, task="auto")
|
||||||
task="auto",
|
|
||||||
tokenizer=model_id,
|
|
||||||
tokenizer_mode="auto",
|
|
||||||
trust_remote_code=False,
|
|
||||||
seed=0,
|
|
||||||
dtype="float16",
|
|
||||||
)
|
|
||||||
|
|
||||||
assert config.runner_type == expected_runner_type
|
assert config.runner_type == expected_runner_type
|
||||||
|
assert config.convert_type == expected_convert_type
|
||||||
|
assert expected_task in config.supported_tasks
|
||||||
|
|
||||||
if config.runner_type == "pooling":
|
|
||||||
assert config.task == expected_task
|
# Can remove once --task option is fully deprecated
|
||||||
else:
|
@pytest.mark.parametrize(
|
||||||
assert expected_task in config.supported_tasks
|
("model_id", "expected_runner_type", "expected_convert_type",
|
||||||
|
"expected_task"),
|
||||||
|
[
|
||||||
|
("distilbert/distilgpt2", "pooling", "embed", "embed"),
|
||||||
|
("intfloat/multilingual-e5-small", "pooling", "embed", "embed"),
|
||||||
|
("jason9693/Qwen2.5-1.5B-apeach", "pooling", "classify", "classify"),
|
||||||
|
("cross-encoder/ms-marco-MiniLM-L-6-v2", "pooling", "classify",
|
||||||
|
"classify"),
|
||||||
|
("Qwen/Qwen2.5-Math-RM-72B", "pooling", "embed", "embed"),
|
||||||
|
("openai/whisper-small", "pooling", "embed", "embed"),
|
||||||
|
],
|
||||||
|
)
|
||||||
|
def test_score_task(model_id, expected_runner_type, expected_convert_type,
|
||||||
|
expected_task):
|
||||||
|
config = ModelConfig(model_id, task="score")
|
||||||
|
|
||||||
|
assert config.runner_type == expected_runner_type
|
||||||
|
assert config.convert_type == expected_convert_type
|
||||||
|
assert expected_task in config.supported_tasks
|
||||||
|
|
||||||
|
|
||||||
|
# Can remove once --task option is fully deprecated
|
||||||
|
@pytest.mark.parametrize(
|
||||||
|
("model_id", "expected_runner_type", "expected_convert_type",
|
||||||
|
"expected_task"),
|
||||||
|
[
|
||||||
|
("openai/whisper-small", "generate", "none", "transcription"),
|
||||||
|
],
|
||||||
|
)
|
||||||
|
def test_transcription_task(model_id, expected_runner_type,
|
||||||
|
expected_convert_type, expected_task):
|
||||||
|
config = ModelConfig(model_id, task="transcription")
|
||||||
|
|
||||||
|
assert config.runner_type == expected_runner_type
|
||||||
|
assert config.convert_type == expected_convert_type
|
||||||
|
assert expected_task in config.supported_tasks
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize(
|
@pytest.mark.parametrize(
|
||||||
("model_id", "expected_runner_type", "expected_task"),
|
("model_id", "expected_runner_type", "expected_convert_type"),
|
||||||
|
[
|
||||||
|
("distilbert/distilgpt2", "generate", "none"),
|
||||||
|
("intfloat/multilingual-e5-small", "pooling", "none"),
|
||||||
|
("jason9693/Qwen2.5-1.5B-apeach", "pooling", "classify"),
|
||||||
|
("cross-encoder/ms-marco-MiniLM-L-6-v2", "pooling", "none"),
|
||||||
|
("Qwen/Qwen2.5-Math-RM-72B", "pooling", "none"),
|
||||||
|
("openai/whisper-small", "generate", "none"),
|
||||||
|
],
|
||||||
|
)
|
||||||
|
def test_auto_runner(model_id, expected_runner_type, expected_convert_type):
|
||||||
|
config = ModelConfig(model_id, runner="auto")
|
||||||
|
|
||||||
|
assert config.runner_type == expected_runner_type
|
||||||
|
assert config.convert_type == expected_convert_type
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize(
|
||||||
|
("model_id", "expected_runner_type", "expected_convert_type"),
|
||||||
[
|
[
|
||||||
("distilbert/distilgpt2", "pooling", "embed"),
|
("distilbert/distilgpt2", "pooling", "embed"),
|
||||||
("intfloat/multilingual-e5-small", "pooling", "embed"),
|
("intfloat/multilingual-e5-small", "pooling", "none"),
|
||||||
("jason9693/Qwen2.5-1.5B-apeach", "pooling", "classify"),
|
("jason9693/Qwen2.5-1.5B-apeach", "pooling", "classify"),
|
||||||
("cross-encoder/ms-marco-MiniLM-L-6-v2", "pooling", "classify"),
|
("cross-encoder/ms-marco-MiniLM-L-6-v2", "pooling", "none"),
|
||||||
("Qwen/Qwen2.5-Math-RM-72B", "pooling", "embed"),
|
("Qwen/Qwen2.5-Math-RM-72B", "pooling", "none"),
|
||||||
("openai/whisper-small", "pooling", "embed"),
|
("openai/whisper-small", "pooling", "embed"),
|
||||||
],
|
],
|
||||||
)
|
)
|
||||||
def test_score_task(model_id, expected_runner_type, expected_task):
|
def test_pooling_runner(model_id, expected_runner_type, expected_convert_type):
|
||||||
config = ModelConfig(
|
config = ModelConfig(model_id, runner="pooling")
|
||||||
model_id,
|
|
||||||
task="score",
|
|
||||||
tokenizer=model_id,
|
|
||||||
tokenizer_mode="auto",
|
|
||||||
trust_remote_code=False,
|
|
||||||
seed=0,
|
|
||||||
dtype="float16",
|
|
||||||
)
|
|
||||||
|
|
||||||
assert config.runner_type == expected_runner_type
|
assert config.runner_type == expected_runner_type
|
||||||
assert config.task == expected_task
|
assert config.convert_type == expected_convert_type
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize(("model_id", "expected_runner_type", "expected_task"),
|
|
||||||
[
|
|
||||||
("Qwen/Qwen2.5-1.5B-Instruct", "draft", "auto"),
|
|
||||||
])
|
|
||||||
def test_draft_task(model_id, expected_runner_type, expected_task):
|
|
||||||
config = ModelConfig(
|
|
||||||
model_id,
|
|
||||||
runner="draft",
|
|
||||||
tokenizer=model_id,
|
|
||||||
seed=0,
|
|
||||||
dtype="float16",
|
|
||||||
)
|
|
||||||
|
|
||||||
assert config.runner_type == expected_runner_type
|
|
||||||
assert config.task == expected_task
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize(
|
@pytest.mark.parametrize(
|
||||||
("model_id", "expected_runner_type", "expected_task"),
|
("model_id", "expected_runner_type", "expected_convert_type"),
|
||||||
[
|
[
|
||||||
("openai/whisper-small", "generate", "transcription"),
|
("Qwen/Qwen2.5-1.5B-Instruct", "draft", "none"),
|
||||||
],
|
],
|
||||||
)
|
)
|
||||||
def test_transcription_task(model_id, expected_runner_type, expected_task):
|
def test_draft_runner(model_id, expected_runner_type, expected_convert_type):
|
||||||
config = ModelConfig(
|
config = ModelConfig(model_id, runner="draft")
|
||||||
model_id,
|
|
||||||
task="transcription",
|
|
||||||
tokenizer=model_id,
|
|
||||||
tokenizer_mode="auto",
|
|
||||||
trust_remote_code=False,
|
|
||||||
seed=0,
|
|
||||||
dtype="float16",
|
|
||||||
)
|
|
||||||
|
|
||||||
assert config.runner_type == expected_runner_type
|
assert config.runner_type == expected_runner_type
|
||||||
assert config.task == expected_task
|
assert config.convert_type == expected_convert_type
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize(("model_id", "bad_task"), [
|
|
||||||
("Qwen/Qwen2.5-Math-RM-72B", "generate"),
|
|
||||||
("Qwen/Qwen3-0.6B", "transcription"),
|
|
||||||
])
|
|
||||||
def test_incorrect_task(model_id, bad_task):
|
|
||||||
with pytest.raises(ValueError, match=r"does not support task=.*"):
|
|
||||||
ModelConfig(
|
|
||||||
model_id,
|
|
||||||
task=bad_task,
|
|
||||||
tokenizer=model_id,
|
|
||||||
tokenizer_mode="auto",
|
|
||||||
trust_remote_code=False,
|
|
||||||
seed=0,
|
|
||||||
dtype="float16",
|
|
||||||
)
|
|
||||||
|
|
||||||
|
|
||||||
MODEL_IDS_EXPECTED = [
|
MODEL_IDS_EXPECTED = [
|
||||||
@ -195,17 +196,7 @@ MODEL_IDS_EXPECTED = [
|
|||||||
@pytest.mark.parametrize("model_id_expected", MODEL_IDS_EXPECTED)
|
@pytest.mark.parametrize("model_id_expected", MODEL_IDS_EXPECTED)
|
||||||
def test_disable_sliding_window(model_id_expected):
|
def test_disable_sliding_window(model_id_expected):
|
||||||
model_id, expected = model_id_expected
|
model_id, expected = model_id_expected
|
||||||
model_config = ModelConfig(
|
model_config = ModelConfig(model_id, disable_sliding_window=True)
|
||||||
model_id,
|
|
||||||
task="auto",
|
|
||||||
tokenizer=model_id,
|
|
||||||
tokenizer_mode="auto",
|
|
||||||
trust_remote_code=False,
|
|
||||||
seed=0,
|
|
||||||
dtype="float16",
|
|
||||||
revision=None,
|
|
||||||
disable_sliding_window=True,
|
|
||||||
)
|
|
||||||
assert model_config.max_model_len == expected
|
assert model_config.max_model_len == expected
|
||||||
|
|
||||||
|
|
||||||
@ -214,16 +205,7 @@ def test_get_sliding_window():
|
|||||||
# Test that the sliding window is correctly computed.
|
# Test that the sliding window is correctly computed.
|
||||||
# For Qwen1.5/Qwen2, get_sliding_window() should be None
|
# For Qwen1.5/Qwen2, get_sliding_window() should be None
|
||||||
# when use_sliding_window is False.
|
# when use_sliding_window is False.
|
||||||
qwen2_model_config = ModelConfig(
|
qwen2_model_config = ModelConfig("Qwen/Qwen1.5-7B")
|
||||||
"Qwen/Qwen1.5-7B",
|
|
||||||
task="auto",
|
|
||||||
tokenizer="Qwen/Qwen1.5-7B",
|
|
||||||
tokenizer_mode="auto",
|
|
||||||
trust_remote_code=False,
|
|
||||||
seed=0,
|
|
||||||
dtype="float16",
|
|
||||||
revision=None,
|
|
||||||
)
|
|
||||||
|
|
||||||
qwen2_model_config.hf_config.use_sliding_window = False
|
qwen2_model_config.hf_config.use_sliding_window = False
|
||||||
qwen2_model_config.hf_config.sliding_window = TEST_SLIDING_WINDOW
|
qwen2_model_config.hf_config.sliding_window = TEST_SLIDING_WINDOW
|
||||||
@ -232,16 +214,7 @@ def test_get_sliding_window():
|
|||||||
qwen2_model_config.hf_config.use_sliding_window = True
|
qwen2_model_config.hf_config.use_sliding_window = True
|
||||||
assert qwen2_model_config.get_sliding_window() == TEST_SLIDING_WINDOW
|
assert qwen2_model_config.get_sliding_window() == TEST_SLIDING_WINDOW
|
||||||
|
|
||||||
mistral_model_config = ModelConfig(
|
mistral_model_config = ModelConfig("mistralai/Mistral-7B-v0.1")
|
||||||
"mistralai/Mistral-7B-v0.1",
|
|
||||||
task="auto",
|
|
||||||
tokenizer="mistralai/Mistral-7B-v0.1",
|
|
||||||
tokenizer_mode="auto",
|
|
||||||
trust_remote_code=False,
|
|
||||||
seed=0,
|
|
||||||
dtype="float16",
|
|
||||||
revision=None,
|
|
||||||
)
|
|
||||||
mistral_model_config.hf_config.sliding_window = None
|
mistral_model_config.hf_config.sliding_window = None
|
||||||
assert mistral_model_config.get_sliding_window() is None
|
assert mistral_model_config.get_sliding_window() is None
|
||||||
|
|
||||||
@ -253,16 +226,7 @@ def test_get_sliding_window():
|
|||||||
reason="Xformers backend is not supported on ROCm.")
|
reason="Xformers backend is not supported on ROCm.")
|
||||||
def test_get_pooling_config():
|
def test_get_pooling_config():
|
||||||
model_id = "sentence-transformers/all-MiniLM-L12-v2"
|
model_id = "sentence-transformers/all-MiniLM-L12-v2"
|
||||||
model_config = ModelConfig(
|
model_config = ModelConfig(model_id)
|
||||||
model_id,
|
|
||||||
task="auto",
|
|
||||||
tokenizer=model_id,
|
|
||||||
tokenizer_mode="auto",
|
|
||||||
trust_remote_code=False,
|
|
||||||
seed=0,
|
|
||||||
dtype="float16",
|
|
||||||
revision=None,
|
|
||||||
)
|
|
||||||
|
|
||||||
pooling_config = model_config._init_pooler_config()
|
pooling_config = model_config._init_pooler_config()
|
||||||
assert pooling_config is not None
|
assert pooling_config is not None
|
||||||
@ -275,14 +239,7 @@ def test_get_pooling_config():
|
|||||||
reason="Xformers backend is not supported on ROCm.")
|
reason="Xformers backend is not supported on ROCm.")
|
||||||
def test_get_pooling_config_from_args():
|
def test_get_pooling_config_from_args():
|
||||||
model_id = "sentence-transformers/all-MiniLM-L12-v2"
|
model_id = "sentence-transformers/all-MiniLM-L12-v2"
|
||||||
model_config = ModelConfig(model_id,
|
model_config = ModelConfig(model_id)
|
||||||
task="auto",
|
|
||||||
tokenizer=model_id,
|
|
||||||
tokenizer_mode="auto",
|
|
||||||
trust_remote_code=False,
|
|
||||||
seed=0,
|
|
||||||
dtype="float16",
|
|
||||||
revision=None)
|
|
||||||
|
|
||||||
override_pooler_config = PoolerConfig(pooling_type='CLS', normalize=True)
|
override_pooler_config = PoolerConfig(pooling_type='CLS', normalize=True)
|
||||||
model_config.override_pooler_config = override_pooler_config
|
model_config.override_pooler_config = override_pooler_config
|
||||||
@ -295,16 +252,8 @@ def test_get_pooling_config_from_args():
|
|||||||
@pytest.mark.skipif(current_platform.is_rocm(),
|
@pytest.mark.skipif(current_platform.is_rocm(),
|
||||||
reason="Xformers backend is not supported on ROCm.")
|
reason="Xformers backend is not supported on ROCm.")
|
||||||
def test_get_bert_tokenization_sentence_transformer_config():
|
def test_get_bert_tokenization_sentence_transformer_config():
|
||||||
bge_model_config = ModelConfig(
|
model_id = "BAAI/bge-base-en-v1.5"
|
||||||
model="BAAI/bge-base-en-v1.5",
|
bge_model_config = ModelConfig(model_id)
|
||||||
task="auto",
|
|
||||||
tokenizer="BAAI/bge-base-en-v1.5",
|
|
||||||
tokenizer_mode="auto",
|
|
||||||
trust_remote_code=False,
|
|
||||||
seed=0,
|
|
||||||
dtype="float16",
|
|
||||||
revision=None,
|
|
||||||
)
|
|
||||||
|
|
||||||
bert_bge_model_config = bge_model_config._get_encoder_config()
|
bert_bge_model_config = bge_model_config._get_encoder_config()
|
||||||
|
|
||||||
@ -317,27 +266,13 @@ def test_rope_customization():
|
|||||||
TEST_ROPE_THETA = 16_000_000.0
|
TEST_ROPE_THETA = 16_000_000.0
|
||||||
LONGCHAT_ROPE_SCALING = {"rope_type": "linear", "factor": 8.0}
|
LONGCHAT_ROPE_SCALING = {"rope_type": "linear", "factor": 8.0}
|
||||||
|
|
||||||
llama_model_config = ModelConfig(
|
llama_model_config = ModelConfig("meta-llama/Meta-Llama-3-8B-Instruct")
|
||||||
"meta-llama/Meta-Llama-3-8B-Instruct",
|
|
||||||
task="auto",
|
|
||||||
tokenizer="meta-llama/Meta-Llama-3-8B-Instruct",
|
|
||||||
tokenizer_mode="auto",
|
|
||||||
trust_remote_code=False,
|
|
||||||
dtype="float16",
|
|
||||||
seed=0,
|
|
||||||
)
|
|
||||||
assert getattr(llama_model_config.hf_config, "rope_scaling", None) is None
|
assert getattr(llama_model_config.hf_config, "rope_scaling", None) is None
|
||||||
assert getattr(llama_model_config.hf_config, "rope_theta", None) == 500_000
|
assert getattr(llama_model_config.hf_config, "rope_theta", None) == 500_000
|
||||||
assert llama_model_config.max_model_len == 8192
|
assert llama_model_config.max_model_len == 8192
|
||||||
|
|
||||||
llama_model_config = ModelConfig(
|
llama_model_config = ModelConfig(
|
||||||
"meta-llama/Meta-Llama-3-8B-Instruct",
|
"meta-llama/Meta-Llama-3-8B-Instruct",
|
||||||
task="auto",
|
|
||||||
tokenizer="meta-llama/Meta-Llama-3-8B-Instruct",
|
|
||||||
tokenizer_mode="auto",
|
|
||||||
trust_remote_code=False,
|
|
||||||
dtype="float16",
|
|
||||||
seed=0,
|
|
||||||
hf_overrides={
|
hf_overrides={
|
||||||
"rope_scaling": TEST_ROPE_SCALING,
|
"rope_scaling": TEST_ROPE_SCALING,
|
||||||
"rope_theta": TEST_ROPE_THETA,
|
"rope_theta": TEST_ROPE_THETA,
|
||||||
@ -349,15 +284,7 @@ def test_rope_customization():
|
|||||||
None) == TEST_ROPE_THETA
|
None) == TEST_ROPE_THETA
|
||||||
assert llama_model_config.max_model_len == 16384
|
assert llama_model_config.max_model_len == 16384
|
||||||
|
|
||||||
longchat_model_config = ModelConfig(
|
longchat_model_config = ModelConfig("lmsys/longchat-13b-16k")
|
||||||
"lmsys/longchat-13b-16k",
|
|
||||||
task="auto",
|
|
||||||
tokenizer="lmsys/longchat-13b-16k",
|
|
||||||
tokenizer_mode="auto",
|
|
||||||
trust_remote_code=False,
|
|
||||||
dtype="float16",
|
|
||||||
seed=0,
|
|
||||||
)
|
|
||||||
# Check if LONGCHAT_ROPE_SCALING entries are in longchat_model_config
|
# Check if LONGCHAT_ROPE_SCALING entries are in longchat_model_config
|
||||||
assert all(
|
assert all(
|
||||||
longchat_model_config.hf_config.rope_scaling.get(key) == value
|
longchat_model_config.hf_config.rope_scaling.get(key) == value
|
||||||
@ -366,12 +293,6 @@ def test_rope_customization():
|
|||||||
|
|
||||||
longchat_model_config = ModelConfig(
|
longchat_model_config = ModelConfig(
|
||||||
"lmsys/longchat-13b-16k",
|
"lmsys/longchat-13b-16k",
|
||||||
task="auto",
|
|
||||||
tokenizer="lmsys/longchat-13b-16k",
|
|
||||||
tokenizer_mode="auto",
|
|
||||||
trust_remote_code=False,
|
|
||||||
dtype="float16",
|
|
||||||
seed=0,
|
|
||||||
hf_overrides={
|
hf_overrides={
|
||||||
"rope_scaling": TEST_ROPE_SCALING,
|
"rope_scaling": TEST_ROPE_SCALING,
|
||||||
},
|
},
|
||||||
@ -390,15 +311,7 @@ def test_rope_customization():
|
|||||||
("meta-llama/Llama-3.2-11B-Vision", True),
|
("meta-llama/Llama-3.2-11B-Vision", True),
|
||||||
])
|
])
|
||||||
def test_is_encoder_decoder(model_id, is_encoder_decoder):
|
def test_is_encoder_decoder(model_id, is_encoder_decoder):
|
||||||
config = ModelConfig(
|
config = ModelConfig(model_id)
|
||||||
model_id,
|
|
||||||
task="auto",
|
|
||||||
tokenizer=model_id,
|
|
||||||
tokenizer_mode="auto",
|
|
||||||
trust_remote_code=False,
|
|
||||||
dtype="float16",
|
|
||||||
seed=0,
|
|
||||||
)
|
|
||||||
|
|
||||||
assert config.is_encoder_decoder == is_encoder_decoder
|
assert config.is_encoder_decoder == is_encoder_decoder
|
||||||
|
|
||||||
@ -408,15 +321,7 @@ def test_is_encoder_decoder(model_id, is_encoder_decoder):
|
|||||||
("Qwen/Qwen2-VL-2B-Instruct", True),
|
("Qwen/Qwen2-VL-2B-Instruct", True),
|
||||||
])
|
])
|
||||||
def test_uses_mrope(model_id, uses_mrope):
|
def test_uses_mrope(model_id, uses_mrope):
|
||||||
config = ModelConfig(
|
config = ModelConfig(model_id)
|
||||||
model_id,
|
|
||||||
task="auto",
|
|
||||||
tokenizer=model_id,
|
|
||||||
tokenizer_mode="auto",
|
|
||||||
trust_remote_code=False,
|
|
||||||
dtype="float16",
|
|
||||||
seed=0,
|
|
||||||
)
|
|
||||||
|
|
||||||
assert config.uses_mrope == uses_mrope
|
assert config.uses_mrope == uses_mrope
|
||||||
|
|
||||||
@ -426,26 +331,12 @@ def test_generation_config_loading():
|
|||||||
|
|
||||||
# When set generation_config to "vllm", the default generation config
|
# When set generation_config to "vllm", the default generation config
|
||||||
# will not be loaded.
|
# will not be loaded.
|
||||||
model_config = ModelConfig(model_id,
|
model_config = ModelConfig(model_id, generation_config="vllm")
|
||||||
task="auto",
|
|
||||||
tokenizer=model_id,
|
|
||||||
tokenizer_mode="auto",
|
|
||||||
trust_remote_code=False,
|
|
||||||
seed=0,
|
|
||||||
dtype="float16",
|
|
||||||
generation_config="vllm")
|
|
||||||
assert model_config.get_diff_sampling_param() == {}
|
assert model_config.get_diff_sampling_param() == {}
|
||||||
|
|
||||||
# When set generation_config to "auto", the default generation config
|
# When set generation_config to "auto", the default generation config
|
||||||
# should be loaded.
|
# should be loaded.
|
||||||
model_config = ModelConfig(model_id,
|
model_config = ModelConfig(model_id, generation_config="auto")
|
||||||
task="auto",
|
|
||||||
tokenizer=model_id,
|
|
||||||
tokenizer_mode="auto",
|
|
||||||
trust_remote_code=False,
|
|
||||||
seed=0,
|
|
||||||
dtype="float16",
|
|
||||||
generation_config="auto")
|
|
||||||
|
|
||||||
correct_generation_config = {
|
correct_generation_config = {
|
||||||
"repetition_penalty": 1.1,
|
"repetition_penalty": 1.1,
|
||||||
@ -461,12 +352,6 @@ def test_generation_config_loading():
|
|||||||
|
|
||||||
model_config = ModelConfig(
|
model_config = ModelConfig(
|
||||||
model_id,
|
model_id,
|
||||||
task="auto",
|
|
||||||
tokenizer=model_id,
|
|
||||||
tokenizer_mode="auto",
|
|
||||||
trust_remote_code=False,
|
|
||||||
seed=0,
|
|
||||||
dtype="float16",
|
|
||||||
generation_config="auto",
|
generation_config="auto",
|
||||||
override_generation_config=override_generation_config)
|
override_generation_config=override_generation_config)
|
||||||
|
|
||||||
@ -479,12 +364,6 @@ def test_generation_config_loading():
|
|||||||
# is set, the override_generation_config should be used directly.
|
# is set, the override_generation_config should be used directly.
|
||||||
model_config = ModelConfig(
|
model_config = ModelConfig(
|
||||||
model_id,
|
model_id,
|
||||||
task="auto",
|
|
||||||
tokenizer=model_id,
|
|
||||||
tokenizer_mode="auto",
|
|
||||||
trust_remote_code=False,
|
|
||||||
seed=0,
|
|
||||||
dtype="float16",
|
|
||||||
generation_config="vllm",
|
generation_config="vllm",
|
||||||
override_generation_config=override_generation_config)
|
override_generation_config=override_generation_config)
|
||||||
|
|
||||||
@ -515,16 +394,7 @@ def test_load_config_pt_load_map_location(pt_load_map_location):
|
|||||||
def test_get_and_verify_max_len(model_id, max_model_len, expected_max_len,
|
def test_get_and_verify_max_len(model_id, max_model_len, expected_max_len,
|
||||||
should_raise):
|
should_raise):
|
||||||
"""Test get_and_verify_max_len with different configurations."""
|
"""Test get_and_verify_max_len with different configurations."""
|
||||||
model_config = ModelConfig(
|
model_config = ModelConfig(model_id)
|
||||||
model_id,
|
|
||||||
task="auto",
|
|
||||||
tokenizer=model_id,
|
|
||||||
tokenizer_mode="auto",
|
|
||||||
trust_remote_code=False,
|
|
||||||
seed=0,
|
|
||||||
dtype="float16",
|
|
||||||
revision=None,
|
|
||||||
)
|
|
||||||
|
|
||||||
if should_raise:
|
if should_raise:
|
||||||
with pytest.raises(ValueError):
|
with pytest.raises(ValueError):
|
||||||
|
|||||||
@ -21,13 +21,8 @@ def test_max_tokens_none():
|
|||||||
def model_config():
|
def model_config():
|
||||||
return ModelConfig(
|
return ModelConfig(
|
||||||
MODEL_NAME,
|
MODEL_NAME,
|
||||||
task="auto",
|
|
||||||
tokenizer=MODEL_NAME,
|
|
||||||
tokenizer_mode="auto",
|
|
||||||
trust_remote_code=False,
|
|
||||||
seed=0,
|
seed=0,
|
||||||
dtype="float16",
|
dtype="float16",
|
||||||
revision=None,
|
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
@ -695,11 +695,7 @@ def test_estimate_max_model_len(model_id, max_model_len,
|
|||||||
# Create a VllmConfig
|
# Create a VllmConfig
|
||||||
model_config = ModelConfig(
|
model_config = ModelConfig(
|
||||||
model_id,
|
model_id,
|
||||||
task="generate",
|
runner="generate",
|
||||||
tokenizer=model_id,
|
|
||||||
tokenizer_mode="auto",
|
|
||||||
trust_remote_code=False,
|
|
||||||
seed=0,
|
|
||||||
dtype="float16",
|
dtype="float16",
|
||||||
max_model_len=max_model_len,
|
max_model_len=max_model_len,
|
||||||
)
|
)
|
||||||
@ -733,11 +729,7 @@ def test_get_max_concurrency_for_kv_cache_config():
|
|||||||
max_model_len = 16384
|
max_model_len = 16384
|
||||||
model_config = ModelConfig(
|
model_config = ModelConfig(
|
||||||
model_id,
|
model_id,
|
||||||
task="generate",
|
runner="generate",
|
||||||
tokenizer=model_id,
|
|
||||||
tokenizer_mode="auto",
|
|
||||||
trust_remote_code=False,
|
|
||||||
seed=0,
|
|
||||||
dtype="float16",
|
dtype="float16",
|
||||||
max_model_len=max_model_len,
|
max_model_len=max_model_len,
|
||||||
)
|
)
|
||||||
|
|||||||
@ -1248,9 +1248,6 @@ def create_scheduler_with_priority(
|
|||||||
)
|
)
|
||||||
model_config = ModelConfig(
|
model_config = ModelConfig(
|
||||||
model=model,
|
model=model,
|
||||||
task="auto",
|
|
||||||
tokenizer=model,
|
|
||||||
tokenizer_mode="auto",
|
|
||||||
trust_remote_code=True,
|
trust_remote_code=True,
|
||||||
dtype="float16",
|
dtype="float16",
|
||||||
seed=42,
|
seed=42,
|
||||||
|
|||||||
@ -59,9 +59,6 @@ def create_scheduler(
|
|||||||
)
|
)
|
||||||
model_config = ModelConfig(
|
model_config = ModelConfig(
|
||||||
model=model,
|
model=model,
|
||||||
task="auto",
|
|
||||||
tokenizer=model,
|
|
||||||
tokenizer_mode="auto",
|
|
||||||
trust_remote_code=True,
|
trust_remote_code=True,
|
||||||
dtype="float16",
|
dtype="float16",
|
||||||
seed=42,
|
seed=42,
|
||||||
|
|||||||
@ -68,9 +68,6 @@ def create_vllm_config(
|
|||||||
)
|
)
|
||||||
model_config = ModelConfig(
|
model_config = ModelConfig(
|
||||||
model=model,
|
model=model,
|
||||||
task="auto",
|
|
||||||
tokenizer=model,
|
|
||||||
tokenizer_mode="auto",
|
|
||||||
trust_remote_code=True,
|
trust_remote_code=True,
|
||||||
dtype="float16",
|
dtype="float16",
|
||||||
seed=42,
|
seed=42,
|
||||||
|
|||||||
@ -24,13 +24,8 @@ eagle3_dir = "yuhuili/EAGLE3-LLaMA3.1-Instruct-8B"
|
|||||||
|
|
||||||
def _create_proposer(method: str, k: int) -> EagleProposer:
|
def _create_proposer(method: str, k: int) -> EagleProposer:
|
||||||
model_config = ModelConfig(model=model_dir,
|
model_config = ModelConfig(model=model_dir,
|
||||||
task="generate",
|
runner="generate",
|
||||||
max_model_len=100,
|
max_model_len=100)
|
||||||
tokenizer=model_dir,
|
|
||||||
tokenizer_mode="auto",
|
|
||||||
dtype="auto",
|
|
||||||
seed=None,
|
|
||||||
trust_remote_code=False)
|
|
||||||
|
|
||||||
# Choose model directory based on method
|
# Choose model directory based on method
|
||||||
draft_model_dir = eagle_dir if method == "eagle" else eagle3_dir
|
draft_model_dir = eagle_dir if method == "eagle" else eagle3_dir
|
||||||
|
|||||||
@ -44,14 +44,7 @@ def test_ngram_proposer():
|
|||||||
|
|
||||||
def ngram_proposer(min_n: int, max_n: int, k: int) -> NgramProposer:
|
def ngram_proposer(min_n: int, max_n: int, k: int) -> NgramProposer:
|
||||||
# Dummy model config. Just to set max_model_len.
|
# Dummy model config. Just to set max_model_len.
|
||||||
model_config = ModelConfig(model="facebook/opt-125m",
|
model_config = ModelConfig(model="facebook/opt-125m")
|
||||||
task="generate",
|
|
||||||
max_model_len=100,
|
|
||||||
tokenizer="facebook/opt-125m",
|
|
||||||
tokenizer_mode="auto",
|
|
||||||
dtype="auto",
|
|
||||||
seed=None,
|
|
||||||
trust_remote_code=False)
|
|
||||||
return NgramProposer(
|
return NgramProposer(
|
||||||
vllm_config=VllmConfig(model_config=model_config,
|
vllm_config=VllmConfig(model_config=model_config,
|
||||||
speculative_config=SpeculativeConfig.
|
speculative_config=SpeculativeConfig.
|
||||||
|
|||||||
@ -26,10 +26,6 @@ def get_vllm_config():
|
|||||||
)
|
)
|
||||||
model_config = ModelConfig(
|
model_config = ModelConfig(
|
||||||
model="facebook/opt-125m",
|
model="facebook/opt-125m",
|
||||||
task="generate",
|
|
||||||
tokenizer="facebook/opt-125m",
|
|
||||||
tokenizer_mode="auto",
|
|
||||||
trust_remote_code=True,
|
|
||||||
dtype="bfloat16", # TPUs typically use bfloat16
|
dtype="bfloat16", # TPUs typically use bfloat16
|
||||||
seed=42,
|
seed=42,
|
||||||
)
|
)
|
||||||
|
|||||||
@ -76,10 +76,6 @@ def get_vllm_config():
|
|||||||
)
|
)
|
||||||
model_config = ModelConfig(
|
model_config = ModelConfig(
|
||||||
model="facebook/opt-125m",
|
model="facebook/opt-125m",
|
||||||
task="generate",
|
|
||||||
tokenizer="facebook/opt-125m",
|
|
||||||
tokenizer_mode="auto",
|
|
||||||
trust_remote_code=True,
|
|
||||||
dtype="float16",
|
dtype="float16",
|
||||||
seed=42,
|
seed=42,
|
||||||
)
|
)
|
||||||
|
|||||||
530
vllm/config.py
530
vllm/config.py
@ -26,7 +26,7 @@ from pydantic import (ConfigDict, SkipValidation, TypeAdapter, field_validator,
|
|||||||
from pydantic.dataclasses import dataclass
|
from pydantic.dataclasses import dataclass
|
||||||
from safetensors.torch import _TYPES as _SAFETENSORS_TO_TORCH_DTYPE
|
from safetensors.torch import _TYPES as _SAFETENSORS_TO_TORCH_DTYPE
|
||||||
from torch.distributed import ProcessGroup, ReduceOp
|
from torch.distributed import ProcessGroup, ReduceOp
|
||||||
from typing_extensions import Self, runtime_checkable
|
from typing_extensions import Self, assert_never, runtime_checkable
|
||||||
|
|
||||||
import vllm.envs as envs
|
import vllm.envs as envs
|
||||||
from vllm import version
|
from vllm import version
|
||||||
@ -102,12 +102,63 @@ RunnerOption = Literal["auto", "generate", "pooling", "draft"]
|
|||||||
|
|
||||||
RunnerType = Literal["generate", "pooling", "draft"]
|
RunnerType = Literal["generate", "pooling", "draft"]
|
||||||
|
|
||||||
_RUNNER_TASKS: dict[RunnerType, list[_ResolvedTask]] = {
|
ConvertOption = Literal["auto", "none", "embed", "classify", "reward"]
|
||||||
|
|
||||||
|
ConvertType = Literal["none", "embed", "classify", "reward"]
|
||||||
|
|
||||||
|
_RUNNER_TASKS: dict[RunnerType, list[TaskOption]] = {
|
||||||
"generate": ["generate", "transcription"],
|
"generate": ["generate", "transcription"],
|
||||||
"pooling": ["encode", "embed", "classify", "reward"],
|
"pooling": ["embedding", "embed", "classify", "score", "reward"],
|
||||||
|
"draft": ["draft"],
|
||||||
|
}
|
||||||
|
|
||||||
|
_RUNNER_CONVERTS: dict[RunnerType, list[ConvertType]] = {
|
||||||
|
"generate": [],
|
||||||
|
"pooling": ["embed", "classify", "reward"],
|
||||||
"draft": [],
|
"draft": [],
|
||||||
}
|
}
|
||||||
|
|
||||||
|
# Some model suffixes are based on auto classes from Transformers:
|
||||||
|
# https://huggingface.co/docs/transformers/en/model_doc/auto
|
||||||
|
# NOTE: Items higher on this list priority over lower ones
|
||||||
|
_SUFFIX_TO_DEFAULTS: list[tuple[str, tuple[RunnerType, ConvertType]]] = [
|
||||||
|
("ForCausalLM", ("generate", "none")),
|
||||||
|
("ForConditionalGeneration", ("generate", "none")),
|
||||||
|
("ChatModel", ("generate", "none")),
|
||||||
|
("LMHeadModel", ("generate", "none")),
|
||||||
|
("ForTextEncoding", ("pooling", "embed")),
|
||||||
|
("EmbeddingModel", ("pooling", "embed")),
|
||||||
|
("ForSequenceClassification", ("pooling", "classify")),
|
||||||
|
("ForAudioClassification", ("pooling", "classify")),
|
||||||
|
("ForImageClassification", ("pooling", "classify")),
|
||||||
|
("ForVideoClassification", ("pooling", "classify")),
|
||||||
|
("ClassificationModel", ("pooling", "classify")),
|
||||||
|
("ForRewardModeling", ("pooling", "reward")),
|
||||||
|
("RewardModel", ("pooling", "reward")),
|
||||||
|
# Let other `*Model`s take priority
|
||||||
|
("Model", ("pooling", "embed")),
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
def iter_architecture_defaults():
|
||||||
|
yield from _SUFFIX_TO_DEFAULTS
|
||||||
|
|
||||||
|
|
||||||
|
def try_match_architecture_defaults(
|
||||||
|
architecture: str,
|
||||||
|
*,
|
||||||
|
runner_type: Optional[RunnerType] = None,
|
||||||
|
convert_type: Optional[ConvertType] = None,
|
||||||
|
) -> Optional[tuple[str, tuple[RunnerType, ConvertType]]]:
|
||||||
|
for suffix, (default_runner_type,
|
||||||
|
default_convert_type) in iter_architecture_defaults():
|
||||||
|
if ((runner_type is None or runner_type == default_runner_type) and
|
||||||
|
(convert_type is None or convert_type == default_convert_type)
|
||||||
|
and architecture.endswith(suffix)):
|
||||||
|
return suffix, (default_runner_type, default_convert_type)
|
||||||
|
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
@runtime_checkable
|
@runtime_checkable
|
||||||
class SupportsHash(Protocol):
|
class SupportsHash(Protocol):
|
||||||
@ -236,11 +287,16 @@ class ModelConfig:
|
|||||||
runner: RunnerOption = "auto"
|
runner: RunnerOption = "auto"
|
||||||
"""The type of model runner to use. Each vLLM instance only supports one
|
"""The type of model runner to use. Each vLLM instance only supports one
|
||||||
model runner, even if the same model can be used for multiple types."""
|
model runner, even if the same model can be used for multiple types."""
|
||||||
task: TaskOption = "auto"
|
convert: ConvertOption = "auto"
|
||||||
"""The task to use the model for. If the model supports more than one
|
"""Convert the model using adapters defined in
|
||||||
model runner, this is used to select which model runner to run.
|
[vllm.model_executor.models.adapters][]. The most common use case is to
|
||||||
|
adapt a text generation model to be used for pooling tasks."""
|
||||||
|
task: Optional[TaskOption] = None
|
||||||
|
"""[DEPRECATED] The task to use the model for. If the model supports more
|
||||||
|
than one model runner, this is used to select which model runner to run.
|
||||||
|
|
||||||
Note that the model may support other tasks using the same model runner."""
|
Note that the model may support other tasks using the same model runner.
|
||||||
|
"""
|
||||||
tokenizer: SkipValidation[str] = None # type: ignore
|
tokenizer: SkipValidation[str] = None # type: ignore
|
||||||
"""Name or path of the Hugging Face tokenizer to use. If unspecified, model
|
"""Name or path of the Hugging Face tokenizer to use. If unspecified, model
|
||||||
name or path will be used."""
|
name or path will be used."""
|
||||||
@ -558,48 +614,103 @@ class ModelConfig:
|
|||||||
self.hf_image_processor_config = get_hf_image_processor_config(
|
self.hf_image_processor_config = get_hf_image_processor_config(
|
||||||
self.model, hf_token=self.hf_token, revision=self.revision)
|
self.model, hf_token=self.hf_token, revision=self.revision)
|
||||||
|
|
||||||
# For pooling models, self.task is used to indicate the
|
architectures = self.architectures
|
||||||
# user-selected task
|
registry = self.registry
|
||||||
if self.task == "score":
|
is_generative_model = registry.is_text_generation_model(
|
||||||
if self._is_classify_task(self.architectures):
|
architectures, self)
|
||||||
self.task = "classify"
|
is_pooling_model = registry.is_pooling_model(architectures, self)
|
||||||
|
|
||||||
|
def _task_to_convert(task: TaskOption) -> ConvertType:
|
||||||
|
if task == "embedding" or task == "embed":
|
||||||
|
return "embed"
|
||||||
|
if task == "classify":
|
||||||
|
return "classify"
|
||||||
|
if task == "reward":
|
||||||
|
return "reward"
|
||||||
|
if task == "score":
|
||||||
|
new_task = self._get_default_pooling_task(architectures)
|
||||||
|
return "classify" if new_task == "classify" else "embed"
|
||||||
|
|
||||||
|
return "none"
|
||||||
|
|
||||||
|
if self.task is not None:
|
||||||
|
runner: RunnerOption = "auto"
|
||||||
|
convert: ConvertOption = "auto"
|
||||||
|
msg_prefix = ("The 'task' option has been deprecated and will be "
|
||||||
|
"removed in v0.13.0 or v1.0, whichever comes first.")
|
||||||
|
msg_hint = "Please remove this option."
|
||||||
|
|
||||||
|
is_generative_task = self.task in _RUNNER_TASKS["generate"]
|
||||||
|
is_pooling_task = self.task in _RUNNER_TASKS["pooling"]
|
||||||
|
|
||||||
|
if is_generative_model and is_pooling_model:
|
||||||
|
if is_generative_task:
|
||||||
|
runner = "generate"
|
||||||
|
convert = "auto"
|
||||||
|
msg_hint = ("Please replace this option with `--runner "
|
||||||
|
"generate` to continue using this model "
|
||||||
|
"as a generative model.")
|
||||||
|
elif is_pooling_task:
|
||||||
|
runner = "pooling"
|
||||||
|
convert = "auto"
|
||||||
|
msg_hint = ("Please replace this option with `--runner "
|
||||||
|
"pooling` to continue using this model "
|
||||||
|
"as a pooling model.")
|
||||||
|
else: # task == "auto"
|
||||||
|
pass
|
||||||
|
elif is_generative_model or is_pooling_model:
|
||||||
|
if is_generative_task:
|
||||||
|
runner = "generate"
|
||||||
|
convert = "auto"
|
||||||
|
msg_hint = "Please remove this option"
|
||||||
|
elif is_pooling_task:
|
||||||
|
runner = "pooling"
|
||||||
|
convert = _task_to_convert(self.task)
|
||||||
|
msg_hint = ("Please replace this option with `--convert "
|
||||||
|
f"{convert}` to continue using this model "
|
||||||
|
"as a pooling model.")
|
||||||
|
else: # task == "auto"
|
||||||
|
pass
|
||||||
else:
|
else:
|
||||||
self.task = "embed"
|
raise AssertionError("The model should be a generative or "
|
||||||
elif self.task == "embedding":
|
"pooling model when task is set to "
|
||||||
msg = ("The 'embedding' task has been renamed to 'embed', please "
|
f"{self.task!r}.")
|
||||||
"use the new name. The old name will be removed in v1.0.")
|
|
||||||
|
self.runner = runner
|
||||||
|
self.convert = convert
|
||||||
|
|
||||||
|
msg = f"{msg_prefix} {msg_hint}"
|
||||||
warnings.warn(msg, DeprecationWarning, stacklevel=2)
|
warnings.warn(msg, DeprecationWarning, stacklevel=2)
|
||||||
|
|
||||||
self.task = "embed"
|
self.runner_type = self._get_runner_type(architectures, self.runner)
|
||||||
|
self.convert_type = self._get_convert_type(architectures,
|
||||||
|
self.runner_type,
|
||||||
|
self.convert)
|
||||||
|
|
||||||
model_info, arch = self.registry.inspect_model_cls(self.architectures)
|
if self.runner_type == "generate" and not is_generative_model:
|
||||||
|
generate_converts = _RUNNER_CONVERTS["generate"]
|
||||||
|
if self.convert_type not in generate_converts:
|
||||||
|
# Currently we don't have any converters for generative models
|
||||||
|
raise ValueError(
|
||||||
|
"This model does not support `--runner generate`.")
|
||||||
|
if self.runner_type == "pooling" and not is_pooling_model:
|
||||||
|
pooling_converts = _RUNNER_CONVERTS["pooling"]
|
||||||
|
if self.convert_type not in pooling_converts:
|
||||||
|
convert_option = "<" + "|".join(pooling_converts) + ">"
|
||||||
|
raise ValueError(
|
||||||
|
"This model does not support `--runner pooling`. "
|
||||||
|
f"You can pass `--convert {convert_option} to adapt "
|
||||||
|
"it into a pooling model.")
|
||||||
|
|
||||||
|
self.supported_tasks = self._get_supported_tasks(
|
||||||
|
architectures, self.runner_type, self.convert_type)
|
||||||
|
|
||||||
|
# Note: Initialize these attributes early because transformers fallback
|
||||||
|
# may fail to load dynamic modules in child processes
|
||||||
|
model_info, arch = registry.inspect_model_cls(architectures, self)
|
||||||
self._model_info = model_info
|
self._model_info = model_info
|
||||||
self._architecture = arch
|
self._architecture = arch
|
||||||
|
logger.info("Resolved architecture: %s", arch)
|
||||||
all_supported_tasks = self._get_supported_tasks(self.task)
|
|
||||||
logger.debug("Tasks supported by runner type: %s", all_supported_tasks)
|
|
||||||
supported_runner_types = self._get_supported_runner_types(
|
|
||||||
all_supported_tasks)
|
|
||||||
runner_type = self._resolve_runner(self.runner, self.task,
|
|
||||||
supported_runner_types,
|
|
||||||
all_supported_tasks)
|
|
||||||
|
|
||||||
logger.debug("Selected runner type: %s", runner_type)
|
|
||||||
# For pooling models, self.task is used to indicate the
|
|
||||||
# user-selected task
|
|
||||||
if runner_type == "pooling" and self.task == "auto":
|
|
||||||
selected_task = all_supported_tasks[runner_type][-1]
|
|
||||||
assert selected_task != "encode"
|
|
||||||
self.task = selected_task
|
|
||||||
self.supported_runner_types = supported_runner_types
|
|
||||||
self.runner_type = runner_type
|
|
||||||
self.supported_tasks = all_supported_tasks[runner_type]
|
|
||||||
|
|
||||||
if self.runner_type in ("draft",
|
|
||||||
"generate") and self.task != "transcription":
|
|
||||||
self.truncation_side = "left"
|
|
||||||
else:
|
|
||||||
self.truncation_side = "right"
|
|
||||||
|
|
||||||
self.pooler_config = self._init_pooler_config()
|
self.pooler_config = self._init_pooler_config()
|
||||||
|
|
||||||
@ -652,16 +763,10 @@ class ModelConfig:
|
|||||||
self.original_max_model_len = self.max_model_len
|
self.original_max_model_len = self.max_model_len
|
||||||
self.max_model_len = self.get_and_verify_max_len(self.max_model_len)
|
self.max_model_len = self.get_and_verify_max_len(self.max_model_len)
|
||||||
self.multimodal_config = self._init_multimodal_config()
|
self.multimodal_config = self._init_multimodal_config()
|
||||||
self.model_supports_multimodal_raw_input = (
|
|
||||||
self.registry.supports_multimodal_raw_input(self.architectures))
|
|
||||||
if not self.skip_tokenizer_init:
|
if not self.skip_tokenizer_init:
|
||||||
self._verify_tokenizer_mode()
|
self._verify_tokenizer_mode()
|
||||||
|
|
||||||
self.is_attention_free = self._init_attention_free()
|
|
||||||
self.is_hybrid = self._init_is_hybrid()
|
|
||||||
self.has_noops = self._init_has_noops()
|
|
||||||
self.has_inner_state = self._init_has_inner_state()
|
|
||||||
|
|
||||||
if (not current_platform.is_neuron() and self.override_neuron_config):
|
if (not current_platform.is_neuron() and self.override_neuron_config):
|
||||||
raise ValueError(
|
raise ValueError(
|
||||||
"`override_neuron_config` is only supported on Neuron.")
|
"`override_neuron_config` is only supported on Neuron.")
|
||||||
@ -702,30 +807,13 @@ class ModelConfig:
|
|||||||
|
|
||||||
@property
|
@property
|
||||||
def architectures(self) -> list[str]:
|
def architectures(self) -> list[str]:
|
||||||
# architectures in the model config.
|
return getattr(self.hf_config, "architectures", [])
|
||||||
architectures = getattr(self.hf_config, "architectures", [])
|
|
||||||
# The registry assumes that it can always inspect the vLLM model class
|
|
||||||
# for a given architecture. This assumption breaks down for the
|
|
||||||
# Transformers backend, which may use a different class depending on
|
|
||||||
# the model type. To work around this, we add the correct Transformers
|
|
||||||
# backend class to the architectures list. We must do this here because
|
|
||||||
# we need access to the `hf_config` to determine the backend class.
|
|
||||||
transformers_backend_cls = self._get_transformers_backend_cls()
|
|
||||||
if (self.model_impl != ModelImpl.VLLM.value
|
|
||||||
and all(arch != transformers_backend_cls
|
|
||||||
for arch in architectures)):
|
|
||||||
architectures.append(transformers_backend_cls)
|
|
||||||
return architectures
|
|
||||||
|
|
||||||
@property
|
@property
|
||||||
def architecture(self) -> str:
|
def architecture(self) -> str:
|
||||||
# The architecture vllm actually used.
|
"""The architecture vllm actually used."""
|
||||||
return self._architecture
|
return self._architecture
|
||||||
|
|
||||||
@property
|
|
||||||
def model_info(self):
|
|
||||||
return self._model_info
|
|
||||||
|
|
||||||
def maybe_pull_model_tokenizer_for_s3(self, model: str,
|
def maybe_pull_model_tokenizer_for_s3(self, model: str,
|
||||||
tokenizer: str) -> None:
|
tokenizer: str) -> None:
|
||||||
"""Pull model/tokenizer from S3 to temporary directory when needed.
|
"""Pull model/tokenizer from S3 to temporary directory when needed.
|
||||||
@ -763,7 +851,7 @@ class ModelConfig:
|
|||||||
self.tokenizer = s3_tokenizer.dir
|
self.tokenizer = s3_tokenizer.dir
|
||||||
|
|
||||||
def _init_multimodal_config(self) -> Optional["MultiModalConfig"]:
|
def _init_multimodal_config(self) -> Optional["MultiModalConfig"]:
|
||||||
if self.registry.is_multimodal_model(self.architectures):
|
if self.registry.is_multimodal_model(self.architectures, self):
|
||||||
return MultiModalConfig(
|
return MultiModalConfig(
|
||||||
limit_per_prompt=self.limit_mm_per_prompt,
|
limit_per_prompt=self.limit_mm_per_prompt,
|
||||||
media_io_kwargs=self.media_io_kwargs,
|
media_io_kwargs=self.media_io_kwargs,
|
||||||
@ -819,19 +907,6 @@ class ModelConfig:
|
|||||||
|
|
||||||
return None
|
return None
|
||||||
|
|
||||||
def _init_attention_free(self) -> bool:
|
|
||||||
return self.registry.is_attention_free_model(self.architectures)
|
|
||||||
|
|
||||||
def _init_is_hybrid(self) -> bool:
|
|
||||||
return self.registry.is_hybrid_model(self.architectures)
|
|
||||||
|
|
||||||
def _init_has_noops(self) -> bool:
|
|
||||||
architectures = getattr(self.hf_config, "architectures", [])
|
|
||||||
return self.registry.is_noops_model(architectures)
|
|
||||||
|
|
||||||
def _init_has_inner_state(self) -> bool:
|
|
||||||
return self.registry.model_has_inner_state(self.architectures)
|
|
||||||
|
|
||||||
def _verify_tokenizer_mode(self) -> None:
|
def _verify_tokenizer_mode(self) -> None:
|
||||||
tokenizer_mode = cast(TokenizerMode, self.tokenizer_mode.lower())
|
tokenizer_mode = cast(TokenizerMode, self.tokenizer_mode.lower())
|
||||||
if tokenizer_mode not in get_args(TokenizerMode):
|
if tokenizer_mode not in get_args(TokenizerMode):
|
||||||
@ -840,155 +915,168 @@ class ModelConfig:
|
|||||||
f"one of {get_args(TokenizerMode)}.")
|
f"one of {get_args(TokenizerMode)}.")
|
||||||
self.tokenizer_mode = tokenizer_mode
|
self.tokenizer_mode = tokenizer_mode
|
||||||
|
|
||||||
def _is_classify_task(self, architectures: list[str]):
|
def _get_default_runner_type(
|
||||||
for arch in architectures:
|
|
||||||
if arch.endswith("ForSequenceClassification"):
|
|
||||||
return True
|
|
||||||
return self.registry.is_cross_encoder_model(architectures)
|
|
||||||
|
|
||||||
def _get_preferred_pooling_task(
|
|
||||||
self,
|
self,
|
||||||
architectures: list[str],
|
architectures: list[str],
|
||||||
) -> _ResolvedTask:
|
) -> RunnerType:
|
||||||
model_id = self.model
|
registry = self.registry
|
||||||
if get_pooling_config(model_id, self.revision):
|
|
||||||
|
# Some Sentence Transformers models use *ForCausalLM archs
|
||||||
|
if get_pooling_config(self.model, self.revision):
|
||||||
|
return "pooling"
|
||||||
|
|
||||||
|
for arch in architectures:
|
||||||
|
if arch in registry.get_supported_archs():
|
||||||
|
if registry.is_pooling_model(architectures, self):
|
||||||
|
return "pooling"
|
||||||
|
if registry.is_text_generation_model(architectures, self):
|
||||||
|
return "generate"
|
||||||
|
|
||||||
|
match = try_match_architecture_defaults(arch)
|
||||||
|
if match:
|
||||||
|
_, (runner_type, _) = match
|
||||||
|
return runner_type
|
||||||
|
|
||||||
|
return "generate"
|
||||||
|
|
||||||
|
def _get_runner_type(
|
||||||
|
self,
|
||||||
|
architectures: list[str],
|
||||||
|
runner: RunnerOption,
|
||||||
|
) -> RunnerType:
|
||||||
|
if runner != "auto":
|
||||||
|
return runner
|
||||||
|
|
||||||
|
runner_type = self._get_default_runner_type(architectures)
|
||||||
|
|
||||||
|
logger.info(
|
||||||
|
"Resolved `--runner auto` to `--runner %s`. "
|
||||||
|
"Pass the value explicitly to silence this message.", runner_type)
|
||||||
|
|
||||||
|
return runner_type
|
||||||
|
|
||||||
|
def _get_default_convert_type(
|
||||||
|
self,
|
||||||
|
architectures: list[str],
|
||||||
|
runner_type: RunnerType,
|
||||||
|
) -> ConvertType:
|
||||||
|
registry = self.registry
|
||||||
|
|
||||||
|
for arch in architectures:
|
||||||
|
if arch in registry.get_supported_archs():
|
||||||
|
if (runner_type == "generate"
|
||||||
|
and registry.is_text_generation_model(
|
||||||
|
architectures, self)):
|
||||||
|
return "none"
|
||||||
|
if (runner_type == "pooling"
|
||||||
|
and registry.is_pooling_model(architectures, self)):
|
||||||
|
return "none"
|
||||||
|
|
||||||
|
match = try_match_architecture_defaults(arch,
|
||||||
|
runner_type=runner_type)
|
||||||
|
if match:
|
||||||
|
_, (_, convert_type) = match
|
||||||
|
return convert_type
|
||||||
|
|
||||||
|
# This is to handle Sentence Transformers models that use *ForCausalLM
|
||||||
|
# and also multi-modal pooling models which are not defined as
|
||||||
|
# Sentence Transformers models
|
||||||
|
if runner_type == "pooling":
|
||||||
return "embed"
|
return "embed"
|
||||||
if self.registry.is_transcription_model(architectures):
|
|
||||||
return "transcription"
|
|
||||||
|
|
||||||
suffix_to_preferred_task: list[tuple[str, _ResolvedTask]] = [
|
return "none"
|
||||||
# Other models follow this pattern
|
|
||||||
("EmbeddingModel", "embed"),
|
|
||||||
("RewardModel", "reward"),
|
|
||||||
]
|
|
||||||
|
|
||||||
for suffix, pref_task in suffix_to_preferred_task:
|
def _get_convert_type(
|
||||||
if self.architecture.endswith(suffix):
|
self,
|
||||||
return pref_task
|
architectures: list[str],
|
||||||
|
runner_type: RunnerType,
|
||||||
|
convert: ConvertOption,
|
||||||
|
) -> ConvertType:
|
||||||
|
if convert != "auto":
|
||||||
|
return convert
|
||||||
|
|
||||||
return "embed"
|
convert_type = self._get_default_convert_type(architectures,
|
||||||
|
runner_type)
|
||||||
|
|
||||||
|
logger.info(
|
||||||
|
"Resolved `--convert auto` to `--convert %s`. "
|
||||||
|
"Pass the value explicitly to silence this message.", convert_type)
|
||||||
|
|
||||||
|
return convert_type
|
||||||
|
|
||||||
def _get_supported_generation_tasks(
|
def _get_supported_generation_tasks(
|
||||||
self,
|
self,
|
||||||
task_option: TaskOption,
|
architectures: list[str],
|
||||||
|
convert_type: ConvertType,
|
||||||
) -> list[_ResolvedTask]:
|
) -> list[_ResolvedTask]:
|
||||||
registry = self.registry
|
registry = self.registry
|
||||||
architectures = self.architectures
|
|
||||||
|
|
||||||
if registry.is_transcription_only_model(architectures):
|
if registry.is_transcription_only_model(architectures, self):
|
||||||
return ["transcription"]
|
return ["transcription"]
|
||||||
|
|
||||||
|
# TODO: Use get_supported_generation_tasks once V0 is removed
|
||||||
supported_tasks = list[_ResolvedTask]()
|
supported_tasks = list[_ResolvedTask]()
|
||||||
if registry.is_text_generation_model(architectures):
|
if (registry.is_text_generation_model(architectures, self)
|
||||||
|
or convert_type in _RUNNER_CONVERTS["generate"]):
|
||||||
supported_tasks.append("generate")
|
supported_tasks.append("generate")
|
||||||
|
|
||||||
if registry.is_transcription_model(architectures):
|
if registry.is_transcription_model(architectures, self):
|
||||||
supported_tasks.append("transcription")
|
supported_tasks.append("transcription")
|
||||||
|
|
||||||
return supported_tasks
|
return supported_tasks
|
||||||
|
|
||||||
|
def _get_default_pooling_task(
|
||||||
|
self,
|
||||||
|
architectures: list[str],
|
||||||
|
) -> Literal["embed", "classify", "reward"]:
|
||||||
|
if self.registry.is_cross_encoder_model(architectures, self):
|
||||||
|
return "classify"
|
||||||
|
|
||||||
|
for arch in architectures:
|
||||||
|
match = try_match_architecture_defaults(arch,
|
||||||
|
runner_type="pooling")
|
||||||
|
if match:
|
||||||
|
_, (_, convert_type) = match
|
||||||
|
assert convert_type != "none"
|
||||||
|
return convert_type
|
||||||
|
|
||||||
|
return "embed"
|
||||||
|
|
||||||
def _get_supported_pooling_tasks(
|
def _get_supported_pooling_tasks(
|
||||||
self,
|
self,
|
||||||
task_option: TaskOption,
|
architectures: list[str],
|
||||||
|
convert_type: ConvertType,
|
||||||
) -> list[_ResolvedTask]:
|
) -> list[_ResolvedTask]:
|
||||||
registry = self.registry
|
registry = self.registry
|
||||||
architectures = self.architectures
|
|
||||||
|
|
||||||
|
# TODO: Use get_supported_pooling_tasks once V0 is removed
|
||||||
supported_tasks = list[_ResolvedTask]()
|
supported_tasks = list[_ResolvedTask]()
|
||||||
if registry.is_pooling_model(architectures):
|
if (registry.is_pooling_model(architectures, self)
|
||||||
|
or convert_type in _RUNNER_CONVERTS["pooling"]):
|
||||||
supported_tasks.append("encode")
|
supported_tasks.append("encode")
|
||||||
|
|
||||||
# For now, users must specify the task (other than "pooling")
|
extra_task = (self._get_default_pooling_task(architectures)
|
||||||
# to use for pooling models
|
if convert_type == "none" else convert_type)
|
||||||
if task_option == "auto":
|
supported_tasks.append(extra_task)
|
||||||
preferred_task = self._get_preferred_pooling_task(
|
|
||||||
architectures)
|
|
||||||
|
|
||||||
supported_tasks.append(preferred_task)
|
|
||||||
elif task_option in _RUNNER_TASKS["pooling"]:
|
|
||||||
supported_tasks.append(cast(_ResolvedTask, task_option))
|
|
||||||
|
|
||||||
return supported_tasks
|
return supported_tasks
|
||||||
|
|
||||||
def _get_supported_tasks(
|
def _get_supported_tasks(
|
||||||
self,
|
self,
|
||||||
task_option: TaskOption,
|
architectures: list[str],
|
||||||
) -> dict[RunnerType, list[_ResolvedTask]]:
|
runner_type: RunnerType,
|
||||||
if self._is_classify_task(self.architectures):
|
convert_type: ConvertType,
|
||||||
return {"generate": [], "pooling": ["classify"], "draft": []}
|
) -> list[_ResolvedTask]:
|
||||||
else:
|
if runner_type == "generate":
|
||||||
return {
|
return self._get_supported_generation_tasks(
|
||||||
"generate": self._get_supported_generation_tasks(task_option),
|
architectures, convert_type)
|
||||||
"pooling": self._get_supported_pooling_tasks(task_option),
|
if runner_type == "pooling":
|
||||||
"draft": ["draft"]
|
return self._get_supported_pooling_tasks(architectures,
|
||||||
}
|
convert_type)
|
||||||
|
if runner_type == "draft":
|
||||||
|
return ["draft"]
|
||||||
|
|
||||||
def _get_supported_runner_types(
|
assert_never(runner_type)
|
||||||
self,
|
|
||||||
supported_tasks: dict[RunnerType, list[_ResolvedTask]],
|
|
||||||
) -> set[RunnerType]:
|
|
||||||
return {
|
|
||||||
runner
|
|
||||||
for runner, runner_tasks in supported_tasks.items()
|
|
||||||
if len(runner_tasks) > 0
|
|
||||||
}
|
|
||||||
|
|
||||||
def _resolve_runner(
|
|
||||||
self,
|
|
||||||
runner_option: RunnerOption,
|
|
||||||
task_option: TaskOption,
|
|
||||||
supported_runner_types: set[RunnerType],
|
|
||||||
supported_tasks: dict[RunnerType, list[_ResolvedTask]],
|
|
||||||
) -> RunnerType:
|
|
||||||
if not supported_runner_types:
|
|
||||||
raise ValueError("This model does not support any model runners!")
|
|
||||||
|
|
||||||
if runner_option != "auto":
|
|
||||||
if runner_option not in supported_runner_types:
|
|
||||||
raise ValueError(
|
|
||||||
f"This model does not support runner={runner_option!r}. "
|
|
||||||
f"Available runners: {supported_runner_types}")
|
|
||||||
|
|
||||||
return runner_option
|
|
||||||
|
|
||||||
if task_option != "auto":
|
|
||||||
for runner, runner_tasks in supported_tasks.items():
|
|
||||||
if task_option in runner_tasks:
|
|
||||||
return runner
|
|
||||||
else:
|
|
||||||
task_runner: RunnerType = next(
|
|
||||||
runner for runner, tasks in _RUNNER_TASKS.items()
|
|
||||||
if task_option in tasks)
|
|
||||||
raise ValueError(
|
|
||||||
f"This model does not support task={task_option!r}. "
|
|
||||||
f"Available tasks for runner={task_runner!r}: "
|
|
||||||
f"{supported_tasks[task_runner]}")
|
|
||||||
|
|
||||||
if "classify" in supported_tasks.get("pooling", []):
|
|
||||||
# When multiple pooling tasks are present, default to
|
|
||||||
# pooling (eg cross-encoder) for non-standard architectures.
|
|
||||||
return "pooling"
|
|
||||||
|
|
||||||
suffix_to_preferred_runner: list[tuple[str, RunnerType]] = [
|
|
||||||
("ForCausalLM", "generate"),
|
|
||||||
("ForConditionalGeneration", "generate"),
|
|
||||||
("ChatModel", "generate"),
|
|
||||||
("LMHeadModel", "generate"),
|
|
||||||
("EmbeddingModel", "pooling"),
|
|
||||||
("RewardModel", "pooling"),
|
|
||||||
]
|
|
||||||
|
|
||||||
for suffix, pref_runner in suffix_to_preferred_runner:
|
|
||||||
if self.architecture.endswith(
|
|
||||||
suffix) and pref_runner in supported_runner_types:
|
|
||||||
return pref_runner
|
|
||||||
|
|
||||||
if "generate" in supported_runner_types:
|
|
||||||
return "generate"
|
|
||||||
if "pooling" in supported_runner_types:
|
|
||||||
return "pooling"
|
|
||||||
|
|
||||||
raise AssertionError("This line should not be reached")
|
|
||||||
|
|
||||||
def _parse_quant_hf_config(self):
|
def _parse_quant_hf_config(self):
|
||||||
quant_cfg = getattr(self.hf_config, "quantization_config", None)
|
quant_cfg = getattr(self.hf_config, "quantization_config", None)
|
||||||
@ -1216,7 +1304,8 @@ class ModelConfig:
|
|||||||
|
|
||||||
pipeline_parallel_size = parallel_config.pipeline_parallel_size
|
pipeline_parallel_size = parallel_config.pipeline_parallel_size
|
||||||
if pipeline_parallel_size > 1:
|
if pipeline_parallel_size > 1:
|
||||||
if not self.registry.is_pp_supported_model(self.architectures):
|
if not self.registry.is_pp_supported_model(self.architectures,
|
||||||
|
self):
|
||||||
raise NotImplementedError(
|
raise NotImplementedError(
|
||||||
"Pipeline parallelism is not supported for this model. "
|
"Pipeline parallelism is not supported for this model. "
|
||||||
"Supported models implement the `SupportsPP` interface.")
|
"Supported models implement the `SupportsPP` interface.")
|
||||||
@ -1558,17 +1647,41 @@ class ModelConfig:
|
|||||||
|
|
||||||
@property
|
@property
|
||||||
def is_cross_encoder(self) -> bool:
|
def is_cross_encoder(self) -> bool:
|
||||||
return self.task == "classify"
|
return (self._model_info.supports_cross_encoding
|
||||||
|
or self.convert_type == "classify")
|
||||||
|
|
||||||
|
@property
|
||||||
|
def is_pp_supported(self) -> bool:
|
||||||
|
return self._model_info.supports_pp
|
||||||
|
|
||||||
|
@property
|
||||||
|
def is_multimodal_raw_input_supported(self) -> bool:
|
||||||
|
return self._model_info.supports_multimodal_raw_input
|
||||||
|
|
||||||
|
@property
|
||||||
|
def is_attention_free(self) -> bool:
|
||||||
|
return self._model_info.is_attention_free
|
||||||
|
|
||||||
|
@property
|
||||||
|
def is_hybrid(self) -> bool:
|
||||||
|
return self._model_info.is_hybrid
|
||||||
|
|
||||||
|
@property
|
||||||
|
def has_noops(self) -> bool:
|
||||||
|
return self._model_info.has_noops
|
||||||
|
|
||||||
|
@property
|
||||||
|
def has_inner_state(self):
|
||||||
|
return self._model_info.has_inner_state
|
||||||
|
|
||||||
|
@property
|
||||||
|
def is_v1_compatible(self) -> bool:
|
||||||
|
return not self._model_info.supports_v0_only
|
||||||
|
|
||||||
@property
|
@property
|
||||||
def use_mla(self) -> bool:
|
def use_mla(self) -> bool:
|
||||||
return self.is_deepseek_mla and not envs.VLLM_MLA_DISABLE
|
return self.is_deepseek_mla and not envs.VLLM_MLA_DISABLE
|
||||||
|
|
||||||
@property
|
|
||||||
def is_v1_compatible(self) -> bool:
|
|
||||||
architectures = getattr(self.hf_config, "architectures", [])
|
|
||||||
return me_models.ModelRegistry.is_v1_compatible(architectures)
|
|
||||||
|
|
||||||
@property
|
@property
|
||||||
def is_matryoshka(self) -> bool:
|
def is_matryoshka(self) -> bool:
|
||||||
return (bool(getattr(self.hf_config, "matryoshka_dimensions", None))
|
return (bool(getattr(self.hf_config, "matryoshka_dimensions", None))
|
||||||
@ -4769,7 +4882,10 @@ class VllmConfig:
|
|||||||
self.scheduler_config.max_model_len = max_model_len
|
self.scheduler_config.max_model_len = max_model_len
|
||||||
|
|
||||||
def try_verify_and_update_config(self):
|
def try_verify_and_update_config(self):
|
||||||
architecture = getattr(self.model_config, "architecture", None)
|
if self.model_config is None:
|
||||||
|
return
|
||||||
|
|
||||||
|
architecture = self.model_config.architecture
|
||||||
if architecture is None:
|
if architecture is None:
|
||||||
return
|
return
|
||||||
|
|
||||||
@ -4782,7 +4898,7 @@ class VllmConfig:
|
|||||||
if self.model_config.is_hybrid:
|
if self.model_config.is_hybrid:
|
||||||
HybridAttentionMambaModelConfig.verify_and_update_config(self)
|
HybridAttentionMambaModelConfig.verify_and_update_config(self)
|
||||||
|
|
||||||
if self.model_config.task == "classify":
|
if self.model_config.convert_type == "classify":
|
||||||
# Maybe convert ForCausalLM into ForSequenceClassification model.
|
# Maybe convert ForCausalLM into ForSequenceClassification model.
|
||||||
from vllm.model_executor.models.adapters import (
|
from vllm.model_executor.models.adapters import (
|
||||||
SequenceClassificationConfig)
|
SequenceClassificationConfig)
|
||||||
|
|||||||
@ -22,14 +22,15 @@ from typing_extensions import TypeIs
|
|||||||
|
|
||||||
import vllm.envs as envs
|
import vllm.envs as envs
|
||||||
from vllm.config import (BlockSize, CacheConfig, CacheDType, CompilationConfig,
|
from vllm.config import (BlockSize, CacheConfig, CacheDType, CompilationConfig,
|
||||||
ConfigFormat, ConfigType, DecodingConfig,
|
ConfigFormat, ConfigType, ConvertOption,
|
||||||
DetailedTraceModules, Device, DeviceConfig,
|
DecodingConfig, DetailedTraceModules, Device,
|
||||||
DistributedExecutorBackend, GuidedDecodingBackend,
|
DeviceConfig, DistributedExecutorBackend,
|
||||||
GuidedDecodingBackendV1, HfOverrides, KVEventsConfig,
|
GuidedDecodingBackend, GuidedDecodingBackendV1,
|
||||||
KVTransferConfig, LoadConfig, LogprobsMode,
|
HfOverrides, KVEventsConfig, KVTransferConfig,
|
||||||
LoRAConfig, ModelConfig, ModelDType, ModelImpl,
|
LoadConfig, LogprobsMode, LoRAConfig, ModelConfig,
|
||||||
MultiModalConfig, ObservabilityConfig, ParallelConfig,
|
ModelDType, ModelImpl, MultiModalConfig,
|
||||||
PoolerConfig, PrefixCachingHashAlgo, SchedulerConfig,
|
ObservabilityConfig, ParallelConfig, PoolerConfig,
|
||||||
|
PrefixCachingHashAlgo, RunnerOption, SchedulerConfig,
|
||||||
SchedulerPolicy, SpeculativeConfig, TaskOption,
|
SchedulerPolicy, SpeculativeConfig, TaskOption,
|
||||||
TokenizerMode, VllmConfig, get_attr_docs, get_field)
|
TokenizerMode, VllmConfig, get_attr_docs, get_field)
|
||||||
from vllm.logger import init_logger
|
from vllm.logger import init_logger
|
||||||
@ -270,7 +271,9 @@ class EngineArgs:
|
|||||||
str, List[str]]] = ModelConfig.served_model_name
|
str, List[str]]] = ModelConfig.served_model_name
|
||||||
tokenizer: Optional[str] = ModelConfig.tokenizer
|
tokenizer: Optional[str] = ModelConfig.tokenizer
|
||||||
hf_config_path: Optional[str] = ModelConfig.hf_config_path
|
hf_config_path: Optional[str] = ModelConfig.hf_config_path
|
||||||
task: TaskOption = ModelConfig.task
|
runner: RunnerOption = ModelConfig.runner
|
||||||
|
convert: ConvertOption = ModelConfig.convert
|
||||||
|
task: Optional[TaskOption] = ModelConfig.task
|
||||||
skip_tokenizer_init: bool = ModelConfig.skip_tokenizer_init
|
skip_tokenizer_init: bool = ModelConfig.skip_tokenizer_init
|
||||||
enable_prompt_embeds: bool = ModelConfig.enable_prompt_embeds
|
enable_prompt_embeds: bool = ModelConfig.enable_prompt_embeds
|
||||||
tokenizer_mode: TokenizerMode = ModelConfig.tokenizer_mode
|
tokenizer_mode: TokenizerMode = ModelConfig.tokenizer_mode
|
||||||
@ -461,7 +464,11 @@ class EngineArgs:
|
|||||||
)
|
)
|
||||||
if not ('serve' in sys.argv[1:] and '--help' in sys.argv[1:]):
|
if not ('serve' in sys.argv[1:] and '--help' in sys.argv[1:]):
|
||||||
model_group.add_argument("--model", **model_kwargs["model"])
|
model_group.add_argument("--model", **model_kwargs["model"])
|
||||||
model_group.add_argument("--task", **model_kwargs["task"])
|
model_group.add_argument("--runner", **model_kwargs["runner"])
|
||||||
|
model_group.add_argument("--convert", **model_kwargs["convert"])
|
||||||
|
model_group.add_argument("--task",
|
||||||
|
**model_kwargs["task"],
|
||||||
|
deprecated=True)
|
||||||
model_group.add_argument("--tokenizer", **model_kwargs["tokenizer"])
|
model_group.add_argument("--tokenizer", **model_kwargs["tokenizer"])
|
||||||
model_group.add_argument("--tokenizer-mode",
|
model_group.add_argument("--tokenizer-mode",
|
||||||
**model_kwargs["tokenizer_mode"])
|
**model_kwargs["tokenizer_mode"])
|
||||||
@ -870,6 +877,8 @@ class EngineArgs:
|
|||||||
return ModelConfig(
|
return ModelConfig(
|
||||||
model=self.model,
|
model=self.model,
|
||||||
hf_config_path=self.hf_config_path,
|
hf_config_path=self.hf_config_path,
|
||||||
|
runner=self.runner,
|
||||||
|
convert=self.convert,
|
||||||
task=self.task,
|
task=self.task,
|
||||||
tokenizer=self.tokenizer,
|
tokenizer=self.tokenizer,
|
||||||
tokenizer_mode=self.tokenizer_mode,
|
tokenizer_mode=self.tokenizer_mode,
|
||||||
|
|||||||
@ -20,8 +20,8 @@ from vllm.beam_search import (BeamSearchInstance, BeamSearchOutput,
|
|||||||
create_sort_beams_key_function)
|
create_sort_beams_key_function)
|
||||||
from vllm.config import (CompilationConfig, ModelDType, TokenizerMode,
|
from vllm.config import (CompilationConfig, ModelDType, TokenizerMode,
|
||||||
is_init_field)
|
is_init_field)
|
||||||
from vllm.engine.arg_utils import (EngineArgs, HfOverrides, PoolerConfig,
|
from vllm.engine.arg_utils import (ConvertOption, EngineArgs, HfOverrides,
|
||||||
TaskOption)
|
PoolerConfig, RunnerOption)
|
||||||
from vllm.engine.llm_engine import LLMEngine
|
from vllm.engine.llm_engine import LLMEngine
|
||||||
from vllm.entrypoints.chat_utils import (ChatCompletionMessageParam,
|
from vllm.entrypoints.chat_utils import (ChatCompletionMessageParam,
|
||||||
ChatTemplateContentFormatOption,
|
ChatTemplateContentFormatOption,
|
||||||
@ -170,7 +170,8 @@ class LLM:
|
|||||||
self,
|
self,
|
||||||
model: str,
|
model: str,
|
||||||
*,
|
*,
|
||||||
task: TaskOption = "auto",
|
runner: RunnerOption = "auto",
|
||||||
|
convert: ConvertOption = "auto",
|
||||||
tokenizer: Optional[str] = None,
|
tokenizer: Optional[str] = None,
|
||||||
tokenizer_mode: TokenizerMode = "auto",
|
tokenizer_mode: TokenizerMode = "auto",
|
||||||
skip_tokenizer_init: bool = False,
|
skip_tokenizer_init: bool = False,
|
||||||
@ -244,7 +245,8 @@ class LLM:
|
|||||||
|
|
||||||
engine_args = EngineArgs(
|
engine_args = EngineArgs(
|
||||||
model=model,
|
model=model,
|
||||||
task=task,
|
runner=runner,
|
||||||
|
convert=convert,
|
||||||
tokenizer=tokenizer,
|
tokenizer=tokenizer,
|
||||||
tokenizer_mode=tokenizer_mode,
|
tokenizer_mode=tokenizer_mode,
|
||||||
skip_tokenizer_init=skip_tokenizer_init,
|
skip_tokenizer_init=skip_tokenizer_init,
|
||||||
@ -459,18 +461,10 @@ class LLM:
|
|||||||
model_config = self.llm_engine.model_config
|
model_config = self.llm_engine.model_config
|
||||||
runner_type = model_config.runner_type
|
runner_type = model_config.runner_type
|
||||||
if runner_type != "generate":
|
if runner_type != "generate":
|
||||||
messages = [
|
raise ValueError(
|
||||||
"LLM.generate() is only supported for generative models."
|
"LLM.generate() is only supported for generative models. "
|
||||||
]
|
"Try passing `--runner generate` to use the model as a "
|
||||||
|
"generative model.")
|
||||||
if "generate" in model_config.supported_runner_types:
|
|
||||||
messages.append(
|
|
||||||
"Your model supports the 'generate' runner, but is "
|
|
||||||
f"currently initialized for the '{runner_type}' runner. "
|
|
||||||
"Please initialize vLLM using `--task generate` or "
|
|
||||||
"`--task transcription`.")
|
|
||||||
|
|
||||||
raise ValueError(" ".join(messages))
|
|
||||||
|
|
||||||
if prompt_token_ids is not None:
|
if prompt_token_ids is not None:
|
||||||
parsed_prompts = self._convert_v1_inputs(
|
parsed_prompts = self._convert_v1_inputs(
|
||||||
@ -497,7 +491,8 @@ class LLM:
|
|||||||
truncate_prompt_tokens = None
|
truncate_prompt_tokens = None
|
||||||
if isinstance(sampling_params, SamplingParams):
|
if isinstance(sampling_params, SamplingParams):
|
||||||
truncate_prompt_tokens = sampling_params.truncate_prompt_tokens
|
truncate_prompt_tokens = sampling_params.truncate_prompt_tokens
|
||||||
_validate_truncation_size(self.llm_engine.model_config.max_model_len,
|
|
||||||
|
_validate_truncation_size(model_config.max_model_len,
|
||||||
truncate_prompt_tokens, tokenization_kwargs)
|
truncate_prompt_tokens, tokenization_kwargs)
|
||||||
|
|
||||||
# Add any modality specific loras to the corresponding prompts
|
# Add any modality specific loras to the corresponding prompts
|
||||||
@ -1100,16 +1095,10 @@ class LLM:
|
|||||||
model_config = self.llm_engine.model_config
|
model_config = self.llm_engine.model_config
|
||||||
runner_type = model_config.runner_type
|
runner_type = model_config.runner_type
|
||||||
if runner_type != "pooling":
|
if runner_type != "pooling":
|
||||||
messages = ["LLM.encode() is only supported for pooling models."]
|
raise ValueError(
|
||||||
|
"LLM.encode() is only supported for pooling models. "
|
||||||
if "pooling" in model_config.supported_runner_types:
|
"Try passing `--runner pooling` to use the model as a "
|
||||||
messages.append(
|
"pooling model.")
|
||||||
"Your model supports the 'pooling' runner, but is "
|
|
||||||
f"currently initialized for the '{runner_type}' runner. "
|
|
||||||
"Please initialize vLLM using `--task embed`, "
|
|
||||||
"`--task classify`, `--task score` etc.")
|
|
||||||
|
|
||||||
raise ValueError(" ".join(messages))
|
|
||||||
|
|
||||||
if prompt_token_ids is not None:
|
if prompt_token_ids is not None:
|
||||||
parsed_prompts = self._convert_v1_inputs(
|
parsed_prompts = self._convert_v1_inputs(
|
||||||
@ -1183,8 +1172,9 @@ class LLM:
|
|||||||
embedding vectors in the same order as the input prompts.
|
embedding vectors in the same order as the input prompts.
|
||||||
"""
|
"""
|
||||||
if "embed" not in self.supported_tasks:
|
if "embed" not in self.supported_tasks:
|
||||||
raise ValueError("Embedding API is not supported by this model. "
|
raise ValueError(
|
||||||
"Please set `--task embed`.")
|
"Embedding API is not supported by this model. "
|
||||||
|
"Try converting the model using `--convert embed`.")
|
||||||
|
|
||||||
items = self.encode(
|
items = self.encode(
|
||||||
prompts,
|
prompts,
|
||||||
@ -1229,7 +1219,7 @@ class LLM:
|
|||||||
if "classify" not in self.supported_tasks:
|
if "classify" not in self.supported_tasks:
|
||||||
raise ValueError(
|
raise ValueError(
|
||||||
"Classification API is not supported by this model. "
|
"Classification API is not supported by this model. "
|
||||||
"Please set `--task classify`.")
|
"Try converting the model using `--convert classify`.")
|
||||||
|
|
||||||
items = self.encode(
|
items = self.encode(
|
||||||
prompts,
|
prompts,
|
||||||
@ -1283,27 +1273,26 @@ class LLM:
|
|||||||
use_tqdm: Union[bool, Callable[..., tqdm]] = True,
|
use_tqdm: Union[bool, Callable[..., tqdm]] = True,
|
||||||
lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None,
|
lora_request: Optional[Union[list[LoRARequest], LoRARequest]] = None,
|
||||||
) -> list[ScoringRequestOutput]:
|
) -> list[ScoringRequestOutput]:
|
||||||
|
model_config = self.llm_engine.model_config
|
||||||
|
|
||||||
if isinstance(tokenizer, MistralTokenizer):
|
if isinstance(tokenizer, MistralTokenizer):
|
||||||
raise ValueError(
|
raise ValueError(
|
||||||
"Score API is only enabled for `--task embed or score`")
|
"Score API is not supported for Mistral tokenizer")
|
||||||
|
|
||||||
if len(data_1) == 1:
|
if len(data_1) == 1:
|
||||||
data_1 = data_1 * len(data_2)
|
data_1 = data_1 * len(data_2)
|
||||||
|
|
||||||
pooling_params = PoolingParams(task="score")
|
pooling_params = PoolingParams(task="score")
|
||||||
tokenization_kwargs: dict[str, Any] = {}
|
tokenization_kwargs: dict[str, Any] = {}
|
||||||
_validate_truncation_size(self.llm_engine.model_config.max_model_len,
|
|
||||||
|
_validate_truncation_size(model_config.max_model_len,
|
||||||
truncate_prompt_tokens, tokenization_kwargs)
|
truncate_prompt_tokens, tokenization_kwargs)
|
||||||
|
|
||||||
parsed_prompts = []
|
parsed_prompts = []
|
||||||
|
|
||||||
input_pairs = [(t1, t2) for t1, t2 in zip(data_1, data_2)]
|
input_pairs = [(t1, t2) for t1, t2 in zip(data_1, data_2)]
|
||||||
|
|
||||||
if self.llm_engine.model_config.is_multimodal_model:
|
if model_config.is_multimodal_model:
|
||||||
|
|
||||||
model_config = self.llm_engine.model_config
|
|
||||||
|
|
||||||
for q, d in input_pairs:
|
for q, d in input_pairs:
|
||||||
_, engine_prompt = get_score_prompt(
|
_, engine_prompt = get_score_prompt(
|
||||||
model_config=model_config,
|
model_config=model_config,
|
||||||
@ -1314,11 +1303,9 @@ class LLM:
|
|||||||
)
|
)
|
||||||
|
|
||||||
parsed_prompts.append(engine_prompt)
|
parsed_prompts.append(engine_prompt)
|
||||||
|
|
||||||
else:
|
else:
|
||||||
|
|
||||||
for q, t in input_pairs:
|
for q, t in input_pairs:
|
||||||
if self.llm_engine.model_config.use_pad_token:
|
if model_config.use_pad_token:
|
||||||
# cross_encoder models defaults to using pad_token.
|
# cross_encoder models defaults to using pad_token.
|
||||||
prompt_inputs = tokenizer(
|
prompt_inputs = tokenizer(
|
||||||
text=q, # type: ignore[arg-type]
|
text=q, # type: ignore[arg-type]
|
||||||
@ -1396,23 +1383,18 @@ class LLM:
|
|||||||
model_config = self.llm_engine.model_config
|
model_config = self.llm_engine.model_config
|
||||||
runner_type = model_config.runner_type
|
runner_type = model_config.runner_type
|
||||||
if runner_type != "pooling":
|
if runner_type != "pooling":
|
||||||
messages = ["LLM.score() is only supported for pooling models."]
|
raise ValueError(
|
||||||
|
"LLM.score() is only supported for pooling models. "
|
||||||
if "pooling" in model_config.supported_runner_types:
|
"Try passing `--runner pooling` to use the model as a "
|
||||||
messages.append(
|
"pooling model.")
|
||||||
"Your model supports the 'pooling' runner, but is "
|
|
||||||
f"currently initialized for the '{runner_type}' runner. "
|
|
||||||
"Please initialize vLLM using `--task embed`, "
|
|
||||||
"`--task classify`, `--task score` etc.")
|
|
||||||
|
|
||||||
raise ValueError(" ".join(messages))
|
|
||||||
|
|
||||||
supported_tasks = self.supported_tasks
|
supported_tasks = self.supported_tasks
|
||||||
if all(t not in supported_tasks for t in ("embed", "classify")):
|
if all(t not in supported_tasks for t in ("embed", "classify")):
|
||||||
raise ValueError("Score API is not supported by this model. "
|
raise ValueError("Score API is not supported by this model. "
|
||||||
"Please set `--task embed` or `--task classify`.")
|
"Try converting the model using "
|
||||||
|
"`--convert embed` or `--convert classify`.")
|
||||||
|
|
||||||
if (model_config.task == "classify"
|
if (model_config.is_cross_encoder
|
||||||
and getattr(model_config.hf_config, "num_labels", 0) != 1):
|
and getattr(model_config.hf_config, "num_labels", 0) != 1):
|
||||||
raise ValueError("Score API is only enabled for num_labels == 1.")
|
raise ValueError("Score API is only enabled for num_labels == 1.")
|
||||||
|
|
||||||
@ -1421,15 +1403,14 @@ class LLM:
|
|||||||
# lists of tokens to the `text` and `text_pair` kwargs
|
# lists of tokens to the `text` and `text_pair` kwargs
|
||||||
tokenizer = self.get_tokenizer()
|
tokenizer = self.get_tokenizer()
|
||||||
|
|
||||||
if not self.llm_engine.model_config.is_multimodal_model:
|
if not model_config.is_multimodal_model:
|
||||||
|
|
||||||
def check_data_type(data: Union[SingletonPrompt,
|
def check_data_type(data: Union[SingletonPrompt,
|
||||||
Sequence[SingletonPrompt],
|
Sequence[SingletonPrompt],
|
||||||
ScoreMultiModalParam]):
|
ScoreMultiModalParam]):
|
||||||
if isinstance(data, dict) and "content" in data:
|
if isinstance(data, dict) and "content" in data:
|
||||||
raise ValueError(
|
raise ValueError("ScoreMultiModalParam is not supported "
|
||||||
f"ScoreMultiModalParam is not supported for {self.llm_engine.model_config.architecture}", # noqa: E501
|
f"for {model_config.architecture}")
|
||||||
)
|
|
||||||
|
|
||||||
check_data_type(data_1)
|
check_data_type(data_1)
|
||||||
check_data_type(data_2)
|
check_data_type(data_2)
|
||||||
@ -1471,7 +1452,7 @@ class LLM:
|
|||||||
|
|
||||||
_validate_score_input_lens(data_1, data_2) # type: ignore[arg-type]
|
_validate_score_input_lens(data_1, data_2) # type: ignore[arg-type]
|
||||||
|
|
||||||
if self.llm_engine.model_config.is_cross_encoder:
|
if model_config.is_cross_encoder:
|
||||||
return self._cross_encoding_score(
|
return self._cross_encoding_score(
|
||||||
tokenizer,
|
tokenizer,
|
||||||
data_1, # type: ignore[arg-type]
|
data_1, # type: ignore[arg-type]
|
||||||
|
|||||||
@ -1734,7 +1734,6 @@ async def init_app_state(
|
|||||||
state.openai_serving_models,
|
state.openai_serving_models,
|
||||||
request_logger=request_logger,
|
request_logger=request_logger,
|
||||||
) if "transcription" in supported_tasks else None
|
) if "transcription" in supported_tasks else None
|
||||||
state.task = model_config.task
|
|
||||||
|
|
||||||
state.enable_server_load_tracking = args.enable_server_load_tracking
|
state.enable_server_load_tracking = args.enable_server_load_tracking
|
||||||
state.server_load_metrics = 0
|
state.server_load_metrics = 0
|
||||||
|
|||||||
@ -9,9 +9,8 @@ from dataclasses import dataclass, field
|
|||||||
from typing import Optional
|
from typing import Optional
|
||||||
|
|
||||||
import torch
|
import torch
|
||||||
import transformers
|
|
||||||
from torch import nn
|
from torch import nn
|
||||||
from transformers.dynamic_module_utils import get_class_from_dynamic_module
|
from typing_extensions import assert_never
|
||||||
|
|
||||||
from vllm.attention import Attention
|
from vllm.attention import Attention
|
||||||
from vllm.config import (ModelConfig, ModelImpl, VllmConfig,
|
from vllm.config import (ModelConfig, ModelImpl, VllmConfig,
|
||||||
@ -20,13 +19,10 @@ from vllm.logger import init_logger
|
|||||||
from vllm.model_executor.layers.linear import QKVCrossParallelLinear
|
from vllm.model_executor.layers.linear import QKVCrossParallelLinear
|
||||||
from vllm.model_executor.layers.quantization.base_config import (
|
from vllm.model_executor.layers.quantization.base_config import (
|
||||||
QuantizationConfig, QuantizeMethodBase)
|
QuantizationConfig, QuantizeMethodBase)
|
||||||
from vllm.model_executor.models import ModelRegistry
|
|
||||||
from vllm.model_executor.models.adapters import (as_embedding_model,
|
from vllm.model_executor.models.adapters import (as_embedding_model,
|
||||||
as_reward_model,
|
as_reward_model,
|
||||||
as_seq_cls_model)
|
as_seq_cls_model)
|
||||||
from vllm.model_executor.models.interfaces import SupportsQuant
|
from vllm.model_executor.models.interfaces import SupportsQuant
|
||||||
from vllm.model_executor.models.registry import (_PREVIOUSLY_SUPPORTED_MODELS,
|
|
||||||
_TRANSFORMERS_BACKEND_MODELS)
|
|
||||||
from vllm.utils import is_pin_memory_available
|
from vllm.utils import is_pin_memory_available
|
||||||
|
|
||||||
logger = init_logger(__name__)
|
logger = init_logger(__name__)
|
||||||
@ -169,61 +165,6 @@ def device_loading_context(module: torch.nn.Module,
|
|||||||
# New parameters or parameters already on target device are untouched
|
# New parameters or parameters already on target device are untouched
|
||||||
|
|
||||||
|
|
||||||
def resolve_transformers_arch(model_config: ModelConfig,
|
|
||||||
architectures: list[str]):
|
|
||||||
if model_config.model_impl == ModelImpl.VLLM:
|
|
||||||
raise ValueError(
|
|
||||||
"Attempting to resolve architecture from the Transformers library "
|
|
||||||
"but the model implementation is set to vLLM. This should never "
|
|
||||||
"happen.")
|
|
||||||
|
|
||||||
for i, arch in enumerate(architectures):
|
|
||||||
if arch in _TRANSFORMERS_BACKEND_MODELS:
|
|
||||||
continue
|
|
||||||
|
|
||||||
if model_config.model_impl == ModelImpl.AUTO:
|
|
||||||
logger.warning(
|
|
||||||
"%s has no vLLM implementation, falling back to Transformers "
|
|
||||||
"implementation. Some features may not be supported and "
|
|
||||||
"performance may not be optimal.", arch)
|
|
||||||
|
|
||||||
auto_map: dict[str, str] = getattr(model_config.hf_config, "auto_map",
|
|
||||||
None) or dict()
|
|
||||||
# Make sure that config class is always initialized before model class,
|
|
||||||
# otherwise the model class won't be able to access the config class,
|
|
||||||
# the expected auto_map should have correct order like:
|
|
||||||
# "auto_map": {
|
|
||||||
# "AutoConfig": "<your-repo-name>--<config-name>",
|
|
||||||
# "AutoModel": "<your-repo-name>--<config-name>",
|
|
||||||
# "AutoModelFor<Task>": "<your-repo-name>--<config-name>",
|
|
||||||
# },
|
|
||||||
auto_modules = {
|
|
||||||
name:
|
|
||||||
get_class_from_dynamic_module(module,
|
|
||||||
model_config.model,
|
|
||||||
revision=model_config.revision)
|
|
||||||
for name, module in sorted(auto_map.items(), key=lambda x: x[0])
|
|
||||||
}
|
|
||||||
model_module = getattr(transformers, arch, None)
|
|
||||||
if model_module is None:
|
|
||||||
if "AutoModel" not in auto_map:
|
|
||||||
raise ValueError(
|
|
||||||
f"Cannot find model module. '{arch}' is not a registered "
|
|
||||||
"model in the Transformers library (only relevant if the "
|
|
||||||
"model is meant to be in Transformers) and 'AutoModel' is "
|
|
||||||
"not present in the model config's 'auto_map' (relevant "
|
|
||||||
"if the model is custom).")
|
|
||||||
model_module = auto_modules["AutoModel"]
|
|
||||||
|
|
||||||
if not model_module.is_backend_compatible():
|
|
||||||
raise ValueError(
|
|
||||||
f"The Transformers implementation of '{arch}' is not "
|
|
||||||
"compatible with vLLM.")
|
|
||||||
|
|
||||||
architectures[i] = model_config._get_transformers_backend_cls()
|
|
||||||
return architectures
|
|
||||||
|
|
||||||
|
|
||||||
def get_model_architecture(
|
def get_model_architecture(
|
||||||
model_config: ModelConfig) -> tuple[type[nn.Module], str]:
|
model_config: ModelConfig) -> tuple[type[nn.Module], str]:
|
||||||
architectures = getattr(model_config.hf_config, "architectures", [])
|
architectures = getattr(model_config.hf_config, "architectures", [])
|
||||||
@ -239,56 +180,38 @@ def get_model_architecture(
|
|||||||
"bitsandbytes",
|
"bitsandbytes",
|
||||||
]
|
]
|
||||||
|
|
||||||
vllm_supported_archs = ModelRegistry.get_supported_archs()
|
if (model_config.quantization is not None
|
||||||
is_supported = lambda arch: (arch in vllm_supported_archs and arch not in
|
and model_config.quantization not in mixtral_supported
|
||||||
_TRANSFORMERS_BACKEND_MODELS)
|
and "MixtralForCausalLM" in architectures):
|
||||||
vllm_not_supported = not any(is_supported(arch) for arch in architectures)
|
|
||||||
|
|
||||||
if vllm_not_supported:
|
|
||||||
# try automatic conversion in adapters.py
|
|
||||||
for arch in architectures:
|
|
||||||
if not arch.endswith("ForSequenceClassification"):
|
|
||||||
continue
|
|
||||||
|
|
||||||
assert model_config.task == "classify"
|
|
||||||
causal_lm_arch = arch.replace("ForSequenceClassification",
|
|
||||||
"ForCausalLM")
|
|
||||||
causal_lm_arch_vllm_supported = (causal_lm_arch
|
|
||||||
in vllm_supported_archs)
|
|
||||||
if not causal_lm_arch_vllm_supported:
|
|
||||||
continue
|
|
||||||
|
|
||||||
architectures = [causal_lm_arch]
|
|
||||||
vllm_not_supported = False
|
|
||||||
break
|
|
||||||
|
|
||||||
if any(arch in _PREVIOUSLY_SUPPORTED_MODELS for arch in architectures):
|
|
||||||
previous_version = _PREVIOUSLY_SUPPORTED_MODELS[architectures[0]]
|
|
||||||
raise ValueError(
|
|
||||||
f"Model architecture {architectures[0]} was supported"
|
|
||||||
f" in vLLM until version {previous_version}, and is "
|
|
||||||
"not supported anymore. Please use an older version"
|
|
||||||
" of vLLM if you want to use this model architecture.")
|
|
||||||
|
|
||||||
if (model_config.model_impl == ModelImpl.TRANSFORMERS or
|
|
||||||
model_config.model_impl == ModelImpl.AUTO and vllm_not_supported):
|
|
||||||
architectures = resolve_transformers_arch(model_config, architectures)
|
|
||||||
logger.debug_once("Resolve transformers arch %s", str(architectures))
|
|
||||||
elif (model_config.quantization is not None
|
|
||||||
and model_config.quantization not in mixtral_supported
|
|
||||||
and "MixtralForCausalLM" in architectures):
|
|
||||||
architectures = ["QuantMixtralForCausalLM"]
|
architectures = ["QuantMixtralForCausalLM"]
|
||||||
|
|
||||||
model_cls, arch = ModelRegistry.resolve_model_cls(architectures)
|
model_cls, arch = model_config.registry.resolve_model_cls(
|
||||||
if model_config.task == "embed":
|
architectures,
|
||||||
logger.debug_once("Automatic conversion using `as_embedding_model`.")
|
model_config=model_config,
|
||||||
|
)
|
||||||
|
|
||||||
|
if arch == model_config._get_transformers_backend_cls():
|
||||||
|
assert model_config.model_impl != ModelImpl.VLLM
|
||||||
|
if model_config.model_impl == ModelImpl.AUTO:
|
||||||
|
logger.warning_once(
|
||||||
|
"%s has no vLLM implementation, falling back to Transformers "
|
||||||
|
"implementation. Some features may not be supported and "
|
||||||
|
"performance may not be optimal.", arch)
|
||||||
|
|
||||||
|
convert_type = model_config.convert_type
|
||||||
|
if convert_type == "none":
|
||||||
|
pass
|
||||||
|
elif convert_type == "embed":
|
||||||
|
logger.debug_once("Converting to embedding model.")
|
||||||
model_cls = as_embedding_model(model_cls)
|
model_cls = as_embedding_model(model_cls)
|
||||||
elif model_config.task == "classify":
|
elif convert_type == "classify":
|
||||||
logger.debug_once("Automatic conversion using `as_seq_cls_model`.")
|
logger.debug_once("Converting to sequence classification model.")
|
||||||
model_cls = as_seq_cls_model(model_cls)
|
model_cls = as_seq_cls_model(model_cls)
|
||||||
elif model_config.task == "reward":
|
elif convert_type == "reward":
|
||||||
logger.debug_once("Automatic conversion using `as_reward_model`.")
|
logger.debug_once("Converting to reward model.")
|
||||||
model_cls = as_reward_model(model_cls)
|
model_cls = as_reward_model(model_cls)
|
||||||
|
else:
|
||||||
|
assert_never(convert_type)
|
||||||
|
|
||||||
return model_cls, arch
|
return model_cls, arch
|
||||||
|
|
||||||
|
|||||||
@ -253,8 +253,10 @@ class HybridAttentionMambaModelConfig(VerifyAndUpdateConfig):
|
|||||||
dtype=kv_cache_dtype,
|
dtype=kv_cache_dtype,
|
||||||
use_mla=model_config.use_mla).page_size_bytes
|
use_mla=model_config.use_mla).page_size_bytes
|
||||||
|
|
||||||
model_cls = ModelRegistry.resolve_model_cls(
|
model_cls, _ = ModelRegistry.resolve_model_cls(
|
||||||
model_config._model_info.architecture)[0]
|
model_config.architecture,
|
||||||
|
model_config=model_config,
|
||||||
|
)
|
||||||
|
|
||||||
# get mamba page size
|
# get mamba page size
|
||||||
mamba_page_size = MambaSpec(
|
mamba_page_size = MambaSpec(
|
||||||
|
|||||||
@ -12,19 +12,24 @@ import sys
|
|||||||
import tempfile
|
import tempfile
|
||||||
from abc import ABC, abstractmethod
|
from abc import ABC, abstractmethod
|
||||||
from collections.abc import Set
|
from collections.abc import Set
|
||||||
from dataclasses import asdict, dataclass, field
|
from dataclasses import dataclass, field
|
||||||
from functools import lru_cache
|
from functools import lru_cache
|
||||||
from typing import Callable, Optional, TypeVar, Union
|
from typing import Callable, Optional, TypeVar, Union
|
||||||
|
|
||||||
import torch.nn as nn
|
import torch.nn as nn
|
||||||
|
import transformers
|
||||||
|
|
||||||
|
from vllm.config import (ModelConfig, ModelImpl, iter_architecture_defaults,
|
||||||
|
try_match_architecture_defaults)
|
||||||
from vllm.logger import init_logger
|
from vllm.logger import init_logger
|
||||||
|
from vllm.transformers_utils.dynamic_module import (
|
||||||
|
try_get_class_from_dynamic_module)
|
||||||
|
|
||||||
from .interfaces import (has_inner_state, has_noops, is_attention_free,
|
from .interfaces import (has_inner_state, has_noops, is_attention_free,
|
||||||
is_hybrid, supports_cross_encoding,
|
is_hybrid, supports_cross_encoding,
|
||||||
supports_multimodal, supports_multimodal_raw_input,
|
supports_multimodal, supports_multimodal_raw_input,
|
||||||
supports_pp, supports_transcription, supports_v0_only)
|
supports_pp, supports_transcription, supports_v0_only)
|
||||||
from .interfaces_base import is_text_generation_model
|
from .interfaces_base import is_pooling_model, is_text_generation_model
|
||||||
|
|
||||||
logger = init_logger(__name__)
|
logger = init_logger(__name__)
|
||||||
|
|
||||||
@ -311,7 +316,7 @@ class _ModelInfo:
|
|||||||
return _ModelInfo(
|
return _ModelInfo(
|
||||||
architecture=model.__name__,
|
architecture=model.__name__,
|
||||||
is_text_generation_model=is_text_generation_model(model),
|
is_text_generation_model=is_text_generation_model(model),
|
||||||
is_pooling_model=True, # Can convert any model into a pooling model
|
is_pooling_model=is_pooling_model(model),
|
||||||
supports_cross_encoding=supports_cross_encoding(model),
|
supports_cross_encoding=supports_cross_encoding(model),
|
||||||
supports_multimodal=supports_multimodal(model),
|
supports_multimodal=supports_multimodal(model),
|
||||||
supports_multimodal_raw_input=supports_multimodal_raw_input(model),
|
supports_multimodal_raw_input=supports_multimodal_raw_input(model),
|
||||||
@ -465,6 +470,16 @@ class _ModelRegistry:
|
|||||||
f"Model architectures {architectures} failed "
|
f"Model architectures {architectures} failed "
|
||||||
"to be inspected. Please check the logs for more details.")
|
"to be inspected. Please check the logs for more details.")
|
||||||
|
|
||||||
|
for arch in architectures:
|
||||||
|
if arch in _PREVIOUSLY_SUPPORTED_MODELS:
|
||||||
|
previous_version = _PREVIOUSLY_SUPPORTED_MODELS[arch]
|
||||||
|
|
||||||
|
raise ValueError(
|
||||||
|
f"Model architecture {arch} was supported in vLLM until "
|
||||||
|
f"v{previous_version}, and is not supported anymore. "
|
||||||
|
"Please use an older version of vLLM if you want to "
|
||||||
|
"use this model architecture.")
|
||||||
|
|
||||||
raise ValueError(
|
raise ValueError(
|
||||||
f"Model architectures {architectures} are not supported for now. "
|
f"Model architectures {architectures} are not supported for now. "
|
||||||
f"Supported architectures: {all_supported_archs}")
|
f"Supported architectures: {all_supported_archs}")
|
||||||
@ -477,174 +492,284 @@ class _ModelRegistry:
|
|||||||
return _try_load_model_cls(model_arch, self.models[model_arch])
|
return _try_load_model_cls(model_arch, self.models[model_arch])
|
||||||
|
|
||||||
def _try_inspect_model_cls(self, model_arch: str) -> Optional[_ModelInfo]:
|
def _try_inspect_model_cls(self, model_arch: str) -> Optional[_ModelInfo]:
|
||||||
if model_arch in self.models:
|
if model_arch not in self.models:
|
||||||
return _try_inspect_model_cls(model_arch, self.models[model_arch])
|
return None
|
||||||
|
|
||||||
if model_arch.endswith("ForSequenceClassification"):
|
return _try_inspect_model_cls(model_arch, self.models[model_arch])
|
||||||
causal_lm_arch = model_arch.replace("ForSequenceClassification",
|
|
||||||
"ForCausalLM")
|
def _try_resolve_transformers(
|
||||||
if causal_lm_arch not in self.models:
|
self,
|
||||||
|
architecture: str,
|
||||||
|
model_config: ModelConfig,
|
||||||
|
) -> Optional[str]:
|
||||||
|
if architecture in _TRANSFORMERS_BACKEND_MODELS:
|
||||||
|
return architecture
|
||||||
|
|
||||||
|
auto_map: dict[str, str] = getattr(model_config.hf_config, "auto_map",
|
||||||
|
None) or dict()
|
||||||
|
|
||||||
|
# Make sure that config class is always initialized before model class,
|
||||||
|
# otherwise the model class won't be able to access the config class,
|
||||||
|
# the expected auto_map should have correct order like:
|
||||||
|
# "auto_map": {
|
||||||
|
# "AutoConfig": "<your-repo-name>--<config-name>",
|
||||||
|
# "AutoModel": "<your-repo-name>--<config-name>",
|
||||||
|
# "AutoModelFor<Task>": "<your-repo-name>--<config-name>",
|
||||||
|
# },
|
||||||
|
for prefix in ("AutoConfig", "AutoModel"):
|
||||||
|
for name, module in auto_map.items():
|
||||||
|
if name.startswith(prefix):
|
||||||
|
try_get_class_from_dynamic_module(
|
||||||
|
module,
|
||||||
|
model_config.model,
|
||||||
|
revision=model_config.revision,
|
||||||
|
warn_on_fail=False,
|
||||||
|
)
|
||||||
|
|
||||||
|
model_module = getattr(transformers, architecture, None)
|
||||||
|
|
||||||
|
if model_module is None:
|
||||||
|
for name, module in auto_map.items():
|
||||||
|
if name.startswith("AutoModel"):
|
||||||
|
model_module = try_get_class_from_dynamic_module(
|
||||||
|
module,
|
||||||
|
model_config.model,
|
||||||
|
revision=model_config.revision,
|
||||||
|
warn_on_fail=True,
|
||||||
|
)
|
||||||
|
if model_module is not None:
|
||||||
|
break
|
||||||
|
else:
|
||||||
|
if model_config.model_impl != ModelImpl.TRANSFORMERS:
|
||||||
|
return None
|
||||||
|
|
||||||
|
raise ValueError(
|
||||||
|
f"Cannot find model module. {architecture!r} is not a "
|
||||||
|
"registered model in the Transformers library (only "
|
||||||
|
"relevant if the model is meant to be in Transformers) "
|
||||||
|
"and 'AutoModel' is not present in the model config's "
|
||||||
|
"'auto_map' (relevant if the model is custom).")
|
||||||
|
|
||||||
|
if not model_module.is_backend_compatible():
|
||||||
|
if model_config.model_impl != ModelImpl.TRANSFORMERS:
|
||||||
return None
|
return None
|
||||||
|
|
||||||
info = _try_inspect_model_cls(causal_lm_arch,
|
raise ValueError(
|
||||||
self.models[causal_lm_arch])
|
f"The Transformers implementation of {architecture!r} "
|
||||||
|
"is not compatible with vLLM.")
|
||||||
|
|
||||||
info = _ModelInfo(**dict(
|
return model_config._get_transformers_backend_cls()
|
||||||
asdict(info), **{
|
|
||||||
"architecture": model_arch,
|
|
||||||
"supports_cross_encoding": True
|
|
||||||
}))
|
|
||||||
return info
|
|
||||||
|
|
||||||
return None
|
def _normalize_arch(
|
||||||
|
self,
|
||||||
|
architecture: str,
|
||||||
|
model_config: ModelConfig,
|
||||||
|
) -> str:
|
||||||
|
if architecture in self.models:
|
||||||
|
return architecture
|
||||||
|
|
||||||
|
# This may be called in order to resolve runner_type and convert_type
|
||||||
|
# in the first place, in which case we consider the default match
|
||||||
|
match = try_match_architecture_defaults(
|
||||||
|
architecture,
|
||||||
|
runner_type=getattr(model_config, "runner_type", None),
|
||||||
|
convert_type=getattr(model_config, "convert_type", None),
|
||||||
|
)
|
||||||
|
if match:
|
||||||
|
suffix, _ = match
|
||||||
|
|
||||||
|
# Get the name of the base model to convert
|
||||||
|
for repl_suffix, _ in iter_architecture_defaults():
|
||||||
|
base_arch = architecture.replace(suffix, repl_suffix)
|
||||||
|
if base_arch in self.models:
|
||||||
|
return base_arch
|
||||||
|
|
||||||
|
return architecture
|
||||||
|
|
||||||
def _normalize_archs(
|
def _normalize_archs(
|
||||||
self,
|
self,
|
||||||
architectures: Union[str, list[str]],
|
architectures: list[str],
|
||||||
|
model_config: ModelConfig,
|
||||||
) -> list[str]:
|
) -> list[str]:
|
||||||
if isinstance(architectures, str):
|
|
||||||
architectures = [architectures]
|
|
||||||
if not architectures:
|
if not architectures:
|
||||||
logger.warning("No model architectures are specified")
|
logger.warning("No model architectures are specified")
|
||||||
|
|
||||||
# filter out support architectures
|
return [
|
||||||
normalized_arch = list(
|
self._normalize_arch(arch, model_config) for arch in architectures
|
||||||
filter(lambda model: model in self.models, architectures))
|
]
|
||||||
|
|
||||||
# try automatic conversion in adapters.py
|
|
||||||
for arch in architectures:
|
|
||||||
if not arch.endswith("ForSequenceClassification"):
|
|
||||||
continue
|
|
||||||
causal_lm_arch = arch.replace("ForSequenceClassification",
|
|
||||||
"ForCausalLM")
|
|
||||||
if causal_lm_arch in self.models:
|
|
||||||
normalized_arch.append(arch)
|
|
||||||
|
|
||||||
# NOTE(Isotr0py): Be careful of architectures' order!
|
|
||||||
# Make sure Transformers backend architecture is at the end of the
|
|
||||||
# list, otherwise pooling models automatic conversion will fail!
|
|
||||||
for arch in normalized_arch:
|
|
||||||
if arch.startswith("TransformersFor"):
|
|
||||||
normalized_arch.remove(arch)
|
|
||||||
normalized_arch.append(arch)
|
|
||||||
|
|
||||||
return normalized_arch
|
|
||||||
|
|
||||||
def inspect_model_cls(
|
def inspect_model_cls(
|
||||||
self,
|
self,
|
||||||
architectures: Union[str, list[str]],
|
architectures: Union[str, list[str]],
|
||||||
|
model_config: ModelConfig,
|
||||||
) -> tuple[_ModelInfo, str]:
|
) -> tuple[_ModelInfo, str]:
|
||||||
architectures = self._normalize_archs(architectures)
|
if isinstance(architectures, str):
|
||||||
|
architectures = [architectures]
|
||||||
|
|
||||||
for arch in architectures:
|
normalized_archs = self._normalize_archs(architectures, model_config)
|
||||||
model_info = self._try_inspect_model_cls(arch)
|
|
||||||
|
# Require transformers impl
|
||||||
|
if model_config.model_impl == ModelImpl.TRANSFORMERS:
|
||||||
|
arch = self._try_resolve_transformers(architectures[0],
|
||||||
|
model_config)
|
||||||
|
if arch is not None:
|
||||||
|
model_info = self._try_inspect_model_cls(arch)
|
||||||
|
if model_info is not None:
|
||||||
|
return (model_info, arch)
|
||||||
|
|
||||||
|
for arch, normalized_arch in zip(architectures, normalized_archs):
|
||||||
|
model_info = self._try_inspect_model_cls(normalized_arch)
|
||||||
if model_info is not None:
|
if model_info is not None:
|
||||||
return (model_info, arch)
|
return (model_info, arch)
|
||||||
|
|
||||||
|
# Fallback to transformers impl
|
||||||
|
if model_config.model_impl in (ModelImpl.AUTO, ModelImpl.TRANSFORMERS):
|
||||||
|
arch = self._try_resolve_transformers(architectures[0],
|
||||||
|
model_config)
|
||||||
|
if arch is not None:
|
||||||
|
model_info = self._try_inspect_model_cls(arch)
|
||||||
|
if model_info is not None:
|
||||||
|
return (model_info, arch)
|
||||||
|
|
||||||
return self._raise_for_unsupported(architectures)
|
return self._raise_for_unsupported(architectures)
|
||||||
|
|
||||||
def resolve_model_cls(
|
def resolve_model_cls(
|
||||||
self,
|
self,
|
||||||
architectures: Union[str, list[str]],
|
architectures: Union[str, list[str]],
|
||||||
|
model_config: ModelConfig,
|
||||||
) -> tuple[type[nn.Module], str]:
|
) -> tuple[type[nn.Module], str]:
|
||||||
architectures = self._normalize_archs(architectures)
|
if isinstance(architectures, str):
|
||||||
|
architectures = [architectures]
|
||||||
|
|
||||||
for arch in architectures:
|
normalized_archs = self._normalize_archs(architectures, model_config)
|
||||||
model_cls = self._try_load_model_cls(arch)
|
|
||||||
|
# Require transformers impl
|
||||||
|
if model_config.model_impl == ModelImpl.TRANSFORMERS:
|
||||||
|
arch = self._try_resolve_transformers(architectures[0],
|
||||||
|
model_config)
|
||||||
|
if arch is not None:
|
||||||
|
model_cls = self._try_load_model_cls(arch)
|
||||||
|
if model_cls is not None:
|
||||||
|
return (model_cls, arch)
|
||||||
|
|
||||||
|
for arch, normalized_arch in zip(architectures, normalized_archs):
|
||||||
|
model_cls = self._try_load_model_cls(normalized_arch)
|
||||||
if model_cls is not None:
|
if model_cls is not None:
|
||||||
return (model_cls, arch)
|
return (model_cls, arch)
|
||||||
|
|
||||||
|
# Fallback to transformers impl
|
||||||
|
if model_config.model_impl in (ModelImpl.AUTO, ModelImpl.TRANSFORMERS):
|
||||||
|
arch = self._try_resolve_transformers(architectures[0],
|
||||||
|
model_config)
|
||||||
|
if arch is not None:
|
||||||
|
model_cls = self._try_load_model_cls(arch)
|
||||||
|
if model_cls is not None:
|
||||||
|
return (model_cls, arch)
|
||||||
|
|
||||||
return self._raise_for_unsupported(architectures)
|
return self._raise_for_unsupported(architectures)
|
||||||
|
|
||||||
def is_text_generation_model(
|
def is_text_generation_model(
|
||||||
self,
|
self,
|
||||||
architectures: Union[str, list[str]],
|
architectures: Union[str, list[str]],
|
||||||
|
model_config: ModelConfig,
|
||||||
) -> bool:
|
) -> bool:
|
||||||
model_cls, _ = self.inspect_model_cls(architectures)
|
model_cls, _ = self.inspect_model_cls(architectures, model_config)
|
||||||
return model_cls.is_text_generation_model
|
return model_cls.is_text_generation_model
|
||||||
|
|
||||||
def is_pooling_model(
|
def is_pooling_model(
|
||||||
self,
|
self,
|
||||||
architectures: Union[str, list[str]],
|
architectures: Union[str, list[str]],
|
||||||
|
model_config: ModelConfig,
|
||||||
) -> bool:
|
) -> bool:
|
||||||
model_cls, _ = self.inspect_model_cls(architectures)
|
model_cls, _ = self.inspect_model_cls(architectures, model_config)
|
||||||
return model_cls.is_pooling_model
|
return model_cls.is_pooling_model
|
||||||
|
|
||||||
def is_cross_encoder_model(
|
def is_cross_encoder_model(
|
||||||
self,
|
self,
|
||||||
architectures: Union[str, list[str]],
|
architectures: Union[str, list[str]],
|
||||||
|
model_config: ModelConfig,
|
||||||
) -> bool:
|
) -> bool:
|
||||||
model_cls, _ = self.inspect_model_cls(architectures)
|
model_cls, _ = self.inspect_model_cls(architectures, model_config)
|
||||||
return model_cls.supports_cross_encoding
|
return model_cls.supports_cross_encoding
|
||||||
|
|
||||||
def is_multimodal_model(
|
def is_multimodal_model(
|
||||||
self,
|
self,
|
||||||
architectures: Union[str, list[str]],
|
architectures: Union[str, list[str]],
|
||||||
|
model_config: ModelConfig,
|
||||||
) -> bool:
|
) -> bool:
|
||||||
model_cls, _ = self.inspect_model_cls(architectures)
|
model_cls, _ = self.inspect_model_cls(architectures, model_config)
|
||||||
return model_cls.supports_multimodal
|
return model_cls.supports_multimodal
|
||||||
|
|
||||||
def supports_multimodal_raw_input(
|
def supports_multimodal_raw_input(
|
||||||
self,
|
self,
|
||||||
architectures: Union[str, list[str]],
|
architectures: Union[str, list[str]],
|
||||||
|
model_config: ModelConfig,
|
||||||
) -> bool:
|
) -> bool:
|
||||||
model_cls, _ = self.inspect_model_cls(architectures)
|
model_cls, _ = self.inspect_model_cls(architectures, model_config)
|
||||||
return model_cls.supports_multimodal_raw_input
|
return model_cls.supports_multimodal_raw_input
|
||||||
|
|
||||||
def is_pp_supported_model(
|
def is_pp_supported_model(
|
||||||
self,
|
self,
|
||||||
architectures: Union[str, list[str]],
|
architectures: Union[str, list[str]],
|
||||||
|
model_config: ModelConfig,
|
||||||
) -> bool:
|
) -> bool:
|
||||||
model_cls, _ = self.inspect_model_cls(architectures)
|
model_cls, _ = self.inspect_model_cls(architectures, model_config)
|
||||||
return model_cls.supports_pp
|
return model_cls.supports_pp
|
||||||
|
|
||||||
def model_has_inner_state(
|
def model_has_inner_state(
|
||||||
self,
|
self,
|
||||||
architectures: Union[str, list[str]],
|
architectures: Union[str, list[str]],
|
||||||
|
model_config: ModelConfig,
|
||||||
) -> bool:
|
) -> bool:
|
||||||
model_cls, _ = self.inspect_model_cls(architectures)
|
model_cls, _ = self.inspect_model_cls(architectures, model_config)
|
||||||
return model_cls.has_inner_state
|
return model_cls.has_inner_state
|
||||||
|
|
||||||
def is_attention_free_model(
|
def is_attention_free_model(
|
||||||
self,
|
self,
|
||||||
architectures: Union[str, list[str]],
|
architectures: Union[str, list[str]],
|
||||||
|
model_config: ModelConfig,
|
||||||
) -> bool:
|
) -> bool:
|
||||||
model_cls, _ = self.inspect_model_cls(architectures)
|
model_cls, _ = self.inspect_model_cls(architectures, model_config)
|
||||||
return model_cls.is_attention_free
|
return model_cls.is_attention_free
|
||||||
|
|
||||||
def is_hybrid_model(
|
def is_hybrid_model(
|
||||||
self,
|
self,
|
||||||
architectures: Union[str, list[str]],
|
architectures: Union[str, list[str]],
|
||||||
|
model_config: ModelConfig,
|
||||||
) -> bool:
|
) -> bool:
|
||||||
model_cls, _ = self.inspect_model_cls(architectures)
|
model_cls, _ = self.inspect_model_cls(architectures, model_config)
|
||||||
return model_cls.is_hybrid
|
return model_cls.is_hybrid
|
||||||
|
|
||||||
def is_noops_model(
|
def is_noops_model(
|
||||||
self,
|
self,
|
||||||
architectures: Union[str, list[str]],
|
architectures: Union[str, list[str]],
|
||||||
|
model_config: ModelConfig,
|
||||||
) -> bool:
|
) -> bool:
|
||||||
model_cls, _ = self.inspect_model_cls(architectures)
|
model_cls, _ = self.inspect_model_cls(architectures, model_config)
|
||||||
return model_cls.has_noops
|
return model_cls.has_noops
|
||||||
|
|
||||||
def is_transcription_model(
|
def is_transcription_model(
|
||||||
self,
|
self,
|
||||||
architectures: Union[str, list[str]],
|
architectures: Union[str, list[str]],
|
||||||
|
model_config: ModelConfig,
|
||||||
) -> bool:
|
) -> bool:
|
||||||
model_cls, _ = self.inspect_model_cls(architectures)
|
model_cls, _ = self.inspect_model_cls(architectures, model_config)
|
||||||
return model_cls.supports_transcription
|
return model_cls.supports_transcription
|
||||||
|
|
||||||
def is_transcription_only_model(
|
def is_transcription_only_model(
|
||||||
self,
|
self,
|
||||||
architectures: Union[str, list[str]],
|
architectures: Union[str, list[str]],
|
||||||
|
model_config: ModelConfig,
|
||||||
) -> bool:
|
) -> bool:
|
||||||
model_cls, _ = self.inspect_model_cls(architectures)
|
model_cls, _ = self.inspect_model_cls(architectures, model_config)
|
||||||
return model_cls.supports_transcription_only
|
return model_cls.supports_transcription_only
|
||||||
|
|
||||||
def is_v1_compatible(
|
def is_v1_compatible(
|
||||||
self,
|
self,
|
||||||
architectures: Union[str, list[str]],
|
architectures: Union[str, list[str]],
|
||||||
|
model_config: ModelConfig,
|
||||||
) -> bool:
|
) -> bool:
|
||||||
model_cls, _ = self.inspect_model_cls(architectures)
|
model_cls, _ = self.inspect_model_cls(architectures, model_config)
|
||||||
return not model_cls.supports_v0_only
|
return not model_cls.supports_v0_only
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
60
vllm/transformers_utils/dynamic_module.py
Normal file
60
vllm/transformers_utils/dynamic_module.py
Normal file
@ -0,0 +1,60 @@
|
|||||||
|
# SPDX-License-Identifier: Apache-2.0
|
||||||
|
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
|
||||||
|
import os
|
||||||
|
from typing import Optional, Union
|
||||||
|
|
||||||
|
from transformers.dynamic_module_utils import get_class_from_dynamic_module
|
||||||
|
|
||||||
|
import vllm.envs as envs
|
||||||
|
from vllm.logger import init_logger
|
||||||
|
|
||||||
|
logger = init_logger(__name__)
|
||||||
|
|
||||||
|
|
||||||
|
def try_get_class_from_dynamic_module(
|
||||||
|
class_reference: str,
|
||||||
|
pretrained_model_name_or_path: str,
|
||||||
|
cache_dir: Optional[Union[str, os.PathLike]] = None,
|
||||||
|
force_download: bool = False,
|
||||||
|
resume_download: Optional[bool] = None,
|
||||||
|
proxies: Optional[dict[str, str]] = None,
|
||||||
|
token: Optional[Union[bool, str]] = None,
|
||||||
|
revision: Optional[str] = None,
|
||||||
|
local_files_only: bool = False,
|
||||||
|
repo_type: Optional[str] = None,
|
||||||
|
code_revision: Optional[str] = None,
|
||||||
|
warn_on_fail: bool = True,
|
||||||
|
**kwargs,
|
||||||
|
) -> Optional[type]:
|
||||||
|
"""
|
||||||
|
As [transformers.dynamic_module_utils.get_class_from_dynamic_module][],
|
||||||
|
but ignoring any errors.
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
return get_class_from_dynamic_module(
|
||||||
|
class_reference,
|
||||||
|
pretrained_model_name_or_path,
|
||||||
|
cache_dir=cache_dir,
|
||||||
|
force_download=force_download,
|
||||||
|
resume_download=resume_download,
|
||||||
|
proxies=proxies,
|
||||||
|
token=token,
|
||||||
|
revision=revision,
|
||||||
|
local_files_only=local_files_only,
|
||||||
|
repo_type=repo_type,
|
||||||
|
code_revision=code_revision,
|
||||||
|
**kwargs,
|
||||||
|
)
|
||||||
|
except Exception:
|
||||||
|
location = "ModelScope" if envs.VLLM_USE_MODELSCOPE else "HF Hub"
|
||||||
|
|
||||||
|
if warn_on_fail:
|
||||||
|
logger.warning(
|
||||||
|
"Unable to load %s from %s on %s.",
|
||||||
|
class_reference,
|
||||||
|
pretrained_model_name_or_path,
|
||||||
|
location,
|
||||||
|
exc_info=True,
|
||||||
|
)
|
||||||
|
|
||||||
|
return None
|
||||||
@ -3,6 +3,8 @@
|
|||||||
|
|
||||||
from typing import Optional
|
from typing import Optional
|
||||||
|
|
||||||
|
from typing_extensions import assert_never
|
||||||
|
|
||||||
from vllm.config import LoRAConfig, ModelConfig, SchedulerConfig
|
from vllm.config import LoRAConfig, ModelConfig, SchedulerConfig
|
||||||
from vllm.lora.request import LoRARequest
|
from vllm.lora.request import LoRARequest
|
||||||
from vllm.transformers_utils.tokenizer import (AnyTokenizer, encode_tokens,
|
from vllm.transformers_utils.tokenizer import (AnyTokenizer, encode_tokens,
|
||||||
@ -108,6 +110,14 @@ class TokenizerGroup:
|
|||||||
def init_tokenizer_from_configs(model_config: ModelConfig,
|
def init_tokenizer_from_configs(model_config: ModelConfig,
|
||||||
scheduler_config: SchedulerConfig,
|
scheduler_config: SchedulerConfig,
|
||||||
lora_config: Optional[LoRAConfig]):
|
lora_config: Optional[LoRAConfig]):
|
||||||
|
runner_type = model_config.runner_type
|
||||||
|
if runner_type == "generate" or runner_type == "draft":
|
||||||
|
truncation_side = "left"
|
||||||
|
elif runner_type == "pooling":
|
||||||
|
truncation_side = "right"
|
||||||
|
else:
|
||||||
|
assert_never(runner_type)
|
||||||
|
|
||||||
return TokenizerGroup(
|
return TokenizerGroup(
|
||||||
tokenizer_id=model_config.tokenizer,
|
tokenizer_id=model_config.tokenizer,
|
||||||
enable_lora=bool(lora_config),
|
enable_lora=bool(lora_config),
|
||||||
@ -117,4 +127,4 @@ def init_tokenizer_from_configs(model_config: ModelConfig,
|
|||||||
tokenizer_mode=model_config.tokenizer_mode,
|
tokenizer_mode=model_config.tokenizer_mode,
|
||||||
trust_remote_code=model_config.trust_remote_code,
|
trust_remote_code=model_config.trust_remote_code,
|
||||||
revision=model_config.tokenizer_revision,
|
revision=model_config.tokenizer_revision,
|
||||||
truncation_side=model_config.truncation_side)
|
truncation_side=truncation_side)
|
||||||
|
|||||||
@ -127,8 +127,8 @@ class GPUModelRunner(LoRAModelRunnerMixin, KVConnectorModelRunnerMixin):
|
|||||||
self.is_multimodal_model = model_config.is_multimodal_model
|
self.is_multimodal_model = model_config.is_multimodal_model
|
||||||
self.is_pooling_model = model_config.pooler_config is not None
|
self.is_pooling_model = model_config.pooler_config is not None
|
||||||
self.is_encoder_only_model = False
|
self.is_encoder_only_model = False
|
||||||
self.model_supports_multimodal_raw_input = (
|
self.is_multimodal_raw_input_supported = (
|
||||||
model_config.model_supports_multimodal_raw_input)
|
model_config.is_multimodal_raw_input_supported)
|
||||||
self.max_model_len = model_config.max_model_len
|
self.max_model_len = model_config.max_model_len
|
||||||
self.max_num_tokens = scheduler_config.max_num_batched_tokens
|
self.max_num_tokens = scheduler_config.max_num_batched_tokens
|
||||||
self.max_num_reqs = scheduler_config.max_num_seqs
|
self.max_num_reqs = scheduler_config.max_num_seqs
|
||||||
@ -583,7 +583,7 @@ class GPUModelRunner(LoRAModelRunnerMixin, KVConnectorModelRunnerMixin):
|
|||||||
) -> dict[str, Any]:
|
) -> dict[str, Any]:
|
||||||
|
|
||||||
model_kwargs: dict[str, Any] = {}
|
model_kwargs: dict[str, Any] = {}
|
||||||
if self.model_supports_multimodal_raw_input:
|
if self.is_multimodal_raw_input_supported:
|
||||||
# This model requires the raw multimodal data in input.
|
# This model requires the raw multimodal data in input.
|
||||||
if scheduler_output:
|
if scheduler_output:
|
||||||
multi_modal_kwargs_list = []
|
multi_modal_kwargs_list = []
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user