mirror of
https://git.datalinker.icu/vllm-project/vllm.git
synced 2025-12-10 20:25:43 +08:00
[Doc] Update pooling model docs (#22186)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
This commit is contained in:
parent
54de71d0df
commit
1539ced93a
@ -120,7 +120,7 @@ A code example can be found here: <gh-file:examples/offline_inference/basic/clas
|
|||||||
### `LLM.score`
|
### `LLM.score`
|
||||||
|
|
||||||
The [score][vllm.LLM.score] method outputs similarity scores between sentence pairs.
|
The [score][vllm.LLM.score] method outputs similarity scores between sentence pairs.
|
||||||
It is designed for embedding models and cross encoder models. Embedding models use cosine similarity, and [cross-encoder models](https://www.sbert.net/examples/applications/cross-encoder/README.html) serve as rerankers between candidate query-document pairs in RAG systems.
|
It is designed for embedding models and cross-encoder models. Embedding models use cosine similarity, and [cross-encoder models](https://www.sbert.net/examples/applications/cross-encoder/README.html) serve as rerankers between candidate query-document pairs in RAG systems.
|
||||||
|
|
||||||
!!! note
|
!!! note
|
||||||
vLLM can only perform the model inference component (e.g. embedding, reranking) of RAG.
|
vLLM can only perform the model inference component (e.g. embedding, reranking) of RAG.
|
||||||
|
|||||||
@ -311,6 +311,8 @@ See [this page](generative_models.md) for more information on how to use generat
|
|||||||
|
|
||||||
#### Text Generation
|
#### Text Generation
|
||||||
|
|
||||||
|
These models primarily accept the [`LLM.generate`](./generative_models.md#llmgenerate) API. Chat/Instruct models additionally support the [`LLM.chat`](./generative_models.md#llmchat) API.
|
||||||
|
|
||||||
<style>
|
<style>
|
||||||
th {
|
th {
|
||||||
white-space: nowrap;
|
white-space: nowrap;
|
||||||
@ -419,7 +421,9 @@ See [this page](./pooling_models.md) for more information on how to use pooling
|
|||||||
Since some model architectures support both generative and pooling tasks,
|
Since some model architectures support both generative and pooling tasks,
|
||||||
you should explicitly specify `--runner pooling` to ensure that the model is used in pooling mode instead of generative mode.
|
you should explicitly specify `--runner pooling` to ensure that the model is used in pooling mode instead of generative mode.
|
||||||
|
|
||||||
#### Text Embedding
|
#### Embedding
|
||||||
|
|
||||||
|
These models primarily support the [`LLM.embed`](./pooling_models.md#llmembed) API.
|
||||||
|
|
||||||
| Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
|
| Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
|
||||||
|--------------|--------|-------------------|----------------------|---------------------------|---------------------|
|
|--------------|--------|-------------------|----------------------|---------------------------|---------------------|
|
||||||
@ -457,28 +461,10 @@ If your model is not in the above list, we will try to automatically convert the
|
|||||||
[as_embedding_model][vllm.model_executor.models.adapters.as_embedding_model]. By default, the embeddings
|
[as_embedding_model][vllm.model_executor.models.adapters.as_embedding_model]. By default, the embeddings
|
||||||
of the whole prompt are extracted from the normalized hidden state corresponding to the last token.
|
of the whole prompt are extracted from the normalized hidden state corresponding to the last token.
|
||||||
|
|
||||||
#### Reward Modeling
|
|
||||||
|
|
||||||
| Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
|
|
||||||
|--------------|--------|-------------------|----------------------|---------------------------|---------------------|
|
|
||||||
| `InternLM2ForRewardModel` | InternLM2-based | `internlm/internlm2-1_8b-reward`, `internlm/internlm2-7b-reward`, etc. | ✅︎ | ✅︎ | ✅︎ |
|
|
||||||
| `LlamaForCausalLM`<sup>C</sup> | Llama-based | `peiyi9979/math-shepherd-mistral-7b-prm`, etc. | ✅︎ | ✅︎ | ✅︎ |
|
|
||||||
| `Qwen2ForRewardModel` | Qwen2-based | `Qwen/Qwen2.5-Math-RM-72B`, etc. | ✅︎ | ✅︎ | ✅︎ |
|
|
||||||
| `Qwen2ForProcessRewardModel` | Qwen2-based | `Qwen/Qwen2.5-Math-PRM-7B`, etc. | ✅︎ | ✅︎ | ✅︎ |
|
|
||||||
| `*Model`<sup>C</sup>, `*ForCausalLM`<sup>C</sup>, etc. | Generative models | N/A | \* | \* | \* |
|
|
||||||
|
|
||||||
<sup>C</sup> Automatically converted into a reward model via `--convert reward`. ([details](./pooling_models.md#model-conversion))
|
|
||||||
\* Feature support is the same as that of the original model.
|
|
||||||
|
|
||||||
If your model is not in the above list, we will try to automatically convert the model using
|
|
||||||
[as_reward_model][vllm.model_executor.models.adapters.as_reward_model]. By default, we return the hidden states of each token directly.
|
|
||||||
|
|
||||||
!!! important
|
|
||||||
For process-supervised reward models such as `peiyi9979/math-shepherd-mistral-7b-prm`, the pooling config should be set explicitly,
|
|
||||||
e.g.: `--override-pooler-config '{"pooling_type": "STEP", "step_tag_id": 123, "returned_token_ids": [456, 789]}'`.
|
|
||||||
|
|
||||||
#### Classification
|
#### Classification
|
||||||
|
|
||||||
|
These models primarily support the [`LLM.classify`](./pooling_models.md#llmclassify) API.
|
||||||
|
|
||||||
| Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
|
| Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
|
||||||
|--------------|--------|-------------------|----------------------|---------------------------|---------------------|
|
|--------------|--------|-------------------|----------------------|---------------------------|---------------------|
|
||||||
| `JambaForSequenceClassification` | Jamba | `ai21labs/Jamba-tiny-reward-dev`, etc. | ✅︎ | ✅︎ | |
|
| `JambaForSequenceClassification` | Jamba | `ai21labs/Jamba-tiny-reward-dev`, etc. | ✅︎ | ✅︎ | |
|
||||||
@ -491,7 +477,10 @@ If your model is not in the above list, we will try to automatically convert the
|
|||||||
If your model is not in the above list, we will try to automatically convert the model using
|
If your model is not in the above list, we will try to automatically convert the model using
|
||||||
[as_seq_cls_model][vllm.model_executor.models.adapters.as_seq_cls_model]. By default, the class probabilities are extracted from the softmaxed hidden state corresponding to the last token.
|
[as_seq_cls_model][vllm.model_executor.models.adapters.as_seq_cls_model]. By default, the class probabilities are extracted from the softmaxed hidden state corresponding to the last token.
|
||||||
|
|
||||||
#### Sentence Pair Scoring
|
#### Cross-encoder / Reranker
|
||||||
|
|
||||||
|
Cross-encoder and reranker models are a subset of classification models that accept two prompts as input.
|
||||||
|
These models primarily support the [`LLM.score`](./pooling_models.md#llmscore) API.
|
||||||
|
|
||||||
| Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
|
| Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
|
||||||
|--------------|--------|-------------------|----------------------|---------------------------|---------------------|
|
|--------------|--------|-------------------|----------------------|---------------------------|---------------------|
|
||||||
@ -501,6 +490,7 @@ If your model is not in the above list, we will try to automatically convert the
|
|||||||
| `Qwen3ForSequenceClassification` | Qwen3-based | `tomaarsen/Qwen3-Reranker-0.6B-seq-cls`, `Qwen/Qwen3-Reranker-0.6B` (see note), etc. | ✅︎ | ✅︎ | ✅︎ |
|
| `Qwen3ForSequenceClassification` | Qwen3-based | `tomaarsen/Qwen3-Reranker-0.6B-seq-cls`, `Qwen/Qwen3-Reranker-0.6B` (see note), etc. | ✅︎ | ✅︎ | ✅︎ |
|
||||||
| `RobertaForSequenceClassification` | RoBERTa-based | `cross-encoder/quora-roberta-base`, etc. | | | |
|
| `RobertaForSequenceClassification` | RoBERTa-based | `cross-encoder/quora-roberta-base`, etc. | | | |
|
||||||
| `XLMRobertaForSequenceClassification` | XLM-RoBERTa-based | `BAAI/bge-reranker-v2-m3`, etc. | | | |
|
| `XLMRobertaForSequenceClassification` | XLM-RoBERTa-based | `BAAI/bge-reranker-v2-m3`, etc. | | | |
|
||||||
|
| `*Model`<sup>C</sup>, `*ForCausalLM`<sup>C</sup>, etc. | Generative models | N/A | \* | \* | \* |
|
||||||
|
|
||||||
<sup>C</sup> Automatically converted into a classification model via `--convert classify`. ([details](./pooling_models.md#model-conversion))
|
<sup>C</sup> Automatically converted into a classification model via `--convert classify`. ([details](./pooling_models.md#model-conversion))
|
||||||
\* Feature support is the same as that of the original model.
|
\* Feature support is the same as that of the original model.
|
||||||
@ -526,6 +516,28 @@ If your model is not in the above list, we will try to automatically convert the
|
|||||||
vllm serve Qwen/Qwen3-Reranker-0.6B --hf_overrides '{"architectures": ["Qwen3ForSequenceClassification"],"classifier_from_token": ["no", "yes"],"is_original_qwen3_reranker": true}'
|
vllm serve Qwen/Qwen3-Reranker-0.6B --hf_overrides '{"architectures": ["Qwen3ForSequenceClassification"],"classifier_from_token": ["no", "yes"],"is_original_qwen3_reranker": true}'
|
||||||
```
|
```
|
||||||
|
|
||||||
|
#### Reward Modeling
|
||||||
|
|
||||||
|
These models primarily support the [`LLM.reward`](./pooling_models.md#llmreward) API.
|
||||||
|
|
||||||
|
| Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
|
||||||
|
|--------------|--------|-------------------|----------------------|---------------------------|---------------------|
|
||||||
|
| `InternLM2ForRewardModel` | InternLM2-based | `internlm/internlm2-1_8b-reward`, `internlm/internlm2-7b-reward`, etc. | ✅︎ | ✅︎ | ✅︎ |
|
||||||
|
| `LlamaForCausalLM`<sup>C</sup> | Llama-based | `peiyi9979/math-shepherd-mistral-7b-prm`, etc. | ✅︎ | ✅︎ | ✅︎ |
|
||||||
|
| `Qwen2ForRewardModel` | Qwen2-based | `Qwen/Qwen2.5-Math-RM-72B`, etc. | ✅︎ | ✅︎ | ✅︎ |
|
||||||
|
| `Qwen2ForProcessRewardModel` | Qwen2-based | `Qwen/Qwen2.5-Math-PRM-7B`, etc. | ✅︎ | ✅︎ | ✅︎ |
|
||||||
|
| `*Model`<sup>C</sup>, `*ForCausalLM`<sup>C</sup>, etc. | Generative models | N/A | \* | \* | \* |
|
||||||
|
|
||||||
|
<sup>C</sup> Automatically converted into a reward model via `--convert reward`. ([details](./pooling_models.md#model-conversion))
|
||||||
|
\* Feature support is the same as that of the original model.
|
||||||
|
|
||||||
|
If your model is not in the above list, we will try to automatically convert the model using
|
||||||
|
[as_reward_model][vllm.model_executor.models.adapters.as_reward_model]. By default, we return the hidden states of each token directly.
|
||||||
|
|
||||||
|
!!! important
|
||||||
|
For process-supervised reward models such as `peiyi9979/math-shepherd-mistral-7b-prm`, the pooling config should be set explicitly,
|
||||||
|
e.g.: `--override-pooler-config '{"pooling_type": "STEP", "step_tag_id": 123, "returned_token_ids": [456, 789]}'`.
|
||||||
|
|
||||||
[](){ #supported-mm-models }
|
[](){ #supported-mm-models }
|
||||||
|
|
||||||
## List of Multimodal Language Models
|
## List of Multimodal Language Models
|
||||||
@ -579,6 +591,8 @@ See [this page](generative_models.md) for more information on how to use generat
|
|||||||
|
|
||||||
#### Text Generation
|
#### Text Generation
|
||||||
|
|
||||||
|
These models primarily accept the [`LLM.generate`](./generative_models.md#llmgenerate) API. Chat/Instruct models additionally support the [`LLM.chat`](./generative_models.md#llmchat) API.
|
||||||
|
|
||||||
| Architecture | Models | Inputs | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
|
| Architecture | Models | Inputs | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/distributed_serving.md) | [V1](gh-issue:8779) |
|
||||||
|--------------|--------|--------|-------------------|----------------------|---------------------------|---------------------|
|
|--------------|--------|--------|-------------------|----------------------|---------------------------|---------------------|
|
||||||
| `AriaForConditionalGeneration` | Aria | T + I<sup>+</sup> | `rhymes-ai/Aria` | | | ✅︎ |
|
| `AriaForConditionalGeneration` | Aria | T + I<sup>+</sup> | `rhymes-ai/Aria` | | | ✅︎ |
|
||||||
@ -720,11 +734,9 @@ Speech2Text models trained specifically for Automatic Speech Recognition.
|
|||||||
|
|
||||||
See [this page](./pooling_models.md) for more information on how to use pooling models.
|
See [this page](./pooling_models.md) for more information on how to use pooling models.
|
||||||
|
|
||||||
!!! important
|
#### Embedding
|
||||||
Since some model architectures support both generative and pooling tasks,
|
|
||||||
you should explicitly specify `--runner pooling` to ensure that the model is used in pooling mode instead of generative mode.
|
|
||||||
|
|
||||||
#### Text Embedding
|
These models primarily support the [`LLM.embed`](./pooling_models.md#llmembed) API.
|
||||||
|
|
||||||
!!! note
|
!!! note
|
||||||
To get the best results, you should use pooling models that are specifically trained as such.
|
To get the best results, you should use pooling models that are specifically trained as such.
|
||||||
@ -742,7 +754,10 @@ The following table lists those that are tested in vLLM.
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
#### Scoring
|
#### Cross-encoder / Reranker
|
||||||
|
|
||||||
|
Cross-encoder and reranker models are a subset of classification models that accept two prompts as input.
|
||||||
|
These models primarily support the [`LLM.score`](./pooling_models.md#llmscore) API.
|
||||||
|
|
||||||
| Architecture | Models | Inputs | Example HF Models | [LoRA][lora-adapter] | [PP][distributed-serving] | [V1](gh-issue:8779) |
|
| Architecture | Models | Inputs | Example HF Models | [LoRA][lora-adapter] | [PP][distributed-serving] | [V1](gh-issue:8779) |
|
||||||
|-------------------------------------|--------------------|----------|--------------------------|------------------------|-----------------------------|-----------------------|
|
|-------------------------------------|--------------------|----------|--------------------------|------------------------|-----------------------------|-----------------------|
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user