[Doc] Show default pooling method in a table (#11904)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
This commit is contained in:
Cyrus Leung 2025-01-10 11:25:20 +08:00 committed by GitHub
parent b844b99ad3
commit 3de2b1eafb
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
2 changed files with 47 additions and 24 deletions

View File

@ -8,14 +8,14 @@ In vLLM, generative models implement the {class}`~vllm.model_executor.models.Vll
Based on the final hidden states of the input, these models output log probabilities of the tokens to generate, Based on the final hidden states of the input, these models output log probabilities of the tokens to generate,
which are then passed through {class}`~vllm.model_executor.layers.Sampler` to obtain the final text. which are then passed through {class}`~vllm.model_executor.layers.Sampler` to obtain the final text.
For generative models, the only supported `--task` option is `"generate"`.
Usually, this is automatically inferred so you don't have to specify it.
## Offline Inference ## Offline Inference
The {class}`~vllm.LLM` class provides various methods for offline inference. The {class}`~vllm.LLM` class provides various methods for offline inference.
See [Engine Arguments](#engine-args) for a list of options when initializing the model. See [Engine Arguments](#engine-args) for a list of options when initializing the model.
For generative models, the only supported {code}`task` option is {code}`"generate"`.
Usually, this is automatically inferred so you don't have to specify it.
### `LLM.generate` ### `LLM.generate`
The {class}`~vllm.LLM.generate` method is available to all generative models in vLLM. The {class}`~vllm.LLM.generate` method is available to all generative models in vLLM.
@ -33,7 +33,7 @@ for output in outputs:
``` ```
You can optionally control the language generation by passing {class}`~vllm.SamplingParams`. You can optionally control the language generation by passing {class}`~vllm.SamplingParams`.
For example, you can use greedy sampling by setting {code}`temperature=0`: For example, you can use greedy sampling by setting `temperature=0`:
```python ```python
llm = LLM(model="facebook/opt-125m") llm = LLM(model="facebook/opt-125m")

View File

@ -14,31 +14,54 @@ As shown in the [Compatibility Matrix](#compatibility-matrix), most vLLM feature
pooling models as they only work on the generation or decode stage, so performance may not improve as much. pooling models as they only work on the generation or decode stage, so performance may not improve as much.
``` ```
For pooling models, we support the following `--task` options.
The selected option sets the default pooler used to extract the final hidden states:
```{list-table}
:widths: 50 25 25 25
:header-rows: 1
* - Task
- Pooling Type
- Normalization
- Softmax
* - Embedding (`embed`)
- `LAST`
- ✅︎
- ✗
* - Classification (`classify`)
- `LAST`
- ✗
- ✅︎
* - Sentence Pair Scoring (`score`)
- \*
- \*
- \*
* - Reward Modeling (`reward`)
- `ALL`
- ✗
- ✗
```
\*The default pooler is always defined by the model.
```{note}
If the model's implementation in vLLM defines its own pooler, the default pooler is set to that instead of the one specified in this table.
```
When loading [Sentence Transformers](https://huggingface.co/sentence-transformers) models,
we attempt to override the default pooler based on its Sentence Transformers configuration file (`modules.json`).
```{tip}
You can customize the model's pooling method via the `--override-pooler-config` option,
which takes priority over both the model's and Sentence Transformers's defaults.
```
## Offline Inference ## Offline Inference
The {class}`~vllm.LLM` class provides various methods for offline inference. The {class}`~vllm.LLM` class provides various methods for offline inference.
See [Engine Arguments](#engine-args) for a list of options when initializing the model. See [Engine Arguments](#engine-args) for a list of options when initializing the model.
For pooling models, we support the following {code}`task` options:
- Embedding ({code}`"embed"` / {code}`"embedding"`)
- Classification ({code}`"classify"`)
- Sentence Pair Scoring ({code}`"score"`)
- Reward Modeling ({code}`"reward"`)
The selected task determines the default {class}`~vllm.model_executor.layers.Pooler` that is used:
- Embedding: Extract only the hidden states corresponding to the last token, and apply normalization.
- Classification: Extract only the hidden states corresponding to the last token, and apply softmax.
- Sentence Pair Scoring: Extract only the hidden states corresponding to the last token, and apply softmax.
- Reward Modeling: Extract all of the hidden states and return them directly.
When loading [Sentence Transformers](https://huggingface.co/sentence-transformers) models,
we attempt to override the default pooler based on its Sentence Transformers configuration file ({code}`modules.json`).
You can customize the model's pooling method via the {code}`override_pooler_config` option,
which takes priority over both the model's and Sentence Transformers's defaults.
### `LLM.encode` ### `LLM.encode`
The {class}`~vllm.LLM.encode` method is available to all pooling models in vLLM. The {class}`~vllm.LLM.encode` method is available to all pooling models in vLLM.