mirror of
https://git.datalinker.icu/vllm-project/vllm.git
synced 2026-05-23 20:04:32 +08:00
[Misc][Doc] Add note regarding loading generation_config by default (#15281)
Signed-off-by: Roger Wang <ywang@roblox.com>
This commit is contained in:
parent
d6cd59f122
commit
9c5c81b0da
@ -58,6 +58,11 @@ from vllm import LLM, SamplingParams
|
|||||||
```
|
```
|
||||||
|
|
||||||
The next section defines a list of input prompts and sampling parameters for text generation. The [sampling temperature](https://arxiv.org/html/2402.05201v1) is set to `0.8` and the [nucleus sampling probability](https://en.wikipedia.org/wiki/Top-p_sampling) is set to `0.95`. You can find more information about the sampling parameters [here](#sampling-params).
|
The next section defines a list of input prompts and sampling parameters for text generation. The [sampling temperature](https://arxiv.org/html/2402.05201v1) is set to `0.8` and the [nucleus sampling probability](https://en.wikipedia.org/wiki/Top-p_sampling) is set to `0.95`. You can find more information about the sampling parameters [here](#sampling-params).
|
||||||
|
:::{important}
|
||||||
|
By default, vLLM will use sampling parameters recommended by model creator by applying the `generation_config.json` from the Hugging Face model repository if it exists. In most cases, this will provide you with the best results by default if {class}`~vllm.SamplingParams` is not specified.
|
||||||
|
|
||||||
|
However, if vLLM's default sampling parameters are preferred, please set `generation_config="vllm"` when creating the {class}`~vllm.LLM` instance.
|
||||||
|
:::
|
||||||
|
|
||||||
```python
|
```python
|
||||||
prompts = [
|
prompts = [
|
||||||
@ -76,7 +81,7 @@ llm = LLM(model="facebook/opt-125m")
|
|||||||
```
|
```
|
||||||
|
|
||||||
:::{note}
|
:::{note}
|
||||||
By default, vLLM downloads models from [HuggingFace](https://huggingface.co/). If you would like to use models from [ModelScope](https://www.modelscope.cn), set the environment variable `VLLM_USE_MODELSCOPE` before initializing the engine.
|
By default, vLLM downloads models from [Hugging Face](https://huggingface.co/). If you would like to use models from [ModelScope](https://www.modelscope.cn), set the environment variable `VLLM_USE_MODELSCOPE` before initializing the engine.
|
||||||
:::
|
:::
|
||||||
|
|
||||||
Now, the fun part! The outputs are generated using `llm.generate`. It adds the input prompts to the vLLM engine's waiting queue and executes the vLLM engine to generate the outputs with high throughput. The outputs are returned as a list of `RequestOutput` objects, which include all of the output tokens.
|
Now, the fun part! The outputs are generated using `llm.generate`. It adds the input prompts to the vLLM engine's waiting queue and executes the vLLM engine to generate the outputs with high throughput. The outputs are returned as a list of `RequestOutput` objects, which include all of the output tokens.
|
||||||
@ -107,6 +112,11 @@ vllm serve Qwen/Qwen2.5-1.5B-Instruct
|
|||||||
By default, the server uses a predefined chat template stored in the tokenizer.
|
By default, the server uses a predefined chat template stored in the tokenizer.
|
||||||
You can learn about overriding it [here](#chat-template).
|
You can learn about overriding it [here](#chat-template).
|
||||||
:::
|
:::
|
||||||
|
:::{important}
|
||||||
|
By default, the server applies `generation_config.json` from the huggingface model repository if it exists. This means the default values of certain sampling parameters can be overridden by those recommended by the model creator.
|
||||||
|
|
||||||
|
To disable this behavior, please pass `--generation-config vllm` when launching the server.
|
||||||
|
:::
|
||||||
|
|
||||||
This server can be queried in the same format as OpenAI API. For example, to list the models:
|
This server can be queried in the same format as OpenAI API. For example, to list the models:
|
||||||
|
|
||||||
|
|||||||
@ -46,6 +46,11 @@ for output in outputs:
|
|||||||
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
||||||
```
|
```
|
||||||
|
|
||||||
|
:::{important}
|
||||||
|
By default, vLLM will use sampling parameters recommended by model creator by applying the `generation_config.json` from the huggingface model repository if it exists. In most cases, this will provide you with the best results by default if {class}`~vllm.SamplingParams` is not specified.
|
||||||
|
|
||||||
|
However, if vLLM's default sampling parameters are preferred, please pass `generation_config="vllm"` when creating the {class}`~vllm.LLM` instance.
|
||||||
|
:::
|
||||||
A code example can be found here: <gh-file:examples/offline_inference/basic/basic.py>
|
A code example can be found here: <gh-file:examples/offline_inference/basic/basic.py>
|
||||||
|
|
||||||
### `LLM.beam_search`
|
### `LLM.beam_search`
|
||||||
|
|||||||
@ -33,7 +33,11 @@ print(completion.choices[0].message)
|
|||||||
vLLM supports some parameters that are not supported by OpenAI, `top_k` for example.
|
vLLM supports some parameters that are not supported by OpenAI, `top_k` for example.
|
||||||
You can pass these parameters to vLLM using the OpenAI client in the `extra_body` parameter of your requests, i.e. `extra_body={"top_k": 50}` for `top_k`.
|
You can pass these parameters to vLLM using the OpenAI client in the `extra_body` parameter of your requests, i.e. `extra_body={"top_k": 50}` for `top_k`.
|
||||||
:::
|
:::
|
||||||
|
:::{important}
|
||||||
|
By default, the server applies `generation_config.json` from the Hugging Face model repository if it exists. This means the default values of certain sampling parameters can be overridden by those recommended by the model creator.
|
||||||
|
|
||||||
|
To disable this behavior, please pass `--generation-config vllm` when launching the server.
|
||||||
|
:::
|
||||||
## Supported APIs
|
## Supported APIs
|
||||||
|
|
||||||
We currently support the following OpenAI APIs:
|
We currently support the following OpenAI APIs:
|
||||||
|
|||||||
@ -1023,6 +1023,13 @@ class ModelConfig:
|
|||||||
"max_new_tokens")
|
"max_new_tokens")
|
||||||
else:
|
else:
|
||||||
diff_sampling_param = {}
|
diff_sampling_param = {}
|
||||||
|
|
||||||
|
if diff_sampling_param:
|
||||||
|
logger.warning_once(
|
||||||
|
"Default sampling parameters have been overridden by the "
|
||||||
|
"model's Hugging Face generation config recommended from the "
|
||||||
|
"model creator. If this is not intended, please relaunch "
|
||||||
|
"vLLM instance with `--generation-config vllm`.")
|
||||||
return diff_sampling_param
|
return diff_sampling_param
|
||||||
|
|
||||||
@property
|
@property
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user