mirror of
https://git.datalinker.icu/vllm-project/vllm.git
synced 2025-12-11 03:08:49 +08:00
Update default max_num_batch_tokens for chunked prefill (#11694)
This commit is contained in:
parent
68d37809b9
commit
2f1e8e8f54
@ -32,8 +32,8 @@ You can enable the feature by specifying `--enable-chunked-prefill` in the comma
|
|||||||
```python
|
```python
|
||||||
llm = LLM(model="meta-llama/Llama-2-7b-hf", enable_chunked_prefill=True)
|
llm = LLM(model="meta-llama/Llama-2-7b-hf", enable_chunked_prefill=True)
|
||||||
# Set max_num_batched_tokens to tune performance.
|
# Set max_num_batched_tokens to tune performance.
|
||||||
# NOTE: 512 is the default max_num_batched_tokens for chunked prefill.
|
# NOTE: 2048 is the default max_num_batched_tokens for chunked prefill.
|
||||||
# llm = LLM(model="meta-llama/Llama-2-7b-hf", enable_chunked_prefill=True, max_num_batched_tokens=512)
|
# llm = LLM(model="meta-llama/Llama-2-7b-hf", enable_chunked_prefill=True, max_num_batched_tokens=2048)
|
||||||
```
|
```
|
||||||
|
|
||||||
By default, vLLM scheduler prioritizes prefills and doesn't batch prefill and decode to the same batch.
|
By default, vLLM scheduler prioritizes prefills and doesn't batch prefill and decode to the same batch.
|
||||||
@ -49,13 +49,12 @@ This policy has two benefits:
|
|||||||
- It improves ITL and generation decode because decode requests are prioritized.
|
- It improves ITL and generation decode because decode requests are prioritized.
|
||||||
- It helps achieve better GPU utilization by locating compute-bound (prefill) and memory-bound (decode) requests to the same batch.
|
- It helps achieve better GPU utilization by locating compute-bound (prefill) and memory-bound (decode) requests to the same batch.
|
||||||
|
|
||||||
You can tune the performance by changing `max_num_batched_tokens`.
|
You can tune the performance by changing `max_num_batched_tokens`. By default, it is set to 2048.
|
||||||
By default, it is set to 512, which has the best ITL on A100 in the initial benchmark (llama 70B and mixtral 8x22B).
|
|
||||||
Smaller `max_num_batched_tokens` achieves better ITL because there are fewer prefills interrupting decodes.
|
Smaller `max_num_batched_tokens` achieves better ITL because there are fewer prefills interrupting decodes.
|
||||||
Higher `max_num_batched_tokens` achieves better TTFT as you can put more prefill to the batch.
|
Higher `max_num_batched_tokens` achieves better TTFT as you can put more prefill to the batch.
|
||||||
|
|
||||||
- If `max_num_batched_tokens` is the same as `max_model_len`, that's almost the equivalent to the default scheduling policy (except that it still prioritizes decodes).
|
- If `max_num_batched_tokens` is the same as `max_model_len`, that's almost the equivalent to the default scheduling policy (except that it still prioritizes decodes).
|
||||||
- Note that the default value (512) of `max_num_batched_tokens` is optimized for ITL, and it may have lower throughput than the default scheduler.
|
- Note that the default value (2048) of `max_num_batched_tokens` is optimized for ITL, and it may have lower throughput than the default scheduler.
|
||||||
|
|
||||||
We recommend you set `max_num_batched_tokens > 2048` for throughput.
|
We recommend you set `max_num_batched_tokens > 2048` for throughput.
|
||||||
|
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user