mirror of
https://git.datalinker.icu/vllm-project/vllm.git
synced 2025-12-10 00:06:06 +08:00
[Doc] use power of 2 (#23172)
This commit is contained in:
parent
21bcc8263f
commit
2c3f557f08
@ -48,7 +48,7 @@ You can tune the performance by adjusting `max_num_batched_tokens`:
|
||||
|
||||
- Smaller values (e.g., 2048) achieve better inter-token latency (ITL) because there are fewer prefills slowing down decodes.
|
||||
- Higher values achieve better time to first token (TTFT) as you can process more prefill tokens in a batch.
|
||||
- For optimal throughput, we recommend setting `max_num_batched_tokens > 8096` especially for smaller models on large GPUs.
|
||||
- For optimal throughput, we recommend setting `max_num_batched_tokens > 8192` especially for smaller models on large GPUs.
|
||||
- If `max_num_batched_tokens` is the same as `max_model_len`, that's almost the equivalent to the V0 default scheduling policy (except that it still prioritizes decodes).
|
||||
|
||||
```python
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user