From 2c3f557f08880b3cff86470b7ee358a047072990 Mon Sep 17 00:00:00 2001 From: Tialo <65392801+Tialo@users.noreply.github.com> Date: Tue, 19 Aug 2025 13:16:23 +0300 Subject: [PATCH] [Doc] use power of 2 (#23172) --- docs/configuration/optimization.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/configuration/optimization.md b/docs/configuration/optimization.md index 2eeb8ad25de5f..c7f50497d6ffa 100644 --- a/docs/configuration/optimization.md +++ b/docs/configuration/optimization.md @@ -48,7 +48,7 @@ You can tune the performance by adjusting `max_num_batched_tokens`: - Smaller values (e.g., 2048) achieve better inter-token latency (ITL) because there are fewer prefills slowing down decodes. - Higher values achieve better time to first token (TTFT) as you can process more prefill tokens in a batch. -- For optimal throughput, we recommend setting `max_num_batched_tokens > 8096` especially for smaller models on large GPUs. +- For optimal throughput, we recommend setting `max_num_batched_tokens > 8192` especially for smaller models on large GPUs. - If `max_num_batched_tokens` is the same as `max_model_len`, that's almost the equivalent to the V0 default scheduling policy (except that it still prioritizes decodes). ```python