[Doc] Clarify cudagraph capture size logic and default behavior in scheduler (#18698)

Signed-off-by: Zazzle516 <2405677060@qq.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
This commit is contained in:
Zazzle516 2025-09-12 07:18:09 +08:00 committed by GitHub
parent f82f7a8990
commit 7a30fa8708
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

View File

@ -3579,30 +3579,40 @@ class VllmConfig:
def _set_cudagraph_sizes(self): def _set_cudagraph_sizes(self):
""" """
cudagraph batchsize padding logic: vLLM defines the default candidate list of batch sizes for CUDA graph
capture as:
`[1, 2, 4] + [8 * i for i in range(1, 1025)]` is a list of all possible ```python
batch sizes that cudagraph will capture. max_graph_size = min(max_num_seqs * 2, 512)
# 1, 2, 4, then multiples of 8 up to max_graph_size
Depending on the engine's configuration of `max_num_seqs`, the cuda_graph_sizes = [1, 2, 4, 8, 16, 24, 32, 40, ..., max_graph_size]
candidate batch sizes to capture cudagraph will shrink to the subset
which just cover the range of `[1, max_num_seqs]`. In the common case,
`max_num_seqs` is 256, and the cudagraph batch sizes will be
`[1, 2, 4, 8, 16, 24, 32, 40, ..., 256]`.
However, if users specify the cudagraph capture sizes through
compilation config, we will use the specified sizes instead.
In the end, `vllm_config.compilation_config.cudagraph_capture_sizes` In the end, `vllm_config.compilation_config.cudagraph_capture_sizes`
will be the final sizes to capture cudagraph (in descending order). will be the final sizes to capture cudagraph (in descending order).
During runtime, if batchsize is larger than These sizes are used to capture and reuse CUDA graphs for
`vllm_config.compilation_config.cudagraph_capture_sizes`, performance-critical paths (e.g., decoding). Capturing enables
no cudagraph will be used. significantly faster kernel dispatch by avoiding Python overhead. The
If the batch size is no larger than list is then filtered based on `max_num_batched_tokens` (e.g., 8192 on
`vllm_config.compilation_config.cudagraph_capture_sizes`, most GPUs), which controls the total allowed number of tokens in a
we can quickly find the padded graph size for a given batch size by batch. Since each sequence may have a variable number of tokens, the
looking up `vllm_config.compilation_config.bs_to_padded_graph_size`. maximum usable batch size will depend on actual sequence lengths.
Example:
With `max_num_batched_tokens = 8192`, and typical sequences
averaging ~32 tokens, most practical batch sizes fall below 256.
However, the system will still allow capture sizes up to 512 if
shape and memory permit.
Note:
If users explicitly specify cudagraph capture sizes in the
compilation config, those will override this default logic.
At runtime:
- If batch size <= one of the `cudagraph_capture_sizes`, the closest
padded CUDA graph will be used.
- If batch size > largest `cudagraph_capture_sizes`, cudagraph will
not be used.
""" """
# calculate the default `batch_size_capture_list` # calculate the default `batch_size_capture_list`