mirror of
https://git.datalinker.icu/vllm-project/vllm.git
synced 2026-03-19 00:57:15 +08:00
perf: allow more numba threads since only rank 0 computes drafts
Address review feedback: since draft tokens are now computed only on rank 0 and broadcast to other ranks, we no longer need to divide the thread count by tp_size. This allows rank 0 to use up to 8 threads for numba parallel processing. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: yurekami <yurekami@users.noreply.github.com>
This commit is contained in:
parent
0f94a71de5
commit
a2b3480236
@ -34,21 +34,17 @@ class NgramProposer:
|
||||
# Threshold of total number of tokens in the batch to enable
|
||||
# multi-threading in numba batch propose.
|
||||
self.num_tokens_threshold = 8192
|
||||
tp_size = vllm_config.parallel_config.tensor_parallel_size
|
||||
cpu_count = os.cpu_count()
|
||||
# Max number of threads for numba parallel processing.
|
||||
# Since draft tokens are computed only on rank 0 and broadcast to other
|
||||
# ranks (for TP consistency), rank 0 can use all available threads.
|
||||
if cpu_count:
|
||||
# Divide by 2 to use physical cores
|
||||
# and not logical cores (hyper-threading).
|
||||
# Cap the number of threads to 8 to avoid using too many threads
|
||||
# since other components like frontend (incl tokenization)
|
||||
# and Structured Outputs also use multiple threads.
|
||||
# TODO(ekagra-ranjan): bump up the cap from 1 to 8
|
||||
# when TP parallelization for ngram is implemented.
|
||||
self.num_numba_thread_available = min(1, (cpu_count // 2))
|
||||
# Divide by tp_size to ensure each tensor parallel rank
|
||||
# has some threads since all ranks will run this.
|
||||
self.num_numba_thread_available //= tp_size
|
||||
self.num_numba_thread_available = min(8, cpu_count // 2)
|
||||
else:
|
||||
self.num_numba_thread_available = 1
|
||||
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user