mirror of
https://git.datalinker.icu/vllm-project/vllm.git
synced 2026-04-01 08:07:04 +08:00
[Bugfix][V1] Re-compute an entire block when fully cache hit (#11186)
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
This commit is contained in:
parent
4b5b8a6a3b
commit
9855aea21b
@ -199,9 +199,13 @@ class Scheduler:
|
||||
if num_new_tokens == 0:
|
||||
# The happens when prompt length is divisible by the block
|
||||
# size and all blocks are cached. Now we force to recompute
|
||||
# the last token.
|
||||
num_computed_tokens -= 1
|
||||
num_new_tokens = 1
|
||||
# the last block. Note that we have to re-compute an entire
|
||||
# block because allocate_slots() assumes num_computed_tokens
|
||||
# is always a multiple of the block size. This limitation
|
||||
# can potentially be removed in the future to slightly
|
||||
# improve the performance.
|
||||
num_computed_tokens -= self.block_size
|
||||
num_new_tokens = self.block_size
|
||||
computed_blocks.pop()
|
||||
num_new_tokens = min(num_new_tokens, token_budget)
|
||||
assert num_new_tokens > 0
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user