[Bugfix][V1] Re-compute an entire block when fully cache hit (#11186)

Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
This commit is contained in:
Cody Yu 2024-12-13 17:08:23 -08:00 committed by GitHub
parent 4b5b8a6a3b
commit 9855aea21b
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

View File

@ -199,9 +199,13 @@ class Scheduler:
if num_new_tokens == 0:
# The happens when prompt length is divisible by the block
# size and all blocks are cached. Now we force to recompute
# the last token.
num_computed_tokens -= 1
num_new_tokens = 1
# the last block. Note that we have to re-compute an entire
# block because allocate_slots() assumes num_computed_tokens
# is always a multiple of the block size. This limitation
# can potentially be removed in the future to slightly
# improve the performance.
num_computed_tokens -= self.block_size
num_new_tokens = self.block_size
computed_blocks.pop()
num_new_tokens = min(num_new_tokens, token_budget)
assert num_new_tokens > 0