diff --git a/docs/design/prefix_caching.md b/docs/design/prefix_caching.md index 270699df623e0..bd4070f381d81 100644 --- a/docs/design/prefix_caching.md +++ b/docs/design/prefix_caching.md @@ -213,22 +213,22 @@ In this example, we assume the block size is 4 (each block can cache 4 tokens), ![Example Time 1](../assets/design/prefix_caching/example-time-1.png) -**Time 3: Request 0 makes the block 3 full and asks for a new block to keep decoding.** We cache block 3 and allocate block 4. +**Time 2: Request 0 makes the block 3 full and asks for a new block to keep decoding.** We cache block 3 and allocate block 4. -![Example Time 3](../assets/design/prefix_caching/example-time-3.png) +![Example Time 2](../assets/design/prefix_caching/example-time-3.png) -**Time 4: Request 1 comes in with the 14 prompt tokens, where the first 10 tokens are the same as request 0.** We can see that only the first 2 blocks (8 tokens) hit the cache, because the 3rd block only matches 2 of 4 tokens. +**Time 3: Request 1 comes in with the 14 prompt tokens, where the first 10 tokens are the same as request 0.** We can see that only the first 2 blocks (8 tokens) hit the cache, because the 3rd block only matches 2 of 4 tokens. -![Example Time 4](../assets/design/prefix_caching/example-time-4.png) +![Example Time 3](../assets/design/prefix_caching/example-time-4.png) -**Time 5: Request 0 is finished and free.** Blocks 2, 3 and 4 are added to the free queue in the reverse order (but block 2 and 3 are still cached). Block 0 and 1 are not added to the free queue because they are being used by Request 1. +**Time 4: Request 0 is finished and free.** Blocks 2, 3 and 4 are added to the free queue in the reverse order (but block 2 and 3 are still cached). Block 0 and 1 are not added to the free queue because they are being used by Request 1. -![Example Time 5](../assets/design/prefix_caching/example-time-5.png) +![Example Time 4](../assets/design/prefix_caching/example-time-5.png) -**Time 6: Request 1 is finished and free.** +**Time 5: Request 1 is finished and free.** -![Example Time 6](../assets/design/prefix_caching/example-time-6.png) +![Example Time 5](../assets/design/prefix_caching/example-time-6.png) -**Time 7: Request 2 comes in with the 29 prompt tokens, where the first 12 tokens are the same as request 0\.** Note that even the block order in the free queue was `7 - 8 - 9 - 4 - 3 - 2 - 6 - 5 - 1 - 0`, the cache hit blocks (i.e., 0, 1, 2) are touched and removed from the queue before allocation, so the free queue becomes `7 - 8 - 9 - 4 - 3 - 6 - 5`. As a result, the allocated blocks are 0 (cached), 1 (cached), 2 (cached), 7, 8, 9, 4, 3 (evicted). +**Time 6: Request 2 comes in with the 29 prompt tokens, where the first 12 tokens are the same as request 0\.** Note that even the block order in the free queue was `7 - 8 - 9 - 4 - 3 - 2 - 6 - 5 - 1 - 0`, the cache hit blocks (i.e., 0, 1, 2) are touched and removed from the queue before allocation, so the free queue becomes `7 - 8 - 9 - 4 - 3 - 6 - 5`. As a result, the allocated blocks are 0 (cached), 1 (cached), 2 (cached), 7, 8, 9, 4, 3 (evicted). -![Example Time 7](../assets/design/prefix_caching/example-time-7.png) +![Example Time 6](../assets/design/prefix_caching/example-time-7.png)