diff --git a/docs/assets/design/v1/metrics/intervals-1.png b/docs/assets/design/metrics/intervals-1.png similarity index 100% rename from docs/assets/design/v1/metrics/intervals-1.png rename to docs/assets/design/metrics/intervals-1.png diff --git a/docs/assets/design/v1/metrics/intervals-2.png b/docs/assets/design/metrics/intervals-2.png similarity index 100% rename from docs/assets/design/v1/metrics/intervals-2.png rename to docs/assets/design/metrics/intervals-2.png diff --git a/docs/assets/design/v1/metrics/intervals-3.png b/docs/assets/design/metrics/intervals-3.png similarity index 100% rename from docs/assets/design/v1/metrics/intervals-3.png rename to docs/assets/design/metrics/intervals-3.png diff --git a/docs/assets/design/v1/prefix_caching/example-time-1.png b/docs/assets/design/prefix_caching/example-time-1.png similarity index 100% rename from docs/assets/design/v1/prefix_caching/example-time-1.png rename to docs/assets/design/prefix_caching/example-time-1.png diff --git a/docs/assets/design/v1/prefix_caching/example-time-3.png b/docs/assets/design/prefix_caching/example-time-3.png similarity index 100% rename from docs/assets/design/v1/prefix_caching/example-time-3.png rename to docs/assets/design/prefix_caching/example-time-3.png diff --git a/docs/assets/design/v1/prefix_caching/example-time-4.png b/docs/assets/design/prefix_caching/example-time-4.png similarity index 100% rename from docs/assets/design/v1/prefix_caching/example-time-4.png rename to docs/assets/design/prefix_caching/example-time-4.png diff --git a/docs/assets/design/v1/prefix_caching/example-time-5.png b/docs/assets/design/prefix_caching/example-time-5.png similarity index 100% rename from docs/assets/design/v1/prefix_caching/example-time-5.png rename to docs/assets/design/prefix_caching/example-time-5.png diff --git a/docs/assets/design/v1/prefix_caching/example-time-6.png b/docs/assets/design/prefix_caching/example-time-6.png similarity index 100% rename from docs/assets/design/v1/prefix_caching/example-time-6.png rename to docs/assets/design/prefix_caching/example-time-6.png diff --git a/docs/assets/design/v1/prefix_caching/example-time-7.png b/docs/assets/design/prefix_caching/example-time-7.png similarity index 100% rename from docs/assets/design/v1/prefix_caching/example-time-7.png rename to docs/assets/design/prefix_caching/example-time-7.png diff --git a/docs/assets/design/v1/prefix_caching/free.png b/docs/assets/design/prefix_caching/free.png similarity index 100% rename from docs/assets/design/v1/prefix_caching/free.png rename to docs/assets/design/prefix_caching/free.png diff --git a/docs/assets/design/v1/prefix_caching/overview.png b/docs/assets/design/prefix_caching/overview.png similarity index 100% rename from docs/assets/design/v1/prefix_caching/overview.png rename to docs/assets/design/prefix_caching/overview.png diff --git a/docs/assets/design/v1/tpu/most_model_len.png b/docs/assets/design/tpu/most_model_len.png similarity index 100% rename from docs/assets/design/v1/tpu/most_model_len.png rename to docs/assets/design/tpu/most_model_len.png diff --git a/docs/configuration/tpu.md b/docs/configuration/tpu.md index 0ff0cdda380e..a2941c80bd27 100644 --- a/docs/configuration/tpu.md +++ b/docs/configuration/tpu.md @@ -47,7 +47,7 @@ This initial compilation time ranges significantly and is impacted by many of th #### max model len vs. most model len -![most_model_len](../assets/design/v1/tpu/most_model_len.png) +![most_model_len](../assets/design/tpu/most_model_len.png) If most of your requests are shorter than the maximum model length but you still need to accommodate occasional longer requests, setting a high maximum model length can negatively impact performance. In these cases, you can try introducing most model len by specifying the `VLLM_TPU_MOST_MODEL_LEN` environment variable. diff --git a/docs/design/metrics.md b/docs/design/metrics.md index ba34c7dca001..1f65331d3c0a 100644 --- a/docs/design/metrics.md +++ b/docs/design/metrics.md @@ -223,7 +223,7 @@ And the calculated intervals are: Put another way: -![Interval calculations - common case](../../assets/design/v1/metrics/intervals-1.png) +![Interval calculations - common case](../assets/design/metrics/intervals-1.png) We explored the possibility of having the frontend calculate these intervals using the timing of events visible by the frontend. However, @@ -238,13 +238,13 @@ When a preemption occurs during decode, since any already generated tokens are reused, we consider the preemption as affecting the inter-token, decode, and inference intervals. -![Interval calculations - preempted decode](../../assets/design/v1/metrics/intervals-2.png) +![Interval calculations - preempted decode](../assets/design/metrics/intervals-2.png) When a preemption occurs during prefill (assuming such an event is possible), we consider the preemption as affecting the time-to-first-token and prefill intervals. -![Interval calculations - preempted prefill](../../assets/design/v1/metrics/intervals-3.png) +![Interval calculations - preempted prefill](../assets/design/metrics/intervals-3.png) ### Frontend Stats Collection diff --git a/docs/design/prefix_caching.md b/docs/design/prefix_caching.md index fcc014cf8516..9941837bf165 100644 --- a/docs/design/prefix_caching.md +++ b/docs/design/prefix_caching.md @@ -125,7 +125,7 @@ There are two design points to highlight: As a result, we will have the following components when the KV cache manager is initialized: -![Component Overview](../../assets/design/v1/prefix_caching/overview.png) +![Component Overview](../assets/design/prefix_caching/overview.png) * Block Pool: A list of KVCacheBlock. * Free Block Queue: Only store the pointers of head and tail blocks for manipulations. @@ -195,7 +195,7 @@ As can be seen, block 3 is a new full block and is cached. However, it is redund When a request is finished, we free all its blocks if no other requests are using them (reference count = 0). In this example, we free request 1 and block 2, 3, 4, 8 associated with it. We can see that the freed blocks are added to the tail of the free queue in the *reverse* order. This is because the last block of a request must hash more tokens and is less likely to be reused by other requests. As a result, it should be evicted first. -![Free queue after a request us freed](../../assets/design/v1/prefix_caching/free.png) +![Free queue after a request us freed](../assets/design/prefix_caching/free.png) ### Eviction (LRU) @@ -211,24 +211,24 @@ In this example, we assume the block size is 4 (each block can cache 4 tokens), **Time 1: The cache is empty and a new request comes in.** We allocate 4 blocks. 3 of them are already full and cached. The fourth block is partially full with 3 of 4 tokens. -![Example Time 1](../../assets/design/v1/prefix_caching/example-time-1.png) +![Example Time 1](../assets/design/prefix_caching/example-time-1.png) **Time 3: Request 0 makes the block 3 full and asks for a new block to keep decoding.** We cache block 3 and allocate block 4. -![Example Time 3](../../assets/design/v1/prefix_caching/example-time-3.png) +![Example Time 3](../assets/design/prefix_caching/example-time-3.png) **Time 4: Request 1 comes in with the 14 prompt tokens, where the first 10 tokens are the same as request 0.** We can see that only the first 2 blocks (8 tokens) hit the cache, because the 3rd block only matches 2 of 4 tokens. -![Example Time 4](../../assets/design/v1/prefix_caching/example-time-4.png) +![Example Time 4](../assets/design/prefix_caching/example-time-4.png) **Time 5: Request 0 is finished and free.** Blocks 2, 3 and 4 are added to the free queue in the reverse order (but block 2 and 3 are still cached). Block 0 and 1 are not added to the free queue because they are being used by Request 1. -![Example Time 5](../../assets/design/v1/prefix_caching/example-time-5.png) +![Example Time 5](../assets/design/prefix_caching/example-time-5.png) **Time 6: Request 1 is finished and free.** -![Example Time 6](../../assets/design/v1/prefix_caching/example-time-6.png) +![Example Time 6](../assets/design/prefix_caching/example-time-6.png) **Time 7: Request 2 comes in with the 29 prompt tokens, where the first 12 tokens are the same as request 0\.** Note that even the block order in the free queue was `7 - 8 - 9 - 4 - 3 - 2 - 6 - 5 - 1 - 0`, the cache hit blocks (i.e., 0, 1, 2) are touched and removed from the queue before allocation, so the free queue becomes `7 - 8 - 9 - 4 - 3 - 6 - 5`. As a result, the allocated blocks are 0 (cached), 1 (cached), 2 (cached), 7, 8, 9, 4, 3 (evicted). -![Example Time 7](../../assets/design/v1/prefix_caching/example-time-7.png) +![Example Time 7](../assets/design/prefix_caching/example-time-7.png)