From f64ffa8c2541ac5dabd1eed3edeeeed9c618f7a3 Mon Sep 17 00:00:00 2001 From: Brayden Zhong Date: Sat, 1 Mar 2025 00:43:54 -0500 Subject: [PATCH] [Docs] Add `pipeline_parallel_size` to optimization docs (#14059) Signed-off-by: Brayden Zhong --- docs/source/performance/optimization.md | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/source/performance/optimization.md b/docs/source/performance/optimization.md index 4fbc376e1aa3..5b0f8421a51e 100644 --- a/docs/source/performance/optimization.md +++ b/docs/source/performance/optimization.md @@ -18,6 +18,7 @@ If you frequently encounter preemptions from the vLLM engine, consider the follo - Increase `gpu_memory_utilization`. The vLLM pre-allocates GPU cache by using gpu_memory_utilization% of memory. By increasing this utilization, you can provide more KV cache space. - Decrease `max_num_seqs` or `max_num_batched_tokens`. This can reduce the number of concurrent requests in a batch, thereby requiring less KV cache space. - Increase `tensor_parallel_size`. This approach shards model weights, so each GPU has more memory available for KV cache. +- Increase `pipeline_parallel_size`. This approach distributes model layers across GPUs, reducing the memory needed for model weights on each GPU, which indirectly leaves more memory available for KV cache. You can also monitor the number of preemption requests through Prometheus metrics exposed by the vLLM. Additionally, you can log the cumulative number of preemption requests by setting disable_log_stats=False.