From e269be2ba2c52ced7581b8499ae19f21383f3c56 Mon Sep 17 00:00:00 2001 From: Cyrus Leung Date: Mon, 25 Aug 2025 21:14:15 +0800 Subject: [PATCH] [Doc] Add caution for API server scale-out (#23550) Signed-off-by: DarkLight1337 --- docs/configuration/optimization.md | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/docs/configuration/optimization.md b/docs/configuration/optimization.md index 69d4de9d2f644..6c7c31f503c15 100644 --- a/docs/configuration/optimization.md +++ b/docs/configuration/optimization.md @@ -196,6 +196,13 @@ vllm serve Qwen/Qwen2.5-VL-3B-Instruct --api-server-count 4 -dp 2 !!! note API server scale-out is only available for online inference. +!!! warning + By default, 8 CPU threads are used in each API server to load media items (e.g. images) + from request data. + + If you apply API server scale-out, consider adjusting `VLLM_MEDIA_LOADING_THREAD_COUNT` + to avoid CPU resource exhaustion. + !!! note [Multi-modal processor cache](#processor-cache) is disabled when API server scale-out is enabled because it requires a one-to-one correspondance between API and engine core processes.