From e269be2ba2c52ced7581b8499ae19f21383f3c56 Mon Sep 17 00:00:00 2001
From: Cyrus Leung <tlleungac@connect.ust.hk>
Date: Mon, 25 Aug 2025 21:14:15 +0800
Subject: [PATCH] [Doc] Add caution for API server scale-out (#23550)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
---
 docs/configuration/optimization.md | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/docs/configuration/optimization.md b/docs/configuration/optimization.md
index 69d4de9d2f644..6c7c31f503c15 100644
--- a/docs/configuration/optimization.md
+++ b/docs/configuration/optimization.md
@@ -196,6 +196,13 @@ vllm serve Qwen/Qwen2.5-VL-3B-Instruct --api-server-count 4 -dp 2
 !!! note
     API server scale-out is only available for online inference.
 
+!!! warning
+    By default, 8 CPU threads are used in each API server to load media items (e.g. images)
+    from request data.
+
+    If you apply API server scale-out, consider adjusting `VLLM_MEDIA_LOADING_THREAD_COUNT`
+    to avoid CPU resource exhaustion.
+
 !!! note
     [Multi-modal processor cache](#processor-cache) is disabled when API server scale-out is enabled
     because it requires a one-to-one correspondance between API and engine core processes.