diff --git a/docs/configuration/optimization.md b/docs/configuration/optimization.md index 69d4de9d2f644..6c7c31f503c15 100644 --- a/docs/configuration/optimization.md +++ b/docs/configuration/optimization.md @@ -196,6 +196,13 @@ vllm serve Qwen/Qwen2.5-VL-3B-Instruct --api-server-count 4 -dp 2 !!! note API server scale-out is only available for online inference. +!!! warning + By default, 8 CPU threads are used in each API server to load media items (e.g. images) + from request data. + + If you apply API server scale-out, consider adjusting `VLLM_MEDIA_LOADING_THREAD_COUNT` + to avoid CPU resource exhaustion. + !!! note [Multi-modal processor cache](#processor-cache) is disabled when API server scale-out is enabled because it requires a one-to-one correspondance between API and engine core processes.