mirror of
https://git.datalinker.icu/vllm-project/vllm.git
synced 2026-01-06 13:41:55 +08:00
[Core] V1: Use multiprocessing by default (#11074)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
This commit is contained in:
parent
0d8451c3a4
commit
4863e5fba5
195
docs/source/design/multiprocessing.md
Normal file
195
docs/source/design/multiprocessing.md
Normal file
@ -0,0 +1,195 @@
|
||||
# Python Multiprocessing
|
||||
|
||||
## Debugging
|
||||
|
||||
Please see the [Debugging
|
||||
Tips](https://docs.vllm.ai/en/latest/getting_started/debugging.html#python-multiprocessing)
|
||||
page for information on known issues and how to solve them.
|
||||
|
||||
## Introduction
|
||||
|
||||
*Note that source code references are to the state of the code at the time of writing in December, 2024.*
|
||||
|
||||
The use of Python multiprocessing in vLLM is complicated by:
|
||||
|
||||
- The use of vLLM as a library and the inability to control the code using vLLM
|
||||
- Varying levels of incompatibilities between multiprocessing methods and vLLM
|
||||
dependencies
|
||||
|
||||
This document describes how vLLM deals with these challenges.
|
||||
|
||||
## Multiprocessing Methods
|
||||
|
||||
[Python multiprocessing methods](https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods) include:
|
||||
|
||||
- `spawn` - spawn a new Python process. This will be the default as of Python
|
||||
3.14.
|
||||
|
||||
- `fork` - Use `os.fork()` to fork the Python interpreter. This is the default
|
||||
in Python versions prior to 3.14.
|
||||
|
||||
- `forkserver` - Spawn a server process that will fork a new process on request.
|
||||
|
||||
### Tradeoffs
|
||||
|
||||
`fork` is the fastest method, but is incompatible with dependencies that use
|
||||
threads.
|
||||
|
||||
`spawn` is more compatible with dependencies, but can be problematic when vLLM
|
||||
is used as a library. If the consuming code does not use a `__main__` guard (`if
|
||||
__name__ == "__main__":`), the code will be inadvertently re-executed when vLLM
|
||||
spawns a new process. This can lead to infinite recursion, among other problems.
|
||||
|
||||
`forkserver` will spawn a new server process that will fork new processes on
|
||||
demand. This unfortunately has the same problem as `spawn` when vLLM is used as
|
||||
a library. The server process is created as a spawned new process, which will
|
||||
re-execute code not protected by a `__main__` guard.
|
||||
|
||||
For both `spawn` and `forkserver`, the process must not depend on inheriting any
|
||||
global state as would be the case with `fork`.
|
||||
|
||||
## Compatibility with Dependencies
|
||||
|
||||
Multiple vLLM dependencies indicate either a preference or requirement for using
|
||||
`spawn`:
|
||||
|
||||
- <https://pytorch.org/docs/stable/notes/multiprocessing.html#cuda-in-multiprocessing>
|
||||
- <https://pytorch.org/docs/stable/multiprocessing.html#sharing-cuda-tensors>
|
||||
- <https://docs.habana.ai/en/latest/PyTorch/Getting_Started_with_PyTorch_and_Gaudi/Getting_Started_with_PyTorch.html?highlight=multiprocessing#torch-multiprocessing-for-dataloaders>
|
||||
|
||||
It is perhaps more accurate to say that there are known problems with using
|
||||
`fork` after initializing these dependencies.
|
||||
|
||||
## Current State (v0)
|
||||
|
||||
The environment variable `VLLM_WORKER_MULTIPROC_METHOD` can be used to control which method is used by vLLM. The current default is `fork`.
|
||||
|
||||
- <https://github.com/vllm-project/vllm/blob/d05f88679bedd73939251a17c3d785a354b2946c/vllm/envs.py#L339-L342>
|
||||
|
||||
When we know we own the process because the `vllm` command was used, we use
|
||||
`spawn` because it's the most widely compatible.
|
||||
|
||||
- <https://github.com/vllm-project/vllm/blob/d05f88679bedd73939251a17c3d785a354b2946c/vllm/scripts.py#L123-L140>
|
||||
|
||||
The `multiproc_xpu_executor` forces the use of `spawn`.
|
||||
|
||||
- <https://github.com/vllm-project/vllm/blob/d05f88679bedd73939251a17c3d785a354b2946c/vllm/executor/multiproc_xpu_executor.py#L14-L18>
|
||||
|
||||
There are other miscellaneous places hard-coding the use of `spawn`:
|
||||
|
||||
- <https://github.com/vllm-project/vllm/blob/d05f88679bedd73939251a17c3d785a354b2946c/vllm/distributed/device_communicators/custom_all_reduce_utils.py#L135>
|
||||
- <https://github.com/vllm-project/vllm/blob/d05f88679bedd73939251a17c3d785a354b2946c/vllm/entrypoints/openai/api_server.py#L184>
|
||||
|
||||
Related PRs:
|
||||
|
||||
- <https://github.com/vllm-project/vllm/pull/8823>
|
||||
|
||||
## Prior State in v1
|
||||
|
||||
There was an environment variable to control whether multiprocessing is used in
|
||||
the v1 engine core, `VLLM_ENABLE_V1_MULTIPROCESSING`. This defaulted to off.
|
||||
|
||||
- <https://github.com/vllm-project/vllm/blob/d05f88679bedd73939251a17c3d785a354b2946c/vllm/envs.py#L452-L454>
|
||||
|
||||
When it was enabled, the v1 `LLMEngine` would create a new process to run the
|
||||
engine core.
|
||||
|
||||
- <https://github.com/vllm-project/vllm/blob/d05f88679bedd73939251a17c3d785a354b2946c/vllm/v1/engine/llm_engine.py#L93-L95>
|
||||
- <https://github.com/vllm-project/vllm/blob/d05f88679bedd73939251a17c3d785a354b2946c/vllm/v1/engine/llm_engine.py#L70-L77>
|
||||
- https://github.com/vllm-project/vllm/blob/d05f88679bedd73939251a17c3d785a354b2946c/vllm/v1/engine/core_client.py#L44-L45
|
||||
|
||||
It was off by default for all the reasons mentioned above - compatibility with
|
||||
dependencies and code using vLLM as a library.
|
||||
|
||||
### Changes Made in v1
|
||||
|
||||
There is not an easy solution with Python's `multiprocessing` that will work
|
||||
everywhere. As a first step, we can get v1 into a state where it does "best
|
||||
effort" choice of multiprocessing method to maximize compatibility.
|
||||
|
||||
- Default to `fork`.
|
||||
- Use `spawn` when we know we control the main process (`vllm` was executed).
|
||||
- If we detect `cuda` was previously initialized, force `spawn` and emit a
|
||||
warning. We know `fork` will break, so this is the best we can do.
|
||||
|
||||
The case that is known to still break in this scenario is code using vLLM as a
|
||||
library that initializes `cuda` before calling vLLM. The warning we emit should
|
||||
instruct users to either add a `__main__` guard or to disable multiprocessing.
|
||||
|
||||
If that known-failure case occurs, the user will see two messages that explain
|
||||
what is happening. First, a log message from vLLM:
|
||||
|
||||
```
|
||||
WARNING 12-11 14:50:37 multiproc_worker_utils.py:281] CUDA was previously
|
||||
initialized. We must use the `spawn` multiprocessing start method. Setting
|
||||
VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See
|
||||
https://docs.vllm.ai/en/latest/getting_started/debugging.html#python-multiprocessing
|
||||
for more information.
|
||||
```
|
||||
|
||||
Second, Python itself will raise an exception with a nice explanation:
|
||||
|
||||
```
|
||||
RuntimeError:
|
||||
An attempt has been made to start a new process before the
|
||||
current process has finished its bootstrapping phase.
|
||||
|
||||
This probably means that you are not using fork to start your
|
||||
child processes and you have forgotten to use the proper idiom
|
||||
in the main module:
|
||||
|
||||
if __name__ == '__main__':
|
||||
freeze_support()
|
||||
...
|
||||
|
||||
The "freeze_support()" line can be omitted if the program
|
||||
is not going to be frozen to produce an executable.
|
||||
|
||||
To fix this issue, refer to the "Safe importing of main module"
|
||||
section in https://docs.python.org/3/library/multiprocessing.html
|
||||
```
|
||||
|
||||
## Alternatives Considered
|
||||
|
||||
### Detect if a `__main__` guard is present
|
||||
|
||||
It has been suggested that we could behave better if we could detect whether
|
||||
code using vLLM as a library has a `__main__` guard in place. This [post on
|
||||
stackoverflow](https://stackoverflow.com/questions/77220442/multiprocessing-pool-in-a-python-class-without-name-main-guard)
|
||||
was from a library author facing the same question.
|
||||
|
||||
It is possible to detect whether we are in the original, `__main__` process, or
|
||||
a subsequent spawned process. However, it does not appear to be straight forward
|
||||
to detect whether a `__main__` guard is present in the code.
|
||||
|
||||
This option has been discarded as impractical.
|
||||
|
||||
### Use `forkserver`
|
||||
|
||||
At first it appears that `forkserver` is a nice solution to the problem.
|
||||
However, the way it works presents the same challenges that `spawn` does when
|
||||
vLLM is used as a library.
|
||||
|
||||
### Force `spawn` all the time
|
||||
|
||||
One way to clean this up is to just force the use of `spawn` all the time and
|
||||
document that the use of a `__main__` guard is required when using vLLM as a
|
||||
library. This would unfortunately break existing code and make vLLM harder to
|
||||
use, violating the desire to make the `LLM` class as easy as possible to use.
|
||||
|
||||
Instead of pushing this on our users, we will retain the complexity to do our
|
||||
best to make things work.
|
||||
|
||||
## Future Work
|
||||
|
||||
We may want to consider a different worker management approach in the future
|
||||
that works around these challenges.
|
||||
|
||||
1. We could implement something `forkserver`-like, but have the process manager
|
||||
be something we initially launch by running our own subprocess and a custom
|
||||
entrypoint for worker management (launch a `vllm-manager` process).
|
||||
|
||||
2. We can explore other libraries that may better suit our needs. Examples to
|
||||
consider:
|
||||
|
||||
- <https://github.com/joblib/loky>
|
||||
@ -136,6 +136,62 @@ If the test script hangs or crashes, usually it means the hardware/drivers are b
|
||||
|
||||
Adjust ``--nproc-per-node``, ``--nnodes``, and ``--node-rank`` according to your setup, being sure to execute different commands (with different ``--node-rank``) on different nodes.
|
||||
|
||||
Python multiprocessing
|
||||
----------------------
|
||||
|
||||
`RuntimeError` Exception
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
If you have seen a warning in your logs like this:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
WARNING 12-11 14:50:37 multiproc_worker_utils.py:281] CUDA was previously
|
||||
initialized. We must use the `spawn` multiprocessing start method. Setting
|
||||
VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See
|
||||
https://docs.vllm.ai/en/latest/getting_started/debugging.html#python-multiprocessing
|
||||
for more information.
|
||||
|
||||
or an error from Python that looks like this:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
RuntimeError:
|
||||
An attempt has been made to start a new process before the
|
||||
current process has finished its bootstrapping phase.
|
||||
|
||||
This probably means that you are not using fork to start your
|
||||
child processes and you have forgotten to use the proper idiom
|
||||
in the main module:
|
||||
|
||||
if __name__ == '__main__':
|
||||
freeze_support()
|
||||
...
|
||||
|
||||
The "freeze_support()" line can be omitted if the program
|
||||
is not going to be frozen to produce an executable.
|
||||
|
||||
To fix this issue, refer to the "Safe importing of main module"
|
||||
section in https://docs.python.org/3/library/multiprocessing.html
|
||||
|
||||
then you must update your Python code to guard usage of ``vllm`` behind a ``if
|
||||
__name__ == '__main__':`` block. For example, instead of this:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
import vllm
|
||||
|
||||
llm = vllm.LLM(...)
|
||||
|
||||
try this instead:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
if __name__ == '__main__':
|
||||
import vllm
|
||||
|
||||
llm = vllm.LLM(...)
|
||||
|
||||
Known Issues
|
||||
----------------------------------------
|
||||
- In ``v0.5.2``, ``v0.5.3``, and ``v0.5.3.post1``, there is a bug caused by `zmq <https://github.com/zeromq/pyzmq/issues/2000>`_ , which can occasionally cause vLLM to hang depending on the machine configuration. The solution is to upgrade to the latest version of ``vllm`` to include the `fix <https://github.com/vllm-project/vllm/pull/6759>`_.
|
||||
|
||||
@ -173,6 +173,7 @@ Documentation
|
||||
design/input_processing/model_inputs_index
|
||||
design/kernel/paged_attention
|
||||
design/multimodal/multimodal_index
|
||||
design/multiprocessing
|
||||
|
||||
.. For Developers: contributing to the vLLM project
|
||||
|
||||
|
||||
@ -232,6 +232,10 @@ class LLM:
|
||||
|
||||
self.request_counter = Counter()
|
||||
|
||||
def __del__(self):
|
||||
if self.llm_engine and hasattr(self.llm_engine, "shutdown"):
|
||||
self.llm_engine.shutdown()
|
||||
|
||||
@staticmethod
|
||||
def get_engine_class() -> Type[LLMEngine]:
|
||||
if envs.VLLM_USE_V1:
|
||||
|
||||
@ -69,7 +69,7 @@ if TYPE_CHECKING:
|
||||
VLLM_SKIP_P2P_CHECK: bool = False
|
||||
VLLM_DISABLED_KERNELS: List[str] = []
|
||||
VLLM_USE_V1: bool = False
|
||||
VLLM_ENABLE_V1_MULTIPROCESSING: bool = False
|
||||
VLLM_ENABLE_V1_MULTIPROCESSING: bool = True
|
||||
VLLM_LOG_BATCHSIZE_INTERVAL: float = -1
|
||||
|
||||
|
||||
@ -460,7 +460,7 @@ environment_variables: Dict[str, Callable[[], Any]] = {
|
||||
|
||||
# If set, enable multiprocessing in LLM for the V1 code path.
|
||||
"VLLM_ENABLE_V1_MULTIPROCESSING":
|
||||
lambda: bool(int(os.getenv("VLLM_ENABLE_V1_MULTIPROCESSING", "0"))),
|
||||
lambda: bool(int(os.getenv("VLLM_ENABLE_V1_MULTIPROCESSING", "1"))),
|
||||
"VLLM_LOG_BATCHSIZE_INTERVAL":
|
||||
lambda: float(os.getenv("VLLM_LOG_BATCHSIZE_INTERVAL", "-1")),
|
||||
}
|
||||
|
||||
@ -274,7 +274,20 @@ def _add_prefix(file: TextIO, worker_name: str, pid: int) -> None:
|
||||
file.write = write_with_prefix # type: ignore[method-assign]
|
||||
|
||||
|
||||
def _check_multiproc_method():
|
||||
if (cuda_is_initialized()
|
||||
and os.environ.get("VLLM_WORKER_MULTIPROC_METHOD") != "spawn"):
|
||||
logger.warning("CUDA was previously initialized. We must use "
|
||||
"the `spawn` multiprocessing start method. Setting "
|
||||
"VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. "
|
||||
"See https://docs.vllm.ai/en/latest/getting_started/"
|
||||
"debugging.html#python-multiprocessing "
|
||||
"for more information.")
|
||||
os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
|
||||
|
||||
|
||||
def get_mp_context():
|
||||
_check_multiproc_method()
|
||||
mp_method = envs.VLLM_WORKER_MULTIPROC_METHOD
|
||||
return multiprocessing.get_context(mp_method)
|
||||
|
||||
@ -284,12 +297,7 @@ def set_multiprocessing_worker_envs(parallel_config):
|
||||
in a multiprocessing environment. This should be called by the parent
|
||||
process before worker processes are created"""
|
||||
|
||||
if (cuda_is_initialized()
|
||||
and os.environ.get("VLLM_WORKER_MULTIPROC_METHOD") != "spawn"):
|
||||
logger.warning("CUDA was previously initialized. We must use "
|
||||
"the `spawn` multiprocessing start method. Setting "
|
||||
"VLLM_WORKER_MULTIPROC_METHOD to 'spawn'.")
|
||||
os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
|
||||
_check_multiproc_method()
|
||||
|
||||
# Configure thread parallelism if OMP_NUM_THREADS isn't set
|
||||
#
|
||||
|
||||
@ -1,4 +1,3 @@
|
||||
import multiprocessing
|
||||
import pickle
|
||||
import queue
|
||||
import signal
|
||||
@ -13,6 +12,7 @@ import zmq.asyncio
|
||||
from msgspec import msgpack
|
||||
|
||||
from vllm.config import CacheConfig, VllmConfig
|
||||
from vllm.executor.multiproc_worker_utils import get_mp_context
|
||||
from vllm.logger import init_logger
|
||||
from vllm.usage.usage_lib import UsageContext
|
||||
from vllm.v1.core.scheduler import Scheduler
|
||||
@ -210,11 +210,7 @@ class EngineCoreProc(EngineCore):
|
||||
output_path: str,
|
||||
ready_path: str,
|
||||
) -> EngineCoreProcHandle:
|
||||
# The current process might have CUDA context,
|
||||
# so we need to spawn a new process.
|
||||
# NOTE(rob): this is a problem for using EngineCoreProc w/
|
||||
# LLM, since we need a if __name__ == "__main__" guard.
|
||||
context = multiprocessing.get_context("spawn")
|
||||
context = get_mp_context()
|
||||
|
||||
process_kwargs = {
|
||||
"input_path": input_path,
|
||||
|
||||
@ -159,10 +159,16 @@ class MPClient(EngineCoreClient):
|
||||
atexit.register(self.shutdown)
|
||||
|
||||
def shutdown(self):
|
||||
# During final garbage collection in process shutdown, atexit may be
|
||||
# None.
|
||||
if atexit:
|
||||
# in case shutdown gets called via __del__ first
|
||||
atexit.unregister(self.shutdown)
|
||||
|
||||
# Shut down the zmq context.
|
||||
self.ctx.destroy(linger=0)
|
||||
|
||||
if hasattr(self, "proc_handle"):
|
||||
if hasattr(self, "proc_handle") and self.proc_handle:
|
||||
# Shutdown the process if needed.
|
||||
if self.proc_handle.proc.is_alive():
|
||||
self.proc_handle.proc.terminate()
|
||||
@ -178,8 +184,9 @@ class MPClient(EngineCoreClient):
|
||||
]
|
||||
for ipc_socket in ipc_sockets:
|
||||
socket_file = ipc_socket.replace("ipc://", "")
|
||||
if os.path.exists(socket_file):
|
||||
if os and os.path.exists(socket_file):
|
||||
os.remove(socket_file)
|
||||
self.proc_handle = None
|
||||
|
||||
def __del__(self):
|
||||
self.shutdown()
|
||||
|
||||
@ -196,3 +196,10 @@ class LLMEngine:
|
||||
f"found type: {type(tokenizer_group)}")
|
||||
|
||||
return tokenizer_group
|
||||
|
||||
def __del__(self):
|
||||
self.shutdown()
|
||||
|
||||
def shutdown(self):
|
||||
if engine_core := getattr(self, "engine_core", None):
|
||||
engine_core.shutdown()
|
||||
|
||||
@ -163,6 +163,10 @@ class MultiprocExecutor:
|
||||
termination and kill signals if needed."""
|
||||
|
||||
def wait_for_termination(procs, timeout):
|
||||
if not time:
|
||||
# If we are in late stage shutdown, the interpreter may replace
|
||||
# `time` with `None`.
|
||||
return all(not proc.is_alive() for proc in procs)
|
||||
start_time = time.time()
|
||||
while time.time() - start_time < timeout:
|
||||
if all(not proc.is_alive() for proc in procs):
|
||||
@ -187,10 +191,14 @@ class MultiprocExecutor:
|
||||
for w in self.workers:
|
||||
# Remove the zmq ipc socket file
|
||||
socket_path = w.ready_path.replace("ipc://", "")
|
||||
if os.path.exists(socket_path):
|
||||
if os and os.path.exists(socket_path):
|
||||
os.remove(socket_path)
|
||||
|
||||
def shutdown(self):
|
||||
if atexit:
|
||||
# in case shutdown was called explicitly, we don't need to call it
|
||||
# again
|
||||
atexit.unregister(self.shutdown)
|
||||
"""Properly shut down the executor and its workers"""
|
||||
if (hasattr(self, 'workers') and self.workers is not None):
|
||||
for w in self.workers: #TODO: not sure if needed
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user