diff --git a/docs/source/design/arch_overview.md b/docs/source/design/arch_overview.md index 94bda8b5c58d5..9254afc9b1519 100644 --- a/docs/source/design/arch_overview.md +++ b/docs/source/design/arch_overview.md @@ -14,8 +14,14 @@ This document provides an overview of the vLLM architecture. vLLM provides a number of entrypoints for interacting with the system. The following diagram shows the relationship between them. -:::{image} /assets/design/arch_overview/entrypoints.excalidraw.png -:alt: Entrypoints Diagram +:::{mermaid} +flowchart TD + CLI["vllm CLI"] --> APIServer["OpenAI API Server"] + LLM["LLM Class"] --> LLMEngine + APIServer --> AsyncLLMEngine + LLMEngine --> EngineCoreClient + AsyncLLMEngine --> EngineCoreClient + EngineCoreClient --> EngineCore ::: ### LLM Class @@ -84,8 +90,14 @@ More details on the API server can be found in the [OpenAI-Compatible Server](#o The `LLMEngine` and `AsyncLLMEngine` classes are central to the functioning of the vLLM system, handling model inference and asynchronous request processing. -:::{image} /assets/design/arch_overview/llm_engine.excalidraw.png -:alt: LLMEngine Diagram +:::{mermaid} +flowchart LR + Processor --> EngineCoreClient + EngineCoreClient --> EngineCore + EngineCore --> Executor + Executor --> Worker + Worker --> ModelRunner + ModelRunner --> Model ::: ### LLMEngine @@ -104,7 +116,7 @@ processing. - **Output Processing**: Processes the outputs generated by the model, decoding the token IDs from a language model into human-readable text. -The code for `LLMEngine` can be found in . +The code for `LLMEngine` can be found in . ### AsyncLLMEngine @@ -116,7 +128,7 @@ can handle multiple concurrent requests and stream outputs to clients. The OpenAI-compatible API server uses the `AsyncLLMEngine`. There is also a demo API server that serves as a simpler example in . -The code for `AsyncLLMEngine` can be found in . +The code for `AsyncLLMEngine` can be found in . ## Worker @@ -140,15 +152,29 @@ Every model runner object has one model object, which is the actual `torch.nn.Module` instance. See [huggingface_integration](#huggingface-integration) for how various configurations affect the class we ultimately get. -## Class Hierarchy +## Class Hierarchy and vLLM V1 Architecture -The following figure shows the class hierarchy of vLLM: +The following diagram shows how the main classes interact: -> :::{figure} /assets/design/hierarchy.png -> :align: center -> :alt: query -> :width: 100% -> ::: +:::{mermaid} +classDiagram + class LLMEngine + class AsyncLLMEngine + class EngineCoreClient + class EngineCore + class Executor + class Worker + class ModelRunner + class Model + + AsyncLLMEngine --> LLMEngine + LLMEngine --> EngineCoreClient + EngineCoreClient --> EngineCore + EngineCore --> Executor + Executor --> Worker + Worker --> ModelRunner + ModelRunner --> Model +::: There are several important design choices behind this class hierarchy: @@ -250,3 +276,32 @@ big problem. In summary, the complete config object `VllmConfig` can be treated as an engine-level global state that is shared among all vLLM classes. + +vLLM V1 introduces a streamlined engine that splits responsibilities between a thin frontend and a highly optimized backend. The design is centered on three core layers: + +1. **Frontend (`LLMEngine` and `AsyncLLM`)** – user-facing classes that handle tokenization, batching of incoming requests, and postprocessing of generated outputs. These classes interact with the engine core through an `EngineCoreClient`. +2. **Engine Core** – the inner loop that schedules requests and runs the model. The core lives in `vllm/v1/engine/core.py` and exposes a lightweight API for adding requests, aborting them, or stepping the model. +3. **Executor and Workers** – the executor (for example `MultiprocExecutor` in ) manages worker processes. Each worker controls a single accelerator device and hosts a `ModelRunner` (such as `GPUModelRunner` in ) which executes the forward pass. + +### EngineCore and Scheduler + +The `EngineCore` maintains a [`Scheduler`]() and a `KVCacheManager` (). At each iteration the scheduler chooses how many tokens to process for every active `Request`, supporting features like prefix caching, chunked prefill and speculative decoding. Scheduled tokens are passed to the model runner and the resulting `EngineCoreOutputs` include generated tokens and per-request events. +The scheduler keeps separate waiting and running queues and enforces limits from +`VllmConfig` such as `max_num_seqs` and `max_num_batched_tokens`. When GPU +memory becomes scarce it can preempt lower priority requests, freeing their KV +cache blocks before resuming them later. After a step finishes it records +statistics and updates each request's progress based on the returned events. + +### Communication via EngineCoreClient + +To overlap computation with I/O, the engine core often runs in a separate process. `EngineCoreClient` () forwards requests and pulls results over ZeroMQ sockets. When using multiple data-parallel ranks, `DPAsyncMPClient` manages a set of engine-core processes and aggregates their outputs. + +### Workers and Model Runners + +Workers are defined in . The default GPU worker initializes CUDA, sets up distributed communication and hosts a `GPUModelRunner` which loads the model, prepares KV cache memory and executes inference kernels. The runner also handles LoRA adapters, attention backends, and cudagraph capture. + +### Output Processing + +`OutputProcessor` () converts raw `EngineCoreOutputs` into `RequestOutput` objects, assembling logprobs, speculative tokens, and final texts. When using `AsyncLLM`, an asynchronous loop continuously fetches these outputs and streams them back to callers. + +This new layering keeps the hot path (`EngineCore`) minimal while letting the frontend focus on user interactions and request bookkeeping. It reduces CPU overhead and simplifies the addition of new optimizations.