mirror of https://git.datalinker.icu/vllm-project/vllm.git synced 2026-08-02 13:15:44 +08:00

History

[Refactor]Abstract Platform Interface for Distributed Backend and Add xccl Support for Intel XPU (#19410 )

Signed-off-by: dbyoung18 <yang5.yang@intel.com>
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>

2025-07-07 04:32:32 +00:00

api

Migrate docs from Sphinx to MkDocs (#18145 )

2025-05-23 02:09:53 -07:00

assets

Migrate docs from Sphinx to MkDocs (#18145 )

2025-05-23 02:09:53 -07:00

[doc] Fold long code blocks to improve readability (#19926 )

2025-06-23 05:24:23 +00:00

cli

[doc] Fold long code blocks to improve readability (#19926 )

2025-06-23 05:24:23 +00:00

community

[doc] use snippets for contact us (#19944 )

2025-06-22 10:26:13 +00:00

configuration

fix[Docs]: link anchor is incorrect #20309 (#20315 )

2025-07-02 06:32:34 +00:00

contributing

[doc] small fix (#20506 )

2025-07-04 20:56:39 -07:00

deployment

[Docs] Improve frameworks/helm.md (#20113 )

2025-06-26 10:41:51 +00:00

design

fix[Docs]: link anchor is incorrect #20309 (#20315 )

2025-07-02 06:32:34 +00:00

features

[Frontend] Support image object in llm.chat (#19635 )

2025-07-06 06:47:13 +00:00

getting_started

[Refactor]Abstract Platform Interface for Distributed Backend and Add xccl Support for Intel XPU (#19410 )

2025-07-07 04:32:32 +00:00

mkdocs

[doc] fix the incorrect logo in dark mode (#20289 )

2025-07-01 08:18:09 +00:00

models

[V1] Support any head size for FlexAttention backend (#20467 )

2025-07-06 09:54:36 -07:00

serving

[Model][2/N] Automatic conversion of CrossEncoding model (#19978 )

2025-07-03 13:59:23 +00:00

training

[Doc] Move examples and further reorganize user guide (#18666 )

2025-05-26 07:38:04 -07:00

usage

[Doc] add config and troubleshooting guide for NCCL & GPUDirect RDMA (#15897 )

2025-06-30 21:44:39 -07:00

.nav.yml

[Doc][TPU] Add models and features supporting matrix. (#20230 )

2025-07-02 06:33:20 +00:00

README.md

[doc] fix the incorrect logo in dark mode (#20289 )

2025-07-01 08:18:09 +00:00

README.md

Welcome to vLLM

![](./assets/logos/vllm-logo-text-light.png){ align="center" alt="vLLM Light" class="logo-light" width="60%" } ![](./assets/logos/vllm-logo-text-dark.png){ align="center" alt="vLLM Dark" class="logo-dark" width="60%" }

Easy, fast, and cheap LLM serving for everyone

Star Watch Fork

vLLM is a fast and easy-to-use library for LLM inference and serving.

Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

vLLM is fast with:

State-of-the-art serving throughput
Efficient management of attention key and value memory with PagedAttention
Continuous batching of incoming requests
Fast model execution with CUDA/HIP graph
Quantization: GPTQ, AWQ, INT4, INT8, and FP8
Optimized CUDA kernels, including integration with FlashAttention and FlashInfer.
Speculative decoding
Chunked prefill

vLLM is flexible and easy to use with:

Seamless integration with popular HuggingFace models
High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
Tensor parallelism and pipeline parallelism support for distributed inference
Streaming outputs
OpenAI-compatible API server
Support NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs, Gaudi® accelerators and GPUs, IBM Power CPUs, TPU, and AWS Trainium and Inferentia Accelerators.
Prefix caching support
Multi-LoRA support

For more information, check out the following:

vLLM announcing blog post (intro to PagedAttention)
vLLM paper (SOSP 2023)
How continuous batching enables 23x throughput in LLM inference while reducing p50 latency by Cade Daniel et al.
[vLLM Meetups][meetups]