Stop using title frontmatter and fix doc that can only be reached by search (#20623)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
This commit is contained in:
Harry Mellor 2025-07-08 11:27:40 +01:00 committed by GitHub
parent b4bab81660
commit b942c094e3
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
81 changed files with 82 additions and 238 deletions

View File

@ -55,6 +55,7 @@ nav:
- contributing/model/registration.md - contributing/model/registration.md
- contributing/model/tests.md - contributing/model/tests.md
- contributing/model/multimodal.md - contributing/model/multimodal.md
- CI: contributing/ci
- Design Documents: - Design Documents:
- V0: design - V0: design
- V1: design/v1 - V1: design/v1

View File

@ -1,5 +1,3 @@
--- # Contact Us
title: Contact Us
---
--8<-- "README.md:contact-us" --8<-- "README.md:contact-us"

View File

@ -1,6 +1,4 @@
--- # Meetups
title: Meetups
---
We host regular meetups in San Francisco Bay Area every 2 months. We will share the project updates from the vLLM team and have guest speakers from the industry to share their experience and insights. Please find the materials of our previous meetups below: We host regular meetups in San Francisco Bay Area every 2 months. We will share the project updates from the vLLM team and have guest speakers from the industry to share their experience and insights. Please find the materials of our previous meetups below:

View File

@ -1,6 +1,4 @@
--- # Engine Arguments
title: Engine Arguments
---
Engine arguments control the behavior of the vLLM engine. Engine arguments control the behavior of the vLLM engine.

View File

@ -1,6 +1,4 @@
--- # Server Arguments
title: Server Arguments
---
The `vllm serve` command is used to launch the OpenAI-compatible server. The `vllm serve` command is used to launch the OpenAI-compatible server.

View File

@ -1,6 +1,4 @@
--- # Benchmark Suites
title: Benchmark Suites
---
vLLM contains two sets of benchmarks: vLLM contains two sets of benchmarks:

View File

@ -1,6 +1,4 @@
--- # Update PyTorch version on vLLM OSS CI/CD
title: Update PyTorch version on vLLM OSS CI/CD
---
vLLM's current policy is to always use the latest PyTorch stable vLLM's current policy is to always use the latest PyTorch stable
release in CI/CD. It is standard practice to submit a PR to update the release in CI/CD. It is standard practice to submit a PR to update the

View File

@ -1,6 +1,4 @@
--- # Summary
title: Summary
---
!!! important !!! important
Many decoder language models can now be automatically loaded using the [Transformers backend][transformers-backend] without having to implement them in vLLM. See if `vllm serve <model>` works first! Many decoder language models can now be automatically loaded using the [Transformers backend][transformers-backend] without having to implement them in vLLM. See if `vllm serve <model>` works first!

View File

@ -1,6 +1,4 @@
--- # Basic Model
title: Basic Model
---
This guide walks you through the steps to implement a basic vLLM model. This guide walks you through the steps to implement a basic vLLM model.

View File

@ -1,6 +1,4 @@
--- # Multi-Modal Support
title: Multi-Modal Support
---
This document walks you through the steps to extend a basic model so that it accepts [multi-modal inputs](../../features/multimodal_inputs.md). This document walks you through the steps to extend a basic model so that it accepts [multi-modal inputs](../../features/multimodal_inputs.md).

View File

@ -1,6 +1,4 @@
--- # Registering a Model
title: Registering a Model
---
vLLM relies on a model registry to determine how to run each model. vLLM relies on a model registry to determine how to run each model.
A list of pre-registered architectures can be found [here](../../models/supported_models.md). A list of pre-registered architectures can be found [here](../../models/supported_models.md).

View File

@ -1,6 +1,4 @@
--- # Unit Testing
title: Unit Testing
---
This page explains how to write unit tests to verify the implementation of your model. This page explains how to write unit tests to verify the implementation of your model.

View File

@ -1,6 +1,4 @@
--- # Using Docker
title: Using Docker
---
[](){ #deployment-docker-pre-built-image } [](){ #deployment-docker-pre-built-image }

View File

@ -1,6 +1,5 @@
--- # Anyscale
title: Anyscale
---
[](){ #deployment-anyscale } [](){ #deployment-anyscale }
[Anyscale](https://www.anyscale.com) is a managed, multi-cloud platform developed by the creators of Ray. [Anyscale](https://www.anyscale.com) is a managed, multi-cloud platform developed by the creators of Ray.

View File

@ -1,6 +1,4 @@
--- # Anything LLM
title: Anything LLM
---
[Anything LLM](https://github.com/Mintplex-Labs/anything-llm) is a full-stack application that enables you to turn any document, resource, or piece of content into context that any LLM can use as references during chatting. [Anything LLM](https://github.com/Mintplex-Labs/anything-llm) is a full-stack application that enables you to turn any document, resource, or piece of content into context that any LLM can use as references during chatting.

View File

@ -1,6 +1,4 @@
--- # AutoGen
title: AutoGen
---
[AutoGen](https://github.com/microsoft/autogen) is a framework for creating multi-agent AI applications that can act autonomously or work alongside humans. [AutoGen](https://github.com/microsoft/autogen) is a framework for creating multi-agent AI applications that can act autonomously or work alongside humans.

View File

@ -1,6 +1,4 @@
--- # BentoML
title: BentoML
---
[BentoML](https://github.com/bentoml/BentoML) allows you to deploy a large language model (LLM) server with vLLM as the backend, which exposes OpenAI-compatible endpoints. You can serve the model locally or containerize it as an OCI-compliant image and deploy it on Kubernetes. [BentoML](https://github.com/bentoml/BentoML) allows you to deploy a large language model (LLM) server with vLLM as the backend, which exposes OpenAI-compatible endpoints. You can serve the model locally or containerize it as an OCI-compliant image and deploy it on Kubernetes.

View File

@ -1,6 +1,4 @@
--- # Cerebrium
title: Cerebrium
---
<p align="center"> <p align="center">
<img src="https://i.ibb.co/hHcScTT/Screenshot-2024-06-13-at-10-14-54.png" alt="vLLM_plus_cerebrium"/> <img src="https://i.ibb.co/hHcScTT/Screenshot-2024-06-13-at-10-14-54.png" alt="vLLM_plus_cerebrium"/>

View File

@ -1,6 +1,4 @@
--- # Chatbox
title: Chatbox
---
[Chatbox](https://github.com/chatboxai/chatbox) is a desktop client for LLMs, available on Windows, Mac, Linux. [Chatbox](https://github.com/chatboxai/chatbox) is a desktop client for LLMs, available on Windows, Mac, Linux.

View File

@ -1,6 +1,4 @@
--- # Dify
title: Dify
---
[Dify](https://github.com/langgenius/dify) is an open-source LLM app development platform. Its intuitive interface combines agentic AI workflow, RAG pipeline, agent capabilities, model management, observability features, and more, allowing you to quickly move from prototype to production. [Dify](https://github.com/langgenius/dify) is an open-source LLM app development platform. Its intuitive interface combines agentic AI workflow, RAG pipeline, agent capabilities, model management, observability features, and more, allowing you to quickly move from prototype to production.

View File

@ -1,6 +1,4 @@
--- # dstack
title: dstack
---
<p align="center"> <p align="center">
<img src="https://i.ibb.co/71kx6hW/vllm-dstack.png" alt="vLLM_plus_dstack"/> <img src="https://i.ibb.co/71kx6hW/vllm-dstack.png" alt="vLLM_plus_dstack"/>

View File

@ -1,6 +1,4 @@
--- # Haystack
title: Haystack
---
# Haystack # Haystack

View File

@ -1,6 +1,4 @@
--- # Helm
title: Helm
---
A Helm chart to deploy vLLM for Kubernetes A Helm chart to deploy vLLM for Kubernetes

View File

@ -1,6 +1,4 @@
--- # LiteLLM
title: LiteLLM
---
[LiteLLM](https://github.com/BerriAI/litellm) call all LLM APIs using the OpenAI format [Bedrock, Huggingface, VertexAI, TogetherAI, Azure, OpenAI, Groq etc.] [LiteLLM](https://github.com/BerriAI/litellm) call all LLM APIs using the OpenAI format [Bedrock, Huggingface, VertexAI, TogetherAI, Azure, OpenAI, Groq etc.]

View File

@ -1,6 +1,4 @@
--- # Lobe Chat
title: Lobe Chat
---
[Lobe Chat](https://github.com/lobehub/lobe-chat) is an open-source, modern-design ChatGPT/LLMs UI/Framework. [Lobe Chat](https://github.com/lobehub/lobe-chat) is an open-source, modern-design ChatGPT/LLMs UI/Framework.

View File

@ -1,6 +1,4 @@
--- # LWS
title: LWS
---
LeaderWorkerSet (LWS) is a Kubernetes API that aims to address common deployment patterns of AI/ML inference workloads. LeaderWorkerSet (LWS) is a Kubernetes API that aims to address common deployment patterns of AI/ML inference workloads.
A major use case is for multi-host/multi-node distributed inference. A major use case is for multi-host/multi-node distributed inference.

View File

@ -1,6 +1,4 @@
--- # Modal
title: Modal
---
vLLM can be run on cloud GPUs with [Modal](https://modal.com), a serverless computing platform designed for fast auto-scaling. vLLM can be run on cloud GPUs with [Modal](https://modal.com), a serverless computing platform designed for fast auto-scaling.

View File

@ -1,6 +1,4 @@
--- # Open WebUI
title: Open WebUI
---
1. Install the [Docker](https://docs.docker.com/engine/install/) 1. Install the [Docker](https://docs.docker.com/engine/install/)

View File

@ -1,6 +1,4 @@
--- # Retrieval-Augmented Generation
title: Retrieval-Augmented Generation
---
[Retrieval-augmented generation (RAG)](https://en.wikipedia.org/wiki/Retrieval-augmented_generation) is a technique that enables generative artificial intelligence (Gen AI) models to retrieve and incorporate new information. It modifies interactions with a large language model (LLM) so that the model responds to user queries with reference to a specified set of documents, using this information to supplement information from its pre-existing training data. This allows LLMs to use domain-specific and/or updated information. Use cases include providing chatbot access to internal company data or generating responses based on authoritative sources. [Retrieval-augmented generation (RAG)](https://en.wikipedia.org/wiki/Retrieval-augmented_generation) is a technique that enables generative artificial intelligence (Gen AI) models to retrieve and incorporate new information. It modifies interactions with a large language model (LLM) so that the model responds to user queries with reference to a specified set of documents, using this information to supplement information from its pre-existing training data. This allows LLMs to use domain-specific and/or updated information. Use cases include providing chatbot access to internal company data or generating responses based on authoritative sources.

View File

@ -1,6 +1,4 @@
--- # SkyPilot
title: SkyPilot
---
<p align="center"> <p align="center">
<img src="https://imgur.com/yxtzPEu.png" alt="vLLM"/> <img src="https://imgur.com/yxtzPEu.png" alt="vLLM"/>

View File

@ -1,6 +1,4 @@
--- # Streamlit
title: Streamlit
---
[Streamlit](https://github.com/streamlit/streamlit) lets you transform Python scripts into interactive web apps in minutes, instead of weeks. Build dashboards, generate reports, or create chat apps. [Streamlit](https://github.com/streamlit/streamlit) lets you transform Python scripts into interactive web apps in minutes, instead of weeks. Build dashboards, generate reports, or create chat apps.

View File

@ -1,5 +1,3 @@
--- # NVIDIA Triton
title: NVIDIA Triton
---
The [Triton Inference Server](https://github.com/triton-inference-server) hosts a tutorial demonstrating how to quickly deploy a simple [facebook/opt-125m](https://huggingface.co/facebook/opt-125m) model using vLLM. Please see [Deploying a vLLM model in Triton](https://github.com/triton-inference-server/tutorials/blob/main/Quick_Deploy/vLLM/README.md#deploying-a-vllm-model-in-triton) for more details. The [Triton Inference Server](https://github.com/triton-inference-server) hosts a tutorial demonstrating how to quickly deploy a simple [facebook/opt-125m](https://huggingface.co/facebook/opt-125m) model using vLLM. Please see [Deploying a vLLM model in Triton](https://github.com/triton-inference-server/tutorials/blob/main/Quick_Deploy/vLLM/README.md#deploying-a-vllm-model-in-triton) for more details.

View File

@ -1,6 +1,4 @@
--- # KServe
title: KServe
---
vLLM can be deployed with [KServe](https://github.com/kserve/kserve) on Kubernetes for highly scalable distributed model serving. vLLM can be deployed with [KServe](https://github.com/kserve/kserve) on Kubernetes for highly scalable distributed model serving.

View File

@ -1,6 +1,4 @@
--- # KubeAI
title: KubeAI
---
[KubeAI](https://github.com/substratusai/kubeai) is a Kubernetes operator that enables you to deploy and manage AI models on Kubernetes. It provides a simple and scalable way to deploy vLLM in production. Functionality such as scale-from-zero, load based autoscaling, model caching, and much more is provided out of the box with zero external dependencies. [KubeAI](https://github.com/substratusai/kubeai) is a Kubernetes operator that enables you to deploy and manage AI models on Kubernetes. It provides a simple and scalable way to deploy vLLM in production. Functionality such as scale-from-zero, load based autoscaling, model caching, and much more is provided out of the box with zero external dependencies.

View File

@ -1,6 +1,4 @@
--- # Llama Stack
title: Llama Stack
---
vLLM is also available via [Llama Stack](https://github.com/meta-llama/llama-stack) . vLLM is also available via [Llama Stack](https://github.com/meta-llama/llama-stack) .

View File

@ -1,6 +1,4 @@
--- # llmaz
title: llmaz
---
[llmaz](https://github.com/InftyAI/llmaz) is an easy-to-use and advanced inference platform for large language models on Kubernetes, aimed for production use. It uses vLLM as the default model serving backend. [llmaz](https://github.com/InftyAI/llmaz) is an easy-to-use and advanced inference platform for large language models on Kubernetes, aimed for production use. It uses vLLM as the default model serving backend.

View File

@ -1,6 +1,4 @@
--- # Production stack
title: Production stack
---
Deploying vLLM on Kubernetes is a scalable and efficient way to serve machine learning models. This guide walks you through deploying vLLM using the [vLLM production stack](https://github.com/vllm-project/production-stack). Born out of a Berkeley-UChicago collaboration, [vLLM production stack](https://github.com/vllm-project/production-stack) is an officially released, production-optimized codebase under the [vLLM project](https://github.com/vllm-project), designed for LLM deployment with: Deploying vLLM on Kubernetes is a scalable and efficient way to serve machine learning models. This guide walks you through deploying vLLM using the [vLLM production stack](https://github.com/vllm-project/production-stack). Born out of a Berkeley-UChicago collaboration, [vLLM production stack](https://github.com/vllm-project/production-stack) is an officially released, production-optimized codebase under the [vLLM project](https://github.com/vllm-project), designed for LLM deployment with:

View File

@ -1,6 +1,4 @@
--- # Using Kubernetes
title: Using Kubernetes
---
Deploying vLLM on Kubernetes is a scalable and efficient way to serve machine learning models. This guide walks you through deploying vLLM using native Kubernetes. Deploying vLLM on Kubernetes is a scalable and efficient way to serve machine learning models. This guide walks you through deploying vLLM using native Kubernetes.

View File

@ -1,6 +1,4 @@
--- # Using Nginx
title: Using Nginx
---
This document shows how to launch multiple vLLM serving containers and use Nginx to act as a load balancer between the servers. This document shows how to launch multiple vLLM serving containers and use Nginx to act as a load balancer between the servers.

View File

@ -1,6 +1,4 @@
--- # Architecture Overview
title: Architecture Overview
---
This document provides an overview of the vLLM architecture. This document provides an overview of the vLLM architecture.

View File

@ -1,6 +1,4 @@
--- # Automatic Prefix Caching
title: Automatic Prefix Caching
---
The core idea of [PagedAttention](https://blog.vllm.ai/2023/06/20/vllm.html) is to partition the KV cache of each request into KV Blocks. Each block contains the attention keys and values for a fixed number of tokens. The PagedAttention algorithm allows these blocks to be stored in non-contiguous physical memory so that we can eliminate memory fragmentation by allocating the memory on demand. The core idea of [PagedAttention](https://blog.vllm.ai/2023/06/20/vllm.html) is to partition the KV cache of each request into KV Blocks. Each block contains the attention keys and values for a fixed number of tokens. The PagedAttention algorithm allows these blocks to be stored in non-contiguous physical memory so that we can eliminate memory fragmentation by allocating the memory on demand.

View File

@ -1,6 +1,4 @@
--- # Integration with HuggingFace
title: Integration with HuggingFace
---
This document describes how vLLM integrates with HuggingFace libraries. We will explain step by step what happens under the hood when we run `vllm serve`. This document describes how vLLM integrates with HuggingFace libraries. We will explain step by step what happens under the hood when we run `vllm serve`.

View File

@ -1,6 +1,4 @@
--- # vLLM Paged Attention
title: vLLM Paged Attention
---
Currently, vLLM utilizes its own implementation of a multi-head query Currently, vLLM utilizes its own implementation of a multi-head query
attention kernel (`csrc/attention/attention_kernels.cu`). attention kernel (`csrc/attention/attention_kernels.cu`).

View File

@ -1,6 +1,4 @@
--- # Multi-Modal Data Processing
title: Multi-Modal Data Processing
---
To enable various optimizations in vLLM such as [chunked prefill][chunked-prefill] and [prefix caching](../features/automatic_prefix_caching.md), we use [BaseMultiModalProcessor][vllm.multimodal.processing.BaseMultiModalProcessor] to provide the correspondence between placeholder feature tokens (e.g. `<image>`) and multi-modal inputs (e.g. the raw input image) based on the outputs of HF processor. To enable various optimizations in vLLM such as [chunked prefill][chunked-prefill] and [prefix caching](../features/automatic_prefix_caching.md), we use [BaseMultiModalProcessor][vllm.multimodal.processing.BaseMultiModalProcessor] to provide the correspondence between placeholder feature tokens (e.g. `<image>`) and multi-modal inputs (e.g. the raw input image) based on the outputs of HF processor.

View File

@ -1,6 +1,4 @@
--- # vLLM's Plugin System
title: vLLM's Plugin System
---
The community frequently requests the ability to extend vLLM with custom features. To facilitate this, vLLM includes a plugin system that allows users to add custom features without modifying the vLLM codebase. This document explains how plugins work in vLLM and how to create a plugin for vLLM. The community frequently requests the ability to extend vLLM with custom features. To facilitate this, vLLM includes a plugin system that allows users to add custom features without modifying the vLLM codebase. This document explains how plugins work in vLLM and how to create a plugin for vLLM.

View File

@ -1,6 +1,4 @@
--- # Automatic Prefix Caching
title: Automatic Prefix Caching
---
## Introduction ## Introduction

View File

@ -1,6 +1,4 @@
--- # Compatibility Matrix
title: Compatibility Matrix
---
The tables below show mutually exclusive features and the support on some hardware. The tables below show mutually exclusive features and the support on some hardware.

View File

@ -1,6 +1,4 @@
--- # Disaggregated Prefilling (experimental)
title: Disaggregated Prefilling (experimental)
---
This page introduces you the disaggregated prefilling feature in vLLM. This page introduces you the disaggregated prefilling feature in vLLM.

View File

@ -1,6 +1,4 @@
--- # LoRA Adapters
title: LoRA Adapters
---
This document shows you how to use [LoRA adapters](https://arxiv.org/abs/2106.09685) with vLLM on top of a base model. This document shows you how to use [LoRA adapters](https://arxiv.org/abs/2106.09685) with vLLM on top of a base model.

View File

@ -1,6 +1,4 @@
--- # Multimodal Inputs
title: Multimodal Inputs
---
This page teaches you how to pass multi-modal inputs to [multi-modal models][supported-mm-models] in vLLM. This page teaches you how to pass multi-modal inputs to [multi-modal models][supported-mm-models] in vLLM.

View File

@ -1,6 +1,4 @@
--- # Quantization
title: Quantization
---
Quantization trades off model precision for smaller memory footprint, allowing large models to be run on a wider range of devices. Quantization trades off model precision for smaller memory footprint, allowing large models to be run on a wider range of devices.

View File

@ -1,6 +1,4 @@
--- # AutoAWQ
title: AutoAWQ
---
To create a new 4-bit quantized model, you can leverage [AutoAWQ](https://github.com/casper-hansen/AutoAWQ). To create a new 4-bit quantized model, you can leverage [AutoAWQ](https://github.com/casper-hansen/AutoAWQ).
Quantization reduces the model's precision from BF16/FP16 to INT4 which effectively reduces the total model memory footprint. Quantization reduces the model's precision from BF16/FP16 to INT4 which effectively reduces the total model memory footprint.

View File

@ -1,6 +1,4 @@
--- # BitBLAS
title: BitBLAS
---
vLLM now supports [BitBLAS](https://github.com/microsoft/BitBLAS) for more efficient and flexible model inference. Compared to other quantization frameworks, BitBLAS provides more precision combinations. vLLM now supports [BitBLAS](https://github.com/microsoft/BitBLAS) for more efficient and flexible model inference. Compared to other quantization frameworks, BitBLAS provides more precision combinations.

View File

@ -1,6 +1,4 @@
--- # BitsAndBytes
title: BitsAndBytes
---
vLLM now supports [BitsAndBytes](https://github.com/TimDettmers/bitsandbytes) for more efficient model inference. vLLM now supports [BitsAndBytes](https://github.com/TimDettmers/bitsandbytes) for more efficient model inference.
BitsAndBytes quantizes models to reduce memory usage and enhance performance without significantly sacrificing accuracy. BitsAndBytes quantizes models to reduce memory usage and enhance performance without significantly sacrificing accuracy.

View File

@ -1,6 +1,4 @@
--- # FP8 W8A8
title: FP8 W8A8
---
vLLM supports FP8 (8-bit floating point) weight and activation quantization using hardware acceleration on GPUs such as Nvidia H100 and AMD MI300x. vLLM supports FP8 (8-bit floating point) weight and activation quantization using hardware acceleration on GPUs such as Nvidia H100 and AMD MI300x.
Currently, only Hopper and Ada Lovelace GPUs are officially supported for W8A8. Currently, only Hopper and Ada Lovelace GPUs are officially supported for W8A8.

View File

@ -1,6 +1,4 @@
--- # GGUF
title: GGUF
---
!!! warning !!! warning
Please note that GGUF support in vLLM is highly experimental and under-optimized at the moment, it might be incompatible with other features. Currently, you can use GGUF as a way to reduce memory footprint. If you encounter any issues, please report them to the vLLM team. Please note that GGUF support in vLLM is highly experimental and under-optimized at the moment, it might be incompatible with other features. Currently, you can use GGUF as a way to reduce memory footprint. If you encounter any issues, please report them to the vLLM team.

View File

@ -1,6 +1,4 @@
--- # GPTQModel
title: GPTQModel
---
To create a new 4-bit or 8-bit GPTQ quantized model, you can leverage [GPTQModel](https://github.com/ModelCloud/GPTQModel) from ModelCloud.AI. To create a new 4-bit or 8-bit GPTQ quantized model, you can leverage [GPTQModel](https://github.com/ModelCloud/GPTQModel) from ModelCloud.AI.

View File

@ -1,6 +1,4 @@
--- # INT4 W4A16
title: INT4 W4A16
---
vLLM supports quantizing weights to INT4 for memory savings and inference acceleration. This quantization method is particularly useful for reducing model size and maintaining low latency in workloads with low queries per second (QPS). vLLM supports quantizing weights to INT4 for memory savings and inference acceleration. This quantization method is particularly useful for reducing model size and maintaining low latency in workloads with low queries per second (QPS).

View File

@ -1,6 +1,4 @@
--- # INT8 W8A8
title: INT8 W8A8
---
vLLM supports quantizing weights and activations to INT8 for memory savings and inference acceleration. vLLM supports quantizing weights and activations to INT8 for memory savings and inference acceleration.
This quantization method is particularly useful for reducing model size while maintaining good performance. This quantization method is particularly useful for reducing model size while maintaining good performance.

View File

@ -1,6 +1,4 @@
--- # Quantized KV Cache
title: Quantized KV Cache
---
## FP8 KV Cache ## FP8 KV Cache

View File

@ -1,6 +1,4 @@
--- # AMD Quark
title: AMD Quark
---
Quantization can effectively reduce memory and bandwidth usage, accelerate computation and improve Quantization can effectively reduce memory and bandwidth usage, accelerate computation and improve
throughput while with minimal accuracy loss. vLLM can leverage [Quark](https://quark.docs.amd.com/latest/), throughput while with minimal accuracy loss. vLLM can leverage [Quark](https://quark.docs.amd.com/latest/),

View File

@ -1,6 +1,4 @@
--- # Supported Hardware
title: Supported Hardware
---
The table below shows the compatibility of various quantization implementations with different hardware platforms in vLLM: The table below shows the compatibility of various quantization implementations with different hardware platforms in vLLM:

View File

@ -1,6 +1,4 @@
--- # Reasoning Outputs
title: Reasoning Outputs
---
vLLM offers support for reasoning models like [DeepSeek R1](https://huggingface.co/deepseek-ai/DeepSeek-R1), which are designed to generate outputs containing both reasoning steps and final conclusions. vLLM offers support for reasoning models like [DeepSeek R1](https://huggingface.co/deepseek-ai/DeepSeek-R1), which are designed to generate outputs containing both reasoning steps and final conclusions.

View File

@ -1,6 +1,4 @@
--- # Speculative Decoding
title: Speculative Decoding
---
!!! warning !!! warning
Please note that speculative decoding in vLLM is not yet optimized and does Please note that speculative decoding in vLLM is not yet optimized and does

View File

@ -1,6 +1,4 @@
--- # Structured Outputs
title: Structured Outputs
---
vLLM supports the generation of structured outputs using vLLM supports the generation of structured outputs using
[xgrammar](https://github.com/mlc-ai/xgrammar) or [xgrammar](https://github.com/mlc-ai/xgrammar) or

View File

@ -1,6 +1,4 @@
--- # Installation
title: Installation
---
vLLM supports the following hardware platforms: vLLM supports the following hardware platforms:

View File

@ -1,6 +1,4 @@
--- # Quickstart
title: Quickstart
---
This guide will help you quickly get started with vLLM to perform: This guide will help you quickly get started with vLLM to perform:

View File

@ -1,6 +1,4 @@
--- # Loading models with Run:ai Model Streamer
title: Loading models with Run:ai Model Streamer
---
Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory.
Further reading can be found in [Run:ai Model Streamer Documentation](https://github.com/run-ai/runai-model-streamer/blob/master/docs/README.md). Further reading can be found in [Run:ai Model Streamer Documentation](https://github.com/run-ai/runai-model-streamer/blob/master/docs/README.md).

View File

@ -1,6 +1,4 @@
--- # Loading models with CoreWeave's Tensorizer
title: Loading models with CoreWeave's Tensorizer
---
vLLM supports loading models with [CoreWeave's Tensorizer](https://docs.coreweave.com/coreweave-machine-learning-and-ai/inference/tensorizer). vLLM supports loading models with [CoreWeave's Tensorizer](https://docs.coreweave.com/coreweave-machine-learning-and-ai/inference/tensorizer).
vLLM model tensors that have been serialized to disk, an HTTP/HTTPS endpoint, or S3 endpoint can be deserialized vLLM model tensors that have been serialized to disk, an HTTP/HTTPS endpoint, or S3 endpoint can be deserialized

View File

@ -1,6 +1,4 @@
--- # Generative Models
title: Generative Models
---
vLLM provides first-class support for generative models, which covers most of LLMs. vLLM provides first-class support for generative models, which covers most of LLMs.

View File

@ -1,6 +1,4 @@
--- # TPU
title: TPU
---
# TPU Supported Models # TPU Supported Models
## Text-only Language Models ## Text-only Language Models

View File

@ -1,6 +1,4 @@
--- # Pooling Models
title: Pooling Models
---
vLLM also supports pooling models, including embedding, reranking and reward models. vLLM also supports pooling models, including embedding, reranking and reward models.

View File

@ -1,6 +1,4 @@
--- # Supported Models
title: Supported Models
---
vLLM supports [generative](./generative_models.md) and [pooling](./pooling_models.md) models across various tasks. vLLM supports [generative](./generative_models.md) and [pooling](./pooling_models.md) models across various tasks.
If a model supports more than one task, you can set the task via the `--task` argument. If a model supports more than one task, you can set the task via the `--task` argument.

View File

@ -1,6 +1,4 @@
--- # Distributed Inference and Serving
title: Distributed Inference and Serving
---
## How to decide the distributed inference strategy? ## How to decide the distributed inference strategy?

View File

@ -1,6 +1,4 @@
--- # LangChain
title: LangChain
---
vLLM is also available via [LangChain](https://github.com/langchain-ai/langchain) . vLLM is also available via [LangChain](https://github.com/langchain-ai/langchain) .

View File

@ -1,6 +1,4 @@
--- # LlamaIndex
title: LlamaIndex
---
vLLM is also available via [LlamaIndex](https://github.com/run-llama/llama_index) . vLLM is also available via [LlamaIndex](https://github.com/run-llama/llama_index) .

View File

@ -1,6 +1,4 @@
--- # Offline Inference
title: Offline Inference
---
Offline inference is possible in your own code using vLLM's [`LLM`][vllm.LLM] class. Offline inference is possible in your own code using vLLM's [`LLM`][vllm.LLM] class.
@ -23,7 +21,7 @@ The available APIs depend on the model type:
!!! info !!! info
[API Reference][offline-inference-api] [API Reference][offline-inference-api]
### Ray Data LLM API ## Ray Data LLM API
Ray Data LLM is an alternative offline inference API that uses vLLM as the underlying engine. Ray Data LLM is an alternative offline inference API that uses vLLM as the underlying engine.
This API adds several batteries-included capabilities that simplify large-scale, GPU-efficient inference: This API adds several batteries-included capabilities that simplify large-scale, GPU-efficient inference:

View File

@ -1,6 +1,4 @@
--- # OpenAI-Compatible Server
title: OpenAI-Compatible Server
---
vLLM provides an HTTP server that implements OpenAI's [Completions API](https://platform.openai.com/docs/api-reference/completions), [Chat API](https://platform.openai.com/docs/api-reference/chat), and more! This functionality lets you serve models and interact with them using an HTTP client. vLLM provides an HTTP server that implements OpenAI's [Completions API](https://platform.openai.com/docs/api-reference/completions), [Chat API](https://platform.openai.com/docs/api-reference/chat), and more! This functionality lets you serve models and interact with them using an HTTP client.

View File

@ -1,6 +1,4 @@
--- # Frequently Asked Questions
title: Frequently Asked Questions
---
> Q: How can I serve multiple models on a single port using the OpenAI API? > Q: How can I serve multiple models on a single port using the OpenAI API?

View File

@ -1,6 +1,4 @@
--- # Troubleshooting
title: Troubleshooting
---
This document outlines some troubleshooting strategies you can consider. If you think you've discovered a bug, please [search existing issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue) first to see if it has already been reported. If not, please [file a new issue](https://github.com/vllm-project/vllm/issues/new/choose), providing as much relevant information as possible. This document outlines some troubleshooting strategies you can consider. If you think you've discovered a bug, please [search existing issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue) first to see if it has already been reported. If not, please [file a new issue](https://github.com/vllm-project/vllm/issues/new/choose), providing as much relevant information as possible.