From b942c094e3ab905aeb16f4136353f378e17159e8 Mon Sep 17 00:00:00 2001 From: Harry Mellor <19981378+hmellor@users.noreply.github.com> Date: Tue, 8 Jul 2025 11:27:40 +0100 Subject: [PATCH] Stop using title frontmatter and fix doc that can only be reached by search (#20623) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> --- docs/.nav.yml | 1 + docs/community/contact_us.md | 4 +--- docs/community/meetups.md | 4 +--- docs/configuration/engine_args.md | 4 +--- docs/configuration/serve_args.md | 4 +--- docs/contributing/benchmarks.md | 4 +--- docs/contributing/{ci-failures.md => ci/failures.md} | 0 docs/{ => contributing}/ci/update_pytorch_version.md | 4 +--- docs/contributing/model/README.md | 4 +--- docs/contributing/model/basic.md | 4 +--- docs/contributing/model/multimodal.md | 4 +--- docs/contributing/model/registration.md | 4 +--- docs/contributing/model/tests.md | 4 +--- docs/deployment/docker.md | 4 +--- docs/deployment/frameworks/anyscale.md | 5 ++--- docs/deployment/frameworks/anything-llm.md | 4 +--- docs/deployment/frameworks/autogen.md | 4 +--- docs/deployment/frameworks/bentoml.md | 4 +--- docs/deployment/frameworks/cerebrium.md | 4 +--- docs/deployment/frameworks/chatbox.md | 4 +--- docs/deployment/frameworks/dify.md | 4 +--- docs/deployment/frameworks/dstack.md | 4 +--- docs/deployment/frameworks/haystack.md | 4 +--- docs/deployment/frameworks/helm.md | 4 +--- docs/deployment/frameworks/litellm.md | 4 +--- docs/deployment/frameworks/lobe-chat.md | 4 +--- docs/deployment/frameworks/lws.md | 4 +--- docs/deployment/frameworks/modal.md | 4 +--- docs/deployment/frameworks/open-webui.md | 4 +--- .../deployment/frameworks/retrieval_augmented_generation.md | 4 +--- docs/deployment/frameworks/skypilot.md | 4 +--- docs/deployment/frameworks/streamlit.md | 4 +--- docs/deployment/frameworks/triton.md | 4 +--- docs/deployment/integrations/kserve.md | 4 +--- docs/deployment/integrations/kubeai.md | 4 +--- docs/deployment/integrations/llamastack.md | 4 +--- docs/deployment/integrations/llmaz.md | 4 +--- docs/deployment/integrations/production-stack.md | 4 +--- docs/deployment/k8s.md | 4 +--- docs/deployment/nginx.md | 4 +--- docs/design/arch_overview.md | 4 +--- docs/design/automatic_prefix_caching.md | 4 +--- docs/design/huggingface_integration.md | 4 +--- docs/design/kernel/paged_attention.md | 4 +--- docs/design/mm_processing.md | 4 +--- docs/design/plugin_system.md | 4 +--- docs/features/automatic_prefix_caching.md | 4 +--- docs/features/compatibility_matrix.md | 4 +--- docs/features/disagg_prefill.md | 4 +--- docs/features/lora.md | 4 +--- docs/features/multimodal_inputs.md | 4 +--- docs/features/quantization/README.md | 4 +--- docs/features/quantization/auto_awq.md | 4 +--- docs/features/quantization/bitblas.md | 4 +--- docs/features/quantization/bnb.md | 4 +--- docs/features/quantization/fp8.md | 4 +--- docs/features/quantization/gguf.md | 4 +--- docs/features/quantization/gptqmodel.md | 4 +--- docs/features/quantization/int4.md | 4 +--- docs/features/quantization/int8.md | 4 +--- docs/features/quantization/quantized_kvcache.md | 4 +--- docs/features/quantization/quark.md | 4 +--- docs/features/quantization/supported_hardware.md | 4 +--- docs/features/reasoning_outputs.md | 4 +--- docs/features/spec_decode.md | 4 +--- docs/features/structured_outputs.md | 4 +--- docs/getting_started/installation/README.md | 4 +--- docs/getting_started/quickstart.md | 4 +--- docs/models/extensions/runai_model_streamer.md | 4 +--- docs/models/extensions/tensorizer.md | 4 +--- docs/models/generative_models.md | 4 +--- docs/models/hardware_supported_models/tpu.md | 4 +--- docs/models/pooling_models.md | 4 +--- docs/models/supported_models.md | 4 +--- docs/serving/distributed_serving.md | 4 +--- docs/serving/integrations/langchain.md | 4 +--- docs/serving/integrations/llamaindex.md | 4 +--- docs/serving/offline_inference.md | 6 ++---- docs/serving/openai_compatible_server.md | 4 +--- docs/usage/faq.md | 4 +--- docs/usage/troubleshooting.md | 4 +--- 81 files changed, 82 insertions(+), 238 deletions(-) rename docs/contributing/{ci-failures.md => ci/failures.md} (100%) rename docs/{ => contributing}/ci/update_pytorch_version.md (99%) diff --git a/docs/.nav.yml b/docs/.nav.yml index 06bfcc3f1effe..ab54dc3e535bd 100644 --- a/docs/.nav.yml +++ b/docs/.nav.yml @@ -55,6 +55,7 @@ nav: - contributing/model/registration.md - contributing/model/tests.md - contributing/model/multimodal.md + - CI: contributing/ci - Design Documents: - V0: design - V1: design/v1 diff --git a/docs/community/contact_us.md b/docs/community/contact_us.md index f26e312b64e70..04c28cde5f6b0 100644 --- a/docs/community/contact_us.md +++ b/docs/community/contact_us.md @@ -1,5 +1,3 @@ ---- -title: Contact Us ---- +# Contact Us --8<-- "README.md:contact-us" diff --git a/docs/community/meetups.md b/docs/community/meetups.md index 89de4574d79e4..e8b3a9c9c8e69 100644 --- a/docs/community/meetups.md +++ b/docs/community/meetups.md @@ -1,6 +1,4 @@ ---- -title: Meetups ---- +# Meetups We host regular meetups in San Francisco Bay Area every 2 months. We will share the project updates from the vLLM team and have guest speakers from the industry to share their experience and insights. Please find the materials of our previous meetups below: diff --git a/docs/configuration/engine_args.md b/docs/configuration/engine_args.md index 579a4731cacae..a0e3594cd5813 100644 --- a/docs/configuration/engine_args.md +++ b/docs/configuration/engine_args.md @@ -1,6 +1,4 @@ ---- -title: Engine Arguments ---- +# Engine Arguments Engine arguments control the behavior of the vLLM engine. diff --git a/docs/configuration/serve_args.md b/docs/configuration/serve_args.md index 4a7d771c5b8f1..142d4b8af898e 100644 --- a/docs/configuration/serve_args.md +++ b/docs/configuration/serve_args.md @@ -1,6 +1,4 @@ ---- -title: Server Arguments ---- +# Server Arguments The `vllm serve` command is used to launch the OpenAI-compatible server. diff --git a/docs/contributing/benchmarks.md b/docs/contributing/benchmarks.md index d0fbfa13cb94a..0ebd99ba5ae12 100644 --- a/docs/contributing/benchmarks.md +++ b/docs/contributing/benchmarks.md @@ -1,6 +1,4 @@ ---- -title: Benchmark Suites ---- +# Benchmark Suites vLLM contains two sets of benchmarks: diff --git a/docs/contributing/ci-failures.md b/docs/contributing/ci/failures.md similarity index 100% rename from docs/contributing/ci-failures.md rename to docs/contributing/ci/failures.md diff --git a/docs/ci/update_pytorch_version.md b/docs/contributing/ci/update_pytorch_version.md similarity index 99% rename from docs/ci/update_pytorch_version.md rename to docs/contributing/ci/update_pytorch_version.md index eb8f194557912..2327bc4b53ad2 100644 --- a/docs/ci/update_pytorch_version.md +++ b/docs/contributing/ci/update_pytorch_version.md @@ -1,6 +1,4 @@ ---- -title: Update PyTorch version on vLLM OSS CI/CD ---- +# Update PyTorch version on vLLM OSS CI/CD vLLM's current policy is to always use the latest PyTorch stable release in CI/CD. It is standard practice to submit a PR to update the diff --git a/docs/contributing/model/README.md b/docs/contributing/model/README.md index dd0e3e701d50b..0ca77fa499db7 100644 --- a/docs/contributing/model/README.md +++ b/docs/contributing/model/README.md @@ -1,6 +1,4 @@ ---- -title: Summary ---- +# Summary !!! important Many decoder language models can now be automatically loaded using the [Transformers backend][transformers-backend] without having to implement them in vLLM. See if `vllm serve ` works first! diff --git a/docs/contributing/model/basic.md b/docs/contributing/model/basic.md index f4f3085dc4e2a..542351fd66bb0 100644 --- a/docs/contributing/model/basic.md +++ b/docs/contributing/model/basic.md @@ -1,6 +1,4 @@ ---- -title: Basic Model ---- +# Basic Model This guide walks you through the steps to implement a basic vLLM model. diff --git a/docs/contributing/model/multimodal.md b/docs/contributing/model/multimodal.md index ced1480ddcc47..3295b8c711c0c 100644 --- a/docs/contributing/model/multimodal.md +++ b/docs/contributing/model/multimodal.md @@ -1,6 +1,4 @@ ---- -title: Multi-Modal Support ---- +# Multi-Modal Support This document walks you through the steps to extend a basic model so that it accepts [multi-modal inputs](../../features/multimodal_inputs.md). diff --git a/docs/contributing/model/registration.md b/docs/contributing/model/registration.md index 46f50a6ec90de..35f35ffa4cde6 100644 --- a/docs/contributing/model/registration.md +++ b/docs/contributing/model/registration.md @@ -1,6 +1,4 @@ ---- -title: Registering a Model ---- +# Registering a Model vLLM relies on a model registry to determine how to run each model. A list of pre-registered architectures can be found [here](../../models/supported_models.md). diff --git a/docs/contributing/model/tests.md b/docs/contributing/model/tests.md index 134a73449be6d..1206ad36771ea 100644 --- a/docs/contributing/model/tests.md +++ b/docs/contributing/model/tests.md @@ -1,6 +1,4 @@ ---- -title: Unit Testing ---- +# Unit Testing This page explains how to write unit tests to verify the implementation of your model. diff --git a/docs/deployment/docker.md b/docs/deployment/docker.md index daf2031938647..e500751896b34 100644 --- a/docs/deployment/docker.md +++ b/docs/deployment/docker.md @@ -1,6 +1,4 @@ ---- -title: Using Docker ---- +# Using Docker [](){ #deployment-docker-pre-built-image } diff --git a/docs/deployment/frameworks/anyscale.md b/docs/deployment/frameworks/anyscale.md index 2ee325782ac0c..5604f7f96157d 100644 --- a/docs/deployment/frameworks/anyscale.md +++ b/docs/deployment/frameworks/anyscale.md @@ -1,6 +1,5 @@ ---- -title: Anyscale ---- +# Anyscale + [](){ #deployment-anyscale } [Anyscale](https://www.anyscale.com) is a managed, multi-cloud platform developed by the creators of Ray. diff --git a/docs/deployment/frameworks/anything-llm.md b/docs/deployment/frameworks/anything-llm.md index 6cead082e1af0..d6b28a358cc3d 100644 --- a/docs/deployment/frameworks/anything-llm.md +++ b/docs/deployment/frameworks/anything-llm.md @@ -1,6 +1,4 @@ ---- -title: Anything LLM ---- +# Anything LLM [Anything LLM](https://github.com/Mintplex-Labs/anything-llm) is a full-stack application that enables you to turn any document, resource, or piece of content into context that any LLM can use as references during chatting. diff --git a/docs/deployment/frameworks/autogen.md b/docs/deployment/frameworks/autogen.md index 8510d063b8391..c255a85d38401 100644 --- a/docs/deployment/frameworks/autogen.md +++ b/docs/deployment/frameworks/autogen.md @@ -1,6 +1,4 @@ ---- -title: AutoGen ---- +# AutoGen [AutoGen](https://github.com/microsoft/autogen) is a framework for creating multi-agent AI applications that can act autonomously or work alongside humans. diff --git a/docs/deployment/frameworks/bentoml.md b/docs/deployment/frameworks/bentoml.md index a11fc4804e44f..9c8f2527f2e2a 100644 --- a/docs/deployment/frameworks/bentoml.md +++ b/docs/deployment/frameworks/bentoml.md @@ -1,6 +1,4 @@ ---- -title: BentoML ---- +# BentoML [BentoML](https://github.com/bentoml/BentoML) allows you to deploy a large language model (LLM) server with vLLM as the backend, which exposes OpenAI-compatible endpoints. You can serve the model locally or containerize it as an OCI-compliant image and deploy it on Kubernetes. diff --git a/docs/deployment/frameworks/cerebrium.md b/docs/deployment/frameworks/cerebrium.md index 3a8d6627312bd..1f233c3204a15 100644 --- a/docs/deployment/frameworks/cerebrium.md +++ b/docs/deployment/frameworks/cerebrium.md @@ -1,6 +1,4 @@ ---- -title: Cerebrium ---- +# Cerebrium

vLLM_plus_cerebrium diff --git a/docs/deployment/frameworks/chatbox.md b/docs/deployment/frameworks/chatbox.md index 0dd97633b382d..15f92ed1e34df 100644 --- a/docs/deployment/frameworks/chatbox.md +++ b/docs/deployment/frameworks/chatbox.md @@ -1,6 +1,4 @@ ---- -title: Chatbox ---- +# Chatbox [Chatbox](https://github.com/chatboxai/chatbox) is a desktop client for LLMs, available on Windows, Mac, Linux. diff --git a/docs/deployment/frameworks/dify.md b/docs/deployment/frameworks/dify.md index e08fdafb6c843..a3063194fb513 100644 --- a/docs/deployment/frameworks/dify.md +++ b/docs/deployment/frameworks/dify.md @@ -1,6 +1,4 @@ ---- -title: Dify ---- +# Dify [Dify](https://github.com/langgenius/dify) is an open-source LLM app development platform. Its intuitive interface combines agentic AI workflow, RAG pipeline, agent capabilities, model management, observability features, and more, allowing you to quickly move from prototype to production. diff --git a/docs/deployment/frameworks/dstack.md b/docs/deployment/frameworks/dstack.md index 750df67223cb8..23dc58c974ed8 100644 --- a/docs/deployment/frameworks/dstack.md +++ b/docs/deployment/frameworks/dstack.md @@ -1,6 +1,4 @@ ---- -title: dstack ---- +# dstack

vLLM_plus_dstack diff --git a/docs/deployment/frameworks/haystack.md b/docs/deployment/frameworks/haystack.md index d069bda0e815e..a18d68142cabb 100644 --- a/docs/deployment/frameworks/haystack.md +++ b/docs/deployment/frameworks/haystack.md @@ -1,6 +1,4 @@ ---- -title: Haystack ---- +# Haystack # Haystack diff --git a/docs/deployment/frameworks/helm.md b/docs/deployment/frameworks/helm.md index 4dacfdf352df7..e5d44945ba725 100644 --- a/docs/deployment/frameworks/helm.md +++ b/docs/deployment/frameworks/helm.md @@ -1,6 +1,4 @@ ---- -title: Helm ---- +# Helm A Helm chart to deploy vLLM for Kubernetes diff --git a/docs/deployment/frameworks/litellm.md b/docs/deployment/frameworks/litellm.md index 8499cebc6fd02..c7e514f2276e0 100644 --- a/docs/deployment/frameworks/litellm.md +++ b/docs/deployment/frameworks/litellm.md @@ -1,6 +1,4 @@ ---- -title: LiteLLM ---- +# LiteLLM [LiteLLM](https://github.com/BerriAI/litellm) call all LLM APIs using the OpenAI format [Bedrock, Huggingface, VertexAI, TogetherAI, Azure, OpenAI, Groq etc.] diff --git a/docs/deployment/frameworks/lobe-chat.md b/docs/deployment/frameworks/lobe-chat.md index 22e62ad615ae5..e3e7dbe6e1e80 100644 --- a/docs/deployment/frameworks/lobe-chat.md +++ b/docs/deployment/frameworks/lobe-chat.md @@ -1,6 +1,4 @@ ---- -title: Lobe Chat ---- +# Lobe Chat [Lobe Chat](https://github.com/lobehub/lobe-chat) is an open-source, modern-design ChatGPT/LLMs UI/Framework. diff --git a/docs/deployment/frameworks/lws.md b/docs/deployment/frameworks/lws.md index 633949bf32d8b..3319dc6c90e1e 100644 --- a/docs/deployment/frameworks/lws.md +++ b/docs/deployment/frameworks/lws.md @@ -1,6 +1,4 @@ ---- -title: LWS ---- +# LWS LeaderWorkerSet (LWS) is a Kubernetes API that aims to address common deployment patterns of AI/ML inference workloads. A major use case is for multi-host/multi-node distributed inference. diff --git a/docs/deployment/frameworks/modal.md b/docs/deployment/frameworks/modal.md index feb6f698009d9..0ab5ed92fe6bd 100644 --- a/docs/deployment/frameworks/modal.md +++ b/docs/deployment/frameworks/modal.md @@ -1,6 +1,4 @@ ---- -title: Modal ---- +# Modal vLLM can be run on cloud GPUs with [Modal](https://modal.com), a serverless computing platform designed for fast auto-scaling. diff --git a/docs/deployment/frameworks/open-webui.md b/docs/deployment/frameworks/open-webui.md index 53d21b4325611..8f27a2b9bb6ee 100644 --- a/docs/deployment/frameworks/open-webui.md +++ b/docs/deployment/frameworks/open-webui.md @@ -1,6 +1,4 @@ ---- -title: Open WebUI ---- +# Open WebUI 1. Install the [Docker](https://docs.docker.com/engine/install/) diff --git a/docs/deployment/frameworks/retrieval_augmented_generation.md b/docs/deployment/frameworks/retrieval_augmented_generation.md index 059bdf0309723..96dd99e7118b6 100644 --- a/docs/deployment/frameworks/retrieval_augmented_generation.md +++ b/docs/deployment/frameworks/retrieval_augmented_generation.md @@ -1,6 +1,4 @@ ---- -title: Retrieval-Augmented Generation ---- +# Retrieval-Augmented Generation [Retrieval-augmented generation (RAG)](https://en.wikipedia.org/wiki/Retrieval-augmented_generation) is a technique that enables generative artificial intelligence (Gen AI) models to retrieve and incorporate new information. It modifies interactions with a large language model (LLM) so that the model responds to user queries with reference to a specified set of documents, using this information to supplement information from its pre-existing training data. This allows LLMs to use domain-specific and/or updated information. Use cases include providing chatbot access to internal company data or generating responses based on authoritative sources. diff --git a/docs/deployment/frameworks/skypilot.md b/docs/deployment/frameworks/skypilot.md index ffa59a17e2fa0..06e2fed38f056 100644 --- a/docs/deployment/frameworks/skypilot.md +++ b/docs/deployment/frameworks/skypilot.md @@ -1,6 +1,4 @@ ---- -title: SkyPilot ---- +# SkyPilot

vLLM diff --git a/docs/deployment/frameworks/streamlit.md b/docs/deployment/frameworks/streamlit.md index 6445ab68e3411..af0f0690c68e2 100644 --- a/docs/deployment/frameworks/streamlit.md +++ b/docs/deployment/frameworks/streamlit.md @@ -1,6 +1,4 @@ ---- -title: Streamlit ---- +# Streamlit [Streamlit](https://github.com/streamlit/streamlit) lets you transform Python scripts into interactive web apps in minutes, instead of weeks. Build dashboards, generate reports, or create chat apps. diff --git a/docs/deployment/frameworks/triton.md b/docs/deployment/frameworks/triton.md index ef6b6f9325b92..faff4a4263eb2 100644 --- a/docs/deployment/frameworks/triton.md +++ b/docs/deployment/frameworks/triton.md @@ -1,5 +1,3 @@ ---- -title: NVIDIA Triton ---- +# NVIDIA Triton The [Triton Inference Server](https://github.com/triton-inference-server) hosts a tutorial demonstrating how to quickly deploy a simple [facebook/opt-125m](https://huggingface.co/facebook/opt-125m) model using vLLM. Please see [Deploying a vLLM model in Triton](https://github.com/triton-inference-server/tutorials/blob/main/Quick_Deploy/vLLM/README.md#deploying-a-vllm-model-in-triton) for more details. diff --git a/docs/deployment/integrations/kserve.md b/docs/deployment/integrations/kserve.md index b61112b3a91bd..edf79fca4f93e 100644 --- a/docs/deployment/integrations/kserve.md +++ b/docs/deployment/integrations/kserve.md @@ -1,6 +1,4 @@ ---- -title: KServe ---- +# KServe vLLM can be deployed with [KServe](https://github.com/kserve/kserve) on Kubernetes for highly scalable distributed model serving. diff --git a/docs/deployment/integrations/kubeai.md b/docs/deployment/integrations/kubeai.md index 37604b8feef4c..89d072215e956 100644 --- a/docs/deployment/integrations/kubeai.md +++ b/docs/deployment/integrations/kubeai.md @@ -1,6 +1,4 @@ ---- -title: KubeAI ---- +# KubeAI [KubeAI](https://github.com/substratusai/kubeai) is a Kubernetes operator that enables you to deploy and manage AI models on Kubernetes. It provides a simple and scalable way to deploy vLLM in production. Functionality such as scale-from-zero, load based autoscaling, model caching, and much more is provided out of the box with zero external dependencies. diff --git a/docs/deployment/integrations/llamastack.md b/docs/deployment/integrations/llamastack.md index cf328054621d8..28031f01f85e8 100644 --- a/docs/deployment/integrations/llamastack.md +++ b/docs/deployment/integrations/llamastack.md @@ -1,6 +1,4 @@ ---- -title: Llama Stack ---- +# Llama Stack vLLM is also available via [Llama Stack](https://github.com/meta-llama/llama-stack) . diff --git a/docs/deployment/integrations/llmaz.md b/docs/deployment/integrations/llmaz.md index 87772ec6ce088..77730a26c24fc 100644 --- a/docs/deployment/integrations/llmaz.md +++ b/docs/deployment/integrations/llmaz.md @@ -1,6 +1,4 @@ ---- -title: llmaz ---- +# llmaz [llmaz](https://github.com/InftyAI/llmaz) is an easy-to-use and advanced inference platform for large language models on Kubernetes, aimed for production use. It uses vLLM as the default model serving backend. diff --git a/docs/deployment/integrations/production-stack.md b/docs/deployment/integrations/production-stack.md index 19371061a5c10..ffec679207fd8 100644 --- a/docs/deployment/integrations/production-stack.md +++ b/docs/deployment/integrations/production-stack.md @@ -1,6 +1,4 @@ ---- -title: Production stack ---- +# Production stack Deploying vLLM on Kubernetes is a scalable and efficient way to serve machine learning models. This guide walks you through deploying vLLM using the [vLLM production stack](https://github.com/vllm-project/production-stack). Born out of a Berkeley-UChicago collaboration, [vLLM production stack](https://github.com/vllm-project/production-stack) is an officially released, production-optimized codebase under the [vLLM project](https://github.com/vllm-project), designed for LLM deployment with: diff --git a/docs/deployment/k8s.md b/docs/deployment/k8s.md index 8eb69527c8472..8eb2270ab7c87 100644 --- a/docs/deployment/k8s.md +++ b/docs/deployment/k8s.md @@ -1,6 +1,4 @@ ---- -title: Using Kubernetes ---- +# Using Kubernetes Deploying vLLM on Kubernetes is a scalable and efficient way to serve machine learning models. This guide walks you through deploying vLLM using native Kubernetes. diff --git a/docs/deployment/nginx.md b/docs/deployment/nginx.md index 2cdf766d11950..b3178e77f845c 100644 --- a/docs/deployment/nginx.md +++ b/docs/deployment/nginx.md @@ -1,6 +1,4 @@ ---- -title: Using Nginx ---- +# Using Nginx This document shows how to launch multiple vLLM serving containers and use Nginx to act as a load balancer between the servers. diff --git a/docs/design/arch_overview.md b/docs/design/arch_overview.md index 27676bc2e919f..334df5dc9b7f0 100644 --- a/docs/design/arch_overview.md +++ b/docs/design/arch_overview.md @@ -1,6 +1,4 @@ ---- -title: Architecture Overview ---- +# Architecture Overview This document provides an overview of the vLLM architecture. diff --git a/docs/design/automatic_prefix_caching.md b/docs/design/automatic_prefix_caching.md index 88b3d0b66e70d..60e21f6ad0fcb 100644 --- a/docs/design/automatic_prefix_caching.md +++ b/docs/design/automatic_prefix_caching.md @@ -1,6 +1,4 @@ ---- -title: Automatic Prefix Caching ---- +# Automatic Prefix Caching The core idea of [PagedAttention](https://blog.vllm.ai/2023/06/20/vllm.html) is to partition the KV cache of each request into KV Blocks. Each block contains the attention keys and values for a fixed number of tokens. The PagedAttention algorithm allows these blocks to be stored in non-contiguous physical memory so that we can eliminate memory fragmentation by allocating the memory on demand. diff --git a/docs/design/huggingface_integration.md b/docs/design/huggingface_integration.md index 100f931ec6123..7b01313ddb00a 100644 --- a/docs/design/huggingface_integration.md +++ b/docs/design/huggingface_integration.md @@ -1,6 +1,4 @@ ---- -title: Integration with HuggingFace ---- +# Integration with HuggingFace This document describes how vLLM integrates with HuggingFace libraries. We will explain step by step what happens under the hood when we run `vllm serve`. diff --git a/docs/design/kernel/paged_attention.md b/docs/design/kernel/paged_attention.md index bd81d817895d5..94bfa97ee2217 100644 --- a/docs/design/kernel/paged_attention.md +++ b/docs/design/kernel/paged_attention.md @@ -1,6 +1,4 @@ ---- -title: vLLM Paged Attention ---- +# vLLM Paged Attention Currently, vLLM utilizes its own implementation of a multi-head query attention kernel (`csrc/attention/attention_kernels.cu`). diff --git a/docs/design/mm_processing.md b/docs/design/mm_processing.md index 75c986269df5a..1e9b6ad6e821e 100644 --- a/docs/design/mm_processing.md +++ b/docs/design/mm_processing.md @@ -1,6 +1,4 @@ ---- -title: Multi-Modal Data Processing ---- +# Multi-Modal Data Processing To enable various optimizations in vLLM such as [chunked prefill][chunked-prefill] and [prefix caching](../features/automatic_prefix_caching.md), we use [BaseMultiModalProcessor][vllm.multimodal.processing.BaseMultiModalProcessor] to provide the correspondence between placeholder feature tokens (e.g. ``) and multi-modal inputs (e.g. the raw input image) based on the outputs of HF processor. diff --git a/docs/design/plugin_system.md b/docs/design/plugin_system.md index 35372b5ea03bc..23a05ac719ce2 100644 --- a/docs/design/plugin_system.md +++ b/docs/design/plugin_system.md @@ -1,6 +1,4 @@ ---- -title: vLLM's Plugin System ---- +# vLLM's Plugin System The community frequently requests the ability to extend vLLM with custom features. To facilitate this, vLLM includes a plugin system that allows users to add custom features without modifying the vLLM codebase. This document explains how plugins work in vLLM and how to create a plugin for vLLM. diff --git a/docs/features/automatic_prefix_caching.md b/docs/features/automatic_prefix_caching.md index 73ff1757371fa..f3c4bdd85c378 100644 --- a/docs/features/automatic_prefix_caching.md +++ b/docs/features/automatic_prefix_caching.md @@ -1,6 +1,4 @@ ---- -title: Automatic Prefix Caching ---- +# Automatic Prefix Caching ## Introduction diff --git a/docs/features/compatibility_matrix.md b/docs/features/compatibility_matrix.md index d71e9fafd6298..fdd75bfe33d4c 100644 --- a/docs/features/compatibility_matrix.md +++ b/docs/features/compatibility_matrix.md @@ -1,6 +1,4 @@ ---- -title: Compatibility Matrix ---- +# Compatibility Matrix The tables below show mutually exclusive features and the support on some hardware. diff --git a/docs/features/disagg_prefill.md b/docs/features/disagg_prefill.md index 5b45b676ee90d..c0c32594f266c 100644 --- a/docs/features/disagg_prefill.md +++ b/docs/features/disagg_prefill.md @@ -1,6 +1,4 @@ ---- -title: Disaggregated Prefilling (experimental) ---- +# Disaggregated Prefilling (experimental) This page introduces you the disaggregated prefilling feature in vLLM. diff --git a/docs/features/lora.md b/docs/features/lora.md index 5ede7c42976c7..3e17c659655e5 100644 --- a/docs/features/lora.md +++ b/docs/features/lora.md @@ -1,6 +1,4 @@ ---- -title: LoRA Adapters ---- +# LoRA Adapters This document shows you how to use [LoRA adapters](https://arxiv.org/abs/2106.09685) with vLLM on top of a base model. diff --git a/docs/features/multimodal_inputs.md b/docs/features/multimodal_inputs.md index 644c9d03af97c..f9df2c89c6007 100644 --- a/docs/features/multimodal_inputs.md +++ b/docs/features/multimodal_inputs.md @@ -1,6 +1,4 @@ ---- -title: Multimodal Inputs ---- +# Multimodal Inputs This page teaches you how to pass multi-modal inputs to [multi-modal models][supported-mm-models] in vLLM. diff --git a/docs/features/quantization/README.md b/docs/features/quantization/README.md index 73d54b8dca851..c30abdab5d612 100644 --- a/docs/features/quantization/README.md +++ b/docs/features/quantization/README.md @@ -1,6 +1,4 @@ ---- -title: Quantization ---- +# Quantization Quantization trades off model precision for smaller memory footprint, allowing large models to be run on a wider range of devices. diff --git a/docs/features/quantization/auto_awq.md b/docs/features/quantization/auto_awq.md index 97227e54c356c..fc998387d29aa 100644 --- a/docs/features/quantization/auto_awq.md +++ b/docs/features/quantization/auto_awq.md @@ -1,6 +1,4 @@ ---- -title: AutoAWQ ---- +# AutoAWQ To create a new 4-bit quantized model, you can leverage [AutoAWQ](https://github.com/casper-hansen/AutoAWQ). Quantization reduces the model's precision from BF16/FP16 to INT4 which effectively reduces the total model memory footprint. diff --git a/docs/features/quantization/bitblas.md b/docs/features/quantization/bitblas.md index 8ad1e1dea299b..ba014d28cde4a 100644 --- a/docs/features/quantization/bitblas.md +++ b/docs/features/quantization/bitblas.md @@ -1,6 +1,4 @@ ---- -title: BitBLAS ---- +# BitBLAS vLLM now supports [BitBLAS](https://github.com/microsoft/BitBLAS) for more efficient and flexible model inference. Compared to other quantization frameworks, BitBLAS provides more precision combinations. diff --git a/docs/features/quantization/bnb.md b/docs/features/quantization/bnb.md index 11c37547863b3..3b15a6072d47a 100644 --- a/docs/features/quantization/bnb.md +++ b/docs/features/quantization/bnb.md @@ -1,6 +1,4 @@ ---- -title: BitsAndBytes ---- +# BitsAndBytes vLLM now supports [BitsAndBytes](https://github.com/TimDettmers/bitsandbytes) for more efficient model inference. BitsAndBytes quantizes models to reduce memory usage and enhance performance without significantly sacrificing accuracy. diff --git a/docs/features/quantization/fp8.md b/docs/features/quantization/fp8.md index 03aec160ea1ca..a6c0fd78e76b6 100644 --- a/docs/features/quantization/fp8.md +++ b/docs/features/quantization/fp8.md @@ -1,6 +1,4 @@ ---- -title: FP8 W8A8 ---- +# FP8 W8A8 vLLM supports FP8 (8-bit floating point) weight and activation quantization using hardware acceleration on GPUs such as Nvidia H100 and AMD MI300x. Currently, only Hopper and Ada Lovelace GPUs are officially supported for W8A8. diff --git a/docs/features/quantization/gguf.md b/docs/features/quantization/gguf.md index 564b999fecd9c..2a1c3bdd775f1 100644 --- a/docs/features/quantization/gguf.md +++ b/docs/features/quantization/gguf.md @@ -1,6 +1,4 @@ ---- -title: GGUF ---- +# GGUF !!! warning Please note that GGUF support in vLLM is highly experimental and under-optimized at the moment, it might be incompatible with other features. Currently, you can use GGUF as a way to reduce memory footprint. If you encounter any issues, please report them to the vLLM team. diff --git a/docs/features/quantization/gptqmodel.md b/docs/features/quantization/gptqmodel.md index 402e0cb3b2bf9..47cb2d65bae47 100644 --- a/docs/features/quantization/gptqmodel.md +++ b/docs/features/quantization/gptqmodel.md @@ -1,6 +1,4 @@ ---- -title: GPTQModel ---- +# GPTQModel To create a new 4-bit or 8-bit GPTQ quantized model, you can leverage [GPTQModel](https://github.com/ModelCloud/GPTQModel) from ModelCloud.AI. diff --git a/docs/features/quantization/int4.md b/docs/features/quantization/int4.md index a76852cf82312..f26de73c2f0fa 100644 --- a/docs/features/quantization/int4.md +++ b/docs/features/quantization/int4.md @@ -1,6 +1,4 @@ ---- -title: INT4 W4A16 ---- +# INT4 W4A16 vLLM supports quantizing weights to INT4 for memory savings and inference acceleration. This quantization method is particularly useful for reducing model size and maintaining low latency in workloads with low queries per second (QPS). diff --git a/docs/features/quantization/int8.md b/docs/features/quantization/int8.md index e1ced47ab9155..7e1cb3fee94a3 100644 --- a/docs/features/quantization/int8.md +++ b/docs/features/quantization/int8.md @@ -1,6 +1,4 @@ ---- -title: INT8 W8A8 ---- +# INT8 W8A8 vLLM supports quantizing weights and activations to INT8 for memory savings and inference acceleration. This quantization method is particularly useful for reducing model size while maintaining good performance. diff --git a/docs/features/quantization/quantized_kvcache.md b/docs/features/quantization/quantized_kvcache.md index 2b0622f197482..c54ec43658a43 100644 --- a/docs/features/quantization/quantized_kvcache.md +++ b/docs/features/quantization/quantized_kvcache.md @@ -1,6 +1,4 @@ ---- -title: Quantized KV Cache ---- +# Quantized KV Cache ## FP8 KV Cache diff --git a/docs/features/quantization/quark.md b/docs/features/quantization/quark.md index 288a636326c99..2c48f9b546b83 100644 --- a/docs/features/quantization/quark.md +++ b/docs/features/quantization/quark.md @@ -1,6 +1,4 @@ ---- -title: AMD Quark ---- +# AMD Quark Quantization can effectively reduce memory and bandwidth usage, accelerate computation and improve throughput while with minimal accuracy loss. vLLM can leverage [Quark](https://quark.docs.amd.com/latest/), diff --git a/docs/features/quantization/supported_hardware.md b/docs/features/quantization/supported_hardware.md index d66972792d574..bb4fe5b54b57b 100644 --- a/docs/features/quantization/supported_hardware.md +++ b/docs/features/quantization/supported_hardware.md @@ -1,6 +1,4 @@ ---- -title: Supported Hardware ---- +# Supported Hardware The table below shows the compatibility of various quantization implementations with different hardware platforms in vLLM: diff --git a/docs/features/reasoning_outputs.md b/docs/features/reasoning_outputs.md index d6ee2955b8965..7ab7efd5e7656 100644 --- a/docs/features/reasoning_outputs.md +++ b/docs/features/reasoning_outputs.md @@ -1,6 +1,4 @@ ---- -title: Reasoning Outputs ---- +# Reasoning Outputs vLLM offers support for reasoning models like [DeepSeek R1](https://huggingface.co/deepseek-ai/DeepSeek-R1), which are designed to generate outputs containing both reasoning steps and final conclusions. diff --git a/docs/features/spec_decode.md b/docs/features/spec_decode.md index 9c63974d0e2ad..4be6bd01a4eb7 100644 --- a/docs/features/spec_decode.md +++ b/docs/features/spec_decode.md @@ -1,6 +1,4 @@ ---- -title: Speculative Decoding ---- +# Speculative Decoding !!! warning Please note that speculative decoding in vLLM is not yet optimized and does diff --git a/docs/features/structured_outputs.md b/docs/features/structured_outputs.md index 84d6ea4fe51e4..4f737afa80f55 100644 --- a/docs/features/structured_outputs.md +++ b/docs/features/structured_outputs.md @@ -1,6 +1,4 @@ ---- -title: Structured Outputs ---- +# Structured Outputs vLLM supports the generation of structured outputs using [xgrammar](https://github.com/mlc-ai/xgrammar) or diff --git a/docs/getting_started/installation/README.md b/docs/getting_started/installation/README.md index 274e7560e46fe..a252343dcee8a 100644 --- a/docs/getting_started/installation/README.md +++ b/docs/getting_started/installation/README.md @@ -1,6 +1,4 @@ ---- -title: Installation ---- +# Installation vLLM supports the following hardware platforms: diff --git a/docs/getting_started/quickstart.md b/docs/getting_started/quickstart.md index 2decd15f033e8..74235db16a15d 100644 --- a/docs/getting_started/quickstart.md +++ b/docs/getting_started/quickstart.md @@ -1,6 +1,4 @@ ---- -title: Quickstart ---- +# Quickstart This guide will help you quickly get started with vLLM to perform: diff --git a/docs/models/extensions/runai_model_streamer.md b/docs/models/extensions/runai_model_streamer.md index b0affe7a4b11d..992dddf385d0d 100644 --- a/docs/models/extensions/runai_model_streamer.md +++ b/docs/models/extensions/runai_model_streamer.md @@ -1,6 +1,4 @@ ---- -title: Loading models with Run:ai Model Streamer ---- +# Loading models with Run:ai Model Streamer Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory. Further reading can be found in [Run:ai Model Streamer Documentation](https://github.com/run-ai/runai-model-streamer/blob/master/docs/README.md). diff --git a/docs/models/extensions/tensorizer.md b/docs/models/extensions/tensorizer.md index 09afca3966e54..5aa647b199275 100644 --- a/docs/models/extensions/tensorizer.md +++ b/docs/models/extensions/tensorizer.md @@ -1,6 +1,4 @@ ---- -title: Loading models with CoreWeave's Tensorizer ---- +# Loading models with CoreWeave's Tensorizer vLLM supports loading models with [CoreWeave's Tensorizer](https://docs.coreweave.com/coreweave-machine-learning-and-ai/inference/tensorizer). vLLM model tensors that have been serialized to disk, an HTTP/HTTPS endpoint, or S3 endpoint can be deserialized diff --git a/docs/models/generative_models.md b/docs/models/generative_models.md index e51b56fa6b7cf..21ad115e411a3 100644 --- a/docs/models/generative_models.md +++ b/docs/models/generative_models.md @@ -1,6 +1,4 @@ ---- -title: Generative Models ---- +# Generative Models vLLM provides first-class support for generative models, which covers most of LLMs. diff --git a/docs/models/hardware_supported_models/tpu.md b/docs/models/hardware_supported_models/tpu.md index 1e0449b5fdeb5..da03a3b3160ad 100644 --- a/docs/models/hardware_supported_models/tpu.md +++ b/docs/models/hardware_supported_models/tpu.md @@ -1,6 +1,4 @@ ---- -title: TPU ---- +# TPU # TPU Supported Models ## Text-only Language Models diff --git a/docs/models/pooling_models.md b/docs/models/pooling_models.md index c659fc567927d..f0de84a66f8b0 100644 --- a/docs/models/pooling_models.md +++ b/docs/models/pooling_models.md @@ -1,6 +1,4 @@ ---- -title: Pooling Models ---- +# Pooling Models vLLM also supports pooling models, including embedding, reranking and reward models. diff --git a/docs/models/supported_models.md b/docs/models/supported_models.md index 54bed5267c3b0..e003a3e31717a 100644 --- a/docs/models/supported_models.md +++ b/docs/models/supported_models.md @@ -1,6 +1,4 @@ ---- -title: Supported Models ---- +# Supported Models vLLM supports [generative](./generative_models.md) and [pooling](./pooling_models.md) models across various tasks. If a model supports more than one task, you can set the task via the `--task` argument. diff --git a/docs/serving/distributed_serving.md b/docs/serving/distributed_serving.md index 1ba7a008734f6..8012500dfbf9f 100644 --- a/docs/serving/distributed_serving.md +++ b/docs/serving/distributed_serving.md @@ -1,6 +1,4 @@ ---- -title: Distributed Inference and Serving ---- +# Distributed Inference and Serving ## How to decide the distributed inference strategy? diff --git a/docs/serving/integrations/langchain.md b/docs/serving/integrations/langchain.md index 6d45623cceb86..47074f411ac99 100644 --- a/docs/serving/integrations/langchain.md +++ b/docs/serving/integrations/langchain.md @@ -1,6 +1,4 @@ ---- -title: LangChain ---- +# LangChain vLLM is also available via [LangChain](https://github.com/langchain-ai/langchain) . diff --git a/docs/serving/integrations/llamaindex.md b/docs/serving/integrations/llamaindex.md index 1cd36239646da..4b838cbcaa9d1 100644 --- a/docs/serving/integrations/llamaindex.md +++ b/docs/serving/integrations/llamaindex.md @@ -1,6 +1,4 @@ ---- -title: LlamaIndex ---- +# LlamaIndex vLLM is also available via [LlamaIndex](https://github.com/run-llama/llama_index) . diff --git a/docs/serving/offline_inference.md b/docs/serving/offline_inference.md index 695eaa4864589..4ec879e0bc8a5 100644 --- a/docs/serving/offline_inference.md +++ b/docs/serving/offline_inference.md @@ -1,6 +1,4 @@ ---- -title: Offline Inference ---- +# Offline Inference Offline inference is possible in your own code using vLLM's [`LLM`][vllm.LLM] class. @@ -23,7 +21,7 @@ The available APIs depend on the model type: !!! info [API Reference][offline-inference-api] -### Ray Data LLM API +## Ray Data LLM API Ray Data LLM is an alternative offline inference API that uses vLLM as the underlying engine. This API adds several batteries-included capabilities that simplify large-scale, GPU-efficient inference: diff --git a/docs/serving/openai_compatible_server.md b/docs/serving/openai_compatible_server.md index 85cf08ebef11a..cebef2b6a2d60 100644 --- a/docs/serving/openai_compatible_server.md +++ b/docs/serving/openai_compatible_server.md @@ -1,6 +1,4 @@ ---- -title: OpenAI-Compatible Server ---- +# OpenAI-Compatible Server vLLM provides an HTTP server that implements OpenAI's [Completions API](https://platform.openai.com/docs/api-reference/completions), [Chat API](https://platform.openai.com/docs/api-reference/chat), and more! This functionality lets you serve models and interact with them using an HTTP client. diff --git a/docs/usage/faq.md b/docs/usage/faq.md index 275a7191e60db..2c8680cb6f7b5 100644 --- a/docs/usage/faq.md +++ b/docs/usage/faq.md @@ -1,6 +1,4 @@ ---- -title: Frequently Asked Questions ---- +# Frequently Asked Questions > Q: How can I serve multiple models on a single port using the OpenAI API? diff --git a/docs/usage/troubleshooting.md b/docs/usage/troubleshooting.md index e18f808329b0b..f9ba32c58c4e1 100644 --- a/docs/usage/troubleshooting.md +++ b/docs/usage/troubleshooting.md @@ -1,6 +1,4 @@ ---- -title: Troubleshooting ---- +# Troubleshooting This document outlines some troubleshooting strategies you can consider. If you think you've discovered a bug, please [search existing issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue) first to see if it has already been reported. If not, please [file a new issue](https://github.com/vllm-project/vllm/issues/new/choose), providing as much relevant information as possible.