diff --git a/docs/source/deployment/integrations/index.md b/docs/source/deployment/integrations/index.md index a557456c086d2..410742b88c735 100644 --- a/docs/source/deployment/integrations/index.md +++ b/docs/source/deployment/integrations/index.md @@ -7,4 +7,5 @@ kserve kubeai llamastack llmaz +production-stack ::: diff --git a/docs/source/deployment/integrations/production-stack.md b/docs/source/deployment/integrations/production-stack.md new file mode 100644 index 0000000000000..e66e8e6a16b29 --- /dev/null +++ b/docs/source/deployment/integrations/production-stack.md @@ -0,0 +1,154 @@ +(deployment-production-stack)= + +# Production stack + +Deploying vLLM on Kubernetes is a scalable and efficient way to serve machine learning models. This guide walks you through deploying vLLM using the [vLLM production stack](https://github.com/vllm-project/production-stack). Born out of a Berkeley-UChicago collaboration, [vLLM production stack](https://github.com/vllm-project/production-stack) is an officially released, production-optimized codebase under the [vLLM project](https://github.com/vllm-project), designed for LLM deployment with: + +* **Upstream vLLM compatibility** – It wraps around upstream vLLM without modifying its code. +* **Ease of use** – Simplified deployment via Helm charts and observability through Grafana dashboards. +* **High performance** – Optimized for LLM workloads with features like multi-model support, model-aware and prefix-aware routing, fast vLLM bootstrapping, and KV cache offloading with [LMCache](https://github.com/LMCache/LMCache), among others. + +If you are new to Kubernetes, don't worry: in the vLLM production stack [repo](https://github.com/vllm-project/production-stack), we provide a step-by-step [guide](https://github.com/vllm-project/production-stack/blob/main/tutorials/00-install-kubernetes-env.md) and a [short video](https://www.youtube.com/watch?v=EsTJbQtzj0g) to set up everything and get started in **4 minutes**! + +## Pre-requisite + +Ensure that you have a running Kubernetes environment with GPU (you can follow [this tutorial](https://github.com/vllm-project/production-stack/blob/main/tutorials/00-install-kubernetes-env.md) to install a Kubernetes environment on a bare-medal GPU machine). + +## Deployment using vLLM production stack + +The standard vLLM production stack install uses a Helm chart. You can run this [bash script](https://github.com/vllm-project/production-stack/blob/main/tutorials/install-helm.sh) to install Helm on your GPU server. + +To install the vLLM production stack, run the following commands on your desktop: + +```bash +sudo helm repo add vllm https://vllm-project.github.io/production-stack +sudo helm install vllm vllm/vllm-stack -f tutorials/assets/values-01-minimal-example.yaml +``` + +This will instantiate a vLLM-production-stack-based deployment named `vllm` that runs a small LLM (Facebook opt-125M model). + +### Validate Installation + +Monitor the deployment status using: + +```bash +sudo kubectl get pods +``` + +And you will see that pods for the `vllm` deployment will transit to `Running` state. + +```text +NAME READY STATUS RESTARTS AGE +vllm-deployment-router-859d8fb668-2x2b7 1/1 Running 0 2m38s +vllm-opt125m-deployment-vllm-84dfc9bd7-vb9bs 1/1 Running 0 2m38s +``` + +**NOTE**: It may take some time for the containers to download the Docker images and LLM weights. + +### Send a Query to the Stack + +Forward the `vllm-router-service` port to the host machine: + +```bash +sudo kubectl port-forward svc/vllm-router-service 30080:80 +``` + +And then you can send out a query to the OpenAI-compatible API to check the available models: + +```bash +curl -o- http://localhost:30080/models +``` + +Expected output: + +```json +{ + "object": "list", + "data": [ + { + "id": "facebook/opt-125m", + "object": "model", + "created": 1737428424, + "owned_by": "vllm", + "root": null + } + ] +} +``` + +To send an actual chatting request, you can issue a curl request to the OpenAI `/completion` endpoint: + +```bash +curl -X POST http://localhost:30080/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "facebook/opt-125m", + "prompt": "Once upon a time,", + "max_tokens": 10 + }' +``` + +Expected output: + +```json +{ + "id": "completion-id", + "object": "text_completion", + "created": 1737428424, + "model": "facebook/opt-125m", + "choices": [ + { + "text": " there was a brave knight who...", + "index": 0, + "finish_reason": "length" + } + ] +} +``` + +### Uninstall + +To remove the deployment, run: + +```bash +sudo helm uninstall vllm +``` + +------ + +### (Advanced) Configuring vLLM production stack + +The core vLLM production stack configuration is managed with YAML. Here is the example configuration used in the installation above: + +```yaml +servingEngineSpec: + runtimeClassName: "" + modelSpec: + - name: "opt125m" + repository: "vllm/vllm-openai" + tag: "latest" + modelURL: "facebook/opt-125m" + + replicaCount: 1 + + requestCPU: 6 + requestMemory: "16Gi" + requestGPU: 1 + + pvcStorage: "10Gi" +``` + +In this YAML configuration: +* **`modelSpec`** includes: + * `name`: A nickname that you prefer to call the model. + * `repository`: Docker repository of vLLM. + * `tag`: Docker image tag. + * `modelURL`: The LLM model that you want to use. +* **`replicaCount`**: Number of replicas. +* **`requestCPU` and `requestMemory`**: Specifies the CPU and memory resource requests for the pod. +* **`requestGPU`**: Specifies the number of GPUs required. +* **`pvcStorage`**: Allocates persistent storage for the model. + +**NOTE:** If you intend to set up two pods, please refer to this [YAML file](https://github.com/vllm-project/production-stack/blob/main/tutorials/assets/values-01-2pods-minimal-example.yaml). + +**NOTE:** vLLM production stack offers many more features (*e.g.* CPU offloading and a wide range of routing algorithms). Please check out these [examples and tutorials](https://github.com/vllm-project/production-stack/tree/main/tutorials) and our [repo](https://github.com/vllm-project/production-stack) for more details! diff --git a/docs/source/deployment/k8s.md b/docs/source/deployment/k8s.md index cbc95c20ff4b3..64071ba042d0b 100644 --- a/docs/source/deployment/k8s.md +++ b/docs/source/deployment/k8s.md @@ -2,17 +2,21 @@ # Using Kubernetes -Using Kubernetes to deploy vLLM is a scalable and efficient way to serve machine learning models. This guide will walk you through the process of deploying vLLM with Kubernetes, including the necessary prerequisites, steps for deployment, and testing. +Deploying vLLM on Kubernetes is a scalable and efficient way to serve machine learning models. This guide walks you through deploying vLLM using native Kubernetes. -## Prerequisites +-------- -Before you begin, ensure that you have the following: +Alternatively, you can also deploy Kubernetes using [helm chart](https://docs.vllm.ai/en/latest/deployment/frameworks/helm.html). There are also open-source projects available to make your deployment even smoother. -- A running Kubernetes cluster -- NVIDIA Kubernetes Device Plugin (`k8s-device-plugin`): This can be found at `https://github.com/NVIDIA/k8s-device-plugin/` -- Available GPU resources in your cluster +* [vLLM production-stack](https://github.com/vllm-project/production-stack): Born out of a Berkeley-UChicago collaboration, vLLM production stack is a project that contains latest research and community effort, while still delivering production-level stability and performance. Checkout the [documentation page](https://docs.vllm.ai/en/latest/deployment/integrations/production-stack.html) for more details and examples. -## Deployment Steps +-------- + +## Pre-requisite + +Ensure that you have a running Kubernetes environment with GPU (you can follow [this tutorial](https://github.com/vllm-project/production-stack/blob/main/tutorials/00-install-kubernetes-env.md) to install a Kubernetes environment on a bare-medal GPU machine). + +## Deployment using native K8s 1. Create a PVC, Secret and Deployment for vLLM