mirror of
https://git.datalinker.icu/vllm-project/vllm.git
synced 2026-03-16 14:17:16 +08:00
Docs update tpu install instructions (#27824)
Signed-off-by: Rob Mulla <rob.mulla@gmail.com> Signed-off-by: Rob Mulla <RobMulla@users.noreply.github.com> Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
This commit is contained in:
parent
d6517be3cd
commit
70bfbd7b16
@ -4,7 +4,7 @@ This doc serves as a collection of handy tips for optimizing your vLLM on TPU wo
|
|||||||
|
|
||||||
## Get started
|
## Get started
|
||||||
|
|
||||||
Looking for setup and installation instructions? Find them [here](../getting_started/installation/google_tpu.md).
|
Looking for setup and installation instructions? Find them [here](https://docs.vllm.ai/projects/tpu/en/latest/getting_started/installation/).
|
||||||
|
|
||||||
### TPU workload sizing
|
### TPU workload sizing
|
||||||
|
|
||||||
|
|||||||
@ -2,4 +2,4 @@ nav:
|
|||||||
- README.md
|
- README.md
|
||||||
- gpu.md
|
- gpu.md
|
||||||
- cpu.md
|
- cpu.md
|
||||||
- google_tpu.md
|
- TPU: https://docs.vllm.ai/projects/tpu/en/latest/getting_started/installation/
|
||||||
|
|||||||
@ -11,7 +11,6 @@ vLLM supports the following hardware platforms:
|
|||||||
- [ARM AArch64](cpu.md#arm-aarch64)
|
- [ARM AArch64](cpu.md#arm-aarch64)
|
||||||
- [Apple silicon](cpu.md#apple-silicon)
|
- [Apple silicon](cpu.md#apple-silicon)
|
||||||
- [IBM Z (S390X)](cpu.md#ibm-z-s390x)
|
- [IBM Z (S390X)](cpu.md#ibm-z-s390x)
|
||||||
- [Google TPU](google_tpu.md)
|
|
||||||
|
|
||||||
## Hardware Plugins
|
## Hardware Plugins
|
||||||
|
|
||||||
@ -20,6 +19,7 @@ The backends below live **outside** the main `vllm` repository and follow the
|
|||||||
|
|
||||||
| Accelerator | PyPI / package | Repository |
|
| Accelerator | PyPI / package | Repository |
|
||||||
|-------------|----------------|------------|
|
|-------------|----------------|------------|
|
||||||
|
| Google TPU | `tpu-inference` | <https://github.com/vllm-project/tpu-inference> |
|
||||||
| Ascend NPU | `vllm-ascend` | <https://github.com/vllm-project/vllm-ascend> |
|
| Ascend NPU | `vllm-ascend` | <https://github.com/vllm-project/vllm-ascend> |
|
||||||
| Intel Gaudi (HPU) | N/A, install from source | <https://github.com/vllm-project/vllm-gaudi> |
|
| Intel Gaudi (HPU) | N/A, install from source | <https://github.com/vllm-project/vllm-gaudi> |
|
||||||
| MetaX MACA GPU | N/A, install from source | <https://github.com/MetaX-MACA/vLLM-metax> |
|
| MetaX MACA GPU | N/A, install from source | <https://github.com/MetaX-MACA/vLLM-metax> |
|
||||||
|
|||||||
@ -1,193 +0,0 @@
|
|||||||
# Google TPU
|
|
||||||
|
|
||||||
Tensor Processing Units (TPUs) are Google's custom-developed application-specific
|
|
||||||
integrated circuits (ASICs) used to accelerate machine learning workloads. TPUs
|
|
||||||
are available in different versions each with different hardware specifications.
|
|
||||||
For more information about TPUs, see [TPU System Architecture](https://cloud.google.com/tpu/docs/system-architecture-tpu-vm).
|
|
||||||
For more information on the TPU versions supported with vLLM, see:
|
|
||||||
|
|
||||||
- [TPU v6e](https://cloud.google.com/tpu/docs/v6e)
|
|
||||||
- [TPU v5e](https://cloud.google.com/tpu/docs/v5e)
|
|
||||||
- [TPU v5p](https://cloud.google.com/tpu/docs/v5p)
|
|
||||||
- [TPU v4](https://cloud.google.com/tpu/docs/v4)
|
|
||||||
|
|
||||||
These TPU versions allow you to configure the physical arrangements of the TPU
|
|
||||||
chips. This can improve throughput and networking performance. For more
|
|
||||||
information see:
|
|
||||||
|
|
||||||
- [TPU v6e topologies](https://cloud.google.com/tpu/docs/v6e#configurations)
|
|
||||||
- [TPU v5e topologies](https://cloud.google.com/tpu/docs/v5e#tpu-v5e-config)
|
|
||||||
- [TPU v5p topologies](https://cloud.google.com/tpu/docs/v5p#tpu-v5p-config)
|
|
||||||
- [TPU v4 topologies](https://cloud.google.com/tpu/docs/v4#tpu-v4-config)
|
|
||||||
|
|
||||||
In order for you to use Cloud TPUs you need to have TPU quota granted to your
|
|
||||||
Google Cloud Platform project. TPU quotas specify how many TPUs you can use in a
|
|
||||||
GPC project and are specified in terms of TPU version, the number of TPU you
|
|
||||||
want to use, and quota type. For more information, see [TPU quota](https://cloud.google.com/tpu/docs/quota#tpu_quota).
|
|
||||||
|
|
||||||
For TPU pricing information, see [Cloud TPU pricing](https://cloud.google.com/tpu/pricing).
|
|
||||||
|
|
||||||
You may need additional persistent storage for your TPU VMs. For more
|
|
||||||
information, see [Storage options for Cloud TPU data](https://cloud.devsite.corp.google.com/tpu/docs/storage-options).
|
|
||||||
|
|
||||||
!!! warning
|
|
||||||
There are no pre-built wheels for this device, so you must either use the pre-built Docker image or build vLLM from source.
|
|
||||||
|
|
||||||
## Requirements
|
|
||||||
|
|
||||||
- Google Cloud TPU VM
|
|
||||||
- TPU versions: v6e, v5e, v5p, v4
|
|
||||||
- Python: 3.11 or newer
|
|
||||||
|
|
||||||
### Provision Cloud TPUs
|
|
||||||
|
|
||||||
You can provision Cloud TPUs using the [Cloud TPU API](https://cloud.google.com/tpu/docs/reference/rest)
|
|
||||||
or the [queued resources](https://cloud.google.com/tpu/docs/queued-resources)
|
|
||||||
API (preferred). This section shows how to create TPUs using the queued resource API. For
|
|
||||||
more information about using the Cloud TPU API, see [Create a Cloud TPU using the Create Node API](https://cloud.google.com/tpu/docs/managing-tpus-tpu-vm#create-node-api).
|
|
||||||
Queued resources enable you to request Cloud TPU resources in a queued manner.
|
|
||||||
When you request queued resources, the request is added to a queue maintained by
|
|
||||||
the Cloud TPU service. When the requested resource becomes available, it's
|
|
||||||
assigned to your Google Cloud project for your immediate exclusive use.
|
|
||||||
|
|
||||||
!!! note
|
|
||||||
In all of the following commands, replace the ALL CAPS parameter names with
|
|
||||||
appropriate values. See the parameter descriptions table for more information.
|
|
||||||
|
|
||||||
### Provision Cloud TPUs with GKE
|
|
||||||
|
|
||||||
For more information about using TPUs with GKE, see:
|
|
||||||
|
|
||||||
- [About TPUs in GKE](https://cloud.google.com/kubernetes-engine/docs/concepts/tpus)
|
|
||||||
- [Deploy TPU workloads in GKE Standard](https://cloud.google.com/kubernetes-engine/docs/how-to/tpus)
|
|
||||||
- [Plan for TPUs in GKE](https://cloud.google.com/kubernetes-engine/docs/concepts/plan-tpus)
|
|
||||||
|
|
||||||
## Configure a new environment
|
|
||||||
|
|
||||||
### Provision a Cloud TPU with the queued resource API
|
|
||||||
|
|
||||||
Create a TPU v5e with 4 TPU chips:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
gcloud alpha compute tpus queued-resources create QUEUED_RESOURCE_ID \
|
|
||||||
--node-id TPU_NAME \
|
|
||||||
--project PROJECT_ID \
|
|
||||||
--zone ZONE \
|
|
||||||
--accelerator-type ACCELERATOR_TYPE \
|
|
||||||
--runtime-version RUNTIME_VERSION \
|
|
||||||
--service-account SERVICE_ACCOUNT
|
|
||||||
```
|
|
||||||
|
|
||||||
| Parameter name | Description |
|
|
||||||
|--------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
|
||||||
| QUEUED_RESOURCE_ID | The user-assigned ID of the queued resource request. |
|
|
||||||
| TPU_NAME | The user-assigned name of the TPU which is created when the queued resource request is allocated. |
|
|
||||||
| PROJECT_ID | Your Google Cloud project |
|
|
||||||
| ZONE | The GCP zone where you want to create your Cloud TPU. The value you use depends on the version of TPUs you are using. For more information, see [TPU regions and zones] |
|
|
||||||
| ACCELERATOR_TYPE | The TPU version you want to use. Specify the TPU version, for example `v5litepod-4` specifies a v5e TPU with 4 cores, `v6e-1` specifies a v6e TPU with 1 core. For more information, see [TPU versions]. |
|
|
||||||
| RUNTIME_VERSION | The TPU VM runtime version to use. For example, use `v2-alpha-tpuv6e` for a VM loaded with one or more v6e TPU(s). |
|
|
||||||
| SERVICE_ACCOUNT | The email address for your service account. You can find it in the IAM Cloud Console under *Service Accounts*. For example: `tpu-service-account@<your_project_ID>.iam.gserviceaccount.com` |
|
|
||||||
|
|
||||||
Connect to your TPU VM using SSH:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
gcloud compute tpus tpu-vm ssh TPU_NAME --project PROJECT_ID --zone ZONE
|
|
||||||
```
|
|
||||||
|
|
||||||
!!! note
|
|
||||||
When configuring `RUNTIME_VERSION` ("TPU software version") on GCP, ensure it matches the TPU generation you've selected by referencing the [TPU VM images] compatibility matrix. Using an incompatible version may prevent vLLM from running correctly.
|
|
||||||
|
|
||||||
[TPU versions]: https://cloud.google.com/tpu/docs/runtimes
|
|
||||||
[TPU VM images]: https://cloud.google.com/tpu/docs/runtimes
|
|
||||||
[TPU regions and zones]: https://cloud.google.com/tpu/docs/regions-zones
|
|
||||||
|
|
||||||
## Set up using Python
|
|
||||||
|
|
||||||
### Pre-built wheels
|
|
||||||
|
|
||||||
Currently, there are no pre-built TPU wheels.
|
|
||||||
|
|
||||||
### Build wheel from source
|
|
||||||
|
|
||||||
Install Miniconda:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
|
|
||||||
bash Miniconda3-latest-Linux-x86_64.sh
|
|
||||||
source ~/.bashrc
|
|
||||||
```
|
|
||||||
|
|
||||||
Create and activate a Conda environment for vLLM:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
conda create -n vllm python=3.12 -y
|
|
||||||
conda activate vllm
|
|
||||||
```
|
|
||||||
|
|
||||||
Clone the vLLM repository and go to the vLLM directory:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
git clone https://github.com/vllm-project/vllm.git && cd vllm
|
|
||||||
```
|
|
||||||
|
|
||||||
Uninstall the existing `torch` and `torch_xla` packages:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
pip uninstall torch torch-xla -y
|
|
||||||
```
|
|
||||||
|
|
||||||
Install build dependencies:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
pip install -r requirements/tpu.txt
|
|
||||||
sudo apt-get install --no-install-recommends --yes libopenblas-base libopenmpi-dev libomp-dev
|
|
||||||
```
|
|
||||||
|
|
||||||
Run the setup script:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
VLLM_TARGET_DEVICE="tpu" python -m pip install -e .
|
|
||||||
```
|
|
||||||
|
|
||||||
## Set up using Docker
|
|
||||||
|
|
||||||
### Pre-built images
|
|
||||||
|
|
||||||
See [Using Docker](../../deployment/docker.md) for instructions on using the official Docker image, making sure to substitute the image name `vllm/vllm-openai` with `vllm/vllm-tpu`.
|
|
||||||
|
|
||||||
### Build image from source
|
|
||||||
|
|
||||||
You can use [docker/Dockerfile.tpu](../../../docker/Dockerfile.tpu) to build a Docker image with TPU support.
|
|
||||||
|
|
||||||
```bash
|
|
||||||
docker build -f docker/Dockerfile.tpu -t vllm-tpu .
|
|
||||||
```
|
|
||||||
|
|
||||||
Run the Docker image with the following command:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# Make sure to add `--privileged --net host --shm-size=16G`.
|
|
||||||
docker run --privileged --net host --shm-size=16G -it vllm-tpu
|
|
||||||
```
|
|
||||||
|
|
||||||
!!! note
|
|
||||||
Since TPU relies on XLA which requires static shapes, vLLM bucketizes the
|
|
||||||
possible input shapes and compiles an XLA graph for each shape. The
|
|
||||||
compilation time may take 20~30 minutes in the first run. However, the
|
|
||||||
compilation time reduces to ~5 minutes afterwards because the XLA graphs are
|
|
||||||
cached in the disk (in `VLLM_XLA_CACHE_PATH` or `~/.cache/vllm/xla_cache` by default).
|
|
||||||
|
|
||||||
!!! tip
|
|
||||||
If you encounter the following error:
|
|
||||||
|
|
||||||
```console
|
|
||||||
from torch._C import * # noqa: F403
|
|
||||||
ImportError: libopenblas.so.0: cannot open shared object file: No such
|
|
||||||
file or directory
|
|
||||||
```
|
|
||||||
|
|
||||||
Install OpenBLAS with the following command:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
sudo apt-get install --no-install-recommends --yes libopenblas-base libopenmpi-dev libomp-dev
|
|
||||||
```
|
|
||||||
@ -63,6 +63,17 @@ This guide will help you quickly get started with vLLM to perform:
|
|||||||
rocm/vllm-dev:nightly
|
rocm/vllm-dev:nightly
|
||||||
```
|
```
|
||||||
|
|
||||||
|
=== "Google TPU"
|
||||||
|
|
||||||
|
To run vLLM on Google TPUs, you need to install the `vllm-tpu` package.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
uv pip install vllm-tpu
|
||||||
|
```
|
||||||
|
|
||||||
|
!!! note
|
||||||
|
For more detailed instructions, including Docker, installing from source, and troubleshooting, please refer to the [vLLM on TPU documentation](https://docs.vllm.ai/projects/tpu/en/latest/).
|
||||||
|
|
||||||
!!! note
|
!!! note
|
||||||
For more detail and non-CUDA platforms, please refer [here](installation/README.md) for specific instructions on how to install vLLM.
|
For more detail and non-CUDA platforms, please refer [here](installation/README.md) for specific instructions on how to install vLLM.
|
||||||
|
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user