diff --git a/docs/getting_started/installation/.nav.yml b/docs/getting_started/installation/.nav.yml index 7acfc015ff508..d4a727c926406 100644 --- a/docs/getting_started/installation/.nav.yml +++ b/docs/getting_started/installation/.nav.yml @@ -2,4 +2,6 @@ nav: - README.md - gpu.md - cpu.md - - ai_accelerator.md \ No newline at end of file + - google_tpu.md + - intel_gaudi.md + - aws_neuron.md diff --git a/docs/getting_started/installation/README.md b/docs/getting_started/installation/README.md index 36bb16cc02249..c5348adfa5283 100644 --- a/docs/getting_started/installation/README.md +++ b/docs/getting_started/installation/README.md @@ -14,7 +14,6 @@ vLLM supports the following hardware platforms: - [ARM AArch64](cpu.md#arm-aarch64) - [Apple silicon](cpu.md#apple-silicon) - [IBM Z (S390X)](cpu.md#ibm-z-s390x) -- [Other AI accelerators](ai_accelerator.md) - - [Google TPU](ai_accelerator.md#google-tpu) - - [Intel Gaudi](ai_accelerator.md#intel-gaudi) - - [AWS Neuron](ai_accelerator.md#aws-neuron) +- [Google TPU](google_tpu.md) +- [Intel Gaudi](intel_gaudi.md) +- [AWS Neuron](aws_neuron.md) diff --git a/docs/getting_started/installation/ai_accelerator.md b/docs/getting_started/installation/ai_accelerator.md deleted file mode 100644 index a4f136a172fed..0000000000000 --- a/docs/getting_started/installation/ai_accelerator.md +++ /dev/null @@ -1,117 +0,0 @@ -# Other AI accelerators - -vLLM is a Python library that supports the following AI accelerators. Select your AI accelerator type to see vendor specific instructions: - -=== "Google TPU" - - --8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:installation" - -=== "Intel Gaudi" - - --8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:installation" - -=== "AWS Neuron" - - --8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:installation" - -## Requirements - -=== "Google TPU" - - --8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:requirements" - -=== "Intel Gaudi" - - --8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:requirements" - -=== "AWS Neuron" - - --8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:requirements" - -## Configure a new environment - -=== "Google TPU" - - --8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:configure-a-new-environment" - -=== "Intel Gaudi" - - --8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:configure-a-new-environment" - -=== "AWS Neuron" - - --8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:configure-a-new-environment" - -## Set up using Python - -### Pre-built wheels - -=== "Google TPU" - - --8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:pre-built-wheels" - -=== "Intel Gaudi" - - --8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:pre-built-wheels" - -=== "AWS Neuron" - - --8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:pre-built-wheels" - -### Build wheel from source - -=== "Google TPU" - - --8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:build-wheel-from-source" - -=== "Intel Gaudi" - - --8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:build-wheel-from-source" - -=== "AWS Neuron" - - --8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:build-wheel-from-source" - -## Set up using Docker - -### Pre-built images - -=== "Google TPU" - - --8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:pre-built-images" - -=== "Intel Gaudi" - - --8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:pre-built-images" - -=== "AWS Neuron" - - --8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:pre-built-images" - -### Build image from source - -=== "Google TPU" - - --8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:build-image-from-source" - -=== "Intel Gaudi" - - --8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:build-image-from-source" - -=== "AWS Neuron" - - --8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:build-image-from-source" - -## Extra information - -=== "Google TPU" - - --8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:extra-information" - -=== "Intel Gaudi" - - --8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:extra-information" - -=== "AWS Neuron" - - --8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:extra-information" diff --git a/docs/getting_started/installation/ai_accelerator/neuron.inc.md b/docs/getting_started/installation/aws_neuron.md similarity index 61% rename from docs/getting_started/installation/ai_accelerator/neuron.inc.md rename to docs/getting_started/installation/aws_neuron.md index 3649cd328088f..6b2efd85f06b1 100644 --- a/docs/getting_started/installation/ai_accelerator/neuron.inc.md +++ b/docs/getting_started/installation/aws_neuron.md @@ -1,15 +1,14 @@ -# --8<-- [start:installation] +# AWS Neuron [AWS Neuron](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/) is the software development kit (SDK) used to run deep learning and - generative AI workloads on AWS Inferentia and AWS Trainium powered Amazon EC2 instances and UltraServers (Inf1, Inf2, Trn1, Trn2, - and Trn2 UltraServer). Both Trainium and Inferentia are powered by fully-independent heterogeneous compute-units called NeuronCores. - This tab describes how to set up your environment to run vLLM on Neuron. +generative AI workloads on AWS Inferentia and AWS Trainium powered Amazon EC2 instances and UltraServers (Inf1, Inf2, Trn1, Trn2, +and Trn2 UltraServer). Both Trainium and Inferentia are powered by fully-independent heterogeneous compute-units called NeuronCores. +This describes how to set up your environment to run vLLM on Neuron. !!! warning There are no pre-built wheels or images for this device, so you must build vLLM from source. -# --8<-- [end:installation] -# --8<-- [start:requirements] +## Requirements - OS: Linux - Python: 3.9 or newer @@ -17,8 +16,7 @@ - Accelerator: NeuronCore-v2 (in trn1/inf2 chips) or NeuronCore-v3 (in trn2 chips) - AWS Neuron SDK 2.23 -# --8<-- [end:requirements] -# --8<-- [start:configure-a-new-environment] +## Configure a new environment ### Launch a Trn1/Trn2/Inf2 instance and verify Neuron dependencies @@ -27,6 +25,7 @@ The easiest way to launch a Trainium or Inferentia instance with pre-installed N - After launching the instance, follow the instructions in [Connect to your instance](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html) to connect to the instance - Once inside your instance, activate the pre-installed virtual environment for inference by running + ```console source /opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/bin/activate ``` @@ -38,20 +37,15 @@ for alternative setup instructions including using Docker and manually installin NxD Inference is the default recommended backend to run inference on Neuron. If you are looking to use the legacy [transformers-neuronx](https://github.com/aws-neuron/transformers-neuronx) library, refer to [Transformers NeuronX Setup](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/transformers-neuronx/setup/index.html). -# --8<-- [end:configure-a-new-environment] -# --8<-- [start:set-up-using-python] +## Set up using Python -# --8<-- [end:set-up-using-python] -# --8<-- [start:pre-built-wheels] +### Pre-built wheels Currently, there are no pre-built Neuron wheels. -# --8<-- [end:pre-built-wheels] -# --8<-- [start:build-wheel-from-source] +### Build wheel from source -#### Install vLLM from source - -Install vllm as follows: +To build and install vLLM from source, run: ```console git clone https://github.com/vllm-project/vllm.git @@ -61,8 +55,8 @@ VLLM_TARGET_DEVICE="neuron" pip install -e . ``` AWS Neuron maintains a [Github fork of vLLM](https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2) at - [https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2](https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2), which contains several features in addition to what's - available on vLLM V0. Please utilize the AWS Fork for the following features: +, which contains several features in addition to what's +available on vLLM V0. Please utilize the AWS Fork for the following features: - Llama-3.2 multi-modal support - Multi-node distributed inference @@ -81,25 +75,22 @@ VLLM_TARGET_DEVICE="neuron" pip install -e . Note that the AWS Neuron fork is only intended to support Neuron hardware; compatibility with other hardwares is not tested. -# --8<-- [end:build-wheel-from-source] -# --8<-- [start:set-up-using-docker] +## Set up using Docker -# --8<-- [end:set-up-using-docker] -# --8<-- [start:pre-built-images] +### Pre-built images Currently, there are no pre-built Neuron images. -# --8<-- [end:pre-built-images] -# --8<-- [start:build-image-from-source] +### Build image from source See [deployment-docker-build-image-from-source][deployment-docker-build-image-from-source] for instructions on building the Docker image. Make sure to use in place of the default Dockerfile. -# --8<-- [end:build-image-from-source] -# --8<-- [start:extra-information] +## Extra information [](){ #feature-support-through-nxd-inference-backend } + ### Feature support through NxD Inference backend The current vLLM and Neuron integration relies on either the `neuronx-distributed-inference` (preferred) or `transformers-neuronx` backend @@ -108,12 +99,15 @@ to perform most of the heavy lifting which includes PyTorch model initialization To configure NxD Inference features through the vLLM entrypoint, use the `override_neuron_config` setting. Provide the configs you want to override as a dictionary (or JSON object when starting vLLM from the CLI). For example, to disable auto bucketing, include + ```console override_neuron_config={ "enable_bucketing":False, } ``` + or when launching vLLM from the CLI, pass + ```console --override-neuron-config "{\"enable_bucketing\":false}" ``` @@ -124,32 +118,30 @@ Alternatively, users can directly call the NxDI library to trace and compile you ### Known limitations - EAGLE speculative decoding: NxD Inference requires the EAGLE draft checkpoint to include the LM head weights from the target model. Refer to this - [guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/feature-guide.html#eagle-checkpoint-compatibility) - for how to convert pretrained EAGLE model checkpoints to be compatible for NxDI. + [guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/feature-guide.html#eagle-checkpoint-compatibility) + for how to convert pretrained EAGLE model checkpoints to be compatible for NxDI. - Quantization: the native quantization flow in vLLM is not well supported on NxD Inference. It is recommended to follow this - [Neuron quantization guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/custom-quantization.html) - to quantize and compile your model using NxD Inference, and then load the compiled artifacts into vLLM. + [Neuron quantization guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/custom-quantization.html) + to quantize and compile your model using NxD Inference, and then load the compiled artifacts into vLLM. - Multi-LoRA serving: NxD Inference only supports loading of LoRA adapters at server startup. Dynamic loading of LoRA adapters at - runtime is not currently supported. Refer to [multi-lora example](https://github.com/aws-neuron/upstreaming-to-vllm/blob/neuron-2.23-vllm-v0.7.2/examples/offline_inference/neuron_multi_lora.py) + runtime is not currently supported. Refer to [multi-lora example](https://github.com/aws-neuron/upstreaming-to-vllm/blob/neuron-2.23-vllm-v0.7.2/examples/offline_inference/neuron_multi_lora.py) - Multi-modal support: multi-modal support is only available through the AWS Neuron fork. This feature has not been upstreamed - to vLLM main because NxD Inference currently relies on certain adaptations to the core vLLM logic to support this feature. + to vLLM main because NxD Inference currently relies on certain adaptations to the core vLLM logic to support this feature. - Multi-node support: distributed inference across multiple Trainium/Inferentia instances is only supported on the AWS Neuron fork. Refer - to this [multi-node example](https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2/examples/neuron/multi_node) - to run. Note that tensor parallelism (distributed inference across NeuronCores) is available in vLLM main. + to this [multi-node example](https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2/examples/neuron/multi_node) + to run. Note that tensor parallelism (distributed inference across NeuronCores) is available in vLLM main. - Known edge case bug in speculative decoding: An edge case failure may occur in speculative decoding when sequence length approaches - max model length (e.g. when requesting max tokens up to the max model length and ignoring eos). In this scenario, vLLM may attempt - to allocate an additional block to ensure there is enough memory for number of lookahead slots, but since we do not have good support - for paged attention, there isn't another Neuron block for vLLM to allocate. A workaround fix (to terminate 1 iteration early) is - implemented in the AWS Neuron fork but is not upstreamed to vLLM main as it modifies core vLLM logic. - + max model length (e.g. when requesting max tokens up to the max model length and ignoring eos). In this scenario, vLLM may attempt + to allocate an additional block to ensure there is enough memory for number of lookahead slots, but since we do not have good support + for paged attention, there isn't another Neuron block for vLLM to allocate. A workaround fix (to terminate 1 iteration early) is + implemented in the AWS Neuron fork but is not upstreamed to vLLM main as it modifies core vLLM logic. ### Environment variables + - `NEURON_COMPILED_ARTIFACTS`: set this environment variable to point to your pre-compiled model artifacts directory to avoid - compilation time upon server initialization. If this variable is not set, the Neuron module will perform compilation and save the - artifacts under `neuron-compiled-artifacts/{unique_hash}/` sub-directory in the model path. If this environment variable is set, - but the directory does not exist, or the contents are invalid, Neuron will also fallback to a new compilation and store the artifacts - under this specified path. + compilation time upon server initialization. If this variable is not set, the Neuron module will perform compilation and save the + artifacts under `neuron-compiled-artifacts/{unique_hash}/` sub-directory in the model path. If this environment variable is set, + but the directory does not exist, or the contents are invalid, Neuron will also fallback to a new compilation and store the artifacts + under this specified path. - `NEURON_CONTEXT_LENGTH_BUCKETS`: Bucket sizes for context encoding. (Only applicable to `transformers-neuronx` backend). - `NEURON_TOKEN_GEN_BUCKETS`: Bucket sizes for token generation. (Only applicable to `transformers-neuronx` backend). - -# --8<-- [end:extra-information] diff --git a/docs/getting_started/installation/ai_accelerator/tpu.inc.md b/docs/getting_started/installation/google_tpu.md similarity index 89% rename from docs/getting_started/installation/ai_accelerator/tpu.inc.md rename to docs/getting_started/installation/google_tpu.md index 8bddf0bab2588..0cb10b8de835e 100644 --- a/docs/getting_started/installation/ai_accelerator/tpu.inc.md +++ b/docs/getting_started/installation/google_tpu.md @@ -1,4 +1,4 @@ -# --8<-- [start:installation] +# Google TPU Tensor Processing Units (TPUs) are Google's custom-developed application-specific integrated circuits (ASICs) used to accelerate machine learning workloads. TPUs @@ -33,8 +33,7 @@ information, see [Storage options for Cloud TPU data](https://cloud.devsite.corp !!! warning There are no pre-built wheels for this device, so you must either use the pre-built Docker image or build vLLM from source. -# --8<-- [end:installation] -# --8<-- [start:requirements] +## Requirements - Google Cloud TPU VM - TPU versions: v6e, v5e, v5p, v4 @@ -63,8 +62,7 @@ For more information about using TPUs with GKE, see: - - -# --8<-- [end:requirements] -# --8<-- [start:configure-a-new-environment] +## Configure a new environment ### Provision a Cloud TPU with the queued resource API @@ -100,16 +98,13 @@ gcloud compute tpus tpu-vm ssh TPU_NAME --project PROJECT_ID --zone ZONE [TPU VM images]: https://cloud.google.com/tpu/docs/runtimes [TPU regions and zones]: https://cloud.google.com/tpu/docs/regions-zones -# --8<-- [end:configure-a-new-environment] -# --8<-- [start:set-up-using-python] +## Set up using Python -# --8<-- [end:set-up-using-python] -# --8<-- [start:pre-built-wheels] +### Pre-built wheels Currently, there are no pre-built TPU wheels. -# --8<-- [end:pre-built-wheels] -# --8<-- [start:build-wheel-from-source] +### Build wheel from source Install Miniconda: @@ -142,7 +137,7 @@ Install build dependencies: ```bash pip install -r requirements/tpu.txt -sudo apt-get install libopenblas-base libopenmpi-dev libomp-dev +sudo apt-get install --no-install-recommends --yes libopenblas-base libopenmpi-dev libomp-dev ``` Run the setup script: @@ -151,16 +146,13 @@ Run the setup script: VLLM_TARGET_DEVICE="tpu" python -m pip install -e . ``` -# --8<-- [end:build-wheel-from-source] -# --8<-- [start:set-up-using-docker] +## Set up using Docker -# --8<-- [end:set-up-using-docker] -# --8<-- [start:pre-built-images] +### Pre-built images See [deployment-docker-pre-built-image][deployment-docker-pre-built-image] for instructions on using the official Docker image, making sure to substitute the image name `vllm/vllm-openai` with `vllm/vllm-tpu`. -# --8<-- [end:pre-built-images] -# --8<-- [start:build-image-from-source] +### Build image from source You can use to build a Docker image with TPU support. @@ -194,11 +186,5 @@ docker run --privileged --net host --shm-size=16G -it vllm-tpu Install OpenBLAS with the following command: ```console - sudo apt-get install libopenblas-base libopenmpi-dev libomp-dev + sudo apt-get install --no-install-recommends --yes libopenblas-base libopenmpi-dev libomp-dev ``` - -# --8<-- [end:build-image-from-source] -# --8<-- [start:extra-information] - -There is no extra information for this device. -# --8<-- [end:extra-information] diff --git a/docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md b/docs/getting_started/installation/intel_gaudi.md similarity index 97% rename from docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md rename to docs/getting_started/installation/intel_gaudi.md index 71ec7e2cc2c6d..f5970850aae71 100644 --- a/docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md +++ b/docs/getting_started/installation/intel_gaudi.md @@ -1,12 +1,11 @@ -# --8<-- [start:installation] +# Intel Gaudi -This tab provides instructions on running vLLM with Intel Gaudi devices. +This page provides instructions on running vLLM with Intel Gaudi devices. !!! warning There are no pre-built wheels or images for this device, so you must build vLLM from source. -# --8<-- [end:installation] -# --8<-- [start:requirements] +## Requirements - OS: Ubuntu 22.04 LTS - Python: 3.10 @@ -19,8 +18,7 @@ to set up the execution environment. To achieve the best performance, please follow the methods outlined in the [Optimizing Training Platform Guide](https://docs.habana.ai/en/latest/PyTorch/Model_Optimization_PyTorch/Optimization_in_Training_Platform.html). -# --8<-- [end:requirements] -# --8<-- [start:configure-a-new-environment] +## Configure a new environment ### Environment verification @@ -57,16 +55,13 @@ docker run \ vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest ``` -# --8<-- [end:configure-a-new-environment] -# --8<-- [start:set-up-using-python] +## Set up using Python -# --8<-- [end:set-up-using-python] -# --8<-- [start:pre-built-wheels] +### Pre-built wheels Currently, there are no pre-built Intel Gaudi wheels. -# --8<-- [end:pre-built-wheels] -# --8<-- [start:build-wheel-from-source] +### Build wheel from source To build and install vLLM from source, run: @@ -87,16 +82,13 @@ pip install -r requirements/hpu.txt python setup.py develop ``` -# --8<-- [end:build-wheel-from-source] -# --8<-- [start:set-up-using-docker] +## Set up using Docker -# --8<-- [end:set-up-using-docker] -# --8<-- [start:pre-built-images] +### Pre-built images Currently, there are no pre-built Intel Gaudi images. -# --8<-- [end:pre-built-images] -# --8<-- [start:build-image-from-source] +### Build image from source ```console docker build -f docker/Dockerfile.hpu -t vllm-hpu-env . @@ -113,10 +105,9 @@ docker run \ !!! tip If you're observing the following error: `docker: Error response from daemon: Unknown runtime specified habana.`, please refer to "Install Using Containers" section of [Intel Gaudi Software Stack and Driver Installation](https://docs.habana.ai/en/v1.18.0/Installation_Guide/Bare_Metal_Fresh_OS.html). Make sure you have `habana-container-runtime` package installed and that `habana` container runtime is registered. -# --8<-- [end:build-image-from-source] -# --8<-- [start:extra-information] +## Extra information -## Supported features +### Supported features - [Offline inference][offline-inference] - Online serving via [OpenAI-Compatible Server][openai-compatible-server] @@ -130,14 +121,14 @@ docker run \ for accelerating low-batch latency and throughput - Attention with Linear Biases (ALiBi) -## Unsupported features +### Unsupported features - Beam search - LoRA adapters - Quantization - Prefill chunking (mixed-batch inferencing) -## Supported configurations +### Supported configurations The following configurations have been validated to function with Gaudi2 devices. Configurations that are not listed may or may not work. @@ -401,4 +392,3 @@ the below: higher batches. You can do that by adding `--enforce-eager` flag to server (for online serving), or by passing `enforce_eager=True` argument to LLM constructor (for offline inference). -# --8<-- [end:extra-information]