mirror of
https://git.datalinker.icu/vllm-project/vllm.git
synced 2026-01-24 11:54:29 +08:00
[doc] split "Other AI Accelerators" tabs (#19708)
This commit is contained in:
parent
154d063b9f
commit
93aee29fdb
@ -2,4 +2,6 @@ nav:
|
||||
- README.md
|
||||
- gpu.md
|
||||
- cpu.md
|
||||
- ai_accelerator.md
|
||||
- google_tpu.md
|
||||
- intel_gaudi.md
|
||||
- aws_neuron.md
|
||||
|
||||
@ -14,7 +14,6 @@ vLLM supports the following hardware platforms:
|
||||
- [ARM AArch64](cpu.md#arm-aarch64)
|
||||
- [Apple silicon](cpu.md#apple-silicon)
|
||||
- [IBM Z (S390X)](cpu.md#ibm-z-s390x)
|
||||
- [Other AI accelerators](ai_accelerator.md)
|
||||
- [Google TPU](ai_accelerator.md#google-tpu)
|
||||
- [Intel Gaudi](ai_accelerator.md#intel-gaudi)
|
||||
- [AWS Neuron](ai_accelerator.md#aws-neuron)
|
||||
- [Google TPU](google_tpu.md)
|
||||
- [Intel Gaudi](intel_gaudi.md)
|
||||
- [AWS Neuron](aws_neuron.md)
|
||||
|
||||
@ -1,117 +0,0 @@
|
||||
# Other AI accelerators
|
||||
|
||||
vLLM is a Python library that supports the following AI accelerators. Select your AI accelerator type to see vendor specific instructions:
|
||||
|
||||
=== "Google TPU"
|
||||
|
||||
--8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:installation"
|
||||
|
||||
=== "Intel Gaudi"
|
||||
|
||||
--8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:installation"
|
||||
|
||||
=== "AWS Neuron"
|
||||
|
||||
--8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:installation"
|
||||
|
||||
## Requirements
|
||||
|
||||
=== "Google TPU"
|
||||
|
||||
--8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:requirements"
|
||||
|
||||
=== "Intel Gaudi"
|
||||
|
||||
--8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:requirements"
|
||||
|
||||
=== "AWS Neuron"
|
||||
|
||||
--8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:requirements"
|
||||
|
||||
## Configure a new environment
|
||||
|
||||
=== "Google TPU"
|
||||
|
||||
--8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:configure-a-new-environment"
|
||||
|
||||
=== "Intel Gaudi"
|
||||
|
||||
--8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:configure-a-new-environment"
|
||||
|
||||
=== "AWS Neuron"
|
||||
|
||||
--8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:configure-a-new-environment"
|
||||
|
||||
## Set up using Python
|
||||
|
||||
### Pre-built wheels
|
||||
|
||||
=== "Google TPU"
|
||||
|
||||
--8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:pre-built-wheels"
|
||||
|
||||
=== "Intel Gaudi"
|
||||
|
||||
--8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:pre-built-wheels"
|
||||
|
||||
=== "AWS Neuron"
|
||||
|
||||
--8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:pre-built-wheels"
|
||||
|
||||
### Build wheel from source
|
||||
|
||||
=== "Google TPU"
|
||||
|
||||
--8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:build-wheel-from-source"
|
||||
|
||||
=== "Intel Gaudi"
|
||||
|
||||
--8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:build-wheel-from-source"
|
||||
|
||||
=== "AWS Neuron"
|
||||
|
||||
--8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:build-wheel-from-source"
|
||||
|
||||
## Set up using Docker
|
||||
|
||||
### Pre-built images
|
||||
|
||||
=== "Google TPU"
|
||||
|
||||
--8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:pre-built-images"
|
||||
|
||||
=== "Intel Gaudi"
|
||||
|
||||
--8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:pre-built-images"
|
||||
|
||||
=== "AWS Neuron"
|
||||
|
||||
--8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:pre-built-images"
|
||||
|
||||
### Build image from source
|
||||
|
||||
=== "Google TPU"
|
||||
|
||||
--8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:build-image-from-source"
|
||||
|
||||
=== "Intel Gaudi"
|
||||
|
||||
--8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:build-image-from-source"
|
||||
|
||||
=== "AWS Neuron"
|
||||
|
||||
--8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:build-image-from-source"
|
||||
|
||||
## Extra information
|
||||
|
||||
=== "Google TPU"
|
||||
|
||||
--8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:extra-information"
|
||||
|
||||
=== "Intel Gaudi"
|
||||
|
||||
--8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:extra-information"
|
||||
|
||||
=== "AWS Neuron"
|
||||
|
||||
--8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:extra-information"
|
||||
@ -1,15 +1,14 @@
|
||||
# --8<-- [start:installation]
|
||||
# AWS Neuron
|
||||
|
||||
[AWS Neuron](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/) is the software development kit (SDK) used to run deep learning and
|
||||
generative AI workloads on AWS Inferentia and AWS Trainium powered Amazon EC2 instances and UltraServers (Inf1, Inf2, Trn1, Trn2,
|
||||
and Trn2 UltraServer). Both Trainium and Inferentia are powered by fully-independent heterogeneous compute-units called NeuronCores.
|
||||
This tab describes how to set up your environment to run vLLM on Neuron.
|
||||
generative AI workloads on AWS Inferentia and AWS Trainium powered Amazon EC2 instances and UltraServers (Inf1, Inf2, Trn1, Trn2,
|
||||
and Trn2 UltraServer). Both Trainium and Inferentia are powered by fully-independent heterogeneous compute-units called NeuronCores.
|
||||
This describes how to set up your environment to run vLLM on Neuron.
|
||||
|
||||
!!! warning
|
||||
There are no pre-built wheels or images for this device, so you must build vLLM from source.
|
||||
|
||||
# --8<-- [end:installation]
|
||||
# --8<-- [start:requirements]
|
||||
## Requirements
|
||||
|
||||
- OS: Linux
|
||||
- Python: 3.9 or newer
|
||||
@ -17,8 +16,7 @@
|
||||
- Accelerator: NeuronCore-v2 (in trn1/inf2 chips) or NeuronCore-v3 (in trn2 chips)
|
||||
- AWS Neuron SDK 2.23
|
||||
|
||||
# --8<-- [end:requirements]
|
||||
# --8<-- [start:configure-a-new-environment]
|
||||
## Configure a new environment
|
||||
|
||||
### Launch a Trn1/Trn2/Inf2 instance and verify Neuron dependencies
|
||||
|
||||
@ -27,6 +25,7 @@ The easiest way to launch a Trainium or Inferentia instance with pre-installed N
|
||||
|
||||
- After launching the instance, follow the instructions in [Connect to your instance](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html) to connect to the instance
|
||||
- Once inside your instance, activate the pre-installed virtual environment for inference by running
|
||||
|
||||
```console
|
||||
source /opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/bin/activate
|
||||
```
|
||||
@ -38,20 +37,15 @@ for alternative setup instructions including using Docker and manually installin
|
||||
NxD Inference is the default recommended backend to run inference on Neuron. If you are looking to use the legacy [transformers-neuronx](https://github.com/aws-neuron/transformers-neuronx)
|
||||
library, refer to [Transformers NeuronX Setup](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/transformers-neuronx/setup/index.html).
|
||||
|
||||
# --8<-- [end:configure-a-new-environment]
|
||||
# --8<-- [start:set-up-using-python]
|
||||
## Set up using Python
|
||||
|
||||
# --8<-- [end:set-up-using-python]
|
||||
# --8<-- [start:pre-built-wheels]
|
||||
### Pre-built wheels
|
||||
|
||||
Currently, there are no pre-built Neuron wheels.
|
||||
|
||||
# --8<-- [end:pre-built-wheels]
|
||||
# --8<-- [start:build-wheel-from-source]
|
||||
### Build wheel from source
|
||||
|
||||
#### Install vLLM from source
|
||||
|
||||
Install vllm as follows:
|
||||
To build and install vLLM from source, run:
|
||||
|
||||
```console
|
||||
git clone https://github.com/vllm-project/vllm.git
|
||||
@ -61,8 +55,8 @@ VLLM_TARGET_DEVICE="neuron" pip install -e .
|
||||
```
|
||||
|
||||
AWS Neuron maintains a [Github fork of vLLM](https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2) at
|
||||
[https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2](https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2), which contains several features in addition to what's
|
||||
available on vLLM V0. Please utilize the AWS Fork for the following features:
|
||||
<https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2>, which contains several features in addition to what's
|
||||
available on vLLM V0. Please utilize the AWS Fork for the following features:
|
||||
|
||||
- Llama-3.2 multi-modal support
|
||||
- Multi-node distributed inference
|
||||
@ -81,25 +75,22 @@ VLLM_TARGET_DEVICE="neuron" pip install -e .
|
||||
|
||||
Note that the AWS Neuron fork is only intended to support Neuron hardware; compatibility with other hardwares is not tested.
|
||||
|
||||
# --8<-- [end:build-wheel-from-source]
|
||||
# --8<-- [start:set-up-using-docker]
|
||||
## Set up using Docker
|
||||
|
||||
# --8<-- [end:set-up-using-docker]
|
||||
# --8<-- [start:pre-built-images]
|
||||
### Pre-built images
|
||||
|
||||
Currently, there are no pre-built Neuron images.
|
||||
|
||||
# --8<-- [end:pre-built-images]
|
||||
# --8<-- [start:build-image-from-source]
|
||||
### Build image from source
|
||||
|
||||
See [deployment-docker-build-image-from-source][deployment-docker-build-image-from-source] for instructions on building the Docker image.
|
||||
|
||||
Make sure to use <gh-file:docker/Dockerfile.neuron> in place of the default Dockerfile.
|
||||
|
||||
# --8<-- [end:build-image-from-source]
|
||||
# --8<-- [start:extra-information]
|
||||
## Extra information
|
||||
|
||||
[](){ #feature-support-through-nxd-inference-backend }
|
||||
|
||||
### Feature support through NxD Inference backend
|
||||
|
||||
The current vLLM and Neuron integration relies on either the `neuronx-distributed-inference` (preferred) or `transformers-neuronx` backend
|
||||
@ -108,12 +99,15 @@ to perform most of the heavy lifting which includes PyTorch model initialization
|
||||
|
||||
To configure NxD Inference features through the vLLM entrypoint, use the `override_neuron_config` setting. Provide the configs you want to override
|
||||
as a dictionary (or JSON object when starting vLLM from the CLI). For example, to disable auto bucketing, include
|
||||
|
||||
```console
|
||||
override_neuron_config={
|
||||
"enable_bucketing":False,
|
||||
}
|
||||
```
|
||||
|
||||
or when launching vLLM from the CLI, pass
|
||||
|
||||
```console
|
||||
--override-neuron-config "{\"enable_bucketing\":false}"
|
||||
```
|
||||
@ -124,32 +118,30 @@ Alternatively, users can directly call the NxDI library to trace and compile you
|
||||
### Known limitations
|
||||
|
||||
- EAGLE speculative decoding: NxD Inference requires the EAGLE draft checkpoint to include the LM head weights from the target model. Refer to this
|
||||
[guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/feature-guide.html#eagle-checkpoint-compatibility)
|
||||
for how to convert pretrained EAGLE model checkpoints to be compatible for NxDI.
|
||||
[guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/feature-guide.html#eagle-checkpoint-compatibility)
|
||||
for how to convert pretrained EAGLE model checkpoints to be compatible for NxDI.
|
||||
- Quantization: the native quantization flow in vLLM is not well supported on NxD Inference. It is recommended to follow this
|
||||
[Neuron quantization guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/custom-quantization.html)
|
||||
to quantize and compile your model using NxD Inference, and then load the compiled artifacts into vLLM.
|
||||
[Neuron quantization guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/custom-quantization.html)
|
||||
to quantize and compile your model using NxD Inference, and then load the compiled artifacts into vLLM.
|
||||
- Multi-LoRA serving: NxD Inference only supports loading of LoRA adapters at server startup. Dynamic loading of LoRA adapters at
|
||||
runtime is not currently supported. Refer to [multi-lora example](https://github.com/aws-neuron/upstreaming-to-vllm/blob/neuron-2.23-vllm-v0.7.2/examples/offline_inference/neuron_multi_lora.py)
|
||||
runtime is not currently supported. Refer to [multi-lora example](https://github.com/aws-neuron/upstreaming-to-vllm/blob/neuron-2.23-vllm-v0.7.2/examples/offline_inference/neuron_multi_lora.py)
|
||||
- Multi-modal support: multi-modal support is only available through the AWS Neuron fork. This feature has not been upstreamed
|
||||
to vLLM main because NxD Inference currently relies on certain adaptations to the core vLLM logic to support this feature.
|
||||
to vLLM main because NxD Inference currently relies on certain adaptations to the core vLLM logic to support this feature.
|
||||
- Multi-node support: distributed inference across multiple Trainium/Inferentia instances is only supported on the AWS Neuron fork. Refer
|
||||
to this [multi-node example](https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2/examples/neuron/multi_node)
|
||||
to run. Note that tensor parallelism (distributed inference across NeuronCores) is available in vLLM main.
|
||||
to this [multi-node example](https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2/examples/neuron/multi_node)
|
||||
to run. Note that tensor parallelism (distributed inference across NeuronCores) is available in vLLM main.
|
||||
- Known edge case bug in speculative decoding: An edge case failure may occur in speculative decoding when sequence length approaches
|
||||
max model length (e.g. when requesting max tokens up to the max model length and ignoring eos). In this scenario, vLLM may attempt
|
||||
to allocate an additional block to ensure there is enough memory for number of lookahead slots, but since we do not have good support
|
||||
for paged attention, there isn't another Neuron block for vLLM to allocate. A workaround fix (to terminate 1 iteration early) is
|
||||
implemented in the AWS Neuron fork but is not upstreamed to vLLM main as it modifies core vLLM logic.
|
||||
|
||||
max model length (e.g. when requesting max tokens up to the max model length and ignoring eos). In this scenario, vLLM may attempt
|
||||
to allocate an additional block to ensure there is enough memory for number of lookahead slots, but since we do not have good support
|
||||
for paged attention, there isn't another Neuron block for vLLM to allocate. A workaround fix (to terminate 1 iteration early) is
|
||||
implemented in the AWS Neuron fork but is not upstreamed to vLLM main as it modifies core vLLM logic.
|
||||
|
||||
### Environment variables
|
||||
|
||||
- `NEURON_COMPILED_ARTIFACTS`: set this environment variable to point to your pre-compiled model artifacts directory to avoid
|
||||
compilation time upon server initialization. If this variable is not set, the Neuron module will perform compilation and save the
|
||||
artifacts under `neuron-compiled-artifacts/{unique_hash}/` sub-directory in the model path. If this environment variable is set,
|
||||
but the directory does not exist, or the contents are invalid, Neuron will also fallback to a new compilation and store the artifacts
|
||||
under this specified path.
|
||||
compilation time upon server initialization. If this variable is not set, the Neuron module will perform compilation and save the
|
||||
artifacts under `neuron-compiled-artifacts/{unique_hash}/` sub-directory in the model path. If this environment variable is set,
|
||||
but the directory does not exist, or the contents are invalid, Neuron will also fallback to a new compilation and store the artifacts
|
||||
under this specified path.
|
||||
- `NEURON_CONTEXT_LENGTH_BUCKETS`: Bucket sizes for context encoding. (Only applicable to `transformers-neuronx` backend).
|
||||
- `NEURON_TOKEN_GEN_BUCKETS`: Bucket sizes for token generation. (Only applicable to `transformers-neuronx` backend).
|
||||
|
||||
# --8<-- [end:extra-information]
|
||||
@ -1,4 +1,4 @@
|
||||
# --8<-- [start:installation]
|
||||
# Google TPU
|
||||
|
||||
Tensor Processing Units (TPUs) are Google's custom-developed application-specific
|
||||
integrated circuits (ASICs) used to accelerate machine learning workloads. TPUs
|
||||
@ -33,8 +33,7 @@ information, see [Storage options for Cloud TPU data](https://cloud.devsite.corp
|
||||
!!! warning
|
||||
There are no pre-built wheels for this device, so you must either use the pre-built Docker image or build vLLM from source.
|
||||
|
||||
# --8<-- [end:installation]
|
||||
# --8<-- [start:requirements]
|
||||
## Requirements
|
||||
|
||||
- Google Cloud TPU VM
|
||||
- TPU versions: v6e, v5e, v5p, v4
|
||||
@ -63,8 +62,7 @@ For more information about using TPUs with GKE, see:
|
||||
- <https://cloud.google.com/kubernetes-engine/docs/concepts/tpus>
|
||||
- <https://cloud.google.com/kubernetes-engine/docs/concepts/plan-tpus>
|
||||
|
||||
# --8<-- [end:requirements]
|
||||
# --8<-- [start:configure-a-new-environment]
|
||||
## Configure a new environment
|
||||
|
||||
### Provision a Cloud TPU with the queued resource API
|
||||
|
||||
@ -100,16 +98,13 @@ gcloud compute tpus tpu-vm ssh TPU_NAME --project PROJECT_ID --zone ZONE
|
||||
[TPU VM images]: https://cloud.google.com/tpu/docs/runtimes
|
||||
[TPU regions and zones]: https://cloud.google.com/tpu/docs/regions-zones
|
||||
|
||||
# --8<-- [end:configure-a-new-environment]
|
||||
# --8<-- [start:set-up-using-python]
|
||||
## Set up using Python
|
||||
|
||||
# --8<-- [end:set-up-using-python]
|
||||
# --8<-- [start:pre-built-wheels]
|
||||
### Pre-built wheels
|
||||
|
||||
Currently, there are no pre-built TPU wheels.
|
||||
|
||||
# --8<-- [end:pre-built-wheels]
|
||||
# --8<-- [start:build-wheel-from-source]
|
||||
### Build wheel from source
|
||||
|
||||
Install Miniconda:
|
||||
|
||||
@ -142,7 +137,7 @@ Install build dependencies:
|
||||
|
||||
```bash
|
||||
pip install -r requirements/tpu.txt
|
||||
sudo apt-get install libopenblas-base libopenmpi-dev libomp-dev
|
||||
sudo apt-get install --no-install-recommends --yes libopenblas-base libopenmpi-dev libomp-dev
|
||||
```
|
||||
|
||||
Run the setup script:
|
||||
@ -151,16 +146,13 @@ Run the setup script:
|
||||
VLLM_TARGET_DEVICE="tpu" python -m pip install -e .
|
||||
```
|
||||
|
||||
# --8<-- [end:build-wheel-from-source]
|
||||
# --8<-- [start:set-up-using-docker]
|
||||
## Set up using Docker
|
||||
|
||||
# --8<-- [end:set-up-using-docker]
|
||||
# --8<-- [start:pre-built-images]
|
||||
### Pre-built images
|
||||
|
||||
See [deployment-docker-pre-built-image][deployment-docker-pre-built-image] for instructions on using the official Docker image, making sure to substitute the image name `vllm/vllm-openai` with `vllm/vllm-tpu`.
|
||||
|
||||
# --8<-- [end:pre-built-images]
|
||||
# --8<-- [start:build-image-from-source]
|
||||
### Build image from source
|
||||
|
||||
You can use <gh-file:docker/Dockerfile.tpu> to build a Docker image with TPU support.
|
||||
|
||||
@ -194,11 +186,5 @@ docker run --privileged --net host --shm-size=16G -it vllm-tpu
|
||||
Install OpenBLAS with the following command:
|
||||
|
||||
```console
|
||||
sudo apt-get install libopenblas-base libopenmpi-dev libomp-dev
|
||||
sudo apt-get install --no-install-recommends --yes libopenblas-base libopenmpi-dev libomp-dev
|
||||
```
|
||||
|
||||
# --8<-- [end:build-image-from-source]
|
||||
# --8<-- [start:extra-information]
|
||||
|
||||
There is no extra information for this device.
|
||||
# --8<-- [end:extra-information]
|
||||
@ -1,12 +1,11 @@
|
||||
# --8<-- [start:installation]
|
||||
# Intel Gaudi
|
||||
|
||||
This tab provides instructions on running vLLM with Intel Gaudi devices.
|
||||
This page provides instructions on running vLLM with Intel Gaudi devices.
|
||||
|
||||
!!! warning
|
||||
There are no pre-built wheels or images for this device, so you must build vLLM from source.
|
||||
|
||||
# --8<-- [end:installation]
|
||||
# --8<-- [start:requirements]
|
||||
## Requirements
|
||||
|
||||
- OS: Ubuntu 22.04 LTS
|
||||
- Python: 3.10
|
||||
@ -19,8 +18,7 @@ to set up the execution environment. To achieve the best performance,
|
||||
please follow the methods outlined in the
|
||||
[Optimizing Training Platform Guide](https://docs.habana.ai/en/latest/PyTorch/Model_Optimization_PyTorch/Optimization_in_Training_Platform.html).
|
||||
|
||||
# --8<-- [end:requirements]
|
||||
# --8<-- [start:configure-a-new-environment]
|
||||
## Configure a new environment
|
||||
|
||||
### Environment verification
|
||||
|
||||
@ -57,16 +55,13 @@ docker run \
|
||||
vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest
|
||||
```
|
||||
|
||||
# --8<-- [end:configure-a-new-environment]
|
||||
# --8<-- [start:set-up-using-python]
|
||||
## Set up using Python
|
||||
|
||||
# --8<-- [end:set-up-using-python]
|
||||
# --8<-- [start:pre-built-wheels]
|
||||
### Pre-built wheels
|
||||
|
||||
Currently, there are no pre-built Intel Gaudi wheels.
|
||||
|
||||
# --8<-- [end:pre-built-wheels]
|
||||
# --8<-- [start:build-wheel-from-source]
|
||||
### Build wheel from source
|
||||
|
||||
To build and install vLLM from source, run:
|
||||
|
||||
@ -87,16 +82,13 @@ pip install -r requirements/hpu.txt
|
||||
python setup.py develop
|
||||
```
|
||||
|
||||
# --8<-- [end:build-wheel-from-source]
|
||||
# --8<-- [start:set-up-using-docker]
|
||||
## Set up using Docker
|
||||
|
||||
# --8<-- [end:set-up-using-docker]
|
||||
# --8<-- [start:pre-built-images]
|
||||
### Pre-built images
|
||||
|
||||
Currently, there are no pre-built Intel Gaudi images.
|
||||
|
||||
# --8<-- [end:pre-built-images]
|
||||
# --8<-- [start:build-image-from-source]
|
||||
### Build image from source
|
||||
|
||||
```console
|
||||
docker build -f docker/Dockerfile.hpu -t vllm-hpu-env .
|
||||
@ -113,10 +105,9 @@ docker run \
|
||||
!!! tip
|
||||
If you're observing the following error: `docker: Error response from daemon: Unknown runtime specified habana.`, please refer to "Install Using Containers" section of [Intel Gaudi Software Stack and Driver Installation](https://docs.habana.ai/en/v1.18.0/Installation_Guide/Bare_Metal_Fresh_OS.html). Make sure you have `habana-container-runtime` package installed and that `habana` container runtime is registered.
|
||||
|
||||
# --8<-- [end:build-image-from-source]
|
||||
# --8<-- [start:extra-information]
|
||||
## Extra information
|
||||
|
||||
## Supported features
|
||||
### Supported features
|
||||
|
||||
- [Offline inference][offline-inference]
|
||||
- Online serving via [OpenAI-Compatible Server][openai-compatible-server]
|
||||
@ -130,14 +121,14 @@ docker run \
|
||||
for accelerating low-batch latency and throughput
|
||||
- Attention with Linear Biases (ALiBi)
|
||||
|
||||
## Unsupported features
|
||||
### Unsupported features
|
||||
|
||||
- Beam search
|
||||
- LoRA adapters
|
||||
- Quantization
|
||||
- Prefill chunking (mixed-batch inferencing)
|
||||
|
||||
## Supported configurations
|
||||
### Supported configurations
|
||||
|
||||
The following configurations have been validated to function with
|
||||
Gaudi2 devices. Configurations that are not listed may or may not work.
|
||||
@ -401,4 +392,3 @@ the below:
|
||||
higher batches. You can do that by adding `--enforce-eager` flag to
|
||||
server (for online serving), or by passing `enforce_eager=True`
|
||||
argument to LLM constructor (for offline inference).
|
||||
# --8<-- [end:extra-information]
|
||||
Loading…
x
Reference in New Issue
Block a user