# --8<-- [start:installation] vLLM offers basic model inferencing and serving on Arm CPU platform, with support NEON, data types FP32, FP16 and BF16. # --8<-- [end:installation] # --8<-- [start:requirements] - OS: Linux - Compiler: `gcc/g++ >= 12.3.0` (optional, recommended) - Instruction Set Architecture (ISA): NEON support is required # --8<-- [end:requirements] # --8<-- [start:set-up-using-python] # --8<-- [end:set-up-using-python] # --8<-- [start:pre-built-wheels] Pre-built vLLM wheels for Arm are available since version 0.11.2. These wheels contain pre-compiled C++ binaries. Please replace `` in the commands below with a specific version string (e.g., `0.11.2`). ```bash uv pip install --pre vllm==+cpu --extra-index-url https://wheels.vllm.ai/%2Bcpu/ ``` ??? console "pip" ```bash pip install --pre vllm==+cpu --extra-index-url https://wheels.vllm.ai/%2Bcpu/ ``` The `uv` approach works for vLLM `v0.6.6` and later. A unique feature of `uv` is that packages in `--extra-index-url` have [higher priority than the default index](https://docs.astral.sh/uv/pip/compatibility/#packages-that-exist-on-multiple-indexes). If the latest public release is `v0.6.6.post1`, `uv`'s behavior allows installing a commit before `v0.6.6.post1` by specifying the `--extra-index-url`. In contrast, `pip` combines packages from `--extra-index-url` and the default index, choosing only the latest version, which makes it difficult to install a development version prior to the released version. !!! note Nightly wheels are currently unsupported for this architecture. (e.g. to bisect the behavior change, performance regression). # --8<-- [end:pre-built-wheels] # --8<-- [start:build-wheel-from-source] First, install the recommended compiler. We recommend using `gcc/g++ >= 12.3.0` as the default compiler to avoid potential problems. For example, on Ubuntu 22.4, you can run: ```bash sudo apt-get update -y sudo apt-get install -y --no-install-recommends ccache git curl wget ca-certificates gcc-12 g++-12 libtcmalloc-minimal4 libnuma-dev ffmpeg libsm6 libxext6 libgl1 jq lsof sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave /usr/bin/g++ g++ /usr/bin/g++-12 ``` Second, clone the vLLM project: ```bash git clone https://github.com/vllm-project/vllm.git vllm_source cd vllm_source ``` Third, install required dependencies: ```bash uv pip install -r requirements/cpu-build.txt --torch-backend cpu uv pip install -r requirements/cpu.txt --torch-backend cpu ``` ??? console "pip" ```bash pip install --upgrade pip pip install -v -r requirements/cpu-build.txt --extra-index-url https://download.pytorch.org/whl/cpu pip install -v -r requirements/cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu ``` Finally, build and install vLLM: ```bash VLLM_TARGET_DEVICE=cpu uv pip install . --no-build-isolation ``` If you want to develop vLLM, install it in editable mode instead. ```bash VLLM_TARGET_DEVICE=cpu uv pip install -e . --no-build-isolation ``` Testing has been conducted on AWS Graviton3 instances for compatibility. # --8<-- [end:build-wheel-from-source] # --8<-- [start:pre-built-images] Currently, there are no pre-built Arm CPU images. # --8<-- [end:pre-built-images] # --8<-- [start:build-image-from-source] ```bash docker build -f docker/Dockerfile.cpu \ --tag vllm-cpu-env . # Launching OpenAI server docker run --rm \ --privileged=true \ --shm-size=4g \ -p 8000:8000 \ -e VLLM_CPU_KVCACHE_SPACE= \ -e VLLM_CPU_OMP_THREADS_BIND= \ vllm-cpu-env \ --model=meta-llama/Llama-3.2-1B-Instruct \ --dtype=bfloat16 \ other vLLM OpenAI server arguments ``` !!! tip An alternative of `--privileged=true` is `--cap-add SYS_NICE --security-opt seccomp=unconfined`. # --8<-- [end:build-image-from-source] # --8<-- [start:extra-information] # --8<-- [end:extra-information]