Signed-off-by: Peter Pan <Peter.Pan@daocloud.io> Signed-off-by: Peter Pan <peter.pan@daocloud.io> Co-authored-by: Li, Jiang <bigpyj64@gmail.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2.3 KiB
--8<-- [start:installation]
vLLM supports basic model inferencing and serving on x86 CPU platform, with data types FP32, FP16 and BF16.
--8<-- [end:installation]
--8<-- [start:requirements]
- OS: Linux
- CPU flags:
avx512f(Recommended),avx512_bf16(Optional),avx512_vnni(Optional)
!!! tip
Use lscpu to check the CPU flags.
--8<-- [end:requirements]
--8<-- [start:set-up-using-python]
--8<-- [end:set-up-using-python]
--8<-- [start:pre-built-wheels]
--8<-- [end:pre-built-wheels]
--8<-- [start:build-wheel-from-source]
--8<-- "docs/getting_started/installation/cpu/build.inc.md"
--8<-- [end:build-wheel-from-source]
--8<-- [start:pre-built-images]
https://gallery.ecr.aws/q9t5s3a7/vllm-cpu-release-repo
!!! warning
If deploying the pre-built images on machines without avx512f, avx512_bf16, or avx512_vnni support, an Illegal instruction error may be raised. It is recommended to build images for these machines with the appropriate build arguments (e.g., --build-arg VLLM_CPU_DISABLE_AVX512=true, --build-arg VLLM_CPU_AVX512BF16=false, or --build-arg VLLM_CPU_AVX512VNNI=false) to disable unsupported features. Please note that without avx512f, AVX2 will be used and this version is not recommended because it only has basic feature support.
--8<-- [end:pre-built-images]
--8<-- [start:build-image-from-source]
docker build -f docker/Dockerfile.cpu \
--build-arg VLLM_CPU_AVX512BF16=false (default)|true \
--build-arg VLLM_CPU_AVX512VNNI=false (default)|true \
--build-arg VLLM_CPU_DISABLE_AVX512=false (default)|true \
--tag vllm-cpu-env \
--target vllm-openai .
# Launching OpenAI server
docker run --rm \
--security-opt seccomp=unconfined \
--cap-add SYS_NICE \
--shm-size=4g \
-p 8000:8000 \
-e VLLM_CPU_KVCACHE_SPACE=<KV cache space> \
-e VLLM_CPU_OMP_THREADS_BIND=<CPU cores for inference> \
vllm-cpu-env \
--model=meta-llama/Llama-3.2-1B-Instruct \
--dtype=bfloat16 \
other vLLM OpenAI server arguments