mirror of
https://git.datalinker.icu/vllm-project/vllm.git
synced 2025-12-10 03:26:12 +08:00
[Hardware][Intel CPU][DOC] Update docs for CPU backend (#6212)
Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> Co-authored-by: Rafael Vasquez <rafvasq21@gmail.com> Co-authored-by: Gubrud, Aaron D <aaron.d.gubrud@intel.com> Co-authored-by: adgubrud <96072084+adgubrud@users.noreply.github.com>
This commit is contained in:
parent
08075c3448
commit
32a1ee74a0
@ -3,7 +3,13 @@
|
||||
Installation with CPU
|
||||
========================
|
||||
|
||||
vLLM initially supports basic model inferencing and serving on x86 CPU platform, with data types FP32 and BF16.
|
||||
vLLM initially supports basic model inferencing and serving on x86 CPU platform, with data types FP32 and BF16. vLLM CPU backend supports the following vLLM features:
|
||||
|
||||
- Tensor Parallel (``-tp = N``)
|
||||
- Quantization (``INT8 W8A8, AWQ``)
|
||||
|
||||
.. note::
|
||||
FP16 data type and more advanced features on `chunked-prefill`, `prefix-caching` and `FP8 KV cache` are under development and will be available soon.
|
||||
|
||||
Table of contents:
|
||||
|
||||
@ -141,5 +147,20 @@ Performance tips
|
||||
|
||||
- If using vLLM CPU backend on a multi-socket machine with NUMA, be aware to set CPU cores using ``VLLM_CPU_OMP_THREADS_BIND`` to avoid cross NUMA node memory access.
|
||||
|
||||
CPU Backend Considerations
|
||||
--------------------------
|
||||
|
||||
- The CPU backend significantly differs from the GPU backend since the vLLM architecture was originally optimized for GPU use. A number of optimizations are needed to enhance its performance.
|
||||
|
||||
- Decouple the HTTP serving components from the inference components. In a GPU backend configuration, the HTTP serving and tokenization tasks operate on the CPU, while inference runs on the GPU, which typically does not pose a problem. However, in a CPU-based setup, the HTTP serving and tokenization can cause significant context switching and reduced cache efficiency. Therefore, it is strongly recommended to segregate these two components for improved performance.
|
||||
|
||||
- On CPU based setup with NUMA enabled, the memory access performance may be largely impacted by the `topology <https://github.com/intel/intel-extension-for-pytorch/blob/main/docs/tutorials/performance_tuning/tuning_guide.md#non-uniform-memory-access-numa>`_. For NUMA architecture, two optimizations are to recommended: Tensor Parallel or Data Parallel.
|
||||
|
||||
* Using Tensor Parallel for a latency constraints deployment: following GPU backend design, a Megatron-LM's parallel algorithm will be used to shard the model, based on the number of NUMA nodes (e.g. TP = 2 for a two NUMA node system). With `TP feature on CPU <https://github.com/vllm-project/vllm/pull/6125>`_ merged, Tensor Parallel is supported for serving and offline inferencing. In general each NUMA node is treated as one GPU card. Below is the example script to enable Tensor Parallel = 2 for serving:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
$ VLLM_CPU_KVCACHE_SPACE=40 VLLM_CPU_OMP_THREADS_BIND="0-31|32-63" vllm serve meta-llama/Llama-2-7b-chat-hf -tp=2 --distributed-executor-backend mp
|
||||
|
||||
|
||||
* Using Data Parallel for maximum throughput: to launch an LLM serving endpoint on each NUMA node along with one additional load balancer to dispatch the requests to those endpoints. Common solutions like `Nginx <../serving/deploying_with_nginx.html>`_ or HAProxy are recommended. Anyscale Ray project provides the feature on LLM `serving <https://docs.ray.io/en/latest/serve/index.html>`_. Here is the example to setup a scalable LLM serving with `Ray Serve <https://github.com/intel/llm-on-ray/blob/main/docs/setup.md>`_.
|
||||
@ -80,6 +80,7 @@ Documentation
|
||||
serving/openai_compatible_server
|
||||
serving/deploying_with_docker
|
||||
serving/deploying_with_k8s
|
||||
serving/deploying_with_nginx
|
||||
serving/distributed_serving
|
||||
serving/metrics
|
||||
serving/env_vars
|
||||
|
||||
142
docs/source/serving/deploying_with_nginx.rst
Normal file
142
docs/source/serving/deploying_with_nginx.rst
Normal file
@ -0,0 +1,142 @@
|
||||
.. _nginxloadbalancer:
|
||||
|
||||
Deploying with Nginx Loadbalancer
|
||||
=================================
|
||||
|
||||
This document shows how to launch multiple vLLM serving containers and use Nginx to act as a load balancer between the servers.
|
||||
|
||||
Table of contents:
|
||||
|
||||
#. :ref:`Build Nginx Container <nginxloadbalancer_nginx_build>`
|
||||
#. :ref:`Create Simple Nginx Config file <nginxloadbalancer_nginx_conf>`
|
||||
#. :ref:`Build vLLM Container <nginxloadbalancer_nginx_vllm_container>`
|
||||
#. :ref:`Create Docker Network <nginxloadbalancer_nginx_docker_network>`
|
||||
#. :ref:`Launch vLLM Containers <nginxloadbalancer_nginx_launch_container>`
|
||||
#. :ref:`Launch Nginx <nginxloadbalancer_nginx_launch_nginx>`
|
||||
#. :ref:`Verify That vLLM Servers Are Ready <nginxloadbalancer_nginx_verify_nginx>`
|
||||
|
||||
.. _nginxloadbalancer_nginx_build:
|
||||
|
||||
Build Nginx Container
|
||||
---------------------
|
||||
|
||||
This guide assumes that you have just cloned the vLLM project and you're currently in the vllm root directory.
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
export vllm_root=`pwd`
|
||||
|
||||
Create a file named ``Dockerfile.nginx``:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
FROM nginx:latest
|
||||
RUN rm /etc/nginx/conf.d/default.conf
|
||||
EXPOSE 80
|
||||
CMD ["nginx", "-g", "daemon off;"]
|
||||
|
||||
Build the container:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
docker build . -f Dockerfile.nginx --tag nginx-lb
|
||||
|
||||
.. _nginxloadbalancer_nginx_conf:
|
||||
|
||||
Create Simple Nginx Config file
|
||||
-------------------------------
|
||||
|
||||
Create a file named ``nginx_conf/nginx.conf``. Note that you can add as many servers as you'd like. In the below example we'll start with two. To add more, add another ``server vllmN:8000 max_fails=3 fail_timeout=10000s;`` entry to ``upstream backend``.
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
upstream backend {
|
||||
least_conn;
|
||||
server vllm0:8000 max_fails=3 fail_timeout=10000s;
|
||||
server vllm1:8000 max_fails=3 fail_timeout=10000s;
|
||||
}
|
||||
server {
|
||||
listen 80;
|
||||
location / {
|
||||
proxy_pass http://backend;
|
||||
proxy_set_header Host $host;
|
||||
proxy_set_header X-Real-IP $remote_addr;
|
||||
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
|
||||
proxy_set_header X-Forwarded-Proto $scheme;
|
||||
}
|
||||
}
|
||||
|
||||
.. _nginxloadbalancer_nginx_vllm_container:
|
||||
|
||||
Build vLLM Container
|
||||
--------------------
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
cd $vllm_root
|
||||
docker build -f Dockerfile . --tag vllm
|
||||
|
||||
|
||||
If you are behind proxy, you can pass the proxy settings to the docker build command as shown below:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
cd $vllm_root
|
||||
docker build -f Dockerfile . --tag vllm --build-arg http_proxy=$http_proxy --build-arg https_proxy=$https_proxy
|
||||
|
||||
.. _nginxloadbalancer_nginx_docker_network:
|
||||
|
||||
Create Docker Network
|
||||
---------------------
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
docker network create vllm_nginx
|
||||
|
||||
|
||||
.. _nginxloadbalancer_nginx_launch_container:
|
||||
|
||||
Launch vLLM Containers
|
||||
----------------------
|
||||
|
||||
Notes:
|
||||
|
||||
* If you have your HuggingFace models cached somewhere else, update ``hf_cache_dir`` below.
|
||||
* If you don't have an existing HuggingFace cache you will want to start ``vllm0`` and wait for the model to complete downloading and the server to be ready. This will ensure that ``vllm1`` can leverage the model you just downloaded and it won't have to be downloaded again.
|
||||
* The below example assumes GPU backend used. If you are using CPU backend, remove ``--gpus all``, add ``VLLM_CPU_KVCACHE_SPACE`` and ``VLLM_CPU_OMP_THREADS_BIND`` environment variables to the docker run command.
|
||||
* Adjust the model name that you want to use in your vLLM servers if you don't want to use ``Llama-2-7b-chat-hf``.
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
mkdir -p ~/.cache/huggingface/hub/
|
||||
hf_cache_dir=~/.cache/huggingface/
|
||||
docker run -itd --ipc host --privileged --network vllm_nginx --gpus all --shm-size=10.24gb -v $hf_cache_dir:/root/.cache/huggingface/ -p 8081:8000 --name vllm0 vllm --model meta-llama/Llama-2-7b-chat-hf
|
||||
docker run -itd --ipc host --privileged --network vllm_nginx --gpus all --shm-size=10.24gb -v $hf_cache_dir:/root/.cache/huggingface/ -p 8082:8000 --name vllm1 vllm --model meta-llama/Llama-2-7b-chat-hf
|
||||
|
||||
.. note::
|
||||
If you are behind proxy, you can pass the proxy settings to the docker run command via ``-e http_proxy=$http_proxy -e https_proxy=$https_proxy``.
|
||||
|
||||
.. _nginxloadbalancer_nginx_launch_nginx:
|
||||
|
||||
Launch Nginx
|
||||
------------
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
docker run -itd -p 8000:80 --network vllm_nginx -v ./nginx_conf/:/etc/nginx/conf.d/ --name nginx-lb nginx-lb:latest
|
||||
|
||||
.. _nginxloadbalancer_nginx_verify_nginx:
|
||||
|
||||
Verify That vLLM Servers Are Ready
|
||||
----------------------------------
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
docker logs vllm0 | grep Uvicorn
|
||||
docker logs vllm1 | grep Uvicorn
|
||||
|
||||
Both outputs should look like this:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
|
||||
Loading…
x
Reference in New Issue
Block a user