mirror of
https://git.datalinker.icu/vllm-project/vllm.git
synced 2026-06-06 04:49:09 +08:00
[doc] improve readability (#18675)
Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com>
This commit is contained in:
parent
624b77a2b3
commit
279f854519
@ -26,7 +26,12 @@ The edges of the build graph represent:
|
|||||||
> Commands to regenerate the build graph (make sure to run it **from the \`root\` directory of the vLLM repository** where the dockerfile is present):
|
> Commands to regenerate the build graph (make sure to run it **from the \`root\` directory of the vLLM repository** where the dockerfile is present):
|
||||||
>
|
>
|
||||||
> ```bash
|
> ```bash
|
||||||
> dockerfilegraph -o png --legend --dpi 200 --max-label-length 50 --filename docker/Dockerfile
|
> dockerfilegraph \
|
||||||
|
> -o png \
|
||||||
|
> --legend \
|
||||||
|
> --dpi 200 \
|
||||||
|
> --max-label-length 50 \
|
||||||
|
> --filename docker/Dockerfile
|
||||||
> ```
|
> ```
|
||||||
>
|
>
|
||||||
> or in case you want to run it directly with the docker image:
|
> or in case you want to run it directly with the docker image:
|
||||||
|
|||||||
@ -41,7 +41,10 @@ If your model imports modules that initialize CUDA, consider lazy-importing it t
|
|||||||
```python
|
```python
|
||||||
from vllm import ModelRegistry
|
from vllm import ModelRegistry
|
||||||
|
|
||||||
ModelRegistry.register_model("YourModelForCausalLM", "your_code:YourModelForCausalLM")
|
ModelRegistry.register_model(
|
||||||
|
"YourModelForCausalLM",
|
||||||
|
"your_code:YourModelForCausalLM"
|
||||||
|
)
|
||||||
```
|
```
|
||||||
|
|
||||||
!!! warning
|
!!! warning
|
||||||
|
|||||||
@ -11,7 +11,7 @@ vLLM offers an official Docker image for deployment.
|
|||||||
The image can be used to run OpenAI compatible server and is available on Docker Hub as [vllm/vllm-openai](https://hub.docker.com/r/vllm/vllm-openai/tags).
|
The image can be used to run OpenAI compatible server and is available on Docker Hub as [vllm/vllm-openai](https://hub.docker.com/r/vllm/vllm-openai/tags).
|
||||||
|
|
||||||
```console
|
```console
|
||||||
$ docker run --runtime nvidia --gpus all \
|
docker run --runtime nvidia --gpus all \
|
||||||
-v ~/.cache/huggingface:/root/.cache/huggingface \
|
-v ~/.cache/huggingface:/root/.cache/huggingface \
|
||||||
--env "HUGGING_FACE_HUB_TOKEN=<secret>" \
|
--env "HUGGING_FACE_HUB_TOKEN=<secret>" \
|
||||||
-p 8000:8000 \
|
-p 8000:8000 \
|
||||||
@ -23,7 +23,7 @@ $ docker run --runtime nvidia --gpus all \
|
|||||||
This image can also be used with other container engines such as [Podman](https://podman.io/).
|
This image can also be used with other container engines such as [Podman](https://podman.io/).
|
||||||
|
|
||||||
```console
|
```console
|
||||||
$ podman run --gpus all \
|
podman run --gpus all \
|
||||||
-v ~/.cache/huggingface:/root/.cache/huggingface \
|
-v ~/.cache/huggingface:/root/.cache/huggingface \
|
||||||
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
|
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
|
||||||
-p 8000:8000 \
|
-p 8000:8000 \
|
||||||
@ -73,7 +73,10 @@ You can build and run vLLM from source via the provided <gh-file:docker/Dockerfi
|
|||||||
|
|
||||||
```console
|
```console
|
||||||
# optionally specifies: --build-arg max_jobs=8 --build-arg nvcc_threads=2
|
# optionally specifies: --build-arg max_jobs=8 --build-arg nvcc_threads=2
|
||||||
DOCKER_BUILDKIT=1 docker build . --target vllm-openai --tag vllm/vllm-openai --file docker/Dockerfile
|
DOCKER_BUILDKIT=1 docker build . \
|
||||||
|
--target vllm-openai \
|
||||||
|
--tag vllm/vllm-openai \
|
||||||
|
--file docker/Dockerfile
|
||||||
```
|
```
|
||||||
|
|
||||||
!!! note
|
!!! note
|
||||||
@ -96,8 +99,8 @@ of PyTorch Nightly and should be considered **experimental**. Using the flag `--
|
|||||||
|
|
||||||
```console
|
```console
|
||||||
# Example of building on Nvidia GH200 server. (Memory usage: ~15GB, Build time: ~1475s / ~25 min, Image size: 6.93GB)
|
# Example of building on Nvidia GH200 server. (Memory usage: ~15GB, Build time: ~1475s / ~25 min, Image size: 6.93GB)
|
||||||
$ python3 use_existing_torch.py
|
python3 use_existing_torch.py
|
||||||
$ DOCKER_BUILDKIT=1 docker build . \
|
DOCKER_BUILDKIT=1 docker build . \
|
||||||
--file docker/Dockerfile \
|
--file docker/Dockerfile \
|
||||||
--target vllm-openai \
|
--target vllm-openai \
|
||||||
--platform "linux/arm64" \
|
--platform "linux/arm64" \
|
||||||
@ -113,7 +116,7 @@ $ DOCKER_BUILDKIT=1 docker build . \
|
|||||||
To run vLLM with the custom-built Docker image:
|
To run vLLM with the custom-built Docker image:
|
||||||
|
|
||||||
```console
|
```console
|
||||||
$ docker run --runtime nvidia --gpus all \
|
docker run --runtime nvidia --gpus all \
|
||||||
-v ~/.cache/huggingface:/root/.cache/huggingface \
|
-v ~/.cache/huggingface:/root/.cache/huggingface \
|
||||||
-p 8000:8000 \
|
-p 8000:8000 \
|
||||||
--env "HUGGING_FACE_HUB_TOKEN=<secret>" \
|
--env "HUGGING_FACE_HUB_TOKEN=<secret>" \
|
||||||
|
|||||||
@ -82,7 +82,11 @@ Check the output of the command. There will be a shareable gradio link (like the
|
|||||||
**Optional**: Serve the 70B model instead of the default 8B and use more GPU:
|
**Optional**: Serve the 70B model instead of the default 8B and use more GPU:
|
||||||
|
|
||||||
```console
|
```console
|
||||||
HF_TOKEN="your-huggingface-token" sky launch serving.yaml --gpus A100:8 --env HF_TOKEN --env MODEL_NAME=meta-llama/Meta-Llama-3-70B-Instruct
|
HF_TOKEN="your-huggingface-token" \
|
||||||
|
sky launch serving.yaml \
|
||||||
|
--gpus A100:8 \
|
||||||
|
--env HF_TOKEN \
|
||||||
|
--env MODEL_NAME=meta-llama/Meta-Llama-3-70B-Instruct
|
||||||
```
|
```
|
||||||
|
|
||||||
## Scale up to multiple replicas
|
## Scale up to multiple replicas
|
||||||
@ -155,7 +159,9 @@ run: |
|
|||||||
Start the serving the Llama-3 8B model on multiple replicas:
|
Start the serving the Llama-3 8B model on multiple replicas:
|
||||||
|
|
||||||
```console
|
```console
|
||||||
HF_TOKEN="your-huggingface-token" sky serve up -n vllm serving.yaml --env HF_TOKEN
|
HF_TOKEN="your-huggingface-token" \
|
||||||
|
sky serve up -n vllm serving.yaml \
|
||||||
|
--env HF_TOKEN
|
||||||
```
|
```
|
||||||
|
|
||||||
Wait until the service is ready:
|
Wait until the service is ready:
|
||||||
@ -318,7 +324,9 @@ run: |
|
|||||||
1. Start the chat web UI:
|
1. Start the chat web UI:
|
||||||
|
|
||||||
```console
|
```console
|
||||||
sky launch -c gui ./gui.yaml --env ENDPOINT=$(sky serve status --endpoint vllm)
|
sky launch \
|
||||||
|
-c gui ./gui.yaml \
|
||||||
|
--env ENDPOINT=$(sky serve status --endpoint vllm)
|
||||||
```
|
```
|
||||||
|
|
||||||
2. Then, we can access the GUI at the returned gradio link:
|
2. Then, we can access the GUI at the returned gradio link:
|
||||||
|
|||||||
@ -33,7 +33,8 @@ pip install streamlit openai
|
|||||||
streamlit run streamlit_openai_chatbot_webserver.py
|
streamlit run streamlit_openai_chatbot_webserver.py
|
||||||
|
|
||||||
# or specify the VLLM_API_BASE or VLLM_API_KEY
|
# or specify the VLLM_API_BASE or VLLM_API_KEY
|
||||||
VLLM_API_BASE="http://vllm-server-host:vllm-server-port/v1" streamlit run streamlit_openai_chatbot_webserver.py
|
VLLM_API_BASE="http://vllm-server-host:vllm-server-port/v1" \
|
||||||
|
streamlit run streamlit_openai_chatbot_webserver.py
|
||||||
|
|
||||||
# start with debug mode to view more details
|
# start with debug mode to view more details
|
||||||
streamlit run streamlit_openai_chatbot_webserver.py --logger.level=debug
|
streamlit run streamlit_openai_chatbot_webserver.py --logger.level=debug
|
||||||
|
|||||||
@ -77,7 +77,11 @@ If you are behind proxy, you can pass the proxy settings to the docker build com
|
|||||||
|
|
||||||
```console
|
```console
|
||||||
cd $vllm_root
|
cd $vllm_root
|
||||||
docker build -f docker/Dockerfile . --tag vllm --build-arg http_proxy=$http_proxy --build-arg https_proxy=$https_proxy
|
docker build \
|
||||||
|
-f docker/Dockerfile . \
|
||||||
|
--tag vllm \
|
||||||
|
--build-arg http_proxy=$http_proxy \
|
||||||
|
--build-arg https_proxy=$https_proxy
|
||||||
```
|
```
|
||||||
|
|
||||||
[](){ #nginxloadbalancer-nginx-docker-network }
|
[](){ #nginxloadbalancer-nginx-docker-network }
|
||||||
@ -102,8 +106,26 @@ Notes:
|
|||||||
```console
|
```console
|
||||||
mkdir -p ~/.cache/huggingface/hub/
|
mkdir -p ~/.cache/huggingface/hub/
|
||||||
hf_cache_dir=~/.cache/huggingface/
|
hf_cache_dir=~/.cache/huggingface/
|
||||||
docker run -itd --ipc host --network vllm_nginx --gpus device=0 --shm-size=10.24gb -v $hf_cache_dir:/root/.cache/huggingface/ -p 8081:8000 --name vllm0 vllm --model meta-llama/Llama-2-7b-chat-hf
|
docker run \
|
||||||
docker run -itd --ipc host --network vllm_nginx --gpus device=1 --shm-size=10.24gb -v $hf_cache_dir:/root/.cache/huggingface/ -p 8082:8000 --name vllm1 vllm --model meta-llama/Llama-2-7b-chat-hf
|
-itd \
|
||||||
|
--ipc host \
|
||||||
|
--network vllm_nginx \
|
||||||
|
--gpus device=0 \
|
||||||
|
--shm-size=10.24gb \
|
||||||
|
-v $hf_cache_dir:/root/.cache/huggingface/ \
|
||||||
|
-p 8081:8000 \
|
||||||
|
--name vllm0 vllm \
|
||||||
|
--model meta-llama/Llama-2-7b-chat-hf
|
||||||
|
docker run \
|
||||||
|
-itd \
|
||||||
|
--ipc host \
|
||||||
|
--network vllm_nginx \
|
||||||
|
--gpus device=1 \
|
||||||
|
--shm-size=10.24gb \
|
||||||
|
-v $hf_cache_dir:/root/.cache/huggingface/ \
|
||||||
|
-p 8082:8000 \
|
||||||
|
--name vllm1 vllm \
|
||||||
|
--model meta-llama/Llama-2-7b-chat-hf
|
||||||
```
|
```
|
||||||
|
|
||||||
!!! note
|
!!! note
|
||||||
@ -114,7 +136,12 @@ docker run -itd --ipc host --network vllm_nginx --gpus device=1 --shm-size=10.24
|
|||||||
## Launch Nginx
|
## Launch Nginx
|
||||||
|
|
||||||
```console
|
```console
|
||||||
docker run -itd -p 8000:80 --network vllm_nginx -v ./nginx_conf/:/etc/nginx/conf.d/ --name nginx-lb nginx-lb:latest
|
docker run \
|
||||||
|
-itd \
|
||||||
|
-p 8000:80 \
|
||||||
|
--network vllm_nginx \
|
||||||
|
-v ./nginx_conf/:/etc/nginx/conf.d/ \
|
||||||
|
--name nginx-lb nginx-lb:latest
|
||||||
```
|
```
|
||||||
|
|
||||||
[](){ #nginxloadbalancer-nginx-verify-nginx }
|
[](){ #nginxloadbalancer-nginx-verify-nginx }
|
||||||
|
|||||||
@ -42,7 +42,9 @@ print(f'Model is quantized and saved at "{quant_path}"')
|
|||||||
To run an AWQ model with vLLM, you can use [TheBloke/Llama-2-7b-Chat-AWQ](https://huggingface.co/TheBloke/Llama-2-7b-Chat-AWQ) with the following command:
|
To run an AWQ model with vLLM, you can use [TheBloke/Llama-2-7b-Chat-AWQ](https://huggingface.co/TheBloke/Llama-2-7b-Chat-AWQ) with the following command:
|
||||||
|
|
||||||
```console
|
```console
|
||||||
python examples/offline_inference/llm_engine_example.py --model TheBloke/Llama-2-7b-Chat-AWQ --quantization awq
|
python examples/offline_inference/llm_engine_example.py \
|
||||||
|
--model TheBloke/Llama-2-7b-Chat-AWQ \
|
||||||
|
--quantization awq
|
||||||
```
|
```
|
||||||
|
|
||||||
AWQ models are also supported directly through the LLM entrypoint:
|
AWQ models are also supported directly through the LLM entrypoint:
|
||||||
|
|||||||
@ -33,7 +33,12 @@ import torch
|
|||||||
|
|
||||||
# "hxbgsyxh/llama-13b-4bit-g-1-bitblas" is a pre-quantized checkpoint.
|
# "hxbgsyxh/llama-13b-4bit-g-1-bitblas" is a pre-quantized checkpoint.
|
||||||
model_id = "hxbgsyxh/llama-13b-4bit-g-1-bitblas"
|
model_id = "hxbgsyxh/llama-13b-4bit-g-1-bitblas"
|
||||||
llm = LLM(model=model_id, dtype=torch.bfloat16, trust_remote_code=True, quantization="bitblas")
|
llm = LLM(
|
||||||
|
model=model_id,
|
||||||
|
dtype=torch.bfloat16,
|
||||||
|
trust_remote_code=True,
|
||||||
|
quantization="bitblas"
|
||||||
|
)
|
||||||
```
|
```
|
||||||
|
|
||||||
## Read gptq format checkpoint
|
## Read gptq format checkpoint
|
||||||
@ -44,5 +49,11 @@ import torch
|
|||||||
|
|
||||||
# "hxbgsyxh/llama-13b-4bit-g-1" is a pre-quantized checkpoint.
|
# "hxbgsyxh/llama-13b-4bit-g-1" is a pre-quantized checkpoint.
|
||||||
model_id = "hxbgsyxh/llama-13b-4bit-g-1"
|
model_id = "hxbgsyxh/llama-13b-4bit-g-1"
|
||||||
llm = LLM(model=model_id, dtype=torch.float16, trust_remote_code=True, quantization="bitblas", max_model_len=1024)
|
llm = LLM(
|
||||||
|
model=model_id,
|
||||||
|
dtype=torch.float16,
|
||||||
|
trust_remote_code=True,
|
||||||
|
quantization="bitblas",
|
||||||
|
max_model_len=1024
|
||||||
|
)
|
||||||
```
|
```
|
||||||
|
|||||||
@ -27,7 +27,11 @@ from vllm import LLM
|
|||||||
import torch
|
import torch
|
||||||
# unsloth/tinyllama-bnb-4bit is a pre-quantized checkpoint.
|
# unsloth/tinyllama-bnb-4bit is a pre-quantized checkpoint.
|
||||||
model_id = "unsloth/tinyllama-bnb-4bit"
|
model_id = "unsloth/tinyllama-bnb-4bit"
|
||||||
llm = LLM(model=model_id, dtype=torch.bfloat16, trust_remote_code=True)
|
llm = LLM(
|
||||||
|
model=model_id,
|
||||||
|
dtype=torch.bfloat16,
|
||||||
|
trust_remote_code=True
|
||||||
|
)
|
||||||
```
|
```
|
||||||
|
|
||||||
## Inflight quantization: load as 4bit quantization
|
## Inflight quantization: load as 4bit quantization
|
||||||
@ -38,8 +42,12 @@ For inflight 4bit quantization with BitsAndBytes, you need to explicitly specify
|
|||||||
from vllm import LLM
|
from vllm import LLM
|
||||||
import torch
|
import torch
|
||||||
model_id = "huggyllama/llama-7b"
|
model_id = "huggyllama/llama-7b"
|
||||||
llm = LLM(model=model_id, dtype=torch.bfloat16, trust_remote_code=True, \
|
llm = LLM(
|
||||||
quantization="bitsandbytes")
|
model=model_id,
|
||||||
|
dtype=torch.bfloat16,
|
||||||
|
trust_remote_code=True,
|
||||||
|
quantization="bitsandbytes"
|
||||||
|
)
|
||||||
```
|
```
|
||||||
|
|
||||||
## OpenAI Compatible Server
|
## OpenAI Compatible Server
|
||||||
|
|||||||
@ -14,14 +14,17 @@ To run a GGUF model with vLLM, you can download and use the local GGUF model fro
|
|||||||
```console
|
```console
|
||||||
wget https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf
|
wget https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf
|
||||||
# We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion.
|
# We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion.
|
||||||
vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0
|
vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
|
||||||
|
--tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0
|
||||||
```
|
```
|
||||||
|
|
||||||
You can also add `--tensor-parallel-size 2` to enable tensor parallelism inference with 2 GPUs:
|
You can also add `--tensor-parallel-size 2` to enable tensor parallelism inference with 2 GPUs:
|
||||||
|
|
||||||
```console
|
```console
|
||||||
# We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion.
|
# We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion.
|
||||||
vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 --tensor-parallel-size 2
|
vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
|
||||||
|
--tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
|
||||||
|
--tensor-parallel-size 2
|
||||||
```
|
```
|
||||||
|
|
||||||
!!! warning
|
!!! warning
|
||||||
@ -31,7 +34,9 @@ GGUF assumes that huggingface can convert the metadata to a config file. In case
|
|||||||
|
|
||||||
```console
|
```console
|
||||||
# If you model is not supported by huggingface you can manually provide a huggingface compatible config path
|
# If you model is not supported by huggingface you can manually provide a huggingface compatible config path
|
||||||
vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 --hf-config-path Tinyllama/TInyLlama-1.1B-Chat-v1.0
|
vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
|
||||||
|
--tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
|
||||||
|
--hf-config-path Tinyllama/TInyLlama-1.1B-Chat-v1.0
|
||||||
```
|
```
|
||||||
|
|
||||||
You can also use the GGUF model directly through the LLM entrypoint:
|
You can also use the GGUF model directly through the LLM entrypoint:
|
||||||
|
|||||||
@ -59,7 +59,8 @@ model.save(quant_path)
|
|||||||
To run an GPTQModel quantized model with vLLM, you can use [DeepSeek-R1-Distill-Qwen-7B-gptqmodel-4bit-vortex-v2](https://huggingface.co/ModelCloud/DeepSeek-R1-Distill-Qwen-7B-gptqmodel-4bit-vortex-v2) with the following command:
|
To run an GPTQModel quantized model with vLLM, you can use [DeepSeek-R1-Distill-Qwen-7B-gptqmodel-4bit-vortex-v2](https://huggingface.co/ModelCloud/DeepSeek-R1-Distill-Qwen-7B-gptqmodel-4bit-vortex-v2) with the following command:
|
||||||
|
|
||||||
```console
|
```console
|
||||||
python examples/offline_inference/llm_engine_example.py --model ModelCloud/DeepSeek-R1-Distill-Qwen-7B-gptqmodel-4bit-vortex-v2
|
python examples/offline_inference/llm_engine_example.py \
|
||||||
|
--model ModelCloud/DeepSeek-R1-Distill-Qwen-7B-gptqmodel-4bit-vortex-v2
|
||||||
```
|
```
|
||||||
|
|
||||||
## Using GPTQModel with vLLM's Python API
|
## Using GPTQModel with vLLM's Python API
|
||||||
|
|||||||
@ -7,7 +7,9 @@ We recommend installing the latest torchao nightly with
|
|||||||
```console
|
```console
|
||||||
# Install the latest TorchAO nightly build
|
# Install the latest TorchAO nightly build
|
||||||
# Choose the CUDA version that matches your system (cu126, cu128, etc.)
|
# Choose the CUDA version that matches your system (cu126, cu128, etc.)
|
||||||
pip install --pre torchao>=10.0.0 --index-url https://download.pytorch.org/whl/nightly/cu126
|
pip install \
|
||||||
|
--pre torchao>=10.0.0 \
|
||||||
|
--index-url https://download.pytorch.org/whl/nightly/cu126
|
||||||
```
|
```
|
||||||
|
|
||||||
## Quantizing HuggingFace Models
|
## Quantizing HuggingFace Models
|
||||||
@ -20,7 +22,12 @@ from torchao.quantization import Int8WeightOnlyConfig
|
|||||||
|
|
||||||
model_name = "meta-llama/Meta-Llama-3-8B"
|
model_name = "meta-llama/Meta-Llama-3-8B"
|
||||||
quantization_config = TorchAoConfig(Int8WeightOnlyConfig())
|
quantization_config = TorchAoConfig(Int8WeightOnlyConfig())
|
||||||
quantized_model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto", quantization_config=quantization_config)
|
quantized_model = AutoModelForCausalLM.from_pretrained(
|
||||||
|
model_name,
|
||||||
|
torch_dtype="auto",
|
||||||
|
device_map="auto",
|
||||||
|
quantization_config=quantization_config
|
||||||
|
)
|
||||||
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
||||||
input_text = "What are we having for dinner?"
|
input_text = "What are we having for dinner?"
|
||||||
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
|
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
|
||||||
|
|||||||
@ -27,7 +27,8 @@ vLLM currently supports the following reasoning models:
|
|||||||
To use reasoning models, you need to specify the `--reasoning-parser` flags when making a request to the chat completion endpoint. The `--reasoning-parser` flag specifies the reasoning parser to use for extracting reasoning content from the model output.
|
To use reasoning models, you need to specify the `--reasoning-parser` flags when making a request to the chat completion endpoint. The `--reasoning-parser` flag specifies the reasoning parser to use for extracting reasoning content from the model output.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --reasoning-parser deepseek_r1
|
vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
|
||||||
|
--reasoning-parser deepseek_r1
|
||||||
```
|
```
|
||||||
|
|
||||||
Next, make a request to the model that should return the reasoning content in the response.
|
Next, make a request to the model that should return the reasoning content in the response.
|
||||||
|
|||||||
@ -45,8 +45,13 @@ for output in outputs:
|
|||||||
To perform the same with an online mode launch the server:
|
To perform the same with an online mode launch the server:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8000 --model facebook/opt-6.7b \
|
python -m vllm.entrypoints.openai.api_server \
|
||||||
--seed 42 -tp 1 --gpu_memory_utilization 0.8 \
|
--host 0.0.0.0 \
|
||||||
|
--port 8000 \
|
||||||
|
--model facebook/opt-6.7b \
|
||||||
|
--seed 42 \
|
||||||
|
-tp 1 \
|
||||||
|
--gpu_memory_utilization 0.8 \
|
||||||
--speculative_config '{"model": "facebook/opt-125m", "num_speculative_tokens": 5}'
|
--speculative_config '{"model": "facebook/opt-125m", "num_speculative_tokens": 5}'
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|||||||
@ -45,7 +45,15 @@ Use the following commands to run a Docker image:
|
|||||||
|
|
||||||
```console
|
```console
|
||||||
docker pull vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest
|
docker pull vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest
|
||||||
docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest
|
docker run \
|
||||||
|
-it \
|
||||||
|
--runtime=habana \
|
||||||
|
-e HABANA_VISIBLE_DEVICES=all \
|
||||||
|
-e OMPI_MCA_btl_vader_single_copy_mechanism=none \
|
||||||
|
--cap-add=sys_nice \
|
||||||
|
--net=host \
|
||||||
|
--ipc=host \
|
||||||
|
vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest
|
||||||
```
|
```
|
||||||
|
|
||||||
# --8<-- [end:requirements]
|
# --8<-- [end:requirements]
|
||||||
@ -91,7 +99,14 @@ Currently, there are no pre-built Intel Gaudi images.
|
|||||||
|
|
||||||
```console
|
```console
|
||||||
docker build -f docker/Dockerfile.hpu -t vllm-hpu-env .
|
docker build -f docker/Dockerfile.hpu -t vllm-hpu-env .
|
||||||
docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --rm vllm-hpu-env
|
docker run \
|
||||||
|
-it \
|
||||||
|
--runtime=habana \
|
||||||
|
-e HABANA_VISIBLE_DEVICES=all \
|
||||||
|
-e OMPI_MCA_btl_vader_single_copy_mechanism=none \
|
||||||
|
--cap-add=sys_nice \
|
||||||
|
--net=host \
|
||||||
|
--rm vllm-hpu-env
|
||||||
```
|
```
|
||||||
|
|
||||||
!!! tip
|
!!! tip
|
||||||
|
|||||||
@ -38,7 +38,8 @@ The installation of drivers and tools wouldn't be necessary, if [Deep Learning A
|
|||||||
sudo tee /etc/apt/sources.list.d/neuron.list > /dev/null <<EOF
|
sudo tee /etc/apt/sources.list.d/neuron.list > /dev/null <<EOF
|
||||||
deb https://apt.repos.neuron.amazonaws.com ${VERSION_CODENAME} main
|
deb https://apt.repos.neuron.amazonaws.com ${VERSION_CODENAME} main
|
||||||
EOF
|
EOF
|
||||||
wget -qO - https://apt.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB | sudo apt-key add -
|
wget -qO - https://apt.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB \
|
||||||
|
| sudo apt-key add -
|
||||||
|
|
||||||
# Update OS packages
|
# Update OS packages
|
||||||
sudo apt-get update -y
|
sudo apt-get update -y
|
||||||
@ -96,12 +97,17 @@ source aws_neuron_venv_pytorch/bin/activate
|
|||||||
|
|
||||||
# Install Jupyter notebook kernel
|
# Install Jupyter notebook kernel
|
||||||
pip install ipykernel
|
pip install ipykernel
|
||||||
python3.10 -m ipykernel install --user --name aws_neuron_venv_pytorch --display-name "Python (torch-neuronx)"
|
python3.10 -m ipykernel install \
|
||||||
|
--user \
|
||||||
|
--name aws_neuron_venv_pytorch \
|
||||||
|
--display-name "Python (torch-neuronx)"
|
||||||
pip install jupyter notebook
|
pip install jupyter notebook
|
||||||
pip install environment_kernels
|
pip install environment_kernels
|
||||||
|
|
||||||
# Set pip repository pointing to the Neuron repository
|
# Set pip repository pointing to the Neuron repository
|
||||||
python -m pip config set global.extra-index-url https://pip.repos.neuron.amazonaws.com
|
python -m pip config set \
|
||||||
|
global.extra-index-url \
|
||||||
|
https://pip.repos.neuron.amazonaws.com
|
||||||
|
|
||||||
# Install wget, awscli
|
# Install wget, awscli
|
||||||
python -m pip install wget
|
python -m pip install wget
|
||||||
|
|||||||
@ -55,7 +55,9 @@ LLM inference is a fast-evolving field, and the latest code may contain bug fixe
|
|||||||
##### Install the latest code using `pip`
|
##### Install the latest code using `pip`
|
||||||
|
|
||||||
```console
|
```console
|
||||||
pip install -U vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
|
pip install -U vllm \
|
||||||
|
--pre \
|
||||||
|
--extra-index-url https://wheels.vllm.ai/nightly
|
||||||
```
|
```
|
||||||
|
|
||||||
`--pre` is required for `pip` to consider pre-released versions.
|
`--pre` is required for `pip` to consider pre-released versions.
|
||||||
@ -63,7 +65,9 @@ pip install -U vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
|
|||||||
Another way to install the latest code is to use `uv`:
|
Another way to install the latest code is to use `uv`:
|
||||||
|
|
||||||
```console
|
```console
|
||||||
uv pip install -U vllm --torch-backend=auto --extra-index-url https://wheels.vllm.ai/nightly
|
uv pip install -U vllm \
|
||||||
|
--torch-backend=auto \
|
||||||
|
--extra-index-url https://wheels.vllm.ai/nightly
|
||||||
```
|
```
|
||||||
|
|
||||||
##### Install specific revisions using `pip`
|
##### Install specific revisions using `pip`
|
||||||
@ -83,7 +87,9 @@ If you want to access the wheels for previous commits (e.g. to bisect the behavi
|
|||||||
|
|
||||||
```console
|
```console
|
||||||
export VLLM_COMMIT=72d9c316d3f6ede485146fe5aabd4e61dbc59069 # use full commit hash from the main branch
|
export VLLM_COMMIT=72d9c316d3f6ede485146fe5aabd4e61dbc59069 # use full commit hash from the main branch
|
||||||
uv pip install vllm --torch-backend=auto --extra-index-url https://wheels.vllm.ai/${VLLM_COMMIT}
|
uv pip install vllm \
|
||||||
|
--torch-backend=auto \
|
||||||
|
--extra-index-url https://wheels.vllm.ai/${VLLM_COMMIT}
|
||||||
```
|
```
|
||||||
|
|
||||||
The `uv` approach works for vLLM `v0.6.6` and later and offers an easy-to-remember command. A unique feature of `uv` is that packages in `--extra-index-url` have [higher priority than the default index](https://docs.astral.sh/uv/pip/compatibility/#packages-that-exist-on-multiple-indexes). If the latest public release is `v0.6.6.post1`, `uv`'s behavior allows installing a commit before `v0.6.6.post1` by specifying the `--extra-index-url`. In contrast, `pip` combines packages from `--extra-index-url` and the default index, choosing only the latest version, which makes it difficult to install a development version prior to the released version.
|
The `uv` approach works for vLLM `v0.6.6` and later and offers an easy-to-remember command. A unique feature of `uv` is that packages in `--extra-index-url` have [higher priority than the default index](https://docs.astral.sh/uv/pip/compatibility/#packages-that-exist-on-multiple-indexes). If the latest public release is `v0.6.6.post1`, `uv`'s behavior allows installing a commit before `v0.6.6.post1` by specifying the `--extra-index-url`. In contrast, `pip` combines packages from `--extra-index-url` and the default index, choosing only the latest version, which makes it difficult to install a development version prior to the released version.
|
||||||
@ -192,7 +198,11 @@ Additionally, if you have trouble building vLLM, we recommend using the NVIDIA P
|
|||||||
|
|
||||||
```console
|
```console
|
||||||
# Use `--ipc=host` to make sure the shared memory is large enough.
|
# Use `--ipc=host` to make sure the shared memory is large enough.
|
||||||
docker run --gpus all -it --rm --ipc=host nvcr.io/nvidia/pytorch:23.10-py3
|
docker run \
|
||||||
|
--gpus all \
|
||||||
|
-it \
|
||||||
|
--rm \
|
||||||
|
--ipc=host nvcr.io/nvidia/pytorch:23.10-py3
|
||||||
```
|
```
|
||||||
|
|
||||||
If you don't want to use docker, it is recommended to have a full installation of CUDA Toolkit. You can download and install it from [the official website](https://developer.nvidia.com/cuda-toolkit-archive). After installation, set the environment variable `CUDA_HOME` to the installation path of CUDA Toolkit, and make sure that the `nvcc` compiler is in your `PATH`, e.g.:
|
If you don't want to use docker, it is recommended to have a full installation of CUDA Toolkit. You can download and install it from [the official website](https://developer.nvidia.com/cuda-toolkit-archive). After installation, set the environment variable `CUDA_HOME` to the installation path of CUDA Toolkit, and make sure that the `nvcc` compiler is in your `PATH`, e.g.:
|
||||||
|
|||||||
@ -91,19 +91,22 @@ Currently, there are no pre-built ROCm wheels.
|
|||||||
4. Build vLLM. For example, vLLM on ROCM 6.3 can be built with the following steps:
|
4. Build vLLM. For example, vLLM on ROCM 6.3 can be built with the following steps:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
$ pip install --upgrade pip
|
pip install --upgrade pip
|
||||||
|
|
||||||
# Build & install AMD SMI
|
# Build & install AMD SMI
|
||||||
$ pip install /opt/rocm/share/amd_smi
|
pip install /opt/rocm/share/amd_smi
|
||||||
|
|
||||||
# Install dependencies
|
# Install dependencies
|
||||||
$ pip install --upgrade numba scipy huggingface-hub[cli,hf_transfer] setuptools_scm
|
pip install --upgrade numba \
|
||||||
$ pip install "numpy<2"
|
scipy \
|
||||||
$ pip install -r requirements/rocm.txt
|
huggingface-hub[cli,hf_transfer] \
|
||||||
|
setuptools_scm
|
||||||
|
pip install "numpy<2"
|
||||||
|
pip install -r requirements/rocm.txt
|
||||||
|
|
||||||
# Build vLLM for MI210/MI250/MI300.
|
# Build vLLM for MI210/MI250/MI300.
|
||||||
$ export PYTORCH_ROCM_ARCH="gfx90a;gfx942"
|
export PYTORCH_ROCM_ARCH="gfx90a;gfx942"
|
||||||
$ python3 setup.py develop
|
python3 setup.py develop
|
||||||
```
|
```
|
||||||
|
|
||||||
This may take 5-10 minutes. Currently, `pip install .` does not work for ROCm installation.
|
This may take 5-10 minutes. Currently, `pip install .` does not work for ROCm installation.
|
||||||
@ -154,7 +157,9 @@ It is important that the user kicks off the docker build using buildkit. Either
|
|||||||
To build vllm on ROCm 6.3 for MI200 and MI300 series, you can use the default:
|
To build vllm on ROCm 6.3 for MI200 and MI300 series, you can use the default:
|
||||||
|
|
||||||
```console
|
```console
|
||||||
DOCKER_BUILDKIT=1 docker build -f docker/Dockerfile.rocm_base -t rocm/vllm-dev:base .
|
DOCKER_BUILDKIT=1 docker build \
|
||||||
|
-f docker/Dockerfile.rocm_base \
|
||||||
|
-t rocm/vllm-dev:base .
|
||||||
```
|
```
|
||||||
|
|
||||||
#### Build an image with vLLM
|
#### Build an image with vLLM
|
||||||
@ -189,7 +194,11 @@ DOCKER_BUILDKIT=1 docker build -f docker/Dockerfile.rocm -t vllm-rocm .
|
|||||||
To build vllm on ROCm 6.3 for Radeon RX7900 series (gfx1100), you should pick the alternative base image:
|
To build vllm on ROCm 6.3 for Radeon RX7900 series (gfx1100), you should pick the alternative base image:
|
||||||
|
|
||||||
```console
|
```console
|
||||||
DOCKER_BUILDKIT=1 docker build --build-arg BASE_IMAGE="rocm/vllm-dev:navi_base" -f docker/Dockerfile.rocm -t vllm-rocm .
|
DOCKER_BUILDKIT=1 docker build \
|
||||||
|
--build-arg BASE_IMAGE="rocm/vllm-dev:navi_base" \
|
||||||
|
-f docker/Dockerfile.rocm \
|
||||||
|
-t vllm-rocm \
|
||||||
|
.
|
||||||
```
|
```
|
||||||
|
|
||||||
To run the above docker image `vllm-rocm`, use the below command:
|
To run the above docker image `vllm-rocm`, use the below command:
|
||||||
|
|||||||
@ -16,19 +16,25 @@ pip3 install vllm[runai]
|
|||||||
To run it as an OpenAI-compatible server, add the `--load-format runai_streamer` flag:
|
To run it as an OpenAI-compatible server, add the `--load-format runai_streamer` flag:
|
||||||
|
|
||||||
```console
|
```console
|
||||||
vllm serve /home/meta-llama/Llama-3.2-3B-Instruct --load-format runai_streamer
|
vllm serve /home/meta-llama/Llama-3.2-3B-Instruct \
|
||||||
|
--load-format runai_streamer
|
||||||
```
|
```
|
||||||
|
|
||||||
To run model from AWS S3 object store run:
|
To run model from AWS S3 object store run:
|
||||||
|
|
||||||
```console
|
```console
|
||||||
vllm serve s3://core-llm/Llama-3-8b --load-format runai_streamer
|
vllm serve s3://core-llm/Llama-3-8b \
|
||||||
|
--load-format runai_streamer
|
||||||
```
|
```
|
||||||
|
|
||||||
To run model from a S3 compatible object store run:
|
To run model from a S3 compatible object store run:
|
||||||
|
|
||||||
```console
|
```console
|
||||||
RUNAI_STREAMER_S3_USE_VIRTUAL_ADDRESSING=0 AWS_EC2_METADATA_DISABLED=true AWS_ENDPOINT_URL=https://storage.googleapis.com vllm serve s3://core-llm/Llama-3-8b --load-format runai_streamer
|
RUNAI_STREAMER_S3_USE_VIRTUAL_ADDRESSING=0 \
|
||||||
|
AWS_EC2_METADATA_DISABLED=true \
|
||||||
|
AWS_ENDPOINT_URL=https://storage.googleapis.com \
|
||||||
|
vllm serve s3://core-llm/Llama-3-8b \
|
||||||
|
--load-format runai_streamer
|
||||||
```
|
```
|
||||||
|
|
||||||
## Tunable parameters
|
## Tunable parameters
|
||||||
@ -39,14 +45,18 @@ You can tune `concurrency` that controls the level of concurrency and number of
|
|||||||
For reading from S3, it will be the number of client instances the host is opening to the S3 server.
|
For reading from S3, it will be the number of client instances the host is opening to the S3 server.
|
||||||
|
|
||||||
```console
|
```console
|
||||||
vllm serve /home/meta-llama/Llama-3.2-3B-Instruct --load-format runai_streamer --model-loader-extra-config '{"concurrency":16}'
|
vllm serve /home/meta-llama/Llama-3.2-3B-Instruct \
|
||||||
|
--load-format runai_streamer \
|
||||||
|
--model-loader-extra-config '{"concurrency":16}'
|
||||||
```
|
```
|
||||||
|
|
||||||
You can control the size of the CPU Memory buffer to which tensors are read from the file, and limit this size.
|
You can control the size of the CPU Memory buffer to which tensors are read from the file, and limit this size.
|
||||||
You can read further about CPU buffer memory limiting [here](https://github.com/run-ai/runai-model-streamer/blob/master/docs/src/env-vars.md#runai_streamer_memory_limit).
|
You can read further about CPU buffer memory limiting [here](https://github.com/run-ai/runai-model-streamer/blob/master/docs/src/env-vars.md#runai_streamer_memory_limit).
|
||||||
|
|
||||||
```console
|
```console
|
||||||
vllm serve /home/meta-llama/Llama-3.2-3B-Instruct --load-format runai_streamer --model-loader-extra-config '{"memory_limit":5368709120}'
|
vllm serve /home/meta-llama/Llama-3.2-3B-Instruct \
|
||||||
|
--load-format runai_streamer \
|
||||||
|
--model-loader-extra-config '{"memory_limit":5368709120}'
|
||||||
```
|
```
|
||||||
|
|
||||||
!!! note
|
!!! note
|
||||||
@ -63,7 +73,9 @@ vllm serve /path/to/sharded/model --load-format runai_streamer_sharded
|
|||||||
The sharded loader expects model files to follow the same naming pattern as the regular sharded state loader: `model-rank-{rank}-part-{part}.safetensors`. You can customize this pattern using the `pattern` parameter in `--model-loader-extra-config`:
|
The sharded loader expects model files to follow the same naming pattern as the regular sharded state loader: `model-rank-{rank}-part-{part}.safetensors`. You can customize this pattern using the `pattern` parameter in `--model-loader-extra-config`:
|
||||||
|
|
||||||
```console
|
```console
|
||||||
vllm serve /path/to/sharded/model --load-format runai_streamer_sharded --model-loader-extra-config '{"pattern":"custom-model-rank-{rank}-part-{part}.safetensors"}'
|
vllm serve /path/to/sharded/model \
|
||||||
|
--load-format runai_streamer_sharded \
|
||||||
|
--model-loader-extra-config '{"pattern":"custom-model-rank-{rank}-part-{part}.safetensors"}'
|
||||||
```
|
```
|
||||||
|
|
||||||
To create sharded model files, you can use the script provided in <gh-file:examples/offline_inference/save_sharded_state.py>. This script demonstrates how to save a model in the sharded format that is compatible with the Run:ai Model Streamer sharded loader.
|
To create sharded model files, you can use the script provided in <gh-file:examples/offline_inference/save_sharded_state.py>. This script demonstrates how to save a model in the sharded format that is compatible with the Run:ai Model Streamer sharded loader.
|
||||||
@ -71,7 +83,9 @@ To create sharded model files, you can use the script provided in <gh-file:examp
|
|||||||
The sharded loader supports all the same tunable parameters as the regular Run:ai Model Streamer, including `concurrency` and `memory_limit`. These can be configured in the same way:
|
The sharded loader supports all the same tunable parameters as the regular Run:ai Model Streamer, including `concurrency` and `memory_limit`. These can be configured in the same way:
|
||||||
|
|
||||||
```console
|
```console
|
||||||
vllm serve /path/to/sharded/model --load-format runai_streamer_sharded --model-loader-extra-config '{"concurrency":16, "memory_limit":5368709120}'
|
vllm serve /path/to/sharded/model \
|
||||||
|
--load-format runai_streamer_sharded \
|
||||||
|
--model-loader-extra-config '{"concurrency":16, "memory_limit":5368709120}'
|
||||||
```
|
```
|
||||||
|
|
||||||
!!! note
|
!!! note
|
||||||
|
|||||||
@ -8,7 +8,9 @@ vLLM provides an HTTP server that implements OpenAI's [Completions API](https://
|
|||||||
In your terminal, you can [install](../getting_started/installation/README.md) vLLM, then start the server with the [`vllm serve`][serve-args] command. (You can also use our [Docker][deployment-docker] image.)
|
In your terminal, you can [install](../getting_started/installation/README.md) vLLM, then start the server with the [`vllm serve`][serve-args] command. (You can also use our [Docker][deployment-docker] image.)
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
vllm serve NousResearch/Meta-Llama-3-8B-Instruct --dtype auto --api-key token-abc123
|
vllm serve NousResearch/Meta-Llama-3-8B-Instruct \
|
||||||
|
--dtype auto \
|
||||||
|
--api-key token-abc123
|
||||||
```
|
```
|
||||||
|
|
||||||
To call the server, in your preferred text editor, create a script that uses an HTTP client. Include any messages that you want to send to the model. Then run that script. Below is an example script using the [official OpenAI Python client](https://github.com/openai/openai-python).
|
To call the server, in your preferred text editor, create a script that uses an HTTP client. Include any messages that you want to send to the model. Then run that script. Below is an example script using the [official OpenAI Python client](https://github.com/openai/openai-python).
|
||||||
@ -243,7 +245,9 @@ and passing a list of `messages` in the request. Refer to the examples below for
|
|||||||
|
|
||||||
```bash
|
```bash
|
||||||
vllm serve TIGER-Lab/VLM2Vec-Full --task embed \
|
vllm serve TIGER-Lab/VLM2Vec-Full --task embed \
|
||||||
--trust-remote-code --max-model-len 4096 --chat-template examples/template_vlm2vec.jinja
|
--trust-remote-code \
|
||||||
|
--max-model-len 4096 \
|
||||||
|
--chat-template examples/template_vlm2vec.jinja
|
||||||
```
|
```
|
||||||
|
|
||||||
!!! warning
|
!!! warning
|
||||||
@ -285,7 +289,9 @@ and passing a list of `messages` in the request. Refer to the examples below for
|
|||||||
|
|
||||||
```bash
|
```bash
|
||||||
vllm serve MrLight/dse-qwen2-2b-mrl-v1 --task embed \
|
vllm serve MrLight/dse-qwen2-2b-mrl-v1 --task embed \
|
||||||
--trust-remote-code --max-model-len 8192 --chat-template examples/template_dse_qwen2_vl.jinja
|
--trust-remote-code \
|
||||||
|
--max-model-len 8192 \
|
||||||
|
--chat-template examples/template_dse_qwen2_vl.jinja
|
||||||
```
|
```
|
||||||
|
|
||||||
!!! warning
|
!!! warning
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user