mirror of
https://git.datalinker.icu/vllm-project/vllm.git
synced 2026-03-16 15:37:13 +08:00
Make distinct code and console admonitions so readers are less likely to miss them (#20585)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
This commit is contained in:
parent
31c5d0a1b7
commit
af107d5a0e
@ -16,7 +16,7 @@ vllm {chat,complete,serve,bench,collect-env,run-batch}
|
||||
|
||||
Start the vLLM OpenAI Compatible API server.
|
||||
|
||||
??? Examples
|
||||
??? console "Examples"
|
||||
|
||||
```bash
|
||||
# Start with a model
|
||||
|
||||
@ -57,7 +57,7 @@ By default, we optimize model inference using CUDA graphs which take up extra me
|
||||
|
||||
You can adjust `compilation_config` to achieve a better balance between inference speed and memory usage:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
@ -129,7 +129,7 @@ reduce the size of the processed multi-modal inputs, which in turn saves memory.
|
||||
|
||||
Here are some examples:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
|
||||
@ -7,7 +7,7 @@ vLLM uses the following environment variables to configure the system:
|
||||
|
||||
All environment variables used by vLLM are prefixed with `VLLM_`. **Special care should be taken for Kubernetes users**: please do not name the service as `vllm`, otherwise environment variables set by Kubernetes might conflict with vLLM's environment variables, because [Kubernetes sets environment variables for each service with the capitalized service name as the prefix](https://kubernetes.io/docs/concepts/services-networking/service/#environment-variables).
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
--8<-- "vllm/envs.py:env-vars-definition"
|
||||
|
||||
@ -95,7 +95,7 @@ For additional features and advanced configurations, refer to the official [MkDo
|
||||
|
||||
## Testing
|
||||
|
||||
??? note "Commands"
|
||||
??? console "Commands"
|
||||
|
||||
```bash
|
||||
pip install -r requirements/dev.txt
|
||||
|
||||
@ -27,7 +27,7 @@ All vLLM modules within the model must include a `prefix` argument in their cons
|
||||
|
||||
The initialization code should look like this:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from torch import nn
|
||||
|
||||
@ -12,7 +12,7 @@ Further update the model as follows:
|
||||
|
||||
- Implement [get_placeholder_str][vllm.model_executor.models.interfaces.SupportsMultiModal.get_placeholder_str] to define the placeholder string which is used to represent the multi-modal item in the text prompt. This should be consistent with the chat template of the model.
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
class YourModelForImage2Seq(nn.Module):
|
||||
@ -41,7 +41,7 @@ Further update the model as follows:
|
||||
|
||||
- Implement [get_multimodal_embeddings][vllm.model_executor.models.interfaces.SupportsMultiModal.get_multimodal_embeddings] that returns the embeddings from running the multimodal inputs through the multimodal tokenizer of the model. Below we provide a boilerplate of a typical implementation pattern, but feel free to adjust it to your own needs.
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
class YourModelForImage2Seq(nn.Module):
|
||||
@ -71,7 +71,7 @@ Further update the model as follows:
|
||||
|
||||
- Implement [get_input_embeddings][vllm.model_executor.models.interfaces.SupportsMultiModal.get_input_embeddings] to merge `multimodal_embeddings` with text embeddings from the `input_ids`. If input processing for the model is implemented correctly (see sections below), then you can leverage the utility function we provide to easily merge the embeddings.
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from .utils import merge_multimodal_embeddings
|
||||
@ -155,7 +155,7 @@ Assuming that the memory usage increases with the number of tokens, the dummy in
|
||||
|
||||
Looking at the code of HF's `LlavaForConditionalGeneration`:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
# https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/llava/modeling_llava.py#L530-L544
|
||||
@ -179,7 +179,7 @@ Assuming that the memory usage increases with the number of tokens, the dummy in
|
||||
The number of placeholder feature tokens per image is `image_features.shape[1]`.
|
||||
`image_features` is calculated inside the `get_image_features` method:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
# https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/llava/modeling_llava.py#L290-L300
|
||||
@ -217,7 +217,7 @@ Assuming that the memory usage increases with the number of tokens, the dummy in
|
||||
|
||||
To find the sequence length, we turn to the code of `CLIPVisionEmbeddings`:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
# https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/clip/modeling_clip.py#L247-L257
|
||||
@ -244,7 +244,7 @@ Assuming that the memory usage increases with the number of tokens, the dummy in
|
||||
|
||||
Overall, the number of placeholder feature tokens for an image can be calculated as:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
def get_num_image_tokens(
|
||||
@ -269,7 +269,7 @@ Assuming that the memory usage increases with the number of tokens, the dummy in
|
||||
Notice that the number of image tokens doesn't depend on the image width and height.
|
||||
We can simply use a dummy `image_size` to calculate the multimodal profiling data:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
# NOTE: In actuality, this is usually implemented as part of the
|
||||
@ -314,7 +314,7 @@ Assuming that the memory usage increases with the number of tokens, the dummy in
|
||||
|
||||
Looking at the code of HF's `FuyuForCausalLM`:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/modeling_fuyu.py#L311-L322
|
||||
@ -344,7 +344,7 @@ Assuming that the memory usage increases with the number of tokens, the dummy in
|
||||
In `FuyuImageProcessor.preprocess`, the images are resized and padded to the target `FuyuImageProcessor.size`,
|
||||
returning the dimensions after resizing (but before padding) as metadata.
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L541-L544
|
||||
@ -382,7 +382,7 @@ Assuming that the memory usage increases with the number of tokens, the dummy in
|
||||
|
||||
In `FuyuImageProcessor.preprocess_with_tokenizer_info`, the images are split into patches based on this metadata:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L417-L425
|
||||
@ -420,7 +420,7 @@ Assuming that the memory usage increases with the number of tokens, the dummy in
|
||||
|
||||
The number of patches is in turn defined by `FuyuImageProcessor.get_num_patches`:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/image_processing_fuyu.py#L552-L562
|
||||
@ -457,7 +457,7 @@ Assuming that the memory usage increases with the number of tokens, the dummy in
|
||||
|
||||
For the multimodal image profiling data, the logic is very similar to LLaVA:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
def get_dummy_mm_data(
|
||||
@ -546,7 +546,7 @@ return a schema of the tensors outputted by the HF processor that are related to
|
||||
In order to support the use of [MultiModalFieldConfig.batched][] like in LLaVA,
|
||||
we remove the extra batch dimension by overriding [BaseMultiModalProcessor._call_hf_processor][]:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
def _call_hf_processor(
|
||||
@ -623,7 +623,7 @@ Each [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instance specifies
|
||||
It simply repeats each input `image_token` a number of times equal to the number of placeholder feature tokens (`num_image_tokens`).
|
||||
Based on this, we override [_get_prompt_updates][vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_updates] as follows:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
def _get_prompt_updates(
|
||||
@ -668,7 +668,7 @@ Each [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instance specifies
|
||||
|
||||
We define a helper function to return `ncols` and `nrows` directly:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
def get_image_feature_grid_size(
|
||||
@ -698,7 +698,7 @@ Each [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instance specifies
|
||||
|
||||
Based on this, we can initially define our replacement tokens as:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
def get_replacement(item_idx: int):
|
||||
@ -718,7 +718,7 @@ Each [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instance specifies
|
||||
However, this is not entirely correct. After `FuyuImageProcessor.preprocess_with_tokenizer_info` is called,
|
||||
a BOS token (`<s>`) is also added to the promopt:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L417-L435
|
||||
@ -745,7 +745,7 @@ Each [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instance specifies
|
||||
To assign the vision embeddings to only the image tokens, instead of a string
|
||||
you can return an instance of [PromptUpdateDetails][vllm.multimodal.processing.PromptUpdateDetails]:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
hf_config = self.info.get_hf_config()
|
||||
@ -772,7 +772,7 @@ Each [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instance specifies
|
||||
Finally, noticing that the HF processor removes the `|ENDOFTEXT|` token from the tokenized prompt,
|
||||
we can search for it to conduct the replacement at the start of the string:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
def _get_prompt_updates(
|
||||
|
||||
@ -125,7 +125,7 @@ to manually kill the profiler and generate your `nsys-rep` report.
|
||||
|
||||
You can view these profiles either as summaries in the CLI, using `nsys stats [profile-file]`, or in the GUI by installing Nsight [locally following the directions here](https://developer.nvidia.com/nsight-systems/get-started).
|
||||
|
||||
??? CLI example
|
||||
??? console "CLI example"
|
||||
|
||||
```bash
|
||||
nsys stats report1.nsys-rep
|
||||
|
||||
@ -97,7 +97,7 @@ of PyTorch Nightly and should be considered **experimental**. Using the flag `--
|
||||
flags to speed up build process. However, ensure your `max_jobs` is substantially larger than `nvcc_threads` to get the most benefits.
|
||||
Keep an eye on memory usage with parallel jobs as it can be substantial (see example below).
|
||||
|
||||
??? Command
|
||||
??? console "Command"
|
||||
|
||||
```bash
|
||||
# Example of building on Nvidia GH200 server. (Memory usage: ~15GB, Build time: ~1475s / ~25 min, Image size: 6.93GB)
|
||||
|
||||
@ -30,7 +30,7 @@ python -m vllm.entrypoints.openai.api_server \
|
||||
|
||||
- Call it with AutoGen:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
|
||||
@ -34,7 +34,7 @@ vllm = "latest"
|
||||
|
||||
Next, let us add our code to handle inference for the LLM of your choice (`mistralai/Mistral-7B-Instruct-v0.1` for this example), add the following code to your `main.py`:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from vllm import LLM, SamplingParams
|
||||
@ -64,7 +64,7 @@ cerebrium deploy
|
||||
|
||||
If successful, you should be returned a CURL command that you can call inference against. Just remember to end the url with the function name you are calling (in our case`/run`)
|
||||
|
||||
??? Command
|
||||
??? console "Command"
|
||||
|
||||
```python
|
||||
curl -X POST https://api.cortex.cerebrium.ai/v4/p-xxxxxx/vllm/run \
|
||||
@ -82,7 +82,7 @@ If successful, you should be returned a CURL command that you can call inference
|
||||
|
||||
You should get a response like:
|
||||
|
||||
??? Response
|
||||
??? console "Response"
|
||||
|
||||
```python
|
||||
{
|
||||
|
||||
@ -26,7 +26,7 @@ dstack init
|
||||
|
||||
Next, to provision a VM instance with LLM of your choice (`NousResearch/Llama-2-7b-chat-hf` for this example), create the following `serve.dstack.yml` file for the dstack `Service`:
|
||||
|
||||
??? Config
|
||||
??? code "Config"
|
||||
|
||||
```yaml
|
||||
type: service
|
||||
@ -48,7 +48,7 @@ Next, to provision a VM instance with LLM of your choice (`NousResearch/Llama-2-
|
||||
|
||||
Then, run the following CLI for provisioning:
|
||||
|
||||
??? Command
|
||||
??? console "Command"
|
||||
|
||||
```console
|
||||
$ dstack run . -f serve.dstack.yml
|
||||
@ -79,7 +79,7 @@ Then, run the following CLI for provisioning:
|
||||
|
||||
After the provisioning, you can interact with the model by using the OpenAI SDK:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from openai import OpenAI
|
||||
|
||||
@ -27,7 +27,7 @@ vllm serve mistralai/Mistral-7B-Instruct-v0.1
|
||||
|
||||
- Use the `OpenAIGenerator` and `OpenAIChatGenerator` components in Haystack to query the vLLM server.
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from haystack.components.generators.chat import OpenAIChatGenerator
|
||||
|
||||
@ -34,7 +34,7 @@ vllm serve qwen/Qwen1.5-0.5B-Chat
|
||||
|
||||
- Call it with litellm:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
import litellm
|
||||
|
||||
@ -17,7 +17,7 @@ vLLM can be deployed with [LWS](https://github.com/kubernetes-sigs/lws) on Kuber
|
||||
|
||||
Deploy the following yaml file `lws.yaml`
|
||||
|
||||
??? Yaml
|
||||
??? code "Yaml"
|
||||
|
||||
```yaml
|
||||
apiVersion: leaderworkerset.x-k8s.io/v1
|
||||
@ -177,7 +177,7 @@ curl http://localhost:8080/v1/completions \
|
||||
|
||||
The output should be similar to the following
|
||||
|
||||
??? Output
|
||||
??? console "Output"
|
||||
|
||||
```text
|
||||
{
|
||||
|
||||
@ -24,7 +24,7 @@ sky check
|
||||
|
||||
See the vLLM SkyPilot YAML for serving, [serving.yaml](https://github.com/skypilot-org/skypilot/blob/master/llm/vllm/serve.yaml).
|
||||
|
||||
??? Yaml
|
||||
??? code "Yaml"
|
||||
|
||||
```yaml
|
||||
resources:
|
||||
@ -95,7 +95,7 @@ HF_TOKEN="your-huggingface-token" \
|
||||
|
||||
SkyPilot can scale up the service to multiple service replicas with built-in autoscaling, load-balancing and fault-tolerance. You can do it by adding a services section to the YAML file.
|
||||
|
||||
??? Yaml
|
||||
??? code "Yaml"
|
||||
|
||||
```yaml
|
||||
service:
|
||||
@ -111,7 +111,7 @@ SkyPilot can scale up the service to multiple service replicas with built-in aut
|
||||
max_completion_tokens: 1
|
||||
```
|
||||
|
||||
??? Yaml
|
||||
??? code "Yaml"
|
||||
|
||||
```yaml
|
||||
service:
|
||||
@ -186,7 +186,7 @@ vllm 2 1 xx.yy.zz.245 18 mins ago 1x GCP([Spot]{'L4': 1}) R
|
||||
|
||||
After the service is READY, you can find a single endpoint for the service and access the service with the endpoint:
|
||||
|
||||
??? Commands
|
||||
??? console "Commands"
|
||||
|
||||
```bash
|
||||
ENDPOINT=$(sky serve status --endpoint 8081 vllm)
|
||||
@ -220,7 +220,7 @@ service:
|
||||
|
||||
This will scale the service up to when the QPS exceeds 2 for each replica.
|
||||
|
||||
??? Yaml
|
||||
??? code "Yaml"
|
||||
|
||||
```yaml
|
||||
service:
|
||||
@ -285,7 +285,7 @@ sky serve down vllm
|
||||
|
||||
It is also possible to access the Llama-3 service with a separate GUI frontend, so the user requests send to the GUI will be load-balanced across replicas.
|
||||
|
||||
??? Yaml
|
||||
??? code "Yaml"
|
||||
|
||||
```yaml
|
||||
envs:
|
||||
|
||||
@ -60,7 +60,7 @@ And then you can send out a query to the OpenAI-compatible API to check the avai
|
||||
curl -o- http://localhost:30080/models
|
||||
```
|
||||
|
||||
??? Output
|
||||
??? console "Output"
|
||||
|
||||
```json
|
||||
{
|
||||
@ -89,7 +89,7 @@ curl -X POST http://localhost:30080/completions \
|
||||
}'
|
||||
```
|
||||
|
||||
??? Output
|
||||
??? console "Output"
|
||||
|
||||
```json
|
||||
{
|
||||
@ -121,7 +121,7 @@ sudo helm uninstall vllm
|
||||
|
||||
The core vLLM production stack configuration is managed with YAML. Here is the example configuration used in the installation above:
|
||||
|
||||
??? Yaml
|
||||
??? code "Yaml"
|
||||
|
||||
```yaml
|
||||
servingEngineSpec:
|
||||
|
||||
@ -29,7 +29,7 @@ Alternatively, you can deploy vLLM to Kubernetes using any of the following:
|
||||
|
||||
First, create a Kubernetes PVC and Secret for downloading and storing Hugging Face model:
|
||||
|
||||
??? Config
|
||||
??? console "Config"
|
||||
|
||||
```bash
|
||||
cat <<EOF |kubectl apply -f -
|
||||
@ -57,7 +57,7 @@ First, create a Kubernetes PVC and Secret for downloading and storing Hugging Fa
|
||||
|
||||
Next, start the vLLM server as a Kubernetes Deployment and Service:
|
||||
|
||||
??? Config
|
||||
??? console "Config"
|
||||
|
||||
```bash
|
||||
cat <<EOF |kubectl apply -f -
|
||||
|
||||
@ -36,7 +36,7 @@ docker build . -f Dockerfile.nginx --tag nginx-lb
|
||||
|
||||
Create a file named `nginx_conf/nginx.conf`. Note that you can add as many servers as you'd like. In the below example we'll start with two. To add more, add another `server vllmN:8000 max_fails=3 fail_timeout=10000s;` entry to `upstream backend`.
|
||||
|
||||
??? Config
|
||||
??? console "Config"
|
||||
|
||||
```console
|
||||
upstream backend {
|
||||
@ -95,7 +95,7 @@ Notes:
|
||||
- The below example assumes GPU backend used. If you are using CPU backend, remove `--gpus device=ID`, add `VLLM_CPU_KVCACHE_SPACE` and `VLLM_CPU_OMP_THREADS_BIND` environment variables to the docker run command.
|
||||
- Adjust the model name that you want to use in your vLLM servers if you don't want to use `Llama-2-7b-chat-hf`.
|
||||
|
||||
??? Commands
|
||||
??? console "Commands"
|
||||
|
||||
```console
|
||||
mkdir -p ~/.cache/huggingface/hub/
|
||||
|
||||
@ -22,7 +22,7 @@ server.
|
||||
|
||||
Here is a sample of `LLM` class usage:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from vllm import LLM, SamplingParams
|
||||
@ -180,7 +180,7 @@ vision-language model.
|
||||
|
||||
To avoid accidentally passing incorrect arguments, the constructor is now keyword-only. This ensures that the constructor will raise an error if old configurations are passed. vLLM developers have already made this change for all models within vLLM. For out-of-tree registered models, developers need to update their models, for example by adding shim code to adapt the old constructor signature to the new one:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
class MyOldModel(nn.Module):
|
||||
|
||||
@ -448,7 +448,7 @@ elements of the entire head for all context tokens. However, overall,
|
||||
all results for output have been calculated but are just stored in
|
||||
different thread register memory.
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```cpp
|
||||
float* out_smem = reinterpret_cast<float*>(shared_mem);
|
||||
|
||||
@ -13,7 +13,7 @@ Plugins are user-registered code that vLLM executes. Given vLLM's architecture (
|
||||
|
||||
vLLM's plugin system uses the standard Python `entry_points` mechanism. This mechanism allows developers to register functions in their Python packages for use by other packages. An example of a plugin:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
# inside `setup.py` file
|
||||
|
||||
@ -61,7 +61,7 @@ To address the above issues, I have designed and developed a local Tensor memory
|
||||
|
||||
# Install vLLM
|
||||
|
||||
??? Commands
|
||||
??? console "Commands"
|
||||
|
||||
```shell
|
||||
# Enter the home directory or your working directory.
|
||||
@ -106,7 +106,7 @@ python3 disagg_prefill_proxy_xpyd.py &
|
||||
|
||||
### Prefill1 (e.g. 10.0.1.2 or 10.0.1.1)
|
||||
|
||||
??? Command
|
||||
??? console "Command"
|
||||
|
||||
```shell
|
||||
VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=0 vllm serve {your model directory} \
|
||||
@ -128,7 +128,7 @@ python3 disagg_prefill_proxy_xpyd.py &
|
||||
|
||||
### Decode1 (e.g. 10.0.1.3 or 10.0.1.1)
|
||||
|
||||
??? Command
|
||||
??? console "Command"
|
||||
|
||||
```shell
|
||||
VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=1 vllm serve {your model directory} \
|
||||
@ -150,7 +150,7 @@ python3 disagg_prefill_proxy_xpyd.py &
|
||||
|
||||
### Decode2 (e.g. 10.0.1.4 or 10.0.1.1)
|
||||
|
||||
??? Command
|
||||
??? console "Command"
|
||||
|
||||
```shell
|
||||
VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=2 vllm serve {your model directory} \
|
||||
@ -172,7 +172,7 @@ python3 disagg_prefill_proxy_xpyd.py &
|
||||
|
||||
### Decode3 (e.g. 10.0.1.5 or 10.0.1.1)
|
||||
|
||||
??? Command
|
||||
??? console "Command"
|
||||
|
||||
```shell
|
||||
VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=3 vllm serve {your model directory} \
|
||||
@ -203,7 +203,7 @@ python3 disagg_prefill_proxy_xpyd.py &
|
||||
|
||||
### Prefill1 (e.g. 10.0.1.2 or 10.0.1.1)
|
||||
|
||||
??? Command
|
||||
??? console "Command"
|
||||
|
||||
```shell
|
||||
VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=0 vllm serve {your model directory} \
|
||||
@ -225,7 +225,7 @@ python3 disagg_prefill_proxy_xpyd.py &
|
||||
|
||||
### Prefill2 (e.g. 10.0.1.3 or 10.0.1.1)
|
||||
|
||||
??? Command
|
||||
??? console "Command"
|
||||
|
||||
```shell
|
||||
VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=1 vllm serve {your model directory} \
|
||||
@ -247,7 +247,7 @@ python3 disagg_prefill_proxy_xpyd.py &
|
||||
|
||||
### Prefill3 (e.g. 10.0.1.4 or 10.0.1.1)
|
||||
|
||||
??? Command
|
||||
??? console "Command"
|
||||
|
||||
```shell
|
||||
VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=2 vllm serve {your model directory} \
|
||||
@ -269,7 +269,7 @@ python3 disagg_prefill_proxy_xpyd.py &
|
||||
|
||||
### Decode1 (e.g. 10.0.1.5 or 10.0.1.1)
|
||||
|
||||
??? Command
|
||||
??? console "Command"
|
||||
|
||||
```shell
|
||||
VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=3 vllm serve {your model directory} \
|
||||
@ -304,7 +304,7 @@ curl -X POST -s http://10.0.1.1:10001/v1/completions \
|
||||
|
||||
# Benchmark
|
||||
|
||||
??? Command
|
||||
??? console "Command"
|
||||
|
||||
```shell
|
||||
python3 benchmark_serving.py \
|
||||
|
||||
@ -28,7 +28,7 @@ A unique aspect of vLLM's `torch.compile` integration, is that we guarantee all
|
||||
|
||||
In the very verbose logs, we can see:
|
||||
|
||||
??? Logs
|
||||
??? console "Logs"
|
||||
|
||||
```text
|
||||
DEBUG 03-07 03:06:52 [decorators.py:203] Start compiling function <code object forward at 0x7f08acf40c90, file "xxx/vllm/model_executor/models/llama.py", line 339>
|
||||
@ -110,7 +110,7 @@ Then it will also compile a specific kernel just for batch size `1, 2, 4, 8`. At
|
||||
|
||||
When all the shapes are known, `torch.compile` can compare different configs, and often find some better configs to run the kernel. For example, we can see the following log:
|
||||
|
||||
??? Logs
|
||||
??? console "Logs"
|
||||
|
||||
```
|
||||
AUTOTUNE mm(8x2048, 2048x3072)
|
||||
|
||||
@ -29,7 +29,7 @@ We can now submit the prompts and call `llm.generate` with the `lora_request` pa
|
||||
of `LoRARequest` is a human identifiable name, the second parameter is a globally unique ID for the adapter and
|
||||
the third parameter is the path to the LoRA adapter.
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
sampling_params = SamplingParams(
|
||||
@ -70,7 +70,7 @@ The server entrypoint accepts all other LoRA configuration parameters (`max_lora
|
||||
etc.), which will apply to all forthcoming requests. Upon querying the `/models` endpoint, we should see our LoRA along
|
||||
with its base model (if `jq` is not installed, you can follow [this guide](https://jqlang.org/download/) to install it.):
|
||||
|
||||
??? Command
|
||||
??? console "Command"
|
||||
|
||||
```bash
|
||||
curl localhost:8000/v1/models | jq .
|
||||
@ -172,7 +172,7 @@ Alternatively, follow these example steps to implement your own plugin:
|
||||
|
||||
1. Implement the LoRAResolver interface.
|
||||
|
||||
??? Example of a simple S3 LoRAResolver implementation
|
||||
??? code "Example of a simple S3 LoRAResolver implementation"
|
||||
|
||||
```python
|
||||
import os
|
||||
@ -238,7 +238,7 @@ The new format of `--lora-modules` is mainly to support the display of parent mo
|
||||
- The `parent` field of LoRA model `sql-lora` now links to its base model `meta-llama/Llama-2-7b-hf`. This correctly reflects the hierarchical relationship between the base model and the LoRA adapter.
|
||||
- The `root` field points to the artifact location of the lora adapter.
|
||||
|
||||
??? Command output
|
||||
??? console "Command output"
|
||||
|
||||
```bash
|
||||
$ curl http://localhost:8000/v1/models
|
||||
|
||||
@ -20,7 +20,7 @@ To input multi-modal data, follow this schema in [vllm.inputs.PromptType][]:
|
||||
|
||||
You can pass a single image to the `'image'` field of the multi-modal dictionary, as shown in the following examples:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
@ -68,7 +68,7 @@ Full example: <gh-file:examples/offline_inference/vision_language.py>
|
||||
|
||||
To substitute multiple images inside the same text prompt, you can pass in a list of images instead:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
@ -146,7 +146,7 @@ for o in outputs:
|
||||
|
||||
Multi-image input can be extended to perform video captioning. We show this with [Qwen2-VL](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct) as it supports videos:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
@ -193,7 +193,7 @@ Full example: <gh-file:examples/offline_inference/audio_language.py>
|
||||
To input pre-computed embeddings belonging to a data type (i.e. image, video, or audio) directly to the language model,
|
||||
pass a tensor of shape `(num_items, feature_size, hidden_size of LM)` to the corresponding field of the multi-modal dictionary.
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
@ -220,7 +220,7 @@ pass a tensor of shape `(num_items, feature_size, hidden_size of LM)` to the cor
|
||||
|
||||
For Qwen2-VL and MiniCPM-V, we accept additional parameters alongside the embeddings:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
# Construct the prompt based on your model
|
||||
@ -288,7 +288,7 @@ vllm serve microsoft/Phi-3.5-vision-instruct --task generate \
|
||||
|
||||
Then, you can use the OpenAI client as follows:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from openai import OpenAI
|
||||
@ -366,7 +366,7 @@ vllm serve llava-hf/llava-onevision-qwen2-0.5b-ov-hf --task generate --max-model
|
||||
|
||||
Then, you can use the OpenAI client as follows:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from openai import OpenAI
|
||||
@ -430,7 +430,7 @@ vllm serve fixie-ai/ultravox-v0_5-llama-3_2-1b
|
||||
|
||||
Then, you can use the OpenAI client as follows:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
import base64
|
||||
@ -486,7 +486,7 @@ Then, you can use the OpenAI client as follows:
|
||||
|
||||
Alternatively, you can pass `audio_url`, which is the audio counterpart of `image_url` for image input:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
chat_completion_from_url = client.chat.completions.create(
|
||||
@ -531,7 +531,7 @@ pass a tensor of shape to the corresponding field of the multi-modal dictionary.
|
||||
For image embeddings, you can pass the base64-encoded tensor to the `image_embeds` field.
|
||||
The following example demonstrates how to pass image embeddings to the OpenAI server:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
image_embedding = torch.load(...)
|
||||
|
||||
@ -15,7 +15,7 @@ pip install autoawq
|
||||
|
||||
After installing AutoAWQ, you are ready to quantize a model. Please refer to the [AutoAWQ documentation](https://casper-hansen.github.io/AutoAWQ/examples/#basic-quantization) for further details. Here is an example of how to quantize `mistralai/Mistral-7B-Instruct-v0.2`:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from awq import AutoAWQForCausalLM
|
||||
@ -51,7 +51,7 @@ python examples/offline_inference/llm_engine_example.py \
|
||||
|
||||
AWQ models are also supported directly through the LLM entrypoint:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from vllm import LLM, SamplingParams
|
||||
|
||||
@ -43,7 +43,7 @@ llm = LLM(
|
||||
|
||||
## Read gptq format checkpoint
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
|
||||
@ -58,7 +58,7 @@ For FP8 quantization, we can recover accuracy with simple RTN quantization. We r
|
||||
|
||||
Since simple RTN does not require data for weight quantization and the activations are quantized dynamically, we do not need any calibration data for this quantization flow.
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from llmcompressor.transformers import oneshot
|
||||
|
||||
@ -41,7 +41,7 @@ vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
|
||||
|
||||
You can also use the GGUF model directly through the LLM entrypoint:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from vllm import LLM, SamplingParams
|
||||
|
||||
@ -31,7 +31,7 @@ After installing GPTQModel, you are ready to quantize a model. Please refer to t
|
||||
|
||||
Here is an example of how to quantize `meta-llama/Llama-3.2-1B-Instruct`:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from datasets import load_dataset
|
||||
@ -69,7 +69,7 @@ python examples/offline_inference/llm_engine_example.py \
|
||||
|
||||
GPTQModel quantized models are also supported directly through the LLM entrypoint:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from vllm import LLM, SamplingParams
|
||||
|
||||
@ -53,7 +53,7 @@ When quantizing weights to INT4, you need sample data to estimate the weight upd
|
||||
It's best to use calibration data that closely matches your deployment data.
|
||||
For a general-purpose instruction-tuned model, you can use a dataset like `ultrachat`:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from datasets import load_dataset
|
||||
@ -78,7 +78,7 @@ For a general-purpose instruction-tuned model, you can use a dataset like `ultra
|
||||
|
||||
Now, apply the quantization algorithms:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from llmcompressor.transformers import oneshot
|
||||
@ -141,7 +141,7 @@ lm_eval --model vllm \
|
||||
|
||||
The following is an example of an expanded quantization recipe you can tune to your own use case:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from compressed_tensors.quantization import (
|
||||
|
||||
@ -54,7 +54,7 @@ When quantizing activations to INT8, you need sample data to estimate the activa
|
||||
It's best to use calibration data that closely matches your deployment data.
|
||||
For a general-purpose instruction-tuned model, you can use a dataset like `ultrachat`:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from datasets import load_dataset
|
||||
@ -81,7 +81,7 @@ For a general-purpose instruction-tuned model, you can use a dataset like `ultra
|
||||
|
||||
Now, apply the quantization algorithms:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from llmcompressor.transformers import oneshot
|
||||
|
||||
@ -14,7 +14,7 @@ You can quantize HuggingFace models using the example scripts provided in the Te
|
||||
|
||||
Below is an example showing how to quantize a model using modelopt's PTQ API:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
import modelopt.torch.quantization as mtq
|
||||
@ -50,7 +50,7 @@ with torch.inference_mode():
|
||||
|
||||
The quantized checkpoint can then be deployed with vLLM. As an example, the following code shows how to deploy `nvidia/Llama-3.1-8B-Instruct-FP8`, which is the FP8 quantized checkpoint derived from `meta-llama/Llama-3.1-8B-Instruct`, using vLLM:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from vllm import LLM, SamplingParams
|
||||
|
||||
@ -35,7 +35,7 @@ Studies have shown that FP8 E4M3 quantization typically only minimally degrades
|
||||
|
||||
Here is an example of how to enable FP8 quantization:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
# To calculate kv cache scales on the fly enable the calculate_kv_scales
|
||||
@ -73,7 +73,7 @@ pip install llmcompressor
|
||||
|
||||
Here's a complete example using `meta-llama/Llama-3.1-8B-Instruct` (most models can use this same pattern):
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from datasets import load_dataset
|
||||
|
||||
@ -42,7 +42,7 @@ The Quark quantization process can be listed for 5 steps as below:
|
||||
Quark uses [Transformers](https://huggingface.co/docs/transformers/en/index)
|
||||
to fetch model and tokenizer.
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from transformers import AutoTokenizer, AutoModelForCausalLM
|
||||
@ -65,7 +65,7 @@ Quark uses the [PyTorch Dataloader](https://pytorch.org/tutorials/beginner/basic
|
||||
to load calibration data. For more details about how to use calibration datasets efficiently, please refer
|
||||
to [Adding Calibration Datasets](https://quark.docs.amd.com/latest/pytorch/calibration_datasets.html).
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from datasets import load_dataset
|
||||
@ -98,7 +98,7 @@ kv-cache and the quantization algorithm is AutoSmoothQuant.
|
||||
AutoSmoothQuant config file for Llama is
|
||||
`examples/torch/language_modeling/llm_ptq/models/llama/autosmoothquant_config.json`.
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from quark.torch.quantization import (Config, QuantizationConfig,
|
||||
@ -145,7 +145,7 @@ HuggingFace `safetensors`, you can refer to
|
||||
[HuggingFace format exporting](https://quark.docs.amd.com/latest/pytorch/export/quark_export_hf.html)
|
||||
for more exporting format details.
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
import torch
|
||||
@ -176,7 +176,7 @@ for more exporting format details.
|
||||
|
||||
Now, you can load and run the Quark quantized model directly through the LLM entrypoint:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from vllm import LLM, SamplingParams
|
||||
|
||||
@ -15,7 +15,7 @@ pip install \
|
||||
## Quantizing HuggingFace Models
|
||||
You can quantize your own huggingface model with torchao, e.g. [transformers](https://huggingface.co/docs/transformers/main/en/quantization/torchao) and [diffusers](https://huggingface.co/docs/diffusers/en/quantization/torchao), and save the checkpoint to huggingface hub like [this](https://huggingface.co/jerryzh168/llama3-8b-int8wo) with the following example code:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```Python
|
||||
import torch
|
||||
|
||||
@ -33,7 +33,7 @@ vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
|
||||
|
||||
Next, make a request to the model that should return the reasoning content in the response.
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from openai import OpenAI
|
||||
@ -70,7 +70,7 @@ The `reasoning_content` field contains the reasoning steps that led to the final
|
||||
|
||||
Streaming chat completions are also supported for reasoning models. The `reasoning_content` field is available in the `delta` field in [chat completion response chunks](https://platform.openai.com/docs/api-reference/chat/streaming).
|
||||
|
||||
??? Json
|
||||
??? console "Json"
|
||||
|
||||
```json
|
||||
{
|
||||
@ -95,7 +95,7 @@ Streaming chat completions are also supported for reasoning models. The `reasoni
|
||||
|
||||
OpenAI Python client library does not officially support `reasoning_content` attribute for streaming output. But the client supports extra attributes in the response. You can use `hasattr` to check if the `reasoning_content` attribute is present in the response. For example:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from openai import OpenAI
|
||||
@ -152,7 +152,7 @@ Remember to check whether the `reasoning_content` exists in the response before
|
||||
|
||||
The reasoning content is also available when both tool calling and the reasoning parser are enabled. Additionally, tool calling only parses functions from the `content` field, not from the `reasoning_content`.
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from openai import OpenAI
|
||||
@ -200,7 +200,7 @@ For more examples, please refer to <gh-file:examples/online_serving/openai_chat_
|
||||
|
||||
You can add a new `ReasoningParser` similar to <gh-file:vllm/reasoning/deepseek_r1_reasoning_parser.py>.
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
# import the required packages
|
||||
@ -258,7 +258,7 @@ You can add a new `ReasoningParser` similar to <gh-file:vllm/reasoning/deepseek_
|
||||
|
||||
Additionally, to enable structured output, you'll need to create a new `Reasoner` similar to the one in <gh-file:vllm/reasoning/deepseek_r1_reasoning_parser.py>.
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
|
||||
@ -18,7 +18,7 @@ Speculative decoding is a technique which improves inter-token latency in memory
|
||||
|
||||
The following code configures vLLM in an offline mode to use speculative decoding with a draft model, speculating 5 tokens at a time.
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from vllm import LLM, SamplingParams
|
||||
@ -62,7 +62,7 @@ python -m vllm.entrypoints.openai.api_server \
|
||||
|
||||
Then use a client:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from openai import OpenAI
|
||||
@ -103,7 +103,7 @@ Then use a client:
|
||||
The following code configures vLLM to use speculative decoding where proposals are generated by
|
||||
matching n-grams in the prompt. For more information read [this thread.](https://x.com/joao_gante/status/1747322413006643259)
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from vllm import LLM, SamplingParams
|
||||
@ -137,7 +137,7 @@ draft models that conditioning draft predictions on both context vectors and sam
|
||||
For more information see [this blog](https://pytorch.org/blog/hitchhikers-guide-speculative-decoding/) or
|
||||
[this technical report](https://arxiv.org/abs/2404.19124).
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from vllm import LLM, SamplingParams
|
||||
@ -185,7 +185,7 @@ A variety of speculative models of this type are available on HF hub:
|
||||
The following code configures vLLM to use speculative decoding where proposals are generated by
|
||||
an [EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency)](https://arxiv.org/pdf/2401.15077) based draft model. A more detailed example for offline mode, including how to extract request level acceptance rate, can be found [here](gh-file:examples/offline_inference/eagle.py).
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from vllm import LLM, SamplingParams
|
||||
|
||||
@ -33,7 +33,7 @@ text.
|
||||
|
||||
Now let´s see an example for each of the cases, starting with the `guided_choice`, as it´s the easiest one:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from openai import OpenAI
|
||||
@ -55,7 +55,7 @@ Now let´s see an example for each of the cases, starting with the `guided_choic
|
||||
|
||||
The next example shows how to use the `guided_regex`. The idea is to generate an email address, given a simple regex template:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
completion = client.chat.completions.create(
|
||||
@ -79,7 +79,7 @@ For this we can use the `guided_json` parameter in two different ways:
|
||||
|
||||
The next example shows how to use the `guided_json` parameter with a Pydantic model:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from pydantic import BaseModel
|
||||
@ -127,7 +127,7 @@ difficult to use, but it´s really powerful. It allows us to define complete
|
||||
languages like SQL queries. It works by using a context free EBNF grammar.
|
||||
As an example, we can use to define a specific format of simplified SQL queries:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
simplified_sql_grammar = """
|
||||
@ -169,7 +169,7 @@ vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-7B --reasoning-parser deepseek_r
|
||||
|
||||
Note that you can use reasoning with any provided structured outputs feature. The following uses one with JSON schema:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from pydantic import BaseModel
|
||||
@ -212,7 +212,7 @@ For the following examples, vLLM was setup using `vllm serve meta-llama/Llama-3.
|
||||
|
||||
Here is a simple example demonstrating how to get structured output using Pydantic models:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from pydantic import BaseModel
|
||||
@ -248,7 +248,7 @@ Age: 28
|
||||
|
||||
Here is a more complex example using nested Pydantic models to handle a step-by-step math solution:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from typing import List
|
||||
@ -308,7 +308,7 @@ These parameters can be used in the same way as the parameters from the Online
|
||||
Serving examples above. One example for the usage of the `choice` parameter is
|
||||
shown below:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from vllm import LLM, SamplingParams
|
||||
|
||||
@ -15,7 +15,7 @@ vllm serve meta-llama/Llama-3.1-8B-Instruct \
|
||||
|
||||
Next, make a request to the model that should result in it using the available tools:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from openai import OpenAI
|
||||
@ -320,7 +320,7 @@ A tool parser plugin is a Python file containing one or more ToolParser implemen
|
||||
|
||||
Here is a summary of a plugin file:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
|
||||
|
||||
@ -76,7 +76,7 @@ Currently, there are no pre-built CPU wheels.
|
||||
|
||||
### Build image from source
|
||||
|
||||
??? Commands
|
||||
??? console "Commands"
|
||||
|
||||
```bash
|
||||
docker build -f docker/Dockerfile.cpu \
|
||||
@ -149,7 +149,7 @@ vllm serve facebook/opt-125m
|
||||
|
||||
- If using vLLM CPU backend on a machine with hyper-threading, it is recommended to bind only one OpenMP thread on each physical CPU core using `VLLM_CPU_OMP_THREADS_BIND` or using auto thread binding feature by default. On a hyper-threading enabled platform with 16 logical CPU cores / 8 physical CPU cores:
|
||||
|
||||
??? Commands
|
||||
??? console "Commands"
|
||||
|
||||
```console
|
||||
$ lscpu -e # check the mapping between logical CPU cores and physical CPU cores
|
||||
|
||||
@ -95,7 +95,7 @@ Currently, there are no pre-built ROCm wheels.
|
||||
|
||||
4. Build vLLM. For example, vLLM on ROCM 6.3 can be built with the following steps:
|
||||
|
||||
??? Commands
|
||||
??? console "Commands"
|
||||
|
||||
```bash
|
||||
pip install --upgrade pip
|
||||
@ -206,7 +206,7 @@ DOCKER_BUILDKIT=1 docker build \
|
||||
|
||||
To run the above docker image `vllm-rocm`, use the below command:
|
||||
|
||||
??? Command
|
||||
??? console "Command"
|
||||
|
||||
```bash
|
||||
docker run -it \
|
||||
|
||||
@ -237,7 +237,7 @@ As an example, if a request of 3 sequences, with max sequence length of 412 come
|
||||
|
||||
Warmup is an optional, but highly recommended step occurring before vLLM server starts listening. It executes a forward pass for each bucket with dummy data. The goal is to pre-compile all graphs and not incur any graph compilation overheads within bucket boundaries during server runtime. Each warmup step is logged during vLLM startup:
|
||||
|
||||
??? Logs
|
||||
??? console "Logs"
|
||||
|
||||
```text
|
||||
INFO 08-01 22:26:47 hpu_model_runner.py:1066] [Warmup][Prompt][1/24] batch_size:4 seq_len:1024 free_mem:79.16 GiB
|
||||
@ -286,7 +286,7 @@ When there's large amount of requests pending, vLLM scheduler will attempt to fi
|
||||
|
||||
Each described step is logged by vLLM server, as follows (negative values correspond to memory being released):
|
||||
|
||||
??? Logs
|
||||
??? console "Logs"
|
||||
|
||||
```text
|
||||
INFO 08-02 17:37:44 hpu_model_runner.py:493] Prompt bucket config (min, step, max_warmup) bs:[1, 32, 4], seq:[128, 128, 1024]
|
||||
|
||||
@ -147,7 +147,7 @@ curl http://localhost:8000/v1/completions \
|
||||
|
||||
Since this server is compatible with OpenAI API, you can use it as a drop-in replacement for any applications using OpenAI API. For example, another way to query the server is via the `openai` Python package:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from openai import OpenAI
|
||||
@ -186,7 +186,7 @@ curl http://localhost:8000/v1/chat/completions \
|
||||
|
||||
Alternatively, you can use the `openai` Python package:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from openai import OpenAI
|
||||
|
||||
@ -39,6 +39,8 @@ body[data-md-color-scheme="slate"] .md-nav__item--section > label.md-nav__link .
|
||||
:root {
|
||||
--md-admonition-icon--announcement: url('data:image/svg+xml;charset=utf-8,<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 16 16" width="16" height="16"><path d="M3.25 9a.75.75 0 0 1 .75.75c0 2.142.456 3.828.733 4.653a.122.122 0 0 0 .05.064.212.212 0 0 0 .117.033h1.31c.085 0 .18-.042.258-.152a.45.45 0 0 0 .075-.366A16.743 16.743 0 0 1 6 9.75a.75.75 0 0 1 1.5 0c0 1.588.25 2.926.494 3.85.293 1.113-.504 2.4-1.783 2.4H4.9c-.686 0-1.35-.41-1.589-1.12A16.4 16.4 0 0 1 2.5 9.75.75.75 0 0 1 3.25 9Z"></path><path d="M0 6a4 4 0 0 1 4-4h2.75a.75.75 0 0 1 .75.75v6.5a.75.75 0 0 1-.75.75H4a4 4 0 0 1-4-4Zm4-2.5a2.5 2.5 0 1 0 0 5h2v-5Z"></path><path d="M15.59.082A.75.75 0 0 1 16 .75v10.5a.75.75 0 0 1-1.189.608l-.002-.001h.001l-.014-.01a5.775 5.775 0 0 0-.422-.25 10.63 10.63 0 0 0-1.469-.64C11.576 10.484 9.536 10 6.75 10a.75.75 0 0 1 0-1.5c2.964 0 5.174.516 6.658 1.043.423.151.787.302 1.092.443V2.014c-.305.14-.669.292-1.092.443C11.924 2.984 9.713 3.5 6.75 3.5a.75.75 0 0 1 0-1.5c2.786 0 4.826-.484 6.155-.957.665-.236 1.154-.47 1.47-.64.144-.077.284-.161.421-.25l.014-.01a.75.75 0 0 1 .78-.061Z"></path></svg>');
|
||||
--md-admonition-icon--important: url('data:image/svg+xml;charset=utf-8,<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 16 16" width="16" height="16"><path d="M4.47.22A.749.749 0 0 1 5 0h6c.199 0 .389.079.53.22l4.25 4.25c.141.14.22.331.22.53v6a.749.749 0 0 1-.22.53l-4.25 4.25A.749.749 0 0 1 11 16H5a.749.749 0 0 1-.53-.22L.22 11.53A.749.749 0 0 1 0 11V5c0-.199.079-.389.22-.53Zm.84 1.28L1.5 5.31v5.38l3.81 3.81h5.38l3.81-3.81V5.31L10.69 1.5ZM8 4a.75.75 0 0 1 .75.75v3.5a.75.75 0 0 1-1.5 0v-3.5A.75.75 0 0 1 8 4Zm0 8a1 1 0 1 1 0-2 1 1 0 0 1 0 2Z"></path></svg>');
|
||||
--md-admonition-icon--code: url('data:image/svg+xml;charset=utf-8,<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 16 16"><path d="m11.28 3.22 4.25 4.25a.75.75 0 0 1 0 1.06l-4.25 4.25a.749.749 0 0 1-1.275-.326.75.75 0 0 1 .215-.734L13.94 8l-3.72-3.72a.749.749 0 0 1 .326-1.275.75.75 0 0 1 .734.215m-6.56 0a.75.75 0 0 1 1.042.018.75.75 0 0 1 .018 1.042L2.06 8l3.72 3.72a.749.749 0 0 1-.326 1.275.75.75 0 0 1-.734-.215L.47 8.53a.75.75 0 0 1 0-1.06Z"/></svg>');
|
||||
--md-admonition-icon--console: url('data:image/svg+xml;charset=utf-8,<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 16 16"><path d="M0 2.75C0 1.784.784 1 1.75 1h12.5c.966 0 1.75.784 1.75 1.75v10.5A1.75 1.75 0 0 1 14.25 15H1.75A1.75 1.75 0 0 1 0 13.25Zm1.75-.25a.25.25 0 0 0-.25.25v10.5c0 .138.112.25.25.25h12.5a.25.25 0 0 0 .25-.25V2.75a.25.25 0 0 0-.25-.25ZM7.25 8a.75.75 0 0 1-.22.53l-2.25 2.25a.749.749 0 0 1-1.275-.326.75.75 0 0 1 .215-.734L5.44 8 3.72 6.28a.749.749 0 0 1 .326-1.275.75.75 0 0 1 .734.215l2.25 2.25c.141.14.22.331.22.53m1.5 1.5h3a.75.75 0 0 1 0 1.5h-3a.75.75 0 0 1 0-1.5"/></svg>');
|
||||
}
|
||||
|
||||
.md-typeset .admonition.announcement,
|
||||
@ -49,6 +51,14 @@ body[data-md-color-scheme="slate"] .md-nav__item--section > label.md-nav__link .
|
||||
.md-typeset details.important {
|
||||
border-color: rgb(239, 85, 82);
|
||||
}
|
||||
.md-typeset .admonition.code,
|
||||
.md-typeset details.code {
|
||||
border-color: #64dd17
|
||||
}
|
||||
.md-typeset .admonition.console,
|
||||
.md-typeset details.console {
|
||||
border-color: #64dd17
|
||||
}
|
||||
|
||||
.md-typeset .announcement > .admonition-title,
|
||||
.md-typeset .announcement > summary {
|
||||
@ -58,6 +68,14 @@ body[data-md-color-scheme="slate"] .md-nav__item--section > label.md-nav__link .
|
||||
.md-typeset .important > summary {
|
||||
background-color: rgb(239, 85, 82, 0.1);
|
||||
}
|
||||
.md-typeset .code > .admonition-title,
|
||||
.md-typeset .code > summary {
|
||||
background-color: #64dd171a;
|
||||
}
|
||||
.md-typeset .console > .admonition-title,
|
||||
.md-typeset .console > summary {
|
||||
background-color: #64dd171a;
|
||||
}
|
||||
|
||||
.md-typeset .announcement > .admonition-title::before,
|
||||
.md-typeset .announcement > summary::before {
|
||||
@ -71,6 +89,18 @@ body[data-md-color-scheme="slate"] .md-nav__item--section > label.md-nav__link .
|
||||
-webkit-mask-image: var(--md-admonition-icon--important);
|
||||
mask-image: var(--md-admonition-icon--important);
|
||||
}
|
||||
.md-typeset .code > .admonition-title::before,
|
||||
.md-typeset .code > summary::before {
|
||||
background-color: #64dd17;
|
||||
-webkit-mask-image: var(--md-admonition-icon--code);
|
||||
mask-image: var(--md-admonition-icon--code);
|
||||
}
|
||||
.md-typeset .console > .admonition-title::before,
|
||||
.md-typeset .console > summary::before {
|
||||
background-color: #64dd17;
|
||||
-webkit-mask-image: var(--md-admonition-icon--console);
|
||||
mask-image: var(--md-admonition-icon--console);
|
||||
}
|
||||
|
||||
/* Make label fully visible on hover */
|
||||
.md-content__button[href*="edit"]:hover::after {
|
||||
|
||||
@ -85,7 +85,7 @@ and automatically applies the model's [chat template](https://huggingface.co/doc
|
||||
In general, only instruction-tuned models have a chat template.
|
||||
Base models may perform poorly as they are not trained to respond to the chat conversation.
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
|
||||
@ -642,7 +642,7 @@ Specified using `--task generate`.
|
||||
|
||||
For the best results, we recommend using the following dependency versions (tested on A10 and L40):
|
||||
|
||||
??? Dependency versions
|
||||
??? code "Dependency versions"
|
||||
|
||||
```text
|
||||
# Core vLLM-compatible dependencies with Molmo accuracy setup (tested on L40)
|
||||
|
||||
@ -13,7 +13,7 @@ pip install langchain langchain_community -q
|
||||
|
||||
To run inference on a single or multiple GPUs, use `VLLM` class from `langchain`.
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from langchain_community.llms import VLLM
|
||||
|
||||
@ -15,7 +15,7 @@ vllm serve NousResearch/Meta-Llama-3-8B-Instruct \
|
||||
|
||||
To call the server, in your preferred text editor, create a script that uses an HTTP client. Include any messages that you want to send to the model. Then run that script. Below is an example script using the [official OpenAI Python client](https://github.com/openai/openai-python).
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from openai import OpenAI
|
||||
@ -146,7 +146,7 @@ completion = client.chat.completions.create(
|
||||
Only `X-Request-Id` HTTP request header is supported for now. It can be enabled
|
||||
with `--enable-request-id-headers`.
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
completion = client.chat.completions.create(
|
||||
@ -185,7 +185,7 @@ Code example: <gh-file:examples/online_serving/openai_completion_client.py>
|
||||
|
||||
The following [sampling parameters][sampling-params] are supported.
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
--8<-- "vllm/entrypoints/openai/protocol.py:completion-sampling-params"
|
||||
@ -193,7 +193,7 @@ The following [sampling parameters][sampling-params] are supported.
|
||||
|
||||
The following extra parameters are supported:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
--8<-- "vllm/entrypoints/openai/protocol.py:completion-extra-params"
|
||||
@ -217,7 +217,7 @@ Code example: <gh-file:examples/online_serving/openai_chat_completion_client.py>
|
||||
|
||||
The following [sampling parameters][sampling-params] are supported.
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
--8<-- "vllm/entrypoints/openai/protocol.py:chat-completion-sampling-params"
|
||||
@ -225,7 +225,7 @@ The following [sampling parameters][sampling-params] are supported.
|
||||
|
||||
The following extra parameters are supported:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
--8<-- "vllm/entrypoints/openai/protocol.py:chat-completion-extra-params"
|
||||
@ -268,7 +268,7 @@ and passing a list of `messages` in the request. Refer to the examples below for
|
||||
|
||||
Since the request schema is not defined by OpenAI client, we post a request to the server using the lower-level `requests` library:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
import requests
|
||||
@ -327,7 +327,7 @@ The following [pooling parameters][pooling-params] are supported.
|
||||
|
||||
The following extra parameters are supported by default:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
--8<-- "vllm/entrypoints/openai/protocol.py:embedding-extra-params"
|
||||
@ -335,7 +335,7 @@ The following extra parameters are supported by default:
|
||||
|
||||
For chat-like input (i.e. if `messages` is passed), these extra parameters are supported instead:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
--8<-- "vllm/entrypoints/openai/protocol.py:chat-embedding-extra-params"
|
||||
@ -358,7 +358,7 @@ Code example: <gh-file:examples/online_serving/openai_transcription_client.py>
|
||||
|
||||
The following [sampling parameters][sampling-params] are supported.
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
--8<-- "vllm/entrypoints/openai/protocol.py:transcription-sampling-params"
|
||||
@ -366,7 +366,7 @@ The following [sampling parameters][sampling-params] are supported.
|
||||
|
||||
The following extra parameters are supported:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
--8<-- "vllm/entrypoints/openai/protocol.py:transcription-extra-params"
|
||||
@ -446,7 +446,7 @@ curl -v "http://127.0.0.1:8000/classify" \
|
||||
}'
|
||||
```
|
||||
|
||||
??? Response
|
||||
??? console "Response"
|
||||
|
||||
```bash
|
||||
{
|
||||
@ -494,7 +494,7 @@ curl -v "http://127.0.0.1:8000/classify" \
|
||||
}'
|
||||
```
|
||||
|
||||
??? Response
|
||||
??? console "Response"
|
||||
|
||||
```bash
|
||||
{
|
||||
@ -564,7 +564,7 @@ curl -X 'POST' \
|
||||
}'
|
||||
```
|
||||
|
||||
??? Response
|
||||
??? console "Response"
|
||||
|
||||
```bash
|
||||
{
|
||||
@ -589,7 +589,7 @@ You can pass a string to `text_1` and a list to `text_2`, forming multiple sente
|
||||
where each pair is built from `text_1` and a string in `text_2`.
|
||||
The total number of pairs is `len(text_2)`.
|
||||
|
||||
??? Request
|
||||
??? console "Request"
|
||||
|
||||
```bash
|
||||
curl -X 'POST' \
|
||||
@ -606,7 +606,7 @@ The total number of pairs is `len(text_2)`.
|
||||
}'
|
||||
```
|
||||
|
||||
??? Response
|
||||
??? console "Response"
|
||||
|
||||
```bash
|
||||
{
|
||||
@ -634,7 +634,7 @@ You can pass a list to both `text_1` and `text_2`, forming multiple sentence pai
|
||||
where each pair is built from a string in `text_1` and the corresponding string in `text_2` (similar to `zip()`).
|
||||
The total number of pairs is `len(text_2)`.
|
||||
|
||||
??? Request
|
||||
??? console "Request"
|
||||
|
||||
```bash
|
||||
curl -X 'POST' \
|
||||
@ -655,7 +655,7 @@ The total number of pairs is `len(text_2)`.
|
||||
}'
|
||||
```
|
||||
|
||||
??? Response
|
||||
??? console "Response"
|
||||
|
||||
```bash
|
||||
{
|
||||
@ -716,7 +716,7 @@ Code example: <gh-file:examples/online_serving/jinaai_rerank_client.py>
|
||||
Note that the `top_n` request parameter is optional and will default to the length of the `documents` field.
|
||||
Result documents will be sorted by relevance, and the `index` property can be used to determine original order.
|
||||
|
||||
??? Request
|
||||
??? console "Request"
|
||||
|
||||
```bash
|
||||
curl -X 'POST' \
|
||||
@ -734,7 +734,7 @@ Result documents will be sorted by relevance, and the `index` property can be us
|
||||
}'
|
||||
```
|
||||
|
||||
??? Response
|
||||
??? console "Response"
|
||||
|
||||
```bash
|
||||
{
|
||||
|
||||
@ -12,7 +12,7 @@ vllm serve unsloth/Llama-3.2-1B-Instruct
|
||||
|
||||
Then query the endpoint to get the latest metrics from the server:
|
||||
|
||||
??? Output
|
||||
??? console "Output"
|
||||
|
||||
```console
|
||||
$ curl http://0.0.0.0:8000/metrics
|
||||
@ -33,7 +33,7 @@ Then query the endpoint to get the latest metrics from the server:
|
||||
|
||||
The following metrics are exposed:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
--8<-- "vllm/engine/metrics.py:metrics-definitions"
|
||||
|
||||
@ -60,7 +60,7 @@ To identify the particular CUDA operation that causes the error, you can add `--
|
||||
|
||||
If GPU/CPU communication cannot be established, you can use the following Python script and follow the instructions below to confirm whether the GPU/CPU communication is working correctly.
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
# Test PyTorch NCCL
|
||||
@ -170,7 +170,7 @@ WARNING 12-11 14:50:37 multiproc_worker_utils.py:281] CUDA was previously
|
||||
|
||||
or an error from Python that looks like this:
|
||||
|
||||
??? Logs
|
||||
??? console "Logs"
|
||||
|
||||
```console
|
||||
RuntimeError:
|
||||
@ -214,7 +214,7 @@ if __name__ == '__main__':
|
||||
|
||||
vLLM heavily depends on `torch.compile` to optimize the model for better performance, which introduces the dependency on the `torch.compile` functionality and the `triton` library. By default, we use `torch.compile` to [optimize some functions](gh-pr:10406) in the model. Before running vLLM, you can check if `torch.compile` is working as expected by running the following script:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
import torch
|
||||
|
||||
@ -10,7 +10,7 @@ The list of data collected by the latest version of vLLM can be found here: <gh-
|
||||
|
||||
Here is an example as of v0.4.0:
|
||||
|
||||
??? Output
|
||||
??? console "Output"
|
||||
|
||||
```json
|
||||
{
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user