mirror of
https://git.datalinker.icu/vllm-project/vllm.git
synced 2025-12-10 05:15:42 +08:00
[Doc] Improve quickstart documentation (#9256)
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>
This commit is contained in:
parent
ca0d92227e
commit
228cfbd03f
@ -1,38 +1,50 @@
|
|||||||
.. _quickstart:
|
.. _quickstart:
|
||||||
|
|
||||||
|
==========
|
||||||
Quickstart
|
Quickstart
|
||||||
==========
|
==========
|
||||||
|
|
||||||
This guide shows how to use vLLM to:
|
This guide will help you quickly get started with vLLM to:
|
||||||
|
|
||||||
* run offline batched inference on a dataset;
|
* :ref:`Run offline batched inference <offline_batched_inference>`
|
||||||
* build an API server for a large language model;
|
* :ref:`Run OpenAI-compatible inference <openai_compatible_server>`
|
||||||
* start an OpenAI-compatible API server.
|
|
||||||
|
|
||||||
Be sure to complete the :ref:`installation instructions <installation>` before continuing with this guide.
|
Prerequisites
|
||||||
|
--------------
|
||||||
|
- OS: Linux
|
||||||
|
- Python: 3.8 - 3.12
|
||||||
|
- GPU: compute capability 7.0 or higher (e.g., V100, T4, RTX20xx, A100, L4, H100, etc.)
|
||||||
|
|
||||||
.. note::
|
Installation
|
||||||
|
--------------
|
||||||
|
|
||||||
By default, vLLM downloads model from `HuggingFace <https://huggingface.co/>`_. If you would like to use models from `ModelScope <https://www.modelscope.cn>`_ in the following examples, please set the environment variable:
|
You can install vLLM using pip. It's recommended to use `conda <https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html>`_ to create and manage Python environments.
|
||||||
|
|
||||||
.. code-block:: shell
|
.. code-block:: console
|
||||||
|
|
||||||
export VLLM_USE_MODELSCOPE=True
|
$ conda create -n myenv python=3.10 -y
|
||||||
|
$ conda activate myenv
|
||||||
|
$ pip install vllm
|
||||||
|
|
||||||
|
Please refer to the :ref:`installation documentation <installation>` for more details on installing vLLM.
|
||||||
|
|
||||||
|
.. _offline_batched_inference:
|
||||||
|
|
||||||
Offline Batched Inference
|
Offline Batched Inference
|
||||||
-------------------------
|
-------------------------
|
||||||
|
|
||||||
We first show an example of using vLLM for offline batched inference on a dataset. In other words, we use vLLM to generate texts for a list of input prompts.
|
With vLLM installed, you can start generating texts for list of input prompts (i.e. offline batch inferencing). The example script for this section can be found `here <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference.py>`__.
|
||||||
|
|
||||||
Import :class:`~vllm.LLM` and :class:`~vllm.SamplingParams` from vLLM.
|
The first line of this example imports the classes :class:`~vllm.LLM` and :class:`~vllm.SamplingParams`:
|
||||||
The :class:`~vllm.LLM` class is the main class for running offline inference with vLLM engine.
|
|
||||||
The :class:`~vllm.SamplingParams` class specifies the parameters for the sampling process.
|
- :class:`~vllm.LLM` is the main class for running offline inference with vLLM engine.
|
||||||
|
- :class:`~vllm.SamplingParams` specifies the parameters for the sampling process.
|
||||||
|
|
||||||
.. code-block:: python
|
.. code-block:: python
|
||||||
|
|
||||||
from vllm import LLM, SamplingParams
|
from vllm import LLM, SamplingParams
|
||||||
|
|
||||||
Define the list of input prompts and the sampling parameters for generation. The sampling temperature is set to 0.8 and the nucleus sampling probability is set to 0.95. For more information about the sampling parameters, refer to the `class definition <https://github.com/vllm-project/vllm/blob/main/vllm/sampling_params.py>`_.
|
The next section defines a list of input prompts and sampling parameters for text generation. The `sampling temperature <https://arxiv.org/html/2402.05201v1>`_ is set to ``0.8`` and the `nucleus sampling probability <https://en.wikipedia.org/wiki/Top-p_sampling>`_ is set to ``0.95``. You can find more information about the sampling parameters `here <https://docs.vllm.ai/en/stable/dev/sampling_params.html>`__.
|
||||||
|
|
||||||
.. code-block:: python
|
.. code-block:: python
|
||||||
|
|
||||||
@ -44,46 +56,46 @@ Define the list of input prompts and the sampling parameters for generation. The
|
|||||||
]
|
]
|
||||||
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
|
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
|
||||||
|
|
||||||
Initialize vLLM's engine for offline inference with the :class:`~vllm.LLM` class and the `OPT-125M model <https://arxiv.org/abs/2205.01068>`_. The list of supported models can be found at :ref:`supported models <supported_models>`.
|
The :class:`~vllm.LLM` class initializes vLLM's engine and the `OPT-125M model <https://arxiv.org/abs/2205.01068>`_ for offline inference. The list of supported models can be found :ref:`here <supported_models>`.
|
||||||
|
|
||||||
.. code-block:: python
|
.. code-block:: python
|
||||||
|
|
||||||
llm = LLM(model="facebook/opt-125m")
|
llm = LLM(model="facebook/opt-125m")
|
||||||
|
|
||||||
Call ``llm.generate`` to generate the outputs. It adds the input prompts to vLLM engine's waiting queue and executes the vLLM engine to generate the outputs with high throughput. The outputs are returned as a list of ``RequestOutput`` objects, which include all the output tokens.
|
.. note::
|
||||||
|
|
||||||
|
By default, vLLM downloads models from `HuggingFace <https://huggingface.co/>`_. If you would like to use models from `ModelScope <https://www.modelscope.cn>`_, set the environment variable ``VLLM_USE_MODELSCOPE`` before initializing the engine.
|
||||||
|
|
||||||
|
Now, the fun part! The outputs are generated using ``llm.generate``. It adds the input prompts to the vLLM engine's waiting queue and executes the vLLM engine to generate the outputs with high throughput. The outputs are returned as a list of ``RequestOutput`` objects, which include all of the output tokens.
|
||||||
|
|
||||||
.. code-block:: python
|
.. code-block:: python
|
||||||
|
|
||||||
outputs = llm.generate(prompts, sampling_params)
|
outputs = llm.generate(prompts, sampling_params)
|
||||||
|
|
||||||
# Print the outputs.
|
|
||||||
for output in outputs:
|
for output in outputs:
|
||||||
prompt = output.prompt
|
prompt = output.prompt
|
||||||
generated_text = output.outputs[0].text
|
generated_text = output.outputs[0].text
|
||||||
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
||||||
|
|
||||||
|
.. _openai_compatible_server:
|
||||||
The code example can also be found in `examples/offline_inference.py <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference.py>`_.
|
|
||||||
|
|
||||||
OpenAI-Compatible Server
|
OpenAI-Compatible Server
|
||||||
------------------------
|
------------------------
|
||||||
|
|
||||||
vLLM can be deployed as a server that implements the OpenAI API protocol. This allows vLLM to be used as a drop-in replacement for applications using OpenAI API.
|
vLLM can be deployed as a server that implements the OpenAI API protocol. This allows vLLM to be used as a drop-in replacement for applications using OpenAI API.
|
||||||
By default, it starts the server at ``http://localhost:8000``. You can specify the address with ``--host`` and ``--port`` arguments. The server currently hosts one model at a time (OPT-125M in the command below) and implements `list models <https://platform.openai.com/docs/api-reference/models/list>`_, `create chat completion <https://platform.openai.com/docs/api-reference/chat/completions/create>`_, and `create completion <https://platform.openai.com/docs/api-reference/completions/create>`_ endpoints. We are actively adding support for more endpoints.
|
By default, it starts the server at ``http://localhost:8000``. You can specify the address with ``--host`` and ``--port`` arguments. The server currently hosts one model at a time and implements endpoints such as `list models <https://platform.openai.com/docs/api-reference/models/list>`_, `create chat completion <https://platform.openai.com/docs/api-reference/chat/completions/create>`_, and `create completion <https://platform.openai.com/docs/api-reference/completions/create>`_ endpoints.
|
||||||
|
|
||||||
Start the server:
|
Run the following command to start the vLLM server with the `Qwen2.5-1.5B-Instruct <https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct>`_ model:
|
||||||
|
|
||||||
.. code-block:: console
|
.. code-block:: console
|
||||||
|
|
||||||
$ vllm serve facebook/opt-125m
|
$ vllm serve Qwen/Qwen2.5-1.5B-Instruct
|
||||||
|
|
||||||
By default, the server uses a predefined chat template stored in the tokenizer. You can override this template by using the ``--chat-template`` argument:
|
.. note::
|
||||||
|
|
||||||
.. code-block:: console
|
By default, the server uses a predefined chat template stored in the tokenizer. You can learn about overriding it `here <https://github.com/vllm-project/vllm/blob/main/docs/source/serving/openai_compatible_server.md#chat-template>`__.
|
||||||
|
|
||||||
$ vllm serve facebook/opt-125m --chat-template ./examples/template_chatml.jinja
|
This server can be queried in the same format as OpenAI API. For example, to list the models:
|
||||||
|
|
||||||
This server can be queried in the same format as OpenAI API. For example, list the models:
|
|
||||||
|
|
||||||
.. code-block:: console
|
.. code-block:: console
|
||||||
|
|
||||||
@ -91,17 +103,17 @@ This server can be queried in the same format as OpenAI API. For example, list t
|
|||||||
|
|
||||||
You can pass in the argument ``--api-key`` or environment variable ``VLLM_API_KEY`` to enable the server to check for API key in the header.
|
You can pass in the argument ``--api-key`` or environment variable ``VLLM_API_KEY`` to enable the server to check for API key in the header.
|
||||||
|
|
||||||
Using OpenAI Completions API with vLLM
|
OpenAI Completions API with vLLM
|
||||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
Query the model with input prompts:
|
Once your server is started, you can query the model with input prompts:
|
||||||
|
|
||||||
.. code-block:: console
|
.. code-block:: console
|
||||||
|
|
||||||
$ curl http://localhost:8000/v1/completions \
|
$ curl http://localhost:8000/v1/completions \
|
||||||
$ -H "Content-Type: application/json" \
|
$ -H "Content-Type: application/json" \
|
||||||
$ -d '{
|
$ -d '{
|
||||||
$ "model": "facebook/opt-125m",
|
$ "model": "Qwen/Qwen2.5-1.5B-Instruct",
|
||||||
$ "prompt": "San Francisco is a",
|
$ "prompt": "San Francisco is a",
|
||||||
$ "max_tokens": 7,
|
$ "max_tokens": 7,
|
||||||
$ "temperature": 0
|
$ "temperature": 0
|
||||||
@ -120,36 +132,32 @@ Since this server is compatible with OpenAI API, you can use it as a drop-in rep
|
|||||||
api_key=openai_api_key,
|
api_key=openai_api_key,
|
||||||
base_url=openai_api_base,
|
base_url=openai_api_base,
|
||||||
)
|
)
|
||||||
completion = client.completions.create(model="facebook/opt-125m",
|
completion = client.completions.create(model="Qwen/Qwen2.5-1.5B-Instruct",
|
||||||
prompt="San Francisco is a")
|
prompt="San Francisco is a")
|
||||||
print("Completion result:", completion)
|
print("Completion result:", completion)
|
||||||
|
|
||||||
For a more detailed client example, refer to `examples/openai_completion_client.py <https://github.com/vllm-project/vllm/blob/main/examples/openai_completion_client.py>`_.
|
A more detailed client example can be found `here <https://github.com/vllm-project/vllm/blob/main/examples/openai_completion_client.py>`__.
|
||||||
|
|
||||||
Using OpenAI Chat API with vLLM
|
OpenAI Chat API with vLLM
|
||||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
The vLLM server is designed to support the OpenAI Chat API, allowing you to engage in dynamic conversations with the model. The chat interface is a more interactive way to communicate with the model, allowing back-and-forth exchanges that can be stored in the chat history. This is useful for tasks that require context or more detailed explanations.
|
vLLM is designed to also support the OpenAI Chat API. The chat interface is a more dynamic, interactive way to communicate with the model, allowing back-and-forth exchanges that can be stored in the chat history. This is useful for tasks that require context or more detailed explanations.
|
||||||
|
|
||||||
Querying the model using OpenAI Chat API:
|
You can use the `create chat completion <https://platform.openai.com/docs/api-reference/chat/completions/create>`_ endpoint to interact with the model:
|
||||||
|
|
||||||
You can use the `create chat completion <https://platform.openai.com/docs/api-reference/chat/completions/create>`_ endpoint to communicate with the model in a chat-like interface:
|
|
||||||
|
|
||||||
.. code-block:: console
|
.. code-block:: console
|
||||||
|
|
||||||
$ curl http://localhost:8000/v1/chat/completions \
|
$ curl http://localhost:8000/v1/chat/completions \
|
||||||
$ -H "Content-Type: application/json" \
|
$ -H "Content-Type: application/json" \
|
||||||
$ -d '{
|
$ -d '{
|
||||||
$ "model": "facebook/opt-125m",
|
$ "model": "Qwen/Qwen2.5-1.5B-Instruct",
|
||||||
$ "messages": [
|
$ "messages": [
|
||||||
$ {"role": "system", "content": "You are a helpful assistant."},
|
$ {"role": "system", "content": "You are a helpful assistant."},
|
||||||
$ {"role": "user", "content": "Who won the world series in 2020?"}
|
$ {"role": "user", "content": "Who won the world series in 2020?"}
|
||||||
$ ]
|
$ ]
|
||||||
$ }'
|
$ }'
|
||||||
|
|
||||||
Python Client Example:
|
Alternatively, you can use the `openai` python package:
|
||||||
|
|
||||||
Using the `openai` python package, you can also communicate with the model in a chat-like manner:
|
|
||||||
|
|
||||||
.. code-block:: python
|
.. code-block:: python
|
||||||
|
|
||||||
@ -164,12 +172,10 @@ Using the `openai` python package, you can also communicate with the model in a
|
|||||||
)
|
)
|
||||||
|
|
||||||
chat_response = client.chat.completions.create(
|
chat_response = client.chat.completions.create(
|
||||||
model="facebook/opt-125m",
|
model="Qwen/Qwen2.5-1.5B-Instruct",
|
||||||
messages=[
|
messages=[
|
||||||
{"role": "system", "content": "You are a helpful assistant."},
|
{"role": "system", "content": "You are a helpful assistant."},
|
||||||
{"role": "user", "content": "Tell me a joke."},
|
{"role": "user", "content": "Tell me a joke."},
|
||||||
]
|
]
|
||||||
)
|
)
|
||||||
print("Chat response:", chat_response)
|
print("Chat response:", chat_response)
|
||||||
|
|
||||||
For more in-depth examples and advanced features of the chat API, you can refer to the official OpenAI documentation.
|
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user