mirror of
https://git.datalinker.icu/vllm-project/vllm.git
synced 2025-12-13 23:35:34 +08:00
add doc about serving option on dstack (#3074)
Co-authored-by: Roger Wang <ywang@roblox.com>
This commit is contained in:
parent
a9bcc7afb2
commit
429d89720e
103
docs/source/serving/deploying_with_dstack.rst
Normal file
103
docs/source/serving/deploying_with_dstack.rst
Normal file
@ -0,0 +1,103 @@
|
|||||||
|
.. _deploying_with_dstack:
|
||||||
|
|
||||||
|
Deploying with dstack
|
||||||
|
============================
|
||||||
|
|
||||||
|
.. raw:: html
|
||||||
|
|
||||||
|
<p align="center">
|
||||||
|
<img src="https://i.ibb.co/71kx6hW/vllm-dstack.png" alt="vLLM_plus_dstack"/>
|
||||||
|
</p>
|
||||||
|
|
||||||
|
vLLM can be run on a cloud based GPU machine with `dstack <https://dstack.ai/>`__, an open-source framework for running LLMs on any cloud. This tutorial assumes that you have already configured credentials, gateway, and GPU quotas on your cloud environment.
|
||||||
|
|
||||||
|
To install dstack client, run:
|
||||||
|
|
||||||
|
.. code-block:: console
|
||||||
|
|
||||||
|
$ pip install "dstack[all]
|
||||||
|
$ dstack server
|
||||||
|
|
||||||
|
Next, to configure your dstack project, run:
|
||||||
|
|
||||||
|
.. code-block:: console
|
||||||
|
|
||||||
|
$ mkdir -p vllm-dstack
|
||||||
|
$ cd vllm-dstack
|
||||||
|
$ dstack init
|
||||||
|
|
||||||
|
Next, to provision a VM instance with LLM of your choice(`NousResearch/Llama-2-7b-chat-hf` for this example), create the following `serve.dstack.yml` file for the dstack `Service`:
|
||||||
|
|
||||||
|
.. code-block:: yaml
|
||||||
|
|
||||||
|
type: service
|
||||||
|
|
||||||
|
python: "3.11"
|
||||||
|
env:
|
||||||
|
- MODEL=NousResearch/Llama-2-7b-chat-hf
|
||||||
|
port: 8000
|
||||||
|
resources:
|
||||||
|
gpu: 24GB
|
||||||
|
commands:
|
||||||
|
- pip install vllm
|
||||||
|
- python -m vllm.entrypoints.openai.api_server --model $MODEL --port 8000
|
||||||
|
model:
|
||||||
|
format: openai
|
||||||
|
type: chat
|
||||||
|
name: NousResearch/Llama-2-7b-chat-hf
|
||||||
|
|
||||||
|
Then, run the following CLI for provisioning:
|
||||||
|
|
||||||
|
.. code-block:: console
|
||||||
|
|
||||||
|
$ dstack run . -f serve.dstack.yml
|
||||||
|
|
||||||
|
⠸ Getting run plan...
|
||||||
|
Configuration serve.dstack.yml
|
||||||
|
Project deep-diver-main
|
||||||
|
User deep-diver
|
||||||
|
Min resources 2..xCPU, 8GB.., 1xGPU (24GB)
|
||||||
|
Max price -
|
||||||
|
Max duration -
|
||||||
|
Spot policy auto
|
||||||
|
Retry policy no
|
||||||
|
|
||||||
|
# BACKEND REGION INSTANCE RESOURCES SPOT PRICE
|
||||||
|
1 gcp us-central1 g2-standard-4 4xCPU, 16GB, 1xL4 (24GB), 100GB (disk) yes $0.223804
|
||||||
|
2 gcp us-east1 g2-standard-4 4xCPU, 16GB, 1xL4 (24GB), 100GB (disk) yes $0.223804
|
||||||
|
3 gcp us-west1 g2-standard-4 4xCPU, 16GB, 1xL4 (24GB), 100GB (disk) yes $0.223804
|
||||||
|
...
|
||||||
|
Shown 3 of 193 offers, $5.876 max
|
||||||
|
|
||||||
|
Continue? [y/n]: y
|
||||||
|
⠙ Submitting run...
|
||||||
|
⠏ Launching spicy-treefrog-1 (pulling)
|
||||||
|
spicy-treefrog-1 provisioning completed (running)
|
||||||
|
Service is published at ...
|
||||||
|
|
||||||
|
After the provisioning, you can interact with the model by using the OpenAI SDK:
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
from openai import OpenAI
|
||||||
|
|
||||||
|
client = OpenAI(
|
||||||
|
base_url="https://gateway.<gateway domain>",
|
||||||
|
api_key="<YOUR-DSTACK-SERVER-ACCESS-TOKEN>"
|
||||||
|
)
|
||||||
|
|
||||||
|
completion = client.chat.completions.create(
|
||||||
|
model="NousResearch/Llama-2-7b-chat-hf",
|
||||||
|
messages=[
|
||||||
|
{
|
||||||
|
"role": "user",
|
||||||
|
"content": "Compose a poem that explains the concept of recursion in programming.",
|
||||||
|
}
|
||||||
|
]
|
||||||
|
)
|
||||||
|
|
||||||
|
print(completion.choices[0].message.content)
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
dstack automatically handles authentication on the gateway using dstack's tokens. Meanwhile, if you don't want to configure a gateway, you can provision dstack `Task` instead of `Service`. The `Task` is for development purpose only. If you want to know more about hands-on materials how to serve vLLM using dstack, check out `this repository <https://github.com/dstackai/dstack-examples/tree/main/deployment/vllm>`__
|
||||||
@ -9,4 +9,5 @@ Integrations
|
|||||||
deploying_with_triton
|
deploying_with_triton
|
||||||
deploying_with_bentoml
|
deploying_with_bentoml
|
||||||
deploying_with_lws
|
deploying_with_lws
|
||||||
|
deploying_with_dstack
|
||||||
serving_with_langchain
|
serving_with_langchain
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user