add doc about serving option on dstack (#3074)

Co-authored-by: Roger Wang <ywang@roblox.com>
2025-12-13 23:35:34 +08:00 · 2024-05-31 02:11:07 +09:00 · 2024-05-31 02:11:07 +09:00 · 429d89720e
commit 429d89720e
parent a9bcc7afb2
2 changed files with 104 additions and 0 deletions
--- a/docs/source/serving/deploying_with_dstack.rst
+++ b/docs/source/serving/deploying_with_dstack.rst
@ -0,0 +1,103 @@
 .. _deploying_with_dstack:
 Deploying with dstack
 ============================
 .. raw:: html
    <p align="center">
        <img src="https://i.ibb.co/71kx6hW/vllm-dstack.png" alt="vLLM_plus_dstack"/>
    </p>
 vLLM can be run on a cloud based GPU machine with `dstack <https://dstack.ai/>`__, an open-source framework for running LLMs on any cloud. This tutorial assumes that you have already configured credentials, gateway, and GPU quotas on your cloud environment.
 To install dstack client, run:
 .. code-block:: console
    $ pip install "dstack[all]
    $ dstack server
 Next, to configure your dstack project, run:
 .. code-block:: console
    $ mkdir -p vllm-dstack
    $ cd vllm-dstack
    $ dstack init
 Next, to provision a VM instance with LLM of your choice(`NousResearch/Llama-2-7b-chat-hf` for this example), create the following `serve.dstack.yml` file for the dstack `Service`:
 .. code-block:: yaml
    type: service
    python: "3.11"
    env:
        - MODEL=NousResearch/Llama-2-7b-chat-hf
    port: 8000
    resources:
        gpu: 24GB
    commands:
        - pip install vllm
        - python -m vllm.entrypoints.openai.api_server --model $MODEL --port 8000
    model:
        format: openai
        type: chat
        name: NousResearch/Llama-2-7b-chat-hf
 Then, run the following CLI for provisioning:
 .. code-block:: console
    $ dstack run . -f serve.dstack.yml
    ⠸ Getting run plan...
     Configuration  serve.dstack.yml             
     Project        deep-diver-main              
     User           deep-diver                   
     Min resources  2..xCPU, 8GB.., 1xGPU (24GB) 
     Max price      -                            
     Max duration   -                            
     Spot policy    auto                         
     Retry policy   no                           
     #  BACKEND  REGION       INSTANCE       RESOURCES                               SPOT  PRICE       
     1  gcp   us-central1  g2-standard-4  4xCPU, 16GB, 1xL4 (24GB), 100GB (disk)  yes   $0.223804   
     2  gcp   us-east1     g2-standard-4  4xCPU, 16GB, 1xL4 (24GB), 100GB (disk)  yes   $0.223804   
     3  gcp   us-west1     g2-standard-4  4xCPU, 16GB, 1xL4 (24GB), 100GB (disk)  yes   $0.223804   
        ...                                                                                            
     Shown 3 of 193 offers, $5.876 max
    Continue? [y/n]: y
    ⠙ Submitting run...
    ⠏ Launching spicy-treefrog-1 (pulling)
    spicy-treefrog-1 provisioning completed (running)
    Service is published at ...
 After the provisioning, you can interact with the model by using the OpenAI SDK:
 .. code-block:: python
    from openai import OpenAI
    client = OpenAI(
        base_url="https://gateway.<gateway domain>",
        api_key="<YOUR-DSTACK-SERVER-ACCESS-TOKEN>"
    )
    completion = client.chat.completions.create(
        model="NousResearch/Llama-2-7b-chat-hf",
        messages=[
            {
                "role": "user",
                "content": "Compose a poem that explains the concept of recursion in programming.",
            }
        ]
    )
    print(completion.choices[0].message.content)
 .. note::
    dstack automatically handles authentication on the gateway using dstack's tokens. Meanwhile, if you don't want to configure a gateway, you can provision dstack `Task` instead of `Service`. The `Task` is for development purpose only. If you want to know more about hands-on materials how to serve vLLM using dstack, check out `this repository <https://github.com/dstackai/dstack-examples/tree/main/deployment/vllm>`__
--- a/docs/source/serving/integrations.rst
+++ b/docs/source/serving/integrations.rst
@ -9,4 +9,5 @@ Integrations
   deploying_with_triton
   deploying_with_bentoml
   deploying_with_lws
   deploying_with_dstack
   serving_with_langchain