[Doc] Update SkyPilot doc for wrong indents and instructions for update service (#4283)

2026-07-19 04:37:10 +08:00 · 2024-07-26 17:39:10 -04:00 · 2024-07-26 17:39:10 -04:00 · 150a1ffbfd
commit 150a1ffbfd
parent 281977bd6e
1 changed files with 228 additions and 172 deletions
--- a/docs/source/serving/run_on_sky.rst
+++ b/docs/source/serving/run_on_sky.rst
@ -159,18 +159,7 @@ SkyPilot can scale up the service to multiple service replicas with built-in aut
      --model $MODEL_NAME \
      --trust-remote-code \
      --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
-            2>&1 | tee api_server.log &
+      2>&1 | tee api_server.log
        echo 'Waiting for vllm api server to start...'
        while ! `cat api_server.log | grep -q 'Uvicorn running on'`; do sleep 1; done
        echo 'Starting gradio server...'
        git clone https://github.com/vllm-project/vllm.git || true
        python vllm/examples/gradio_openai_chatbot_webserver.py \
            -m $MODEL_NAME \
            --port 8811 \
            --model-url http://localhost:8081/v1 \
            --stop-token-ids 128009,128001
 .. raw:: html
@ -203,8 +192,8 @@ Wait until the service is ready:
  Service Replicas
  SERVICE_NAME  ID  VERSION  IP            LAUNCHED     RESOURCES                STATUS  REGION
-    vllm          1   1        xx.yy.zz.121  18 mins ago  1x GCP({'L4': 1})  READY   us-east4
+  vllm          1   1        xx.yy.zz.121  18 mins ago  1x GCP([Spot]{'L4': 1})  READY   us-east4
-    vllm          2   1        xx.yy.zz.245  18 mins ago  1x GCP({'L4': 1})  READY   us-east4
+  vllm          2   1        xx.yy.zz.245  18 mins ago  1x GCP([Spot]{'L4': 1})  READY   us-east4
 .. raw:: html
@ -232,19 +221,91 @@ After the service is READY, you can find a single endpoint for the service and a
      "stop_token_ids": [128009,  128001]
    }'
-To enable autoscaling, you could specify additional configs in `services`:
+To enable autoscaling, you could replace the `replicas` with the following configs in `service`:
 .. code-block:: yaml
-    services:
+  service:
    replica_policy:
-            min_replicas: 0
+      min_replicas: 2
-            max_replicas: 3
+      max_replicas: 4
      target_qps_per_replica: 2
 This will scale the service up to when the QPS exceeds 2 for each replica.
 .. raw:: html
  <details>
  <summary>Click to see the full recipe YAML</summary>
 .. code-block:: yaml
  service:
    replica_policy:
      min_replicas: 2
      max_replicas: 4
      target_qps_per_replica: 2
    # An actual request for readiness probe.
    readiness_probe:
      path: /v1/chat/completions
      post_data:
        model: $MODEL_NAME
        messages:
          - role: user
            content: Hello! What is your name?
        max_tokens: 1
  resources:
    accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
    use_spot: True
    disk_size: 512  # Ensure model checkpoints can fit.
    disk_tier: best
    ports: 8081  # Expose to internet traffic.
  envs:
    MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
    HF_TOKEN: <your-huggingface-token>  # Change to your own huggingface token, or use --env to pass.
  setup: |
    conda create -n vllm python=3.10 -y
    conda activate vllm
    pip install vllm==0.4.0.post1
    # Install Gradio for web UI.
    pip install gradio openai
    pip install flash-attn==2.5.7
  run: |
    conda activate vllm
    echo 'Starting vllm api server...'
    python -u -m vllm.entrypoints.openai.api_server \
      --port 8081 \
      --model $MODEL_NAME \
      --trust-remote-code \
      --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
      2>&1 | tee api_server.log
 .. raw:: html
  </details>
 To update the service with the new config:
 .. code-block:: console
  HF_TOKEN="your-huggingface-token" sky serve update vllm serving.yaml --env HF_TOKEN
 To stop the service:
 .. code-block:: console
  sky serve down vllm
 **Optional**: Connect a GUI to the endpoint
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@ -259,18 +320,15 @@ It is also possible to access the Llama-3 service with a separate GUI frontend,
 .. code-block:: yaml
  envs:
-        MODEL_NAME: meta-llama/Meta-Llama-3-70B-Instruct
+    MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
    ENDPOINT: x.x.x.x:3031 # Address of the API server running vllm. 
  resources:
    cpus: 2
  setup: |
        conda activate vllm
        if [ $? -ne 0 ]; then
    conda create -n vllm python=3.10 -y
    conda activate vllm
        fi
    # Install Gradio for web UI.
    pip install gradio openai
@ -278,9 +336,6 @@ It is also possible to access the Llama-3 service with a separate GUI frontend,
  run: |
    conda activate vllm
    export PATH=$PATH:/sbin
        WORKER_IP=$(hostname -I | cut -d' ' -f1)
        CONTROLLER_PORT=21001
        WORKER_PORT=21002
    echo 'Starting gradio server...'
    git clone https://github.com/vllm-project/vllm.git || true
@ -290,6 +345,7 @@ It is also possible to access the Llama-3 service with a separate GUI frontend,
      --model-url http://$ENDPOINT/v1 \
      --stop-token-ids 128009,128001 | tee ~/gradio.log
 .. raw:: html
  </details>