mirror of
https://git.datalinker.icu/vllm-project/vllm.git
synced 2025-12-10 04:34:57 +08:00
189 lines
3.7 KiB
Markdown
189 lines
3.7 KiB
Markdown
# vLLM CLI Guide
|
|
|
|
The vllm command-line tool is used to run and manage vLLM models. You can start by viewing the help message with:
|
|
|
|
```bash
|
|
vllm --help
|
|
```
|
|
|
|
Available Commands:
|
|
|
|
```bash
|
|
vllm {chat,complete,serve,bench,collect-env,run-batch}
|
|
```
|
|
|
|
## serve
|
|
|
|
Starts the vLLM OpenAI Compatible API server.
|
|
|
|
Start with a model:
|
|
|
|
```bash
|
|
vllm serve meta-llama/Llama-2-7b-hf
|
|
```
|
|
|
|
Specify the port:
|
|
|
|
```bash
|
|
vllm serve meta-llama/Llama-2-7b-hf --port 8100
|
|
```
|
|
|
|
Serve over a Unix domain socket:
|
|
|
|
```bash
|
|
vllm serve meta-llama/Llama-2-7b-hf --uds /tmp/vllm.sock
|
|
```
|
|
|
|
Check with --help for more options:
|
|
|
|
```bash
|
|
# To list all groups
|
|
vllm serve --help=listgroup
|
|
|
|
# To view a argument group
|
|
vllm serve --help=ModelConfig
|
|
|
|
# To view a single argument
|
|
vllm serve --help=max-num-seqs
|
|
|
|
# To search by keyword
|
|
vllm serve --help=max
|
|
|
|
# To view full help with pager (less/more)
|
|
vllm serve --help=page
|
|
```
|
|
|
|
See [vllm serve](./serve.md) for the full reference of all available arguments.
|
|
|
|
## chat
|
|
|
|
Generate chat completions via the running API server.
|
|
|
|
```bash
|
|
# Directly connect to localhost API without arguments
|
|
vllm chat
|
|
|
|
# Specify API url
|
|
vllm chat --url http://{vllm-serve-host}:{vllm-serve-port}/v1
|
|
|
|
# Quick chat with a single prompt
|
|
vllm chat --quick "hi"
|
|
```
|
|
|
|
See [vllm chat](./chat.md) for the full reference of all available arguments.
|
|
|
|
## complete
|
|
|
|
Generate text completions based on the given prompt via the running API server.
|
|
|
|
```bash
|
|
# Directly connect to localhost API without arguments
|
|
vllm complete
|
|
|
|
# Specify API url
|
|
vllm complete --url http://{vllm-serve-host}:{vllm-serve-port}/v1
|
|
|
|
# Quick complete with a single prompt
|
|
vllm complete --quick "The future of AI is"
|
|
```
|
|
|
|
See [vllm complete](./complete.md) for the full reference of all available arguments.
|
|
|
|
## bench
|
|
|
|
Run benchmark tests for latency online serving throughput and offline inference throughput.
|
|
|
|
To use benchmark commands, please install with extra dependencies using `pip install vllm[bench]`.
|
|
|
|
Available Commands:
|
|
|
|
```bash
|
|
vllm bench {latency, serve, throughput}
|
|
```
|
|
|
|
### latency
|
|
|
|
Benchmark the latency of a single batch of requests.
|
|
|
|
```bash
|
|
vllm bench latency \
|
|
--model meta-llama/Llama-3.2-1B-Instruct \
|
|
--input-len 32 \
|
|
--output-len 1 \
|
|
--enforce-eager \
|
|
--load-format dummy
|
|
```
|
|
|
|
See [vllm bench latency](./bench/latency.md) for the full reference of all available arguments.
|
|
|
|
### serve
|
|
|
|
Benchmark the online serving throughput.
|
|
|
|
```bash
|
|
vllm bench serve \
|
|
--model meta-llama/Llama-3.2-1B-Instruct \
|
|
--host server-host \
|
|
--port server-port \
|
|
--random-input-len 32 \
|
|
--random-output-len 4 \
|
|
--num-prompts 5
|
|
```
|
|
|
|
See [vllm bench serve](./bench/serve.md) for the full reference of all available arguments.
|
|
|
|
### throughput
|
|
|
|
Benchmark offline inference throughput.
|
|
|
|
```bash
|
|
vllm bench throughput \
|
|
--model meta-llama/Llama-3.2-1B-Instruct \
|
|
--input-len 32 \
|
|
--output-len 1 \
|
|
--enforce-eager \
|
|
--load-format dummy
|
|
```
|
|
|
|
See [vllm bench throughput](./bench/throughput.md) for the full reference of all available arguments.
|
|
|
|
## collect-env
|
|
|
|
Start collecting environment information.
|
|
|
|
```bash
|
|
vllm collect-env
|
|
```
|
|
|
|
## run-batch
|
|
|
|
Run batch prompts and write results to file.
|
|
|
|
Running with a local file:
|
|
|
|
```bash
|
|
vllm run-batch \
|
|
-i offline_inference/openai_batch/openai_example_batch.jsonl \
|
|
-o results.jsonl \
|
|
--model meta-llama/Meta-Llama-3-8B-Instruct
|
|
```
|
|
|
|
Using remote file:
|
|
|
|
```bash
|
|
vllm run-batch \
|
|
-i https://raw.githubusercontent.com/vllm-project/vllm/main/examples/offline_inference/openai_batch/openai_example_batch.jsonl \
|
|
-o results.jsonl \
|
|
--model meta-llama/Meta-Llama-3-8B-Instruct
|
|
```
|
|
|
|
See [vllm run-batch](./run-batch.md) for the full reference of all available arguments.
|
|
|
|
## More Help
|
|
|
|
For detailed options of any subcommand, use:
|
|
|
|
```bash
|
|
vllm <subcommand> --help
|
|
```
|