vllm/docs/cli/README.md
Harry Mellor bc1d02ac85
[Docs] Add comprehensive CLI reference for all large vllm subcommands (#22601)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-08-11 00:13:33 -07:00

189 lines
3.7 KiB
Markdown

# vLLM CLI Guide
The vllm command-line tool is used to run and manage vLLM models. You can start by viewing the help message with:
```bash
vllm --help
```
Available Commands:
```bash
vllm {chat,complete,serve,bench,collect-env,run-batch}
```
## serve
Starts the vLLM OpenAI Compatible API server.
Start with a model:
```bash
vllm serve meta-llama/Llama-2-7b-hf
```
Specify the port:
```bash
vllm serve meta-llama/Llama-2-7b-hf --port 8100
```
Serve over a Unix domain socket:
```bash
vllm serve meta-llama/Llama-2-7b-hf --uds /tmp/vllm.sock
```
Check with --help for more options:
```bash
# To list all groups
vllm serve --help=listgroup
# To view a argument group
vllm serve --help=ModelConfig
# To view a single argument
vllm serve --help=max-num-seqs
# To search by keyword
vllm serve --help=max
# To view full help with pager (less/more)
vllm serve --help=page
```
See [vllm serve](./serve.md) for the full reference of all available arguments.
## chat
Generate chat completions via the running API server.
```bash
# Directly connect to localhost API without arguments
vllm chat
# Specify API url
vllm chat --url http://{vllm-serve-host}:{vllm-serve-port}/v1
# Quick chat with a single prompt
vllm chat --quick "hi"
```
See [vllm chat](./chat.md) for the full reference of all available arguments.
## complete
Generate text completions based on the given prompt via the running API server.
```bash
# Directly connect to localhost API without arguments
vllm complete
# Specify API url
vllm complete --url http://{vllm-serve-host}:{vllm-serve-port}/v1
# Quick complete with a single prompt
vllm complete --quick "The future of AI is"
```
See [vllm complete](./complete.md) for the full reference of all available arguments.
## bench
Run benchmark tests for latency online serving throughput and offline inference throughput.
To use benchmark commands, please install with extra dependencies using `pip install vllm[bench]`.
Available Commands:
```bash
vllm bench {latency, serve, throughput}
```
### latency
Benchmark the latency of a single batch of requests.
```bash
vllm bench latency \
--model meta-llama/Llama-3.2-1B-Instruct \
--input-len 32 \
--output-len 1 \
--enforce-eager \
--load-format dummy
```
See [vllm bench latency](./bench/latency.md) for the full reference of all available arguments.
### serve
Benchmark the online serving throughput.
```bash
vllm bench serve \
--model meta-llama/Llama-3.2-1B-Instruct \
--host server-host \
--port server-port \
--random-input-len 32 \
--random-output-len 4 \
--num-prompts 5
```
See [vllm bench serve](./bench/serve.md) for the full reference of all available arguments.
### throughput
Benchmark offline inference throughput.
```bash
vllm bench throughput \
--model meta-llama/Llama-3.2-1B-Instruct \
--input-len 32 \
--output-len 1 \
--enforce-eager \
--load-format dummy
```
See [vllm bench throughput](./bench/throughput.md) for the full reference of all available arguments.
## collect-env
Start collecting environment information.
```bash
vllm collect-env
```
## run-batch
Run batch prompts and write results to file.
Running with a local file:
```bash
vllm run-batch \
-i offline_inference/openai_batch/openai_example_batch.jsonl \
-o results.jsonl \
--model meta-llama/Meta-Llama-3-8B-Instruct
```
Using remote file:
```bash
vllm run-batch \
-i https://raw.githubusercontent.com/vllm-project/vllm/main/examples/offline_inference/openai_batch/openai_example_batch.jsonl \
-o results.jsonl \
--model meta-llama/Meta-Llama-3-8B-Instruct
```
See [vllm run-batch](./run-batch.md) for the full reference of all available arguments.
## More Help
For detailed options of any subcommand, use:
```bash
vllm <subcommand> --help
```