mirror of
https://git.datalinker.icu/vllm-project/vllm.git
synced 2026-06-01 15:17:05 +08:00
[doc] use MkDocs collapsible blocks - supplement (#19973)
Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com>
This commit is contained in:
parent
5111642a6f
commit
b82e0f82cb
@ -61,6 +61,8 @@ To address the above issues, I have designed and developed a local Tensor memory
|
|||||||
|
|
||||||
# Install vLLM
|
# Install vLLM
|
||||||
|
|
||||||
|
??? Commands
|
||||||
|
|
||||||
```shell
|
```shell
|
||||||
# Enter the home directory or your working directory.
|
# Enter the home directory or your working directory.
|
||||||
cd /home
|
cd /home
|
||||||
@ -104,6 +106,8 @@ python3 disagg_prefill_proxy_xpyd.py &
|
|||||||
|
|
||||||
### Prefill1 (e.g. 10.0.1.2 or 10.0.1.1)
|
### Prefill1 (e.g. 10.0.1.2 or 10.0.1.1)
|
||||||
|
|
||||||
|
??? Command
|
||||||
|
|
||||||
```shell
|
```shell
|
||||||
VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=0 vllm serve {your model directory} \
|
VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=0 vllm serve {your model directory} \
|
||||||
--host 0.0.0.0 \
|
--host 0.0.0.0 \
|
||||||
@ -124,6 +128,8 @@ VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=0 vllm serve {your model directory} \
|
|||||||
|
|
||||||
### Decode1 (e.g. 10.0.1.3 or 10.0.1.1)
|
### Decode1 (e.g. 10.0.1.3 or 10.0.1.1)
|
||||||
|
|
||||||
|
??? Command
|
||||||
|
|
||||||
```shell
|
```shell
|
||||||
VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=1 vllm serve {your model directory} \
|
VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=1 vllm serve {your model directory} \
|
||||||
--host 0.0.0.0 \
|
--host 0.0.0.0 \
|
||||||
@ -144,6 +150,8 @@ VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=1 vllm serve {your model directory} \
|
|||||||
|
|
||||||
### Decode2 (e.g. 10.0.1.4 or 10.0.1.1)
|
### Decode2 (e.g. 10.0.1.4 or 10.0.1.1)
|
||||||
|
|
||||||
|
??? Command
|
||||||
|
|
||||||
```shell
|
```shell
|
||||||
VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=2 vllm serve {your model directory} \
|
VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=2 vllm serve {your model directory} \
|
||||||
--host 0.0.0.0 \
|
--host 0.0.0.0 \
|
||||||
@ -164,6 +172,8 @@ VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=2 vllm serve {your model directory} \
|
|||||||
|
|
||||||
### Decode3 (e.g. 10.0.1.5 or 10.0.1.1)
|
### Decode3 (e.g. 10.0.1.5 or 10.0.1.1)
|
||||||
|
|
||||||
|
??? Command
|
||||||
|
|
||||||
```shell
|
```shell
|
||||||
VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=3 vllm serve {your model directory} \
|
VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=3 vllm serve {your model directory} \
|
||||||
--host 0.0.0.0 \
|
--host 0.0.0.0 \
|
||||||
@ -193,6 +203,8 @@ python3 disagg_prefill_proxy_xpyd.py &
|
|||||||
|
|
||||||
### Prefill1 (e.g. 10.0.1.2 or 10.0.1.1)
|
### Prefill1 (e.g. 10.0.1.2 or 10.0.1.1)
|
||||||
|
|
||||||
|
??? Command
|
||||||
|
|
||||||
```shell
|
```shell
|
||||||
VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=0 vllm serve {your model directory} \
|
VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=0 vllm serve {your model directory} \
|
||||||
--host 0.0.0.0 \
|
--host 0.0.0.0 \
|
||||||
@ -213,6 +225,8 @@ VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=0 vllm serve {your model directory} \
|
|||||||
|
|
||||||
### Prefill2 (e.g. 10.0.1.3 or 10.0.1.1)
|
### Prefill2 (e.g. 10.0.1.3 or 10.0.1.1)
|
||||||
|
|
||||||
|
??? Command
|
||||||
|
|
||||||
```shell
|
```shell
|
||||||
VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=1 vllm serve {your model directory} \
|
VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=1 vllm serve {your model directory} \
|
||||||
--host 0.0.0.0 \
|
--host 0.0.0.0 \
|
||||||
@ -233,6 +247,8 @@ VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=1 vllm serve {your model directory} \
|
|||||||
|
|
||||||
### Prefill3 (e.g. 10.0.1.4 or 10.0.1.1)
|
### Prefill3 (e.g. 10.0.1.4 or 10.0.1.1)
|
||||||
|
|
||||||
|
??? Command
|
||||||
|
|
||||||
```shell
|
```shell
|
||||||
VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=2 vllm serve {your model directory} \
|
VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=2 vllm serve {your model directory} \
|
||||||
--host 0.0.0.0 \
|
--host 0.0.0.0 \
|
||||||
@ -253,6 +269,8 @@ VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=2 vllm serve {your model directory} \
|
|||||||
|
|
||||||
### Decode1 (e.g. 10.0.1.5 or 10.0.1.1)
|
### Decode1 (e.g. 10.0.1.5 or 10.0.1.1)
|
||||||
|
|
||||||
|
??? Command
|
||||||
|
|
||||||
```shell
|
```shell
|
||||||
VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=3 vllm serve {your model directory} \
|
VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=3 vllm serve {your model directory} \
|
||||||
--host 0.0.0.0 \
|
--host 0.0.0.0 \
|
||||||
@ -286,6 +304,8 @@ curl -X POST -s http://10.0.1.1:10001/v1/completions \
|
|||||||
|
|
||||||
# Benchmark
|
# Benchmark
|
||||||
|
|
||||||
|
??? Command
|
||||||
|
|
||||||
```shell
|
```shell
|
||||||
python3 benchmark_serving.py \
|
python3 benchmark_serving.py \
|
||||||
--backend vllm \
|
--backend vllm \
|
||||||
|
|||||||
@ -28,7 +28,9 @@ A unique aspect of vLLM's `torch.compile` integration, is that we guarantee all
|
|||||||
|
|
||||||
In the very verbose logs, we can see:
|
In the very verbose logs, we can see:
|
||||||
|
|
||||||
```
|
??? Logs
|
||||||
|
|
||||||
|
```text
|
||||||
DEBUG 03-07 03:06:52 [decorators.py:203] Start compiling function <code object forward at 0x7f08acf40c90, file "xxx/vllm/model_executor/models/llama.py", line 339>
|
DEBUG 03-07 03:06:52 [decorators.py:203] Start compiling function <code object forward at 0x7f08acf40c90, file "xxx/vllm/model_executor/models/llama.py", line 339>
|
||||||
|
|
||||||
DEBUG 03-07 03:06:54 [backends.py:370] Traced files (to be considered for compilation cache):
|
DEBUG 03-07 03:06:54 [backends.py:370] Traced files (to be considered for compilation cache):
|
||||||
@ -99,14 +101,17 @@ This time, Inductor compilation is completely bypassed, and we will load from di
|
|||||||
|
|
||||||
The above example just uses Inductor to compile for a general shape (i.e. symbolic shape). We can also use Inductor to compile for some of the specific shapes, for example:
|
The above example just uses Inductor to compile for a general shape (i.e. symbolic shape). We can also use Inductor to compile for some of the specific shapes, for example:
|
||||||
|
|
||||||
```
|
```bash
|
||||||
vllm serve meta-llama/Llama-3.2-1B --compilation_config '{"compile_sizes": [1, 2, 4, 8]}'
|
vllm serve meta-llama/Llama-3.2-1B \
|
||||||
|
--compilation_config '{"compile_sizes": [1, 2, 4, 8]}'
|
||||||
```
|
```
|
||||||
|
|
||||||
Then it will also compile a specific kernel just for batch size `1, 2, 4, 8`. At this time, all of the shapes in the computation graph are static and known, and we will turn on auto-tuning to tune for max performance. This can be slow when you run it for the first time, but the next time you run it, we can directly bypass the tuning and run the tuned kernel.
|
Then it will also compile a specific kernel just for batch size `1, 2, 4, 8`. At this time, all of the shapes in the computation graph are static and known, and we will turn on auto-tuning to tune for max performance. This can be slow when you run it for the first time, but the next time you run it, we can directly bypass the tuning and run the tuned kernel.
|
||||||
|
|
||||||
When all the shapes are known, `torch.compile` can compare different configs, and often find some better configs to run the kernel. For example, we can see the following log:
|
When all the shapes are known, `torch.compile` can compare different configs, and often find some better configs to run the kernel. For example, we can see the following log:
|
||||||
|
|
||||||
|
??? Logs
|
||||||
|
|
||||||
```
|
```
|
||||||
AUTOTUNE mm(8x2048, 2048x3072)
|
AUTOTUNE mm(8x2048, 2048x3072)
|
||||||
triton_mm_4 0.0130 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=16, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=5, num_warps=2
|
triton_mm_4 0.0130 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=16, BLOCK_N=32, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, num_stages=5, num_warps=2
|
||||||
@ -136,8 +141,9 @@ The cudagraphs are captured and managed by the compiler backend, and replayed wh
|
|||||||
|
|
||||||
By default, vLLM will try to determine a set of sizes to capture cudagraph. You can also override it using the config `cudagraph_capture_sizes`:
|
By default, vLLM will try to determine a set of sizes to capture cudagraph. You can also override it using the config `cudagraph_capture_sizes`:
|
||||||
|
|
||||||
```
|
```bash
|
||||||
vllm serve meta-llama/Llama-3.2-1B --compilation-config '{"cudagraph_capture_sizes": [1, 2, 4, 8]}'
|
vllm serve meta-llama/Llama-3.2-1B \
|
||||||
|
--compilation-config '{"cudagraph_capture_sizes": [1, 2, 4, 8]}'
|
||||||
```
|
```
|
||||||
|
|
||||||
Then it will only capture cudagraph for the specified sizes. It can be useful to have fine-grained control over the cudagraph capture.
|
Then it will only capture cudagraph for the specified sizes. It can be useful to have fine-grained control over the cudagraph capture.
|
||||||
|
|||||||
@ -55,7 +55,7 @@ STDOUT of the console in JSON format with a log level of `INFO`.
|
|||||||
|
|
||||||
To begin, first, create an appropriate JSON logging configuration file:
|
To begin, first, create an appropriate JSON logging configuration file:
|
||||||
|
|
||||||
**/path/to/logging_config.json:**
|
??? note "/path/to/logging_config.json"
|
||||||
|
|
||||||
```json
|
```json
|
||||||
{
|
{
|
||||||
@ -104,7 +104,7 @@ configuration overrides the built-in default logging configuration used by vLLM.
|
|||||||
First, create an appropriate JSON logging configuration file that includes
|
First, create an appropriate JSON logging configuration file that includes
|
||||||
configuration for the root vLLM logger and for the logger you wish to silence:
|
configuration for the root vLLM logger and for the logger you wish to silence:
|
||||||
|
|
||||||
**/path/to/logging_config.json:**
|
??? note "/path/to/logging_config.json"
|
||||||
|
|
||||||
```json
|
```json
|
||||||
{
|
{
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user