mirror of
https://git.datalinker.icu/vllm-project/vllm.git
synced 2026-03-16 14:17:16 +08:00
[Doc] cleanup TPU documentation and remove outdated examples (#29048)
Signed-off-by: Rob Mulla <rob.mulla@gmail.com> Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
This commit is contained in:
parent
c7a29d2c8d
commit
dd39f91edb
@ -24,14 +24,16 @@ nav:
|
|||||||
- deployment/integrations
|
- deployment/integrations
|
||||||
- Training: training
|
- Training: training
|
||||||
- Configuration:
|
- Configuration:
|
||||||
- configuration/README.md
|
|
||||||
- configuration/*
|
- configuration/*
|
||||||
|
- TPU: https://docs.vllm.ai/projects/tpu/en/latest/
|
||||||
- Models:
|
- Models:
|
||||||
- models/supported_models.md
|
- models/supported_models.md
|
||||||
- models/generative_models.md
|
- models/generative_models.md
|
||||||
- models/pooling_models.md
|
- models/pooling_models.md
|
||||||
- models/extensions
|
- models/extensions
|
||||||
- Hardware Supported Models: models/hardware_supported_models
|
- Hardware Supported Models:
|
||||||
|
- models/hardware_supported_models/*
|
||||||
|
- TPU: https://docs.vllm.ai/projects/tpu/en/latest/recommended_models_features/
|
||||||
- Features: features
|
- Features: features
|
||||||
- Developer Guide:
|
- Developer Guide:
|
||||||
- contributing/README.md
|
- contributing/README.md
|
||||||
|
|||||||
@ -1,111 +0,0 @@
|
|||||||
# TPU Optimization Tips
|
|
||||||
|
|
||||||
This doc serves as a collection of handy tips for optimizing your vLLM on TPU workload.
|
|
||||||
|
|
||||||
## Get started
|
|
||||||
|
|
||||||
Looking for setup and installation instructions? Find them [here](https://docs.vllm.ai/projects/tpu/en/latest/getting_started/installation/).
|
|
||||||
|
|
||||||
### TPU workload sizing
|
|
||||||
|
|
||||||
When selecting the ideal number of chips for a single serving instance, it's important to account for both the model size and the average request context length. Adequate HBM for the KV cache is essential to ensure a sufficient number of concurrent requests can be processed.
|
|
||||||
|
|
||||||
The following colab [calculator](https://colab.research.google.com/github/ericehanley/rightsize-vllm/blob/main/HBM_Calculator.ipynb) will tell you:
|
|
||||||
|
|
||||||
- KV cache size requirement per token and per request
|
|
||||||
- TPU/GPU memory consumed by the model weights
|
|
||||||
- TPU/GPU memory allocated for the KV cache
|
|
||||||
- Maximum \# of requests you can approximately set (--max-num-seqs)
|
|
||||||
|
|
||||||
This approach serves as a general rule of thumb.
|
|
||||||
|
|
||||||
#### Latency-throughput tradeoff
|
|
||||||
|
|
||||||
As with rightsizing the number of chips for your workload, consider adjusting `--max-num-seqs` to fine-tune the latency-throughput balance. Decreasing `--max-num-seqs` and/or increasing the number of chips can help reduce latency.
|
|
||||||
|
|
||||||
`--max-num-seqs` defines the number of concurrent decode slots, effectively limiting the number of requests the server can process tokens for simultaneously. Increasing this value allows the server to pre-allocate more HBM to handle a higher number of concurrent requests, which can maximize overall throughput. However, this often increases the end-to-end (e2e) latency per request.
|
|
||||||
|
|
||||||
Therefore, carefully tuning `--max-num-seqs` is crucial to achieving the desired balance between latency and throughput for your specific workload.
|
|
||||||
|
|
||||||
In a similar way, `--max-num-batch-tokens` can be adjusted down to improve latency, or adjusted up to improve throughput.
|
|
||||||
|
|
||||||
#### Compilation and Caching
|
|
||||||
|
|
||||||
Coming from a GPU background, one of the key differences you'll notice with TPUs is an initial compilation step. TPUs are specialized accelerators (ASICs) that achieve maximum performance by executing pre-compiled, static computation graphs via the XLA compiler. Unlike GPUs, which can handle dynamic input shapes more flexibly, TPUs require a specific compiled graph for each tensor shape (e.g., batch size and sequence length) they process.
|
|
||||||
|
|
||||||
To manage this, vLLM performs a one-time "warmup" process when you first launch the server. During this phase, it pre-compiles the model for various common input shapes and saves these compiled graphs to a cache on disk or remote storage (located at `~/.cache/vllm/xla_cache` by default). This process can range significantly, anywhere from a few minutes to an hour depending on the size of the model and context length used.
|
|
||||||
|
|
||||||
Although the first compilation can take some time, for all subsequent server launches, vLLM can load these graphs directly from the cache, eliminating the compilation time for future runs.
|
|
||||||
|
|
||||||
Use `VLLM_XLA_CACHE_PATH` environment variable to write to shareable storage for future deployed nodes (like when using autoscaling).
|
|
||||||
|
|
||||||
#### Reducing compilation time
|
|
||||||
|
|
||||||
This initial compilation time ranges significantly and is impacted by many of the arguments discussed in this optimization doc. Factors that influence the length of time to compile are things like model size and `--max-num-batch-tokens`. Other arguments you can tune are things like `VLLM_TPU_MOST_MODEL_LEN`.
|
|
||||||
|
|
||||||
### Optimize based on your data
|
|
||||||
|
|
||||||
#### max-model-len vs. most-model-len
|
|
||||||
|
|
||||||

|
|
||||||
|
|
||||||
If most of your requests are shorter than the maximum model length but you still need to accommodate occasional longer requests, setting a high maximum model length can negatively impact performance. In these cases, you can try introducing most-model-len by specifying the `VLLM_TPU_MOST_MODEL_LEN` environment variable.
|
|
||||||
|
|
||||||
For example, 1% requests are 32k length and 99% requests are 2k length. You can pass 32k into `--max-model-len 32768` and use `VLLM_TPU_MOST_MODEL_LEN=2048`.
|
|
||||||
|
|
||||||
The requests get subdivided into max-model-len and most-model-len categories, for the latter category, you can gain better performance since the server can process more requests at a time.
|
|
||||||
|
|
||||||
#### Padding
|
|
||||||
|
|
||||||
For online serving with latency requirements, consider switching to bucket padding by setting the `VLLM_TPU_BUCKET_PADDING_GAP` environment variable. Because of the layout of the TPU, try using increments of 128 (e.g., 128, 256, etc.)
|
|
||||||
|
|
||||||
The server pads the requests into fixed lengths before sending them to the model to avoid recompilation. To read more about TPU padding, see [here](https://cloud.google.com/tpu/docs/performance-guide#xla-efficiencies). Currently, there are 2 ways to pad the requests:
|
|
||||||
|
|
||||||
1. the default exponential padding (pad to the nearest power of 2)
|
|
||||||
2. bucket padding (pad to the nearest linearly increasing bucket).
|
|
||||||
|
|
||||||
When using bucket padding, the buckets start from 16, end at max_model_len, and increment by `VLLM_TPU_BUCKET_PADDING_GAP`.
|
|
||||||
|
|
||||||
For example, max_model_len=512, padding_gap=64, the buckets will be [16, 32, 64, 128, 192, 256, 320, 384, 448, 512].
|
|
||||||
|
|
||||||
The fewer tokens you pad, the less unnecessary computation TPU does, the better performance you can get. For example, if num_tokens=300, with exponential padding, you pad to 512, with the bucket_padding above, you pad to 320.
|
|
||||||
|
|
||||||
However, you need to be careful to choose the padding gap. If the gap is too small, it means the number of buckets is large, leading to increased warmup (precompile) time and higher memory to store the compiled graph. Too many compiled graphs may lead to HBM OOM. Conversely, an overly large gap yields no performance improvement compared to the default exponential padding.
|
|
||||||
|
|
||||||
#### Quantization
|
|
||||||
|
|
||||||
If possible, use the precision that matches the chip’s hardware acceleration:
|
|
||||||
|
|
||||||
- v5e has int4/int8 hardware acceleration in the MXU
|
|
||||||
- v6e has int4/int8 hardware acceleration in the MXU
|
|
||||||
|
|
||||||
Supported quantized formats and features in vLLM on TPU [Jul '25]:
|
|
||||||
|
|
||||||
- INT8 W8A8
|
|
||||||
- INT8 W8A16
|
|
||||||
- FP8 KV cache
|
|
||||||
- [WIP] FP8 W8A8
|
|
||||||
- [WIP] AWQ
|
|
||||||
- [WIP] FP4 W4A8
|
|
||||||
|
|
||||||
#### Parallelization
|
|
||||||
|
|
||||||
Don't set TP to be less than the number of chips on a single-host deployment.
|
|
||||||
|
|
||||||
Although it’s common to do this with GPUs, don't try to fragment 2 or 8 different workloads across 8 chips on a single host. If you need 1 or 4 chips, just create an instance with 1 or 4 chips (these are partial-host machine types).
|
|
||||||
|
|
||||||
### Tune your workloads
|
|
||||||
|
|
||||||
Although we try to have great default configs, we strongly recommend you check out the [vLLM auto-tuner](../../benchmarks/auto_tune/README.md) to optimize your workloads for your use case.
|
|
||||||
|
|
||||||
### Future Topics We'll Cover
|
|
||||||
|
|
||||||
#### Profiling
|
|
||||||
|
|
||||||
The auto-tuner provides a profile of optimized configurations as its final step. However, interpreting this profile can be challenging for new users. We plan to expand this section in the future with more detailed guidance. In the meantime, you can learn how to collect a TPU profile using vLLM's native profiling tools [here](../examples/offline_inference/profiling_tpu.md). This profile can provide valuable insights into your workload's performance.
|
|
||||||
|
|
||||||
#### SPMD
|
|
||||||
|
|
||||||
More details to come.
|
|
||||||
|
|
||||||
**Want us to cover something that isn't listed here? Open up an issue please and cite this doc. We'd love to hear your questions or tips.**
|
|
||||||
@ -59,20 +59,23 @@ th:not(:first-child) {
|
|||||||
|
|
||||||
### Feature x Hardware
|
### Feature x Hardware
|
||||||
|
|
||||||
| Feature | Volta | Turing | Ampere | Ada | Hopper | CPU | AMD | TPU | Intel GPU |
|
| Feature | Volta | Turing | Ampere | Ada | Hopper | CPU | AMD | Intel GPU |
|
||||||
|-----------------------------------------------------------|---------------------|-----------|-----------|--------|------------|--------------------|--------|-----| ------------|
|
|-----------------------------------------------------------|---------------------|-----------|-----------|--------|------------|--------------------|--------| ------------|
|
||||||
| [CP](../configuration/optimization.md#chunked-prefill) | [❌](https://github.com/vllm-project/vllm/issues/2729) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
|
| [CP](../configuration/optimization.md#chunked-prefill) | [❌](https://github.com/vllm-project/vllm/issues/2729) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
|
||||||
| [APC](automatic_prefix_caching.md) | [❌](https://github.com/vllm-project/vllm/issues/3687) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
|
| [APC](automatic_prefix_caching.md) | [❌](https://github.com/vllm-project/vllm/issues/3687) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
|
||||||
| [LoRA](lora.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
|
| [LoRA](lora.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
|
||||||
| [SD](spec_decode.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ❌ | [🟠](https://github.com/vllm-project/vllm/issues/26963) |
|
| [SD](spec_decode.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | [🟠](https://github.com/vllm-project/vllm/issues/26963) |
|
||||||
| CUDA graph | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ❌ | [❌](https://github.com/vllm-project/vllm/issues/26970) |
|
| CUDA graph | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | [❌](https://github.com/vllm-project/vllm/issues/26970) |
|
||||||
| [pooling](../models/pooling_models.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ |
|
| [pooling](../models/pooling_models.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
|
||||||
| <abbr title="Encoder-Decoder Models">enc-dec</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ✅ |
|
| <abbr title="Encoder-Decoder Models">enc-dec</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ |
|
||||||
| [mm](multimodal_inputs.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | [🟠](https://github.com/vllm-project/vllm/issues/26965) |
|
| [mm](multimodal_inputs.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | [🟠](https://github.com/vllm-project/vllm/issues/26965) |
|
||||||
| <abbr title="Logprobs">logP</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ |
|
| [prompt-embeds](prompt_embeds.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❔ | ✅ |
|
||||||
| <abbr title="Prompt Logprobs">prmpt logP</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ |
|
| <abbr title="Logprobs">logP</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
|
||||||
| <abbr title="Async Output Processing">async output</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | ✅ |
|
| <abbr title="Prompt Logprobs">prmpt logP</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
|
||||||
| multi-step | ✅ | ✅ | ✅ | ✅ | ✅ | [❌](https://github.com/vllm-project/vllm/issues/8477) | ✅ | ❌ | ✅ |
|
| <abbr title="Async Output Processing">async output</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ✅ |
|
||||||
| best-of | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ |
|
| multi-step | ✅ | ✅ | ✅ | ✅ | ✅ | [❌](https://github.com/vllm-project/vllm/issues/8477) | ✅ | ✅ |
|
||||||
| beam-search | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ |
|
| best-of | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
|
||||||
| [prompt-embeds](prompt_embeds.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❔ | [❌](https://github.com/vllm-project/vllm/issues/25097) | ✅ |
|
| beam-search | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
|
||||||
|
|
||||||
|
!!! note
|
||||||
|
For information on feature support on Google TPU, please refer to the [TPU-Inference Recommended Models and Features](https://docs.vllm.ai/projects/tpu/en/latest/recommended_models_features/) documentation.
|
||||||
|
|||||||
@ -43,24 +43,27 @@ th:not(:first-child) {
|
|||||||
}
|
}
|
||||||
</style>
|
</style>
|
||||||
|
|
||||||
| Implementation | Volta | Turing | Ampere | Ada | Hopper | AMD GPU | Intel GPU | Intel Gaudi | x86 CPU | Google TPU |
|
| Implementation | Volta | Turing | Ampere | Ada | Hopper | AMD GPU | Intel GPU | Intel Gaudi | x86 CPU |
|
||||||
|-----------------------|---------|----------|----------|-------|----------|-----------|-------------|-------------|-----------|--------------|
|
|-----------------------|---------|----------|----------|-------|----------|-----------|-------------|-------------|-----------|
|
||||||
| AWQ | ❌ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ✅︎ | ❌ | ✅︎ | ❌ |
|
| AWQ | ❌ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ✅︎ | ❌ | ✅︎ |
|
||||||
| GPTQ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ✅︎ | ❌ | ✅︎ | ❌ |
|
| GPTQ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ✅︎ | ❌ | ✅︎ |
|
||||||
| Marlin (GPTQ/AWQ/FP8) | ❌ | ❌ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | ❌ |
|
| Marlin (GPTQ/AWQ/FP8) | ❌ | ❌ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ |
|
||||||
| INT8 (W8A8) | ❌ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ✅︎ | ✅︎ |
|
| INT8 (W8A8) | ❌ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ✅︎ |
|
||||||
| FP8 (W8A8) | ❌ | ❌ | ❌ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ |
|
| FP8 (W8A8) | ❌ | ❌ | ❌ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ |
|
||||||
| BitBLAS | ✅︎ | ✅ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | ❌ |
|
| BitBLAS | ✅︎ | ✅ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ |
|
||||||
| BitBLAS (GPTQ) | ❌ | ❌ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | ❌ |
|
| BitBLAS (GPTQ) | ❌ | ❌ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ |
|
||||||
| bitsandbytes | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | ❌ |
|
| bitsandbytes | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ |
|
||||||
| DeepSpeedFP | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | ❌ |
|
| DeepSpeedFP | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ |
|
||||||
| GGUF | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ |
|
| GGUF | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ |
|
||||||
| INC (W8A8) | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅︎ | ❌ | ❌ |
|
| INC (W8A8) | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅︎ | ❌ |
|
||||||
|
|
||||||
- Volta refers to SM 7.0, Turing to SM 7.5, Ampere to SM 8.0/8.6, Ada to SM 8.9, and Hopper to SM 9.0.
|
- Volta refers to SM 7.0, Turing to SM 7.5, Ampere to SM 8.0/8.6, Ada to SM 8.9, and Hopper to SM 9.0.
|
||||||
- ✅︎ indicates that the quantization method is supported on the specified hardware.
|
- ✅︎ indicates that the quantization method is supported on the specified hardware.
|
||||||
- ❌ indicates that the quantization method is not supported on the specified hardware.
|
- ❌ indicates that the quantization method is not supported on the specified hardware.
|
||||||
|
|
||||||
|
!!! note
|
||||||
|
For information on quantization support on Google TPU, please refer to the [TPU-Inference Recommended Models and Features](https://docs.vllm.ai/projects/tpu/en/latest/recommended_models_features/) documentation.
|
||||||
|
|
||||||
!!! note
|
!!! note
|
||||||
This compatibility chart is subject to change as vLLM continues to evolve and expand its support for different hardware platforms and quantization methods.
|
This compatibility chart is subject to change as vLLM continues to evolve and expand its support for different hardware platforms and quantization methods.
|
||||||
|
|
||||||
|
|||||||
@ -1,34 +0,0 @@
|
|||||||
# TPU
|
|
||||||
|
|
||||||
## Supported Models
|
|
||||||
|
|
||||||
### Text-only Language Models
|
|
||||||
|
|
||||||
| Model | Architecture | Supported |
|
|
||||||
|-----------------------------------------------------|--------------------------------|-----------|
|
|
||||||
| mistralai/Mixtral-8x7B-Instruct-v0.1 | MixtralForCausalLM | 🟨 |
|
|
||||||
| mistralai/Mistral-Small-24B-Instruct-2501 | MistralForCausalLM | ✅ |
|
|
||||||
| mistralai/Codestral-22B-v0.1 | MistralForCausalLM | ✅ |
|
|
||||||
| mistralai/Mixtral-8x22B-Instruct-v0.1 | MixtralForCausalLM | ❌ |
|
|
||||||
| meta-llama/Llama-3.3-70B-Instruct | LlamaForCausalLM | ✅ |
|
|
||||||
| meta-llama/Llama-3.1-8B-Instruct | LlamaForCausalLM | ✅ |
|
|
||||||
| meta-llama/Llama-3.1-70B-Instruct | LlamaForCausalLM | ✅ |
|
|
||||||
| meta-llama/Llama-4-* | Llama4ForConditionalGeneration | ❌ |
|
|
||||||
| microsoft/Phi-3-mini-128k-instruct | Phi3ForCausalLM | 🟨 |
|
|
||||||
| microsoft/phi-4 | Phi3ForCausalLM | ❌ |
|
|
||||||
| google/gemma-3-27b-it | Gemma3ForConditionalGeneration | 🟨 |
|
|
||||||
| google/gemma-3-4b-it | Gemma3ForConditionalGeneration | ❌ |
|
|
||||||
| deepseek-ai/DeepSeek-R1 | DeepseekV3ForCausalLM | ❌ |
|
|
||||||
| deepseek-ai/DeepSeek-V3 | DeepseekV3ForCausalLM | ❌ |
|
|
||||||
| RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w8a8 | LlamaForCausalLM | ✅ |
|
|
||||||
| RedHatAI/Meta-Llama-3.1-70B-Instruct-quantized.w8a8 | LlamaForCausalLM | ✅ |
|
|
||||||
| Qwen/Qwen3-8B | Qwen3ForCausalLM | ✅ |
|
|
||||||
| Qwen/Qwen3-32B | Qwen3ForCausalLM | ✅ |
|
|
||||||
| Qwen/Qwen2.5-7B-Instruct | Qwen2ForCausalLM | ✅ |
|
|
||||||
| Qwen/Qwen2.5-32B | Qwen2ForCausalLM | ✅ |
|
|
||||||
| Qwen/Qwen2.5-14B-Instruct | Qwen2ForCausalLM | ✅ |
|
|
||||||
| Qwen/Qwen2.5-1.5B-Instruct | Qwen2ForCausalLM | 🟨 |
|
|
||||||
|
|
||||||
✅ Runs and optimized.
|
|
||||||
🟨 Runs and correct but not optimized to green yet.
|
|
||||||
❌ Does not pass accuracy test or does not run.
|
|
||||||
@ -1,70 +0,0 @@
|
|||||||
# vLLM TPU Profiling
|
|
||||||
|
|
||||||
This script is used to profile the TPU performance of vLLM for specific prefill or decode token shapes.
|
|
||||||
|
|
||||||
Note: an actual running server is a mix of both prefill of many shapes and decode of many shapes.
|
|
||||||
|
|
||||||
We assume you are on a TPU already (this was tested on TPU v6e) and have installed vLLM according to the [Google TPU installation guide](https://docs.vllm.ai/en/latest/getting_started/installation/google_tpu.html).
|
|
||||||
|
|
||||||
> In all examples below, we run several warmups before (so `--enforce-eager` is okay)
|
|
||||||
|
|
||||||
## Profile Examples
|
|
||||||
|
|
||||||
### Generate Prefill Trace
|
|
||||||
|
|
||||||
This example runs Qwen/Qwen2.5-7B-Instruct with a single request of 1024 input tokens. This is set up in attempt to profile just the prefill time and operations.
|
|
||||||
|
|
||||||
```bash
|
|
||||||
export XLA_HLO_DEBUG=1
|
|
||||||
export MODEL=Qwen/Qwen2.5-7B-Instruct
|
|
||||||
export VLLM_TPU_PROFILE_DURATION_MS=3000
|
|
||||||
export VLLM_TPU_PROFILE_DELAY_MS=0
|
|
||||||
|
|
||||||
python3 profiling.py \
|
|
||||||
--model $MODEL \
|
|
||||||
--input-len 1024 --output-len 1 \
|
|
||||||
--batch-size 1 --enforce-eager \
|
|
||||||
--max-model-len 2048 \
|
|
||||||
--tensor-parallel-size 1 \
|
|
||||||
--profile-result-dir profiles
|
|
||||||
```
|
|
||||||
|
|
||||||
### Generate Decode Trace
|
|
||||||
|
|
||||||
This example runs Llama 3.1 70B with a batch of 32 requests where each has 1 input token and 128 output tokens. This is set up in attempt to profile just the 32 decodes running in parallel by having an extremely small prefill of 1 token and setting `VLLM_TPU_PROFILE_DELAY_MS=1000` to skip the first second of inference (hopefully prefill).
|
|
||||||
|
|
||||||
```bash
|
|
||||||
export XLA_HLO_DEBUG=1
|
|
||||||
export MODEL=meta-llama/Llama-3.1-70B-Instruct
|
|
||||||
export VLLM_TPU_PROFILE_DURATION_MS=2000
|
|
||||||
export VLLM_TPU_PROFILE_DELAY_MS=1000
|
|
||||||
|
|
||||||
rm -rf ~/.cache/vllm/xla_cache
|
|
||||||
python3 profiling.py \
|
|
||||||
--model $MODEL \
|
|
||||||
--input-len 1 \
|
|
||||||
--output-len 128 \
|
|
||||||
--batch-size 32 \
|
|
||||||
--enforce-eager \
|
|
||||||
--profile-result-dir profiles \
|
|
||||||
--max-model-len 2048 --tensor-parallel-size 8
|
|
||||||
```
|
|
||||||
|
|
||||||
## Visualizing the profiles
|
|
||||||
|
|
||||||
Once you have collected your profiles with this script, you can visualize them using [TensorBoard](https://cloud.google.com/tpu/docs/pytorch-xla-performance-profiling-tpu-vm).
|
|
||||||
|
|
||||||
Here are most likely the dependencies you need to install:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
pip install tensorflow-cpu \
|
|
||||||
tensorboard-plugin-profile \
|
|
||||||
etils \
|
|
||||||
importlib_resources
|
|
||||||
```
|
|
||||||
|
|
||||||
Then you just need to point TensorBoard to the directory where you saved the profiles and visit `http://localhost:6006/` in your browser:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
tensorboard --logdir profiles/ --port 6006
|
|
||||||
```
|
|
||||||
@ -1,110 +0,0 @@
|
|||||||
# SPDX-License-Identifier: Apache-2.0
|
|
||||||
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
|
|
||||||
|
|
||||||
import argparse
|
|
||||||
import dataclasses
|
|
||||||
import os
|
|
||||||
import time
|
|
||||||
|
|
||||||
import numpy as np
|
|
||||||
import torch_xla.debug.profiler as xp
|
|
||||||
from tqdm import tqdm
|
|
||||||
|
|
||||||
from vllm import LLM, SamplingParams
|
|
||||||
from vllm.engine.arg_utils import EngineArgs
|
|
||||||
from vllm.inputs import PromptType
|
|
||||||
from vllm.utils.argparse_utils import FlexibleArgumentParser
|
|
||||||
|
|
||||||
DURATION_MS = int(os.getenv("VLLM_TPU_PROFILE_DURATION_MS", 3000))
|
|
||||||
DELAY_MS = int(os.getenv("VLLM_TPU_PROFILE_DELAY_MS", 0))
|
|
||||||
|
|
||||||
|
|
||||||
def main(args: argparse.Namespace):
|
|
||||||
print(args)
|
|
||||||
|
|
||||||
engine_args = EngineArgs.from_cli_args(args)
|
|
||||||
llm = LLM(**dataclasses.asdict(engine_args))
|
|
||||||
server = xp.start_server(9012) # noqa: F841
|
|
||||||
|
|
||||||
sampling_params = SamplingParams(
|
|
||||||
temperature=0.0,
|
|
||||||
ignore_eos=True,
|
|
||||||
max_tokens=args.output_len,
|
|
||||||
)
|
|
||||||
print(sampling_params)
|
|
||||||
dummy_prompt_token_ids = np.random.randint(
|
|
||||||
10000, size=(args.batch_size, args.input_len)
|
|
||||||
)
|
|
||||||
dummy_prompts: list[PromptType] = [
|
|
||||||
{"prompt_token_ids": batch} for batch in dummy_prompt_token_ids.tolist()
|
|
||||||
]
|
|
||||||
|
|
||||||
def run_to_completion():
|
|
||||||
start_time = time.perf_counter()
|
|
||||||
llm.generate(dummy_prompts, sampling_params=sampling_params, use_tqdm=False)
|
|
||||||
end_time = time.perf_counter()
|
|
||||||
latency = end_time - start_time
|
|
||||||
return latency
|
|
||||||
|
|
||||||
# Warmup
|
|
||||||
print("Warming up...")
|
|
||||||
warmup_latencies = []
|
|
||||||
for _ in tqdm(range(args.num_iters_warmup), desc="Warmup iterations"):
|
|
||||||
warmup_latencies.append(run_to_completion())
|
|
||||||
print(f"Average warmup latency: {np.mean(warmup_latencies):.4f}s")
|
|
||||||
|
|
||||||
# Profile
|
|
||||||
profile_dir = args.profile_result_dir
|
|
||||||
print(f"Profiling (results will be saved to '{profile_dir}')...")
|
|
||||||
# Enable tracing on server
|
|
||||||
xp.trace_detached(
|
|
||||||
"localhost:9012", profile_dir, delay_ms=DELAY_MS, duration_ms=DURATION_MS
|
|
||||||
)
|
|
||||||
if DELAY_MS == 0:
|
|
||||||
time.sleep(1.0)
|
|
||||||
profile_latencies = []
|
|
||||||
for _ in tqdm(range(args.num_iters), desc="Profile iterations"):
|
|
||||||
profile_latencies.append(run_to_completion())
|
|
||||||
print(f"Average profile latency: {np.mean(profile_latencies):.4f}s")
|
|
||||||
|
|
||||||
return
|
|
||||||
|
|
||||||
|
|
||||||
def parse_args():
|
|
||||||
parser = FlexibleArgumentParser(
|
|
||||||
description="Benchmark the latency of processing a single batch of "
|
|
||||||
"requests till completion."
|
|
||||||
)
|
|
||||||
parser.add_argument("--input-len", type=int, default=32)
|
|
||||||
parser.add_argument("--output-len", type=int, default=128)
|
|
||||||
parser.add_argument("--batch-size", type=int, default=8)
|
|
||||||
parser.add_argument(
|
|
||||||
"--num-iters-warmup",
|
|
||||||
type=int,
|
|
||||||
default=5,
|
|
||||||
help="Number of iterations to run for warmup.",
|
|
||||||
)
|
|
||||||
parser.add_argument(
|
|
||||||
"--num-iters",
|
|
||||||
type=int,
|
|
||||||
default=1,
|
|
||||||
help="Number of iterations to run for profiling.",
|
|
||||||
)
|
|
||||||
parser.add_argument(
|
|
||||||
"--profile-result-dir",
|
|
||||||
type=str,
|
|
||||||
default="profiles",
|
|
||||||
help=(
|
|
||||||
"path to save the pytorch profiler output. Can be visualized "
|
|
||||||
"with ui.perfetto.dev or Tensorboard "
|
|
||||||
"(https://cloud.google.com/tpu/docs/pytorch-xla-performance-profiling-tpu-vm)."
|
|
||||||
),
|
|
||||||
)
|
|
||||||
|
|
||||||
parser = EngineArgs.add_cli_args(parser)
|
|
||||||
return parser.parse_args()
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
args = parse_args()
|
|
||||||
main(args)
|
|
||||||
@ -1,58 +0,0 @@
|
|||||||
# SPDX-License-Identifier: Apache-2.0
|
|
||||||
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
|
|
||||||
|
|
||||||
import argparse
|
|
||||||
import os
|
|
||||||
|
|
||||||
from vllm import LLM, SamplingParams
|
|
||||||
|
|
||||||
prompts = [
|
|
||||||
"A robot may not injure a human being",
|
|
||||||
"It is only with the heart that one can see rightly;",
|
|
||||||
"The greatest glory in living lies not in never falling,",
|
|
||||||
]
|
|
||||||
answers = [
|
|
||||||
" or, through inaction, allow a human being to come to harm.",
|
|
||||||
" what is essential is invisible to the eye.",
|
|
||||||
" but in rising every time we fall.",
|
|
||||||
]
|
|
||||||
N = 1
|
|
||||||
# Currently, top-p sampling is disabled. `top_p` should be 1.0.
|
|
||||||
sampling_params = SamplingParams(temperature=0, top_p=1.0, n=N, max_tokens=16)
|
|
||||||
|
|
||||||
|
|
||||||
def main():
|
|
||||||
parser = argparse.ArgumentParser(description="TPU offline inference example")
|
|
||||||
parser.add_argument("--use-spmd", action="store_true", help="Enable SPMD mode")
|
|
||||||
args = parser.parse_args()
|
|
||||||
|
|
||||||
llm_args = {
|
|
||||||
"model": "Qwen/Qwen2-1.5B-Instruct",
|
|
||||||
"max_num_batched_tokens": 64,
|
|
||||||
"max_num_seqs": 4,
|
|
||||||
"max_model_len": 128,
|
|
||||||
}
|
|
||||||
if args.use_spmd:
|
|
||||||
os.environ["VLLM_XLA_USE_SPMD"] = "1"
|
|
||||||
# Can only hardcode the number of chips for now.
|
|
||||||
# calling xr.global_runtime_device_count() beforeing init SPMD env in
|
|
||||||
# torch_xla will mess up the distributed env.
|
|
||||||
llm_args["tensor_parallel_size"] = 8
|
|
||||||
# Use Llama, for num_kv_heads = 8.
|
|
||||||
llm_args["model"] = "meta-llama/Llama-3.1-8B-Instruct"
|
|
||||||
|
|
||||||
# Set `enforce_eager=True` to avoid ahead-of-time compilation.
|
|
||||||
# In real workloads, `enforce_eager` should be `False`.
|
|
||||||
llm = LLM(**llm_args)
|
|
||||||
outputs = llm.generate(prompts, sampling_params)
|
|
||||||
print("-" * 50)
|
|
||||||
for output, answer in zip(outputs, answers):
|
|
||||||
prompt = output.prompt
|
|
||||||
generated_text = output.outputs[0].text
|
|
||||||
print(f"Prompt: {prompt!r}\nGenerated text: {generated_text!r}")
|
|
||||||
assert generated_text.startswith(answer)
|
|
||||||
print("-" * 50)
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
main()
|
|
||||||
Loading…
x
Reference in New Issue
Block a user