mirror of
https://git.datalinker.icu/vllm-project/vllm.git
synced 2025-12-26 06:49:38 +08:00
[Doc] Update ai_accelerator/hpu-gaudi.inc.md (#16956)
Signed-off-by: windsonsea <haifeng.yao@daocloud.io>
This commit is contained in:
parent
d6da9322c8
commit
3097ce3a32
@ -13,11 +13,11 @@ There are no pre-built wheels or images for this device, so you must build vLLM
|
||||
- Intel Gaudi accelerator
|
||||
- Intel Gaudi software version 1.18.0
|
||||
|
||||
Please follow the instructions provided in the [Gaudi Installation
|
||||
Guide](https://docs.habana.ai/en/latest/Installation_Guide/index.html)
|
||||
Please follow the instructions provided in the
|
||||
[Gaudi Installation Guide](https://docs.habana.ai/en/latest/Installation_Guide/index.html)
|
||||
to set up the execution environment. To achieve the best performance,
|
||||
please follow the methods outlined in the [Optimizing Training Platform
|
||||
Guide](https://docs.habana.ai/en/latest/PyTorch/Model_Optimization_PyTorch/Optimization_in_Training_Platform.html).
|
||||
please follow the methods outlined in the
|
||||
[Optimizing Training Platform Guide](https://docs.habana.ai/en/latest/PyTorch/Model_Optimization_PyTorch/Optimization_in_Training_Platform.html).
|
||||
|
||||
## Configure a new environment
|
||||
|
||||
@ -32,15 +32,13 @@ pip list | grep habana # verify that habana-torch-plugin, habana-torch-dataloade
|
||||
pip list | grep neural # verify that neural_compressor is installed
|
||||
```
|
||||
|
||||
Refer to [Intel Gaudi Software Stack
|
||||
Verification](https://docs.habana.ai/en/latest/Installation_Guide/SW_Verification.html#platform-upgrade)
|
||||
Refer to [Intel Gaudi Software Stack Verification](https://docs.habana.ai/en/latest/Installation_Guide/SW_Verification.html#platform-upgrade)
|
||||
for more details.
|
||||
|
||||
### Run Docker Image
|
||||
|
||||
It is highly recommended to use the latest Docker image from Intel Gaudi
|
||||
vault. Refer to the [Intel Gaudi
|
||||
documentation](https://docs.habana.ai/en/latest/Installation_Guide/Bare_Metal_Fresh_OS.html#pull-prebuilt-containers)
|
||||
vault. Refer to the [Intel Gaudi documentation](https://docs.habana.ai/en/latest/Installation_Guide/Bare_Metal_Fresh_OS.html#pull-prebuilt-containers)
|
||||
for more details.
|
||||
|
||||
Use the following commands to run a Docker image:
|
||||
@ -278,8 +276,9 @@ Lower value corresponds to less usable graph memory reserved for prefill stage,
|
||||
:::
|
||||
|
||||
User can also configure the strategy for capturing HPU Graphs for prompt and decode stages separately. Strategy affects the order of capturing graphs. There are two strategies implemented:
|
||||
\- `max_bs` - graph capture queue will sorted in descending order by their batch sizes. Buckets with equal batch sizes are sorted by sequence length in ascending order (e.g. `(64, 128)`, `(64, 256)`, `(32, 128)`, `(32, 256)`, `(1, 128)`, `(1,256)`), default strategy for decode
|
||||
\- `min_tokens` - graph capture queue will be sorted in ascending order by the number of tokens each graph processes (`batch_size*sequence_length`), default strategy for prompt
|
||||
|
||||
- `max_bs` - graph capture queue will sorted in descending order by their batch sizes. Buckets with equal batch sizes are sorted by sequence length in ascending order (e.g. `(64, 128)`, `(64, 256)`, `(32, 128)`, `(32, 256)`, `(1, 128)`, `(1,256)`), default strategy for decode
|
||||
- `min_tokens` - graph capture queue will be sorted in ascending order by the number of tokens each graph processes (`batch_size*sequence_length`), default strategy for prompt
|
||||
|
||||
When there's large amount of requests pending, vLLM scheduler will attempt to fill the maximum batch size for decode as soon as possible. When a request is finished, decode batch size decreases. When that happens, vLLM will attempt to schedule a prefill iteration for requests in the waiting queue, to fill the decode batch size to its previous state. This means that in a full load scenario, decode batch size is often at its maximum, which makes large batch size HPU Graphs crucial to capture, as reflected by `max_bs` strategy. On the other hand, prefills will be executed most frequently with very low batch sizes (1-4), which is reflected in `min_tokens` strategy.
|
||||
|
||||
@ -326,8 +325,7 @@ INFO 08-02 17:38:43 hpu_executor.py:91] init_cache_engine took 37.92 GiB of devi
|
||||
- We recommend running inference on Gaudi 2 with `block_size` of 128
|
||||
for BF16 data type. Using default values (16, 32) might lead to
|
||||
sub-optimal performance due to Matrix Multiplication Engine
|
||||
under-utilization (see [Gaudi
|
||||
Architecture](https://docs.habana.ai/en/latest/Gaudi_Overview/Gaudi_Architecture.html)).
|
||||
under-utilization (see [Gaudi Architecture](https://docs.habana.ai/en/latest/Gaudi_Overview/Gaudi_Architecture.html)).
|
||||
- For max throughput on Llama 7B, we recommend running with batch size
|
||||
of 128 or 256 and max context length of 2048 with HPU Graphs enabled.
|
||||
If you encounter out-of-memory issues, see troubleshooting section.
|
||||
@ -336,11 +334,11 @@ INFO 08-02 17:38:43 hpu_executor.py:91] init_cache_engine took 37.92 GiB of devi
|
||||
|
||||
**Diagnostic and profiling knobs:**
|
||||
|
||||
- `VLLM_PROFILER_ENABLED`: if `true`, high level profiler will be enabled. Resulting JSON traces can be viewed in [perfetto.habana.ai](https://perfetto.habana.ai/#!/viewer). Disabled by default.
|
||||
- `VLLM_HPU_LOG_STEP_GRAPH_COMPILATION`: if `true`, will log graph compilations per each vLLM engine step, only when there was any - highly recommended to use alongside `PT_HPU_METRICS_GC_DETAILS=1`. Disabled by default.
|
||||
- `VLLM_HPU_LOG_STEP_GRAPH_COMPILATION_ALL`: if `true`, will log graph compilations per each vLLM engine step, always, even if there were none. Disabled by default.
|
||||
- `VLLM_HPU_LOG_STEP_CPU_FALLBACKS`: if `true`, will log cpu fallbacks per each vLLM engine step, only when there was any. Disabled by default.
|
||||
- `VLLM_HPU_LOG_STEP_CPU_FALLBACKS_ALL`: if `true`, will log cpu fallbacks per each vLLM engine step, always, even if there were none. Disabled by default.
|
||||
- `VLLM_PROFILER_ENABLED`: If `true`, enable the high level profiler. Resulting JSON traces can be viewed in [perfetto.habana.ai](https://perfetto.habana.ai/#!/viewer). `false` by default.
|
||||
- `VLLM_HPU_LOG_STEP_GRAPH_COMPILATION`: If `true`, log graph compilations for each vLLM engine step when any occurs. Highly recommended to use with `PT_HPU_METRICS_GC_DETAILS=1`. `false` by default.
|
||||
- `VLLM_HPU_LOG_STEP_GRAPH_COMPILATION_ALL`: If `true`, always log graph compilations for each vLLM engine step even if none occurred. `false` by default.
|
||||
- `VLLM_HPU_LOG_STEP_CPU_FALLBACKS`: If `true`, log CPU fallbacks for each vLLM engine step when any occurs. `false` by default.
|
||||
- `VLLM_HPU_LOG_STEP_CPU_FALLBACKS_ALL`: if `true`, always log CPU fallbacks for each vLLM engine step even if none occurred. `false` by default.
|
||||
|
||||
**Performance tuning knobs:**
|
||||
|
||||
@ -381,7 +379,7 @@ INFO 08-02 17:38:43 hpu_executor.py:91] init_cache_engine took 37.92 GiB of devi
|
||||
|
||||
Additionally, there are HPU PyTorch Bridge environment variables impacting vLLM execution:
|
||||
|
||||
- `PT_HPU_LAZY_MODE`: if `0`, PyTorch Eager backend for Gaudi will be used, if `1` PyTorch Lazy backend for Gaudi will be used, `1` is default
|
||||
- `PT_HPU_LAZY_MODE`: if `0`, PyTorch Eager backend for Gaudi will be used; if `1`, PyTorch Lazy backend for Gaudi will be used. `1` is default.
|
||||
- `PT_HPU_ENABLE_LAZY_COLLECTIVES`: required to be `true` for tensor parallel inference with HPU Graphs
|
||||
|
||||
## Troubleshooting: tweaking HPU graphs
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user