diff --git a/docs/configuration/optimization.md b/docs/configuration/optimization.md index 5564d8a81d937..5c74610ebd290 100644 --- a/docs/configuration/optimization.md +++ b/docs/configuration/optimization.md @@ -139,9 +139,9 @@ there is relatively little gain from TP. On the other hand, TP incurs significan overhead because of all-reduce being performed after every layer. Given this, it may be advantageous to instead shard the batched input data using TP, essentially -performing batch-level DP. This has been shown to improve the throughput by around 10% for +performing batch-level DP. This has been shown to improve the throughput and TTFT by around 10% for `tensor_parallel_size=8`. For vision encoders that use hardware-unoptimized Conv3D operations, -batch-level DP can provide another 40% increase to throughput compared to regular TP. +batch-level DP can provide another 40% improvement compared to regular TP. Nevertheless, since the weights of the multi-modal encoder are replicated across each TP rank, there will be a minor increase in memory consumption and may cause OOM if you can barely fit the model already. @@ -172,14 +172,15 @@ Batch-level DP needs to be implemented on a per-model basis, and enabled by setting `supports_encoder_tp_data = True` in the model class. Regardless, you need to set `mm_encoder_tp_mode="data"` in engine arguments to use this feature. -Known supported models: +Known supported models (with corresponding benchmarks): -- GLM-4.5V GLM-4.1V () +- dots_ocr () +- GLM-4.1V or above () - InternVL () - Kimi-VL () - Llama4 () - MiniCPM-V-2.5 or above (, ) -- Qwen2.5-VL () +- Qwen2-VL or above (, , ) - Step3 () ## Input Processing