[ROCM][AMD][TRITON] Halving warps number for fw_prefill to reduce spilling (#12713)

Signed-off-by: Aleksandr Malyshev <maleksan@amd.com> Co-authored-by: Aleksandr Malyshev <maleksan@amd.com>
2026-06-08 23:01:23 +08:00 · 2025-02-04 19:58:22 -08:00 · 2025-02-04 19:58:22 -08:00 · 64862d106e
commit 64862d106e
parent b3a0d01e45
1 changed files with 1 additions and 1 deletions
--- a/vllm/attention/ops/prefix_prefill.py
+++ b/vllm/attention/ops/prefix_prefill.py
@ -11,7 +11,7 @@ from vllm.platforms import current_platform
 # Static kernels parameters
 BASE_BLOCK = 128 if current_platform.has_device_capability(80) else 64
-NUM_WARPS = 8
+NUM_WARPS = 4 if current_platform.is_rocm() else 8
 # To check compatibility
 IS_TURING = current_platform.get_device_capability() == (7, 5)