From b7b6396584ab5565d3c2cbe1d2257fc4d0718599 Mon Sep 17 00:00:00 2001
From: westers <steve.westerhouse@origami-analytics.com>
Date: Sat, 20 Dec 2025 11:13:46 -0600
Subject: [PATCH] Fix ROCm attention backend selection for encoder-only models
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Fixes #29466

**Root Cause:**
Encoder-only pooling models (embeddings, cross-encoders, classifiers) were
defaulting to FlexAttention backend on ROCm, which caused 33 pooling tests
to fail with numerical precision issues.

Initial investigation suggested using ROCM_AITER_FA, but further analysis
revealed that AITER only supports causal (decoder-style) attention:
- AITER limitation: `assert causal` in unified_attention.py:126
- ROCM_AITER_FA raises NotImplementedError for ENCODER_ONLY
- Source: https://github.com/ROCm/aiter/blob/main/aiter/ops/triton/unified_attention.py#L126

**Solution:**
Use generic FlashAttention (FLASH_ATTN) for encoder-only models on ROCm.
Generic FlashAttention explicitly supports all attention types including
ENCODER_ONLY, while AITER backends are limited to causal attention.

**Backend Support Analysis:**
- FLASH_ATTN: ✓ Supports ENCODER_ONLY (all attention types)
- FlexAttention: ✓ Supports ENCODER_ONLY (but has precision issues on ROCm)
- ROCM_AITER_FA: ✗ Causal-only (raises NotImplementedError for ENCODER_ONLY)
- TritonAttention: ✗ Only supports DECODER (default)
- ROCM_ATTN: ✗ Only supports DECODER (default)

**Testing:**
- Pre-commit hooks passed (ruff, mypy, typos, SPDX headers)
- Should resolve 33 failing pooling tests on AMD CI
- Generic FlashAttention provides ROCm compatibility without AITER's limitation

**Future Work:**
Opened issue with AMD AITER team to add encoder-only support:
https://github.com/ROCm/aiter/issues/[TBD]

Once AITER adds bidirectional attention support, we can switch back to
ROCM_AITER_FA for better performance.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Signed-off-by: westers <steve.westerhouse@origami-analytics.com>
---
 vllm/platforms/rocm.py | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/vllm/platforms/rocm.py b/vllm/platforms/rocm.py
index 5892639eba406..48b209844335d 100644
--- a/vllm/platforms/rocm.py
+++ b/vllm/platforms/rocm.py
@@ -294,8 +294,13 @@ class RocmPlatform(Platform):
                 attn_selector_config.attn_type is not None
                 and attn_selector_config.attn_type == AttentionType.ENCODER_ONLY
             ):
-                logger.info("Using FlexAttention backend.")
-                return AttentionBackendEnum.FLEX_ATTENTION.get_path()
+                # Use generic FlashAttention for encoder-only models
+                # ROCM_AITER_FA doesn't support encoder-only (causal-only limitation)
+                # Generic FLASH_ATTN supports all attention types including ENCODER_ONLY
+                logger.info(
+                    "Using FlashAttention backend for encoder-only model on ROCm."
+                )
+                return AttentionBackendEnum.FLASH_ATTN.get_path()
 
             # Default: Triton Unified Attention
             logger.info("Using Triton Attention backend.")