westers b7b6396584 Fix ROCm attention backend selection for encoder-only models
Fixes #29466

**Root Cause:**
Encoder-only pooling models (embeddings, cross-encoders, classifiers) were
defaulting to FlexAttention backend on ROCm, which caused 33 pooling tests
to fail with numerical precision issues.

Initial investigation suggested using ROCM_AITER_FA, but further analysis
revealed that AITER only supports causal (decoder-style) attention:
- AITER limitation: `assert causal` in unified_attention.py:126
- ROCM_AITER_FA raises NotImplementedError for ENCODER_ONLY
- Source: https://github.com/ROCm/aiter/blob/main/aiter/ops/triton/unified_attention.py#L126

**Solution:**
Use generic FlashAttention (FLASH_ATTN) for encoder-only models on ROCm.
Generic FlashAttention explicitly supports all attention types including
ENCODER_ONLY, while AITER backends are limited to causal attention.

**Backend Support Analysis:**
- FLASH_ATTN: ✓ Supports ENCODER_ONLY (all attention types)
- FlexAttention: ✓ Supports ENCODER_ONLY (but has precision issues on ROCm)
- ROCM_AITER_FA: ✗ Causal-only (raises NotImplementedError for ENCODER_ONLY)
- TritonAttention: ✗ Only supports DECODER (default)
- ROCM_ATTN: ✗ Only supports DECODER (default)

**Testing:**
- Pre-commit hooks passed (ruff, mypy, typos, SPDX headers)
- Should resolve 33 failing pooling tests on AMD CI
- Generic FlashAttention provides ROCm compatibility without AITER's limitation

**Future Work:**
Opened issue with AMD AITER team to add encoder-only support:
https://github.com/ROCm/aiter/issues/[TBD]

Once AITER adds bidirectional attention support, we can switch back to
ROCM_AITER_FA for better performance.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Signed-off-by: westers <steve.westerhouse@origami-analytics.com>
2025-12-20 12:00:54 -06:00
..