Replace hardcoded WARP_SIZE=32 with the dynamic WARP_SIZE macro from
cuda_compat.h to correctly support both Wave64 (MI300X/gfx942) and
Wave32 (Strix Halo/gfx1151) architectures.
The previous hardcoded value was incorrect for AMD CDNA GPUs which use
64-wide wavefronts. While the current static_assert (kWarpSize >= 4)
passes for both 32 and 64, having inconsistent WARP_SIZE definitions
across the codebase is a maintenance issue and potential latent bug.
Changes:
- Add cuda_compat.h include for WARP_SIZE macro
- Replace local WARP_SIZE constant with kWarpSize from cuda_compat.h
- Update static_assert and comments to use kWarpSize
Signed-off-by: c0de128 <kevin.mckay@outlook.com>
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
Signed-off-by: Lain <siyuanf@nvidia.com>
Co-authored-by: Daniel Campora <961215+dcampora@users.noreply.github.com>