vllm/csrc at 348de41b5253318b4a578d7d972511aebb50f7dc - vllm - 丝路新云-代码仓

xinyun/vllm

mirror of https://git.datalinker.icu/vllm-project/vllm.git synced 2026-07-07 18:27:15 +08:00

History

c0de128 348de41b52 [Bugfix][Hardware][AMD] Fix uninitialized Qlocal registers in ROCm attention kernel

In the ROCm PagedAttention wmma kernel, when GQA_RATIO == 1, only lane 0
loads valid Q data into the Qlocal registers. Lanes 1-15 retain garbage
values from previous GPU cycles. These uninitialized values then
contaminate the wmma (Wave Matrix Multiply-Accumulate) instruction results,
causing subtle numerical accuracy issues.

The bug exists in two locations within paged_attention_ll4_kv_kernel:
1. Lines 1834-1842: _B16x16 Qlocal for 16-bit cache types
2. Lines 2610-2617: _B16x8 Qlocal for 8-bit cache types

Both locations have an `if (lane16id < GQA_RATIO)` block that loads Q data
but lack an `else` clause to zero out Qlocal for non-loading lanes.

The correct pattern already exists elsewhere in the file (lines 1067-1070)
where unused Qlocal slots are explicitly zeroed:
```cpp
} else {
  Qlocal[QHLOOP - 1].xy[0] = {0};
  Qlocal[QHLOOP - 1].xy[1] = {0};
}
```

This fix adds the missing `else` clauses to zero out Qlocal registers for
lanes that don't load Q data, preventing garbage values from propagating
into the attention score computation.

Impact:
- Affects non-GQA models (GQA_RATIO == 1) like Llama-2
- Symptom: Random numerical drift, potential NaNs in softmax
- Fix ensures deterministic behavior across all wave lanes

Signed-off-by: c0de128 <kevin.mckay@outlook.com>

2025-12-24 08:45:44 -06:00

..

[Performance][MLA][ROCm] Remove redundant D2D copy in deepseek (#27457 )

2025-11-26 12:45:28 +08:00

[small][batch invariance] Rename the env and internal flags to simplify usage (#26855 )

2025-10-16 21:40:25 +00:00

[CPU][Bugfix] Fix ppc64le CPU build (#30871 )

2025-12-19 12:26:35 +00:00

cutlass_extensions

Update Optional[x] -> x | None and Union[x, y] to x | y (#26633 )

2025-10-12 09:51:31 -07:00

mamba/mamba_ssm

[V1] [Hybrid] Mamba1 Automatic Prefix Caching (#26377 )

2025-11-02 04:16:23 -08:00

[Kernel][Quantization][MoE] add marlin kernel support for turing (sm75) (#29901 )

2025-12-16 14:35:28 -08:00

[FIX] FP4 quantization kernel padding initialization bug (#31097 )

2025-12-23 08:45:18 -08:00

[Bugfix][Rocm] fix qr error when different inp shape (#25892 )

2025-10-13 10:04:21 -07:00

[Bugfix][Hardware][AMD] Fix uninitialized Qlocal registers in ROCm attention kernel

2025-12-24 08:45:44 -06:00

[feat]: CUTLASS block scaled group gemm for SM100 (#19757 )

2025-07-04 12:58:04 -06:00

activation_kernels.cu

[Perf][Kernels] Vectorize csrc/activations_kernels.cu (#29512 )

2025-12-16 14:56:02 -08:00

cache_kernels.cu

[DeepSeek v3.2] Remove unnecessary syncwarps (#31047 )

2025-12-23 21:33:30 -08:00

cache.h

[Misc] Remove unused custom ops copy_blocks and copy_blocks_mla (#30967 )

2025-12-23 18:22:35 -08:00

cub_helpers.h

[Refactor] Refactor FP8 & INT8 Quant Folder inside w8a8 (#25293 )

2025-10-08 10:20:48 -04:00

cuda_compat.h

[Bugfix][ROCm] Fix for warp_size uses on host (#21205 )

2025-07-24 00:37:19 -07:00

cuda_utils_kernels.cu

[NVIDIA] Support nvfp4 quantization (#12784 )

2025-02-12 19:51:51 -08:00

cuda_utils.h

[Attention] MLA with chunked prefill (#12639 )

2025-02-21 15:30:12 -08:00

cuda_view.cu

Simplify from_blob usage in get_cuda_view_from_cpu_tensor (#29027 )

2025-11-22 10:35:32 +00:00

cumem_allocator_compat.h

[Core][AMD] Migrate fully transparent sleep mode to ROCm platform (#12695 )

2025-11-12 15:24:12 -08:00

cumem_allocator.cpp

[NIXL][Bugfix] Fix NIXL/RDMA registration failure over CuMemAllocator (#29569 )

2025-12-17 01:52:58 -08:00

custom_all_reduce_test.cu

[Distributed] Add custom allreduce support for ROCM (#14125 )

2025-03-31 22:49:12 -07:00

custom_all_reduce.cu

[Distributed] Add custom allreduce support for ROCM (#14125 )

2025-03-31 22:49:12 -07:00

custom_all_reduce.cuh

[Kernels] Enable Torch Symmetric Memory All-Reduce By Default (#24111 )

2025-09-11 09:45:31 -07:00

custom_quickreduce.cu

[Feature] add quick all reduce (#19744 )

2025-06-26 20:54:24 -07:00

dispatch_utils.h

[Performance] Fused blockwise quant RMS norm (#27883 )

2025-12-07 16:38:04 +00:00

fused_qknorm_rope_kernel.cu

[ROCm] [Critical]: Remove unused variable (#31156 )

2025-12-22 08:28:22 -08:00

launch_bounds_utils.h

Update launch_bounds_utils.h for correct compile on Multiple Cuda Arch - PTXAS out of range Warning (#25843 )

2025-09-30 19:18:19 -07:00

layernorm_kernels.cu

[kernel][perf] support uncontiguous input for rms_norm kernel (#28103 )

2025-11-20 19:39:09 -08:00

layernorm_quant_kernels.cu

[Kernel] Optimize rms_norm kernel (#27931 )

2025-11-11 18:02:23 +00:00

ops.h

[Kernel]Support W4A8 Grouped GEMM on Hopper (#29691 )

2025-12-08 19:29:06 -08:00

permute_cols.cu

[Kernel] (2/N) Machete - Integrate into CompressedTensorsWNA16 and GPTQMarlin (#7701 )

2024-09-23 13:46:26 -04:00

pos_encoding_kernels.cu

[Chore] Remove unused batched RoPE op & kernel (#24789 )

2025-09-13 00:08:20 -07:00

sampler.cu

[Bugfix][DSV32] Fix overflow in topk. (#30754 )

2025-12-16 14:21:17 -08:00

torch_bindings.cpp

[Misc] Remove unused custom ops copy_blocks and copy_blocks_mla (#30967 )

2025-12-23 18:22:35 -08:00

type_convert.cuh

[ROCm] [Bugfix] Fix fused_qknorm_rope_kernel rocm compatibility (#28500 )

2025-11-12 05:01:14 -08:00