vllm/kernels at 961a5ab423f548462a01112fb2bf58f34004445f - vllm

mirror of https://git.datalinker.icu/vllm-project/vllm.git synced 2026-06-29 05:37:12 +08:00

History

c0de128 961a5ab423 [Bugfix][Hardware][AMD] Consolidate FP8 min/max values into helper function

Add get_fp8_min_max() helper in quant_utils.py to centralize the
FP8 min/max value logic for ROCm fnuz dtype handling.

On ROCm with torch.float8_e4m3fnuz, using PyTorch's default finfo.max
(240.0) causes accuracy issues with dynamic quantization. The correct
value is 224.0 for fnuz dtype.

This change:
- Adds get_fp8_min_max(dtype) helper returning (fp8_min, fp8_max) tuple
- Updates input_quant_fp8.py to use the helper
- Updates fp8_utils.py per_token_group_quant_fp8() to use the helper
- Updates deep_gemm.py per_block_cast_to_fp8() to use the helper
- Updates tests/kernels/quant_utils.py to use the helper

Fixes #30360

Signed-off-by: c0de128 <kevin.mckay@outlook.com>

2025-12-24 13:20:25 -06:00

attention

[Misc] Remove unused custom ops copy_blocks and copy_blocks_mla (#30967 )

2025-12-23 18:22:35 -08:00

core

[Kernel] Enable fused_qknorm_rope_kernel supports partial rope (#30821 )

2025-12-21 18:39:22 -08:00

mamba

Add SpecDec support to selective_state_update (#29488 )

2025-12-08 16:45:18 -05:00

moe

[MoE Refactor][9/N] Use modular kernel for unquantized Triton MoE (#31052 )