vllm/csrc at 0ce7b952f8eafdb13a7b6de3af53157c7aae98d4 - vllm - 丝路新云-代码仓

xinyun/vllm

mirror of https://git.datalinker.icu/vllm-project/vllm.git synced 2026-07-11 13:17:28 +08:00

History

Cody Yu 5985e3427d

[Kernel] Vectorized FP8 quantize kernel (#5396 )

Inspired by #5146, this PR improves FP8 quantize kernel by vectorizing data transfer to better utilize memory bandwidth. Microbenchmark shows that this improved kernel can achieve 1.0x-1.5x speedup (especially when hidden size is large).

In details, we applied 3 optimizations:

- Use inverted scale so that most divisions are changed to multiplications.
- Unroll the loop by 4 times to improve ILP.
- Use vectorized 4 to transfer data between HBM and SRAM.

2024-06-12 14:07:26 -07:00

..

[Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops (#5047 )

2024-06-09 16:23:30 -04:00

[Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops (#5047 )

2024-06-09 16:23:30 -04:00

[Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops (#5047 )

2024-06-09 16:23:30 -04:00

[Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops (#5047 )

2024-06-09 16:23:30 -04:00

[Kernel] Vectorized FP8 quantize kernel (#5396 )

2024-06-12 14:07:26 -07:00

activation_kernels.cu

[Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops (#5047 )

2024-06-09 16:23:30 -04:00

cache_kernels.cu

[Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops (#5047 )

2024-06-09 16:23:30 -04:00

cache.h

[Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops (#5047 )

2024-06-09 16:23:30 -04:00

cuda_compat.h

[Kernel][ROCm][AMD] enable fused topk_softmax kernel for moe layer (#4927 )

2024-06-02 14:13:26 -07:00

cuda_utils_kernels.cu

[Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops (#5047 )

2024-06-09 16:23:30 -04:00

cuda_utils.h

[Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops (#5047 )

2024-06-09 16:23:30 -04:00

custom_all_reduce_test.cu

[CI/Build] Enforce style for C++ and CUDA code with clang-format (#4722 )

2024-05-22 07:18:41 +00:00

custom_all_reduce.cu

[Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops (#5047 )

2024-06-09 16:23:30 -04:00

custom_all_reduce.cuh

[CI/Build] Enforce style for C++ and CUDA code with clang-format (#4722 )

2024-05-22 07:18:41 +00:00

dispatch_utils.h

[Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops (#5047 )

2024-06-09 16:23:30 -04:00

layernorm_kernels.cu

[Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops (#5047 )

2024-06-09 16:23:30 -04:00

moe_align_block_size_kernels.cu

[Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops (#5047 )

2024-06-09 16:23:30 -04:00

ops.h

[Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops (#5047 )

2024-06-09 16:23:30 -04:00

pos_encoding_kernels.cu

[Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops (#5047 )

2024-06-09 16:23:30 -04:00

reduction_utils.cuh

[Kernel] Dynamic Per-Token Activation Quantization (#5037 )

2024-06-07 09:36:26 -07:00

registration.h

[Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops (#5047 )

2024-06-09 16:23:30 -04:00

torch_bindings.cpp

[Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops (#5047 )

2024-06-09 16:23:30 -04:00