vllm/kernels at 24bb4fe432fffeccf7a27270ee70aff1b1b8a89a - vllm

mirror of https://git.datalinker.icu/vllm-project/vllm.git synced 2026-08-03 00:27:13 +08:00

History

[Kernel] Update fused_moe tuning script for FP8 (#4457 )

This PR updates the tuning script for the fused_moe kernel to support FP8 and also adds configurations for TP4. Note that for the configuration I removed num_warps and num_stages for small batch sizes since that improved performance and brought the benchmarks on par with the numbers before in that regime to make sure this is a strict improvement over the status quo.

All the numbers below are for mistralai/Mixtral-8x7B-Instruct-v0.1, 1000 input and 50 output tokens.

Before this PR (with static activation scaling):

qps = 1: 9.8 ms ITL, 0.49s e2e latency
qps = 2: 9.7 ms ITL, 0.49s e2e latency 
qps = 4: 10.1 ms ITL, 0.52s e2e latency
qps = 6: 11.9 ms ITL, 0.59s e2e latency
qps = 8: 14.0 ms ITL, 0.70s e2e latency
qps = 10: 15.7 ms ITL, 0.79s e2e latency

After this PR (with static activation scaling):

qps = 1: 9.8 ms ITL, 0.49s e2e latency
qps = 2: 9.7 ms ITL, 0.49s e2e latency
qps = 4: 10.2 ms ITL, 0.53s e2e latency
qps = 6: 11.9 ms ITL, 0.59s e2e latency
qps = 8: 11.9 ms ITL, 0.59s e2e latency
qps = 10: 12.1 ms ITL, 0.61s e2e latency

2024-05-01 11:47:38 -07:00

benchmark_aqlm.py

[Core]refactor aqlm quant ops (#4351 )

2024-04-25 15:03:56 -04:00

benchmark_mixtral_moe.py

[Kernel] Update fused_moe tuning script for FP8 (#4457 )

2024-05-01 11:47:38 -07:00

benchmark_paged_attention.py

[Misc] Add indirection layer for custom ops (#3913 )

2024-04-10 20:26:07 -07:00

benchmark_rope.py

[CI] Try introducing isort. (#3495 )

2024-03-25 07:59:47 -07:00