mirror of
https://git.datalinker.icu/vllm-project/vllm.git
synced 2025-12-10 04:26:00 +08:00
Signed-off-by: Didier Durand <durand.didier@gmail.com> Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
130 lines
7.8 KiB
Markdown
130 lines
7.8 KiB
Markdown
# DeepSeek DeepGEMM Kernels Benchmark
|
|
|
|
This directory includes benchmarks between DeepSeek's DeepGEMM block fp8 kernels against vLLM's existing triton and CUTLASS-based kernels.
|
|
|
|
Currently, this just includes dense GEMMs and only works on Hopper GPUs.
|
|
|
|
## Setup
|
|
|
|
You need to install vLLM in your usual fashion, then install DeepGEMM from source in its own directory:
|
|
|
|
```bash
|
|
git clone --recursive https://github.com/deepseek-ai/DeepGEMM
|
|
cd DeepGEMM
|
|
python setup.py install
|
|
uv pip install -e .
|
|
```
|
|
|
|
## Usage
|
|
|
|
```console
|
|
python benchmark_fp8_block_dense_gemm.py
|
|
INFO 02-26 21:55:13 [__init__.py:207] Automatically detected platform cuda.
|
|
===== STARTING FP8 GEMM BENCHMARK =====
|
|
PyTorch version: 2.5.1+cu124
|
|
CUDA version: 12.4
|
|
Triton version: 3.1.0
|
|
Using device: NVIDIA H100 80GB HBM3
|
|
WARNING 02-26 21:55:15 [fp8_utils.py:458] Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal! Config file not found at /home/mgoin/code/vllm/vllm/model_executor/layers/quantization/utils/configs/N=4096,K=7168,device_name=NVIDIA_H100_80GB_HBM3,dtype=fp8_w8a8,block_shape=[128,128].json
|
|
INFO 02-26 21:55:15 [fp8_utils.py:449] Using configuration from /home/mgoin/code/vllm/vllm/model_executor/layers/quantization/utils/configs/N=7168,K=18432,device_name=NVIDIA_H100_80GB_HBM3,dtype=fp8_w8a8,block_shape=[128,128].json for W8A8 Block FP8 kernel.
|
|
WARNING 02-26 21:55:16 [fp8_utils.py:458] Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal! Config file not found at /home/mgoin/code/vllm/vllm/model_executor/layers/quantization/utils/configs/N=18432,K=7168,device_name=NVIDIA_H100_80GB_HBM3,dtype=fp8_w8a8,block_shape=[128,128].json
|
|
WARNING 02-26 21:55:17 [fp8_utils.py:458] Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal! Config file not found at /home/mgoin/code/vllm/vllm/model_executor/layers/quantization/utils/configs/N=24576,K=1536,device_name=NVIDIA_H100_80GB_HBM3,dtype=fp8_w8a8,block_shape=[128,128].json
|
|
INFO 02-26 21:55:17 [fp8_utils.py:449] Using configuration from /home/mgoin/code/vllm/vllm/model_executor/layers/quantization/utils/configs/N=32768,K=512,device_name=NVIDIA_H100_80GB_HBM3,dtype=fp8_w8a8,block_shape=[128,128].json for W8A8 Block FP8 kernel.
|
|
INFO 02-26 21:55:17 [fp8_utils.py:449] Using configuration from /home/mgoin/code/vllm/vllm/model_executor/layers/quantization/utils/configs/N=7168,K=16384,device_name=NVIDIA_H100_80GB_HBM3,dtype=fp8_w8a8,block_shape=[128,128].json for W8A8 Block FP8 kernel.
|
|
|
|
===== PERFORMANCE COMPARISON =====
|
|
|
|
DeepGEMM Implementation:
|
|
+------+-------+-------+-----------+--------+--------+
|
|
| m | n | k | Time (μs) | TFLOPS | GB/s |
|
|
+------+-------+-------+-----------+--------+--------+
|
|
| 8 | 4096 | 7168 | 102.9 | 4.6 | 286.4 |
|
|
| 8 | 7168 | 18432 | 70.8 | 29.8 | 1868.8 |
|
|
| 8 | 18432 | 7168 | 69.3 | 30.5 | 1911.8 |
|
|
| 64 | 4096 | 7168 | 69.1 | 54.4 | 439.0 |
|
|
| 64 | 7168 | 18432 | 69.4 | 243.6 | 1933.6 |
|
|
| 64 | 18432 | 7168 | 70.4 | 240.3 | 1917.2 |
|
|
| 64 | 24576 | 1536 | 70.1 | 68.9 | 584.6 |
|
|
| 64 | 32768 | 512 | 68.4 | 31.4 | 307.1 |
|
|
| 64 | 7168 | 16384 | 69.5 | 216.3 | 1718.5 |
|
|
| 128 | 4096 | 7168 | 141.1 | 53.3 | 222.1 |
|
|
| 128 | 7168 | 18432 | 71.9 | 470.5 | 1896.1 |
|
|
| 128 | 18432 | 7168 | 69.3 | 488.2 | 1988.2 |
|
|
| 1024 | 4096 | 7168 | 89.7 | 670.1 | 502.5 |
|
|
| 1024 | 18432 | 7168 | 279.0 | 969.8 | 635.2 |
|
|
| 2048 | 4096 | 7168 | 175.1 | 687.0 | 347.4 |
|
|
| 4096 | 4096 | 7168 | 335.4 | 717.0 | 275.1 |
|
|
+------+-------+-------+-----------+--------+--------+
|
|
|
|
vLLM Triton Implementation:
|
|
+------+-------+-------+-----------+--------+--------+--------------+
|
|
| m | n | k | Time (μs) | TFLOPS | GB/s | vs DeepGEMM |
|
|
+------+-------+-------+-----------+--------+--------+--------------+
|
|
| 8 | 4096 | 7168 | 74.0 | 6.3 | 398.2 | 1.39x faster |
|
|
| 8 | 7168 | 18432 | 89.6 | 23.6 | 1478.1 | 0.79x slower |
|
|
| 8 | 18432 | 7168 | 113.2 | 18.7 | 1170.4 | 0.61x slower |
|
|
| 64 | 4096 | 7168 | 79.4 | 47.3 | 382.2 | 0.87x slower |
|
|
| 64 | 7168 | 18432 | 98.5 | 171.7 | 1363.0 | 0.70x slower |
|
|
| 64 | 18432 | 7168 | 119.5 | 141.5 | 1129.4 | 0.59x slower |
|
|
| 64 | 24576 | 1536 | 37.6 | 128.4 | 1089.7 | 1.86x faster |
|
|
| 64 | 32768 | 512 | 38.7 | 55.5 | 542.6 | 1.77x faster |
|
|
| 64 | 7168 | 16384 | 86.1 | 174.5 | 1386.4 | 0.81x slower |
|
|
| 128 | 4096 | 7168 | 90.7 | 82.9 | 345.4 | 1.56x faster |
|
|
| 128 | 7168 | 18432 | 144.0 | 234.9 | 946.9 | 0.50x slower |
|
|
| 128 | 18432 | 7168 | 229.5 | 147.4 | 600.1 | 0.30x slower |
|
|
| 1024 | 4096 | 7168 | 242.3 | 248.2 | 186.1 | 0.37x slower |
|
|
| 1024 | 18432 | 7168 | 897.8 | 301.4 | 197.4 | 0.31x slower |
|
|
| 2048 | 4096 | 7168 | 463.0 | 259.7 | 131.4 | 0.38x slower |
|
|
| 4096 | 4096 | 7168 | 901.8 | 266.7 | 102.3 | 0.37x slower |
|
|
+------+-------+-------+-----------+--------+--------+--------------+
|
|
|
|
vLLM CUTLASS Implementation:
|
|
+------+-------+-------+-----------+--------+--------+--------------+--------------+
|
|
| m | n | k | Time (μs) | TFLOPS | GB/s | vs DeepGEMM | vs Triton |
|
|
+------+-------+-------+-----------+--------+--------+--------------+--------------+
|
|
| 8 | 4096 | 7168 | 34.6 | 13.6 | 852.3 | 2.98x faster | 2.14x faster |
|
|
| 8 | 7168 | 18432 | 78.9 | 26.8 | 1677.3 | 0.90x slower | 1.13x faster |
|
|
| 8 | 18432 | 7168 | 81.2 | 26.0 | 1631.1 | 0.85x slower | 1.39x faster |
|
|
| 64 | 4096 | 7168 | 36.9 | 101.9 | 822.9 | 1.87x faster | 2.15x faster |
|
|
| 64 | 7168 | 18432 | 87.4 | 193.4 | 1535.2 | 0.79x slower | 1.13x faster |
|
|
| 64 | 18432 | 7168 | 85.0 | 199.0 | 1587.6 | 0.83x slower | 1.41x faster |
|
|
| 64 | 24576 | 1536 | 28.0 | 172.8 | 1465.8 | 2.51x faster | 1.35x faster |
|
|
| 64 | 32768 | 512 | 28.8 | 74.5 | 728.5 | 2.37x faster | 1.34x faster |
|
|
| 64 | 7168 | 16384 | 77.9 | 193.0 | 1532.8 | 0.89x slower | 1.11x faster |
|
|
| 128 | 4096 | 7168 | 39.1 | 192.4 | 802.0 | 3.61x faster | 2.32x faster |
|
|
| 128 | 7168 | 18432 | 93.7 | 360.8 | 1454.2 | 0.77x slower | 1.54x faster |
|
|
| 128 | 18432 | 7168 | 85.7 | 394.8 | 1608.0 | 0.81x slower | 2.68x faster |
|
|
| 1024 | 4096 | 7168 | 99.7 | 603.1 | 452.2 | 0.90x slower | 2.43x faster |
|
|
| 1024 | 18432 | 7168 | 331.3 | 816.7 | 534.9 | 0.84x slower | 2.71x faster |
|
|
| 2048 | 4096 | 7168 | 198.3 | 606.6 | 306.7 | 0.88x slower | 2.34x faster |
|
|
| 4096 | 4096 | 7168 | 392.2 | 613.2 | 235.3 | 0.86x slower | 2.30x faster |
|
|
+------+-------+-------+-----------+--------+--------+--------------+--------------+
|
|
|
|
===== AVERAGE PERFORMANCE =====
|
|
+----------------+------------+----------+---------------+
|
|
| Implementation | Avg TFLOPS | Avg GB/s | Avg Time (ms) |
|
|
+----------------+------------+----------+---------------+
|
|
| DeepGEMM | 310.98 | 1052.10 | 0.11 |
|
|
| vLLM Triton | 144.30 | 715.60 | 0.23 |
|
|
| vLLM CUTLASS | 286.78 | 1076.67 | 0.11 |
|
|
+----------------+------------+----------+---------------+
|
|
|
|
===== AVERAGE SPEEDUPS =====
|
|
+-----------------------------+--------------+
|
|
| Comparison | Speedup |
|
|
+-----------------------------+--------------+
|
|
| DeepGEMM vs vLLM Triton | 1.71x faster |
|
|
| DeepGEMM vs vLLM CUTLASS | 0.94x slower |
|
|
| vLLM CUTLASS vs vLLM Triton | 1.84x faster |
|
|
+-----------------------------+--------------+
|
|
|
|
===== ACCURACY COMPARISON =====
|
|
+----------------+-----------------------+
|
|
| Implementation | Avg Diff vs Reference |
|
|
+----------------+-----------------------+
|
|
| DeepGEMM | 0.000684 |
|
|
| vLLM Triton | 0.000684 |
|
|
| vLLM CUTLASS | 0.000684 |
|
|
+----------------+-----------------------+
|
|
```
|