vllm/quantization at 8d75fe48ca5f46b7af0f5201d8500b9604eed769 - vllm

mirror of https://git.datalinker.icu/vllm-project/vllm.git synced 2025-12-31 11:29:41 +08:00

History

Tyler Michael Smith 8d75fe48ca

[Kernel] Switch fp8 layers to use the CUTLASS kernels (#5183 )

Switching from torch._scaled_mm to vLLM's cutlass fp8 kernels when supported as we are seeing 5-15% improvement in e2e performance on neuralmagic/Meta-Llama-3-8B-Instruct-FP8

see https://docs.google.com/spreadsheets/d/1GiAnmzyGHgZ6zL_LDSTm35Bdrt4A8AaFEurDlISYYA4/ for some quick e2e benchmarks and #5144 for comparisons across different GEMM sizes.

2024-06-07 08:42:35 +00:00

compressed_tensors

[Kernel] Pass a device pointer into the quantize kernel for the scales (#5159 )

2024-06-03 09:52:30 -07:00

utils

[Kernel] Add marlin_24 unit tests (#4901 )

2024-05-19 11:37:34 -04:00

__init__.py

[Feature][Kernel] Support bitsandbytes quantization and QLoRA (#4776 )

2024-06-01 14:51:10 -06:00

aqlm.py

[Core] Allow AQLM on Pascal (#5058 )