Lei Wang 8d32dc603d
[Kernel] Support Microsoft Runtime Kernel Lib for our Low Precision Computation - BitBLAS (#6036)
Signed-off-by: xinyuxiao <xinyuxiao2024@gmail.com>
Co-authored-by: xinyuxiao <xinyuxiao2024@gmail.com>
2025-04-22 09:01:36 +01:00

1.2 KiB

BitBLAS

vLLM now supports BitBLAS for more efficient and flexible model inference. Compared to other quantization frameworks, BitBLAS provides more precision combinations.

Below are the steps to utilize BitBLAS with vLLM.

pip install bitblas>=0.1.0

vLLM reads the model's config file and supports pre-quantized checkpoints.

You can find pre-quantized models on:

Usually, these repositories have a quantize_config.json file that includes a quantization_config section.

Read bitblas format checkpoint

from vllm import LLM
import torch

# "hxbgsyxh/llama-13b-4bit-g-1-bitblas" is a pre-quantized checkpoint.
model_id = "hxbgsyxh/llama-13b-4bit-g-1-bitblas"
llm = LLM(model=model_id, dtype=torch.bfloat16, trust_remote_code=True, quantization="bitblas")

Read gptq format checkpoint

from vllm import LLM
import torch

# "hxbgsyxh/llama-13b-4bit-g-1" is a pre-quantized checkpoint.
model_id = "hxbgsyxh/llama-13b-4bit-g-1"
llm = LLM(model=model_id, dtype=torch.float16, trust_remote_code=True, quantization="bitblas", max_model_len=1024)