vllm/bitblas.md at af107d5a0e47cd40cce3e35285d36cccf0e7048b

mirror of https://git.datalinker.icu/vllm-project/vllm.git synced 2025-12-10 11:06:15 +08:00

Make distinct code and console admonitions so readers are less likely to miss them (#20585 )

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

2025-07-07 19:55:28 -07:00

1.7 KiB

Raw Blame History

title
BitBLAS

{ #bitblas }

vLLM now supports BitBLAS for more efficient and flexible model inference. Compared to other quantization frameworks, BitBLAS provides more precision combinations.

!!! note Ensure your hardware supports the selected dtype (torch.bfloat16 or torch.float16). Most recent NVIDIA GPUs support float16, while bfloat16 is more common on newer architectures like Ampere or Hopper. For details see supported hardware.

Below are the steps to utilize BitBLAS with vLLM.

pip install bitblas>=0.1.0

vLLM reads the model's config file and supports pre-quantized checkpoints.

You can find pre-quantized models on:

Usually, these repositories have a quantize_config.json file that includes a quantization_config section.

Read bitblas format checkpoint

from vllm import LLM
import torch

# "hxbgsyxh/llama-13b-4bit-g-1-bitblas" is a pre-quantized checkpoint.
model_id = "hxbgsyxh/llama-13b-4bit-g-1-bitblas"
llm = LLM(
    model=model_id,
    dtype=torch.bfloat16,
    trust_remote_code=True,
    quantization="bitblas"
)

Read gptq format checkpoint

??? code

```python
from vllm import LLM
import torch

# "hxbgsyxh/llama-13b-4bit-g-1" is a pre-quantized checkpoint.
model_id = "hxbgsyxh/llama-13b-4bit-g-1"
llm = LLM(
    model=model_id,
    dtype=torch.float16,
    trust_remote_code=True,
    quantization="bitblas",
    max_model_len=1024
)
```

1.7 KiB Raw Blame History

Read bitblas format checkpoint

Read gptq format checkpoint

1.7 KiB

Raw Blame History