当我使用指令:

```
# We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion.
CUDA_VISIBLE_DEVICES=0,1 \
vllm serve /data/models/ollama-model/QwQ-32B-GGUF/qwq-32b-q4_k_m.gguf \
--tensor-parallel-size 2 \
--port 8132 \
--max-model-len 1024 \
--gpu-memory-utilization 0.7 \
> /data/models/qwq32-q4.log 2>&1
```, 出现了报错:
The tokenizer class you load from this checkpoint is 'LlamaTokenizer'.
The class this function is called from is 'Qwen2TokenizerFast'.

模型检查点(qwq-32b-q4_k_m.gguf)内部分词器配置为LlamaTokenizer,但代码中实际使用的是Qwen2TokenizerFast(通义千问的分词器),为什么是这样?
This commit is contained in:
rudy0053 2025-03-19 02:52:36 +00:00
parent b8bda72841
commit 8245fd1aaf

Diff Content Not Available