mirror of
https://git.datalinker.icu/vllm-project/vllm.git
synced 2026-04-09 12:47:06 +08:00
41 lines
1.2 KiB
Markdown
41 lines
1.2 KiB
Markdown
# GSM8K Accuracy Evaluation
|
|
|
|
This directory contains a replacement for the lm-eval-harness GSM8K evaluation, using an isolated GSM8K script and vLLM server for better performance and control.
|
|
|
|
## Usage
|
|
|
|
### Run tests with pytest (like buildkite)
|
|
|
|
```bash
|
|
pytest -s -v tests/evals/gsm8k/test_gsm8k_correctness.py \
|
|
--config-list-file=configs/models-small.txt
|
|
```
|
|
|
|
### Run standalone evaluation script
|
|
|
|
```bash
|
|
# Start vLLM server first
|
|
vllm serve Qwen/Qwen2.5-1.5B-Instruct --port 8000
|
|
|
|
# Run evaluation
|
|
python tests/evals/gsm8k/gsm8k_eval.py --port 8000
|
|
```
|
|
|
|
## Configuration Format
|
|
|
|
Model configs in `configs/` directory use this YAML format:
|
|
|
|
```yaml
|
|
model_name: "Qwen/Qwen2.5-1.5B-Instruct"
|
|
accuracy_threshold: 0.54 # Minimum expected accuracy
|
|
num_questions: 1319 # Number of questions (default: full test set)
|
|
num_fewshot: 5 # Few-shot examples from train set
|
|
server_args: "--max-model-len 4096 --tensor-parallel-size 2" # Server arguments
|
|
env: # Environment variables (optional)
|
|
VLLM_USE_FLASHINFER_MOE_FP4: "1"
|
|
```
|
|
|
|
The `server_args` field accepts any arguments that can be passed to `vllm serve`.
|
|
|
|
The `env` field accepts a dictionary of environment variables to set for the server process.
|