GIMBench is a benchmarking framework for evaluating Guided Infilling Models (GIM).
This project provides tools and benchmarks to evaluate models' ability to perform guided infilling tasks - generating text that follows specific constraints and patterns.
Install GIMBench using pip:
pip install gimbenchFor development:
make install-devGIMBench provides several benchmark types:
- CV Parsing: Evaluate models on structured information extraction from CVs
- Regex Matching: Test models' ability to generate text matching specific patterns
- Multiple Choice QA: Assess guided generation in question-answering contexts
- Perplexity: Measure language modeling quality with constraints
- Code Infilling: Evaluate code infilling via unit-test execution (pass@k)
- SciERC Relation Extraction: Evaluate scientific relation extraction on the Hugging Face dataset
Sculpt-AI/GIMBench-sci-erc
Run MMLU-Pro benchmark:
python -m gimbench.mcqa.mmlu_pro \
--model_type vllm \
--model_name meta-llama/Llama-3.1-8B-Instruct \
--base_url http://localhost:8000/v1Run GPQA Diamond benchmark:
python -m gimbench.mcqa.gpqa_diamond \
--model_type openai \
--model_name gpt-4 \
--api_key YOUR_API_KEYRun GIM-SFT perplexity evaluation:
python -m gimbench.ppl.gim_sft \
--model_type vllm-offline \
--model_name meta-llama/Llama-3.1-8B-InstructRun HumanEval Infilling benchmark (code generation + unit-test execution, pass@k):
# GIM-guided infilling (default), pass@1
python -m gimbench.code.humaneval_infilling \
--model_type vllm \
--model_name meta-llama/Llama-3.1-8B-Instruct \
--base_url http://localhost:8000/v1
# Sample 20 completions per problem, report pass@1 and pass@10
python -m gimbench.code.humaneval_infilling \
--model_type vllm \
--model_name meta-llama/Llama-3.1-8B-Instruct \
--base_url http://localhost:8000/v1 \
--temperature 0.8 \
--num_samples 20 \
--pass_k 1 10
# Plain LLM (no GIMKit)
python -m gimbench.code.humaneval_infilling \
--model_type vllm \
--model_name meta-llama/Llama-3.1-8B-Instruct \
--base_url http://localhost:8000/v1 \
--no_gimkitRun SciERC relation extraction benchmark (Hugging Face dataset):
python -m gimbench.scierc.scierc \
--model_type vllm \
--model_name meta-llama/Llama-3.1-8B-Instruct \
--base_url http://localhost:8000/v1 \
--scierc_split test
# Plain LLM (no GIMKit)
python -m gimbench.scierc.scierc \
--model_type vllm \
--model_name meta-llama/Llama-3.1-8B-Instruct \
--base_url http://localhost:8000/v1 \
--scierc_split dev \
--no_gimkitIf you need to rebuild and upload the dataset, use:
python benchmarks/GIMBench-sci-erc/1_build_dataset.py \
--raw_dir benchmarks/GIMBench-sci-erc/data/raw_data \
--repo_id Sculpt-AI/GIMBench-sci-erc \
--push_to_hubRun linting:
make lintFix linting issues automatically:
make lint-fixRun pre-commit hooks:
make pre-commit