GIMBench

GIMBench is a benchmarking framework for evaluating Guided Infilling Models (GIM).

Overview

This project provides tools and benchmarks to evaluate models' ability to perform guided infilling tasks - generating text that follows specific constraints and patterns.

Installation

Install GIMBench using pip:

pip install gimbench

For development:

make install-dev

Usage

GIMBench provides several benchmark types:

CV Parsing: Evaluate models on structured information extraction from CVs
Regex Matching: Test models' ability to generate text matching specific patterns
Multiple Choice QA: Assess guided generation in question-answering contexts
Perplexity: Measure language modeling quality with constraints
Code Infilling: Evaluate code infilling via unit-test execution (pass@k)
SciERC Relation Extraction: Evaluate scientific relation extraction on the Hugging Face dataset Sculpt-AI/GIMBench-sci-erc

Example Commands

Run MMLU-Pro benchmark:

python -m gimbench.mcqa.mmlu_pro \
    --model_type vllm \
    --model_name meta-llama/Llama-3.1-8B-Instruct \
    --base_url http://localhost:8000/v1

Run GPQA Diamond benchmark:

python -m gimbench.mcqa.gpqa_diamond \
    --model_type openai \
    --model_name gpt-4 \
    --api_key YOUR_API_KEY

Run GIM-SFT perplexity evaluation:

python -m gimbench.ppl.gim_sft \
    --model_type vllm-offline \
    --model_name meta-llama/Llama-3.1-8B-Instruct

Run HumanEval Infilling benchmark (code generation + unit-test execution, pass@k):

# GIM-guided infilling (default), pass@1
python -m gimbench.code.humaneval_infilling \
    --model_type vllm \
    --model_name meta-llama/Llama-3.1-8B-Instruct \
    --base_url http://localhost:8000/v1

# Sample 20 completions per problem, report pass@1 and pass@10
python -m gimbench.code.humaneval_infilling \
    --model_type vllm \
    --model_name meta-llama/Llama-3.1-8B-Instruct \
    --base_url http://localhost:8000/v1 \
    --temperature 0.8 \
    --num_samples 20 \
    --pass_k 1 10

# Plain LLM (no GIMKit)
python -m gimbench.code.humaneval_infilling \
    --model_type vllm \
    --model_name meta-llama/Llama-3.1-8B-Instruct \
    --base_url http://localhost:8000/v1 \
    --no_gimkit

Run SciERC relation extraction benchmark (Hugging Face dataset):

python -m gimbench.scierc.scierc \
    --model_type vllm \
    --model_name meta-llama/Llama-3.1-8B-Instruct \
    --base_url http://localhost:8000/v1 \
    --scierc_split test

# Plain LLM (no GIMKit)
python -m gimbench.scierc.scierc \
    --model_type vllm \
    --model_name meta-llama/Llama-3.1-8B-Instruct \
    --base_url http://localhost:8000/v1 \
    --scierc_split dev \
    --no_gimkit

If you need to rebuild and upload the dataset, use:

python benchmarks/GIMBench-sci-erc/1_build_dataset.py \
    --raw_dir benchmarks/GIMBench-sci-erc/data/raw_data \
    --repo_id Sculpt-AI/GIMBench-sci-erc \
    --push_to_hub

Development

Run linting:

make lint

Fix linting issues automatically:

make lint-fix

Run pre-commit hooks:

make pre-commit

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
.github/workflows		.github/workflows
.vscode		.vscode
benchmarks		benchmarks
src/gimbench		src/gimbench
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GIMBench

Overview

Installation

Usage

Example Commands

Development

About

Uh oh!

Releases 12

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GIMBench

Overview

Installation

Usage

Example Commands

Development

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 12

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages