VERGE: Verification-Enhanced Generation of Multi-Hop Datasets for Evaluating Task-Specific RAG

VERGE Paper (link will be added upon acceptance)

Figure: VERGE dataset generation process

Overview

This repository contains the implementation of VERGE, a verification-enhanced methodology for generating multi-hop datasets to evaluate Retrieval-Augmented Generation (RAG) systems. VERGE addresses significant methodological gaps in existing RAG evaluation frameworks by generating task-specific, multi-hop reasoning datasets.

Key Features

Verification Agent: Ensures generated questions necessitate genuine multi-hop reasoning and maintain factual consistency
Hierarchical Error Taxonomy: Structured analysis of RAG system failure patterns specifically in multi-hop reasoning contexts

Repository Structure

src/
├── Chunker/                  # Document chunking scripts
├── Data/                     # Dataset download scripts
├── ExamProcesser/            # Post-generation exam processing
├── LLMServer/                # Local LLM inference wrappers (llama.cpp)
├── Solver/                   # RAG and closed-book exam solvers
├── categorise_errors.py      # Error pattern categorisation
├── generate_exam.py          # Main dataset generation pipeline
├── prompt_template.py        # Prompt templates for generation, verification, and evaluation
└── retriever.py              # Hybrid BM25 + dense retriever

Requirements

Python 3.10+
GPU recommended for local model inference (llama.cpp supports CPU-only at reduced speed)

Quick Start

Installation

pip install -r requirements.txt
python -m spacy download en_core_web_sm
python -m nltk.downloader punkt_tab

Setting up the Python path

All scripts should be run from the project root with src/ on the Python path:

export PYTHONPATH=src

1. Download data

python src/Data/long_bench_downloader.py
python src/Data/download_documents_sec_filings.py

2. Chunk, embed and index the data

python src/Chunker/document_chunker.py

3. Generate multi-hop datasets with the verification agent

python src/generate_exam.py \
  --task_domain gov_report \
  --model_name llama_3_2_3b \
  --sample_size 700 \
  --target_hop_number 176 \
  --version v1

Supported --model_name values: llama_3_2_3b, llama_3_1_8b, gemma2_9b, ministral_8b, mistral_7b

Supported --task_domain values: gov_report, hotpotqa, multifieldqa_en, SecFilings, wiki

4. Solve the exam (RAG setting)

python src/Solver/solve_exam_rag.py

5. Categorise error patterns

python src/categorise_errors.py

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
imgs		imgs
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VERGE: Verification-Enhanced Generation of Multi-Hop Datasets for Evaluating Task-Specific RAG

Overview

Key Features

Repository Structure

Requirements

Quick Start

Installation

Setting up the Python path

1. Download data

2. Chunk, embed and index the data

3. Generate multi-hop datasets with the verification agent

4. Solve the exam (RAG setting)

5. Categorise error patterns

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VERGE: Verification-Enhanced Generation of Multi-Hop Datasets for Evaluating Task-Specific RAG

Overview

Key Features

Repository Structure

Requirements

Quick Start

Installation

Setting up the Python path

1. Download data

2. Chunk, embed and index the data

3. Generate multi-hop datasets with the verification agent

4. Solve the exam (RAG setting)

5. Categorise error patterns

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages