VERGE Paper (link will be added upon acceptance)
Figure: VERGE dataset generation process
This repository contains the implementation of VERGE, a verification-enhanced methodology for generating multi-hop datasets to evaluate Retrieval-Augmented Generation (RAG) systems. VERGE addresses significant methodological gaps in existing RAG evaluation frameworks by generating task-specific, multi-hop reasoning datasets.
- Verification Agent: Ensures generated questions necessitate genuine multi-hop reasoning and maintain factual consistency
- Hierarchical Error Taxonomy: Structured analysis of RAG system failure patterns specifically in multi-hop reasoning contexts
src/
├── Chunker/ # Document chunking scripts
├── Data/ # Dataset download scripts
├── ExamProcesser/ # Post-generation exam processing
├── LLMServer/ # Local LLM inference wrappers (llama.cpp)
├── Solver/ # RAG and closed-book exam solvers
├── categorise_errors.py # Error pattern categorisation
├── generate_exam.py # Main dataset generation pipeline
├── prompt_template.py # Prompt templates for generation, verification, and evaluation
└── retriever.py # Hybrid BM25 + dense retriever
- Python 3.10+
- GPU recommended for local model inference (llama.cpp supports CPU-only at reduced speed)
pip install -r requirements.txt
python -m spacy download en_core_web_sm
python -m nltk.downloader punkt_tabAll scripts should be run from the project root with src/ on the Python path:
export PYTHONPATH=srcpython src/Data/long_bench_downloader.py
python src/Data/download_documents_sec_filings.pypython src/Chunker/document_chunker.pypython src/generate_exam.py \
--task_domain gov_report \
--model_name llama_3_2_3b \
--sample_size 700 \
--target_hop_number 176 \
--version v1Supported --model_name values: llama_3_2_3b, llama_3_1_8b, gemma2_9b, ministral_8b, mistral_7b
Supported --task_domain values: gov_report, hotpotqa, multifieldqa_en, SecFilings, wiki
python src/Solver/solve_exam_rag.pypython src/categorise_errors.py