BenchTrace Code

This repository contains the code for BenchTrace, a benchmark for evaluating self-evolution ability in LLM agents. BenchTrace consists of a snapshot-reflection dataset of 1,821 annotated episodes spanning six tasks, paired with two evaluation suites: Reflection Evaluation and Evolution Evaluation.

📊 Dataset: huangjh16/BenchTrace on Hugging Face

Repository Structure

benchrogue_code/
├── data_generation/        # Dataset construction pipeline
├── evaluation/
│   ├── reflection/         # Reflection Evaluation
│   └── evolution/          # Evolution Evaluation
└── analysis/               # Analysis and plotting scripts

Note on Jericho game files: Zork and other Infocom game data (ROMs and .qzl saved-game files) are proprietary and are not included in this repository for copyright reasons. To run the Jericho evaluation, obtain the game files separately (see the Jericho project) and place them under evaluation/evolution/jericho/.

Each of the three top-level modules contains a subdirectory per task:

jericho — Jericho text adventure games
alfworld — ALFWorld embodied household tasks
babyai — BabyAI grid-world navigation
bundled_web_shopping — Bundled Web Shopping
group_travel_planning — Group Travel Planning
science_world — ScienceWorld

Requirements

Install dependencies:

pip install -r requirements.txt

An api_key.json file is required in the root directory with the following format:

{"openai": "<your-openai-key>", "anthropic": "<your-anthropic-key>"}

Modules

1. `data_generation/`

Scripts for constructing the BenchTrace snapshot-reflection dataset.

Script	Description
`annotate_episodes.py`	Collect raw episodes from agent runs (Jericho only)
`format_draft.py`	Generate rule-based draft annotations from episodes (Jericho only)
`ai_annotate.py`	Run AI annotators (Claude Sonnet + Gemini Flash) to produce failure annotations
`build_dataset.py`	Assemble the final dataset from AI and human annotations
`compute_agreement.py`	Compute inter-annotator agreement between AI annotators
`annotation_ui.py`	Web UI for human annotation and adjudication (Jericho only)
`prompts/`	Task-specific annotation prompt for the AI annotator

Usage (example for Jericho):

python data_generation/jericho/ai_annotate.py
python data_generation/jericho/build_dataset.py

2. `evaluation/`

`evaluation/reflection/`

Scripts for running and scoring Reflection Evaluation — measuring whether a model can correctly answer detection, localization, and diagnosis questions given an episode snapshot.

Script	Description
`run_reflect_task.py`	Run reflection evaluation on a dataset split
`score_reflect_task.py`	Compute per-question metrics (Accuracy, Jaccard, Token F1, LLM-Judge)
`llm_judge.py`	LLM-as-judge scoring for diagnosis
`cascade_analysis.py`	Cascade (funnel) analysis across detection → localization → diagnosis
`prompts/`	Task-specific prompts for all three questions

Usage:

python evaluation/reflection/jericho/run_reflect_task.py --model Qwen/Qwen3-32B
python evaluation/reflection/jericho/score_reflect_task.py

`evaluation/evolution/`

Scripts for running and scoring Evolution Evaluation — measuring whether an agent avoids a target failure instance after being exposed to relevant past snapshots.

Baselines included:

Script	Baseline
`run_non_evolution.py`	ReAct (no evolution)
`run_naive.py`	Naive (full history concatenation)
`run_evotest.py`	EvoTest
`run_reflexion.py`	Reflexion
`run_remem.py`	ReMemory
`run_rag.py`	RAG
`run_memrl.py`	MemoryRL
`run_autoskill.py`	AutoSkill
`run_agentR.py`	Agent-R (Jericho only)
`score_evol_eval.py`	Compute score and FAR across all results

Usage:

python evaluation/evolution/jericho/run_evotest.py --game balances --model Qwen/Qwen3-32B
python evaluation/evolution/score_evol_eval.py

3. `analysis/`

Scripts for reproducing figures and analysis results in the paper.

Script	Description
`plot_cascade_crossenv.py`	Figure 2: funnel analysis across all six tasks
`cascade/plot_cascade_<task>.py`	Per-task funnel breakdown
`correlation_analysis.py`	Table 9: FAR conditioned on cumulative reflection correctness
`cumulative_funnel.py`	Cumulative funnel statistics
`funnel_corr.py`	Correlation between funnel levels and FAR
`all_correct_corr.py`	FAR for fully correct reflections
`detection_corr.py`	Detection-level correlation analysis
`strategy_corr.py`	Strategy failure correlation analysis

Usage:

python analysis/plot_cascade_crossenv.py
python analysis/correlation_analysis.py

Citation

@inproceedings{benchrogue2026,
  title     = {BenchTrace: Benchmarking Self-Evolution in LLM Agents},
  booktitle = {Proceedings of ACL},
  year      = {2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
analysis		analysis
data_generation		data_generation
evaluation		evaluation
figures		figures
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

BenchTrace Code

Repository Structure

Requirements

Modules

1. `data_generation/`

2. `evaluation/`

`evaluation/reflection/`

`evaluation/evolution/`

3. `analysis/`

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

BenchTrace Code

Repository Structure

Requirements

Modules

1. data_generation/

2. evaluation/

evaluation/reflection/

evaluation/evolution/

3. analysis/

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. `data_generation/`

2. `evaluation/`

`evaluation/reflection/`

`evaluation/evolution/`

3. `analysis/`

Packages