Qwen Large Model Application Team, Alibaba
In this project, we provide the official code and released spreadsheet skills for Trace2Skill. Trace2Skill automatically adapts and creates agent skills from execution traces. Instead of updating skills sequentially from individual trajectories, it analyzes a pool of traces in parallel, proposes trajectory-local patches with multiple analysts, and hierarchically consolidates them into a unified, conflict-free skill directory.
The paper studies two evolution modes: skill deepening from an existing human-written skill, and skill creation from scratch from a weak initial draft. In addition to spreadsheet tasks, the paper also studies math reasoning and visual question answering.
Trace2Skill pipeline: trajectory generation -> parallel multi-agent patch proposal -> conflict-free patch consolidation
Clone this repository and install the lightweight runtime dependencies:
git clone https://github.com/Qwen-Applications/Trace2Skill.git
cd Trace2Skill
python -m pip install openai tqdm openpyxl requests diskcacheThe runners use OpenAI-compatible chat APIs by default. Set the API credentials for your provider:
export OPENAI_API_KEY=<your_api_key>
export OPENAI_BASE_URL=<optional_openai_compatible_endpoint>For local OpenAI-compatible serving, pass --api-key EMPTY and --base-url http://localhost:8000/v1 to the analysis or skill-evolution entrypoints.
Prepare a SpreadsheetBench dataset directory or JSONL file and pass it with --data_path when running evaluation. The benchmark runner uses the preloaded spreadsheet skills under spreadsheet_agent/skills/.
We release the top-performing spreadsheet skills referenced in the paper under released_skills/:
| Skill | Setting | Source traces | Location |
|---|---|---|---|
trace2skill-xlsx-35B-combined |
Self-deepen | 35B combined success/error traces | released_skills/trace2skill-xlsx-35B-combined/ |
xlsx-35B |
Self-create | 35B error traces | released_skills/xlsx-35B/ |
trace2skill-xlsx-122B-combined |
Self-deepen | 122B combined success/error traces | released_skills/trace2skill-xlsx-122B-combined/ |
xlsx-122B |
Self-create | 122B error traces | released_skills/xlsx-122B/ |
The runtime skill tree in spreadsheet_agent/skills/ includes the released xlsx-35B and xlsx-122B variants directly. The full paper release set is preserved separately in released_skills/.
From the repository root, run the SpreadsheetBench agent, evaluate outputs, analyze trajectories, and evolve skills with the following entrypoints:
| Workflow | Command | Output |
|---|---|---|
| Run SpreadsheetBench | python run_spreadsheetbench.py --data_path <dataset> --model <model> |
Spreadsheet outputs and optional trajectory logs |
| Evaluate outputs | python evaluate_with_official.py --data_path <dataset> --output_dir <outputs> |
Official-compatible evaluation results |
| Match results and logs | python analyze_results.py --help |
Failure triage records |
| Agentic error analysis | python analysis/run_error_analysis.py --help |
parsed_error_records.json |
| Single-call success analysis | python analysis/run_success_analysis_llm.py --help |
parsed_success_records.json |
| Parse error analysis | python analysis/parse_error_analysis_outputs.py --help |
Parsed error JSON |
| Parse success analysis | python analysis/parse_success_analysis_outputs.py --help |
Parsed success JSON |
| Parallel error-driven skill evolution | python -m skill_evolver.run_parallel_skill_evolution --help |
Updated skill directory |
| Parallel combined skill evolution | python -m skill_evolver.run_parallel_combined_skill_evolution --help |
Updated skill directory |
Example benchmark run:
python run_spreadsheetbench.py \
--data_path <dataset> \
--model <model> \
--output_dir outputs/spreadsheetbench \
--log_dir outputs/logsExample parallel skill evolution from parsed error records:
python -m skill_evolver.run_parallel_skill_evolution \
--input-json <analysis_output_or_parsed_error_records.json> \
--skill-dir spreadsheet_agent/skills/xlsx/ \
--model <model> \
--max-workers 4 \
--save-intermediatesThe skill-evolver entrypoints accept either parsed JSON files or the corresponding analysis output directories directly.
The spreadsheet verified data for reproduction is included under data/spreadsheetbench_verified/spreadsheetbench_verified_400. The commands below cover baseline evaluation, error/success analysis, Trace2Skill evolution, and evolved-skill evaluation. Set MODEL to your OpenAI-compatible served model name or path. The commands use the Qwen reproduction configs in gen_config/; for Gemma runs, replace them with gen_config/gemma4_instruct.json and gen_config/gemma4_thinking.json.
Set common reproduction variables:
DATA_PATH=data/spreadsheetbench_verified/spreadsheetbench_verified_400
MODEL=Qwen3.5-122B-A10B
WORKERS=128
SEED=41
CLI_ONLY_DIR=outputs/reproduce/cli_only_baseline
BASELINE_DIR=outputs/reproduce/baseline_run
GENERATION_CONFIG=gen_config/qwen3.5_35B_122B_instruct_reasoning.json
THINK_GENERATION_CONFIG=gen_config/qwen3.5_35B_122B_thinking_reasoning.jsonRun the cli_only baseline (i.e., no skill; only command line as a tool) on the held-out split (200:400) for comparison:
python run_spreadsheetbench.py \
--data_path "$DATA_PATH" \
--model "$MODEL" \
--agent cli_only \
--log_dir "$CLI_ONLY_DIR/logs" \
--log_format markdown \
--working_dir "$CLI_ONLY_DIR/work" \
--output_dir "$CLI_ONLY_DIR/outputs" \
--max_turns 100 \
--workers "$WORKERS" \
--seeds "$SEED" \
--generation_config "$GENERATION_CONFIG" \
--start_idx 200 \
--end_idx 400
python evaluate_with_official.py \
--data_path "$DATA_PATH" \
--output_dir "$CLI_ONLY_DIR/outputs" \
--verbose \
--start_idx 200 \
--end_idx 400Run the skill-preloaded spreadsheet agent on the training split (0:200) and produce the error/success analysis records:
python run_spreadsheetbench.py \
--data_path "$DATA_PATH" \
--model "$MODEL" \
--agent cli_skill_preloaded \
--log_dir "$BASELINE_DIR/logs" \
--log_format markdown \
--working_dir "$BASELINE_DIR/work" \
--output_dir "$BASELINE_DIR/outputs" \
--max_turns 100 \
--workers "$WORKERS" \
--skills_dir spreadsheet_agent/skills \
--seeds "$SEED" \
--generation_config "$GENERATION_CONFIG" \
--start_idx 0 \
--end_idx 200
python evaluate_with_official.py \
--data_path "$DATA_PATH" \
--output_dir "$BASELINE_DIR/outputs" \
--verbose \
--start_idx 0 \
--end_idx 200
python analyze_results.py \
--eval_results "$BASELINE_DIR/outputs/eval_official_results.json" \
--log_dir "$BASELINE_DIR/logs"
python analysis/run_error_analysis.py \
--data_path "$DATA_PATH" \
--work_dir "$BASELINE_DIR/work" \
--logs_dir "$BASELINE_DIR/logs" \
--output_dir "$BASELINE_DIR/error_analysis" \
--model "$MODEL" \
--workers "$WORKERS" \
--generation_config "$GENERATION_CONFIG" \
--max_turns 100
python analysis/run_success_analysis_llm.py \
--logs_dir "$BASELINE_DIR/logs" \
--output_dir "$BASELINE_DIR/success_analysis" \
--model "$MODEL" \
--max_workers "$WORKERS" \
--generation_config "$THINK_GENERATION_CONFIG"
python analysis/parse_error_analysis_outputs.py \
--input_dir "$BASELINE_DIR/error_analysis" \
--output "$BASELINE_DIR/error_analysis_parsed.json" 2>&1 \
| tee "$BASELINE_DIR/error_analysis_parsed.log"
python analysis/parse_success_analysis_outputs.py \
--input_dir "$BASELINE_DIR/success_analysis" \
--output "$BASELINE_DIR/success_analysis_parsed.json" 2>&1 \
| tee "$BASELINE_DIR/success_analysis_parsed.log"Run Trace2Skill evolution from the parsed training-split error analysis records:
PATCH_FORMAT=json
EVOLVE_ROOT=outputs/reproduce/skill_evolution_seed_${SEED}
EVOLUTION_DIR="$EVOLVE_ROOT/error_driven_skill_evolution"
EVOLVED_RUN_DIR=outputs/reproduce/evolved_run_seed_${SEED}
EVOLVED_SKILLS="$EVOLUTION_DIR/skills"
mkdir -p "$EVOLVED_SKILLS"
cp -r spreadsheet_agent/skills/. "$EVOLVED_SKILLS"
python -m skill_evolver.run_parallel_skill_evolution \
--input-json "$BASELINE_DIR/error_analysis_parsed.json" \
--skill-dir "$EVOLVED_SKILLS/xlsx" \
--model "$MODEL" \
--verbose \
--batch-size 1 \
--changelog "$EVOLUTION_DIR/change.log" \
--save-intermediates \
--intermediates-dir "$EVOLUTION_DIR/intermediates" \
--max-workers "$WORKERS" \
--prompt generic \
--generation-config "$THINK_GENERATION_CONFIG" \
--parse-failure-dir "$EVOLUTION_DIR/parse_failures" \
--patch-pipeline "$PATCH_FORMAT" \
--seed "$SEED"Because the pipeline is long and the ReAct agent can work up to 100 steps, exact 100% reproduction can be difficult even when all random seeds are controlled. The general trend and paper takeaways should still be reproducible. Also, because all skills are modified by LLMs, a single hallucinated edit can damage overall skill robustness. To reduce this risk, run the training-set validation below and verify that the Trace2Skill-evolved skill does not hurt training-set performance compared with the baseline skill before using it for final evaluation.
In the paper, we ran skill evolution plus this training-set validation with three random seeds, then selected the evolved skill with the best training-set validation performance before evaluating it on the held-out split.
For that training-set validation on the evolution split (0:200), compare the baseline and evolved skill eval_official_results.json summaries:
VALIDATION_DIR=outputs/reproduce/validation_train
python run_spreadsheetbench.py \
--data_path "$DATA_PATH" \
--model "$MODEL" \
--log_dir "$VALIDATION_DIR/baseline_logs" \
--working_dir "$VALIDATION_DIR/baseline_work" \
--output_dir "$VALIDATION_DIR/baseline_outputs" \
--max_turns 100 \
--workers "$WORKERS" \
--skills_dir spreadsheet_agent/skills \
--seeds "$SEED" \
--generation_config "$GENERATION_CONFIG" \
--start_idx 0 \
--end_idx 200
python evaluate_with_official.py \
--data_path "$DATA_PATH" \
--output_dir "$VALIDATION_DIR/baseline_outputs" \
--start_idx 0 \
--end_idx 200
python run_spreadsheetbench.py \
--data_path "$DATA_PATH" \
--model "$MODEL" \
--log_dir "$VALIDATION_DIR/evolved_logs" \
--working_dir "$VALIDATION_DIR/evolved_work" \
--output_dir "$VALIDATION_DIR/evolved_outputs" \
--max_turns 100 \
--workers "$WORKERS" \
--skills_dir "$EVOLVED_SKILLS" \
--seeds "$SEED" \
--generation_config "$GENERATION_CONFIG" \
--start_idx 0 \
--end_idx 200
python evaluate_with_official.py \
--data_path "$DATA_PATH" \
--output_dir "$VALIDATION_DIR/evolved_outputs" \
--start_idx 0 \
--end_idx 200After selecting the best seed by training-set validation, evaluate that evolved skill on the held-out split (200:400):
python run_spreadsheetbench.py \
--data_path "$DATA_PATH" \
--model "$MODEL" \
--log_dir "$EVOLVED_RUN_DIR/logs" \
--log_format markdown \
--working_dir "$EVOLVED_RUN_DIR/work" \
--output_dir "$EVOLVED_RUN_DIR/outputs" \
--max_turns 100 \
--workers "$WORKERS" \
--skills_dir "$EVOLVED_SKILLS" \
--seeds "$SEED" \
--generation_config "$GENERATION_CONFIG" \
--start_idx 200 \
--end_idx 400
python evaluate_with_official.py \
--data_path "$DATA_PATH" \
--output_dir "$EVOLVED_RUN_DIR/outputs" \
--start_idx 200 \
--end_idx 400Trace2Skill/
├── README.md
├── analysis/ # Error/success trajectory analysis scripts and prompts
│ ├── parse_error_analysis_outputs.py
│ ├── parse_success_analysis_outputs.py
│ ├── run_error_analysis.py # Agentic error analyst
│ └── run_success_analysis_llm.py # Single-call success analyst
├── released_skills/ # Released paper skill artifacts
│ ├── trace2skill-xlsx-35B-combined/
│ ├── trace2skill-xlsx-122B-combined/
│ ├── xlsx-35B/
│ └── xlsx-122B/
├── skill_evolver/ # Parallel Trace2Skill patch proposal and consolidation
│ ├── run_parallel_skill_evolution.py
│ └── run_parallel_combined_skill_evolution.py
├── spreadsheet_agent/ # SpreadsheetBench agent and runtime skills
│ ├── agents/
│ ├── skills/
│ └── tools/
├── src/react_agent/ # ReAct agent and OpenAI-compatible model clients
├── run_spreadsheetbench.py # SpreadsheetBench runner
├── evaluate_with_official.py # Official-compatible scorer wrapper
├── analyze_results.py # Result/log matching and triage
└── spreadsheetbench_support.py # Shared SpreadsheetBench utilities
Core implementations:
skill_evolver/parallel_evolving_agent.pyskill_evolver/parallel_success_evolving_agent.pyskill_evolver/skill_evolving_agent.pyanalysis/error_analysis_agent.pyspreadsheet_agent/agents/cli_skill_preloaded_agent.py
This repository focuses on the spreadsheet setting and released skills discussed in the paper, while keeping the core Trace2Skill evolution pipeline runnable. We thank the developers and communities behind the tools used by this release:
- Qwen and the Qwen application ecosystem
- SpreadsheetBench for spreadsheet-agent evaluation
- openpyxl for spreadsheet manipulation support
This project is released under the Apache License 2.0. Released skill artifacts may include their own license files where noted.
If you find our work useful in your research, please consider citing our paper:
@misc{ni2026trace2skilldistilltrajectorylocallessons,
title={Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills},
author={Jingwei Ni and Yihao Liu and Xinpeng Liu and Yutao Sun and Mengyu Zhou and Pengyu Cheng and Dexin Wang and Erchao Zhao and Xiaoxi Jiang and Guanjun Jiang},
year={2026},
eprint={2603.25158},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2603.25158},
}