Automatic experiment & workflow tracker for ML/AI projects — no manual logging needed. Detects experiments in real time, stores structured records, generates progress reports on demand.
Common pains in ML experimentation:
- Ran dozens of experiments, forgot the parameters and conclusions two weeks later
- Notes scattered across chat logs, terminal output, and memory — impossible to aggregate
- Writing weekly reports means digging through all experiment records from scratch
- Paper CHANGELOGs and experiment logs have inconsistent formats
workflow-tracker solves all of these automatically — detects experiment activity and silently records it. Zero extra effort.
npx skills add workflow-tracker -gOr from a local clone:
git clone https://github.com/MarkD1Zzz/workflow-tracker.git
npx skills add ./workflow-tracker -gRequires: Node.js ≥ 18, Claude Code or compatible agent.
Triggers automatically on these signals, without interrupting your workflow:
- Running training/evaluation scripts
- Metric changes (accuracy, F1, loss, etc.)
- Parameter changes ("change lr to 0.001")
- Verbal experiment conclusions ("tried X, result was Y")
| Project Type | Detection Signal | Output Files |
|---|---|---|
| Engineering | data/train/, train.py, pipeline |
workflow.json + workflow.md |
| Paper | tex/, manuscript, figures/ |
CHANGELOG.md + experiment_log.md |
Phase → Task → Experiment
Each experiment auto-extracts: Hypothesis / Method / Parameters / Results (with delta) / Conclusion / Tags
Say "generate report" to produce:
- Paper project: Update CHANGELOG.md + experiment_log.md
- Engineering project: Generate
.docx.json+.pptx.jsonintermediate format (render with any tool later)
You: Swapped Stage 2 MLP for SVM(linear, C=1). Accuracy: 93.75% → 94.79%. SVM is deterministic.
Claude: Recorded. SVM(linear) → SUCCESS, delta +1.04pp.
→ workflow.json + workflow.md updated
You: Finished attention module ablation. SE 94.2%, CBAM 94.8%, FAA 96.1%.
Claude: Paper project detected.
→ CHANGELOG.md appended with timeline entry
→ experiment_log.md appended with detailed record
You: Generate a progress report for the last two weeks.
Claude: Generated report_20260614.docx.json + report_20260614.pptx.json
Run node render.js or python render.py to produce final files.
{
"project": "Welding Defect Classification",
"updated": "2026-06-14T14:30",
"phases": [{
"name": "Phase 1: Accuracy Optimization",
"status": "in_progress",
"tasks": [{
"name": "Task 1.1: Classifier Replacement",
"status": "completed",
"experiments": [{
"date": "2026-06-14",
"title": "SVM(linear) replaces MLP",
"method": "SVC(kernel='linear', C=1, class_weight='balanced')",
"params": {"kernel": "linear", "C": 1},
"results": {"baseline": 93.75, "new": 94.79, "delta": 1.04},
"conclusion": "SUCCESS",
"tags": ["classifier", "svm", "breakthrough"]
}]
}]
}]
}## 2026-06-14 — Attention Module Ablation
### Background
Comparing SE / CBAM / FAA attention modules on NEU-DET.
### Results
| Module | Accuracy | Delta vs SE |
|--------|----------|-------------|
| SE | 94.2% | baseline |
| CBAM | 94.8% | +0.6pp |
| FAA | 96.1% | +1.9pp |
### Conclusion
FAA significantly outperforms SE and CBAM. Ablation validates the attention redundancy hypothesis.- Project Type Detection: Scans directory structure (
tex/→paper,data/train/→engineering) - Signal Detection: Matches experiment keywords + numeric change patterns in conversation
- Batch Writing: Accumulates experiments, writes once per round (avoids excessive IO)
- Delta Auto-Calculation: Computes difference whenever old and new values appear
- Tag Auto-Classification: Assigns tags like
architecture,hyperparameter-tuning,classifier,data-augmentationbased on method type
- Deep learning model training & tuning
- Academic paper ablation study management
- GAN/VAE/Diffusion model iteration
- Computer vision classification/detection/segmentation
- Any ML workflow that needs "what was tried → what happened → what it means" tracking
A structured six-step audit pipeline for academic manuscripts. Designed to catch the errors that survive multiple rounds of self-editing: stale counts after table row deletion, figure scripts holding outdated metric values, loss-function mismatches in comparison charts, and narrative claims that drifted out of sync with the data.
What it checks:
| Audit | Target | Example |
|---|---|---|
| Data provenance | Table rows traceable to actual experiments | "Was every ablation row actually run?" |
| Architecture attribution | Component naming matches original source | "Is this module ours or a cited baseline?" |
| Cross-file consistency | Manuscript ↔ figure scripts ↔ experiment logs | F1 values in fig1_scatter() matching Table 2 |
| Benchmark fairness | Same loss function across compared methods | "CE Loss baselines vs. CB Focal Loss ours in one chart?" |
| Stale counts | "N comparisons/runs/variants" after row deletion | \multirow{N} still correct after removing rows |
| Narrative drift | Claims between sections don't contradict | "Backbone-dependent" vs. "universally redundant" |
Workflow: Source verification → impact-surface grep analysis → one-pass batch editing (tex + tables + figure scripts) → residue sweep → script-to-table data cross-audit → memory persistence.
Trigger signals: "verify X", "was this experiment actually run?", "X never existed in this architecture", "do the figures need updating?", "check consistency".
Scope: Auto-detects the active paper project directory. Architecture facts are extracted and cached per project on first run, not hardcoded.
workflow-tracker/
├── SKILL.md # Main skill file (Claude Code entry point)
├── SKILL_EN.md # English skill definition
├── README.md # This file (English)
├── README_zh.md # Chinese documentation
├── LICENSE # MIT
├── evals.json # 6 test cases, 25 assertions
├── .gitignore
└── manuscript-check/ # Sub-skill: paper manuscript integrity checker
└── SKILL.md # Six-step verification workflow
cd workspace/iteration-2
python grade_all.py| Metric | Value |
|---|---|
| Avg Response Time | 131s |
| Avg Tokens | 27k |
| Pass Rate (6 evals) | 100% |
| Paper Mode | ✓ |
| JSON Intermediate Format | ✓ |
MIT © 2026
Built on the Claude Code Skills framework. Inspired by real-world experiment management needs across computer vision, defect detection, and generative model research.
- Enhanced:
manuscript-check— script-to-table data cross-audit, benchmark fairness checks, known-pitfalls reference table - Fixed: Removed all hardcoded local paths and project-specific facts; replaced with auto-detection and per-project caching
- Improved: README sub-skill documentation with audit-dimension table and concrete examples
- New:
manuscript-checksub-skill — six-step paper manuscript integrity verification- Source-to-manuscript cross-referencing with grep impact analysis
- Multi-section batch editing (tex + tables + figure scripts)
- Post-edit residue checking + narrative consistency audit
- Automatic memory file persistence
- Improved: Bootstrap CHANGELOG.md + experiment_log.md on first paper project load
- Initial release
- Auto-detect & silent logging for ML experiments
- Dual-mode output: engineering (
workflow.json+workflow.md) / paper (CHANGELOG.md+experiment_log.md) - Three-level structure: Phase → Task → Experiment
- Report generation:
.docx.json+.pptx.jsonintermediate format