workflow-tracker

Automatic experiment & workflow tracker for ML/AI projects — no manual logging needed. Detects experiments in real time, stores structured records, generates progress reports on demand.

中文文档

Why?

Common pains in ML experimentation:

Ran dozens of experiments, forgot the parameters and conclusions two weeks later
Notes scattered across chat logs, terminal output, and memory — impossible to aggregate
Writing weekly reports means digging through all experiment records from scratch
Paper CHANGELOGs and experiment logs have inconsistent formats

workflow-tracker solves all of these automatically — detects experiment activity and silently records it. Zero extra effort.

Install

npx skills add workflow-tracker -g

Or from a local clone:

git clone https://github.com/MarkD1Zzz/workflow-tracker.git
npx skills add ./workflow-tracker -g

Requires: Node.js ≥ 18, Claude Code or compatible agent.

Features

Auto-Detect & Silent Logging

Triggers automatically on these signals, without interrupting your workflow:

Running training/evaluation scripts
Metric changes (accuracy, F1, loss, etc.)
Parameter changes ("change lr to 0.001")
Verbal experiment conclusions ("tried X, result was Y")

Dual-Mode Output

Project Type	Detection Signal	Output Files
Engineering	`data/train/`, `train.py`, `pipeline`	`workflow.json` + `workflow.md`
Paper	`tex/`, `manuscript`, `figures/`	`CHANGELOG.md` + `experiment_log.md`

Three-Level Structure

Phase → Task → Experiment

Each experiment auto-extracts: Hypothesis / Method / Parameters / Results (with delta) / Conclusion / Tags

Report Generation

Say "generate report" to produce:

Paper project: Update CHANGELOG.md + experiment_log.md
Engineering project: Generate .docx.json + .pptx.json intermediate format (render with any tool later)

Examples

Scenario 1: Engineering — Classifier Swap

You: Swapped Stage 2 MLP for SVM(linear, C=1). Accuracy: 93.75% → 94.79%. SVM is deterministic.

Claude: Recorded. SVM(linear) → SUCCESS, delta +1.04pp.
       → workflow.json + workflow.md updated

Scenario 2: Paper — Ablation Study

You: Finished attention module ablation. SE 94.2%, CBAM 94.8%, FAA 96.1%.

Claude: Paper project detected.
       → CHANGELOG.md appended with timeline entry
       → experiment_log.md appended with detailed record

Scenario 3: Report Generation

You: Generate a progress report for the last two weeks.

Claude: Generated report_20260614.docx.json + report_20260614.pptx.json
        Run node render.js or python render.py to produce final files.

Output Formats

workflow.json (Engineering)

{
  "project": "Welding Defect Classification",
  "updated": "2026-06-14T14:30",
  "phases": [{
    "name": "Phase 1: Accuracy Optimization",
    "status": "in_progress",
    "tasks": [{
      "name": "Task 1.1: Classifier Replacement",
      "status": "completed",
      "experiments": [{
        "date": "2026-06-14",
        "title": "SVM(linear) replaces MLP",
        "method": "SVC(kernel='linear', C=1, class_weight='balanced')",
        "params": {"kernel": "linear", "C": 1},
        "results": {"baseline": 93.75, "new": 94.79, "delta": 1.04},
        "conclusion": "SUCCESS",
        "tags": ["classifier", "svm", "breakthrough"]
      }]
    }]
  }]
}

CHANGELOG.md (Paper)

## 2026-06-14 — Attention Module Ablation

### Background
Comparing SE / CBAM / FAA attention modules on NEU-DET.

### Results
| Module | Accuracy | Delta vs SE |
|--------|----------|-------------|
| SE     | 94.2%    | baseline    |
| CBAM   | 94.8%    | +0.6pp      |
| FAA    | 96.1%    | +1.9pp      |

### Conclusion
FAA significantly outperforms SE and CBAM. Ablation validates the attention redundancy hypothesis.

How It Works

Project Type Detection: Scans directory structure (tex/→paper, data/train/→engineering)
Signal Detection: Matches experiment keywords + numeric change patterns in conversation
Batch Writing: Accumulates experiments, writes once per round (avoids excessive IO)
Delta Auto-Calculation: Computes difference whenever old and new values appear
Tag Auto-Classification: Assigns tags like architecture, hyperparameter-tuning, classifier, data-augmentation based on method type

Use Cases

Deep learning model training & tuning
Academic paper ablation study management
GAN/VAE/Diffusion model iteration
Computer vision classification/detection/segmentation
Any ML workflow that needs "what was tried → what happened → what it means" tracking

Sub-Skills

manuscript-check — Academic Manuscript Integrity Verifier

A structured six-step audit pipeline for academic manuscripts. Designed to catch the errors that survive multiple rounds of self-editing: stale counts after table row deletion, figure scripts holding outdated metric values, loss-function mismatches in comparison charts, and narrative claims that drifted out of sync with the data.

What it checks:

Audit	Target	Example
Data provenance	Table rows traceable to actual experiments	"Was every ablation row actually run?"
Architecture attribution	Component naming matches original source	"Is this module ours or a cited baseline?"
Cross-file consistency	Manuscript ↔ figure scripts ↔ experiment logs	F1 values in `fig1_scatter()` matching Table 2
Benchmark fairness	Same loss function across compared methods	"CE Loss baselines vs. CB Focal Loss ours in one chart?"
Stale counts	"N comparisons/runs/variants" after row deletion	`\multirow{N}` still correct after removing rows
Narrative drift	Claims between sections don't contradict	"Backbone-dependent" vs. "universally redundant"

Workflow: Source verification → impact-surface grep analysis → one-pass batch editing (tex + tables + figure scripts) → residue sweep → script-to-table data cross-audit → memory persistence.

Trigger signals: "verify X", "was this experiment actually run?", "X never existed in this architecture", "do the figures need updating?", "check consistency".

Scope: Auto-detects the active paper project directory. Architecture facts are extracted and cached per project on first run, not hardcoded.

Repo Structure

workflow-tracker/
├── SKILL.md               # Main skill file (Claude Code entry point)
├── SKILL_EN.md            # English skill definition
├── README.md              # This file (English)
├── README_zh.md           # Chinese documentation
├── LICENSE                # MIT
├── evals.json             # 6 test cases, 25 assertions
├── .gitignore
└── manuscript-check/      # Sub-skill: paper manuscript integrity checker
    └── SKILL.md           # Six-step verification workflow

Development

Running Tests

cd workspace/iteration-2
python grade_all.py

Benchmark (v2)

Metric	Value
Avg Response Time	131s
Avg Tokens	27k
Pass Rate (6 evals)	100%
Paper Mode	✓
JSON Intermediate Format	✓

License

Credits

Built on the Claude Code Skills framework. Inspired by real-world experiment management needs across computer vision, defect detection, and generative model research.

Changelog

v1.2.0 (2026-06-17)

Enhanced: manuscript-check — script-to-table data cross-audit, benchmark fairness checks, known-pitfalls reference table
Fixed: Removed all hardcoded local paths and project-specific facts; replaced with auto-detection and per-project caching
Improved: README sub-skill documentation with audit-dimension table and concrete examples

v1.1.0 (2026-06-16)

New: manuscript-check sub-skill — six-step paper manuscript integrity verification
- Source-to-manuscript cross-referencing with grep impact analysis
- Multi-section batch editing (tex + tables + figure scripts)
- Post-edit residue checking + narrative consistency audit
- Automatic memory file persistence
Improved: Bootstrap CHANGELOG.md + experiment_log.md on first paper project load

v1.0.0 (2026-06-14)

Initial release
Auto-detect & silent logging for ML experiments
Dual-mode output: engineering (workflow.json + workflow.md) / paper (CHANGELOG.md + experiment_log.md)
Three-level structure: Phase → Task → Experiment
Report generation: .docx.json + .pptx.json intermediate format

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

workflow-tracker

Why?

Install

Features

Auto-Detect & Silent Logging

Dual-Mode Output

Three-Level Structure

Report Generation

Examples

Scenario 1: Engineering — Classifier Swap

Scenario 2: Paper — Ablation Study

Scenario 3: Report Generation

Output Formats

workflow.json (Engineering)

CHANGELOG.md (Paper)

How It Works

Use Cases

Sub-Skills

manuscript-check — Academic Manuscript Integrity Verifier

Repo Structure

Development

Running Tests

Benchmark (v2)

License

Credits

Changelog

v1.2.0 (2026-06-17)

v1.1.0 (2026-06-16)

v1.0.0 (2026-06-14)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
manuscript-check		manuscript-check
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_zh.md		README_zh.md
SKILL.md		SKILL.md
SKILL_EN.md		SKILL_EN.md
evals.json		evals.json

Folders and files

Latest commit

History

Repository files navigation

workflow-tracker

Why?

Install

Features

Auto-Detect & Silent Logging

Dual-Mode Output

Three-Level Structure

Report Generation

Examples

Scenario 1: Engineering — Classifier Swap

Scenario 2: Paper — Ablation Study

Scenario 3: Report Generation

Output Formats

workflow.json (Engineering)

CHANGELOG.md (Paper)

How It Works

Use Cases

Sub-Skills

manuscript-check — Academic Manuscript Integrity Verifier

Repo Structure

Development

Running Tests

Benchmark (v2)

License

Credits

Changelog

v1.2.0 (2026-06-17)

v1.1.0 (2026-06-16)

v1.0.0 (2026-06-14)

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages