algorithmicsuperintelligence · hdson07 · May 18, 2026 · May 18, 2026 · May 18, 2026 · May 18, 2026
diff --git a/.claude/skills/openevolve-pipeline/SKILL.md b/.claude/skills/openevolve-pipeline/SKILL.md
@@ -0,0 +1,123 @@
+---
+name: openevolve-pipeline
+description: Scaffold an OpenEvolve solver-parameter tuning pipeline under input/<bench>/evolve/ when a user supplies a new solver + benchmark dataset. Trigger on requests like "add a new solver to openevolve", "make an evolve pipeline for <solver>", "tune <solver> parameters on this bench", "generate the cpsat-bench style scaffold for X", or whenever the user drops raw solver runs into input/<bench>/raw-data and asks for parameter tuning. The skill follows input/ADD_NEW_SOLVER.md: it interviews for solver/scoring/clustering/phase decisions, writes exactly the 4 per-bench files (config.yaml, params.json, adapter.py, _solve_worker.py) + N phase modules, and verifies the result with the _lib CLIs (sampler / self_test / rebaseline).
+---
+
+# OpenEvolve Pipeline Generator
+
+Given a new solver and a raw benchmark dataset, produce a complete per-bench tuning pipeline at `input/<bench>/evolve/`. All orchestration lives in `input/_lib/` — **never edit it**. The bench contributes only the per-bench surface from `input/ADD_NEW_SOLVER.md` §2:
+
+| file | purpose |
+|---|---|
+| `evolve/config.yaml` | bench + LLM + clustering + evaluation knobs |
+| `evolve/params.json` | rich solver parameter catalog (defaults / locked / groups) |
+| `evolve/adapter.py` | solver hooks (constants + `get_problem_size`) |
+| `evolve/_solve_worker.py` | subprocess entry — `argv = (params_json, problem_path, timeout_s)` → JSON line on stdout |
+| `evolve/phase{N}_<name>/initial_program.py` | one per phase (last phase is usually unified) |
+
+Authoritative spec: [input/ADD_NEW_SOLVER.md](../../../input/ADD_NEW_SOLVER.md). Reference implementations: [input/z3-bench/evolve/](../../../input/z3-bench/evolve/) (flat-overrides + speedup), [input/cpsat-bench/evolve/](../../../input/cpsat-bench/evolve/) (SIZE_BUCKETS + worker-axis + cost mode).
+
+Reading order before scaffolding: `references/interview-checklist.md` → `references/decision-guide.md` → `references/verify-checklist.md` → `references/gotchas.md`.
+
+## Workflow
+
+### 1. Inspect what user provided
+
+```bash
+ls input/<bench>/raw-data/ | head
+ls input/<bench>/problems.jsonl 2>/dev/null && head -1 input/<bench>/problems.jsonl | python3 -m json.tool
+```
+
+Confirm presence of `raw-data/` and `problems.jsonl`. If `problems.jsonl` missing, stop and tell user — `_lib/sampler.py` only reads it (does not build it). Generation is solver-specific user work (ADD_NEW_SOLVER.md §1.4). Offer to draft a `build_problems_jsonl.py` after they describe raw-data layout.
+
+If <10 problems: warn that quintile clustering collapses and cascade stages become noisy. Recommend 1–2 phases only.
+
+### 2. Interview
+
+Use `AskUserQuestion` (≤4 per call). Must-know set in [references/interview-checklist.md](references/interview-checklist.md). Skip anything already obvious from `problems.jsonl` + raw-data inspection.
+
+Highest-leverage answers:
+1. Solver binary / Python binding and how to invoke it.
+2. `problems.jsonl` field names → adapter constants (`PROBLEM_FILE_FIELD`, `STATUS_FIELD`, `STATS_FIELD`, `FEATURES_FIELD`, `OBJECTIVE_FIELD`).
+3. Decisive vs. decided result tokens.
+4. Score mode: `speedup` (z3-style: wall-clock min) vs `cost` (cpsat-style: objective gap + dtime). See [references/decision-guide.md](references/decision-guide.md).
+5. Worker-count axis? If yes → `WORKERS_KEY`, phase-level worker lock.
+6. SIZE_BUCKETS / STAGE3_OVERRIDES needed? Default off; turn on when problem-size distribution is wide and multi-modal.
+7. Phase plan: how many phases, namespace per phase, last phase unified (yes/no).
+8. Clustering: `kmeans` (default) | `quintile` | `thresholds`; feature path inside `problems.jsonl`.
+
+### 3. Scaffold
+
+Write 4 files + N phase dirs. Substitute placeholders from interview into [templates/](templates/):
+
+```
+input/<bench>/evolve/
+├── config.yaml                       ← templates/config.yaml.tmpl
+├── params.json                       ← templates/params.json.tmpl (skeleton; expand groups)
+├── adapter.py                        ← templates/adapter.py.tmpl
+├── _solve_worker.py                  ← templates/_solve_worker_py.tmpl  (or _solve_worker_cli.tmpl for binary)
+├── phase1_<name>/initial_program.py  ← templates/initial_program_simple.py.tmpl
+├── phase2_<name>/initial_program.py  ← templates/initial_program_simple.py.tmpl
+│                                       (or _cpsat.py.tmpl if SIZE_BUCKETS/worker lock)
+└── phaseN_unified/initial_program.py ← templates/initial_program_unified.py.tmpl
+```
+
+Phase modules must use `params_catalog.load_for_bench(_BENCH).defaults` for BASELINE — never hardcode a parallel default dict. Config single-source rule: see [[feedback_config_single_source]].
+
+`unified_dict_name` in `config.yaml` MUST match the EVOLVE-BLOCK dict name in the last phase (default: `UNIFIED_OVERRIDES`). Mismatch → `_lib.prepare_phase` fails silently.
+
+### 4. Verify (run these — do not skip)
+
+```bash
+# 0. Solver binding installed?
+python3 -c "import <solver_pkg>; print(<solver_pkg>.__version__)"   # or `command -v <binary>`
+
+# 1. Catalog load + validation
+python3 -c "
+import sys; sys.path.insert(0, 'input')
+from _lib import params_catalog
+c = params_catalog.load('input/<bench>/evolve/params.json')
+print('keys:', len(c.known_keys()), 'defaults:', len(c.defaults), 'locked:', len(c.locked))
+print('validate ok:', c.validate(c.defaults))
+print('validate bogus:', c.validate({'fake_key': 1}))
+"
+
+# 2. Clustering + stage split
+cd input && python3 -m _lib.sampler <bench>
+# Expect cache/stage{1..4}_sample.json. Inspect cluster sizes — all-in-one cluster
+# = features field path wrong or every problem has 0.
+
+# 3. BASELINE sanity on stage1
+python3 -m _lib.self_test <bench>
+# Expect result labels match + ratio in [0.5, 2.0] (WARN tolerated).
+
+# 4. Local baseline capture (10-run avg) — slow
+python3 -m _lib.rebaseline <bench>
+# Expect cache/local_baseline.json.
+
+# 5. Single-phase smoke (low iter)
+./input/run_phase.sh <bench> 1 --pin 2-3 --iterations 2
+# Expect phase1/openevolve_output/best/best_program.py + cache/phase1_best.json.
+```
+
+Each verify step gates the next. Stop and fix at the first failure — do not proceed to the next CLI hoping it will surface a clearer error.
+
+### 5. Hand off
+
+Report to user:
+- 4 files + N phase modules created at the paths above.
+- Verification results (catalog key count, sampler cluster sizes, self_test ratio).
+- Next command: `./input/run_phase.sh <bench> --pin <core-range>` for full chain.
+- `final_program.py` will land at `input/<bench>/evolve/final_program.py` after last phase (auto via `_lib.finalize`). If `bench.solver_mode` is set to a non-default variant, output suffixes to `final_program_<mode>.py` and artifacts isolate to `cache-<mode>/`.
+
+## Refuse / push back
+
+- `input/<bench>/raw-data/` absent → ask where dataset lives. Do not invent a layout.
+- `problems.jsonl` absent → solver-specific generation is **user** work (ADD_NEW_SOLVER.md §1.4). Optionally help draft `build_problems_jsonl.py` once they describe meta format.
+- Solver has no Python binding AND no CLI that accepts a problem file → ask user how they invoke it; cannot write `_solve_worker.py` without this.
+- Fewer than ~10 problems → warn about cascade collapse; suggest single phase.
+- User asks to edit `input/_lib/*` → refuse. `_lib` is bench-agnostic; per-bench knobs go in the 4 files only.
+
+## Gotchas
+
+See [references/gotchas.md](references/gotchas.md) — verbatim copy of ADD_NEW_SOLVER.md §6 plus failure-mode index from verify steps.
diff --git a/.claude/skills/openevolve-pipeline/references/decision-guide.md b/.claude/skills/openevolve-pipeline/references/decision-guide.md
@@ -0,0 +1,96 @@
+# Decision Guide
+
+ADD_NEW_SOLVER.md §5 — verbatim guidance for ambiguous knobs.
+
+## score_mode
+
+| Solver characteristic | Recommended mode |
+|---|---|
+| Baseline records `objective_value`; optimization problem | `cost` |
+| Sat/Unsat satisfaction; minimize wall-clock | `speedup` |
+| Has determinism counter (e.g. cpsat `deterministic_time`) | `cost` + `time_metric: dtime` |
+
+`speedup` = geomean(baseline_ms / candidate_ms). Higher better.
+`cost` = combination of objective-gap + dtime ratio. Lower better (inverted to speedup-like by scorer).
+
+## Worker axis
+
+If solver has a `num_workers`-style knob that strongly affects runtime:
+- `adapter.WORKERS_KEY = "<key>"`.
+- Each phase pins it via `PHASE_LOCKED["<key>"] = N`.
+- `_lib.evaluator_core` uses core-block allocation.
+- `_lib.rebaseline` produces `by_workers` baseline schema.
+
+Else:
+- `WORKERS_KEY = None`.
+- One core per solve; flat baseline.
+
+cpsat-bench uses W=1 (default profile) / W=8 (`OPENEVOLVE_PROFILE=large`)
+across phases. z3-bench is single-threaded.
+
+## SIZE_BUCKETS / STAGE3_OVERRIDES
+
+| Situation | Toggle |
+|---|---|
+| Problem size spans wide range (e.g. 7k–250k constraints), multi-modal score distribution | `enable_size_buckets: true` |
+| A few outlier problems dominate aggregate score | `enable_outlier_stage: true` + populate `cache/outliers.json` |
+| Pool small (<30) or uniform | Both `false` |
+
+`enable_size_buckets: true` → phase modules must use `initial_program_cpsat.py.tmpl`
+(SIZE_BUCKETS + `get_phase_size_buckets()`).
+`enable_outlier_stage: true` → add `STAGE3_OVERRIDES` + `get_phase_stage3_overrides()`.
+
+## clustering.method
+
+| Method | When |
+|---|---|
+| `kmeans` | 1D Lloyd's. Lets cluster boundaries emerge from data shape. Default. |
+| `quintile` | Rank-based equal-count splits. Use when boundary consistency across runs matters more than natural breaks. |
+| `thresholds` | User-specified cut-offs (e.g. `[50000, 150000]` → 3 buckets). Use when you have domain knowledge of regimes. |
+
+## clustering.mode (optional sample-profile override)
+
+Generic `_lib.sampler` feature. `clustering.mode: <name>` selects a
+`clustering.modes.<name>` block that is **shallow-merged over the base clustering
+block**. Lets one config carry several sample profiles and switch by one field.
+Unset → base block only.
+
+- ORTHOGONAL to `solver_mode` — a sample profile (e.g. `large` = focus on
+  constraint-heavy instances) applies in any solver mode. Do NOT key the override
+  off solver_mode; keep the two knobs independent.
+- Only the keys present in the override are replaced (e.g. just `method` +
+  `thresholds` + `stage_sizes`); everything else falls through to base.
+- z3-bench uses `modes.large` (threshold bucketing, top bucket only) to focus on
+  the biggest instances when the speedup signal is dominated by them.
+
+## solver_mode (optional variant suffix)
+
+Generic `_lib.bench_paths` feature. `bench.solver_mode` (default unset ==
+`optimize`) does two things:
+
+1. **Artifact suffixing** so multiple modes coexist on disk:
+   `cache/`+`final_program.py` (optimize) vs `cache-<X>/`+`final_program_<X>.py`.
+   Every `_lib` CLI (sampler/rebaseline/extract_best/prepare_phase/final_verify/
+   finalize) routes through `bench_paths.cache_dir` / `variant_suffix`, so the two
+   modes' baselines and outputs never collide.
+2. **Worker branching** — `_solve_worker.py` reads the SAME sibling `config.yaml`
+   field and changes solver behavior (z3: `sat` = `z3.Solver` over
+   `parse_smt2_file`, drops `assert-soft`, no objective, `opt.*` params silently
+   dropped). `_lib.evaluator_core` warns if `optimize` + `score_mode != cost`.
+
+Use when one workload has two ways to be solved (full optimize vs feasibility-only)
+and you want both tunable without copying the bench dir. Switching = edit
+`solver_mode` + `score_mode`, then re-run sampler/rebaseline/phases (per-mode
+`cache-<X>/` keeps a dedicated baseline — **rebaseline is mandatory after switch**).
+
+## Existing solver reference
+
+| Solver | score_mode | Worker axis | Size buckets | Phases |
+|---|---|---|---|---|
+| z3 (`z3-bench`) | cost (optimize) / speedup (sat) | NO | NO | 4 (opt_sls + sat + smt + unified) |
+| CP-SAT (`cpsat-bench`) | cost (dtime + cost_ratio) | YES (W=1, W=8) | YES | 5 (search + presolve + lp_cuts + unified + custom_subsolvers) |
+
+z3-bench also demonstrates the optional `solver_mode` (optimize/sat) +
+`clustering.mode` (base/large) knobs above — both default-off, both config-only.
+
+Use whichever matches the new solver's profile as the structural template.
diff --git a/.claude/skills/openevolve-pipeline/references/gotchas.md b/.claude/skills/openevolve-pipeline/references/gotchas.md
@@ -0,0 +1,105 @@
+# Gotchas
+
+ADD_NEW_SOLVER.md §6 — verbatim. Re-listed here for skill-local lookup.
+
+## 1. problems.jsonl field name mismatch
+
+`adapter.PROBLEM_FILE_FIELD` / `STATUS_FIELD` MUST match the JSON keys exactly.
+Typos are the #1 failure mode. `head -1 input/<bench>/problems.jsonl | python3 -m json.tool`
+shows the canonical keys.
+
+## 2. `features.<feature>` missing
+
+`clustering.feature: features.num_X`, but problems.jsonl entries lack
+`features.num_X` → sampler treats every problem as size=0 → all problems
+collapse into one cluster → cascade stages become meaningless.
+
+Verify with the sampler stdout: each cluster should hold a recognizable
+spread of problem SHAs. All-in-one cluster = features field path wrong.
+
+## 3. DECISIVE vs DECIDED confusion
+
+- DECISIVE = "solver gave an answer" (e.g. `Sat`, `Unsat`, `OPTIMAL`).
+- DECIDED  = "baseline produced a conclusive answer → regression comparable"
+  (e.g. cpsat: `INFEASIBLE` decided but only `OPTIMAL`/`FEASIBLE` decisive).
+
+Most solvers: both sets identical. cpsat: they differ.
+
+## 4. `_solve_worker.py` doesn't surface invalid params
+
+If solver silently ignores unknown keys, the catalog alone cannot catch a
+mutated illegal key. Worker MUST emit
+`{"invalid_param": "<key>", "result": "Unknown", "elapsed_ms": 0}` when
+solver rejects a key — otherwise evaluator cannot 0-score the candidate.
+
+Test: pass `{"obviously_fake_key": 1}` → worker should emit invalid_param.
+
+## 5. Phase docstring empty
+
+LLM has no other signal about phase intent. Even one line — "Phase 2: tune
+presolve.* knobs" — improves mutation quality dramatically.
+
+## 6. `unified_dict_name` mismatch
+
+`config.yaml` `bench.unified_dict_name` MUST match the EVOLVE-BLOCK dict
+name in the last phase's `initial_program.py`. Default convention:
+`UNIFIED_OVERRIDES`.
+
+Mismatch → `_lib.prepare_phase` cannot materialize the union → last phase
+starts empty and loses prior-phase wins.
+
+## 7. `worker_path` is relative to `<bench>/evolve/`
+
+`config.yaml` `bench.worker_path: _solve_worker.py` (no directory prefix
+when worker is at evolve/ root).
+
+## Verify-time additional gotchas
+
+### v1. `OPENEVOLVE_BENCH_ROOT unset` from phase module
+
+Phase module `_resolve_bench_root()` fallback walks parents looking for
+adapter+params.json. Fails if phase dir is not exactly two levels under
+`<bench>`. Correct layout:
+
+```
+input/<bench>/evolve/params.json
+input/<bench>/evolve/adapter.py
+input/<bench>/evolve/phase1_x/initial_program.py    ← two levels under <bench>
+```
+
+### v2. Cascade thresholds too tight
+
+`evaluator.cascade_thresholds: [1.03, 1.03, 1.03]` means each stage demands
+≥3% improvement. Solver with high variance + few problems may never cross.
+Lower to `[1.01, 1.01, 1.01]` for noisy benches.
+
+### v3. `parallel_solvers > 1` with single-threaded baseline
+
+If baseline was captured single-threaded, running candidates with
+`parallel_solvers: N` co-locates them on shared cores → timings inflate
+vs baseline → false regression. Either pin core ranges via `--pin` or
+recapture baseline at the same parallelism.
+
+### v4. solver_mode switch without re-baseline
+
+If the bench uses `bench.solver_mode` variants, each mode keeps its OWN baseline
+in `cache-<mode>/local_baseline.json` (optimize → plain `cache/`). After switching
+`solver_mode`, the new mode's `cache-<mode>/` has no baseline → `self_test`/scoring
+compare against a stale or empty baseline. ALWAYS re-run `_lib.rebaseline <bench>`
+after a switch. The suffix isolation is automatic (`bench_paths.cache_dir`), so the
+old mode's baseline is preserved — switching back needs no recapture.
+
+### v5. clustering.mode override silently ignored
+
+`clustering.mode: <name>` only applies if a matching `clustering.modes.<name>`
+block exists. A typo (mode set, no block) → sampler prints
+`mode=... has no modes.... — using base` and falls back to the base block. Check
+sampler stdout for `clustering: applied modes.<name> override` to confirm it took.
+
+### v6. Catalog `defaults` ≠ binding's real defaults
+
+If `params.json` `defaults` includes a key the binding's real default is
+different, the BASELINE phase modules send may diverge from what `_lib.rebaseline`
+captured. Symptom: `self_test` ratio drifts outside [0.5, 2.0].
+
+Fix: ensure `defaults` is what the original problems.jsonl baseline run used.