Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
b1c43c1
[feat] initial implement for z3-solver opt pipeline
hdson-Axion May 18, 2026
8b15494
[feat] add claude code sdk backend
hdson-Axion May 18, 2026
a94cadd
Merge branch 'hdson/z3-initial-tune'
hdson-Axion May 18, 2026
626e54c
[feat] upgrade optimize loop
hdson-Axion May 18, 2026
82c3648
[feat] enable multi-pricessing
hdson-Axion May 18, 2026
89a0f0f
[feat] size-based sample selection + adaptive per-problem timeout
hdson-Axion May 19, 2026
8eb20fd
[fix] update utils
hdson-Axion May 19, 2026
ceb1150
update samples
hdson-Axion May 19, 2026
d005d86
[fix] remove rebaseline timeout
hdson-Axion May 19, 2026
32b19d3
change core idx
hdson-Axion May 19, 2026
88f0f68
use core isolation on docker
hdson-Axion May 19, 2026
78a1008
Merge branch 'main' of github.com:hdson07/openevolve
hdson-Axion May 19, 2026
cb4ca60
[feat] 4-stage cascade with runtime quintile samples + outlier filter
hdson-Axion May 19, 2026
9eaa911
Merge branch 'main' of github.com:hdson07/openevolve
hdson-Axion May 19, 2026
e33438d
add center pick option
hdson-Axion May 19, 2026
ea29ccc
update stats ratio
hdson-Axion May 19, 2026
0161fef
[fix] update scoring logic
hdson-Axion May 20, 2026
e21d456
[feat] enable cpsat bench
hdson-Axion May 20, 2026
9f789df
fix core allocation error
hdson-Axion May 20, 2026
4939185
add verify on cpsat
hdson-Axion May 21, 2026
da66606
update regression score logic
hdson-Axion May 21, 2026
89950c8
add final bench
hdson-Axion May 21, 2026
22d0a99
sample clustering
hdson-Axion May 21, 2026
045a561
cpsat update phase
hdson-Axion May 22, 2026
b550561
fix rebase logic
hdson-Axion May 22, 2026
d484866
[feat] add statistics for cp-sat
hdson-Axion May 26, 2026
225f5ea
cpsat: per-problem param tuning + outlier-only stage3
hdson-Axion May 26, 2026
1c8b7cf
update default params
hdson-Axion May 26, 2026
e51f4ab
tune stage3 sampling
hdson-Axion May 26, 2026
739fcc8
tune score logic
hdson-Axion May 26, 2026
3a4ebdd
add optimization target prompts
hdson-Axion May 26, 2026
8d77730
avoid zero sample crash:
hdson-Axion May 27, 2026
6642f78
chage claude max turn
hdson-Axion May 27, 2026
44aba61
add isolate large profile
hdson-Axion May 27, 2026
c51b489
stage 3 tune
hdson-Axion May 27, 2026
a0bb884
[docs] add openevolve study
hdson-Axion May 28, 2026
4186230
add phase5 custom subsolvers, worker logging, ignore cpsat run artifacts
hdson-Axion May 28, 2026
4dc2be8
[docs] update slides
hdson-Axion May 28, 2026
35e87cf
[feat] add small profile mode
hdson-Axion May 28, 2026
0555709
[docs] update docs
hdson-Axion Jun 1, 2026
b9e60ef
[fix] extract final solutions
hdson-Axion Jun 1, 2026
7173f9e
[refactor] unify cpsat + z3 pipelines into shared _lib platform
hdson-Axion Jun 1, 2026
1af4575
[chore] sampler stage reports, build_problems scripts, gitignore cleanup
hdson-Axion Jun 1, 2026
4837e01
[skill] add openevolve-pipeline scaffold generator
hdson-Axion Jun 1, 2026
5ecc164
[z3-bench] add reboot pipeline, sat-mode caches, and shared _lib updates
hdson-Axion Jun 9, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
123 changes: 123 additions & 0 deletions .claude/skills/openevolve-pipeline/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
---
name: openevolve-pipeline
description: Scaffold an OpenEvolve solver-parameter tuning pipeline under input/<bench>/evolve/ when a user supplies a new solver + benchmark dataset. Trigger on requests like "add a new solver to openevolve", "make an evolve pipeline for <solver>", "tune <solver> parameters on this bench", "generate the cpsat-bench style scaffold for X", or whenever the user drops raw solver runs into input/<bench>/raw-data and asks for parameter tuning. The skill follows input/ADD_NEW_SOLVER.md: it interviews for solver/scoring/clustering/phase decisions, writes exactly the 4 per-bench files (config.yaml, params.json, adapter.py, _solve_worker.py) + N phase modules, and verifies the result with the _lib CLIs (sampler / self_test / rebaseline).
---

# OpenEvolve Pipeline Generator

Given a new solver and a raw benchmark dataset, produce a complete per-bench tuning pipeline at `input/<bench>/evolve/`. All orchestration lives in `input/_lib/` — **never edit it**. The bench contributes only the per-bench surface from `input/ADD_NEW_SOLVER.md` §2:

| file | purpose |
|---|---|
| `evolve/config.yaml` | bench + LLM + clustering + evaluation knobs |
| `evolve/params.json` | rich solver parameter catalog (defaults / locked / groups) |
| `evolve/adapter.py` | solver hooks (constants + `get_problem_size`) |
| `evolve/_solve_worker.py` | subprocess entry — `argv = (params_json, problem_path, timeout_s)` → JSON line on stdout |
| `evolve/phase{N}_<name>/initial_program.py` | one per phase (last phase is usually unified) |

Authoritative spec: [input/ADD_NEW_SOLVER.md](../../../input/ADD_NEW_SOLVER.md). Reference implementations: [input/z3-bench/evolve/](../../../input/z3-bench/evolve/) (flat-overrides + speedup), [input/cpsat-bench/evolve/](../../../input/cpsat-bench/evolve/) (SIZE_BUCKETS + worker-axis + cost mode).

Reading order before scaffolding: `references/interview-checklist.md` → `references/decision-guide.md` → `references/verify-checklist.md` → `references/gotchas.md`.

## Workflow

### 1. Inspect what user provided

```bash
ls input/<bench>/raw-data/ | head
ls input/<bench>/problems.jsonl 2>/dev/null && head -1 input/<bench>/problems.jsonl | python3 -m json.tool
```

Confirm presence of `raw-data/` and `problems.jsonl`. If `problems.jsonl` missing, stop and tell user — `_lib/sampler.py` only reads it (does not build it). Generation is solver-specific user work (ADD_NEW_SOLVER.md §1.4). Offer to draft a `build_problems_jsonl.py` after they describe raw-data layout.

If <10 problems: warn that quintile clustering collapses and cascade stages become noisy. Recommend 1–2 phases only.

### 2. Interview

Use `AskUserQuestion` (≤4 per call). Must-know set in [references/interview-checklist.md](references/interview-checklist.md). Skip anything already obvious from `problems.jsonl` + raw-data inspection.

Highest-leverage answers:
1. Solver binary / Python binding and how to invoke it.
2. `problems.jsonl` field names → adapter constants (`PROBLEM_FILE_FIELD`, `STATUS_FIELD`, `STATS_FIELD`, `FEATURES_FIELD`, `OBJECTIVE_FIELD`).
3. Decisive vs. decided result tokens.
4. Score mode: `speedup` (z3-style: wall-clock min) vs `cost` (cpsat-style: objective gap + dtime). See [references/decision-guide.md](references/decision-guide.md).
5. Worker-count axis? If yes → `WORKERS_KEY`, phase-level worker lock.
6. SIZE_BUCKETS / STAGE3_OVERRIDES needed? Default off; turn on when problem-size distribution is wide and multi-modal.
7. Phase plan: how many phases, namespace per phase, last phase unified (yes/no).
8. Clustering: `kmeans` (default) | `quintile` | `thresholds`; feature path inside `problems.jsonl`.

### 3. Scaffold

Write 4 files + N phase dirs. Substitute placeholders from interview into [templates/](templates/):

```
input/<bench>/evolve/
├── config.yaml ← templates/config.yaml.tmpl
├── params.json ← templates/params.json.tmpl (skeleton; expand groups)
├── adapter.py ← templates/adapter.py.tmpl
├── _solve_worker.py ← templates/_solve_worker_py.tmpl (or _solve_worker_cli.tmpl for binary)
├── phase1_<name>/initial_program.py ← templates/initial_program_simple.py.tmpl
├── phase2_<name>/initial_program.py ← templates/initial_program_simple.py.tmpl
│ (or _cpsat.py.tmpl if SIZE_BUCKETS/worker lock)
└── phaseN_unified/initial_program.py ← templates/initial_program_unified.py.tmpl
```

Phase modules must use `params_catalog.load_for_bench(_BENCH).defaults` for BASELINE — never hardcode a parallel default dict. Config single-source rule: see [[feedback_config_single_source]].

`unified_dict_name` in `config.yaml` MUST match the EVOLVE-BLOCK dict name in the last phase (default: `UNIFIED_OVERRIDES`). Mismatch → `_lib.prepare_phase` fails silently.

### 4. Verify (run these — do not skip)

```bash
# 0. Solver binding installed?
python3 -c "import <solver_pkg>; print(<solver_pkg>.__version__)" # or `command -v <binary>`

# 1. Catalog load + validation
python3 -c "
import sys; sys.path.insert(0, 'input')
from _lib import params_catalog
c = params_catalog.load('input/<bench>/evolve/params.json')
print('keys:', len(c.known_keys()), 'defaults:', len(c.defaults), 'locked:', len(c.locked))
print('validate ok:', c.validate(c.defaults))
print('validate bogus:', c.validate({'fake_key': 1}))
"

# 2. Clustering + stage split
cd input && python3 -m _lib.sampler <bench>
# Expect cache/stage{1..4}_sample.json. Inspect cluster sizes — all-in-one cluster
# = features field path wrong or every problem has 0.

# 3. BASELINE sanity on stage1
python3 -m _lib.self_test <bench>
# Expect result labels match + ratio in [0.5, 2.0] (WARN tolerated).

# 4. Local baseline capture (10-run avg) — slow
python3 -m _lib.rebaseline <bench>
# Expect cache/local_baseline.json.

# 5. Single-phase smoke (low iter)
./input/run_phase.sh <bench> 1 --pin 2-3 --iterations 2
# Expect phase1/openevolve_output/best/best_program.py + cache/phase1_best.json.
```

Each verify step gates the next. Stop and fix at the first failure — do not proceed to the next CLI hoping it will surface a clearer error.

### 5. Hand off

Report to user:
- 4 files + N phase modules created at the paths above.
- Verification results (catalog key count, sampler cluster sizes, self_test ratio).
- Next command: `./input/run_phase.sh <bench> --pin <core-range>` for full chain.
- `final_program.py` will land at `input/<bench>/evolve/final_program.py` after last phase (auto via `_lib.finalize`). If `bench.solver_mode` is set to a non-default variant, output suffixes to `final_program_<mode>.py` and artifacts isolate to `cache-<mode>/`.

## Refuse / push back

- `input/<bench>/raw-data/` absent → ask where dataset lives. Do not invent a layout.
- `problems.jsonl` absent → solver-specific generation is **user** work (ADD_NEW_SOLVER.md §1.4). Optionally help draft `build_problems_jsonl.py` once they describe meta format.
- Solver has no Python binding AND no CLI that accepts a problem file → ask user how they invoke it; cannot write `_solve_worker.py` without this.
- Fewer than ~10 problems → warn about cascade collapse; suggest single phase.
- User asks to edit `input/_lib/*` → refuse. `_lib` is bench-agnostic; per-bench knobs go in the 4 files only.

## Gotchas

See [references/gotchas.md](references/gotchas.md) — verbatim copy of ADD_NEW_SOLVER.md §6 plus failure-mode index from verify steps.
96 changes: 96 additions & 0 deletions .claude/skills/openevolve-pipeline/references/decision-guide.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
# Decision Guide

ADD_NEW_SOLVER.md §5 — verbatim guidance for ambiguous knobs.

## score_mode

| Solver characteristic | Recommended mode |
|---|---|
| Baseline records `objective_value`; optimization problem | `cost` |
| Sat/Unsat satisfaction; minimize wall-clock | `speedup` |
| Has determinism counter (e.g. cpsat `deterministic_time`) | `cost` + `time_metric: dtime` |

`speedup` = geomean(baseline_ms / candidate_ms). Higher better.
`cost` = combination of objective-gap + dtime ratio. Lower better (inverted to speedup-like by scorer).

## Worker axis

If solver has a `num_workers`-style knob that strongly affects runtime:
- `adapter.WORKERS_KEY = "<key>"`.
- Each phase pins it via `PHASE_LOCKED["<key>"] = N`.
- `_lib.evaluator_core` uses core-block allocation.
- `_lib.rebaseline` produces `by_workers` baseline schema.

Else:
- `WORKERS_KEY = None`.
- One core per solve; flat baseline.

cpsat-bench uses W=1 (default profile) / W=8 (`OPENEVOLVE_PROFILE=large`)
across phases. z3-bench is single-threaded.

## SIZE_BUCKETS / STAGE3_OVERRIDES

| Situation | Toggle |
|---|---|
| Problem size spans wide range (e.g. 7k–250k constraints), multi-modal score distribution | `enable_size_buckets: true` |
| A few outlier problems dominate aggregate score | `enable_outlier_stage: true` + populate `cache/outliers.json` |
| Pool small (<30) or uniform | Both `false` |

`enable_size_buckets: true` → phase modules must use `initial_program_cpsat.py.tmpl`
(SIZE_BUCKETS + `get_phase_size_buckets()`).
`enable_outlier_stage: true` → add `STAGE3_OVERRIDES` + `get_phase_stage3_overrides()`.

## clustering.method

| Method | When |
|---|---|
| `kmeans` | 1D Lloyd's. Lets cluster boundaries emerge from data shape. Default. |
| `quintile` | Rank-based equal-count splits. Use when boundary consistency across runs matters more than natural breaks. |
| `thresholds` | User-specified cut-offs (e.g. `[50000, 150000]` → 3 buckets). Use when you have domain knowledge of regimes. |

## clustering.mode (optional sample-profile override)

Generic `_lib.sampler` feature. `clustering.mode: <name>` selects a
`clustering.modes.<name>` block that is **shallow-merged over the base clustering
block**. Lets one config carry several sample profiles and switch by one field.
Unset → base block only.

- ORTHOGONAL to `solver_mode` — a sample profile (e.g. `large` = focus on
constraint-heavy instances) applies in any solver mode. Do NOT key the override
off solver_mode; keep the two knobs independent.
- Only the keys present in the override are replaced (e.g. just `method` +
`thresholds` + `stage_sizes`); everything else falls through to base.
- z3-bench uses `modes.large` (threshold bucketing, top bucket only) to focus on
the biggest instances when the speedup signal is dominated by them.

## solver_mode (optional variant suffix)

Generic `_lib.bench_paths` feature. `bench.solver_mode` (default unset ==
`optimize`) does two things:

1. **Artifact suffixing** so multiple modes coexist on disk:
`cache/`+`final_program.py` (optimize) vs `cache-<X>/`+`final_program_<X>.py`.
Every `_lib` CLI (sampler/rebaseline/extract_best/prepare_phase/final_verify/
finalize) routes through `bench_paths.cache_dir` / `variant_suffix`, so the two
modes' baselines and outputs never collide.
2. **Worker branching** — `_solve_worker.py` reads the SAME sibling `config.yaml`
field and changes solver behavior (z3: `sat` = `z3.Solver` over
`parse_smt2_file`, drops `assert-soft`, no objective, `opt.*` params silently
dropped). `_lib.evaluator_core` warns if `optimize` + `score_mode != cost`.

Use when one workload has two ways to be solved (full optimize vs feasibility-only)
and you want both tunable without copying the bench dir. Switching = edit
`solver_mode` + `score_mode`, then re-run sampler/rebaseline/phases (per-mode
`cache-<X>/` keeps a dedicated baseline — **rebaseline is mandatory after switch**).

## Existing solver reference

| Solver | score_mode | Worker axis | Size buckets | Phases |
|---|---|---|---|---|
| z3 (`z3-bench`) | cost (optimize) / speedup (sat) | NO | NO | 4 (opt_sls + sat + smt + unified) |
| CP-SAT (`cpsat-bench`) | cost (dtime + cost_ratio) | YES (W=1, W=8) | YES | 5 (search + presolve + lp_cuts + unified + custom_subsolvers) |

z3-bench also demonstrates the optional `solver_mode` (optimize/sat) +
`clustering.mode` (base/large) knobs above — both default-off, both config-only.

Use whichever matches the new solver's profile as the structural template.
105 changes: 105 additions & 0 deletions .claude/skills/openevolve-pipeline/references/gotchas.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
# Gotchas

ADD_NEW_SOLVER.md §6 — verbatim. Re-listed here for skill-local lookup.

## 1. problems.jsonl field name mismatch

`adapter.PROBLEM_FILE_FIELD` / `STATUS_FIELD` MUST match the JSON keys exactly.
Typos are the #1 failure mode. `head -1 input/<bench>/problems.jsonl | python3 -m json.tool`
shows the canonical keys.

## 2. `features.<feature>` missing

`clustering.feature: features.num_X`, but problems.jsonl entries lack
`features.num_X` → sampler treats every problem as size=0 → all problems
collapse into one cluster → cascade stages become meaningless.

Verify with the sampler stdout: each cluster should hold a recognizable
spread of problem SHAs. All-in-one cluster = features field path wrong.

## 3. DECISIVE vs DECIDED confusion

- DECISIVE = "solver gave an answer" (e.g. `Sat`, `Unsat`, `OPTIMAL`).
- DECIDED = "baseline produced a conclusive answer → regression comparable"
(e.g. cpsat: `INFEASIBLE` decided but only `OPTIMAL`/`FEASIBLE` decisive).

Most solvers: both sets identical. cpsat: they differ.

## 4. `_solve_worker.py` doesn't surface invalid params

If solver silently ignores unknown keys, the catalog alone cannot catch a
mutated illegal key. Worker MUST emit
`{"invalid_param": "<key>", "result": "Unknown", "elapsed_ms": 0}` when
solver rejects a key — otherwise evaluator cannot 0-score the candidate.

Test: pass `{"obviously_fake_key": 1}` → worker should emit invalid_param.

## 5. Phase docstring empty

LLM has no other signal about phase intent. Even one line — "Phase 2: tune
presolve.* knobs" — improves mutation quality dramatically.

## 6. `unified_dict_name` mismatch

`config.yaml` `bench.unified_dict_name` MUST match the EVOLVE-BLOCK dict
name in the last phase's `initial_program.py`. Default convention:
`UNIFIED_OVERRIDES`.

Mismatch → `_lib.prepare_phase` cannot materialize the union → last phase
starts empty and loses prior-phase wins.

## 7. `worker_path` is relative to `<bench>/evolve/`

`config.yaml` `bench.worker_path: _solve_worker.py` (no directory prefix
when worker is at evolve/ root).

## Verify-time additional gotchas

### v1. `OPENEVOLVE_BENCH_ROOT unset` from phase module

Phase module `_resolve_bench_root()` fallback walks parents looking for
adapter+params.json. Fails if phase dir is not exactly two levels under
`<bench>`. Correct layout:

```
input/<bench>/evolve/params.json
input/<bench>/evolve/adapter.py
input/<bench>/evolve/phase1_x/initial_program.py ← two levels under <bench>
```

### v2. Cascade thresholds too tight

`evaluator.cascade_thresholds: [1.03, 1.03, 1.03]` means each stage demands
≥3% improvement. Solver with high variance + few problems may never cross.
Lower to `[1.01, 1.01, 1.01]` for noisy benches.

### v3. `parallel_solvers > 1` with single-threaded baseline

If baseline was captured single-threaded, running candidates with
`parallel_solvers: N` co-locates them on shared cores → timings inflate
vs baseline → false regression. Either pin core ranges via `--pin` or
recapture baseline at the same parallelism.

### v4. solver_mode switch without re-baseline

If the bench uses `bench.solver_mode` variants, each mode keeps its OWN baseline
in `cache-<mode>/local_baseline.json` (optimize → plain `cache/`). After switching
`solver_mode`, the new mode's `cache-<mode>/` has no baseline → `self_test`/scoring
compare against a stale or empty baseline. ALWAYS re-run `_lib.rebaseline <bench>`
after a switch. The suffix isolation is automatic (`bench_paths.cache_dir`), so the
old mode's baseline is preserved — switching back needs no recapture.

### v5. clustering.mode override silently ignored

`clustering.mode: <name>` only applies if a matching `clustering.modes.<name>`
block exists. A typo (mode set, no block) → sampler prints
`mode=... has no modes.... — using base` and falls back to the base block. Check
sampler stdout for `clustering: applied modes.<name> override` to confirm it took.

### v6. Catalog `defaults` ≠ binding's real defaults

If `params.json` `defaults` includes a key the binding's real default is
different, the BASELINE phase modules send may diverge from what `_lib.rebaseline`
captured. Symptom: `self_test` ratio drifts outside [0.5, 2.0].

Fix: ensure `defaults` is what the original problems.jsonl baseline run used.
Loading