diff --git a/pyproject.toml b/pyproject.toml
index 20aada0df..597f1de0a 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -298,6 +298,8 @@ lint.per-file-ignores."tests/**" = [ "ANN", "D", "PLR2004", "PT", "S101", "T20"
 lint.per-file-ignores."tests/**/generate_patterns.py" = [ "PERF401" ]
 # Generated opset code: Allow long lines
 lint.per-file-ignores."src/winml/modelkit/analyze/onnx_opset/**" = [ "D", "E501", "N802", "N803", "N806", "TC001", "TC002", "TC003" ]
+# Research scripts: POC code, not production — exempt from all style/type/security rules
+lint.per-file-ignores."research/**" = [ "ANN", "D", "E", "N", "S", "T20", "UP", "W", "B", "C4", "FA", "I", "PERF", "PIE", "PT", "PTH", "RET", "RSE", "RUF", "SIM", "TCH", "TID", "TRY", "G", "ICN", "E402", "E501", "F401", "F403", "F811" ]
 # === Import Conventions ===
 lint.flake8-bandit.check-typed-exception = true
 lint.flake8-bandit.hardcoded-tmp-directory = [ "/tmp", "/var/tmp", "C:\\Temp" ]
diff --git a/research/autoconfig/README.md b/research/autoconfig/README.md
new file mode 100644
index 000000000..2769b7fd1
--- /dev/null
+++ b/research/autoconfig/README.md
@@ -0,0 +1,350 @@
+# autoconfig — Automated Config Search POC
+
+**Status: Research POC — not production code.**
+
+This directory contains an experimental automated search system that finds the optimal
+`winml-cli` build configuration (execution provider, opset version, graph optimizations)
+for a given model on Windows hardware — without requiring the user to understand the
+underlying ORT/EP optimizer mechanics.
+
+---
+
+## What This Is
+
+`autoconfig.py` implements an Explorer/Optimizer/Reviewer loop as three explicit
+classes wired by a thin orchestrator (`main()`):
+
+1. **`Explorer`** — selects the next hypothesis from the **full OFAT search grid**
+   the orchestrator enumerates (from a FP32 baseline, one factor varied at a time —
+   opset 17–21, quant precision fp32/fp16/int8/int16/w8a16, or one single graph
+   pass; ~74 combinations via `build_search_space()`): it builds the `priority_queue`
+   and prunes refuted/no-op configs via KB hard-blocks + the Insight Engine
+   `skip_set`. Pruning uses the **baseline graph analysis** — a graph pass whose
+   pattern is absent (e.g. no Conv→BN subgraph) is cut, while passes whose pattern
+   is present are boosted to the front. Owns search *order* only — the grid itself
+   is generated up front, zero-experience.
+2. **`Optimizer`** — runs `winml build` + `winml perf` (two-phase: 200-iter CV screen → 3×500-iter full bench)
+   + `winml eval` accuracy. Produces raw measurements only. A graph pass that
+   builds to a graph identical to the baseline (`graph_is_noop`) is discarded
+   before benchmarking — it matched nothing.
+3. **`Reviewer`** — applies the `ThroughputOnly` verdict (`threshold = max(1%, 2×CV)`),
+   decides keep/discard, and drafts KB entries.
+
+The loop terminates after 30 consecutive discards (plateau detection) or a time budget.
+
+The same four-role architecture is also captured as composable **skill definitions**
+under `skills/` — an `autoconfig-orchestrator` (the brain) that delegates to three
+sub-skills `autoconfig-explorer`, `autoconfig-optimizer`, and `autoconfig-reviewer`.
+Each `SKILL.md` mirrors the corresponding class and the diagram phase.
+
+`catalog_sweep.py` is a single, JSON-driven multi-model sweep. It reads the hypothesis
+matrix, model catalog, and per-EP bench protocol from `ep_device_knowledge/<ep>_<device>.json`
+and runs them for any `--ep/--device` combination (qnn/npu, qnn/gpu, dml/gpu, cpu/cpu),
+collecting structured results in `catalog-<device-or-ep>-sweep/<model-slug>/results.json`.
+
+`analyze_graph.py` is an ONNX graph analysis helper that identifies architectural
+patterns relevant to EP optimization (Transpose sandwiches, residual branches, GELU
+variants, depthwise Conv) and surfaces gaps in `winml analyze` output.
+
+`gen_report_v3.py` generates an HTML sweep report from `results.json` files.
+
+`autoconfig_diagram.html` is an interactive architecture diagram of the Explorer/Optimizer/
+Reviewer loop.
+
+---
+
+## Key Findings — 8-Model QNN NPU Catalog Sweep (2026-06-13)
+
+### npu-001: opset 21 NHWC bypass is real — but architecture-specific
+
+Opset ≥ 21 bypasses ORT's NHWC layout transformer for QNN EP, giving a large speedup
+on **Conv + residual** models but no benefit (or slight regression) on pure transformers:
+
+| Architecture | Models | opset 21 vs opset 17 |
+|---|---|---|
+| Conv + residual | MobileViT-small, DINOv2-small | **+26–31% speedup** |
+| Pure transformer | ViT-base, YOLOS-small | neutral / slight regression |
+| BERT-family NLP | DistilBERT, MiniLM, RoBERTa | neutral (within DVFS noise) |
+| Plain Conv (ResNet) | ResNet-18 | ~+20% (h1→h3), but DVFS-dominated |
+
+Root cause: ORT's `IsSupportedOpset()` gate in `layout_transformation.cc` causes the
+NHWC layout transform to insert Transpose nodes around Conv ops. For Conv+residual
+models these Transposes cannot be cancelled, so bypassing the transform (opset 21) gives
+a cleaner HTP graph. Pure attention models have no Conv→NHWC transposes, so the bypass
+has no effect.
+
+### npu-006: Conv fusions cause ~4900% regression on QNN NPU for Conv-dominant models
+
+`conv_bn_fusion`, `conv_add_fusion`, `conv_activation_fusion` produce fused op nodes
+that QNN EP cannot execute natively — falling back to CPU for every fused Conv:
+
+| Model | h4 (conv fusions) vs h1 (baseline) |
+|---|---|
+| ResNet-18 | **132.3 ms vs 2.72 ms (+4764% regression)** |
+| MobileViT-small | 11.36 ms vs 11.72 ms (neutral) |
+| DistilBERT | 19.59 ms vs 19.5 ms (neutral — no Conv to fuse) |
+
+This is a critical correctness/performance hazard. `winml` should detect when the target
+EP would CPU-fallback fused Conv ops and suppress incompatible fusions automatically
+(see [Feature Gaps](#feature-gaps)).
+
+### npu-007: DVFS thermal noise requires session-level averaging for reliable results
+
+QNN NPU exhibits extreme DVFS thermal throttling. CV is consistently 0.10–2.0+ across
+all models. Practical implications:
+
+- The CV < 15% Phase-A gate must be **disabled** for QNN NPU (blocks all models)
+- Differences < 10% between configs are **unreliable** without ≥ 1500 total iterations
+- Recommended protocol: **3 × 500-iter sessions** with 30 s cool-down; report median of
+  session p50 values
+- 30 s cool-down reduces but does not eliminate DVFS spikes
+
+---
+
+## How to Run
+
+### Prerequisites
+
+- `winml` CLI installed and on PATH
+- Python 3.11+ with `onnx` package (`pip install onnx`)
+- For QNN experiments: Snapdragon X Elite device with QNN SDK (Hexagon HTP driver)
+
+### autoconfig.py — single-model adaptive search
+
+Configured at the top of the file (edit `MODEL_ID`, `TASK`, `EP`, `DEVICE`, `WORK_DIR`):
+
+```bash
+# Default: facebook/convnext-tiny-224 on CPU
+python skills/orchestrator/autoconfig.py
+```
+
+Results are written to `WORK_DIR/results.tsv` and per-hypothesis subdirectories.
+The script reads `ep_device_knowledge/<ep>_<device>.json` to prune already-refuted configurations.
+
+### catalog_sweep.py — JSON-driven multi-model sweep
+
+One driver covers every EP/device. The hypothesis matrix, model catalog, and bench
+protocol (screen/full iterations, thermal handling, effect-size gate, paired A/B,
+accuracy eval) all come from `ep_device_knowledge/<ep>_<device>.json`:
+
+```bash
+# Full QNN NPU catalog sweep (all models, ~6-8 hours on X Elite)
+python tools/catalog_sweep.py --ep qnn --device npu
+
+# CPU EP sweep, single model
+python tools/catalog_sweep.py --ep cpu --device cpu --model microsoft/resnet-18
+
+# QNN GPU sweep
+python tools/catalog_sweep.py --ep qnn --device gpu
+
+# Show the models/hypotheses configured for an EP/device
+python tools/catalog_sweep.py --ep qnn --device npu --list
+```
+
+Results land in `catalog-<device-or-ep>-sweep/<model-slug>/` — `results.json`, an HTML
+report, and `champion_<ep>_<device>.json` — the recommended build config itself: a copy
+of the optimal hypothesis' `winml_build_config.json`, so it can be fed straight back to
+`winml build -c`. A `SUMMARY.md` is regenerated at the end of each sweep.
+
+### analyze_graph.py — ONNX graph analysis
+
+```bash
+# Edit the onnx path at the top of the file, then:
+python skills/explorer/analyze_graph.py
+```
+
+Prints Transpose patterns, residual branch structure, GELU variants, and op domain
+breakdown to stdout.
+
+---
+
+## ep_device_knowledge/ — Empirical Knowledge Base
+
+Each JSON file stores empirical findings **and** the sweep configuration for one
+EP/device combination, named `<ep>_<device>.json`:
+
+| File | EP/device |
+|---|---|
+| `cpu_cpu.json` | CPU EP (Snapdragon X Elite Oryon) |
+| `dml_gpu.json` | DirectML EP (GPU) |
+| `qnn_gpu.json` | QNN Adreno GPU |
+| `qnn_npu.json` | QNN HTP (Hexagon NPU) — most findings here |
+
+### Schema overview
+
+Each file has a `findings` array. Each finding has:
+
+```json
+{
+  "id": "npu-001",
+  "title": "...",
+  "mechanism_confirmed": true,
+  "architecture_requirement": ["has_conv_ops", "has_residual_connections"],
+  "status": "confirmed",
+  "confidence": "high"
+}
+```
+
+It also carries the data-driven sweep contract consumed by `catalog_sweep.py`:
+`sweep_config` (bench protocol), `hypotheses` (the h0–hN matrix with opset/optim/guards),
+`models` (the catalog), and `cross_checks` (npu-001 opset-bypass, npu-006 catastrophic
+regression, cpu-001 regression probe).
+
+And a `search_space_rules` object that `autoconfig.py` reads to prune configurations
+(only findings with `"mechanism_confirmed": true` are applied as pruning rules).
+
+### Adding a new finding
+
+1. Run the experiment and collect bench data
+2. Add an entry to the appropriate `ep_device_knowledge/<ep>_<device>.json` under `findings`
+3. Set `"mechanism_confirmed": false` and `"confidence": "draft"` until the mechanism
+   is understood from ORT/EP source code
+4. If the finding prunes a search dimension, add a rule under `search_space_rules`
+5. Set `"mechanism_confirmed": true` only after source code investigation confirms
+   the root cause — do NOT promote to confirmed based on benchmark numbers alone
+6. See `ep_device_knowledge/README.md` for the epistemics guidelines
+
+---
+
+## Self-Evolution Tooling
+
+Implements the loop from [`docs/self-evolution-design.html`](docs/self-evolution-design.html) —
+how sweeps stabilize their own conclusions and promote findings without a human in the loop.
+
+### skills/optimizer/bench_utils.py — paired A/B + adaptive sampling
+
+Shared bench primitives used across sweeps:
+
+- **`paired_ab_bench(run_session, baseline, hyp, n_pairs)`** (Fix #1) — interleaves the
+  baseline and hypothesis perf sessions in one thermal window so DVFS/thermal drift appears
+  in both legs and **cancels** in the within-pair ratio. Returns mean gain, 95% CI, and a
+  verdict (`KEEP_CONFIRMED` / `MARGINAL` / `DISCARD`). This is the unbiased fix for the
+  npu-001/MobileViT failure, where a cold baseline vs warm hypothesis manufactured a fake win.
+- **`adaptive_paired_ab_bench(...)`** (Fix #2) — keeps adding pairs until the 95% CI is
+  decisive (clears the KEEP or DISCARD band) or `MAX_PAIRS` is reached. Stable models finish
+  in `MIN_PAIRS=3`; noisy ones automatically get more samples.
+- **`thermal_classify(ref_p50, cold_ref_p50)`** (Fix #5) — classifies device thermal state
+  (`COOL`/`WARM`/`HOT_RUN`) from a reference-model latency, for excluding throttled runs.
+- **`session_cv(p50s)`** — between-session coefficient of variation (the effect-size noise floor).
+
+The QNN sweep opts into paired A/B with `--paired-ab` (default off; the validated default is
+the sequential Phase B):
+
+```bash
+python tools/catalog_sweep.py --ep qnn --device npu --model apple/mobilevit-small --task image-classification --paired-ab
+```
+
+### skills/reviewer/promote_findings.py — confidence-gated KB promotion (L1 → L4)
+
+Post-processing script (Fix #4) that reads every `catalog-*-sweep/*/results.json` and applies
+the confidence ladder, writing a **draft** to `ep_device_knowledge/_auto_promoted.json` (it never
+clobbers the curated `<ep>_<device>.json` files):
+
+| Level | Gate |
+|---|---|
+| **L1** Observed | median gain ≥ 5% on one model, one run |
+| **L2** Confirmed | hypothesis p50 range strictly below baseline range **and** gain ≥ 2×(session CV) — the same effect-size gate the sweep uses |
+| **L3** Generalized | same `(ep, flags)` reaches L2 on ≥2 distinct models of one architecture class (`model_type`) |
+| **L4** Cross-cutting | same `(ep, flags)` reaches L2 across ≥3 architecture classes |
+
+```bash
+python skills/reviewer/promote_findings.py   # writes ep_device_knowledge/_auto_promoted.json
+```
+
+A human applies the promotion checklist in [`ep_device_knowledge/README.md`](ep_device_knowledge/README.md)
+(paired A/B, clean baseline, effect-size > noise floor, independent reruns, baseline-drift
+check) before merging any auto-promoted candidate into the curated KB.
+
+### skills/explorer/analyze_insight.py — architecture-based pruning (Fix #3)
+
+`build_insight()` fuses graph fingerprint + `winml analyze` + KB rules into a `skip_set`
+(hypotheses to prune) and `priority_boosts` (reordering), cutting the 14-hypothesis matrix
+to the few that matter per architecture.
+
+---
+
+## Feature Gaps Identified
+
+Four actionable gaps in `winml-cli` surfaced by this research:
+
+1. **FusedConv detection in `winml analyze`** — `analyze` should detect Conv ops that
+   would CPU-fallback on QNN NPU after fusion (npu-006), and either warn or suppress
+   incompatible fusions in the generated build config.
+
+2. **DVFS-aware perf** — `winml perf` should support `--thermal-stabilization` mode
+   that waits for device temperature to stabilize before measurements, and should report
+   confidence intervals rather than a single p50.
+
+3. **Budget-aware sweep** — `tools/catalog_sweep.py` exhausts the 20-min budget on models
+   > 50 ms baseline after just 2 hypotheses (YOLOS: 78 ms × 3×500 iters = 207 s/hypothesis).
+   A `--quick` flag that reduces to 1×200-iter for large models is needed.
+
+4. **Benefit-gated fusion in `winml analyze`** — the analyzer currently auto-applies a fusion
+   whenever the graph pattern matches, but a fusion *firing* (op count drops / graph topology
+   changes after the flag) does **not** imply a perf win. Many fusions fire cleanly yet land
+   within measurement noise (e.g. BERT/ConvNeXt on QNN NPU — graph changes, p50 unchanged, see
+   npu-011). The analyzer should: (a) confirm a fusion actually fired by diffing pre/post-optimize
+   op counts and graph topology (not just pattern-match the input graph), and (b) gate retention
+   of that fusion on a measured perf delta beyond the noise band — applied-but-not-beneficial
+   fusions should be dropped (or flagged) rather than kept, since they add build cost and EP risk
+   for no return. This research records such cases so they can train that benefit gate.
+
+---
+
+## Directory Layout
+
+```
+research/autoconfig/
+├── README.md                    ← this file
+│
+├── skills/                      ← the agent loop, one folder per role (each has SKILL.md + its scripts)
+│   ├── orchestrator/            ← the brain: Phase 0–3 lifecycle
+│   │   ├── SKILL.md
+│   │   └── autoconfig.py        ← adaptive single-model search loop (Explorer/Optimizer/Reviewer classes)
+│   ├── explorer/                ← "what to try next": priority_queue + skip_set
+│   │   ├── SKILL.md
+│   │   ├── analyze_insight.py   ← graph + analyze + KB → skip_set / priority_boosts
+│   │   └── analyze_graph.py     ← ONNX graph pattern analysis helper
+│   ├── optimizer/               ← "run it": build → screen → full bench → eval
+│   │   ├── SKILL.md
+│   │   └── bench_utils.py       ← shared bench primitives (paired A/B, adaptive, thermal, verdict)
+│   └── reviewer/                ← "judge it": ThroughputOnly verdict + KB draft
+│       ├── SKILL.md
+│       └── promote_findings.py  ← L1→L4 confidence-gated KB promotion (draft sink)
+│
+├── lib/                         ← shared, role-agnostic helpers
+│   ├── report_gen.py            ← HTML/markdown report rendering
+│   └── gen_model_report.py      ← per-model report builder used by the sweeps
+│
+├── tools/                       ← batch drivers and one-off utilities
+│   ├── catalog_sweep.py         ← JSON-driven multi-model sweep (--ep/--device, --paired-ab)
+│   ├── validation_sweep.py      ← re-runs to validate KB findings
+│   └── gen_report_v3.py         ← legacy HTML report generator
+│
+├── docs/                        ← design docs (self-evolution, agent, skills, cross-device)
+│   └── autoconfig_diagram.html  ← Explorer/Optimizer/Reviewer architecture diagram
+│
+├── ep_device_knowledge/
+│   ├── README.md                ← epistemics guidelines + promotion checklist
+│   ├── _auto_promoted.json      ← promote_findings.py output (auto-generated draft)
+│   ├── cpu_cpu.json             ← CPU EP findings + sweep config (ConvNext, 6 findings)
+│   ├── dml_gpu.json             ← DirectML EP findings + sweep config
+│   ├── qnn_gpu.json             ← QNN Adreno GPU findings + sweep config
+│   └── qnn_npu.json             ← QNN HTP NPU findings + sweep config (npu-001 … npu-007)
+│
+├── catalog-qnn-sweep/           ← QNN NPU sweep results (also catalog-cpu-sweep/, catalog-gpu-sweep/)
+│   ├── SUMMARY.md               ← 8-model sweep results and cross-model analysis
+│   ├── apple--mobilevit-small/  ← per-model tuning products live together:
+│   │   ├── results.json         ←   benchmark results + verdicts
+│   │   ├── report.html          ←   per-model HTML report
+│   │   └── champion_qnn_npu.json ←  recommended build config (raw winml_build_config.json)
+│   ├── facebook--dinov2-small/
+│   ├── microsoft--resnet-18/
+│   ├── google--vit-base-patch16-224/
+│   ├── deepset--roberta-base-squad2/
+│   ├── distilbert--distilbert-base-uncased-finetuned-sst-2-english/
+│   ├── sentence-transformers--all-MiniLM-L6-v2/
+│   └── hustvl--yolos-small/
+│
+└── catalog-cpu-sweep/, catalog-gpu-sweep/  ← analogous per-model results for CPU / QNN GPU
+```
diff --git a/research/autoconfig/catalog-cpu-sweep/.gitignore b/research/autoconfig/catalog-cpu-sweep/.gitignore
new file mode 100644
index 000000000..b3b91d38b
--- /dev/null
+++ b/research/autoconfig/catalog-cpu-sweep/.gitignore
@@ -0,0 +1,10 @@
+# Hypothesis build artifacts (large binary files)
+h*/
+_tmp_config/
+# Raw perf session files
+full_perf_s*.json
+screen_perf.json
+confirm_s*.json
+# Model weight files
+*.data
+*.onnx
diff --git a/research/autoconfig/catalog-cpu-sweep/apple--mobilevit-small/report.html b/research/autoconfig/catalog-cpu-sweep/apple--mobilevit-small/report.html
new file mode 100644
index 000000000..96988c64b
--- /dev/null
+++ b/research/autoconfig/catalog-cpu-sweep/apple--mobilevit-small/report.html
@@ -0,0 +1,598 @@
+<!DOCTYPE html>
+<html lang="en">
+<head>
+  <meta charset="UTF-8">
+  <meta name="viewport" content="width=device-width, initial-scale=1.0">
+  <title>CPU CPU Optimization Report — apple/mobilevit-small</title>
+  <style>
+    * { box-sizing: border-box; margin: 0; padding: 0; }
+    body {
+      font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", sans-serif;
+      background: #f4f6f9;
+      color: #1a1a2e;
+      padding: 28px 24px 40px;
+      font-size: 13px;
+      line-height: 1.5;
+    }
+    h1 { font-size: 24px; font-weight: 800; margin-bottom: 6px; color: #102a43; }
+    .subtitle { color: #5f6c80; font-size: 12px; margin-bottom: 24px; }
+    .section-card {
+      background: #ffffff;
+      border: 1.5px solid #dbe5f0;
+      border-radius: 12px;
+      padding: 18px 20px;
+      margin-bottom: 20px;
+      box-shadow: 0 1px 3px rgba(16, 42, 67, 0.06);
+    }
+    .kpi-grid {
+      display: grid;
+      grid-template-columns: repeat(5, minmax(0, 1fr));
+      gap: 14px;
+      margin-bottom: 20px;
+    }
+    .kpi-card {
+      background: linear-gradient(180deg, #ffffff 0%, #f8fbff 100%);
+      border: 1.5px solid #dbe5f0;
+      border-radius: 12px;
+      padding: 16px;
+      min-height: 120px;
+    }
+    .kpi-label {
+      font-size: 10px;
+      text-transform: uppercase;
+      letter-spacing: 0.8px;
+      color: #6b7c93;
+      font-weight: 700;
+      margin-bottom: 8px;
+    }
+    .kpi-value {
+      font-size: 23px;
+      font-weight: 800;
+      color: #102a43;
+      line-height: 1.15;
+      margin-bottom: 6px;
+      word-break: break-word;
+    }
+    .kpi-card.good .kpi-value { color: #2e7d32; }
+    .kpi-card.bad .kpi-value { color: #c62828; }
+    .kpi-sub { color: #6b7c93; font-size: 11px; }
+    .section-title {
+      font-size: 14px;
+      font-weight: 800;
+      color: #102a43;
+      margin-bottom: 14px;
+    }
+    .characteristics-table {
+      width: 100%;
+      border-collapse: collapse;
+    }
+    .characteristics-table th,
+    .characteristics-table td {
+      padding: 9px 10px;
+      border-bottom: 1px solid #ebf1f6;
+      text-align: left;
+      vertical-align: top;
+    }
+    .characteristics-table th {
+      width: 180px;
+      color: #5f6c80;
+      font-size: 11px;
+      text-transform: uppercase;
+      letter-spacing: 0.7px;
+    }
+    .report-table {
+      width: 100%;
+      border-collapse: collapse;
+    }
+    .report-table th {
+      text-align: left;
+      padding: 9px 10px;
+      background: #eef4fb;
+      color: #486581;
+      border-bottom: 2px solid #d9e2ec;
+      font-size: 10px;
+      text-transform: uppercase;
+      letter-spacing: 0.7px;
+    }
+    .report-table td {
+      padding: 10px;
+      border-bottom: 1px solid #ebf1f6;
+      vertical-align: top;
+    }
+    .report-table tr:hover td { background: #f8fbff; }
+    .champion-row td { background: #e8f1fd; }
+    .hyp-pill {
+      display: inline-block;
+      background: #102a43;
+      color: white;
+      border-radius: 999px;
+      padding: 2px 8px;
+      font-size: 11px;
+      font-weight: 700;
+    }
+    .gain-pos { color: #2e7d32; font-weight: 700; }
+    .gain-neg { color: #c62828; font-weight: 700; }
+    .chart-wrap {
+      overflow-x: auto;
+      border: 1px solid #e6edf5;
+      border-radius: 10px;
+      background: #fbfdff;
+      padding: 10px;
+    }
+    .chart-svg { width: 100%; min-width: 760px; display: block; }
+    .axis-label { fill: #486581; font-size: 11px; font-weight: 700; }
+    .tick-label { fill: #7b8794; font-size: 10px; }
+    .tick-line { stroke: #d9e2ec; stroke-width: 1; }
+    .center-line { stroke: #1e88e5; stroke-width: 2; stroke-dasharray: 4 4; }
+    .row-bg { fill: transparent; }
+    .hyp-label { fill: #102a43; font-size: 12px; font-weight: 800; }
+    .hyp-sub { fill: #7b8794; font-size: 10px; }
+    .baseline-bar { stroke: #546e7a; stroke-width: 3; }
+    .value-text { fill: #102a43; font-size: 11px; font-weight: 700; }
+    .build-fail-text { fill: #37474f; font-size: 10px; font-weight: 800; letter-spacing: 0.5px; }
+    .gap-grid {
+      display: grid;
+      grid-template-columns: repeat(auto-fit, minmax(240px, 1fr));
+      gap: 12px;
+    }
+    .gap-card {
+      background: #fff8e1;
+      border: 1.5px solid #ffe082;
+      border-radius: 10px;
+      padding: 12px 14px;
+      color: #7c5b00;
+      font-size: 12px;
+    }
+    .flag-pill {
+      display: inline-block;
+      background: #e3f2fd;
+      color: #1565c0;
+      border-radius: 4px;
+      padding: 1px 6px;
+      font-size: 10px;
+      font-weight: 700;
+      margin: 1px 2px 1px 0;
+    }
+    .runs-val {
+      font-family: "Cascadia Code", "Consolas", monospace;
+      font-size: 10.5px;
+      color: #546e7a;
+      white-space: nowrap;
+    }
+    .hyp-detail-table .label-cell { font-size: 11.5px; max-width: 220px; }
+    .hyp-detail-table .opset-cell { text-align: center; font-weight: 700; color: #3949ab; font-size: 12px; }
+    .hyp-detail-table .flags-cell { min-width: 140px; }
+    .hyp-detail-table .p50-cell { font-family: "Cascadia Code","Consolas",monospace; font-size: 12px; white-space: nowrap; }
+    .hyp-detail-table .sessions-cell { min-width: 160px; }
+    .hyp-detail-table .conf-cell { font-size: 11px; color: #546e7a; }
+    .verdict-keep { color: #2e7d32; font-weight: 700; }
+    .verdict-discard { color: #c62828; font-weight: 700; }
+    .footer {
+      margin-top: 16px;
+      color: #7b8794;
+      font-size: 11px;
+      text-align: center;
+    }
+    @media (max-width: 1200px) {
+      .kpi-grid { grid-template-columns: repeat(2, minmax(0, 1fr)); }
+    }
+    @media (max-width: 720px) {
+      .kpi-grid { grid-template-columns: 1fr; }
+      body { padding: 18px 14px 28px; }
+    }
+  </style>
+</head>
+<body>
+  <h1>CPU CPU Optimization Report — apple/mobilevit-small</h1>
+  <div class="subtitle">mobilevit arch · 2026-06-18 · 14 hypotheses tested</div>
+
+  <section class="kpi-grid">
+    <div class="kpi-card good">
+      <div class="kpi-label">Best Gain %</div>
+      <div class="kpi-value">+12.3%</div>
+      <div class="kpi-sub">Champion: h7</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">Baseline → Champion ms</div>
+      <div class="kpi-value">73.17 ms → 64.14 ms</div>
+      <div class="kpi-sub">Latency reduction: 9.03 ms</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">EP + Device</div>
+      <div class="kpi-value">CPU / CPU</div>
+      <div class="kpi-sub">Baseline opset 17</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">Champion Config</div>
+      <div class="kpi-value">h7</div>
+      <div class="kpi-sub">opset 17 + bias_softmax_fusion</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">Total experiments</div>
+      <div class="kpi-value">14</div>
+      <div class="kpi-sub">0 KEEP / 12 DISCARD</div>
+    </div>
+  </section>
+
+
+    <section class="section-card">
+      <div class="section-title">Model Characteristics</div>
+      <table class="characteristics-table">
+        <tr><th>Model ID</th><td>apple/mobilevit-small</td></tr><tr><th>Task</th><td>image-classification</td></tr><tr><th>Arch type</th><td>mobilevit</td></tr><tr><th>Baseline opset</th><td>17</td></tr><tr><th>EP</th><td>cpu</td></tr><tr><th>Device</th><td>cpu</td></tr>
+      </table>
+    </section>
+
+
+    <section class="section-card">
+      <div class="section-title">Hypothesis Gain Chart</div>
+      <div class="chart-wrap">
+        <svg class="chart-svg" viewBox="0 0 748 634" role="img" aria-label="Hypothesis gain chart"><defs><pattern id="buildFailPattern" patternUnits="userSpaceOnUse" width="8" height="8" patternTransform="rotate(45)"><rect width="8" height="8" fill="#cfd8dc"></rect><rect width="3" height="8" fill="#90a4ae"></rect></pattern></defs><text x="0" y="20" class="axis-label">Hypothesis</text><text x="150" y="20" class="axis-label">Gain vs baseline (%)</text><line x1="410.0" y1="40" x2="410.0" y2="608" class="center-line" /><line x1="150.0" y1="44" x2="150.0" y2="608" class="tick-line" /><text x="150.0" y="628" text-anchor="middle" class="tick-label">-200%</text><line x1="280.0" y1="44" x2="280.0" y2="608" class="tick-line" /><text x="280.0" y="628" text-anchor="middle" class="tick-label">-100%</text><line x1="410.0" y1="44" x2="410.0" y2="608" class="tick-line" /><text x="410.0" y="628" text-anchor="middle" class="tick-label">0%</text><line x1="540.0" y1="44" x2="540.0" y2="608" class="tick-line" /><text x="540.0" y="628" text-anchor="middle" class="tick-label">100%</text><line x1="670.0" y1="44" x2="670.0" y2="608" class="tick-line" /><text x="670.0" y="628" text-anchor="middle" class="tick-label">200%</text><g><title>h0: baseline (opset 17, autoconf defaults)
+status=OK  verdict=BASELINE
+p50=73.17 ms  gain=+0.0%</title><rect x="0" y="48.0" width="748" height="40" class="row-bg" /><text x="8" y="64.0" class="hyp-label">h0</text><text x="8" y="77.0" class="hyp-sub">baseline (opset 17, autoc…</text><line x1="410.0" y1="56.0" x2="410.0" y2="80.0" class="baseline-bar" /><text x="418.0" y="72.0" text-anchor="start" class="value-text">0.0%</text></g><g><title>h1: opset 17 explicit
+status=OK  verdict=DISCARD
+p50=87.48 ms  gain=-19.6%</title><rect x="0" y="88.0" width="748" height="40" class="row-bg" /><text x="8" y="104.0" class="hyp-label">h1</text><text x="8" y="117.0" class="hyp-sub">opset 17 explicit</text><rect x="384.6" y="96.0" width="25.4" height="24" fill="#e53935" stroke="none" stroke-width="0" rx="4" /><text x="376.6" y="112.0" text-anchor="end" class="value-text">-19.6%</text></g><g><title>h2: opset 19 (cpu-001 risk — transformer test)
+status=OK  verdict=DISCARD
+p50=79.83 ms  gain=-9.1%</title><rect x="0" y="128.0" width="748" height="40" class="row-bg" /><text x="8" y="144.0" class="hyp-label">h2</text><text x="8" y="157.0" class="hyp-sub">opset 19 (cpu-001 risk — …</text><rect x="398.2" y="136.0" width="11.8" height="24" fill="#e53935" stroke="none" stroke-width="0" rx="4" /><text x="390.2" y="152.0" text-anchor="end" class="value-text">-9.1%</text></g><g><title>h3: opset 21 (cpu-001 risk — transformer test)
+status=OK  verdict=DISCARD
+p50=78.59 ms  gain=-7.4%</title><rect x="0" y="168.0" width="748" height="40" class="row-bg" /><text x="8" y="184.0" class="hyp-label">h3</text><text x="8" y="197.0" class="hyp-sub">opset 21 (cpu-001 risk — …</text><rect x="400.4" y="176.0" width="9.6" height="24" fill="#e53935" stroke="none" stroke-width="0" rx="4" /><text x="392.4" y="192.0" text-anchor="end" class="value-text">-7.4%</text></g><g><title>h4: opset 17 + attention_fusion
+status=OK  verdict=DISCARD
+p50=77.77 ms  gain=-6.3%</title><rect x="0" y="208.0" width="748" height="40" class="row-bg" /><text x="8" y="224.0" class="hyp-label">h4</text><text x="8" y="237.0" class="hyp-sub">opset 17 + attention_fusi…</text><rect x="401.8" y="216.0" width="8.2" height="24" fill="#e53935" stroke="none" stroke-width="0" rx="4" /><text x="393.8" y="232.0" text-anchor="end" class="value-text">-6.3%</text></g><g><title>h5: opset 17 + skip_layer_norm_fusion
+status=OK  verdict=DISCARD
+p50=80.51 ms  gain=-10.0%</title><rect x="0" y="248.0" width="748" height="40" class="row-bg" /><text x="8" y="264.0" class="hyp-label">h5</text><text x="8" y="277.0" class="hyp-sub">opset 17 + skip_layer_nor…</text><rect x="396.9" y="256.0" width="13.1" height="24" fill="#e53935" stroke="none" stroke-width="0" rx="4" /><text x="388.9" y="272.0" text-anchor="end" class="value-text">-10.0%</text></g><g><title>h6: opset 17 + layer_norm_fusion
+status=OK  verdict=DISCARD
+p50=803.22 ms  gain=-997.8%</title><rect x="0" y="288.0" width="748" height="40" class="row-bg" /><text x="8" y="304.0" class="hyp-label">h6</text><text x="8" y="317.0" class="hyp-sub">opset 17 + layer_norm_fus…</text><rect x="150.0" y="296.0" width="260.0" height="24" fill="#e53935" stroke="none" stroke-width="0" rx="4" /><text x="142.0" y="312.0" text-anchor="end" class="value-text">-997.8%</text></g><g><title>h7: opset 17 + bias_softmax_fusion
+status=OK  verdict=MARGINAL_UNCONFIRMED
+p50=64.14 ms  gain=-63.4%</title><rect x="0" y="328.0" width="748" height="40" class="row-bg" /><text x="8" y="344.0" class="hyp-label">h7</text><text x="8" y="357.0" class="hyp-sub">opset 17 + bias_softmax_f…</text><rect x="327.6" y="336.0" width="82.4" height="24" fill="#e53935" stroke="#1e88e5" stroke-width="4" rx="4" /><text x="319.6" y="352.0" text-anchor="end" class="value-text">-63.4%</text></g><g><title>h8: opset 17 + matmul_add_fusion (cpu-002 guarded)
+status=SKIPPED_CPU002  verdict=—
+p50=—  gain=—</title><rect x="0" y="368.0" width="748" height="40" class="row-bg" /><text x="8" y="384.0" class="hyp-label">h8</text><text x="8" y="397.0" class="hyp-sub">opset 17 + matmul_add_fus…</text></g><g><title>h9: opset 17 + matmul_transpose_fusion
+status=OK  verdict=DISCARD
+p50=194.19 ms  gain=-165.4%</title><rect x="0" y="408.0" width="748" height="40" class="row-bg" /><text x="8" y="424.0" class="hyp-label">h9</text><text x="8" y="437.0" class="hyp-sub">opset 17 + matmul_transpo…</text><rect x="195.0" y="416.0" width="215.0" height="24" fill="#e53935" stroke="none" stroke-width="0" rx="4" /><text x="187.0" y="432.0" text-anchor="end" class="value-text">-165.4%</text></g><g><title>h10: opset 17 + attention + skip_layer_norm + layer_norm
+status=OK  verdict=DISCARD
+p50=193.64 ms  gain=-164.7%</title><rect x="0" y="448.0" width="748" height="40" class="row-bg" /><text x="8" y="464.0" class="hyp-label">h10</text><text x="8" y="477.0" class="hyp-sub">opset 17 + attention + sk…</text><rect x="195.9" y="456.0" width="214.1" height="24" fill="#e53935" stroke="none" stroke-width="0" rx="4" /><text x="187.9" y="472.0" text-anchor="end" class="value-text">-164.7%</text></g><g><title>h11: opset 17 + nchwc_transformer (Conv-heavy models)
+status=BUILD_FAIL  verdict=—
+p50=—  gain=—</title><rect x="0" y="488.0" width="748" height="40" class="row-bg" /><text x="8" y="504.0" class="hyp-label">h11</text><text x="8" y="517.0" class="hyp-sub">opset 17 + nchwc_transfor…</text><rect x="364.0" y="496.0" width="92" height="24" fill="url(#buildFailPattern)" stroke="#78909c" stroke-width="1.5" rx="4" /><text x="410.0" y="512.0" text-anchor="middle" class="build-fail-text">BUILD_FAIL</text></g><g><title>h12: opset 17 + transpose_optimizer
+status=BUILD_FAIL  verdict=—
+p50=—  gain=—</title><rect x="0" y="528.0" width="748" height="40" class="row-bg" /><text x="8" y="544.0" class="hyp-label">h12</text><text x="8" y="557.0" class="hyp-sub">opset 17 + transpose_opti…</text><rect x="364.0" y="536.0" width="92" height="24" fill="url(#buildFailPattern)" stroke="#78909c" stroke-width="1.5" rx="4" /><text x="410.0" y="552.0" text-anchor="middle" class="build-fail-text">BUILD_FAIL</text></g><g><title>h13: opset 17 + gelu_fusion explicit
+status=BUILD_FAIL  verdict=—
+p50=—  gain=—</title><rect x="0" y="568.0" width="748" height="40" class="row-bg" /><text x="8" y="584.0" class="hyp-label">h13</text><text x="8" y="597.0" class="hyp-sub">opset 17 + gelu_fusion ex…</text><rect x="364.0" y="576.0" width="92" height="24" fill="url(#buildFailPattern)" stroke="#78909c" stroke-width="1.5" rx="4" /><text x="410.0" y="592.0" text-anchor="middle" class="build-fail-text">BUILD_FAIL</text></g></svg>
+      </div>
+    </section>
+
+
+    <section class="section-card">
+      <div class="section-title">🔬 All Hypotheses — Full Detail</div>
+      <div style="overflow-x:auto">
+      <table class="report-table hyp-detail-table">
+        <thead>
+          <tr>
+            <th>ID</th>
+            <th>Config Label</th>
+            <th>Opset</th>
+            <th>Extra Flags</th>
+            <th>Median p50</th>
+            <th>Session p50s (ms)</th>
+            <th>Gain %</th>
+            <th>Verdict</th>
+            <th>Confidence</th>
+          </tr>
+        </thead>
+        <tbody>
+
+        <tr class="">
+          <td><span class="hyp-pill">h0</span></td>
+          <td class="label-cell">baseline (opset 17, autoconf defaults)</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">73.17 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[73.17 · 72.10 · 80.23]</span></td>
+          <td><span class="">+0.0%</span></td>
+          <td><span class="">BASELINE</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h1</span></td>
+          <td class="label-cell">opset 17 explicit</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">87.48 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[87.48 · 89.86 · 57.04]</span></td>
+          <td><span class="gain-neg">-19.6%</span></td>
+          <td><span class="verdict-discard">DISCARD</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h2</span></td>
+          <td class="label-cell">opset 19 (cpu-001 risk — transformer test)</td>
+          <td class="opset-cell">19</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">79.83 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[74.50 · 86.26 · 79.83]</span></td>
+          <td><span class="gain-neg">-9.1%</span></td>
+          <td><span class="verdict-discard">DISCARD</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h3</span></td>
+          <td class="label-cell">opset 21 (cpu-001 risk — transformer test)</td>
+          <td class="opset-cell">21</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">78.59 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[67.43 · 84.27 · 78.59]</span></td>
+          <td><span class="gain-neg">-7.4%</span></td>
+          <td><span class="verdict-discard">DISCARD</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h4</span></td>
+          <td class="label-cell">opset 17 + attention_fusion</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><span class="flag-pill">attention_fusion</span></td>
+          <td class="p50-cell">77.77 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[83.44 · 70.19 · 77.77]</span></td>
+          <td><span class="gain-neg">-6.3%</span></td>
+          <td><span class="verdict-discard">DISCARD</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h5</span></td>
+          <td class="label-cell">opset 17 + skip_layer_norm_fusion</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><span class="flag-pill">skip_layer_norm_fusion</span></td>
+          <td class="p50-cell">80.51 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[80.51 · 60.10 · 785.99]</span></td>
+          <td><span class="gain-neg">-10.0%</span></td>
+          <td><span class="verdict-discard">DISCARD</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h6</span></td>
+          <td class="label-cell">opset 17 + layer_norm_fusion</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><span class="flag-pill">layer_norm_fusion</span></td>
+          <td class="p50-cell">803.22 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[817.62 · 803.22 · 184.94]</span></td>
+          <td><span class="gain-neg">-997.8%</span></td>
+          <td><span class="verdict-discard">DISCARD</span></td>
+          <td class="conf-cell">ranges separated</td>
+        </tr>
+        <tr class="champion-row">
+          <td><span class="hyp-pill">h7</span> <span style="color:#1976d2;font-weight:900">★</span></td>
+          <td class="label-cell">opset 17 + bias_softmax_fusion</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><span class="flag-pill">bias_softmax_fusion</span></td>
+          <td class="p50-cell">64.14 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[60.64 · 64.14 · 119.52 · 239.25 · 279.32]</span></td>
+          <td><span class="gain-neg">-63.4%</span></td>
+          <td><span class="">MARGINAL_UNCONFIRMED</span></td>
+          <td class="conf-cell">2/5 sessions confirm</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h8</span></td>
+          <td class="label-cell">opset 17 + matmul_add_fusion (cpu-002 guarded)</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">—</td>
+          <td class="sessions-cell">—</td>
+          <td>—</td>
+          <td><span class="">SKIPPED_CPU002</span></td>
+          <td class="conf-cell">guarded skip</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h9</span></td>
+          <td class="label-cell">opset 17 + matmul_transpose_fusion</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><span class="flag-pill">matmul_transpose_fusion</span></td>
+          <td class="p50-cell">194.19 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[194.19 · 175.97 · 203.41]</span></td>
+          <td><span class="gain-neg">-165.4%</span></td>
+          <td><span class="verdict-discard">DISCARD</span></td>
+          <td class="conf-cell">ranges separated</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h10</span></td>
+          <td class="label-cell">opset 17 + attention + skip_layer_norm + layer_norm</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><span class="flag-pill">attention_fusion</span>, <span class="flag-pill">skip_layer_norm_fusion</span>, <span class="flag-pill">layer_norm_fusion</span></td>
+          <td class="p50-cell">193.64 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[261.24 · 189.74 · 193.64]</span></td>
+          <td><span class="gain-neg">-164.7%</span></td>
+          <td><span class="verdict-discard">DISCARD</span></td>
+          <td class="conf-cell">ranges separated</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h11</span></td>
+          <td class="label-cell">opset 17 + nchwc_transformer (Conv-heavy models)</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">BUILD_FAIL</td>
+          <td class="sessions-cell"><span style="color:#c62828;font-weight:700">BUILD_FAIL</span></td>
+          <td>—</td>
+          <td><span class="verdict-discard">BUILD_FAIL</span></td>
+          <td class="conf-cell">build failed</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h12</span></td>
+          <td class="label-cell">opset 17 + transpose_optimizer</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">BUILD_FAIL</td>
+          <td class="sessions-cell"><span style="color:#c62828;font-weight:700">BUILD_FAIL</span></td>
+          <td>—</td>
+          <td><span class="verdict-discard">BUILD_FAIL</span></td>
+          <td class="conf-cell">build failed</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h13</span></td>
+          <td class="label-cell">opset 17 + gelu_fusion explicit</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">BUILD_FAIL</td>
+          <td class="sessions-cell"><span style="color:#c62828;font-weight:700">BUILD_FAIL</span></td>
+          <td>—</td>
+          <td><span class="verdict-discard">BUILD_FAIL</span></td>
+          <td class="conf-cell">build failed</td>
+        </tr>
+        </tbody>
+      </table>
+      </div>
+      <div style="margin-top:10px;font-size:11px;color:#7b8794">
+        ★ = champion hypothesis &nbsp;·&nbsp; Session p50s are individual bench sessions (median used for comparison)
+      </div>
+    </section>
+
+
+
+    <section class="section-card">
+      <div class="section-title">❌ Ineffective or Harmful</div>
+      <table class="report-table">
+        <thead>
+          <tr>
+            <th>Hypothesis</th>
+            <th>Label</th>
+            <th>Gain %</th>
+            <th>Verdict</th>
+            <th>Confidence</th>
+          </tr>
+        </thead>
+        <tbody>
+
+            <tr class="">
+              <td><span class="hyp-pill">h1</span></td>
+              <td>opset 17 explicit</td>
+              <td class="gain-neg">-19.6%</td>
+              <td>DISCARD</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h2</span></td>
+              <td>opset 19 (cpu-001 risk — transformer test)</td>
+              <td class="gain-neg">-9.1%</td>
+              <td>DISCARD</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h3</span></td>
+              <td>opset 21 (cpu-001 risk — transformer test)</td>
+              <td class="gain-neg">-7.4%</td>
+              <td>DISCARD</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h4</span></td>
+              <td>opset 17 + attention_fusion</td>
+              <td class="gain-neg">-6.3%</td>
+              <td>DISCARD</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h5</span></td>
+              <td>opset 17 + skip_layer_norm_fusion</td>
+              <td class="gain-neg">-10.0%</td>
+              <td>DISCARD</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h6</span></td>
+              <td>opset 17 + layer_norm_fusion</td>
+              <td class="gain-neg">-997.8%</td>
+              <td>DISCARD</td>
+              <td>ranges separated</td>
+            </tr>
+
+            <tr class="champion-row">
+              <td><span class="hyp-pill">h7</span></td>
+              <td>opset 17 + bias_softmax_fusion</td>
+              <td class="gain-neg">-63.4%</td>
+              <td>MARGINAL_UNCONFIRMED</td>
+              <td>2/5 sessions confirm</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h9</span></td>
+              <td>opset 17 + matmul_transpose_fusion</td>
+              <td class="gain-neg">-165.4%</td>
+              <td>DISCARD</td>
+              <td>ranges separated</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h10</span></td>
+              <td>opset 17 + attention + skip_layer_norm + layer_norm</td>
+              <td class="gain-neg">-164.7%</td>
+              <td>DISCARD</td>
+              <td>ranges separated</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h11</span></td>
+              <td>opset 17 + nchwc_transformer (Conv-heavy models)</td>
+              <td class="gain-pos">—</td>
+              <td>BUILD_FAIL</td>
+              <td>build failed</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h12</span></td>
+              <td>opset 17 + transpose_optimizer</td>
+              <td class="gain-pos">—</td>
+              <td>BUILD_FAIL</td>
+              <td>build failed</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h13</span></td>
+              <td>opset 17 + gelu_fusion explicit</td>
+              <td class="gain-pos">—</td>
+              <td>BUILD_FAIL</td>
+              <td>build failed</td>
+            </tr>
+
+        </tbody>
+      </table>
+    </section>
+
+
+    <section class="section-card">
+      <div class="section-title">⚪ Neutral / Build Fail</div>
+      <table class="report-table">
+        <thead>
+          <tr>
+            <th>Hypothesis</th>
+            <th>Label</th>
+            <th>Gain %</th>
+            <th>Verdict</th>
+            <th>Confidence</th>
+          </tr>
+        </thead>
+        <tbody>
+
+            <tr class="">
+              <td><span class="hyp-pill">h0</span></td>
+              <td>baseline (opset 17, autoconf defaults)</td>
+              <td class="gain-pos">+0.0%</td>
+              <td>BASELINE</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h8</span></td>
+              <td>opset 17 + matmul_add_fusion (cpu-002 guarded)</td>
+              <td class="gain-pos">—</td>
+              <td>SKIPPED_CPU002</td>
+              <td>guarded skip</td>
+            </tr>
+
+        </tbody>
+      </table>
+    </section>
+
+
+
+  <div class="footer">Generated by gen_model_report.py · research/autoconfig</div>
+</body>
+</html>
diff --git a/research/autoconfig/catalog-cpu-sweep/apple--mobilevit-small/results.json b/research/autoconfig/catalog-cpu-sweep/apple--mobilevit-small/results.json
new file mode 100644
index 000000000..173f889f9
--- /dev/null
+++ b/research/autoconfig/catalog-cpu-sweep/apple--mobilevit-small/results.json
@@ -0,0 +1,232 @@
+{
+  "model_id": "apple/mobilevit-small",
+  "task": "image-classification",
+  "model_type": "mobilevit",
+  "timestamp": "2026-06-18T15:29:58",
+  "ep": "cpu",
+  "device": "cpu",
+  "hypotheses": {
+    "h0": {
+      "status": "OK",
+      "label": "baseline (opset 17, autoconf defaults)",
+      "opset": 17,
+      "extra_optim": null,
+      "screen_p50_ms": 66.804,
+      "screen_cv": 3.1693311777737856,
+      "full_p50s_ms": [
+        73.166,
+        72.1,
+        80.234
+      ],
+      "median_p50_ms": 73.166,
+      "verdict": "BASELINE"
+    },
+    "h1": {
+      "status": "OK",
+      "label": "opset 17 explicit",
+      "opset": 17,
+      "extra_optim": null,
+      "screen_p50_ms": 69.007,
+      "screen_cv": 3.623472981001927,
+      "full_p50s_ms": [
+        87.48,
+        89.858,
+        57.036
+      ],
+      "median_p50_ms": 87.48,
+      "gain_vs_baseline_pct": -19.56,
+      "verdict": "DISCARD"
+    },
+    "h2": {
+      "status": "OK",
+      "label": "opset 19 (cpu-001 risk — transformer test)",
+      "opset": 19,
+      "extra_optim": null,
+      "screen_p50_ms": 78.369,
+      "screen_cv": 3.1204047518789317,
+      "full_p50s_ms": [
+        74.505,
+        86.262,
+        79.826
+      ],
+      "median_p50_ms": 79.826,
+      "gain_vs_baseline_pct": -9.1,
+      "verdict": "DISCARD"
+    },
+    "h3": {
+      "status": "OK",
+      "label": "opset 21 (cpu-001 risk — transformer test)",
+      "opset": 21,
+      "extra_optim": null,
+      "screen_p50_ms": 41.225,
+      "screen_cv": 5.67767131594906,
+      "full_p50s_ms": [
+        67.43,
+        84.267,
+        78.586
+      ],
+      "median_p50_ms": 78.586,
+      "gain_vs_baseline_pct": -7.41,
+      "verdict": "DISCARD"
+    },
+    "h4": {
+      "status": "OK",
+      "label": "opset 17 + attention_fusion",
+      "opset": 17,
+      "extra_optim": {
+        "attention_fusion": true
+      },
+      "screen_p50_ms": 57.061,
+      "screen_cv": 4.881863269133033,
+      "full_p50s_ms": [
+        83.444,
+        70.192,
+        77.772
+      ],
+      "median_p50_ms": 77.772,
+      "gain_vs_baseline_pct": -6.3,
+      "verdict": "DISCARD"
+    },
+    "h5": {
+      "status": "OK",
+      "label": "opset 17 + skip_layer_norm_fusion",
+      "opset": 17,
+      "extra_optim": {
+        "skip_layer_norm_fusion": true
+      },
+      "screen_p50_ms": 72.701,
+      "screen_cv": 3.3349472496939523,
+      "full_p50s_ms": [
+        80.514,
+        60.097,
+        785.991
+      ],
+      "median_p50_ms": 80.514,
+      "gain_vs_baseline_pct": -10.04,
+      "verdict": "DISCARD"
+    },
+    "h6": {
+      "status": "OK",
+      "label": "opset 17 + layer_norm_fusion",
+      "opset": 17,
+      "extra_optim": {
+        "layer_norm_fusion": true
+      },
+      "screen_p50_ms": 759.837,
+      "screen_cv": 0.4795699603994014,
+      "full_p50s_ms": [
+        817.624,
+        803.217,
+        184.944
+      ],
+      "median_p50_ms": 803.217,
+      "gain_vs_baseline_pct": -997.8,
+      "verdict": "DISCARD"
+    },
+    "h7": {
+      "status": "OK",
+      "label": "opset 17 + bias_softmax_fusion",
+      "opset": 17,
+      "extra_optim": {
+        "bias_softmax_fusion": true
+      },
+      "screen_p50_ms": 52.703,
+      "screen_cv": 0.4295580896723146,
+      "full_p50s_ms": [
+        60.637,
+        64.137,
+        119.521
+      ],
+      "median_p50_ms": 64.137,
+      "gain_vs_baseline_pct": 12.34,
+      "verdict": "MARGINAL_UNCONFIRMED",
+      "confirm_p50s_ms": [
+        239.249,
+        279.325
+      ],
+      "all_p50s_ms": [
+        60.637,
+        64.137,
+        119.521,
+        239.249,
+        279.325
+      ],
+      "overall_median_p50_ms": 119.521,
+      "overall_gain_pct": -63.36,
+      "sessions_above_threshold": 2,
+      "total_sessions": 5
+    },
+    "h8": {
+      "status": "SKIPPED_CPU002",
+      "label": "opset 17 + matmul_add_fusion (cpu-002 guarded)",
+      "opset": 17,
+      "reason": "cpu-002: model already has Gemm — matmul_add_fusion skipped"
+    },
+    "h9": {
+      "status": "OK",
+      "label": "opset 17 + matmul_transpose_fusion",
+      "opset": 17,
+      "extra_optim": {
+        "matmul_transpose_fusion": true
+      },
+      "screen_p50_ms": 153.131,
+      "screen_cv": 1.1719312222868001,
+      "full_p50s_ms": [
+        194.194,
+        175.965,
+        203.405
+      ],
+      "median_p50_ms": 194.194,
+      "gain_vs_baseline_pct": -165.42,
+      "verdict": "DISCARD"
+    },
+    "h10": {
+      "status": "OK",
+      "label": "opset 17 + attention + skip_layer_norm + layer_norm",
+      "opset": 17,
+      "extra_optim": {
+        "attention_fusion": true,
+        "skip_layer_norm_fusion": true,
+        "layer_norm_fusion": true
+      },
+      "screen_p50_ms": 202.155,
+      "screen_cv": 1.1776211322994732,
+      "full_p50s_ms": [
+        261.236,
+        189.739,
+        193.641
+      ],
+      "median_p50_ms": 193.641,
+      "gain_vs_baseline_pct": -164.66,
+      "verdict": "DISCARD"
+    },
+    "h11": {
+      "status": "BUILD_FAIL",
+      "label": "opset 17 + nchwc_transformer (Conv-heavy models)",
+      "opset": 17,
+      "build_error": "     device                                            \n⏳ Optimize  Optimizing ONNX graph...\n   Analyzing 395 nodes  (iter 1/3)\n   Patterns\n     Matmul Add  → matmul_add_fusion\n   Optimizing  (applying autoconf)\n     {'matmul_add_fusion': True}Error: Build failed: [Errno 28] No space left on device\n"
+    },
+    "h12": {
+      "status": "BUILD_FAIL",
+      "label": "opset 17 + transpose_optimizer",
+      "opset": 17,
+      "build_error": "pple--mobilevit-small\\h12\\export.onnx\n(21.6 MB)\n[06/18/26 16:34:49] ERROR    ✗ ort_graph failed: [Errno 28] No space left on   \n                             device                                            \n⏳ Optimize  Optimizing ONNX graph...Error: Build failed: [Errno 28] No space left on device\n"
+    },
+    "h13": {
+      "status": "BUILD_FAIL",
+      "label": "opset 17 + gelu_fusion explicit",
+      "opset": 17,
+      "build_error": "📂 Output:    \nC:\\tmp\\autoconfig-demo\\catalog-cpu-sweep\\apple--mobilevit-small\\h13\n\n════════════════════════════════════════════════════════════\n🎯 Stages\n════════════════════════════════════════════════════════════\n⏳ Export  Exporting to ONNX...Error: Build failed: [Errno 28] No space left on device\n"
+    }
+  },
+  "baseline_p50_ms": 73.166,
+  "best_p50_ms": 64.137,
+  "best_hypothesis": "h7",
+  "best_gain_pct": 12.34,
+  "errors": [
+    "h11: BUILD_FAIL",
+    "h12: BUILD_FAIL",
+    "h13: BUILD_FAIL"
+  ],
+  "baseline_opset": 17
+}
diff --git a/research/autoconfig/catalog-cpu-sweep/facebook--dinov2-small/report.html b/research/autoconfig/catalog-cpu-sweep/facebook--dinov2-small/report.html
new file mode 100644
index 000000000..df542812b
--- /dev/null
+++ b/research/autoconfig/catalog-cpu-sweep/facebook--dinov2-small/report.html
@@ -0,0 +1,598 @@
+<!DOCTYPE html>
+<html lang="en">
+<head>
+  <meta charset="UTF-8">
+  <meta name="viewport" content="width=device-width, initial-scale=1.0">
+  <title>CPU CPU Optimization Report — facebook/dinov2-small</title>
+  <style>
+    * { box-sizing: border-box; margin: 0; padding: 0; }
+    body {
+      font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", sans-serif;
+      background: #f4f6f9;
+      color: #1a1a2e;
+      padding: 28px 24px 40px;
+      font-size: 13px;
+      line-height: 1.5;
+    }
+    h1 { font-size: 24px; font-weight: 800; margin-bottom: 6px; color: #102a43; }
+    .subtitle { color: #5f6c80; font-size: 12px; margin-bottom: 24px; }
+    .section-card {
+      background: #ffffff;
+      border: 1.5px solid #dbe5f0;
+      border-radius: 12px;
+      padding: 18px 20px;
+      margin-bottom: 20px;
+      box-shadow: 0 1px 3px rgba(16, 42, 67, 0.06);
+    }
+    .kpi-grid {
+      display: grid;
+      grid-template-columns: repeat(5, minmax(0, 1fr));
+      gap: 14px;
+      margin-bottom: 20px;
+    }
+    .kpi-card {
+      background: linear-gradient(180deg, #ffffff 0%, #f8fbff 100%);
+      border: 1.5px solid #dbe5f0;
+      border-radius: 12px;
+      padding: 16px;
+      min-height: 120px;
+    }
+    .kpi-label {
+      font-size: 10px;
+      text-transform: uppercase;
+      letter-spacing: 0.8px;
+      color: #6b7c93;
+      font-weight: 700;
+      margin-bottom: 8px;
+    }
+    .kpi-value {
+      font-size: 23px;
+      font-weight: 800;
+      color: #102a43;
+      line-height: 1.15;
+      margin-bottom: 6px;
+      word-break: break-word;
+    }
+    .kpi-card.good .kpi-value { color: #2e7d32; }
+    .kpi-card.bad .kpi-value { color: #c62828; }
+    .kpi-sub { color: #6b7c93; font-size: 11px; }
+    .section-title {
+      font-size: 14px;
+      font-weight: 800;
+      color: #102a43;
+      margin-bottom: 14px;
+    }
+    .characteristics-table {
+      width: 100%;
+      border-collapse: collapse;
+    }
+    .characteristics-table th,
+    .characteristics-table td {
+      padding: 9px 10px;
+      border-bottom: 1px solid #ebf1f6;
+      text-align: left;
+      vertical-align: top;
+    }
+    .characteristics-table th {
+      width: 180px;
+      color: #5f6c80;
+      font-size: 11px;
+      text-transform: uppercase;
+      letter-spacing: 0.7px;
+    }
+    .report-table {
+      width: 100%;
+      border-collapse: collapse;
+    }
+    .report-table th {
+      text-align: left;
+      padding: 9px 10px;
+      background: #eef4fb;
+      color: #486581;
+      border-bottom: 2px solid #d9e2ec;
+      font-size: 10px;
+      text-transform: uppercase;
+      letter-spacing: 0.7px;
+    }
+    .report-table td {
+      padding: 10px;
+      border-bottom: 1px solid #ebf1f6;
+      vertical-align: top;
+    }
+    .report-table tr:hover td { background: #f8fbff; }
+    .champion-row td { background: #e8f1fd; }
+    .hyp-pill {
+      display: inline-block;
+      background: #102a43;
+      color: white;
+      border-radius: 999px;
+      padding: 2px 8px;
+      font-size: 11px;
+      font-weight: 700;
+    }
+    .gain-pos { color: #2e7d32; font-weight: 700; }
+    .gain-neg { color: #c62828; font-weight: 700; }
+    .chart-wrap {
+      overflow-x: auto;
+      border: 1px solid #e6edf5;
+      border-radius: 10px;
+      background: #fbfdff;
+      padding: 10px;
+    }
+    .chart-svg { width: 100%; min-width: 760px; display: block; }
+    .axis-label { fill: #486581; font-size: 11px; font-weight: 700; }
+    .tick-label { fill: #7b8794; font-size: 10px; }
+    .tick-line { stroke: #d9e2ec; stroke-width: 1; }
+    .center-line { stroke: #1e88e5; stroke-width: 2; stroke-dasharray: 4 4; }
+    .row-bg { fill: transparent; }
+    .hyp-label { fill: #102a43; font-size: 12px; font-weight: 800; }
+    .hyp-sub { fill: #7b8794; font-size: 10px; }
+    .baseline-bar { stroke: #546e7a; stroke-width: 3; }
+    .value-text { fill: #102a43; font-size: 11px; font-weight: 700; }
+    .build-fail-text { fill: #37474f; font-size: 10px; font-weight: 800; letter-spacing: 0.5px; }
+    .gap-grid {
+      display: grid;
+      grid-template-columns: repeat(auto-fit, minmax(240px, 1fr));
+      gap: 12px;
+    }
+    .gap-card {
+      background: #fff8e1;
+      border: 1.5px solid #ffe082;
+      border-radius: 10px;
+      padding: 12px 14px;
+      color: #7c5b00;
+      font-size: 12px;
+    }
+    .flag-pill {
+      display: inline-block;
+      background: #e3f2fd;
+      color: #1565c0;
+      border-radius: 4px;
+      padding: 1px 6px;
+      font-size: 10px;
+      font-weight: 700;
+      margin: 1px 2px 1px 0;
+    }
+    .runs-val {
+      font-family: "Cascadia Code", "Consolas", monospace;
+      font-size: 10.5px;
+      color: #546e7a;
+      white-space: nowrap;
+    }
+    .hyp-detail-table .label-cell { font-size: 11.5px; max-width: 220px; }
+    .hyp-detail-table .opset-cell { text-align: center; font-weight: 700; color: #3949ab; font-size: 12px; }
+    .hyp-detail-table .flags-cell { min-width: 140px; }
+    .hyp-detail-table .p50-cell { font-family: "Cascadia Code","Consolas",monospace; font-size: 12px; white-space: nowrap; }
+    .hyp-detail-table .sessions-cell { min-width: 160px; }
+    .hyp-detail-table .conf-cell { font-size: 11px; color: #546e7a; }
+    .verdict-keep { color: #2e7d32; font-weight: 700; }
+    .verdict-discard { color: #c62828; font-weight: 700; }
+    .footer {
+      margin-top: 16px;
+      color: #7b8794;
+      font-size: 11px;
+      text-align: center;
+    }
+    @media (max-width: 1200px) {
+      .kpi-grid { grid-template-columns: repeat(2, minmax(0, 1fr)); }
+    }
+    @media (max-width: 720px) {
+      .kpi-grid { grid-template-columns: 1fr; }
+      body { padding: 18px 14px 28px; }
+    }
+  </style>
+</head>
+<body>
+  <h1>CPU CPU Optimization Report — facebook/dinov2-small</h1>
+  <div class="subtitle">dinov2 arch · 2026-06-18 · 14 hypotheses tested</div>
+
+  <section class="kpi-grid">
+    <div class="kpi-card neutral">
+      <div class="kpi-label">Best Gain %</div>
+      <div class="kpi-value">—</div>
+      <div class="kpi-sub">Champion: —</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">Baseline → Champion ms</div>
+      <div class="kpi-value">112.60 ms → —</div>
+      <div class="kpi-sub">Latency reduction: —</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">EP + Device</div>
+      <div class="kpi-value">CPU / CPU</div>
+      <div class="kpi-sub">Baseline opset 17</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">Champion Config</div>
+      <div class="kpi-value">—</div>
+      <div class="kpi-sub">—</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">Total experiments</div>
+      <div class="kpi-value">14</div>
+      <div class="kpi-sub">0 KEEP / 12 DISCARD</div>
+    </div>
+  </section>
+
+
+    <section class="section-card">
+      <div class="section-title">Model Characteristics</div>
+      <table class="characteristics-table">
+        <tr><th>Model ID</th><td>facebook/dinov2-small</td></tr><tr><th>Task</th><td>image-feature-extraction</td></tr><tr><th>Arch type</th><td>dinov2</td></tr><tr><th>Baseline opset</th><td>17</td></tr><tr><th>EP</th><td>cpu</td></tr><tr><th>Device</th><td>cpu</td></tr>
+      </table>
+    </section>
+
+
+    <section class="section-card">
+      <div class="section-title">Hypothesis Gain Chart</div>
+      <div class="chart-wrap">
+        <svg class="chart-svg" viewBox="0 0 748 634" role="img" aria-label="Hypothesis gain chart"><defs><pattern id="buildFailPattern" patternUnits="userSpaceOnUse" width="8" height="8" patternTransform="rotate(45)"><rect width="8" height="8" fill="#cfd8dc"></rect><rect width="3" height="8" fill="#90a4ae"></rect></pattern></defs><text x="0" y="20" class="axis-label">Hypothesis</text><text x="150" y="20" class="axis-label">Gain vs baseline (%)</text><line x1="410.0" y1="40" x2="410.0" y2="608" class="center-line" /><line x1="150.0" y1="44" x2="150.0" y2="608" class="tick-line" /><text x="150.0" y="628" text-anchor="middle" class="tick-label">-200%</text><line x1="280.0" y1="44" x2="280.0" y2="608" class="tick-line" /><text x="280.0" y="628" text-anchor="middle" class="tick-label">-100%</text><line x1="410.0" y1="44" x2="410.0" y2="608" class="tick-line" /><text x="410.0" y="628" text-anchor="middle" class="tick-label">0%</text><line x1="540.0" y1="44" x2="540.0" y2="608" class="tick-line" /><text x="540.0" y="628" text-anchor="middle" class="tick-label">100%</text><line x1="670.0" y1="44" x2="670.0" y2="608" class="tick-line" /><text x="670.0" y="628" text-anchor="middle" class="tick-label">200%</text><g><title>h0: baseline (opset 17, autoconf defaults)
+status=OK  verdict=BASELINE
+p50=112.60 ms  gain=+0.0%</title><rect x="0" y="48.0" width="748" height="40" class="row-bg" /><text x="8" y="64.0" class="hyp-label">h0</text><text x="8" y="77.0" class="hyp-sub">baseline (opset 17, autoc…</text><line x1="410.0" y1="56.0" x2="410.0" y2="80.0" class="baseline-bar" /><text x="418.0" y="72.0" text-anchor="start" class="value-text">0.0%</text></g><g><title>h1: opset 17 explicit
+status=OK  verdict=DISCARD
+p50=762.81 ms  gain=-577.5%</title><rect x="0" y="88.0" width="748" height="40" class="row-bg" /><text x="8" y="104.0" class="hyp-label">h1</text><text x="8" y="117.0" class="hyp-sub">opset 17 explicit</text><rect x="150.0" y="96.0" width="260.0" height="24" fill="#e53935" stroke="none" stroke-width="0" rx="4" /><text x="142.0" y="112.0" text-anchor="end" class="value-text">-577.5%</text></g><g><title>h2: opset 19 (cpu-001 risk — transformer test)
+status=OK  verdict=CPU001_REGRESSION
+p50=1106.11 ms  gain=-882.4%</title><rect x="0" y="128.0" width="748" height="40" class="row-bg" /><text x="8" y="144.0" class="hyp-label">h2</text><text x="8" y="157.0" class="hyp-sub">opset 19 (cpu-001 risk — …</text><rect x="150.0" y="136.0" width="260.0" height="24" fill="#e53935" stroke="none" stroke-width="0" rx="4" /><text x="142.0" y="152.0" text-anchor="end" class="value-text">-882.4%</text></g><g><title>h3: opset 21 (cpu-001 risk — transformer test)
+status=OK  verdict=CPU001_REGRESSION
+p50=1095.19 ms  gain=-872.6%</title><rect x="0" y="168.0" width="748" height="40" class="row-bg" /><text x="8" y="184.0" class="hyp-label">h3</text><text x="8" y="197.0" class="hyp-sub">opset 21 (cpu-001 risk — …</text><rect x="150.0" y="176.0" width="260.0" height="24" fill="#e53935" stroke="none" stroke-width="0" rx="4" /><text x="142.0" y="192.0" text-anchor="end" class="value-text">-872.6%</text></g><g><title>h4: opset 17 + attention_fusion
+status=OK  verdict=DISCARD
+p50=1083.83 ms  gain=-862.6%</title><rect x="0" y="208.0" width="748" height="40" class="row-bg" /><text x="8" y="224.0" class="hyp-label">h4</text><text x="8" y="237.0" class="hyp-sub">opset 17 + attention_fusi…</text><rect x="150.0" y="216.0" width="260.0" height="24" fill="#e53935" stroke="none" stroke-width="0" rx="4" /><text x="142.0" y="232.0" text-anchor="end" class="value-text">-862.6%</text></g><g><title>h5: opset 17 + skip_layer_norm_fusion
+status=OK  verdict=DISCARD
+p50=1103.07 ms  gain=-879.6%</title><rect x="0" y="248.0" width="748" height="40" class="row-bg" /><text x="8" y="264.0" class="hyp-label">h5</text><text x="8" y="277.0" class="hyp-sub">opset 17 + skip_layer_nor…</text><rect x="150.0" y="256.0" width="260.0" height="24" fill="#e53935" stroke="none" stroke-width="0" rx="4" /><text x="142.0" y="272.0" text-anchor="end" class="value-text">-879.6%</text></g><g><title>h6: opset 17 + layer_norm_fusion
+status=OK  verdict=DISCARD
+p50=148.70 ms  gain=-32.1%</title><rect x="0" y="288.0" width="748" height="40" class="row-bg" /><text x="8" y="304.0" class="hyp-label">h6</text><text x="8" y="317.0" class="hyp-sub">opset 17 + layer_norm_fus…</text><rect x="368.3" y="296.0" width="41.7" height="24" fill="#e53935" stroke="none" stroke-width="0" rx="4" /><text x="360.3" y="312.0" text-anchor="end" class="value-text">-32.1%</text></g><g><title>h7: opset 17 + bias_softmax_fusion
+status=OK  verdict=DISCARD
+p50=1121.98 ms  gain=-896.4%</title><rect x="0" y="328.0" width="748" height="40" class="row-bg" /><text x="8" y="344.0" class="hyp-label">h7</text><text x="8" y="357.0" class="hyp-sub">opset 17 + bias_softmax_f…</text><rect x="150.0" y="336.0" width="260.0" height="24" fill="#e53935" stroke="none" stroke-width="0" rx="4" /><text x="142.0" y="352.0" text-anchor="end" class="value-text">-896.4%</text></g><g><title>h8: opset 17 + matmul_add_fusion (cpu-002 guarded)
+status=SKIPPED_CPU002  verdict=—
+p50=—  gain=—</title><rect x="0" y="368.0" width="748" height="40" class="row-bg" /><text x="8" y="384.0" class="hyp-label">h8</text><text x="8" y="397.0" class="hyp-sub">opset 17 + matmul_add_fus…</text></g><g><title>h9: opset 17 + matmul_transpose_fusion
+status=OK  verdict=DISCARD
+p50=186.48 ms  gain=-65.6%</title><rect x="0" y="408.0" width="748" height="40" class="row-bg" /><text x="8" y="424.0" class="hyp-label">h9</text><text x="8" y="437.0" class="hyp-sub">opset 17 + matmul_transpo…</text><rect x="324.7" y="416.0" width="85.3" height="24" fill="#e53935" stroke="none" stroke-width="0" rx="4" /><text x="316.7" y="432.0" text-anchor="end" class="value-text">-65.6%</text></g><g><title>h10: opset 17 + attention + skip_layer_norm + layer_norm
+status=OK  verdict=DISCARD
+p50=136.57 ms  gain=-21.3%</title><rect x="0" y="448.0" width="748" height="40" class="row-bg" /><text x="8" y="464.0" class="hyp-label">h10</text><text x="8" y="477.0" class="hyp-sub">opset 17 + attention + sk…</text><rect x="382.3" y="456.0" width="27.7" height="24" fill="#e53935" stroke="none" stroke-width="0" rx="4" /><text x="374.3" y="472.0" text-anchor="end" class="value-text">-21.3%</text></g><g><title>h11: opset 17 + nchwc_transformer (Conv-heavy models)
+status=OK  verdict=DISCARD
+p50=157.51 ms  gain=-39.9%</title><rect x="0" y="488.0" width="748" height="40" class="row-bg" /><text x="8" y="504.0" class="hyp-label">h11</text><text x="8" y="517.0" class="hyp-sub">opset 17 + nchwc_transfor…</text><rect x="358.2" y="496.0" width="51.8" height="24" fill="#e53935" stroke="none" stroke-width="0" rx="4" /><text x="350.2" y="512.0" text-anchor="end" class="value-text">-39.9%</text></g><g><title>h12: opset 17 + transpose_optimizer
+status=OK  verdict=DISCARD
+p50=154.59 ms  gain=-37.3%</title><rect x="0" y="528.0" width="748" height="40" class="row-bg" /><text x="8" y="544.0" class="hyp-label">h12</text><text x="8" y="557.0" class="hyp-sub">opset 17 + transpose_opti…</text><rect x="361.5" y="536.0" width="48.5" height="24" fill="#e53935" stroke="none" stroke-width="0" rx="4" /><text x="353.5" y="552.0" text-anchor="end" class="value-text">-37.3%</text></g><g><title>h13: opset 17 + gelu_fusion explicit
+status=OK  verdict=DISCARD
+p50=154.10 ms  gain=-36.9%</title><rect x="0" y="568.0" width="748" height="40" class="row-bg" /><text x="8" y="584.0" class="hyp-label">h13</text><text x="8" y="597.0" class="hyp-sub">opset 17 + gelu_fusion ex…</text><rect x="362.1" y="576.0" width="47.9" height="24" fill="#e53935" stroke="none" stroke-width="0" rx="4" /><text x="354.1" y="592.0" text-anchor="end" class="value-text">-36.9%</text></g></svg>
+      </div>
+    </section>
+
+
+    <section class="section-card">
+      <div class="section-title">🔬 All Hypotheses — Full Detail</div>
+      <div style="overflow-x:auto">
+      <table class="report-table hyp-detail-table">
+        <thead>
+          <tr>
+            <th>ID</th>
+            <th>Config Label</th>
+            <th>Opset</th>
+            <th>Extra Flags</th>
+            <th>Median p50</th>
+            <th>Session p50s (ms)</th>
+            <th>Gain %</th>
+            <th>Verdict</th>
+            <th>Confidence</th>
+          </tr>
+        </thead>
+        <tbody>
+
+        <tr class="">
+          <td><span class="hyp-pill">h0</span></td>
+          <td class="label-cell">baseline (opset 17, autoconf defaults)</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">112.60 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[142.03 · 105.56 · 112.60]</span></td>
+          <td><span class="">+0.0%</span></td>
+          <td><span class="">BASELINE</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h1</span></td>
+          <td class="label-cell">opset 17 explicit</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">762.81 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[150.63 · 1123.34 · 762.81]</span></td>
+          <td><span class="gain-neg">-577.5%</span></td>
+          <td><span class="verdict-discard">DISCARD</span></td>
+          <td class="conf-cell">ranges separated</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h2</span></td>
+          <td class="label-cell">opset 19 (cpu-001 risk — transformer test)</td>
+          <td class="opset-cell">19</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">1106.11 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[1106.11 · 1104.49 · 1164.20]</span></td>
+          <td><span class="gain-neg">-882.4%</span></td>
+          <td><span class="">CPU001_REGRESSION</span></td>
+          <td class="conf-cell">ranges separated</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h3</span></td>
+          <td class="label-cell">opset 21 (cpu-001 risk — transformer test)</td>
+          <td class="opset-cell">21</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">1095.19 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[1057.56 · 1095.19 · 1128.22]</span></td>
+          <td><span class="gain-neg">-872.6%</span></td>
+          <td><span class="">CPU001_REGRESSION</span></td>
+          <td class="conf-cell">ranges separated</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h4</span></td>
+          <td class="label-cell">opset 17 + attention_fusion</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><span class="flag-pill">attention_fusion</span></td>
+          <td class="p50-cell">1083.83 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[1086.54 · 1068.75 · 1083.83]</span></td>
+          <td><span class="gain-neg">-862.6%</span></td>
+          <td><span class="verdict-discard">DISCARD</span></td>
+          <td class="conf-cell">ranges separated</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h5</span></td>
+          <td class="label-cell">opset 17 + skip_layer_norm_fusion</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><span class="flag-pill">skip_layer_norm_fusion</span></td>
+          <td class="p50-cell">1103.07 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[1119.95 · 1103.07 · 161.83]</span></td>
+          <td><span class="gain-neg">-879.6%</span></td>
+          <td><span class="verdict-discard">DISCARD</span></td>
+          <td class="conf-cell">ranges separated</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h6</span></td>
+          <td class="label-cell">opset 17 + layer_norm_fusion</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><span class="flag-pill">layer_norm_fusion</span></td>
+          <td class="p50-cell">148.70 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[142.60 · 155.01 · 148.70]</span></td>
+          <td><span class="gain-neg">-32.1%</span></td>
+          <td><span class="verdict-discard">DISCARD</span></td>
+          <td class="conf-cell">ranges separated</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h7</span></td>
+          <td class="label-cell">opset 17 + bias_softmax_fusion</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><span class="flag-pill">bias_softmax_fusion</span></td>
+          <td class="p50-cell">1121.98 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[899.91 · 1145.34 · 1121.98]</span></td>
+          <td><span class="gain-neg">-896.4%</span></td>
+          <td><span class="verdict-discard">DISCARD</span></td>
+          <td class="conf-cell">ranges separated</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h8</span></td>
+          <td class="label-cell">opset 17 + matmul_add_fusion (cpu-002 guarded)</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">—</td>
+          <td class="sessions-cell">—</td>
+          <td>—</td>
+          <td><span class="">SKIPPED_CPU002</span></td>
+          <td class="conf-cell">guarded skip</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h9</span></td>
+          <td class="label-cell">opset 17 + matmul_transpose_fusion</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><span class="flag-pill">matmul_transpose_fusion</span></td>
+          <td class="p50-cell">186.48 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[161.47 · 186.48 · 334.34]</span></td>
+          <td><span class="gain-neg">-65.6%</span></td>
+          <td><span class="verdict-discard">DISCARD</span></td>
+          <td class="conf-cell">ranges separated</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h10</span></td>
+          <td class="label-cell">opset 17 + attention + skip_layer_norm + layer_norm</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><span class="flag-pill">attention_fusion</span>, <span class="flag-pill">skip_layer_norm_fusion</span>, <span class="flag-pill">layer_norm_fusion</span></td>
+          <td class="p50-cell">136.57 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[121.38 · 167.90 · 136.57]</span></td>
+          <td><span class="gain-neg">-21.3%</span></td>
+          <td><span class="verdict-discard">DISCARD</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h11</span></td>
+          <td class="label-cell">opset 17 + nchwc_transformer (Conv-heavy models)</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><span class="flag-pill">nchwc_transformer</span></td>
+          <td class="p50-cell">157.51 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[157.51 · 192.39 · 157.25]</span></td>
+          <td><span class="gain-neg">-39.9%</span></td>
+          <td><span class="verdict-discard">DISCARD</span></td>
+          <td class="conf-cell">ranges separated</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h12</span></td>
+          <td class="label-cell">opset 17 + transpose_optimizer</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><span class="flag-pill">transpose_optimizer</span></td>
+          <td class="p50-cell">154.59 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[175.29 · 143.11 · 154.59]</span></td>
+          <td><span class="gain-neg">-37.3%</span></td>
+          <td><span class="verdict-discard">DISCARD</span></td>
+          <td class="conf-cell">ranges separated</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h13</span></td>
+          <td class="label-cell">opset 17 + gelu_fusion explicit</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><span class="flag-pill">gelu_fusion</span></td>
+          <td class="p50-cell">154.10 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[146.72 · 163.78 · 154.10]</span></td>
+          <td><span class="gain-neg">-36.9%</span></td>
+          <td><span class="verdict-discard">DISCARD</span></td>
+          <td class="conf-cell">ranges separated</td>
+        </tr>
+        </tbody>
+      </table>
+      </div>
+      <div style="margin-top:10px;font-size:11px;color:#7b8794">
+        ★ = champion hypothesis &nbsp;·&nbsp; Session p50s are individual bench sessions (median used for comparison)
+      </div>
+    </section>
+
+
+
+    <section class="section-card">
+      <div class="section-title">❌ Ineffective or Harmful</div>
+      <table class="report-table">
+        <thead>
+          <tr>
+            <th>Hypothesis</th>
+            <th>Label</th>
+            <th>Gain %</th>
+            <th>Verdict</th>
+            <th>Confidence</th>
+          </tr>
+        </thead>
+        <tbody>
+
+            <tr class="">
+              <td><span class="hyp-pill">h1</span></td>
+              <td>opset 17 explicit</td>
+              <td class="gain-neg">-577.5%</td>
+              <td>DISCARD</td>
+              <td>ranges separated</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h2</span></td>
+              <td>opset 19 (cpu-001 risk — transformer test)</td>
+              <td class="gain-neg">-882.4%</td>
+              <td>CPU001_REGRESSION</td>
+              <td>ranges separated</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h3</span></td>
+              <td>opset 21 (cpu-001 risk — transformer test)</td>
+              <td class="gain-neg">-872.6%</td>
+              <td>CPU001_REGRESSION</td>
+              <td>ranges separated</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h4</span></td>
+              <td>opset 17 + attention_fusion</td>
+              <td class="gain-neg">-862.6%</td>
+              <td>DISCARD</td>
+              <td>ranges separated</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h5</span></td>
+              <td>opset 17 + skip_layer_norm_fusion</td>
+              <td class="gain-neg">-879.6%</td>
+              <td>DISCARD</td>
+              <td>ranges separated</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h6</span></td>
+              <td>opset 17 + layer_norm_fusion</td>
+              <td class="gain-neg">-32.1%</td>
+              <td>DISCARD</td>
+              <td>ranges separated</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h7</span></td>
+              <td>opset 17 + bias_softmax_fusion</td>
+              <td class="gain-neg">-896.4%</td>
+              <td>DISCARD</td>
+              <td>ranges separated</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h9</span></td>
+              <td>opset 17 + matmul_transpose_fusion</td>
+              <td class="gain-neg">-65.6%</td>
+              <td>DISCARD</td>
+              <td>ranges separated</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h10</span></td>
+              <td>opset 17 + attention + skip_layer_norm + layer_norm</td>
+              <td class="gain-neg">-21.3%</td>
+              <td>DISCARD</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h11</span></td>
+              <td>opset 17 + nchwc_transformer (Conv-heavy models)</td>
+              <td class="gain-neg">-39.9%</td>
+              <td>DISCARD</td>
+              <td>ranges separated</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h12</span></td>
+              <td>opset 17 + transpose_optimizer</td>
+              <td class="gain-neg">-37.3%</td>
+              <td>DISCARD</td>
+              <td>ranges separated</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h13</span></td>
+              <td>opset 17 + gelu_fusion explicit</td>
+              <td class="gain-neg">-36.9%</td>
+              <td>DISCARD</td>
+              <td>ranges separated</td>
+            </tr>
+
+        </tbody>
+      </table>
+    </section>
+
+
+    <section class="section-card">
+      <div class="section-title">⚪ Neutral / Build Fail</div>
+      <table class="report-table">
+        <thead>
+          <tr>
+            <th>Hypothesis</th>
+            <th>Label</th>
+            <th>Gain %</th>
+            <th>Verdict</th>
+            <th>Confidence</th>
+          </tr>
+        </thead>
+        <tbody>
+
+            <tr class="">
+              <td><span class="hyp-pill">h0</span></td>
+              <td>baseline (opset 17, autoconf defaults)</td>
+              <td class="gain-pos">+0.0%</td>
+              <td>BASELINE</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h8</span></td>
+              <td>opset 17 + matmul_add_fusion (cpu-002 guarded)</td>
+              <td class="gain-pos">—</td>
+              <td>SKIPPED_CPU002</td>
+              <td>guarded skip</td>
+            </tr>
+
+        </tbody>
+      </table>
+    </section>
+
+
+
+  <div class="footer">Generated by gen_model_report.py · research/autoconfig</div>
+</body>
+</html>
diff --git a/research/autoconfig/catalog-cpu-sweep/facebook--dinov2-small/results.json b/research/autoconfig/catalog-cpu-sweep/facebook--dinov2-small/results.json
new file mode 100644
index 000000000..88067a09f
--- /dev/null
+++ b/research/autoconfig/catalog-cpu-sweep/facebook--dinov2-small/results.json
@@ -0,0 +1,249 @@
+{
+  "model_id": "facebook/dinov2-small",
+  "task": "image-feature-extraction",
+  "model_type": "dinov2",
+  "timestamp": "2026-06-18T12:25:19",
+  "ep": "cpu",
+  "device": "cpu",
+  "hypotheses": {
+    "h0": {
+      "status": "OK",
+      "label": "baseline (opset 17, autoconf defaults)",
+      "opset": 17,
+      "extra_optim": null,
+      "screen_p50_ms": 1108.058,
+      "screen_cv": 0.303283763124313,
+      "full_p50s_ms": [
+        142.033,
+        105.561,
+        112.599
+      ],
+      "median_p50_ms": 112.599,
+      "verdict": "BASELINE"
+    },
+    "h1": {
+      "status": "OK",
+      "label": "opset 17 explicit",
+      "opset": 17,
+      "extra_optim": null,
+      "screen_p50_ms": 114.372,
+      "screen_cv": 3.0164201028223694,
+      "full_p50s_ms": [
+        150.633,
+        1123.338,
+        762.812
+      ],
+      "median_p50_ms": 762.812,
+      "gain_vs_baseline_pct": -577.46,
+      "verdict": "DISCARD"
+    },
+    "h2": {
+      "status": "OK",
+      "label": "opset 19 (cpu-001 risk — transformer test)",
+      "opset": 19,
+      "extra_optim": null,
+      "screen_p50_ms": 918.187,
+      "screen_cv": 0.6378613506834665,
+      "full_p50s_ms": [
+        1106.113,
+        1104.489,
+        1164.205
+      ],
+      "median_p50_ms": 1106.113,
+      "gain_vs_baseline_pct": -882.35,
+      "verdict": "CPU001_REGRESSION"
+    },
+    "h3": {
+      "status": "OK",
+      "label": "opset 21 (cpu-001 risk — transformer test)",
+      "opset": 21,
+      "extra_optim": null,
+      "screen_p50_ms": 1139.678,
+      "screen_cv": 0.23544106317749397,
+      "full_p50s_ms": [
+        1057.558,
+        1095.186,
+        1128.223
+      ],
+      "median_p50_ms": 1095.186,
+      "gain_vs_baseline_pct": -872.64,
+      "verdict": "CPU001_REGRESSION"
+    },
+    "h4": {
+      "status": "OK",
+      "label": "opset 17 + attention_fusion",
+      "opset": 17,
+      "extra_optim": {
+        "attention_fusion": true
+      },
+      "screen_p50_ms": 1093.504,
+      "screen_cv": 0.2851768260564205,
+      "full_p50s_ms": [
+        1086.54,
+        1068.752,
+        1083.83
+      ],
+      "median_p50_ms": 1083.83,
+      "gain_vs_baseline_pct": -862.56,
+      "verdict": "DISCARD"
+    },
+    "h5": {
+      "status": "OK",
+      "label": "opset 17 + skip_layer_norm_fusion",
+      "opset": 17,
+      "extra_optim": {
+        "skip_layer_norm_fusion": true
+      },
+      "screen_p50_ms": 1099.529,
+      "screen_cv": 0.3173004077200328,
+      "full_p50s_ms": [
+        1119.951,
+        1103.065,
+        161.832
+      ],
+      "median_p50_ms": 1103.065,
+      "gain_vs_baseline_pct": -879.64,
+      "verdict": "DISCARD"
+    },
+    "h6": {
+      "status": "OK",
+      "label": "opset 17 + layer_norm_fusion",
+      "opset": 17,
+      "extra_optim": {
+        "layer_norm_fusion": true
+      },
+      "screen_p50_ms": 881.27,
+      "screen_cv": 0.6731739421516676,
+      "full_p50s_ms": [
+        142.596,
+        155.014,
+        148.704
+      ],
+      "median_p50_ms": 148.704,
+      "gain_vs_baseline_pct": -32.07,
+      "verdict": "DISCARD"
+    },
+    "h7": {
+      "status": "OK",
+      "label": "opset 17 + bias_softmax_fusion",
+      "opset": 17,
+      "extra_optim": {
+        "bias_softmax_fusion": true
+      },
+      "screen_p50_ms": 107.327,
+      "screen_cv": 0.367950282780661,
+      "full_p50s_ms": [
+        899.911,
+        1145.338,
+        1121.982
+      ],
+      "median_p50_ms": 1121.982,
+      "gain_vs_baseline_pct": -896.44,
+      "verdict": "DISCARD"
+    },
+    "h8": {
+      "status": "SKIPPED_CPU002",
+      "label": "opset 17 + matmul_add_fusion (cpu-002 guarded)",
+      "opset": 17,
+      "reason": "cpu-002: model already has Gemm — matmul_add_fusion skipped"
+    },
+    "h9": {
+      "status": "OK",
+      "label": "opset 17 + matmul_transpose_fusion",
+      "opset": 17,
+      "extra_optim": {
+        "matmul_transpose_fusion": true
+      },
+      "screen_p50_ms": 102.861,
+      "screen_cv": 0.5473794732697523,
+      "full_p50s_ms": [
+        161.473,
+        186.476,
+        334.336
+      ],
+      "median_p50_ms": 186.476,
+      "gain_vs_baseline_pct": -65.61,
+      "verdict": "DISCARD"
+    },
+    "h10": {
+      "status": "OK",
+      "label": "opset 17 + attention + skip_layer_norm + layer_norm",
+      "opset": 17,
+      "extra_optim": {
+        "attention_fusion": true,
+        "skip_layer_norm_fusion": true,
+        "layer_norm_fusion": true
+      },
+      "screen_p50_ms": 168.419,
+      "screen_cv": 3.5594440057238192,
+      "full_p50s_ms": [
+        121.378,
+        167.902,
+        136.572
+      ],
+      "median_p50_ms": 136.572,
+      "gain_vs_baseline_pct": -21.29,
+      "verdict": "DISCARD"
+    },
+    "h11": {
+      "status": "OK",
+      "label": "opset 17 + nchwc_transformer (Conv-heavy models)",
+      "opset": 17,
+      "extra_optim": {
+        "nchwc_transformer": true
+      },
+      "screen_p50_ms": 156.796,
+      "screen_cv": 2.250503839383658,
+      "full_p50s_ms": [
+        157.508,
+        192.392,
+        157.246
+      ],
+      "median_p50_ms": 157.508,
+      "gain_vs_baseline_pct": -39.88,
+      "verdict": "DISCARD"
+    },
+    "h12": {
+      "status": "OK",
+      "label": "opset 17 + transpose_optimizer",
+      "opset": 17,
+      "extra_optim": {
+        "transpose_optimizer": true
+      },
+      "screen_p50_ms": 159.442,
+      "screen_cv": 3.556904705159243,
+      "full_p50s_ms": [
+        175.292,
+        143.108,
+        154.593
+      ],
+      "median_p50_ms": 154.593,
+      "gain_vs_baseline_pct": -37.3,
+      "verdict": "DISCARD"
+    },
+    "h13": {
+      "status": "OK",
+      "label": "opset 17 + gelu_fusion explicit",
+      "opset": 17,
+      "extra_optim": {
+        "gelu_fusion": true
+      },
+      "screen_p50_ms": 175.256,
+      "screen_cv": 2.8835132606016343,
+      "full_p50s_ms": [
+        146.716,
+        163.783,
+        154.105
+      ],
+      "median_p50_ms": 154.105,
+      "gain_vs_baseline_pct": -36.86,
+      "verdict": "DISCARD"
+    }
+  },
+  "baseline_p50_ms": 112.599,
+  "best_p50_ms": null,
+  "best_hypothesis": null,
+  "best_gain_pct": null,
+  "errors": [],
+  "baseline_opset": 17
+}
diff --git a/research/autoconfig/catalog-cpu-sweep/microsoft--rad-dino/report.html b/research/autoconfig/catalog-cpu-sweep/microsoft--rad-dino/report.html
new file mode 100644
index 000000000..0d6b9c240
--- /dev/null
+++ b/research/autoconfig/catalog-cpu-sweep/microsoft--rad-dino/report.html
@@ -0,0 +1,268 @@
+<!DOCTYPE html>
+<html lang="en">
+<head>
+  <meta charset="UTF-8">
+  <meta name="viewport" content="width=device-width, initial-scale=1.0">
+  <title>CPU CPU Optimization Report — microsoft/rad-dino</title>
+  <style>
+    * { box-sizing: border-box; margin: 0; padding: 0; }
+    body {
+      font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", sans-serif;
+      background: #f4f6f9;
+      color: #1a1a2e;
+      padding: 28px 24px 40px;
+      font-size: 13px;
+      line-height: 1.5;
+    }
+    h1 { font-size: 24px; font-weight: 800; margin-bottom: 6px; color: #102a43; }
+    .subtitle { color: #5f6c80; font-size: 12px; margin-bottom: 24px; }
+    .section-card {
+      background: #ffffff;
+      border: 1.5px solid #dbe5f0;
+      border-radius: 12px;
+      padding: 18px 20px;
+      margin-bottom: 20px;
+      box-shadow: 0 1px 3px rgba(16, 42, 67, 0.06);
+    }
+    .kpi-grid {
+      display: grid;
+      grid-template-columns: repeat(5, minmax(0, 1fr));
+      gap: 14px;
+      margin-bottom: 20px;
+    }
+    .kpi-card {
+      background: linear-gradient(180deg, #ffffff 0%, #f8fbff 100%);
+      border: 1.5px solid #dbe5f0;
+      border-radius: 12px;
+      padding: 16px;
+      min-height: 120px;
+    }
+    .kpi-label {
+      font-size: 10px;
+      text-transform: uppercase;
+      letter-spacing: 0.8px;
+      color: #6b7c93;
+      font-weight: 700;
+      margin-bottom: 8px;
+    }
+    .kpi-value {
+      font-size: 23px;
+      font-weight: 800;
+      color: #102a43;
+      line-height: 1.15;
+      margin-bottom: 6px;
+      word-break: break-word;
+    }
+    .kpi-card.good .kpi-value { color: #2e7d32; }
+    .kpi-card.bad .kpi-value { color: #c62828; }
+    .kpi-sub { color: #6b7c93; font-size: 11px; }
+    .section-title {
+      font-size: 14px;
+      font-weight: 800;
+      color: #102a43;
+      margin-bottom: 14px;
+    }
+    .characteristics-table {
+      width: 100%;
+      border-collapse: collapse;
+    }
+    .characteristics-table th,
+    .characteristics-table td {
+      padding: 9px 10px;
+      border-bottom: 1px solid #ebf1f6;
+      text-align: left;
+      vertical-align: top;
+    }
+    .characteristics-table th {
+      width: 180px;
+      color: #5f6c80;
+      font-size: 11px;
+      text-transform: uppercase;
+      letter-spacing: 0.7px;
+    }
+    .report-table {
+      width: 100%;
+      border-collapse: collapse;
+    }
+    .report-table th {
+      text-align: left;
+      padding: 9px 10px;
+      background: #eef4fb;
+      color: #486581;
+      border-bottom: 2px solid #d9e2ec;
+      font-size: 10px;
+      text-transform: uppercase;
+      letter-spacing: 0.7px;
+    }
+    .report-table td {
+      padding: 10px;
+      border-bottom: 1px solid #ebf1f6;
+      vertical-align: top;
+    }
+    .report-table tr:hover td { background: #f8fbff; }
+    .champion-row td { background: #e8f1fd; }
+    .hyp-pill {
+      display: inline-block;
+      background: #102a43;
+      color: white;
+      border-radius: 999px;
+      padding: 2px 8px;
+      font-size: 11px;
+      font-weight: 700;
+    }
+    .gain-pos { color: #2e7d32; font-weight: 700; }
+    .gain-neg { color: #c62828; font-weight: 700; }
+    .chart-wrap {
+      overflow-x: auto;
+      border: 1px solid #e6edf5;
+      border-radius: 10px;
+      background: #fbfdff;
+      padding: 10px;
+    }
+    .chart-svg { width: 100%; min-width: 760px; display: block; }
+    .axis-label { fill: #486581; font-size: 11px; font-weight: 700; }
+    .tick-label { fill: #7b8794; font-size: 10px; }
+    .tick-line { stroke: #d9e2ec; stroke-width: 1; }
+    .center-line { stroke: #1e88e5; stroke-width: 2; stroke-dasharray: 4 4; }
+    .row-bg { fill: transparent; }
+    .hyp-label { fill: #102a43; font-size: 12px; font-weight: 800; }
+    .hyp-sub { fill: #7b8794; font-size: 10px; }
+    .baseline-bar { stroke: #546e7a; stroke-width: 3; }
+    .value-text { fill: #102a43; font-size: 11px; font-weight: 700; }
+    .build-fail-text { fill: #37474f; font-size: 10px; font-weight: 800; letter-spacing: 0.5px; }
+    .gap-grid {
+      display: grid;
+      grid-template-columns: repeat(auto-fit, minmax(240px, 1fr));
+      gap: 12px;
+    }
+    .gap-card {
+      background: #fff8e1;
+      border: 1.5px solid #ffe082;
+      border-radius: 10px;
+      padding: 12px 14px;
+      color: #7c5b00;
+      font-size: 12px;
+    }
+    .flag-pill {
+      display: inline-block;
+      background: #e3f2fd;
+      color: #1565c0;
+      border-radius: 4px;
+      padding: 1px 6px;
+      font-size: 10px;
+      font-weight: 700;
+      margin: 1px 2px 1px 0;
+    }
+    .runs-val {
+      font-family: "Cascadia Code", "Consolas", monospace;
+      font-size: 10.5px;
+      color: #546e7a;
+      white-space: nowrap;
+    }
+    .hyp-detail-table .label-cell { font-size: 11.5px; max-width: 220px; }
+    .hyp-detail-table .opset-cell { text-align: center; font-weight: 700; color: #3949ab; font-size: 12px; }
+    .hyp-detail-table .flags-cell { min-width: 140px; }
+    .hyp-detail-table .p50-cell { font-family: "Cascadia Code","Consolas",monospace; font-size: 12px; white-space: nowrap; }
+    .hyp-detail-table .sessions-cell { min-width: 160px; }
+    .hyp-detail-table .conf-cell { font-size: 11px; color: #546e7a; }
+    .verdict-keep { color: #2e7d32; font-weight: 700; }
+    .verdict-discard { color: #c62828; font-weight: 700; }
+    .footer {
+      margin-top: 16px;
+      color: #7b8794;
+      font-size: 11px;
+      text-align: center;
+    }
+    @media (max-width: 1200px) {
+      .kpi-grid { grid-template-columns: repeat(2, minmax(0, 1fr)); }
+    }
+    @media (max-width: 720px) {
+      .kpi-grid { grid-template-columns: 1fr; }
+      body { padding: 18px 14px 28px; }
+    }
+  </style>
+</head>
+<body>
+  <h1>CPU CPU Optimization Report — microsoft/rad-dino</h1>
+  <div class="subtitle">dinov2 arch · 2026-06-18 · 0 hypotheses tested</div>
+
+  <section class="kpi-grid">
+    <div class="kpi-card neutral">
+      <div class="kpi-label">Best Gain %</div>
+      <div class="kpi-value">—</div>
+      <div class="kpi-sub">Champion: —</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">Baseline → Champion ms</div>
+      <div class="kpi-value">— → —</div>
+      <div class="kpi-sub">Latency reduction: —</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">EP + Device</div>
+      <div class="kpi-value">CPU / CPU</div>
+      <div class="kpi-sub">Baseline opset —</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">Champion Config</div>
+      <div class="kpi-value">—</div>
+      <div class="kpi-sub">—</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">Total experiments</div>
+      <div class="kpi-value">0</div>
+      <div class="kpi-sub">0 KEEP / 0 DISCARD</div>
+    </div>
+  </section>
+
+
+    <section class="section-card">
+      <div class="section-title">Model Characteristics</div>
+      <table class="characteristics-table">
+        <tr><th>Model ID</th><td>microsoft/rad-dino</td></tr><tr><th>Task</th><td>image-feature-extraction</td></tr><tr><th>Arch type</th><td>dinov2</td></tr><tr><th>Baseline opset</th><td>—</td></tr><tr><th>EP</th><td>cpu</td></tr><tr><th>Device</th><td>cpu</td></tr>
+      </table>
+    </section>
+
+
+    <section class="section-card">
+      <div class="section-title">Hypothesis Gain Chart</div>
+      <div class="chart-wrap">
+        <svg class="chart-svg" viewBox="0 0 748 74" role="img" aria-label="Hypothesis gain chart"><defs><pattern id="buildFailPattern" patternUnits="userSpaceOnUse" width="8" height="8" patternTransform="rotate(45)"><rect width="8" height="8" fill="#cfd8dc"></rect><rect width="3" height="8" fill="#90a4ae"></rect></pattern></defs><text x="0" y="20" class="axis-label">Hypothesis</text><text x="150" y="20" class="axis-label">Gain vs baseline (%)</text><line x1="410.0" y1="40" x2="410.0" y2="48" class="center-line" /><line x1="150.0" y1="44" x2="150.0" y2="48" class="tick-line" /><text x="150.0" y="68" text-anchor="middle" class="tick-label">-200%</text><line x1="280.0" y1="44" x2="280.0" y2="48" class="tick-line" /><text x="280.0" y="68" text-anchor="middle" class="tick-label">-100%</text><line x1="410.0" y1="44" x2="410.0" y2="48" class="tick-line" /><text x="410.0" y="68" text-anchor="middle" class="tick-label">0%</text><line x1="540.0" y1="44" x2="540.0" y2="48" class="tick-line" /><text x="540.0" y="68" text-anchor="middle" class="tick-label">100%</text><line x1="670.0" y1="44" x2="670.0" y2="48" class="tick-line" /><text x="670.0" y="68" text-anchor="middle" class="tick-label">200%</text></svg>
+      </div>
+    </section>
+
+
+    <section class="section-card">
+      <div class="section-title">🔬 All Hypotheses — Full Detail</div>
+      <div style="overflow-x:auto">
+      <table class="report-table hyp-detail-table">
+        <thead>
+          <tr>
+            <th>ID</th>
+            <th>Config Label</th>
+            <th>Opset</th>
+            <th>Extra Flags</th>
+            <th>Median p50</th>
+            <th>Session p50s (ms)</th>
+            <th>Gain %</th>
+            <th>Verdict</th>
+            <th>Confidence</th>
+          </tr>
+        </thead>
+        <tbody>
+
+        </tbody>
+      </table>
+      </div>
+      <div style="margin-top:10px;font-size:11px;color:#7b8794">
+        ★ = champion hypothesis &nbsp;·&nbsp; Session p50s are individual bench sessions (median used for comparison)
+      </div>
+    </section>
+
+
+
+
+
+
+  <div class="footer">Generated by gen_model_report.py · research/autoconfig</div>
+</body>
+</html>
diff --git a/research/autoconfig/catalog-cpu-sweep/microsoft--rad-dino/results.json b/research/autoconfig/catalog-cpu-sweep/microsoft--rad-dino/results.json
new file mode 100644
index 000000000..d4a47523c
--- /dev/null
+++ b/research/autoconfig/catalog-cpu-sweep/microsoft--rad-dino/results.json
@@ -0,0 +1,16 @@
+{
+  "model_id": "microsoft/rad-dino",
+  "task": "image-feature-extraction",
+  "model_type": "dinov2",
+  "timestamp": "2026-06-18T16:53:15",
+  "ep": "cpu",
+  "device": "cpu",
+  "hypotheses": {},
+  "baseline_p50_ms": null,
+  "best_p50_ms": null,
+  "best_hypothesis": null,
+  "best_gain_pct": null,
+  "errors": [
+    "base config generation failed"
+  ]
+}
diff --git a/research/autoconfig/catalog-cpu-sweep/microsoft--resnet-18/report.html b/research/autoconfig/catalog-cpu-sweep/microsoft--resnet-18/report.html
new file mode 100644
index 000000000..54658bb64
--- /dev/null
+++ b/research/autoconfig/catalog-cpu-sweep/microsoft--resnet-18/report.html
@@ -0,0 +1,616 @@
+<!DOCTYPE html>
+<html lang="en">
+<head>
+  <meta charset="UTF-8">
+  <meta name="viewport" content="width=device-width, initial-scale=1.0">
+  <title>CPU CPU Optimization Report — microsoft/resnet-18</title>
+  <style>
+    * { box-sizing: border-box; margin: 0; padding: 0; }
+    body {
+      font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", sans-serif;
+      background: #f4f6f9;
+      color: #1a1a2e;
+      padding: 28px 24px 40px;
+      font-size: 13px;
+      line-height: 1.5;
+    }
+    h1 { font-size: 24px; font-weight: 800; margin-bottom: 6px; color: #102a43; }
+    .subtitle { color: #5f6c80; font-size: 12px; margin-bottom: 24px; }
+    .section-card {
+      background: #ffffff;
+      border: 1.5px solid #dbe5f0;
+      border-radius: 12px;
+      padding: 18px 20px;
+      margin-bottom: 20px;
+      box-shadow: 0 1px 3px rgba(16, 42, 67, 0.06);
+    }
+    .kpi-grid {
+      display: grid;
+      grid-template-columns: repeat(5, minmax(0, 1fr));
+      gap: 14px;
+      margin-bottom: 20px;
+    }
+    .kpi-card {
+      background: linear-gradient(180deg, #ffffff 0%, #f8fbff 100%);
+      border: 1.5px solid #dbe5f0;
+      border-radius: 12px;
+      padding: 16px;
+      min-height: 120px;
+    }
+    .kpi-label {
+      font-size: 10px;
+      text-transform: uppercase;
+      letter-spacing: 0.8px;
+      color: #6b7c93;
+      font-weight: 700;
+      margin-bottom: 8px;
+    }
+    .kpi-value {
+      font-size: 23px;
+      font-weight: 800;
+      color: #102a43;
+      line-height: 1.15;
+      margin-bottom: 6px;
+      word-break: break-word;
+    }
+    .kpi-card.good .kpi-value { color: #2e7d32; }
+    .kpi-card.bad .kpi-value { color: #c62828; }
+    .kpi-sub { color: #6b7c93; font-size: 11px; }
+    .section-title {
+      font-size: 14px;
+      font-weight: 800;
+      color: #102a43;
+      margin-bottom: 14px;
+    }
+    .characteristics-table {
+      width: 100%;
+      border-collapse: collapse;
+    }
+    .characteristics-table th,
+    .characteristics-table td {
+      padding: 9px 10px;
+      border-bottom: 1px solid #ebf1f6;
+      text-align: left;
+      vertical-align: top;
+    }
+    .characteristics-table th {
+      width: 180px;
+      color: #5f6c80;
+      font-size: 11px;
+      text-transform: uppercase;
+      letter-spacing: 0.7px;
+    }
+    .report-table {
+      width: 100%;
+      border-collapse: collapse;
+    }
+    .report-table th {
+      text-align: left;
+      padding: 9px 10px;
+      background: #eef4fb;
+      color: #486581;
+      border-bottom: 2px solid #d9e2ec;
+      font-size: 10px;
+      text-transform: uppercase;
+      letter-spacing: 0.7px;
+    }
+    .report-table td {
+      padding: 10px;
+      border-bottom: 1px solid #ebf1f6;
+      vertical-align: top;
+    }
+    .report-table tr:hover td { background: #f8fbff; }
+    .champion-row td { background: #e8f1fd; }
+    .hyp-pill {
+      display: inline-block;
+      background: #102a43;
+      color: white;
+      border-radius: 999px;
+      padding: 2px 8px;
+      font-size: 11px;
+      font-weight: 700;
+    }
+    .gain-pos { color: #2e7d32; font-weight: 700; }
+    .gain-neg { color: #c62828; font-weight: 700; }
+    .chart-wrap {
+      overflow-x: auto;
+      border: 1px solid #e6edf5;
+      border-radius: 10px;
+      background: #fbfdff;
+      padding: 10px;
+    }
+    .chart-svg { width: 100%; min-width: 760px; display: block; }
+    .axis-label { fill: #486581; font-size: 11px; font-weight: 700; }
+    .tick-label { fill: #7b8794; font-size: 10px; }
+    .tick-line { stroke: #d9e2ec; stroke-width: 1; }
+    .center-line { stroke: #1e88e5; stroke-width: 2; stroke-dasharray: 4 4; }
+    .row-bg { fill: transparent; }
+    .hyp-label { fill: #102a43; font-size: 12px; font-weight: 800; }
+    .hyp-sub { fill: #7b8794; font-size: 10px; }
+    .baseline-bar { stroke: #546e7a; stroke-width: 3; }
+    .value-text { fill: #102a43; font-size: 11px; font-weight: 700; }
+    .build-fail-text { fill: #37474f; font-size: 10px; font-weight: 800; letter-spacing: 0.5px; }
+    .gap-grid {
+      display: grid;
+      grid-template-columns: repeat(auto-fit, minmax(240px, 1fr));
+      gap: 12px;
+    }
+    .gap-card {
+      background: #fff8e1;
+      border: 1.5px solid #ffe082;
+      border-radius: 10px;
+      padding: 12px 14px;
+      color: #7c5b00;
+      font-size: 12px;
+    }
+    .flag-pill {
+      display: inline-block;
+      background: #e3f2fd;
+      color: #1565c0;
+      border-radius: 4px;
+      padding: 1px 6px;
+      font-size: 10px;
+      font-weight: 700;
+      margin: 1px 2px 1px 0;
+    }
+    .runs-val {
+      font-family: "Cascadia Code", "Consolas", monospace;
+      font-size: 10.5px;
+      color: #546e7a;
+      white-space: nowrap;
+    }
+    .hyp-detail-table .label-cell { font-size: 11.5px; max-width: 220px; }
+    .hyp-detail-table .opset-cell { text-align: center; font-weight: 700; color: #3949ab; font-size: 12px; }
+    .hyp-detail-table .flags-cell { min-width: 140px; }
+    .hyp-detail-table .p50-cell { font-family: "Cascadia Code","Consolas",monospace; font-size: 12px; white-space: nowrap; }
+    .hyp-detail-table .sessions-cell { min-width: 160px; }
+    .hyp-detail-table .conf-cell { font-size: 11px; color: #546e7a; }
+    .verdict-keep { color: #2e7d32; font-weight: 700; }
+    .verdict-discard { color: #c62828; font-weight: 700; }
+    .footer {
+      margin-top: 16px;
+      color: #7b8794;
+      font-size: 11px;
+      text-align: center;
+    }
+    @media (max-width: 1200px) {
+      .kpi-grid { grid-template-columns: repeat(2, minmax(0, 1fr)); }
+    }
+    @media (max-width: 720px) {
+      .kpi-grid { grid-template-columns: 1fr; }
+      body { padding: 18px 14px 28px; }
+    }
+  </style>
+</head>
+<body>
+  <h1>CPU CPU Optimization Report — microsoft/resnet-18</h1>
+  <div class="subtitle">resnet arch · 2026-06-18 · 14 hypotheses tested</div>
+
+  <section class="kpi-grid">
+    <div class="kpi-card good">
+      <div class="kpi-label">Best Gain %</div>
+      <div class="kpi-value">+92.5%</div>
+      <div class="kpi-sub">Champion: h9</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">Baseline → Champion ms</div>
+      <div class="kpi-value">237.47 ms → 17.80 ms</div>
+      <div class="kpi-sub">Latency reduction: 219.68 ms</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">EP + Device</div>
+      <div class="kpi-value">CPU / CPU</div>
+      <div class="kpi-sub">Baseline opset 17</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">Champion Config</div>
+      <div class="kpi-value">h9</div>
+      <div class="kpi-sub">opset 17 + matmul_transpose_fusion</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">Total experiments</div>
+      <div class="kpi-value">14</div>
+      <div class="kpi-sub">6 KEEP / 1 DISCARD</div>
+    </div>
+  </section>
+
+
+    <section class="section-card">
+      <div class="section-title">Model Characteristics</div>
+      <table class="characteristics-table">
+        <tr><th>Model ID</th><td>microsoft/resnet-18</td></tr><tr><th>Task</th><td>image-classification</td></tr><tr><th>Arch type</th><td>resnet</td></tr><tr><th>Baseline opset</th><td>17</td></tr><tr><th>EP</th><td>cpu</td></tr><tr><th>Device</th><td>cpu</td></tr>
+      </table>
+    </section>
+
+
+    <section class="section-card">
+      <div class="section-title">Hypothesis Gain Chart</div>
+      <div class="chart-wrap">
+        <svg class="chart-svg" viewBox="0 0 748 634" role="img" aria-label="Hypothesis gain chart"><defs><pattern id="buildFailPattern" patternUnits="userSpaceOnUse" width="8" height="8" patternTransform="rotate(45)"><rect width="8" height="8" fill="#cfd8dc"></rect><rect width="3" height="8" fill="#90a4ae"></rect></pattern></defs><text x="0" y="20" class="axis-label">Hypothesis</text><text x="150" y="20" class="axis-label">Gain vs baseline (%)</text><line x1="410.0" y1="40" x2="410.0" y2="608" class="center-line" /><line x1="150.0" y1="44" x2="150.0" y2="608" class="tick-line" /><text x="150.0" y="628" text-anchor="middle" class="tick-label">-200%</text><line x1="280.0" y1="44" x2="280.0" y2="608" class="tick-line" /><text x="280.0" y="628" text-anchor="middle" class="tick-label">-100%</text><line x1="410.0" y1="44" x2="410.0" y2="608" class="tick-line" /><text x="410.0" y="628" text-anchor="middle" class="tick-label">0%</text><line x1="540.0" y1="44" x2="540.0" y2="608" class="tick-line" /><text x="540.0" y="628" text-anchor="middle" class="tick-label">100%</text><line x1="670.0" y1="44" x2="670.0" y2="608" class="tick-line" /><text x="670.0" y="628" text-anchor="middle" class="tick-label">200%</text><g><title>h0: baseline (opset 17, autoconf defaults)
+status=OK  verdict=BASELINE
+p50=237.47 ms  gain=+0.0%</title><rect x="0" y="48.0" width="748" height="40" class="row-bg" /><text x="8" y="64.0" class="hyp-label">h0</text><text x="8" y="77.0" class="hyp-sub">baseline (opset 17, autoc…</text><line x1="410.0" y1="56.0" x2="410.0" y2="80.0" class="baseline-bar" /><text x="418.0" y="72.0" text-anchor="start" class="value-text">0.0%</text></g><g><title>h1: opset 17 explicit
+status=OK  verdict=DISCARD
+p50=244.96 ms  gain=-3.1%</title><rect x="0" y="88.0" width="748" height="40" class="row-bg" /><text x="8" y="104.0" class="hyp-label">h1</text><text x="8" y="117.0" class="hyp-sub">opset 17 explicit</text><rect x="405.9" y="96.0" width="4.1" height="24" fill="#90a4ae" stroke="none" stroke-width="0" rx="4" /><text x="397.9" y="112.0" text-anchor="end" class="value-text">-3.1%</text></g><g><title>h2: opset 19 (cpu-001 risk — transformer test)
+status=OK  verdict=MARGINAL
+p50=231.69 ms  gain=+2.4%</title><rect x="0" y="128.0" width="748" height="40" class="row-bg" /><text x="8" y="144.0" class="hyp-label">h2</text><text x="8" y="157.0" class="hyp-sub">opset 19 (cpu-001 risk — …</text><rect x="410.0" y="136.0" width="3.2" height="24" fill="#90a4ae" stroke="none" stroke-width="0" rx="4" /><text x="421.2" y="152.0" text-anchor="start" class="value-text">+2.4%</text></g><g><title>h3: opset 21 (cpu-001 risk — transformer test)
+status=OK  verdict=MARGINAL
+p50=226.69 ms  gain=+4.5%</title><rect x="0" y="168.0" width="748" height="40" class="row-bg" /><text x="8" y="184.0" class="hyp-label">h3</text><text x="8" y="197.0" class="hyp-sub">opset 21 (cpu-001 risk — …</text><rect x="410.0" y="176.0" width="5.9" height="24" fill="#90a4ae" stroke="none" stroke-width="0" rx="4" /><text x="423.9" y="192.0" text-anchor="start" class="value-text">+4.5%</text></g><g><title>h4: opset 17 + attention_fusion
+status=OK  verdict=MARGINAL
+p50=231.07 ms  gain=+2.7%</title><rect x="0" y="208.0" width="748" height="40" class="row-bg" /><text x="8" y="224.0" class="hyp-label">h4</text><text x="8" y="237.0" class="hyp-sub">opset 17 + attention_fusi…</text><rect x="410.0" y="216.0" width="3.5" height="24" fill="#90a4ae" stroke="none" stroke-width="0" rx="4" /><text x="421.5" y="232.0" text-anchor="start" class="value-text">+2.7%</text></g><g><title>h5: opset 17 + skip_layer_norm_fusion
+status=OK  verdict=MARGINAL
+p50=226.59 ms  gain=+4.6%</title><rect x="0" y="248.0" width="748" height="40" class="row-bg" /><text x="8" y="264.0" class="hyp-label">h5</text><text x="8" y="277.0" class="hyp-sub">opset 17 + skip_layer_nor…</text><rect x="410.0" y="256.0" width="6.0" height="24" fill="#90a4ae" stroke="none" stroke-width="0" rx="4" /><text x="424.0" y="272.0" text-anchor="start" class="value-text">+4.6%</text></g><g><title>h6: opset 17 + layer_norm_fusion
+status=OK  verdict=KEEP_CONFIRMED
+p50=212.70 ms  gain=+15.7%</title><rect x="0" y="288.0" width="748" height="40" class="row-bg" /><text x="8" y="304.0" class="hyp-label">h6</text><text x="8" y="317.0" class="hyp-sub">opset 17 + layer_norm_fus…</text><rect x="410.0" y="296.0" width="20.3" height="24" fill="#43a047" stroke="none" stroke-width="0" rx="4" /><text x="438.3" y="312.0" text-anchor="start" class="value-text">+15.7%</text></g><g><title>h7: opset 17 + bias_softmax_fusion
+status=OK  verdict=MARGINAL
+p50=227.78 ms  gain=+4.1%</title><rect x="0" y="328.0" width="748" height="40" class="row-bg" /><text x="8" y="344.0" class="hyp-label">h7</text><text x="8" y="357.0" class="hyp-sub">opset 17 + bias_softmax_f…</text><rect x="410.0" y="336.0" width="5.3" height="24" fill="#90a4ae" stroke="none" stroke-width="0" rx="4" /><text x="423.3" y="352.0" text-anchor="start" class="value-text">+4.1%</text></g><g><title>h8: opset 17 + matmul_add_fusion (cpu-002 guarded)
+status=SKIPPED_CPU002  verdict=—
+p50=—  gain=—</title><rect x="0" y="368.0" width="748" height="40" class="row-bg" /><text x="8" y="384.0" class="hyp-label">h8</text><text x="8" y="397.0" class="hyp-sub">opset 17 + matmul_add_fus…</text></g><g><title>h9: opset 17 + matmul_transpose_fusion
+status=OK  verdict=KEEP_CONFIRMED
+p50=17.80 ms  gain=+89.8%</title><rect x="0" y="408.0" width="748" height="40" class="row-bg" /><text x="8" y="424.0" class="hyp-label">h9</text><text x="8" y="437.0" class="hyp-sub">opset 17 + matmul_transpo…</text><rect x="410.0" y="416.0" width="116.7" height="24" fill="#43a047" stroke="#1e88e5" stroke-width="4" rx="4" /><text x="534.7" y="432.0" text-anchor="start" class="value-text">+89.8%</text></g><g><title>h10: opset 17 + attention + skip_layer_norm + layer_norm
+status=OK  verdict=KEEP_CONFIRMED
+p50=20.09 ms  gain=+91.5%</title><rect x="0" y="448.0" width="748" height="40" class="row-bg" /><text x="8" y="464.0" class="hyp-label">h10</text><text x="8" y="477.0" class="hyp-sub">opset 17 + attention + sk…</text><rect x="410.0" y="456.0" width="119.0" height="24" fill="#43a047" stroke="none" stroke-width="0" rx="4" /><text x="537.0" y="472.0" text-anchor="start" class="value-text">+91.5%</text></g><g><title>h11: opset 17 + nchwc_transformer (Conv-heavy models)
+status=OK  verdict=MARGINAL_UNCONFIRMED
+p50=40.87 ms  gain=+84.5%</title><rect x="0" y="488.0" width="748" height="40" class="row-bg" /><text x="8" y="504.0" class="hyp-label">h11</text><text x="8" y="517.0" class="hyp-sub">opset 17 + nchwc_transfor…</text><rect x="410.0" y="496.0" width="109.8" height="24" fill="#43a047" stroke="none" stroke-width="0" rx="4" /><text x="527.8" y="512.0" text-anchor="start" class="value-text">+84.5%</text></g><g><title>h12: opset 17 + transpose_optimizer
+status=OK  verdict=KEEP_CONFIRMED
+p50=36.91 ms  gain=+84.5%</title><rect x="0" y="528.0" width="748" height="40" class="row-bg" /><text x="8" y="544.0" class="hyp-label">h12</text><text x="8" y="557.0" class="hyp-sub">opset 17 + transpose_opti…</text><rect x="410.0" y="536.0" width="109.8" height="24" fill="#43a047" stroke="none" stroke-width="0" rx="4" /><text x="527.8" y="552.0" text-anchor="start" class="value-text">+84.5%</text></g><g><title>h13: opset 17 + gelu_fusion explicit
+status=OK  verdict=KEEP_CONFIRMED
+p50=26.39 ms  gain=+88.9%</title><rect x="0" y="568.0" width="748" height="40" class="row-bg" /><text x="8" y="584.0" class="hyp-label">h13</text><text x="8" y="597.0" class="hyp-sub">opset 17 + gelu_fusion ex…</text><rect x="410.0" y="576.0" width="115.6" height="24" fill="#43a047" stroke="none" stroke-width="0" rx="4" /><text x="533.6" y="592.0" text-anchor="start" class="value-text">+88.9%</text></g></svg>
+      </div>
+    </section>
+
+
+    <section class="section-card">
+      <div class="section-title">🔬 All Hypotheses — Full Detail</div>
+      <div style="overflow-x:auto">
+      <table class="report-table hyp-detail-table">
+        <thead>
+          <tr>
+            <th>ID</th>
+            <th>Config Label</th>
+            <th>Opset</th>
+            <th>Extra Flags</th>
+            <th>Median p50</th>
+            <th>Session p50s (ms)</th>
+            <th>Gain %</th>
+            <th>Verdict</th>
+            <th>Confidence</th>
+          </tr>
+        </thead>
+        <tbody>
+
+        <tr class="">
+          <td><span class="hyp-pill">h0</span></td>
+          <td class="label-cell">baseline (opset 17, autoconf defaults)</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">237.47 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[237.47 · 230.21 · 238.44]</span></td>
+          <td><span class="">+0.0%</span></td>
+          <td><span class="">BASELINE</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h1</span></td>
+          <td class="label-cell">opset 17 explicit</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">244.96 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[221.85 · 252.22 · 244.96]</span></td>
+          <td><span class="gain-neg">-3.1%</span></td>
+          <td><span class="verdict-discard">DISCARD</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h2</span></td>
+          <td class="label-cell">opset 19 (cpu-001 risk — transformer test)</td>
+          <td class="opset-cell">19</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">231.69 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[209.29 · 231.69 · 238.07]</span></td>
+          <td><span class="gain-pos">+2.4%</span></td>
+          <td><span class="">MARGINAL</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h3</span></td>
+          <td class="label-cell">opset 21 (cpu-001 risk — transformer test)</td>
+          <td class="opset-cell">21</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">226.69 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[218.89 · 226.69 · 230.42]</span></td>
+          <td><span class="gain-pos">+4.5%</span></td>
+          <td><span class="">MARGINAL</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h4</span></td>
+          <td class="label-cell">opset 17 + attention_fusion</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><span class="flag-pill">attention_fusion</span></td>
+          <td class="p50-cell">231.07 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[209.43 · 231.07 · 235.40]</span></td>
+          <td><span class="gain-pos">+2.7%</span></td>
+          <td><span class="">MARGINAL</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h5</span></td>
+          <td class="label-cell">opset 17 + skip_layer_norm_fusion</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><span class="flag-pill">skip_layer_norm_fusion</span></td>
+          <td class="p50-cell">226.59 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[207.97 · 226.59 · 227.74]</span></td>
+          <td><span class="gain-pos">+4.6%</span></td>
+          <td><span class="">MARGINAL</span></td>
+          <td class="conf-cell">ranges separated</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h6</span></td>
+          <td class="label-cell">opset 17 + layer_norm_fusion</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><span class="flag-pill">layer_norm_fusion</span></td>
+          <td class="p50-cell">212.70 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[200.31 · 212.70 · 215.09 · 40.37 · 24.96]</span></td>
+          <td><span class="gain-pos">+15.7%</span></td>
+          <td><span class="verdict-keep">KEEP_CONFIRMED</span></td>
+          <td class="conf-cell">5/5 sessions confirm</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h7</span></td>
+          <td class="label-cell">opset 17 + bias_softmax_fusion</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><span class="flag-pill">bias_softmax_fusion</span></td>
+          <td class="p50-cell">227.78 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[222.57 · 245.29 · 227.78]</span></td>
+          <td><span class="gain-pos">+4.1%</span></td>
+          <td><span class="">MARGINAL</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h8</span></td>
+          <td class="label-cell">opset 17 + matmul_add_fusion (cpu-002 guarded)</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">—</td>
+          <td class="sessions-cell">—</td>
+          <td>—</td>
+          <td><span class="">SKIPPED_CPU002</span></td>
+          <td class="conf-cell">guarded skip</td>
+        </tr>
+        <tr class="champion-row">
+          <td><span class="hyp-pill">h9</span> <span style="color:#1976d2;font-weight:900">★</span></td>
+          <td class="label-cell">opset 17 + matmul_transpose_fusion</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><span class="flag-pill">matmul_transpose_fusion</span></td>
+          <td class="p50-cell">17.80 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[24.22 · 11.52 · 17.80 · 186.46 · 197.36]</span></td>
+          <td><span class="gain-pos">+89.8%</span></td>
+          <td><span class="verdict-keep">KEEP_CONFIRMED</span></td>
+          <td class="conf-cell">5/5 sessions confirm</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h10</span></td>
+          <td class="label-cell">opset 17 + attention + skip_layer_norm + layer_norm</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><span class="flag-pill">attention_fusion</span>, <span class="flag-pill">skip_layer_norm_fusion</span>, <span class="flag-pill">layer_norm_fusion</span></td>
+          <td class="p50-cell">20.09 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[20.09 · 14.91 · 43.27 · 18.86 · 39.40]</span></td>
+          <td><span class="gain-pos">+91.5%</span></td>
+          <td><span class="verdict-keep">KEEP_CONFIRMED</span></td>
+          <td class="conf-cell">5/5 sessions confirm</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h11</span></td>
+          <td class="label-cell">opset 17 + nchwc_transformer (Conv-heavy models)</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><span class="flag-pill">nchwc_transformer</span></td>
+          <td class="p50-cell">40.87 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[30.78 · 40.87 · 230.91 · 36.88 · 27.17]</span></td>
+          <td><span class="gain-pos">+84.5%</span></td>
+          <td><span class="">MARGINAL_UNCONFIRMED</span></td>
+          <td class="conf-cell">4/5 sessions confirm</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h12</span></td>
+          <td class="label-cell">opset 17 + transpose_optimizer</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><span class="flag-pill">transpose_optimizer</span></td>
+          <td class="p50-cell">36.91 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[36.91 · 21.65 · 40.88 · 26.59 · 38.94]</span></td>
+          <td><span class="gain-pos">+84.5%</span></td>
+          <td><span class="verdict-keep">KEEP_CONFIRMED</span></td>
+          <td class="conf-cell">5/5 sessions confirm</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h13</span></td>
+          <td class="label-cell">opset 17 + gelu_fusion explicit</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><span class="flag-pill">gelu_fusion</span></td>
+          <td class="p50-cell">26.39 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[219.22 · 26.39 · 20.94 · 18.75 · 215.34]</span></td>
+          <td><span class="gain-pos">+88.9%</span></td>
+          <td><span class="verdict-keep">KEEP_CONFIRMED</span></td>
+          <td class="conf-cell">5/5 sessions confirm</td>
+        </tr>
+        </tbody>
+      </table>
+      </div>
+      <div style="margin-top:10px;font-size:11px;color:#7b8794">
+        ★ = champion hypothesis &nbsp;·&nbsp; Session p50s are individual bench sessions (median used for comparison)
+      </div>
+    </section>
+
+
+    <section class="section-card">
+      <div class="section-title">✅ Effective Optimizations</div>
+      <table class="report-table">
+        <thead>
+          <tr>
+            <th>Hypothesis</th>
+            <th>Label</th>
+            <th>Gain %</th>
+            <th>Verdict</th>
+            <th>Confidence</th>
+          </tr>
+        </thead>
+        <tbody>
+
+            <tr class="">
+              <td><span class="hyp-pill">h6</span></td>
+              <td>opset 17 + layer_norm_fusion</td>
+              <td class="gain-pos">+15.7%</td>
+              <td>KEEP_CONFIRMED</td>
+              <td>5/5 sessions confirm</td>
+            </tr>
+
+            <tr class="champion-row">
+              <td><span class="hyp-pill">h9</span></td>
+              <td>opset 17 + matmul_transpose_fusion</td>
+              <td class="gain-pos">+89.8%</td>
+              <td>KEEP_CONFIRMED</td>
+              <td>5/5 sessions confirm</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h10</span></td>
+              <td>opset 17 + attention + skip_layer_norm + layer_norm</td>
+              <td class="gain-pos">+91.5%</td>
+              <td>KEEP_CONFIRMED</td>
+              <td>5/5 sessions confirm</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h11</span></td>
+              <td>opset 17 + nchwc_transformer (Conv-heavy models)</td>
+              <td class="gain-pos">+84.5%</td>
+              <td>MARGINAL_UNCONFIRMED</td>
+              <td>4/5 sessions confirm</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h12</span></td>
+              <td>opset 17 + transpose_optimizer</td>
+              <td class="gain-pos">+84.5%</td>
+              <td>KEEP_CONFIRMED</td>
+              <td>5/5 sessions confirm</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h13</span></td>
+              <td>opset 17 + gelu_fusion explicit</td>
+              <td class="gain-pos">+88.9%</td>
+              <td>KEEP_CONFIRMED</td>
+              <td>5/5 sessions confirm</td>
+            </tr>
+
+        </tbody>
+      </table>
+    </section>
+
+
+    <section class="section-card">
+      <div class="section-title">❌ Ineffective or Harmful</div>
+      <table class="report-table">
+        <thead>
+          <tr>
+            <th>Hypothesis</th>
+            <th>Label</th>
+            <th>Gain %</th>
+            <th>Verdict</th>
+            <th>Confidence</th>
+          </tr>
+        </thead>
+        <tbody>
+
+            <tr class="">
+              <td><span class="hyp-pill">h1</span></td>
+              <td>opset 17 explicit</td>
+              <td class="gain-neg">-3.1%</td>
+              <td>DISCARD</td>
+              <td>ranges overlap</td>
+            </tr>
+
+        </tbody>
+      </table>
+    </section>
+
+
+    <section class="section-card">
+      <div class="section-title">⚪ Neutral / Build Fail</div>
+      <table class="report-table">
+        <thead>
+          <tr>
+            <th>Hypothesis</th>
+            <th>Label</th>
+            <th>Gain %</th>
+            <th>Verdict</th>
+            <th>Confidence</th>
+          </tr>
+        </thead>
+        <tbody>
+
+            <tr class="">
+              <td><span class="hyp-pill">h0</span></td>
+              <td>baseline (opset 17, autoconf defaults)</td>
+              <td class="gain-pos">+0.0%</td>
+              <td>BASELINE</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h2</span></td>
+              <td>opset 19 (cpu-001 risk — transformer test)</td>
+              <td class="gain-pos">+2.4%</td>
+              <td>MARGINAL</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h3</span></td>
+              <td>opset 21 (cpu-001 risk — transformer test)</td>
+              <td class="gain-pos">+4.5%</td>
+              <td>MARGINAL</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h4</span></td>
+              <td>opset 17 + attention_fusion</td>
+              <td class="gain-pos">+2.7%</td>
+              <td>MARGINAL</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h5</span></td>
+              <td>opset 17 + skip_layer_norm_fusion</td>
+              <td class="gain-pos">+4.6%</td>
+              <td>MARGINAL</td>
+              <td>ranges separated</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h7</span></td>
+              <td>opset 17 + bias_softmax_fusion</td>
+              <td class="gain-pos">+4.1%</td>
+              <td>MARGINAL</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h8</span></td>
+              <td>opset 17 + matmul_add_fusion (cpu-002 guarded)</td>
+              <td class="gain-pos">—</td>
+              <td>SKIPPED_CPU002</td>
+              <td>guarded skip</td>
+            </tr>
+
+        </tbody>
+      </table>
+    </section>
+
+
+
+  <div class="footer">Generated by gen_model_report.py · research/autoconfig</div>
+</body>
+</html>
diff --git a/research/autoconfig/catalog-cpu-sweep/microsoft--resnet-18/results.json b/research/autoconfig/catalog-cpu-sweep/microsoft--resnet-18/results.json
new file mode 100644
index 000000000..9be730ae4
--- /dev/null
+++ b/research/autoconfig/catalog-cpu-sweep/microsoft--resnet-18/results.json
@@ -0,0 +1,339 @@
+{
+  "model_id": "microsoft/resnet-18",
+  "task": "image-classification",
+  "model_type": "resnet",
+  "timestamp": "2026-06-18T11:14:10",
+  "ep": "cpu",
+  "device": "cpu",
+  "hypotheses": {
+    "h0": {
+      "status": "OK",
+      "label": "baseline (opset 17, autoconf defaults)",
+      "opset": 17,
+      "extra_optim": null,
+      "screen_p50_ms": 231.091,
+      "screen_cv": 0.634823511084378,
+      "full_p50s_ms": [
+        237.472,
+        230.213,
+        238.44
+      ],
+      "median_p50_ms": 237.472,
+      "verdict": "BASELINE"
+    },
+    "h1": {
+      "status": "OK",
+      "label": "opset 17 explicit",
+      "opset": 17,
+      "extra_optim": null,
+      "screen_p50_ms": 236.135,
+      "screen_cv": 0.6215427615558896,
+      "full_p50s_ms": [
+        221.852,
+        252.225,
+        244.959
+      ],
+      "median_p50_ms": 244.959,
+      "gain_vs_baseline_pct": -3.15,
+      "verdict": "DISCARD"
+    },
+    "h2": {
+      "status": "OK",
+      "label": "opset 19 (cpu-001 risk — transformer test)",
+      "opset": 19,
+      "extra_optim": null,
+      "screen_p50_ms": 228.935,
+      "screen_cv": 0.6700941315220477,
+      "full_p50s_ms": [
+        209.29,
+        231.693,
+        238.073
+      ],
+      "median_p50_ms": 231.693,
+      "gain_vs_baseline_pct": 2.43,
+      "verdict": "MARGINAL"
+    },
+    "h3": {
+      "status": "OK",
+      "label": "opset 21 (cpu-001 risk — transformer test)",
+      "opset": 21,
+      "extra_optim": null,
+      "screen_p50_ms": 222.347,
+      "screen_cv": 0.6050137847598573,
+      "full_p50s_ms": [
+        218.891,
+        226.688,
+        230.417
+      ],
+      "median_p50_ms": 226.688,
+      "gain_vs_baseline_pct": 4.54,
+      "verdict": "MARGINAL"
+    },
+    "h4": {
+      "status": "OK",
+      "label": "opset 17 + attention_fusion",
+      "opset": 17,
+      "extra_optim": {
+        "attention_fusion": true
+      },
+      "screen_p50_ms": 229.793,
+      "screen_cv": 0.6810172633631137,
+      "full_p50s_ms": [
+        209.431,
+        231.069,
+        235.402
+      ],
+      "median_p50_ms": 231.069,
+      "gain_vs_baseline_pct": 2.7,
+      "verdict": "MARGINAL"
+    },
+    "h5": {
+      "status": "OK",
+      "label": "opset 17 + skip_layer_norm_fusion",
+      "opset": 17,
+      "extra_optim": {
+        "skip_layer_norm_fusion": true
+      },
+      "screen_p50_ms": 188.605,
+      "screen_cv": 0.8141088518331964,
+      "full_p50s_ms": [
+        207.967,
+        226.586,
+        227.739
+      ],
+      "median_p50_ms": 226.586,
+      "gain_vs_baseline_pct": 4.58,
+      "verdict": "MARGINAL"
+    },
+    "h6": {
+      "status": "OK",
+      "label": "opset 17 + layer_norm_fusion",
+      "opset": 17,
+      "extra_optim": {
+        "layer_norm_fusion": true
+      },
+      "screen_p50_ms": 206.291,
+      "screen_cv": 0.6984017722537581,
+      "full_p50s_ms": [
+        200.308,
+        212.704,
+        215.094
+      ],
+      "median_p50_ms": 212.704,
+      "gain_vs_baseline_pct": 10.43,
+      "verdict": "KEEP_CONFIRMED",
+      "confirm_p50s_ms": [
+        40.366,
+        24.962
+      ],
+      "all_p50s_ms": [
+        200.308,
+        212.704,
+        215.094,
+        40.366,
+        24.962
+      ],
+      "overall_median_p50_ms": 200.308,
+      "overall_gain_pct": 15.65,
+      "sessions_above_threshold": 5,
+      "total_sessions": 5
+    },
+    "h7": {
+      "status": "OK",
+      "label": "opset 17 + bias_softmax_fusion",
+      "opset": 17,
+      "extra_optim": {
+        "bias_softmax_fusion": true
+      },
+      "screen_p50_ms": 176.804,
+      "screen_cv": 0.8944367774484739,
+      "full_p50s_ms": [
+        222.575,
+        245.29,
+        227.782
+      ],
+      "median_p50_ms": 227.782,
+      "gain_vs_baseline_pct": 4.08,
+      "verdict": "MARGINAL"
+    },
+    "h8": {
+      "status": "SKIPPED_CPU002",
+      "label": "opset 17 + matmul_add_fusion (cpu-002 guarded)",
+      "opset": 17,
+      "reason": "cpu-002: model already has Gemm — matmul_add_fusion skipped"
+    },
+    "h9": {
+      "status": "OK",
+      "label": "opset 17 + matmul_transpose_fusion",
+      "opset": 17,
+      "extra_optim": {
+        "matmul_transpose_fusion": true
+      },
+      "screen_p50_ms": 15.5,
+      "screen_cv": 0.6570967741935484,
+      "full_p50s_ms": [
+        24.223,
+        11.524,
+        17.797
+      ],
+      "median_p50_ms": 17.797,
+      "gain_vs_baseline_pct": 92.51,
+      "verdict": "KEEP_CONFIRMED",
+      "confirm_p50s_ms": [
+        186.462,
+        197.357
+      ],
+      "all_p50s_ms": [
+        24.223,
+        11.524,
+        17.797,
+        186.462,
+        197.357
+      ],
+      "overall_median_p50_ms": 24.223,
+      "overall_gain_pct": 89.8,
+      "sessions_above_threshold": 5,
+      "total_sessions": 5
+    },
+    "h10": {
+      "status": "OK",
+      "label": "opset 17 + attention + skip_layer_norm + layer_norm",
+      "opset": 17,
+      "extra_optim": {
+        "attention_fusion": true,
+        "skip_layer_norm_fusion": true,
+        "layer_norm_fusion": true
+      },
+      "screen_p50_ms": 24.828,
+      "screen_cv": 0.5822458514580312,
+      "full_p50s_ms": [
+        20.086,
+        14.909,
+        43.266
+      ],
+      "median_p50_ms": 20.086,
+      "gain_vs_baseline_pct": 91.54,
+      "verdict": "KEEP_CONFIRMED",
+      "confirm_p50s_ms": [
+        18.859,
+        39.401
+      ],
+      "all_p50s_ms": [
+        20.086,
+        14.909,
+        43.266,
+        18.859,
+        39.401
+      ],
+      "overall_median_p50_ms": 20.086,
+      "overall_gain_pct": 91.54,
+      "sessions_above_threshold": 5,
+      "total_sessions": 5
+    },
+    "h11": {
+      "status": "OK",
+      "label": "opset 17 + nchwc_transformer (Conv-heavy models)",
+      "opset": 17,
+      "extra_optim": {
+        "nchwc_transformer": true
+      },
+      "screen_p50_ms": 14.073,
+      "screen_cv": 0.8100618205073545,
+      "full_p50s_ms": [
+        30.776,
+        40.872,
+        230.911
+      ],
+      "median_p50_ms": 40.872,
+      "gain_vs_baseline_pct": 82.79,
+      "verdict": "MARGINAL_UNCONFIRMED",
+      "confirm_p50s_ms": [
+        36.88,
+        27.171
+      ],
+      "all_p50s_ms": [
+        30.776,
+        40.872,
+        230.911,
+        36.88,
+        27.171
+      ],
+      "overall_median_p50_ms": 36.88,
+      "overall_gain_pct": 84.47,
+      "sessions_above_threshold": 4,
+      "total_sessions": 5
+    },
+    "h12": {
+      "status": "OK",
+      "label": "opset 17 + transpose_optimizer",
+      "opset": 17,
+      "extra_optim": {
+        "transpose_optimizer": true
+      },
+      "screen_p50_ms": 10.858,
+      "screen_cv": 0.7146804199668446,
+      "full_p50s_ms": [
+        36.911,
+        21.651,
+        40.879
+      ],
+      "median_p50_ms": 36.911,
+      "gain_vs_baseline_pct": 84.46,
+      "verdict": "KEEP_CONFIRMED",
+      "confirm_p50s_ms": [
+        26.592,
+        38.939
+      ],
+      "all_p50s_ms": [
+        36.911,
+        21.651,
+        40.879,
+        26.592,
+        38.939
+      ],
+      "overall_median_p50_ms": 36.911,
+      "overall_gain_pct": 84.46,
+      "sessions_above_threshold": 5,
+      "total_sessions": 5
+    },
+    "h13": {
+      "status": "OK",
+      "label": "opset 17 + gelu_fusion explicit",
+      "opset": 17,
+      "extra_optim": {
+        "gelu_fusion": true
+      },
+      "screen_p50_ms": 183.865,
+      "screen_cv": 0.9105920104424441,
+      "full_p50s_ms": [
+        219.217,
+        26.395,
+        20.936
+      ],
+      "median_p50_ms": 26.395,
+      "gain_vs_baseline_pct": 88.89,
+      "verdict": "KEEP_CONFIRMED",
+      "confirm_p50s_ms": [
+        18.747,
+        215.344
+      ],
+      "all_p50s_ms": [
+        219.217,
+        26.395,
+        20.936,
+        18.747,
+        215.344
+      ],
+      "overall_median_p50_ms": 26.395,
+      "overall_gain_pct": 88.89,
+      "sessions_above_threshold": 5,
+      "total_sessions": 5
+    }
+  },
+  "baseline_p50_ms": 237.472,
+  "best_p50_ms": 17.797,
+  "best_hypothesis": "h9",
+  "best_gain_pct": 92.51,
+  "errors": [],
+  "baseline_opset": 17
+}
diff --git a/research/autoconfig/catalog-cpu-sweep/sentence-transformers--all-MiniLM-L6-v2/report.html b/research/autoconfig/catalog-cpu-sweep/sentence-transformers--all-MiniLM-L6-v2/report.html
new file mode 100644
index 000000000..801d03710
--- /dev/null
+++ b/research/autoconfig/catalog-cpu-sweep/sentence-transformers--all-MiniLM-L6-v2/report.html
@@ -0,0 +1,268 @@
+<!DOCTYPE html>
+<html lang="en">
+<head>
+  <meta charset="UTF-8">
+  <meta name="viewport" content="width=device-width, initial-scale=1.0">
+  <title>CPU CPU Optimization Report — sentence-transformers/all-MiniLM-L6-v2</title>
+  <style>
+    * { box-sizing: border-box; margin: 0; padding: 0; }
+    body {
+      font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", sans-serif;
+      background: #f4f6f9;
+      color: #1a1a2e;
+      padding: 28px 24px 40px;
+      font-size: 13px;
+      line-height: 1.5;
+    }
+    h1 { font-size: 24px; font-weight: 800; margin-bottom: 6px; color: #102a43; }
+    .subtitle { color: #5f6c80; font-size: 12px; margin-bottom: 24px; }
+    .section-card {
+      background: #ffffff;
+      border: 1.5px solid #dbe5f0;
+      border-radius: 12px;
+      padding: 18px 20px;
+      margin-bottom: 20px;
+      box-shadow: 0 1px 3px rgba(16, 42, 67, 0.06);
+    }
+    .kpi-grid {
+      display: grid;
+      grid-template-columns: repeat(5, minmax(0, 1fr));
+      gap: 14px;
+      margin-bottom: 20px;
+    }
+    .kpi-card {
+      background: linear-gradient(180deg, #ffffff 0%, #f8fbff 100%);
+      border: 1.5px solid #dbe5f0;
+      border-radius: 12px;
+      padding: 16px;
+      min-height: 120px;
+    }
+    .kpi-label {
+      font-size: 10px;
+      text-transform: uppercase;
+      letter-spacing: 0.8px;
+      color: #6b7c93;
+      font-weight: 700;
+      margin-bottom: 8px;
+    }
+    .kpi-value {
+      font-size: 23px;
+      font-weight: 800;
+      color: #102a43;
+      line-height: 1.15;
+      margin-bottom: 6px;
+      word-break: break-word;
+    }
+    .kpi-card.good .kpi-value { color: #2e7d32; }
+    .kpi-card.bad .kpi-value { color: #c62828; }
+    .kpi-sub { color: #6b7c93; font-size: 11px; }
+    .section-title {
+      font-size: 14px;
+      font-weight: 800;
+      color: #102a43;
+      margin-bottom: 14px;
+    }
+    .characteristics-table {
+      width: 100%;
+      border-collapse: collapse;
+    }
+    .characteristics-table th,
+    .characteristics-table td {
+      padding: 9px 10px;
+      border-bottom: 1px solid #ebf1f6;
+      text-align: left;
+      vertical-align: top;
+    }
+    .characteristics-table th {
+      width: 180px;
+      color: #5f6c80;
+      font-size: 11px;
+      text-transform: uppercase;
+      letter-spacing: 0.7px;
+    }
+    .report-table {
+      width: 100%;
+      border-collapse: collapse;
+    }
+    .report-table th {
+      text-align: left;
+      padding: 9px 10px;
+      background: #eef4fb;
+      color: #486581;
+      border-bottom: 2px solid #d9e2ec;
+      font-size: 10px;
+      text-transform: uppercase;
+      letter-spacing: 0.7px;
+    }
+    .report-table td {
+      padding: 10px;
+      border-bottom: 1px solid #ebf1f6;
+      vertical-align: top;
+    }
+    .report-table tr:hover td { background: #f8fbff; }
+    .champion-row td { background: #e8f1fd; }
+    .hyp-pill {
+      display: inline-block;
+      background: #102a43;
+      color: white;
+      border-radius: 999px;
+      padding: 2px 8px;
+      font-size: 11px;
+      font-weight: 700;
+    }
+    .gain-pos { color: #2e7d32; font-weight: 700; }
+    .gain-neg { color: #c62828; font-weight: 700; }
+    .chart-wrap {
+      overflow-x: auto;
+      border: 1px solid #e6edf5;
+      border-radius: 10px;
+      background: #fbfdff;
+      padding: 10px;
+    }
+    .chart-svg { width: 100%; min-width: 760px; display: block; }
+    .axis-label { fill: #486581; font-size: 11px; font-weight: 700; }
+    .tick-label { fill: #7b8794; font-size: 10px; }
+    .tick-line { stroke: #d9e2ec; stroke-width: 1; }
+    .center-line { stroke: #1e88e5; stroke-width: 2; stroke-dasharray: 4 4; }
+    .row-bg { fill: transparent; }
+    .hyp-label { fill: #102a43; font-size: 12px; font-weight: 800; }
+    .hyp-sub { fill: #7b8794; font-size: 10px; }
+    .baseline-bar { stroke: #546e7a; stroke-width: 3; }
+    .value-text { fill: #102a43; font-size: 11px; font-weight: 700; }
+    .build-fail-text { fill: #37474f; font-size: 10px; font-weight: 800; letter-spacing: 0.5px; }
+    .gap-grid {
+      display: grid;
+      grid-template-columns: repeat(auto-fit, minmax(240px, 1fr));
+      gap: 12px;
+    }
+    .gap-card {
+      background: #fff8e1;
+      border: 1.5px solid #ffe082;
+      border-radius: 10px;
+      padding: 12px 14px;
+      color: #7c5b00;
+      font-size: 12px;
+    }
+    .flag-pill {
+      display: inline-block;
+      background: #e3f2fd;
+      color: #1565c0;
+      border-radius: 4px;
+      padding: 1px 6px;
+      font-size: 10px;
+      font-weight: 700;
+      margin: 1px 2px 1px 0;
+    }
+    .runs-val {
+      font-family: "Cascadia Code", "Consolas", monospace;
+      font-size: 10.5px;
+      color: #546e7a;
+      white-space: nowrap;
+    }
+    .hyp-detail-table .label-cell { font-size: 11.5px; max-width: 220px; }
+    .hyp-detail-table .opset-cell { text-align: center; font-weight: 700; color: #3949ab; font-size: 12px; }
+    .hyp-detail-table .flags-cell { min-width: 140px; }
+    .hyp-detail-table .p50-cell { font-family: "Cascadia Code","Consolas",monospace; font-size: 12px; white-space: nowrap; }
+    .hyp-detail-table .sessions-cell { min-width: 160px; }
+    .hyp-detail-table .conf-cell { font-size: 11px; color: #546e7a; }
+    .verdict-keep { color: #2e7d32; font-weight: 700; }
+    .verdict-discard { color: #c62828; font-weight: 700; }
+    .footer {
+      margin-top: 16px;
+      color: #7b8794;
+      font-size: 11px;
+      text-align: center;
+    }
+    @media (max-width: 1200px) {
+      .kpi-grid { grid-template-columns: repeat(2, minmax(0, 1fr)); }
+    }
+    @media (max-width: 720px) {
+      .kpi-grid { grid-template-columns: 1fr; }
+      body { padding: 18px 14px 28px; }
+    }
+  </style>
+</head>
+<body>
+  <h1>CPU CPU Optimization Report — sentence-transformers/all-MiniLM-L6-v2</h1>
+  <div class="subtitle">bert arch · 2026-06-18 · 0 hypotheses tested</div>
+
+  <section class="kpi-grid">
+    <div class="kpi-card neutral">
+      <div class="kpi-label">Best Gain %</div>
+      <div class="kpi-value">—</div>
+      <div class="kpi-sub">Champion: —</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">Baseline → Champion ms</div>
+      <div class="kpi-value">— → —</div>
+      <div class="kpi-sub">Latency reduction: —</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">EP + Device</div>
+      <div class="kpi-value">CPU / CPU</div>
+      <div class="kpi-sub">Baseline opset —</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">Champion Config</div>
+      <div class="kpi-value">—</div>
+      <div class="kpi-sub">—</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">Total experiments</div>
+      <div class="kpi-value">0</div>
+      <div class="kpi-sub">0 KEEP / 0 DISCARD</div>
+    </div>
+  </section>
+
+
+    <section class="section-card">
+      <div class="section-title">Model Characteristics</div>
+      <table class="characteristics-table">
+        <tr><th>Model ID</th><td>sentence-transformers/all-MiniLM-L6-v2</td></tr><tr><th>Task</th><td>sentence-similarity</td></tr><tr><th>Arch type</th><td>bert</td></tr><tr><th>Baseline opset</th><td>—</td></tr><tr><th>EP</th><td>cpu</td></tr><tr><th>Device</th><td>cpu</td></tr>
+      </table>
+    </section>
+
+
+    <section class="section-card">
+      <div class="section-title">Hypothesis Gain Chart</div>
+      <div class="chart-wrap">
+        <svg class="chart-svg" viewBox="0 0 748 74" role="img" aria-label="Hypothesis gain chart"><defs><pattern id="buildFailPattern" patternUnits="userSpaceOnUse" width="8" height="8" patternTransform="rotate(45)"><rect width="8" height="8" fill="#cfd8dc"></rect><rect width="3" height="8" fill="#90a4ae"></rect></pattern></defs><text x="0" y="20" class="axis-label">Hypothesis</text><text x="150" y="20" class="axis-label">Gain vs baseline (%)</text><line x1="410.0" y1="40" x2="410.0" y2="48" class="center-line" /><line x1="150.0" y1="44" x2="150.0" y2="48" class="tick-line" /><text x="150.0" y="68" text-anchor="middle" class="tick-label">-200%</text><line x1="280.0" y1="44" x2="280.0" y2="48" class="tick-line" /><text x="280.0" y="68" text-anchor="middle" class="tick-label">-100%</text><line x1="410.0" y1="44" x2="410.0" y2="48" class="tick-line" /><text x="410.0" y="68" text-anchor="middle" class="tick-label">0%</text><line x1="540.0" y1="44" x2="540.0" y2="48" class="tick-line" /><text x="540.0" y="68" text-anchor="middle" class="tick-label">100%</text><line x1="670.0" y1="44" x2="670.0" y2="48" class="tick-line" /><text x="670.0" y="68" text-anchor="middle" class="tick-label">200%</text></svg>
+      </div>
+    </section>
+
+
+    <section class="section-card">
+      <div class="section-title">🔬 All Hypotheses — Full Detail</div>
+      <div style="overflow-x:auto">
+      <table class="report-table hyp-detail-table">
+        <thead>
+          <tr>
+            <th>ID</th>
+            <th>Config Label</th>
+            <th>Opset</th>
+            <th>Extra Flags</th>
+            <th>Median p50</th>
+            <th>Session p50s (ms)</th>
+            <th>Gain %</th>
+            <th>Verdict</th>
+            <th>Confidence</th>
+          </tr>
+        </thead>
+        <tbody>
+
+        </tbody>
+      </table>
+      </div>
+      <div style="margin-top:10px;font-size:11px;color:#7b8794">
+        ★ = champion hypothesis &nbsp;·&nbsp; Session p50s are individual bench sessions (median used for comparison)
+      </div>
+    </section>
+
+
+
+
+
+
+  <div class="footer">Generated by gen_model_report.py · research/autoconfig</div>
+</body>
+</html>
diff --git a/research/autoconfig/catalog-cpu-sweep/sentence-transformers--all-MiniLM-L6-v2/results.json b/research/autoconfig/catalog-cpu-sweep/sentence-transformers--all-MiniLM-L6-v2/results.json
new file mode 100644
index 000000000..a174931d9
--- /dev/null
+++ b/research/autoconfig/catalog-cpu-sweep/sentence-transformers--all-MiniLM-L6-v2/results.json
@@ -0,0 +1,16 @@
+{
+  "model_id": "sentence-transformers/all-MiniLM-L6-v2",
+  "task": "sentence-similarity",
+  "model_type": "bert",
+  "timestamp": "2026-06-18T16:52:49",
+  "ep": "cpu",
+  "device": "cpu",
+  "hypotheses": {},
+  "baseline_p50_ms": null,
+  "best_p50_ms": null,
+  "best_hypothesis": null,
+  "best_gain_pct": null,
+  "errors": [
+    "base config generation failed"
+  ]
+}
diff --git a/research/autoconfig/catalog-gpu-sweep/.gitignore b/research/autoconfig/catalog-gpu-sweep/.gitignore
new file mode 100644
index 000000000..b3b91d38b
--- /dev/null
+++ b/research/autoconfig/catalog-gpu-sweep/.gitignore
@@ -0,0 +1,10 @@
+# Hypothesis build artifacts (large binary files)
+h*/
+_tmp_config/
+# Raw perf session files
+full_perf_s*.json
+screen_perf.json
+confirm_s*.json
+# Model weight files
+*.data
+*.onnx
diff --git a/research/autoconfig/catalog-gpu-sweep/BAAI--bge-small-en-v1.5/report.html b/research/autoconfig/catalog-gpu-sweep/BAAI--bge-small-en-v1.5/report.html
new file mode 100644
index 000000000..c17fca4e0
--- /dev/null
+++ b/research/autoconfig/catalog-gpu-sweep/BAAI--bge-small-en-v1.5/report.html
@@ -0,0 +1,577 @@
+<!DOCTYPE html>
+<html lang="en">
+<head>
+  <meta charset="UTF-8">
+  <meta name="viewport" content="width=device-width, initial-scale=1.0">
+  <title>QNN GPU Optimization Report — BAAI/bge-small-en-v1.5</title>
+  <style>
+    * { box-sizing: border-box; margin: 0; padding: 0; }
+    body {
+      font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", sans-serif;
+      background: #f4f6f9;
+      color: #1a1a2e;
+      padding: 28px 24px 40px;
+      font-size: 13px;
+      line-height: 1.5;
+    }
+    h1 { font-size: 24px; font-weight: 800; margin-bottom: 6px; color: #102a43; }
+    .subtitle { color: #5f6c80; font-size: 12px; margin-bottom: 24px; }
+    .section-card {
+      background: #ffffff;
+      border: 1.5px solid #dbe5f0;
+      border-radius: 12px;
+      padding: 18px 20px;
+      margin-bottom: 20px;
+      box-shadow: 0 1px 3px rgba(16, 42, 67, 0.06);
+    }
+    .kpi-grid {
+      display: grid;
+      grid-template-columns: repeat(5, minmax(0, 1fr));
+      gap: 14px;
+      margin-bottom: 20px;
+    }
+    .kpi-card {
+      background: linear-gradient(180deg, #ffffff 0%, #f8fbff 100%);
+      border: 1.5px solid #dbe5f0;
+      border-radius: 12px;
+      padding: 16px;
+      min-height: 120px;
+    }
+    .kpi-label {
+      font-size: 10px;
+      text-transform: uppercase;
+      letter-spacing: 0.8px;
+      color: #6b7c93;
+      font-weight: 700;
+      margin-bottom: 8px;
+    }
+    .kpi-value {
+      font-size: 23px;
+      font-weight: 800;
+      color: #102a43;
+      line-height: 1.15;
+      margin-bottom: 6px;
+      word-break: break-word;
+    }
+    .kpi-card.good .kpi-value { color: #2e7d32; }
+    .kpi-card.bad .kpi-value { color: #c62828; }
+    .kpi-sub { color: #6b7c93; font-size: 11px; }
+    .section-title {
+      font-size: 14px;
+      font-weight: 800;
+      color: #102a43;
+      margin-bottom: 14px;
+    }
+    .characteristics-table {
+      width: 100%;
+      border-collapse: collapse;
+    }
+    .characteristics-table th,
+    .characteristics-table td {
+      padding: 9px 10px;
+      border-bottom: 1px solid #ebf1f6;
+      text-align: left;
+      vertical-align: top;
+    }
+    .characteristics-table th {
+      width: 180px;
+      color: #5f6c80;
+      font-size: 11px;
+      text-transform: uppercase;
+      letter-spacing: 0.7px;
+    }
+    .report-table {
+      width: 100%;
+      border-collapse: collapse;
+    }
+    .report-table th {
+      text-align: left;
+      padding: 9px 10px;
+      background: #eef4fb;
+      color: #486581;
+      border-bottom: 2px solid #d9e2ec;
+      font-size: 10px;
+      text-transform: uppercase;
+      letter-spacing: 0.7px;
+    }
+    .report-table td {
+      padding: 10px;
+      border-bottom: 1px solid #ebf1f6;
+      vertical-align: top;
+    }
+    .report-table tr:hover td { background: #f8fbff; }
+    .champion-row td { background: #e8f1fd; }
+    .hyp-pill {
+      display: inline-block;
+      background: #102a43;
+      color: white;
+      border-radius: 999px;
+      padding: 2px 8px;
+      font-size: 11px;
+      font-weight: 700;
+    }
+    .gain-pos { color: #2e7d32; font-weight: 700; }
+    .gain-neg { color: #c62828; font-weight: 700; }
+    .chart-wrap {
+      overflow-x: auto;
+      border: 1px solid #e6edf5;
+      border-radius: 10px;
+      background: #fbfdff;
+      padding: 10px;
+    }
+    .chart-svg { width: 100%; min-width: 760px; display: block; }
+    .axis-label { fill: #486581; font-size: 11px; font-weight: 700; }
+    .tick-label { fill: #7b8794; font-size: 10px; }
+    .tick-line { stroke: #d9e2ec; stroke-width: 1; }
+    .center-line { stroke: #1e88e5; stroke-width: 2; stroke-dasharray: 4 4; }
+    .row-bg { fill: transparent; }
+    .hyp-label { fill: #102a43; font-size: 12px; font-weight: 800; }
+    .hyp-sub { fill: #7b8794; font-size: 10px; }
+    .baseline-bar { stroke: #546e7a; stroke-width: 3; }
+    .value-text { fill: #102a43; font-size: 11px; font-weight: 700; }
+    .build-fail-text { fill: #37474f; font-size: 10px; font-weight: 800; letter-spacing: 0.5px; }
+    .gap-grid {
+      display: grid;
+      grid-template-columns: repeat(auto-fit, minmax(240px, 1fr));
+      gap: 12px;
+    }
+    .gap-card {
+      background: #fff8e1;
+      border: 1.5px solid #ffe082;
+      border-radius: 10px;
+      padding: 12px 14px;
+      color: #7c5b00;
+      font-size: 12px;
+    }
+    .flag-pill {
+      display: inline-block;
+      background: #e3f2fd;
+      color: #1565c0;
+      border-radius: 4px;
+      padding: 1px 6px;
+      font-size: 10px;
+      font-weight: 700;
+      margin: 1px 2px 1px 0;
+    }
+    .runs-val {
+      font-family: "Cascadia Code", "Consolas", monospace;
+      font-size: 10.5px;
+      color: #546e7a;
+      white-space: nowrap;
+    }
+    .hyp-detail-table .label-cell { font-size: 11.5px; max-width: 220px; }
+    .hyp-detail-table .opset-cell { text-align: center; font-weight: 700; color: #3949ab; font-size: 12px; }
+    .hyp-detail-table .flags-cell { min-width: 140px; }
+    .hyp-detail-table .p50-cell { font-family: "Cascadia Code","Consolas",monospace; font-size: 12px; white-space: nowrap; }
+    .hyp-detail-table .sessions-cell { min-width: 160px; }
+    .hyp-detail-table .conf-cell { font-size: 11px; color: #546e7a; }
+    .verdict-keep { color: #2e7d32; font-weight: 700; }
+    .verdict-discard { color: #c62828; font-weight: 700; }
+    .footer {
+      margin-top: 16px;
+      color: #7b8794;
+      font-size: 11px;
+      text-align: center;
+    }
+    @media (max-width: 1200px) {
+      .kpi-grid { grid-template-columns: repeat(2, minmax(0, 1fr)); }
+    }
+    @media (max-width: 720px) {
+      .kpi-grid { grid-template-columns: 1fr; }
+      body { padding: 18px 14px 28px; }
+    }
+  </style>
+</head>
+<body>
+  <h1>QNN GPU Optimization Report — BAAI/bge-small-en-v1.5</h1>
+  <div class="subtitle">bert arch · 2026-06-18 · 13 hypotheses tested</div>
+
+  <section class="kpi-grid">
+    <div class="kpi-card neutral">
+      <div class="kpi-label">Best Gain %</div>
+      <div class="kpi-value">—</div>
+      <div class="kpi-sub">Champion: —</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">Baseline → Champion ms</div>
+      <div class="kpi-value">52.63 ms → —</div>
+      <div class="kpi-sub">Latency reduction: —</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">EP + Device</div>
+      <div class="kpi-value">QNN / GPU</div>
+      <div class="kpi-sub">Baseline opset 17</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">Champion Config</div>
+      <div class="kpi-value">—</div>
+      <div class="kpi-sub">—</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">Total experiments</div>
+      <div class="kpi-value">13</div>
+      <div class="kpi-sub">0 KEEP / 4 DISCARD</div>
+    </div>
+  </section>
+
+
+    <section class="section-card">
+      <div class="section-title">Model Characteristics</div>
+      <table class="characteristics-table">
+        <tr><th>Model ID</th><td>BAAI/bge-small-en-v1.5</td></tr><tr><th>Task</th><td>sentence-similarity</td></tr><tr><th>Arch type</th><td>bert</td></tr><tr><th>Baseline opset</th><td>17</td></tr><tr><th>EP</th><td>qnn</td></tr><tr><th>Device</th><td>gpu</td></tr>
+      </table>
+    </section>
+
+
+    <section class="section-card">
+      <div class="section-title">Hypothesis Gain Chart</div>
+      <div class="chart-wrap">
+        <svg class="chart-svg" viewBox="0 0 748 594" role="img" aria-label="Hypothesis gain chart"><defs><pattern id="buildFailPattern" patternUnits="userSpaceOnUse" width="8" height="8" patternTransform="rotate(45)"><rect width="8" height="8" fill="#cfd8dc"></rect><rect width="3" height="8" fill="#90a4ae"></rect></pattern></defs><text x="0" y="20" class="axis-label">Hypothesis</text><text x="150" y="20" class="axis-label">Gain vs baseline (%)</text><line x1="410.0" y1="40" x2="410.0" y2="568" class="center-line" /><line x1="150.0" y1="44" x2="150.0" y2="568" class="tick-line" /><text x="150.0" y="588" text-anchor="middle" class="tick-label">-200%</text><line x1="280.0" y1="44" x2="280.0" y2="568" class="tick-line" /><text x="280.0" y="588" text-anchor="middle" class="tick-label">-100%</text><line x1="410.0" y1="44" x2="410.0" y2="568" class="tick-line" /><text x="410.0" y="588" text-anchor="middle" class="tick-label">0%</text><line x1="540.0" y1="44" x2="540.0" y2="568" class="tick-line" /><text x="540.0" y="588" text-anchor="middle" class="tick-label">100%</text><line x1="670.0" y1="44" x2="670.0" y2="568" class="tick-line" /><text x="670.0" y="588" text-anchor="middle" class="tick-label">200%</text><g><title>h0: baseline FP32 (no quant, no compile)
+status=OK  verdict=BASELINE
+p50=52.63 ms  gain=+0.0%</title><rect x="0" y="48.0" width="748" height="40" class="row-bg" /><text x="8" y="64.0" class="hyp-label">h0</text><text x="8" y="77.0" class="hyp-sub">baseline FP32 (no quant, …</text><line x1="410.0" y1="56.0" x2="410.0" y2="80.0" class="baseline-bar" /><text x="418.0" y="72.0" text-anchor="start" class="value-text">0.0%</text></g><g><title>h1: opset 17 explicit
+status=OK  verdict=DISCARD
+p50=53.34 ms  gain=-1.4%</title><rect x="0" y="88.0" width="748" height="40" class="row-bg" /><text x="8" y="104.0" class="hyp-label">h1</text><text x="8" y="117.0" class="hyp-sub">opset 17 explicit</text><rect x="408.2" y="96.0" width="2.0" height="24" fill="#90a4ae" stroke="none" stroke-width="0" rx="4" /><text x="400.2" y="112.0" text-anchor="end" class="value-text">-1.4%</text></g><g><title>h2: opset 19
+status=OK  verdict=DISCARD
+p50=53.36 ms  gain=-1.4%</title><rect x="0" y="128.0" width="748" height="40" class="row-bg" /><text x="8" y="144.0" class="hyp-label">h2</text><text x="8" y="157.0" class="hyp-sub">opset 19</text><rect x="408.2" y="136.0" width="2.0" height="24" fill="#90a4ae" stroke="none" stroke-width="0" rx="4" /><text x="400.2" y="152.0" text-anchor="end" class="value-text">-1.4%</text></g><g><title>h3: opset 21 (tests gpu-006)
+status=OK  verdict=MARGINAL
+p50=52.54 ms  gain=+0.2%</title><rect x="0" y="168.0" width="748" height="40" class="row-bg" /><text x="8" y="184.0" class="hyp-label">h3</text><text x="8" y="197.0" class="hyp-sub">opset 21 (tests gpu-006)</text><rect x="410.0" y="176.0" width="2.0" height="24" fill="#90a4ae" stroke="none" stroke-width="0" rx="4" /><text x="418.2" y="192.0" text-anchor="start" class="value-text">+0.2%</text></g><g><title>h4: opset 17 + matmul_transpose_fusion
+status=OK  verdict=DISCARD
+p50=52.81 ms  gain=-0.3%</title><rect x="0" y="208.0" width="748" height="40" class="row-bg" /><text x="8" y="224.0" class="hyp-label">h4</text><text x="8" y="237.0" class="hyp-sub">opset 17 + matmul_transpo…</text><rect x="409.5" y="216.0" width="2.0" height="24" fill="#90a4ae" stroke="none" stroke-width="0" rx="4" /><text x="401.5" y="232.0" text-anchor="end" class="value-text">-0.3%</text></g><g><title>h5: opset 17 + attention_fusion
+status=OK  verdict=MARGINAL
+p50=52.57 ms  gain=+0.1%</title><rect x="0" y="248.0" width="748" height="40" class="row-bg" /><text x="8" y="264.0" class="hyp-label">h5</text><text x="8" y="277.0" class="hyp-sub">opset 17 + attention_fusi…</text><rect x="410.0" y="256.0" width="2.0" height="24" fill="#90a4ae" stroke="none" stroke-width="0" rx="4" /><text x="418.1" y="272.0" text-anchor="start" class="value-text">+0.1%</text></g><g><title>h6: opset 17 + bias_softmax_fusion
+status=OK  verdict=DISCARD
+p50=52.70 ms  gain=-0.1%</title><rect x="0" y="288.0" width="748" height="40" class="row-bg" /><text x="8" y="304.0" class="hyp-label">h6</text><text x="8" y="317.0" class="hyp-sub">opset 17 + bias_softmax_f…</text><rect x="409.8" y="296.0" width="2.0" height="24" fill="#90a4ae" stroke="none" stroke-width="0" rx="4" /><text x="401.8" y="312.0" text-anchor="end" class="value-text">-0.1%</text></g><g><title>h7: opset 17 + layer_norm_fusion
+status=OK  verdict=MARGINAL
+p50=52.62 ms  gain=+0.0%</title><rect x="0" y="328.0" width="748" height="40" class="row-bg" /><text x="8" y="344.0" class="hyp-label">h7</text><text x="8" y="357.0" class="hyp-sub">opset 17 + layer_norm_fus…</text><rect x="410.0" y="336.0" width="2.0" height="24" fill="#90a4ae" stroke="none" stroke-width="0" rx="4" /><text x="418.0" y="352.0" text-anchor="start" class="value-text">+0.0%</text></g><g><title>h8: opset 17 + skip_layer_norm_fusion
+status=BENCH_FAIL  verdict=—
+p50=—  gain=—</title><rect x="0" y="368.0" width="748" height="40" class="row-bg" /><text x="8" y="384.0" class="hyp-label">h8</text><text x="8" y="397.0" class="hyp-sub">opset 17 + skip_layer_nor…</text></g><g><title>h9: opset 21 + matmul_transpose + attention_fusion
+status=BUILD_FAIL  verdict=—
+p50=—  gain=—</title><rect x="0" y="408.0" width="748" height="40" class="row-bg" /><text x="8" y="424.0" class="hyp-label">h9</text><text x="8" y="437.0" class="hyp-sub">opset 21 + matmul_transpo…</text><rect x="364.0" y="416.0" width="92" height="24" fill="url(#buildFailPattern)" stroke="#78909c" stroke-width="1.5" rx="4" /><text x="410.0" y="432.0" text-anchor="middle" class="build-fail-text">BUILD_FAIL</text></g><g><title>h10: opset 17 + ln + skip_ln + matmul_transpose
+status=BUILD_FAIL  verdict=—
+p50=—  gain=—</title><rect x="0" y="448.0" width="748" height="40" class="row-bg" /><text x="8" y="464.0" class="hyp-label">h10</text><text x="8" y="477.0" class="hyp-sub">opset 17 + ln + skip_ln +…</text><rect x="364.0" y="456.0" width="92" height="24" fill="url(#buildFailPattern)" stroke="#78909c" stroke-width="1.5" rx="4" /><text x="410.0" y="472.0" text-anchor="middle" class="build-fail-text">BUILD_FAIL</text></g><g><title>h11: opset 17 + gelu_fusion explicit
+status=BUILD_FAIL  verdict=—
+p50=—  gain=—</title><rect x="0" y="488.0" width="748" height="40" class="row-bg" /><text x="8" y="504.0" class="hyp-label">h11</text><text x="8" y="517.0" class="hyp-sub">opset 17 + gelu_fusion ex…</text><rect x="364.0" y="496.0" width="92" height="24" fill="url(#buildFailPattern)" stroke="#78909c" stroke-width="1.5" rx="4" /><text x="410.0" y="512.0" text-anchor="middle" class="build-fail-text">BUILD_FAIL</text></g><g><title>h12: opset 17 + transpose_optimizer
+status=BUILD_FAIL  verdict=—
+p50=—  gain=—</title><rect x="0" y="528.0" width="748" height="40" class="row-bg" /><text x="8" y="544.0" class="hyp-label">h12</text><text x="8" y="557.0" class="hyp-sub">opset 17 + transpose_opti…</text><rect x="364.0" y="536.0" width="92" height="24" fill="url(#buildFailPattern)" stroke="#78909c" stroke-width="1.5" rx="4" /><text x="410.0" y="552.0" text-anchor="middle" class="build-fail-text">BUILD_FAIL</text></g></svg>
+      </div>
+    </section>
+
+
+    <section class="section-card">
+      <div class="section-title">🔬 All Hypotheses — Full Detail</div>
+      <div style="overflow-x:auto">
+      <table class="report-table hyp-detail-table">
+        <thead>
+          <tr>
+            <th>ID</th>
+            <th>Config Label</th>
+            <th>Opset</th>
+            <th>Extra Flags</th>
+            <th>Median p50</th>
+            <th>Session p50s (ms)</th>
+            <th>Gain %</th>
+            <th>Verdict</th>
+            <th>Confidence</th>
+          </tr>
+        </thead>
+        <tbody>
+
+        <tr class="">
+          <td><span class="hyp-pill">h0</span></td>
+          <td class="label-cell">baseline FP32 (no quant, no compile)</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">52.63 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[57.21 · 52.63 · 51.96]</span></td>
+          <td><span class="">+0.0%</span></td>
+          <td><span class="">BASELINE</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h1</span></td>
+          <td class="label-cell">opset 17 explicit</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">53.34 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[53.69 · 52.32 · 53.34]</span></td>
+          <td><span class="gain-neg">-1.4%</span></td>
+          <td><span class="verdict-discard">DISCARD</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h2</span></td>
+          <td class="label-cell">opset 19</td>
+          <td class="opset-cell">19</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">53.36 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[52.84 · 53.36 · 53.40]</span></td>
+          <td><span class="gain-neg">-1.4%</span></td>
+          <td><span class="verdict-discard">DISCARD</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h3</span></td>
+          <td class="label-cell">opset 21 (tests gpu-006)</td>
+          <td class="opset-cell">21</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">52.54 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[52.19 · 53.58 · 52.54]</span></td>
+          <td><span class="gain-pos">+0.2%</span></td>
+          <td><span class="">MARGINAL</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h4</span></td>
+          <td class="label-cell">opset 17 + matmul_transpose_fusion</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><span class="flag-pill">matmul_transpose_fusion</span></td>
+          <td class="p50-cell">52.81 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[52.81 · 53.63 · 52.17]</span></td>
+          <td><span class="gain-neg">-0.3%</span></td>
+          <td><span class="verdict-discard">DISCARD</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h5</span></td>
+          <td class="label-cell">opset 17 + attention_fusion</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><span class="flag-pill">attention_fusion</span></td>
+          <td class="p50-cell">52.57 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[52.99 · 52.27 · 52.57]</span></td>
+          <td><span class="gain-pos">+0.1%</span></td>
+          <td><span class="">MARGINAL</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h6</span></td>
+          <td class="label-cell">opset 17 + bias_softmax_fusion</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><span class="flag-pill">bias_softmax_fusion</span></td>
+          <td class="p50-cell">52.70 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[51.83 · 53.41 · 52.70]</span></td>
+          <td><span class="gain-neg">-0.1%</span></td>
+          <td><span class="verdict-discard">DISCARD</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h7</span></td>
+          <td class="label-cell">opset 17 + layer_norm_fusion</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><span class="flag-pill">layer_norm_fusion</span></td>
+          <td class="p50-cell">52.62 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[54.30 · 52.52 · 52.62]</span></td>
+          <td><span class="gain-pos">+0.0%</span></td>
+          <td><span class="">MARGINAL</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h8</span></td>
+          <td class="label-cell">opset 17 + skip_layer_norm_fusion</td>
+          <td class="opset-cell">—</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">—</td>
+          <td class="sessions-cell">—</td>
+          <td>—</td>
+          <td><span class="verdict-discard">BENCH_FAIL</span></td>
+          <td class="conf-cell">bench failed</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h9</span></td>
+          <td class="label-cell">opset 21 + matmul_transpose + attention_fusion</td>
+          <td class="opset-cell">21</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">BUILD_FAIL</td>
+          <td class="sessions-cell"><span style="color:#c62828;font-weight:700">BUILD_FAIL</span></td>
+          <td>—</td>
+          <td><span class="verdict-discard">BUILD_FAIL</span></td>
+          <td class="conf-cell">build failed</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h10</span></td>
+          <td class="label-cell">opset 17 + ln + skip_ln + matmul_transpose</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">BUILD_FAIL</td>
+          <td class="sessions-cell"><span style="color:#c62828;font-weight:700">BUILD_FAIL</span></td>
+          <td>—</td>
+          <td><span class="verdict-discard">BUILD_FAIL</span></td>
+          <td class="conf-cell">build failed</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h11</span></td>
+          <td class="label-cell">opset 17 + gelu_fusion explicit</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">BUILD_FAIL</td>
+          <td class="sessions-cell"><span style="color:#c62828;font-weight:700">BUILD_FAIL</span></td>
+          <td>—</td>
+          <td><span class="verdict-discard">BUILD_FAIL</span></td>
+          <td class="conf-cell">build failed</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h12</span></td>
+          <td class="label-cell">opset 17 + transpose_optimizer</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">BUILD_FAIL</td>
+          <td class="sessions-cell"><span style="color:#c62828;font-weight:700">BUILD_FAIL</span></td>
+          <td>—</td>
+          <td><span class="verdict-discard">BUILD_FAIL</span></td>
+          <td class="conf-cell">build failed</td>
+        </tr>
+        </tbody>
+      </table>
+      </div>
+      <div style="margin-top:10px;font-size:11px;color:#7b8794">
+        ★ = champion hypothesis &nbsp;·&nbsp; Session p50s are individual bench sessions (median used for comparison)
+      </div>
+    </section>
+
+
+
+    <section class="section-card">
+      <div class="section-title">❌ Ineffective or Harmful</div>
+      <table class="report-table">
+        <thead>
+          <tr>
+            <th>Hypothesis</th>
+            <th>Label</th>
+            <th>Gain %</th>
+            <th>Verdict</th>
+            <th>Confidence</th>
+          </tr>
+        </thead>
+        <tbody>
+
+            <tr class="">
+              <td><span class="hyp-pill">h9</span></td>
+              <td>opset 21 + matmul_transpose + attention_fusion</td>
+              <td class="gain-pos">—</td>
+              <td>BUILD_FAIL</td>
+              <td>build failed</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h10</span></td>
+              <td>opset 17 + ln + skip_ln + matmul_transpose</td>
+              <td class="gain-pos">—</td>
+              <td>BUILD_FAIL</td>
+              <td>build failed</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h11</span></td>
+              <td>opset 17 + gelu_fusion explicit</td>
+              <td class="gain-pos">—</td>
+              <td>BUILD_FAIL</td>
+              <td>build failed</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h12</span></td>
+              <td>opset 17 + transpose_optimizer</td>
+              <td class="gain-pos">—</td>
+              <td>BUILD_FAIL</td>
+              <td>build failed</td>
+            </tr>
+
+        </tbody>
+      </table>
+    </section>
+
+
+    <section class="section-card">
+      <div class="section-title">⚪ Neutral / Build Fail</div>
+      <table class="report-table">
+        <thead>
+          <tr>
+            <th>Hypothesis</th>
+            <th>Label</th>
+            <th>Gain %</th>
+            <th>Verdict</th>
+            <th>Confidence</th>
+          </tr>
+        </thead>
+        <tbody>
+
+            <tr class="">
+              <td><span class="hyp-pill">h0</span></td>
+              <td>baseline FP32 (no quant, no compile)</td>
+              <td class="gain-pos">+0.0%</td>
+              <td>BASELINE</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h1</span></td>
+              <td>opset 17 explicit</td>
+              <td class="gain-neg">-1.4%</td>
+              <td>DISCARD</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h2</span></td>
+              <td>opset 19</td>
+              <td class="gain-neg">-1.4%</td>
+              <td>DISCARD</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h3</span></td>
+              <td>opset 21 (tests gpu-006)</td>
+              <td class="gain-pos">+0.2%</td>
+              <td>MARGINAL</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h4</span></td>
+              <td>opset 17 + matmul_transpose_fusion</td>
+              <td class="gain-neg">-0.3%</td>
+              <td>DISCARD</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h5</span></td>
+              <td>opset 17 + attention_fusion</td>
+              <td class="gain-pos">+0.1%</td>
+              <td>MARGINAL</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h6</span></td>
+              <td>opset 17 + bias_softmax_fusion</td>
+              <td class="gain-neg">-0.1%</td>
+              <td>DISCARD</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h7</span></td>
+              <td>opset 17 + layer_norm_fusion</td>
+              <td class="gain-pos">+0.0%</td>
+              <td>MARGINAL</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h8</span></td>
+              <td>opset 17 + skip_layer_norm_fusion</td>
+              <td class="gain-pos">—</td>
+              <td>BENCH_FAIL</td>
+              <td>bench failed</td>
+            </tr>
+
+        </tbody>
+      </table>
+    </section>
+
+
+
+  <div class="footer">Generated by gen_model_report.py · research/autoconfig</div>
+</body>
+</html>
diff --git a/research/autoconfig/catalog-gpu-sweep/BAAI--bge-small-en-v1.5/results.json b/research/autoconfig/catalog-gpu-sweep/BAAI--bge-small-en-v1.5/results.json
new file mode 100644
index 000000000..ad8809324
--- /dev/null
+++ b/research/autoconfig/catalog-gpu-sweep/BAAI--bge-small-en-v1.5/results.json
@@ -0,0 +1,187 @@
+{
+  "model_id": "BAAI/bge-small-en-v1.5",
+  "task": "sentence-similarity",
+  "model_type": "bert",
+  "timestamp": "2026-06-18T00:17:47",
+  "ep": "qnn",
+  "device": "gpu",
+  "baseline_opset": 17,
+  "hypotheses": {
+    "h0": {
+      "status": "OK",
+      "label": "baseline FP32 (no quant, no compile)",
+      "opset": 17,
+      "extra_optim": null,
+      "screen_p50_ms": 54.32,
+      "screen_cv": 0.7272091310751105,
+      "full_p50s_ms": [
+        57.207,
+        52.628,
+        51.964
+      ],
+      "median_p50_ms": 52.628,
+      "verdict": "BASELINE"
+    },
+    "h1": {
+      "status": "OK",
+      "label": "opset 17 explicit",
+      "opset": 17,
+      "extra_optim": null,
+      "screen_p50_ms": 54.548,
+      "screen_cv": 0.2691757717973161,
+      "full_p50s_ms": [
+        53.686,
+        52.321,
+        53.336
+      ],
+      "median_p50_ms": 53.336,
+      "gain_vs_baseline_pct": -1.35,
+      "verdict": "DISCARD"
+    },
+    "h2": {
+      "status": "OK",
+      "label": "opset 19",
+      "opset": 19,
+      "extra_optim": null,
+      "screen_p50_ms": 53.712,
+      "screen_cv": 0.11630548108430146,
+      "full_p50s_ms": [
+        52.844,
+        53.359,
+        53.4
+      ],
+      "median_p50_ms": 53.359,
+      "gain_vs_baseline_pct": -1.39,
+      "verdict": "DISCARD"
+    },
+    "h3": {
+      "status": "OK",
+      "label": "opset 21 (tests gpu-006)",
+      "opset": 21,
+      "extra_optim": null,
+      "screen_p50_ms": 53.406,
+      "screen_cv": 0.1399842714301764,
+      "full_p50s_ms": [
+        52.192,
+        53.582,
+        52.542
+      ],
+      "median_p50_ms": 52.542,
+      "gain_vs_baseline_pct": 0.16,
+      "verdict": "MARGINAL"
+    },
+    "h4": {
+      "status": "OK",
+      "label": "opset 17 + matmul_transpose_fusion",
+      "opset": 17,
+      "extra_optim": {
+        "matmul_transpose_fusion": true
+      },
+      "screen_p50_ms": 52.792,
+      "screen_cv": 0.18718745264434003,
+      "full_p50s_ms": [
+        52.812,
+        53.63,
+        52.173
+      ],
+      "median_p50_ms": 52.812,
+      "gain_vs_baseline_pct": -0.35,
+      "verdict": "DISCARD"
+    },
+    "h5": {
+      "status": "OK",
+      "label": "opset 17 + attention_fusion",
+      "opset": 17,
+      "extra_optim": {
+        "attention_fusion": true
+      },
+      "screen_p50_ms": 52.42,
+      "screen_cv": 0.1541205646699733,
+      "full_p50s_ms": [
+        52.991,
+        52.271,
+        52.571
+      ],
+      "median_p50_ms": 52.571,
+      "gain_vs_baseline_pct": 0.11,
+      "verdict": "MARGINAL"
+    },
+    "h6": {
+      "status": "OK",
+      "label": "opset 17 + bias_softmax_fusion",
+      "opset": 17,
+      "extra_optim": {
+        "bias_softmax_fusion": true
+      },
+      "screen_p50_ms": 52.712,
+      "screen_cv": 0.15228031567764452,
+      "full_p50s_ms": [
+        51.826,
+        53.412,
+        52.698
+      ],
+      "median_p50_ms": 52.698,
+      "gain_vs_baseline_pct": -0.13,
+      "verdict": "DISCARD"
+    },
+    "h7": {
+      "status": "OK",
+      "label": "opset 17 + layer_norm_fusion",
+      "opset": 17,
+      "extra_optim": {
+        "layer_norm_fusion": true
+      },
+      "screen_p50_ms": 58.252,
+      "screen_cv": 0.19027672869601042,
+      "full_p50s_ms": [
+        54.301,
+        52.525,
+        52.622
+      ],
+      "median_p50_ms": 52.622,
+      "gain_vs_baseline_pct": 0.01,
+      "verdict": "MARGINAL"
+    },
+    "h8": {
+      "status": "BENCH_FAIL",
+      "label": "opset 17 + skip_layer_norm_fusion"
+    },
+    "h9": {
+      "status": "BUILD_FAIL",
+      "label": "opset 21 + matmul_transpose + attention_fusion",
+      "opset": 21,
+      "build_error": "AI--bge-small-en-v1.5\\h9\\export.onnx\n(127.5 MB)\n[06/18/26 00:59:02] ERROR    ✗ ort_graph failed: [Errno 28] No space left on   \n                             device                                            \n⏳ Optimize  Optimizing ONNX graph...Error: Build failed: [Errno 28] No space left on device\n"
+    },
+    "h10": {
+      "status": "BUILD_FAIL",
+      "label": "opset 17 + ln + skip_ln + matmul_transpose",
+      "opset": 17,
+      "build_error": "   Supported tasks are: feature-extraction,          \n                             fill-mask, multiple-choice, question-answering,   \n                             text-classification, token-classification.        \n⏳ Export  Exporting to ONNX...Error: Build failed: [Errno 28] No space left on device\n"
+    },
+    "h11": {
+      "status": "BUILD_FAIL",
+      "label": "opset 17 + gelu_fusion explicit",
+      "opset": 17,
+      "build_error": "   Supported tasks are: feature-extraction,          \n                             fill-mask, multiple-choice, question-answering,   \n                             text-classification, token-classification.        \n⏳ Export  Exporting to ONNX...Error: Build failed: [Errno 28] No space left on device\n"
+    },
+    "h12": {
+      "status": "BUILD_FAIL",
+      "label": "opset 17 + transpose_optimizer",
+      "opset": 17,
+      "build_error": "   Supported tasks are: feature-extraction,          \n                             fill-mask, multiple-choice, question-answering,   \n                             text-classification, token-classification.        \n⏳ Export  Exporting to ONNX...Error: Build failed: [Errno 28] No space left on device\n"
+    }
+  },
+  "best_hypothesis": null,
+  "baseline_p50_ms": 52.628,
+  "best_p50_ms": null,
+  "best_gain_pct": null,
+  "opset21_gain_pct": 0.16,
+  "feature_gaps": [],
+  "errors": [
+    "h8: screen bench failed",
+    "h9: BUILD_FAIL",
+    "h10: BUILD_FAIL",
+    "h11: BUILD_FAIL",
+    "h12: BUILD_FAIL"
+  ]
+}
diff --git a/research/autoconfig/catalog-gpu-sweep/apple--mobilevit-small/report.html b/research/autoconfig/catalog-gpu-sweep/apple--mobilevit-small/report.html
new file mode 100644
index 000000000..6e4b98e27
--- /dev/null
+++ b/research/autoconfig/catalog-gpu-sweep/apple--mobilevit-small/report.html
@@ -0,0 +1,577 @@
+<!DOCTYPE html>
+<html lang="en">
+<head>
+  <meta charset="UTF-8">
+  <meta name="viewport" content="width=device-width, initial-scale=1.0">
+  <title>QNN GPU Optimization Report — apple/mobilevit-small</title>
+  <style>
+    * { box-sizing: border-box; margin: 0; padding: 0; }
+    body {
+      font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", sans-serif;
+      background: #f4f6f9;
+      color: #1a1a2e;
+      padding: 28px 24px 40px;
+      font-size: 13px;
+      line-height: 1.5;
+    }
+    h1 { font-size: 24px; font-weight: 800; margin-bottom: 6px; color: #102a43; }
+    .subtitle { color: #5f6c80; font-size: 12px; margin-bottom: 24px; }
+    .section-card {
+      background: #ffffff;
+      border: 1.5px solid #dbe5f0;
+      border-radius: 12px;
+      padding: 18px 20px;
+      margin-bottom: 20px;
+      box-shadow: 0 1px 3px rgba(16, 42, 67, 0.06);
+    }
+    .kpi-grid {
+      display: grid;
+      grid-template-columns: repeat(5, minmax(0, 1fr));
+      gap: 14px;
+      margin-bottom: 20px;
+    }
+    .kpi-card {
+      background: linear-gradient(180deg, #ffffff 0%, #f8fbff 100%);
+      border: 1.5px solid #dbe5f0;
+      border-radius: 12px;
+      padding: 16px;
+      min-height: 120px;
+    }
+    .kpi-label {
+      font-size: 10px;
+      text-transform: uppercase;
+      letter-spacing: 0.8px;
+      color: #6b7c93;
+      font-weight: 700;
+      margin-bottom: 8px;
+    }
+    .kpi-value {
+      font-size: 23px;
+      font-weight: 800;
+      color: #102a43;
+      line-height: 1.15;
+      margin-bottom: 6px;
+      word-break: break-word;
+    }
+    .kpi-card.good .kpi-value { color: #2e7d32; }
+    .kpi-card.bad .kpi-value { color: #c62828; }
+    .kpi-sub { color: #6b7c93; font-size: 11px; }
+    .section-title {
+      font-size: 14px;
+      font-weight: 800;
+      color: #102a43;
+      margin-bottom: 14px;
+    }
+    .characteristics-table {
+      width: 100%;
+      border-collapse: collapse;
+    }
+    .characteristics-table th,
+    .characteristics-table td {
+      padding: 9px 10px;
+      border-bottom: 1px solid #ebf1f6;
+      text-align: left;
+      vertical-align: top;
+    }
+    .characteristics-table th {
+      width: 180px;
+      color: #5f6c80;
+      font-size: 11px;
+      text-transform: uppercase;
+      letter-spacing: 0.7px;
+    }
+    .report-table {
+      width: 100%;
+      border-collapse: collapse;
+    }
+    .report-table th {
+      text-align: left;
+      padding: 9px 10px;
+      background: #eef4fb;
+      color: #486581;
+      border-bottom: 2px solid #d9e2ec;
+      font-size: 10px;
+      text-transform: uppercase;
+      letter-spacing: 0.7px;
+    }
+    .report-table td {
+      padding: 10px;
+      border-bottom: 1px solid #ebf1f6;
+      vertical-align: top;
+    }
+    .report-table tr:hover td { background: #f8fbff; }
+    .champion-row td { background: #e8f1fd; }
+    .hyp-pill {
+      display: inline-block;
+      background: #102a43;
+      color: white;
+      border-radius: 999px;
+      padding: 2px 8px;
+      font-size: 11px;
+      font-weight: 700;
+    }
+    .gain-pos { color: #2e7d32; font-weight: 700; }
+    .gain-neg { color: #c62828; font-weight: 700; }
+    .chart-wrap {
+      overflow-x: auto;
+      border: 1px solid #e6edf5;
+      border-radius: 10px;
+      background: #fbfdff;
+      padding: 10px;
+    }
+    .chart-svg { width: 100%; min-width: 760px; display: block; }
+    .axis-label { fill: #486581; font-size: 11px; font-weight: 700; }
+    .tick-label { fill: #7b8794; font-size: 10px; }
+    .tick-line { stroke: #d9e2ec; stroke-width: 1; }
+    .center-line { stroke: #1e88e5; stroke-width: 2; stroke-dasharray: 4 4; }
+    .row-bg { fill: transparent; }
+    .hyp-label { fill: #102a43; font-size: 12px; font-weight: 800; }
+    .hyp-sub { fill: #7b8794; font-size: 10px; }
+    .baseline-bar { stroke: #546e7a; stroke-width: 3; }
+    .value-text { fill: #102a43; font-size: 11px; font-weight: 700; }
+    .build-fail-text { fill: #37474f; font-size: 10px; font-weight: 800; letter-spacing: 0.5px; }
+    .gap-grid {
+      display: grid;
+      grid-template-columns: repeat(auto-fit, minmax(240px, 1fr));
+      gap: 12px;
+    }
+    .gap-card {
+      background: #fff8e1;
+      border: 1.5px solid #ffe082;
+      border-radius: 10px;
+      padding: 12px 14px;
+      color: #7c5b00;
+      font-size: 12px;
+    }
+    .flag-pill {
+      display: inline-block;
+      background: #e3f2fd;
+      color: #1565c0;
+      border-radius: 4px;
+      padding: 1px 6px;
+      font-size: 10px;
+      font-weight: 700;
+      margin: 1px 2px 1px 0;
+    }
+    .runs-val {
+      font-family: "Cascadia Code", "Consolas", monospace;
+      font-size: 10.5px;
+      color: #546e7a;
+      white-space: nowrap;
+    }
+    .hyp-detail-table .label-cell { font-size: 11.5px; max-width: 220px; }
+    .hyp-detail-table .opset-cell { text-align: center; font-weight: 700; color: #3949ab; font-size: 12px; }
+    .hyp-detail-table .flags-cell { min-width: 140px; }
+    .hyp-detail-table .p50-cell { font-family: "Cascadia Code","Consolas",monospace; font-size: 12px; white-space: nowrap; }
+    .hyp-detail-table .sessions-cell { min-width: 160px; }
+    .hyp-detail-table .conf-cell { font-size: 11px; color: #546e7a; }
+    .verdict-keep { color: #2e7d32; font-weight: 700; }
+    .verdict-discard { color: #c62828; font-weight: 700; }
+    .footer {
+      margin-top: 16px;
+      color: #7b8794;
+      font-size: 11px;
+      text-align: center;
+    }
+    @media (max-width: 1200px) {
+      .kpi-grid { grid-template-columns: repeat(2, minmax(0, 1fr)); }
+    }
+    @media (max-width: 720px) {
+      .kpi-grid { grid-template-columns: 1fr; }
+      body { padding: 18px 14px 28px; }
+    }
+  </style>
+</head>
+<body>
+  <h1>QNN GPU Optimization Report — apple/mobilevit-small</h1>
+  <div class="subtitle">mobilevit arch · 2026-06-18 · 13 hypotheses tested</div>
+
+  <section class="kpi-grid">
+    <div class="kpi-card neutral">
+      <div class="kpi-label">Best Gain %</div>
+      <div class="kpi-value">—</div>
+      <div class="kpi-sub">Champion: —</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">Baseline → Champion ms</div>
+      <div class="kpi-value">17.98 ms → —</div>
+      <div class="kpi-sub">Latency reduction: —</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">EP + Device</div>
+      <div class="kpi-value">QNN / GPU</div>
+      <div class="kpi-sub">Baseline opset 17</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">Champion Config</div>
+      <div class="kpi-value">—</div>
+      <div class="kpi-sub">—</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">Total experiments</div>
+      <div class="kpi-value">13</div>
+      <div class="kpi-sub">0 KEEP / 3 DISCARD</div>
+    </div>
+  </section>
+
+
+    <section class="section-card">
+      <div class="section-title">Model Characteristics</div>
+      <table class="characteristics-table">
+        <tr><th>Model ID</th><td>apple/mobilevit-small</td></tr><tr><th>Task</th><td>image-classification</td></tr><tr><th>Arch type</th><td>mobilevit</td></tr><tr><th>Baseline opset</th><td>17</td></tr><tr><th>EP</th><td>qnn</td></tr><tr><th>Device</th><td>gpu</td></tr>
+      </table>
+    </section>
+
+
+    <section class="section-card">
+      <div class="section-title">Hypothesis Gain Chart</div>
+      <div class="chart-wrap">
+        <svg class="chart-svg" viewBox="0 0 748 594" role="img" aria-label="Hypothesis gain chart"><defs><pattern id="buildFailPattern" patternUnits="userSpaceOnUse" width="8" height="8" patternTransform="rotate(45)"><rect width="8" height="8" fill="#cfd8dc"></rect><rect width="3" height="8" fill="#90a4ae"></rect></pattern></defs><text x="0" y="20" class="axis-label">Hypothesis</text><text x="150" y="20" class="axis-label">Gain vs baseline (%)</text><line x1="410.0" y1="40" x2="410.0" y2="568" class="center-line" /><line x1="150.0" y1="44" x2="150.0" y2="568" class="tick-line" /><text x="150.0" y="588" text-anchor="middle" class="tick-label">-200%</text><line x1="280.0" y1="44" x2="280.0" y2="568" class="tick-line" /><text x="280.0" y="588" text-anchor="middle" class="tick-label">-100%</text><line x1="410.0" y1="44" x2="410.0" y2="568" class="tick-line" /><text x="410.0" y="588" text-anchor="middle" class="tick-label">0%</text><line x1="540.0" y1="44" x2="540.0" y2="568" class="tick-line" /><text x="540.0" y="588" text-anchor="middle" class="tick-label">100%</text><line x1="670.0" y1="44" x2="670.0" y2="568" class="tick-line" /><text x="670.0" y="588" text-anchor="middle" class="tick-label">200%</text><g><title>h0: baseline FP32 (no quant, no compile)
+status=OK  verdict=BASELINE
+p50=17.98 ms  gain=+0.0%</title><rect x="0" y="48.0" width="748" height="40" class="row-bg" /><text x="8" y="64.0" class="hyp-label">h0</text><text x="8" y="77.0" class="hyp-sub">baseline FP32 (no quant, …</text><line x1="410.0" y1="56.0" x2="410.0" y2="80.0" class="baseline-bar" /><text x="418.0" y="72.0" text-anchor="start" class="value-text">0.0%</text></g><g><title>h1: opset 17 explicit
+status=OK  verdict=MARGINAL
+p50=17.73 ms  gain=+1.4%</title><rect x="0" y="88.0" width="748" height="40" class="row-bg" /><text x="8" y="104.0" class="hyp-label">h1</text><text x="8" y="117.0" class="hyp-sub">opset 17 explicit</text><rect x="410.0" y="96.0" width="2.0" height="24" fill="#90a4ae" stroke="none" stroke-width="0" rx="4" /><text x="419.9" y="112.0" text-anchor="start" class="value-text">+1.4%</text></g><g><title>h2: opset 19
+status=OK  verdict=DISCARD
+p50=18.28 ms  gain=-1.6%</title><rect x="0" y="128.0" width="748" height="40" class="row-bg" /><text x="8" y="144.0" class="hyp-label">h2</text><text x="8" y="157.0" class="hyp-sub">opset 19</text><rect x="407.9" y="136.0" width="2.1" height="24" fill="#90a4ae" stroke="none" stroke-width="0" rx="4" /><text x="399.9" y="152.0" text-anchor="end" class="value-text">-1.6%</text></g><g><title>h3: opset 21 (tests gpu-006)
+status=OK  verdict=DISCARD
+p50=18.60 ms  gain=-3.4%</title><rect x="0" y="168.0" width="748" height="40" class="row-bg" /><text x="8" y="184.0" class="hyp-label">h3</text><text x="8" y="197.0" class="hyp-sub">opset 21 (tests gpu-006)</text><rect x="405.6" y="176.0" width="4.4" height="24" fill="#90a4ae" stroke="none" stroke-width="0" rx="4" /><text x="397.6" y="192.0" text-anchor="end" class="value-text">-3.4%</text></g><g><title>h4: opset 17 + matmul_transpose_fusion
+status=OK  verdict=MARGINAL
+p50=17.74 ms  gain=+1.4%</title><rect x="0" y="208.0" width="748" height="40" class="row-bg" /><text x="8" y="224.0" class="hyp-label">h4</text><text x="8" y="237.0" class="hyp-sub">opset 17 + matmul_transpo…</text><rect x="410.0" y="216.0" width="2.0" height="24" fill="#90a4ae" stroke="none" stroke-width="0" rx="4" /><text x="419.8" y="232.0" text-anchor="start" class="value-text">+1.4%</text></g><g><title>h5: opset 17 + attention_fusion
+status=OK  verdict=DISCARD
+p50=18.14 ms  gain=-0.9%</title><rect x="0" y="248.0" width="748" height="40" class="row-bg" /><text x="8" y="264.0" class="hyp-label">h5</text><text x="8" y="277.0" class="hyp-sub">opset 17 + attention_fusi…</text><rect x="408.9" y="256.0" width="2.0" height="24" fill="#90a4ae" stroke="none" stroke-width="0" rx="4" /><text x="400.9" y="272.0" text-anchor="end" class="value-text">-0.9%</text></g><g><title>h6: opset 17 + bias_softmax_fusion
+status=OK  verdict=MARGINAL
+p50=17.67 ms  gain=+1.8%</title><rect x="0" y="288.0" width="748" height="40" class="row-bg" /><text x="8" y="304.0" class="hyp-label">h6</text><text x="8" y="317.0" class="hyp-sub">opset 17 + bias_softmax_f…</text><rect x="410.0" y="296.0" width="2.3" height="24" fill="#90a4ae" stroke="none" stroke-width="0" rx="4" /><text x="420.3" y="312.0" text-anchor="start" class="value-text">+1.8%</text></g><g><title>h7: opset 17 + layer_norm_fusion
+status=OK  verdict=MARGINAL
+p50=17.83 ms  gain=+0.9%</title><rect x="0" y="328.0" width="748" height="40" class="row-bg" /><text x="8" y="344.0" class="hyp-label">h7</text><text x="8" y="357.0" class="hyp-sub">opset 17 + layer_norm_fus…</text><rect x="410.0" y="336.0" width="2.0" height="24" fill="#90a4ae" stroke="none" stroke-width="0" rx="4" /><text x="419.1" y="352.0" text-anchor="start" class="value-text">+0.9%</text></g><g><title>h8: opset 17 + skip_layer_norm_fusion
+status=BENCH_FAIL  verdict=—
+p50=—  gain=—</title><rect x="0" y="368.0" width="748" height="40" class="row-bg" /><text x="8" y="384.0" class="hyp-label">h8</text><text x="8" y="397.0" class="hyp-sub">opset 17 + skip_layer_nor…</text></g><g><title>h9: opset 21 + matmul_transpose + attention_fusion
+status=OK  verdict=DISCARD
+p50=19.22 ms  gain=-6.9%</title><rect x="0" y="408.0" width="748" height="40" class="row-bg" /><text x="8" y="424.0" class="hyp-label">h9</text><text x="8" y="437.0" class="hyp-sub">opset 21 + matmul_transpo…</text><rect x="401.0" y="416.0" width="9.0" height="24" fill="#e53935" stroke="none" stroke-width="0" rx="4" /><text x="393.0" y="432.0" text-anchor="end" class="value-text">-6.9%</text></g><g><title>h10: opset 17 + ln + skip_ln + matmul_transpose
+status=BENCH_FAIL  verdict=—
+p50=—  gain=—</title><rect x="0" y="448.0" width="748" height="40" class="row-bg" /><text x="8" y="464.0" class="hyp-label">h10</text><text x="8" y="477.0" class="hyp-sub">opset 17 + ln + skip_ln +…</text></g><g><title>h11: opset 17 + gelu_fusion explicit
+status=OK  verdict=MARGINAL
+p50=17.79 ms  gain=+1.1%</title><rect x="0" y="488.0" width="748" height="40" class="row-bg" /><text x="8" y="504.0" class="hyp-label">h11</text><text x="8" y="517.0" class="hyp-sub">opset 17 + gelu_fusion ex…</text><rect x="410.0" y="496.0" width="2.0" height="24" fill="#90a4ae" stroke="none" stroke-width="0" rx="4" /><text x="419.4" y="512.0" text-anchor="start" class="value-text">+1.1%</text></g><g><title>h12: opset 17 + transpose_optimizer
+status=OK  verdict=DISCARD
+p50=18.57 ms  gain=-3.2%</title><rect x="0" y="528.0" width="748" height="40" class="row-bg" /><text x="8" y="544.0" class="hyp-label">h12</text><text x="8" y="557.0" class="hyp-sub">opset 17 + transpose_opti…</text><rect x="405.8" y="536.0" width="4.2" height="24" fill="#90a4ae" stroke="none" stroke-width="0" rx="4" /><text x="397.8" y="552.0" text-anchor="end" class="value-text">-3.2%</text></g></svg>
+      </div>
+    </section>
+
+
+    <section class="section-card">
+      <div class="section-title">🔬 All Hypotheses — Full Detail</div>
+      <div style="overflow-x:auto">
+      <table class="report-table hyp-detail-table">
+        <thead>
+          <tr>
+            <th>ID</th>
+            <th>Config Label</th>
+            <th>Opset</th>
+            <th>Extra Flags</th>
+            <th>Median p50</th>
+            <th>Session p50s (ms)</th>
+            <th>Gain %</th>
+            <th>Verdict</th>
+            <th>Confidence</th>
+          </tr>
+        </thead>
+        <tbody>
+
+        <tr class="">
+          <td><span class="hyp-pill">h0</span></td>
+          <td class="label-cell">baseline FP32 (no quant, no compile)</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">17.98 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[18.20 · 17.98 · 17.77]</span></td>
+          <td><span class="">+0.0%</span></td>
+          <td><span class="">BASELINE</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h1</span></td>
+          <td class="label-cell">opset 17 explicit</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">17.73 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[18.66 · 17.73 · 17.56]</span></td>
+          <td><span class="gain-pos">+1.4%</span></td>
+          <td><span class="">MARGINAL</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h2</span></td>
+          <td class="label-cell">opset 19</td>
+          <td class="opset-cell">19</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">18.28 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[18.16 · 18.38 · 18.28]</span></td>
+          <td><span class="gain-neg">-1.6%</span></td>
+          <td><span class="verdict-discard">DISCARD</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h3</span></td>
+          <td class="label-cell">opset 21 (tests gpu-006)</td>
+          <td class="opset-cell">21</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">18.60 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[18.19 · 18.85 · 18.60]</span></td>
+          <td><span class="gain-neg">-3.4%</span></td>
+          <td><span class="verdict-discard">DISCARD</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h4</span></td>
+          <td class="label-cell">opset 17 + matmul_transpose_fusion</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><span class="flag-pill">matmul_transpose_fusion</span></td>
+          <td class="p50-cell">17.74 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[17.74 · 17.61 · 18.28]</span></td>
+          <td><span class="gain-pos">+1.4%</span></td>
+          <td><span class="">MARGINAL</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h5</span></td>
+          <td class="label-cell">opset 17 + attention_fusion</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><span class="flag-pill">attention_fusion</span></td>
+          <td class="p50-cell">18.14 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[20.60 · 18.14 · 17.86]</span></td>
+          <td><span class="gain-neg">-0.9%</span></td>
+          <td><span class="verdict-discard">DISCARD</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h6</span></td>
+          <td class="label-cell">opset 17 + bias_softmax_fusion</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><span class="flag-pill">bias_softmax_fusion</span></td>
+          <td class="p50-cell">17.67 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[18.07 · 17.67 · 17.66]</span></td>
+          <td><span class="gain-pos">+1.8%</span></td>
+          <td><span class="">MARGINAL</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h7</span></td>
+          <td class="label-cell">opset 17 + layer_norm_fusion</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><span class="flag-pill">layer_norm_fusion</span></td>
+          <td class="p50-cell">17.83 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[17.66 · 17.83 · 20.32]</span></td>
+          <td><span class="gain-pos">+0.9%</span></td>
+          <td><span class="">MARGINAL</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h8</span></td>
+          <td class="label-cell">opset 17 + skip_layer_norm_fusion</td>
+          <td class="opset-cell">—</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">—</td>
+          <td class="sessions-cell">—</td>
+          <td>—</td>
+          <td><span class="verdict-discard">BENCH_FAIL</span></td>
+          <td class="conf-cell">bench failed</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h9</span></td>
+          <td class="label-cell">opset 21 + matmul_transpose + attention_fusion</td>
+          <td class="opset-cell">21</td>
+          <td class="flags-cell"><span class="flag-pill">matmul_transpose_fusion</span>, <span class="flag-pill">attention_fusion</span></td>
+          <td class="p50-cell">19.22 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[18.16 · 19.24 · 19.22]</span></td>
+          <td><span class="gain-neg">-6.9%</span></td>
+          <td><span class="verdict-discard">DISCARD</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h10</span></td>
+          <td class="label-cell">opset 17 + ln + skip_ln + matmul_transpose</td>
+          <td class="opset-cell">—</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">—</td>
+          <td class="sessions-cell">—</td>
+          <td>—</td>
+          <td><span class="verdict-discard">BENCH_FAIL</span></td>
+          <td class="conf-cell">bench failed</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h11</span></td>
+          <td class="label-cell">opset 17 + gelu_fusion explicit</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><span class="flag-pill">gelu_fusion</span></td>
+          <td class="p50-cell">17.79 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[17.68 · 18.41 · 17.79]</span></td>
+          <td><span class="gain-pos">+1.1%</span></td>
+          <td><span class="">MARGINAL</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h12</span></td>
+          <td class="label-cell">opset 17 + transpose_optimizer</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><span class="flag-pill">transpose_optimizer</span></td>
+          <td class="p50-cell">18.57 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[17.71 · 18.88 · 18.57]</span></td>
+          <td><span class="gain-neg">-3.2%</span></td>
+          <td><span class="verdict-discard">DISCARD</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        </tbody>
+      </table>
+      </div>
+      <div style="margin-top:10px;font-size:11px;color:#7b8794">
+        ★ = champion hypothesis &nbsp;·&nbsp; Session p50s are individual bench sessions (median used for comparison)
+      </div>
+    </section>
+
+
+
+    <section class="section-card">
+      <div class="section-title">❌ Ineffective or Harmful</div>
+      <table class="report-table">
+        <thead>
+          <tr>
+            <th>Hypothesis</th>
+            <th>Label</th>
+            <th>Gain %</th>
+            <th>Verdict</th>
+            <th>Confidence</th>
+          </tr>
+        </thead>
+        <tbody>
+
+            <tr class="">
+              <td><span class="hyp-pill">h3</span></td>
+              <td>opset 21 (tests gpu-006)</td>
+              <td class="gain-neg">-3.4%</td>
+              <td>DISCARD</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h9</span></td>
+              <td>opset 21 + matmul_transpose + attention_fusion</td>
+              <td class="gain-neg">-6.9%</td>
+              <td>DISCARD</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h12</span></td>
+              <td>opset 17 + transpose_optimizer</td>
+              <td class="gain-neg">-3.2%</td>
+              <td>DISCARD</td>
+              <td>ranges overlap</td>
+            </tr>
+
+        </tbody>
+      </table>
+    </section>
+
+
+    <section class="section-card">
+      <div class="section-title">⚪ Neutral / Build Fail</div>
+      <table class="report-table">
+        <thead>
+          <tr>
+            <th>Hypothesis</th>
+            <th>Label</th>
+            <th>Gain %</th>
+            <th>Verdict</th>
+            <th>Confidence</th>
+          </tr>
+        </thead>
+        <tbody>
+
+            <tr class="">
+              <td><span class="hyp-pill">h0</span></td>
+              <td>baseline FP32 (no quant, no compile)</td>
+              <td class="gain-pos">+0.0%</td>
+              <td>BASELINE</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h1</span></td>
+              <td>opset 17 explicit</td>
+              <td class="gain-pos">+1.4%</td>
+              <td>MARGINAL</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h2</span></td>
+              <td>opset 19</td>
+              <td class="gain-neg">-1.6%</td>
+              <td>DISCARD</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h4</span></td>
+              <td>opset 17 + matmul_transpose_fusion</td>
+              <td class="gain-pos">+1.4%</td>
+              <td>MARGINAL</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h5</span></td>
+              <td>opset 17 + attention_fusion</td>
+              <td class="gain-neg">-0.9%</td>
+              <td>DISCARD</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h6</span></td>
+              <td>opset 17 + bias_softmax_fusion</td>
+              <td class="gain-pos">+1.8%</td>
+              <td>MARGINAL</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h7</span></td>
+              <td>opset 17 + layer_norm_fusion</td>
+              <td class="gain-pos">+0.9%</td>
+              <td>MARGINAL</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h8</span></td>
+              <td>opset 17 + skip_layer_norm_fusion</td>
+              <td class="gain-pos">—</td>
+              <td>BENCH_FAIL</td>
+              <td>bench failed</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h10</span></td>
+              <td>opset 17 + ln + skip_ln + matmul_transpose</td>
+              <td class="gain-pos">—</td>
+              <td>BENCH_FAIL</td>
+              <td>bench failed</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h11</span></td>
+              <td>opset 17 + gelu_fusion explicit</td>
+              <td class="gain-pos">+1.1%</td>
+              <td>MARGINAL</td>
+              <td>ranges overlap</td>
+            </tr>
+
+        </tbody>
+      </table>
+    </section>
+
+
+
+  <div class="footer">Generated by gen_model_report.py · research/autoconfig</div>
+</body>
+</html>
diff --git a/research/autoconfig/catalog-gpu-sweep/apple--mobilevit-small/results.json b/research/autoconfig/catalog-gpu-sweep/apple--mobilevit-small/results.json
new file mode 100644
index 000000000..0d75e8b9d
--- /dev/null
+++ b/research/autoconfig/catalog-gpu-sweep/apple--mobilevit-small/results.json
@@ -0,0 +1,219 @@
+{
+  "model_id": "apple/mobilevit-small",
+  "task": "image-classification",
+  "model_type": "mobilevit",
+  "timestamp": "2026-06-18T01:40:29",
+  "ep": "qnn",
+  "device": "gpu",
+  "baseline_opset": 17,
+  "hypotheses": {
+    "h0": {
+      "status": "OK",
+      "label": "baseline FP32 (no quant, no compile)",
+      "opset": 17,
+      "extra_optim": null,
+      "screen_p50_ms": 21.759,
+      "screen_cv": 0.17119352911438945,
+      "full_p50s_ms": [
+        18.204,
+        17.985,
+        17.773
+      ],
+      "median_p50_ms": 17.985,
+      "verdict": "BASELINE"
+    },
+    "h1": {
+      "status": "OK",
+      "label": "opset 17 explicit",
+      "opset": 17,
+      "extra_optim": null,
+      "screen_p50_ms": 18.003,
+      "screen_cv": 0.17624840304393713,
+      "full_p50s_ms": [
+        18.657,
+        17.727,
+        17.557
+      ],
+      "median_p50_ms": 17.727,
+      "gain_vs_baseline_pct": 1.43,
+      "verdict": "MARGINAL"
+    },
+    "h2": {
+      "status": "OK",
+      "label": "opset 19",
+      "opset": 19,
+      "extra_optim": null,
+      "screen_p50_ms": 18.53,
+      "screen_cv": 0.15267134376686453,
+      "full_p50s_ms": [
+        18.162,
+        18.381,
+        18.281
+      ],
+      "median_p50_ms": 18.281,
+      "gain_vs_baseline_pct": -1.65,
+      "verdict": "DISCARD"
+    },
+    "h3": {
+      "status": "OK",
+      "label": "opset 21 (tests gpu-006)",
+      "opset": 21,
+      "extra_optim": null,
+      "screen_p50_ms": 18.209,
+      "screen_cv": 0.2452633313196771,
+      "full_p50s_ms": [
+        18.188,
+        18.851,
+        18.6
+      ],
+      "median_p50_ms": 18.6,
+      "gain_vs_baseline_pct": -3.42,
+      "verdict": "DISCARD"
+    },
+    "h4": {
+      "status": "OK",
+      "label": "opset 17 + matmul_transpose_fusion",
+      "opset": 17,
+      "extra_optim": {
+        "matmul_transpose_fusion": true
+      },
+      "screen_p50_ms": 17.775,
+      "screen_cv": 0.15651195499296766,
+      "full_p50s_ms": [
+        17.74,
+        17.609,
+        18.28
+      ],
+      "median_p50_ms": 17.74,
+      "gain_vs_baseline_pct": 1.36,
+      "verdict": "MARGINAL"
+    },
+    "h5": {
+      "status": "OK",
+      "label": "opset 17 + attention_fusion",
+      "opset": 17,
+      "extra_optim": {
+        "attention_fusion": true
+      },
+      "screen_p50_ms": 17.942,
+      "screen_cv": 0.3691896109686768,
+      "full_p50s_ms": [
+        20.597,
+        18.141,
+        17.859
+      ],
+      "median_p50_ms": 18.141,
+      "gain_vs_baseline_pct": -0.87,
+      "verdict": "DISCARD"
+    },
+    "h6": {
+      "status": "OK",
+      "label": "opset 17 + bias_softmax_fusion",
+      "opset": 17,
+      "extra_optim": {
+        "bias_softmax_fusion": true
+      },
+      "screen_p50_ms": 20.757,
+      "screen_cv": 0.15112973936503346,
+      "full_p50s_ms": [
+        18.068,
+        17.671,
+        17.662
+      ],
+      "median_p50_ms": 17.671,
+      "gain_vs_baseline_pct": 1.75,
+      "verdict": "MARGINAL"
+    },
+    "h7": {
+      "status": "OK",
+      "label": "opset 17 + layer_norm_fusion",
+      "opset": 17,
+      "extra_optim": {
+        "layer_norm_fusion": true
+      },
+      "screen_p50_ms": 17.683,
+      "screen_cv": 0.21947633320137985,
+      "full_p50s_ms": [
+        17.655,
+        17.827,
+        20.316
+      ],
+      "median_p50_ms": 17.827,
+      "gain_vs_baseline_pct": 0.88,
+      "verdict": "MARGINAL"
+    },
+    "h8": {
+      "status": "BENCH_FAIL",
+      "label": "opset 17 + skip_layer_norm_fusion"
+    },
+    "h9": {
+      "status": "OK",
+      "label": "opset 21 + matmul_transpose + attention_fusion",
+      "opset": 21,
+      "extra_optim": {
+        "matmul_transpose_fusion": true,
+        "attention_fusion": true
+      },
+      "screen_p50_ms": 18.238,
+      "screen_cv": 0.10938699418795922,
+      "full_p50s_ms": [
+        18.161,
+        19.242,
+        19.224
+      ],
+      "median_p50_ms": 19.224,
+      "gain_vs_baseline_pct": -6.89,
+      "verdict": "DISCARD"
+    },
+    "h10": {
+      "status": "BENCH_FAIL",
+      "label": "opset 17 + ln + skip_ln + matmul_transpose"
+    },
+    "h11": {
+      "status": "OK",
+      "label": "opset 17 + gelu_fusion explicit",
+      "opset": 17,
+      "extra_optim": {
+        "gelu_fusion": true
+      },
+      "screen_p50_ms": 17.604,
+      "screen_cv": 0.1246875710065894,
+      "full_p50s_ms": [
+        17.678,
+        18.414,
+        17.788
+      ],
+      "median_p50_ms": 17.788,
+      "gain_vs_baseline_pct": 1.1,
+      "verdict": "MARGINAL"
+    },
+    "h12": {
+      "status": "OK",
+      "label": "opset 17 + transpose_optimizer",
+      "opset": 17,
+      "extra_optim": {
+        "transpose_optimizer": true
+      },
+      "screen_p50_ms": 17.827,
+      "screen_cv": 0.200201940876199,
+      "full_p50s_ms": [
+        17.706,
+        18.881,
+        18.57
+      ],
+      "median_p50_ms": 18.57,
+      "gain_vs_baseline_pct": -3.25,
+      "verdict": "DISCARD"
+    }
+  },
+  "best_hypothesis": null,
+  "baseline_p50_ms": 17.985,
+  "best_p50_ms": null,
+  "best_gain_pct": null,
+  "opset21_gain_pct": -3.42,
+  "feature_gaps": [],
+  "errors": [
+    "h8: screen bench failed",
+    "h10: screen bench failed"
+  ]
+}
diff --git a/research/autoconfig/catalog-gpu-sweep/deepset--roberta-base-squad2/report.html b/research/autoconfig/catalog-gpu-sweep/deepset--roberta-base-squad2/report.html
new file mode 100644
index 000000000..abb7d9fcd
--- /dev/null
+++ b/research/autoconfig/catalog-gpu-sweep/deepset--roberta-base-squad2/report.html
@@ -0,0 +1,577 @@
+<!DOCTYPE html>
+<html lang="en">
+<head>
+  <meta charset="UTF-8">
+  <meta name="viewport" content="width=device-width, initial-scale=1.0">
+  <title>QNN GPU Optimization Report — deepset/roberta-base-squad2</title>
+  <style>
+    * { box-sizing: border-box; margin: 0; padding: 0; }
+    body {
+      font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", sans-serif;
+      background: #f4f6f9;
+      color: #1a1a2e;
+      padding: 28px 24px 40px;
+      font-size: 13px;
+      line-height: 1.5;
+    }
+    h1 { font-size: 24px; font-weight: 800; margin-bottom: 6px; color: #102a43; }
+    .subtitle { color: #5f6c80; font-size: 12px; margin-bottom: 24px; }
+    .section-card {
+      background: #ffffff;
+      border: 1.5px solid #dbe5f0;
+      border-radius: 12px;
+      padding: 18px 20px;
+      margin-bottom: 20px;
+      box-shadow: 0 1px 3px rgba(16, 42, 67, 0.06);
+    }
+    .kpi-grid {
+      display: grid;
+      grid-template-columns: repeat(5, minmax(0, 1fr));
+      gap: 14px;
+      margin-bottom: 20px;
+    }
+    .kpi-card {
+      background: linear-gradient(180deg, #ffffff 0%, #f8fbff 100%);
+      border: 1.5px solid #dbe5f0;
+      border-radius: 12px;
+      padding: 16px;
+      min-height: 120px;
+    }
+    .kpi-label {
+      font-size: 10px;
+      text-transform: uppercase;
+      letter-spacing: 0.8px;
+      color: #6b7c93;
+      font-weight: 700;
+      margin-bottom: 8px;
+    }
+    .kpi-value {
+      font-size: 23px;
+      font-weight: 800;
+      color: #102a43;
+      line-height: 1.15;
+      margin-bottom: 6px;
+      word-break: break-word;
+    }
+    .kpi-card.good .kpi-value { color: #2e7d32; }
+    .kpi-card.bad .kpi-value { color: #c62828; }
+    .kpi-sub { color: #6b7c93; font-size: 11px; }
+    .section-title {
+      font-size: 14px;
+      font-weight: 800;
+      color: #102a43;
+      margin-bottom: 14px;
+    }
+    .characteristics-table {
+      width: 100%;
+      border-collapse: collapse;
+    }
+    .characteristics-table th,
+    .characteristics-table td {
+      padding: 9px 10px;
+      border-bottom: 1px solid #ebf1f6;
+      text-align: left;
+      vertical-align: top;
+    }
+    .characteristics-table th {
+      width: 180px;
+      color: #5f6c80;
+      font-size: 11px;
+      text-transform: uppercase;
+      letter-spacing: 0.7px;
+    }
+    .report-table {
+      width: 100%;
+      border-collapse: collapse;
+    }
+    .report-table th {
+      text-align: left;
+      padding: 9px 10px;
+      background: #eef4fb;
+      color: #486581;
+      border-bottom: 2px solid #d9e2ec;
+      font-size: 10px;
+      text-transform: uppercase;
+      letter-spacing: 0.7px;
+    }
+    .report-table td {
+      padding: 10px;
+      border-bottom: 1px solid #ebf1f6;
+      vertical-align: top;
+    }
+    .report-table tr:hover td { background: #f8fbff; }
+    .champion-row td { background: #e8f1fd; }
+    .hyp-pill {
+      display: inline-block;
+      background: #102a43;
+      color: white;
+      border-radius: 999px;
+      padding: 2px 8px;
+      font-size: 11px;
+      font-weight: 700;
+    }
+    .gain-pos { color: #2e7d32; font-weight: 700; }
+    .gain-neg { color: #c62828; font-weight: 700; }
+    .chart-wrap {
+      overflow-x: auto;
+      border: 1px solid #e6edf5;
+      border-radius: 10px;
+      background: #fbfdff;
+      padding: 10px;
+    }
+    .chart-svg { width: 100%; min-width: 760px; display: block; }
+    .axis-label { fill: #486581; font-size: 11px; font-weight: 700; }
+    .tick-label { fill: #7b8794; font-size: 10px; }
+    .tick-line { stroke: #d9e2ec; stroke-width: 1; }
+    .center-line { stroke: #1e88e5; stroke-width: 2; stroke-dasharray: 4 4; }
+    .row-bg { fill: transparent; }
+    .hyp-label { fill: #102a43; font-size: 12px; font-weight: 800; }
+    .hyp-sub { fill: #7b8794; font-size: 10px; }
+    .baseline-bar { stroke: #546e7a; stroke-width: 3; }
+    .value-text { fill: #102a43; font-size: 11px; font-weight: 700; }
+    .build-fail-text { fill: #37474f; font-size: 10px; font-weight: 800; letter-spacing: 0.5px; }
+    .gap-grid {
+      display: grid;
+      grid-template-columns: repeat(auto-fit, minmax(240px, 1fr));
+      gap: 12px;
+    }
+    .gap-card {
+      background: #fff8e1;
+      border: 1.5px solid #ffe082;
+      border-radius: 10px;
+      padding: 12px 14px;
+      color: #7c5b00;
+      font-size: 12px;
+    }
+    .flag-pill {
+      display: inline-block;
+      background: #e3f2fd;
+      color: #1565c0;
+      border-radius: 4px;
+      padding: 1px 6px;
+      font-size: 10px;
+      font-weight: 700;
+      margin: 1px 2px 1px 0;
+    }
+    .runs-val {
+      font-family: "Cascadia Code", "Consolas", monospace;
+      font-size: 10.5px;
+      color: #546e7a;
+      white-space: nowrap;
+    }
+    .hyp-detail-table .label-cell { font-size: 11.5px; max-width: 220px; }
+    .hyp-detail-table .opset-cell { text-align: center; font-weight: 700; color: #3949ab; font-size: 12px; }
+    .hyp-detail-table .flags-cell { min-width: 140px; }
+    .hyp-detail-table .p50-cell { font-family: "Cascadia Code","Consolas",monospace; font-size: 12px; white-space: nowrap; }
+    .hyp-detail-table .sessions-cell { min-width: 160px; }
+    .hyp-detail-table .conf-cell { font-size: 11px; color: #546e7a; }
+    .verdict-keep { color: #2e7d32; font-weight: 700; }
+    .verdict-discard { color: #c62828; font-weight: 700; }
+    .footer {
+      margin-top: 16px;
+      color: #7b8794;
+      font-size: 11px;
+      text-align: center;
+    }
+    @media (max-width: 1200px) {
+      .kpi-grid { grid-template-columns: repeat(2, minmax(0, 1fr)); }
+    }
+    @media (max-width: 720px) {
+      .kpi-grid { grid-template-columns: 1fr; }
+      body { padding: 18px 14px 28px; }
+    }
+  </style>
+</head>
+<body>
+  <h1>QNN GPU Optimization Report — deepset/roberta-base-squad2</h1>
+  <div class="subtitle">roberta arch · 2026-06-18 · 13 hypotheses tested</div>
+
+  <section class="kpi-grid">
+    <div class="kpi-card neutral">
+      <div class="kpi-label">Best Gain %</div>
+      <div class="kpi-value">—</div>
+      <div class="kpi-sub">Champion: —</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">Baseline → Champion ms</div>
+      <div class="kpi-value">99.53 ms → —</div>
+      <div class="kpi-sub">Latency reduction: —</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">EP + Device</div>
+      <div class="kpi-value">QNN / GPU</div>
+      <div class="kpi-sub">Baseline opset 17</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">Champion Config</div>
+      <div class="kpi-value">—</div>
+      <div class="kpi-sub">—</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">Total experiments</div>
+      <div class="kpi-value">13</div>
+      <div class="kpi-sub">0 KEEP / 7 DISCARD</div>
+    </div>
+  </section>
+
+
+    <section class="section-card">
+      <div class="section-title">Model Characteristics</div>
+      <table class="characteristics-table">
+        <tr><th>Model ID</th><td>deepset/roberta-base-squad2</td></tr><tr><th>Task</th><td>question-answering</td></tr><tr><th>Arch type</th><td>roberta</td></tr><tr><th>Baseline opset</th><td>17</td></tr><tr><th>EP</th><td>qnn</td></tr><tr><th>Device</th><td>gpu</td></tr>
+      </table>
+    </section>
+
+
+    <section class="section-card">
+      <div class="section-title">Hypothesis Gain Chart</div>
+      <div class="chart-wrap">
+        <svg class="chart-svg" viewBox="0 0 748 594" role="img" aria-label="Hypothesis gain chart"><defs><pattern id="buildFailPattern" patternUnits="userSpaceOnUse" width="8" height="8" patternTransform="rotate(45)"><rect width="8" height="8" fill="#cfd8dc"></rect><rect width="3" height="8" fill="#90a4ae"></rect></pattern></defs><text x="0" y="20" class="axis-label">Hypothesis</text><text x="150" y="20" class="axis-label">Gain vs baseline (%)</text><line x1="410.0" y1="40" x2="410.0" y2="568" class="center-line" /><line x1="150.0" y1="44" x2="150.0" y2="568" class="tick-line" /><text x="150.0" y="588" text-anchor="middle" class="tick-label">-200%</text><line x1="280.0" y1="44" x2="280.0" y2="568" class="tick-line" /><text x="280.0" y="588" text-anchor="middle" class="tick-label">-100%</text><line x1="410.0" y1="44" x2="410.0" y2="568" class="tick-line" /><text x="410.0" y="588" text-anchor="middle" class="tick-label">0%</text><line x1="540.0" y1="44" x2="540.0" y2="568" class="tick-line" /><text x="540.0" y="588" text-anchor="middle" class="tick-label">100%</text><line x1="670.0" y1="44" x2="670.0" y2="568" class="tick-line" /><text x="670.0" y="588" text-anchor="middle" class="tick-label">200%</text><g><title>h0: baseline FP32 (no quant, no compile)
+status=OK  verdict=BASELINE
+p50=99.53 ms  gain=+0.0%</title><rect x="0" y="48.0" width="748" height="40" class="row-bg" /><text x="8" y="64.0" class="hyp-label">h0</text><text x="8" y="77.0" class="hyp-sub">baseline FP32 (no quant, …</text><line x1="410.0" y1="56.0" x2="410.0" y2="80.0" class="baseline-bar" /><text x="418.0" y="72.0" text-anchor="start" class="value-text">0.0%</text></g><g><title>h1: opset 17 explicit
+status=OK  verdict=MARGINAL
+p50=98.59 ms  gain=+0.9%</title><rect x="0" y="88.0" width="748" height="40" class="row-bg" /><text x="8" y="104.0" class="hyp-label">h1</text><text x="8" y="117.0" class="hyp-sub">opset 17 explicit</text><rect x="410.0" y="96.0" width="2.0" height="24" fill="#90a4ae" stroke="none" stroke-width="0" rx="4" /><text x="419.2" y="112.0" text-anchor="start" class="value-text">+0.9%</text></g><g><title>h2: opset 19
+status=OK  verdict=DISCARD
+p50=100.33 ms  gain=-0.8%</title><rect x="0" y="128.0" width="748" height="40" class="row-bg" /><text x="8" y="144.0" class="hyp-label">h2</text><text x="8" y="157.0" class="hyp-sub">opset 19</text><rect x="409.0" y="136.0" width="2.0" height="24" fill="#90a4ae" stroke="none" stroke-width="0" rx="4" /><text x="401.0" y="152.0" text-anchor="end" class="value-text">-0.8%</text></g><g><title>h3: opset 21 (tests gpu-006)
+status=OK  verdict=DISCARD
+p50=100.67 ms  gain=-1.1%</title><rect x="0" y="168.0" width="748" height="40" class="row-bg" /><text x="8" y="184.0" class="hyp-label">h3</text><text x="8" y="197.0" class="hyp-sub">opset 21 (tests gpu-006)</text><rect x="408.5" y="176.0" width="2.0" height="24" fill="#90a4ae" stroke="none" stroke-width="0" rx="4" /><text x="400.5" y="192.0" text-anchor="end" class="value-text">-1.1%</text></g><g><title>h4: opset 17 + matmul_transpose_fusion
+status=OK  verdict=MARGINAL
+p50=98.44 ms  gain=+1.1%</title><rect x="0" y="208.0" width="748" height="40" class="row-bg" /><text x="8" y="224.0" class="hyp-label">h4</text><text x="8" y="237.0" class="hyp-sub">opset 17 + matmul_transpo…</text><rect x="410.0" y="216.0" width="2.0" height="24" fill="#90a4ae" stroke="none" stroke-width="0" rx="4" /><text x="419.4" y="232.0" text-anchor="start" class="value-text">+1.1%</text></g><g><title>h5: opset 17 + attention_fusion
+status=OK  verdict=MARGINAL
+p50=98.59 ms  gain=+0.9%</title><rect x="0" y="248.0" width="748" height="40" class="row-bg" /><text x="8" y="264.0" class="hyp-label">h5</text><text x="8" y="277.0" class="hyp-sub">opset 17 + attention_fusi…</text><rect x="410.0" y="256.0" width="2.0" height="24" fill="#90a4ae" stroke="none" stroke-width="0" rx="4" /><text x="419.2" y="272.0" text-anchor="start" class="value-text">+0.9%</text></g><g><title>h6: opset 17 + bias_softmax_fusion
+status=BUILD_FAIL  verdict=—
+p50=—  gain=—</title><rect x="0" y="288.0" width="748" height="40" class="row-bg" /><text x="8" y="304.0" class="hyp-label">h6</text><text x="8" y="317.0" class="hyp-sub">opset 17 + bias_softmax_f…</text><rect x="364.0" y="296.0" width="92" height="24" fill="url(#buildFailPattern)" stroke="#78909c" stroke-width="1.5" rx="4" /><text x="410.0" y="312.0" text-anchor="middle" class="build-fail-text">BUILD_FAIL</text></g><g><title>h7: opset 17 + layer_norm_fusion
+status=BUILD_FAIL  verdict=—
+p50=—  gain=—</title><rect x="0" y="328.0" width="748" height="40" class="row-bg" /><text x="8" y="344.0" class="hyp-label">h7</text><text x="8" y="357.0" class="hyp-sub">opset 17 + layer_norm_fus…</text><rect x="364.0" y="336.0" width="92" height="24" fill="url(#buildFailPattern)" stroke="#78909c" stroke-width="1.5" rx="4" /><text x="410.0" y="352.0" text-anchor="middle" class="build-fail-text">BUILD_FAIL</text></g><g><title>h8: opset 17 + skip_layer_norm_fusion
+status=BUILD_FAIL  verdict=—
+p50=—  gain=—</title><rect x="0" y="368.0" width="748" height="40" class="row-bg" /><text x="8" y="384.0" class="hyp-label">h8</text><text x="8" y="397.0" class="hyp-sub">opset 17 + skip_layer_nor…</text><rect x="364.0" y="376.0" width="92" height="24" fill="url(#buildFailPattern)" stroke="#78909c" stroke-width="1.5" rx="4" /><text x="410.0" y="392.0" text-anchor="middle" class="build-fail-text">BUILD_FAIL</text></g><g><title>h9: opset 21 + matmul_transpose + attention_fusion
+status=BUILD_FAIL  verdict=—
+p50=—  gain=—</title><rect x="0" y="408.0" width="748" height="40" class="row-bg" /><text x="8" y="424.0" class="hyp-label">h9</text><text x="8" y="437.0" class="hyp-sub">opset 21 + matmul_transpo…</text><rect x="364.0" y="416.0" width="92" height="24" fill="url(#buildFailPattern)" stroke="#78909c" stroke-width="1.5" rx="4" /><text x="410.0" y="432.0" text-anchor="middle" class="build-fail-text">BUILD_FAIL</text></g><g><title>h10: opset 17 + ln + skip_ln + matmul_transpose
+status=BUILD_FAIL  verdict=—
+p50=—  gain=—</title><rect x="0" y="448.0" width="748" height="40" class="row-bg" /><text x="8" y="464.0" class="hyp-label">h10</text><text x="8" y="477.0" class="hyp-sub">opset 17 + ln + skip_ln +…</text><rect x="364.0" y="456.0" width="92" height="24" fill="url(#buildFailPattern)" stroke="#78909c" stroke-width="1.5" rx="4" /><text x="410.0" y="472.0" text-anchor="middle" class="build-fail-text">BUILD_FAIL</text></g><g><title>h11: opset 17 + gelu_fusion explicit
+status=BUILD_FAIL  verdict=—
+p50=—  gain=—</title><rect x="0" y="488.0" width="748" height="40" class="row-bg" /><text x="8" y="504.0" class="hyp-label">h11</text><text x="8" y="517.0" class="hyp-sub">opset 17 + gelu_fusion ex…</text><rect x="364.0" y="496.0" width="92" height="24" fill="url(#buildFailPattern)" stroke="#78909c" stroke-width="1.5" rx="4" /><text x="410.0" y="512.0" text-anchor="middle" class="build-fail-text">BUILD_FAIL</text></g><g><title>h12: opset 17 + transpose_optimizer
+status=BUILD_FAIL  verdict=—
+p50=—  gain=—</title><rect x="0" y="528.0" width="748" height="40" class="row-bg" /><text x="8" y="544.0" class="hyp-label">h12</text><text x="8" y="557.0" class="hyp-sub">opset 17 + transpose_opti…</text><rect x="364.0" y="536.0" width="92" height="24" fill="url(#buildFailPattern)" stroke="#78909c" stroke-width="1.5" rx="4" /><text x="410.0" y="552.0" text-anchor="middle" class="build-fail-text">BUILD_FAIL</text></g></svg>
+      </div>
+    </section>
+
+
+    <section class="section-card">
+      <div class="section-title">🔬 All Hypotheses — Full Detail</div>
+      <div style="overflow-x:auto">
+      <table class="report-table hyp-detail-table">
+        <thead>
+          <tr>
+            <th>ID</th>
+            <th>Config Label</th>
+            <th>Opset</th>
+            <th>Extra Flags</th>
+            <th>Median p50</th>
+            <th>Session p50s (ms)</th>
+            <th>Gain %</th>
+            <th>Verdict</th>
+            <th>Confidence</th>
+          </tr>
+        </thead>
+        <tbody>
+
+        <tr class="">
+          <td><span class="hyp-pill">h0</span></td>
+          <td class="label-cell">baseline FP32 (no quant, no compile)</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">99.53 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[99.95 · 97.75 · 99.53]</span></td>
+          <td><span class="">+0.0%</span></td>
+          <td><span class="">BASELINE</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h1</span></td>
+          <td class="label-cell">opset 17 explicit</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">98.59 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[99.11 · 98.16 · 98.59]</span></td>
+          <td><span class="gain-pos">+0.9%</span></td>
+          <td><span class="">MARGINAL</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h2</span></td>
+          <td class="label-cell">opset 19</td>
+          <td class="opset-cell">19</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">100.33 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[100.33 · 99.66 · 101.42]</span></td>
+          <td><span class="gain-neg">-0.8%</span></td>
+          <td><span class="verdict-discard">DISCARD</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h3</span></td>
+          <td class="label-cell">opset 21 (tests gpu-006)</td>
+          <td class="opset-cell">21</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">100.67 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[100.42 · 100.67 · 100.98]</span></td>
+          <td><span class="gain-neg">-1.1%</span></td>
+          <td><span class="verdict-discard">DISCARD</span></td>
+          <td class="conf-cell">ranges separated</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h4</span></td>
+          <td class="label-cell">opset 17 + matmul_transpose_fusion</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><span class="flag-pill">matmul_transpose_fusion</span></td>
+          <td class="p50-cell">98.44 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[98.04 · 99.49 · 98.44]</span></td>
+          <td><span class="gain-pos">+1.1%</span></td>
+          <td><span class="">MARGINAL</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h5</span></td>
+          <td class="label-cell">opset 17 + attention_fusion</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><span class="flag-pill">attention_fusion</span></td>
+          <td class="p50-cell">98.59 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[98.59 · 98.91 · 98.56]</span></td>
+          <td><span class="gain-pos">+0.9%</span></td>
+          <td><span class="">MARGINAL</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h6</span></td>
+          <td class="label-cell">opset 17 + bias_softmax_fusion</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">BUILD_FAIL</td>
+          <td class="sessions-cell"><span style="color:#c62828;font-weight:700">BUILD_FAIL</span></td>
+          <td>—</td>
+          <td><span class="verdict-discard">BUILD_FAIL</span></td>
+          <td class="conf-cell">build failed</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h7</span></td>
+          <td class="label-cell">opset 17 + layer_norm_fusion</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">BUILD_FAIL</td>
+          <td class="sessions-cell"><span style="color:#c62828;font-weight:700">BUILD_FAIL</span></td>
+          <td>—</td>
+          <td><span class="verdict-discard">BUILD_FAIL</span></td>
+          <td class="conf-cell">build failed</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h8</span></td>
+          <td class="label-cell">opset 17 + skip_layer_norm_fusion</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">BUILD_FAIL</td>
+          <td class="sessions-cell"><span style="color:#c62828;font-weight:700">BUILD_FAIL</span></td>
+          <td>—</td>
+          <td><span class="verdict-discard">BUILD_FAIL</span></td>
+          <td class="conf-cell">build failed</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h9</span></td>
+          <td class="label-cell">opset 21 + matmul_transpose + attention_fusion</td>
+          <td class="opset-cell">21</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">BUILD_FAIL</td>
+          <td class="sessions-cell"><span style="color:#c62828;font-weight:700">BUILD_FAIL</span></td>
+          <td>—</td>
+          <td><span class="verdict-discard">BUILD_FAIL</span></td>
+          <td class="conf-cell">build failed</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h10</span></td>
+          <td class="label-cell">opset 17 + ln + skip_ln + matmul_transpose</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">BUILD_FAIL</td>
+          <td class="sessions-cell"><span style="color:#c62828;font-weight:700">BUILD_FAIL</span></td>
+          <td>—</td>
+          <td><span class="verdict-discard">BUILD_FAIL</span></td>
+          <td class="conf-cell">build failed</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h11</span></td>
+          <td class="label-cell">opset 17 + gelu_fusion explicit</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">BUILD_FAIL</td>
+          <td class="sessions-cell"><span style="color:#c62828;font-weight:700">BUILD_FAIL</span></td>
+          <td>—</td>
+          <td><span class="verdict-discard">BUILD_FAIL</span></td>
+          <td class="conf-cell">build failed</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h12</span></td>
+          <td class="label-cell">opset 17 + transpose_optimizer</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">BUILD_FAIL</td>
+          <td class="sessions-cell"><span style="color:#c62828;font-weight:700">BUILD_FAIL</span></td>
+          <td>—</td>
+          <td><span class="verdict-discard">BUILD_FAIL</span></td>
+          <td class="conf-cell">build failed</td>
+        </tr>
+        </tbody>
+      </table>
+      </div>
+      <div style="margin-top:10px;font-size:11px;color:#7b8794">
+        ★ = champion hypothesis &nbsp;·&nbsp; Session p50s are individual bench sessions (median used for comparison)
+      </div>
+    </section>
+
+
+
+    <section class="section-card">
+      <div class="section-title">❌ Ineffective or Harmful</div>
+      <table class="report-table">
+        <thead>
+          <tr>
+            <th>Hypothesis</th>
+            <th>Label</th>
+            <th>Gain %</th>
+            <th>Verdict</th>
+            <th>Confidence</th>
+          </tr>
+        </thead>
+        <tbody>
+
+            <tr class="">
+              <td><span class="hyp-pill">h6</span></td>
+              <td>opset 17 + bias_softmax_fusion</td>
+              <td class="gain-pos">—</td>
+              <td>BUILD_FAIL</td>
+              <td>build failed</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h7</span></td>
+              <td>opset 17 + layer_norm_fusion</td>
+              <td class="gain-pos">—</td>
+              <td>BUILD_FAIL</td>
+              <td>build failed</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h8</span></td>
+              <td>opset 17 + skip_layer_norm_fusion</td>
+              <td class="gain-pos">—</td>
+              <td>BUILD_FAIL</td>
+              <td>build failed</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h9</span></td>
+              <td>opset 21 + matmul_transpose + attention_fusion</td>
+              <td class="gain-pos">—</td>
+              <td>BUILD_FAIL</td>
+              <td>build failed</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h10</span></td>
+              <td>opset 17 + ln + skip_ln + matmul_transpose</td>
+              <td class="gain-pos">—</td>
+              <td>BUILD_FAIL</td>
+              <td>build failed</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h11</span></td>
+              <td>opset 17 + gelu_fusion explicit</td>
+              <td class="gain-pos">—</td>
+              <td>BUILD_FAIL</td>
+              <td>build failed</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h12</span></td>
+              <td>opset 17 + transpose_optimizer</td>
+              <td class="gain-pos">—</td>
+              <td>BUILD_FAIL</td>
+              <td>build failed</td>
+            </tr>
+
+        </tbody>
+      </table>
+    </section>
+
+
+    <section class="section-card">
+      <div class="section-title">⚪ Neutral / Build Fail</div>
+      <table class="report-table">
+        <thead>
+          <tr>
+            <th>Hypothesis</th>
+            <th>Label</th>
+            <th>Gain %</th>
+            <th>Verdict</th>
+            <th>Confidence</th>
+          </tr>
+        </thead>
+        <tbody>
+
+            <tr class="">
+              <td><span class="hyp-pill">h0</span></td>
+              <td>baseline FP32 (no quant, no compile)</td>
+              <td class="gain-pos">+0.0%</td>
+              <td>BASELINE</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h1</span></td>
+              <td>opset 17 explicit</td>
+              <td class="gain-pos">+0.9%</td>
+              <td>MARGINAL</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h2</span></td>
+              <td>opset 19</td>
+              <td class="gain-neg">-0.8%</td>
+              <td>DISCARD</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h3</span></td>
+              <td>opset 21 (tests gpu-006)</td>
+              <td class="gain-neg">-1.1%</td>
+              <td>DISCARD</td>
+              <td>ranges separated</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h4</span></td>
+              <td>opset 17 + matmul_transpose_fusion</td>
+              <td class="gain-pos">+1.1%</td>
+              <td>MARGINAL</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h5</span></td>
+              <td>opset 17 + attention_fusion</td>
+              <td class="gain-pos">+0.9%</td>
+              <td>MARGINAL</td>
+              <td>ranges overlap</td>
+            </tr>
+
+        </tbody>
+      </table>
+    </section>
+
+
+
+  <div class="footer">Generated by gen_model_report.py · research/autoconfig</div>
+</body>
+</html>
diff --git a/research/autoconfig/catalog-gpu-sweep/deepset--roberta-base-squad2/results.json b/research/autoconfig/catalog-gpu-sweep/deepset--roberta-base-squad2/results.json
new file mode 100644
index 000000000..773322212
--- /dev/null
+++ b/research/autoconfig/catalog-gpu-sweep/deepset--roberta-base-squad2/results.json
@@ -0,0 +1,167 @@
+{
+  "model_id": "deepset/roberta-base-squad2",
+  "task": "question-answering",
+  "model_type": "roberta",
+  "timestamp": "2026-06-18T02:23:50",
+  "ep": "qnn",
+  "device": "gpu",
+  "baseline_opset": 17,
+  "hypotheses": {
+    "h0": {
+      "status": "OK",
+      "label": "baseline FP32 (no quant, no compile)",
+      "opset": 17,
+      "extra_optim": null,
+      "screen_p50_ms": 98.117,
+      "screen_cv": 0.15088109094244626,
+      "full_p50s_ms": [
+        99.948,
+        97.755,
+        99.535
+      ],
+      "median_p50_ms": 99.535,
+      "verdict": "BASELINE"
+    },
+    "h1": {
+      "status": "OK",
+      "label": "opset 17 explicit",
+      "opset": 17,
+      "extra_optim": null,
+      "screen_p50_ms": 98.107,
+      "screen_cv": 0.14872537127829819,
+      "full_p50s_ms": [
+        99.112,
+        98.16,
+        98.593
+      ],
+      "median_p50_ms": 98.593,
+      "gain_vs_baseline_pct": 0.95,
+      "verdict": "MARGINAL"
+    },
+    "h2": {
+      "status": "OK",
+      "label": "opset 19",
+      "opset": 19,
+      "extra_optim": null,
+      "screen_p50_ms": 107.597,
+      "screen_cv": 0.21958790672602396,
+      "full_p50s_ms": [
+        100.327,
+        99.658,
+        101.422
+      ],
+      "median_p50_ms": 100.327,
+      "gain_vs_baseline_pct": -0.8,
+      "verdict": "DISCARD"
+    },
+    "h3": {
+      "status": "OK",
+      "label": "opset 21 (tests gpu-006)",
+      "opset": 21,
+      "extra_optim": null,
+      "screen_p50_ms": 100.15,
+      "screen_cv": 0.16429355966050924,
+      "full_p50s_ms": [
+        100.42,
+        100.667,
+        100.984
+      ],
+      "median_p50_ms": 100.667,
+      "gain_vs_baseline_pct": -1.14,
+      "verdict": "DISCARD"
+    },
+    "h4": {
+      "status": "OK",
+      "label": "opset 17 + matmul_transpose_fusion",
+      "opset": 17,
+      "extra_optim": {
+        "matmul_transpose_fusion": true
+      },
+      "screen_p50_ms": 97.954,
+      "screen_cv": 0.14972333952671663,
+      "full_p50s_ms": [
+        98.044,
+        99.494,
+        98.442
+      ],
+      "median_p50_ms": 98.442,
+      "gain_vs_baseline_pct": 1.1,
+      "verdict": "MARGINAL"
+    },
+    "h5": {
+      "status": "OK",
+      "label": "opset 17 + attention_fusion",
+      "opset": 17,
+      "extra_optim": {
+        "attention_fusion": true
+      },
+      "screen_p50_ms": 102.402,
+      "screen_cv": 0.22213433331380247,
+      "full_p50s_ms": [
+        98.593,
+        98.912,
+        98.564
+      ],
+      "median_p50_ms": 98.593,
+      "gain_vs_baseline_pct": 0.95,
+      "verdict": "MARGINAL"
+    },
+    "h6": {
+      "status": "BUILD_FAIL",
+      "label": "opset 17 + bias_softmax_fusion",
+      "opset": 17,
+      "build_error": "ion_mask     [1, 512]       int32\n   Output:       start_logits\n                 end_logits\n   📦 Artifact:   \nC:\\tmp\\autoconfig-demo\\catalog-gpu-sweep\\deepset--roberta-base-squad2\\h6\\export\n.onnx  (474.9 MB)\n⏳ Optimize  Optimizing ONNX graph...Error: Build failed: [Errno 28] No space left on device\n"
+    },
+    "h7": {
+      "status": "BUILD_FAIL",
+      "label": "opset 17 + layer_norm_fusion",
+      "opset": 17,
+      "build_error": "put:    \nC:\\tmp\\autoconfig-demo\\catalog-gpu-sweep\\deepset--roberta-base-squad2\\h7\n\n════════════════════════════════════════════════════════════\n🎯 Stages\n════════════════════════════════════════════════════════════\n⏳ Export  Exporting to ONNX...Error: Build failed: [Errno 28] No space left on device\n"
+    },
+    "h8": {
+      "status": "BUILD_FAIL",
+      "label": "opset 17 + skip_layer_norm_fusion",
+      "opset": 17,
+      "build_error": "put:    \nC:\\tmp\\autoconfig-demo\\catalog-gpu-sweep\\deepset--roberta-base-squad2\\h8\n\n════════════════════════════════════════════════════════════\n🎯 Stages\n════════════════════════════════════════════════════════════\n⏳ Export  Exporting to ONNX...Error: Build failed: [Errno 28] No space left on device\n"
+    },
+    "h9": {
+      "status": "BUILD_FAIL",
+      "label": "opset 21 + matmul_transpose + attention_fusion",
+      "opset": 21,
+      "build_error": "put:    \nC:\\tmp\\autoconfig-demo\\catalog-gpu-sweep\\deepset--roberta-base-squad2\\h9\n\n════════════════════════════════════════════════════════════\n🎯 Stages\n════════════════════════════════════════════════════════════\n⏳ Export  Exporting to ONNX...Error: Build failed: [Errno 28] No space left on device\n"
+    },
+    "h10": {
+      "status": "BUILD_FAIL",
+      "label": "opset 17 + ln + skip_ln + matmul_transpose",
+      "opset": 17,
+      "build_error": "ut:    \nC:\\tmp\\autoconfig-demo\\catalog-gpu-sweep\\deepset--roberta-base-squad2\\h10\n\n════════════════════════════════════════════════════════════\n🎯 Stages\n════════════════════════════════════════════════════════════\n⏳ Export  Exporting to ONNX...Error: Build failed: [Errno 28] No space left on device\n"
+    },
+    "h11": {
+      "status": "BUILD_FAIL",
+      "label": "opset 17 + gelu_fusion explicit",
+      "opset": 17,
+      "build_error": "ut:    \nC:\\tmp\\autoconfig-demo\\catalog-gpu-sweep\\deepset--roberta-base-squad2\\h11\n\n════════════════════════════════════════════════════════════\n🎯 Stages\n════════════════════════════════════════════════════════════\n⏳ Export  Exporting to ONNX...Error: Build failed: [Errno 28] No space left on device\n"
+    },
+    "h12": {
+      "status": "BUILD_FAIL",
+      "label": "opset 17 + transpose_optimizer",
+      "opset": 17,
+      "build_error": "ut:    \nC:\\tmp\\autoconfig-demo\\catalog-gpu-sweep\\deepset--roberta-base-squad2\\h12\n\n════════════════════════════════════════════════════════════\n🎯 Stages\n════════════════════════════════════════════════════════════\n⏳ Export  Exporting to ONNX...Error: Build failed: [Errno 28] No space left on device\n"
+    }
+  },
+  "best_hypothesis": null,
+  "baseline_p50_ms": 99.535,
+  "best_p50_ms": null,
+  "best_gain_pct": null,
+  "opset21_gain_pct": -1.14,
+  "feature_gaps": [],
+  "errors": [
+    "h6: BUILD_FAIL",
+    "h7: BUILD_FAIL",
+    "h8: BUILD_FAIL",
+    "h9: BUILD_FAIL",
+    "h10: BUILD_FAIL",
+    "h11: BUILD_FAIL",
+    "h12: BUILD_FAIL"
+  ]
+}
diff --git a/research/autoconfig/catalog-gpu-sweep/deepset--tinyroberta-squad2/report.html b/research/autoconfig/catalog-gpu-sweep/deepset--tinyroberta-squad2/report.html
new file mode 100644
index 000000000..bf073f126
--- /dev/null
+++ b/research/autoconfig/catalog-gpu-sweep/deepset--tinyroberta-squad2/report.html
@@ -0,0 +1,577 @@
+<!DOCTYPE html>
+<html lang="en">
+<head>
+  <meta charset="UTF-8">
+  <meta name="viewport" content="width=device-width, initial-scale=1.0">
+  <title>QNN GPU Optimization Report — deepset/tinyroberta-squad2</title>
+  <style>
+    * { box-sizing: border-box; margin: 0; padding: 0; }
+    body {
+      font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", sans-serif;
+      background: #f4f6f9;
+      color: #1a1a2e;
+      padding: 28px 24px 40px;
+      font-size: 13px;
+      line-height: 1.5;
+    }
+    h1 { font-size: 24px; font-weight: 800; margin-bottom: 6px; color: #102a43; }
+    .subtitle { color: #5f6c80; font-size: 12px; margin-bottom: 24px; }
+    .section-card {
+      background: #ffffff;
+      border: 1.5px solid #dbe5f0;
+      border-radius: 12px;
+      padding: 18px 20px;
+      margin-bottom: 20px;
+      box-shadow: 0 1px 3px rgba(16, 42, 67, 0.06);
+    }
+    .kpi-grid {
+      display: grid;
+      grid-template-columns: repeat(5, minmax(0, 1fr));
+      gap: 14px;
+      margin-bottom: 20px;
+    }
+    .kpi-card {
+      background: linear-gradient(180deg, #ffffff 0%, #f8fbff 100%);
+      border: 1.5px solid #dbe5f0;
+      border-radius: 12px;
+      padding: 16px;
+      min-height: 120px;
+    }
+    .kpi-label {
+      font-size: 10px;
+      text-transform: uppercase;
+      letter-spacing: 0.8px;
+      color: #6b7c93;
+      font-weight: 700;
+      margin-bottom: 8px;
+    }
+    .kpi-value {
+      font-size: 23px;
+      font-weight: 800;
+      color: #102a43;
+      line-height: 1.15;
+      margin-bottom: 6px;
+      word-break: break-word;
+    }
+    .kpi-card.good .kpi-value { color: #2e7d32; }
+    .kpi-card.bad .kpi-value { color: #c62828; }
+    .kpi-sub { color: #6b7c93; font-size: 11px; }
+    .section-title {
+      font-size: 14px;
+      font-weight: 800;
+      color: #102a43;
+      margin-bottom: 14px;
+    }
+    .characteristics-table {
+      width: 100%;
+      border-collapse: collapse;
+    }
+    .characteristics-table th,
+    .characteristics-table td {
+      padding: 9px 10px;
+      border-bottom: 1px solid #ebf1f6;
+      text-align: left;
+      vertical-align: top;
+    }
+    .characteristics-table th {
+      width: 180px;
+      color: #5f6c80;
+      font-size: 11px;
+      text-transform: uppercase;
+      letter-spacing: 0.7px;
+    }
+    .report-table {
+      width: 100%;
+      border-collapse: collapse;
+    }
+    .report-table th {
+      text-align: left;
+      padding: 9px 10px;
+      background: #eef4fb;
+      color: #486581;
+      border-bottom: 2px solid #d9e2ec;
+      font-size: 10px;
+      text-transform: uppercase;
+      letter-spacing: 0.7px;
+    }
+    .report-table td {
+      padding: 10px;
+      border-bottom: 1px solid #ebf1f6;
+      vertical-align: top;
+    }
+    .report-table tr:hover td { background: #f8fbff; }
+    .champion-row td { background: #e8f1fd; }
+    .hyp-pill {
+      display: inline-block;
+      background: #102a43;
+      color: white;
+      border-radius: 999px;
+      padding: 2px 8px;
+      font-size: 11px;
+      font-weight: 700;
+    }
+    .gain-pos { color: #2e7d32; font-weight: 700; }
+    .gain-neg { color: #c62828; font-weight: 700; }
+    .chart-wrap {
+      overflow-x: auto;
+      border: 1px solid #e6edf5;
+      border-radius: 10px;
+      background: #fbfdff;
+      padding: 10px;
+    }
+    .chart-svg { width: 100%; min-width: 760px; display: block; }
+    .axis-label { fill: #486581; font-size: 11px; font-weight: 700; }
+    .tick-label { fill: #7b8794; font-size: 10px; }
+    .tick-line { stroke: #d9e2ec; stroke-width: 1; }
+    .center-line { stroke: #1e88e5; stroke-width: 2; stroke-dasharray: 4 4; }
+    .row-bg { fill: transparent; }
+    .hyp-label { fill: #102a43; font-size: 12px; font-weight: 800; }
+    .hyp-sub { fill: #7b8794; font-size: 10px; }
+    .baseline-bar { stroke: #546e7a; stroke-width: 3; }
+    .value-text { fill: #102a43; font-size: 11px; font-weight: 700; }
+    .build-fail-text { fill: #37474f; font-size: 10px; font-weight: 800; letter-spacing: 0.5px; }
+    .gap-grid {
+      display: grid;
+      grid-template-columns: repeat(auto-fit, minmax(240px, 1fr));
+      gap: 12px;
+    }
+    .gap-card {
+      background: #fff8e1;
+      border: 1.5px solid #ffe082;
+      border-radius: 10px;
+      padding: 12px 14px;
+      color: #7c5b00;
+      font-size: 12px;
+    }
+    .flag-pill {
+      display: inline-block;
+      background: #e3f2fd;
+      color: #1565c0;
+      border-radius: 4px;
+      padding: 1px 6px;
+      font-size: 10px;
+      font-weight: 700;
+      margin: 1px 2px 1px 0;
+    }
+    .runs-val {
+      font-family: "Cascadia Code", "Consolas", monospace;
+      font-size: 10.5px;
+      color: #546e7a;
+      white-space: nowrap;
+    }
+    .hyp-detail-table .label-cell { font-size: 11.5px; max-width: 220px; }
+    .hyp-detail-table .opset-cell { text-align: center; font-weight: 700; color: #3949ab; font-size: 12px; }
+    .hyp-detail-table .flags-cell { min-width: 140px; }
+    .hyp-detail-table .p50-cell { font-family: "Cascadia Code","Consolas",monospace; font-size: 12px; white-space: nowrap; }
+    .hyp-detail-table .sessions-cell { min-width: 160px; }
+    .hyp-detail-table .conf-cell { font-size: 11px; color: #546e7a; }
+    .verdict-keep { color: #2e7d32; font-weight: 700; }
+    .verdict-discard { color: #c62828; font-weight: 700; }
+    .footer {
+      margin-top: 16px;
+      color: #7b8794;
+      font-size: 11px;
+      text-align: center;
+    }
+    @media (max-width: 1200px) {
+      .kpi-grid { grid-template-columns: repeat(2, minmax(0, 1fr)); }
+    }
+    @media (max-width: 720px) {
+      .kpi-grid { grid-template-columns: 1fr; }
+      body { padding: 18px 14px 28px; }
+    }
+  </style>
+</head>
+<body>
+  <h1>QNN GPU Optimization Report — deepset/tinyroberta-squad2</h1>
+  <div class="subtitle">roberta arch · 2026-06-17 · 13 hypotheses tested</div>
+
+  <section class="kpi-grid">
+    <div class="kpi-card neutral">
+      <div class="kpi-label">Best Gain %</div>
+      <div class="kpi-value">—</div>
+      <div class="kpi-sub">Champion: —</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">Baseline → Champion ms</div>
+      <div class="kpi-value">51.17 ms → —</div>
+      <div class="kpi-sub">Latency reduction: —</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">EP + Device</div>
+      <div class="kpi-value">QNN / GPU</div>
+      <div class="kpi-sub">Baseline opset 17</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">Champion Config</div>
+      <div class="kpi-value">—</div>
+      <div class="kpi-sub">—</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">Total experiments</div>
+      <div class="kpi-value">13</div>
+      <div class="kpi-sub">0 KEEP / 3 DISCARD</div>
+    </div>
+  </section>
+
+
+    <section class="section-card">
+      <div class="section-title">Model Characteristics</div>
+      <table class="characteristics-table">
+        <tr><th>Model ID</th><td>deepset/tinyroberta-squad2</td></tr><tr><th>Task</th><td>question-answering</td></tr><tr><th>Arch type</th><td>roberta</td></tr><tr><th>Baseline opset</th><td>17</td></tr><tr><th>EP</th><td>qnn</td></tr><tr><th>Device</th><td>gpu</td></tr>
+      </table>
+    </section>
+
+
+    <section class="section-card">
+      <div class="section-title">Hypothesis Gain Chart</div>
+      <div class="chart-wrap">
+        <svg class="chart-svg" viewBox="0 0 748 594" role="img" aria-label="Hypothesis gain chart"><defs><pattern id="buildFailPattern" patternUnits="userSpaceOnUse" width="8" height="8" patternTransform="rotate(45)"><rect width="8" height="8" fill="#cfd8dc"></rect><rect width="3" height="8" fill="#90a4ae"></rect></pattern></defs><text x="0" y="20" class="axis-label">Hypothesis</text><text x="150" y="20" class="axis-label">Gain vs baseline (%)</text><line x1="410.0" y1="40" x2="410.0" y2="568" class="center-line" /><line x1="150.0" y1="44" x2="150.0" y2="568" class="tick-line" /><text x="150.0" y="588" text-anchor="middle" class="tick-label">-200%</text><line x1="280.0" y1="44" x2="280.0" y2="568" class="tick-line" /><text x="280.0" y="588" text-anchor="middle" class="tick-label">-100%</text><line x1="410.0" y1="44" x2="410.0" y2="568" class="tick-line" /><text x="410.0" y="588" text-anchor="middle" class="tick-label">0%</text><line x1="540.0" y1="44" x2="540.0" y2="568" class="tick-line" /><text x="540.0" y="588" text-anchor="middle" class="tick-label">100%</text><line x1="670.0" y1="44" x2="670.0" y2="568" class="tick-line" /><text x="670.0" y="588" text-anchor="middle" class="tick-label">200%</text><g><title>h0: baseline FP32 (no quant, no compile)
+status=OK  verdict=BASELINE
+p50=51.17 ms  gain=+0.0%</title><rect x="0" y="48.0" width="748" height="40" class="row-bg" /><text x="8" y="64.0" class="hyp-label">h0</text><text x="8" y="77.0" class="hyp-sub">baseline FP32 (no quant, …</text><line x1="410.0" y1="56.0" x2="410.0" y2="80.0" class="baseline-bar" /><text x="418.0" y="72.0" text-anchor="start" class="value-text">0.0%</text></g><g><title>h1: opset 17 explicit
+status=OK  verdict=MARGINAL
+p50=51.14 ms  gain=+0.1%</title><rect x="0" y="88.0" width="748" height="40" class="row-bg" /><text x="8" y="104.0" class="hyp-label">h1</text><text x="8" y="117.0" class="hyp-sub">opset 17 explicit</text><rect x="410.0" y="96.0" width="2.0" height="24" fill="#90a4ae" stroke="none" stroke-width="0" rx="4" /><text x="418.1" y="112.0" text-anchor="start" class="value-text">+0.1%</text></g><g><title>h2: opset 19
+status=OK  verdict=DISCARD
+p50=52.25 ms  gain=-2.1%</title><rect x="0" y="128.0" width="748" height="40" class="row-bg" /><text x="8" y="144.0" class="hyp-label">h2</text><text x="8" y="157.0" class="hyp-sub">opset 19</text><rect x="407.2" y="136.0" width="2.8" height="24" fill="#90a4ae" stroke="none" stroke-width="0" rx="4" /><text x="399.2" y="152.0" text-anchor="end" class="value-text">-2.1%</text></g><g><title>h3: opset 21 (tests gpu-006)
+status=OK  verdict=DISCARD
+p50=52.54 ms  gain=-2.7%</title><rect x="0" y="168.0" width="748" height="40" class="row-bg" /><text x="8" y="184.0" class="hyp-label">h3</text><text x="8" y="197.0" class="hyp-sub">opset 21 (tests gpu-006)</text><rect x="406.5" y="176.0" width="3.5" height="24" fill="#90a4ae" stroke="none" stroke-width="0" rx="4" /><text x="398.5" y="192.0" text-anchor="end" class="value-text">-2.7%</text></g><g><title>h4: opset 17 + matmul_transpose_fusion
+status=OK  verdict=MARGINAL
+p50=50.67 ms  gain=+1.0%</title><rect x="0" y="208.0" width="748" height="40" class="row-bg" /><text x="8" y="224.0" class="hyp-label">h4</text><text x="8" y="237.0" class="hyp-sub">opset 17 + matmul_transpo…</text><rect x="410.0" y="216.0" width="2.0" height="24" fill="#90a4ae" stroke="none" stroke-width="0" rx="4" /><text x="419.3" y="232.0" text-anchor="start" class="value-text">+1.0%</text></g><g><title>h5: opset 17 + attention_fusion
+status=OK  verdict=DISCARD
+p50=51.58 ms  gain=-0.8%</title><rect x="0" y="248.0" width="748" height="40" class="row-bg" /><text x="8" y="264.0" class="hyp-label">h5</text><text x="8" y="277.0" class="hyp-sub">opset 17 + attention_fusi…</text><rect x="409.0" y="256.0" width="2.0" height="24" fill="#90a4ae" stroke="none" stroke-width="0" rx="4" /><text x="401.0" y="272.0" text-anchor="end" class="value-text">-0.8%</text></g><g><title>h6: opset 17 + bias_softmax_fusion
+status=OK  verdict=MARGINAL
+p50=51.06 ms  gain=+0.2%</title><rect x="0" y="288.0" width="748" height="40" class="row-bg" /><text x="8" y="304.0" class="hyp-label">h6</text><text x="8" y="317.0" class="hyp-sub">opset 17 + bias_softmax_f…</text><rect x="410.0" y="296.0" width="2.0" height="24" fill="#90a4ae" stroke="none" stroke-width="0" rx="4" /><text x="418.3" y="312.0" text-anchor="start" class="value-text">+0.2%</text></g><g><title>h7: opset 17 + layer_norm_fusion
+status=OK  verdict=MARGINAL
+p50=50.63 ms  gain=+1.1%</title><rect x="0" y="328.0" width="748" height="40" class="row-bg" /><text x="8" y="344.0" class="hyp-label">h7</text><text x="8" y="357.0" class="hyp-sub">opset 17 + layer_norm_fus…</text><rect x="410.0" y="336.0" width="2.0" height="24" fill="#90a4ae" stroke="none" stroke-width="0" rx="4" /><text x="419.4" y="352.0" text-anchor="start" class="value-text">+1.1%</text></g><g><title>h8: opset 17 + skip_layer_norm_fusion
+status=BENCH_FAIL  verdict=—
+p50=—  gain=—</title><rect x="0" y="368.0" width="748" height="40" class="row-bg" /><text x="8" y="384.0" class="hyp-label">h8</text><text x="8" y="397.0" class="hyp-sub">opset 17 + skip_layer_nor…</text></g><g><title>h9: opset 21 + matmul_transpose + attention_fusion
+status=OK  verdict=DISCARD
+p50=52.58 ms  gain=-2.8%</title><rect x="0" y="408.0" width="748" height="40" class="row-bg" /><text x="8" y="424.0" class="hyp-label">h9</text><text x="8" y="437.0" class="hyp-sub">opset 21 + matmul_transpo…</text><rect x="406.4" y="416.0" width="3.6" height="24" fill="#90a4ae" stroke="none" stroke-width="0" rx="4" /><text x="398.4" y="432.0" text-anchor="end" class="value-text">-2.8%</text></g><g><title>h10: opset 17 + ln + skip_ln + matmul_transpose
+status=BENCH_FAIL  verdict=—
+p50=—  gain=—</title><rect x="0" y="448.0" width="748" height="40" class="row-bg" /><text x="8" y="464.0" class="hyp-label">h10</text><text x="8" y="477.0" class="hyp-sub">opset 17 + ln + skip_ln +…</text></g><g><title>h11: opset 17 + gelu_fusion explicit
+status=OK  verdict=MARGINAL
+p50=50.50 ms  gain=+1.3%</title><rect x="0" y="488.0" width="748" height="40" class="row-bg" /><text x="8" y="504.0" class="hyp-label">h11</text><text x="8" y="517.0" class="hyp-sub">opset 17 + gelu_fusion ex…</text><rect x="410.0" y="496.0" width="2.0" height="24" fill="#90a4ae" stroke="none" stroke-width="0" rx="4" /><text x="419.7" y="512.0" text-anchor="start" class="value-text">+1.3%</text></g><g><title>h12: opset 17 + transpose_optimizer
+status=OK  verdict=MARGINAL
+p50=51.02 ms  gain=+0.3%</title><rect x="0" y="528.0" width="748" height="40" class="row-bg" /><text x="8" y="544.0" class="hyp-label">h12</text><text x="8" y="557.0" class="hyp-sub">opset 17 + transpose_opti…</text><rect x="410.0" y="536.0" width="2.0" height="24" fill="#90a4ae" stroke="none" stroke-width="0" rx="4" /><text x="418.4" y="552.0" text-anchor="start" class="value-text">+0.3%</text></g></svg>
+      </div>
+    </section>
+
+
+    <section class="section-card">
+      <div class="section-title">🔬 All Hypotheses — Full Detail</div>
+      <div style="overflow-x:auto">
+      <table class="report-table hyp-detail-table">
+        <thead>
+          <tr>
+            <th>ID</th>
+            <th>Config Label</th>
+            <th>Opset</th>
+            <th>Extra Flags</th>
+            <th>Median p50</th>
+            <th>Session p50s (ms)</th>
+            <th>Gain %</th>
+            <th>Verdict</th>
+            <th>Confidence</th>
+          </tr>
+        </thead>
+        <tbody>
+
+        <tr class="">
+          <td><span class="hyp-pill">h0</span></td>
+          <td class="label-cell">baseline FP32 (no quant, no compile)</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">51.17 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[51.17 · 51.24 · 50.41]</span></td>
+          <td><span class="">+0.0%</span></td>
+          <td><span class="">BASELINE</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h1</span></td>
+          <td class="label-cell">opset 17 explicit</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">51.14 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[50.52 · 51.14 · 51.37]</span></td>
+          <td><span class="gain-pos">+0.1%</span></td>
+          <td><span class="">MARGINAL</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h2</span></td>
+          <td class="label-cell">opset 19</td>
+          <td class="opset-cell">19</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">52.25 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[51.06 · 53.10 · 52.25]</span></td>
+          <td><span class="gain-neg">-2.1%</span></td>
+          <td><span class="verdict-discard">DISCARD</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h3</span></td>
+          <td class="label-cell">opset 21 (tests gpu-006)</td>
+          <td class="opset-cell">21</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">52.54 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[52.54 · 55.41 · 51.71]</span></td>
+          <td><span class="gain-neg">-2.7%</span></td>
+          <td><span class="verdict-discard">DISCARD</span></td>
+          <td class="conf-cell">ranges separated</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h4</span></td>
+          <td class="label-cell">opset 17 + matmul_transpose_fusion</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><span class="flag-pill">matmul_transpose_fusion</span></td>
+          <td class="p50-cell">50.67 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[50.39 · 50.67 · 51.56]</span></td>
+          <td><span class="gain-pos">+1.0%</span></td>
+          <td><span class="">MARGINAL</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h5</span></td>
+          <td class="label-cell">opset 17 + attention_fusion</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><span class="flag-pill">attention_fusion</span></td>
+          <td class="p50-cell">51.58 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[51.86 · 51.58 · 50.47]</span></td>
+          <td><span class="gain-neg">-0.8%</span></td>
+          <td><span class="verdict-discard">DISCARD</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h6</span></td>
+          <td class="label-cell">opset 17 + bias_softmax_fusion</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><span class="flag-pill">bias_softmax_fusion</span></td>
+          <td class="p50-cell">51.06 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[51.06 · 50.97 · 52.08]</span></td>
+          <td><span class="gain-pos">+0.2%</span></td>
+          <td><span class="">MARGINAL</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h7</span></td>
+          <td class="label-cell">opset 17 + layer_norm_fusion</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><span class="flag-pill">layer_norm_fusion</span></td>
+          <td class="p50-cell">50.63 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[50.63 · 50.47 · 51.42]</span></td>
+          <td><span class="gain-pos">+1.1%</span></td>
+          <td><span class="">MARGINAL</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h8</span></td>
+          <td class="label-cell">opset 17 + skip_layer_norm_fusion</td>
+          <td class="opset-cell">—</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">—</td>
+          <td class="sessions-cell">—</td>
+          <td>—</td>
+          <td><span class="verdict-discard">BENCH_FAIL</span></td>
+          <td class="conf-cell">bench failed</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h9</span></td>
+          <td class="label-cell">opset 21 + matmul_transpose + attention_fusion</td>
+          <td class="opset-cell">21</td>
+          <td class="flags-cell"><span class="flag-pill">matmul_transpose_fusion</span>, <span class="flag-pill">attention_fusion</span></td>
+          <td class="p50-cell">52.58 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[52.58 · 51.76 · 57.06]</span></td>
+          <td><span class="gain-neg">-2.8%</span></td>
+          <td><span class="verdict-discard">DISCARD</span></td>
+          <td class="conf-cell">ranges separated</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h10</span></td>
+          <td class="label-cell">opset 17 + ln + skip_ln + matmul_transpose</td>
+          <td class="opset-cell">—</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">—</td>
+          <td class="sessions-cell">—</td>
+          <td>—</td>
+          <td><span class="verdict-discard">BENCH_FAIL</span></td>
+          <td class="conf-cell">bench failed</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h11</span></td>
+          <td class="label-cell">opset 17 + gelu_fusion explicit</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><span class="flag-pill">gelu_fusion</span></td>
+          <td class="p50-cell">50.50 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[50.34 · 50.50 · 51.30]</span></td>
+          <td><span class="gain-pos">+1.3%</span></td>
+          <td><span class="">MARGINAL</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h12</span></td>
+          <td class="label-cell">opset 17 + transpose_optimizer</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><span class="flag-pill">transpose_optimizer</span></td>
+          <td class="p50-cell">51.02 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[51.02 · 52.29 · 50.83]</span></td>
+          <td><span class="gain-pos">+0.3%</span></td>
+          <td><span class="">MARGINAL</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        </tbody>
+      </table>
+      </div>
+      <div style="margin-top:10px;font-size:11px;color:#7b8794">
+        ★ = champion hypothesis &nbsp;·&nbsp; Session p50s are individual bench sessions (median used for comparison)
+      </div>
+    </section>
+
+
+
+    <section class="section-card">
+      <div class="section-title">❌ Ineffective or Harmful</div>
+      <table class="report-table">
+        <thead>
+          <tr>
+            <th>Hypothesis</th>
+            <th>Label</th>
+            <th>Gain %</th>
+            <th>Verdict</th>
+            <th>Confidence</th>
+          </tr>
+        </thead>
+        <tbody>
+
+            <tr class="">
+              <td><span class="hyp-pill">h2</span></td>
+              <td>opset 19</td>
+              <td class="gain-neg">-2.1%</td>
+              <td>DISCARD</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h3</span></td>
+              <td>opset 21 (tests gpu-006)</td>
+              <td class="gain-neg">-2.7%</td>
+              <td>DISCARD</td>
+              <td>ranges separated</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h9</span></td>
+              <td>opset 21 + matmul_transpose + attention_fusion</td>
+              <td class="gain-neg">-2.8%</td>
+              <td>DISCARD</td>
+              <td>ranges separated</td>
+            </tr>
+
+        </tbody>
+      </table>
+    </section>
+
+
+    <section class="section-card">
+      <div class="section-title">⚪ Neutral / Build Fail</div>
+      <table class="report-table">
+        <thead>
+          <tr>
+            <th>Hypothesis</th>
+            <th>Label</th>
+            <th>Gain %</th>
+            <th>Verdict</th>
+            <th>Confidence</th>
+          </tr>
+        </thead>
+        <tbody>
+
+            <tr class="">
+              <td><span class="hyp-pill">h0</span></td>
+              <td>baseline FP32 (no quant, no compile)</td>
+              <td class="gain-pos">+0.0%</td>
+              <td>BASELINE</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h1</span></td>
+              <td>opset 17 explicit</td>
+              <td class="gain-pos">+0.1%</td>
+              <td>MARGINAL</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h4</span></td>
+              <td>opset 17 + matmul_transpose_fusion</td>
+              <td class="gain-pos">+1.0%</td>
+              <td>MARGINAL</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h5</span></td>
+              <td>opset 17 + attention_fusion</td>
+              <td class="gain-neg">-0.8%</td>
+              <td>DISCARD</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h6</span></td>
+              <td>opset 17 + bias_softmax_fusion</td>
+              <td class="gain-pos">+0.2%</td>
+              <td>MARGINAL</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h7</span></td>
+              <td>opset 17 + layer_norm_fusion</td>
+              <td class="gain-pos">+1.1%</td>
+              <td>MARGINAL</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h8</span></td>
+              <td>opset 17 + skip_layer_norm_fusion</td>
+              <td class="gain-pos">—</td>
+              <td>BENCH_FAIL</td>
+              <td>bench failed</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h10</span></td>
+              <td>opset 17 + ln + skip_ln + matmul_transpose</td>
+              <td class="gain-pos">—</td>
+              <td>BENCH_FAIL</td>
+              <td>bench failed</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h11</span></td>
+              <td>opset 17 + gelu_fusion explicit</td>
+              <td class="gain-pos">+1.3%</td>
+              <td>MARGINAL</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h12</span></td>
+              <td>opset 17 + transpose_optimizer</td>
+              <td class="gain-pos">+0.3%</td>
+              <td>MARGINAL</td>
+              <td>ranges overlap</td>
+            </tr>
+
+        </tbody>
+      </table>
+    </section>
+
+
+
+  <div class="footer">Generated by gen_model_report.py · research/autoconfig</div>
+</body>
+</html>
diff --git a/research/autoconfig/catalog-gpu-sweep/deepset--tinyroberta-squad2/results.json b/research/autoconfig/catalog-gpu-sweep/deepset--tinyroberta-squad2/results.json
new file mode 100644
index 000000000..dfa47962b
--- /dev/null
+++ b/research/autoconfig/catalog-gpu-sweep/deepset--tinyroberta-squad2/results.json
@@ -0,0 +1,219 @@
+{
+  "model_id": "deepset/tinyroberta-squad2",
+  "task": "question-answering",
+  "model_type": "roberta",
+  "timestamp": "2026-06-17T23:13:59",
+  "ep": "qnn",
+  "device": "gpu",
+  "baseline_opset": 17,
+  "hypotheses": {
+    "h0": {
+      "status": "OK",
+      "label": "baseline FP32 (no quant, no compile)",
+      "opset": 17,
+      "extra_optim": null,
+      "screen_p50_ms": 51.003,
+      "screen_cv": 0.1682450051957728,
+      "full_p50s_ms": [
+        51.171,
+        51.243,
+        50.412
+      ],
+      "median_p50_ms": 51.171,
+      "verdict": "BASELINE"
+    },
+    "h1": {
+      "status": "OK",
+      "label": "opset 17 explicit",
+      "opset": 17,
+      "extra_optim": null,
+      "screen_p50_ms": 53.124,
+      "screen_cv": 0.19249303516301483,
+      "full_p50s_ms": [
+        50.523,
+        51.142,
+        51.373
+      ],
+      "median_p50_ms": 51.142,
+      "gain_vs_baseline_pct": 0.06,
+      "verdict": "MARGINAL"
+    },
+    "h2": {
+      "status": "OK",
+      "label": "opset 19",
+      "opset": 19,
+      "extra_optim": null,
+      "screen_p50_ms": 52.106,
+      "screen_cv": 0.15598971327678193,
+      "full_p50s_ms": [
+        51.063,
+        53.096,
+        52.254
+      ],
+      "median_p50_ms": 52.254,
+      "gain_vs_baseline_pct": -2.12,
+      "verdict": "DISCARD"
+    },
+    "h3": {
+      "status": "OK",
+      "label": "opset 21 (tests gpu-006)",
+      "opset": 21,
+      "extra_optim": null,
+      "screen_p50_ms": 52.129,
+      "screen_cv": 0.20215235281705002,
+      "full_p50s_ms": [
+        52.541,
+        55.415,
+        51.708
+      ],
+      "median_p50_ms": 52.541,
+      "gain_vs_baseline_pct": -2.68,
+      "verdict": "DISCARD"
+    },
+    "h4": {
+      "status": "OK",
+      "label": "opset 17 + matmul_transpose_fusion",
+      "opset": 17,
+      "extra_optim": {
+        "matmul_transpose_fusion": true
+      },
+      "screen_p50_ms": 50.249,
+      "screen_cv": 0.1313060956436944,
+      "full_p50s_ms": [
+        50.388,
+        50.669,
+        51.56
+      ],
+      "median_p50_ms": 50.669,
+      "gain_vs_baseline_pct": 0.98,
+      "verdict": "MARGINAL"
+    },
+    "h5": {
+      "status": "OK",
+      "label": "opset 17 + attention_fusion",
+      "opset": 17,
+      "extra_optim": {
+        "attention_fusion": true
+      },
+      "screen_p50_ms": 52.692,
+      "screen_cv": 0.18913687087223865,
+      "full_p50s_ms": [
+        51.86,
+        51.58,
+        50.474
+      ],
+      "median_p50_ms": 51.58,
+      "gain_vs_baseline_pct": -0.8,
+      "verdict": "DISCARD"
+    },
+    "h6": {
+      "status": "OK",
+      "label": "opset 17 + bias_softmax_fusion",
+      "opset": 17,
+      "extra_optim": {
+        "bias_softmax_fusion": true
+      },
+      "screen_p50_ms": 51.657,
+      "screen_cv": 0.15649379561337282,
+      "full_p50s_ms": [
+        51.058,
+        50.966,
+        52.08
+      ],
+      "median_p50_ms": 51.058,
+      "gain_vs_baseline_pct": 0.22,
+      "verdict": "MARGINAL"
+    },
+    "h7": {
+      "status": "OK",
+      "label": "opset 17 + layer_norm_fusion",
+      "opset": 17,
+      "extra_optim": {
+        "layer_norm_fusion": true
+      },
+      "screen_p50_ms": 50.528,
+      "screen_cv": 0.14144632678910704,
+      "full_p50s_ms": [
+        50.635,
+        50.467,
+        51.424
+      ],
+      "median_p50_ms": 50.635,
+      "gain_vs_baseline_pct": 1.05,
+      "verdict": "MARGINAL"
+    },
+    "h8": {
+      "status": "BENCH_FAIL",
+      "label": "opset 17 + skip_layer_norm_fusion"
+    },
+    "h9": {
+      "status": "OK",
+      "label": "opset 21 + matmul_transpose + attention_fusion",
+      "opset": 21,
+      "extra_optim": {
+        "matmul_transpose_fusion": true,
+        "attention_fusion": true
+      },
+      "screen_p50_ms": 51.784,
+      "screen_cv": 0.1952340491271435,
+      "full_p50s_ms": [
+        52.576,
+        51.761,
+        57.061
+      ],
+      "median_p50_ms": 52.576,
+      "gain_vs_baseline_pct": -2.75,
+      "verdict": "DISCARD"
+    },
+    "h10": {
+      "status": "BENCH_FAIL",
+      "label": "opset 17 + ln + skip_ln + matmul_transpose"
+    },
+    "h11": {
+      "status": "OK",
+      "label": "opset 17 + gelu_fusion explicit",
+      "opset": 17,
+      "extra_optim": {
+        "gelu_fusion": true
+      },
+      "screen_p50_ms": 50.985,
+      "screen_cv": 0.13755025988035696,
+      "full_p50s_ms": [
+        50.344,
+        50.501,
+        51.304
+      ],
+      "median_p50_ms": 50.501,
+      "gain_vs_baseline_pct": 1.31,
+      "verdict": "MARGINAL"
+    },
+    "h12": {
+      "status": "OK",
+      "label": "opset 17 + transpose_optimizer",
+      "opset": 17,
+      "extra_optim": {
+        "transpose_optimizer": true
+      },
+      "screen_p50_ms": 50.361,
+      "screen_cv": 0.18784376799507557,
+      "full_p50s_ms": [
+        51.016,
+        52.289,
+        50.832
+      ],
+      "median_p50_ms": 51.016,
+      "gain_vs_baseline_pct": 0.3,
+      "verdict": "MARGINAL"
+    }
+  },
+  "best_hypothesis": null,
+  "baseline_p50_ms": 51.171,
+  "best_p50_ms": null,
+  "best_gain_pct": null,
+  "opset21_gain_pct": -2.68,
+  "feature_gaps": [],
+  "errors": [
+    "h8: screen bench failed",
+    "h10: screen bench failed"
+  ]
+}
diff --git a/research/autoconfig/catalog-gpu-sweep/facebook--dinov2-small/report.html b/research/autoconfig/catalog-gpu-sweep/facebook--dinov2-small/report.html
new file mode 100644
index 000000000..99b67bcf8
--- /dev/null
+++ b/research/autoconfig/catalog-gpu-sweep/facebook--dinov2-small/report.html
@@ -0,0 +1,577 @@
+<!DOCTYPE html>
+<html lang="en">
+<head>
+  <meta charset="UTF-8">
+  <meta name="viewport" content="width=device-width, initial-scale=1.0">
+  <title>QNN GPU Optimization Report — facebook/dinov2-small</title>
+  <style>
+    * { box-sizing: border-box; margin: 0; padding: 0; }
+    body {
+      font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", sans-serif;
+      background: #f4f6f9;
+      color: #1a1a2e;
+      padding: 28px 24px 40px;
+      font-size: 13px;
+      line-height: 1.5;
+    }
+    h1 { font-size: 24px; font-weight: 800; margin-bottom: 6px; color: #102a43; }
+    .subtitle { color: #5f6c80; font-size: 12px; margin-bottom: 24px; }
+    .section-card {
+      background: #ffffff;
+      border: 1.5px solid #dbe5f0;
+      border-radius: 12px;
+      padding: 18px 20px;
+      margin-bottom: 20px;
+      box-shadow: 0 1px 3px rgba(16, 42, 67, 0.06);
+    }
+    .kpi-grid {
+      display: grid;
+      grid-template-columns: repeat(5, minmax(0, 1fr));
+      gap: 14px;
+      margin-bottom: 20px;
+    }
+    .kpi-card {
+      background: linear-gradient(180deg, #ffffff 0%, #f8fbff 100%);
+      border: 1.5px solid #dbe5f0;
+      border-radius: 12px;
+      padding: 16px;
+      min-height: 120px;
+    }
+    .kpi-label {
+      font-size: 10px;
+      text-transform: uppercase;
+      letter-spacing: 0.8px;
+      color: #6b7c93;
+      font-weight: 700;
+      margin-bottom: 8px;
+    }
+    .kpi-value {
+      font-size: 23px;
+      font-weight: 800;
+      color: #102a43;
+      line-height: 1.15;
+      margin-bottom: 6px;
+      word-break: break-word;
+    }
+    .kpi-card.good .kpi-value { color: #2e7d32; }
+    .kpi-card.bad .kpi-value { color: #c62828; }
+    .kpi-sub { color: #6b7c93; font-size: 11px; }
+    .section-title {
+      font-size: 14px;
+      font-weight: 800;
+      color: #102a43;
+      margin-bottom: 14px;
+    }
+    .characteristics-table {
+      width: 100%;
+      border-collapse: collapse;
+    }
+    .characteristics-table th,
+    .characteristics-table td {
+      padding: 9px 10px;
+      border-bottom: 1px solid #ebf1f6;
+      text-align: left;
+      vertical-align: top;
+    }
+    .characteristics-table th {
+      width: 180px;
+      color: #5f6c80;
+      font-size: 11px;
+      text-transform: uppercase;
+      letter-spacing: 0.7px;
+    }
+    .report-table {
+      width: 100%;
+      border-collapse: collapse;
+    }
+    .report-table th {
+      text-align: left;
+      padding: 9px 10px;
+      background: #eef4fb;
+      color: #486581;
+      border-bottom: 2px solid #d9e2ec;
+      font-size: 10px;
+      text-transform: uppercase;
+      letter-spacing: 0.7px;
+    }
+    .report-table td {
+      padding: 10px;
+      border-bottom: 1px solid #ebf1f6;
+      vertical-align: top;
+    }
+    .report-table tr:hover td { background: #f8fbff; }
+    .champion-row td { background: #e8f1fd; }
+    .hyp-pill {
+      display: inline-block;
+      background: #102a43;
+      color: white;
+      border-radius: 999px;
+      padding: 2px 8px;
+      font-size: 11px;
+      font-weight: 700;
+    }
+    .gain-pos { color: #2e7d32; font-weight: 700; }
+    .gain-neg { color: #c62828; font-weight: 700; }
+    .chart-wrap {
+      overflow-x: auto;
+      border: 1px solid #e6edf5;
+      border-radius: 10px;
+      background: #fbfdff;
+      padding: 10px;
+    }
+    .chart-svg { width: 100%; min-width: 760px; display: block; }
+    .axis-label { fill: #486581; font-size: 11px; font-weight: 700; }
+    .tick-label { fill: #7b8794; font-size: 10px; }
+    .tick-line { stroke: #d9e2ec; stroke-width: 1; }
+    .center-line { stroke: #1e88e5; stroke-width: 2; stroke-dasharray: 4 4; }
+    .row-bg { fill: transparent; }
+    .hyp-label { fill: #102a43; font-size: 12px; font-weight: 800; }
+    .hyp-sub { fill: #7b8794; font-size: 10px; }
+    .baseline-bar { stroke: #546e7a; stroke-width: 3; }
+    .value-text { fill: #102a43; font-size: 11px; font-weight: 700; }
+    .build-fail-text { fill: #37474f; font-size: 10px; font-weight: 800; letter-spacing: 0.5px; }
+    .gap-grid {
+      display: grid;
+      grid-template-columns: repeat(auto-fit, minmax(240px, 1fr));
+      gap: 12px;
+    }
+    .gap-card {
+      background: #fff8e1;
+      border: 1.5px solid #ffe082;
+      border-radius: 10px;
+      padding: 12px 14px;
+      color: #7c5b00;
+      font-size: 12px;
+    }
+    .flag-pill {
+      display: inline-block;
+      background: #e3f2fd;
+      color: #1565c0;
+      border-radius: 4px;
+      padding: 1px 6px;
+      font-size: 10px;
+      font-weight: 700;
+      margin: 1px 2px 1px 0;
+    }
+    .runs-val {
+      font-family: "Cascadia Code", "Consolas", monospace;
+      font-size: 10.5px;
+      color: #546e7a;
+      white-space: nowrap;
+    }
+    .hyp-detail-table .label-cell { font-size: 11.5px; max-width: 220px; }
+    .hyp-detail-table .opset-cell { text-align: center; font-weight: 700; color: #3949ab; font-size: 12px; }
+    .hyp-detail-table .flags-cell { min-width: 140px; }
+    .hyp-detail-table .p50-cell { font-family: "Cascadia Code","Consolas",monospace; font-size: 12px; white-space: nowrap; }
+    .hyp-detail-table .sessions-cell { min-width: 160px; }
+    .hyp-detail-table .conf-cell { font-size: 11px; color: #546e7a; }
+    .verdict-keep { color: #2e7d32; font-weight: 700; }
+    .verdict-discard { color: #c62828; font-weight: 700; }
+    .footer {
+      margin-top: 16px;
+      color: #7b8794;
+      font-size: 11px;
+      text-align: center;
+    }
+    @media (max-width: 1200px) {
+      .kpi-grid { grid-template-columns: repeat(2, minmax(0, 1fr)); }
+    }
+    @media (max-width: 720px) {
+      .kpi-grid { grid-template-columns: 1fr; }
+      body { padding: 18px 14px 28px; }
+    }
+  </style>
+</head>
+<body>
+  <h1>QNN GPU Optimization Report — facebook/dinov2-small</h1>
+  <div class="subtitle">dinov2 arch · 2026-06-18 · 13 hypotheses tested</div>
+
+  <section class="kpi-grid">
+    <div class="kpi-card good">
+      <div class="kpi-label">Best Gain %</div>
+      <div class="kpi-value">+16.7%</div>
+      <div class="kpi-sub">Champion: h12</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">Baseline → Champion ms</div>
+      <div class="kpi-value">26.37 ms → 21.98 ms</div>
+      <div class="kpi-sub">Latency reduction: 4.39 ms</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">EP + Device</div>
+      <div class="kpi-value">QNN / GPU</div>
+      <div class="kpi-sub">Baseline opset 17</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">Champion Config</div>
+      <div class="kpi-value">h12</div>
+      <div class="kpi-sub">opset 17 + transpose_optimizer</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">Total experiments</div>
+      <div class="kpi-value">13</div>
+      <div class="kpi-sub">7 KEEP / 0 DISCARD</div>
+    </div>
+  </section>
+
+
+    <section class="section-card">
+      <div class="section-title">Model Characteristics</div>
+      <table class="characteristics-table">
+        <tr><th>Model ID</th><td>facebook/dinov2-small</td></tr><tr><th>Task</th><td>image-feature-extraction</td></tr><tr><th>Arch type</th><td>dinov2</td></tr><tr><th>Baseline opset</th><td>17</td></tr><tr><th>EP</th><td>qnn</td></tr><tr><th>Device</th><td>gpu</td></tr>
+      </table>
+    </section>
+
+
+    <section class="section-card">
+      <div class="section-title">Hypothesis Gain Chart</div>
+      <div class="chart-wrap">
+        <svg class="chart-svg" viewBox="0 0 748 594" role="img" aria-label="Hypothesis gain chart"><defs><pattern id="buildFailPattern" patternUnits="userSpaceOnUse" width="8" height="8" patternTransform="rotate(45)"><rect width="8" height="8" fill="#cfd8dc"></rect><rect width="3" height="8" fill="#90a4ae"></rect></pattern></defs><text x="0" y="20" class="axis-label">Hypothesis</text><text x="150" y="20" class="axis-label">Gain vs baseline (%)</text><line x1="410.0" y1="40" x2="410.0" y2="568" class="center-line" /><line x1="150.0" y1="44" x2="150.0" y2="568" class="tick-line" /><text x="150.0" y="588" text-anchor="middle" class="tick-label">-200%</text><line x1="280.0" y1="44" x2="280.0" y2="568" class="tick-line" /><text x="280.0" y="588" text-anchor="middle" class="tick-label">-100%</text><line x1="410.0" y1="44" x2="410.0" y2="568" class="tick-line" /><text x="410.0" y="588" text-anchor="middle" class="tick-label">0%</text><line x1="540.0" y1="44" x2="540.0" y2="568" class="tick-line" /><text x="540.0" y="588" text-anchor="middle" class="tick-label">100%</text><line x1="670.0" y1="44" x2="670.0" y2="568" class="tick-line" /><text x="670.0" y="588" text-anchor="middle" class="tick-label">200%</text><g><title>h0: baseline FP32 (no quant, no compile)
+status=OK  verdict=BASELINE
+p50=26.37 ms  gain=+0.0%</title><rect x="0" y="48.0" width="748" height="40" class="row-bg" /><text x="8" y="64.0" class="hyp-label">h0</text><text x="8" y="77.0" class="hyp-sub">baseline FP32 (no quant, …</text><line x1="410.0" y1="56.0" x2="410.0" y2="80.0" class="baseline-bar" /><text x="418.0" y="72.0" text-anchor="start" class="value-text">0.0%</text></g><g><title>h1: opset 17 explicit
+status=OK  verdict=MARGINAL_UNCONFIRMED
+p50=24.91 ms  gain=+6.9%</title><rect x="0" y="88.0" width="748" height="40" class="row-bg" /><text x="8" y="104.0" class="hyp-label">h1</text><text x="8" y="117.0" class="hyp-sub">opset 17 explicit</text><rect x="410.0" y="96.0" width="9.0" height="24" fill="#43a047" stroke="none" stroke-width="0" rx="4" /><text x="427.0" y="112.0" text-anchor="start" class="value-text">+6.9%</text></g><g><title>h2: opset 19
+status=OK  verdict=MARGINAL
+p50=25.42 ms  gain=+3.6%</title><rect x="0" y="128.0" width="748" height="40" class="row-bg" /><text x="8" y="144.0" class="hyp-label">h2</text><text x="8" y="157.0" class="hyp-sub">opset 19</text><rect x="410.0" y="136.0" width="4.7" height="24" fill="#90a4ae" stroke="none" stroke-width="0" rx="4" /><text x="422.7" y="152.0" text-anchor="start" class="value-text">+3.6%</text></g><g><title>h3: opset 21 (tests gpu-006)
+status=OK  verdict=MARGINAL
+p50=26.05 ms  gain=+1.2%</title><rect x="0" y="168.0" width="748" height="40" class="row-bg" /><text x="8" y="184.0" class="hyp-label">h3</text><text x="8" y="197.0" class="hyp-sub">opset 21 (tests gpu-006)</text><rect x="410.0" y="176.0" width="2.0" height="24" fill="#90a4ae" stroke="none" stroke-width="0" rx="4" /><text x="419.6" y="192.0" text-anchor="start" class="value-text">+1.2%</text></g><g><title>h4: opset 17 + matmul_transpose_fusion
+status=OK  verdict=KEEP_CONFIRMED
+p50=24.14 ms  gain=+9.4%</title><rect x="0" y="208.0" width="748" height="40" class="row-bg" /><text x="8" y="224.0" class="hyp-label">h4</text><text x="8" y="237.0" class="hyp-sub">opset 17 + matmul_transpo…</text><rect x="410.0" y="216.0" width="12.2" height="24" fill="#43a047" stroke="none" stroke-width="0" rx="4" /><text x="430.2" y="232.0" text-anchor="start" class="value-text">+9.4%</text></g><g><title>h5: opset 17 + attention_fusion
+status=OK  verdict=MARGINAL_UNCONFIRMED
+p50=23.59 ms  gain=+11.3%</title><rect x="0" y="248.0" width="748" height="40" class="row-bg" /><text x="8" y="264.0" class="hyp-label">h5</text><text x="8" y="277.0" class="hyp-sub">opset 17 + attention_fusi…</text><rect x="410.0" y="256.0" width="14.7" height="24" fill="#43a047" stroke="none" stroke-width="0" rx="4" /><text x="432.7" y="272.0" text-anchor="start" class="value-text">+11.3%</text></g><g><title>h6: opset 17 + bias_softmax_fusion
+status=OK  verdict=KEEP_CONFIRMED
+p50=24.69 ms  gain=+6.5%</title><rect x="0" y="288.0" width="748" height="40" class="row-bg" /><text x="8" y="304.0" class="hyp-label">h6</text><text x="8" y="317.0" class="hyp-sub">opset 17 + bias_softmax_f…</text><rect x="410.0" y="296.0" width="8.4" height="24" fill="#43a047" stroke="none" stroke-width="0" rx="4" /><text x="426.4" y="312.0" text-anchor="start" class="value-text">+6.5%</text></g><g><title>h7: opset 17 + layer_norm_fusion
+status=OK  verdict=MARGINAL
+p50=25.30 ms  gain=+4.1%</title><rect x="0" y="328.0" width="748" height="40" class="row-bg" /><text x="8" y="344.0" class="hyp-label">h7</text><text x="8" y="357.0" class="hyp-sub">opset 17 + layer_norm_fus…</text><rect x="410.0" y="336.0" width="5.3" height="24" fill="#90a4ae" stroke="none" stroke-width="0" rx="4" /><text x="423.3" y="352.0" text-anchor="start" class="value-text">+4.1%</text></g><g><title>h8: opset 17 + skip_layer_norm_fusion
+status=BENCH_FAIL  verdict=—
+p50=—  gain=—</title><rect x="0" y="368.0" width="748" height="40" class="row-bg" /><text x="8" y="384.0" class="hyp-label">h8</text><text x="8" y="397.0" class="hyp-sub">opset 17 + skip_layer_nor…</text></g><g><title>h9: opset 21 + matmul_transpose + attention_fusion
+status=OK  verdict=KEEP_CONFIRMED
+p50=22.98 ms  gain=+12.3%</title><rect x="0" y="408.0" width="748" height="40" class="row-bg" /><text x="8" y="424.0" class="hyp-label">h9</text><text x="8" y="437.0" class="hyp-sub">opset 21 + matmul_transpo…</text><rect x="410.0" y="416.0" width="16.0" height="24" fill="#43a047" stroke="none" stroke-width="0" rx="4" /><text x="434.0" y="432.0" text-anchor="start" class="value-text">+12.3%</text></g><g><title>h10: opset 17 + ln + skip_ln + matmul_transpose
+status=BENCH_FAIL  verdict=—
+p50=—  gain=—</title><rect x="0" y="448.0" width="748" height="40" class="row-bg" /><text x="8" y="464.0" class="hyp-label">h10</text><text x="8" y="477.0" class="hyp-sub">opset 17 + ln + skip_ln +…</text></g><g><title>h11: opset 17 + gelu_fusion explicit
+status=OK  verdict=KEEP_CONFIRMED
+p50=22.72 ms  gain=+16.2%</title><rect x="0" y="488.0" width="748" height="40" class="row-bg" /><text x="8" y="504.0" class="hyp-label">h11</text><text x="8" y="517.0" class="hyp-sub">opset 17 + gelu_fusion ex…</text><rect x="410.0" y="496.0" width="21.1" height="24" fill="#43a047" stroke="none" stroke-width="0" rx="4" /><text x="439.1" y="512.0" text-anchor="start" class="value-text">+16.2%</text></g><g><title>h12: opset 17 + transpose_optimizer
+status=OK  verdict=KEEP_CONFIRMED
+p50=21.98 ms  gain=+16.7%</title><rect x="0" y="528.0" width="748" height="40" class="row-bg" /><text x="8" y="544.0" class="hyp-label">h12</text><text x="8" y="557.0" class="hyp-sub">opset 17 + transpose_opti…</text><rect x="410.0" y="536.0" width="21.7" height="24" fill="#43a047" stroke="#1e88e5" stroke-width="4" rx="4" /><text x="439.7" y="552.0" text-anchor="start" class="value-text">+16.7%</text></g></svg>
+      </div>
+    </section>
+
+
+    <section class="section-card">
+      <div class="section-title">🔬 All Hypotheses — Full Detail</div>
+      <div style="overflow-x:auto">
+      <table class="report-table hyp-detail-table">
+        <thead>
+          <tr>
+            <th>ID</th>
+            <th>Config Label</th>
+            <th>Opset</th>
+            <th>Extra Flags</th>
+            <th>Median p50</th>
+            <th>Session p50s (ms)</th>
+            <th>Gain %</th>
+            <th>Verdict</th>
+            <th>Confidence</th>
+          </tr>
+        </thead>
+        <tbody>
+
+        <tr class="">
+          <td><span class="hyp-pill">h0</span></td>
+          <td class="label-cell">baseline FP32 (no quant, no compile)</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">26.37 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[23.28 · 26.70 · 26.37]</span></td>
+          <td><span class="">+0.0%</span></td>
+          <td><span class="">BASELINE</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h1</span></td>
+          <td class="label-cell">opset 17 explicit</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">24.91 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[24.91 · 25.30 · 24.55 · 22.38 · 21.70]</span></td>
+          <td><span class="gain-pos">+6.9%</span></td>
+          <td><span class="">MARGINAL_UNCONFIRMED</span></td>
+          <td class="conf-cell">4/5 sessions confirm</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h2</span></td>
+          <td class="label-cell">opset 19</td>
+          <td class="opset-cell">19</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">25.42 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[26.47 · 24.46 · 25.42]</span></td>
+          <td><span class="gain-pos">+3.6%</span></td>
+          <td><span class="">MARGINAL</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h3</span></td>
+          <td class="label-cell">opset 21 (tests gpu-006)</td>
+          <td class="opset-cell">21</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">26.05 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[24.24 · 26.17 · 26.05]</span></td>
+          <td><span class="gain-pos">+1.2%</span></td>
+          <td><span class="">MARGINAL</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h4</span></td>
+          <td class="label-cell">opset 17 + matmul_transpose_fusion</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><span class="flag-pill">matmul_transpose_fusion</span></td>
+          <td class="p50-cell">24.14 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[24.14 · 24.63 · 23.45 · 21.29 · 23.90]</span></td>
+          <td><span class="gain-pos">+9.4%</span></td>
+          <td><span class="verdict-keep">KEEP_CONFIRMED</span></td>
+          <td class="conf-cell">5/5 sessions confirm</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h5</span></td>
+          <td class="label-cell">opset 17 + attention_fusion</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><span class="flag-pill">attention_fusion</span></td>
+          <td class="p50-cell">23.59 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[23.59 · 23.39 · 25.55 · 22.78 · 23.18]</span></td>
+          <td><span class="gain-pos">+11.3%</span></td>
+          <td><span class="">MARGINAL_UNCONFIRMED</span></td>
+          <td class="conf-cell">4/5 sessions confirm</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h6</span></td>
+          <td class="label-cell">opset 17 + bias_softmax_fusion</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><span class="flag-pill">bias_softmax_fusion</span></td>
+          <td class="p50-cell">24.69 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[24.69 · 24.81 · 24.67 · 21.98 · 22.55]</span></td>
+          <td><span class="gain-pos">+6.5%</span></td>
+          <td><span class="verdict-keep">KEEP_CONFIRMED</span></td>
+          <td class="conf-cell">5/5 sessions confirm</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h7</span></td>
+          <td class="label-cell">opset 17 + layer_norm_fusion</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><span class="flag-pill">layer_norm_fusion</span></td>
+          <td class="p50-cell">25.30 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[26.77 · 25.30 · 23.94]</span></td>
+          <td><span class="gain-pos">+4.1%</span></td>
+          <td><span class="">MARGINAL</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h8</span></td>
+          <td class="label-cell">opset 17 + skip_layer_norm_fusion</td>
+          <td class="opset-cell">—</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">—</td>
+          <td class="sessions-cell">—</td>
+          <td>—</td>
+          <td><span class="verdict-discard">BENCH_FAIL</span></td>
+          <td class="conf-cell">bench failed</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h9</span></td>
+          <td class="label-cell">opset 21 + matmul_transpose + attention_fusion</td>
+          <td class="opset-cell">21</td>
+          <td class="flags-cell"><span class="flag-pill">matmul_transpose_fusion</span>, <span class="flag-pill">attention_fusion</span></td>
+          <td class="p50-cell">22.98 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[22.98 · 22.44 · 23.15 · 23.13 · 23.92]</span></td>
+          <td><span class="gain-pos">+12.3%</span></td>
+          <td><span class="verdict-keep">KEEP_CONFIRMED</span></td>
+          <td class="conf-cell">5/5 sessions confirm</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h10</span></td>
+          <td class="label-cell">opset 17 + ln + skip_ln + matmul_transpose</td>
+          <td class="opset-cell">—</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">—</td>
+          <td class="sessions-cell">—</td>
+          <td>—</td>
+          <td><span class="verdict-discard">BENCH_FAIL</span></td>
+          <td class="conf-cell">bench failed</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h11</span></td>
+          <td class="label-cell">opset 17 + gelu_fusion explicit</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><span class="flag-pill">gelu_fusion</span></td>
+          <td class="p50-cell">22.72 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[23.02 · 21.82 · 22.72 · 21.74 · 22.10]</span></td>
+          <td><span class="gain-pos">+16.2%</span></td>
+          <td><span class="verdict-keep">KEEP_CONFIRMED</span></td>
+          <td class="conf-cell">5/5 sessions confirm</td>
+        </tr>
+        <tr class="champion-row">
+          <td><span class="hyp-pill">h12</span> <span style="color:#1976d2;font-weight:900">★</span></td>
+          <td class="label-cell">opset 17 + transpose_optimizer</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><span class="flag-pill">transpose_optimizer</span></td>
+          <td class="p50-cell">21.98 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[22.32 · 21.71 · 21.98 · 21.67 · 23.43]</span></td>
+          <td><span class="gain-pos">+16.7%</span></td>
+          <td><span class="verdict-keep">KEEP_CONFIRMED</span></td>
+          <td class="conf-cell">5/5 sessions confirm</td>
+        </tr>
+        </tbody>
+      </table>
+      </div>
+      <div style="margin-top:10px;font-size:11px;color:#7b8794">
+        ★ = champion hypothesis &nbsp;·&nbsp; Session p50s are individual bench sessions (median used for comparison)
+      </div>
+    </section>
+
+
+    <section class="section-card">
+      <div class="section-title">✅ Effective Optimizations</div>
+      <table class="report-table">
+        <thead>
+          <tr>
+            <th>Hypothesis</th>
+            <th>Label</th>
+            <th>Gain %</th>
+            <th>Verdict</th>
+            <th>Confidence</th>
+          </tr>
+        </thead>
+        <tbody>
+
+            <tr class="">
+              <td><span class="hyp-pill">h1</span></td>
+              <td>opset 17 explicit</td>
+              <td class="gain-pos">+6.9%</td>
+              <td>MARGINAL_UNCONFIRMED</td>
+              <td>4/5 sessions confirm</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h4</span></td>
+              <td>opset 17 + matmul_transpose_fusion</td>
+              <td class="gain-pos">+9.4%</td>
+              <td>KEEP_CONFIRMED</td>
+              <td>5/5 sessions confirm</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h5</span></td>
+              <td>opset 17 + attention_fusion</td>
+              <td class="gain-pos">+11.3%</td>
+              <td>MARGINAL_UNCONFIRMED</td>
+              <td>4/5 sessions confirm</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h6</span></td>
+              <td>opset 17 + bias_softmax_fusion</td>
+              <td class="gain-pos">+6.5%</td>
+              <td>KEEP_CONFIRMED</td>
+              <td>5/5 sessions confirm</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h9</span></td>
+              <td>opset 21 + matmul_transpose + attention_fusion</td>
+              <td class="gain-pos">+12.3%</td>
+              <td>KEEP_CONFIRMED</td>
+              <td>5/5 sessions confirm</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h11</span></td>
+              <td>opset 17 + gelu_fusion explicit</td>
+              <td class="gain-pos">+16.2%</td>
+              <td>KEEP_CONFIRMED</td>
+              <td>5/5 sessions confirm</td>
+            </tr>
+
+            <tr class="champion-row">
+              <td><span class="hyp-pill">h12</span></td>
+              <td>opset 17 + transpose_optimizer</td>
+              <td class="gain-pos">+16.7%</td>
+              <td>KEEP_CONFIRMED</td>
+              <td>5/5 sessions confirm</td>
+            </tr>
+
+        </tbody>
+      </table>
+    </section>
+
+
+
+    <section class="section-card">
+      <div class="section-title">⚪ Neutral / Build Fail</div>
+      <table class="report-table">
+        <thead>
+          <tr>
+            <th>Hypothesis</th>
+            <th>Label</th>
+            <th>Gain %</th>
+            <th>Verdict</th>
+            <th>Confidence</th>
+          </tr>
+        </thead>
+        <tbody>
+
+            <tr class="">
+              <td><span class="hyp-pill">h0</span></td>
+              <td>baseline FP32 (no quant, no compile)</td>
+              <td class="gain-pos">+0.0%</td>
+              <td>BASELINE</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h2</span></td>
+              <td>opset 19</td>
+              <td class="gain-pos">+3.6%</td>
+              <td>MARGINAL</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h3</span></td>
+              <td>opset 21 (tests gpu-006)</td>
+              <td class="gain-pos">+1.2%</td>
+              <td>MARGINAL</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h7</span></td>
+              <td>opset 17 + layer_norm_fusion</td>
+              <td class="gain-pos">+4.1%</td>
+              <td>MARGINAL</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h8</span></td>
+              <td>opset 17 + skip_layer_norm_fusion</td>
+              <td class="gain-pos">—</td>
+              <td>BENCH_FAIL</td>
+              <td>bench failed</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h10</span></td>
+              <td>opset 17 + ln + skip_ln + matmul_transpose</td>
+              <td class="gain-pos">—</td>
+              <td>BENCH_FAIL</td>
+              <td>bench failed</td>
+            </tr>
+
+        </tbody>
+      </table>
+    </section>
+
+
+
+  <div class="footer">Generated by gen_model_report.py · research/autoconfig</div>
+</body>
+</html>
diff --git a/research/autoconfig/catalog-gpu-sweep/facebook--dinov2-small/results.json b/research/autoconfig/catalog-gpu-sweep/facebook--dinov2-small/results.json
new file mode 100644
index 000000000..b3ada263f
--- /dev/null
+++ b/research/autoconfig/catalog-gpu-sweep/facebook--dinov2-small/results.json
@@ -0,0 +1,324 @@
+{
+  "model_id": "facebook/dinov2-small",
+  "task": "image-feature-extraction",
+  "model_type": "dinov2",
+  "timestamp": "2026-06-18T09:31:21",
+  "ep": "qnn",
+  "device": "gpu",
+  "baseline_opset": 17,
+  "hypotheses": {
+    "h0": {
+      "status": "OK",
+      "label": "baseline FP32 (no quant, no compile)",
+      "opset": 17,
+      "extra_optim": null,
+      "screen_p50_ms": 23.84,
+      "screen_cv": 0.22185402684563757,
+      "full_p50s_ms": [
+        23.282,
+        26.705,
+        26.372
+      ],
+      "median_p50_ms": 26.372,
+      "verdict": "BASELINE"
+    },
+    "h1": {
+      "status": "OK",
+      "label": "opset 17 explicit",
+      "opset": 17,
+      "extra_optim": null,
+      "screen_p50_ms": 23.814,
+      "screen_cv": 0.2982279331485681,
+      "full_p50s_ms": [
+        24.915,
+        25.298,
+        24.546
+      ],
+      "median_p50_ms": 24.915,
+      "gain_vs_baseline_pct": 5.52,
+      "verdict": "MARGINAL_UNCONFIRMED",
+      "confirm_p50s_ms": [
+        22.377,
+        21.697
+      ],
+      "all_p50s_ms": [
+        24.915,
+        25.298,
+        24.546,
+        22.377,
+        21.697
+      ],
+      "overall_median_p50_ms": 24.546,
+      "overall_gain_pct": 6.92,
+      "sessions_above_threshold": 4,
+      "total_sessions": 5
+    },
+    "h2": {
+      "status": "OK",
+      "label": "opset 19",
+      "opset": 19,
+      "extra_optim": null,
+      "screen_p50_ms": 26.353,
+      "screen_cv": 0.22968921944370657,
+      "full_p50s_ms": [
+        26.467,
+        24.459,
+        25.421
+      ],
+      "median_p50_ms": 25.421,
+      "gain_vs_baseline_pct": 3.61,
+      "verdict": "MARGINAL"
+    },
+    "h3": {
+      "status": "OK",
+      "label": "opset 21 (tests gpu-006)",
+      "opset": 21,
+      "extra_optim": null,
+      "screen_p50_ms": 25.534,
+      "screen_cv": 0.25432756324900135,
+      "full_p50s_ms": [
+        24.236,
+        26.174,
+        26.051
+      ],
+      "median_p50_ms": 26.051,
+      "gain_vs_baseline_pct": 1.22,
+      "verdict": "MARGINAL"
+    },
+    "h4": {
+      "status": "OK",
+      "label": "opset 17 + matmul_transpose_fusion",
+      "opset": 17,
+      "extra_optim": {
+        "matmul_transpose_fusion": true
+      },
+      "screen_p50_ms": 23.241,
+      "screen_cv": 0.19310700916483803,
+      "full_p50s_ms": [
+        24.144,
+        24.633,
+        23.453
+      ],
+      "median_p50_ms": 24.144,
+      "gain_vs_baseline_pct": 8.45,
+      "verdict": "KEEP_CONFIRMED",
+      "confirm_p50s_ms": [
+        21.288,
+        23.896
+      ],
+      "all_p50s_ms": [
+        24.144,
+        24.633,
+        23.453,
+        21.288,
+        23.896
+      ],
+      "overall_median_p50_ms": 23.896,
+      "overall_gain_pct": 9.39,
+      "sessions_above_threshold": 5,
+      "total_sessions": 5
+    },
+    "h5": {
+      "status": "OK",
+      "label": "opset 17 + attention_fusion",
+      "opset": 17,
+      "extra_optim": {
+        "attention_fusion": true
+      },
+      "screen_p50_ms": 23.289,
+      "screen_cv": 0.17308600626905404,
+      "full_p50s_ms": [
+        23.589,
+        23.385,
+        25.548
+      ],
+      "median_p50_ms": 23.589,
+      "gain_vs_baseline_pct": 10.55,
+      "verdict": "MARGINAL_UNCONFIRMED",
+      "confirm_p50s_ms": [
+        22.777,
+        23.185
+      ],
+      "all_p50s_ms": [
+        23.589,
+        23.385,
+        25.548,
+        22.777,
+        23.185
+      ],
+      "overall_median_p50_ms": 23.385,
+      "overall_gain_pct": 11.33,
+      "sessions_above_threshold": 4,
+      "total_sessions": 5
+    },
+    "h6": {
+      "status": "OK",
+      "label": "opset 17 + bias_softmax_fusion",
+      "opset": 17,
+      "extra_optim": {
+        "bias_softmax_fusion": true
+      },
+      "screen_p50_ms": 23.287,
+      "screen_cv": 0.22261347532958303,
+      "full_p50s_ms": [
+        24.686,
+        24.808,
+        24.666
+      ],
+      "median_p50_ms": 24.686,
+      "gain_vs_baseline_pct": 6.39,
+      "verdict": "KEEP_CONFIRMED",
+      "confirm_p50s_ms": [
+        21.979,
+        22.546
+      ],
+      "all_p50s_ms": [
+        24.686,
+        24.808,
+        24.666,
+        21.979,
+        22.546
+      ],
+      "overall_median_p50_ms": 24.666,
+      "overall_gain_pct": 6.47,
+      "sessions_above_threshold": 5,
+      "total_sessions": 5
+    },
+    "h7": {
+      "status": "OK",
+      "label": "opset 17 + layer_norm_fusion",
+      "opset": 17,
+      "extra_optim": {
+        "layer_norm_fusion": true
+      },
+      "screen_p50_ms": 43.267,
+      "screen_cv": 0.5303580095684932,
+      "full_p50s_ms": [
+        26.767,
+        25.295,
+        23.936
+      ],
+      "median_p50_ms": 25.295,
+      "gain_vs_baseline_pct": 4.08,
+      "verdict": "MARGINAL"
+    },
+    "h8": {
+      "status": "BENCH_FAIL",
+      "label": "opset 17 + skip_layer_norm_fusion"
+    },
+    "h9": {
+      "status": "OK",
+      "label": "opset 21 + matmul_transpose + attention_fusion",
+      "opset": 21,
+      "extra_optim": {
+        "matmul_transpose_fusion": true,
+        "attention_fusion": true
+      },
+      "screen_p50_ms": 23.101,
+      "screen_cv": 0.1016839097874551,
+      "full_p50s_ms": [
+        22.982,
+        22.438,
+        23.149
+      ],
+      "median_p50_ms": 22.982,
+      "gain_vs_baseline_pct": 12.85,
+      "verdict": "KEEP_CONFIRMED",
+      "confirm_p50s_ms": [
+        23.132,
+        23.917
+      ],
+      "all_p50s_ms": [
+        22.982,
+        22.438,
+        23.149,
+        23.132,
+        23.917
+      ],
+      "overall_median_p50_ms": 23.132,
+      "overall_gain_pct": 12.29,
+      "sessions_above_threshold": 5,
+      "total_sessions": 5
+    },
+    "h10": {
+      "status": "BENCH_FAIL",
+      "label": "opset 17 + ln + skip_ln + matmul_transpose"
+    },
+    "h11": {
+      "status": "OK",
+      "label": "opset 17 + gelu_fusion explicit",
+      "opset": 17,
+      "extra_optim": {
+        "gelu_fusion": true
+      },
+      "screen_p50_ms": 22.655,
+      "screen_cv": 0.15378503641580224,
+      "full_p50s_ms": [
+        23.022,
+        21.821,
+        22.718
+      ],
+      "median_p50_ms": 22.718,
+      "gain_vs_baseline_pct": 13.86,
+      "verdict": "KEEP_CONFIRMED",
+      "confirm_p50s_ms": [
+        21.742,
+        22.096
+      ],
+      "all_p50s_ms": [
+        23.022,
+        21.821,
+        22.718,
+        21.742,
+        22.096
+      ],
+      "overall_median_p50_ms": 22.096,
+      "overall_gain_pct": 16.21,
+      "sessions_above_threshold": 5,
+      "total_sessions": 5
+    },
+    "h12": {
+      "status": "OK",
+      "label": "opset 17 + transpose_optimizer",
+      "opset": 17,
+      "extra_optim": {
+        "transpose_optimizer": true
+      },
+      "screen_p50_ms": 22.066,
+      "screen_cv": 0.2796156983594671,
+      "full_p50s_ms": [
+        22.323,
+        21.709,
+        21.977
+      ],
+      "median_p50_ms": 21.977,
+      "gain_vs_baseline_pct": 16.67,
+      "verdict": "KEEP_CONFIRMED",
+      "confirm_p50s_ms": [
+        21.667,
+        23.431
+      ],
+      "all_p50s_ms": [
+        22.323,
+        21.709,
+        21.977,
+        21.667,
+        23.431
+      ],
+      "overall_median_p50_ms": 21.977,
+      "overall_gain_pct": 16.67,
+      "sessions_above_threshold": 5,
+      "total_sessions": 5
+    }
+  },
+  "best_hypothesis": "h12",
+  "baseline_p50_ms": 26.372,
+  "best_p50_ms": 21.977,
+  "best_gain_pct": 16.67,
+  "opset21_gain_pct": 1.22,
+  "feature_gaps": [],
+  "errors": [
+    "h8: screen bench failed",
+    "h10: screen bench failed"
+  ]
+}
diff --git a/research/autoconfig/catalog-gpu-sweep/microsoft--rad-dino/report.html b/research/autoconfig/catalog-gpu-sweep/microsoft--rad-dino/report.html
new file mode 100644
index 000000000..ca69b133a
--- /dev/null
+++ b/research/autoconfig/catalog-gpu-sweep/microsoft--rad-dino/report.html
@@ -0,0 +1,577 @@
+<!DOCTYPE html>
+<html lang="en">
+<head>
+  <meta charset="UTF-8">
+  <meta name="viewport" content="width=device-width, initial-scale=1.0">
+  <title>QNN GPU Optimization Report — microsoft/rad-dino</title>
+  <style>
+    * { box-sizing: border-box; margin: 0; padding: 0; }
+    body {
+      font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", sans-serif;
+      background: #f4f6f9;
+      color: #1a1a2e;
+      padding: 28px 24px 40px;
+      font-size: 13px;
+      line-height: 1.5;
+    }
+    h1 { font-size: 24px; font-weight: 800; margin-bottom: 6px; color: #102a43; }
+    .subtitle { color: #5f6c80; font-size: 12px; margin-bottom: 24px; }
+    .section-card {
+      background: #ffffff;
+      border: 1.5px solid #dbe5f0;
+      border-radius: 12px;
+      padding: 18px 20px;
+      margin-bottom: 20px;
+      box-shadow: 0 1px 3px rgba(16, 42, 67, 0.06);
+    }
+    .kpi-grid {
+      display: grid;
+      grid-template-columns: repeat(5, minmax(0, 1fr));
+      gap: 14px;
+      margin-bottom: 20px;
+    }
+    .kpi-card {
+      background: linear-gradient(180deg, #ffffff 0%, #f8fbff 100%);
+      border: 1.5px solid #dbe5f0;
+      border-radius: 12px;
+      padding: 16px;
+      min-height: 120px;
+    }
+    .kpi-label {
+      font-size: 10px;
+      text-transform: uppercase;
+      letter-spacing: 0.8px;
+      color: #6b7c93;
+      font-weight: 700;
+      margin-bottom: 8px;
+    }
+    .kpi-value {
+      font-size: 23px;
+      font-weight: 800;
+      color: #102a43;
+      line-height: 1.15;
+      margin-bottom: 6px;
+      word-break: break-word;
+    }
+    .kpi-card.good .kpi-value { color: #2e7d32; }
+    .kpi-card.bad .kpi-value { color: #c62828; }
+    .kpi-sub { color: #6b7c93; font-size: 11px; }
+    .section-title {
+      font-size: 14px;
+      font-weight: 800;
+      color: #102a43;
+      margin-bottom: 14px;
+    }
+    .characteristics-table {
+      width: 100%;
+      border-collapse: collapse;
+    }
+    .characteristics-table th,
+    .characteristics-table td {
+      padding: 9px 10px;
+      border-bottom: 1px solid #ebf1f6;
+      text-align: left;
+      vertical-align: top;
+    }
+    .characteristics-table th {
+      width: 180px;
+      color: #5f6c80;
+      font-size: 11px;
+      text-transform: uppercase;
+      letter-spacing: 0.7px;
+    }
+    .report-table {
+      width: 100%;
+      border-collapse: collapse;
+    }
+    .report-table th {
+      text-align: left;
+      padding: 9px 10px;
+      background: #eef4fb;
+      color: #486581;
+      border-bottom: 2px solid #d9e2ec;
+      font-size: 10px;
+      text-transform: uppercase;
+      letter-spacing: 0.7px;
+    }
+    .report-table td {
+      padding: 10px;
+      border-bottom: 1px solid #ebf1f6;
+      vertical-align: top;
+    }
+    .report-table tr:hover td { background: #f8fbff; }
+    .champion-row td { background: #e8f1fd; }
+    .hyp-pill {
+      display: inline-block;
+      background: #102a43;
+      color: white;
+      border-radius: 999px;
+      padding: 2px 8px;
+      font-size: 11px;
+      font-weight: 700;
+    }
+    .gain-pos { color: #2e7d32; font-weight: 700; }
+    .gain-neg { color: #c62828; font-weight: 700; }
+    .chart-wrap {
+      overflow-x: auto;
+      border: 1px solid #e6edf5;
+      border-radius: 10px;
+      background: #fbfdff;
+      padding: 10px;
+    }
+    .chart-svg { width: 100%; min-width: 760px; display: block; }
+    .axis-label { fill: #486581; font-size: 11px; font-weight: 700; }
+    .tick-label { fill: #7b8794; font-size: 10px; }
+    .tick-line { stroke: #d9e2ec; stroke-width: 1; }
+    .center-line { stroke: #1e88e5; stroke-width: 2; stroke-dasharray: 4 4; }
+    .row-bg { fill: transparent; }
+    .hyp-label { fill: #102a43; font-size: 12px; font-weight: 800; }
+    .hyp-sub { fill: #7b8794; font-size: 10px; }
+    .baseline-bar { stroke: #546e7a; stroke-width: 3; }
+    .value-text { fill: #102a43; font-size: 11px; font-weight: 700; }
+    .build-fail-text { fill: #37474f; font-size: 10px; font-weight: 800; letter-spacing: 0.5px; }
+    .gap-grid {
+      display: grid;
+      grid-template-columns: repeat(auto-fit, minmax(240px, 1fr));
+      gap: 12px;
+    }
+    .gap-card {
+      background: #fff8e1;
+      border: 1.5px solid #ffe082;
+      border-radius: 10px;
+      padding: 12px 14px;
+      color: #7c5b00;
+      font-size: 12px;
+    }
+    .flag-pill {
+      display: inline-block;
+      background: #e3f2fd;
+      color: #1565c0;
+      border-radius: 4px;
+      padding: 1px 6px;
+      font-size: 10px;
+      font-weight: 700;
+      margin: 1px 2px 1px 0;
+    }
+    .runs-val {
+      font-family: "Cascadia Code", "Consolas", monospace;
+      font-size: 10.5px;
+      color: #546e7a;
+      white-space: nowrap;
+    }
+    .hyp-detail-table .label-cell { font-size: 11.5px; max-width: 220px; }
+    .hyp-detail-table .opset-cell { text-align: center; font-weight: 700; color: #3949ab; font-size: 12px; }
+    .hyp-detail-table .flags-cell { min-width: 140px; }
+    .hyp-detail-table .p50-cell { font-family: "Cascadia Code","Consolas",monospace; font-size: 12px; white-space: nowrap; }
+    .hyp-detail-table .sessions-cell { min-width: 160px; }
+    .hyp-detail-table .conf-cell { font-size: 11px; color: #546e7a; }
+    .verdict-keep { color: #2e7d32; font-weight: 700; }
+    .verdict-discard { color: #c62828; font-weight: 700; }
+    .footer {
+      margin-top: 16px;
+      color: #7b8794;
+      font-size: 11px;
+      text-align: center;
+    }
+    @media (max-width: 1200px) {
+      .kpi-grid { grid-template-columns: repeat(2, minmax(0, 1fr)); }
+    }
+    @media (max-width: 720px) {
+      .kpi-grid { grid-template-columns: 1fr; }
+      body { padding: 18px 14px 28px; }
+    }
+  </style>
+</head>
+<body>
+  <h1>QNN GPU Optimization Report — microsoft/rad-dino</h1>
+  <div class="subtitle">dinov2 arch · 2026-06-17 · 13 hypotheses tested</div>
+
+  <section class="kpi-grid">
+    <div class="kpi-card neutral">
+      <div class="kpi-label">Best Gain %</div>
+      <div class="kpi-value">—</div>
+      <div class="kpi-sub">Champion: —</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">Baseline → Champion ms</div>
+      <div class="kpi-value">321.26 ms → —</div>
+      <div class="kpi-sub">Latency reduction: —</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">EP + Device</div>
+      <div class="kpi-value">QNN / GPU</div>
+      <div class="kpi-sub">Baseline opset 17</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">Champion Config</div>
+      <div class="kpi-value">—</div>
+      <div class="kpi-sub">—</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">Total experiments</div>
+      <div class="kpi-value">13</div>
+      <div class="kpi-sub">0 KEEP / 5 DISCARD</div>
+    </div>
+  </section>
+
+
+    <section class="section-card">
+      <div class="section-title">Model Characteristics</div>
+      <table class="characteristics-table">
+        <tr><th>Model ID</th><td>microsoft/rad-dino</td></tr><tr><th>Task</th><td>image-feature-extraction</td></tr><tr><th>Arch type</th><td>dinov2</td></tr><tr><th>Baseline opset</th><td>17</td></tr><tr><th>EP</th><td>qnn</td></tr><tr><th>Device</th><td>gpu</td></tr>
+      </table>
+    </section>
+
+
+    <section class="section-card">
+      <div class="section-title">Hypothesis Gain Chart</div>
+      <div class="chart-wrap">
+        <svg class="chart-svg" viewBox="0 0 748 594" role="img" aria-label="Hypothesis gain chart"><defs><pattern id="buildFailPattern" patternUnits="userSpaceOnUse" width="8" height="8" patternTransform="rotate(45)"><rect width="8" height="8" fill="#cfd8dc"></rect><rect width="3" height="8" fill="#90a4ae"></rect></pattern></defs><text x="0" y="20" class="axis-label">Hypothesis</text><text x="150" y="20" class="axis-label">Gain vs baseline (%)</text><line x1="410.0" y1="40" x2="410.0" y2="568" class="center-line" /><line x1="150.0" y1="44" x2="150.0" y2="568" class="tick-line" /><text x="150.0" y="588" text-anchor="middle" class="tick-label">-200%</text><line x1="280.0" y1="44" x2="280.0" y2="568" class="tick-line" /><text x="280.0" y="588" text-anchor="middle" class="tick-label">-100%</text><line x1="410.0" y1="44" x2="410.0" y2="568" class="tick-line" /><text x="410.0" y="588" text-anchor="middle" class="tick-label">0%</text><line x1="540.0" y1="44" x2="540.0" y2="568" class="tick-line" /><text x="540.0" y="588" text-anchor="middle" class="tick-label">100%</text><line x1="670.0" y1="44" x2="670.0" y2="568" class="tick-line" /><text x="670.0" y="588" text-anchor="middle" class="tick-label">200%</text><g><title>h0: baseline FP32 (no quant, no compile)
+status=OK  verdict=BASELINE
+p50=321.26 ms  gain=+0.0%</title><rect x="0" y="48.0" width="748" height="40" class="row-bg" /><text x="8" y="64.0" class="hyp-label">h0</text><text x="8" y="77.0" class="hyp-sub">baseline FP32 (no quant, …</text><line x1="410.0" y1="56.0" x2="410.0" y2="80.0" class="baseline-bar" /><text x="418.0" y="72.0" text-anchor="start" class="value-text">0.0%</text></g><g><title>h1: opset 17 explicit
+status=OK  verdict=DISCARD
+p50=338.66 ms  gain=-5.4%</title><rect x="0" y="88.0" width="748" height="40" class="row-bg" /><text x="8" y="104.0" class="hyp-label">h1</text><text x="8" y="117.0" class="hyp-sub">opset 17 explicit</text><rect x="403.0" y="96.0" width="7.0" height="24" fill="#e53935" stroke="none" stroke-width="0" rx="4" /><text x="395.0" y="112.0" text-anchor="end" class="value-text">-5.4%</text></g><g><title>h2: opset 19
+status=OK  verdict=DISCARD
+p50=331.58 ms  gain=-3.2%</title><rect x="0" y="128.0" width="748" height="40" class="row-bg" /><text x="8" y="144.0" class="hyp-label">h2</text><text x="8" y="157.0" class="hyp-sub">opset 19</text><rect x="405.8" y="136.0" width="4.2" height="24" fill="#90a4ae" stroke="none" stroke-width="0" rx="4" /><text x="397.8" y="152.0" text-anchor="end" class="value-text">-3.2%</text></g><g><title>h3: opset 21 (tests gpu-006)
+status=OK  verdict=DISCARD
+p50=329.70 ms  gain=-2.6%</title><rect x="0" y="168.0" width="748" height="40" class="row-bg" /><text x="8" y="184.0" class="hyp-label">h3</text><text x="8" y="197.0" class="hyp-sub">opset 21 (tests gpu-006)</text><rect x="406.6" y="176.0" width="3.4" height="24" fill="#90a4ae" stroke="none" stroke-width="0" rx="4" /><text x="398.6" y="192.0" text-anchor="end" class="value-text">-2.6%</text></g><g><title>h4: opset 17 + matmul_transpose_fusion
+status=OK  verdict=DISCARD
+p50=324.80 ms  gain=-1.1%</title><rect x="0" y="208.0" width="748" height="40" class="row-bg" /><text x="8" y="224.0" class="hyp-label">h4</text><text x="8" y="237.0" class="hyp-sub">opset 17 + matmul_transpo…</text><rect x="408.6" y="216.0" width="2.0" height="24" fill="#90a4ae" stroke="none" stroke-width="0" rx="4" /><text x="400.6" y="232.0" text-anchor="end" class="value-text">-1.1%</text></g><g><title>h5: opset 17 + attention_fusion
+status=OK  verdict=DISCARD
+p50=329.01 ms  gain=-2.4%</title><rect x="0" y="248.0" width="748" height="40" class="row-bg" /><text x="8" y="264.0" class="hyp-label">h5</text><text x="8" y="277.0" class="hyp-sub">opset 17 + attention_fusi…</text><rect x="406.9" y="256.0" width="3.1" height="24" fill="#90a4ae" stroke="none" stroke-width="0" rx="4" /><text x="398.9" y="272.0" text-anchor="end" class="value-text">-2.4%</text></g><g><title>h6: opset 17 + bias_softmax_fusion
+status=OK  verdict=DISCARD
+p50=329.61 ms  gain=-2.6%</title><rect x="0" y="288.0" width="748" height="40" class="row-bg" /><text x="8" y="304.0" class="hyp-label">h6</text><text x="8" y="317.0" class="hyp-sub">opset 17 + bias_softmax_f…</text><rect x="406.6" y="296.0" width="3.4" height="24" fill="#90a4ae" stroke="none" stroke-width="0" rx="4" /><text x="398.6" y="312.0" text-anchor="end" class="value-text">-2.6%</text></g><g><title>h7: opset 17 + layer_norm_fusion
+status=OK  verdict=DISCARD
+p50=327.27 ms  gain=-1.9%</title><rect x="0" y="328.0" width="748" height="40" class="row-bg" /><text x="8" y="344.0" class="hyp-label">h7</text><text x="8" y="357.0" class="hyp-sub">opset 17 + layer_norm_fus…</text><rect x="407.6" y="336.0" width="2.4" height="24" fill="#90a4ae" stroke="none" stroke-width="0" rx="4" /><text x="399.6" y="352.0" text-anchor="end" class="value-text">-1.9%</text></g><g><title>h8: opset 17 + skip_layer_norm_fusion
+status=BENCH_FAIL  verdict=—
+p50=—  gain=—</title><rect x="0" y="368.0" width="748" height="40" class="row-bg" /><text x="8" y="384.0" class="hyp-label">h8</text><text x="8" y="397.0" class="hyp-sub">opset 17 + skip_layer_nor…</text></g><g><title>h9: opset 21 + matmul_transpose + attention_fusion
+status=OK  verdict=DISCARD
+p50=324.74 ms  gain=-1.1%</title><rect x="0" y="408.0" width="748" height="40" class="row-bg" /><text x="8" y="424.0" class="hyp-label">h9</text><text x="8" y="437.0" class="hyp-sub">opset 21 + matmul_transpo…</text><rect x="408.6" y="416.0" width="2.0" height="24" fill="#90a4ae" stroke="none" stroke-width="0" rx="4" /><text x="400.6" y="432.0" text-anchor="end" class="value-text">-1.1%</text></g><g><title>h10: opset 17 + ln + skip_ln + matmul_transpose
+status=BENCH_FAIL  verdict=—
+p50=—  gain=—</title><rect x="0" y="448.0" width="748" height="40" class="row-bg" /><text x="8" y="464.0" class="hyp-label">h10</text><text x="8" y="477.0" class="hyp-sub">opset 17 + ln + skip_ln +…</text></g><g><title>h11: opset 17 + gelu_fusion explicit
+status=OK  verdict=MARGINAL
+p50=314.84 ms  gain=+2.0%</title><rect x="0" y="488.0" width="748" height="40" class="row-bg" /><text x="8" y="504.0" class="hyp-label">h11</text><text x="8" y="517.0" class="hyp-sub">opset 17 + gelu_fusion ex…</text><rect x="410.0" y="496.0" width="2.6" height="24" fill="#90a4ae" stroke="none" stroke-width="0" rx="4" /><text x="420.6" y="512.0" text-anchor="start" class="value-text">+2.0%</text></g><g><title>h12: opset 17 + transpose_optimizer
+status=OK  verdict=MARGINAL
+p50=316.97 ms  gain=+1.3%</title><rect x="0" y="528.0" width="748" height="40" class="row-bg" /><text x="8" y="544.0" class="hyp-label">h12</text><text x="8" y="557.0" class="hyp-sub">opset 17 + transpose_opti…</text><rect x="410.0" y="536.0" width="2.0" height="24" fill="#90a4ae" stroke="none" stroke-width="0" rx="4" /><text x="419.7" y="552.0" text-anchor="start" class="value-text">+1.3%</text></g></svg>
+      </div>
+    </section>
+
+
+    <section class="section-card">
+      <div class="section-title">🔬 All Hypotheses — Full Detail</div>
+      <div style="overflow-x:auto">
+      <table class="report-table hyp-detail-table">
+        <thead>
+          <tr>
+            <th>ID</th>
+            <th>Config Label</th>
+            <th>Opset</th>
+            <th>Extra Flags</th>
+            <th>Median p50</th>
+            <th>Session p50s (ms)</th>
+            <th>Gain %</th>
+            <th>Verdict</th>
+            <th>Confidence</th>
+          </tr>
+        </thead>
+        <tbody>
+
+        <tr class="">
+          <td><span class="hyp-pill">h0</span></td>
+          <td class="label-cell">baseline FP32 (no quant, no compile)</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">321.26 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[318.51 · 324.03 · 321.26]</span></td>
+          <td><span class="">+0.0%</span></td>
+          <td><span class="">BASELINE</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h1</span></td>
+          <td class="label-cell">opset 17 explicit</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">338.66 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[420.76 · 338.66 · 331.40]</span></td>
+          <td><span class="gain-neg">-5.4%</span></td>
+          <td><span class="verdict-discard">DISCARD</span></td>
+          <td class="conf-cell">ranges separated</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h2</span></td>
+          <td class="label-cell">opset 19</td>
+          <td class="opset-cell">19</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">331.58 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[337.96 · 331.58 · 328.05]</span></td>
+          <td><span class="gain-neg">-3.2%</span></td>
+          <td><span class="verdict-discard">DISCARD</span></td>
+          <td class="conf-cell">ranges separated</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h3</span></td>
+          <td class="label-cell">opset 21 (tests gpu-006)</td>
+          <td class="opset-cell">21</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">329.70 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[326.33 · 334.93 · 329.70]</span></td>
+          <td><span class="gain-neg">-2.6%</span></td>
+          <td><span class="verdict-discard">DISCARD</span></td>
+          <td class="conf-cell">ranges separated</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h4</span></td>
+          <td class="label-cell">opset 17 + matmul_transpose_fusion</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><span class="flag-pill">matmul_transpose_fusion</span></td>
+          <td class="p50-cell">324.80 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[324.80 · 324.38 · 326.37]</span></td>
+          <td><span class="gain-neg">-1.1%</span></td>
+          <td><span class="verdict-discard">DISCARD</span></td>
+          <td class="conf-cell">ranges separated</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h5</span></td>
+          <td class="label-cell">opset 17 + attention_fusion</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><span class="flag-pill">attention_fusion</span></td>
+          <td class="p50-cell">329.01 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[321.67 · 332.71 · 329.01]</span></td>
+          <td><span class="gain-neg">-2.4%</span></td>
+          <td><span class="verdict-discard">DISCARD</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h6</span></td>
+          <td class="label-cell">opset 17 + bias_softmax_fusion</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><span class="flag-pill">bias_softmax_fusion</span></td>
+          <td class="p50-cell">329.61 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[331.67 · 327.97 · 329.61]</span></td>
+          <td><span class="gain-neg">-2.6%</span></td>
+          <td><span class="verdict-discard">DISCARD</span></td>
+          <td class="conf-cell">ranges separated</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h7</span></td>
+          <td class="label-cell">opset 17 + layer_norm_fusion</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><span class="flag-pill">layer_norm_fusion</span></td>
+          <td class="p50-cell">327.27 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[327.27 · 324.65 · 327.86]</span></td>
+          <td><span class="gain-neg">-1.9%</span></td>
+          <td><span class="verdict-discard">DISCARD</span></td>
+          <td class="conf-cell">ranges separated</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h8</span></td>
+          <td class="label-cell">opset 17 + skip_layer_norm_fusion</td>
+          <td class="opset-cell">—</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">—</td>
+          <td class="sessions-cell">—</td>
+          <td>—</td>
+          <td><span class="verdict-discard">BENCH_FAIL</span></td>
+          <td class="conf-cell">bench failed</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h9</span></td>
+          <td class="label-cell">opset 21 + matmul_transpose + attention_fusion</td>
+          <td class="opset-cell">21</td>
+          <td class="flags-cell"><span class="flag-pill">matmul_transpose_fusion</span>, <span class="flag-pill">attention_fusion</span></td>
+          <td class="p50-cell">324.74 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[319.64 · 324.74 · 328.56]</span></td>
+          <td><span class="gain-neg">-1.1%</span></td>
+          <td><span class="verdict-discard">DISCARD</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h10</span></td>
+          <td class="label-cell">opset 17 + ln + skip_ln + matmul_transpose</td>
+          <td class="opset-cell">—</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">—</td>
+          <td class="sessions-cell">—</td>
+          <td>—</td>
+          <td><span class="verdict-discard">BENCH_FAIL</span></td>
+          <td class="conf-cell">bench failed</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h11</span></td>
+          <td class="label-cell">opset 17 + gelu_fusion explicit</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><span class="flag-pill">gelu_fusion</span></td>
+          <td class="p50-cell">314.84 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[314.87 · 314.84 · 313.88]</span></td>
+          <td><span class="gain-pos">+2.0%</span></td>
+          <td><span class="">MARGINAL</span></td>
+          <td class="conf-cell">ranges separated</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h12</span></td>
+          <td class="label-cell">opset 17 + transpose_optimizer</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><span class="flag-pill">transpose_optimizer</span></td>
+          <td class="p50-cell">316.97 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[320.64 · 316.97 · 311.98]</span></td>
+          <td><span class="gain-pos">+1.3%</span></td>
+          <td><span class="">MARGINAL</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        </tbody>
+      </table>
+      </div>
+      <div style="margin-top:10px;font-size:11px;color:#7b8794">
+        ★ = champion hypothesis &nbsp;·&nbsp; Session p50s are individual bench sessions (median used for comparison)
+      </div>
+    </section>
+
+
+
+    <section class="section-card">
+      <div class="section-title">❌ Ineffective or Harmful</div>
+      <table class="report-table">
+        <thead>
+          <tr>
+            <th>Hypothesis</th>
+            <th>Label</th>
+            <th>Gain %</th>
+            <th>Verdict</th>
+            <th>Confidence</th>
+          </tr>
+        </thead>
+        <tbody>
+
+            <tr class="">
+              <td><span class="hyp-pill">h1</span></td>
+              <td>opset 17 explicit</td>
+              <td class="gain-neg">-5.4%</td>
+              <td>DISCARD</td>
+              <td>ranges separated</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h2</span></td>
+              <td>opset 19</td>
+              <td class="gain-neg">-3.2%</td>
+              <td>DISCARD</td>
+              <td>ranges separated</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h3</span></td>
+              <td>opset 21 (tests gpu-006)</td>
+              <td class="gain-neg">-2.6%</td>
+              <td>DISCARD</td>
+              <td>ranges separated</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h5</span></td>
+              <td>opset 17 + attention_fusion</td>
+              <td class="gain-neg">-2.4%</td>
+              <td>DISCARD</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h6</span></td>
+              <td>opset 17 + bias_softmax_fusion</td>
+              <td class="gain-neg">-2.6%</td>
+              <td>DISCARD</td>
+              <td>ranges separated</td>
+            </tr>
+
+        </tbody>
+      </table>
+    </section>
+
+
+    <section class="section-card">
+      <div class="section-title">⚪ Neutral / Build Fail</div>
+      <table class="report-table">
+        <thead>
+          <tr>
+            <th>Hypothesis</th>
+            <th>Label</th>
+            <th>Gain %</th>
+            <th>Verdict</th>
+            <th>Confidence</th>
+          </tr>
+        </thead>
+        <tbody>
+
+            <tr class="">
+              <td><span class="hyp-pill">h0</span></td>
+              <td>baseline FP32 (no quant, no compile)</td>
+              <td class="gain-pos">+0.0%</td>
+              <td>BASELINE</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h4</span></td>
+              <td>opset 17 + matmul_transpose_fusion</td>
+              <td class="gain-neg">-1.1%</td>
+              <td>DISCARD</td>
+              <td>ranges separated</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h7</span></td>
+              <td>opset 17 + layer_norm_fusion</td>
+              <td class="gain-neg">-1.9%</td>
+              <td>DISCARD</td>
+              <td>ranges separated</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h8</span></td>
+              <td>opset 17 + skip_layer_norm_fusion</td>
+              <td class="gain-pos">—</td>
+              <td>BENCH_FAIL</td>
+              <td>bench failed</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h9</span></td>
+              <td>opset 21 + matmul_transpose + attention_fusion</td>
+              <td class="gain-neg">-1.1%</td>
+              <td>DISCARD</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h10</span></td>
+              <td>opset 17 + ln + skip_ln + matmul_transpose</td>
+              <td class="gain-pos">—</td>
+              <td>BENCH_FAIL</td>
+              <td>bench failed</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h11</span></td>
+              <td>opset 17 + gelu_fusion explicit</td>
+              <td class="gain-pos">+2.0%</td>
+              <td>MARGINAL</td>
+              <td>ranges separated</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h12</span></td>
+              <td>opset 17 + transpose_optimizer</td>
+              <td class="gain-pos">+1.3%</td>
+              <td>MARGINAL</td>
+              <td>ranges overlap</td>
+            </tr>
+
+        </tbody>
+      </table>
+    </section>
+
+
+
+  <div class="footer">Generated by gen_model_report.py · research/autoconfig</div>
+</body>
+</html>
diff --git a/research/autoconfig/catalog-gpu-sweep/microsoft--rad-dino/results.json b/research/autoconfig/catalog-gpu-sweep/microsoft--rad-dino/results.json
new file mode 100644
index 000000000..13721bfac
--- /dev/null
+++ b/research/autoconfig/catalog-gpu-sweep/microsoft--rad-dino/results.json
@@ -0,0 +1,219 @@
+{
+  "model_id": "microsoft/rad-dino",
+  "task": "image-feature-extraction",
+  "model_type": "dinov2",
+  "timestamp": "2026-06-17T21:21:07",
+  "ep": "qnn",
+  "device": "gpu",
+  "baseline_opset": 17,
+  "hypotheses": {
+    "h0": {
+      "status": "OK",
+      "label": "baseline FP32 (no quant, no compile)",
+      "opset": 17,
+      "extra_optim": null,
+      "screen_p50_ms": 311.746,
+      "screen_cv": 0.07637307295041477,
+      "full_p50s_ms": [
+        318.508,
+        324.031,
+        321.256
+      ],
+      "median_p50_ms": 321.256,
+      "verdict": "BASELINE"
+    },
+    "h1": {
+      "status": "OK",
+      "label": "opset 17 explicit",
+      "opset": 17,
+      "extra_optim": null,
+      "screen_p50_ms": 321.096,
+      "screen_cv": 0.13311595286144953,
+      "full_p50s_ms": [
+        420.756,
+        338.659,
+        331.4
+      ],
+      "median_p50_ms": 338.659,
+      "gain_vs_baseline_pct": -5.42,
+      "verdict": "DISCARD"
+    },
+    "h2": {
+      "status": "OK",
+      "label": "opset 19",
+      "opset": 19,
+      "extra_optim": null,
+      "screen_p50_ms": 334.966,
+      "screen_cv": 0.11764776126532245,
+      "full_p50s_ms": [
+        337.958,
+        331.58,
+        328.045
+      ],
+      "median_p50_ms": 331.58,
+      "gain_vs_baseline_pct": -3.21,
+      "verdict": "DISCARD"
+    },
+    "h3": {
+      "status": "OK",
+      "label": "opset 21 (tests gpu-006)",
+      "opset": 21,
+      "extra_optim": null,
+      "screen_p50_ms": 328.867,
+      "screen_cv": 0.09647060969936173,
+      "full_p50s_ms": [
+        326.329,
+        334.932,
+        329.704
+      ],
+      "median_p50_ms": 329.704,
+      "gain_vs_baseline_pct": -2.63,
+      "verdict": "DISCARD"
+    },
+    "h4": {
+      "status": "OK",
+      "label": "opset 17 + matmul_transpose_fusion",
+      "opset": 17,
+      "extra_optim": {
+        "matmul_transpose_fusion": true
+      },
+      "screen_p50_ms": 321.486,
+      "screen_cv": 0.1220675239357235,
+      "full_p50s_ms": [
+        324.795,
+        324.376,
+        326.365
+      ],
+      "median_p50_ms": 324.795,
+      "gain_vs_baseline_pct": -1.1,
+      "verdict": "DISCARD"
+    },
+    "h5": {
+      "status": "OK",
+      "label": "opset 17 + attention_fusion",
+      "opset": 17,
+      "extra_optim": {
+        "attention_fusion": true
+      },
+      "screen_p50_ms": 327.99,
+      "screen_cv": 0.11450958870697277,
+      "full_p50s_ms": [
+        321.67,
+        332.714,
+        329.006
+      ],
+      "median_p50_ms": 329.006,
+      "gain_vs_baseline_pct": -2.41,
+      "verdict": "DISCARD"
+    },
+    "h6": {
+      "status": "OK",
+      "label": "opset 17 + bias_softmax_fusion",
+      "opset": 17,
+      "extra_optim": {
+        "bias_softmax_fusion": true
+      },
+      "screen_p50_ms": 326.056,
+      "screen_cv": 0.11823429104202961,
+      "full_p50s_ms": [
+        331.665,
+        327.97,
+        329.607
+      ],
+      "median_p50_ms": 329.607,
+      "gain_vs_baseline_pct": -2.6,
+      "verdict": "DISCARD"
+    },
+    "h7": {
+      "status": "OK",
+      "label": "opset 17 + layer_norm_fusion",
+      "opset": 17,
+      "extra_optim": {
+        "layer_norm_fusion": true
+      },
+      "screen_p50_ms": 320.193,
+      "screen_cv": 0.12970302286433497,
+      "full_p50s_ms": [
+        327.267,
+        324.65,
+        327.859
+      ],
+      "median_p50_ms": 327.267,
+      "gain_vs_baseline_pct": -1.87,
+      "verdict": "DISCARD"
+    },
+    "h8": {
+      "status": "BENCH_FAIL",
+      "label": "opset 17 + skip_layer_norm_fusion"
+    },
+    "h9": {
+      "status": "OK",
+      "label": "opset 21 + matmul_transpose + attention_fusion",
+      "opset": 21,
+      "extra_optim": {
+        "matmul_transpose_fusion": true,
+        "attention_fusion": true
+      },
+      "screen_p50_ms": 320.802,
+      "screen_cv": 0.12593125978017594,
+      "full_p50s_ms": [
+        319.641,
+        324.735,
+        328.564
+      ],
+      "median_p50_ms": 324.735,
+      "gain_vs_baseline_pct": -1.08,
+      "verdict": "DISCARD"
+    },
+    "h10": {
+      "status": "BENCH_FAIL",
+      "label": "opset 17 + ln + skip_ln + matmul_transpose"
+    },
+    "h11": {
+      "status": "OK",
+      "label": "opset 17 + gelu_fusion explicit",
+      "opset": 17,
+      "extra_optim": {
+        "gelu_fusion": true
+      },
+      "screen_p50_ms": 312.178,
+      "screen_cv": 0.12257750385997732,
+      "full_p50s_ms": [
+        314.865,
+        314.838,
+        313.876
+      ],
+      "median_p50_ms": 314.838,
+      "gain_vs_baseline_pct": 2.0,
+      "verdict": "MARGINAL"
+    },
+    "h12": {
+      "status": "OK",
+      "label": "opset 17 + transpose_optimizer",
+      "opset": 17,
+      "extra_optim": {
+        "transpose_optimizer": true
+      },
+      "screen_p50_ms": 314.241,
+      "screen_cv": 0.12399082233063159,
+      "full_p50s_ms": [
+        320.636,
+        316.974,
+        311.984
+      ],
+      "median_p50_ms": 316.974,
+      "gain_vs_baseline_pct": 1.33,
+      "verdict": "MARGINAL"
+    }
+  },
+  "best_hypothesis": null,
+  "baseline_p50_ms": 321.256,
+  "best_p50_ms": null,
+  "best_gain_pct": null,
+  "opset21_gain_pct": -2.63,
+  "feature_gaps": [],
+  "errors": [
+    "h8: screen bench failed",
+    "h10: screen bench failed"
+  ]
+}
diff --git a/research/autoconfig/catalog-gpu-sweep/microsoft--resnet-18/report.html b/research/autoconfig/catalog-gpu-sweep/microsoft--resnet-18/report.html
new file mode 100644
index 000000000..80f9a441c
--- /dev/null
+++ b/research/autoconfig/catalog-gpu-sweep/microsoft--resnet-18/report.html
@@ -0,0 +1,595 @@
+<!DOCTYPE html>
+<html lang="en">
+<head>
+  <meta charset="UTF-8">
+  <meta name="viewport" content="width=device-width, initial-scale=1.0">
+  <title>QNN GPU Optimization Report — microsoft/resnet-18</title>
+  <style>
+    * { box-sizing: border-box; margin: 0; padding: 0; }
+    body {
+      font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", sans-serif;
+      background: #f4f6f9;
+      color: #1a1a2e;
+      padding: 28px 24px 40px;
+      font-size: 13px;
+      line-height: 1.5;
+    }
+    h1 { font-size: 24px; font-weight: 800; margin-bottom: 6px; color: #102a43; }
+    .subtitle { color: #5f6c80; font-size: 12px; margin-bottom: 24px; }
+    .section-card {
+      background: #ffffff;
+      border: 1.5px solid #dbe5f0;
+      border-radius: 12px;
+      padding: 18px 20px;
+      margin-bottom: 20px;
+      box-shadow: 0 1px 3px rgba(16, 42, 67, 0.06);
+    }
+    .kpi-grid {
+      display: grid;
+      grid-template-columns: repeat(5, minmax(0, 1fr));
+      gap: 14px;
+      margin-bottom: 20px;
+    }
+    .kpi-card {
+      background: linear-gradient(180deg, #ffffff 0%, #f8fbff 100%);
+      border: 1.5px solid #dbe5f0;
+      border-radius: 12px;
+      padding: 16px;
+      min-height: 120px;
+    }
+    .kpi-label {
+      font-size: 10px;
+      text-transform: uppercase;
+      letter-spacing: 0.8px;
+      color: #6b7c93;
+      font-weight: 700;
+      margin-bottom: 8px;
+    }
+    .kpi-value {
+      font-size: 23px;
+      font-weight: 800;
+      color: #102a43;
+      line-height: 1.15;
+      margin-bottom: 6px;
+      word-break: break-word;
+    }
+    .kpi-card.good .kpi-value { color: #2e7d32; }
+    .kpi-card.bad .kpi-value { color: #c62828; }
+    .kpi-sub { color: #6b7c93; font-size: 11px; }
+    .section-title {
+      font-size: 14px;
+      font-weight: 800;
+      color: #102a43;
+      margin-bottom: 14px;
+    }
+    .characteristics-table {
+      width: 100%;
+      border-collapse: collapse;
+    }
+    .characteristics-table th,
+    .characteristics-table td {
+      padding: 9px 10px;
+      border-bottom: 1px solid #ebf1f6;
+      text-align: left;
+      vertical-align: top;
+    }
+    .characteristics-table th {
+      width: 180px;
+      color: #5f6c80;
+      font-size: 11px;
+      text-transform: uppercase;
+      letter-spacing: 0.7px;
+    }
+    .report-table {
+      width: 100%;
+      border-collapse: collapse;
+    }
+    .report-table th {
+      text-align: left;
+      padding: 9px 10px;
+      background: #eef4fb;
+      color: #486581;
+      border-bottom: 2px solid #d9e2ec;
+      font-size: 10px;
+      text-transform: uppercase;
+      letter-spacing: 0.7px;
+    }
+    .report-table td {
+      padding: 10px;
+      border-bottom: 1px solid #ebf1f6;
+      vertical-align: top;
+    }
+    .report-table tr:hover td { background: #f8fbff; }
+    .champion-row td { background: #e8f1fd; }
+    .hyp-pill {
+      display: inline-block;
+      background: #102a43;
+      color: white;
+      border-radius: 999px;
+      padding: 2px 8px;
+      font-size: 11px;
+      font-weight: 700;
+    }
+    .gain-pos { color: #2e7d32; font-weight: 700; }
+    .gain-neg { color: #c62828; font-weight: 700; }
+    .chart-wrap {
+      overflow-x: auto;
+      border: 1px solid #e6edf5;
+      border-radius: 10px;
+      background: #fbfdff;
+      padding: 10px;
+    }
+    .chart-svg { width: 100%; min-width: 760px; display: block; }
+    .axis-label { fill: #486581; font-size: 11px; font-weight: 700; }
+    .tick-label { fill: #7b8794; font-size: 10px; }
+    .tick-line { stroke: #d9e2ec; stroke-width: 1; }
+    .center-line { stroke: #1e88e5; stroke-width: 2; stroke-dasharray: 4 4; }
+    .row-bg { fill: transparent; }
+    .hyp-label { fill: #102a43; font-size: 12px; font-weight: 800; }
+    .hyp-sub { fill: #7b8794; font-size: 10px; }
+    .baseline-bar { stroke: #546e7a; stroke-width: 3; }
+    .value-text { fill: #102a43; font-size: 11px; font-weight: 700; }
+    .build-fail-text { fill: #37474f; font-size: 10px; font-weight: 800; letter-spacing: 0.5px; }
+    .gap-grid {
+      display: grid;
+      grid-template-columns: repeat(auto-fit, minmax(240px, 1fr));
+      gap: 12px;
+    }
+    .gap-card {
+      background: #fff8e1;
+      border: 1.5px solid #ffe082;
+      border-radius: 10px;
+      padding: 12px 14px;
+      color: #7c5b00;
+      font-size: 12px;
+    }
+    .flag-pill {
+      display: inline-block;
+      background: #e3f2fd;
+      color: #1565c0;
+      border-radius: 4px;
+      padding: 1px 6px;
+      font-size: 10px;
+      font-weight: 700;
+      margin: 1px 2px 1px 0;
+    }
+    .runs-val {
+      font-family: "Cascadia Code", "Consolas", monospace;
+      font-size: 10.5px;
+      color: #546e7a;
+      white-space: nowrap;
+    }
+    .hyp-detail-table .label-cell { font-size: 11.5px; max-width: 220px; }
+    .hyp-detail-table .opset-cell { text-align: center; font-weight: 700; color: #3949ab; font-size: 12px; }
+    .hyp-detail-table .flags-cell { min-width: 140px; }
+    .hyp-detail-table .p50-cell { font-family: "Cascadia Code","Consolas",monospace; font-size: 12px; white-space: nowrap; }
+    .hyp-detail-table .sessions-cell { min-width: 160px; }
+    .hyp-detail-table .conf-cell { font-size: 11px; color: #546e7a; }
+    .verdict-keep { color: #2e7d32; font-weight: 700; }
+    .verdict-discard { color: #c62828; font-weight: 700; }
+    .footer {
+      margin-top: 16px;
+      color: #7b8794;
+      font-size: 11px;
+      text-align: center;
+    }
+    @media (max-width: 1200px) {
+      .kpi-grid { grid-template-columns: repeat(2, minmax(0, 1fr)); }
+    }
+    @media (max-width: 720px) {
+      .kpi-grid { grid-template-columns: 1fr; }
+      body { padding: 18px 14px 28px; }
+    }
+  </style>
+</head>
+<body>
+  <h1>QNN GPU Optimization Report — microsoft/resnet-18</h1>
+  <div class="subtitle">resnet arch · 2026-06-18 · 13 hypotheses tested</div>
+
+  <section class="kpi-grid">
+    <div class="kpi-card good">
+      <div class="kpi-label">Best Gain %</div>
+      <div class="kpi-value">+8.4%</div>
+      <div class="kpi-sub">Champion: h12</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">Baseline → Champion ms</div>
+      <div class="kpi-value">6.82 ms → 6.25 ms</div>
+      <div class="kpi-sub">Latency reduction: 0.57 ms</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">EP + Device</div>
+      <div class="kpi-value">QNN / GPU</div>
+      <div class="kpi-sub">Baseline opset 17</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">Champion Config</div>
+      <div class="kpi-value">h12</div>
+      <div class="kpi-sub">opset 17 + transpose_optimizer</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">Total experiments</div>
+      <div class="kpi-value">13</div>
+      <div class="kpi-sub">2 KEEP / 2 DISCARD</div>
+    </div>
+  </section>
+
+
+    <section class="section-card">
+      <div class="section-title">Model Characteristics</div>
+      <table class="characteristics-table">
+        <tr><th>Model ID</th><td>microsoft/resnet-18</td></tr><tr><th>Task</th><td>image-classification</td></tr><tr><th>Arch type</th><td>resnet</td></tr><tr><th>Baseline opset</th><td>17</td></tr><tr><th>EP</th><td>qnn</td></tr><tr><th>Device</th><td>gpu</td></tr>
+      </table>
+    </section>
+
+
+    <section class="section-card">
+      <div class="section-title">Hypothesis Gain Chart</div>
+      <div class="chart-wrap">
+        <svg class="chart-svg" viewBox="0 0 748 594" role="img" aria-label="Hypothesis gain chart"><defs><pattern id="buildFailPattern" patternUnits="userSpaceOnUse" width="8" height="8" patternTransform="rotate(45)"><rect width="8" height="8" fill="#cfd8dc"></rect><rect width="3" height="8" fill="#90a4ae"></rect></pattern></defs><text x="0" y="20" class="axis-label">Hypothesis</text><text x="150" y="20" class="axis-label">Gain vs baseline (%)</text><line x1="410.0" y1="40" x2="410.0" y2="568" class="center-line" /><line x1="150.0" y1="44" x2="150.0" y2="568" class="tick-line" /><text x="150.0" y="588" text-anchor="middle" class="tick-label">-200%</text><line x1="280.0" y1="44" x2="280.0" y2="568" class="tick-line" /><text x="280.0" y="588" text-anchor="middle" class="tick-label">-100%</text><line x1="410.0" y1="44" x2="410.0" y2="568" class="tick-line" /><text x="410.0" y="588" text-anchor="middle" class="tick-label">0%</text><line x1="540.0" y1="44" x2="540.0" y2="568" class="tick-line" /><text x="540.0" y="588" text-anchor="middle" class="tick-label">100%</text><line x1="670.0" y1="44" x2="670.0" y2="568" class="tick-line" /><text x="670.0" y="588" text-anchor="middle" class="tick-label">200%</text><g><title>h0: baseline FP32 (no quant, no compile)
+status=OK  verdict=BASELINE
+p50=6.82 ms  gain=+0.0%</title><rect x="0" y="48.0" width="748" height="40" class="row-bg" /><text x="8" y="64.0" class="hyp-label">h0</text><text x="8" y="77.0" class="hyp-sub">baseline FP32 (no quant, …</text><line x1="410.0" y1="56.0" x2="410.0" y2="80.0" class="baseline-bar" /><text x="418.0" y="72.0" text-anchor="start" class="value-text">0.0%</text></g><g><title>h1: opset 17 explicit
+status=OK  verdict=MARGINAL
+p50=6.64 ms  gain=+2.7%</title><rect x="0" y="88.0" width="748" height="40" class="row-bg" /><text x="8" y="104.0" class="hyp-label">h1</text><text x="8" y="117.0" class="hyp-sub">opset 17 explicit</text><rect x="410.0" y="96.0" width="3.5" height="24" fill="#90a4ae" stroke="none" stroke-width="0" rx="4" /><text x="421.5" y="112.0" text-anchor="start" class="value-text">+2.7%</text></g><g><title>h2: opset 19
+status=OK  verdict=DISCARD
+p50=6.88 ms  gain=-0.9%</title><rect x="0" y="128.0" width="748" height="40" class="row-bg" /><text x="8" y="144.0" class="hyp-label">h2</text><text x="8" y="157.0" class="hyp-sub">opset 19</text><rect x="408.8" y="136.0" width="2.0" height="24" fill="#90a4ae" stroke="none" stroke-width="0" rx="4" /><text x="400.8" y="152.0" text-anchor="end" class="value-text">-0.9%</text></g><g><title>h3: opset 21 (tests gpu-006)
+status=OK  verdict=MARGINAL
+p50=6.60 ms  gain=+3.3%</title><rect x="0" y="168.0" width="748" height="40" class="row-bg" /><text x="8" y="184.0" class="hyp-label">h3</text><text x="8" y="197.0" class="hyp-sub">opset 21 (tests gpu-006)</text><rect x="410.0" y="176.0" width="4.3" height="24" fill="#90a4ae" stroke="none" stroke-width="0" rx="4" /><text x="422.3" y="192.0" text-anchor="start" class="value-text">+3.3%</text></g><g><title>h4: opset 17 + matmul_transpose_fusion
+status=OK  verdict=MARGINAL
+p50=6.56 ms  gain=+3.8%</title><rect x="0" y="208.0" width="748" height="40" class="row-bg" /><text x="8" y="224.0" class="hyp-label">h4</text><text x="8" y="237.0" class="hyp-sub">opset 17 + matmul_transpo…</text><rect x="410.0" y="216.0" width="5.0" height="24" fill="#90a4ae" stroke="none" stroke-width="0" rx="4" /><text x="423.0" y="232.0" text-anchor="start" class="value-text">+3.8%</text></g><g><title>h5: opset 17 + attention_fusion
+status=OK  verdict=MARGINAL
+p50=6.59 ms  gain=+3.4%</title><rect x="0" y="248.0" width="748" height="40" class="row-bg" /><text x="8" y="264.0" class="hyp-label">h5</text><text x="8" y="277.0" class="hyp-sub">opset 17 + attention_fusi…</text><rect x="410.0" y="256.0" width="4.4" height="24" fill="#90a4ae" stroke="none" stroke-width="0" rx="4" /><text x="422.4" y="272.0" text-anchor="start" class="value-text">+3.4%</text></g><g><title>h6: opset 17 + bias_softmax_fusion
+status=OK  verdict=MARGINAL
+p50=6.52 ms  gain=+4.5%</title><rect x="0" y="288.0" width="748" height="40" class="row-bg" /><text x="8" y="304.0" class="hyp-label">h6</text><text x="8" y="317.0" class="hyp-sub">opset 17 + bias_softmax_f…</text><rect x="410.0" y="296.0" width="5.8" height="24" fill="#90a4ae" stroke="none" stroke-width="0" rx="4" /><text x="423.8" y="312.0" text-anchor="start" class="value-text">+4.5%</text></g><g><title>h7: opset 17 + layer_norm_fusion
+status=OK  verdict=DISCARD
+p50=7.11 ms  gain=-4.2%</title><rect x="0" y="328.0" width="748" height="40" class="row-bg" /><text x="8" y="344.0" class="hyp-label">h7</text><text x="8" y="357.0" class="hyp-sub">opset 17 + layer_norm_fus…</text><rect x="404.6" y="336.0" width="5.4" height="24" fill="#90a4ae" stroke="none" stroke-width="0" rx="4" /><text x="396.6" y="352.0" text-anchor="end" class="value-text">-4.2%</text></g><g><title>h8: opset 17 + skip_layer_norm_fusion
+status=OK  verdict=MARGINAL
+p50=6.78 ms  gain=+0.7%</title><rect x="0" y="368.0" width="748" height="40" class="row-bg" /><text x="8" y="384.0" class="hyp-label">h8</text><text x="8" y="397.0" class="hyp-sub">opset 17 + skip_layer_nor…</text><rect x="410.0" y="376.0" width="2.0" height="24" fill="#90a4ae" stroke="none" stroke-width="0" rx="4" /><text x="418.9" y="392.0" text-anchor="start" class="value-text">+0.7%</text></g><g><title>h9: opset 21 + matmul_transpose + attention_fusion
+status=OK  verdict=DISCARD
+p50=7.37 ms  gain=-8.0%</title><rect x="0" y="408.0" width="748" height="40" class="row-bg" /><text x="8" y="424.0" class="hyp-label">h9</text><text x="8" y="437.0" class="hyp-sub">opset 21 + matmul_transpo…</text><rect x="399.6" y="416.0" width="10.4" height="24" fill="#e53935" stroke="none" stroke-width="0" rx="4" /><text x="391.6" y="432.0" text-anchor="end" class="value-text">-8.0%</text></g><g><title>h10: opset 17 + ln + skip_ln + matmul_transpose
+status=OK  verdict=MARGINAL
+p50=6.76 ms  gain=+1.0%</title><rect x="0" y="448.0" width="748" height="40" class="row-bg" /><text x="8" y="464.0" class="hyp-label">h10</text><text x="8" y="477.0" class="hyp-sub">opset 17 + ln + skip_ln +…</text><rect x="410.0" y="456.0" width="2.0" height="24" fill="#90a4ae" stroke="none" stroke-width="0" rx="4" /><text x="419.3" y="472.0" text-anchor="start" class="value-text">+1.0%</text></g><g><title>h11: opset 17 + gelu_fusion explicit
+status=OK  verdict=MARGINAL_UNCONFIRMED
+p50=6.39 ms  gain=+6.4%</title><rect x="0" y="488.0" width="748" height="40" class="row-bg" /><text x="8" y="504.0" class="hyp-label">h11</text><text x="8" y="517.0" class="hyp-sub">opset 17 + gelu_fusion ex…</text><rect x="410.0" y="496.0" width="8.3" height="24" fill="#43a047" stroke="none" stroke-width="0" rx="4" /><text x="426.3" y="512.0" text-anchor="start" class="value-text">+6.4%</text></g><g><title>h12: opset 17 + transpose_optimizer
+status=OK  verdict=MARGINAL_UNCONFIRMED
+p50=6.25 ms  gain=+5.7%</title><rect x="0" y="528.0" width="748" height="40" class="row-bg" /><text x="8" y="544.0" class="hyp-label">h12</text><text x="8" y="557.0" class="hyp-sub">opset 17 + transpose_opti…</text><rect x="410.0" y="536.0" width="7.4" height="24" fill="#43a047" stroke="#1e88e5" stroke-width="4" rx="4" /><text x="425.4" y="552.0" text-anchor="start" class="value-text">+5.7%</text></g></svg>
+      </div>
+    </section>
+
+
+    <section class="section-card">
+      <div class="section-title">🔬 All Hypotheses — Full Detail</div>
+      <div style="overflow-x:auto">
+      <table class="report-table hyp-detail-table">
+        <thead>
+          <tr>
+            <th>ID</th>
+            <th>Config Label</th>
+            <th>Opset</th>
+            <th>Extra Flags</th>
+            <th>Median p50</th>
+            <th>Session p50s (ms)</th>
+            <th>Gain %</th>
+            <th>Verdict</th>
+            <th>Confidence</th>
+          </tr>
+        </thead>
+        <tbody>
+
+        <tr class="">
+          <td><span class="hyp-pill">h0</span></td>
+          <td class="label-cell">baseline FP32 (no quant, no compile)</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">6.82 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[6.82 · 6.92 · 6.04]</span></td>
+          <td><span class="">+0.0%</span></td>
+          <td><span class="">BASELINE</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h1</span></td>
+          <td class="label-cell">opset 17 explicit</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">6.64 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[6.64 · 6.74 · 6.01]</span></td>
+          <td><span class="gain-pos">+2.7%</span></td>
+          <td><span class="">MARGINAL</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h2</span></td>
+          <td class="label-cell">opset 19</td>
+          <td class="opset-cell">19</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">6.88 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[7.39 · 6.56 · 6.88]</span></td>
+          <td><span class="gain-neg">-0.9%</span></td>
+          <td><span class="verdict-discard">DISCARD</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h3</span></td>
+          <td class="label-cell">opset 21 (tests gpu-006)</td>
+          <td class="opset-cell">21</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">6.60 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[6.60 · 6.78 · 6.41]</span></td>
+          <td><span class="gain-pos">+3.3%</span></td>
+          <td><span class="">MARGINAL</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h4</span></td>
+          <td class="label-cell">opset 17 + matmul_transpose_fusion</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><span class="flag-pill">matmul_transpose_fusion</span></td>
+          <td class="p50-cell">6.56 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[6.56 · 6.53 · 7.55]</span></td>
+          <td><span class="gain-pos">+3.8%</span></td>
+          <td><span class="">MARGINAL</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h5</span></td>
+          <td class="label-cell">opset 17 + attention_fusion</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><span class="flag-pill">attention_fusion</span></td>
+          <td class="p50-cell">6.59 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[6.57 · 7.50 · 6.59]</span></td>
+          <td><span class="gain-pos">+3.4%</span></td>
+          <td><span class="">MARGINAL</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h6</span></td>
+          <td class="label-cell">opset 17 + bias_softmax_fusion</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><span class="flag-pill">bias_softmax_fusion</span></td>
+          <td class="p50-cell">6.52 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[6.11 · 6.52 · 6.68]</span></td>
+          <td><span class="gain-pos">+4.5%</span></td>
+          <td><span class="">MARGINAL</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h7</span></td>
+          <td class="label-cell">opset 17 + layer_norm_fusion</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><span class="flag-pill">layer_norm_fusion</span></td>
+          <td class="p50-cell">7.11 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[7.11 · 6.88 · 7.23]</span></td>
+          <td><span class="gain-neg">-4.2%</span></td>
+          <td><span class="verdict-discard">DISCARD</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h8</span></td>
+          <td class="label-cell">opset 17 + skip_layer_norm_fusion</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><span class="flag-pill">skip_layer_norm_fusion</span></td>
+          <td class="p50-cell">6.78 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[6.34 · 6.78 · 7.06]</span></td>
+          <td><span class="gain-pos">+0.7%</span></td>
+          <td><span class="">MARGINAL</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h9</span></td>
+          <td class="label-cell">opset 21 + matmul_transpose + attention_fusion</td>
+          <td class="opset-cell">21</td>
+          <td class="flags-cell"><span class="flag-pill">matmul_transpose_fusion</span>, <span class="flag-pill">attention_fusion</span></td>
+          <td class="p50-cell">7.37 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[7.45 · 6.43 · 7.37]</span></td>
+          <td><span class="gain-neg">-8.0%</span></td>
+          <td><span class="verdict-discard">DISCARD</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h10</span></td>
+          <td class="label-cell">opset 17 + ln + skip_ln + matmul_transpose</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><span class="flag-pill">layer_norm_fusion</span>, <span class="flag-pill">skip_layer_norm_fusion</span>, <span class="flag-pill">matmul_transpose_fusion</span></td>
+          <td class="p50-cell">6.76 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[6.76 · 8.16 · 6.38]</span></td>
+          <td><span class="gain-pos">+1.0%</span></td>
+          <td><span class="">MARGINAL</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h11</span></td>
+          <td class="label-cell">opset 17 + gelu_fusion explicit</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><span class="flag-pill">gelu_fusion</span></td>
+          <td class="p50-cell">6.39 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[6.48 · 6.17 · 6.39 · 6.48 · 6.35]</span></td>
+          <td><span class="gain-pos">+6.4%</span></td>
+          <td><span class="">MARGINAL_UNCONFIRMED</span></td>
+          <td class="conf-cell">4/5 sessions confirm</td>
+        </tr>
+        <tr class="champion-row">
+          <td><span class="hyp-pill">h12</span> <span style="color:#1976d2;font-weight:900">★</span></td>
+          <td class="label-cell">opset 17 + transpose_optimizer</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><span class="flag-pill">transpose_optimizer</span></td>
+          <td class="p50-cell">6.25 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[6.44 · 6.25 · 6.22 · 8.72 · 6.87]</span></td>
+          <td><span class="gain-pos">+5.7%</span></td>
+          <td><span class="">MARGINAL_UNCONFIRMED</span></td>
+          <td class="conf-cell">3/5 sessions confirm</td>
+        </tr>
+        </tbody>
+      </table>
+      </div>
+      <div style="margin-top:10px;font-size:11px;color:#7b8794">
+        ★ = champion hypothesis &nbsp;·&nbsp; Session p50s are individual bench sessions (median used for comparison)
+      </div>
+    </section>
+
+
+    <section class="section-card">
+      <div class="section-title">✅ Effective Optimizations</div>
+      <table class="report-table">
+        <thead>
+          <tr>
+            <th>Hypothesis</th>
+            <th>Label</th>
+            <th>Gain %</th>
+            <th>Verdict</th>
+            <th>Confidence</th>
+          </tr>
+        </thead>
+        <tbody>
+
+            <tr class="">
+              <td><span class="hyp-pill">h11</span></td>
+              <td>opset 17 + gelu_fusion explicit</td>
+              <td class="gain-pos">+6.4%</td>
+              <td>MARGINAL_UNCONFIRMED</td>
+              <td>4/5 sessions confirm</td>
+            </tr>
+
+            <tr class="champion-row">
+              <td><span class="hyp-pill">h12</span></td>
+              <td>opset 17 + transpose_optimizer</td>
+              <td class="gain-pos">+5.7%</td>
+              <td>MARGINAL_UNCONFIRMED</td>
+              <td>3/5 sessions confirm</td>
+            </tr>
+
+        </tbody>
+      </table>
+    </section>
+
+
+    <section class="section-card">
+      <div class="section-title">❌ Ineffective or Harmful</div>
+      <table class="report-table">
+        <thead>
+          <tr>
+            <th>Hypothesis</th>
+            <th>Label</th>
+            <th>Gain %</th>
+            <th>Verdict</th>
+            <th>Confidence</th>
+          </tr>
+        </thead>
+        <tbody>
+
+            <tr class="">
+              <td><span class="hyp-pill">h7</span></td>
+              <td>opset 17 + layer_norm_fusion</td>
+              <td class="gain-neg">-4.2%</td>
+              <td>DISCARD</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h9</span></td>
+              <td>opset 21 + matmul_transpose + attention_fusion</td>
+              <td class="gain-neg">-8.0%</td>
+              <td>DISCARD</td>
+              <td>ranges overlap</td>
+            </tr>
+
+        </tbody>
+      </table>
+    </section>
+
+
+    <section class="section-card">
+      <div class="section-title">⚪ Neutral / Build Fail</div>
+      <table class="report-table">
+        <thead>
+          <tr>
+            <th>Hypothesis</th>
+            <th>Label</th>
+            <th>Gain %</th>
+            <th>Verdict</th>
+            <th>Confidence</th>
+          </tr>
+        </thead>
+        <tbody>
+
+            <tr class="">
+              <td><span class="hyp-pill">h0</span></td>
+              <td>baseline FP32 (no quant, no compile)</td>
+              <td class="gain-pos">+0.0%</td>
+              <td>BASELINE</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h1</span></td>
+              <td>opset 17 explicit</td>
+              <td class="gain-pos">+2.7%</td>
+              <td>MARGINAL</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h2</span></td>
+              <td>opset 19</td>
+              <td class="gain-neg">-0.9%</td>
+              <td>DISCARD</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h3</span></td>
+              <td>opset 21 (tests gpu-006)</td>
+              <td class="gain-pos">+3.3%</td>
+              <td>MARGINAL</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h4</span></td>
+              <td>opset 17 + matmul_transpose_fusion</td>
+              <td class="gain-pos">+3.8%</td>
+              <td>MARGINAL</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h5</span></td>
+              <td>opset 17 + attention_fusion</td>
+              <td class="gain-pos">+3.4%</td>
+              <td>MARGINAL</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h6</span></td>
+              <td>opset 17 + bias_softmax_fusion</td>
+              <td class="gain-pos">+4.5%</td>
+              <td>MARGINAL</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h8</span></td>
+              <td>opset 17 + skip_layer_norm_fusion</td>
+              <td class="gain-pos">+0.7%</td>
+              <td>MARGINAL</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h10</span></td>
+              <td>opset 17 + ln + skip_ln + matmul_transpose</td>
+              <td class="gain-pos">+1.0%</td>
+              <td>MARGINAL</td>
+              <td>ranges overlap</td>
+            </tr>
+
+        </tbody>
+      </table>
+    </section>
+
+
+
+  <div class="footer">Generated by gen_model_report.py · research/autoconfig</div>
+</body>
+</html>
diff --git a/research/autoconfig/catalog-gpu-sweep/microsoft--resnet-18/results.json b/research/autoconfig/catalog-gpu-sweep/microsoft--resnet-18/results.json
new file mode 100644
index 000000000..400cf7e3a
--- /dev/null
+++ b/research/autoconfig/catalog-gpu-sweep/microsoft--resnet-18/results.json
@@ -0,0 +1,276 @@
+{
+  "model_id": "microsoft/resnet-18",
+  "task": "image-classification",
+  "model_type": "resnet",
+  "timestamp": "2026-06-18T09:05:04",
+  "ep": "qnn",
+  "device": "gpu",
+  "baseline_opset": 17,
+  "hypotheses": {
+    "h0": {
+      "status": "OK",
+      "label": "baseline FP32 (no quant, no compile)",
+      "opset": 17,
+      "extra_optim": null,
+      "screen_p50_ms": 7.266,
+      "screen_cv": 0.713047068538398,
+      "full_p50s_ms": [
+        6.823,
+        6.916,
+        6.04
+      ],
+      "median_p50_ms": 6.823,
+      "verdict": "BASELINE"
+    },
+    "h1": {
+      "status": "OK",
+      "label": "opset 17 explicit",
+      "opset": 17,
+      "extra_optim": null,
+      "screen_p50_ms": 7.584,
+      "screen_cv": 0.2866561181434599,
+      "full_p50s_ms": [
+        6.638,
+        6.738,
+        6.012
+      ],
+      "median_p50_ms": 6.638,
+      "gain_vs_baseline_pct": 2.71,
+      "verdict": "MARGINAL"
+    },
+    "h2": {
+      "status": "OK",
+      "label": "opset 19",
+      "opset": 19,
+      "extra_optim": null,
+      "screen_p50_ms": 7.013,
+      "screen_cv": 0.348353058605447,
+      "full_p50s_ms": [
+        7.392,
+        6.557,
+        6.884
+      ],
+      "median_p50_ms": 6.884,
+      "gain_vs_baseline_pct": -0.89,
+      "verdict": "DISCARD"
+    },
+    "h3": {
+      "status": "OK",
+      "label": "opset 21 (tests gpu-006)",
+      "opset": 21,
+      "extra_optim": null,
+      "screen_p50_ms": 7.382,
+      "screen_cv": 0.38715795177458684,
+      "full_p50s_ms": [
+        6.6,
+        6.775,
+        6.409
+      ],
+      "median_p50_ms": 6.6,
+      "gain_vs_baseline_pct": 3.27,
+      "verdict": "MARGINAL"
+    },
+    "h4": {
+      "status": "OK",
+      "label": "opset 17 + matmul_transpose_fusion",
+      "opset": 17,
+      "extra_optim": {
+        "matmul_transpose_fusion": true
+      },
+      "screen_p50_ms": 7.024,
+      "screen_cv": 0.3361332574031891,
+      "full_p50s_ms": [
+        6.562,
+        6.53,
+        7.551
+      ],
+      "median_p50_ms": 6.562,
+      "gain_vs_baseline_pct": 3.83,
+      "verdict": "MARGINAL"
+    },
+    "h5": {
+      "status": "OK",
+      "label": "opset 17 + attention_fusion",
+      "opset": 17,
+      "extra_optim": {
+        "attention_fusion": true
+      },
+      "screen_p50_ms": 7.344,
+      "screen_cv": 0.6030773420479303,
+      "full_p50s_ms": [
+        6.574,
+        7.504,
+        6.594
+      ],
+      "median_p50_ms": 6.594,
+      "gain_vs_baseline_pct": 3.36,
+      "verdict": "MARGINAL"
+    },
+    "h6": {
+      "status": "OK",
+      "label": "opset 17 + bias_softmax_fusion",
+      "opset": 17,
+      "extra_optim": {
+        "bias_softmax_fusion": true
+      },
+      "screen_p50_ms": 7.052,
+      "screen_cv": 0.34160521837776514,
+      "full_p50s_ms": [
+        6.114,
+        6.516,
+        6.682
+      ],
+      "median_p50_ms": 6.516,
+      "gain_vs_baseline_pct": 4.5,
+      "verdict": "MARGINAL"
+    },
+    "h7": {
+      "status": "OK",
+      "label": "opset 17 + layer_norm_fusion",
+      "opset": 17,
+      "extra_optim": {
+        "layer_norm_fusion": true
+      },
+      "screen_p50_ms": 6.483,
+      "screen_cv": 0.9310504396112911,
+      "full_p50s_ms": [
+        7.109,
+        6.881,
+        7.234
+      ],
+      "median_p50_ms": 7.109,
+      "gain_vs_baseline_pct": -4.19,
+      "verdict": "DISCARD"
+    },
+    "h8": {
+      "status": "OK",
+      "label": "opset 17 + skip_layer_norm_fusion",
+      "opset": 17,
+      "extra_optim": {
+        "skip_layer_norm_fusion": true
+      },
+      "screen_p50_ms": 6.841,
+      "screen_cv": 0.2964478877357111,
+      "full_p50s_ms": [
+        6.339,
+        6.777,
+        7.058
+      ],
+      "median_p50_ms": 6.777,
+      "gain_vs_baseline_pct": 0.67,
+      "verdict": "MARGINAL"
+    },
+    "h9": {
+      "status": "OK",
+      "label": "opset 21 + matmul_transpose + attention_fusion",
+      "opset": 21,
+      "extra_optim": {
+        "matmul_transpose_fusion": true,
+        "attention_fusion": true
+      },
+      "screen_p50_ms": 6.98,
+      "screen_cv": 0.6378223495702006,
+      "full_p50s_ms": [
+        7.448,
+        6.432,
+        7.368
+      ],
+      "median_p50_ms": 7.368,
+      "gain_vs_baseline_pct": -7.99,
+      "verdict": "DISCARD"
+    },
+    "h10": {
+      "status": "OK",
+      "label": "opset 17 + ln + skip_ln + matmul_transpose",
+      "opset": 17,
+      "extra_optim": {
+        "layer_norm_fusion": true,
+        "skip_layer_norm_fusion": true,
+        "matmul_transpose_fusion": true
+      },
+      "screen_p50_ms": 5.897,
+      "screen_cv": 0.9113108360183143,
+      "full_p50s_ms": [
+        6.756,
+        8.163,
+        6.381
+      ],
+      "median_p50_ms": 6.756,
+      "gain_vs_baseline_pct": 0.98,
+      "verdict": "MARGINAL"
+    },
+    "h11": {
+      "status": "OK",
+      "label": "opset 17 + gelu_fusion explicit",
+      "opset": 17,
+      "extra_optim": {
+        "gelu_fusion": true
+      },
+      "screen_p50_ms": 6.974,
+      "screen_cv": 0.8368224835101807,
+      "full_p50s_ms": [
+        6.482,
+        6.175,
+        6.386
+      ],
+      "median_p50_ms": 6.386,
+      "gain_vs_baseline_pct": 6.4,
+      "verdict": "MARGINAL_UNCONFIRMED",
+      "confirm_p50s_ms": [
+        6.48,
+        6.348
+      ],
+      "all_p50s_ms": [
+        6.482,
+        6.175,
+        6.386,
+        6.48,
+        6.348
+      ],
+      "overall_median_p50_ms": 6.386,
+      "overall_gain_pct": 6.4,
+      "sessions_above_threshold": 4,
+      "total_sessions": 5
+    },
+    "h12": {
+      "status": "OK",
+      "label": "opset 17 + transpose_optimizer",
+      "opset": 17,
+      "extra_optim": {
+        "transpose_optimizer": true
+      },
+      "screen_p50_ms": 5.992,
+      "screen_cv": 0.3384512683578104,
+      "full_p50s_ms": [
+        6.437,
+        6.251,
+        6.224
+      ],
+      "median_p50_ms": 6.251,
+      "gain_vs_baseline_pct": 8.38,
+      "verdict": "MARGINAL_UNCONFIRMED",
+      "confirm_p50s_ms": [
+        8.718,
+        6.869
+      ],
+      "all_p50s_ms": [
+        6.437,
+        6.251,
+        6.224,
+        8.718,
+        6.869
+      ],
+      "overall_median_p50_ms": 6.437,
+      "overall_gain_pct": 5.66,
+      "sessions_above_threshold": 3,
+      "total_sessions": 5
+    }
+  },
+  "best_hypothesis": "h12",
+  "baseline_p50_ms": 6.823,
+  "best_p50_ms": 6.251,
+  "best_gain_pct": 8.38,
+  "opset21_gain_pct": 3.27,
+  "feature_gaps": [],
+  "errors": []
+}
diff --git a/research/autoconfig/catalog-gpu-sweep/sentence-transformers--all-MiniLM-L6-v2/report.html b/research/autoconfig/catalog-gpu-sweep/sentence-transformers--all-MiniLM-L6-v2/report.html
new file mode 100644
index 000000000..b7cfc5441
--- /dev/null
+++ b/research/autoconfig/catalog-gpu-sweep/sentence-transformers--all-MiniLM-L6-v2/report.html
@@ -0,0 +1,577 @@
+<!DOCTYPE html>
+<html lang="en">
+<head>
+  <meta charset="UTF-8">
+  <meta name="viewport" content="width=device-width, initial-scale=1.0">
+  <title>QNN GPU Optimization Report — sentence-transformers/all-MiniLM-L6-v2</title>
+  <style>
+    * { box-sizing: border-box; margin: 0; padding: 0; }
+    body {
+      font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", sans-serif;
+      background: #f4f6f9;
+      color: #1a1a2e;
+      padding: 28px 24px 40px;
+      font-size: 13px;
+      line-height: 1.5;
+    }
+    h1 { font-size: 24px; font-weight: 800; margin-bottom: 6px; color: #102a43; }
+    .subtitle { color: #5f6c80; font-size: 12px; margin-bottom: 24px; }
+    .section-card {
+      background: #ffffff;
+      border: 1.5px solid #dbe5f0;
+      border-radius: 12px;
+      padding: 18px 20px;
+      margin-bottom: 20px;
+      box-shadow: 0 1px 3px rgba(16, 42, 67, 0.06);
+    }
+    .kpi-grid {
+      display: grid;
+      grid-template-columns: repeat(5, minmax(0, 1fr));
+      gap: 14px;
+      margin-bottom: 20px;
+    }
+    .kpi-card {
+      background: linear-gradient(180deg, #ffffff 0%, #f8fbff 100%);
+      border: 1.5px solid #dbe5f0;
+      border-radius: 12px;
+      padding: 16px;
+      min-height: 120px;
+    }
+    .kpi-label {
+      font-size: 10px;
+      text-transform: uppercase;
+      letter-spacing: 0.8px;
+      color: #6b7c93;
+      font-weight: 700;
+      margin-bottom: 8px;
+    }
+    .kpi-value {
+      font-size: 23px;
+      font-weight: 800;
+      color: #102a43;
+      line-height: 1.15;
+      margin-bottom: 6px;
+      word-break: break-word;
+    }
+    .kpi-card.good .kpi-value { color: #2e7d32; }
+    .kpi-card.bad .kpi-value { color: #c62828; }
+    .kpi-sub { color: #6b7c93; font-size: 11px; }
+    .section-title {
+      font-size: 14px;
+      font-weight: 800;
+      color: #102a43;
+      margin-bottom: 14px;
+    }
+    .characteristics-table {
+      width: 100%;
+      border-collapse: collapse;
+    }
+    .characteristics-table th,
+    .characteristics-table td {
+      padding: 9px 10px;
+      border-bottom: 1px solid #ebf1f6;
+      text-align: left;
+      vertical-align: top;
+    }
+    .characteristics-table th {
+      width: 180px;
+      color: #5f6c80;
+      font-size: 11px;
+      text-transform: uppercase;
+      letter-spacing: 0.7px;
+    }
+    .report-table {
+      width: 100%;
+      border-collapse: collapse;
+    }
+    .report-table th {
+      text-align: left;
+      padding: 9px 10px;
+      background: #eef4fb;
+      color: #486581;
+      border-bottom: 2px solid #d9e2ec;
+      font-size: 10px;
+      text-transform: uppercase;
+      letter-spacing: 0.7px;
+    }
+    .report-table td {
+      padding: 10px;
+      border-bottom: 1px solid #ebf1f6;
+      vertical-align: top;
+    }
+    .report-table tr:hover td { background: #f8fbff; }
+    .champion-row td { background: #e8f1fd; }
+    .hyp-pill {
+      display: inline-block;
+      background: #102a43;
+      color: white;
+      border-radius: 999px;
+      padding: 2px 8px;
+      font-size: 11px;
+      font-weight: 700;
+    }
+    .gain-pos { color: #2e7d32; font-weight: 700; }
+    .gain-neg { color: #c62828; font-weight: 700; }
+    .chart-wrap {
+      overflow-x: auto;
+      border: 1px solid #e6edf5;
+      border-radius: 10px;
+      background: #fbfdff;
+      padding: 10px;
+    }
+    .chart-svg { width: 100%; min-width: 760px; display: block; }
+    .axis-label { fill: #486581; font-size: 11px; font-weight: 700; }
+    .tick-label { fill: #7b8794; font-size: 10px; }
+    .tick-line { stroke: #d9e2ec; stroke-width: 1; }
+    .center-line { stroke: #1e88e5; stroke-width: 2; stroke-dasharray: 4 4; }
+    .row-bg { fill: transparent; }
+    .hyp-label { fill: #102a43; font-size: 12px; font-weight: 800; }
+    .hyp-sub { fill: #7b8794; font-size: 10px; }
+    .baseline-bar { stroke: #546e7a; stroke-width: 3; }
+    .value-text { fill: #102a43; font-size: 11px; font-weight: 700; }
+    .build-fail-text { fill: #37474f; font-size: 10px; font-weight: 800; letter-spacing: 0.5px; }
+    .gap-grid {
+      display: grid;
+      grid-template-columns: repeat(auto-fit, minmax(240px, 1fr));
+      gap: 12px;
+    }
+    .gap-card {
+      background: #fff8e1;
+      border: 1.5px solid #ffe082;
+      border-radius: 10px;
+      padding: 12px 14px;
+      color: #7c5b00;
+      font-size: 12px;
+    }
+    .flag-pill {
+      display: inline-block;
+      background: #e3f2fd;
+      color: #1565c0;
+      border-radius: 4px;
+      padding: 1px 6px;
+      font-size: 10px;
+      font-weight: 700;
+      margin: 1px 2px 1px 0;
+    }
+    .runs-val {
+      font-family: "Cascadia Code", "Consolas", monospace;
+      font-size: 10.5px;
+      color: #546e7a;
+      white-space: nowrap;
+    }
+    .hyp-detail-table .label-cell { font-size: 11.5px; max-width: 220px; }
+    .hyp-detail-table .opset-cell { text-align: center; font-weight: 700; color: #3949ab; font-size: 12px; }
+    .hyp-detail-table .flags-cell { min-width: 140px; }
+    .hyp-detail-table .p50-cell { font-family: "Cascadia Code","Consolas",monospace; font-size: 12px; white-space: nowrap; }
+    .hyp-detail-table .sessions-cell { min-width: 160px; }
+    .hyp-detail-table .conf-cell { font-size: 11px; color: #546e7a; }
+    .verdict-keep { color: #2e7d32; font-weight: 700; }
+    .verdict-discard { color: #c62828; font-weight: 700; }
+    .footer {
+      margin-top: 16px;
+      color: #7b8794;
+      font-size: 11px;
+      text-align: center;
+    }
+    @media (max-width: 1200px) {
+      .kpi-grid { grid-template-columns: repeat(2, minmax(0, 1fr)); }
+    }
+    @media (max-width: 720px) {
+      .kpi-grid { grid-template-columns: 1fr; }
+      body { padding: 18px 14px 28px; }
+    }
+  </style>
+</head>
+<body>
+  <h1>QNN GPU Optimization Report — sentence-transformers/all-MiniLM-L6-v2</h1>
+  <div class="subtitle">bert arch · 2026-06-18 · 13 hypotheses tested</div>
+
+  <section class="kpi-grid">
+    <div class="kpi-card neutral">
+      <div class="kpi-label">Best Gain %</div>
+      <div class="kpi-value">—</div>
+      <div class="kpi-sub">Champion: —</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">Baseline → Champion ms</div>
+      <div class="kpi-value">27.93 ms → —</div>
+      <div class="kpi-sub">Latency reduction: —</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">EP + Device</div>
+      <div class="kpi-value">QNN / GPU</div>
+      <div class="kpi-sub">Baseline opset 17</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">Champion Config</div>
+      <div class="kpi-value">—</div>
+      <div class="kpi-sub">—</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">Total experiments</div>
+      <div class="kpi-value">13</div>
+      <div class="kpi-sub">0 KEEP / 9 DISCARD</div>
+    </div>
+  </section>
+
+
+    <section class="section-card">
+      <div class="section-title">Model Characteristics</div>
+      <table class="characteristics-table">
+        <tr><th>Model ID</th><td>sentence-transformers/all-MiniLM-L6-v2</td></tr><tr><th>Task</th><td>sentence-similarity</td></tr><tr><th>Arch type</th><td>bert</td></tr><tr><th>Baseline opset</th><td>17</td></tr><tr><th>EP</th><td>qnn</td></tr><tr><th>Device</th><td>gpu</td></tr>
+      </table>
+    </section>
+
+
+    <section class="section-card">
+      <div class="section-title">Hypothesis Gain Chart</div>
+      <div class="chart-wrap">
+        <svg class="chart-svg" viewBox="0 0 748 594" role="img" aria-label="Hypothesis gain chart"><defs><pattern id="buildFailPattern" patternUnits="userSpaceOnUse" width="8" height="8" patternTransform="rotate(45)"><rect width="8" height="8" fill="#cfd8dc"></rect><rect width="3" height="8" fill="#90a4ae"></rect></pattern></defs><text x="0" y="20" class="axis-label">Hypothesis</text><text x="150" y="20" class="axis-label">Gain vs baseline (%)</text><line x1="410.0" y1="40" x2="410.0" y2="568" class="center-line" /><line x1="150.0" y1="44" x2="150.0" y2="568" class="tick-line" /><text x="150.0" y="588" text-anchor="middle" class="tick-label">-200%</text><line x1="280.0" y1="44" x2="280.0" y2="568" class="tick-line" /><text x="280.0" y="588" text-anchor="middle" class="tick-label">-100%</text><line x1="410.0" y1="44" x2="410.0" y2="568" class="tick-line" /><text x="410.0" y="588" text-anchor="middle" class="tick-label">0%</text><line x1="540.0" y1="44" x2="540.0" y2="568" class="tick-line" /><text x="540.0" y="588" text-anchor="middle" class="tick-label">100%</text><line x1="670.0" y1="44" x2="670.0" y2="568" class="tick-line" /><text x="670.0" y="588" text-anchor="middle" class="tick-label">200%</text><g><title>h0: baseline FP32 (no quant, no compile)
+status=OK  verdict=BASELINE
+p50=27.93 ms  gain=+0.0%</title><rect x="0" y="48.0" width="748" height="40" class="row-bg" /><text x="8" y="64.0" class="hyp-label">h0</text><text x="8" y="77.0" class="hyp-sub">baseline FP32 (no quant, …</text><line x1="410.0" y1="56.0" x2="410.0" y2="80.0" class="baseline-bar" /><text x="418.0" y="72.0" text-anchor="start" class="value-text">0.0%</text></g><g><title>h1: opset 17 explicit
+status=OK  verdict=DISCARD
+p50=32.66 ms  gain=-16.9%</title><rect x="0" y="88.0" width="748" height="40" class="row-bg" /><text x="8" y="104.0" class="hyp-label">h1</text><text x="8" y="117.0" class="hyp-sub">opset 17 explicit</text><rect x="388.0" y="96.0" width="22.0" height="24" fill="#e53935" stroke="none" stroke-width="0" rx="4" /><text x="380.0" y="112.0" text-anchor="end" class="value-text">-16.9%</text></g><g><title>h2: opset 19
+status=BUILD_FAIL  verdict=—
+p50=—  gain=—</title><rect x="0" y="128.0" width="748" height="40" class="row-bg" /><text x="8" y="144.0" class="hyp-label">h2</text><text x="8" y="157.0" class="hyp-sub">opset 19</text><rect x="364.0" y="136.0" width="92" height="24" fill="url(#buildFailPattern)" stroke="#78909c" stroke-width="1.5" rx="4" /><text x="410.0" y="152.0" text-anchor="middle" class="build-fail-text">BUILD_FAIL</text></g><g><title>h3: opset 21 (tests gpu-006)
+status=BUILD_FAIL  verdict=—
+p50=—  gain=—</title><rect x="0" y="168.0" width="748" height="40" class="row-bg" /><text x="8" y="184.0" class="hyp-label">h3</text><text x="8" y="197.0" class="hyp-sub">opset 21 (tests gpu-006)</text><rect x="364.0" y="176.0" width="92" height="24" fill="url(#buildFailPattern)" stroke="#78909c" stroke-width="1.5" rx="4" /><text x="410.0" y="192.0" text-anchor="middle" class="build-fail-text">BUILD_FAIL</text></g><g><title>h4: opset 17 + matmul_transpose_fusion
+status=BUILD_FAIL  verdict=—
+p50=—  gain=—</title><rect x="0" y="208.0" width="748" height="40" class="row-bg" /><text x="8" y="224.0" class="hyp-label">h4</text><text x="8" y="237.0" class="hyp-sub">opset 17 + matmul_transpo…</text><rect x="364.0" y="216.0" width="92" height="24" fill="url(#buildFailPattern)" stroke="#78909c" stroke-width="1.5" rx="4" /><text x="410.0" y="232.0" text-anchor="middle" class="build-fail-text">BUILD_FAIL</text></g><g><title>h5: opset 17 + attention_fusion
+status=BUILD_FAIL  verdict=—
+p50=—  gain=—</title><rect x="0" y="248.0" width="748" height="40" class="row-bg" /><text x="8" y="264.0" class="hyp-label">h5</text><text x="8" y="277.0" class="hyp-sub">opset 17 + attention_fusi…</text><rect x="364.0" y="256.0" width="92" height="24" fill="url(#buildFailPattern)" stroke="#78909c" stroke-width="1.5" rx="4" /><text x="410.0" y="272.0" text-anchor="middle" class="build-fail-text">BUILD_FAIL</text></g><g><title>h6: opset 17 + bias_softmax_fusion
+status=BUILD_FAIL  verdict=—
+p50=—  gain=—</title><rect x="0" y="288.0" width="748" height="40" class="row-bg" /><text x="8" y="304.0" class="hyp-label">h6</text><text x="8" y="317.0" class="hyp-sub">opset 17 + bias_softmax_f…</text><rect x="364.0" y="296.0" width="92" height="24" fill="url(#buildFailPattern)" stroke="#78909c" stroke-width="1.5" rx="4" /><text x="410.0" y="312.0" text-anchor="middle" class="build-fail-text">BUILD_FAIL</text></g><g><title>h7: opset 17 + layer_norm_fusion
+status=OK  verdict=DISCARD
+p50=28.55 ms  gain=-2.2%</title><rect x="0" y="328.0" width="748" height="40" class="row-bg" /><text x="8" y="344.0" class="hyp-label">h7</text><text x="8" y="357.0" class="hyp-sub">opset 17 + layer_norm_fus…</text><rect x="407.1" y="336.0" width="2.9" height="24" fill="#90a4ae" stroke="none" stroke-width="0" rx="4" /><text x="399.1" y="352.0" text-anchor="end" class="value-text">-2.2%</text></g><g><title>h8: opset 17 + skip_layer_norm_fusion
+status=BENCH_FAIL  verdict=—
+p50=—  gain=—</title><rect x="0" y="368.0" width="748" height="40" class="row-bg" /><text x="8" y="384.0" class="hyp-label">h8</text><text x="8" y="397.0" class="hyp-sub">opset 17 + skip_layer_nor…</text></g><g><title>h9: opset 21 + matmul_transpose + attention_fusion
+status=OK  verdict=DISCARD
+p50=29.08 ms  gain=-4.1%</title><rect x="0" y="408.0" width="748" height="40" class="row-bg" /><text x="8" y="424.0" class="hyp-label">h9</text><text x="8" y="437.0" class="hyp-sub">opset 21 + matmul_transpo…</text><rect x="404.6" y="416.0" width="5.4" height="24" fill="#90a4ae" stroke="none" stroke-width="0" rx="4" /><text x="396.6" y="432.0" text-anchor="end" class="value-text">-4.1%</text></g><g><title>h10: opset 17 + ln + skip_ln + matmul_transpose
+status=BENCH_FAIL  verdict=—
+p50=—  gain=—</title><rect x="0" y="448.0" width="748" height="40" class="row-bg" /><text x="8" y="464.0" class="hyp-label">h10</text><text x="8" y="477.0" class="hyp-sub">opset 17 + ln + skip_ln +…</text></g><g><title>h11: opset 17 + gelu_fusion explicit
+status=OK  verdict=MARGINAL
+p50=27.38 ms  gain=+2.0%</title><rect x="0" y="488.0" width="748" height="40" class="row-bg" /><text x="8" y="504.0" class="hyp-label">h11</text><text x="8" y="517.0" class="hyp-sub">opset 17 + gelu_fusion ex…</text><rect x="410.0" y="496.0" width="2.5" height="24" fill="#90a4ae" stroke="none" stroke-width="0" rx="4" /><text x="420.5" y="512.0" text-anchor="start" class="value-text">+2.0%</text></g><g><title>h12: opset 17 + transpose_optimizer
+status=OK  verdict=DISCARD
+p50=28.86 ms  gain=-3.3%</title><rect x="0" y="528.0" width="748" height="40" class="row-bg" /><text x="8" y="544.0" class="hyp-label">h12</text><text x="8" y="557.0" class="hyp-sub">opset 17 + transpose_opti…</text><rect x="405.7" y="536.0" width="4.3" height="24" fill="#90a4ae" stroke="none" stroke-width="0" rx="4" /><text x="397.7" y="552.0" text-anchor="end" class="value-text">-3.3%</text></g></svg>
+      </div>
+    </section>
+
+
+    <section class="section-card">
+      <div class="section-title">🔬 All Hypotheses — Full Detail</div>
+      <div style="overflow-x:auto">
+      <table class="report-table hyp-detail-table">
+        <thead>
+          <tr>
+            <th>ID</th>
+            <th>Config Label</th>
+            <th>Opset</th>
+            <th>Extra Flags</th>
+            <th>Median p50</th>
+            <th>Session p50s (ms)</th>
+            <th>Gain %</th>
+            <th>Verdict</th>
+            <th>Confidence</th>
+          </tr>
+        </thead>
+        <tbody>
+
+        <tr class="">
+          <td><span class="hyp-pill">h0</span></td>
+          <td class="label-cell">baseline FP32 (no quant, no compile)</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">27.93 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[27.93 · 27.93 · 28.94]</span></td>
+          <td><span class="">+0.0%</span></td>
+          <td><span class="">BASELINE</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h1</span></td>
+          <td class="label-cell">opset 17 explicit</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">32.66 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[32.66 · 44.52 · 31.94]</span></td>
+          <td><span class="gain-neg">-16.9%</span></td>
+          <td><span class="verdict-discard">DISCARD</span></td>
+          <td class="conf-cell">ranges separated</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h2</span></td>
+          <td class="label-cell">opset 19</td>
+          <td class="opset-cell">19</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">BUILD_FAIL</td>
+          <td class="sessions-cell"><span style="color:#c62828;font-weight:700">BUILD_FAIL</span></td>
+          <td>—</td>
+          <td><span class="verdict-discard">BUILD_FAIL</span></td>
+          <td class="conf-cell">build failed</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h3</span></td>
+          <td class="label-cell">opset 21 (tests gpu-006)</td>
+          <td class="opset-cell">21</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">BUILD_FAIL</td>
+          <td class="sessions-cell"><span style="color:#c62828;font-weight:700">BUILD_FAIL</span></td>
+          <td>—</td>
+          <td><span class="verdict-discard">BUILD_FAIL</span></td>
+          <td class="conf-cell">build failed</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h4</span></td>
+          <td class="label-cell">opset 17 + matmul_transpose_fusion</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">BUILD_FAIL</td>
+          <td class="sessions-cell"><span style="color:#c62828;font-weight:700">BUILD_FAIL</span></td>
+          <td>—</td>
+          <td><span class="verdict-discard">BUILD_FAIL</span></td>
+          <td class="conf-cell">build failed</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h5</span></td>
+          <td class="label-cell">opset 17 + attention_fusion</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">BUILD_FAIL</td>
+          <td class="sessions-cell"><span style="color:#c62828;font-weight:700">BUILD_FAIL</span></td>
+          <td>—</td>
+          <td><span class="verdict-discard">BUILD_FAIL</span></td>
+          <td class="conf-cell">build failed</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h6</span></td>
+          <td class="label-cell">opset 17 + bias_softmax_fusion</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">BUILD_FAIL</td>
+          <td class="sessions-cell"><span style="color:#c62828;font-weight:700">BUILD_FAIL</span></td>
+          <td>—</td>
+          <td><span class="verdict-discard">BUILD_FAIL</span></td>
+          <td class="conf-cell">build failed</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h7</span></td>
+          <td class="label-cell">opset 17 + layer_norm_fusion</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><span class="flag-pill">layer_norm_fusion</span></td>
+          <td class="p50-cell">28.55 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[27.27 · 28.55 · 28.84]</span></td>
+          <td><span class="gain-neg">-2.2%</span></td>
+          <td><span class="verdict-discard">DISCARD</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h8</span></td>
+          <td class="label-cell">opset 17 + skip_layer_norm_fusion</td>
+          <td class="opset-cell">—</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">—</td>
+          <td class="sessions-cell">—</td>
+          <td>—</td>
+          <td><span class="verdict-discard">BENCH_FAIL</span></td>
+          <td class="conf-cell">bench failed</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h9</span></td>
+          <td class="label-cell">opset 21 + matmul_transpose + attention_fusion</td>
+          <td class="opset-cell">21</td>
+          <td class="flags-cell"><span class="flag-pill">matmul_transpose_fusion</span>, <span class="flag-pill">attention_fusion</span></td>
+          <td class="p50-cell">29.08 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[29.98 · 27.45 · 29.08]</span></td>
+          <td><span class="gain-neg">-4.1%</span></td>
+          <td><span class="verdict-discard">DISCARD</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h10</span></td>
+          <td class="label-cell">opset 17 + ln + skip_ln + matmul_transpose</td>
+          <td class="opset-cell">—</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">—</td>
+          <td class="sessions-cell">—</td>
+          <td>—</td>
+          <td><span class="verdict-discard">BENCH_FAIL</span></td>
+          <td class="conf-cell">bench failed</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h11</span></td>
+          <td class="label-cell">opset 17 + gelu_fusion explicit</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><span class="flag-pill">gelu_fusion</span></td>
+          <td class="p50-cell">27.38 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[27.38 · 26.66 · 27.49]</span></td>
+          <td><span class="gain-pos">+2.0%</span></td>
+          <td><span class="">MARGINAL</span></td>
+          <td class="conf-cell">ranges separated</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h12</span></td>
+          <td class="label-cell">opset 17 + transpose_optimizer</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><span class="flag-pill">transpose_optimizer</span></td>
+          <td class="p50-cell">28.86 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[28.86 · 29.80 · 28.35]</span></td>
+          <td><span class="gain-neg">-3.3%</span></td>
+          <td><span class="verdict-discard">DISCARD</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        </tbody>
+      </table>
+      </div>
+      <div style="margin-top:10px;font-size:11px;color:#7b8794">
+        ★ = champion hypothesis &nbsp;·&nbsp; Session p50s are individual bench sessions (median used for comparison)
+      </div>
+    </section>
+
+
+
+    <section class="section-card">
+      <div class="section-title">❌ Ineffective or Harmful</div>
+      <table class="report-table">
+        <thead>
+          <tr>
+            <th>Hypothesis</th>
+            <th>Label</th>
+            <th>Gain %</th>
+            <th>Verdict</th>
+            <th>Confidence</th>
+          </tr>
+        </thead>
+        <tbody>
+
+            <tr class="">
+              <td><span class="hyp-pill">h1</span></td>
+              <td>opset 17 explicit</td>
+              <td class="gain-neg">-16.9%</td>
+              <td>DISCARD</td>
+              <td>ranges separated</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h2</span></td>
+              <td>opset 19</td>
+              <td class="gain-pos">—</td>
+              <td>BUILD_FAIL</td>
+              <td>build failed</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h3</span></td>
+              <td>opset 21 (tests gpu-006)</td>
+              <td class="gain-pos">—</td>
+              <td>BUILD_FAIL</td>
+              <td>build failed</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h4</span></td>
+              <td>opset 17 + matmul_transpose_fusion</td>
+              <td class="gain-pos">—</td>
+              <td>BUILD_FAIL</td>
+              <td>build failed</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h5</span></td>
+              <td>opset 17 + attention_fusion</td>
+              <td class="gain-pos">—</td>
+              <td>BUILD_FAIL</td>
+              <td>build failed</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h6</span></td>
+              <td>opset 17 + bias_softmax_fusion</td>
+              <td class="gain-pos">—</td>
+              <td>BUILD_FAIL</td>
+              <td>build failed</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h7</span></td>
+              <td>opset 17 + layer_norm_fusion</td>
+              <td class="gain-neg">-2.2%</td>
+              <td>DISCARD</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h9</span></td>
+              <td>opset 21 + matmul_transpose + attention_fusion</td>
+              <td class="gain-neg">-4.1%</td>
+              <td>DISCARD</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h12</span></td>
+              <td>opset 17 + transpose_optimizer</td>
+              <td class="gain-neg">-3.3%</td>
+              <td>DISCARD</td>
+              <td>ranges overlap</td>
+            </tr>
+
+        </tbody>
+      </table>
+    </section>
+
+
+    <section class="section-card">
+      <div class="section-title">⚪ Neutral / Build Fail</div>
+      <table class="report-table">
+        <thead>
+          <tr>
+            <th>Hypothesis</th>
+            <th>Label</th>
+            <th>Gain %</th>
+            <th>Verdict</th>
+            <th>Confidence</th>
+          </tr>
+        </thead>
+        <tbody>
+
+            <tr class="">
+              <td><span class="hyp-pill">h0</span></td>
+              <td>baseline FP32 (no quant, no compile)</td>
+              <td class="gain-pos">+0.0%</td>
+              <td>BASELINE</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h8</span></td>
+              <td>opset 17 + skip_layer_norm_fusion</td>
+              <td class="gain-pos">—</td>
+              <td>BENCH_FAIL</td>
+              <td>bench failed</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h10</span></td>
+              <td>opset 17 + ln + skip_ln + matmul_transpose</td>
+              <td class="gain-pos">—</td>
+              <td>BENCH_FAIL</td>
+              <td>bench failed</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h11</span></td>
+              <td>opset 17 + gelu_fusion explicit</td>
+              <td class="gain-pos">+2.0%</td>
+              <td>MARGINAL</td>
+              <td>ranges separated</td>
+            </tr>
+
+        </tbody>
+      </table>
+    </section>
+
+
+
+  <div class="footer">Generated by gen_model_report.py · research/autoconfig</div>
+</body>
+</html>
diff --git a/research/autoconfig/catalog-gpu-sweep/sentence-transformers--all-MiniLM-L6-v2/results.json b/research/autoconfig/catalog-gpu-sweep/sentence-transformers--all-MiniLM-L6-v2/results.json
new file mode 100644
index 000000000..e7ecb0f24
--- /dev/null
+++ b/research/autoconfig/catalog-gpu-sweep/sentence-transformers--all-MiniLM-L6-v2/results.json
@@ -0,0 +1,168 @@
+{
+  "model_id": "sentence-transformers/all-MiniLM-L6-v2",
+  "task": "sentence-similarity",
+  "model_type": "bert",
+  "timestamp": "2026-06-18T10:30:56",
+  "ep": "qnn",
+  "device": "gpu",
+  "baseline_opset": 17,
+  "hypotheses": {
+    "h0": {
+      "status": "OK",
+      "label": "baseline FP32 (no quant, no compile)",
+      "opset": 17,
+      "extra_optim": null,
+      "screen_p50_ms": 28.907,
+      "screen_cv": 0.3101670875566472,
+      "full_p50s_ms": [
+        27.929,
+        27.929,
+        28.942
+      ],
+      "median_p50_ms": 27.929,
+      "verdict": "BASELINE"
+    },
+    "h1": {
+      "status": "OK",
+      "label": "opset 17 explicit",
+      "opset": 17,
+      "extra_optim": null,
+      "screen_p50_ms": 33.164,
+      "screen_cv": 0.5540344952357978,
+      "full_p50s_ms": [
+        32.658,
+        44.515,
+        31.943
+      ],
+      "median_p50_ms": 32.658,
+      "gain_vs_baseline_pct": -16.93,
+      "verdict": "DISCARD"
+    },
+    "h2": {
+      "status": "BUILD_FAIL",
+      "label": "opset 19",
+      "opset": 19,
+      "build_error": "   Supported tasks are: feature-extraction,          \n                             fill-mask, multiple-choice, question-answering,   \n                             text-classification, token-classification.        \n⏳ Export  Exporting to ONNX...Error: Build failed: [Errno 28] No space left on device\n"
+    },
+    "h3": {
+      "status": "BUILD_FAIL",
+      "label": "opset 21 (tests gpu-006)",
+      "opset": 21,
+      "build_error": "   Supported tasks are: feature-extraction,          \n                             fill-mask, multiple-choice, question-answering,   \n                             text-classification, token-classification.        \n⏳ Export  Exporting to ONNX...Error: Build failed: [Errno 28] No space left on device\n"
+    },
+    "h4": {
+      "status": "BUILD_FAIL",
+      "label": "opset 17 + matmul_transpose_fusion",
+      "opset": 17,
+      "build_error": "   Supported tasks are: feature-extraction,          \n                             fill-mask, multiple-choice, question-answering,   \n                             text-classification, token-classification.        \n⏳ Export  Exporting to ONNX...Error: Build failed: [Errno 28] No space left on device\n"
+    },
+    "h5": {
+      "status": "BUILD_FAIL",
+      "label": "opset 17 + attention_fusion",
+      "opset": 17,
+      "build_error": "..Error: Build failed: ONNX Runtime optimization failed: [ONNXRuntimeError] : 7 : INVALID_PROTOBUF : Protobuf serialization failed. | Pipe: ort_graph | Model info: {'optimization_level': 2, 'disabled_count': 40} | Caused by: [ONNXRuntimeError] : 7 : INVALID_PROTOBUF : Protobuf serialization failed.\n"
+    },
+    "h6": {
+      "status": "BUILD_FAIL",
+      "label": "opset 17 + bias_softmax_fusion",
+      "opset": 17,
+      "build_error": "e}Error: Build failed: ONNX Runtime optimization failed: [ONNXRuntimeError] : 7 : INVALID_PROTOBUF : Protobuf serialization failed. | Pipe: ort_graph | Model info: {'optimization_level': 2, 'disabled_count': 37} | Caused by: [ONNXRuntimeError] : 7 : INVALID_PROTOBUF : Protobuf serialization failed.\n"
+    },
+    "h7": {
+      "status": "OK",
+      "label": "opset 17 + layer_norm_fusion",
+      "opset": 17,
+      "extra_optim": {
+        "layer_norm_fusion": true
+      },
+      "screen_p50_ms": 27.998,
+      "screen_cv": 0.7119437102650189,
+      "full_p50s_ms": [
+        27.269,
+        28.545,
+        28.837
+      ],
+      "median_p50_ms": 28.545,
+      "gain_vs_baseline_pct": -2.21,
+      "verdict": "DISCARD"
+    },
+    "h8": {
+      "status": "BENCH_FAIL",
+      "label": "opset 17 + skip_layer_norm_fusion"
+    },
+    "h9": {
+      "status": "OK",
+      "label": "opset 21 + matmul_transpose + attention_fusion",
+      "opset": 21,
+      "extra_optim": {
+        "matmul_transpose_fusion": true,
+        "attention_fusion": true
+      },
+      "screen_p50_ms": 28.12,
+      "screen_cv": 0.3956258890469417,
+      "full_p50s_ms": [
+        29.983,
+        27.454,
+        29.083
+      ],
+      "median_p50_ms": 29.083,
+      "gain_vs_baseline_pct": -4.13,
+      "verdict": "DISCARD"
+    },
+    "h10": {
+      "status": "BENCH_FAIL",
+      "label": "opset 17 + ln + skip_ln + matmul_transpose"
+    },
+    "h11": {
+      "status": "OK",
+      "label": "opset 17 + gelu_fusion explicit",
+      "opset": 17,
+      "extra_optim": {
+        "gelu_fusion": true
+      },
+      "screen_p50_ms": 26.486,
+      "screen_cv": 0.2676508344030809,
+      "full_p50s_ms": [
+        27.382,
+        26.663,
+        27.486
+      ],
+      "median_p50_ms": 27.382,
+      "gain_vs_baseline_pct": 1.96,
+      "verdict": "MARGINAL"
+    },
+    "h12": {
+      "status": "OK",
+      "label": "opset 17 + transpose_optimizer",
+      "opset": 17,
+      "extra_optim": {
+        "transpose_optimizer": true
+      },
+      "screen_p50_ms": 31.432,
+      "screen_cv": 0.36580554848561975,
+      "full_p50s_ms": [
+        28.86,
+        29.805,
+        28.349
+      ],
+      "median_p50_ms": 28.86,
+      "gain_vs_baseline_pct": -3.33,
+      "verdict": "DISCARD"
+    }
+  },
+  "best_hypothesis": null,
+  "baseline_p50_ms": 27.929,
+  "best_p50_ms": null,
+  "best_gain_pct": null,
+  "opset21_gain_pct": null,
+  "feature_gaps": [],
+  "errors": [
+    "h2: BUILD_FAIL",
+    "h3: BUILD_FAIL",
+    "h4: BUILD_FAIL",
+    "h5: BUILD_FAIL",
+    "h6: BUILD_FAIL",
+    "h8: screen bench failed",
+    "h10: screen bench failed"
+  ]
+}
diff --git a/research/autoconfig/catalog-qnn-sweep/.gitignore b/research/autoconfig/catalog-qnn-sweep/.gitignore
new file mode 100644
index 000000000..29bb809b7
--- /dev/null
+++ b/research/autoconfig/catalog-qnn-sweep/.gitignore
@@ -0,0 +1,3 @@
+# Ignore per-hypothesis build artifacts from validation_sweep.py
+# (ONNX model files, calibration data, perf session JSONs)
+val_h*/
diff --git a/research/autoconfig/catalog-qnn-sweep/BAAI--bge-small-en-v1.5/results_new.json b/research/autoconfig/catalog-qnn-sweep/BAAI--bge-small-en-v1.5/results_new.json
new file mode 100644
index 000000000..fed23f364
--- /dev/null
+++ b/research/autoconfig/catalog-qnn-sweep/BAAI--bge-small-en-v1.5/results_new.json
@@ -0,0 +1,31 @@
+{
+  "model_id": "BAAI/bge-small-en-v1.5",
+  "task": "sentence-similarity",
+  "hypotheses": {
+    "h0": {
+      "description": "opset17 no opts",
+      "model_file": "quantized.onnx",
+      "screen_p50_ms": 9.208,
+      "screen_cv": 0.3059,
+      "full_p50s_ms": [
+        10.516,
+        10.323,
+        11.01
+      ],
+      "avg_p50_ms": 10.616
+    },
+    "h3": {
+      "description": "opset21 no opts",
+      "model_file": "quantized.onnx",
+      "screen_p50_ms": 9.562,
+      "screen_cv": 0.2575,
+      "full_p50s_ms": [
+        10.253,
+        9.331,
+        9.937
+      ],
+      "avg_p50_ms": 9.84
+    }
+  },
+  "opset21_gain_pct": 7.31
+}
diff --git a/research/autoconfig/catalog-qnn-sweep/SUMMARY.md b/research/autoconfig/catalog-qnn-sweep/SUMMARY.md
new file mode 100644
index 000000000..fca9f0439
--- /dev/null
+++ b/research/autoconfig/catalog-qnn-sweep/SUMMARY.md
@@ -0,0 +1,175 @@
+# QNN NPU Optimization Sweep — Catalog Models
+
+Generated: 2026-06-22T12:29:01  
+EP: `qnn` / device: `npu`  
+Bench protocol: Phase-A 200 iters (high CV expected on QNN NPU — DVFS), Phase-B 500x3 sessions, 30s cool-down  
+npu-001 criterion: median >=5% gain AND ranges non-overlapping  
+npu-006 criterion: Conv% of ops; h4/h5 marked catastrophic if >=5x baseline  
+Effect-size gate: gain reliable only if gain% >= 2×(session-CV) AND ranges separated  
+
+---
+
+## Per-Model Results
+
+| Model | Conv% | Baseline p50 | Best p50 | Best config | Gain% | Reliable? | npu-001? | npu-006 regression? | Notes |
+|-------|-------|-------------|----------|-------------|-------|-----------|----------|---------------------|-------|
+| `apple/mobilevit-small` | 2% | 5.5 ms | 5.4 ms | h3 (opset 21 (tests npu-001 bypass)) | 2.8% | ⚠️ within noise | neutral | no | none |
+| `deepset/roberta-base-squad2` | N/A | 14.9 ms | 14.7 ms | h1 (opset 17 explicit) | 1.5% | N/A | neutral | no | Model timed out at 1466s (before h4); Model timed out at 1466s (before h5) |
+| `distilbert/distilbert-base-uncased-finetuned-sst-2-english` | N/A | 19.5 ms | 19.5 ms | h2 (opset 19) | 0.0% | N/A | neutral | no | Model timed out at 1385s (before h5) |
+| `facebook/dinov2-small` | N/A | 6.6 ms | 5.0 ms | h3 (opset 21 (tests npu-001 bypass)) | 24.1% | N/A | YES (median) | no | Model timed out at 1333s (before h4); Model timed out at 1333s (before h5) |
+| `google/vit-base-patch16-224` | N/A | 9.0 ms | 9.0 ms | h0 (baseline (auto-config, W8A16)) | 0.0% | N/A | NO | no | h2: BUILD_FAIL; Model timed out at 1204s (before h4); Model timed out at 1204s ( |
+| `hustvl/yolos-small` | 0% | 49.6 ms | 48.6 ms | h3 (opset 21 (tests npu-001 bypass)) | 2.0% | ⚠️ within noise | N/A | no | h2 (opset 19), h4/h5 (conv fusions): not measured — agent deprioritized (yolos i |
+| `microsoft/resnet-18` | N/A | 1.0 ms | 1.0 ms | h0 (baseline (auto-config, W8A16)) | 0.0% | N/A | YES (median) | no | Model timed out at 1560s (before h5) |
+| `sentence-transformers/all-MiniLM-L6-v2` | N/A | 5.8 ms | 5.8 ms | h0 (baseline (auto-config, W8A16)) | 0.0% | N/A | neutral | no | Model timed out at 1346s (before h5) |
+
+## Hypothesis Breakdown per Model
+
+### apple/mobilevit-small
+
+| Hypothesis | Opset | Screen p50 | Full p50 (median) | CV | Status | Accuracy |
+|------------|-------|-----------|-------------------|-----|--------|---------|
+| h0 (baseline (auto-config, W8A16)) | 17 | 5.0 | 5.5 | 0.093 | OK | — |
+| h1 (opset 17 explicit) | 17 | 5.8 | 5.6 | 0.304 | OK_HIGH_CV ⚡DVFS | — |
+| h2 (opset 19) | 19 | 5.8 | 6.6 | 0.120 | OK | — |
+| h3 (opset 21 (tests npu-001 bypass)) | 21 | 5.2 | 5.4 | 0.163 | OK_HIGH_CV ⚡DVFS | — |
+| h4 (opset 17 + conv fusions) | 17 | 6.7 | 6.5 | 0.181 | OK_HIGH_CV ⚡DVFS | — |
+| h5 (opset 21 + conv fusions) | 21 | 6.2 | 6.7 | 0.153 | OK_HIGH_CV ⚡DVFS | — |
+| h6 (opset 21 + matmul_transpose_fusion) | 21 | 5.9 | 6.2 | 0.229 | OK_HIGH_CV ⚡DVFS | — |
+| h7 (opset 21 + bias_softmax_fusion) | 21 | 4.6 | 6.4 | 0.043 | OK | — |
+| h8 (opset 21 + attention_fusion) | 21 | 6.5 | 5.8 | 0.455 | OK_HIGH_CV ⚡DVFS | — |
+| h9 (opset 21 + highdimRTR_lowdimRTR) | 21 | 5.7 | 6.5 | 0.190 | OK_HIGH_CV ⚡DVFS | — |
+| h10 (opset 17 + conv_add_fusion only) | 17 | 6.7 | 5.9 | 0.188 | OK_HIGH_CV ⚡DVFS | — |
+
+### deepset/roberta-base-squad2
+
+| Hypothesis | Opset | Screen p50 | Full p50 (median) | CV | Status | Accuracy |
+|------------|-------|-----------|-------------------|-----|--------|---------|
+| h0 (baseline (auto-config, W8A16)) | 17 | 14.9 | 14.9 | 0.119 | OK | — |
+| h1 (opset 17 explicit) | 17 | 14.7 | 14.7 | 0.129 | OK | — |
+| h2 (opset 19) | 19 | 15.3 | 14.9 | 0.234 | OK_HIGH_CV ⚡DVFS | — |
+| h3 (opset 21 (tests npu-001 bypass)) | 21 | 14.8 | 14.9 | 0.116 | OK | — |
+| h4 (opset 17 + conv fusions) | ? | — | — | ? | TIMEOUT | — |
+| h5 (opset 21 + conv fusions) | ? | — | — | ? | TIMEOUT | — |
+
+### distilbert/distilbert-base-uncased-finetuned-sst-2-english
+
+| Hypothesis | Opset | Screen p50 | Full p50 (median) | CV | Status | Accuracy |
+|------------|-------|-----------|-------------------|-----|--------|---------|
+| h0 (baseline (auto-config, W8A16)) | 17 | 19.5 | 19.5 | 0.156 | OK_HIGH_CV ⚡DVFS | — |
+| h1 (opset 17 explicit) | 17 | 19.7 | 19.5 | 0.272 | OK_HIGH_CV ⚡DVFS | — |
+| h2 (opset 19) | 19 | 19.4 | 19.5 | 0.195 | OK_HIGH_CV ⚡DVFS | — |
+| h3 (opset 21 (tests npu-001 bypass)) | 21 | 19.4 | 19.5 | 0.290 | OK_HIGH_CV ⚡DVFS | — |
+| h4 (opset 17 + conv fusions) | 17 | 19.4 | 19.6 | 0.237 | OK_HIGH_CV ⚡DVFS | — |
+| h5 (opset 21 + conv fusions) | ? | — | — | ? | TIMEOUT | — |
+
+### facebook/dinov2-small
+
+| Hypothesis | Opset | Screen p50 | Full p50 (median) | CV | Status | Accuracy |
+|------------|-------|-----------|-------------------|-----|--------|---------|
+| h0 (baseline (auto-config, W8A16)) | 17 | 7.2 | 6.6 | 0.344 | OK_HIGH_CV ⚡DVFS | — |
+| h1 (opset 17 explicit) | 17 | 4.9 | 7.2 | 0.457 | OK_HIGH_CV ⚡DVFS | — |
+| h2 (opset 19) | 19 | 7.0 | 7.2 | 1.805 | OK_HIGH_CV ⚡DVFS | — |
+| h3 (opset 21 (tests npu-001 bypass)) | 21 | 9.4 | 5.0 | 0.936 | OK_HIGH_CV ⚡DVFS | — |
+| h4 (opset 17 + conv fusions) | ? | — | — | ? | TIMEOUT | — |
+| h5 (opset 21 + conv fusions) | ? | — | — | ? | TIMEOUT | — |
+
+### google/vit-base-patch16-224
+
+| Hypothesis | Opset | Screen p50 | Full p50 (median) | CV | Status | Accuracy |
+|------------|-------|-----------|-------------------|-----|--------|---------|
+| h0 (baseline (auto-config, W8A16)) | 17 | 9.2 | 9.0 | 1.289 | OK_HIGH_CV ⚡DVFS | 0.740 |
+| h1 (opset 17 explicit) | 17 | 9.7 | 9.3 | 0.743 | OK_HIGH_CV ⚡DVFS | — |
+| h2 (opset 19) | 19 | — | — | ? | BUILD_FAIL | — |
+| h3 (opset 21 (tests npu-001 bypass)) | 21 | 11.6 | 10.0 | 2.159 | OK_HIGH_CV ⚡DVFS | — |
+| h4 (opset 17 + conv fusions) | ? | — | — | ? | TIMEOUT | — |
+| h5 (opset 21 + conv fusions) | ? | — | — | ? | TIMEOUT | — |
+
+### hustvl/yolos-small
+
+| Hypothesis | Opset | Screen p50 | Full p50 (median) | CV | Status | Accuracy |
+|------------|-------|-----------|-------------------|-----|--------|---------|
+| h0 (baseline (auto-config, W8A16)) | 17 | 48.7 | 49.6 | 0.067 | OK | — |
+| h1 (opset 17 explicit) | 17 | 66.4 | 65.9 | 0.226 | OK_HIGH_CV ⚡DVFS | — |
+| h2 (opset 19) | ? | — | — | ? | TIMEOUT | — |
+| h3 (opset 21 (tests npu-001 bypass)) | 21 | 48.8 | 48.6 | 0.050 | OK | — |
+| h4 (opset 17 + conv fusions) | ? | — | — | ? | TIMEOUT | — |
+| h5 (opset 21 + conv fusions) | ? | — | — | ? | TIMEOUT | — |
+| h6 (opset 21 + matmul_transpose_fusion) | 21 | 49.0 | 50.0 | 0.048 | OK | — |
+| h7 (opset 21 + bias_softmax_fusion) | 21 | 49.0 | 51.6 | 0.062 | OK | — |
+| h8 (opset 21 + attention_fusion) | 21 | 51.3 | 49.5 | 0.078 | OK | — |
+
+### microsoft/resnet-18
+
+| Hypothesis | Opset | Screen p50 | Full p50 (median) | CV | Status | Accuracy |
+|------------|-------|-----------|-------------------|-----|--------|---------|
+| h0 (baseline (auto-config, W8A16)) | 17 | 4.0 | 1.0 | 1.690 | OK_HIGH_CV ⚡DVFS | 0.660 |
+| h1 (opset 17 explicit) | 17 | 3.1 | 2.7 | 2.036 | OK_HIGH_CV ⚡DVFS | — |
+| h2 (opset 19) | 19 | 4.0 | 1.1 | 1.517 | OK_HIGH_CV ⚡DVFS | — |
+| h3 (opset 21 (tests npu-001 bypass)) | 21 | 3.0 | 2.2 | 1.176 | OK_HIGH_CV ⚡DVFS | — |
+| h4 (opset 17 + conv fusions) | 17 | 128.1 | 132.3 | 1.405 | OK_HIGH_CV ⚡DVFS | — |
+| h5 (opset 21 + conv fusions) | ? | — | — | ? | TIMEOUT | — |
+
+### sentence-transformers/all-MiniLM-L6-v2
+
+| Hypothesis | Opset | Screen p50 | Full p50 (median) | CV | Status | Accuracy |
+|------------|-------|-----------|-------------------|-----|--------|---------|
+| h0 (baseline (auto-config, W8A16)) | 17 | 5.9 | 5.8 | 0.222 | OK_HIGH_CV ⚡DVFS | — |
+| h1 (opset 17 explicit) | 17 | 5.9 | 5.9 | 0.999 | OK_HIGH_CV ⚡DVFS | — |
+| h2 (opset 19) | 19 | 5.3 | 6.0 | 0.205 | OK_HIGH_CV ⚡DVFS | — |
+| h3 (opset 21 (tests npu-001 bypass)) | 21 | 6.0 | 5.9 | 1.127 | OK_HIGH_CV ⚡DVFS | — |
+| h4 (opset 17 + conv fusions) | 17 | 5.5 | 6.0 | 0.134 | OK | — |
+| h5 (opset 21 + conv fusions) | ? | — | — | ? | TIMEOUT | — |
+
+---
+
+## Cross-Model Patterns
+
+### npu-001: Does opset 21 bypass help broadly?
+
+- **Helps (2 models):** `facebook/dinov2-small`, `microsoft/resnet-18`
+- **Hurts (1 models):** `google/vit-base-patch16-224`
+- **Neutral (4 models):** `apple/mobilevit-small`, `deepset/roberta-base-squad2`, `distilbert/distilbert-base-uncased-finetuned-sst-2-english`, `sentence-transformers/all-MiniLM-L6-v2`
+- **N/A (1 models):** `hustvl/yolos-small`
+
+> **Finding**: Mixed results (2 help, 1 hurt, 4 neutral). Architecture-dependent. Confirm ORT `kMaxSupportedOpset` version before drawing conclusions.
+
+### Feature Gaps
+
+- No feature gaps observed
+
+### Build / Compatibility Issues
+
+**`deepset/roberta-base-squad2`**
+  - Model timed out at 1466s (before h4)
+  - Model timed out at 1466s (before h5)
+**`distilbert/distilbert-base-uncased-finetuned-sst-2-english`**
+  - Model timed out at 1385s (before h5)
+**`facebook/dinov2-small`**
+  - Model timed out at 1333s (before h4)
+  - Model timed out at 1333s (before h5)
+**`google/vit-base-patch16-224`**
+  - h2: BUILD_FAIL
+  - Model timed out at 1204s (before h4)
+  - Model timed out at 1204s (before h5)
+**`hustvl/yolos-small`**
+  - h2 (opset 19), h4/h5 (conv fusions): not measured — agent deprioritized (yolos is 0.1% conv / 99.9% transformer, so conv-fusion and intermediate-opset hypotheses are low expected-value).
+**`microsoft/resnet-18`**
+  - Model timed out at 1560s (before h5)
+**`sentence-transformers/all-MiniLM-L6-v2`**
+  - Model timed out at 1346s (before h5)
+
+---
+
+## Updated Recommendations for `ep_knowledge/qnn_npu.json`
+
+Based on this cross-architecture sweep:
+
+- **npu-001**: Broaden scope beyond ConvNext. Architectures that benefit: facebook/dinov2-small, microsoft/resnet-18. Update `scope` field and set `gate1_statistical` confidence accordingly.
+- **search_space_rules.opset.recommended_order**: Retain `[21, 17]` as default order.
+
+### Conv Fusion Findings (h4 vs h1, h5 vs h3)
+
+- **`apple/mobilevit-small`**: conv-fusions on opset17: -16.0% (5.6→6.5ms); conv-fusions on opset21: -25.3% (5.4→6.7ms)
+- **`distilbert/distilbert-base-uncased-finetuned-sst-2-english`**: conv-fusions on opset17: -0.5% (19.5→19.6ms)
+- **`microsoft/resnet-18`**: conv-fusions on opset17: -4771.1% (2.7→132.3ms)
+- **`sentence-transformers/all-MiniLM-L6-v2`**: conv-fusions on opset17: -1.5% (5.9→6.0ms)
diff --git a/research/autoconfig/catalog-qnn-sweep/VALIDATION_SUMMARY.md b/research/autoconfig/catalog-qnn-sweep/VALIDATION_SUMMARY.md
new file mode 100644
index 000000000..0dc697d3e
--- /dev/null
+++ b/research/autoconfig/catalog-qnn-sweep/VALIDATION_SUMMARY.md
@@ -0,0 +1,108 @@
+# Validation Sweep Results — QNN NPU (2026-06-16)
+
+**Device:** Snapdragon X Elite X1E80100  
+**ORT:** onnxruntime-windowsml==1.24.5  
+**QNN SDK:** 2.2450.47.0  
+**Protocol:** 3 × 500 iters, 30s cool-down, `quantized.onnx` (W8A16), `--no-compile`  
+**Script:** `validation_sweep.py` — targeted 4-hypothesis sweep (h0/h1/h3/h4)
+
+## Hypothesis Matrix
+
+| ID | Config | Purpose |
+|----|--------|---------|
+| h0 | auto-config baseline (W8A16, opset auto) | baseline reference |
+| h1 | opset 17 explicit (W8A16) | npu-001 baseline |
+| h3 | opset 21 (W8A16) | **npu-001 test** — does opset21 help? |
+| h4 | opset 17 + conv fusions | **npu-006 test** — do conv fusions regress? |
+
+---
+
+## Results by Model
+
+### facebook/dinov2-base (ViT-B DINOv2, image-feature-extraction)
+
+| Hyp | Median p50 | Sessions (ms) | CV note |
+|-----|-----------|---------------|---------|
+| h0 auto | 38.68 ms | [38.99, 38.68, 36.26] | stable (stale build artifact) |
+| **h1 opset17** | **34.56 ms** | [34.56, 34.67, 33.15] | rock stable |
+| **h3 opset21** | **26.23 ms** | [33.00, 26.22, 26.23] | s0 elevated (JIT warmup), s1+s2 stable |
+| h4 fusions | 25.92 ms | [26.06, 25.92, 25.87] | rock stable |
+
+**npu-001: opset21 → +24.1% speedup** `(34.56 → 26.23ms)`  
+**npu-006: conv fusions → -25% (fusions FASTER, not regression)** — DINOv2 is attention-dominant, few Conv ops to fuse
+
+---
+
+### microsoft/rad-dino (ViT-L DINOv2 medical, image-feature-extraction)
+
+| Hyp | Median p50 | Sessions (ms) | CV note |
+|-----|-----------|---------------|---------|
+| **h1 opset17** | **274.98 ms** | [274.98, 274.56, 275.10] | CV=0.009, CPU-deterministic |
+| **h3 opset21** | **275.36 ms** | [275.30, 275.36, 275.56] | CV=0.022 |
+
+**npu-001: -0.1% — NEUTRAL (CPU-bound)**  
+Model runs entirely on CPU (~275ms). QNN NPU cannot accelerate rad-dino (ViT-L too large or incompatible ops). Opset has no effect when model is CPU-bound.
+
+---
+
+### facebook/dino-vitb16 (plain DINO ViT-B/16, image-feature-extraction)
+
+| Hyp | Median p50 | Sessions (ms) | CV note |
+|-----|-----------|---------------|---------|
+| **h1 opset17** | **19.92 ms** | [19.92, 19.97, 19.90] | rock stable |
+| **h3 opset21** | **20.07 ms** | [20.20, 20.07, 19.99] | rock stable |
+| h4 fusions | 20.12 ms | [20.12, 20.04, 20.41] | rock stable |
+
+**npu-001: -0.7% — NEUTRAL** ← **critical control**  
+**npu-006: +1.0% — NEUTRAL** (no Conv layers to fuse, patch-embed Conv fusion is benign)
+
+---
+
+## Cross-Model Summary — npu-001 (opset21 vs opset17)
+
+| Model | Architecture | opset17 (h1) | opset21 (h3) | Gain | Verdict |
+|-------|-------------|-------------|-------------|------|---------|
+| facebook/dinov2-small | DINOv2 ViT-S | 7.18 ms* | 4.98 ms* | **+30.6%** | ✅ CONFIRMED |
+| facebook/dinov2-base | DINOv2 ViT-B | 34.56 ms | 26.23 ms | **+24.1%** | ✅ CONFIRMED |
+| apple/mobilevit-small | Conv+Attn hybrid | 11.72 ms* | 8.62 ms* | **+26.5%** ⚠️ | 🟡 LIKELY (DVFS spike in h1) |
+| facebook/dino-vitb16 | plain ViT-B/16 | 19.92 ms | 20.07 ms | **-0.7%** | ❌ NEUTRAL — critical control |
+| microsoft/rad-dino | ViT-L DINOv2 | 274.98 ms | 275.36 ms | **-0.1%** | ⬛ CPU-BOUND (untestable) |
+| google/vit-base-patch16-224 | plain ViT-B | n/a | n/a | **-7.4%** ⚠️* | ❌ REGRESSION |
+
+_*Original catalog_qnn_sweep.py data (optimized.onnx, not quantized.onnx — different pipeline)_
+
+**Key architectural discriminant:** opset21 consistently helps **DINOv2 family** (+24-31%) but has **zero effect on plain ViT** (dino-vitb16: -0.7%, noise-level). This is NOT a general ViT property. DINOv2-specific op patterns must explain the difference — mechanism TBD.
+
+---
+
+## Cross-Model Summary — npu-006 (conv fusions)
+
+| Model | Architecture | h1 no-fusions | h4 fusions | Regression | Verdict |
+|-------|-------------|--------------|-----------|------------|---------|
+| microsoft/resnet-18 | Conv-dominant | ~1–4 ms* | 132–135 ms* | **+4900%** 🔥 | ✅ CATASTROPHIC |
+| apple/mobilevit-small | Conv+Attn | ~10–12 ms* | ~10–12 ms* | **≈0%** | 🟢 SAFE |
+| facebook/dinov2-base | DINOv2 ViT-B | 34.56 ms | 25.92 ms | **-25%** (faster) | 🟢 SAFE / beneficial |
+| facebook/dino-vitb16 | plain ViT-B | 19.92 ms | 20.12 ms | **+1.0%** | 🟢 SAFE (neutral) |
+
+_*Original catalog_qnn_sweep.py data_
+
+**Conclusion:** Conv fusions only regress Conv-dominant models (ResNet). Attention-dominant models (DINOv2, ViT) are safe or slightly benefit. The hazard is proportional to Conv op density.
+
+---
+
+## Bugs Found and Fixed in validation_sweep.py
+
+| Bug | Impact | Fix |
+|-----|--------|-----|
+| `bench_screen` parsed `d.get("p50_ms")` instead of `d["latency_ms"]["p50"]` | All hypotheses marked BENCH_FAIL in v1/v2 runs | Fixed to read nested `latency_ms.p50` |
+| Reuse check triggered on any `.onnx` (including truncated `export.onnx`) | h1 was benchmarked on FP32 unoptimized model | Changed to require `quantized.onnx` or `optimized.onnx` |
+| Model file selection preferred `optimized.onnx` over `quantized.onnx` alphabetically | Benchmarked FP32 graph instead of W8A16 quantized | Fixed to explicitly prefer `quantized` > `optimized` > other |
+
+---
+
+## Known Limitations
+
+1. **`--no-compile` throughout**: All runs omit `winml compile` (pre-built QNN context binary). Production use would include compile, which npu-003 suggests adds ~1.7x additional speedup. The npu-001 ratio should hold with compile enabled, but absolute latencies will be lower.
+2. **3 sessions only**: DVFS on QNN NPU can cause any single session to be thermal-spiked. With only 3 sessions, the median can still be affected if 2/3 spike. See h3 dinov2-base s0=33ms (warmup effect) vs s1+s2=26ms.
+3. **rad-dino untestable**: When a model falls back entirely to CPU, no NPU-related findings can be extracted. The reason for CPU fallback (model size? unsupported ops?) was not investigated.
+4. **dinov2-small not re-validated with v2 pipeline**: The original +30.6% result was from `catalog_qnn_sweep.py` using `optimized.onnx`. The v2 pipeline uses `quantized.onnx`. For full comparability, dinov2-small should be re-run with `validation_sweep.py`.
diff --git a/research/autoconfig/catalog-qnn-sweep/apple--mobilevit-small/champion_qnn_npu.json b/research/autoconfig/catalog-qnn-sweep/apple--mobilevit-small/champion_qnn_npu.json
new file mode 100644
index 000000000..72a1a9465
--- /dev/null
+++ b/research/autoconfig/catalog-qnn-sweep/apple--mobilevit-small/champion_qnn_npu.json
@@ -0,0 +1,59 @@
+{
+  "export": {
+    "opset_version": 17,
+    "batch_size": 1,
+    "export_params": true,
+    "do_constant_folding": true,
+    "verbose": false,
+    "dynamo": false,
+    "enable_hierarchy_tags": true,
+    "clean_onnx": false,
+    "hierarchy_tag_format": "full",
+    "input_tensors": [
+      {
+        "name": "pixel_values",
+        "dtype": "float32",
+        "shape": [
+          1,
+          3,
+          256,
+          256
+        ],
+        "value_range": [
+          0,
+          1
+        ]
+      }
+    ],
+    "output_tensors": [
+      {
+        "name": "logits"
+      }
+    ]
+  },
+  "optim": {},
+  "quant": {
+    "mode": "qdq",
+    "samples": 10,
+    "calibration_method": "minmax",
+    "weight_type": "uint8",
+    "activation_type": "uint16",
+    "per_channel": false,
+    "symmetric": false,
+    "save_calibration": false,
+    "distribution": "uniform",
+    "seed": null,
+    "calibration_load_path": null,
+    "calibration_save_path": null,
+    "op_types_to_quantize": null,
+    "nodes_to_exclude": null,
+    "task": "image-classification",
+    "model_name": "apple/mobilevit-small"
+  },
+  "compile": null,
+  "loader": {
+    "task": "image-classification",
+    "model_class": "AutoModelForImageClassification",
+    "model_type": "mobilevit"
+  }
+}
diff --git a/research/autoconfig/catalog-qnn-sweep/apple--mobilevit-small/report.html b/research/autoconfig/catalog-qnn-sweep/apple--mobilevit-small/report.html
new file mode 100644
index 000000000..85d8074da
--- /dev/null
+++ b/research/autoconfig/catalog-qnn-sweep/apple--mobilevit-small/report.html
@@ -0,0 +1,535 @@
+<!DOCTYPE html>
+<html lang="en">
+<head>
+  <meta charset="UTF-8">
+  <meta name="viewport" content="width=device-width, initial-scale=1.0">
+  <title>QNN NPU Optimization Report — apple/mobilevit-small</title>
+  <style>
+    * { box-sizing: border-box; margin: 0; padding: 0; }
+    body {
+      font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", sans-serif;
+      background: #f4f6f9;
+      color: #1a1a2e;
+      padding: 28px 24px 40px;
+      font-size: 13px;
+      line-height: 1.5;
+    }
+    h1 { font-size: 24px; font-weight: 800; margin-bottom: 6px; color: #102a43; }
+    .subtitle { color: #5f6c80; font-size: 12px; margin-bottom: 24px; }
+    .section-card {
+      background: #ffffff;
+      border: 1.5px solid #dbe5f0;
+      border-radius: 12px;
+      padding: 18px 20px;
+      margin-bottom: 20px;
+      box-shadow: 0 1px 3px rgba(16, 42, 67, 0.06);
+    }
+    .kpi-grid {
+      display: grid;
+      grid-template-columns: repeat(5, minmax(0, 1fr));
+      gap: 14px;
+      margin-bottom: 20px;
+    }
+    .kpi-card {
+      background: linear-gradient(180deg, #ffffff 0%, #f8fbff 100%);
+      border: 1.5px solid #dbe5f0;
+      border-radius: 12px;
+      padding: 16px;
+      min-height: 120px;
+    }
+    .kpi-label {
+      font-size: 10px;
+      text-transform: uppercase;
+      letter-spacing: 0.8px;
+      color: #6b7c93;
+      font-weight: 700;
+      margin-bottom: 8px;
+    }
+    .kpi-value {
+      font-size: 23px;
+      font-weight: 800;
+      color: #102a43;
+      line-height: 1.15;
+      margin-bottom: 6px;
+      word-break: break-word;
+    }
+    .kpi-card.good .kpi-value { color: #2e7d32; }
+    .kpi-card.bad .kpi-value { color: #c62828; }
+    .kpi-sub { color: #6b7c93; font-size: 11px; }
+    .section-title {
+      font-size: 14px;
+      font-weight: 800;
+      color: #102a43;
+      margin-bottom: 14px;
+    }
+    .characteristics-table {
+      width: 100%;
+      border-collapse: collapse;
+    }
+    .characteristics-table th,
+    .characteristics-table td {
+      padding: 9px 10px;
+      border-bottom: 1px solid #ebf1f6;
+      text-align: left;
+      vertical-align: top;
+    }
+    .characteristics-table th {
+      width: 180px;
+      color: #5f6c80;
+      font-size: 11px;
+      text-transform: uppercase;
+      letter-spacing: 0.7px;
+    }
+    .report-table {
+      width: 100%;
+      border-collapse: collapse;
+    }
+    .report-table th {
+      text-align: left;
+      padding: 9px 10px;
+      background: #eef4fb;
+      color: #486581;
+      border-bottom: 2px solid #d9e2ec;
+      font-size: 10px;
+      text-transform: uppercase;
+      letter-spacing: 0.7px;
+    }
+    .report-table td {
+      padding: 10px;
+      border-bottom: 1px solid #ebf1f6;
+      vertical-align: top;
+    }
+    .report-table tr:hover td { background: #f8fbff; }
+    .champion-row td { background: #e8f1fd; }
+    .hyp-pill {
+      display: inline-block;
+      background: #102a43;
+      color: white;
+      border-radius: 999px;
+      padding: 2px 8px;
+      font-size: 11px;
+      font-weight: 700;
+    }
+    .gain-pos { color: #2e7d32; font-weight: 700; }
+    .gain-neg { color: #c62828; font-weight: 700; }
+    .chart-wrap {
+      overflow-x: auto;
+      border: 1px solid #e6edf5;
+      border-radius: 10px;
+      background: #fbfdff;
+      padding: 10px;
+    }
+    .chart-svg { width: 100%; min-width: 760px; display: block; }
+    .axis-label { fill: #486581; font-size: 11px; font-weight: 700; }
+    .tick-label { fill: #7b8794; font-size: 10px; }
+    .tick-line { stroke: #d9e2ec; stroke-width: 1; }
+    .center-line { stroke: #1e88e5; stroke-width: 2; stroke-dasharray: 4 4; }
+    .row-bg { fill: transparent; }
+    .hyp-label { fill: #102a43; font-size: 12px; font-weight: 800; }
+    .hyp-sub { fill: #7b8794; font-size: 10px; }
+    .baseline-bar { stroke: #546e7a; stroke-width: 3; }
+    .value-text { fill: #102a43; font-size: 11px; font-weight: 700; }
+    .build-fail-text { fill: #37474f; font-size: 10px; font-weight: 800; letter-spacing: 0.5px; }
+    .gap-grid {
+      display: grid;
+      grid-template-columns: repeat(auto-fit, minmax(240px, 1fr));
+      gap: 12px;
+    }
+    .gap-card {
+      background: #fff8e1;
+      border: 1.5px solid #ffe082;
+      border-radius: 10px;
+      padding: 12px 14px;
+      color: #7c5b00;
+      font-size: 12px;
+    }
+    .flag-pill {
+      display: inline-block;
+      background: #e3f2fd;
+      color: #1565c0;
+      border-radius: 4px;
+      padding: 1px 6px;
+      font-size: 10px;
+      font-weight: 700;
+      margin: 1px 2px 1px 0;
+    }
+    .runs-val {
+      font-family: "Cascadia Code", "Consolas", monospace;
+      font-size: 10.5px;
+      color: #546e7a;
+      white-space: nowrap;
+    }
+    .hyp-detail-table .label-cell { font-size: 11.5px; max-width: 220px; }
+    .hyp-detail-table .opset-cell { text-align: center; font-weight: 700; color: #3949ab; font-size: 12px; }
+    .hyp-detail-table .flags-cell { min-width: 140px; }
+    .hyp-detail-table .p50-cell { font-family: "Cascadia Code","Consolas",monospace; font-size: 12px; white-space: nowrap; }
+    .hyp-detail-table .sessions-cell { min-width: 160px; }
+    .hyp-detail-table .conf-cell { font-size: 11px; color: #546e7a; }
+    .verdict-keep { color: #2e7d32; font-weight: 700; }
+    .verdict-discard { color: #c62828; font-weight: 700; }
+    .footer {
+      margin-top: 16px;
+      color: #7b8794;
+      font-size: 11px;
+      text-align: center;
+    }
+    @media (max-width: 1200px) {
+      .kpi-grid { grid-template-columns: repeat(2, minmax(0, 1fr)); }
+    }
+    @media (max-width: 720px) {
+      .kpi-grid { grid-template-columns: 1fr; }
+      body { padding: 18px 14px 28px; }
+    }
+  </style>
+</head>
+<body>
+  <h1>QNN NPU Optimization Report — apple/mobilevit-small</h1>
+  <div class="subtitle">mobilevit arch · 2026-06-22 · 11 hypotheses tested</div>
+
+  <section class="kpi-grid">
+    <div class="kpi-card good">
+      <div class="kpi-label">Best Gain %</div>
+      <div class="kpi-value">+2.8%</div>
+      <div class="kpi-sub">Champion: h3 · ⚠ neutral within noise — ship baseline</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">Baseline → Champion ms</div>
+      <div class="kpi-value">5.51 ms → 5.36 ms</div>
+      <div class="kpi-sub">Latency reduction: 0.15 ms</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">EP + Device</div>
+      <div class="kpi-value">QNN / NPU</div>
+      <div class="kpi-sub">Baseline opset 17</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">Champion Config</div>
+      <div class="kpi-value">h0 (baseline)</div>
+      <div class="kpi-sub">⚠ neutral within noise — ship baseline</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">Total experiments</div>
+      <div class="kpi-value">11</div>
+      <div class="kpi-sub">0 KEEP / 8 DISCARD</div>
+    </div>
+  </section>
+
+
+    <section class="section-card">
+      <div class="section-title">Model Characteristics</div>
+      <table class="characteristics-table">
+        <tr><th>Model ID</th><td>apple/mobilevit-small</td></tr><tr><th>Task</th><td>image-classification</td></tr><tr><th>Arch type</th><td>mobilevit</td></tr><tr><th>Baseline opset</th><td>17</td></tr><tr><th>EP</th><td>qnn</td></tr><tr><th>Device</th><td>npu</td></tr><tr><th>Conv%</th><td>2.5%</td></tr><tr><th>npu-006 risk</th><td>LOW</td></tr><tr><th>npu-001 note</th><td>neutral</td></tr>
+      </table>
+    </section>
+
+
+    <section class="section-card">
+      <div class="section-title">Hypothesis Gain Chart</div>
+      <div class="chart-wrap">
+        <svg class="chart-svg" viewBox="0 0 748 514" role="img" aria-label="Hypothesis gain chart"><defs><pattern id="buildFailPattern" patternUnits="userSpaceOnUse" width="8" height="8" patternTransform="rotate(45)"><rect width="8" height="8" fill="#cfd8dc"></rect><rect width="3" height="8" fill="#90a4ae"></rect></pattern></defs><text x="0" y="20" class="axis-label">Hypothesis</text><text x="150" y="20" class="axis-label">Gain vs baseline (%)</text><line x1="410.0" y1="40" x2="410.0" y2="488" class="center-line" /><line x1="150.0" y1="44" x2="150.0" y2="488" class="tick-line" /><text x="150.0" y="508" text-anchor="middle" class="tick-label">-200%</text><line x1="280.0" y1="44" x2="280.0" y2="488" class="tick-line" /><text x="280.0" y="508" text-anchor="middle" class="tick-label">-100%</text><line x1="410.0" y1="44" x2="410.0" y2="488" class="tick-line" /><text x="410.0" y="508" text-anchor="middle" class="tick-label">0%</text><line x1="540.0" y1="44" x2="540.0" y2="488" class="tick-line" /><text x="540.0" y="508" text-anchor="middle" class="tick-label">100%</text><line x1="670.0" y1="44" x2="670.0" y2="488" class="tick-line" /><text x="670.0" y="508" text-anchor="middle" class="tick-label">200%</text><g><title>h0: baseline (auto-config, W8A16)
+status=OK  verdict=—
+p50=5.51 ms  gain=+0.0%</title><rect x="0" y="48.0" width="748" height="40" class="row-bg" /><text x="8" y="64.0" class="hyp-label">h0</text><text x="8" y="77.0" class="hyp-sub">baseline (auto-config, W8…</text><line x1="410.0" y1="56.0" x2="410.0" y2="80.0" class="baseline-bar" /><text x="418.0" y="72.0" text-anchor="start" class="value-text">0.0%</text></g><g><title>h1: opset 17 explicit
+status=OK_HIGH_CV  verdict=—
+p50=5.61 ms  gain=-1.9%</title><rect x="0" y="88.0" width="748" height="40" class="row-bg" /><text x="8" y="104.0" class="hyp-label">h1</text><text x="8" y="117.0" class="hyp-sub">opset 17 explicit</text><rect x="407.5" y="96.0" width="2.5" height="24" fill="#90a4ae" stroke="none" stroke-width="0" rx="4" /><text x="399.5" y="112.0" text-anchor="end" class="value-text">-1.9%</text></g><g><title>h2: opset 19
+status=OK  verdict=—
+p50=6.59 ms  gain=-19.5%</title><rect x="0" y="128.0" width="748" height="40" class="row-bg" /><text x="8" y="144.0" class="hyp-label">h2</text><text x="8" y="157.0" class="hyp-sub">opset 19</text><rect x="384.6" y="136.0" width="25.4" height="24" fill="#e53935" stroke="none" stroke-width="0" rx="4" /><text x="376.6" y="152.0" text-anchor="end" class="value-text">-19.5%</text></g><g><title>h3: opset 21 (tests npu-001 bypass)
+status=OK_HIGH_CV  verdict=—
+p50=5.36 ms  gain=+2.8%</title><rect x="0" y="168.0" width="748" height="40" class="row-bg" /><text x="8" y="184.0" class="hyp-label">h3</text><text x="8" y="197.0" class="hyp-sub">opset 21 (tests npu-001 b…</text><rect x="410.0" y="176.0" width="3.7" height="24" fill="#90a4ae" stroke="#1e88e5" stroke-width="4" rx="4" /><text x="421.7" y="192.0" text-anchor="start" class="value-text">+2.8%</text></g><g><title>h4: opset 17 + conv fusions
+status=OK_HIGH_CV  verdict=—
+p50=6.51 ms  gain=-18.2%</title><rect x="0" y="208.0" width="748" height="40" class="row-bg" /><text x="8" y="224.0" class="hyp-label">h4</text><text x="8" y="237.0" class="hyp-sub">opset 17 + conv fusions</text><rect x="386.3" y="216.0" width="23.7" height="24" fill="#e53935" stroke="none" stroke-width="0" rx="4" /><text x="378.3" y="232.0" text-anchor="end" class="value-text">-18.2%</text></g><g><title>h5: opset 21 + conv fusions
+status=OK_HIGH_CV  verdict=—
+p50=6.71 ms  gain=-21.8%</title><rect x="0" y="248.0" width="748" height="40" class="row-bg" /><text x="8" y="264.0" class="hyp-label">h5</text><text x="8" y="277.0" class="hyp-sub">opset 21 + conv fusions</text><rect x="381.7" y="256.0" width="28.3" height="24" fill="#e53935" stroke="none" stroke-width="0" rx="4" /><text x="373.7" y="272.0" text-anchor="end" class="value-text">-21.8%</text></g><g><title>h6: opset 21 + matmul_transpose_fusion
+status=OK_HIGH_CV  verdict=—
+p50=6.22 ms  gain=-12.8%</title><rect x="0" y="288.0" width="748" height="40" class="row-bg" /><text x="8" y="304.0" class="hyp-label">h6</text><text x="8" y="317.0" class="hyp-sub">opset 21 + matmul_transpo…</text><rect x="393.3" y="296.0" width="16.7" height="24" fill="#e53935" stroke="none" stroke-width="0" rx="4" /><text x="385.3" y="312.0" text-anchor="end" class="value-text">-12.8%</text></g><g><title>h7: opset 21 + bias_softmax_fusion
+status=OK  verdict=—
+p50=6.43 ms  gain=-16.7%</title><rect x="0" y="328.0" width="748" height="40" class="row-bg" /><text x="8" y="344.0" class="hyp-label">h7</text><text x="8" y="357.0" class="hyp-sub">opset 21 + bias_softmax_f…</text><rect x="388.3" y="336.0" width="21.7" height="24" fill="#e53935" stroke="none" stroke-width="0" rx="4" /><text x="380.3" y="352.0" text-anchor="end" class="value-text">-16.7%</text></g><g><title>h8: opset 21 + attention_fusion
+status=OK_HIGH_CV  verdict=—
+p50=5.75 ms  gain=-4.4%</title><rect x="0" y="368.0" width="748" height="40" class="row-bg" /><text x="8" y="384.0" class="hyp-label">h8</text><text x="8" y="397.0" class="hyp-sub">opset 21 + attention_fusi…</text><rect x="404.3" y="376.0" width="5.7" height="24" fill="#90a4ae" stroke="none" stroke-width="0" rx="4" /><text x="396.3" y="392.0" text-anchor="end" class="value-text">-4.4%</text></g><g><title>h9: opset 21 + highdimRTR_lowdimRTR
+status=OK_HIGH_CV  verdict=—
+p50=6.54 ms  gain=-18.8%</title><rect x="0" y="408.0" width="748" height="40" class="row-bg" /><text x="8" y="424.0" class="hyp-label">h9</text><text x="8" y="437.0" class="hyp-sub">opset 21 + highdimRTR_low…</text><rect x="385.6" y="416.0" width="24.4" height="24" fill="#e53935" stroke="none" stroke-width="0" rx="4" /><text x="377.6" y="432.0" text-anchor="end" class="value-text">-18.8%</text></g><g><title>h10: opset 17 + conv_add_fusion only
+status=OK_HIGH_CV  verdict=—
+p50=5.88 ms  gain=-6.7%</title><rect x="0" y="448.0" width="748" height="40" class="row-bg" /><text x="8" y="464.0" class="hyp-label">h10</text><text x="8" y="477.0" class="hyp-sub">opset 17 + conv_add_fusio…</text><rect x="401.3" y="456.0" width="8.7" height="24" fill="#e53935" stroke="none" stroke-width="0" rx="4" /><text x="393.3" y="472.0" text-anchor="end" class="value-text">-6.7%</text></g></svg>
+      </div>
+    </section>
+
+
+    <section class="section-card">
+      <div class="section-title">🔬 All Hypotheses — Full Detail</div>
+      <div style="overflow-x:auto">
+      <table class="report-table hyp-detail-table">
+        <thead>
+          <tr>
+            <th>ID</th>
+            <th>Config Label</th>
+            <th>Opset</th>
+            <th>Extra Flags</th>
+            <th>Median p50</th>
+            <th>Session p50s (ms)</th>
+            <th>Gain %</th>
+            <th>Verdict</th>
+            <th>Confidence</th>
+          </tr>
+        </thead>
+        <tbody>
+
+        <tr class="">
+          <td><span class="hyp-pill">h0</span></td>
+          <td class="label-cell">baseline (auto-config, W8A16)</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">5.51 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[4.98 · 5.51 · 5.72]</span></td>
+          <td><span class="">+0.0%</span></td>
+          <td><span class="">OK</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h1</span></td>
+          <td class="label-cell">opset 17 explicit</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">5.61 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[5.63 · 5.31 · 5.61]</span></td>
+          <td><span class="gain-neg">-1.9%</span></td>
+          <td><span class="">OK_HIGH_CV</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h2</span></td>
+          <td class="label-cell">opset 19</td>
+          <td class="opset-cell">19</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">6.59 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[6.68 · 6.59 · 5.29]</span></td>
+          <td><span class="gain-neg">-19.5%</span></td>
+          <td><span class="">OK</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="champion-row">
+          <td><span class="hyp-pill">h3</span> <span style="color:#1976d2;font-weight:900">★</span></td>
+          <td class="label-cell">opset 21 (tests npu-001 bypass)</td>
+          <td class="opset-cell">21</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">5.36 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[5.36 · 5.26 · 5.89]</span></td>
+          <td><span class="gain-pos">+2.8%</span></td>
+          <td><span class="">OK_HIGH_CV</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h4</span></td>
+          <td class="label-cell">opset 17 + conv fusions</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><span class="flag-pill">conv_bn_fusion</span>, <span class="flag-pill">conv_add_fusion</span>, <span class="flag-pill">conv_activation_fusion</span></td>
+          <td class="p50-cell">6.51 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[6.43 · 6.84 · 6.51]</span></td>
+          <td><span class="gain-neg">-18.2%</span></td>
+          <td><span class="">OK_HIGH_CV</span></td>
+          <td class="conf-cell">ranges separated</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h5</span></td>
+          <td class="label-cell">opset 21 + conv fusions</td>
+          <td class="opset-cell">21</td>
+          <td class="flags-cell"><span class="flag-pill">conv_bn_fusion</span>, <span class="flag-pill">conv_add_fusion</span>, <span class="flag-pill">conv_activation_fusion</span></td>
+          <td class="p50-cell">6.71 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[5.60 · 6.84 · 6.71]</span></td>
+          <td><span class="gain-neg">-21.8%</span></td>
+          <td><span class="">OK_HIGH_CV</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h6</span></td>
+          <td class="label-cell">opset 21 + matmul_transpose_fusion</td>
+          <td class="opset-cell">21</td>
+          <td class="flags-cell"><span class="flag-pill">matmul_transpose_fusion</span></td>
+          <td class="p50-cell">6.22 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[5.76 · 6.26 · 6.22]</span></td>
+          <td><span class="gain-neg">-12.8%</span></td>
+          <td><span class="">OK_HIGH_CV</span></td>
+          <td class="conf-cell">ranges separated</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h7</span></td>
+          <td class="label-cell">opset 21 + bias_softmax_fusion</td>
+          <td class="opset-cell">21</td>
+          <td class="flags-cell"><span class="flag-pill">bias_softmax_fusion</span></td>
+          <td class="p50-cell">6.43 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[6.43 · 6.47 · 5.59]</span></td>
+          <td><span class="gain-neg">-16.7%</span></td>
+          <td><span class="">OK</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h8</span></td>
+          <td class="label-cell">opset 21 + attention_fusion</td>
+          <td class="opset-cell">21</td>
+          <td class="flags-cell"><span class="flag-pill">attention_fusion</span></td>
+          <td class="p50-cell">5.75 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[5.75 · 5.67 · 6.72]</span></td>
+          <td><span class="gain-neg">-4.4%</span></td>
+          <td><span class="">OK_HIGH_CV</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h9</span></td>
+          <td class="label-cell">opset 21 + highdimRTR_lowdimRTR</td>
+          <td class="opset-cell">21</td>
+          <td class="flags-cell"><span class="flag-pill">highdimRTR_lowdimRTR</span></td>
+          <td class="p50-cell">6.54 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[6.63 · 5.54 · 6.54]</span></td>
+          <td><span class="gain-neg">-18.8%</span></td>
+          <td><span class="">OK_HIGH_CV</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h10</span></td>
+          <td class="label-cell">opset 17 + conv_add_fusion only</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><span class="flag-pill">conv_add_fusion</span></td>
+          <td class="p50-cell">5.88 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[5.88 · 6.07 · 5.55]</span></td>
+          <td><span class="gain-neg">-6.7%</span></td>
+          <td><span class="">OK_HIGH_CV</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        </tbody>
+      </table>
+      </div>
+      <div style="margin-top:10px;font-size:11px;color:#7b8794">
+        ★ = champion hypothesis &nbsp;·&nbsp; Session p50s are individual bench sessions (median used for comparison)
+      </div>
+    </section>
+
+
+
+    <section class="section-card">
+      <div class="section-title">❌ Ineffective or Harmful</div>
+      <table class="report-table">
+        <thead>
+          <tr>
+            <th>Hypothesis</th>
+            <th>Label</th>
+            <th>Gain %</th>
+            <th>Verdict</th>
+            <th>Confidence</th>
+          </tr>
+        </thead>
+        <tbody>
+
+            <tr class="">
+              <td><span class="hyp-pill">h2</span></td>
+              <td>opset 19</td>
+              <td class="gain-neg">-19.5%</td>
+              <td>OK</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h4</span></td>
+              <td>opset 17 + conv fusions</td>
+              <td class="gain-neg">-18.2%</td>
+              <td>OK_HIGH_CV</td>
+              <td>ranges separated</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h5</span></td>
+              <td>opset 21 + conv fusions</td>
+              <td class="gain-neg">-21.8%</td>
+              <td>OK_HIGH_CV</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h6</span></td>
+              <td>opset 21 + matmul_transpose_fusion</td>
+              <td class="gain-neg">-12.8%</td>
+              <td>OK_HIGH_CV</td>
+              <td>ranges separated</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h7</span></td>
+              <td>opset 21 + bias_softmax_fusion</td>
+              <td class="gain-neg">-16.7%</td>
+              <td>OK</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h8</span></td>
+              <td>opset 21 + attention_fusion</td>
+              <td class="gain-neg">-4.4%</td>
+              <td>OK_HIGH_CV</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h9</span></td>
+              <td>opset 21 + highdimRTR_lowdimRTR</td>
+              <td class="gain-neg">-18.8%</td>
+              <td>OK_HIGH_CV</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h10</span></td>
+              <td>opset 17 + conv_add_fusion only</td>
+              <td class="gain-neg">-6.7%</td>
+              <td>OK_HIGH_CV</td>
+              <td>ranges overlap</td>
+            </tr>
+
+        </tbody>
+      </table>
+    </section>
+
+
+    <section class="section-card">
+      <div class="section-title">⚪ Neutral / Build Fail</div>
+      <table class="report-table">
+        <thead>
+          <tr>
+            <th>Hypothesis</th>
+            <th>Label</th>
+            <th>Gain %</th>
+            <th>Verdict</th>
+            <th>Confidence</th>
+          </tr>
+        </thead>
+        <tbody>
+
+            <tr class="">
+              <td><span class="hyp-pill">h0</span></td>
+              <td>baseline (auto-config, W8A16)</td>
+              <td class="gain-pos">+0.0%</td>
+              <td>OK</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h1</span></td>
+              <td>opset 17 explicit</td>
+              <td class="gain-neg">-1.9%</td>
+              <td>OK_HIGH_CV</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="champion-row">
+              <td><span class="hyp-pill">h3</span></td>
+              <td>opset 21 (tests npu-001 bypass)</td>
+              <td class="gain-pos">+2.8%</td>
+              <td>OK_HIGH_CV</td>
+              <td>ranges overlap</td>
+            </tr>
+
+        </tbody>
+      </table>
+    </section>
+
+
+
+  <div class="footer">Generated by gen_model_report.py · research/autoconfig</div>
+</body>
+</html>
diff --git a/research/autoconfig/catalog-qnn-sweep/apple--mobilevit-small/results.json b/research/autoconfig/catalog-qnn-sweep/apple--mobilevit-small/results.json
new file mode 100644
index 000000000..8736e8048
--- /dev/null
+++ b/research/autoconfig/catalog-qnn-sweep/apple--mobilevit-small/results.json
@@ -0,0 +1,274 @@
+{
+  "model_id": "apple/mobilevit-small",
+  "task": "image-classification",
+  "model_type": "mobilevit",
+  "timestamp": "2026-06-22T08:34:17",
+  "ep": "qnn",
+  "device": "npu",
+  "baseline_opset": 17,
+  "conv_pct": 2.5,
+  "npu006_risk": false,
+  "npu006_regression": false,
+  "hypotheses": {
+    "h0": {
+      "status": "OK",
+      "screen": {
+        "p50_ms": 5.05,
+        "cv": 0.0935,
+        "stable": true
+      },
+      "full": {
+        "p50s_ms": [
+          4.976,
+          5.51,
+          5.716
+        ],
+        "median_p50_ms": 5.51
+      },
+      "accuracy": null,
+      "label": "baseline (auto-config, W8A16)",
+      "opset": 17,
+      "extra_optim": {}
+    },
+    "h1": {
+      "status": "OK_HIGH_CV",
+      "screen": {
+        "p50_ms": 5.844,
+        "cv": 0.3039,
+        "stable": false,
+        "note": "DVFS noise — high CV expected on QNN NPU"
+      },
+      "full": {
+        "p50s_ms": [
+          5.634,
+          5.307,
+          5.614
+        ],
+        "median_p50_ms": 5.614
+      },
+      "accuracy": null,
+      "label": "opset 17 explicit",
+      "opset": 17,
+      "extra_optim": {}
+    },
+    "h2": {
+      "status": "OK",
+      "screen": {
+        "p50_ms": 5.81,
+        "cv": 0.1203,
+        "stable": true
+      },
+      "full": {
+        "p50s_ms": [
+          6.678,
+          6.586,
+          5.293
+        ],
+        "median_p50_ms": 6.586
+      },
+      "accuracy": null,
+      "label": "opset 19",
+      "opset": 19,
+      "extra_optim": {}
+    },
+    "h3": {
+      "status": "OK_HIGH_CV",
+      "screen": {
+        "p50_ms": 5.218,
+        "cv": 0.1631,
+        "stable": false,
+        "note": "DVFS noise — high CV expected on QNN NPU"
+      },
+      "full": {
+        "p50s_ms": [
+          5.355,
+          5.256,
+          5.895
+        ],
+        "median_p50_ms": 5.355
+      },
+      "accuracy": null,
+      "label": "opset 21 (tests npu-001 bypass)",
+      "opset": 21,
+      "extra_optim": {}
+    },
+    "h4": {
+      "status": "OK_HIGH_CV",
+      "screen": {
+        "p50_ms": 6.73,
+        "cv": 0.1811,
+        "stable": false,
+        "note": "DVFS noise — high CV expected on QNN NPU"
+      },
+      "full": {
+        "p50s_ms": [
+          6.43,
+          6.842,
+          6.515
+        ],
+        "median_p50_ms": 6.515
+      },
+      "accuracy": null,
+      "label": "opset 17 + conv fusions",
+      "opset": 17,
+      "extra_optim": {
+        "conv_bn_fusion": true,
+        "conv_add_fusion": true,
+        "conv_activation_fusion": true
+      },
+      "npu006_expected_regression": false
+    },
+    "h5": {
+      "status": "OK_HIGH_CV",
+      "screen": {
+        "p50_ms": 6.187,
+        "cv": 0.1526,
+        "stable": false,
+        "note": "DVFS noise — high CV expected on QNN NPU"
+      },
+      "full": {
+        "p50s_ms": [
+          5.604,
+          6.837,
+          6.711
+        ],
+        "median_p50_ms": 6.711
+      },
+      "accuracy": null,
+      "label": "opset 21 + conv fusions",
+      "opset": 21,
+      "extra_optim": {
+        "conv_bn_fusion": true,
+        "conv_add_fusion": true,
+        "conv_activation_fusion": true
+      },
+      "npu006_expected_regression": false
+    },
+    "h6": {
+      "status": "OK_HIGH_CV",
+      "screen": {
+        "p50_ms": 5.921,
+        "cv": 0.2292,
+        "stable": false,
+        "note": "DVFS noise — high CV expected on QNN NPU"
+      },
+      "full": {
+        "p50s_ms": [
+          5.762,
+          6.263,
+          6.218
+        ],
+        "median_p50_ms": 6.218
+      },
+      "accuracy": null,
+      "label": "opset 21 + matmul_transpose_fusion",
+      "opset": 21,
+      "extra_optim": {
+        "matmul_transpose_fusion": true
+      }
+    },
+    "h7": {
+      "status": "OK",
+      "screen": {
+        "p50_ms": 4.618,
+        "cv": 0.0427,
+        "stable": true
+      },
+      "full": {
+        "p50s_ms": [
+          6.431,
+          6.47,
+          5.586
+        ],
+        "median_p50_ms": 6.431
+      },
+      "accuracy": null,
+      "label": "opset 21 + bias_softmax_fusion",
+      "opset": 21,
+      "extra_optim": {
+        "bias_softmax_fusion": true
+      }
+    },
+    "h8": {
+      "status": "OK_HIGH_CV",
+      "screen": {
+        "p50_ms": 6.451,
+        "cv": 0.4551,
+        "stable": false,
+        "note": "DVFS noise — high CV expected on QNN NPU"
+      },
+      "full": {
+        "p50s_ms": [
+          5.75,
+          5.675,
+          6.718
+        ],
+        "median_p50_ms": 5.75
+      },
+      "accuracy": null,
+      "label": "opset 21 + attention_fusion",
+      "opset": 21,
+      "extra_optim": {
+        "attention_fusion": true
+      }
+    },
+    "h9": {
+      "status": "OK_HIGH_CV",
+      "screen": {
+        "p50_ms": 5.72,
+        "cv": 0.1899,
+        "stable": false,
+        "note": "DVFS noise — high CV expected on QNN NPU"
+      },
+      "full": {
+        "p50s_ms": [
+          6.627,
+          5.535,
+          6.545
+        ],
+        "median_p50_ms": 6.545
+      },
+      "accuracy": null,
+      "label": "opset 21 + highdimRTR_lowdimRTR",
+      "opset": 21,
+      "extra_optim": {
+        "highdimRTR_lowdimRTR": true
+      }
+    },
+    "h10": {
+      "status": "OK_HIGH_CV",
+      "screen": {
+        "p50_ms": 6.726,
+        "cv": 0.1875,
+        "stable": false,
+        "note": "DVFS noise — high CV expected on QNN NPU"
+      },
+      "full": {
+        "p50s_ms": [
+          5.879,
+          6.067,
+          5.55
+        ],
+        "median_p50_ms": 5.879
+      },
+      "accuracy": null,
+      "label": "opset 17 + conv_add_fusion only",
+      "opset": 17,
+      "extra_optim": {
+        "conv_add_fusion": true
+      }
+    }
+  },
+  "best_hypothesis": "h3",
+  "baseline_p50_ms": 5.51,
+  "best_p50_ms": 5.355,
+  "best_gain_pct": 2.81,
+  "npu001_generalized": "neutral",
+  "npu001_ranges_non_overlapping": false,
+  "feature_gaps": [],
+  "errors": [],
+  "best_gain_noise_floor_pct": 14.14,
+  "best_gain_ranges_separated": false,
+  "best_gain_reliable": false,
+  "best_gain_verdict": "NEUTRAL_WITHIN_NOISE"
+}
diff --git a/research/autoconfig/catalog-qnn-sweep/deepset--roberta-base-squad2/report.html b/research/autoconfig/catalog-qnn-sweep/deepset--roberta-base-squad2/report.html
new file mode 100644
index 000000000..d316dc973
--- /dev/null
+++ b/research/autoconfig/catalog-qnn-sweep/deepset--roberta-base-squad2/report.html
@@ -0,0 +1,412 @@
+<!DOCTYPE html>
+<html lang="en">
+<head>
+  <meta charset="UTF-8">
+  <meta name="viewport" content="width=device-width, initial-scale=1.0">
+  <title>QNN NPU Optimization Report — deepset/roberta-base-squad2</title>
+  <style>
+    * { box-sizing: border-box; margin: 0; padding: 0; }
+    body {
+      font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", sans-serif;
+      background: #f4f6f9;
+      color: #1a1a2e;
+      padding: 28px 24px 40px;
+      font-size: 13px;
+      line-height: 1.5;
+    }
+    h1 { font-size: 24px; font-weight: 800; margin-bottom: 6px; color: #102a43; }
+    .subtitle { color: #5f6c80; font-size: 12px; margin-bottom: 24px; }
+    .section-card {
+      background: #ffffff;
+      border: 1.5px solid #dbe5f0;
+      border-radius: 12px;
+      padding: 18px 20px;
+      margin-bottom: 20px;
+      box-shadow: 0 1px 3px rgba(16, 42, 67, 0.06);
+    }
+    .kpi-grid {
+      display: grid;
+      grid-template-columns: repeat(5, minmax(0, 1fr));
+      gap: 14px;
+      margin-bottom: 20px;
+    }
+    .kpi-card {
+      background: linear-gradient(180deg, #ffffff 0%, #f8fbff 100%);
+      border: 1.5px solid #dbe5f0;
+      border-radius: 12px;
+      padding: 16px;
+      min-height: 120px;
+    }
+    .kpi-label {
+      font-size: 10px;
+      text-transform: uppercase;
+      letter-spacing: 0.8px;
+      color: #6b7c93;
+      font-weight: 700;
+      margin-bottom: 8px;
+    }
+    .kpi-value {
+      font-size: 23px;
+      font-weight: 800;
+      color: #102a43;
+      line-height: 1.15;
+      margin-bottom: 6px;
+      word-break: break-word;
+    }
+    .kpi-card.good .kpi-value { color: #2e7d32; }
+    .kpi-card.bad .kpi-value { color: #c62828; }
+    .kpi-sub { color: #6b7c93; font-size: 11px; }
+    .section-title {
+      font-size: 14px;
+      font-weight: 800;
+      color: #102a43;
+      margin-bottom: 14px;
+    }
+    .characteristics-table {
+      width: 100%;
+      border-collapse: collapse;
+    }
+    .characteristics-table th,
+    .characteristics-table td {
+      padding: 9px 10px;
+      border-bottom: 1px solid #ebf1f6;
+      text-align: left;
+      vertical-align: top;
+    }
+    .characteristics-table th {
+      width: 180px;
+      color: #5f6c80;
+      font-size: 11px;
+      text-transform: uppercase;
+      letter-spacing: 0.7px;
+    }
+    .report-table {
+      width: 100%;
+      border-collapse: collapse;
+    }
+    .report-table th {
+      text-align: left;
+      padding: 9px 10px;
+      background: #eef4fb;
+      color: #486581;
+      border-bottom: 2px solid #d9e2ec;
+      font-size: 10px;
+      text-transform: uppercase;
+      letter-spacing: 0.7px;
+    }
+    .report-table td {
+      padding: 10px;
+      border-bottom: 1px solid #ebf1f6;
+      vertical-align: top;
+    }
+    .report-table tr:hover td { background: #f8fbff; }
+    .champion-row td { background: #e8f1fd; }
+    .hyp-pill {
+      display: inline-block;
+      background: #102a43;
+      color: white;
+      border-radius: 999px;
+      padding: 2px 8px;
+      font-size: 11px;
+      font-weight: 700;
+    }
+    .gain-pos { color: #2e7d32; font-weight: 700; }
+    .gain-neg { color: #c62828; font-weight: 700; }
+    .chart-wrap {
+      overflow-x: auto;
+      border: 1px solid #e6edf5;
+      border-radius: 10px;
+      background: #fbfdff;
+      padding: 10px;
+    }
+    .chart-svg { width: 100%; min-width: 760px; display: block; }
+    .axis-label { fill: #486581; font-size: 11px; font-weight: 700; }
+    .tick-label { fill: #7b8794; font-size: 10px; }
+    .tick-line { stroke: #d9e2ec; stroke-width: 1; }
+    .center-line { stroke: #1e88e5; stroke-width: 2; stroke-dasharray: 4 4; }
+    .row-bg { fill: transparent; }
+    .hyp-label { fill: #102a43; font-size: 12px; font-weight: 800; }
+    .hyp-sub { fill: #7b8794; font-size: 10px; }
+    .baseline-bar { stroke: #546e7a; stroke-width: 3; }
+    .value-text { fill: #102a43; font-size: 11px; font-weight: 700; }
+    .build-fail-text { fill: #37474f; font-size: 10px; font-weight: 800; letter-spacing: 0.5px; }
+    .gap-grid {
+      display: grid;
+      grid-template-columns: repeat(auto-fit, minmax(240px, 1fr));
+      gap: 12px;
+    }
+    .gap-card {
+      background: #fff8e1;
+      border: 1.5px solid #ffe082;
+      border-radius: 10px;
+      padding: 12px 14px;
+      color: #7c5b00;
+      font-size: 12px;
+    }
+    .flag-pill {
+      display: inline-block;
+      background: #e3f2fd;
+      color: #1565c0;
+      border-radius: 4px;
+      padding: 1px 6px;
+      font-size: 10px;
+      font-weight: 700;
+      margin: 1px 2px 1px 0;
+    }
+    .runs-val {
+      font-family: "Cascadia Code", "Consolas", monospace;
+      font-size: 10.5px;
+      color: #546e7a;
+      white-space: nowrap;
+    }
+    .hyp-detail-table .label-cell { font-size: 11.5px; max-width: 220px; }
+    .hyp-detail-table .opset-cell { text-align: center; font-weight: 700; color: #3949ab; font-size: 12px; }
+    .hyp-detail-table .flags-cell { min-width: 140px; }
+    .hyp-detail-table .p50-cell { font-family: "Cascadia Code","Consolas",monospace; font-size: 12px; white-space: nowrap; }
+    .hyp-detail-table .sessions-cell { min-width: 160px; }
+    .hyp-detail-table .conf-cell { font-size: 11px; color: #546e7a; }
+    .verdict-keep { color: #2e7d32; font-weight: 700; }
+    .verdict-discard { color: #c62828; font-weight: 700; }
+    .footer {
+      margin-top: 16px;
+      color: #7b8794;
+      font-size: 11px;
+      text-align: center;
+    }
+    @media (max-width: 1200px) {
+      .kpi-grid { grid-template-columns: repeat(2, minmax(0, 1fr)); }
+    }
+    @media (max-width: 720px) {
+      .kpi-grid { grid-template-columns: 1fr; }
+      body { padding: 18px 14px 28px; }
+    }
+  </style>
+</head>
+<body>
+  <h1>QNN NPU Optimization Report — deepset/roberta-base-squad2</h1>
+  <div class="subtitle">roberta arch · 2026-06-13 · 6 hypotheses tested</div>
+
+  <section class="kpi-grid">
+    <div class="kpi-card good">
+      <div class="kpi-label">Best Gain %</div>
+      <div class="kpi-value">+1.5%</div>
+      <div class="kpi-sub">Champion: h1</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">Baseline → Champion ms</div>
+      <div class="kpi-value">14.94 ms → 14.72 ms</div>
+      <div class="kpi-sub">Latency reduction: 0.23 ms</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">EP + Device</div>
+      <div class="kpi-value">QNN / NPU</div>
+      <div class="kpi-sub">Baseline opset 17</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">Champion Config</div>
+      <div class="kpi-value">h1</div>
+      <div class="kpi-sub">opset 17 + autoconf defaults</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">Total experiments</div>
+      <div class="kpi-value">6</div>
+      <div class="kpi-sub">0 KEEP / 0 DISCARD</div>
+    </div>
+  </section>
+
+
+    <section class="section-card">
+      <div class="section-title">Model Characteristics</div>
+      <table class="characteristics-table">
+        <tr><th>Model ID</th><td>deepset/roberta-base-squad2</td></tr><tr><th>Task</th><td>question-answering</td></tr><tr><th>Arch type</th><td>roberta</td></tr><tr><th>Baseline opset</th><td>17</td></tr><tr><th>EP</th><td>qnn</td></tr><tr><th>Device</th><td>npu</td></tr><tr><th>npu-001 note</th><td>neutral</td></tr>
+      </table>
+    </section>
+
+
+    <section class="section-card">
+      <div class="section-title">Hypothesis Gain Chart</div>
+      <div class="chart-wrap">
+        <svg class="chart-svg" viewBox="0 0 748 314" role="img" aria-label="Hypothesis gain chart"><defs><pattern id="buildFailPattern" patternUnits="userSpaceOnUse" width="8" height="8" patternTransform="rotate(45)"><rect width="8" height="8" fill="#cfd8dc"></rect><rect width="3" height="8" fill="#90a4ae"></rect></pattern></defs><text x="0" y="20" class="axis-label">Hypothesis</text><text x="150" y="20" class="axis-label">Gain vs baseline (%)</text><line x1="410.0" y1="40" x2="410.0" y2="288" class="center-line" /><line x1="150.0" y1="44" x2="150.0" y2="288" class="tick-line" /><text x="150.0" y="308" text-anchor="middle" class="tick-label">-200%</text><line x1="280.0" y1="44" x2="280.0" y2="288" class="tick-line" /><text x="280.0" y="308" text-anchor="middle" class="tick-label">-100%</text><line x1="410.0" y1="44" x2="410.0" y2="288" class="tick-line" /><text x="410.0" y="308" text-anchor="middle" class="tick-label">0%</text><line x1="540.0" y1="44" x2="540.0" y2="288" class="tick-line" /><text x="540.0" y="308" text-anchor="middle" class="tick-label">100%</text><line x1="670.0" y1="44" x2="670.0" y2="288" class="tick-line" /><text x="670.0" y="308" text-anchor="middle" class="tick-label">200%</text><g><title>h0: baseline (auto-config, W8A16)
+status=OK  verdict=—
+p50=14.94 ms  gain=+0.0%</title><rect x="0" y="48.0" width="748" height="40" class="row-bg" /><text x="8" y="64.0" class="hyp-label">h0</text><text x="8" y="77.0" class="hyp-sub">baseline (auto-config, W8…</text><line x1="410.0" y1="56.0" x2="410.0" y2="80.0" class="baseline-bar" /><text x="418.0" y="72.0" text-anchor="start" class="value-text">0.0%</text></g><g><title>h1: opset 17 explicit
+status=OK  verdict=—
+p50=14.72 ms  gain=+1.5%</title><rect x="0" y="88.0" width="748" height="40" class="row-bg" /><text x="8" y="104.0" class="hyp-label">h1</text><text x="8" y="117.0" class="hyp-sub">opset 17 explicit</text><rect x="410.0" y="96.0" width="2.0" height="24" fill="#90a4ae" stroke="#1e88e5" stroke-width="4" rx="4" /><text x="420.0" y="112.0" text-anchor="start" class="value-text">+1.5%</text></g><g><title>h2: opset 19
+status=OK_HIGH_CV  verdict=—
+p50=14.88 ms  gain=+0.4%</title><rect x="0" y="128.0" width="748" height="40" class="row-bg" /><text x="8" y="144.0" class="hyp-label">h2</text><text x="8" y="157.0" class="hyp-sub">opset 19</text><rect x="410.0" y="136.0" width="2.0" height="24" fill="#90a4ae" stroke="none" stroke-width="0" rx="4" /><text x="418.6" y="152.0" text-anchor="start" class="value-text">+0.4%</text></g><g><title>h3: opset 21 (tests npu-001 bypass)
+status=OK  verdict=—
+p50=14.92 ms  gain=+0.1%</title><rect x="0" y="168.0" width="748" height="40" class="row-bg" /><text x="8" y="184.0" class="hyp-label">h3</text><text x="8" y="197.0" class="hyp-sub">opset 21 (tests npu-001 b…</text><rect x="410.0" y="176.0" width="2.0" height="24" fill="#90a4ae" stroke="none" stroke-width="0" rx="4" /><text x="418.2" y="192.0" text-anchor="start" class="value-text">+0.1%</text></g><g><title>h4: opset 17 + conv fusions
+status=TIMEOUT  verdict=—
+p50=—  gain=—</title><rect x="0" y="208.0" width="748" height="40" class="row-bg" /><text x="8" y="224.0" class="hyp-label">h4</text><text x="8" y="237.0" class="hyp-sub">opset 17 + conv fusions</text></g><g><title>h5: opset 21 + conv fusions
+status=TIMEOUT  verdict=—
+p50=—  gain=—</title><rect x="0" y="248.0" width="748" height="40" class="row-bg" /><text x="8" y="264.0" class="hyp-label">h5</text><text x="8" y="277.0" class="hyp-sub">opset 21 + conv fusions</text></g></svg>
+      </div>
+    </section>
+
+
+    <section class="section-card">
+      <div class="section-title">🔬 All Hypotheses — Full Detail</div>
+      <div style="overflow-x:auto">
+      <table class="report-table hyp-detail-table">
+        <thead>
+          <tr>
+            <th>ID</th>
+            <th>Config Label</th>
+            <th>Opset</th>
+            <th>Extra Flags</th>
+            <th>Median p50</th>
+            <th>Session p50s (ms)</th>
+            <th>Gain %</th>
+            <th>Verdict</th>
+            <th>Confidence</th>
+          </tr>
+        </thead>
+        <tbody>
+
+        <tr class="">
+          <td><span class="hyp-pill">h0</span></td>
+          <td class="label-cell">baseline (auto-config, W8A16)</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">14.94 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[14.94 · 14.71 · 14.97]</span></td>
+          <td><span class="">+0.0%</span></td>
+          <td><span class="">OK</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="champion-row">
+          <td><span class="hyp-pill">h1</span> <span style="color:#1976d2;font-weight:900">★</span></td>
+          <td class="label-cell">opset 17 explicit</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">14.72 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[14.64 · 14.87 · 14.72]</span></td>
+          <td><span class="gain-pos">+1.5%</span></td>
+          <td><span class="">OK</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h2</span></td>
+          <td class="label-cell">opset 19</td>
+          <td class="opset-cell">19</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">14.88 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[14.95 · 14.88 · 14.83]</span></td>
+          <td><span class="gain-pos">+0.4%</span></td>
+          <td><span class="">OK_HIGH_CV</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h3</span></td>
+          <td class="label-cell">opset 21 (tests npu-001 bypass)</td>
+          <td class="opset-cell">21</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">14.92 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[16.68 · 14.74 · 14.92]</span></td>
+          <td><span class="gain-pos">+0.1%</span></td>
+          <td><span class="">OK</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h4</span></td>
+          <td class="label-cell">opset 17 + conv fusions</td>
+          <td class="opset-cell">—</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">—</td>
+          <td class="sessions-cell">—</td>
+          <td>—</td>
+          <td><span class="">TIMEOUT</span></td>
+          <td class="conf-cell">single-point only</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h5</span></td>
+          <td class="label-cell">opset 21 + conv fusions</td>
+          <td class="opset-cell">—</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">—</td>
+          <td class="sessions-cell">—</td>
+          <td>—</td>
+          <td><span class="">TIMEOUT</span></td>
+          <td class="conf-cell">single-point only</td>
+        </tr>
+        </tbody>
+      </table>
+      </div>
+      <div style="margin-top:10px;font-size:11px;color:#7b8794">
+        ★ = champion hypothesis &nbsp;·&nbsp; Session p50s are individual bench sessions (median used for comparison)
+      </div>
+    </section>
+
+
+
+
+    <section class="section-card">
+      <div class="section-title">⚪ Neutral / Build Fail</div>
+      <table class="report-table">
+        <thead>
+          <tr>
+            <th>Hypothesis</th>
+            <th>Label</th>
+            <th>Gain %</th>
+            <th>Verdict</th>
+            <th>Confidence</th>
+          </tr>
+        </thead>
+        <tbody>
+
+            <tr class="">
+              <td><span class="hyp-pill">h0</span></td>
+              <td>baseline (auto-config, W8A16)</td>
+              <td class="gain-pos">+0.0%</td>
+              <td>OK</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="champion-row">
+              <td><span class="hyp-pill">h1</span></td>
+              <td>opset 17 explicit</td>
+              <td class="gain-pos">+1.5%</td>
+              <td>OK</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h2</span></td>
+              <td>opset 19</td>
+              <td class="gain-pos">+0.4%</td>
+              <td>OK_HIGH_CV</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h3</span></td>
+              <td>opset 21 (tests npu-001 bypass)</td>
+              <td class="gain-pos">+0.1%</td>
+              <td>OK</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h4</span></td>
+              <td>opset 17 + conv fusions</td>
+              <td class="gain-pos">—</td>
+              <td>TIMEOUT</td>
+              <td>single-point only</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h5</span></td>
+              <td>opset 21 + conv fusions</td>
+              <td class="gain-pos">—</td>
+              <td>TIMEOUT</td>
+              <td>single-point only</td>
+            </tr>
+
+        </tbody>
+      </table>
+    </section>
+
+
+
+  <div class="footer">Generated by gen_model_report.py · research/autoconfig</div>
+</body>
+</html>
diff --git a/research/autoconfig/catalog-qnn-sweep/deepset--roberta-base-squad2/results.json b/research/autoconfig/catalog-qnn-sweep/deepset--roberta-base-squad2/results.json
new file mode 100644
index 000000000..fa8a959f4
--- /dev/null
+++ b/research/autoconfig/catalog-qnn-sweep/deepset--roberta-base-squad2/results.json
@@ -0,0 +1,106 @@
+{
+  "model_id": "deepset/roberta-base-squad2",
+  "task": "question-answering",
+  "model_type": "roberta",
+  "timestamp": "2026-06-13T16:21:18",
+  "ep": "qnn",
+  "device": "npu",
+  "baseline_opset": 17,
+  "hypotheses": {
+    "h0": {
+      "status": "OK",
+      "screen": {
+        "p50_ms": 14.919,
+        "cv": 0.1188,
+        "stable": true
+      },
+      "full": {
+        "p50s_ms": [
+          14.941,
+          14.711,
+          14.97
+        ],
+        "median_p50_ms": 14.941
+      },
+      "accuracy": null,
+      "label": "baseline (auto-config, W8A16)",
+      "opset": 17
+    },
+    "h1": {
+      "status": "OK",
+      "screen": {
+        "p50_ms": 14.747,
+        "cv": 0.1286,
+        "stable": true
+      },
+      "full": {
+        "p50s_ms": [
+          14.645,
+          14.873,
+          14.716
+        ],
+        "median_p50_ms": 14.716
+      },
+      "accuracy": null,
+      "label": "opset 17 explicit",
+      "opset": 17
+    },
+    "h2": {
+      "status": "OK_HIGH_CV",
+      "screen": {
+        "p50_ms": 15.309,
+        "cv": 0.2344,
+        "stable": false,
+        "note": "DVFS noise — high CV expected on QNN NPU"
+      },
+      "full": {
+        "p50s_ms": [
+          14.951,
+          14.877,
+          14.834
+        ],
+        "median_p50_ms": 14.877
+      },
+      "accuracy": null,
+      "label": "opset 19",
+      "opset": 19
+    },
+    "h3": {
+      "status": "OK",
+      "screen": {
+        "p50_ms": 14.798,
+        "cv": 0.1159,
+        "stable": true
+      },
+      "full": {
+        "p50s_ms": [
+          16.685,
+          14.743,
+          14.919
+        ],
+        "median_p50_ms": 14.919
+      },
+      "accuracy": null,
+      "label": "opset 21 (tests npu-001 bypass)",
+      "opset": 21
+    },
+    "h4": {
+      "status": "TIMEOUT",
+      "label": "opset 17 + conv fusions"
+    },
+    "h5": {
+      "status": "TIMEOUT",
+      "label": "opset 21 + conv fusions"
+    }
+  },
+  "best_hypothesis": "h1",
+  "baseline_p50_ms": 14.941,
+  "best_p50_ms": 14.716,
+  "best_gain_pct": 1.51,
+  "npu001_generalized": "neutral",
+  "feature_gaps": [],
+  "errors": [
+    "Model timed out at 1466s (before h4)",
+    "Model timed out at 1466s (before h5)"
+  ]
+}
diff --git a/research/autoconfig/catalog-qnn-sweep/distilbert--distilbert-base-uncased-finetuned-sst-2-english/report.html b/research/autoconfig/catalog-qnn-sweep/distilbert--distilbert-base-uncased-finetuned-sst-2-english/report.html
new file mode 100644
index 000000000..9566543c7
--- /dev/null
+++ b/research/autoconfig/catalog-qnn-sweep/distilbert--distilbert-base-uncased-finetuned-sst-2-english/report.html
@@ -0,0 +1,412 @@
+<!DOCTYPE html>
+<html lang="en">
+<head>
+  <meta charset="UTF-8">
+  <meta name="viewport" content="width=device-width, initial-scale=1.0">
+  <title>QNN NPU Optimization Report — distilbert/distilbert-base-uncased-finetuned-sst-2-english</title>
+  <style>
+    * { box-sizing: border-box; margin: 0; padding: 0; }
+    body {
+      font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", sans-serif;
+      background: #f4f6f9;
+      color: #1a1a2e;
+      padding: 28px 24px 40px;
+      font-size: 13px;
+      line-height: 1.5;
+    }
+    h1 { font-size: 24px; font-weight: 800; margin-bottom: 6px; color: #102a43; }
+    .subtitle { color: #5f6c80; font-size: 12px; margin-bottom: 24px; }
+    .section-card {
+      background: #ffffff;
+      border: 1.5px solid #dbe5f0;
+      border-radius: 12px;
+      padding: 18px 20px;
+      margin-bottom: 20px;
+      box-shadow: 0 1px 3px rgba(16, 42, 67, 0.06);
+    }
+    .kpi-grid {
+      display: grid;
+      grid-template-columns: repeat(5, minmax(0, 1fr));
+      gap: 14px;
+      margin-bottom: 20px;
+    }
+    .kpi-card {
+      background: linear-gradient(180deg, #ffffff 0%, #f8fbff 100%);
+      border: 1.5px solid #dbe5f0;
+      border-radius: 12px;
+      padding: 16px;
+      min-height: 120px;
+    }
+    .kpi-label {
+      font-size: 10px;
+      text-transform: uppercase;
+      letter-spacing: 0.8px;
+      color: #6b7c93;
+      font-weight: 700;
+      margin-bottom: 8px;
+    }
+    .kpi-value {
+      font-size: 23px;
+      font-weight: 800;
+      color: #102a43;
+      line-height: 1.15;
+      margin-bottom: 6px;
+      word-break: break-word;
+    }
+    .kpi-card.good .kpi-value { color: #2e7d32; }
+    .kpi-card.bad .kpi-value { color: #c62828; }
+    .kpi-sub { color: #6b7c93; font-size: 11px; }
+    .section-title {
+      font-size: 14px;
+      font-weight: 800;
+      color: #102a43;
+      margin-bottom: 14px;
+    }
+    .characteristics-table {
+      width: 100%;
+      border-collapse: collapse;
+    }
+    .characteristics-table th,
+    .characteristics-table td {
+      padding: 9px 10px;
+      border-bottom: 1px solid #ebf1f6;
+      text-align: left;
+      vertical-align: top;
+    }
+    .characteristics-table th {
+      width: 180px;
+      color: #5f6c80;
+      font-size: 11px;
+      text-transform: uppercase;
+      letter-spacing: 0.7px;
+    }
+    .report-table {
+      width: 100%;
+      border-collapse: collapse;
+    }
+    .report-table th {
+      text-align: left;
+      padding: 9px 10px;
+      background: #eef4fb;
+      color: #486581;
+      border-bottom: 2px solid #d9e2ec;
+      font-size: 10px;
+      text-transform: uppercase;
+      letter-spacing: 0.7px;
+    }
+    .report-table td {
+      padding: 10px;
+      border-bottom: 1px solid #ebf1f6;
+      vertical-align: top;
+    }
+    .report-table tr:hover td { background: #f8fbff; }
+    .champion-row td { background: #e8f1fd; }
+    .hyp-pill {
+      display: inline-block;
+      background: #102a43;
+      color: white;
+      border-radius: 999px;
+      padding: 2px 8px;
+      font-size: 11px;
+      font-weight: 700;
+    }
+    .gain-pos { color: #2e7d32; font-weight: 700; }
+    .gain-neg { color: #c62828; font-weight: 700; }
+    .chart-wrap {
+      overflow-x: auto;
+      border: 1px solid #e6edf5;
+      border-radius: 10px;
+      background: #fbfdff;
+      padding: 10px;
+    }
+    .chart-svg { width: 100%; min-width: 760px; display: block; }
+    .axis-label { fill: #486581; font-size: 11px; font-weight: 700; }
+    .tick-label { fill: #7b8794; font-size: 10px; }
+    .tick-line { stroke: #d9e2ec; stroke-width: 1; }
+    .center-line { stroke: #1e88e5; stroke-width: 2; stroke-dasharray: 4 4; }
+    .row-bg { fill: transparent; }
+    .hyp-label { fill: #102a43; font-size: 12px; font-weight: 800; }
+    .hyp-sub { fill: #7b8794; font-size: 10px; }
+    .baseline-bar { stroke: #546e7a; stroke-width: 3; }
+    .value-text { fill: #102a43; font-size: 11px; font-weight: 700; }
+    .build-fail-text { fill: #37474f; font-size: 10px; font-weight: 800; letter-spacing: 0.5px; }
+    .gap-grid {
+      display: grid;
+      grid-template-columns: repeat(auto-fit, minmax(240px, 1fr));
+      gap: 12px;
+    }
+    .gap-card {
+      background: #fff8e1;
+      border: 1.5px solid #ffe082;
+      border-radius: 10px;
+      padding: 12px 14px;
+      color: #7c5b00;
+      font-size: 12px;
+    }
+    .flag-pill {
+      display: inline-block;
+      background: #e3f2fd;
+      color: #1565c0;
+      border-radius: 4px;
+      padding: 1px 6px;
+      font-size: 10px;
+      font-weight: 700;
+      margin: 1px 2px 1px 0;
+    }
+    .runs-val {
+      font-family: "Cascadia Code", "Consolas", monospace;
+      font-size: 10.5px;
+      color: #546e7a;
+      white-space: nowrap;
+    }
+    .hyp-detail-table .label-cell { font-size: 11.5px; max-width: 220px; }
+    .hyp-detail-table .opset-cell { text-align: center; font-weight: 700; color: #3949ab; font-size: 12px; }
+    .hyp-detail-table .flags-cell { min-width: 140px; }
+    .hyp-detail-table .p50-cell { font-family: "Cascadia Code","Consolas",monospace; font-size: 12px; white-space: nowrap; }
+    .hyp-detail-table .sessions-cell { min-width: 160px; }
+    .hyp-detail-table .conf-cell { font-size: 11px; color: #546e7a; }
+    .verdict-keep { color: #2e7d32; font-weight: 700; }
+    .verdict-discard { color: #c62828; font-weight: 700; }
+    .footer {
+      margin-top: 16px;
+      color: #7b8794;
+      font-size: 11px;
+      text-align: center;
+    }
+    @media (max-width: 1200px) {
+      .kpi-grid { grid-template-columns: repeat(2, minmax(0, 1fr)); }
+    }
+    @media (max-width: 720px) {
+      .kpi-grid { grid-template-columns: 1fr; }
+      body { padding: 18px 14px 28px; }
+    }
+  </style>
+</head>
+<body>
+  <h1>QNN NPU Optimization Report — distilbert/distilbert-base-uncased-finetuned-sst-2-english</h1>
+  <div class="subtitle">distilbert arch · 2026-06-13 · 6 hypotheses tested</div>
+
+  <section class="kpi-grid">
+    <div class="kpi-card good">
+      <div class="kpi-label">Best Gain %</div>
+      <div class="kpi-value">+0.0%</div>
+      <div class="kpi-sub">Champion: h2</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">Baseline → Champion ms</div>
+      <div class="kpi-value">19.48 ms → 19.48 ms</div>
+      <div class="kpi-sub">Latency reduction: 0.00 ms</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">EP + Device</div>
+      <div class="kpi-value">QNN / NPU</div>
+      <div class="kpi-sub">Baseline opset 17</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">Champion Config</div>
+      <div class="kpi-value">h2</div>
+      <div class="kpi-sub">opset 19 + autoconf defaults</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">Total experiments</div>
+      <div class="kpi-value">6</div>
+      <div class="kpi-sub">0 KEEP / 0 DISCARD</div>
+    </div>
+  </section>
+
+
+    <section class="section-card">
+      <div class="section-title">Model Characteristics</div>
+      <table class="characteristics-table">
+        <tr><th>Model ID</th><td>distilbert/distilbert-base-uncased-finetuned-sst-2-english</td></tr><tr><th>Task</th><td>text-classification</td></tr><tr><th>Arch type</th><td>distilbert</td></tr><tr><th>Baseline opset</th><td>17</td></tr><tr><th>EP</th><td>qnn</td></tr><tr><th>Device</th><td>npu</td></tr><tr><th>npu-001 note</th><td>neutral</td></tr>
+      </table>
+    </section>
+
+
+    <section class="section-card">
+      <div class="section-title">Hypothesis Gain Chart</div>
+      <div class="chart-wrap">
+        <svg class="chart-svg" viewBox="0 0 748 314" role="img" aria-label="Hypothesis gain chart"><defs><pattern id="buildFailPattern" patternUnits="userSpaceOnUse" width="8" height="8" patternTransform="rotate(45)"><rect width="8" height="8" fill="#cfd8dc"></rect><rect width="3" height="8" fill="#90a4ae"></rect></pattern></defs><text x="0" y="20" class="axis-label">Hypothesis</text><text x="150" y="20" class="axis-label">Gain vs baseline (%)</text><line x1="410.0" y1="40" x2="410.0" y2="288" class="center-line" /><line x1="150.0" y1="44" x2="150.0" y2="288" class="tick-line" /><text x="150.0" y="308" text-anchor="middle" class="tick-label">-200%</text><line x1="280.0" y1="44" x2="280.0" y2="288" class="tick-line" /><text x="280.0" y="308" text-anchor="middle" class="tick-label">-100%</text><line x1="410.0" y1="44" x2="410.0" y2="288" class="tick-line" /><text x="410.0" y="308" text-anchor="middle" class="tick-label">0%</text><line x1="540.0" y1="44" x2="540.0" y2="288" class="tick-line" /><text x="540.0" y="308" text-anchor="middle" class="tick-label">100%</text><line x1="670.0" y1="44" x2="670.0" y2="288" class="tick-line" /><text x="670.0" y="308" text-anchor="middle" class="tick-label">200%</text><g><title>h0: baseline (auto-config, W8A16)
+status=OK_HIGH_CV  verdict=—
+p50=19.48 ms  gain=+0.0%</title><rect x="0" y="48.0" width="748" height="40" class="row-bg" /><text x="8" y="64.0" class="hyp-label">h0</text><text x="8" y="77.0" class="hyp-sub">baseline (auto-config, W8…</text><line x1="410.0" y1="56.0" x2="410.0" y2="80.0" class="baseline-bar" /><text x="418.0" y="72.0" text-anchor="start" class="value-text">0.0%</text></g><g><title>h1: opset 17 explicit
+status=OK_HIGH_CV  verdict=—
+p50=19.50 ms  gain=-0.1%</title><rect x="0" y="88.0" width="748" height="40" class="row-bg" /><text x="8" y="104.0" class="hyp-label">h1</text><text x="8" y="117.0" class="hyp-sub">opset 17 explicit</text><rect x="409.9" y="96.0" width="2.0" height="24" fill="#90a4ae" stroke="none" stroke-width="0" rx="4" /><text x="401.9" y="112.0" text-anchor="end" class="value-text">-0.1%</text></g><g><title>h2: opset 19
+status=OK_HIGH_CV  verdict=—
+p50=19.48 ms  gain=+0.0%</title><rect x="0" y="128.0" width="748" height="40" class="row-bg" /><text x="8" y="144.0" class="hyp-label">h2</text><text x="8" y="157.0" class="hyp-sub">opset 19</text><rect x="410.0" y="136.0" width="2.0" height="24" fill="#90a4ae" stroke="#1e88e5" stroke-width="4" rx="4" /><text x="418.0" y="152.0" text-anchor="start" class="value-text">+0.0%</text></g><g><title>h3: opset 21 (tests npu-001 bypass)
+status=OK_HIGH_CV  verdict=—
+p50=19.50 ms  gain=-0.1%</title><rect x="0" y="168.0" width="748" height="40" class="row-bg" /><text x="8" y="184.0" class="hyp-label">h3</text><text x="8" y="197.0" class="hyp-sub">opset 21 (tests npu-001 b…</text><rect x="409.8" y="176.0" width="2.0" height="24" fill="#90a4ae" stroke="none" stroke-width="0" rx="4" /><text x="401.8" y="192.0" text-anchor="end" class="value-text">-0.1%</text></g><g><title>h4: opset 17 + conv fusions
+status=OK_HIGH_CV  verdict=—
+p50=19.59 ms  gain=-0.6%</title><rect x="0" y="208.0" width="748" height="40" class="row-bg" /><text x="8" y="224.0" class="hyp-label">h4</text><text x="8" y="237.0" class="hyp-sub">opset 17 + conv fusions</text><rect x="409.3" y="216.0" width="2.0" height="24" fill="#90a4ae" stroke="none" stroke-width="0" rx="4" /><text x="401.3" y="232.0" text-anchor="end" class="value-text">-0.6%</text></g><g><title>h5: opset 21 + conv fusions
+status=TIMEOUT  verdict=—
+p50=—  gain=—</title><rect x="0" y="248.0" width="748" height="40" class="row-bg" /><text x="8" y="264.0" class="hyp-label">h5</text><text x="8" y="277.0" class="hyp-sub">opset 21 + conv fusions</text></g></svg>
+      </div>
+    </section>
+
+
+    <section class="section-card">
+      <div class="section-title">🔬 All Hypotheses — Full Detail</div>
+      <div style="overflow-x:auto">
+      <table class="report-table hyp-detail-table">
+        <thead>
+          <tr>
+            <th>ID</th>
+            <th>Config Label</th>
+            <th>Opset</th>
+            <th>Extra Flags</th>
+            <th>Median p50</th>
+            <th>Session p50s (ms)</th>
+            <th>Gain %</th>
+            <th>Verdict</th>
+            <th>Confidence</th>
+          </tr>
+        </thead>
+        <tbody>
+
+        <tr class="">
+          <td><span class="hyp-pill">h0</span></td>
+          <td class="label-cell">baseline (auto-config, W8A16)</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">19.48 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[19.51 · 19.46 · 19.48]</span></td>
+          <td><span class="">+0.0%</span></td>
+          <td><span class="">OK_HIGH_CV</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h1</span></td>
+          <td class="label-cell">opset 17 explicit</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">19.50 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[19.50 · 19.42 · 19.52]</span></td>
+          <td><span class="gain-neg">-0.1%</span></td>
+          <td><span class="">OK_HIGH_CV</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="champion-row">
+          <td><span class="hyp-pill">h2</span> <span style="color:#1976d2;font-weight:900">★</span></td>
+          <td class="label-cell">opset 19</td>
+          <td class="opset-cell">19</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">19.48 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[19.47 · 19.68 · 19.48]</span></td>
+          <td><span class="gain-pos">+0.0%</span></td>
+          <td><span class="">OK_HIGH_CV</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h3</span></td>
+          <td class="label-cell">opset 21 (tests npu-001 bypass)</td>
+          <td class="opset-cell">21</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">19.50 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[19.59 · 19.45 · 19.50]</span></td>
+          <td><span class="gain-neg">-0.1%</span></td>
+          <td><span class="">OK_HIGH_CV</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h4</span></td>
+          <td class="label-cell">opset 17 + conv fusions</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">19.59 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[19.59 · 19.63 · 19.50]</span></td>
+          <td><span class="gain-neg">-0.6%</span></td>
+          <td><span class="">OK_HIGH_CV</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h5</span></td>
+          <td class="label-cell">opset 21 + conv fusions</td>
+          <td class="opset-cell">—</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">—</td>
+          <td class="sessions-cell">—</td>
+          <td>—</td>
+          <td><span class="">TIMEOUT</span></td>
+          <td class="conf-cell">single-point only</td>
+        </tr>
+        </tbody>
+      </table>
+      </div>
+      <div style="margin-top:10px;font-size:11px;color:#7b8794">
+        ★ = champion hypothesis &nbsp;·&nbsp; Session p50s are individual bench sessions (median used for comparison)
+      </div>
+    </section>
+
+
+
+
+    <section class="section-card">
+      <div class="section-title">⚪ Neutral / Build Fail</div>
+      <table class="report-table">
+        <thead>
+          <tr>
+            <th>Hypothesis</th>
+            <th>Label</th>
+            <th>Gain %</th>
+            <th>Verdict</th>
+            <th>Confidence</th>
+          </tr>
+        </thead>
+        <tbody>
+
+            <tr class="">
+              <td><span class="hyp-pill">h0</span></td>
+              <td>baseline (auto-config, W8A16)</td>
+              <td class="gain-pos">+0.0%</td>
+              <td>OK_HIGH_CV</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h1</span></td>
+              <td>opset 17 explicit</td>
+              <td class="gain-neg">-0.1%</td>
+              <td>OK_HIGH_CV</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="champion-row">
+              <td><span class="hyp-pill">h2</span></td>
+              <td>opset 19</td>
+              <td class="gain-pos">+0.0%</td>
+              <td>OK_HIGH_CV</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h3</span></td>
+              <td>opset 21 (tests npu-001 bypass)</td>
+              <td class="gain-neg">-0.1%</td>
+              <td>OK_HIGH_CV</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h4</span></td>
+              <td>opset 17 + conv fusions</td>
+              <td class="gain-neg">-0.6%</td>
+              <td>OK_HIGH_CV</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h5</span></td>
+              <td>opset 21 + conv fusions</td>
+              <td class="gain-pos">—</td>
+              <td>TIMEOUT</td>
+              <td>single-point only</td>
+            </tr>
+
+        </tbody>
+      </table>
+    </section>
+
+
+
+  <div class="footer">Generated by gen_model_report.py · research/autoconfig</div>
+</body>
+</html>
diff --git a/research/autoconfig/catalog-qnn-sweep/distilbert--distilbert-base-uncased-finetuned-sst-2-english/results.json b/research/autoconfig/catalog-qnn-sweep/distilbert--distilbert-base-uncased-finetuned-sst-2-english/results.json
new file mode 100644
index 000000000..9d10a6736
--- /dev/null
+++ b/research/autoconfig/catalog-qnn-sweep/distilbert--distilbert-base-uncased-finetuned-sst-2-english/results.json
@@ -0,0 +1,124 @@
+{
+  "model_id": "distilbert/distilbert-base-uncased-finetuned-sst-2-english",
+  "task": "text-classification",
+  "model_type": "distilbert",
+  "timestamp": "2026-06-13T15:34:52",
+  "ep": "qnn",
+  "device": "npu",
+  "baseline_opset": 17,
+  "hypotheses": {
+    "h0": {
+      "status": "OK_HIGH_CV",
+      "screen": {
+        "p50_ms": 19.511,
+        "cv": 0.156,
+        "stable": false,
+        "note": "DVFS noise — high CV expected on QNN NPU"
+      },
+      "full": {
+        "p50s_ms": [
+          19.512,
+          19.459,
+          19.48
+        ],
+        "median_p50_ms": 19.48
+      },
+      "accuracy": null,
+      "label": "baseline (auto-config, W8A16)",
+      "opset": 17
+    },
+    "h1": {
+      "status": "OK_HIGH_CV",
+      "screen": {
+        "p50_ms": 19.721,
+        "cv": 0.2715,
+        "stable": false,
+        "note": "DVFS noise — high CV expected on QNN NPU"
+      },
+      "full": {
+        "p50s_ms": [
+          19.498,
+          19.417,
+          19.519
+        ],
+        "median_p50_ms": 19.498
+      },
+      "accuracy": null,
+      "label": "opset 17 explicit",
+      "opset": 17
+    },
+    "h2": {
+      "status": "OK_HIGH_CV",
+      "screen": {
+        "p50_ms": 19.431,
+        "cv": 0.1945,
+        "stable": false,
+        "note": "DVFS noise — high CV expected on QNN NPU"
+      },
+      "full": {
+        "p50s_ms": [
+          19.471,
+          19.684,
+          19.477
+        ],
+        "median_p50_ms": 19.477
+      },
+      "accuracy": null,
+      "label": "opset 19",
+      "opset": 19
+    },
+    "h3": {
+      "status": "OK_HIGH_CV",
+      "screen": {
+        "p50_ms": 19.443,
+        "cv": 0.2903,
+        "stable": false,
+        "note": "DVFS noise — high CV expected on QNN NPU"
+      },
+      "full": {
+        "p50s_ms": [
+          19.591,
+          19.447,
+          19.505
+        ],
+        "median_p50_ms": 19.505
+      },
+      "accuracy": null,
+      "label": "opset 21 (tests npu-001 bypass)",
+      "opset": 21
+    },
+    "h4": {
+      "status": "OK_HIGH_CV",
+      "screen": {
+        "p50_ms": 19.404,
+        "cv": 0.237,
+        "stable": false,
+        "note": "DVFS noise — high CV expected on QNN NPU"
+      },
+      "full": {
+        "p50s_ms": [
+          19.588,
+          19.628,
+          19.502
+        ],
+        "median_p50_ms": 19.588
+      },
+      "accuracy": null,
+      "label": "opset 17 + conv fusions",
+      "opset": 17
+    },
+    "h5": {
+      "status": "TIMEOUT",
+      "label": "opset 21 + conv fusions"
+    }
+  },
+  "best_hypothesis": "h2",
+  "baseline_p50_ms": 19.48,
+  "best_p50_ms": 19.477,
+  "best_gain_pct": 0.02,
+  "npu001_generalized": "neutral",
+  "feature_gaps": [],
+  "errors": [
+    "Model timed out at 1385s (before h5)"
+  ]
+}
diff --git a/research/autoconfig/catalog-qnn-sweep/facebook--dino-vitb16/results_v2.json b/research/autoconfig/catalog-qnn-sweep/facebook--dino-vitb16/results_v2.json
new file mode 100644
index 000000000..b8c34f0d3
--- /dev/null
+++ b/research/autoconfig/catalog-qnn-sweep/facebook--dino-vitb16/results_v2.json
@@ -0,0 +1,92 @@
+{
+  "model_id": "facebook/dino-vitb16",
+  "task": "image-feature-extraction",
+  "model_type": "vit",
+  "timestamp": "2026-06-16T18:19:46",
+  "ep": "qnn",
+  "device": "npu",
+  "validation_sweep": true,
+  "hypotheses": {
+    "h0": {
+      "status": "OK_HIGH_CV",
+      "screen": {
+        "p50_ms": 20.367,
+        "cv": 0.2452,
+        "stable": false,
+        "note": "DVFS noise — high CV expected on QNN NPU"
+      },
+      "full": {
+        "p50s_ms": [
+          20.037,
+          20.009,
+          20.048
+        ],
+        "median_p50_ms": 20.037
+      },
+      "label": "baseline (auto-config, W8A16)",
+      "opset": "auto"
+    },
+    "h1": {
+      "status": "OK_HIGH_CV",
+      "screen": {
+        "p50_ms": 20.027,
+        "cv": 0.4804,
+        "stable": false,
+        "note": "DVFS noise — high CV expected on QNN NPU"
+      },
+      "full": {
+        "p50s_ms": [
+          19.924,
+          19.975,
+          19.897
+        ],
+        "median_p50_ms": 19.924
+      },
+      "label": "opset 17 explicit",
+      "opset": 17
+    },
+    "h3": {
+      "status": "OK_HIGH_CV",
+      "screen": {
+        "p50_ms": 20.369,
+        "cv": 0.9085,
+        "stable": false,
+        "note": "DVFS noise — high CV expected on QNN NPU"
+      },
+      "full": {
+        "p50s_ms": [
+          20.197,
+          20.071,
+          19.988
+        ],
+        "median_p50_ms": 20.071
+      },
+      "label": "opset 21 (tests npu-001)",
+      "opset": 21
+    },
+    "h4": {
+      "status": "OK_HIGH_CV",
+      "screen": {
+        "p50_ms": 19.871,
+        "cv": 0.3492,
+        "stable": false,
+        "note": "DVFS noise — high CV expected on QNN NPU"
+      },
+      "full": {
+        "p50s_ms": [
+          20.123,
+          20.037,
+          20.413
+        ],
+        "median_p50_ms": 20.123
+      },
+      "label": "opset 17 + conv fusions",
+      "opset": 17
+    }
+  },
+  "errors": [],
+  "npu001_opset21_vs_17_gain_pct": -0.7,
+  "npu001_note": "opset21 median 20.071ms vs opset17 19.924ms = -0.7%",
+  "npu006_conv_fusion_regression_pct": 1.0,
+  "npu006_note": "conv fusions median 20.123ms vs no-fusion 19.924ms = +1.0%"
+}
diff --git a/research/autoconfig/catalog-qnn-sweep/facebook--dinov2-base/results_v2.json b/research/autoconfig/catalog-qnn-sweep/facebook--dinov2-base/results_v2.json
new file mode 100644
index 000000000..416ddce95
--- /dev/null
+++ b/research/autoconfig/catalog-qnn-sweep/facebook--dinov2-base/results_v2.json
@@ -0,0 +1,92 @@
+{
+  "model_id": "facebook/dinov2-base",
+  "task": "image-feature-extraction",
+  "model_type": "dinov2",
+  "timestamp": "2026-06-16T16:12:15",
+  "ep": "qnn",
+  "device": "npu",
+  "validation_sweep": true,
+  "hypotheses": {
+    "h0": {
+      "status": "OK_HIGH_CV",
+      "screen": {
+        "p50_ms": 41.108,
+        "cv": 1.2524,
+        "stable": false,
+        "note": "DVFS noise — high CV expected on QNN NPU"
+      },
+      "full": {
+        "p50s_ms": [
+          38.991,
+          38.68,
+          36.256
+        ],
+        "median_p50_ms": 38.68
+      },
+      "label": "baseline (auto-config, W8A16)",
+      "opset": "auto"
+    },
+    "h1": {
+      "status": "OK_HIGH_CV",
+      "screen": {
+        "p50_ms": 36.348,
+        "cv": 0.7429,
+        "stable": false,
+        "note": "DVFS noise — high CV expected on QNN NPU"
+      },
+      "full": {
+        "p50s_ms": [
+          34.556,
+          34.668,
+          33.148
+        ],
+        "median_p50_ms": 34.556
+      },
+      "label": "opset 17 explicit",
+      "opset": 17
+    },
+    "h3": {
+      "status": "OK_HIGH_CV",
+      "screen": {
+        "p50_ms": 32.742,
+        "cv": 0.8357,
+        "stable": false,
+        "note": "DVFS noise — high CV expected on QNN NPU"
+      },
+      "full": {
+        "p50s_ms": [
+          33.001,
+          26.224,
+          26.227
+        ],
+        "median_p50_ms": 26.227
+      },
+      "label": "opset 21 (tests npu-001)",
+      "opset": 21
+    },
+    "h4": {
+      "status": "OK",
+      "screen": {
+        "p50_ms": 25.83,
+        "cv": 0.1082,
+        "stable": true,
+        "note": null
+      },
+      "full": {
+        "p50s_ms": [
+          26.064,
+          25.921,
+          25.872
+        ],
+        "median_p50_ms": 25.921
+      },
+      "label": "opset 17 + conv fusions",
+      "opset": 17
+    }
+  },
+  "errors": [],
+  "npu001_opset21_vs_17_gain_pct": 24.1,
+  "npu001_note": "opset21 median 26.227ms vs opset17 34.556ms = +24.1%",
+  "npu006_conv_fusion_regression_pct": -25.0,
+  "npu006_note": "conv fusions median 25.921ms vs no-fusion 34.556ms = -25.0%"
+}
diff --git a/research/autoconfig/catalog-qnn-sweep/facebook--dinov2-small/report.html b/research/autoconfig/catalog-qnn-sweep/facebook--dinov2-small/report.html
new file mode 100644
index 000000000..432deb35e
--- /dev/null
+++ b/research/autoconfig/catalog-qnn-sweep/facebook--dinov2-small/report.html
@@ -0,0 +1,448 @@
+<!DOCTYPE html>
+<html lang="en">
+<head>
+  <meta charset="UTF-8">
+  <meta name="viewport" content="width=device-width, initial-scale=1.0">
+  <title>QNN NPU Optimization Report — facebook/dinov2-small</title>
+  <style>
+    * { box-sizing: border-box; margin: 0; padding: 0; }
+    body {
+      font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", sans-serif;
+      background: #f4f6f9;
+      color: #1a1a2e;
+      padding: 28px 24px 40px;
+      font-size: 13px;
+      line-height: 1.5;
+    }
+    h1 { font-size: 24px; font-weight: 800; margin-bottom: 6px; color: #102a43; }
+    .subtitle { color: #5f6c80; font-size: 12px; margin-bottom: 24px; }
+    .section-card {
+      background: #ffffff;
+      border: 1.5px solid #dbe5f0;
+      border-radius: 12px;
+      padding: 18px 20px;
+      margin-bottom: 20px;
+      box-shadow: 0 1px 3px rgba(16, 42, 67, 0.06);
+    }
+    .kpi-grid {
+      display: grid;
+      grid-template-columns: repeat(5, minmax(0, 1fr));
+      gap: 14px;
+      margin-bottom: 20px;
+    }
+    .kpi-card {
+      background: linear-gradient(180deg, #ffffff 0%, #f8fbff 100%);
+      border: 1.5px solid #dbe5f0;
+      border-radius: 12px;
+      padding: 16px;
+      min-height: 120px;
+    }
+    .kpi-label {
+      font-size: 10px;
+      text-transform: uppercase;
+      letter-spacing: 0.8px;
+      color: #6b7c93;
+      font-weight: 700;
+      margin-bottom: 8px;
+    }
+    .kpi-value {
+      font-size: 23px;
+      font-weight: 800;
+      color: #102a43;
+      line-height: 1.15;
+      margin-bottom: 6px;
+      word-break: break-word;
+    }
+    .kpi-card.good .kpi-value { color: #2e7d32; }
+    .kpi-card.bad .kpi-value { color: #c62828; }
+    .kpi-sub { color: #6b7c93; font-size: 11px; }
+    .section-title {
+      font-size: 14px;
+      font-weight: 800;
+      color: #102a43;
+      margin-bottom: 14px;
+    }
+    .characteristics-table {
+      width: 100%;
+      border-collapse: collapse;
+    }
+    .characteristics-table th,
+    .characteristics-table td {
+      padding: 9px 10px;
+      border-bottom: 1px solid #ebf1f6;
+      text-align: left;
+      vertical-align: top;
+    }
+    .characteristics-table th {
+      width: 180px;
+      color: #5f6c80;
+      font-size: 11px;
+      text-transform: uppercase;
+      letter-spacing: 0.7px;
+    }
+    .report-table {
+      width: 100%;
+      border-collapse: collapse;
+    }
+    .report-table th {
+      text-align: left;
+      padding: 9px 10px;
+      background: #eef4fb;
+      color: #486581;
+      border-bottom: 2px solid #d9e2ec;
+      font-size: 10px;
+      text-transform: uppercase;
+      letter-spacing: 0.7px;
+    }
+    .report-table td {
+      padding: 10px;
+      border-bottom: 1px solid #ebf1f6;
+      vertical-align: top;
+    }
+    .report-table tr:hover td { background: #f8fbff; }
+    .champion-row td { background: #e8f1fd; }
+    .hyp-pill {
+      display: inline-block;
+      background: #102a43;
+      color: white;
+      border-radius: 999px;
+      padding: 2px 8px;
+      font-size: 11px;
+      font-weight: 700;
+    }
+    .gain-pos { color: #2e7d32; font-weight: 700; }
+    .gain-neg { color: #c62828; font-weight: 700; }
+    .chart-wrap {
+      overflow-x: auto;
+      border: 1px solid #e6edf5;
+      border-radius: 10px;
+      background: #fbfdff;
+      padding: 10px;
+    }
+    .chart-svg { width: 100%; min-width: 760px; display: block; }
+    .axis-label { fill: #486581; font-size: 11px; font-weight: 700; }
+    .tick-label { fill: #7b8794; font-size: 10px; }
+    .tick-line { stroke: #d9e2ec; stroke-width: 1; }
+    .center-line { stroke: #1e88e5; stroke-width: 2; stroke-dasharray: 4 4; }
+    .row-bg { fill: transparent; }
+    .hyp-label { fill: #102a43; font-size: 12px; font-weight: 800; }
+    .hyp-sub { fill: #7b8794; font-size: 10px; }
+    .baseline-bar { stroke: #546e7a; stroke-width: 3; }
+    .value-text { fill: #102a43; font-size: 11px; font-weight: 700; }
+    .build-fail-text { fill: #37474f; font-size: 10px; font-weight: 800; letter-spacing: 0.5px; }
+    .gap-grid {
+      display: grid;
+      grid-template-columns: repeat(auto-fit, minmax(240px, 1fr));
+      gap: 12px;
+    }
+    .gap-card {
+      background: #fff8e1;
+      border: 1.5px solid #ffe082;
+      border-radius: 10px;
+      padding: 12px 14px;
+      color: #7c5b00;
+      font-size: 12px;
+    }
+    .flag-pill {
+      display: inline-block;
+      background: #e3f2fd;
+      color: #1565c0;
+      border-radius: 4px;
+      padding: 1px 6px;
+      font-size: 10px;
+      font-weight: 700;
+      margin: 1px 2px 1px 0;
+    }
+    .runs-val {
+      font-family: "Cascadia Code", "Consolas", monospace;
+      font-size: 10.5px;
+      color: #546e7a;
+      white-space: nowrap;
+    }
+    .hyp-detail-table .label-cell { font-size: 11.5px; max-width: 220px; }
+    .hyp-detail-table .opset-cell { text-align: center; font-weight: 700; color: #3949ab; font-size: 12px; }
+    .hyp-detail-table .flags-cell { min-width: 140px; }
+    .hyp-detail-table .p50-cell { font-family: "Cascadia Code","Consolas",monospace; font-size: 12px; white-space: nowrap; }
+    .hyp-detail-table .sessions-cell { min-width: 160px; }
+    .hyp-detail-table .conf-cell { font-size: 11px; color: #546e7a; }
+    .verdict-keep { color: #2e7d32; font-weight: 700; }
+    .verdict-discard { color: #c62828; font-weight: 700; }
+    .footer {
+      margin-top: 16px;
+      color: #7b8794;
+      font-size: 11px;
+      text-align: center;
+    }
+    @media (max-width: 1200px) {
+      .kpi-grid { grid-template-columns: repeat(2, minmax(0, 1fr)); }
+    }
+    @media (max-width: 720px) {
+      .kpi-grid { grid-template-columns: 1fr; }
+      body { padding: 18px 14px 28px; }
+    }
+  </style>
+</head>
+<body>
+  <h1>QNN NPU Optimization Report — facebook/dinov2-small</h1>
+  <div class="subtitle">dinov2 arch · 2026-06-13 · 6 hypotheses tested</div>
+
+  <section class="kpi-grid">
+    <div class="kpi-card good">
+      <div class="kpi-label">Best Gain %</div>
+      <div class="kpi-value">+24.1%</div>
+      <div class="kpi-sub">Champion: h3</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">Baseline → Champion ms</div>
+      <div class="kpi-value">6.56 ms → 4.98 ms</div>
+      <div class="kpi-sub">Latency reduction: 1.58 ms</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">EP + Device</div>
+      <div class="kpi-value">QNN / NPU</div>
+      <div class="kpi-sub">Baseline opset 17</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">Champion Config</div>
+      <div class="kpi-value">h3</div>
+      <div class="kpi-sub">opset 21 + autoconf defaults</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">Total experiments</div>
+      <div class="kpi-value">6</div>
+      <div class="kpi-sub">1 KEEP / 2 DISCARD</div>
+    </div>
+  </section>
+
+
+    <section class="section-card">
+      <div class="section-title">Model Characteristics</div>
+      <table class="characteristics-table">
+        <tr><th>Model ID</th><td>facebook/dinov2-small</td></tr><tr><th>Task</th><td>image-feature-extraction</td></tr><tr><th>Arch type</th><td>dinov2</td></tr><tr><th>Baseline opset</th><td>17</td></tr><tr><th>EP</th><td>qnn</td></tr><tr><th>Device</th><td>npu</td></tr><tr><th>npu-001 note</th><td>True</td></tr>
+      </table>
+    </section>
+
+
+    <section class="section-card">
+      <div class="section-title">Hypothesis Gain Chart</div>
+      <div class="chart-wrap">
+        <svg class="chart-svg" viewBox="0 0 748 314" role="img" aria-label="Hypothesis gain chart"><defs><pattern id="buildFailPattern" patternUnits="userSpaceOnUse" width="8" height="8" patternTransform="rotate(45)"><rect width="8" height="8" fill="#cfd8dc"></rect><rect width="3" height="8" fill="#90a4ae"></rect></pattern></defs><text x="0" y="20" class="axis-label">Hypothesis</text><text x="150" y="20" class="axis-label">Gain vs baseline (%)</text><line x1="410.0" y1="40" x2="410.0" y2="288" class="center-line" /><line x1="150.0" y1="44" x2="150.0" y2="288" class="tick-line" /><text x="150.0" y="308" text-anchor="middle" class="tick-label">-200%</text><line x1="280.0" y1="44" x2="280.0" y2="288" class="tick-line" /><text x="280.0" y="308" text-anchor="middle" class="tick-label">-100%</text><line x1="410.0" y1="44" x2="410.0" y2="288" class="tick-line" /><text x="410.0" y="308" text-anchor="middle" class="tick-label">0%</text><line x1="540.0" y1="44" x2="540.0" y2="288" class="tick-line" /><text x="540.0" y="308" text-anchor="middle" class="tick-label">100%</text><line x1="670.0" y1="44" x2="670.0" y2="288" class="tick-line" /><text x="670.0" y="308" text-anchor="middle" class="tick-label">200%</text><g><title>h0: baseline (auto-config, W8A16)
+status=OK_HIGH_CV  verdict=—
+p50=6.56 ms  gain=+0.0%</title><rect x="0" y="48.0" width="748" height="40" class="row-bg" /><text x="8" y="64.0" class="hyp-label">h0</text><text x="8" y="77.0" class="hyp-sub">baseline (auto-config, W8…</text><line x1="410.0" y1="56.0" x2="410.0" y2="80.0" class="baseline-bar" /><text x="418.0" y="72.0" text-anchor="start" class="value-text">0.0%</text></g><g><title>h1: opset 17 explicit
+status=OK_HIGH_CV  verdict=—
+p50=7.18 ms  gain=-9.4%</title><rect x="0" y="88.0" width="748" height="40" class="row-bg" /><text x="8" y="104.0" class="hyp-label">h1</text><text x="8" y="117.0" class="hyp-sub">opset 17 explicit</text><rect x="397.8" y="96.0" width="12.2" height="24" fill="#e53935" stroke="none" stroke-width="0" rx="4" /><text x="389.8" y="112.0" text-anchor="end" class="value-text">-9.4%</text></g><g><title>h2: opset 19
+status=OK_HIGH_CV  verdict=—
+p50=7.19 ms  gain=-9.6%</title><rect x="0" y="128.0" width="748" height="40" class="row-bg" /><text x="8" y="144.0" class="hyp-label">h2</text><text x="8" y="157.0" class="hyp-sub">opset 19</text><rect x="397.5" y="136.0" width="12.5" height="24" fill="#e53935" stroke="none" stroke-width="0" rx="4" /><text x="389.5" y="152.0" text-anchor="end" class="value-text">-9.6%</text></g><g><title>h3: opset 21 (tests npu-001 bypass)
+status=OK_HIGH_CV  verdict=—
+p50=4.98 ms  gain=+24.1%</title><rect x="0" y="168.0" width="748" height="40" class="row-bg" /><text x="8" y="184.0" class="hyp-label">h3</text><text x="8" y="197.0" class="hyp-sub">opset 21 (tests npu-001 b…</text><rect x="410.0" y="176.0" width="31.4" height="24" fill="#43a047" stroke="#1e88e5" stroke-width="4" rx="4" /><text x="449.4" y="192.0" text-anchor="start" class="value-text">+24.1%</text></g><g><title>h4: opset 17 + conv fusions
+status=TIMEOUT  verdict=—
+p50=—  gain=—</title><rect x="0" y="208.0" width="748" height="40" class="row-bg" /><text x="8" y="224.0" class="hyp-label">h4</text><text x="8" y="237.0" class="hyp-sub">opset 17 + conv fusions</text></g><g><title>h5: opset 21 + conv fusions
+status=TIMEOUT  verdict=—
+p50=—  gain=—</title><rect x="0" y="248.0" width="748" height="40" class="row-bg" /><text x="8" y="264.0" class="hyp-label">h5</text><text x="8" y="277.0" class="hyp-sub">opset 21 + conv fusions</text></g></svg>
+      </div>
+    </section>
+
+
+    <section class="section-card">
+      <div class="section-title">🔬 All Hypotheses — Full Detail</div>
+      <div style="overflow-x:auto">
+      <table class="report-table hyp-detail-table">
+        <thead>
+          <tr>
+            <th>ID</th>
+            <th>Config Label</th>
+            <th>Opset</th>
+            <th>Extra Flags</th>
+            <th>Median p50</th>
+            <th>Session p50s (ms)</th>
+            <th>Gain %</th>
+            <th>Verdict</th>
+            <th>Confidence</th>
+          </tr>
+        </thead>
+        <tbody>
+
+        <tr class="">
+          <td><span class="hyp-pill">h0</span></td>
+          <td class="label-cell">baseline (auto-config, W8A16)</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">6.56 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[6.56 · 6.35 · 12.41]</span></td>
+          <td><span class="">+0.0%</span></td>
+          <td><span class="">OK_HIGH_CV</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h1</span></td>
+          <td class="label-cell">opset 17 explicit</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">7.18 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[7.18 · 6.39 · 9.44]</span></td>
+          <td><span class="gain-neg">-9.4%</span></td>
+          <td><span class="">OK_HIGH_CV</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h2</span></td>
+          <td class="label-cell">opset 19</td>
+          <td class="opset-cell">19</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">7.19 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[8.45 · 7.19 · 6.19]</span></td>
+          <td><span class="gain-neg">-9.6%</span></td>
+          <td><span class="">OK_HIGH_CV</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="champion-row">
+          <td><span class="hyp-pill">h3</span> <span style="color:#1976d2;font-weight:900">★</span></td>
+          <td class="label-cell">opset 21 (tests npu-001 bypass)</td>
+          <td class="opset-cell">21</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">4.98 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[4.98 · 4.88 · 6.88]</span></td>
+          <td><span class="gain-pos">+24.1%</span></td>
+          <td><span class="">OK_HIGH_CV</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h4</span></td>
+          <td class="label-cell">opset 17 + conv fusions</td>
+          <td class="opset-cell">—</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">—</td>
+          <td class="sessions-cell">—</td>
+          <td>—</td>
+          <td><span class="">TIMEOUT</span></td>
+          <td class="conf-cell">single-point only</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h5</span></td>
+          <td class="label-cell">opset 21 + conv fusions</td>
+          <td class="opset-cell">—</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">—</td>
+          <td class="sessions-cell">—</td>
+          <td>—</td>
+          <td><span class="">TIMEOUT</span></td>
+          <td class="conf-cell">single-point only</td>
+        </tr>
+        </tbody>
+      </table>
+      </div>
+      <div style="margin-top:10px;font-size:11px;color:#7b8794">
+        ★ = champion hypothesis &nbsp;·&nbsp; Session p50s are individual bench sessions (median used for comparison)
+      </div>
+    </section>
+
+
+    <section class="section-card">
+      <div class="section-title">✅ Effective Optimizations</div>
+      <table class="report-table">
+        <thead>
+          <tr>
+            <th>Hypothesis</th>
+            <th>Label</th>
+            <th>Gain %</th>
+            <th>Verdict</th>
+            <th>Confidence</th>
+          </tr>
+        </thead>
+        <tbody>
+
+            <tr class="champion-row">
+              <td><span class="hyp-pill">h3</span></td>
+              <td>opset 21 (tests npu-001 bypass)</td>
+              <td class="gain-pos">+24.1%</td>
+              <td>OK_HIGH_CV</td>
+              <td>ranges overlap</td>
+            </tr>
+
+        </tbody>
+      </table>
+    </section>
+
+
+    <section class="section-card">
+      <div class="section-title">❌ Ineffective or Harmful</div>
+      <table class="report-table">
+        <thead>
+          <tr>
+            <th>Hypothesis</th>
+            <th>Label</th>
+            <th>Gain %</th>
+            <th>Verdict</th>
+            <th>Confidence</th>
+          </tr>
+        </thead>
+        <tbody>
+
+            <tr class="">
+              <td><span class="hyp-pill">h1</span></td>
+              <td>opset 17 explicit</td>
+              <td class="gain-neg">-9.4%</td>
+              <td>OK_HIGH_CV</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h2</span></td>
+              <td>opset 19</td>
+              <td class="gain-neg">-9.6%</td>
+              <td>OK_HIGH_CV</td>
+              <td>ranges overlap</td>
+            </tr>
+
+        </tbody>
+      </table>
+    </section>
+
+
+    <section class="section-card">
+      <div class="section-title">⚪ Neutral / Build Fail</div>
+      <table class="report-table">
+        <thead>
+          <tr>
+            <th>Hypothesis</th>
+            <th>Label</th>
+            <th>Gain %</th>
+            <th>Verdict</th>
+            <th>Confidence</th>
+          </tr>
+        </thead>
+        <tbody>
+
+            <tr class="">
+              <td><span class="hyp-pill">h0</span></td>
+              <td>baseline (auto-config, W8A16)</td>
+              <td class="gain-pos">+0.0%</td>
+              <td>OK_HIGH_CV</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h4</span></td>
+              <td>opset 17 + conv fusions</td>
+              <td class="gain-pos">—</td>
+              <td>TIMEOUT</td>
+              <td>single-point only</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h5</span></td>
+              <td>opset 21 + conv fusions</td>
+              <td class="gain-pos">—</td>
+              <td>TIMEOUT</td>
+              <td>single-point only</td>
+            </tr>
+
+        </tbody>
+      </table>
+    </section>
+
+
+
+  <div class="footer">Generated by gen_model_report.py · research/autoconfig</div>
+</body>
+</html>
diff --git a/research/autoconfig/catalog-qnn-sweep/facebook--dinov2-small/results.json b/research/autoconfig/catalog-qnn-sweep/facebook--dinov2-small/results.json
new file mode 100644
index 000000000..521b465de
--- /dev/null
+++ b/research/autoconfig/catalog-qnn-sweep/facebook--dinov2-small/results.json
@@ -0,0 +1,109 @@
+{
+  "model_id": "facebook/dinov2-small",
+  "task": "image-feature-extraction",
+  "model_type": "dinov2",
+  "timestamp": "2026-06-13T14:49:59",
+  "ep": "qnn",
+  "device": "npu",
+  "baseline_opset": 17,
+  "hypotheses": {
+    "h0": {
+      "status": "OK_HIGH_CV",
+      "screen": {
+        "p50_ms": 7.213,
+        "cv": 0.3437,
+        "stable": false,
+        "note": "DVFS noise — high CV expected on QNN NPU"
+      },
+      "full": {
+        "p50s_ms": [
+          6.561,
+          6.353,
+          12.408
+        ],
+        "median_p50_ms": 6.561
+      },
+      "accuracy": null,
+      "label": "baseline (auto-config, W8A16)",
+      "opset": 17
+    },
+    "h1": {
+      "status": "OK_HIGH_CV",
+      "screen": {
+        "p50_ms": 4.897,
+        "cv": 0.4572,
+        "stable": false,
+        "note": "DVFS noise — high CV expected on QNN NPU"
+      },
+      "full": {
+        "p50s_ms": [
+          7.176,
+          6.392,
+          9.436
+        ],
+        "median_p50_ms": 7.176
+      },
+      "accuracy": null,
+      "label": "opset 17 explicit",
+      "opset": 17
+    },
+    "h2": {
+      "status": "OK_HIGH_CV",
+      "screen": {
+        "p50_ms": 6.953,
+        "cv": 1.8047,
+        "stable": false,
+        "note": "DVFS noise — high CV expected on QNN NPU"
+      },
+      "full": {
+        "p50s_ms": [
+          8.454,
+          7.191,
+          6.194
+        ],
+        "median_p50_ms": 7.191
+      },
+      "accuracy": null,
+      "label": "opset 19",
+      "opset": 19
+    },
+    "h3": {
+      "status": "OK_HIGH_CV",
+      "screen": {
+        "p50_ms": 9.432,
+        "cv": 0.936,
+        "stable": false,
+        "note": "DVFS noise — high CV expected on QNN NPU"
+      },
+      "full": {
+        "p50s_ms": [
+          4.977,
+          4.876,
+          6.884
+        ],
+        "median_p50_ms": 4.977
+      },
+      "accuracy": null,
+      "label": "opset 21 (tests npu-001 bypass)",
+      "opset": 21
+    },
+    "h4": {
+      "status": "TIMEOUT",
+      "label": "opset 17 + conv fusions"
+    },
+    "h5": {
+      "status": "TIMEOUT",
+      "label": "opset 21 + conv fusions"
+    }
+  },
+  "best_hypothesis": "h3",
+  "baseline_p50_ms": 6.561,
+  "best_p50_ms": 4.977,
+  "best_gain_pct": 24.14,
+  "npu001_generalized": true,
+  "feature_gaps": [],
+  "errors": [
+    "Model timed out at 1333s (before h4)",
+    "Model timed out at 1333s (before h5)"
+  ]
+}
diff --git a/research/autoconfig/catalog-qnn-sweep/google--vit-base-patch16-224/report.html b/research/autoconfig/catalog-qnn-sweep/google--vit-base-patch16-224/report.html
new file mode 100644
index 000000000..a66c1b47d
--- /dev/null
+++ b/research/autoconfig/catalog-qnn-sweep/google--vit-base-patch16-224/report.html
@@ -0,0 +1,430 @@
+<!DOCTYPE html>
+<html lang="en">
+<head>
+  <meta charset="UTF-8">
+  <meta name="viewport" content="width=device-width, initial-scale=1.0">
+  <title>QNN NPU Optimization Report — google/vit-base-patch16-224</title>
+  <style>
+    * { box-sizing: border-box; margin: 0; padding: 0; }
+    body {
+      font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", sans-serif;
+      background: #f4f6f9;
+      color: #1a1a2e;
+      padding: 28px 24px 40px;
+      font-size: 13px;
+      line-height: 1.5;
+    }
+    h1 { font-size: 24px; font-weight: 800; margin-bottom: 6px; color: #102a43; }
+    .subtitle { color: #5f6c80; font-size: 12px; margin-bottom: 24px; }
+    .section-card {
+      background: #ffffff;
+      border: 1.5px solid #dbe5f0;
+      border-radius: 12px;
+      padding: 18px 20px;
+      margin-bottom: 20px;
+      box-shadow: 0 1px 3px rgba(16, 42, 67, 0.06);
+    }
+    .kpi-grid {
+      display: grid;
+      grid-template-columns: repeat(5, minmax(0, 1fr));
+      gap: 14px;
+      margin-bottom: 20px;
+    }
+    .kpi-card {
+      background: linear-gradient(180deg, #ffffff 0%, #f8fbff 100%);
+      border: 1.5px solid #dbe5f0;
+      border-radius: 12px;
+      padding: 16px;
+      min-height: 120px;
+    }
+    .kpi-label {
+      font-size: 10px;
+      text-transform: uppercase;
+      letter-spacing: 0.8px;
+      color: #6b7c93;
+      font-weight: 700;
+      margin-bottom: 8px;
+    }
+    .kpi-value {
+      font-size: 23px;
+      font-weight: 800;
+      color: #102a43;
+      line-height: 1.15;
+      margin-bottom: 6px;
+      word-break: break-word;
+    }
+    .kpi-card.good .kpi-value { color: #2e7d32; }
+    .kpi-card.bad .kpi-value { color: #c62828; }
+    .kpi-sub { color: #6b7c93; font-size: 11px; }
+    .section-title {
+      font-size: 14px;
+      font-weight: 800;
+      color: #102a43;
+      margin-bottom: 14px;
+    }
+    .characteristics-table {
+      width: 100%;
+      border-collapse: collapse;
+    }
+    .characteristics-table th,
+    .characteristics-table td {
+      padding: 9px 10px;
+      border-bottom: 1px solid #ebf1f6;
+      text-align: left;
+      vertical-align: top;
+    }
+    .characteristics-table th {
+      width: 180px;
+      color: #5f6c80;
+      font-size: 11px;
+      text-transform: uppercase;
+      letter-spacing: 0.7px;
+    }
+    .report-table {
+      width: 100%;
+      border-collapse: collapse;
+    }
+    .report-table th {
+      text-align: left;
+      padding: 9px 10px;
+      background: #eef4fb;
+      color: #486581;
+      border-bottom: 2px solid #d9e2ec;
+      font-size: 10px;
+      text-transform: uppercase;
+      letter-spacing: 0.7px;
+    }
+    .report-table td {
+      padding: 10px;
+      border-bottom: 1px solid #ebf1f6;
+      vertical-align: top;
+    }
+    .report-table tr:hover td { background: #f8fbff; }
+    .champion-row td { background: #e8f1fd; }
+    .hyp-pill {
+      display: inline-block;
+      background: #102a43;
+      color: white;
+      border-radius: 999px;
+      padding: 2px 8px;
+      font-size: 11px;
+      font-weight: 700;
+    }
+    .gain-pos { color: #2e7d32; font-weight: 700; }
+    .gain-neg { color: #c62828; font-weight: 700; }
+    .chart-wrap {
+      overflow-x: auto;
+      border: 1px solid #e6edf5;
+      border-radius: 10px;
+      background: #fbfdff;
+      padding: 10px;
+    }
+    .chart-svg { width: 100%; min-width: 760px; display: block; }
+    .axis-label { fill: #486581; font-size: 11px; font-weight: 700; }
+    .tick-label { fill: #7b8794; font-size: 10px; }
+    .tick-line { stroke: #d9e2ec; stroke-width: 1; }
+    .center-line { stroke: #1e88e5; stroke-width: 2; stroke-dasharray: 4 4; }
+    .row-bg { fill: transparent; }
+    .hyp-label { fill: #102a43; font-size: 12px; font-weight: 800; }
+    .hyp-sub { fill: #7b8794; font-size: 10px; }
+    .baseline-bar { stroke: #546e7a; stroke-width: 3; }
+    .value-text { fill: #102a43; font-size: 11px; font-weight: 700; }
+    .build-fail-text { fill: #37474f; font-size: 10px; font-weight: 800; letter-spacing: 0.5px; }
+    .gap-grid {
+      display: grid;
+      grid-template-columns: repeat(auto-fit, minmax(240px, 1fr));
+      gap: 12px;
+    }
+    .gap-card {
+      background: #fff8e1;
+      border: 1.5px solid #ffe082;
+      border-radius: 10px;
+      padding: 12px 14px;
+      color: #7c5b00;
+      font-size: 12px;
+    }
+    .flag-pill {
+      display: inline-block;
+      background: #e3f2fd;
+      color: #1565c0;
+      border-radius: 4px;
+      padding: 1px 6px;
+      font-size: 10px;
+      font-weight: 700;
+      margin: 1px 2px 1px 0;
+    }
+    .runs-val {
+      font-family: "Cascadia Code", "Consolas", monospace;
+      font-size: 10.5px;
+      color: #546e7a;
+      white-space: nowrap;
+    }
+    .hyp-detail-table .label-cell { font-size: 11.5px; max-width: 220px; }
+    .hyp-detail-table .opset-cell { text-align: center; font-weight: 700; color: #3949ab; font-size: 12px; }
+    .hyp-detail-table .flags-cell { min-width: 140px; }
+    .hyp-detail-table .p50-cell { font-family: "Cascadia Code","Consolas",monospace; font-size: 12px; white-space: nowrap; }
+    .hyp-detail-table .sessions-cell { min-width: 160px; }
+    .hyp-detail-table .conf-cell { font-size: 11px; color: #546e7a; }
+    .verdict-keep { color: #2e7d32; font-weight: 700; }
+    .verdict-discard { color: #c62828; font-weight: 700; }
+    .footer {
+      margin-top: 16px;
+      color: #7b8794;
+      font-size: 11px;
+      text-align: center;
+    }
+    @media (max-width: 1200px) {
+      .kpi-grid { grid-template-columns: repeat(2, minmax(0, 1fr)); }
+    }
+    @media (max-width: 720px) {
+      .kpi-grid { grid-template-columns: 1fr; }
+      body { padding: 18px 14px 28px; }
+    }
+  </style>
+</head>
+<body>
+  <h1>QNN NPU Optimization Report — google/vit-base-patch16-224</h1>
+  <div class="subtitle">vit arch · 2026-06-13 · 6 hypotheses tested</div>
+
+  <section class="kpi-grid">
+    <div class="kpi-card neutral">
+      <div class="kpi-label">Best Gain %</div>
+      <div class="kpi-value">+0.0%</div>
+      <div class="kpi-sub">Champion: h0</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">Baseline → Champion ms</div>
+      <div class="kpi-value">9.04 ms → 9.04 ms</div>
+      <div class="kpi-sub">Latency reduction: 0.00 ms</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">EP + Device</div>
+      <div class="kpi-value">QNN / NPU</div>
+      <div class="kpi-sub">Baseline opset 17</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">Champion Config</div>
+      <div class="kpi-value">h0</div>
+      <div class="kpi-sub">opset 17 + autoconf defaults</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">Total experiments</div>
+      <div class="kpi-value">6</div>
+      <div class="kpi-sub">0 KEEP / 3 DISCARD</div>
+    </div>
+  </section>
+
+
+    <section class="section-card">
+      <div class="section-title">Model Characteristics</div>
+      <table class="characteristics-table">
+        <tr><th>Model ID</th><td>google/vit-base-patch16-224</td></tr><tr><th>Task</th><td>image-classification</td></tr><tr><th>Arch type</th><td>vit</td></tr><tr><th>Baseline opset</th><td>17</td></tr><tr><th>EP</th><td>qnn</td></tr><tr><th>Device</th><td>npu</td></tr><tr><th>npu-001 note</th><td>False</td></tr>
+      </table>
+    </section>
+
+
+    <section class="section-card">
+      <div class="section-title">Hypothesis Gain Chart</div>
+      <div class="chart-wrap">
+        <svg class="chart-svg" viewBox="0 0 748 314" role="img" aria-label="Hypothesis gain chart"><defs><pattern id="buildFailPattern" patternUnits="userSpaceOnUse" width="8" height="8" patternTransform="rotate(45)"><rect width="8" height="8" fill="#cfd8dc"></rect><rect width="3" height="8" fill="#90a4ae"></rect></pattern></defs><text x="0" y="20" class="axis-label">Hypothesis</text><text x="150" y="20" class="axis-label">Gain vs baseline (%)</text><line x1="410.0" y1="40" x2="410.0" y2="288" class="center-line" /><line x1="150.0" y1="44" x2="150.0" y2="288" class="tick-line" /><text x="150.0" y="308" text-anchor="middle" class="tick-label">-200%</text><line x1="280.0" y1="44" x2="280.0" y2="288" class="tick-line" /><text x="280.0" y="308" text-anchor="middle" class="tick-label">-100%</text><line x1="410.0" y1="44" x2="410.0" y2="288" class="tick-line" /><text x="410.0" y="308" text-anchor="middle" class="tick-label">0%</text><line x1="540.0" y1="44" x2="540.0" y2="288" class="tick-line" /><text x="540.0" y="308" text-anchor="middle" class="tick-label">100%</text><line x1="670.0" y1="44" x2="670.0" y2="288" class="tick-line" /><text x="670.0" y="308" text-anchor="middle" class="tick-label">200%</text><g><title>h0: baseline (auto-config, W8A16)
+status=OK_HIGH_CV  verdict=—
+p50=9.04 ms  gain=+0.0%</title><rect x="0" y="48.0" width="748" height="40" class="row-bg" /><text x="8" y="64.0" class="hyp-label">h0</text><text x="8" y="77.0" class="hyp-sub">baseline (auto-config, W8…</text><line x1="410.0" y1="56.0" x2="410.0" y2="80.0" class="baseline-bar" /><text x="418.0" y="72.0" text-anchor="start" class="value-text">0.0%</text></g><g><title>h1: opset 17 explicit
+status=OK_HIGH_CV  verdict=—
+p50=9.33 ms  gain=-3.2%</title><rect x="0" y="88.0" width="748" height="40" class="row-bg" /><text x="8" y="104.0" class="hyp-label">h1</text><text x="8" y="117.0" class="hyp-sub">opset 17 explicit</text><rect x="405.8" y="96.0" width="4.2" height="24" fill="#90a4ae" stroke="none" stroke-width="0" rx="4" /><text x="397.8" y="112.0" text-anchor="end" class="value-text">-3.2%</text></g><g><title>h2: opset 19
+status=BUILD_FAIL  verdict=—
+p50=—  gain=—</title><rect x="0" y="128.0" width="748" height="40" class="row-bg" /><text x="8" y="144.0" class="hyp-label">h2</text><text x="8" y="157.0" class="hyp-sub">opset 19</text><rect x="364.0" y="136.0" width="92" height="24" fill="url(#buildFailPattern)" stroke="#78909c" stroke-width="1.5" rx="4" /><text x="410.0" y="152.0" text-anchor="middle" class="build-fail-text">BUILD_FAIL</text></g><g><title>h3: opset 21 (tests npu-001 bypass)
+status=OK_HIGH_CV  verdict=—
+p50=10.02 ms  gain=-10.8%</title><rect x="0" y="168.0" width="748" height="40" class="row-bg" /><text x="8" y="184.0" class="hyp-label">h3</text><text x="8" y="197.0" class="hyp-sub">opset 21 (tests npu-001 b…</text><rect x="395.9" y="176.0" width="14.1" height="24" fill="#e53935" stroke="none" stroke-width="0" rx="4" /><text x="387.9" y="192.0" text-anchor="end" class="value-text">-10.8%</text></g><g><title>h4: opset 17 + conv fusions
+status=TIMEOUT  verdict=—
+p50=—  gain=—</title><rect x="0" y="208.0" width="748" height="40" class="row-bg" /><text x="8" y="224.0" class="hyp-label">h4</text><text x="8" y="237.0" class="hyp-sub">opset 17 + conv fusions</text></g><g><title>h5: opset 21 + conv fusions
+status=TIMEOUT  verdict=—
+p50=—  gain=—</title><rect x="0" y="248.0" width="748" height="40" class="row-bg" /><text x="8" y="264.0" class="hyp-label">h5</text><text x="8" y="277.0" class="hyp-sub">opset 21 + conv fusions</text></g></svg>
+      </div>
+    </section>
+
+
+    <section class="section-card">
+      <div class="section-title">🔬 All Hypotheses — Full Detail</div>
+      <div style="overflow-x:auto">
+      <table class="report-table hyp-detail-table">
+        <thead>
+          <tr>
+            <th>ID</th>
+            <th>Config Label</th>
+            <th>Opset</th>
+            <th>Extra Flags</th>
+            <th>Median p50</th>
+            <th>Session p50s (ms)</th>
+            <th>Gain %</th>
+            <th>Verdict</th>
+            <th>Confidence</th>
+          </tr>
+        </thead>
+        <tbody>
+
+        <tr class="champion-row">
+          <td><span class="hyp-pill">h0</span> <span style="color:#1976d2;font-weight:900">★</span></td>
+          <td class="label-cell">baseline (auto-config, W8A16)</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">9.04 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[9.04 · 8.60 · 9.78]</span></td>
+          <td><span class="">+0.0%</span></td>
+          <td><span class="">OK_HIGH_CV</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h1</span></td>
+          <td class="label-cell">opset 17 explicit</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">9.33 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[9.33 · 12.72 · 9.06]</span></td>
+          <td><span class="gain-neg">-3.2%</span></td>
+          <td><span class="">OK_HIGH_CV</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h2</span></td>
+          <td class="label-cell">opset 19</td>
+          <td class="opset-cell">19</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">BUILD_FAIL</td>
+          <td class="sessions-cell"><span style="color:#c62828;font-weight:700">BUILD_FAIL</span></td>
+          <td>—</td>
+          <td><span class="verdict-discard">BUILD_FAIL</span></td>
+          <td class="conf-cell">build failed</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h3</span></td>
+          <td class="label-cell">opset 21 (tests npu-001 bypass)</td>
+          <td class="opset-cell">21</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">10.02 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[15.27 · 10.02 · 7.81]</span></td>
+          <td><span class="gain-neg">-10.8%</span></td>
+          <td><span class="">OK_HIGH_CV</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h4</span></td>
+          <td class="label-cell">opset 17 + conv fusions</td>
+          <td class="opset-cell">—</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">—</td>
+          <td class="sessions-cell">—</td>
+          <td>—</td>
+          <td><span class="">TIMEOUT</span></td>
+          <td class="conf-cell">single-point only</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h5</span></td>
+          <td class="label-cell">opset 21 + conv fusions</td>
+          <td class="opset-cell">—</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">—</td>
+          <td class="sessions-cell">—</td>
+          <td>—</td>
+          <td><span class="">TIMEOUT</span></td>
+          <td class="conf-cell">single-point only</td>
+        </tr>
+        </tbody>
+      </table>
+      </div>
+      <div style="margin-top:10px;font-size:11px;color:#7b8794">
+        ★ = champion hypothesis &nbsp;·&nbsp; Session p50s are individual bench sessions (median used for comparison)
+      </div>
+    </section>
+
+
+
+    <section class="section-card">
+      <div class="section-title">❌ Ineffective or Harmful</div>
+      <table class="report-table">
+        <thead>
+          <tr>
+            <th>Hypothesis</th>
+            <th>Label</th>
+            <th>Gain %</th>
+            <th>Verdict</th>
+            <th>Confidence</th>
+          </tr>
+        </thead>
+        <tbody>
+
+            <tr class="">
+              <td><span class="hyp-pill">h1</span></td>
+              <td>opset 17 explicit</td>
+              <td class="gain-neg">-3.2%</td>
+              <td>OK_HIGH_CV</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h2</span></td>
+              <td>opset 19</td>
+              <td class="gain-pos">—</td>
+              <td>BUILD_FAIL</td>
+              <td>build failed</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h3</span></td>
+              <td>opset 21 (tests npu-001 bypass)</td>
+              <td class="gain-neg">-10.8%</td>
+              <td>OK_HIGH_CV</td>
+              <td>ranges overlap</td>
+            </tr>
+
+        </tbody>
+      </table>
+    </section>
+
+
+    <section class="section-card">
+      <div class="section-title">⚪ Neutral / Build Fail</div>
+      <table class="report-table">
+        <thead>
+          <tr>
+            <th>Hypothesis</th>
+            <th>Label</th>
+            <th>Gain %</th>
+            <th>Verdict</th>
+            <th>Confidence</th>
+          </tr>
+        </thead>
+        <tbody>
+
+            <tr class="champion-row">
+              <td><span class="hyp-pill">h0</span></td>
+              <td>baseline (auto-config, W8A16)</td>
+              <td class="gain-pos">+0.0%</td>
+              <td>OK_HIGH_CV</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h4</span></td>
+              <td>opset 17 + conv fusions</td>
+              <td class="gain-pos">—</td>
+              <td>TIMEOUT</td>
+              <td>single-point only</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h5</span></td>
+              <td>opset 21 + conv fusions</td>
+              <td class="gain-pos">—</td>
+              <td>TIMEOUT</td>
+              <td>single-point only</td>
+            </tr>
+
+        </tbody>
+      </table>
+    </section>
+
+
+
+  <div class="footer">Generated by gen_model_report.py · research/autoconfig</div>
+</body>
+</html>
diff --git a/research/autoconfig/catalog-qnn-sweep/google--vit-base-patch16-224/results.json b/research/autoconfig/catalog-qnn-sweep/google--vit-base-patch16-224/results.json
new file mode 100644
index 000000000..42edb241b
--- /dev/null
+++ b/research/autoconfig/catalog-qnn-sweep/google--vit-base-patch16-224/results.json
@@ -0,0 +1,96 @@
+{
+  "model_id": "google/vit-base-patch16-224",
+  "task": "image-classification",
+  "model_type": "vit",
+  "timestamp": "2026-06-13T14:05:37",
+  "ep": "qnn",
+  "device": "npu",
+  "baseline_opset": 17,
+  "hypotheses": {
+    "h0": {
+      "status": "OK_HIGH_CV",
+      "screen": {
+        "p50_ms": 9.245,
+        "cv": 1.2887,
+        "stable": false,
+        "note": "DVFS noise — high CV expected on QNN NPU"
+      },
+      "full": {
+        "p50s_ms": [
+          9.039,
+          8.6,
+          9.779
+        ],
+        "median_p50_ms": 9.039
+      },
+      "accuracy": 0.74,
+      "label": "baseline (auto-config, W8A16)",
+      "opset": 17
+    },
+    "h1": {
+      "status": "OK_HIGH_CV",
+      "screen": {
+        "p50_ms": 9.656,
+        "cv": 0.7434,
+        "stable": false,
+        "note": "DVFS noise — high CV expected on QNN NPU"
+      },
+      "full": {
+        "p50s_ms": [
+          9.33,
+          12.723,
+          9.064
+        ],
+        "median_p50_ms": 9.33
+      },
+      "accuracy": null,
+      "label": "opset 17 explicit",
+      "opset": 17
+    },
+    "h2": {
+      "status": "BUILD_FAIL",
+      "label": "opset 19",
+      "opset": 19,
+      "build_error": "MzU3NTk3NTM4NmY1YzY0YjEzZjgwNTlkYmY3MWVkNDBkYWEwMGFcXD91c2VyX2lkPXB1YmxpYyZYLVhldC1DYXMtVWlkPXB1YmxpYyZyZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0aW9uPWlubGluZSUzQitmaWxlbmFtZSUyQSUzRFVURi04JTI3JTI3dHJhaW4tMDAwMDAtb2YtMDAwMTMucGFycXVldCUzQitmaWxlbmFtZSUzRCUyMnRyYWluLTAwMDAwLW9mLTAwMDEzLnBhcnF1ZXQlMjIlM0IiLCJDb25kaXRpb24iOnsiRGF0ZUxlc3NUaGFuIjp7IkVwb2NoVGltZSI6MTc4MTMzNTIwOH0sIkJ5dGVSYW5nZSI6eyJFeHBlY3RlZEhlYWRlciI6ImJ5dGVzPTQ4NTEzNzYwNC00ODUyMDMxMzkifX19XX0_&Signature=MEUCIQD51-TIZFhcd8Id1yCa5oFvcfXtxBJQLnbeG3PPgDJm5AIgBbqpmbciOJZpxVhunYiYCwhL8FT6ymJ72UKocE3aygs_&Key-Pair-Id=01KAYHXK2CBJSW0YZTMNXK9W1M\n\n"
+    },
+    "h3": {
+      "status": "OK_HIGH_CV",
+      "screen": {
+        "p50_ms": 11.564,
+        "cv": 2.1585,
+        "stable": false,
+        "note": "DVFS noise — high CV expected on QNN NPU"
+      },
+      "full": {
+        "p50s_ms": [
+          15.271,
+          10.019,
+          7.808
+        ],
+        "median_p50_ms": 10.019
+      },
+      "accuracy": null,
+      "label": "opset 21 (tests npu-001 bypass)",
+      "opset": 21
+    },
+    "h4": {
+      "status": "TIMEOUT",
+      "label": "opset 17 + conv fusions"
+    },
+    "h5": {
+      "status": "TIMEOUT",
+      "label": "opset 21 + conv fusions"
+    }
+  },
+  "best_hypothesis": "h0",
+  "baseline_p50_ms": 9.039,
+  "best_p50_ms": 9.039,
+  "best_gain_pct": 0.0,
+  "npu001_generalized": false,
+  "feature_gaps": [],
+  "errors": [
+    "h2: BUILD_FAIL",
+    "Model timed out at 1204s (before h4)",
+    "Model timed out at 1204s (before h5)"
+  ]
+}
diff --git a/research/autoconfig/catalog-qnn-sweep/hustvl--yolos-small/champion_qnn_npu.json b/research/autoconfig/catalog-qnn-sweep/hustvl--yolos-small/champion_qnn_npu.json
new file mode 100644
index 000000000..3e73b6c4f
--- /dev/null
+++ b/research/autoconfig/catalog-qnn-sweep/hustvl--yolos-small/champion_qnn_npu.json
@@ -0,0 +1,62 @@
+{
+  "export": {
+    "opset_version": 17,
+    "batch_size": 1,
+    "export_params": true,
+    "do_constant_folding": true,
+    "verbose": false,
+    "dynamo": false,
+    "enable_hierarchy_tags": true,
+    "clean_onnx": false,
+    "hierarchy_tag_format": "full",
+    "input_tensors": [
+      {
+        "name": "pixel_values",
+        "dtype": "float32",
+        "shape": [
+          1,
+          3,
+          512,
+          864
+        ],
+        "value_range": [
+          0,
+          1
+        ]
+      }
+    ],
+    "output_tensors": [
+      {
+        "name": "logits"
+      },
+      {
+        "name": "pred_boxes"
+      }
+    ]
+  },
+  "optim": {},
+  "quant": {
+    "mode": "qdq",
+    "samples": 10,
+    "calibration_method": "minmax",
+    "weight_type": "uint8",
+    "activation_type": "uint16",
+    "per_channel": false,
+    "symmetric": false,
+    "save_calibration": false,
+    "distribution": "uniform",
+    "seed": null,
+    "calibration_load_path": null,
+    "calibration_save_path": null,
+    "op_types_to_quantize": null,
+    "nodes_to_exclude": null,
+    "task": "object-detection",
+    "model_name": "hustvl/yolos-small"
+  },
+  "compile": null,
+  "loader": {
+    "task": "object-detection",
+    "model_class": "AutoModelForObjectDetection",
+    "model_type": "yolos"
+  }
+}
diff --git a/research/autoconfig/catalog-qnn-sweep/hustvl--yolos-small/report.html b/research/autoconfig/catalog-qnn-sweep/hustvl--yolos-small/report.html
new file mode 100644
index 000000000..c9422c1ad
--- /dev/null
+++ b/research/autoconfig/catalog-qnn-sweep/hustvl--yolos-small/report.html
@@ -0,0 +1,493 @@
+<!DOCTYPE html>
+<html lang="en">
+<head>
+  <meta charset="UTF-8">
+  <meta name="viewport" content="width=device-width, initial-scale=1.0">
+  <title>QNN NPU Optimization Report — hustvl/yolos-small</title>
+  <style>
+    * { box-sizing: border-box; margin: 0; padding: 0; }
+    body {
+      font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", sans-serif;
+      background: #f4f6f9;
+      color: #1a1a2e;
+      padding: 28px 24px 40px;
+      font-size: 13px;
+      line-height: 1.5;
+    }
+    h1 { font-size: 24px; font-weight: 800; margin-bottom: 6px; color: #102a43; }
+    .subtitle { color: #5f6c80; font-size: 12px; margin-bottom: 24px; }
+    .section-card {
+      background: #ffffff;
+      border: 1.5px solid #dbe5f0;
+      border-radius: 12px;
+      padding: 18px 20px;
+      margin-bottom: 20px;
+      box-shadow: 0 1px 3px rgba(16, 42, 67, 0.06);
+    }
+    .kpi-grid {
+      display: grid;
+      grid-template-columns: repeat(5, minmax(0, 1fr));
+      gap: 14px;
+      margin-bottom: 20px;
+    }
+    .kpi-card {
+      background: linear-gradient(180deg, #ffffff 0%, #f8fbff 100%);
+      border: 1.5px solid #dbe5f0;
+      border-radius: 12px;
+      padding: 16px;
+      min-height: 120px;
+    }
+    .kpi-label {
+      font-size: 10px;
+      text-transform: uppercase;
+      letter-spacing: 0.8px;
+      color: #6b7c93;
+      font-weight: 700;
+      margin-bottom: 8px;
+    }
+    .kpi-value {
+      font-size: 23px;
+      font-weight: 800;
+      color: #102a43;
+      line-height: 1.15;
+      margin-bottom: 6px;
+      word-break: break-word;
+    }
+    .kpi-card.good .kpi-value { color: #2e7d32; }
+    .kpi-card.bad .kpi-value { color: #c62828; }
+    .kpi-sub { color: #6b7c93; font-size: 11px; }
+    .section-title {
+      font-size: 14px;
+      font-weight: 800;
+      color: #102a43;
+      margin-bottom: 14px;
+    }
+    .characteristics-table {
+      width: 100%;
+      border-collapse: collapse;
+    }
+    .characteristics-table th,
+    .characteristics-table td {
+      padding: 9px 10px;
+      border-bottom: 1px solid #ebf1f6;
+      text-align: left;
+      vertical-align: top;
+    }
+    .characteristics-table th {
+      width: 180px;
+      color: #5f6c80;
+      font-size: 11px;
+      text-transform: uppercase;
+      letter-spacing: 0.7px;
+    }
+    .report-table {
+      width: 100%;
+      border-collapse: collapse;
+    }
+    .report-table th {
+      text-align: left;
+      padding: 9px 10px;
+      background: #eef4fb;
+      color: #486581;
+      border-bottom: 2px solid #d9e2ec;
+      font-size: 10px;
+      text-transform: uppercase;
+      letter-spacing: 0.7px;
+    }
+    .report-table td {
+      padding: 10px;
+      border-bottom: 1px solid #ebf1f6;
+      vertical-align: top;
+    }
+    .report-table tr:hover td { background: #f8fbff; }
+    .champion-row td { background: #e8f1fd; }
+    .hyp-pill {
+      display: inline-block;
+      background: #102a43;
+      color: white;
+      border-radius: 999px;
+      padding: 2px 8px;
+      font-size: 11px;
+      font-weight: 700;
+    }
+    .gain-pos { color: #2e7d32; font-weight: 700; }
+    .gain-neg { color: #c62828; font-weight: 700; }
+    .chart-wrap {
+      overflow-x: auto;
+      border: 1px solid #e6edf5;
+      border-radius: 10px;
+      background: #fbfdff;
+      padding: 10px;
+    }
+    .chart-svg { width: 100%; min-width: 760px; display: block; }
+    .axis-label { fill: #486581; font-size: 11px; font-weight: 700; }
+    .tick-label { fill: #7b8794; font-size: 10px; }
+    .tick-line { stroke: #d9e2ec; stroke-width: 1; }
+    .center-line { stroke: #1e88e5; stroke-width: 2; stroke-dasharray: 4 4; }
+    .row-bg { fill: transparent; }
+    .hyp-label { fill: #102a43; font-size: 12px; font-weight: 800; }
+    .hyp-sub { fill: #7b8794; font-size: 10px; }
+    .baseline-bar { stroke: #546e7a; stroke-width: 3; }
+    .value-text { fill: #102a43; font-size: 11px; font-weight: 700; }
+    .build-fail-text { fill: #37474f; font-size: 10px; font-weight: 800; letter-spacing: 0.5px; }
+    .gap-grid {
+      display: grid;
+      grid-template-columns: repeat(auto-fit, minmax(240px, 1fr));
+      gap: 12px;
+    }
+    .gap-card {
+      background: #fff8e1;
+      border: 1.5px solid #ffe082;
+      border-radius: 10px;
+      padding: 12px 14px;
+      color: #7c5b00;
+      font-size: 12px;
+    }
+    .flag-pill {
+      display: inline-block;
+      background: #e3f2fd;
+      color: #1565c0;
+      border-radius: 4px;
+      padding: 1px 6px;
+      font-size: 10px;
+      font-weight: 700;
+      margin: 1px 2px 1px 0;
+    }
+    .runs-val {
+      font-family: "Cascadia Code", "Consolas", monospace;
+      font-size: 10.5px;
+      color: #546e7a;
+      white-space: nowrap;
+    }
+    .hyp-detail-table .label-cell { font-size: 11.5px; max-width: 220px; }
+    .hyp-detail-table .opset-cell { text-align: center; font-weight: 700; color: #3949ab; font-size: 12px; }
+    .hyp-detail-table .flags-cell { min-width: 140px; }
+    .hyp-detail-table .p50-cell { font-family: "Cascadia Code","Consolas",monospace; font-size: 12px; white-space: nowrap; }
+    .hyp-detail-table .sessions-cell { min-width: 160px; }
+    .hyp-detail-table .conf-cell { font-size: 11px; color: #546e7a; }
+    .verdict-keep { color: #2e7d32; font-weight: 700; }
+    .verdict-discard { color: #c62828; font-weight: 700; }
+    .footer {
+      margin-top: 16px;
+      color: #7b8794;
+      font-size: 11px;
+      text-align: center;
+    }
+    @media (max-width: 1200px) {
+      .kpi-grid { grid-template-columns: repeat(2, minmax(0, 1fr)); }
+    }
+    @media (max-width: 720px) {
+      .kpi-grid { grid-template-columns: 1fr; }
+      body { padding: 18px 14px 28px; }
+    }
+  </style>
+</head>
+<body>
+  <h1>QNN NPU Optimization Report — hustvl/yolos-small</h1>
+  <div class="subtitle">auto arch · 2026-06-22 · 9 hypotheses tested</div>
+
+  <section class="kpi-grid">
+    <div class="kpi-card good">
+      <div class="kpi-label">Best Gain %</div>
+      <div class="kpi-value">+2.0%</div>
+      <div class="kpi-sub">Champion: h3 · ⚠ neutral within noise — ship baseline</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">Baseline → Champion ms</div>
+      <div class="kpi-value">49.60 ms → 48.60 ms</div>
+      <div class="kpi-sub">Latency reduction: 0.99 ms</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">EP + Device</div>
+      <div class="kpi-value">QNN / NPU</div>
+      <div class="kpi-sub">Baseline opset 17</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">Champion Config</div>
+      <div class="kpi-value">h0 (baseline)</div>
+      <div class="kpi-sub">⚠ neutral within noise — ship baseline</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">Total experiments</div>
+      <div class="kpi-value">9</div>
+      <div class="kpi-sub">0 KEEP / 2 DISCARD</div>
+    </div>
+  </section>
+
+
+    <section class="section-card">
+      <div class="section-title">Model Characteristics</div>
+      <table class="characteristics-table">
+        <tr><th>Model ID</th><td>hustvl/yolos-small</td></tr><tr><th>Task</th><td>object-detection</td></tr><tr><th>Arch type</th><td>auto</td></tr><tr><th>Baseline opset</th><td>17</td></tr><tr><th>EP</th><td>qnn</td></tr><tr><th>Device</th><td>npu</td></tr><tr><th>Conv%</th><td>0.1%</td></tr><tr><th>npu-006 risk</th><td>LOW</td></tr><tr><th>npu-001 note</th><td>N/A (high-CV opset17 reference)</td></tr>
+      </table>
+    </section>
+
+
+    <section class="section-card">
+      <div class="section-title">Hypothesis Gain Chart</div>
+      <div class="chart-wrap">
+        <svg class="chart-svg" viewBox="0 0 748 434" role="img" aria-label="Hypothesis gain chart"><defs><pattern id="buildFailPattern" patternUnits="userSpaceOnUse" width="8" height="8" patternTransform="rotate(45)"><rect width="8" height="8" fill="#cfd8dc"></rect><rect width="3" height="8" fill="#90a4ae"></rect></pattern></defs><text x="0" y="20" class="axis-label">Hypothesis</text><text x="150" y="20" class="axis-label">Gain vs baseline (%)</text><line x1="410.0" y1="40" x2="410.0" y2="408" class="center-line" /><line x1="150.0" y1="44" x2="150.0" y2="408" class="tick-line" /><text x="150.0" y="428" text-anchor="middle" class="tick-label">-200%</text><line x1="280.0" y1="44" x2="280.0" y2="408" class="tick-line" /><text x="280.0" y="428" text-anchor="middle" class="tick-label">-100%</text><line x1="410.0" y1="44" x2="410.0" y2="408" class="tick-line" /><text x="410.0" y="428" text-anchor="middle" class="tick-label">0%</text><line x1="540.0" y1="44" x2="540.0" y2="408" class="tick-line" /><text x="540.0" y="428" text-anchor="middle" class="tick-label">100%</text><line x1="670.0" y1="44" x2="670.0" y2="408" class="tick-line" /><text x="670.0" y="428" text-anchor="middle" class="tick-label">200%</text><g><title>h0: baseline (auto-config, W8A16)
+status=OK  verdict=—
+p50=49.60 ms  gain=+0.0%</title><rect x="0" y="48.0" width="748" height="40" class="row-bg" /><text x="8" y="64.0" class="hyp-label">h0</text><text x="8" y="77.0" class="hyp-sub">baseline (auto-config, W8…</text><line x1="410.0" y1="56.0" x2="410.0" y2="80.0" class="baseline-bar" /><text x="418.0" y="72.0" text-anchor="start" class="value-text">0.0%</text></g><g><title>h1: opset 17 explicit
+status=OK_HIGH_CV  verdict=—
+p50=65.89 ms  gain=-32.8%</title><rect x="0" y="88.0" width="748" height="40" class="row-bg" /><text x="8" y="104.0" class="hyp-label">h1</text><text x="8" y="117.0" class="hyp-sub">opset 17 explicit</text><rect x="367.3" y="96.0" width="42.7" height="24" fill="#e53935" stroke="none" stroke-width="0" rx="4" /><text x="359.3" y="112.0" text-anchor="end" class="value-text">-32.8%</text></g><g><title>h2: opset 19
+status=TIMEOUT  verdict=—
+p50=—  gain=—</title><rect x="0" y="128.0" width="748" height="40" class="row-bg" /><text x="8" y="144.0" class="hyp-label">h2</text><text x="8" y="157.0" class="hyp-sub">opset 19</text></g><g><title>h3: opset 21 (tests npu-001 bypass)
+status=OK  verdict=—
+p50=48.60 ms  gain=+2.0%</title><rect x="0" y="168.0" width="748" height="40" class="row-bg" /><text x="8" y="184.0" class="hyp-label">h3</text><text x="8" y="197.0" class="hyp-sub">opset 21 (tests npu-001 b…</text><rect x="410.0" y="176.0" width="2.6" height="24" fill="#90a4ae" stroke="#1e88e5" stroke-width="4" rx="4" /><text x="420.6" y="192.0" text-anchor="start" class="value-text">+2.0%</text></g><g><title>h4: opset 17 + conv fusions
+status=TIMEOUT  verdict=—
+p50=—  gain=—</title><rect x="0" y="208.0" width="748" height="40" class="row-bg" /><text x="8" y="224.0" class="hyp-label">h4</text><text x="8" y="237.0" class="hyp-sub">opset 17 + conv fusions</text></g><g><title>h5: opset 21 + conv fusions
+status=TIMEOUT  verdict=—
+p50=—  gain=—</title><rect x="0" y="248.0" width="748" height="40" class="row-bg" /><text x="8" y="264.0" class="hyp-label">h5</text><text x="8" y="277.0" class="hyp-sub">opset 21 + conv fusions</text></g><g><title>h6: opset 21 + matmul_transpose_fusion
+status=OK  verdict=—
+p50=49.96 ms  gain=-0.7%</title><rect x="0" y="288.0" width="748" height="40" class="row-bg" /><text x="8" y="304.0" class="hyp-label">h6</text><text x="8" y="317.0" class="hyp-sub">opset 21 + matmul_transpo…</text><rect x="409.1" y="296.0" width="2.0" height="24" fill="#90a4ae" stroke="none" stroke-width="0" rx="4" /><text x="401.1" y="312.0" text-anchor="end" class="value-text">-0.7%</text></g><g><title>h7: opset 21 + bias_softmax_fusion
+status=OK  verdict=—
+p50=51.63 ms  gain=-4.1%</title><rect x="0" y="328.0" width="748" height="40" class="row-bg" /><text x="8" y="344.0" class="hyp-label">h7</text><text x="8" y="357.0" class="hyp-sub">opset 21 + bias_softmax_f…</text><rect x="404.7" y="336.0" width="5.3" height="24" fill="#90a4ae" stroke="none" stroke-width="0" rx="4" /><text x="396.7" y="352.0" text-anchor="end" class="value-text">-4.1%</text></g><g><title>h8: opset 21 + attention_fusion
+status=OK  verdict=—
+p50=49.53 ms  gain=+0.1%</title><rect x="0" y="368.0" width="748" height="40" class="row-bg" /><text x="8" y="384.0" class="hyp-label">h8</text><text x="8" y="397.0" class="hyp-sub">opset 21 + attention_fusi…</text><rect x="410.0" y="376.0" width="2.0" height="24" fill="#90a4ae" stroke="none" stroke-width="0" rx="4" /><text x="418.2" y="392.0" text-anchor="start" class="value-text">+0.1%</text></g></svg>
+      </div>
+    </section>
+
+
+    <section class="section-card">
+      <div class="section-title">🔬 All Hypotheses — Full Detail</div>
+      <div style="overflow-x:auto">
+      <table class="report-table hyp-detail-table">
+        <thead>
+          <tr>
+            <th>ID</th>
+            <th>Config Label</th>
+            <th>Opset</th>
+            <th>Extra Flags</th>
+            <th>Median p50</th>
+            <th>Session p50s (ms)</th>
+            <th>Gain %</th>
+            <th>Verdict</th>
+            <th>Confidence</th>
+          </tr>
+        </thead>
+        <tbody>
+
+        <tr class="">
+          <td><span class="hyp-pill">h0</span></td>
+          <td class="label-cell">baseline (auto-config, W8A16)</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">49.60 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[50.72 · 49.32 · 49.60]</span></td>
+          <td><span class="">+0.0%</span></td>
+          <td><span class="">OK</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h1</span></td>
+          <td class="label-cell">opset 17 explicit</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">65.89 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[67.96 · 65.68 · 65.89]</span></td>
+          <td><span class="gain-neg">-32.8%</span></td>
+          <td><span class="">OK_HIGH_CV</span></td>
+          <td class="conf-cell">ranges separated</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h2</span></td>
+          <td class="label-cell">opset 19</td>
+          <td class="opset-cell">—</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">—</td>
+          <td class="sessions-cell">—</td>
+          <td>—</td>
+          <td><span class="">TIMEOUT</span></td>
+          <td class="conf-cell">single-point only</td>
+        </tr>
+        <tr class="champion-row">
+          <td><span class="hyp-pill">h3</span> <span style="color:#1976d2;font-weight:900">★</span></td>
+          <td class="label-cell">opset 21 (tests npu-001 bypass)</td>
+          <td class="opset-cell">21</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">48.60 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[48.74 · 48.60 · 48.60]</span></td>
+          <td><span class="gain-pos">+2.0%</span></td>
+          <td><span class="">OK</span></td>
+          <td class="conf-cell">ranges separated</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h4</span></td>
+          <td class="label-cell">opset 17 + conv fusions</td>
+          <td class="opset-cell">—</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">—</td>
+          <td class="sessions-cell">—</td>
+          <td>—</td>
+          <td><span class="">TIMEOUT</span></td>
+          <td class="conf-cell">single-point only</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h5</span></td>
+          <td class="label-cell">opset 21 + conv fusions</td>
+          <td class="opset-cell">—</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">—</td>
+          <td class="sessions-cell">—</td>
+          <td>—</td>
+          <td><span class="">TIMEOUT</span></td>
+          <td class="conf-cell">single-point only</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h6</span></td>
+          <td class="label-cell">opset 21 + matmul_transpose_fusion</td>
+          <td class="opset-cell">21</td>
+          <td class="flags-cell"><span class="flag-pill">matmul_transpose_fusion</span></td>
+          <td class="p50-cell">49.96 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[50.15 · 49.57 · 49.96]</span></td>
+          <td><span class="gain-neg">-0.7%</span></td>
+          <td><span class="">OK</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h7</span></td>
+          <td class="label-cell">opset 21 + bias_softmax_fusion</td>
+          <td class="opset-cell">21</td>
+          <td class="flags-cell"><span class="flag-pill">bias_softmax_fusion</span></td>
+          <td class="p50-cell">51.63 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[49.06 · 51.63 · 51.90]</span></td>
+          <td><span class="gain-neg">-4.1%</span></td>
+          <td><span class="">OK</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h8</span></td>
+          <td class="label-cell">opset 21 + attention_fusion</td>
+          <td class="opset-cell">21</td>
+          <td class="flags-cell"><span class="flag-pill">attention_fusion</span></td>
+          <td class="p50-cell">49.53 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[51.23 · 49.20 · 49.53]</span></td>
+          <td><span class="gain-pos">+0.1%</span></td>
+          <td><span class="">OK</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        </tbody>
+      </table>
+      </div>
+      <div style="margin-top:10px;font-size:11px;color:#7b8794">
+        ★ = champion hypothesis &nbsp;·&nbsp; Session p50s are individual bench sessions (median used for comparison)
+      </div>
+    </section>
+
+
+
+    <section class="section-card">
+      <div class="section-title">❌ Ineffective or Harmful</div>
+      <table class="report-table">
+        <thead>
+          <tr>
+            <th>Hypothesis</th>
+            <th>Label</th>
+            <th>Gain %</th>
+            <th>Verdict</th>
+            <th>Confidence</th>
+          </tr>
+        </thead>
+        <tbody>
+
+            <tr class="">
+              <td><span class="hyp-pill">h1</span></td>
+              <td>opset 17 explicit</td>
+              <td class="gain-neg">-32.8%</td>
+              <td>OK_HIGH_CV</td>
+              <td>ranges separated</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h7</span></td>
+              <td>opset 21 + bias_softmax_fusion</td>
+              <td class="gain-neg">-4.1%</td>
+              <td>OK</td>
+              <td>ranges overlap</td>
+            </tr>
+
+        </tbody>
+      </table>
+    </section>
+
+
+    <section class="section-card">
+      <div class="section-title">⚪ Neutral / Build Fail</div>
+      <table class="report-table">
+        <thead>
+          <tr>
+            <th>Hypothesis</th>
+            <th>Label</th>
+            <th>Gain %</th>
+            <th>Verdict</th>
+            <th>Confidence</th>
+          </tr>
+        </thead>
+        <tbody>
+
+            <tr class="">
+              <td><span class="hyp-pill">h0</span></td>
+              <td>baseline (auto-config, W8A16)</td>
+              <td class="gain-pos">+0.0%</td>
+              <td>OK</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h2</span></td>
+              <td>opset 19</td>
+              <td class="gain-pos">—</td>
+              <td>TIMEOUT</td>
+              <td>single-point only</td>
+            </tr>
+
+            <tr class="champion-row">
+              <td><span class="hyp-pill">h3</span></td>
+              <td>opset 21 (tests npu-001 bypass)</td>
+              <td class="gain-pos">+2.0%</td>
+              <td>OK</td>
+              <td>ranges separated</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h4</span></td>
+              <td>opset 17 + conv fusions</td>
+              <td class="gain-pos">—</td>
+              <td>TIMEOUT</td>
+              <td>single-point only</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h5</span></td>
+              <td>opset 21 + conv fusions</td>
+              <td class="gain-pos">—</td>
+              <td>TIMEOUT</td>
+              <td>single-point only</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h6</span></td>
+              <td>opset 21 + matmul_transpose_fusion</td>
+              <td class="gain-neg">-0.7%</td>
+              <td>OK</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h8</span></td>
+              <td>opset 21 + attention_fusion</td>
+              <td class="gain-pos">+0.1%</td>
+              <td>OK</td>
+              <td>ranges overlap</td>
+            </tr>
+
+        </tbody>
+      </table>
+    </section>
+
+
+
+  <div class="footer">Generated by gen_model_report.py · research/autoconfig</div>
+</body>
+</html>
diff --git a/research/autoconfig/catalog-qnn-sweep/hustvl--yolos-small/results.json b/research/autoconfig/catalog-qnn-sweep/hustvl--yolos-small/results.json
new file mode 100644
index 000000000..2af500c4e
--- /dev/null
+++ b/research/autoconfig/catalog-qnn-sweep/hustvl--yolos-small/results.json
@@ -0,0 +1,183 @@
+{
+  "model_id": "hustvl/yolos-small",
+  "task": "object-detection",
+  "model_type": "auto",
+  "timestamp": "2026-06-22T12:06:44",
+  "ep": "qnn",
+  "device": "npu",
+  "baseline_opset": 17,
+  "hypotheses": {
+    "h0": {
+      "status": "OK",
+      "screen": {
+        "p50_ms": 48.663,
+        "cv": 0.0666,
+        "stable": true
+      },
+      "full": {
+        "p50s_ms": [
+          50.715,
+          49.324,
+          49.598
+        ],
+        "median_p50_ms": 49.598
+      },
+      "accuracy": null,
+      "label": "baseline (auto-config, W8A16)",
+      "opset": 17,
+      "extra_optim": {}
+    },
+    "h1": {
+      "status": "OK_HIGH_CV",
+      "screen": {
+        "p50_ms": 66.447,
+        "cv": 0.2261,
+        "stable": false,
+        "note": "DVFS noise — high CV expected on QNN NPU"
+      },
+      "full": {
+        "p50s_ms": [
+          67.96,
+          65.68,
+          65.888
+        ],
+        "median_p50_ms": 65.888
+      },
+      "accuracy": null,
+      "label": "opset 17 explicit",
+      "opset": 17,
+      "extra_optim": {}
+    },
+    "h2": {
+      "status": "TIMEOUT",
+      "label": "opset 19"
+    },
+    "h3": {
+      "status": "OK",
+      "screen": {
+        "p50_ms": 48.787,
+        "cv": 0.0503,
+        "stable": true
+      },
+      "full": {
+        "p50s_ms": [
+          48.741,
+          48.599,
+          48.605
+        ],
+        "median_p50_ms": 48.605
+      },
+      "accuracy": null,
+      "label": "opset 21 (tests npu-001 bypass)",
+      "opset": 21,
+      "extra_optim": {},
+      "paired_ab": {
+        "gains_pct": [
+          -0.39,
+          -1.85,
+          -0.05,
+          -0.42,
+          -0.16,
+          -0.31,
+          -0.73,
+          -0.22
+        ],
+        "mean_gain_pct": -0.52,
+        "ci_half_95": 0.4,
+        "n_pairs": 8,
+        "verdict": "MARGINAL"
+      }
+    },
+    "h4": {
+      "status": "TIMEOUT",
+      "label": "opset 17 + conv fusions"
+    },
+    "h5": {
+      "status": "TIMEOUT",
+      "label": "opset 21 + conv fusions"
+    },
+    "h6": {
+      "status": "OK",
+      "screen": {
+        "p50_ms": 49.012,
+        "cv": 0.048,
+        "stable": true
+      },
+      "full": {
+        "p50s_ms": [
+          50.151,
+          49.574,
+          49.956
+        ],
+        "median_p50_ms": 49.956
+      },
+      "accuracy": null,
+      "label": "opset 21 + matmul_transpose_fusion",
+      "opset": 21,
+      "extra_optim": {
+        "matmul_transpose_fusion": true
+      }
+    },
+    "h7": {
+      "status": "OK",
+      "screen": {
+        "p50_ms": 49.042,
+        "cv": 0.0618,
+        "stable": true
+      },
+      "full": {
+        "p50s_ms": [
+          49.06,
+          51.631,
+          51.895
+        ],
+        "median_p50_ms": 51.631
+      },
+      "accuracy": null,
+      "label": "opset 21 + bias_softmax_fusion",
+      "opset": 21,
+      "extra_optim": {
+        "bias_softmax_fusion": true
+      }
+    },
+    "h8": {
+      "status": "OK",
+      "screen": {
+        "p50_ms": 51.292,
+        "cv": 0.078,
+        "stable": true
+      },
+      "full": {
+        "p50s_ms": [
+          51.226,
+          49.202,
+          49.531
+        ],
+        "median_p50_ms": 49.531
+      },
+      "accuracy": null,
+      "label": "opset 21 + attention_fusion",
+      "opset": 21,
+      "extra_optim": {
+        "attention_fusion": true
+      }
+    }
+  },
+  "best_hypothesis": "h3",
+  "baseline_p50_ms": 49.598,
+  "best_p50_ms": 48.605,
+  "best_gain_pct": 2.0,
+  "npu001_generalized": "N/A (high-CV opset17 reference)",
+  "feature_gaps": [],
+  "errors": [
+    "h2 (opset 19), h4/h5 (conv fusions): not measured — agent deprioritized (yolos is 0.1% conv / 99.9% transformer, so conv-fusion and intermediate-opset hypotheses are low expected-value)."
+  ],
+  "conv_pct": 0.1,
+  "npu006_risk": false,
+  "npu006_regression": false,
+  "best_gain_reliable": false,
+  "best_gain_verdict": "NEUTRAL_WITHIN_NOISE",
+  "best_gain_noise_floor_pct": 2.95,
+  "best_gain_ranges_separated": true,
+  "npu001_ranges_non_overlapping": true
+}
diff --git a/research/autoconfig/catalog-qnn-sweep/microsoft--rad-dino/results_v2.json b/research/autoconfig/catalog-qnn-sweep/microsoft--rad-dino/results_v2.json
new file mode 100644
index 000000000..20cf14836
--- /dev/null
+++ b/research/autoconfig/catalog-qnn-sweep/microsoft--rad-dino/results_v2.json
@@ -0,0 +1,71 @@
+{
+  "model_id": "microsoft/rad-dino",
+  "task": "image-feature-extraction",
+  "model_type": "dinov2",
+  "timestamp": "2026-06-16T16:43:10",
+  "ep": "qnn",
+  "device": "npu",
+  "validation_sweep": true,
+  "hypotheses": {
+    "h0": {
+      "status": "OK",
+      "screen": {
+        "p50_ms": 274.506,
+        "cv": 0.0134,
+        "stable": true,
+        "note": null
+      },
+      "full": {
+        "p50s_ms": [
+          274.727,
+          274.621,
+          274.949
+        ],
+        "median_p50_ms": 274.727
+      },
+      "label": "baseline (auto-config, W8A16)",
+      "opset": "auto"
+    },
+    "h1": {
+      "status": "OK",
+      "screen": {
+        "p50_ms": 274.204,
+        "cv": 0.0088,
+        "stable": true,
+        "note": null
+      },
+      "full": {
+        "p50s_ms": [
+          274.979,
+          274.557,
+          275.099
+        ],
+        "median_p50_ms": 274.979
+      },
+      "label": "opset 17 explicit",
+      "opset": 17
+    },
+    "h3": {
+      "status": "OK",
+      "screen": {
+        "p50_ms": 275.269,
+        "cv": 0.0222,
+        "stable": true,
+        "note": null
+      },
+      "full": {
+        "p50s_ms": [
+          275.298,
+          275.355,
+          275.564
+        ],
+        "median_p50_ms": 275.355
+      },
+      "label": "opset 21 (tests npu-001)",
+      "opset": 21
+    }
+  },
+  "errors": [],
+  "npu001_opset21_vs_17_gain_pct": -0.1,
+  "npu001_note": "opset21 median 275.355ms vs opset17 274.979ms = -0.1%"
+}
diff --git a/research/autoconfig/catalog-qnn-sweep/microsoft--resnet-18/report.html b/research/autoconfig/catalog-qnn-sweep/microsoft--resnet-18/report.html
new file mode 100644
index 000000000..8a6e36f71
--- /dev/null
+++ b/research/autoconfig/catalog-qnn-sweep/microsoft--resnet-18/report.html
@@ -0,0 +1,430 @@
+<!DOCTYPE html>
+<html lang="en">
+<head>
+  <meta charset="UTF-8">
+  <meta name="viewport" content="width=device-width, initial-scale=1.0">
+  <title>QNN NPU Optimization Report — microsoft/resnet-18</title>
+  <style>
+    * { box-sizing: border-box; margin: 0; padding: 0; }
+    body {
+      font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", sans-serif;
+      background: #f4f6f9;
+      color: #1a1a2e;
+      padding: 28px 24px 40px;
+      font-size: 13px;
+      line-height: 1.5;
+    }
+    h1 { font-size: 24px; font-weight: 800; margin-bottom: 6px; color: #102a43; }
+    .subtitle { color: #5f6c80; font-size: 12px; margin-bottom: 24px; }
+    .section-card {
+      background: #ffffff;
+      border: 1.5px solid #dbe5f0;
+      border-radius: 12px;
+      padding: 18px 20px;
+      margin-bottom: 20px;
+      box-shadow: 0 1px 3px rgba(16, 42, 67, 0.06);
+    }
+    .kpi-grid {
+      display: grid;
+      grid-template-columns: repeat(5, minmax(0, 1fr));
+      gap: 14px;
+      margin-bottom: 20px;
+    }
+    .kpi-card {
+      background: linear-gradient(180deg, #ffffff 0%, #f8fbff 100%);
+      border: 1.5px solid #dbe5f0;
+      border-radius: 12px;
+      padding: 16px;
+      min-height: 120px;
+    }
+    .kpi-label {
+      font-size: 10px;
+      text-transform: uppercase;
+      letter-spacing: 0.8px;
+      color: #6b7c93;
+      font-weight: 700;
+      margin-bottom: 8px;
+    }
+    .kpi-value {
+      font-size: 23px;
+      font-weight: 800;
+      color: #102a43;
+      line-height: 1.15;
+      margin-bottom: 6px;
+      word-break: break-word;
+    }
+    .kpi-card.good .kpi-value { color: #2e7d32; }
+    .kpi-card.bad .kpi-value { color: #c62828; }
+    .kpi-sub { color: #6b7c93; font-size: 11px; }
+    .section-title {
+      font-size: 14px;
+      font-weight: 800;
+      color: #102a43;
+      margin-bottom: 14px;
+    }
+    .characteristics-table {
+      width: 100%;
+      border-collapse: collapse;
+    }
+    .characteristics-table th,
+    .characteristics-table td {
+      padding: 9px 10px;
+      border-bottom: 1px solid #ebf1f6;
+      text-align: left;
+      vertical-align: top;
+    }
+    .characteristics-table th {
+      width: 180px;
+      color: #5f6c80;
+      font-size: 11px;
+      text-transform: uppercase;
+      letter-spacing: 0.7px;
+    }
+    .report-table {
+      width: 100%;
+      border-collapse: collapse;
+    }
+    .report-table th {
+      text-align: left;
+      padding: 9px 10px;
+      background: #eef4fb;
+      color: #486581;
+      border-bottom: 2px solid #d9e2ec;
+      font-size: 10px;
+      text-transform: uppercase;
+      letter-spacing: 0.7px;
+    }
+    .report-table td {
+      padding: 10px;
+      border-bottom: 1px solid #ebf1f6;
+      vertical-align: top;
+    }
+    .report-table tr:hover td { background: #f8fbff; }
+    .champion-row td { background: #e8f1fd; }
+    .hyp-pill {
+      display: inline-block;
+      background: #102a43;
+      color: white;
+      border-radius: 999px;
+      padding: 2px 8px;
+      font-size: 11px;
+      font-weight: 700;
+    }
+    .gain-pos { color: #2e7d32; font-weight: 700; }
+    .gain-neg { color: #c62828; font-weight: 700; }
+    .chart-wrap {
+      overflow-x: auto;
+      border: 1px solid #e6edf5;
+      border-radius: 10px;
+      background: #fbfdff;
+      padding: 10px;
+    }
+    .chart-svg { width: 100%; min-width: 760px; display: block; }
+    .axis-label { fill: #486581; font-size: 11px; font-weight: 700; }
+    .tick-label { fill: #7b8794; font-size: 10px; }
+    .tick-line { stroke: #d9e2ec; stroke-width: 1; }
+    .center-line { stroke: #1e88e5; stroke-width: 2; stroke-dasharray: 4 4; }
+    .row-bg { fill: transparent; }
+    .hyp-label { fill: #102a43; font-size: 12px; font-weight: 800; }
+    .hyp-sub { fill: #7b8794; font-size: 10px; }
+    .baseline-bar { stroke: #546e7a; stroke-width: 3; }
+    .value-text { fill: #102a43; font-size: 11px; font-weight: 700; }
+    .build-fail-text { fill: #37474f; font-size: 10px; font-weight: 800; letter-spacing: 0.5px; }
+    .gap-grid {
+      display: grid;
+      grid-template-columns: repeat(auto-fit, minmax(240px, 1fr));
+      gap: 12px;
+    }
+    .gap-card {
+      background: #fff8e1;
+      border: 1.5px solid #ffe082;
+      border-radius: 10px;
+      padding: 12px 14px;
+      color: #7c5b00;
+      font-size: 12px;
+    }
+    .flag-pill {
+      display: inline-block;
+      background: #e3f2fd;
+      color: #1565c0;
+      border-radius: 4px;
+      padding: 1px 6px;
+      font-size: 10px;
+      font-weight: 700;
+      margin: 1px 2px 1px 0;
+    }
+    .runs-val {
+      font-family: "Cascadia Code", "Consolas", monospace;
+      font-size: 10.5px;
+      color: #546e7a;
+      white-space: nowrap;
+    }
+    .hyp-detail-table .label-cell { font-size: 11.5px; max-width: 220px; }
+    .hyp-detail-table .opset-cell { text-align: center; font-weight: 700; color: #3949ab; font-size: 12px; }
+    .hyp-detail-table .flags-cell { min-width: 140px; }
+    .hyp-detail-table .p50-cell { font-family: "Cascadia Code","Consolas",monospace; font-size: 12px; white-space: nowrap; }
+    .hyp-detail-table .sessions-cell { min-width: 160px; }
+    .hyp-detail-table .conf-cell { font-size: 11px; color: #546e7a; }
+    .verdict-keep { color: #2e7d32; font-weight: 700; }
+    .verdict-discard { color: #c62828; font-weight: 700; }
+    .footer {
+      margin-top: 16px;
+      color: #7b8794;
+      font-size: 11px;
+      text-align: center;
+    }
+    @media (max-width: 1200px) {
+      .kpi-grid { grid-template-columns: repeat(2, minmax(0, 1fr)); }
+    }
+    @media (max-width: 720px) {
+      .kpi-grid { grid-template-columns: 1fr; }
+      body { padding: 18px 14px 28px; }
+    }
+  </style>
+</head>
+<body>
+  <h1>QNN NPU Optimization Report — microsoft/resnet-18</h1>
+  <div class="subtitle">resnet arch · 2026-06-13 · 6 hypotheses tested</div>
+
+  <section class="kpi-grid">
+    <div class="kpi-card neutral">
+      <div class="kpi-label">Best Gain %</div>
+      <div class="kpi-value">+0.0%</div>
+      <div class="kpi-sub">Champion: h0</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">Baseline → Champion ms</div>
+      <div class="kpi-value">0.96 ms → 0.96 ms</div>
+      <div class="kpi-sub">Latency reduction: 0.00 ms</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">EP + Device</div>
+      <div class="kpi-value">QNN / NPU</div>
+      <div class="kpi-sub">Baseline opset 17</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">Champion Config</div>
+      <div class="kpi-value">h0</div>
+      <div class="kpi-sub">opset 17 + autoconf defaults</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">Total experiments</div>
+      <div class="kpi-value">6</div>
+      <div class="kpi-sub">0 KEEP / 4 DISCARD</div>
+    </div>
+  </section>
+
+
+    <section class="section-card">
+      <div class="section-title">Model Characteristics</div>
+      <table class="characteristics-table">
+        <tr><th>Model ID</th><td>microsoft/resnet-18</td></tr><tr><th>Task</th><td>image-classification</td></tr><tr><th>Arch type</th><td>resnet</td></tr><tr><th>Baseline opset</th><td>17</td></tr><tr><th>EP</th><td>qnn</td></tr><tr><th>Device</th><td>npu</td></tr><tr><th>npu-001 note</th><td>True</td></tr>
+      </table>
+    </section>
+
+
+    <section class="section-card">
+      <div class="section-title">Hypothesis Gain Chart</div>
+      <div class="chart-wrap">
+        <svg class="chart-svg" viewBox="0 0 748 314" role="img" aria-label="Hypothesis gain chart"><defs><pattern id="buildFailPattern" patternUnits="userSpaceOnUse" width="8" height="8" patternTransform="rotate(45)"><rect width="8" height="8" fill="#cfd8dc"></rect><rect width="3" height="8" fill="#90a4ae"></rect></pattern></defs><text x="0" y="20" class="axis-label">Hypothesis</text><text x="150" y="20" class="axis-label">Gain vs baseline (%)</text><line x1="410.0" y1="40" x2="410.0" y2="288" class="center-line" /><line x1="150.0" y1="44" x2="150.0" y2="288" class="tick-line" /><text x="150.0" y="308" text-anchor="middle" class="tick-label">-200%</text><line x1="280.0" y1="44" x2="280.0" y2="288" class="tick-line" /><text x="280.0" y="308" text-anchor="middle" class="tick-label">-100%</text><line x1="410.0" y1="44" x2="410.0" y2="288" class="tick-line" /><text x="410.0" y="308" text-anchor="middle" class="tick-label">0%</text><line x1="540.0" y1="44" x2="540.0" y2="288" class="tick-line" /><text x="540.0" y="308" text-anchor="middle" class="tick-label">100%</text><line x1="670.0" y1="44" x2="670.0" y2="288" class="tick-line" /><text x="670.0" y="308" text-anchor="middle" class="tick-label">200%</text><g><title>h0: baseline (auto-config, W8A16)
+status=OK_HIGH_CV  verdict=—
+p50=0.96 ms  gain=+0.0%</title><rect x="0" y="48.0" width="748" height="40" class="row-bg" /><text x="8" y="64.0" class="hyp-label">h0</text><text x="8" y="77.0" class="hyp-sub">baseline (auto-config, W8…</text><line x1="410.0" y1="56.0" x2="410.0" y2="80.0" class="baseline-bar" /><text x="418.0" y="72.0" text-anchor="start" class="value-text">0.0%</text></g><g><title>h1: opset 17 explicit
+status=OK_HIGH_CV  verdict=—
+p50=2.72 ms  gain=-181.7%</title><rect x="0" y="88.0" width="748" height="40" class="row-bg" /><text x="8" y="104.0" class="hyp-label">h1</text><text x="8" y="117.0" class="hyp-sub">opset 17 explicit</text><rect x="173.7" y="96.0" width="236.3" height="24" fill="#e53935" stroke="none" stroke-width="0" rx="4" /><text x="165.7" y="112.0" text-anchor="end" class="value-text">-181.7%</text></g><g><title>h2: opset 19
+status=OK_HIGH_CV  verdict=—
+p50=1.15 ms  gain=-19.0%</title><rect x="0" y="128.0" width="748" height="40" class="row-bg" /><text x="8" y="144.0" class="hyp-label">h2</text><text x="8" y="157.0" class="hyp-sub">opset 19</text><rect x="385.3" y="136.0" width="24.7" height="24" fill="#e53935" stroke="none" stroke-width="0" rx="4" /><text x="377.3" y="152.0" text-anchor="end" class="value-text">-19.0%</text></g><g><title>h3: opset 21 (tests npu-001 bypass)
+status=OK_HIGH_CV  verdict=—
+p50=2.17 ms  gain=-125.6%</title><rect x="0" y="168.0" width="748" height="40" class="row-bg" /><text x="8" y="184.0" class="hyp-label">h3</text><text x="8" y="197.0" class="hyp-sub">opset 21 (tests npu-001 b…</text><rect x="246.7" y="176.0" width="163.3" height="24" fill="#e53935" stroke="none" stroke-width="0" rx="4" /><text x="238.7" y="192.0" text-anchor="end" class="value-text">-125.6%</text></g><g><title>h4: opset 17 + conv fusions
+status=OK_HIGH_CV  verdict=—
+p50=132.30 ms  gain=-13624.1%</title><rect x="0" y="208.0" width="748" height="40" class="row-bg" /><text x="8" y="224.0" class="hyp-label">h4</text><text x="8" y="237.0" class="hyp-sub">opset 17 + conv fusions</text><rect x="150.0" y="216.0" width="260.0" height="24" fill="#e53935" stroke="none" stroke-width="0" rx="4" /><text x="142.0" y="232.0" text-anchor="end" class="value-text">-13624.1%</text></g><g><title>h5: opset 21 + conv fusions
+status=TIMEOUT  verdict=—
+p50=—  gain=—</title><rect x="0" y="248.0" width="748" height="40" class="row-bg" /><text x="8" y="264.0" class="hyp-label">h5</text><text x="8" y="277.0" class="hyp-sub">opset 21 + conv fusions</text></g></svg>
+      </div>
+    </section>
+
+
+    <section class="section-card">
+      <div class="section-title">🔬 All Hypotheses — Full Detail</div>
+      <div style="overflow-x:auto">
+      <table class="report-table hyp-detail-table">
+        <thead>
+          <tr>
+            <th>ID</th>
+            <th>Config Label</th>
+            <th>Opset</th>
+            <th>Extra Flags</th>
+            <th>Median p50</th>
+            <th>Session p50s (ms)</th>
+            <th>Gain %</th>
+            <th>Verdict</th>
+            <th>Confidence</th>
+          </tr>
+        </thead>
+        <tbody>
+
+        <tr class="champion-row">
+          <td><span class="hyp-pill">h0</span> <span style="color:#1976d2;font-weight:900">★</span></td>
+          <td class="label-cell">baseline (auto-config, W8A16)</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">0.96 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[1.31 · 0.95 · 0.96]</span></td>
+          <td><span class="">+0.0%</span></td>
+          <td><span class="">OK_HIGH_CV</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h1</span></td>
+          <td class="label-cell">opset 17 explicit</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">2.72 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[0.99 · 4.00 · 2.72]</span></td>
+          <td><span class="gain-neg">-181.7%</span></td>
+          <td><span class="">OK_HIGH_CV</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h2</span></td>
+          <td class="label-cell">opset 19</td>
+          <td class="opset-cell">19</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">1.15 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[1.15 · 1.11 · 1.95]</span></td>
+          <td><span class="gain-neg">-19.0%</span></td>
+          <td><span class="">OK_HIGH_CV</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h3</span></td>
+          <td class="label-cell">opset 21 (tests npu-001 bypass)</td>
+          <td class="opset-cell">21</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">2.17 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[1.05 · 2.17 · 4.11]</span></td>
+          <td><span class="gain-neg">-125.6%</span></td>
+          <td><span class="">OK_HIGH_CV</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h4</span></td>
+          <td class="label-cell">opset 17 + conv fusions</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">132.30 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[132.30 · 134.97 · 130.67]</span></td>
+          <td><span class="gain-neg">-13624.1%</span></td>
+          <td><span class="">OK_HIGH_CV</span></td>
+          <td class="conf-cell">ranges separated</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h5</span></td>
+          <td class="label-cell">opset 21 + conv fusions</td>
+          <td class="opset-cell">—</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">—</td>
+          <td class="sessions-cell">—</td>
+          <td>—</td>
+          <td><span class="">TIMEOUT</span></td>
+          <td class="conf-cell">single-point only</td>
+        </tr>
+        </tbody>
+      </table>
+      </div>
+      <div style="margin-top:10px;font-size:11px;color:#7b8794">
+        ★ = champion hypothesis &nbsp;·&nbsp; Session p50s are individual bench sessions (median used for comparison)
+      </div>
+    </section>
+
+
+
+    <section class="section-card">
+      <div class="section-title">❌ Ineffective or Harmful</div>
+      <table class="report-table">
+        <thead>
+          <tr>
+            <th>Hypothesis</th>
+            <th>Label</th>
+            <th>Gain %</th>
+            <th>Verdict</th>
+            <th>Confidence</th>
+          </tr>
+        </thead>
+        <tbody>
+
+            <tr class="">
+              <td><span class="hyp-pill">h1</span></td>
+              <td>opset 17 explicit</td>
+              <td class="gain-neg">-181.7%</td>
+              <td>OK_HIGH_CV</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h2</span></td>
+              <td>opset 19</td>
+              <td class="gain-neg">-19.0%</td>
+              <td>OK_HIGH_CV</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h3</span></td>
+              <td>opset 21 (tests npu-001 bypass)</td>
+              <td class="gain-neg">-125.6%</td>
+              <td>OK_HIGH_CV</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h4</span></td>
+              <td>opset 17 + conv fusions</td>
+              <td class="gain-neg">-13624.1%</td>
+              <td>OK_HIGH_CV</td>
+              <td>ranges separated</td>
+            </tr>
+
+        </tbody>
+      </table>
+    </section>
+
+
+    <section class="section-card">
+      <div class="section-title">⚪ Neutral / Build Fail</div>
+      <table class="report-table">
+        <thead>
+          <tr>
+            <th>Hypothesis</th>
+            <th>Label</th>
+            <th>Gain %</th>
+            <th>Verdict</th>
+            <th>Confidence</th>
+          </tr>
+        </thead>
+        <tbody>
+
+            <tr class="champion-row">
+              <td><span class="hyp-pill">h0</span></td>
+              <td>baseline (auto-config, W8A16)</td>
+              <td class="gain-pos">+0.0%</td>
+              <td>OK_HIGH_CV</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h5</span></td>
+              <td>opset 21 + conv fusions</td>
+              <td class="gain-pos">—</td>
+              <td>TIMEOUT</td>
+              <td>single-point only</td>
+            </tr>
+
+        </tbody>
+      </table>
+    </section>
+
+
+
+  <div class="footer">Generated by gen_model_report.py · research/autoconfig</div>
+</body>
+</html>
diff --git a/research/autoconfig/catalog-qnn-sweep/microsoft--resnet-18/results.json b/research/autoconfig/catalog-qnn-sweep/microsoft--resnet-18/results.json
new file mode 100644
index 000000000..555428793
--- /dev/null
+++ b/research/autoconfig/catalog-qnn-sweep/microsoft--resnet-18/results.json
@@ -0,0 +1,124 @@
+{
+  "model_id": "microsoft/resnet-18",
+  "task": "image-classification",
+  "model_type": "resnet",
+  "timestamp": "2026-06-13T13:38:52",
+  "ep": "qnn",
+  "device": "npu",
+  "baseline_opset": 17,
+  "hypotheses": {
+    "h0": {
+      "status": "OK_HIGH_CV",
+      "screen": {
+        "p50_ms": 4.031,
+        "cv": 1.6902,
+        "stable": false,
+        "note": "DVFS noise — high CV expected on QNN NPU"
+      },
+      "full": {
+        "p50s_ms": [
+          1.311,
+          0.952,
+          0.964
+        ],
+        "median_p50_ms": 0.964
+      },
+      "accuracy": 0.66,
+      "label": "baseline (auto-config, W8A16)",
+      "opset": 17
+    },
+    "h1": {
+      "status": "OK_HIGH_CV",
+      "screen": {
+        "p50_ms": 3.111,
+        "cv": 2.0363,
+        "stable": false,
+        "note": "DVFS noise — high CV expected on QNN NPU"
+      },
+      "full": {
+        "p50s_ms": [
+          0.99,
+          4.003,
+          2.716
+        ],
+        "median_p50_ms": 2.716
+      },
+      "accuracy": null,
+      "label": "opset 17 explicit",
+      "opset": 17
+    },
+    "h2": {
+      "status": "OK_HIGH_CV",
+      "screen": {
+        "p50_ms": 3.992,
+        "cv": 1.5168,
+        "stable": false,
+        "note": "DVFS noise — high CV expected on QNN NPU"
+      },
+      "full": {
+        "p50s_ms": [
+          1.147,
+          1.114,
+          1.947
+        ],
+        "median_p50_ms": 1.147
+      },
+      "accuracy": null,
+      "label": "opset 19",
+      "opset": 19
+    },
+    "h3": {
+      "status": "OK_HIGH_CV",
+      "screen": {
+        "p50_ms": 2.968,
+        "cv": 1.1762,
+        "stable": false,
+        "note": "DVFS noise — high CV expected on QNN NPU"
+      },
+      "full": {
+        "p50s_ms": [
+          1.054,
+          2.175,
+          4.107
+        ],
+        "median_p50_ms": 2.175
+      },
+      "accuracy": null,
+      "label": "opset 21 (tests npu-001 bypass)",
+      "opset": 21
+    },
+    "h4": {
+      "status": "OK_HIGH_CV",
+      "screen": {
+        "p50_ms": 128.104,
+        "cv": 1.4049,
+        "stable": false,
+        "note": "DVFS noise — high CV expected on QNN NPU"
+      },
+      "full": {
+        "p50s_ms": [
+          132.3,
+          134.97,
+          130.669
+        ],
+        "median_p50_ms": 132.3
+      },
+      "accuracy": null,
+      "label": "opset 17 + conv fusions",
+      "opset": 17
+    },
+    "h5": {
+      "status": "TIMEOUT",
+      "label": "opset 21 + conv fusions"
+    }
+  },
+  "best_hypothesis": "h0",
+  "baseline_p50_ms": 0.964,
+  "best_p50_ms": 0.964,
+  "best_gain_pct": 0.0,
+  "npu001_generalized": true,
+  "feature_gaps": [],
+  "errors": [
+    "Model timed out at 1560s (before h5)"
+  ]
+}
diff --git a/research/autoconfig/catalog-qnn-sweep/rizvandwiki--gender-classification/results_new.json b/research/autoconfig/catalog-qnn-sweep/rizvandwiki--gender-classification/results_new.json
new file mode 100644
index 000000000..ad2ca7a54
--- /dev/null
+++ b/research/autoconfig/catalog-qnn-sweep/rizvandwiki--gender-classification/results_new.json
@@ -0,0 +1,31 @@
+{
+  "model_id": "rizvandwiki/gender-classification",
+  "task": "image-classification",
+  "hypotheses": {
+    "h0": {
+      "description": "opset17 no opts",
+      "model_file": "quantized.onnx",
+      "screen_p50_ms": 29.602,
+      "screen_cv": 0.5068,
+      "full_p50s_ms": [
+        14.151,
+        14.942,
+        13.889
+      ],
+      "avg_p50_ms": 14.327
+    },
+    "h3": {
+      "description": "opset21 no opts",
+      "model_file": "quantized.onnx",
+      "screen_p50_ms": 15.056,
+      "screen_cv": 0.579,
+      "full_p50s_ms": [
+        13.698,
+        13.921,
+        13.868
+      ],
+      "avg_p50_ms": 13.829
+    }
+  },
+  "opset21_gain_pct": 3.48
+}
diff --git a/research/autoconfig/catalog-qnn-sweep/sentence-transformers--all-MiniLM-L6-v2/report.html b/research/autoconfig/catalog-qnn-sweep/sentence-transformers--all-MiniLM-L6-v2/report.html
new file mode 100644
index 000000000..edf2604a2
--- /dev/null
+++ b/research/autoconfig/catalog-qnn-sweep/sentence-transformers--all-MiniLM-L6-v2/report.html
@@ -0,0 +1,430 @@
+<!DOCTYPE html>
+<html lang="en">
+<head>
+  <meta charset="UTF-8">
+  <meta name="viewport" content="width=device-width, initial-scale=1.0">
+  <title>QNN NPU Optimization Report — sentence-transformers/all-MiniLM-L6-v2</title>
+  <style>
+    * { box-sizing: border-box; margin: 0; padding: 0; }
+    body {
+      font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", sans-serif;
+      background: #f4f6f9;
+      color: #1a1a2e;
+      padding: 28px 24px 40px;
+      font-size: 13px;
+      line-height: 1.5;
+    }
+    h1 { font-size: 24px; font-weight: 800; margin-bottom: 6px; color: #102a43; }
+    .subtitle { color: #5f6c80; font-size: 12px; margin-bottom: 24px; }
+    .section-card {
+      background: #ffffff;
+      border: 1.5px solid #dbe5f0;
+      border-radius: 12px;
+      padding: 18px 20px;
+      margin-bottom: 20px;
+      box-shadow: 0 1px 3px rgba(16, 42, 67, 0.06);
+    }
+    .kpi-grid {
+      display: grid;
+      grid-template-columns: repeat(5, minmax(0, 1fr));
+      gap: 14px;
+      margin-bottom: 20px;
+    }
+    .kpi-card {
+      background: linear-gradient(180deg, #ffffff 0%, #f8fbff 100%);
+      border: 1.5px solid #dbe5f0;
+      border-radius: 12px;
+      padding: 16px;
+      min-height: 120px;
+    }
+    .kpi-label {
+      font-size: 10px;
+      text-transform: uppercase;
+      letter-spacing: 0.8px;
+      color: #6b7c93;
+      font-weight: 700;
+      margin-bottom: 8px;
+    }
+    .kpi-value {
+      font-size: 23px;
+      font-weight: 800;
+      color: #102a43;
+      line-height: 1.15;
+      margin-bottom: 6px;
+      word-break: break-word;
+    }
+    .kpi-card.good .kpi-value { color: #2e7d32; }
+    .kpi-card.bad .kpi-value { color: #c62828; }
+    .kpi-sub { color: #6b7c93; font-size: 11px; }
+    .section-title {
+      font-size: 14px;
+      font-weight: 800;
+      color: #102a43;
+      margin-bottom: 14px;
+    }
+    .characteristics-table {
+      width: 100%;
+      border-collapse: collapse;
+    }
+    .characteristics-table th,
+    .characteristics-table td {
+      padding: 9px 10px;
+      border-bottom: 1px solid #ebf1f6;
+      text-align: left;
+      vertical-align: top;
+    }
+    .characteristics-table th {
+      width: 180px;
+      color: #5f6c80;
+      font-size: 11px;
+      text-transform: uppercase;
+      letter-spacing: 0.7px;
+    }
+    .report-table {
+      width: 100%;
+      border-collapse: collapse;
+    }
+    .report-table th {
+      text-align: left;
+      padding: 9px 10px;
+      background: #eef4fb;
+      color: #486581;
+      border-bottom: 2px solid #d9e2ec;
+      font-size: 10px;
+      text-transform: uppercase;
+      letter-spacing: 0.7px;
+    }
+    .report-table td {
+      padding: 10px;
+      border-bottom: 1px solid #ebf1f6;
+      vertical-align: top;
+    }
+    .report-table tr:hover td { background: #f8fbff; }
+    .champion-row td { background: #e8f1fd; }
+    .hyp-pill {
+      display: inline-block;
+      background: #102a43;
+      color: white;
+      border-radius: 999px;
+      padding: 2px 8px;
+      font-size: 11px;
+      font-weight: 700;
+    }
+    .gain-pos { color: #2e7d32; font-weight: 700; }
+    .gain-neg { color: #c62828; font-weight: 700; }
+    .chart-wrap {
+      overflow-x: auto;
+      border: 1px solid #e6edf5;
+      border-radius: 10px;
+      background: #fbfdff;
+      padding: 10px;
+    }
+    .chart-svg { width: 100%; min-width: 760px; display: block; }
+    .axis-label { fill: #486581; font-size: 11px; font-weight: 700; }
+    .tick-label { fill: #7b8794; font-size: 10px; }
+    .tick-line { stroke: #d9e2ec; stroke-width: 1; }
+    .center-line { stroke: #1e88e5; stroke-width: 2; stroke-dasharray: 4 4; }
+    .row-bg { fill: transparent; }
+    .hyp-label { fill: #102a43; font-size: 12px; font-weight: 800; }
+    .hyp-sub { fill: #7b8794; font-size: 10px; }
+    .baseline-bar { stroke: #546e7a; stroke-width: 3; }
+    .value-text { fill: #102a43; font-size: 11px; font-weight: 700; }
+    .build-fail-text { fill: #37474f; font-size: 10px; font-weight: 800; letter-spacing: 0.5px; }
+    .gap-grid {
+      display: grid;
+      grid-template-columns: repeat(auto-fit, minmax(240px, 1fr));
+      gap: 12px;
+    }
+    .gap-card {
+      background: #fff8e1;
+      border: 1.5px solid #ffe082;
+      border-radius: 10px;
+      padding: 12px 14px;
+      color: #7c5b00;
+      font-size: 12px;
+    }
+    .flag-pill {
+      display: inline-block;
+      background: #e3f2fd;
+      color: #1565c0;
+      border-radius: 4px;
+      padding: 1px 6px;
+      font-size: 10px;
+      font-weight: 700;
+      margin: 1px 2px 1px 0;
+    }
+    .runs-val {
+      font-family: "Cascadia Code", "Consolas", monospace;
+      font-size: 10.5px;
+      color: #546e7a;
+      white-space: nowrap;
+    }
+    .hyp-detail-table .label-cell { font-size: 11.5px; max-width: 220px; }
+    .hyp-detail-table .opset-cell { text-align: center; font-weight: 700; color: #3949ab; font-size: 12px; }
+    .hyp-detail-table .flags-cell { min-width: 140px; }
+    .hyp-detail-table .p50-cell { font-family: "Cascadia Code","Consolas",monospace; font-size: 12px; white-space: nowrap; }
+    .hyp-detail-table .sessions-cell { min-width: 160px; }
+    .hyp-detail-table .conf-cell { font-size: 11px; color: #546e7a; }
+    .verdict-keep { color: #2e7d32; font-weight: 700; }
+    .verdict-discard { color: #c62828; font-weight: 700; }
+    .footer {
+      margin-top: 16px;
+      color: #7b8794;
+      font-size: 11px;
+      text-align: center;
+    }
+    @media (max-width: 1200px) {
+      .kpi-grid { grid-template-columns: repeat(2, minmax(0, 1fr)); }
+    }
+    @media (max-width: 720px) {
+      .kpi-grid { grid-template-columns: 1fr; }
+      body { padding: 18px 14px 28px; }
+    }
+  </style>
+</head>
+<body>
+  <h1>QNN NPU Optimization Report — sentence-transformers/all-MiniLM-L6-v2</h1>
+  <div class="subtitle">bert arch · 2026-06-13 · 6 hypotheses tested</div>
+
+  <section class="kpi-grid">
+    <div class="kpi-card neutral">
+      <div class="kpi-label">Best Gain %</div>
+      <div class="kpi-value">+0.0%</div>
+      <div class="kpi-sub">Champion: h0</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">Baseline → Champion ms</div>
+      <div class="kpi-value">5.81 ms → 5.81 ms</div>
+      <div class="kpi-sub">Latency reduction: 0.00 ms</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">EP + Device</div>
+      <div class="kpi-value">QNN / NPU</div>
+      <div class="kpi-sub">Baseline opset 17</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">Champion Config</div>
+      <div class="kpi-value">h0</div>
+      <div class="kpi-sub">opset 17 + autoconf defaults</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">Total experiments</div>
+      <div class="kpi-value">6</div>
+      <div class="kpi-sub">0 KEEP / 2 DISCARD</div>
+    </div>
+  </section>
+
+
+    <section class="section-card">
+      <div class="section-title">Model Characteristics</div>
+      <table class="characteristics-table">
+        <tr><th>Model ID</th><td>sentence-transformers/all-MiniLM-L6-v2</td></tr><tr><th>Task</th><td>sentence-similarity</td></tr><tr><th>Arch type</th><td>bert</td></tr><tr><th>Baseline opset</th><td>17</td></tr><tr><th>EP</th><td>qnn</td></tr><tr><th>Device</th><td>npu</td></tr><tr><th>npu-001 note</th><td>neutral</td></tr>
+      </table>
+    </section>
+
+
+    <section class="section-card">
+      <div class="section-title">Hypothesis Gain Chart</div>
+      <div class="chart-wrap">
+        <svg class="chart-svg" viewBox="0 0 748 314" role="img" aria-label="Hypothesis gain chart"><defs><pattern id="buildFailPattern" patternUnits="userSpaceOnUse" width="8" height="8" patternTransform="rotate(45)"><rect width="8" height="8" fill="#cfd8dc"></rect><rect width="3" height="8" fill="#90a4ae"></rect></pattern></defs><text x="0" y="20" class="axis-label">Hypothesis</text><text x="150" y="20" class="axis-label">Gain vs baseline (%)</text><line x1="410.0" y1="40" x2="410.0" y2="288" class="center-line" /><line x1="150.0" y1="44" x2="150.0" y2="288" class="tick-line" /><text x="150.0" y="308" text-anchor="middle" class="tick-label">-200%</text><line x1="280.0" y1="44" x2="280.0" y2="288" class="tick-line" /><text x="280.0" y="308" text-anchor="middle" class="tick-label">-100%</text><line x1="410.0" y1="44" x2="410.0" y2="288" class="tick-line" /><text x="410.0" y="308" text-anchor="middle" class="tick-label">0%</text><line x1="540.0" y1="44" x2="540.0" y2="288" class="tick-line" /><text x="540.0" y="308" text-anchor="middle" class="tick-label">100%</text><line x1="670.0" y1="44" x2="670.0" y2="288" class="tick-line" /><text x="670.0" y="308" text-anchor="middle" class="tick-label">200%</text><g><title>h0: baseline (auto-config, W8A16)
+status=OK_HIGH_CV  verdict=—
+p50=5.81 ms  gain=+0.0%</title><rect x="0" y="48.0" width="748" height="40" class="row-bg" /><text x="8" y="64.0" class="hyp-label">h0</text><text x="8" y="77.0" class="hyp-sub">baseline (auto-config, W8…</text><line x1="410.0" y1="56.0" x2="410.0" y2="80.0" class="baseline-bar" /><text x="418.0" y="72.0" text-anchor="start" class="value-text">0.0%</text></g><g><title>h1: opset 17 explicit
+status=OK_HIGH_CV  verdict=—
+p50=5.88 ms  gain=-1.2%</title><rect x="0" y="88.0" width="748" height="40" class="row-bg" /><text x="8" y="104.0" class="hyp-label">h1</text><text x="8" y="117.0" class="hyp-sub">opset 17 explicit</text><rect x="408.4" y="96.0" width="2.0" height="24" fill="#90a4ae" stroke="none" stroke-width="0" rx="4" /><text x="400.4" y="112.0" text-anchor="end" class="value-text">-1.2%</text></g><g><title>h2: opset 19
+status=OK_HIGH_CV  verdict=—
+p50=5.98 ms  gain=-3.0%</title><rect x="0" y="128.0" width="748" height="40" class="row-bg" /><text x="8" y="144.0" class="hyp-label">h2</text><text x="8" y="157.0" class="hyp-sub">opset 19</text><rect x="406.2" y="136.0" width="3.8" height="24" fill="#90a4ae" stroke="none" stroke-width="0" rx="4" /><text x="398.2" y="152.0" text-anchor="end" class="value-text">-3.0%</text></g><g><title>h3: opset 21 (tests npu-001 bypass)
+status=OK_HIGH_CV  verdict=—
+p50=5.85 ms  gain=-0.7%</title><rect x="0" y="168.0" width="748" height="40" class="row-bg" /><text x="8" y="184.0" class="hyp-label">h3</text><text x="8" y="197.0" class="hyp-sub">opset 21 (tests npu-001 b…</text><rect x="409.0" y="176.0" width="2.0" height="24" fill="#90a4ae" stroke="none" stroke-width="0" rx="4" /><text x="401.0" y="192.0" text-anchor="end" class="value-text">-0.7%</text></g><g><title>h4: opset 17 + conv fusions
+status=OK  verdict=—
+p50=5.97 ms  gain=-2.7%</title><rect x="0" y="208.0" width="748" height="40" class="row-bg" /><text x="8" y="224.0" class="hyp-label">h4</text><text x="8" y="237.0" class="hyp-sub">opset 17 + conv fusions</text><rect x="406.5" y="216.0" width="3.5" height="24" fill="#90a4ae" stroke="none" stroke-width="0" rx="4" /><text x="398.5" y="232.0" text-anchor="end" class="value-text">-2.7%</text></g><g><title>h5: opset 21 + conv fusions
+status=TIMEOUT  verdict=—
+p50=—  gain=—</title><rect x="0" y="248.0" width="748" height="40" class="row-bg" /><text x="8" y="264.0" class="hyp-label">h5</text><text x="8" y="277.0" class="hyp-sub">opset 21 + conv fusions</text></g></svg>
+      </div>
+    </section>
+
+
+    <section class="section-card">
+      <div class="section-title">🔬 All Hypotheses — Full Detail</div>
+      <div style="overflow-x:auto">
+      <table class="report-table hyp-detail-table">
+        <thead>
+          <tr>
+            <th>ID</th>
+            <th>Config Label</th>
+            <th>Opset</th>
+            <th>Extra Flags</th>
+            <th>Median p50</th>
+            <th>Session p50s (ms)</th>
+            <th>Gain %</th>
+            <th>Verdict</th>
+            <th>Confidence</th>
+          </tr>
+        </thead>
+        <tbody>
+
+        <tr class="champion-row">
+          <td><span class="hyp-pill">h0</span> <span style="color:#1976d2;font-weight:900">★</span></td>
+          <td class="label-cell">baseline (auto-config, W8A16)</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">5.81 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[5.81 · 5.65 · 5.83]</span></td>
+          <td><span class="">+0.0%</span></td>
+          <td><span class="">OK_HIGH_CV</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h1</span></td>
+          <td class="label-cell">opset 17 explicit</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">5.88 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[5.81 · 5.88 · 5.91]</span></td>
+          <td><span class="gain-neg">-1.2%</span></td>
+          <td><span class="">OK_HIGH_CV</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h2</span></td>
+          <td class="label-cell">opset 19</td>
+          <td class="opset-cell">19</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">5.98 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[5.98 · 5.80 · 6.02]</span></td>
+          <td><span class="gain-neg">-3.0%</span></td>
+          <td><span class="">OK_HIGH_CV</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h3</span></td>
+          <td class="label-cell">opset 21 (tests npu-001 bypass)</td>
+          <td class="opset-cell">21</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">5.85 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[6.00 · 5.85 · 5.84]</span></td>
+          <td><span class="gain-neg">-0.7%</span></td>
+          <td><span class="">OK_HIGH_CV</span></td>
+          <td class="conf-cell">ranges separated</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h4</span></td>
+          <td class="label-cell">opset 17 + conv fusions</td>
+          <td class="opset-cell">17</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">5.97 ms</td>
+          <td class="sessions-cell"><span class="runs-val">[6.06 · 5.97 · 5.47]</span></td>
+          <td><span class="gain-neg">-2.7%</span></td>
+          <td><span class="">OK</span></td>
+          <td class="conf-cell">ranges overlap</td>
+        </tr>
+        <tr class="">
+          <td><span class="hyp-pill">h5</span></td>
+          <td class="label-cell">opset 21 + conv fusions</td>
+          <td class="opset-cell">—</td>
+          <td class="flags-cell"><em style="color:#bbb">not stored</em></td>
+          <td class="p50-cell">—</td>
+          <td class="sessions-cell">—</td>
+          <td>—</td>
+          <td><span class="">TIMEOUT</span></td>
+          <td class="conf-cell">single-point only</td>
+        </tr>
+        </tbody>
+      </table>
+      </div>
+      <div style="margin-top:10px;font-size:11px;color:#7b8794">
+        ★ = champion hypothesis &nbsp;·&nbsp; Session p50s are individual bench sessions (median used for comparison)
+      </div>
+    </section>
+
+
+
+    <section class="section-card">
+      <div class="section-title">❌ Ineffective or Harmful</div>
+      <table class="report-table">
+        <thead>
+          <tr>
+            <th>Hypothesis</th>
+            <th>Label</th>
+            <th>Gain %</th>
+            <th>Verdict</th>
+            <th>Confidence</th>
+          </tr>
+        </thead>
+        <tbody>
+
+            <tr class="">
+              <td><span class="hyp-pill">h2</span></td>
+              <td>opset 19</td>
+              <td class="gain-neg">-3.0%</td>
+              <td>OK_HIGH_CV</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h4</span></td>
+              <td>opset 17 + conv fusions</td>
+              <td class="gain-neg">-2.7%</td>
+              <td>OK</td>
+              <td>ranges overlap</td>
+            </tr>
+
+        </tbody>
+      </table>
+    </section>
+
+
+    <section class="section-card">
+      <div class="section-title">⚪ Neutral / Build Fail</div>
+      <table class="report-table">
+        <thead>
+          <tr>
+            <th>Hypothesis</th>
+            <th>Label</th>
+            <th>Gain %</th>
+            <th>Verdict</th>
+            <th>Confidence</th>
+          </tr>
+        </thead>
+        <tbody>
+
+            <tr class="champion-row">
+              <td><span class="hyp-pill">h0</span></td>
+              <td>baseline (auto-config, W8A16)</td>
+              <td class="gain-pos">+0.0%</td>
+              <td>OK_HIGH_CV</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h1</span></td>
+              <td>opset 17 explicit</td>
+              <td class="gain-neg">-1.2%</td>
+              <td>OK_HIGH_CV</td>
+              <td>ranges overlap</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h3</span></td>
+              <td>opset 21 (tests npu-001 bypass)</td>
+              <td class="gain-neg">-0.7%</td>
+              <td>OK_HIGH_CV</td>
+              <td>ranges separated</td>
+            </tr>
+
+            <tr class="">
+              <td><span class="hyp-pill">h5</span></td>
+              <td>opset 21 + conv fusions</td>
+              <td class="gain-pos">—</td>
+              <td>TIMEOUT</td>
+              <td>single-point only</td>
+            </tr>
+
+        </tbody>
+      </table>
+    </section>
+
+
+
+  <div class="footer">Generated by gen_model_report.py · research/autoconfig</div>
+</body>
+</html>
diff --git a/research/autoconfig/catalog-qnn-sweep/sentence-transformers--all-MiniLM-L6-v2/results.json b/research/autoconfig/catalog-qnn-sweep/sentence-transformers--all-MiniLM-L6-v2/results.json
new file mode 100644
index 000000000..67483f470
--- /dev/null
+++ b/research/autoconfig/catalog-qnn-sweep/sentence-transformers--all-MiniLM-L6-v2/results.json
@@ -0,0 +1,123 @@
+{
+  "model_id": "sentence-transformers/all-MiniLM-L6-v2",
+  "task": "sentence-similarity",
+  "model_type": "bert",
+  "timestamp": "2026-06-13T15:58:36",
+  "ep": "qnn",
+  "device": "npu",
+  "baseline_opset": 17,
+  "hypotheses": {
+    "h0": {
+      "status": "OK_HIGH_CV",
+      "screen": {
+        "p50_ms": 5.934,
+        "cv": 0.2221,
+        "stable": false,
+        "note": "DVFS noise — high CV expected on QNN NPU"
+      },
+      "full": {
+        "p50s_ms": [
+          5.808,
+          5.647,
+          5.829
+        ],
+        "median_p50_ms": 5.808
+      },
+      "accuracy": null,
+      "label": "baseline (auto-config, W8A16)",
+      "opset": 17
+    },
+    "h1": {
+      "status": "OK_HIGH_CV",
+      "screen": {
+        "p50_ms": 5.851,
+        "cv": 0.9986,
+        "stable": false,
+        "note": "DVFS noise — high CV expected on QNN NPU"
+      },
+      "full": {
+        "p50s_ms": [
+          5.814,
+          5.88,
+          5.912
+        ],
+        "median_p50_ms": 5.88
+      },
+      "accuracy": null,
+      "label": "opset 17 explicit",
+      "opset": 17
+    },
+    "h2": {
+      "status": "OK_HIGH_CV",
+      "screen": {
+        "p50_ms": 5.309,
+        "cv": 0.2051,
+        "stable": false,
+        "note": "DVFS noise — high CV expected on QNN NPU"
+      },
+      "full": {
+        "p50s_ms": [
+          5.98,
+          5.799,
+          6.021
+        ],
+        "median_p50_ms": 5.98
+      },
+      "accuracy": null,
+      "label": "opset 19",
+      "opset": 19
+    },
+    "h3": {
+      "status": "OK_HIGH_CV",
+      "screen": {
+        "p50_ms": 5.959,
+        "cv": 1.1272,
+        "stable": false,
+        "note": "DVFS noise — high CV expected on QNN NPU"
+      },
+      "full": {
+        "p50s_ms": [
+          6.0,
+          5.851,
+          5.844
+        ],
+        "median_p50_ms": 5.851
+      },
+      "accuracy": null,
+      "label": "opset 21 (tests npu-001 bypass)",
+      "opset": 21
+    },
+    "h4": {
+      "status": "OK",
+      "screen": {
+        "p50_ms": 5.478,
+        "cv": 0.1344,
+        "stable": true
+      },
+      "full": {
+        "p50s_ms": [
+          6.059,
+          5.966,
+          5.469
+        ],
+        "median_p50_ms": 5.966
+      },
+      "accuracy": null,
+      "label": "opset 17 + conv fusions",
+      "opset": 17
+    },
+    "h5": {
+      "status": "TIMEOUT",
+      "label": "opset 21 + conv fusions"
+    }
+  },
+  "best_hypothesis": "h0",
+  "baseline_p50_ms": 5.808,
+  "best_p50_ms": 5.808,
+  "best_gain_pct": 0.0,
+  "npu001_generalized": "neutral",
+  "feature_gaps": [],
+  "errors": [
+    "Model timed out at 1346s (before h5)"
+  ]
+}
diff --git a/research/autoconfig/docs/agent-design.html b/research/autoconfig/docs/agent-design.html
new file mode 100644
index 000000000..ae2a050d4
--- /dev/null
+++ b/research/autoconfig/docs/agent-design.html
@@ -0,0 +1,426 @@
+<!DOCTYPE html>
+<html lang="en">
+<head>
+<meta charset="utf-8">
+<meta name="viewport" content="width=device-width, initial-scale=1">
+<title>WinML CLI Agent Design Doc</title>
+<style>
+:root {
+  --bg: #ffffff;
+  --fg: #1f2328;
+  --muted: #59636e;
+  --border: #d1d9e0;
+  --accent: #0969da;
+  --code-bg: #f6f8fa;
+  --table-stripe: #f6f8fa;
+  --sidebar-bg: #f6f8fa;
+}
+@media (prefers-color-scheme: dark) {
+  :root {
+    --bg: #0d1117; --fg: #e6edf3; --muted: #9198a1; --border: #30363d;
+    --accent: #4493f8; --code-bg: #161b22; --table-stripe: #161b22; --sidebar-bg: #161b22;
+  }
+}
+* { box-sizing: border-box; }
+html { scroll-behavior: smooth; }
+body {
+  margin: 0; background: var(--bg); color: var(--fg);
+  font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", "Noto Sans", Helvetica, Arial, sans-serif;
+  font-size: 16px; line-height: 1.65;
+}
+.layout { display: flex; max-width: 1450px; margin: 0 auto; }
+nav.toc {
+  width: 300px; flex-shrink: 0; position: sticky; top: 0; align-self: flex-start;
+  height: 100vh; overflow-y: auto; padding: 24px 16px 60px; border-right: 1px solid var(--border);
+  background: var(--sidebar-bg); font-size: 13.5px;
+}
+nav.toc .toc-title {
+  font-weight: 700; font-size: 14px; text-transform: uppercase;
+  letter-spacing: .04em; color: var(--muted); margin: 0 0 12px;
+}
+nav.toc ul { list-style: none; margin: 0; padding-left: 14px; }
+nav.toc > ul { padding-left: 0; }
+nav.toc li { margin: 2px 0; }
+nav.toc a { color: var(--fg); text-decoration: none; display: block; padding: 3px 8px; border-radius: 6px; }
+nav.toc a:hover { background: rgba(99,110,123,.15); color: var(--accent); }
+main { flex: 1; min-width: 0; padding: 40px 56px 120px; }
+@media (max-width: 1000px) {
+  nav.toc { display: none; }
+  main { padding: 24px; }
+}
+h1, h2, h3, h4 { line-height: 1.3; margin-top: 1.8em; margin-bottom: .6em; font-weight: 650; scroll-margin-top: 16px; }
+h1 { font-size: 2em; border-bottom: 2px solid var(--border); padding-bottom: .3em; margin-top: 0; }
+h2 { font-size: 1.5em; border-bottom: 1px solid var(--border); padding-bottom: .25em; }
+h3 { font-size: 1.2em; }
+h4 { font-size: 1.02em; }
+a { color: var(--accent); }
+p, li { overflow-wrap: break-word; }
+code {
+  font-family: "SF Mono", "Cascadia Code", Consolas, "Liberation Mono", Menlo, monospace;
+  font-size: 85%; background: var(--code-bg); padding: .2em .4em; border-radius: 6px;
+}
+table { border-collapse: collapse; width: 100%; margin: 1em 0; display: block; overflow-x: auto; font-size: 14px; }
+th, td { border: 1px solid var(--border); padding: 7px 12px; text-align: left; vertical-align: top; }
+th { background: var(--sidebar-bg); font-weight: 650; }
+tr:nth-child(even) td { background: var(--table-stripe); }
+blockquote { margin: 1em 0; padding: .2em 1em; color: var(--muted); border-left: 4px solid var(--border); }
+.headerlink { text-decoration: none; opacity: 0; margin-left: .4em; font-weight: 400; }
+h1:hover .headerlink, h2:hover .headerlink, h3:hover .headerlink, h4:hover .headerlink { opacity: .5; }
+.user-input-strip {
+  margin: 1em 0;
+  border: 3px solid #4a56b8;
+  border-radius: 14px;
+  padding: 14px 18px;
+  display: flex;
+  align-items: center;
+  gap: 10px;
+  flex-wrap: wrap;
+}
+.user-input-strip .label {
+  font-weight: 700;
+}
+.chip {
+  display: inline-block;
+  background: #eef0f8;
+  color: #4a56b8;
+  border-radius: 8px;
+  padding: 2px 8px;
+  font-size: 90%;
+  font-weight: 600;
+}
+</style>
+</head>
+<body>
+<div class="layout">
+<nav class="toc">
+  <p class="toc-title">Contents</p>
+  <div class="toc">
+    <ul>
+      <li><a href="#problem-statement">Problem statement</a></li>
+      <li><a href="#key-user-scenarios">Key user scenarios</a></li>
+      <li><a href="#execution-plan-and-deliverables">Execution plan and deliverables</a></li>
+      <li><a href="#winml-cli-vs-olive">winml-cli vs Olive</a></li>
+      <li><a href="#design-principles">Design principles</a></li>
+      <li><a href="#solution">Solution</a><ul>
+        <li><a href="#diagram-walkthrough">Diagram walkthrough</a></li>
+        <li><a href="#autoconfig-positioning">Autoconfig positioning</a></li>
+        <li><a href="#loop-v3-vs-agent-layer">Loop v3 vs agent layer</a></li>
+      </ul></li>
+      <li><a href="#input">Input</a></li>
+      <li><a href="#output">Output</a></li>
+      <li><a href="#roles-and-responsibilities">Roles and responsibilities</a></li>
+      <li><a href="#auto-research-inspired-policy">Auto-research-inspired policy</a></li>
+      <li><a href="#how-it-works">How it works</a><ul>
+        <li><a href="#lifecycle-orchestration">Lifecycle orchestration</a></li>
+        <li><a href="#cross-device-integration">Cross-device integration</a></li>
+        <li><a href="#self-evolution-integration">Self-evolution integration</a></li>
+        <li><a href="#evidence-constraints">Evidence constraints</a></li>
+        <li><a href="#key-concerns">Key concerns</a></li>
+        <li><a href="#open-questions">Open questions</a></li>
+      </ul></li>
+      <li><a href="#references">References</a></li>
+    </ul>
+  </div>
+</nav>
+<main>
+<h1 id="winml-cli-agent-design-doc">WinML CLI Agent Design Doc<a class="headerlink" href="#winml-cli-agent-design-doc" title="Permanent link">&para;</a></h1>
+<p><strong>Status:</strong> Draft (research POC) · <strong>Updated:</strong> 2026-06-21</p>
+
+<h2 id="problem-statement">Problem statement<a class="headerlink" href="#problem-statement" title="Permanent link">&para;</a></h2>
+<p>
+Although winml-cli provides default config generation, real deployment needs usually require model- and target-specific trade-offs.
+The same default can behave very differently when objective shifts across latency, accuracy, memory, EP, and device.
+</p>
+<table>
+  <thead><tr><th>Customer signal</th><th>Observed ask</th><th>Design implication</th></tr></thead>
+  <tbody>
+    <tr><td>Teams / ecosystem feedback</td><td>Need usable CLI/tooling for model analysis, optimization, and benchmarking; WinML CLI introduced as recommended entry.</td><td>Agent layer must turn primitives into guided workflows, not raw command output only.</td></tr>
+    <tr><td>Canva / Affinity</td><td>One universal model across IHVs, near-real-time performance, minimal per-vendor tuning, better debuggability than black-box behavior.</td><td>Cross-device confidence + explainable diagnostics are core requirements.</td></tr>
+    <tr><td>Adobe</td><td>DML memory footprint and GPU↔CPU fallback ping-pong called out as major blockers.</td><td>Need EP-behavior visibility, fallback analysis, and actionable optimization guidance.</td></tr>
+    <tr><td>CyberLink</td><td>Need parity vs native runtimes, one model across silicon, and minimal engineering overhead (auto EP preference).</td><td>Agent must optimize for portability + performance while reducing expert intervention.</td></tr>
+  </tbody>
+</table>
+<p>
+Problem focus in this design:
+</p>
+<ol>
+  <li><strong>Default config is not always enough.</strong> Users have different constraints (perf, accuracy, memory, cross-EP/device portability), and trade-offs should be made from actual runtime evidence on target devices, sometimes requiring coordinated cross-device tuning.</li>
+  <li><strong>Negative optimization exists on some models.</strong> Current measurements show default config can regress specific models; this must be identified systematically and resolved with explainable diagnostics.</li>
+  <li><strong>Optimization behavior changes over time.</strong> EP/driver/winml version upgrades can shift optimal settings; the system should capture these shifts and stay up-to-date instead of freezing historical assumptions.</li>
+  <li><strong>Analyzer → optimizer coverage is still incomplete.</strong> Not all meaningful fusion opportunities are currently detected and translated into optimization actions; we need to identify which missed fusion opportunities matter most on real models and prioritize them.</li>
+</ol>
+
+<h2 id="key-user-scenarios">Key user scenarios<a class="headerlink" href="#key-user-scenarios" title="Permanent link">&para;</a></h2>
+<table>
+  <thead><tr><th>Scenario</th><th>Example ask</th><th>Expected outcome</th></tr></thead>
+  <tbody>
+    <tr>
+      <td>Constraint-driven config search</td>
+      <td>"Find a ConvNeXt config with accuracy drop &lt; 10% and memory &lt; 800MB."</td>
+      <td>Feasible config set + ranked recommendation + trade-off table.</td>
+    </tr>
+    <tr>
+      <td>Cross-device / cross-EP search</td>
+      <td>"Find a ConvNeXt model/config that runs on 3 NPUs." / "Find a model that can run on all EPs."</td>
+      <td>Portability-aware recommendation, fallback chain, and confidence by device/EP scope.</td>
+    </tr>
+    <tr>
+      <td>Model optimization upgrade</td>
+      <td>"Find a model/config faster than current one with accuracy drop &lt; 5%."</td>
+      <td>Candidate replacements with verified speedup, bounded accuracy loss, and migration guidance.</td>
+    </tr>
+  </tbody>
+</table>
+
+<h2 id="execution-plan-and-deliverables">Execution plan and deliverables<a class="headerlink" href="#execution-plan-and-deliverables" title="Permanent link">&para;</a></h2>
+<p>
+Execution will use <strong>built-in model recipe tuning</strong> as the driver task. Each iteration improves two things at once:
+the skill's reasoning quality and the per-model built-in recipe quality.
+</p>
+<ul>
+  <li>Run iterative tuning on built-in model set across target EP/device scopes.</li>
+  <li>Capture gaps in analyzer/optimizer coverage and convert high-impact misses into prioritized feature work.</li>
+  <li>Feed validated results back to recipes, EP-specific knowledge, and skill evaluation signals.</li>
+</ul>
+
+<div style="border:1px solid var(--border); border-radius:10px; padding:14px; margin:14px 0 18px; overflow-x:auto;">
+  <svg width="100%" height="240" viewBox="0 0 980 240" preserveAspectRatio="none" style="display:block; min-width:900px; font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',sans-serif;">
+    <defs>
+      <marker id="arr" markerWidth="8" markerHeight="6" refX="7" refY="3" orient="auto">
+        <polygon points="0 0,8 3,0 6" fill="#94a3b8"></polygon>
+      </marker>
+    </defs>
+    <rect x="24" y="24" width="190" height="44" rx="8" fill="#eef0f8" stroke="#d1d9e0"></rect>
+    <text x="119" y="51" text-anchor="middle" font-size="12" fill="#334">Built-in model recipe tuning</text>
+    <rect x="266" y="24" width="160" height="44" rx="8" fill="#eef0f8" stroke="#d1d9e0"></rect>
+    <text x="346" y="51" text-anchor="middle" font-size="12" fill="#334">Run + measure</text>
+    <rect x="468" y="24" width="200" height="44" rx="8" fill="#eef0f8" stroke="#d1d9e0"></rect>
+    <text x="568" y="51" text-anchor="middle" font-size="12" fill="#334">Analyze misses / regressions</text>
+    <rect x="720" y="24" width="230" height="44" rx="8" fill="#eef0f8" stroke="#d1d9e0"></rect>
+    <text x="835" y="51" text-anchor="middle" font-size="12" fill="#334">Update skill + recipes + requirements</text>
+
+    <line x1="214" y1="46" x2="266" y2="46" stroke="#94a3b8" stroke-width="2" marker-end="url(#arr)"></line>
+    <line x1="426" y1="46" x2="468" y2="46" stroke="#94a3b8" stroke-width="2" marker-end="url(#arr)"></line>
+    <line x1="668" y1="46" x2="720" y2="46" stroke="#94a3b8" stroke-width="2" marker-end="url(#arr)"></line>
+    <path d="M 948 70 C 948 112, 750 126, 210 126 C 110 126, 70 110, 60 88" fill="none" stroke="#7aa2f7" stroke-width="2" stroke-dasharray="6 4" marker-end="url(#arr)"></path>
+    <text x="746" y="116" font-size="11" fill="#4a5f8a">iterate with next recipe/model batch</text>
+
+    <!-- Outcome pyramid (4 layers) -->
+    <polygon points="250,220 730,220 660,182 320,182" fill="#e6f4ea" stroke="#d1d9e0"></polygon>
+    <polygon points="320,182 660,182 610,146 370,146" fill="#e8eefc" stroke="#d1d9e0"></polygon>
+    <polygon points="370,146 610,146 560,112 420,112" fill="#f5f0ff" stroke="#d1d9e0"></polygon>
+    <polygon points="420,112 560,112 490,72" fill="#efe6ff" stroke="#d1d9e0"></polygon>
+
+    <text x="490" y="205" text-anchor="middle" font-size="11" fill="#334">Improved built-in model recipes</text>
+    <text x="490" y="167" text-anchor="middle" font-size="11" fill="#334">EP-specific optimization experience / knowledge</text>
+    <text x="490" y="133" text-anchor="middle" font-size="11" fill="#334">WinML CLI feature requirements</text>
+    <text x="490" y="98" text-anchor="middle" font-size="11" fill="#334">Self-evaluated skills</text>
+  </svg>
+</div>
+
+<h2 id="winml-cli-vs-olive">winml-cli vs Olive<a class="headerlink" href="#winml-cli-vs-olive" title="Permanent link">&para;</a></h2>
+<p>
+The overlap is real at implementation level (both use ORT ecosystem), but the optimization philosophy is intentionally different.
+<strong>winml-cli focuses on a limited set of high-impact, commonly needed tuning levers with regular verification</strong>.
+<strong>Olive is a broader optimization framework with deeper knobs and stronger low-level control</strong>.
+</p>
+<table>
+  <thead><tr><th>Dimension</th><th>Olive</th><th>winml-cli (+ agent layer)</th></tr></thead>
+  <tbody>
+    <tr><td>Tuning scope</td><td>Comprehensive optimization surface, including many advanced pass combinations</td><td>Curated, high-impact optimization set for common Windows deployment paths</td></tr>
+    <tr><td>Control depth</td><td>Fine-grained and expert-oriented</td><td>Constrained and opinionated by design to reduce operational complexity</td></tr>
+    <tr><td>Verification model</td><td>User-driven validation strategy</td><td>Regular verification built into flow (baseline checks, eval gates, confidence-aware decisions)</td></tr>
+    <tr><td>Primary investment</td><td>Broad model transformation and deep optimization control</td><td>Stronger debugging/diagnostic capability, explainability, and safer decision support</td></tr>
+    <tr><td>Deep model adjustment ownership</td><td>First-class (advanced pass-level tuning)</td><td>Often delegated to Olive / Mobius for heavy model surgery</td></tr>
+    <tr><td>Primary user</td><td>ML/optimization engineer</td><td>WinApp developer / product engineer prioritizing time-to-ship</td></tr>
+    <tr><td>Default UX goal</td><td>Maximum controllability</td><td>Minimum engineering effort with explainable, reliable outcomes</td></tr>
+  </tbody>
+</table>
+
+<h2 id="design-principles">Design principles<a class="headerlink" href="#design-principles" title="Permanent link">&para;</a></h2>
+<ol>
+  <li><strong>Agent for judgment, tools for computation</strong>: keep heavy search deterministic and use the agent for diagnosis, prioritization, and explanation.</li>
+  <li><strong>Lifecycle orchestration first</strong>: one orchestrator role spans Intake → Insight → Opt Loop → Outcome.</li>
+  <li><strong>Evidence over intuition</strong>: all recommendations are backed by validation signals and confidence semantics.</li>
+  <li><strong>Cross-device by default</strong>: design for deployment fleets, not only the developer machine.</li>
+  <li><strong>Self-evolving knowledge</strong>: findings are promoted through confidence levels before broad reuse.</li>
+</ol>
+<blockquote>
+Lifecycle visual reference: <a href="./autoconfig_diagram.html"><code>autoconfig_diagram.html</code></a>.
+</blockquote>
+
+<h2 id="solution">Solution<a class="headerlink" href="#solution" title="Permanent link">&para;</a></h2>
+<p>
+The solution is the lifecycle shown in <a href="./autoconfig_diagram.html"><code>autoconfig_diagram.html</code></a>:
+an orchestrated pipeline from intake to outcome, with an explicit optimization loop in the middle.
+</p>
+
+<h3 id="diagram-walkthrough">Diagram walkthrough<a class="headerlink" href="#diagram-walkthrough" title="Permanent link">&para;</a></h3>
+<ol>
+  <li><strong>Phase 0 · Intake:</strong> establish baseline + correctness contract, and load resume state.</li>
+  <li><strong>Phase 1 · Insight:</strong> collect profiling/analyzer/graph evidence and generate <code>hypothesis_pool</code>.</li>
+  <li><strong>Phase 2 · Opt Loop:</strong> Explorer → Optimizer → Reviewer repeatedly evaluate candidate deltas.</li>
+  <li><strong>Phase 3 · Outcome:</strong> emit champion config, report, experiment artifacts, and KB draft findings.</li>
+</ol>
+<p>
+The key is the <strong>loop</strong>: hypotheses are not run once in a fixed sequence. Reviewer verdicts feed back into the next Explorer iteration until stop conditions are met (objective reached, queue exhausted, or plateau).
+</p>
+
+<h3 id="autoconfig-positioning">Autoconfig positioning<a class="headerlink" href="#autoconfig-positioning" title="Permanent link">&para;</a></h3>
+<p>
+Autoconfig is a <strong>sub-tool</strong>, not the primary UX entry. The agent uses it for targeted sweep over
+<code>EP × opset × graph options</code>, then returns an explainable report with a feasible-options comparison table.
+Correctness validation (<code>winml eval</code>) is mandatory before recommendation.
+</p>
+
+<h3 id="loop-v3-vs-agent-layer">Loop v3 vs agent layer<a class="headerlink" href="#loop-v3-vs-agent-layer" title="Permanent link">&para;</a></h3>
+<p>Autoconfig loop v3 already improved core execution quality (thresholded verdicts, early exit, crash-resume, KB-guided pruning, DVFS-aware handling). The agent layer still adds capabilities the loop alone does not provide:</p>
+<ol>
+  <li><strong>Architecture-aware reasoning</strong>: explain why a hypothesis exists for this model, not only run it.</li>
+  <li><strong>Failure explanation</strong>: convert DISCARD/failure traces into actionable diagnosis.</li>
+  <li><strong>Cross-device confidence</strong>: reason about deployment behavior beyond the local machine.</li>
+  <li><strong>Adaptive strategy</strong>: stop/reprioritize based on evidence trajectory, not only fixed counters.</li>
+  <li><strong>Knowledge narration</strong>: present promoted findings in developer-readable form, not just raw artifacts.</li>
+</ol>
+
+<h2 id="input">Input<a class="headerlink" href="#input" title="Permanent link">&para;</a></h2>
+<div class="user-input-strip">
+  <span style="font-size:20px;line-height:1;">&#128100;</span>
+  <span class="label">User input</span>
+  <span>&mdash; Model ID + Target EP + Objective:</span>
+  <span class="chip">accuracy-primary</span>
+  <span class="chip">latency-primary</span>
+  <span class="chip">Pareto</span>
+  <span>+ optional budget / accuracy floor</span>
+</div>
+
+<h2 id="output">Output<a class="headerlink" href="#output" title="Permanent link">&para;</a></h2>
+<table>
+  <thead><tr><th>Output</th><th>Source in autoconfig_diagram</th><th>Value</th></tr></thead>
+  <tbody>
+    <tr><td>Champion config</td><td><code>config_&lt;ep&gt;_optimal.json</code></td><td>Directly consumable best-known config</td></tr>
+    <tr><td>HTML benchmark report + comparison table</td><td><code>report.html</code> with experiment chart/table</td><td>Explainable recommendation and tradeoff visibility</td></tr>
+    <tr><td>Experiment artifacts</td><td><code>experiments/&lt;n&gt;/</code>, plus loop telemetry in <code>results.tsv</code> and <code>session.json</code></td><td>Audit trail, reproducibility, crash-resume continuity</td></tr>
+    <tr><td>KB draft entry</td><td><code>ep_knowledge/&lt;ep&gt;.json</code> with new entries marked <code>status="draft"</code></td><td>Feeds confidence-gated knowledge evolution</td></tr>
+    <tr><td>Feature requirements</td><td>Issue references for capability gaps (e.g. fused-op diagnostics, DVFS-aware perf)</td><td>Turns findings into product backlog action</td></tr>
+  </tbody>
+</table>
+
+<h2 id="roles-and-responsibilities">Roles and responsibilities<a class="headerlink" href="#roles-and-responsibilities" title="Permanent link">&para;</a></h2>
+<table>
+  <thead><tr><th>Role</th><th>Responsibility</th><th>Key artifacts</th></tr></thead>
+  <tbody>
+    <tr><td>Orchestrator</td><td>Controls phase transitions, loop gates, stop conditions, and resume behavior</td><td><code>session.json</code>, run state, final synthesis</td></tr>
+    <tr><td>Explorer</td><td>Builds and ranks candidate experiments from <code>hypothesis_pool</code> under KB constraints</td><td><code>skip_set</code>, <code>priority_queue</code>, candidate deltas</td></tr>
+    <tr><td>Optimizer</td><td>Executes build/perf/eval for each candidate and records measurement evidence</td><td>perf/eval logs, <code>results.tsv</code>, experiment folders</td></tr>
+    <tr><td>Reviewer</td><td>Applies acceptance policy and returns verdict/suggestions to next loop iteration</td><td>KEEP/MARGINAL/DISCARD outcomes + rationale</td></tr>
+  </tbody>
+</table>
+<p>
+The Insight phase can leverage <strong>debug-model</strong> as a sub-skill to enrich failure analysis and bottleneck interpretation before hypothesis ranking.
+</p>
+
+<h2 id="auto-research-inspired-policy">Auto-research-inspired policy<a class="headerlink" href="#auto-research-inspired-policy" title="Permanent link">&para;</a></h2>
+<p>
+This design borrows from auto-research thinking: Explorer expands search and manages both termination and acceptance conditions through explicit policy, not ad-hoc trial-and-error.
+</p>
+<h3 id="search-space-definition">Search space definition<a class="headerlink" href="#search-space-definition" title="Permanent link">&para;</a></h3>
+<p>Explorer currently mutates all major tunable knobs exposed in the winml build path, including:</p>
+<ol>
+  <li>opset version</li>
+  <li><code>winml optimize</code> options</li>
+  <li>quantization parameters</li>
+  <li>ORT runtime configuration options</li>
+</ol>
+<h3 id="termination-condition">Termination condition<a class="headerlink" href="#termination-condition" title="Permanent link">&para;</a></h3>
+<p>The loop stops when all experiments considered worth exploring have been executed (or when global stop conditions trigger).</p>
+<h3 id="acceptance-condition">Acceptance condition<a class="headerlink" href="#acceptance-condition" title="Permanent link">&para;</a></h3>
+<p>
+An experiment is accepted only when <strong>performance, accuracy, and memory</strong> all satisfy user requirements, and the observed performance gain is <strong>stable</strong> rather than noise.
+</p>
+
+<h2 id="how-it-works">How it works<a class="headerlink" href="#how-it-works" title="Permanent link">&para;</a></h2>
+<h3 id="lifecycle-orchestration">Lifecycle orchestration<a class="headerlink" href="#lifecycle-orchestration" title="Permanent link">&para;</a></h3>
+<p>
+The orchestrator controls phase transitions and loop gates. Explorer/Optimizer/Reviewer execute the optimization loop,
+while outcome synthesis consolidates recommendation and evidence into final outputs.
+</p>
+<ul>
+  <li><strong>Phase governance:</strong> enforce intake prerequisites (baseline + correctness contract) before optimization starts.</li>
+  <li><strong>Loop governance:</strong> drive <code>priority_queue</code> consumption, enforce stop conditions (objective met, plateau, queue empty), and keep run state resumable via <code>session.json</code>.</li>
+  <li><strong>Decision governance:</strong> ensure each recommendation is backed by benchmark evidence and clear verdict logic, then package it into operator-friendly outputs.</li>
+  <li><strong>Reliability governance:</strong> preserve crash-resume semantics and avoid losing completed experiments when long sweeps are interrupted.</li>
+</ul>
+<p>
+In practice, Orchestrator is the control plane and Explorer/Optimizer/Reviewer are the execution plane. This separation keeps compute deterministic while allowing higher-level strategy updates without rewriting bench primitives.
+</p>
+
+<h3 id="cross-device-integration">Cross-device integration<a class="headerlink" href="#cross-device-integration" title="Permanent link">&para;</a></h3>
+<p>
+From <code>cross-device-design.html</code>: treat <code>winml serve</code> Phase 0 endpoints as distributed workers,
+optimize with joint objective across devices, and add a device axis to portability confidence.
+</p>
+<ul>
+  <li><strong>Execution model:</strong> each device runs <code>winml serve</code> as a worker; orchestrator fans out build/perf/eval calls and aggregates results centrally.</li>
+  <li><strong>Objective model:</strong> replace single-host “best local config” with a weighted multi-device objective (latency/accuracy/coverage by deployment mix).</li>
+  <li><strong>Portability model:</strong> mark findings with device scope (local-only vs cross-device stable) so recommendations can express confidence per hardware tier.</li>
+  <li><strong>Operational model:</strong> generate fallback chains (for example QNN → DML → CPU) when a single universal winner is not feasible.</li>
+</ul>
+<p>
+This directly addresses customer requests for “one model across IHVs” and reduced manual per-vendor tuning by shifting complexity from app teams into orchestration logic.
+</p>
+
+<h3 id="self-evolution-integration">Self-evolution integration<a class="headerlink" href="#self-evolution-integration" title="Permanent link">&para;</a></h3>
+<p>
+From <code>self-evolution-design.html</code>: use paired A/B protocol and adaptive sampling to stabilize conclusions,
+then promote findings through L1→L5 confidence levels before broad KB reuse.
+</p>
+<ul>
+  <li><strong>Measurement robustness:</strong> paired A/B reduces thermal/order bias; adaptive session count increases confidence only when needed.</li>
+  <li><strong>Knowledge quality:</strong> promotion gates prevent noisy one-off wins from entering reusable KB rules prematurely.</li>
+  <li><strong>Search efficiency:</strong> once findings are promoted, <code>skip_set</code> and ranking improve, reducing wasted experiments in future runs.</li>
+  <li><strong>Governance loop:</strong> each sweep contributes structured evidence back to KB, making later recommendations faster and more reliable.</li>
+</ul>
+<p>
+The result is a closed-loop system: run experiments → accumulate evidence → promote stable patterns → improve next orchestration cycle.
+</p>
+
+<h3 id="evidence-constraints">Evidence constraints<a class="headerlink" href="#evidence-constraints" title="Permanent link">&para;</a></h3>
+<table>
+  <thead><tr><th>Finding</th><th>Implication</th><th>Required control</th></tr></thead>
+  <tbody>
+    <tr><td><code>npu-001</code></td><td>opset 21 benefits Conv+residual patterns</td><td>Keep opset as first-class search lever</td></tr>
+    <tr><td><code>npu-006</code></td><td>Conv fusion can cause catastrophic NPU regressions</td><td>Hard-block risky fusion hypotheses</td></tr>
+    <tr><td><code>npu-007</code></td><td>DVFS distorts naive perf conclusions</td><td>Use DVFS-aware bench protocol + confidence gating</td></tr>
+  </tbody>
+</table>
+
+<h3 id="key-concerns">Key concerns<a class="headerlink" href="#key-concerns" title="Permanent link">&para;</a></h3>
+<table>
+  <thead><tr><th>Concern</th><th>Mitigation in design</th></tr></thead>
+  <tbody>
+    <tr><td>Device heterogeneity may invalidate local optimum</td><td>Cross-Device Confidence Agent + multi-device objective and fallback chains</td></tr>
+    <tr><td>Trust/auditability of recommendations</td><td>Require provenance artifacts and report-level explanation</td></tr>
+    <tr><td>Noise-driven false wins</td><td>DVFS-aware protocol, thresholded verdict policy, confidence gates</td></tr>
+    <tr><td>Overlap concerns with Olive</td><td>Differentiate on UX/explainability and Windows deployment reasoning</td></tr>
+  </tbody>
+</table>
+
+<h3 id="open-questions">Open questions<a class="headerlink" href="#open-questions" title="Permanent link">&para;</a></h3>
+<ol>
+  <li>Should this ship as <code>winml agent</code> or as agent-assist modes on existing commands?</li>
+  <li>How should cross-device execution be provisioned: local lab fleet, cloud runners, or hybrid?</li>
+  <li>What is the minimal offline fallback for restricted environments?</li>
+</ol>
+
+<h2 id="references">References<a class="headerlink" href="#references" title="Permanent link">&para;</a></h2>
+<ul>
+  <li><a href="./agent-design.md"><code>agent-design.md</code></a></li>
+  <li><a href="./autoconfig_diagram.html"><code>autoconfig_diagram.html</code></a></li>
+  <li><a href="./cross-device-design.html"><code>cross-device-design.html</code></a></li>
+  <li><a href="./self-evolution-design.html"><code>self-evolution-design.html</code></a></li>
+</ul>
+</main>
+</div>
+</body>
+</html>
diff --git a/research/autoconfig/docs/agent-design.md b/research/autoconfig/docs/agent-design.md
new file mode 100644
index 000000000..688dad029
--- /dev/null
+++ b/research/autoconfig/docs/agent-design.md
@@ -0,0 +1,254 @@
+# WinML CLI Agent Design
+
+> Status: Draft — 2026-06-17 (updated: autoconfig loop V3 changes incorporated)
+> Context: Strategic design for the agent layer of winml-cli
+
+---
+
+## 1. Context: Why Agent Matters for winml-cli
+
+### 1.1 winml-cli vs Olive — The Real Distinction
+
+Microsoft Olive already exists as a pass-based optimization framework supporting QNN, DML, and other Windows EPs. The temptation is to dismiss winml-cli's agent as redundant with Olive. That would be wrong — the distinction is fundamental:
+
+| Dimension | Olive | winml-cli |
+| --- | --- | --- |
+| Target user | ML engineer who understands ORT internals | WinApp developer who wants their model to work on Windows |
+| Workflow | Compose passes manually, specify EP upfront | `config` + `build` — two commands, full pipeline |
+| Hardware selection | Manual EP specification | `--device auto` — detects hardware, selects EP |
+| Explainability | Silent pipeline output | Designed for transparency |
+| Windows-first | Cross-platform, Windows supported | Built exclusively for Windows hardware diversity |
+| Operator diagnostics | Not available | `winml analyze` — operator linting, EP compatibility |
+| Agent-ready | Not designed for it | First-class design goal |
+
+**Analogy:** Olive is webpack (powerful, expert-configured); winml-cli is Vite (opinionated, works for most cases out of the box).
+
+### 1.2 The Core Gap Agent Should Fill
+
+WinApp developers lack access to a senior ML engineer who:
+
+- Knows why a model fails on QNN NPU for this specific operator pattern
+- Can read an error message and immediately know the root cause
+- Understands which optimization knob to turn for which problem
+- Knows how a config that works on Snapdragon X Elite will behave on Intel Meteor Lake
+
+**The agent's job is to be that person.**
+
+---
+
+## 2. Agent Design Philosophy
+
+### 2.1 The Improved Loop (autoconfig V3) vs The Agent Layer
+
+The autoconfig search loop has been significantly improved since the initial draft. As of v3 (`59e7329d`):
+
+**What the improved loop does well:**
+- Statistical significance via `ThroughputOnly` verdict policy: `improvement > max(1% floor, 2× screen_CV)` — noise-level deltas no longer pass as KEEP
+- Screen early exit: if screen improvement < 1%, skip 3× full bench — saves 25–90 min per rejected hypothesis
+- Crash-resume via `session.json`: atomic state persistence, restartable without re-running completed experiments
+- KB-guided search: `ep_knowledge/*.json` confirmed rules prune the search space before any experiment runs
+- DVFS-aware bench protocol: npu-007 CV gate disabled on QNN NPU; 3× 500-iter sessions with cool-down
+- npu-006 guard: Conv% > 20% → hard-block conv fusions before they cause 4900% regression
+
+**What still requires the agent layer:**
+
+The loop is a *computation engine*, not an *intelligence layer*. It needs an agent because:
+
+1. **No architecture-aware hypothesis generation** — hypotheses are hardcoded per EP, not generated from model analysis. An attention-heavy model gets the same hypotheses as a Conv-heavy one.
+2. **No failure explanation** — DISCARD is logged but not explained. Developers can't learn from results without reading raw JSON.
+3. **No cross-device reasoning** — a config found on Snapdragon X Elite has unknown behavior on Intel Meteor Lake. The loop can't tell you that.
+4. **No adaptive stopping** — 30-DISCARD plateau is a static heuristic. An agent would recognize when all architectural levers for this model/EP pair have been exhausted.
+5. **No KB self-update** — KB is manually maintained. An agent with memory extraction (cf. AgenticGPUOptimizer `memory_extractor.py`) would auto-update `ep_knowledge/*.json` after each run.
+
+The revised framing: **autoconfig is a sub-tool that the agent invokes and explains, not a headless replacement for the agent**.
+
+### 2.2 The Wrong Design (Original Autoconfig)
+
+The *original* autoconfig ran a **headless search loop** with no statistical significance, no crash-resume, and no KB-guided pruning:
+Explorer → Optimizer → Reviewer → repeat
+
+**Problems that were present (now fixed in V3):**
+
+- No statistical significance — 1% hardcoded floor meant noise-level deltas passed as KEEP
+- No screen early exit — every hypothesis ran 3× full bench regardless of screen result
+- No crash-resume — an interrupted run lost all state
+- All optim keys in kebab-case → `build_config()` silently used snake_case lookups → every hypothesis ran as baseline (critical bug, fixed)
+
+**Remaining problems (require agent layer to fix):**
+
+- A Python script can do benchmark loops faster, cheaper, and more reliably than an LLM agent — the loop is good, the LLM overhead is not worth it
+- Results (config files) are not auditable — developer cannot verify why a config was chosen
+- No explainability — developer doesn't understand what was decided or why
+- Treats developer as absent; no collaborative interaction
+- The "agentic" overhead (LLM inference cost per loop iteration) adds nondeterminism without intelligence
+
+Autoconfig search is useful as a **sub-tool**, not as the primary value proposition of the agent layer.
+
+### 2.2 The Right Design: Diagnosis + Guidance over Search
+
+Agent excels at **judgment, diagnosis, and explanation** — not computation. The redesign centers on:
+
+> **When a developer encounters a problem, the agent gives explanation + executable next step — not a config file.**
+
+#### Design Principles
+
+1. **Explain, don't just output**  
+   Instead of silently picking an EP, say: *"I picked QNN EP because your device has a Qualcomm NPU. Operator coverage is 97% — the remaining 3% fall back to CPU, which is acceptable for these specific ops."*
+2. **Fix, don't just diagnose**  
+   When an incompatible operator is found, apply the graph transformation — don't just flag it.
+3. **Developer talks, agent acts**  
+   The agent is interactive and conversational. Developer says "this model is slow on GPU" → agent asks clarifying questions, runs targeted experiments, explains findings.
+4. **Progressive trust**  
+   Show confidence levels. Be explicit about uncertainty. Let the developer see what the agent is doing. Never give false precision (e.g., "Config A is 3% faster" when standard deviation is 5%).
+5. **Windows device diversity as first-class concern**  
+   Always reason about what happens on devices the developer doesn't have — not just the machine the agent runs on.
+
+---
+
+## 3. Agent Types
+
+### 3.1 Diagnostic Agent *(highest priority)*
+
+**Trigger:** Model fails to load, crashes at inference, throws EP compatibility error  
+**Developer question:** "My model fails on QNN NPU — why? What do I do?"
+
+**Agent responsibilities:**
+
+- Parse error message → identify root cause (unsupported op, shape mismatch, driver version, etc.)
+- Analyze model graph → enumerate incompatible operators per EP
+- Propose and apply concrete fix (graph transformation, operator substitution, fallback EP)
+- Verify fix with `winml eval` accuracy check
+
+**Why this is Olive-incompatible:** Olive doesn't converse, doesn't diagnose, doesn't explain. It fails silently or produces a broken model.
+
+**Example interaction:**
+
+```javascript
+Developer: winml build failed. Error: "QNNExecutionProvider: Unsupported op at node /conv/Conv_3"
+Agent: Found it. Conv_3 has dynamic padding — QNN NPU requires static shapes.
+       I'll apply DynamicToFixedShape transform and re-run the compile.
+       [applies fix] → Build succeeded. NPU latency: 12.3ms. Accuracy delta: 0.01%.
+```
+
+---
+
+### 3.2 Decision Guidance Agent
+
+**Trigger:** Developer is at a decision point in the pipeline (which EP? which precision? to quantize or not?)  
+**Developer question:** "I don't know what options to pick. What's the tradeoff?"
+
+**Agent responsibilities:**
+
+- Run quick comparative benchmarks (not exhaustive search)
+- Present tradeoffs with numbers: latency gain vs accuracy delta vs model size
+- Make a recommendation with reasoning, not just a number
+- Let developer override with understanding of consequences
+
+**Key difference from autoconfig:** This is interactive and decision-oriented, not headless. The developer is in the loop.
+
+---
+
+### 3.3 Cross-Device Confidence Agent *(winml-cli unique)*
+
+**Trigger:** Developer has a working config, asks "will this work on my users' devices?"  
+**Developer question:** "My app ships on many Windows hardware configs. Will this be okay?"
+
+**Agent responsibilities:**
+
+- Given a config optimized for Device A, reason about behavior on Device B, C...
+- Identify configs that are device-specific (compiled QNN binaries only work on Qualcomm)
+- Generate multi-device config with automatic EP fallback chain (QNN → DML → CPU)
+- Surface warnings: "This config will fail on Intel Meteor Lake — here's the fallback"
+
+**Why this matters:** WinApp developers ship to millions of devices. No other tool addresses Windows hardware diversity in the deployment sense.
+
+---
+
+### 3.4 Regression Detection Agent *(CI/CD scenario)*
+
+**Trigger:** ORT version bump, driver update, or scheduled CI run  
+**Developer question:** "Something changed — my model got slower / broke"
+
+**Agent responsibilities:**
+
+- Compare before/after perf numbers with statistical validity (not point estimates)
+- Correlate change with known ORT/EP changelog entries
+- Identify which layer / operator regressed using profiler output
+- Propose workaround or file structured bug report
+
+---
+
+## 4. Role of Autoconfig (Perf Search) in This Design
+
+Autoconfig (opset × EP × opt\_level search) is **not abandoned** — it becomes a sub-tool invoked by the agents above when appropriate.
+
+**When it's invoked:**
+
+- Diagnostic Agent: "Let me try a few configs to see if this performance issue is config-related"
+- Decision Guidance Agent: "Let me run a quick sweep across opt\_level to show you the tradeoff"
+
+**What changes:**
+
+- It's not the primary entry point
+- Its output is explained, not just surfaced as a config file
+- It runs fewer, targeted experiments (not exhaustive grid search)
+- Accuracy validation (`winml eval`) is mandatory before any recommendation
+
+---
+
+### 3.5 Model Recommendation Agent *(developer onboarding scenario)*
+
+**Trigger:** Developer describes a problem they want to solve but doesn't know which model to use  
+**Developer question:** "I want to add real-time object detection to my app. What model should I use?"
+
+**Agent responsibilities:**
+
+- Understand the developer's task from natural language (classification, detection, segmentation, NLP, etc.)
+- Query `winml catalog` to enumerate available built-in models for that task
+- Factor in the developer's current hardware (`winml sys`) — which EPs are available, memory constraints, NPU presence
+- Recommend the best-fit model with reasoning: why this model, expected latency on their device, accuracy tier
+- Output a ready-to-run command so the developer can immediately try it
+
+**Why this is valuable:** Developers often don't know which model architecture fits their use case, let alone which one performs well on their specific Windows hardware. The catalog has verified, pre-tested models — the agent bridges "I have a problem" to "here is a model you can ship today."
+
+**Example interaction:**
+
+```javascript
+Developer: I want to classify product images in my e-commerce app. Should be fast on low-end laptops.
+Agent: For image classification on low-end hardware without an NPU, I recommend:
+
+  → microsoft/resnet-50  (catalog built-in)
+    - Task: image-classification
+    - Latency on CPU: ~18ms p50 (tested on Intel Core i5)
+    - Size: 98MB FP32, 25MB w8a16
+    - Best EP on your machine: CPUExecutionProvider
+
+  Runner-up: google/vit-base-patch16-224 (better accuracy, ~2x slower on CPU)
+
+  To build and benchmark:
+    winml build -c $(winml config -m microsoft/resnet-50 --device auto) -o resnet_out/
+    winml perf -m resnet_out/model.onnx --device auto --iterations 100
+```
+
+**What makes this different from a search engine:** The recommendation is hardware-aware — the same question asked on a machine with a Qualcomm NPU would surface a different model (or a different EP for the same model) with different expected numbers. It's not a static lookup, it's a contextual match.
+
+---
+
+## 5. Key Concerns to Track
+
+| Concern | Mitigation |
+| --- | --- |
+| Device heterogeneity: config found on Dev's machine may not generalize | Cross-Device Confidence Agent explicitly addresses this; output includes device scope |
+| Trust/auditability: developer can't verify agent recommendation | All recommendations include reasoning + confidence + "how I tested this" |
+| Olive overlap at implementation layer | winml-cli uses ORT under the hood like Olive; the differentiation is UX + Windows-first + explainability, not reimplementing optimization passes |
+| Accuracy validation | `winml eval` is mandatory in every agent loop that modifies the model |
+| Agent hallucinating perf numbers | All perf claims require iteration ≥ 1000 and report p50/p90/p99 with std dev |
+
+---
+
+## 6. Open Questions
+
+1. **Scope**: Should the agent be a CLI mode (`winml agent`) or embedded into existing commands (`winml build --agent`)?
+2. **Olive relationship**: Should winml-cli contribute opset search back to Olive, or maintain it independently? Needs alignment with Olive team.
+3. **Offline / no-LLM mode**: Should the agent work without LLM (rule-based fallback) for air-gapped CI environments?
+4. **Multi-device testing**: Cross-Device Confidence Agent requires access to multiple devices or a device simulation layer — how to implement?
diff --git a/research/autoconfig/docs/autoconfig_diagram.html b/research/autoconfig/docs/autoconfig_diagram.html
new file mode 100644
index 000000000..9b5b9e69a
--- /dev/null
+++ b/research/autoconfig/docs/autoconfig_diagram.html
@@ -0,0 +1,451 @@
+<!DOCTYPE html>
+<html lang="en">
+<head>
+<meta charset="UTF-8">
+<title>autoconfig Skill — Architecture</title>
+<style>
+  * { box-sizing: border-box; margin: 0; padding: 0; }
+  body {
+    font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", sans-serif;
+    background: #f4f6f9;
+    padding: 32px 24px;
+    color: #1a1a2e;
+  }
+  h1 { font-size: 18px; font-weight: 700; color: #1a1a2e; margin-bottom: 4px; }
+  .subtitle { font-size: 12px; color: #666; margin-bottom: 6px; }
+  .version-tag {
+    display: inline-block; background: #e8eaf6; color: #3949ab;
+    border-radius: 4px; padding: 2px 8px; font-size: 10px; font-weight: 700;
+    letter-spacing: 0.5px; margin-bottom: 24px;
+  }
+
+  /* ── Layout ── */
+  .diagram { display: flex; flex-direction: column; align-items: center; gap: 0; }
+  .lifecycle-flow {
+    width: 100%; max-width: 960px; box-sizing: border-box;
+    position: relative; padding-left: 54px;
+  }
+  .orchestrator-line {
+    position: absolute; left: 20px; top: 0; bottom: 0;
+    width: 14px; border-radius: 999px;
+    background: linear-gradient(180deg, #4a79d9, #3f6fcc);
+    border: 1.5px solid #7ea6f0;
+  }
+  .orchestrator-label {
+    position: absolute; left: 4px; top: 10px;
+    writing-mode: vertical-rl; text-orientation: mixed; transform: rotate(180deg);
+    font-size: 11px; font-weight: 700; letter-spacing: 0.8px;
+    color: #1f3f83; background: #e8eaf6; border: 1px solid #c5cae9; border-radius: 6px;
+    padding: 6px 4px;
+  }
+
+  /* ── Phase strips ── */
+  .phase-row {
+    display: flex; align-items: stretch; width: 100%; max-width: 900px; margin-bottom: 0;
+  }
+  .phase-label {
+    writing-mode: vertical-rl; text-orientation: mixed; transform: rotate(180deg);
+    font-size: 10px; font-weight: 700; letter-spacing: 1.2px; text-transform: uppercase;
+    padding: 12px 8px; border-radius: 8px 0 0 8px; min-width: 32px;
+    display: flex; align-items: center; justify-content: center; flex-shrink: 0;
+  }
+  .phase-body {
+    flex: 1; padding: 16px 20px; border-radius: 0 8px 8px 0;
+    display: flex; align-items: center; gap: 12px; flex-wrap: wrap;
+  }
+
+  /* ── Phase colors ── */
+  .p0 .phase-label { background: #e8eaf6; color: #3949ab; }
+  .p0 .phase-body  { background: #f3f4fc; border: 1.5px solid #c5cae9; }
+  .p1 .phase-label { background: #e0f2f1; color: #00695c; }
+  .p1 .phase-body  { background: #f1faf9; border: 1.5px solid #b2dfdb; }
+  .p2 .phase-label { background: #fff3e0; color: #e65100; }
+  .p2 .phase-body  { background: #fffbf5; border: 1.5px solid #ffe0b2; }
+  .p3 .phase-label { background: #fce4ec; color: #880e4f; }
+  .p3 .phase-body  { background: #fff5f8; border: 1.5px solid #f8bbd0; }
+
+  /* ── Connector arrows ── */
+  .arrow {
+    display: flex; justify-content: center; align-items: center;
+    height: 28px; width: 100%; max-width: 900px; position: relative;
+  }
+  .arrow::before {
+    content: ''; position: absolute; left: 50%; top: 0; bottom: 0;
+    width: 2px; background: #aab;
+  }
+  .arrow-head {
+    position: relative; z-index: 1; background: #f4f6f9;
+    padding: 0 6px; font-size: 18px; color: #889; line-height: 1;
+  }
+
+  /* ── Generic box ── */
+  .box {
+    background: #fff; border-radius: 8px; padding: 10px 14px;
+    border: 1.5px solid #dde; font-size: 12px; min-width: 130px;
+  }
+  .box-title {
+    font-weight: 700; font-size: 11px; margin-bottom: 5px;
+    display: flex; align-items: center; gap: 6px;
+  }
+  .box ul { padding-left: 14px; color: #445; line-height: 1.6; font-size: 11px; }
+  .box code {
+    font-family: "Cascadia Code","Fira Code",monospace; font-size: 10px;
+    background: #f0f2f5; padding: 1px 4px; border-radius: 3px; color: #2d4a8a;
+  }
+
+  /* ── Loop (Phase 2) ── */
+  .loop-container { width: 100%; display: flex; gap: 10px; align-items: flex-start; }
+  .loop-agents { flex: 1; display: flex; flex-direction: column; gap: 8px; }
+  .experiment-loop {
+    border: 2px dashed #ffcc80; border-radius: 12px; padding: 10px 12px;
+    display: flex; flex-direction: column; gap: 8px; background: #fffaf3;
+  }
+  .experiment-loop-title {
+    font-size: 11px; font-weight: 700; color: #bf360c;
+    text-transform: uppercase; letter-spacing: 0.6px;
+  }
+  .rel-note {
+    align-self: center; font-size: 10px; color: #5c6bc0; background: #eef2ff;
+    border: 1px dashed #9fa8da; border-radius: 999px; padding: 2px 10px;
+  }
+  .rel-note.data { color: #00897b; border-color: #80cbc4; background: #edfdfb; }
+  .loop-agent {
+    background: #fff; border-radius: 8px; padding: 10px 14px;
+    border: 1.5px solid #ffe0b2; font-size: 11px;
+  }
+  .loop-agent .agent-title { font-weight: 700; font-size: 11px; margin-bottom: 4px; color: #bf360c; }
+  .loop-agent ul { padding-left: 14px; color: #445; line-height: 1.65; }
+  .loop-agent .new-badge {
+    display: inline-block; background: #e8f5e9; color: #2e7d32; border: 1px solid #a5d6a7;
+    border-radius: 4px; padding: 0 5px; font-size: 9px; font-weight: 700;
+    letter-spacing: 0.5px; margin-left: 6px; vertical-align: middle;
+  }
+  .loop-side { display: flex; flex-direction: column; gap: 8px; min-width: 158px; }
+  .stop-box {
+    background: #fff; border: 1.5px dashed #aab; border-radius: 8px;
+    padding: 10px 14px; font-size: 11px; color: #556;
+  }
+  .stop-box .stop-title { font-weight: 700; margin-bottom: 4px; }
+  .stop-box ul { padding-left: 14px; line-height: 1.65; }
+  .loop-return {
+    margin-top: 2px; display: flex; align-items: center; justify-content: center; gap: 8px;
+    font-size: 10.5px; font-weight: 700; color: #bf360c;
+    border: 1.5px dashed #bf360c; border-radius: 999px; padding: 4px 10px; background: #fff3e0;
+  }
+  .loop-return .loop-icon { font-size: 13px; line-height: 1; }
+  .mini-arrow { text-align: center; font-size: 13px; color: #e65100; line-height: 1; margin: -2px 0; }
+
+  /* verdict pill */
+  .verdict-row {
+    display: flex; gap: 6px; flex-wrap: wrap; margin-top: 5px; padding-left: 4px;
+  }
+  .pill {
+    padding: 2px 8px; border-radius: 10px; font-size: 10px; font-weight: 700;
+    letter-spacing: 0.3px;
+  }
+  .pill-keep    { background: #e8f5e9; color: #2e7d32; border: 1px solid #a5d6a7; }
+  .pill-marginal{ background: #fff8e1; color: #f57f17; border: 1px solid #ffe082; }
+  .pill-discard { background: #fce4ec; color: #880e4f; border: 1px solid #f48fb1; }
+  .pill-early   { background: #e3f2fd; color: #1565c0; border: 1px solid #90caf9; }
+  .pill-crash   { background: #efebe9; color: #4e342e; border: 1px solid #bcaaa4; }
+
+  /* ── Output badges ── */
+  .output-badges { display: flex; gap: 10px; flex-wrap: wrap; }
+  .badge {
+    background: #fff; border-radius: 8px; padding: 8px 14px;
+    border: 1.5px solid #f48fb1; font-size: 11px; min-width: 160px;
+  }
+  .badge .badge-title {
+    font-weight: 700; font-size: 10px; color: #880e4f;
+    margin-bottom: 3px; text-transform: uppercase; letter-spacing: 0.5px;
+  }
+  .badge code {
+    display: block; font-family: "Cascadia Code",monospace; font-size: 10px;
+    background: #fce4ec; padding: 2px 5px; border-radius: 3px; margin-top: 2px; color: #6a0e30;
+  }
+
+  /* ── User input ── */
+  .user-input {
+    max-width: 900px; width: 100%; background: #fff; border: 2px solid #3949ab;
+    border-radius: 10px; padding: 12px 20px; display: flex; align-items: center; gap: 20px;
+  }
+  .user-input .ui-icon { font-size: 24px; }
+  .user-input .ui-text { font-size: 12px; }
+  .user-input .ui-text strong { font-size: 13px; }
+  .tag {
+    display: inline-block; background: #e8eaf6; color: #3949ab;
+    border-radius: 4px; padding: 2px 6px; font-size: 10px; font-weight: 600; margin: 2px 3px 2px 0;
+  }
+
+  /* profile-result box */
+  .profile-result {
+    background: #e0f7fa; border: 1.5px solid #80cbc4; border-radius: 6px;
+    padding: 6px 10px; font-size: 10.5px; color: #004d40; min-width: 190px;
+  }
+  .profile-result strong { font-size: 11px; display: block; margin-bottom: 3px; }
+</style>
+</head>
+<body>
+
+<h1>autoconfig — Skill Architecture</h1>
+<p class="subtitle">Profile-guided autonomous config search for WinApp developers</p>
+<div class="version-tag">v3 · 2026-06-17 · AgenticGPUOptimizer V2 patterns applied</div>
+
+<div class="diagram">
+
+  <!-- USER INPUT -->
+  <div class="user-input">
+    <div class="ui-icon">&#128100;</div>
+    <div class="ui-text">
+      <strong>User input</strong> &nbsp;&mdash;&nbsp;
+      Model ID &nbsp;+&nbsp; Target EP &nbsp;+&nbsp; Objective:
+      <span class="tag">accuracy-primary</span>
+      <span class="tag">latency-primary</span>
+      <span class="tag">Pareto</span>
+      &nbsp;+ optional budget / accuracy floor
+    </div>
+  </div>
+
+  <div class="arrow"><div class="arrow-head">&#8595;</div></div>
+
+  <div class="lifecycle-flow">
+  <div class="orchestrator-line"></div>
+  <div class="orchestrator-label">Orchestrator</div>
+
+  <!-- PHASE 0 -->
+  <div class="phase-row p0">
+    <div class="phase-label">Phase 0 · Intake</div>
+    <div class="phase-body" style="flex-wrap:nowrap;align-items:stretch">
+      <div class="box" style="flex:1">
+        <div class="box-title">Inspect</div>
+        <ul>
+          <li><code>winml inspect</code></li>
+          <li>EP availability check</li>
+          <li>Load <code>session.json</code> (crash-resume)</li>
+        </ul>
+      </div>
+      <div style="color:#aab;font-size:18px;display:flex;align-items:center">&#8594;</div>
+      <div class="box" style="flex:1">
+        <div class="box-title">Baseline Build</div>
+        <ul>
+          <li><code>winml build</code> (opset17, no quant)</li>
+          <li>Record baseline p50</li>
+        </ul>
+      </div>
+      <div style="color:#aab;font-size:18px;display:flex;align-items:center">&#8594;</div>
+      <div class="box" style="flex:1">
+        <div class="box-title">Correctness Contract</div>
+        <ul>
+          <li><code>winml eval --mode compare</code></li>
+          <li>Reference: original ONNX or HF PyTorch</li>
+          <li>Lock cosine similarity = 1.000</li>
+        </ul>
+      </div>
+    </div>
+  </div>
+
+  <div class="arrow"><div class="arrow-head">&#8595;</div></div>
+
+  <!-- PHASE 1 -->
+  <div class="phase-row p1">
+    <div class="phase-label">Phase 1 · Insight</div>
+    <div class="phase-body" style="flex-direction:column;gap:14px">
+
+      <div style="display:flex;gap:10px;width:100%;align-items:flex-start">
+        <div class="box" style="flex:1">
+          <div class="box-title">Runtime Profile</div>
+          <ul>
+            <li><code>winml perf --profile</code> (pending #158)</li>
+            <li>Per-op kernel time, bottleneck %</li>
+          </ul>
+        </div>
+        <div class="box" style="flex:1">
+          <div class="box-title">Static Analyzer</div>
+          <ul>
+            <li><code>winml analyze --ep &lt;ep&gt;</code></li>
+            <li>Conv% &rarr; npu-006 risk flag</li>
+            <li>Partial-support op list</li>
+          </ul>
+        </div>
+        <div class="box" style="flex:1">
+          <div class="box-title">Graph Analysis</div>
+          <ul>
+            <li>Op counts by type</li>
+            <li>Fusion opportunities</li>
+            <li>Static vs dynamic axes</li>
+          </ul>
+        </div>
+      </div>
+
+      <div style="display:flex;align-items:center;gap:8px;padding:0 4px">
+        <div style="flex:1;height:1.5px;background:#a5d6a7"></div>
+        <div style="font-size:11px;color:#2e7d32;font-weight:600">Insight Engine &mdash; hypothesis_pool (unfiltered candidates)</div>
+        <div style="flex:1;height:1.5px;background:#a5d6a7"></div>
+      </div>
+
+    </div>
+  </div>
+
+  <div class="arrow"><div class="arrow-head">&#8595;</div></div>
+
+  <!-- PHASE 2 -->
+  <div class="phase-row p2">
+    <div class="phase-label">Phase 2 · Opt Loop</div>
+    <div class="phase-body">
+      <div class="loop-container">
+
+        <div class="loop-agents">
+          <div class="experiment-loop">
+            <div class="experiment-loop-title">Experiment loop (until stop condition)</div>
+
+            <div class="loop-agent">
+              <div class="agent-title">Explorer</div>
+              <div style="display:flex;gap:8px;margin-bottom:8px">
+                <div class="box" style="flex:1;border-color:#80cbc4;background:#e0f7fa;padding:7px 10px">
+                  <div class="box-title" style="color:#004d40;font-size:10px;margin-bottom:2px">skip_set</div>
+                  <div style="font-size:10.5px;color:#004d40">KB hard-block pruning after hypothesis generation</div>
+                </div>
+                <div class="box" style="flex:1;border-color:#a5d6a7;background:#f1faf9;padding:7px 10px">
+                  <div class="box-title" style="color:#1b5e20;font-size:10px;margin-bottom:2px">priority_queue</div>
+                  <div style="font-size:10.5px;color:#1b5e20">Ranked hypotheses after pruning</div>
+                </div>
+              </div>
+              <ul>
+                <li>Skip completed iters from <code>session.json</code> <span class="new-badge">NEW</span></li>
+                <li>Load <code>hypothesis_pool</code> from Insight Engine</li>
+                <li>Apply KB hard blocks &rarr; <code>skip_set</code></li>
+                <li>Rank remaining hypotheses &rarr; <code>priority_queue</code></li>
+                <li>Pop next hypothesis from <code>priority_queue</code></li>
+                <li>Build <code>config.json</code> delta</li>
+              </ul>
+            </div>
+
+            <div class="rel-note">spawn per experiment &darr;</div>
+            <div class="mini-arrow">&#8595;</div>
+
+            <div class="loop-agent">
+              <div class="agent-title">Optimizer</div>
+              <ul>
+                <li><code>winml build -c config.json</code></li>
+                <li><strong>Phase A — screen</strong> (200 iters): CV gate for CPU/GPU; <em>disabled</em> for QNN NPU (DVFS)</li>
+                <li><strong>Early exit</strong> <span class="new-badge">NEW</span>: screen &Delta; &lt; 1% &rarr; DISCARD, skip full bench</li>
+                <li><strong>Phase B — full bench</strong> (3 &times; 1000 iters, 60s cool-down)</li>
+                <li><code>winml eval</code> &rarr; accuracy gate</li>
+              </ul>
+            </div>
+
+            <div class="rel-note data">benchmark + accuracy data &darr;</div>
+            <div class="mini-arrow">&#8595;</div>
+
+            <div class="loop-agent">
+              <div class="agent-title">Reviewer &mdash; ThroughputOnly <span class="new-badge">NEW</span></div>
+              <ul>
+                <li><code>threshold = max(1%, 2.0 &times; CV)</code></li>
+              </ul>
+              <div class="verdict-row">
+                <span class="pill pill-keep">KEEP &gt;1.5&times;thr</span>
+                <span class="pill pill-marginal">MARGINAL 1&times;&ndash;1.5&times;</span>
+                <span class="pill pill-discard">DISCARD</span>
+                <span class="pill pill-early">EARLY DISCARD</span>
+                <span class="pill pill-crash">ACC/BUILD FAIL</span>
+              </div>
+            </div>
+            <div class="loop-return"><span class="loop-icon">&#8634;</span> reviewer verdict / suggestions &rarr; back to Explorer (next iteration)</div>
+          </div>
+
+          <div class="loop-agent" style="border-color:#b2dfdb;background:#f1faf9">
+            <div class="agent-title" style="color:#006064">Crash-Resume <span class="new-badge">NEW</span></div>
+            <ul>
+              <li>Atomic write after every experiment</li>
+              <li>Stores: completed iters, baseline/best p50, discard counters</li>
+            </ul>
+          </div>
+
+        </div>
+
+        <div class="loop-side">
+          <div class="stop-box">
+            <div class="stop-title">Stop conditions</div>
+            <ul>
+              <li>Objective met</li>
+              <li>30 consecutive DISCARDs</li>
+              <li>Queue empty</li>
+              <li>User stops</li>
+            </ul>
+          </div>
+          <div class="stop-box" style="background:#fffde7;border-color:#ffe082">
+            <div class="stop-title" style="color:#f57f17">results.tsv</div>
+            config &middot; screen_p50 &middot; median_p50<br>
+            CV &middot; delta_pct &middot; status
+          </div>
+          <div class="stop-box" style="background:#f3f4fc;border-color:#c5cae9;margin-top:4px">
+            <div class="stop-title" style="color:#3949ab">session.json</div>
+            completed_iters<br>
+            baseline/best p50<br>
+            discard counters
+          </div>
+          <div class="stop-box" style="background:#fce4ec;border-color:#f48fb1;margin-top:4px">
+            <div class="stop-title" style="color:#880e4f">ep_knowledge/</div>
+            New entries as<br>
+            <strong>status="draft"</strong>
+          </div>
+        </div>
+
+      </div>
+    </div>
+  </div>
+
+  <div class="arrow"><div class="arrow-head">&#8595;</div></div>
+
+  <!-- PHASE 3 -->
+  <div class="phase-row p3">
+    <div class="phase-label">Phase 3 · Outcome</div>
+    <div class="phase-body">
+      <div class="output-badges">
+        <div class="badge">
+          <div class="badge-title">Champion Config</div>
+          Best config + provenance
+          <code>config_&lt;ep&gt;_optimal.json</code>
+        </div>
+        <div class="badge">
+          <div class="badge-title">HTML Report</div>
+          Chart + experiment table
+          <code>report.html</code>
+        </div>
+        <div class="badge">
+          <div class="badge-title">Experiment Artifacts</div>
+          Per-hypothesis logs
+          <code>experiments/&lt;n&gt;/</code>
+        </div>
+        <div class="badge">
+          <div class="badge-title">KB Draft Entry</div>
+          New findings, promoted after Gate 2
+          <code>ep_knowledge/&lt;ep&gt;.json</code>
+        </div>
+        <div class="badge">
+          <div class="badge-title">Feature Requirements</div>
+          Issues filed per finding
+          <code>#NNN &middot; &lt;feature gap title&gt;</code>
+        </div>
+      </div>
+    </div>
+  </div>
+  </div>
+
+  <!-- Footnote -->
+  <div style="max-width:900px;width:100%;margin-top:22px;font-size:11px;color:#778;line-height:1.8;border-top:1px dashed #ccd;padding-top:14px">
+    <strong>v3 · 2026-06-17:</strong>
+    ThroughputOnly verdict policy (threshold = max(1%, 2&times;CV));
+    screen early exit (&Delta;&lt;1% skips full bench, saves ~25&ndash;90 min);
+    crash-resume via atomic session.json.
+    &nbsp;&middot;&nbsp;
+    <strong>Key constraints:</strong>
+    npu-006 (Conv%&gt;20% &rarr; block conv fusions);
+    npu-007 (CV gate off on NPU);
+    cpu-001 (opset17 on CPU);
+    gpu-004 (no quant on QNN GPU).
+  </div>
+
+</div>
+</body>
+</html>
diff --git a/research/autoconfig/docs/cross-device-design.html b/research/autoconfig/docs/cross-device-design.html
new file mode 100644
index 000000000..04cf3cdf3
--- /dev/null
+++ b/research/autoconfig/docs/cross-device-design.html
@@ -0,0 +1,696 @@
+<!DOCTYPE html>
+<html lang="en">
+<head>
+<meta charset="UTF-8">
+<title>autoconfig Skill — Cross-Device / Cross-EP Auto-Config Design</title>
+<style>
+  * { box-sizing: border-box; margin: 0; padding: 0; }
+  body {
+    font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", sans-serif;
+    background: #f4f6f9;
+    padding: 32px 24px;
+    color: #1a1a2e;
+    font-size: 13px;
+    line-height: 1.5;
+  }
+
+  /* ── Header ── */
+  h1 { font-size: 20px; font-weight: 700; color: #1a1a2e; margin-bottom: 4px; }
+  .subtitle { font-size: 12px; color: #666; margin-bottom: 6px; }
+  .tag {
+    display: inline-block; border-radius: 4px; padding: 2px 8px;
+    font-size: 10px; font-weight: 700; letter-spacing: 0.5px; margin-right: 6px; margin-bottom: 24px;
+  }
+  .tag-design  { background: #e8eaf6; color: #3949ab; }
+  .tag-status  { background: #fff3e0; color: #e65100; }
+  .tag-core    { background: #e0f2f1; color: #00695c; }
+
+  /* ── Section headers ── */
+  h2 {
+    font-size: 13px; font-weight: 700; color: #1a1a2e;
+    text-transform: uppercase; letter-spacing: 0.8px;
+    margin: 32px 0 14px;
+    padding-bottom: 6px;
+    border-bottom: 2px solid #e0e4ef;
+  }
+  h3 { font-size: 13px; font-weight: 700; color: #3949ab; margin: 18px 0 8px; }
+
+  /* ── Cards ── */
+  .card-grid { display: grid; gap: 12px; }
+  .card-grid-2 { grid-template-columns: 1fr 1fr; }
+  .card-grid-3 { grid-template-columns: 1fr 1fr 1fr; }
+  .card-grid-4 { grid-template-columns: 1fr 1fr 1fr 1fr; }
+
+  .card {
+    background: #fff; border-radius: 10px;
+    padding: 16px 18px;
+    border: 1px solid #e0e4ef;
+  }
+  .card.red    { border-left: 4px solid #ef5350; }
+  .card.orange { border-left: 4px solid #ff9800; }
+  .card.green  { border-left: 4px solid #43a047; }
+  .card.blue   { border-left: 4px solid #1976d2; }
+  .card.purple { border-left: 4px solid #7b1fa2; }
+  .card.teal   { border-left: 4px solid #00897b; }
+  .card.grey   { border-left: 4px solid #90a4ae; }
+
+  .card-title { font-size: 11px; font-weight: 700; text-transform: uppercase; letter-spacing: 0.6px; margin-bottom: 6px; }
+  .card.red    .card-title { color: #c62828; }
+  .card.orange .card-title { color: #e65100; }
+  .card.green  .card-title { color: #2e7d32; }
+  .card.blue   .card-title { color: #0d47a1; }
+  .card.purple .card-title { color: #6a1b9a; }
+  .card.teal   .card-title { color: #00695c; }
+  .card.grey   .card-title { color: #546e7a; }
+
+  .card p { font-size: 12px; color: #444; margin-bottom: 6px; }
+  .card p:last-child { margin-bottom: 0; }
+  .card ul { margin: 4px 0 0 16px; font-size: 12px; color: #444; }
+  .card li { margin-bottom: 3px; }
+
+  /* ── Tables ── */
+  table { width: 100%; border-collapse: collapse; font-size: 12px; }
+  th { background: #e8eaf6; color: #3949ab; font-weight: 700; text-align: left; padding: 8px 12px; font-size: 11px; text-transform: uppercase; letter-spacing: 0.5px; }
+  td { padding: 8px 12px; border-bottom: 1px solid #e0e4ef; vertical-align: top; }
+  tr:last-child td { border-bottom: none; }
+  tr:nth-child(even) td { background: #f9fafc; }
+
+  .sev-high { color: #c62828; font-weight: 700; }
+  .sev-med  { color: #e65100; font-weight: 700; }
+  .sev-low  { color: #2e7d32; font-weight: 700; }
+  .yes { color: #2e7d32; font-weight: 700; }
+  .no  { color: #c62828; font-weight: 700; }
+  .maybe { color: #e65100; font-weight: 700; }
+
+  /* ── Code blocks ── */
+  pre {
+    background: #1e1e2e; color: #cdd6f4;
+    border-radius: 8px; padding: 14px 16px;
+    font-size: 11.5px; line-height: 1.6;
+    overflow-x: auto; margin-top: 8px;
+  }
+  code { font-family: "Cascadia Code", "Fira Code", "Consolas", monospace; }
+  .inline-code { background: #eef0f8; color: #3949ab; border-radius: 4px; padding: 1px 5px; font-size: 11.5px; }
+  .c-kw { color: #cba6f7; }
+  .c-fn { color: #89b4fa; }
+  .c-st { color: #a6e3a1; }
+  .c-cm { color: #6c7086; font-style: italic; }
+  .c-nu { color: #fab387; }
+  .c-tp { color: #f5c2e7; }
+
+  /* ── Flow / fleet diagram ── */
+  .flow { display: flex; align-items: center; gap: 0; flex-wrap: wrap; margin-top: 10px; }
+  .flow-box {
+    background: #fff; border: 1px solid #c5cae9; border-radius: 8px;
+    padding: 10px 14px; font-size: 11px; text-align: center; min-width: 120px;
+  }
+  .flow-box strong { display: block; font-size: 11px; color: #1a1a2e; }
+  .flow-box span { font-size: 10px; color: #666; display: block; margin-top: 2px; }
+  .flow-box.new { border-color: #42a5f5; background: #e3f2fd; }
+  .flow-box.new strong { color: #0d47a1; }
+  .flow-arrow { font-size: 18px; color: #9fa8da; padding: 0 6px; }
+
+  .worker {
+    border-radius: 8px; padding: 10px 12px; font-size: 11px; text-align: center;
+    border: 1px solid #cfd8dc; background: #fff; min-width: 132px;
+  }
+  .worker strong { display: block; font-size: 11px; }
+  .worker span { font-size: 10px; color: #666; display: block; margin-top: 2px; }
+  .worker.npu  { border-top: 3px solid #ab47bc; }
+  .worker.gpu  { border-top: 3px solid #42a5f5; }
+  .worker.cpu  { border-top: 3px solid #90a4ae; }
+
+  /* ── Priority / pill labels ── */
+  .p0 { background: #ffebee; color: #c62828; font-weight: 700; border-radius: 4px; padding: 2px 6px; font-size: 10px; }
+  .p1 { background: #fff8e1; color: #e65100; font-weight: 700; border-radius: 4px; padding: 2px 6px; font-size: 10px; }
+  .p2 { background: #e8f5e9; color: #2e7d32; font-weight: 700; border-radius: 4px; padding: 2px 6px; font-size: 10px; }
+
+  .pill { display: inline-block; border-radius: 20px; padding: 1px 8px; font-size: 10px; font-weight: 700; letter-spacing: 0.3px; }
+  .pill-new   { background: #e3f2fd; color: #1565c0; }
+  .pill-exist { background: #e8f5e9; color: #2e7d32; }
+  .pill-mod   { background: #fff3e0; color: #e65100; }
+
+  .check { color: #43a047; font-weight: 700; }
+  .cross { color: #e53935; font-weight: 700; }
+
+  .note {
+    background: #fff9c4; border-left: 3px solid #f9a825;
+    border-radius: 0 8px 8px 0; padding: 10px 14px;
+    font-size: 12px; color: #555; margin-top: 12px;
+  }
+  .note strong { color: #f57f17; }
+  .note.teal { background: #e0f2f1; border-left-color: #00897b; }
+  .note.teal strong { color: #00695c; }
+
+  /* ── Scenario block ── */
+  .scenario {
+    background: #fff; border-radius: 10px; border: 1px solid #e0e4ef;
+    border-left: 4px solid #00897b; padding: 18px 20px; margin-top: 14px;
+  }
+  .scenario-q {
+    font-size: 14px; font-weight: 700; color: #00695c; margin-bottom: 4px;
+  }
+  .scenario-tag { font-size: 10px; font-weight: 700; color: #00897b; text-transform: uppercase; letter-spacing: 0.6px; }
+  .scenario h4 { font-size: 11px; font-weight: 700; text-transform: uppercase; letter-spacing: 0.5px; color: #455a64; margin: 12px 0 5px; }
+  .scenario p, .scenario li { font-size: 12px; color: #444; }
+  .scenario ul { margin: 0 0 0 16px; }
+  .scenario li { margin-bottom: 4px; }
+
+  /* ── Tabs ── */
+  .tabs { display: flex; gap: 4px; margin-bottom: 8px; border-bottom: 2px solid #e0e4ef; max-width: 1000px; flex-wrap: wrap; }
+  .tab-btn {
+    appearance: none; border: none; background: none; cursor: pointer;
+    font-family: inherit; font-size: 12px; font-weight: 700;
+    color: #888; padding: 9px 16px; border-radius: 8px 8px 0 0;
+    border-bottom: 2px solid transparent; margin-bottom: -2px; letter-spacing: 0.3px;
+  }
+  .tab-btn:hover { color: #00695c; background: #e0f2f1; }
+  .tab-btn.active { color: #00695c; border-bottom-color: #00897b; background: #fff; }
+  .tab-panel { display: none; }
+  .tab-panel.active { display: block; }
+  .tab-panel > h2:first-child { margin-top: 18px; }
+
+  @media (max-width: 760px) {
+    .card-grid-2, .card-grid-3, .card-grid-4 { grid-template-columns: 1fr; }
+  }
+</style>
+</head>
+<body>
+
+<h1>autoconfig Skill — Cross-Device / Cross-EP Auto-Config Design</h1>
+<p class="subtitle">Turning the single-machine sweep into a fleet-wide, multi-objective tuner — orchestrated through <code>winml serve</code></p>
+<span class="tag tag-design">DESIGN</span>
+<span class="tag tag-core">CORE IDEA: winml serve Phase 0 = the fleet worker (no new mode)</span>
+<span class="tag tag-status">PROPOSAL → V3 ROADMAP</span>
+
+<div class="tabs">
+  <button class="tab-btn active" data-tab="arch">1 · Problem &amp; Architecture</button>
+  <button class="tab-btn" data-tab="objective">2 · Objective &amp; Portability</button>
+  <button class="tab-btn" data-tab="scenarios">3 · User Scenarios</button>
+  <button class="tab-btn" data-tab="plan">4 · Plan &amp; Conclusions</button>
+</div>
+
+<!-- ═══════════════════ TAB 1: ARCHITECTURE ═══════════════════ -->
+<div class="tab-panel active" data-tab="arch">
+
+<h2>1 · The Gap — Single-Device Search Can't Answer Fleet Questions</h2>
+
+<p style="margin-bottom:14px; color:#444; max-width:980px;">
+Every sweep today (<code>catalog_qnn_sweep.py</code>, <code>catalog_gpu_sweep.py</code>, <code>catalog_cpu_sweep.py</code>)
+runs on <strong>one machine</strong> and optimizes for <strong>one (EP, device)</strong> pair. But a WinApp developer ships to
+<strong>millions of heterogeneous Windows devices</strong>. The champion config found on a Snapdragon X Elite NPU has
+<em>unknown</em> behavior on an Intel Lunar Lake NPU, an AMD Ryzen&nbsp;AI XDNA part, or a CPU-only budget laptop.</p>
+
+<table>
+  <thead><tr><th>Limitation Today</th><th>Why It Hurts</th><th>Severity</th></tr></thead>
+  <tbody>
+    <tr>
+      <td><strong>One device per sweep</strong></td>
+      <td>Champion config is only validated where the sweep ran; portability is a guess.</td>
+      <td><span class="sev-high">CRITICAL</span></td>
+    </tr>
+    <tr>
+      <td><strong>No joint objective</strong></td>
+      <td>Running N independent sweeps yields N local optima — never a config that is <em>jointly</em> good across the fleet.</td>
+      <td><span class="sev-high">HIGH</span></td>
+    </tr>
+    <tr>
+      <td><strong>Can't co-locate hardware</strong></td>
+      <td>You cannot put a Qualcomm NPU, Intel NPU, and AMD NPU in one box. The search must reach across machines.</td>
+      <td><span class="sev-high">HIGH</span></td>
+    </tr>
+    <tr>
+      <td><strong>Serial wall-clock</strong></td>
+      <td>A 14-hypothesis sweep is ~6–8 h on one device; the same matrix across a 3-device fleet is 3× longer, not parallel.</td>
+      <td><span class="sev-med">MEDIUM</span></td>
+    </tr>
+    <tr>
+      <td><strong>KB has no device axis</strong></td>
+      <td><code>ep_knowledge/*.json</code> findings are implicitly "true on X Elite". No way to express "true across 3 NPU SKUs".</td>
+      <td><span class="sev-med">MEDIUM</span></td>
+    </tr>
+  </tbody>
+</table>
+
+<h2>2 · Core Idea — <code>winml serve</code> (Phase 0) <em>is already</em> the tune worker</h2>
+
+<p style="margin-bottom:12px; color:#444; max-width:980px;">
+<code>winml serve</code> (in <code>commands/serve.py</code>, currently in <code>_DISABLED_COMMANDS</code> — implemented, not yet shipped)
+has a <strong>Phase 0 mode</strong> that exposes <em>every winml command as an HTTP endpoint</em> — including <code>build</code>,
+<code>perf</code>, <code>eval</code>, and <code>sys</code>. Bound to <code>--host 0.0.0.0</code> it is reachable across machines.
+That is the whole worker: <strong>no new flag, no new RPC protocol</strong>. Each physical device in a lab/fleet just runs
+<code>winml serve --host 0.0.0.0</code>; a central <strong>orchestrator</strong> — a thin layer over the existing sweep loop —
+acts as an HTTP client, treating the fleet as one distributed benchmark backend.</p>
+
+<div class="note teal" style="max-width:1000px; margin-bottom:14px;">
+  <strong>Earlier drafts proposed a dedicated <code>winml serve --tune</code> mode. It was dropped</strong> — Phase 0 already
+  exposes build/perf/eval/sys over HTTP, <code>--idle-timeout 0</code> already keeps the session warm during a sweep, and
+  <code>--host 0.0.0.0</code> already opens the network. A <code>--tune</code> flag added surface area without adding capability.
+</div>
+
+<div class="card-grid card-grid-2" style="margin-bottom:14px;">
+  <div class="card red">
+    <div class="card-title">❌ Today: local CLI calls</div>
+    <p>The sweep calls <code>winml build</code> + <code>winml perf</code> as local subprocesses. The (EP, device) is whatever box you launched on.</p>
+    <pre><code><span class="c-cm"># single host, single EP</span>
+b = <span class="c-fn">run_perf_session</span>(baseline)  <span class="c-cm"># local</span>
+h = <span class="c-fn">run_perf_session</span>(hyp)       <span class="c-cm"># local</span></code></pre>
+  </div>
+  <div class="card teal">
+    <div class="card-title">✅ Proposed: fan out to serve workers</div>
+    <p>The orchestrator dispatches the <strong>same</strong> job to every device that owns a relevant EP — in parallel — and collects a result vector keyed by device.</p>
+    <pre><code><span class="c-cm"># N hosts, N EPs, one round-trip</span>
+results = <span class="c-fn">fleet.bench</span>(baseline, hyp,
+            devices=[<span class="c-st">"npu-a"</span>,<span class="c-st">"npu-b"</span>,<span class="c-st">"npu-c"</span>])
+<span class="c-cm"># → {device: gain_vector}</span></code></pre>
+  </div>
+</div>
+
+<h3>Fleet topology</h3>
+<div style="background:#fff; border:1px solid #e0e4ef; border-radius:10px; padding:18px; max-width:1000px;">
+  <div class="flow" style="justify-content:center;">
+    <div class="flow-box new">
+      <strong>autoconfig orchestrator</strong>
+      <span>job scheduler · result aggregator<br>multi-device objective</span>
+    </div>
+  </div>
+  <div style="text-align:center; font-size:20px; color:#9fa8da; margin:6px 0;">↓ &nbsp; HTTP client calls to Phase 0 endpoints: <code>/build</code> · <code>/perf</code> · <code>/eval</code> · <code>/sys</code> &nbsp; ↓</div>
+  <div class="flow" style="justify-content:center; gap:12px;">
+    <div class="worker npu"><strong>winml serve --host 0.0.0.0</strong><span>Snapdragon X Elite<br>QNN · Hexagon HTP NPU</span></div>
+    <div class="worker npu"><strong>winml serve --host 0.0.0.0</strong><span>Intel Lunar Lake<br>OpenVINO · NPU</span></div>
+    <div class="worker npu"><strong>winml serve --host 0.0.0.0</strong><span>AMD Ryzen AI<br>Vitis AI · XDNA NPU</span></div>
+    <div class="worker gpu"><strong>winml serve --host 0.0.0.0</strong><span>discrete / iGPU<br>DML · GPU</span></div>
+    <div class="worker cpu"><strong>winml serve --host 0.0.0.0</strong><span>budget laptop<br>CPU only · 8 GB</span></div>
+  </div>
+  <p style="text-align:center; font-size:11px; color:#888; margin-top:10px;">
+    Each worker runs the <strong>Paired A/B protocol locally</strong> (so DVFS/thermal drift cancels <em>per device</em>) and returns only result JSON — never raw weights on the hot path.
+  </p>
+</div>
+
+<div class="note teal" style="max-width:1000px;">
+  <strong>Why this is cheap to build:</strong> the worker already exists (Phase 0 HTTP wrapper), and the orchestrator reuses
+  everything from <code>self-evolution-design.html</code> — Paired A/B bench, adaptive <code>n_sessions</code>, the confidence
+  ladder, champion-config output. The only genuinely new pieces are <em>(a)</em> the fleet client + scheduler,
+  <em>(b)</em> the multi-device objective function, and <em>(c)</em> a device axis on the KB. Networking is plumbing; the
+  intellectual work is the objective and the portability taxonomy (Tab&nbsp;2).
+</div>
+
+<h2>3 · Distributed Bench Protocol — Reuse Phase 0 Endpoints</h2>
+
+<p style="margin-bottom:10px; color:#444; max-width:980px;">
+No new protocol is invented. Phase 0 already maps each winml command to an HTTP endpoint; the orchestrator just calls them
+remotely instead of as local subprocesses. The mapping below is <strong>existing commands over the existing wrapper</strong> —
+the only orchestrator-side convention is that the worker runs the <strong>Paired A/B loop locally</strong> so thermal/DVFS cancels per device.</p>
+
+<table>
+  <thead><tr><th>Phase 0 endpoint</th><th>Backing command</th><th>Used for</th><th>Notes</th></tr></thead>
+  <tbody>
+    <tr><td><code>GET /sys</code></td><td><span class="pill pill-exist">winml sys</span></td><td>Hardware fingerprint: EP list, NPU SKU, driver versions, RAM, ISA</td><td>Orchestrator builds the device matrix from these.</td></tr>
+    <tr><td><code>POST /build</code></td><td><span class="pill pill-exist">winml build</span></td><td>Compile a candidate config (opset, EP flags, graph passes, precision)</td><td>Artifact stays on the worker; orchestrator references it by output dir.</td></tr>
+    <tr><td><code>POST /perf</code></td><td><span class="pill pill-exist">winml perf</span></td><td>Latency distribution: p50/p90/p99, CV, CPU-fallback%</td><td>Orchestrator drives the A/B pairing by sequencing calls; thermal cancels locally.</td></tr>
+    <tr><td><code>POST /eval</code></td><td><span class="pill pill-exist">winml eval</span></td><td>accuracy, top-1/top-5 delta, cosine vs FP baseline</td><td>Mandatory before any cross-device recommendation.</td></tr>
+  </tbody>
+</table>
+
+<div class="note teal" style="max-width:1000px;">
+  <strong>Two small gaps worth noting</strong> (neither needs a new mode): <em>(1)</em> Phase 0 today exposes commands one call at a time —
+  a <em>config-hash artifact cache</em> on the worker (compile once, perf many) would be a nice server-side optimization but is not required for correctness;
+  <em>(2)</em> the Paired A/B <em>sequencing</em> lives in the orchestrator, which means it must trust the worker not to interleave other jobs mid-pair —
+  enforced by scheduling discipline (one bench job per worker at a time), not by a flag.
+</div>
+
+<div class="note" style="max-width:1000px;">
+  <strong>Weight transfer is off the hot path.</strong> The model is pushed to each worker <em>once</em> (or pulled from a shared
+  catalog URL); thereafter only config specs and result JSON cross the wire.
+</div>
+
+</div><!-- end arch -->
+
+<!-- ═══════════════════ TAB 2: OBJECTIVE & PORTABILITY ═══════════════════ -->
+<div class="tab-panel" data-tab="objective">
+
+<h2>4 · The Hard Part — Multi-Device Objective</h2>
+
+<p style="margin-bottom:12px; color:#444; max-width:980px;">
+Single-device search optimizes a scalar (latency). A fleet produces a <strong>metric vector per device</strong>
+— so "best" is no longer a single number. The orchestrator scores each candidate config <code>c</code> against
+the device set <code>D</code>, then selects by the aggregation strategy the user asked for.</p>
+
+<pre><code><span class="c-cm"># per device d, candidate config c → metric tuple</span>
+m(c, d) = (latency_p50_ms, accuracy_delta, peak_mem_mb, cpu_fallback_pct, portability_class)
+
+<span class="c-kw">def</span> <span class="c-fn">score</span>(c, D, strategy, constraints):
+    rows = [<span class="c-fn">fleet_metric</span>(c, d) <span class="c-kw">for</span> d <span class="c-kw">in</span> D]
+    <span class="c-cm"># hard gates first — any device violating a constraint disqualifies c</span>
+    <span class="c-kw">if</span> <span class="c-kw">any</span>(<span class="c-fn">violates</span>(r, constraints[d]) <span class="c-kw">for</span> r, d <span class="c-kw">in</span> <span class="c-fn">zip</span>(rows, D)):
+        <span class="c-kw">return</span> <span class="c-tp">DISQUALIFIED</span>
+    <span class="c-kw">if</span> strategy == <span class="c-st">"worst_case"</span>:                 <span class="c-cm"># Scenario A: best for N NPUs</span>
+        <span class="c-kw">return</span> <span class="c-fn">max</span>(r.latency_p50 <span class="c-kw">for</span> r <span class="c-kw">in</span> rows)   <span class="c-cm"># minimize the slowest device</span>
+    <span class="c-kw">if</span> strategy == <span class="c-st">"weighted"</span>:                   <span class="c-cm"># Scenario B: balance tiers</span>
+        <span class="c-kw">return</span> <span class="c-fn">sum</span>(w[d] * <span class="c-fn">norm</span>(r) <span class="c-kw">for</span> r, d <span class="c-kw">in</span> <span class="c-fn">zip</span>(rows, D))</code></pre>
+
+<div class="card-grid card-grid-3" style="margin-top:14px;">
+  <div class="card teal">
+    <div class="card-title">Worst-case (minimax)</div>
+    <p>Minimize the <strong>slowest</strong> device's latency. Guarantees a floor for every device in the set. Used for "best for N NPUs".</p>
+  </div>
+  <div class="card blue">
+    <div class="card-title">Weighted-sum</div>
+    <p>Tier weights (e.g. low-end 0.7 / high-end 0.3) with hard per-device constraints. Used for the balanced high/low scenario.</p>
+  </div>
+  <div class="card purple">
+    <div class="card-title">Pareto frontier</div>
+    <p>Return the non-dominated set over {perf, accuracy, mem} plus the <strong>knee point</strong>, so the agent can explain tradeoffs rather than hide them.</p>
+  </div>
+</div>
+
+<h2>5 · Config Portability Taxonomy</h2>
+
+<p style="margin-bottom:12px; color:#444; max-width:980px;">
+"One config for 3 NPUs" is ambiguous until you classify <em>what</em> is portable. Graph-level decisions travel; compiled
+vendor binaries do not. The orchestrator tags every champion with a portability class.</p>
+
+<table>
+  <thead><tr><th>Class</th><th>What's shared</th><th>Travels across…</th><th>Example</th></tr></thead>
+  <tbody>
+    <tr>
+      <td><span class="pill pill-exist">PORTABLE</span></td>
+      <td>opset, graph passes, precision — pure ONNX-level decisions</td>
+      <td><span class="yes">any device + any EP</span></td>
+      <td>opset 21 NHWC-bypass (npu-001); w8a16 quantization</td>
+    </tr>
+    <tr>
+      <td><span class="pill pill-mod">EP-PORTABLE</span></td>
+      <td>Same EP family + flags; recompiled per device</td>
+      <td><span class="maybe">same EP vendor, different SKU</span></td>
+      <td>QNN HTP flags shared across two Hexagon SKUs</td>
+    </tr>
+    <tr>
+      <td><span class="pill pill-new">DEVICE-LOCKED</span></td>
+      <td>Compiled context binary / vendor blob</td>
+      <td><span class="no">one device + driver only</span></td>
+      <td>QNN context binary; OpenVINO compiled blob</td>
+    </tr>
+  </tbody>
+</table>
+
+<div class="note" style="max-width:1000px;">
+  <strong>Key consequence:</strong> for a heterogeneous-vendor NPU set (Qualcomm + Intel + AMD), a literal "single config" can only
+  be <span class="pill pill-exist">PORTABLE</span> — i.e. shared <em>graph-level</em> choices (opset, passes, precision) plus a
+  <strong>per-device EP selection</strong>. The orchestrator searches the portable dimensions <em>jointly</em> and locks the EP per device.
+  The deliverable is therefore <em>"one shared config + N compiled artifacts"</em>, not one binary.
+</div>
+
+<h2>6 · Cross-Device KB — Adding the Device Axis</h2>
+
+<p style="margin-bottom:10px; color:#444; max-width:980px;">
+The confidence ladder from <code>self-evolution-design.html</code> generalizes findings along the <em>architecture</em> axis
+(2+ models → arch rule). The fleet adds an orthogonal <strong>device</strong> axis: a finding confirmed on 3 NPU SKUs is a
+<em>device-general</em> rule.</p>
+
+<table>
+  <thead><tr><th>Field (new)</th><th>Meaning</th><th>Promotion gate</th></tr></thead>
+  <tbody>
+    <tr><td><code>device_scope</code></td><td>List of (EP, SKU, driver) where the finding holds</td><td>Recorded per confirming worker</td></tr>
+    <tr><td><code>device_general</code></td><td>Holds across ≥3 SKUs of the same EP class</td><td>≥3 <code>device_scope</code> entries, same verdict sign</td></tr>
+    <tr><td><code>cross_ep</code></td><td>Holds across ≥2 EP vendors (truly portable)</td><td>Confirmed on ≥2 distinct EP families</td></tr>
+  </tbody>
+</table>
+
+<div class="note teal" style="max-width:1000px;">
+  <strong>Payoff:</strong> once a rule is <code>cross_ep</code>, the orchestrator can <em>predict</em> it on an unseen device and
+  skip that hypothesis on new fleet members — the device-axis analogue of the L5 predictive tier. The fleet is what lets a
+  finding earn that scope in the first place.
+</div>
+
+</div><!-- end objective -->
+
+<!-- ═══════════════════ TAB 3: SCENARIOS ═══════════════════ -->
+<div class="tab-panel" data-tab="scenarios">
+
+<h2>7 · Marked User Scenarios</h2>
+<p style="margin-bottom:6px; color:#444; max-width:980px;">The two scenarios this design must serve directly. Both reduce to the same engine — they differ only in the <strong>strategy</strong> and <strong>constraints</strong> passed to <code>score()</code>.</p>
+
+<!-- Scenario A -->
+<div class="scenario">
+  <div class="scenario-tag">Scenario A · worst-case / minimax</div>
+  <div class="scenario-q">"Find me a config best for 3 NPUs."</div>
+
+  <h4>Fleet</h4>
+  <p>3 NPU workers — e.g. Snapdragon X Elite (QNN/Hexagon), Intel Lunar Lake (OpenVINO/NPU), AMD Ryzen AI (Vitis/XDNA) — each running <code>winml serve --host 0.0.0.0</code>.</p>
+
+  <h4>Objective</h4>
+  <ul>
+    <li><strong>strategy = worst_case</strong>: minimize the <em>slowest</em> NPU's p50 latency, so every device gets an acceptable floor.</li>
+    <li><strong>hard gate</strong>: <code>accuracy_delta ≤ ε</code> and <code>cpu_fallback% ≈ 0</code> on each NPU (a config that CPU-falls-back on one vendor is disqualified — cf. npu-006 conv-fusion hazard).</li>
+  </ul>
+
+  <h4>How it runs</h4>
+  <ul>
+    <li>Orchestrator searches the <span class="pill pill-exist">PORTABLE</span> dimensions (opset, graph passes, precision) <strong>jointly</strong>; locks the best EP per NPU vendor.</li>
+    <li>Each candidate is fanned out to all 3 workers; the minimax score is the slowest of the 3.</li>
+    <li>Portability verdict surfaced: shared graph config travels, but each NPU keeps its own compiled artifact (<span class="pill pill-new">DEVICE-LOCKED</span> binary).</li>
+  </ul>
+
+  <h4>Output</h4>
+  <pre><code>champion (shared, PORTABLE):  opset=21, passes=[layout_bypass], precision=w8a16
+per-device EP:   { snapdragon: QNN-HTP, lunar_lake: OpenVINO-NPU, ryzen_ai: Vitis-XDNA }
+worst-case p50:  14.9 ms   (slowest = ryzen_ai)
+per-device p50:  { snapdragon: 11.2, lunar_lake: 13.4, ryzen_ai: 14.9 } ms
+accuracy_delta:  all ≤ 0.4%   ✔ gate passed
+artifacts:       3 compiled bundles (one per NPU)</code></pre>
+</div>
+
+<!-- Scenario B -->
+<div class="scenario" style="border-left-color:#1976d2;">
+  <div class="scenario-tag" style="color:#1565c0;">Scenario B · constrained weighted-sum</div>
+  <div class="scenario-q" style="color:#0d47a1;">"Find me a config that balances high-end and low-end machines — perf / accuracy / memory requirements are xxx."</div>
+
+  <h4>Fleet (spans tiers)</h4>
+  <p><strong>High-end:</strong> NPU + GPU, 32 GB. &nbsp;<strong>Low-end:</strong> CPU-only, 8 GB. Both running <code>winml serve --host 0.0.0.0</code>.</p>
+
+  <h4>Constraints (the "xxx" — supplied by the developer)</h4>
+  <table style="margin-top:4px;">
+    <thead><tr><th>Requirement</th><th>Low-end (binding)</th><th>High-end</th></tr></thead>
+    <tbody>
+      <tr><td>perf (p50)</td><td><code>≤ 30 ms</code></td><td><code>≤ 10 ms</code></td></tr>
+      <tr><td>accuracy_delta</td><td><code>≤ 1%</code></td><td><code>≤ 1%</code></td></tr>
+      <tr><td>peak memory</td><td><code>≤ 2 GB</code></td><td><code>≤ 8 GB</code></td></tr>
+    </tbody>
+  </table>
+
+  <h4>How it runs</h4>
+  <ul>
+    <li><strong>strategy = weighted</strong> with tier weights; the low-end device is usually the <em>binding</em> constraint.</li>
+    <li>Constrained multi-objective: satisfy every hard gate on the low-end box first, then maximize high-end perf within the feasible set.</li>
+    <li>The orchestrator may discover that <strong>no single config</strong> satisfies both — in which case it returns a <strong>tier-conditional config map</strong> (one config per tier) with the EP fallback chain, not a forced compromise.</li>
+  </ul>
+
+  <h4>Output — two shapes, agent picks the honest one</h4>
+  <div class="card-grid card-grid-2" style="margin-top:6px;">
+    <div class="card green">
+      <div class="card-title">Single balanced config (if feasible)</div>
+      <pre><code>config: opset=21, w8a16, layout_bypass
+low-end CPU:  p50 27 ms · mem 1.6 GB · Δacc 0.6%  ✔
+high-end NPU: p50  8 ms · mem 2.1 GB · Δacc 0.6%  ✔
+binding:  low-end perf (27/30 ms)</code></pre>
+    </div>
+    <div class="card orange">
+      <div class="card-title">Tier-conditional map (if not)</div>
+      <pre><code>low-end:  { ep: CPU,  precision: w8a16, opset: 21 }
+high-end: { ep: QNN,  precision: fp16,  opset: 21 }
+fallback chain:  QNN → DML → CPU
+note: no single config meets the 2 GB cap
+      on low-end at fp16 — split required.</code></pre>
+    </div>
+  </div>
+  <p style="margin-top:8px; font-size:12px; color:#444;">The tier-conditional path connects directly to the <strong>Cross-Device Confidence Agent</strong> in <code>agent-design.md §3.3</code> — same EP-fallback-chain concept, now produced from real fleet measurements instead of reasoning.</p>
+</div>
+
+<div class="note teal" style="max-width:1000px;">
+  <strong>One engine, two questions.</strong> Scenario A is <code>strategy=worst_case</code> over a homogeneous-role (all-NPU) set;
+  Scenario B is <code>strategy=weighted</code> + hard constraints over a heterogeneous-tier set. Nothing in the search loop,
+  bench protocol, or KB changes between them — only the objective arguments.
+</div>
+
+<h2>8 · Skill Outcome — Possible Results &amp; Trade-offs</h2>
+
+<p style="margin-bottom:10px; color:#444; max-width:1000px;">
+The skill does <strong>not</strong> silently pick one answer shape. It runs the fleet, then presents <em>all</em> outcomes that
+satisfy the hard constraints, with the trade-offs spelled out — <strong>the user makes the final call.</strong> "Best" is a
+judgement (simplicity vs peak perf vs maintenance cost) that only the developer can make.</p>
+
+<table>
+  <thead><tr>
+    <th>Outcome shape</th><th>What you ship</th><th>Pros</th><th>Cons</th><th>Best when</th>
+  </tr></thead>
+  <tbody>
+    <tr>
+      <td><span class="pill pill-exist">Single portable config</span></td>
+      <td>1 shared graph config (opset/passes/precision) + per-device EP + N compiled artifacts</td>
+      <td class="yes">One source of truth; one config to reason about; portable across the set</td>
+      <td class="no">Compromise perf on every device; still N binaries to build/store</td>
+      <td>Homogeneous-role fleet (e.g. all NPUs); you want simplicity over peak perf</td>
+    </tr>
+    <tr>
+      <td><span class="pill pill-exist">Single balanced config</span></td>
+      <td>1 config that meets every hard constraint across all tiers</td>
+      <td class="yes">Simplest possible — one config, possibly one artifact path if EP shared</td>
+      <td class="no">Often optimal on <em>no</em> device; may be <strong>infeasible</strong> under tight constraints</td>
+      <td>Tiers share an EP and constraints are loose enough to leave slack</td>
+    </tr>
+    <tr>
+      <td><span class="pill pill-mod">Tier-conditional map</span></td>
+      <td>One config per tier + EP fallback chain (QNN → DML → CPU)</td>
+      <td class="yes">Each tier near-optimal; honest when no single config can satisfy all</td>
+      <td class="maybe">More configs to maintain/test; runtime needs device→tier routing logic</td>
+      <td>Wide tier spread (NPU↔CPU) and/or tight perf/memory budgets</td>
+    </tr>
+    <tr>
+      <td><span class="pill pill-new">Pareto frontier set</span></td>
+      <td>The non-dominated set over {perf, accuracy, mem} + the knee point</td>
+      <td class="yes">Full visibility into every trade-off; nothing hidden</td>
+      <td class="maybe">No single answer — the user must choose; more to digest</td>
+      <td>Priorities are unclear up front; you want to explore before committing</td>
+    </tr>
+    <tr>
+      <td><span class="pill pill-new">Per-device champion</span></td>
+      <td>Independent best config for each device (N configs, no sharing)</td>
+      <td class="yes">Max achievable perf on every single device</td>
+      <td class="no">Worst portability; N configs + N artifacts to manage; no shared story</td>
+      <td>You already ship per-SKU builds, so maintenance cost is already paid</td>
+    </tr>
+  </tbody>
+</table>
+
+<div class="note" style="max-width:1000px;">
+  <strong>The skill's job is to make the trade-off legible, not to decide it.</strong> Every row above comes with measured
+  per-device numbers (latency, accuracy delta, peak mem, portability class) so the developer chooses with evidence — e.g.
+  <em>"the single portable config is 18% slower on the fast NPU but saves me a second build pipeline — I'll take it."</em>
+</div>
+
+</div><!-- end scenarios -->
+
+<!-- ═══════════════════ TAB 4: PLAN & CONCLUSIONS ═══════════════════ -->
+<div class="tab-panel" data-tab="plan">
+
+<h2>9 · Implementation Plan</h2>
+
+<table>
+  <thead><tr><th>Priority</th><th>Component</th><th>File(s)</th><th>Status</th><th>Key change</th></tr></thead>
+  <tbody>
+    <tr>
+      <td><span class="p0">P0</span></td>
+      <td>Enable <code>winml serve</code> Phase 0</td>
+      <td><code>cli.py</code> <span class="pill pill-mod">MOD</span></td>
+      <td><span class="cross">TODO</span></td>
+      <td>Remove <code>serve</code> from <code>_DISABLED_COMMANDS</code>; verify build/perf/eval/sys endpoints reachable over <code>0.0.0.0</code></td>
+    </tr>
+    <tr>
+      <td><span class="p0">P0</span></td>
+      <td>Fleet client + scheduler</td>
+      <td><code>fleet.py</code> <span class="pill pill-new">NEW</span></td>
+      <td><span class="cross">TODO</span></td>
+      <td>Register workers from <code>/sys</code>, fan out jobs, collect result vectors</td>
+    </tr>
+    <tr>
+      <td><span class="p0">P0</span></td>
+      <td>Multi-device objective</td>
+      <td><code>fleet_objective.py</code> <span class="pill pill-new">NEW</span></td>
+      <td><span class="cross">TODO</span></td>
+      <td><code>score(c, D, strategy, constraints)</code> → worst_case / weighted / Pareto</td>
+    </tr>
+    <tr>
+      <td><span class="p1">P1</span></td>
+      <td>Reuse Paired A/B per worker</td>
+      <td><code>sweep_utils.py</code> <span class="pill pill-exist">REUSE</span></td>
+      <td><span class="cross">TODO</span></td>
+      <td>Worker runs existing Paired A/B locally; orchestrator only aggregates</td>
+    </tr>
+    <tr>
+      <td><span class="p1">P1</span></td>
+      <td>Portability classifier</td>
+      <td><code>fleet_objective.py</code> <span class="pill pill-new">NEW</span></td>
+      <td><span class="cross">TODO</span></td>
+      <td>Tag champion PORTABLE / EP-PORTABLE / DEVICE-LOCKED; emit per-device artifacts</td>
+    </tr>
+    <tr>
+      <td><span class="p1">P1</span></td>
+      <td>Device axis in KB</td>
+      <td><code>ep_knowledge/*.json</code> <span class="pill pill-mod">MOD</span></td>
+      <td><span class="cross">TODO</span></td>
+      <td>Add <code>device_scope · device_general · cross_ep</code>; promote_findings reads them</td>
+    </tr>
+    <tr>
+      <td><span class="p2">P2</span></td>
+      <td>Tier-conditional config map output</td>
+      <td><code>fleet.py</code> <span class="pill pill-new">NEW</span></td>
+      <td><span class="cross">TODO</span></td>
+      <td>When no single config is feasible, emit per-tier map + EP fallback chain</td>
+    </tr>
+    <tr>
+      <td><span class="p2">P2</span></td>
+      <td>Cross-device prediction</td>
+      <td><code>analyze_insight.py</code> <span class="pill pill-mod">MOD</span></td>
+      <td><span class="cross">TODO</span></td>
+      <td>Use <code>cross_ep</code> rules to skip hypotheses on new fleet members</td>
+    </tr>
+  </tbody>
+</table>
+
+<h2>10 · Conclusions</h2>
+
+<div class="card-grid card-grid-2">
+  <div class="card teal">
+    <div class="card-title">✅ winml serve makes this a small build</div>
+    <p>The worker already exists (Phase 0 HTTP wrapper) and the orchestrator reuses Paired A/B, adaptive sampling, the confidence ladder, and champion output. The only new code is a fleet client plus an objective function.</p>
+  </div>
+  <div class="card orange">
+    <div class="card-title">⚠ The hard part is the objective, not the network</div>
+    <p>Networking is plumbing. The real design work is the multi-device objective (minimax vs weighted vs Pareto) and the portability taxonomy that decides what "one config" even means across vendors.</p>
+  </div>
+  <div class="card blue">
+    <div class="card-title">📦 Two honest deliverable shapes</div>
+    <p>A fleet answer is either <strong>one shared (portable) config + N compiled artifacts</strong>, or a <strong>tier-conditional config map</strong> with an EP fallback chain. Forcing a single binary across vendors is dishonest — the system must be allowed to return a map.</p>
+  </div>
+  <div class="card purple">
+    <div class="card-title">🧠 The fleet is what grows the KB's device axis</div>
+    <p>Only a multi-device fleet can promote a finding from "true on X Elite" to <code>cross_ep</code>. That scope, in turn, lets the system <em>predict</em> behavior on unseen devices — closing the loop with the Cross-Device Confidence Agent.</p>
+  </div>
+</div>
+
+<div class="note teal" style="max-width:1000px; margin-top:16px;">
+  <strong>Bottom line:</strong> <code>winml serve</code> (Phase 0) + a fleet orchestrator + a multi-device objective converts autoconfig
+  from a single-device tuner into a fleet tuner that answers the two questions developers actually ask — <em>"best for N NPUs"</em>
+  and <em>"balance my high-end and low-end fleet under perf/accuracy/memory budgets"</em> — with measured, auditable, per-device evidence.
+</div>
+
+<h3>Honest limitations</h3>
+<ul style="margin:8px 0 0 18px; font-size:12px; color:#444; max-width:1000px;">
+  <li><strong>DVFS is still per-device.</strong> Thermal noise doesn't disappear; each worker must still run the full Paired A/B protocol locally, so wall-clock is bounded by the slowest/noisiest device.</li>
+  <li><strong>Heterogeneous-vendor "single config" is a graph-level config, not a binary.</strong> Compiled NPU blobs are <span class="pill pill-new">DEVICE-LOCKED</span> by construction.</li>
+  <li><strong>Fleet provisioning is real cost.</strong> The value of cross-device search scales with how representative the fleet is of the shipping device population — a 3-device fleet generalizes better than 1, but is not the world.</li>
+  <li><strong>Driver/EP version drift</strong> across workers can confound results; <code>/sys</code> fingerprints must be recorded with every finding so the KB stays honest.</li>
+</ul>
+
+</div><!-- end plan -->
+
+<br><br>
+<div style="font-size:10px; color:#aaa; text-align:right;">Generated 2026-06-18 · research/autoconfig/docs/cross-device-design.html</div>
+
+<script>
+  document.querySelectorAll('.tab-btn').forEach(function (btn) {
+    btn.addEventListener('click', function () {
+      var tab = btn.dataset.tab;
+      document.querySelectorAll('.tab-btn').forEach(function (b) {
+        b.classList.toggle('active', b === btn);
+      });
+      document.querySelectorAll('.tab-panel').forEach(function (p) {
+        p.classList.toggle('active', p.dataset.tab === tab);
+      });
+    });
+  });
+</script>
+
+</body>
+</html>
diff --git a/research/autoconfig/docs/ep-findings-summary.html b/research/autoconfig/docs/ep-findings-summary.html
new file mode 100644
index 000000000..26e5d6ce6
--- /dev/null
+++ b/research/autoconfig/docs/ep-findings-summary.html
@@ -0,0 +1,941 @@
+<!DOCTYPE html>
+<html lang="en">
+<head>
+<meta charset="UTF-8">
+<title>WinML EP Findings — Validated Catalog</title>
+<style>
+  * { box-sizing: border-box; margin: 0; padding: 0; }
+  body {
+    font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", sans-serif;
+    background: #f4f6f9;
+    padding: 32px 24px;
+    color: #1a1a2e;
+    font-size: 13px;
+  }
+  h1 { font-size: 20px; font-weight: 700; margin-bottom: 4px; }
+  .meta { font-size: 11px; color: #778; margin-bottom: 28px; line-height: 1.6; }
+
+  .ep-section { margin-bottom: 32px; }
+  .ep-header {
+    display: flex; align-items: center; gap: 12px;
+    padding: 10px 18px; border-radius: 10px 10px 0 0;
+    font-weight: 700; font-size: 13px;
+  }
+  .ep-body { border-radius: 0 0 10px 10px; overflow: hidden; }
+
+  .npu  .ep-header { background: #e8f5e9; color: #1b5e20; border: 1.5px solid #a5d6a7; border-bottom: none; }
+  .npu  .ep-body   { border: 1.5px solid #a5d6a7; border-top: none; }
+  .cpu  .ep-header { background: #e8eaf6; color: #3949ab; border: 1.5px solid #9fa8da; border-bottom: none; }
+  .cpu  .ep-body   { border: 1.5px solid #9fa8da; border-top: none; }
+  .dml  .ep-header { background: #e3f2fd; color: #1565c0; border: 1.5px solid #90caf9; border-bottom: none; }
+  .dml  .ep-body   { border: 1.5px solid #90caf9; border-top: none; }
+  .gpu  .ep-header { background: #fff3e0; color: #e65100; border: 1.5px solid #ffcc80; border-bottom: none; }
+  .gpu  .ep-body   { border: 1.5px solid #ffcc80; border-top: none; }
+
+  .single-model { display: none; }
+  body.show-single .single-model { display: grid; }
+  .single-model .find-id { color: #c5cae9; }
+
+  .sm-divider {
+    display: flex; align-items: center; gap: 10px;
+    padding: 6px 16px;
+    background: #fafafa; border-top: 1px solid #eef;
+    font-size: 10px; color: #aab;
+  }
+  .sm-divider .sm-count {
+    background: #f0f2f5; color: #778; border-radius: 8px;
+    padding: 1px 7px; font-weight: 700;
+  }
+  body.show-single .sm-divider { display: none; }
+
+  .toggle-bar {
+    display: flex; align-items: center; gap: 12px;
+    max-width: 900px; margin: 0 0 20px 0;
+  }
+  .toggle-btn {
+    padding: 6px 14px; border-radius: 20px; border: 1.5px solid #9fa8da;
+    background: #e8eaf6; color: #3949ab; font-size: 11px; font-weight: 700;
+    cursor: pointer; transition: background 0.15s;
+  }
+  .toggle-btn:hover { background: #c5cae9; }
+  .toggle-note { font-size: 11px; color: #aab; }
+
+  .finding {
+    display: grid;
+    grid-template-columns: 28px 70px 1fr 220px;
+    gap: 0;
+    background: #fff;
+    border-bottom: 1px solid #eef;
+    align-items: stretch;
+    min-height: 52px;
+  }
+  .finding:last-child { border-bottom: none; }
+  .finding:hover { background: #f8f9fd; }
+
+  .find-id {
+    display: flex; align-items: center; justify-content: center;
+    font-size: 9px; font-weight: 700; color: #aab;
+    padding: 0 4px;
+    border-right: 1px solid #eef;
+    writing-mode: vertical-rl;
+    letter-spacing: 1px;
+  }
+  .find-conf {
+    display: flex; align-items: center; justify-content: center;
+    padding: 8px 10px; border-right: 1px solid #eef; min-width: 70px;
+  }
+  .find-body { padding: 10px 14px; }
+  .find-title { font-weight: 700; font-size: 12px; margin-bottom: 3px; }
+  .find-detail { font-size: 11px; color: #556; line-height: 1.5; }
+  .find-detail code {
+    font-family: "Cascadia Code","Fira Code",monospace; font-size: 10px;
+    background: #f0f2f5; padding: 1px 4px; border-radius: 3px; color: #2d4a8a;
+  }
+  .find-detail .data { color: #2e7d32; font-weight: 600; }
+  .find-detail .warn { color: #c62828; font-weight: 600; }
+  .find-action { padding: 10px 14px; min-width: 180px; border-left: 1px solid #eef; }
+  .find-action .action-label {
+    font-size: 9px; text-transform: uppercase; letter-spacing: 0.8px;
+    color: #aab; font-weight: 700; margin-bottom: 4px;
+  }
+  .find-action .action-text { font-size: 11px; color: #334; line-height: 1.5; }
+
+  .conf {
+    padding: 3px 7px; border-radius: 10px; font-size: 9px; font-weight: 700;
+    letter-spacing: 0.4px; text-align: center; line-height: 1.3;
+  }
+  .conf-high     { background: #e8f5e9; color: #1b5e20; border: 1px solid #a5d6a7; }
+  .conf-medium   { background: #fff8e1; color: #f57f17; border: 1px solid #ffe082; }
+  .conf-low      { background: #fce4ec; color: #880e4f; border: 1px solid #f48fb1; }
+
+  .scope {
+    display: inline-block; background: #f0f2f5; color: #556;
+    border-radius: 4px; padding: 1px 5px; font-size: 10px;
+    margin-top: 3px; font-style: italic;
+  }
+
+  .fr-section { margin-top: 40px; }
+  .fr-header {
+    font-weight: 700; font-size: 14px; margin-bottom: 12px;
+    padding-bottom: 6px; border-bottom: 2px solid #dde;
+    display: flex; align-items: center; gap: 8px;
+  }
+  .fr-table { width: 100%; border-collapse: collapse; }
+  .fr-table th {
+    text-align: left; font-size: 11px; text-transform: uppercase; letter-spacing: 0.8px;
+    color: #778; font-weight: 700; padding: 8px 12px;
+    border-bottom: 2px solid #dde; background: #f4f6f9;
+  }
+  .fr-table td {
+    padding: 10px 12px; border-bottom: 1px solid #eef;
+    vertical-align: top; font-size: 12px;
+  }
+  .fr-table tr:hover td { background: #f8f9fd; }
+  .issue-link {
+    display: inline-block; background: #e8eaf6; color: #3949ab;
+    border-radius: 4px; padding: 2px 7px; font-size: 11px; font-weight: 700;
+    text-decoration: none;
+  }
+  .issue-link:hover { background: #c5cae9; }
+  .no-issue {
+    display: inline-block; background: #f5f5f5; color: #888;
+    border-radius: 4px; padding: 2px 7px; font-size: 11px; font-style: italic;
+  }
+  .pri { padding: 2px 8px; border-radius: 10px; font-size: 10px; font-weight: 700; }
+  .pri-p0 { background: #fce4ec; color: #880e4f; border: 1px solid #f48fb1; }
+  .pri-p1 { background: #fff3e0; color: #e65100; border: 1px solid #ffcc80; }
+  .pri-p2 { background: #e8eaf6; color: #3949ab; border: 1px solid #9fa8da; }
+  .pri-p3 { background: #f5f5f5; color: #666; border: 1px solid #ddd; }
+  .fr-table td:first-child { font-weight: 600; min-width: 120px; }
+  .blocked { color: #c62828; font-size: 11px; }
+
+  .stats-bar {
+    display: flex; gap: 20px; margin-bottom: 28px;
+    padding: 14px 18px; background: #fff; border-radius: 10px;
+    border: 1.5px solid #dde;
+  }
+  .stat { text-align: center; }
+  .stat-n { font-size: 22px; font-weight: 800; color: #1a1a2e; line-height: 1; }
+  .stat-l { font-size: 10px; color: #778; margin-top: 3px; text-transform: uppercase; letter-spacing: 0.6px; }
+
+  .caveat {
+    background: #fff8e1; border: 1.5px solid #ffe082; border-radius: 8px;
+    padding: 10px 16px; font-size: 11px; color: #6d4c41; margin-bottom: 20px;
+    line-height: 1.6;
+  }
+  .caveat strong { color: #e65100; }
+
+  .in-progress-banner {
+    background: #e3f2fd; border: 1.5px solid #90caf9; border-radius: 6px;
+    padding: 7px 14px; font-size: 11px; color: #0d47a1; margin: 6px 0 0 0;
+    line-height: 1.5;
+  }
+</style>
+</head>
+<body>
+
+<h1>WinML EP Findings &mdash; Validated Catalog</h1>
+<p class="meta">
+  Hardware: Snapdragon X Elite CRD &nbsp;|&nbsp; ORT: 1.24.5 (onnxruntime-windowsml) &nbsp;|&nbsp;
+  QNN SDK: Hexagon HTP (NPU) + Adreno X1-85 (GPU) &nbsp;|&nbsp;
+  Last updated: 2026-06-22 &nbsp;|&nbsp; 15 models (QNN NPU), 8 models (QNN GPU), 4 models (CPU partial) &nbsp;|&nbsp; npu-001 revised: MobileViT opset21 NEUTRAL on clean rerun
+</p>
+
+<div class="stats-bar">
+  <div class="stat"><div class="stat-n">31</div><div class="stat-l">total findings</div></div>
+  <div class="stat"><div class="stat-n" style="color:#1b5e20">11</div><div class="stat-l">visible (multi-model / cross-EP)</div></div>
+  <div class="stat"><div class="stat-n" style="color:#9fa8da">20</div><div class="stat-l">hidden (single-model)</div></div>
+  <div class="stat"><div class="stat-n">15</div><div class="stat-l">NPU models tested</div></div>
+  <div class="stat"><div class="stat-n">8</div><div class="stat-l">GPU models tested</div></div>
+  <div class="stat"><div class="stat-n">8</div><div class="stat-l">feature requests</div></div>
+</div>
+
+<div class="caveat">
+  <strong>Scope warning:</strong> All findings from 1 hardware device (Snapdragon X Elite CRD, Oryon CPU + Adreno X1-85 + Hexagon HTP NPU).
+  DML EP not available on this device (package conflict with onnxruntime-windowsml).
+  QNN NPU: 15 models tested (8 catalog + 3 recipe + 4 validation). QNN GPU: 8 models (full catalog sweep 2026-06-18).
+  CPU: partial sweep (4/8 models done; ResNet-18/MobileViT/DINOv2/rad-dino; BERT/NLP in progress 2026-06-18).
+  Always re-validate on new model architectures before using findings to prune search space.
+</div>
+
+<div class="toggle-bar">
+  <button class="toggle-btn" onclick="document.body.classList.toggle('show-single');this.textContent=document.body.classList.contains('show-single')?'Hide single-model findings':'Show single-model findings'">
+    Show single-model findings
+  </button>
+  <button class="toggle-btn" id="lowconf-btn" onclick="toggleLowConf()">
+    Show low-confidence findings
+  </button>
+  <span class="toggle-note">Showing only multi-model / cross-EP findings by default. Single-model and <strong>LOW-confidence</strong> findings are hidden.</span>
+</div>
+
+<!-- QNN NPU -->
+<div class="ep-section npu">
+  <div class="ep-header">
+    QNN NPU &nbsp;&mdash;&nbsp; Hexagon HTP (Snapdragon X Elite)
+    &nbsp;<span style="font-weight:400;font-size:11px;color:#388e3c">15 models, 3&times;500-iter sessions, 30s cool-down | h0-h10 hypotheses (catalog_qnn_sweep.py)</span>
+  </div>
+  <div class="ep-body">
+
+    <div class="finding">
+      <div class="find-id">npu-006</div>
+      <div class="find-conf"><span class="conf conf-high">HIGH<br>confirmed</span></div>
+      <div class="find-body">
+        <div class="find-title">Full conv fusion pack causes catastrophic CPU fallback on Conv-dominant models (~130x regression)</div>
+        <div class="find-detail">
+          ResNet-18 full pack (<code>conv-bn + conv-add + conv-activation fusions</code>):
+          <span class="data">3-session p50 = [132, 135, 131]ms</span> vs baseline ~1ms.
+          <span class="warn">~130x regression, near-zero CV = deterministic CPU fallback.</span>
+          DINOv2-base (Conv%&lt;1%): fusion is neutral.
+          ORT <code>FusedConv</code> op produced by full pack is not dispatchable by QNN EP.
+          <strong>Refinement 2026-06-17:</strong> <code>conv_add_fusion</code> alone (h10 ResNet-18) = <span class="data">+0.93% NEUTRAL</span>. Regression requires full pack that creates FusedConv.
+          <div><span class="scope">Scope: Conv-dominant models when Conv% &gt; 20%. Not applicable to transformer or NLP models.</span></div>
+        </div>
+      </div>
+      <div class="find-action">
+        <div class="action-label">Autoconfig action</div>
+        <div class="action-text">
+          <strong>Hard-block</strong> full conv fusion pack when Conv% &gt; 20%.
+          <code>conv_add_fusion</code> alone is safe.
+          Gate in <code>catalog_qnn_sweep.py</code> via <code>count_conv_pct()</code>.
+        </div>
+      </div>
+    </div>
+
+    <div class="finding">
+      <div class="find-id">npu-007</div>
+      <div class="find-conf"><span class="conf conf-high">HIGH<br>confirmed</span></div>
+      <div class="find-body">
+        <div class="find-title">DVFS thermal noise makes CV-based stability gating unreliable on QNN NPU</div>
+        <div class="find-detail">
+          Across all catalog models, within-session CV ranges 0.1&ndash;2.0+ even on warm device.
+          CV gate (&lt;15%) blocks most valid candidates &mdash; the noise is DVFS, not model instability.
+          Reliable signal: 3+ independent sessions &times; 500+ iters with 30s cool-down. Use median p50 across sessions.
+          Differences &lt;10% are within noise floor.
+          <div><span class="scope">Scope: General &mdash; all models on QNN NPU / Snapdragon X Elite.</span></div>
+        </div>
+      </div>
+      <div class="find-action">
+        <div class="action-label">Autoconfig action</div>
+        <div class="action-text">
+          CV gate DISABLED for QNN NPU (<code>SCREEN_CV_MAX_NPU = 999.0</code>).
+          Always run 3&times;500 Phase B regardless of screen CV.
+          Feature request: <code>winml perf --sessions 3 --cool-down 30s</code> (#155).
+        </div>
+      </div>
+    </div>
+
+    <div class="finding">
+      <div class="find-id">npu-001</div>
+      <div class="find-conf"><span class="conf conf-medium">MEDIUM<br>empirical</span></div>
+      <div class="find-body">
+        <div class="find-title">opset 21 export gives +24&ndash;31% speedup on DINOv2-family models &mdash; mechanism UNKNOWN, NOT a general ViT property; MobileViT benefit NOT reproduced on clean rerun</div>
+        <div class="find-detail">
+          DINOv2-small: <span class="data">opset17 7.2ms &rarr; opset21 5.0ms (+30.6%)</span>.
+          DINOv2-base: <span class="data">opset17 34.6ms &rarr; opset21 26.2ms (+24.1%)</span>.
+          MobileViT-small: <span class="warn">REVISED to NEUTRAL.</span> The original +28.6% / +42.1% (matmul_transpose) was measured against an inflated ~12ms baseline. A clean from-scratch 11-hypothesis rerun (2026-06-22, fresh winml config+build, 3&times;500-iter) gave <span class="data">baseline 5.51ms &rarr; opset21 5.355ms = +2.81% with overlapping session ranges</span>; matmul_transpose (h6) = 6.218ms = SLOWER. The earlier &ldquo;win&rdquo; was a DVFS/thermal baseline artifact.
+          <span class="warn">Critical controls: dino-vitb16 &minus;0.7% NEUTRAL; ViT-base &minus;10.8% HURTS; all NLP tested neutral. Not a general ViT property.</span>
+          Also: bias_softmax_fusion adds +14% incremental on DINOv2 on top of opset21 (npu-009). Original kMaxSupportedOpset bypass mechanism INVALIDATED (ORT 1.24.5 has kMaxSupportedOpset&ge;23).
+          <div><span class="scope">Scope: DINOv2-family confirmed. MobileViT REVISED to neutral. NOT plain ViT (ViT-base HURTS), NOT NLP.</span></div>
+        </div>
+      </div>
+      <div class="find-action">
+        <div class="action-label">Autoconfig action</div>
+        <div class="action-text">
+          DINOv2-family: try <code>opset21 + bias_softmax_fusion</code> bundle first.
+          MobileViT-class hybrids: do NOT assume opset21 helps &mdash; clear the effect-size gate (gain &ge; 2&times;session-CV AND ranges separated) before trusting any win.
+          Plain ViT: SKIP &mdash; confirmed harmful.
+          NLP: SKIP &mdash; consistently neutral.
+          Architecture check required.
+        </div>
+      </div>
+    </div>
+
+    <div class="finding">
+      <div class="find-id">npu-010</div>
+      <div class="find-conf"><span class="conf conf-high">HIGH<br>cross-EP</span></div>
+      <div class="find-body">
+        <div class="find-title">highdimRTR causes regression on CNN-ViT hybrids: &minus;19% NPU, &minus;6.9% GPU &mdash; spurious +36 Reshape insertion</div>
+        <div class="find-detail">
+          MobileViT NPU (h9 opset21+highdimRTR): <span class="warn">median 14.4ms vs baseline 12.1ms = &minus;18.9%.</span>
+          MobileViT GPU (h9 bundle): <span class="warn">19.2ms vs 18.0ms = &minus;6.89%.</span>
+          ONNX diff: h9 graph has <strong>+36 extra Reshape nodes</strong> (108&rarr;144). The 12 original RTR patterns are UNCHANGED.
+          Root cause: highdimRTR misidentifies <code>Gemm&rarr;Reshape&rarr;Transpose</code> sequences in MobileViT hybrid unfold. Inserts Reshape intermediaries after Gemm &rarr; breaks dispatch merging &rarr; extra DMA.
+          Contrast: DINOv2 (pure ViT): <span class="data">h9 = +38.1% NPU</span> &mdash; pure ViT benefits.
+          <div><span class="scope">Scope: CNN-ViT hybrids with <code>Gemm&rarr;Reshape&rarr;Transpose</code> unfold. Pure-ViT models benefit. Architecture-dependent.</span></div>
+        </div>
+      </div>
+      <div class="find-action">
+        <div class="action-label">Autoconfig action</div>
+        <div class="action-text">
+          Hard-block highdimRTR for <code>Gemm&rarr;Reshape&rarr;Transpose</code> hybrid unfold models.
+          <code>analyze_insight.py</code> must detect and add highdimRTR to <code>skip_set</code>.
+          Safe for pure-ViT (DINOv2 +38%). Architecture check required.
+        </div>
+      </div>
+    </div>
+
+    <div class="sm-divider">
+      <span class="sm-count">5 single-model findings hidden (npu-002/003/004/008/009)</span>
+      <span>Click &ldquo;Show single-model findings&rdquo; above to expand</span>
+    </div>
+
+    <div class="finding single-model">
+      <div class="find-id">npu-002</div>
+      <div class="find-conf"><span class="conf conf-medium">MEDIUM<br>1 model</span></div>
+      <div class="find-body">
+        <div class="find-title">W8A16 quantization gives ~1.9x speedup over FP32 on QNN NPU</div>
+        <div class="find-detail">
+          ConvNext FP32: <span class="data">19.4ms &rarr; W8A16: 10.3ms (1.9x)</span>. 1 model only.
+          <div><span class="scope">Scope: ConvNext only for magnitude. Mechanism generalizes; magnitude does not.</span></div>
+        </div>
+      </div>
+      <div class="find-action">
+        <div class="action-label">Autoconfig action</div>
+        <div class="action-text">Always quantize for QNN NPU. W8A16 is the starting point.</div>
+      </div>
+    </div>
+
+    <div class="finding single-model">
+      <div class="find-id">npu-003</div>
+      <div class="find-conf"><span class="conf conf-medium">MEDIUM<br>1 model</span></div>
+      <div class="find-body">
+        <div class="find-title">winml compile (EPContext) adds ~1.7x speedup on top of W8A16</div>
+        <div class="find-detail">
+          ConvNext W8A16: <span class="data">10.3ms &rarr; EPContext: 6.0ms (1.7x)</span>. 1 model only.
+          <div><span class="scope">Scope: ConvNext only for magnitude. Mechanism generalizes to all QNN NPU models.</span></div>
+        </div>
+      </div>
+      <div class="find-action">
+        <div class="action-label">Autoconfig action</div>
+        <div class="action-text">Always run <code>winml compile</code> after finding best config for QNN NPU.</div>
+      </div>
+    </div>
+
+    <div class="finding single-model">
+      <div class="find-id">npu-004</div>
+      <div class="find-conf"><span class="conf conf-low">LOW<br>anecdote</span></div>
+      <div class="find-body">
+        <div class="find-title">W8A8 may cause accuracy collapse on models with LN+GELU (UNVALIDATED)</div>
+        <div class="find-detail">
+          Experiment aborted early &mdash; no accuracy numbers preserved. Recalled anecdote only.
+          <span class="warn">Do NOT skip W8A8 without running eval first.</span>
+          <div><span class="scope">Scope: UNVALIDATED. ConvNext only.</span></div>
+        </div>
+      </div>
+      <div class="find-action">
+        <div class="action-label">Autoconfig action</div>
+        <div class="action-text">Treat as anecdotal. Run W8A8 eval before deciding.</div>
+      </div>
+    </div>
+
+    <div class="finding single-model">
+      <div class="find-id">npu-008</div>
+      <div class="find-conf"><span class="conf conf-high">HIGH<br>1 model</span></div>
+      <div class="find-body">
+        <div class="find-title">microsoft/rad-dino fails to build on QNN NPU across ALL opset variants (access violation rc=0xC0000005)</div>
+        <div class="find-detail">
+          <span class="warn">winml build crash (rc=0xC0000005) for opset 17, 19, and 21 on QNN NPU.</span>
+          rad-dino is ViT-L scale (large model, non-standard medical imaging architecture).
+          Builds successfully on CPU EP (~275ms). QNN GPU also BUILD_FAIL all hypotheses.
+          <div><span class="scope">Scope: microsoft/rad-dino only. Likely unsupported op or tensor size in QNN SDK for ViT-L scale.</span></div>
+        </div>
+      </div>
+      <div class="find-action">
+        <div class="action-label">Autoconfig action</div>
+        <div class="action-text">Route rad-dino to CPU EP only. Feature gap: <code>winml build</code> should fast-fail with diagnostic rather than access violation.</div>
+      </div>
+    </div>
+
+    <div class="finding single-model">
+      <div class="find-id">npu-009</div>
+      <div class="find-conf"><span class="conf conf-medium">MEDIUM<br>1 model</span></div>
+      <div class="find-body">
+        <div class="find-title">bias_softmax_fusion adds +14% incremental speedup on DINOv2 NPU when combined with opset21</div>
+        <div class="find-detail">
+          DINOv2-small h7 (<code>opset21 + bias_softmax_fusion</code>): <span class="data">p50=4.03ms (+38.6% total)</span>
+          vs h3 (opset21 only): 4.98ms. Incremental gain: <span class="data">+14.1%</span>.
+          Outperforms attention_fusion (h8=+28.4%) and matmul_transpose (h6=+24.8%) on DINOv2.
+          Mechanism: folds <code>Add(qk, bias)+Softmax</code> &rarr; single FusedSoftmax with native HTP path.
+          <div><span class="scope">Scope: DINOv2-small confirmed. Not tested on DINOv2-base or plain ViT.</span></div>
+        </div>
+      </div>
+      <div class="find-action">
+        <div class="action-label">Autoconfig action</div>
+        <div class="action-text">
+          For DINOv2-family: use <code>opset21 + bias_softmax_fusion</code> bundle.
+          Prioritize over attention_fusion (outperformed h8 by 10pp).
+        </div>
+      </div>
+    </div>
+
+  </div>
+</div>
+
+<!-- CPU -->
+<div class="ep-section cpu">
+  <div class="ep-header">
+    CPU EP &nbsp;&mdash;&nbsp; Oryon CPU (Snapdragon X Elite)
+    &nbsp;<span style="font-weight:400;font-size:11px;color:#5c6bc0">4 models done (ResNet/MobileViT/DINOv2/rad-dino); BERT/NLP in progress | 3&times;300-iter sessions, Phase C</span>
+  </div>
+  <div class="ep-body">
+
+    <!-- cpu-006: cross-EP opset isolation -->
+    <div class="finding">
+      <div class="find-id">cpu-006</div>
+      <div class="find-conf"><span class="conf conf-high">HIGH<br>empirical</span></div>
+      <div class="find-body">
+        <div class="find-title">CPU EP, QNN GPU, and QNN NPU respond DIFFERENTLY to opset changes &mdash; EP isolation is mandatory</div>
+        <div class="find-detail">
+          CPU opset17 vs 21 (ConvNext): <span class="warn">3.9x SLOWER at opset21.</span>
+          CPU opset17 vs 21 (DINOv2): <span class="warn">~10x SLOWER at opset21 (cpu-001/cpu-009).</span>
+          QNN GPU opset17 vs 21: <span class="data">neutral-to-slightly-negative (&minus;5.4% to +3.3%)</span> across 7 models.
+          QNN NPU opset17 vs 21 (DINOv2): <span class="data">+24% FASTER.</span>
+          Same opset change, three different outcomes on the same chip. DINOv2 goes +24% on NPU but &minus;10x on CPU.
+          <div><span class="scope">Scope: Meta-rule about EP isolation. Applies to all models.</span></div>
+        </div>
+      </div>
+      <div class="find-action">
+        <div class="action-label">Autoconfig action</div>
+        <div class="action-text">
+          NEVER transfer opset findings across EPs.
+          Always validate per EP independently.
+          CPU, GPU, and NPU search spaces are fully independent.
+        </div>
+      </div>
+    </div>
+
+    <!-- cpu-007: HUGE win on ResNet, VISIBLE -->
+    <div class="finding">
+      <div class="find-id">cpu-007</div>
+      <div class="find-conf"><span class="conf conf-high">HIGH<br>KEEP_CONFIRMED<br>1 model</span></div>
+      <div class="find-body">
+        <div class="find-title">matmul_transpose_fusion gives +92% speedup on ResNet-18 CPU EP (237ms &rarr; 17.8ms) &mdash; baseline config is severely suboptimal</div>
+        <div class="find-detail">
+          ResNet-18 h9 (<code>matmul_transpose_fusion</code>): <span class="data">17.8ms vs auto-config baseline 237ms = +92.51% KEEP_CONFIRMED</span> (all 5 Phase C sessions passed).
+          Also confirmed: h12 (<code>transpose_optimizer</code>) +84.46%, h13 (<code>gelu_fusion</code>) +88.89%, h10 (bundle) +91.54%, h6 (<code>layer_norm_fusion</code>) +10.43%.
+          <span class="warn">237ms baseline = severely suboptimal auto-config for ResNet-18 on CPU. matmul_transpose_fusion enables BLAS-level transposed GEMM dispatch that ORT cannot reach with unfused MatMul+Transpose pairs.</span>
+          <div><span class="scope">Scope: ResNet-18 confirmed. Models with unfused MatMul+Transpose chains likely benefit. DINOv2: NOT applicable (all fusion flags regress DINOv2 on CPU via cpu-001 interference).</span></div>
+        </div>
+      </div>
+      <div class="find-action">
+        <div class="action-label">Autoconfig action</div>
+        <div class="action-text">
+          For ResNet-class on CPU: apply <code>matmul_transpose_fusion + transpose_optimizer + gelu_fusion</code> bundle.
+          <strong>Critical</strong>: verify whether auto-config baseline is always this suboptimal for ResNet.
+          Needs re-test with fresh config to confirm finding is reproducible.
+        </div>
+      </div>
+    </div>
+
+    <!-- cpu-008: MobileViT layer_norm catastrophe, VISIBLE -->
+    <div class="finding">
+      <div class="find-id">cpu-008</div>
+      <div class="find-conf"><span class="conf conf-high">HIGH<br>1 model</span></div>
+      <div class="find-body">
+        <div class="find-title">layer_norm_fusion causes &minus;997% regression on MobileViT CPU EP (73ms &rarr; 803ms) &mdash; wrong LN pattern match</div>
+        <div class="find-detail">
+          MobileViT h6 (<code>layer_norm_fusion</code>): <span class="warn">803ms vs baseline 73ms = &minus;997% DISCARD.</span>
+          Also: matmul_transpose_fusion &minus;165%, attention_fusion bundle &minus;164%, skip_layer_norm_fusion &minus;10%.
+          Only <code>bias_softmax_fusion</code> helps: <span class="data">64ms (+12.3% MARGINAL_UNCONFIRMED)</span>.
+          Mechanism: MobileViT places LayerNorm after Conv2D outputs (CNN-ViT hybrid). layer_norm_fusion expects pure transformer LN (post-MLP). Fusing the wrong pattern creates an op the CPU runtime cannot dispatch to an optimized kernel.
+          <div><span class="scope">Scope: CNN-ViT hybrid models (MobileViT). Pure transformers (BERT/ViT) are expected safe.</span></div>
+        </div>
+      </div>
+      <div class="find-action">
+        <div class="action-label">Autoconfig action</div>
+        <div class="action-text">
+          Block <code>layer_norm_fusion</code>, <code>matmul_transpose_fusion</code>, and <code>attention_fusion</code> for CNN-ViT hybrid models on CPU.
+          <code>analyze_insight.py</code> must detect CNN-ViT hybrid architecture and skip these fusions.
+        </div>
+      </div>
+    </div>
+
+    <div style="padding:10px 16px;background:#fff;border-bottom:1px solid #eef;">
+      <div class="in-progress-banner">
+        &#x1F50D; <strong>CPU sweep: BERT/NLP models in progress (2026-06-18)</strong> &mdash; roberta-base-squad2, tinyroberta, bge-small, MiniLM-L6-v2.
+        Key open question: does cpu-001 (opset regression) fire on pure-BERT models (sparse Transpose) or is it only Transpose-heavy architectures?
+        Expected finding: BERT is safe at opset19/21; attention_fusion may help BERT significantly.
+      </div>
+    </div>
+
+    <div class="sm-divider">
+      <span class="sm-count">5 single-model findings hidden (cpu-001/002/005/009 &mdash; single-arch, + cpu-004 anecdote)</span>
+      <span>Click &ldquo;Show single-model findings&rdquo; above to expand</span>
+    </div>
+
+    <div class="finding single-model">
+      <div class="find-id">cpu-001</div>
+      <div class="find-conf"><span class="conf conf-high">HIGH<br>confirmed<br>ConvNext+DINOv2</span></div>
+      <div class="find-body">
+        <div class="find-title">opset 19+ causes 3&ndash;10x slowdown on models with dense Transpose graphs &mdash; NOT ConvNext-specific</div>
+        <div class="find-detail">
+          ConvNext: <span class="warn">opset19=160ms (3.7x), opset21=170ms (3.9x)</span> vs baseline 43.7ms.
+          DINOv2-small: <span class="warn">opset19=1106ms (9.8x), opset21=1095ms (9.7x)</span> vs baseline 112ms &mdash; <code>CPU001_REGRESSION</code> verdict.
+          ResNet-18: <span class="data">opset19=231ms (+2.4% neutral), opset21=227ms (+4.5% neutral)</span> &mdash; NOT affected.
+          MobileViT: opset19=&minus;9.1% (mild, not catastrophic).
+          Pattern: models with &ge;49 Transpose nodes (ConvNext, DINOv2) hit cpu-001; sparse-Transpose models (ResNet) do not.
+          BERT/NLP pending (expected neutral based on Transformer LN-dominant graph with few Transposes).
+          <div><span class="scope">Scope: Dense-Transpose models confirmed (ConvNext, DINOv2). ResNet confirmed safe. BERT/NLP pending.</span></div>
+        </div>
+      </div>
+      <div class="find-action">
+        <div class="action-label">Autoconfig action</div>
+        <div class="action-text">Default to opset17 for CPU EP. For DINOv2/ConvNext: hard-block opset19+. For ResNet: opset is safely neutral.</div>
+      </div>
+    </div>
+
+    <div class="finding single-model">
+      <div class="find-id">cpu-009</div>
+      <div class="find-conf"><span class="conf conf-high">HIGH<br>1 model</span></div>
+      <div class="find-body">
+        <div class="find-title">DINOv2 CPU EP is constrained to auto-config only &mdash; ANY explicit flag causes catastrophic regression</div>
+        <div class="find-detail">
+          DINOv2-small: auto-config baseline=112ms. h1 opset17-explicit: <span class="warn">762ms (&minus;577%)</span>.
+          h2/h3 opset19/21: <span class="warn">~1100ms (&minus;880% CPU001_REGRESSION)</span>.
+          h4 attention_fusion: <span class="warn">1083ms (&minus;862%)</span>. h7 bias_softmax_fusion: <span class="warn">1121ms (&minus;896%)</span>.
+          Even <em>forcing opset17 explicitly</em> (h1) regresses &minus;577% vs auto-config &mdash; the auto-config default must use a specific graph optimization path that is disrupted by any explicit override.
+          <div><span class="scope">Scope: DINOv2-small confirmed. Likely generalizes to all pure-ViT models with dense Transpose graphs on CPU.</span></div>
+        </div>
+      </div>
+      <div class="find-action">
+        <div class="action-label">Autoconfig action</div>
+        <div class="action-text">For DINOv2/ViT-class on CPU: use auto-config ONLY. Do not force any opset. Do not apply any fusion flags. All deviations regress.</div>
+      </div>
+    </div>
+
+    <div class="finding single-model">
+      <div class="find-id">cpu-002</div>
+      <div class="find-conf"><span class="conf conf-high">HIGH<br>confirmed</span></div>
+      <div class="find-body">
+        <div class="find-title">matmul_add_fusion regresses CPU +87% on models where ORT L2 already produced Gemm nodes</div>
+        <div class="find-detail">
+          ConvNext: <span class="warn">p50=81.7ms vs baseline 43.7ms (+87%).</span>
+          ORT L2 already converts MatMul+Add &rarr; Gemm at baseline. Applying fusion on top conflicts.
+          <code>catalog_cpu_sweep.py</code> auto-skips via <code>_model_has_gemm()</code> guard.
+          <div><span class="scope">Scope: Models where ORT L2 baseline already has Gemm nodes.</span></div>
+        </div>
+      </div>
+      <div class="find-action">
+        <div class="action-label">Autoconfig action</div>
+        <div class="action-text">Skip <code>matmul_add_fusion</code> when model.onnx already contains Gemm. Guard implemented.</div>
+      </div>
+    </div>
+
+    <div class="finding single-model">
+      <div class="find-id">cpu-005</div>
+      <div class="find-conf"><span class="conf conf-high">HIGH<br>confirmed</span></div>
+      <div class="find-body">
+        <div class="find-title">Baseline (no extra flags) is optimal for ConvNext CPU &mdash; graph pass sweep is wasted</div>
+        <div class="find-detail">
+          22-experiment ablation: no flag improved p50 beyond noise. Baseline at 43.7ms is floor.
+          ORT L2 already applies gelu_fusion and MatMul&rarr;Gemm.
+          <div><span class="scope">Scope: ConvNext-class. Transformer: awaiting CPU catalog sweep completion.</span></div>
+        </div>
+      </div>
+      <div class="find-action">
+        <div class="action-label">Autoconfig action</div>
+        <div class="action-text">For CPU + ConvNext: skip graph pass sweep. ResNet: apply matmul_transpose_fusion (cpu-007).</div>
+      </div>
+    </div>
+
+  </div>
+</div>
+
+<!-- DML -->
+<div class="ep-section dml">
+  <div class="ep-header">
+    DML EP &nbsp;&mdash;&nbsp; Adreno X1-85 via Direct3D 12
+    &nbsp;<span style="font-weight:400;font-size:11px;color:#1976d2">1 model only (facebook/convnext-tiny-224). DML not available on test device (onnxruntime-windowsml package conflict)</span>
+  </div>
+  <div class="ep-body">
+
+    <div class="sm-divider">
+      <span class="sm-count">3 single-model findings hidden (dml-001/002/003 &mdash; ConvNext only)</span>
+      <span>Click &ldquo;Show single-model findings&rdquo; above to expand</span>
+    </div>
+
+    <div class="finding single-model">
+      <div class="find-id">dml-001</div>
+      <div class="find-conf"><span class="conf conf-medium">MEDIUM<br>stability</span></div>
+      <div class="find-body">
+        <div class="find-title">DML is more stable than QNN GPU &mdash; p50 difference is within noise</div>
+        <div class="find-detail">
+          DML FP32: p50=16.9ms, std=0.52. QNN GPU FP32: p50=17.7ms, std=0.97.
+          p50 diff = <span class="warn">0.82&sigma; of QNN GPU &mdash; distributions OVERLAP. Not a separable p50 advantage.</span>
+          DML meaningfully more stable: CV 3% vs 5.5%.
+          <div><span class="scope">Scope: Adreno X1-85, ConvNext. 3-run comparison only.</span></div>
+        </div>
+      </div>
+      <div class="find-action">
+        <div class="action-label">Autoconfig action</div>
+        <div class="action-text">Prefer DML over QNN GPU for lower tail latency (p90). Do NOT claim DML is faster based on p50 alone.</div>
+      </div>
+    </div>
+
+    <div class="finding single-model">
+      <div class="find-id">dml-002</div>
+      <div class="find-conf"><span class="conf conf-medium">MEDIUM<br>1 run</span></div>
+      <div class="find-body">
+        <div class="find-title">NHWC transformer increases latency variance on DML &mdash; p50 neutral, p90 +19%</div>
+        <div class="find-detail">
+          DML NHWC: p50=16.5ms, <span class="warn">p90=21.0ms (+19%), std=1.89 (3.6x worse)</span>.
+          <div><span class="scope">Scope: Adreno X1-85 + DML, ConvNext.</span></div>
+        </div>
+      </div>
+      <div class="find-action">
+        <div class="action-label">Autoconfig action</div>
+        <div class="action-text">Do NOT apply <code>nhwc-transformer</code> for DML EP. p90 +19% is unacceptable.</div>
+      </div>
+    </div>
+
+    <div class="finding single-model">
+      <div class="find-id">dml-003</div>
+      <div class="find-conf"><span class="conf conf-low">LOW<br>CLI gap</span></div>
+      <div class="find-body">
+        <div class="find-title">DML FP16 gives ~1.4x speedup with clean unimodal distribution &mdash; BLOCKED (#867)</div>
+        <div class="find-detail">
+          DML FP16 (Python hack): <span class="data">p50=11.8ms, p90=12.8ms, std=0.66</span> vs FP32 16.9ms.
+          <span class="warn">Cannot reproduce with winml CLI today. Blocked on <strong>#867</strong> (--precision fp16).</span>
+          <div><span class="scope">Scope: Adreno X1-85 + DML. 1 experiment only.</span></div>
+        </div>
+      </div>
+      <div class="find-action">
+        <div class="action-label">Autoconfig action</div>
+        <div class="action-text">Marked SKIPPED until <strong>#867</strong> ships. FP16 is the primary DML lever.</div>
+      </div>
+    </div>
+
+  </div>
+</div>
+
+<!-- QNN GPU -->
+<div class="ep-section gpu">
+  <div class="ep-header">
+    QNN GPU EP &nbsp;&mdash;&nbsp; Adreno X1-85 via QNN SDK
+    &nbsp;<span style="font-weight:400;font-size:11px;color:#bf360c">8 models, 3&times;300-iter sessions + Phase C confirmation | catalog_gpu_sweep.py h0-h12 (2026-06-18)</span>
+  </div>
+  <div class="ep-body">
+
+    <div class="finding">
+      <div class="find-id">gpu-004</div>
+      <div class="find-conf"><span class="conf conf-high">HIGH<br>confirmed</span></div>
+      <div class="find-body">
+        <div class="find-title">W8A8 QDQ hangs indefinitely on QNN GPU EP (QNN SDK limitation)</div>
+        <div class="find-detail">
+          Any W8A8 QDQ-annotated ONNX passed to QNN GPU EP &rarr; infinite hang.
+          <code>winml build</code> already protects via <code>_patch_device()</code> (quant=null for GPU).
+          Fast-fail enhancement: <strong>#868</strong>.
+          <div><span class="scope">Scope: QNN GPU EP. QNN SDK limitation. Not a concern in normal user path.</span></div>
+        </div>
+      </div>
+      <div class="find-action">
+        <div class="action-label">Autoconfig action</div>
+        <div class="action-text">Skip ALL quantization for QNN GPU EP. <code>winml build</code> protects automatically. Tracked: <strong>#868</strong>.</div>
+      </div>
+    </div>
+
+    <div class="finding">
+      <div class="find-id">gpu-006</div>
+      <div class="find-conf"><span class="conf conf-high">HIGH<br>confirmed<br>7 models</span></div>
+      <div class="find-body">
+        <div class="find-title">opset 21 is neutral-to-negative on QNN GPU &mdash; CONFIRMED across 7 models</div>
+        <div class="find-detail">
+          Full sweep 2026-06-18 (h3 = opset21):
+          DINOv2 <span class="data">+1.2%</span>, ResNet-18 <span class="data">+3.3%</span> (both MARGINAL),
+          MobileViT <span class="warn">&minus;3.4%</span>, roberta <span class="warn">&minus;1.1%</span>,
+          tinyroberta <span class="warn">&minus;2.7%</span>, rad-dino <span class="warn">&minus;2.6%</span>, bge-small +0.2%.
+          Range: &minus;5.4% to +3.3%. No KEEP verdict. All MARGINAL or DISCARD.
+          <span class="warn">Opposite of QNN NPU: DINOv2 +30% on NPU vs +1.2% on GPU.</span>
+          <div><span class="scope">Scope: QNN GPU EP. Confirmed across 7 diverse architectures (CNN, ViT, transformer, NLP).</span></div>
+        </div>
+      </div>
+      <div class="find-action">
+        <div class="action-label">Autoconfig action</div>
+        <div class="action-text">
+          Do NOT try opset 19 or 21 for QNN GPU. Default to opset 17.
+          Rule confirmed &mdash; remove opset sweep from GPU search space.
+        </div>
+      </div>
+    </div>
+
+    <div class="finding">
+      <div class="find-id">gpu-007</div>
+      <div class="find-conf"><span class="conf conf-high">HIGH<br>confirmed<br>DINOv2</span></div>
+      <div class="find-body">
+        <div class="find-title">transpose_optimizer gives +8&ndash;17% on Conv/ViT models on QNN GPU &mdash; KEEP_CONFIRMED on DINOv2</div>
+        <div class="find-detail">
+          DINOv2-small h12 (<code>transpose_optimizer</code>): <span class="data">26.4ms &rarr; 22.0ms = +16.67%</span> (KEEP_CONFIRMED &mdash; all 5 sessions passed Phase C).
+          ResNet-18 h12: <span class="data">6.82ms &rarr; 6.25ms = +8.38%</span> (MARGINAL_UNCONFIRMED &mdash; Phase C inconclusive).
+          gelu_fusion explicit (h11) also KEEP_CONFIRMED on DINOv2: <span class="data">+13.86%</span>.
+          NLP models (roberta, bge-small): mostly BUILD_FAIL with transpose_optimizer &mdash; likely IR incompatibility.
+          <div><span class="scope">Scope: Conv-dominant and ViT models. NLP: BUILD_FAIL needs investigation.</span></div>
+        </div>
+      </div>
+      <div class="find-action">
+        <div class="action-label">Autoconfig action</div>
+        <div class="action-text">
+          Apply <code>transpose_optimizer</code> as default for QNN GPU on Conv/ViT models.
+          Skip for NLP models until BUILD_FAIL is resolved.
+          Feature gap: diagnose why transpose_optimizer causes BUILD_FAIL on BERT/RoBERTa.
+        </div>
+      </div>
+    </div>
+
+    <div class="finding">
+      <div class="find-id">gpu-008</div>
+      <div class="find-conf"><span class="conf conf-high">HIGH<br>cross-EP</span></div>
+      <div class="find-body">
+        <div class="find-title">highdimRTR causes &minus;6.9% regression on MobileViT GPU &mdash; same root cause as npu-010</div>
+        <div class="find-detail">
+          MobileViT h9 (bundle including highdimRTR): <span class="warn">p50=19.2ms vs baseline 18.0ms = &minus;6.89% DISCARD.</span>
+          Less severe than NPU (&minus;19%) due to lower DMA sensitivity on Adreno vs Hexagon HTP.
+          Root cause: same +36 spurious Reshape nodes confirmed by npu-010 ONNX diff.
+          <div><span class="scope">Scope: CNN-ViT hybrids with Gemm&rarr;Reshape&rarr;Transpose unfold. See npu-010 for full mechanism.</span></div>
+        </div>
+      </div>
+      <div class="find-action">
+        <div class="action-label">Autoconfig action</div>
+        <div class="action-text">Same block rule as npu-010. Cross-EP: same architecture check protects both GPU and NPU sweeps.</div>
+      </div>
+    </div>
+
+    <div class="sm-divider">
+      <span class="sm-count">4 single-model findings hidden (gpu-001/002/003/005 &mdash; ConvNext only)</span>
+      <span>Click &ldquo;Show single-model findings&rdquo; above to expand</span>
+    </div>
+
+    <div class="finding single-model">
+      <div class="find-id">gpu-001</div>
+      <div class="find-conf"><span class="conf conf-high">HIGH<br>confirmed</span></div>
+      <div class="find-body">
+        <div class="find-title">FP32 baseline is already optimal for ConvNext on QNN GPU &mdash; no optimization pass helps</div>
+        <div class="find-detail">
+          11-pass sweep on ConvNext: all 0% node reduction or worse. 251/0/0/0 analyze output.
+          <div><span class="scope">Scope: ConvNext-class. Transformer models may benefit (see gpu-007).</span></div>
+        </div>
+      </div>
+      <div class="find-action">
+        <div class="action-label">Autoconfig action</div>
+        <div class="action-text">Skip all graph pass experiments for QNN GPU on ConvNext-class. FP16 is the only remaining lever (#867).</div>
+      </div>
+    </div>
+
+    <div class="finding single-model">
+      <div class="find-id">gpu-002</div>
+      <div class="find-conf"><span class="conf conf-medium">MEDIUM<br>consistent</span></div>
+      <div class="find-body">
+        <div class="find-title">NHWC transformer hurts QNN GPU on Adreno &mdash; ~10% worse p50, +21% p90</div>
+        <div class="find-detail">
+          NHWC: <span class="warn">p50=19.5ms (+10%), p90=23.8ms (+21%), std=3.43 (3.5x worse)</span>.
+          <div><span class="scope">Scope: Adreno X1-85 + QNN GPU.</span></div>
+        </div>
+      </div>
+      <div class="find-action">
+        <div class="action-label">Autoconfig action</div>
+        <div class="action-text">Do NOT apply <code>nhwc-transformer</code> for QNN GPU EP.</div>
+      </div>
+    </div>
+
+    <div class="finding single-model">
+      <div class="find-id">gpu-005</div>
+      <div class="find-conf"><span class="conf conf-high">HIGH<br>confirmed</span></div>
+      <div class="find-body">
+        <div class="find-title">gelu_fusion improves latency STABILITY on QNN GPU &mdash; not p50</div>
+        <div class="find-detail">
+          Unfused GELU (287 nodes): p50=17.4ms, <span class="warn">p90=29.2ms, std=5.90</span>.
+          Fused GELU (251 nodes): p50=17.7ms, <span class="data">p90=19.7ms (&minus;48%), std=0.97 (&minus;6x)</span>.
+          <div><span class="scope">Scope: Any model with GELU activations on QNN GPU.</span></div>
+        </div>
+      </div>
+      <div class="find-action">
+        <div class="action-label">Autoconfig action</div>
+        <div class="action-text">Always apply <code>gelu_fusion</code> for QNN GPU (stability benefit, not p50).</div>
+      </div>
+    </div>
+
+    <div class="finding single-model">
+      <div class="find-id">gpu-003</div>
+      <div class="find-conf"><span class="conf conf-low">LOW<br>1 run</span></div>
+      <div class="find-body">
+        <div class="find-title">winml compile regresses QNN GPU by ~34% &mdash; single experiment, low confidence</div>
+        <div class="find-detail">
+          FP32 + compile: <span class="warn">p50=23.7ms vs baseline 17.7ms (+34%)</span>. Single experiment only.
+          <div><span class="scope">Scope: QNN GPU EP. QNN NPU: compile always helps.</span></div>
+        </div>
+      </div>
+      <div class="find-action">
+        <div class="action-label">Autoconfig action</div>
+        <div class="action-text">Avoid <code>winml compile</code> for QNN GPU EP. Re-validate if behavior changes.</div>
+      </div>
+    </div>
+
+  </div>
+</div>
+
+
+<!-- FEATURE REQUESTS -->
+<div class="fr-section">
+  <div class="fr-header">
+    Feature Requests &amp; CLI Gaps
+    <span style="font-weight:400;font-size:11px;color:#778">&mdash; required to complete the autoconfig skill</span>
+  </div>
+
+  <table class="fr-table">
+    <thead>
+      <tr>
+        <th>Feature</th>
+        <th>Issue</th>
+        <th>Priority</th>
+        <th>Motivation</th>
+        <th>EP Finding</th>
+        <th>Status</th>
+      </tr>
+    </thead>
+    <tbody>
+
+      <tr>
+        <td>winml perf multi-session bench protocol<br><code>--sessions N --cool-down S</code></td>
+        <td><a class="issue-link" href="#">#155</a></td>
+        <td><span class="pri pri-p0">P0</span></td>
+        <td>npu-007: reliable QNN NPU measurement requires 3 independent sessions with 30s cool-down. Single-session p50 is meaningless due to DVFS. catalog_qnn_sweep.py works around this but CLI support is needed for production autoconfig.</td>
+        <td>npu-007</td>
+        <td>OPEN</td>
+      </tr>
+
+      <tr>
+        <td>winml analyze: detect FusedConv and warn when Conv%&nbsp;&gt;&nbsp;threshold pre-build</td>
+        <td><span class="no-issue">not filed</span></td>
+        <td><span class="pri pri-p0">P0</span></td>
+        <td>npu-006: conv fusions create FusedConv ops that QNN EP cannot dispatch, causing 130x regression. autoconfig guards via Conv% counter but a CLI-level lint in <code>winml analyze --ep qnn</code> would make this generally available without custom Python code.</td>
+        <td>npu-006</td>
+        <td class="blocked">NEEDS ISSUE</td>
+      </tr>
+
+      <tr>
+        <td>winml analyze: detect Gemm&rarr;Reshape&rarr;Transpose hybrid unfold; warn before applying highdimRTR</td>
+        <td><span class="no-issue">not filed</span></td>
+        <td><span class="pri pri-p1">P1</span></td>
+        <td>npu-010 / gpu-008: highdimRTR inserts +36 spurious Reshape nodes on CNN-ViT hybrids (MobileViT), causing &minus;19% NPU / &minus;6.9% GPU regression. analyze_insight.py adds to skip_set internally, but a <code>winml analyze --flag highdimRTR_lowdimRTR</code> lint check would make this available to all users.</td>
+        <td>npu-010, gpu-008</td>
+        <td class="blocked">NEEDS ISSUE</td>
+      </tr>
+
+      <tr>
+        <td>winml build --precision fp16</td>
+        <td><a class="issue-link" href="#">#867</a></td>
+        <td><span class="pri pri-p1">P1</span></td>
+        <td>dml-003: DML FP16 gives ~1.4x speedup with clean distribution, only achievable via Python workaround. Same for QNN GPU (FP16 is the only remaining lever after all graph passes exhausted).</td>
+        <td>dml-003, gpu-001</td>
+        <td>OPEN</td>
+      </tr>
+
+      <tr>
+        <td>winml perf --profile (per-op kernel time)</td>
+        <td><a class="issue-link" href="#">#158</a></td>
+        <td><span class="pri pri-p1">P1</span></td>
+        <td>Phase 1 Insight in autoconfig needs dynamic op-dominance data (Gemm% vs Conv% vs Attention%) to prioritize hypotheses. POC bridges via static analyze_graph.py, but dynamic profiling is needed for accurate attribution.</td>
+        <td>all EPs</td>
+        <td>OPEN</td>
+      </tr>
+
+      <tr>
+        <td>winml build --json (structured output)</td>
+        <td><a class="issue-link" href="#">#443</a></td>
+        <td><span class="pri pri-p2">P2</span></td>
+        <td>autoconfig parses winml build stdout to detect failures &mdash; fragile string parsing. A --json flag should emit per-step status, elapsed time, and output artifact paths. Would enable precise partial-failure detection and resume.</td>
+        <td>all EPs</td>
+        <td>OPEN</td>
+      </tr>
+
+      <tr>
+        <td>winml eval --mode compare: support local PyTorch model as reference</td>
+        <td><span class="no-issue">not filed</span></td>
+        <td><span class="pri pri-p2">P2</span></td>
+        <td>autoconfig correctness gate requires HuggingFace model ID as golden reference. Local .pt/.pth files and custom fine-tunes are not supported, blocking cosine-similarity correctness checks for non-HF models.</td>
+        <td>all EPs</td>
+        <td class="blocked">NEEDS ISSUE</td>
+      </tr>
+
+    </tbody>
+  </table>
+</div>
+
+<div style="margin-top:32px;padding-top:14px;border-top:1px dashed #ccd;font-size:11px;color:#aab;line-height:1.8">
+  <strong>How to read confidence levels:</strong>
+  <span class="conf conf-high">HIGH confirmed</span> = mechanism understood + data from &ge;3 independent sessions with non-overlapping ranges.
+  &nbsp;
+  <span class="conf conf-medium">MEDIUM empirical</span> = data is reliable but mechanism unconfirmed or from 1 model only.
+  &nbsp;
+  <span class="conf conf-low">LOW</span> = single experiment, anecdote, or CLI gap blocking proper validation.
+  <br>
+  All findings from Snapdragon X Elite CRD (Oryon CPU + Adreno X1-85 GPU + Hexagon HTP NPU).
+  ORT 1.24.5 (onnxruntime-windowsml). Findings may not generalize to x86 hardware or older ORT versions.
+</div>
+<script>
+(function () {
+  // Auto-tag any finding card that carries a LOW confidence badge.
+  document.querySelectorAll('.finding').forEach(function (f) {
+    if (f.querySelector('.conf-low')) f.classList.add('low-conf');
+  });
+})();
+
+// LOW-confidence findings are hidden by default. When hidden we force an inline
+// display:none so they stay hidden regardless of the single-model toggle; when
+// shown we clear the inline style and let the single-model CSS rules apply.
+var lowConfShown = false;
+function applyLowConf() {
+  document.querySelectorAll('.finding.low-conf').forEach(function (f) {
+    f.style.display = lowConfShown ? '' : 'none';
+  });
+}
+function toggleLowConf() {
+  lowConfShown = !lowConfShown;
+  document.getElementById('lowconf-btn').textContent =
+    lowConfShown ? 'Hide low-confidence findings' : 'Show low-confidence findings';
+  applyLowConf();
+}
+applyLowConf();
+</script>
+
+</body>
+</html>
diff --git a/research/autoconfig/docs/ep-knowledge-review.md b/research/autoconfig/docs/ep-knowledge-review.md
new file mode 100644
index 000000000..288467396
--- /dev/null
+++ b/research/autoconfig/docs/ep-knowledge-review.md
@@ -0,0 +1,246 @@
+# EP Knowledge Base — Critical Review
+
+> Date: 2026-06-16
+> Reviewer: internal audit
+> Scope: `ep_knowledge/qnn_npu.json` findings npu-001 through npu-007
+>
+> This document records issues found in the original KB entries and the
+> reasoning behind corrections applied in the June 2026 update.
+
+---
+
+## Summary of Issues Found
+
+| Finding | Status Before Review | Issue | Corrected Status |
+|---------|---------------------|-------|-----------------|
+| npu-001 | `mechanism_confirmed: true` | ORT version used has kMaxSupportedOpset ≥ 22 — bypass mechanism does not apply; ResNet-18 data is noise | `mechanism_confirmed: false`, mechanism UNKNOWN |
+| npu-002 | scope: "General / most vision models" | Tested on 1 model only (ConvNext) | scope narrowed to ConvNext |
+| npu-003 | scope: "General / all QNN NPU" | Tested on 1 model only (ConvNext) | scope narrowed to ConvNext |
+| npu-004 | confidence: "medium" | No recorded data; experiment aborted before measurements saved | confidence: "very_low / anecdote" |
+| npu-005 | confidence: "medium" | Compares ORT QNN EP vs qairt native stack — different compilation pipeline entirely | added fairness caveat |
+| npu-006 | `mechanism_confirmed: false` | Observation is solid (3-session consistent). Mechanism is unconfirmed but regression is unambiguous | no change to confirmed status; added session evidence |
+| npu-007 | `mechanism_confirmed: true` | Solid, confirmed across all 8 models | no change |
+
+---
+
+## Detailed Analysis
+
+### npu-001 — opset 21 speedup
+
+#### ORT version issue (critical)
+
+The catalog sweep used `onnxruntime-windowsml==1.24.5`. The npu-001 mechanism
+explanation relies on ORT's `kMaxSupportedOpset` gate:
+
+> "On older ORT where kMaxSupportedOpset < 21, opset 21 models bypass the
+> NCHW→NHWC layout transformer entirely."
+
+But the `kMaxSupportedOpset` version table (from `cpu.json`) shows:
+
+| ORT version | kMaxSupportedOpset |
+|-------------|-------------------|
+| v1.14.x | 18 |
+| v1.16.x | 19 |
+| v1.17.x | 20 |
+| v1.18.x | 21 |
+| main_HEAD | 26 |
+
+At ORT 1.24.x, `kMaxSupportedOpset` is almost certainly ≥ 22. This means BOTH
+opset 17 and opset 21 models go through the NHWC layout transform in the ORT
+version actually used in the sweep. **The "bypass" mechanism does not apply.**
+
+Consequence: `mechanism_confirmed` must be `false`. The speedup for DINOv2 and
+MobileViT is empirically real but the cause is **unknown**. The ORT source code
+investigation confirmed the bypass mechanism for *older* ORT versions, not for
+the ORT version actually used.
+
+Possible alternative mechanisms (uninvestigated):
+1. PyTorch ONNX exporter produces a structurally different graph at opset 21
+   (different op decompositions, fewer reshape/squeeze nodes)
+2. QNN EP's graph partitioner behaves differently with opset 21 operator
+   semantics even when the NHWC transform fires
+3. Quantization calibration path differs between opset export versions
+4. The NHWC transform at opset 21 still inserts fewer Transposes for some reason
+   despite firing (investigation needed via optimized graph dump)
+
+#### ResNet-18 data is noise-dominated
+
+ResNet-18 baseline p50 is ~1ms. At this latency, the 3×500-iter protocol
+produces per-session p50s that vary 4x between sessions:
+
+```
+h1 (opset17): sessions = [0.990, 4.003, 2.716] ms  ← 4x range
+h3 (opset21): sessions = [1.054, 2.175, 4.107] ms  ← 4x range
+```
+
+The two distributions fully overlap. Declaring a "+20.2% speedup" from comparing
+medians (2.716 vs 2.175ms) is not statistically valid. This data point is
+**removed** from `validated_models.benefits_from_opset21`.
+
+To get reliable data for ResNet-18, a minimum of ~3000 iterations per session
+and ≥ 5 sessions would be needed.
+
+#### MobileViT DVFS spike in h1
+
+h1 (opset17) sessions: [10.557, 11.721, **27.436**] ms
+
+The third session at 27.4ms is a clear DVFS thermal event (2.4x spike). The
+median (11.721ms) is upward-biased by this session. The "true" opset17 p50 is
+likely ~11ms, making the "+26.5%" speedup calculation overstated. A more
+conservative estimate is ~20-22%.
+
+However, h3 (opset21) sessions [10.814, 8.625, 8.449] show two highly consistent
+low-latency sessions. The speedup is real, magnitude uncertain (~20-26%).
+
+#### DINOv2 — most reliable evidence for npu-001
+
+h1 (opset17): [7.176, 6.392, 9.436] ms — range 6.4–9.4ms
+h3 (opset21): [4.977, 4.876, 6.884] ms — range 4.9–6.9ms
+
+The two distributions barely overlap only at extremes (h3 max 6.884 ≈ h1 min
+6.392). h3 sessions 1 and 2 (4.977, 4.876ms) are tightly clustered at ~4.9ms,
+well below the h1 range. The speedup appears real (≥24% vs h1's non-spiked
+sessions, up to 31% vs h1 median).
+
+DINOv2-small's benefit is notable because it is primarily a Vision Transformer —
+it has a patch embedding Conv layer but attention-dominant compute. Why opset21
+helps DINOv2 but NOT ViT-base is unknown. This architecture distinction needs
+investigation.
+
+#### Updated empirical claim for npu-001
+
+**Observable fact**: For DINOv2-small and MobileViT-small on QNN NPU (ORT 1.24.5,
+Snapdragon X Elite), using opset 21 export instead of opset 17 produces a
+consistent latency reduction of ~20-31% across 3-session benchmarks.
+
+**What is NOT known**: Why this occurs in ORT 1.24.x where the kMaxSupportedOpset
+bypass should not apply.
+
+**What needs investigation**:
+1. Dump optimized.onnx for both opset17 and opset21 DINOv2, count Transpose nodes
+   — if opset21 has fewer Transposes, explains speedup via a different mechanism
+2. Verify ORT 1.24.x kMaxSupportedOpset value from compiled binary
+3. Test 3+ additional Conv+residual models: EfficientNet-B0, MobileNet-V3,
+   ConvNeXt-tiny (already done for CPU; needs QNN NPU validation)
+
+---
+
+### npu-002 — W8A16 speedup over FP32
+
+**Issue**: Scope states "General (applies to most vision models on QNN NPU)".
+Evidence base: 1 model (ConvNext), 1 device.
+
+The 1.9x speedup is plausible from HTP architecture (INT8 weight path), but
+the magnitude varies by model: a model with few weight-heavy ops (e.g., pure
+attention) may see less speedup than a Conv-heavy model. "Most vision models"
+is over-claimed.
+
+**Correction**: Scope narrowed to "ConvNext — single model validation". The
+catalog sweep provides indirect evidence (all 8 models used W8A16 and ran
+faster than FP32 would on HTP) but no direct FP32 comparison baseline for
+those models.
+
+---
+
+### npu-003 — compile speedup
+
+**Issue**: Scope states "General (applies to all QNN NPU deployments)". Evidence
+base: 1 model (ConvNext), 1 device.
+
+The compile (EPContext) mechanism is well-understood and applies generally, but
+the 1.7x magnitude is model-specific. Models with simpler graphs may see less
+benefit; models with many ops may see more.
+
+**Correction**: Scope narrowed. The mechanism claim ("eliminates JIT partitioning")
+is generally correct; the magnitude claim (1.7x) is ConvNext-specific.
+
+---
+
+### npu-004 — W8A8 accuracy collapse
+
+**Issue**: The observation is "Exact numbers not recorded — aborted early." This
+is an anecdote, not a finding. The confidence of "medium" is unjustified without
+data.
+
+The claim may well be correct (W8A8 on LN+GELU is problematic), but without
+recorded accuracy numbers it cannot be treated as a KB finding.
+
+**Correction**: Confidence downgraded to "very_low". The finding is relabeled
+as an unrecorded anecdote pending a proper experiment with recorded numbers.
+
+---
+
+### npu-006 — conv fusions catastrophic regression
+
+This finding is the **most statistically solid** in the entire KB:
+
+ResNet-18 h4 sessions: [132.3, 134.97, 130.669] ms — CV = 0.016 (extremely stable)
+ResNet-18 h1 sessions: [0.990, 4.003, 2.716] ms — median 2.716ms
+
+Even using the best h1 session (0.990ms) vs worst h4 session (134.97ms), the
+regression is 136x. The 3-session consistency of h4 (~130-135ms) with near-zero
+variance is unusual for QNN NPU (all other hypotheses show high CV). This
+suggests the fused ops cause a deterministic CPU fallback with no DVFS noise —
+consistent with the mechanism hypothesis.
+
+The only issue is "mechanism_confirmed: false" — the CPU fallback has not been
+verified via EP partition dump. The regression is unambiguous; the mechanism is
+a strong hypothesis.
+
+**No changes needed** except documenting the 3-session evidence more explicitly.
+
+---
+
+## Additional Models Needed for Validation
+
+### For npu-001 (opset21 benefit for Conv+residual)
+
+| Model | Why useful | Predicted result |
+|-------|-----------|-----------------|
+| `microsoft/efficientnet-b0` | Conv-dominant, no residual-add structure | uncertain |
+| `microsoft/mobilenet-v3-small` | Conv-dominant + SE blocks | likely benefits |
+| `timm/convnextv2-nano` | ConvNext variant, already confirmed for ConvNext | should benefit |
+| `facebook/deit-small-patch16-224` | Pure ViT (no Conv), similar to ViT-base | should be neutral |
+| `timm/regnetx-002` | ResNet-like but with group Conv | uncertain |
+
+Goal: determine whether the benefit is "Conv+residual" or something more specific
+to the DINOv2/MobileViT architectures (e.g., hybrid Conv+attention).
+
+### For npu-006 (conv fusions)
+
+| Model | Why useful | Predicted result |
+|-------|-----------|-----------------|
+| `microsoft/efficientnet-b0` | Conv+BN heavy (many fuseable patterns) | should regress |
+| `google/mobilenet-v2-1.0-224` | Depthwise Conv dominant | should regress |
+| `timm/vgg16` | Pure Conv-BN | should regress |
+| `microsoft/beit-base-patch16-224` | Pure transformer | should be neutral |
+
+Goal: confirm that the regression generalizes to all Conv-dominant models, not
+just ResNet-18.
+
+### For npu-002/003 (W8A16 and compile)
+
+Run FP32 vs W8A16 and W8A16 vs W8A16+compile on at least:
+- `apple/mobilevit-small` (already benchmarked W8A16; need FP32 baseline)
+- `microsoft/resnet-18` (same)
+- `facebook/dinov2-small` (same)
+
+This would promote npu-002 and npu-003 from "1-model observations" to
+"catalog-validated" findings.
+
+---
+
+## Minimum Experiment Protocol for Validation
+
+For any new model added to the KB:
+
+1. Run 3 independent sessions × 500 iters with 30s cool-down (npu-007 protocol)
+2. Record raw per-session p50s, not just the median
+3. Verify session-to-session range is < 50% of the median before reporting a gain
+4. For sub-2ms models: increase to 3 sessions × 2000 iters minimum
+5. Always dump the optimized graph (`--save-optimized-model`) for opset comparison
+6. Record ORT version (`winml --version`) at experiment time in the finding
+
+---
+
+*This review document should be re-run after any ORT or QNN SDK version update.*
diff --git a/research/autoconfig/docs/feature-gaps/921-analyze-highdimRTR-hybrid-unfold.json b/research/autoconfig/docs/feature-gaps/921-analyze-highdimRTR-hybrid-unfold.json
new file mode 100644
index 000000000..320f4b475
--- /dev/null
+++ b/research/autoconfig/docs/feature-gaps/921-analyze-highdimRTR-hybrid-unfold.json
@@ -0,0 +1,59 @@
+{
+  "issue_number": 921,
+  "github_url": "https://github.com/microsoft/winml-cli/issues/921",
+  "title": "analyze: detect Gemm→Reshape→Transpose hybrid-unfold pattern; warn before applying highdimRTR",
+  "status": "OPEN",
+  "labels": ["static-analyzer", "graph-optimizer", "P2", "triaged"],
+  "filed_date": "2026-06-18",
+  "category": "analyze",
+  "source_findings": ["npu-010", "gpu-008"],
+  "affected_eps": ["qnn_npu", "qnn_gpu"],
+  "affected_arch": ["mobilevit", "cnn_vit_hybrid"],
+  "summary": "analyze_insight.py does not detect Gemm→Reshape→Transpose unfold blocks (CNN-ViT hybrid fingerprint). When highdimRTR_lowdimRTR is applied to models with this pattern, it inserts ~36 spurious Reshape nodes after Gemm layers, increasing memory traffic and causing regression.",
+  "root_cause": "MobileViT's CNN encoder implements a sliding-window unfold via Gemm→Reshape→Transpose. highdimRTR_lowdimRTR misidentifies these as optimizable RTR chains. The optimizer tries to lower dimensionality but fails — in the process inserting layout-conversion Reshape nodes before/after each Gemm-unfold block to meet expected tensor formats. Net effect: +36 extra nodes, more DMA traffic on NPU/GPU.",
+  "measured_impact": [
+    {
+      "model": "apple/mobilevit-small",
+      "ep": "qnn_npu",
+      "hypothesis": "h9",
+      "baseline_ms": 26.6,
+      "result_ms": 31.8,
+      "gain_pct": -19.5,
+      "verdict": "DISCARD",
+      "protocol": "3x500 iters, Phase C confirmed",
+      "date": "2026-06-17"
+    },
+    {
+      "model": "apple/mobilevit-small",
+      "ep": "qnn_gpu",
+      "hypothesis": "h9",
+      "baseline_ms": null,
+      "result_ms": null,
+      "gain_pct": -6.9,
+      "verdict": "DISCARD",
+      "protocol": "3x300 iters, Phase C confirmed",
+      "date": "2026-06-18"
+    },
+    {
+      "model": "facebook/dinov2-small",
+      "ep": "qnn_npu",
+      "hypothesis": "h9",
+      "baseline_ms": null,
+      "result_ms": null,
+      "gain_pct": 38.1,
+      "verdict": "KEEP_CONFIRMED",
+      "note": "Pure-ViT — no Gemm-unfold blocks. highdimRTR works correctly here.",
+      "protocol": "3x500 iters",
+      "date": "2026-06-17"
+    }
+  ],
+  "fix_needed": {
+    "file": "analyze_insight.py",
+    "function": "detect_fusion_candidates",
+    "description": "Add a pass that counts Gemm→Reshape→Transpose chains. If count > 0, emit FusionCandidate with tag 'highdimRTR_risky' and add hypothesis h9 to skip_set.",
+    "code_sketch": "for node in graph.node:\n  if node.op_type == 'Reshape':\n    pred = producer.get(node.input[0])\n    if pred and pred.op_type in ('Gemm', 'MatMul'):\n      consumer = _single_consumer(node)\n      if consumer and consumer.op_type == 'Transpose':\n        gemm_unfold_count += 1"
+  },
+  "discriminator": "Gemm→Reshape→Transpose count > 0 → add highdimRTR to skip_set; count == 0 → highdimRTR is a candidate (may give +38% on pure-ViT)",
+  "related_issues": [180],
+  "notes": "Issue #180 is a companion question about whether unmergeable RTR patterns should be surfaced by the pattern matcher. This issue is about pre-detection of the *source* pattern before rewrite is attempted."
+}
diff --git a/research/autoconfig/docs/feature-gaps/README.md b/research/autoconfig/docs/feature-gaps/README.md
new file mode 100644
index 000000000..9e349faf8
--- /dev/null
+++ b/research/autoconfig/docs/feature-gaps/README.md
@@ -0,0 +1,68 @@
+# Feature Gap Issues — WinML autoconfig Research
+
+Each issue is a separate JSON file in this directory. Filed issues have `issue_number` set;
+pending issues have `issue_number: null`.
+
+## JSON Schema
+
+```json
+{
+  "issue_number": 921,            // null if not yet filed
+  "github_url": "https://...",    // null if pending
+  "title": "...",
+  "status": "OPEN | CLOSED | PENDING",
+  "labels": ["..."],
+  "filed_date": "YYYY-MM-DD",     // null if pending
+  "category": "analyze | build | optimize | perf | ...",
+  "source_findings": ["npu-010"], // KB finding IDs that motivated this issue
+  "affected_eps": ["qnn_npu"],
+  "affected_arch": ["mobilevit"],
+  "summary": "One paragraph",
+  "root_cause": "Detailed explanation",
+  "measured_impact": [
+    {
+      "model": "apple/mobilevit-small",
+      "ep": "qnn_npu",
+      "hypothesis": "h9",
+      "baseline_ms": 26.6,
+      "result_ms": 31.8,
+      "gain_pct": -19.5,
+      "verdict": "DISCARD",
+      "protocol": "3x500 iters",
+      "date": "YYYY-MM-DD"
+    }
+  ],
+  "fix_needed": {
+    "file": "analyze_insight.py",
+    "function": "...",
+    "description": "...",
+    "code_sketch": "..."  // optional
+  },
+  "discriminator": "How to detect this case at analysis time",
+  "related_issues": [180],
+  "notes": "..."
+}
+```
+
+## Index
+
+| File | Issue | Status | Category | Source Findings |
+|---|---|---|---|---|
+| `921-analyze-highdimRTR-hybrid-unfold.json` | [#921](https://github.com/microsoft/winml-cli/issues/921) | OPEN | analyze | npu-010, gpu-008 |
+| `pending-cpu001-opset-regression-warning.json` | pending | PENDING | build | cpu-001 |
+| `pending-cpu008-layer-norm-fusion-guard.json` | pending | PENDING | optimize | cpu-008 |
+| `pending-npu006-fusedconv-unfuse.json` | pending | PENDING | optimize | npu-006 |
+| `pending-npu007-dvfs-protocol-flag.json` | pending | PENDING | perf | npu-007 |
+
+## How to file a pending issue
+
+```bash
+gh issue create --repo microsoft/winml-cli \
+  --title "<title from json>" \
+  --body "$(cat pending-<name>.json | python -c 'import json,sys; d=json.load(sys.stdin); print(d[\"summary\"] + \"\\n\\n\" + d[\"root_cause\"])')" \
+  --label "P2,triaged"
+
+# Then update the JSON:
+# - Set issue_number, github_url, status = "OPEN", filed_date
+# - Rename file from pending-* to <number>-<slug>.json
+```
diff --git a/research/autoconfig/docs/feature-gaps/pending-cpu001-opset-regression-warning.json b/research/autoconfig/docs/feature-gaps/pending-cpu001-opset-regression-warning.json
new file mode 100644
index 000000000..738050e6d
--- /dev/null
+++ b/research/autoconfig/docs/feature-gaps/pending-cpu001-opset-regression-warning.json
@@ -0,0 +1,67 @@
+{
+  "issue_number": null,
+  "github_url": null,
+  "title": "winml build: warn when opset 19/21 regresses dense-Transpose models on CPU EP",
+  "status": "PENDING",
+  "labels": ["bug", "cpu", "dev experience", "P2"],
+  "filed_date": null,
+  "category": "build",
+  "source_findings": ["cpu-001"],
+  "affected_eps": ["cpu"],
+  "affected_arch": ["convnext", "dinov2", "dense_transpose_vit"],
+  "summary": "Auto-config baseline for ConvNext and DINOv2 uses a special Transpose-optimizer bypass path. Any explicit opset override (17, 19, or 21) disrupts this path and causes 3–10x slowdown on CPU EP. Users have no warning when this happens.",
+  "root_cause": "ORT's CPU EP Transpose optimizer uses a code path that only activates when opset is left at the ONNX model's native value. Forcing an explicit opset — even opset 17 (same as baseline) — triggers a different code path that materializes all Transpose operations explicitly, causing catastrophic memory overhead on dense-Transpose graph topologies.",
+  "measured_impact": [
+    {
+      "model": "microsoft/convnext-base",
+      "ep": "cpu",
+      "hypothesis": "h2 (opset19)",
+      "baseline_ms": null,
+      "gain_pct": -290.0,
+      "verdict": "DISCARD",
+      "protocol": "3x300 iters",
+      "date": "2026-06 (prior sweep)"
+    },
+    {
+      "model": "facebook/dinov2-small",
+      "ep": "cpu",
+      "hypothesis": "h1 (opset17 explicit)",
+      "baseline_ms": 112.6,
+      "result_ms": 762.0,
+      "gain_pct": -577.0,
+      "verdict": "DISCARD",
+      "note": "Even forcing opset17 (same as baseline) causes 6.8x regression — it's the explicitness of the override, not the version number.",
+      "protocol": "3x300 iters",
+      "date": "2026-06-18"
+    },
+    {
+      "model": "facebook/dinov2-small",
+      "ep": "cpu",
+      "hypothesis": "h2 (opset19)",
+      "baseline_ms": 112.6,
+      "result_ms": 1106.0,
+      "gain_pct": -882.0,
+      "verdict": "DISCARD",
+      "note": "9.8x slowdown — cpu-001 fires on DINOv2 as hard as ConvNext.",
+      "protocol": "3x300 iters",
+      "date": "2026-06-18"
+    },
+    {
+      "model": "microsoft/resnet-18",
+      "ep": "cpu",
+      "hypothesis": "h2 (opset19)",
+      "baseline_ms": null,
+      "gain_pct": 2.4,
+      "verdict": "MARGINAL",
+      "note": "ResNet is SAFE — sparse Transpose graph, opset changes neutral to slightly positive.",
+      "date": "2026-06-18"
+    }
+  ],
+  "fix_needed": {
+    "file": "winml build (pipeline)",
+    "description": "When user requests opset override on a dense-Transpose model targeting CPU EP, emit a warning: 'cpu-001: opset override may disrupt ORT Transpose optimizer bypass path on this model. Baseline auto-config is recommended for CPU EP with dense-Transpose architectures.'"
+  },
+  "discriminator": "Transpose count >= 49 AND ep == 'cpu' → warn before applying opset override",
+  "related_issues": [],
+  "notes": "cpu-001 was originally documented as ConvNext-specific. 2026-06-18 sweep confirms it fires equally on DINOv2 (both have dense Transpose graphs). Scope updated in cpu.json."
+}
diff --git a/research/autoconfig/docs/feature-gaps/pending-cpu008-layer-norm-fusion-guard.json b/research/autoconfig/docs/feature-gaps/pending-cpu008-layer-norm-fusion-guard.json
new file mode 100644
index 000000000..a9d57b3f7
--- /dev/null
+++ b/research/autoconfig/docs/feature-gaps/pending-cpu008-layer-norm-fusion-guard.json
@@ -0,0 +1,46 @@
+{
+  "issue_number": null,
+  "github_url": null,
+  "title": "winml optimize: guard layer_norm_fusion against CNN-ViT hybrid LayerNorm patterns",
+  "status": "PENDING",
+  "labels": ["bug", "cpu", "graph-optimizer", "P2"],
+  "filed_date": null,
+  "category": "optimize",
+  "source_findings": ["cpu-008"],
+  "affected_eps": ["cpu"],
+  "affected_arch": ["mobilevit", "cnn_vit_hybrid"],
+  "summary": "layer_norm_fusion causes catastrophic regression (-997%) on MobileViT CPU. The CNN encoder's LayerNorm is applied to different tensor shapes than standard Transformer LayerNorm — the fusion pattern mismatch results in incorrect kernel selection and extreme slowdown.",
+  "root_cause": "MobileViT uses LayerNorm inside its CNN unfold blocks on patch-level feature tensors (e.g., [B, N_patches, C]). Standard layer_norm_fusion targets [B, seq, hidden] Transformer LN. The fusion incorrectly matches MobileViT LN due to shape similarity, but the fused kernel paths produce much slower execution on the CNN-LN tensor layout.",
+  "measured_impact": [
+    {
+      "model": "apple/mobilevit-small",
+      "ep": "cpu",
+      "hypothesis": "h6 (layer_norm_fusion)",
+      "baseline_ms": 73.0,
+      "result_ms": 803.0,
+      "gain_pct": -997.8,
+      "verdict": "DISCARD",
+      "note": "11x slowdown. Most severe regression observed in the entire CPU catalog sweep.",
+      "protocol": "3x300 iters",
+      "date": "2026-06-18"
+    },
+    {
+      "model": "apple/mobilevit-small",
+      "ep": "cpu",
+      "hypothesis": "h9 (matmul_transpose_fusion)",
+      "baseline_ms": 73.0,
+      "gain_pct": -165.0,
+      "verdict": "DISCARD",
+      "note": "Also regresses badly — consistent with CNN-ViT hybrid pattern mismatch theme.",
+      "date": "2026-06-18"
+    }
+  ],
+  "fix_needed": {
+    "file": "analyze_insight.py",
+    "description": "Detect Gemm→Reshape→Transpose unfold blocks (same as #921 discriminator). If present, add layer_norm_fusion and matmul_transpose_fusion to CPU skip_set.",
+    "note": "Same discriminator as issue #921 — models with CNN-ViT hybrid fingerprint should have both highdimRTR (NPU/GPU) AND layer_norm_fusion (CPU) in their skip_set."
+  },
+  "discriminator": "Gemm→Reshape→Transpose count > 0 AND ep == 'cpu' → skip layer_norm_fusion, matmul_transpose_fusion",
+  "related_issues": [921],
+  "notes": "This is the CPU analog of issue #921. The Gemm-unfold fingerprint is the common discriminator for both. A single detection in analyze_insight.py can feed both skip_sets."
+}
diff --git a/research/autoconfig/docs/feature-gaps/pending-npu006-fusedconv-unfuse.json b/research/autoconfig/docs/feature-gaps/pending-npu006-fusedconv-unfuse.json
new file mode 100644
index 000000000..042f65dd2
--- /dev/null
+++ b/research/autoconfig/docs/feature-gaps/pending-npu006-fusedconv-unfuse.json
@@ -0,0 +1,46 @@
+{
+  "issue_number": null,
+  "github_url": null,
+  "title": "winml optimize: add FusedConv detection and unfuse path for QNN EP",
+  "status": "PENDING",
+  "labels": ["bug", "qnn", "graph-optimizer", "P1"],
+  "filed_date": null,
+  "category": "optimize",
+  "source_findings": ["npu-006"],
+  "affected_eps": ["qnn_npu", "qnn_gpu"],
+  "affected_arch": ["resnet", "cnn_dense", "convnext"],
+  "summary": "Conv fusions (conv-bn + conv-add + conv-activation) produce FusedConv nodes that QNN EP cannot dispatch, causing CPU fallback and catastrophic regression (up to 4900%). winml optimize should detect FusedConv nodes when targeting QNN EP and either block the fusion or unfuse them post-build.",
+  "root_cause": "ORT graph optimizer's conv fusion pass combines Conv+BatchNorm+Add+Activation into a single FusedConv node. QNN EP's op support list does not include FusedConv — it falls back to CPU EP for these nodes, which defeats the purpose of QNN execution. The full pack (all 3 fusions together) is catastrophic; individual fusions (conv_add alone) are neutral-to-safe.",
+  "measured_impact": [
+    {
+      "model": "microsoft/resnet-18",
+      "ep": "qnn_npu",
+      "hypothesis": "h4 (full conv fusion pack)",
+      "baseline_ms": 7.23,
+      "result_ms": 361.5,
+      "gain_pct": -4899.0,
+      "verdict": "DISCARD",
+      "note": "4900% regression — pure CPU fallback due to FusedConv unsupported by QNN EP.",
+      "protocol": "3x500 iters",
+      "date": "2026-06 (prior sweep)"
+    },
+    {
+      "model": "microsoft/resnet-18",
+      "ep": "qnn_npu",
+      "hypothesis": "h10 (conv_add_fusion only)",
+      "baseline_ms": 7.23,
+      "gain_pct": 0.93,
+      "verdict": "NEUTRAL",
+      "note": "conv_add alone is SAFE — only the full 3-fusion pack creates FusedConv.",
+      "date": "2026-06-17"
+    }
+  ],
+  "fix_needed": {
+    "description": "Option A (preferred): In winml build pipeline, after applying conv fusions, scan graph for FusedConv nodes. If EP is QNN (NPU or GPU), unfuse them back to Conv+BN+Add+Activation before compilation.",
+    "alternative": "Option B: Block conv_bn_fusion + conv_activation_fusion flags when EP=QNN. Allow conv_add_fusion alone (confirmed safe).",
+    "detection": "ORT graph: node.op_type == 'FusedConv' — these nodes should never reach QNN EP compilation."
+  },
+  "discriminator": "Conv% of total ops > 20% AND ep in ('qnn_npu', 'qnn_gpu') → warn before applying full conv fusion pack",
+  "related_issues": [],
+  "notes": "npu-006 refinement (2026-06-17): conv_add_fusion alone is neutral (+0.93%) and safe. Only the combination (conv-bn + conv-add + conv-activation) creates FusedConv. The catalog sweep h4/h5 guard already warns based on Conv% threshold."
+}
diff --git a/research/autoconfig/docs/feature-gaps/pending-npu007-dvfs-protocol-flag.json b/research/autoconfig/docs/feature-gaps/pending-npu007-dvfs-protocol-flag.json
new file mode 100644
index 000000000..6f6be79cc
--- /dev/null
+++ b/research/autoconfig/docs/feature-gaps/pending-npu007-dvfs-protocol-flag.json
@@ -0,0 +1,48 @@
+{
+  "issue_number": null,
+  "github_url": null,
+  "title": "winml perf: add --dvfs-protocol flag for reliable QNN NPU benchmarking (multi-session + cool-down)",
+  "status": "PENDING",
+  "labels": ["dev experience", "qnn", "NPU", "P2"],
+  "filed_date": null,
+  "category": "perf",
+  "source_findings": ["npu-007"],
+  "affected_eps": ["qnn_npu"],
+  "affected_arch": ["all"],
+  "summary": "Single-session winml perf on QNN NPU can show ±30% variance due to DVFS (Dynamic Voltage/Frequency Scaling) thermal noise. Reliable benchmarking requires 3+ independent sessions with 30s cool-down between sessions. This protocol should be built into winml perf as a first-class flag.",
+  "root_cause": "Snapdragon X Elite HTP (QNN NPU) uses DVFS to manage power/thermal. After sustained inference load, the NPU frequency drops to manage heat. A single bench session may start at full frequency (fast) and end at throttled frequency (slow), making the result unrepresentative. Multi-session protocol with cool-down ensures each session starts at a consistent thermal baseline.",
+  "measured_impact": [
+    {
+      "description": "Single-session CV on MobileViT QNN NPU",
+      "cv_observed": "0.37 (37%)",
+      "note": "CV > 0.15 is common on QNN NPU — current catalog_qnn_sweep.py Phase A screen marks this as 'DVFS noise — high CV expected' and always proceeds to Phase B regardless."
+    },
+    {
+      "description": "Multi-session stability",
+      "note": "3x500 iters with 30s cool-down between sessions achieves consistent results. All 3 session p50s should be compared (range non-overlap criterion) rather than using a single p50."
+    }
+  ],
+  "proposed_api": {
+    "flag": "--dvfs-protocol",
+    "description": "When set, winml perf runs N_sessions independent sessions with cool_down_s seconds between them. Reports median of session p50s, range, and CV-per-session.",
+    "defaults": {
+      "n_sessions": 3,
+      "iterations_per_session": 500,
+      "cool_down_s": 30,
+      "warmup": 10
+    },
+    "output": {
+      "session_p50s_ms": [6.9, 7.0, 8.0],
+      "median_p50_ms": 7.0,
+      "range_ms": 1.1,
+      "dvfs_stable": true
+    }
+  },
+  "fix_needed": {
+    "file": "winml perf command",
+    "description": "Add --dvfs-protocol flag. When active: run N sessions with cool-down, report median + range. Also report whether ranges overlap with a reference run (useful for A/B comparison)."
+  },
+  "discriminator": "ep == 'qnn_npu' → recommend --dvfs-protocol for reliable results",
+  "related_issues": [],
+  "notes": "The autoconfig catalog_qnn_sweep.py already implements this protocol internally (FULL_SESSIONS=3, COOL_DOWN_S=30, FULL_ITERS=500). Promoting to winml perf as a first-class flag would let users get reliable NPU numbers without needing the full autoconfig sweep infrastructure."
+}
diff --git a/research/autoconfig/docs/self-evolution-design.html b/research/autoconfig/docs/self-evolution-design.html
new file mode 100644
index 000000000..28d15ee9d
--- /dev/null
+++ b/research/autoconfig/docs/self-evolution-design.html
@@ -0,0 +1,698 @@
+<!DOCTYPE html>
+<html lang="en">
+<head>
+<meta charset="UTF-8">
+<title>autoconfig Skill — Self-Evolution Design</title>
+<style>
+  * { box-sizing: border-box; margin: 0; padding: 0; }
+  body {
+    font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", sans-serif;
+    background: #f4f6f9;
+    padding: 32px 24px;
+    color: #1a1a2e;
+    font-size: 13px;
+    line-height: 1.5;
+  }
+
+  /* ── Header ── */
+  h1 { font-size: 20px; font-weight: 700; color: #1a1a2e; margin-bottom: 4px; }
+  .subtitle { font-size: 12px; color: #666; margin-bottom: 6px; }
+  .tag {
+    display: inline-block; border-radius: 4px; padding: 2px 8px;
+    font-size: 10px; font-weight: 700; letter-spacing: 0.5px; margin-right: 6px; margin-bottom: 24px;
+  }
+  .tag-design { background: #e8eaf6; color: #3949ab; }
+  .tag-status { background: #fff3e0; color: #e65100; }
+
+  /* ── Section headers ── */
+  h2 {
+    font-size: 13px; font-weight: 700; color: #1a1a2e;
+    text-transform: uppercase; letter-spacing: 0.8px;
+    margin: 32px 0 14px;
+    padding-bottom: 6px;
+    border-bottom: 2px solid #e0e4ef;
+  }
+  h3 { font-size: 13px; font-weight: 700; color: #3949ab; margin-bottom: 8px; }
+
+  /* ── Cards ── */
+  .card-grid {
+    display: grid; gap: 12px;
+  }
+  .card-grid-2 { grid-template-columns: 1fr 1fr; }
+  .card-grid-3 { grid-template-columns: 1fr 1fr 1fr; }
+  .card-grid-4 { grid-template-columns: 1fr 1fr 1fr 1fr; }
+
+  .card {
+    background: #fff; border-radius: 10px;
+    padding: 16px 18px;
+    border: 1px solid #e0e4ef;
+  }
+  .card.red    { border-left: 4px solid #ef5350; }
+  .card.orange { border-left: 4px solid #ff9800; }
+  .card.green  { border-left: 4px solid #43a047; }
+  .card.blue   { border-left: 4px solid #1976d2; }
+  .card.purple { border-left: 4px solid #7b1fa2; }
+  .card.grey   { border-left: 4px solid #90a4ae; }
+
+  .card-title { font-size: 11px; font-weight: 700; text-transform: uppercase; letter-spacing: 0.6px; margin-bottom: 6px; }
+  .card.red    .card-title { color: #c62828; }
+  .card.orange .card-title { color: #e65100; }
+  .card.green  .card-title { color: #2e7d32; }
+  .card.blue   .card-title { color: #0d47a1; }
+  .card.purple .card-title { color: #6a1b9a; }
+  .card.grey   .card-title { color: #546e7a; }
+
+  .card p { font-size: 12px; color: #444; margin-bottom: 6px; }
+  .card p:last-child { margin-bottom: 0; }
+
+  /* ── Confidence Ladder ── */
+  .ladder { display: flex; flex-direction: column; gap: 0; max-width: 860px; }
+  .ladder-rung {
+    display: flex; align-items: stretch;
+  }
+  .rung-level {
+    width: 52px; flex-shrink: 0;
+    display: flex; align-items: center; justify-content: center;
+    font-size: 16px; font-weight: 900; border-radius: 10px 0 0 10px;
+    color: #fff;
+  }
+  .rung-body {
+    flex: 1; padding: 12px 16px; border-radius: 0 10px 10px 0;
+    border: 1px solid transparent; border-left: none;
+  }
+  .rung-name { font-size: 11px; font-weight: 700; text-transform: uppercase; letter-spacing: 0.6px; }
+  .rung-desc { font-size: 12px; color: #444; margin-top: 3px; }
+  .rung-gate { font-size: 11px; font-style: italic; color: #666; margin-top: 3px; }
+  .ladder-connector {
+    width: 52px; flex-shrink: 0; display: flex; justify-content: center;
+    align-items: center; color: #bbb; font-size: 18px; height: 18px; margin: -1px 0;
+  }
+
+  .l1 { background: #90a4ae; }  .l1-body { background: #f5f7f8; border-color: #cfd8dc; }
+  .l2 { background: #42a5f5; }  .l2-body { background: #e3f2fd; border-color: #bbdefb; }
+  .l3 { background: #66bb6a; }  .l3-body { background: #e8f5e9; border-color: #c8e6c9; }
+  .l4 { background: #ffa726; }  .l4-body { background: #fff8e1; border-color: #ffe082; }
+  .l5 { background: #ab47bc; }  .l5-body { background: #f3e5f5; border-color: #e1bee7; }
+
+  /* ── Pain point table ── */
+  table { width: 100%; border-collapse: collapse; font-size: 12px; }
+  th { background: #e8eaf6; color: #3949ab; font-weight: 700; text-align: left; padding: 8px 12px; font-size: 11px; text-transform: uppercase; letter-spacing: 0.5px; }
+  td { padding: 8px 12px; border-bottom: 1px solid #e0e4ef; vertical-align: top; }
+  tr:last-child td { border-bottom: none; }
+  tr:nth-child(even) td { background: #f9fafc; }
+
+  .sev-high { color: #c62828; font-weight: 700; }
+  .sev-med  { color: #e65100; font-weight: 700; }
+  .sev-low  { color: #2e7d32; font-weight: 700; }
+
+  /* ── Code blocks ── */
+  pre {
+    background: #1e1e2e; color: #cdd6f4;
+    border-radius: 8px; padding: 14px 16px;
+    font-size: 11.5px; line-height: 1.6;
+    overflow-x: auto; margin-top: 8px;
+  }
+  code { font-family: "Cascadia Code", "Fira Code", "Consolas", monospace; }
+  .c-kw { color: #cba6f7; }   /* keyword */
+  .c-fn { color: #89b4fa; }   /* function */
+  .c-st { color: #a6e3a1; }   /* string */
+  .c-cm { color: #6c7086; font-style: italic; }  /* comment */
+  .c-nu { color: #fab387; }   /* number */
+  .c-tp { color: #f5c2e7; }   /* type */
+
+  /* ── Flow diagram ── */
+  .flow { display: flex; align-items: center; gap: 0; flex-wrap: wrap; margin-top: 10px; }
+  .flow-box {
+    background: #fff; border: 1px solid #c5cae9; border-radius: 8px;
+    padding: 10px 14px; font-size: 11px; text-align: center; min-width: 120px;
+  }
+  .flow-box strong { display: block; font-size: 11px; color: #1a1a2e; }
+  .flow-box span { font-size: 10px; color: #666; display: block; margin-top: 2px; }
+  .flow-box.new { border-color: #42a5f5; background: #e3f2fd; }
+  .flow-box.new strong { color: #0d47a1; }
+  .flow-arrow { font-size: 18px; color: #9fa8da; padding: 0 6px; }
+  .flow-label { font-size: 10px; color: #666; text-align: center; margin-top: 4px; }
+
+  /* ── Priority table ── */
+  .p0 { background: #ffebee; color: #c62828; font-weight: 700; border-radius: 4px; padding: 2px 6px; font-size: 10px; }
+  .p1 { background: #fff8e1; color: #e65100; font-weight: 700; border-radius: 4px; padding: 2px 6px; font-size: 10px; }
+  .p2 { background: #e8f5e9; color: #2e7d32; font-weight: 700; border-radius: 4px; padding: 2px 6px; font-size: 10px; }
+
+  .check { color: #43a047; font-weight: 700; }
+  .cross  { color: #e53935; font-weight: 700; }
+  .wip    { color: #fb8c00; font-weight: 700; }
+
+  /* ── Inline labels ── */
+  .pill {
+    display: inline-block; border-radius: 20px; padding: 1px 8px;
+    font-size: 10px; font-weight: 700; letter-spacing: 0.3px;
+  }
+  .pill-new    { background: #e3f2fd; color: #1565c0; }
+  .pill-exist  { background: #e8f5e9; color: #2e7d32; }
+  .pill-mod    { background: #fff3e0; color: #e65100; }
+
+  .note {
+    background: #fff9c4; border-left: 3px solid #f9a825;
+    border-radius: 0 8px 8px 0; padding: 10px 14px;
+    font-size: 12px; color: #555; margin-top: 12px;
+  }
+  .note strong { color: #f57f17; }
+
+  /* ── Tabs ── */
+  .tabs {
+    display: flex; gap: 4px; margin-bottom: 8px;
+    border-bottom: 2px solid #e0e4ef; max-width: 900px;
+  }
+  .tab-btn {
+    appearance: none; border: none; background: none; cursor: pointer;
+    font-family: inherit; font-size: 12px; font-weight: 700;
+    color: #888; padding: 9px 16px; border-radius: 8px 8px 0 0;
+    border-bottom: 2px solid transparent; margin-bottom: -2px;
+    letter-spacing: 0.3px;
+  }
+  .tab-btn:hover { color: #3949ab; background: #eef0f8; }
+  .tab-btn.active { color: #3949ab; border-bottom-color: #3949ab; background: #fff; }
+  .tab-btn .badge {
+    display: inline-block; background: #e8eaf6; color: #3949ab;
+    border-radius: 10px; padding: 0 6px; font-size: 10px; margin-left: 6px;
+  }
+  .tab-panel { display: none; }
+  .tab-panel.active { display: block; }
+  .tab-panel > h2:first-child { margin-top: 18px; }
+
+  @media (max-width: 720px) {
+    .card-grid-2, .card-grid-3, .card-grid-4 { grid-template-columns: 1fr; }
+  }
+</style>
+</head>
+<body>
+
+<h1>autoconfig Skill — Self-Evolution Design</h1>
+<p class="subtitle">How the sweep loop learns, stabilizes, and improves itself over time</p>
+<span class="tag tag-design">DESIGN</span>
+<span class="tag tag-status">POC → V2 ROADMAP</span>
+
+<div class="tabs">
+  <button class="tab-btn active" data-tab="overview">Overview &amp; Roadmap</button>
+  <button class="tab-btn" data-tab="fixes">The 5 Fixes<span class="badge">code</span></button>
+</div>
+
+<!-- ═══════════════════ TAB: OVERVIEW (part A) ═══════════════════ -->
+<div class="tab-panel active" data-tab="overview">
+
+<!-- ═══════════════════════════════════════════════════════════ -->
+<h2>1 · Problem Statement</h2>
+
+<p style="margin-bottom:14px; color:#444;">Performance noise makes sweep conclusions <strong>unreliable</strong> — results vary run-to-run, causing false KB promotions and wasted sweep time.</p>
+
+<table>
+  <thead><tr><th>Pain Point</th><th>Root Cause</th><th>Severity</th><th>Gap</th></tr></thead>
+  <tbody>
+    <tr>
+      <td><strong>DVFS thermal noise</strong></td>
+      <td>NPU frequency scales with temp; same model 2× slower when hot</td>
+      <td><span class="sev-high">CRITICAL</span></td>
+      <td>Fixed 30s cool-down ignores actual temperature</td>
+    </tr>
+    <tr>
+      <td><strong>Sequence bias</strong></td>
+      <td>Baseline runs cold, late hypotheses run hot — unfair</td>
+      <td><span class="sev-high">HIGH</span></td>
+      <td>Systematic order bias, no mitigation</td>
+    </tr>
+    <tr>
+      <td><strong>No paired comparison</strong></td>
+      <td>Baseline & hypothesis measured at different thermal moments</td>
+      <td><span class="sev-high">HIGH</span></td>
+      <td>Delta confounds drift with gain (unpaired)</td>
+    </tr>
+    <tr>
+      <td><strong>Fixed sample size</strong></td>
+      <td>3 sessions regardless of variance</td>
+      <td><span class="sev-med">MEDIUM</span></td>
+      <td>No adaptive sampling during sweep</td>
+    </tr>
+    <tr>
+      <td><strong>Manual KB promotion</strong></td>
+      <td>Findings written by hand from logs</td>
+      <td><span class="sev-med">MEDIUM</span></td>
+      <td>KB grows only when a human reads logs</td>
+    </tr>
+    <tr>
+      <td><strong>No prioritization</strong></td>
+      <td>All 14 hypotheses run per model, even irrelevant ones</td>
+      <td><span class="sev-low">LOW-MED</span></td>
+      <td>Sweeps don't consume skip_set yet</td>
+    </tr>
+  </tbody>
+</table>
+
+<!-- ═══════════════════════════════════════════════════════════ -->
+<h2>2 · Confidence-Gated Promotion (L1 → L5)</h2>
+
+<p style="margin-bottom:14px; color:#444;">Findings climb confidence levels via quantitative gates — no manual judgement. KB holds L3+ only.</p>
+
+<div class="ladder">
+
+  <div class="ladder-rung">
+    <div class="rung-level l1">L1</div>
+    <div class="rung-body l1-body">
+      <div class="rung-name">Observed — Single Model, Single Run</div>
+      <div class="rung-desc">Beats baseline in one sweep. Stored in <code>results.json</code>.</div>
+      <div class="rung-gate">Gate: median gain &gt; 5%.</div>
+    </div>
+  </div>
+
+  <div class="ladder-rung" style="height:14px;">
+    <div class="ladder-connector">↓</div>
+    <div style="flex:1; display:flex; align-items:center; padding-left:10px; font-size:10px; color:#888; font-style:italic;">
+      Paired A/B bench → range non-overlap
+    </div>
+  </div>
+
+  <div class="ladder-rung">
+    <div class="rung-level l2">L2</div>
+    <div class="rung-body l2-body">
+      <div class="rung-name">Confirmed — Statistically Robust</div>
+      <div class="rung-desc">All hypothesis p50s beat all baseline p50s; 95% CI excludes 0.</div>
+      <div class="rung-gate">Gate: max(hyp_p50s) &lt; min(baseline_p50s).</div>
+    </div>
+  </div>
+
+  <div class="ladder-rung" style="height:14px;">
+    <div class="ladder-connector">↓</div>
+    <div style="flex:1; display:flex; align-items:center; padding-left:10px; font-size:10px; color:#888; font-style:italic;">
+      promote_findings.py — same flags on 2+ models, same arch
+    </div>
+  </div>
+
+  <div class="ladder-rung">
+    <div class="rung-level l3">L3</div>
+    <div class="rung-body l3-body">
+      <div class="rung-name">Generalized — Architecture Rule</div>
+      <div class="rung-desc">Same flags give L2 gains on ≥2 models of one arch class. Written to <code>ep_knowledge/&lt;ep&gt;.json</code>.</div>
+      <div class="rung-gate">Gate: ≥2 L2 with same (flags, arch_class).</div>
+    </div>
+  </div>
+
+  <div class="ladder-rung" style="height:14px;">
+    <div class="ladder-connector">↓</div>
+    <div style="flex:1; display:flex; align-items:center; padding-left:10px; font-size:10px; color:#888; font-style:italic;">
+      promote_findings.py — confirmed across arch classes
+    </div>
+  </div>
+
+  <div class="ladder-rung">
+    <div class="rung-level l4">L4</div>
+    <div class="rung-body l4-body">
+      <div class="rung-name">Cross-Cutting Rule</div>
+      <div class="rung-desc">Applies across ≥3 arch classes; scope broadens to EP-wide.</div>
+      <div class="rung-gate">Gate: ≥3 L2 across ≥3 arch_class values.</div>
+    </div>
+  </div>
+
+  <div class="ladder-rung" style="height:14px;">
+    <div class="ladder-connector">↓</div>
+    <div style="flex:1; display:flex; align-items:center; padding-left:10px; font-size:10px; color:#888; font-style:italic;">
+      analyze_insight.py predicts from graph fingerprint
+    </div>
+  </div>
+
+  <div class="ladder-rung">
+    <div class="rung-level l5">L5</div>
+    <div class="rung-body l5-body">
+      <div class="rung-name">Predictive — No Sweep Required</div>
+      <div class="rung-desc">Graph fingerprint predicts help/hurt before running; sweep skips it and emits the optimal config.</div>
+      <div class="rung-gate">Gate: L4 rule + pattern match, &lt;5% false-positive on held-out models.</div>
+    </div>
+  </div>
+
+</div>
+
+<div class="note" style="max-width:860px; margin-top:16px;">
+  <strong>Current state (2026-06-18):</strong> Most findings sit at L2 (manual). npu-001 & npu-007 have L3-grade evidence. <code>promote_findings.py</code> and L5 prediction are not yet built.
+</div>
+
+</div><!-- end overview part A -->
+
+<!-- ═══════════════════ TAB: THE 5 FIXES ═══════════════════ -->
+<div class="tab-panel" data-tab="fixes">
+
+<p style="margin:14px 0; color:#444; max-width:900px;">Each fix resolves a pain point from §1. <strong>#1</strong> kills thermal noise + sequence bias + unpaired comparison · <strong>#2</strong> fixed sample size · <strong>#3</strong> no prioritization · <strong>#4</strong> manual KB promotion · <strong>#5</strong> thermal noise (calibration backstop).</p>
+
+<!-- ═══════════════════════════════════════════════════════════ -->
+<h2>Fix #1 — Paired A/B Bench Protocol</h2>
+
+<div class="card-grid card-grid-2" style="margin-bottom:14px;">
+  <div class="card red">
+    <div class="card-title">❌ Current: Sequential (Biased)</div>
+    <p>All baseline runs, then all hypothesis runs. Baseline cold, hypothesis warm — "gain" includes thermal drift.</p>
+    <pre><code><span class="c-cm"># device heats up →</span>
+h0: [base] [base] [base]   <span class="c-cm"># cool</span>
+h6: [hyp]  [hyp]  [hyp]    <span class="c-cm"># warm</span>
+<span class="c-cm"># delta = optimization + drift (confounded)</span></code></pre>
+  </div>
+  <div class="card green">
+    <div class="card-title">✅ New: Paired A/B (Unbiased)</div>
+    <p>Each pair runs baseline then hypothesis in one thermal window. Average the within-pair ratios — drift cancels.</p>
+    <pre><code>pair_n: [base] → [hyp]   <span class="c-cm"># ratio = (base-hyp)/base</span>
+gain = mean(ratios) ± 95% CI
+<span class="c-cm"># drift appears in both → cancels</span></code></pre>
+  </div>
+</div>
+
+<pre><code><span class="c-kw">def</span> <span class="c-fn">paired_ab_bench</span>(baseline, hyp, n_pairs=<span class="c-nu">3</span>, iters=<span class="c-nu">500</span>, cool_down_s=<span class="c-nu">30</span>) -> <span class="c-tp">dict</span>:
+    <span class="c-cm">"""Interleaved A/B bench → gains_pct list + CI + verdict."""</span>
+    gains = []
+    <span class="c-kw">for</span> i <span class="c-kw">in</span> <span class="c-fn">range</span>(n_pairs):
+        b = <span class="c-fn">run_perf_session</span>(baseline, iters)
+        h = <span class="c-fn">run_perf_session</span>(hyp, iters)
+        <span class="c-kw">if</span> b <span class="c-kw">and</span> h: gains.append((b - h) / b * <span class="c-nu">100</span>)
+        <span class="c-kw">if</span> i &lt; n_pairs - <span class="c-nu">1</span>: time.sleep(cool_down_s)
+    <span class="c-kw">if not</span> gains: <span class="c-kw">return</span> {<span class="c-st">"verdict"</span>: <span class="c-st">"BENCH_FAIL"</span>}
+    mean = statistics.mean(gains)
+    ci   = <span class="c-nu">1.96</span> * statistics.stdev(gains) / math.sqrt(len(gains)) <span class="c-kw">if</span> len(gains) > <span class="c-nu">1</span> <span class="c-kw">else</span> <span class="c-nu">999</span>
+    verdict = (<span class="c-st">"KEEP_CONFIRMED"</span> <span class="c-kw">if</span> mean - ci > <span class="c-nu">5</span> <span class="c-kw">else</span>
+               <span class="c-st">"DISCARD"</span> <span class="c-kw">if</span> mean + ci &lt; -<span class="c-nu">2</span> <span class="c-kw">else</span> <span class="c-st">"MARGINAL"</span>)
+    <span class="c-kw">return</span> {<span class="c-st">"gains_pct"</span>: gains, <span class="c-st">"mean_gain_pct"</span>: <span class="c-fn">round</span>(mean, <span class="c-nu">2</span>),
+            <span class="c-st">"ci_half_95"</span>: <span class="c-fn">round</span>(ci, <span class="c-nu">2</span>), <span class="c-st">"verdict"</span>: verdict}</code></pre>
+
+<!-- ═══════════════════════════════════════════════════════════ -->
+<h2>Fix #2 — Adaptive n_sessions</h2>
+
+<p style="margin-bottom:10px; color:#444;">Keep sampling until the 95% CI is decisive — or budget runs out — instead of a fixed N.</p>
+
+<div class="card-grid card-grid-2">
+  <div class="card blue">
+    <div class="card-title">Stopping Criterion</div>
+    <p><strong>Stop early:</strong> <code>CI_lower &gt; +5%</code> (KEEP) or <code>CI_upper &lt; -2%</code> (DISCARD).</p>
+    <p><strong>Force stop</strong> at <code>MAX_PAIRS = 8</code> → MARGINAL.</p>
+    <p>Stable models finish in 3 pairs; noisy ones get more automatically.</p>
+  </div>
+  <div class="card blue">
+    <div class="card-title">Budget Allocation</div>
+    <p><strong>Priority queue:</strong> test highest-prior hypotheses first.</p>
+    <p>Once a KEEP_CONFIRMED is found, remaining hypotheses get fewer pairs (quick reject/confirm).</p>
+  </div>
+</div>
+
+<!-- ═══════════════════════════════════════════════════════════ -->
+<h2>Fix #3 — Architecture-Based Hypothesis Pruning</h2>
+
+<p style="margin-bottom:10px; color:#444;">Sweeps consume <code>analyze_insight.py</code> graph patterns to skip irrelevant/harmful hypotheses — cutting 14 to 4–5 per model.</p>
+
+<table>
+  <thead><tr><th>Architecture Class</th><th>Graph Fingerprint</th><th>Skip (known harmful)</th><th>Prioritize (likely helpful)</th></tr></thead>
+  <tbody>
+    <tr>
+      <td><strong>Pure ViT</strong><br><small>DINOv2, ViT-B, YOLOS</small></td>
+      <td>Dense Transpose (≥49), no Gemm-unfold blocks</td>
+      <td>conv fusions (h4/h5/h10), layer_norm_fusion</td>
+      <td>opset21 (npu-001), highdimRTR (L4), bias_softmax</td>
+    </tr>
+    <tr>
+      <td><strong>Pure CNN</strong><br><small>ResNet-18, EfficientNet</small></td>
+      <td>Conv% &gt; 20%, sparse Transpose</td>
+      <td>attention_fusion, highdimRTR, bias_softmax</td>
+      <td>matmul_transpose_fusion (cpu-007), opset19/21 safe</td>
+    </tr>
+    <tr>
+      <td><strong>CNN-ViT Hybrid</strong><br><small>MobileViT, EfficientFormer</small></td>
+      <td>Gemm→Reshape→Transpose unfold blocks present</td>
+      <td>highdimRTR <span style="color:#c62828">⚠ -19% NPU</span>, layer_norm_fusion <span style="color:#c62828">⚠ -997% CPU</span>, matmul_transpose CPU</td>
+      <td>opset21 + matmul_transpose_fusion (NPU h6: +42%)</td>
+    </tr>
+    <tr>
+      <td><strong>BERT / NLP Encoder</strong><br><small>BERT, RoBERTa, DistilBERT, MiniLM</small></td>
+      <td>Attention pattern, sparse Transpose, Add→Softmax</td>
+      <td>conv fusions, layer_norm_fusion (BERT LN≠CV LN), opset21 (cpu-001 on dense-Transpose subclass)</td>
+      <td>attention_fusion, bias_softmax_fusion (npu-009 +14%)</td>
+    </tr>
+    <tr>
+      <td><strong>Dense-Transpose ViT</strong><br><small>ConvNext, DINOv2-style</small></td>
+      <td>Transpose count ≥ 49 AND Gemm-unfold absent</td>
+      <td>opset19/21 on CPU <span style="color:#c62828">⚠ cpu-001 10x slowdown</span></td>
+      <td>opset17 explicit (baseline), highdimRTR</td>
+    </tr>
+  </tbody>
+</table>
+
+<pre style="margin-top:12px;"><code><span class="c-kw">def</span> <span class="c-fn">get_hypothesis_skip_set</span>(model_type, candidates) -> <span class="c-tp">set</span>[<span class="c-tp">str</span>]:
+    <span class="c-st">"""Skip hypotheses by arch fingerprint + KB rules."""</span>
+    skip, tags = <span class="c-fn">set</span>(), {c.tag <span class="c-kw">for</span> c <span class="c-kw">in</span> candidates}
+    <span class="c-kw">if</span> <span class="c-st">"conv_dense"</span> <span class="c-kw">in</span> tags:                  skip |= {<span class="c-st">"h4"</span>, <span class="c-st">"h5"</span>}   <span class="c-cm"># npu-006</span>
+    <span class="c-kw">if</span> <span class="c-st">"gemm_reshape_transpose_unfold"</span> <span class="c-kw">in</span> tags: skip.add(<span class="c-st">"h9"</span>)       <span class="c-cm"># npu-010 highdimRTR</span>
+    <span class="c-kw">if</span> model_type == <span class="c-st">"mobilevit"</span>:             skip.add(<span class="c-st">"h6"</span>)       <span class="c-cm"># cpu-008 layer_norm</span>
+    <span class="c-kw">if</span> <span class="c-st">"dense_transpose"</span> <span class="c-kw">in</span> tags:             skip |= {<span class="c-st">"h2"</span>, <span class="c-st">"h3"</span>}   <span class="c-cm"># cpu-001 opset19/21</span>
+    <span class="c-kw">return</span> skip</code></pre>
+
+<!-- ═══════════════════════════════════════════════════════════ -->
+<h2>Fix #4 — promote_findings.py</h2>
+
+<p style="margin-bottom:10px; color:#444;">Post-processing script: reads all <code>results.json</code>, applies the confidence ladder, auto-updates the KB.</p>
+
+<svg width="860" height="110" viewBox="0 0 860 110" font-family="-apple-system,BlinkMacSystemFont,'Segoe UI',sans-serif" style="max-width:100%;display:block;margin-top:10px;">
+  <defs>
+    <marker id="arr-pf" markerWidth="8" markerHeight="6" refX="7" refY="3" orient="auto">
+      <polygon points="0 0,8 3,0 6" fill="#7986cb"/>
+    </marker>
+  </defs>
+  <!-- Box 1: results.json -->
+  <rect x="10" y="15" width="170" height="72" rx="8" fill="#f5f7f8" stroke="#cfd8dc" stroke-width="1.5"/>
+  <text x="95" y="43" text-anchor="middle" font-size="11.5" font-weight="700" fill="#37474f">results.json × N</text>
+  <text x="95" y="58" text-anchor="middle" font-size="10.5" fill="#666">catalog-*-sweep/</text>
+  <text x="95" y="72" text-anchor="middle" font-size="10.5" fill="#666">*/results.json</text>
+  <!-- Arrow 1→2 -->
+  <line x1="180" y1="51" x2="218" y2="51" stroke="#7986cb" stroke-width="1.5" marker-end="url(#arr-pf)"/>
+  <!-- Box 2: promote_findings.py -->
+  <rect x="220" y="15" width="190" height="72" rx="8" fill="#e3f2fd" stroke="#90caf9" stroke-width="1.5"/>
+  <text x="315" y="40" text-anchor="middle" font-size="11.5" font-weight="700" fill="#0d47a1">promote_findings.py</text>
+  <text x="315" y="56" text-anchor="middle" font-size="10.5" fill="#1565c0">L1→L2: range non-overlap</text>
+  <text x="315" y="70" text-anchor="middle" font-size="10.5" fill="#1565c0">L2→L3: 2+ models, same arch</text>
+  <!-- Arrow 2→3 -->
+  <line x1="410" y1="51" x2="448" y2="51" stroke="#7986cb" stroke-width="1.5" marker-end="url(#arr-pf)"/>
+  <!-- Box 3: ep_knowledge -->
+  <rect x="450" y="15" width="175" height="72" rx="8" fill="#e8f5e9" stroke="#a5d6a7" stroke-width="1.5"/>
+  <text x="537" y="40" text-anchor="middle" font-size="11.5" font-weight="700" fill="#1b5e20">ep_knowledge/*.json</text>
+  <text x="537" y="56" text-anchor="middle" font-size="10.5" fill="#2e7d32">auto-generated findings</text>
+  <text x="537" y="70" text-anchor="middle" font-size="10.5" fill="#2e7d32">with evidence list</text>
+  <!-- Arrow 3→4 -->
+  <line x1="625" y1="51" x2="663" y2="51" stroke="#7986cb" stroke-width="1.5" marker-end="url(#arr-pf)"/>
+  <!-- Box 4: analyze_insight -->
+  <rect x="665" y="15" width="185" height="72" rx="8" fill="#e8eaf6" stroke="#9fa8da" stroke-width="1.5"/>
+  <text x="757" y="40" text-anchor="middle" font-size="11.5" font-weight="700" fill="#283593">analyze_insight.py</text>
+  <text x="757" y="56" text-anchor="middle" font-size="10.5" fill="#3949ab">reads KB → skip_set</text>
+  <text x="757" y="70" text-anchor="middle" font-size="10.5" fill="#3949ab">for next sweep</text>
+</svg>
+
+<pre style="margin-top:12px;"><code><span class="c-kw">def</span> <span class="c-fn">collect_l2_candidates</span>(sweep_dirs):
+    <span class="c-st">"""L1→L2: range non-overlap for a single model."""</span>
+    out = []
+    <span class="c-kw">for</span> d <span class="c-kw">in</span> sweep_dirs:
+        r = json.loads((d / <span class="c-st">"results.json"</span>).read_text())
+        base = r[<span class="c-st">"hypotheses"</span>][<span class="c-st">"h0"</span>][<span class="c-st">"full"</span>][<span class="c-st">"p50s_ms"</span>]
+        <span class="c-kw">for</span> h_id, h <span class="c-kw">in</span> r[<span class="c-st">"hypotheses"</span>].items():
+            <span class="c-kw">if</span> h.get(<span class="c-st">"verdict"</span>) == <span class="c-st">"KEEP_CONFIRMED"</span> <span class="c-kw">and</span> <span class="c-fn">max</span>(h[<span class="c-st">"full"</span>][<span class="c-st">"p50s_ms"</span>]) &lt; <span class="c-fn">min</span>(base):
+                out.append({<span class="c-st">"arch"</span>: r[<span class="c-st">"model_type"</span>], <span class="c-st">"ep"</span>: r[<span class="c-st">"ep"</span>],
+                            <span class="c-st">"flags"</span>: h.get(<span class="c-st">"extra_optim"</span>, {}), <span class="c-st">"gain_pct"</span>: h[<span class="c-st">"mean_gain_pct"</span>]})
+    <span class="c-kw">return</span> out
+
+<span class="c-kw">def</span> <span class="c-fn">promote_to_l3</span>(l2s):
+    <span class="c-st">"""L2→L3: same flags on ≥2 models of one arch class."""</span>
+    g = <span class="c-fn">defaultdict</span>(<span class="c-tp">list</span>)
+    <span class="c-kw">for</span> c <span class="c-kw">in</span> l2s:
+        g[(c[<span class="c-st">"ep"</span>], <span class="c-fn">frozenset</span>(c[<span class="c-st">"flags"</span>].items()), c[<span class="c-st">"arch"</span>])].append(c)
+    <span class="c-kw">return</span> [{<span class="c-st">"ep"</span>: ep, <span class="c-st">"flags"</span>: <span class="c-fn">dict</span>(f), <span class="c-st">"arch"</span>: a, <span class="c-st">"evidence"</span>: ev}
+            <span class="c-kw">for</span> (ep, f, a), ev <span class="c-kw">in</span> g.items() <span class="c-kw">if</span> len(ev) >= <span class="c-nu">2</span>]</code></pre>
+
+<!-- ═══════════════════════════════════════════════════════════ -->
+<h2>Fix #5 — Thermal Reference Model (P2)</h2>
+
+<div class="card-grid card-grid-2">
+  <div class="card orange">
+    <div class="card-title">Concept</div>
+    <p>Run a fixed tiny model (100 iters) before each session — its latency proxies device thermal state.</p>
+    <p>Store <code>thermal_ref_p50_ms</code>; normalize gains by it for valid cross-run comparison.</p>
+  </div>
+  <div class="card orange">
+    <div class="card-title">When to Use</div>
+    <p><strong>Cool</strong> (≤ 1.05× cold): proceed.</p>
+    <p><strong>Hot</strong> (&gt; 1.3× cold): wait 60s, retry up to 3×, then flag "HOT_RUN".</p>
+    <p>HOT_RUN sessions excluded from L2 promotion.</p>
+  </div>
+</div>
+
+</div><!-- end fixes panel -->
+
+<!-- ═══════════════════ TAB: OVERVIEW (part B) ═══════════════════ -->
+<div class="tab-panel active" data-tab="overview">
+
+<!-- ═══════════════════════════════════════════════════════════ -->
+<h2>3 · Full Self-Evolution Loop</h2>
+
+<svg width="900" height="190" viewBox="0 0 900 190" font-family="-apple-system,BlinkMacSystemFont,'Segoe UI',sans-serif" style="max-width:100%;display:block;margin-top:4px;">
+  <defs>
+    <marker id="arr-evo" markerWidth="8" markerHeight="6" refX="7" refY="3" orient="auto">
+      <polygon points="0 0,8 3,0 6" fill="#7986cb"/>
+    </marker>
+    <marker id="arr-evo-up" markerWidth="8" markerHeight="6" refX="1" refY="3" orient="auto">
+      <polygon points="8 0,0 3,8 6" fill="#7986cb"/>
+    </marker>
+  </defs>
+
+  <!-- Box 1: analyze_insight.py -->
+  <rect x="10" y="15" width="155" height="80" rx="8" fill="#e8eaf6" stroke="#9fa8da" stroke-width="1.5"/>
+  <text x="87" y="40" text-anchor="middle" font-size="11" font-weight="700" fill="#283593">analyze_insight.py</text>
+  <text x="87" y="56" text-anchor="middle" font-size="10" fill="#3949ab">graph fingerprint</text>
+  <text x="87" y="69" text-anchor="middle" font-size="10" fill="#3949ab">→ skip_set + priority</text>
+
+  <!-- Arrow 1→2 -->
+  <line x1="165" y1="55" x2="200" y2="55" stroke="#7986cb" stroke-width="1.5" marker-end="url(#arr-evo)"/>
+
+  <!-- Box 2: catalog_*_sweep.py -->
+  <rect x="202" y="15" width="163" height="80" rx="8" fill="#e8f5e9" stroke="#a5d6a7" stroke-width="1.5"/>
+  <text x="283" y="40" text-anchor="middle" font-size="11" font-weight="700" fill="#1b5e20">catalog_*_sweep.py</text>
+  <text x="283" y="56" text-anchor="middle" font-size="10" fill="#2e7d32">Paired A/B · adaptive n</text>
+  <text x="283" y="69" text-anchor="middle" font-size="10" fill="#2e7d32">4–5 hyps (pruned)</text>
+
+  <!-- Arrow 2→3 -->
+  <line x1="365" y1="55" x2="400" y2="55" stroke="#7986cb" stroke-width="1.5" marker-end="url(#arr-evo)"/>
+
+  <!-- Box 3: results.json -->
+  <rect x="402" y="15" width="152" height="80" rx="8" fill="#f5f7f8" stroke="#cfd8dc" stroke-width="1.5"/>
+  <text x="478" y="40" text-anchor="middle" font-size="11" font-weight="700" fill="#37474f">results.json</text>
+  <text x="478" y="56" text-anchor="middle" font-size="10" fill="#546e7a">verdict · CI · gain</text>
+  <text x="478" y="69" text-anchor="middle" font-size="10" fill="#546e7a">extra_optim stored</text>
+
+  <!-- Arrow 3→4 -->
+  <line x1="554" y1="55" x2="589" y2="55" stroke="#7986cb" stroke-width="1.5" marker-end="url(#arr-evo)"/>
+
+  <!-- Box 4: promote_findings.py -->
+  <rect x="591" y="15" width="165" height="80" rx="8" fill="#e3f2fd" stroke="#90caf9" stroke-width="1.5"/>
+  <text x="673" y="40" text-anchor="middle" font-size="11" font-weight="700" fill="#0d47a1">promote_findings.py</text>
+  <text x="673" y="56" text-anchor="middle" font-size="10" fill="#1565c0">L1→L2→L3→L4</text>
+  <text x="673" y="69" text-anchor="middle" font-size="10" fill="#1565c0">auto-updates KB</text>
+
+  <!-- Arrow 4→5 -->
+  <line x1="756" y1="55" x2="791" y2="55" stroke="#7986cb" stroke-width="1.5" marker-end="url(#arr-evo)"/>
+
+  <!-- Box 5: ep_knowledge -->
+  <rect x="793" y="15" width="97" height="80" rx="8" fill="#f3e5f5" stroke="#ce93d8" stroke-width="1.5"/>
+  <text x="841" y="40" text-anchor="middle" font-size="10.5" font-weight="700" fill="#4a148c">ep_knowledge/</text>
+  <text x="841" y="54" text-anchor="middle" font-size="10" fill="#6a1b9a">*.json</text>
+  <text x="841" y="68" text-anchor="middle" font-size="10" fill="#6a1b9a">L3+ findings</text>
+  <text x="841" y="81" text-anchor="middle" font-size="10" fill="#6a1b9a">config_optimal</text>
+
+  <!-- Feedback arrow: bottom of box5 → bottom of box1 -->
+  <path d="M 841,95 L 841,148 L 87,148 L 87,95"
+        stroke="#7986cb" stroke-width="1.5" stroke-dasharray="5,3"
+        fill="none" marker-end="url(#arr-evo-up)"/>
+  <!-- Feedback label -->
+  <rect x="330" y="138" width="240" height="18" rx="4" fill="#ede7f6"/>
+  <text x="450" y="151" text-anchor="middle" font-size="10" fill="#4527a0" font-style="italic">skip_set feeds back → sweeps get shorter</text>
+</svg>
+
+<!-- ═══════════════════════════════════════════════════════════ -->
+<h2>4 · Implementation Plan</h2>
+
+<table>
+  <thead><tr><th>Priority</th><th>Component</th><th>File(s)</th><th>Status</th><th>Key change</th></tr></thead>
+  <tbody>
+    <tr>
+      <td><span class="p0">P0</span></td>
+      <td>Paired A/B bench primitive</td>
+      <td><code>sweep_utils.py</code> <span class="pill pill-new">NEW</span></td>
+      <td><span class="cross">TODO</span></td>
+      <td><code>paired_ab_bench(baseline_path, hyp_path, n_pairs, iters)</code> → verdict + CI</td>
+    </tr>
+    <tr>
+      <td><span class="p0">P0</span></td>
+      <td>Adaptive n_sessions</td>
+      <td><code>sweep_utils.py</code> <span class="pill pill-new">NEW</span></td>
+      <td><span class="cross">TODO</span></td>
+      <td>Stop when CI_lower &gt; 5% or CI_upper &lt; -2%, max 8 pairs</td>
+    </tr>
+    <tr>
+      <td><span class="p0">P0</span></td>
+      <td>promote_findings.py</td>
+      <td><code>promote_findings.py</code> <span class="pill pill-new">NEW</span></td>
+      <td><span class="cross">TODO</span></td>
+      <td>L1→L2→L3→L4 gates, auto-write to ep_knowledge/*.json</td>
+    </tr>
+    <tr>
+      <td><span class="p0">P0</span></td>
+      <td>Champion config output</td>
+      <td>All 3 sweep scripts <span class="pill pill-mod">MOD</span></td>
+      <td><span class="cross">TODO</span></td>
+      <td>After sweep: write <code>config_&lt;ep&gt;_&lt;device&gt;_optimal.json</code> from best hypothesis</td>
+    </tr>
+    <tr>
+      <td><span class="p1">P1</span></td>
+      <td>Architecture hypothesis pruning</td>
+      <td><code>analyze_insight.py</code> <span class="pill pill-mod">MOD</span></td>
+      <td><span class="cross">TODO</span></td>
+      <td><code>get_hypothesis_skip_set(model_type, candidates)</code> → set of h_ids to skip</td>
+    </tr>
+    <tr>
+      <td><span class="p1">P1</span></td>
+      <td>Wire Paired A/B into QNN sweep</td>
+      <td><code>catalog_qnn_sweep.py</code> <span class="pill pill-mod">MOD</span></td>
+      <td><span class="cross">TODO</span></td>
+      <td>Replace <code>bench_full()</code> with <code>paired_ab_bench()</code>; add <code>PAIRED_AB=True</code> flag</td>
+    </tr>
+    <tr>
+      <td><span class="p1">P1</span></td>
+      <td>Wire Paired A/B into GPU + CPU sweeps</td>
+      <td><code>catalog_gpu_sweep.py</code>, <code>catalog_cpu_sweep.py</code> <span class="pill pill-mod">MOD</span></td>
+      <td><span class="cross">TODO</span></td>
+      <td>Same as QNN; import from sweep_utils</td>
+    </tr>
+    <tr>
+      <td><span class="p1">P1</span></td>
+      <td>Feature gaps issue log</td>
+      <td><code>docs/feature-gaps/issues.md</code> <span class="pill pill-new">NEW</span></td>
+      <td><span class="cross">TODO</span></td>
+      <td>Persistent log of all research-derived GitHub issues with context + date</td>
+    </tr>
+    <tr>
+      <td><span class="p2">P2</span></td>
+      <td>Thermal reference model</td>
+      <td><code>sweep_utils.py</code> <span class="pill pill-new">NEW</span></td>
+      <td><span class="cross">TODO</span></td>
+      <td><code>thermal_calibrate(ep, device)</code> → thermal_ref_p50_ms, HOT_RUN detection</td>
+    </tr>
+    <tr>
+      <td><span class="p2">P2</span></td>
+      <td>L5 prediction in analyze_insight</td>
+      <td><code>analyze_insight.py</code> <span class="pill pill-mod">MOD</span></td>
+      <td><span class="cross">TODO</span></td>
+      <td>Read L4 KB rules → predict winner before sweep; emit champion config directly</td>
+    </tr>
+  </tbody>
+</table>
+
+<div class="note" style="max-width:900px; margin-top:16px;">
+  <strong>Key insight:</strong> Once Paired A/B + promote_findings.py exist, the system self-corrects — each sweep adds L2 candidates that reinforce or surface KB rules. skip_set grows richer, sweeps get shorter and more reliable, with no human in the loop between runs.
+</div>
+
+</div><!-- end overview part B -->
+
+<br><br>
+<div style="font-size:10px; color:#aaa; text-align:right;">Generated 2026-06-18 · research/autoconfig/docs/self-evolution-design.html</div>
+
+<script>
+  document.querySelectorAll('.tab-btn').forEach(function (btn) {
+    btn.addEventListener('click', function () {
+      var tab = btn.dataset.tab;
+      document.querySelectorAll('.tab-btn').forEach(function (b) {
+        b.classList.toggle('active', b === btn);
+      });
+      document.querySelectorAll('.tab-panel').forEach(function (p) {
+        p.classList.toggle('active', p.dataset.tab === tab);
+      });
+    });
+  });
+</script>
+
+</body>
+</html>
diff --git a/research/autoconfig/ep_device_knowledge/README.md b/research/autoconfig/ep_device_knowledge/README.md
new file mode 100644
index 000000000..51310f233
--- /dev/null
+++ b/research/autoconfig/ep_device_knowledge/README.md
@@ -0,0 +1,56 @@
+# Per-EP Empirical Knowledge Base
+
+Each JSON file stores empirical findings for one EP/device combination.
+
+## ⚠️ CRITICAL EPISTEMICS
+
+These findings are **observational hypotheses, not ground truth**. They were derived
+from a small number of experiments on a single model (ConvNext-tiny) on a single device
+(Snapdragon X Elite CRD). Every finding carries a `confidence` field and a `falsified_by`
+field. Before using a finding to prune a search space, check:
+
+1. **Is the model architecture similar?** (ConvNext ≠ BERT ≠ ResNet)
+2. **Is the hardware the same?** (X Elite CRD ≠ X Plus ≠ X1E-80-100)
+3. **Is the ORT/QNN SDK version the same?**
+4. **Is the mechanism confirmed?** (see `mechanism_confirmed` field)
+
+**Dialectical rule**: A finding that prunes a search dimension must be re-enabled
+if a new experiment on a new model/hardware contradicts it. Findings degrade over time
+as ORT and QNN SDK versions change.
+
+## ✅ Promotion checklist (before a finding becomes a pruning rule)
+
+These rules exist because of the **npu-001 / MobileViT failure**: a `+26.5%` opset-21
+"win" was recorded from a single sweep whose baseline (~12 ms) was silently inflated by
+DVFS/thermal throttling. A clean from-scratch rerun (2026-06-22) measured the baseline at
+~5.5 ms and the same config at +2.8% — fully within noise. The fake gain came from a
+**polluted baseline and a cross-run comparison**, the two least reliable things on a
+DVFS NPU. To avoid recording artifacts as findings, a result must clear ALL of these
+before its `confidence` is raised above `draft` / before it is used to prune search space:
+
+1. **Paired / same-thermal-window measurement.** Compare a config against its baseline
+   measured in the *same* thermal window (interleave A/B/A/B), and compare the
+   within-window **delta** — never an absolute baseline carried over from another run.
+2. **Clean baseline gate.** Reject the whole comparison if the baseline session-to-session
+   CV is high or contains a >2σ spike. A noisy baseline poisons every ratio derived from it.
+3. **Effect size > noise floor.** Require `gain% >= 2 × (session-to-session CV)` AND
+   non-overlapping session p50 ranges. A sub-5% median win on QNN NPU is noise by default.
+   (`catalog_sweep.py` now emits `best_gain_verdict`: `RELIABLE` /
+   `NEUTRAL_WITHIN_NOISE` / `UNRELIABLE_RANGES_OVERLAP` for exactly this.)
+4. **Independent reruns, then tiered confidence.** A single sweep is **L1 (draft)** only.
+   Promote to **L3** only after ≥N independent reruns (fresh build) agree in direction;
+   reach **L5** only after cross-time / cross-device stability. Only ≥L3 findings may be
+   used to prune the search space (see `docs/self-evolution-design.html`, L1–L5).
+5. **Track absolute-baseline drift.** Record each model's absolute baseline over time. If
+   the baseline shifts beyond threshold between runs, **invalidate dependent findings** and
+   re-measure — a baseline that moves 2× is itself a regression signal, not a constant.
+
+> One-line rule: on DVFS hardware, trust only **same-window paired deltas that exceed the
+> noise floor and reproduce across independent reruns** — never single-run absolute
+> baselines or cross-run ratios.
+
+## Files
+- `qnn_npu.json` — QNN HTP (NPU) EP findings
+- `qnn_gpu.json` — QNN GPU EP findings
+- `dml.json`     — DirectML EP findings
+- `cpu.json`     — CPU EP findings
diff --git a/research/autoconfig/ep_device_knowledge/_auto_promoted.json b/research/autoconfig/ep_device_knowledge/_auto_promoted.json
new file mode 100644
index 000000000..90147b3b0
--- /dev/null
+++ b/research/autoconfig/ep_device_knowledge/_auto_promoted.json
@@ -0,0 +1,305 @@
+{
+  "_meta": {
+    "generated_by": "promote_findings.py",
+    "status": "draft",
+    "note": "Auto-generated promotion candidates. NOT curated KB. Apply the promotion checklist in ep_device_knowledge/README.md (paired A/B, clean baseline, effect-size > noise floor, independent reruns, baseline-drift check) before merging into <ep>_<device>.json.",
+    "gates": {
+      "L1_gain_pct": 5.0,
+      "L2_effect_size_cv_mult": 2.0,
+      "L3_min_models": 2,
+      "L4_min_arch_classes": 3
+    }
+  },
+  "L1_observed": [
+    {
+      "model_id": "apple/mobilevit-small",
+      "arch_class": "mobilevit",
+      "ep": "cpu",
+      "device": "cpu",
+      "hyp_id": "h7",
+      "label": "opset 17 + bias_softmax_fusion",
+      "flags": "bias_softmax_fusion=True + opset=17",
+      "gain_pct": 12.34,
+      "noise_floor_pct": 81.13,
+      "ranges_separated": false,
+      "level": 1
+    },
+    {
+      "model_id": "microsoft/resnet-18",
+      "arch_class": "resnet",
+      "ep": "cpu",
+      "device": "cpu",
+      "hyp_id": "h6",
+      "label": "opset 17 + layer_norm_fusion",
+      "flags": "layer_norm_fusion=True + opset=17",
+      "gain_pct": 10.43,
+      "noise_floor_pct": 7.58,
+      "ranges_separated": true,
+      "level": 2
+    },
+    {
+      "model_id": "microsoft/resnet-18",
+      "arch_class": "resnet",
+      "ep": "cpu",
+      "device": "cpu",
+      "hyp_id": "h9",
+      "label": "opset 17 + matmul_transpose_fusion",
+      "flags": "matmul_transpose_fusion=True + opset=17",
+      "gain_pct": 92.51,
+      "noise_floor_pct": 71.15,
+      "ranges_separated": true,
+      "level": 2
+    },
+    {
+      "model_id": "microsoft/resnet-18",
+      "arch_class": "resnet",
+      "ep": "cpu",
+      "device": "cpu",
+      "hyp_id": "h10",
+      "label": "opset 17 + attention + skip_layer_norm + layer_norm",
+      "flags": "attention_fusion=True + layer_norm_fusion=True + opset=17 + skip_layer_norm_fusion=True",
+      "gain_pct": 91.54,
+      "noise_floor_pct": 115.77,
+      "ranges_separated": true,
+      "level": 1
+    },
+    {
+      "model_id": "microsoft/resnet-18",
+      "arch_class": "resnet",
+      "ep": "cpu",
+      "device": "cpu",
+      "hyp_id": "h11",
+      "label": "opset 17 + nchwc_transformer (Conv-heavy models)",
+      "flags": "nchwc_transformer=True + opset=17",
+      "gain_pct": 82.79,
+      "noise_floor_pct": 223.59,
+      "ranges_separated": false,
+      "level": 1
+    },
+    {
+      "model_id": "microsoft/resnet-18",
+      "arch_class": "resnet",
+      "ep": "cpu",
+      "device": "cpu",
+      "hyp_id": "h12",
+      "label": "opset 17 + transpose_optimizer",
+      "flags": "opset=17 + transpose_optimizer=True",
+      "gain_pct": 84.46,
+      "noise_floor_pct": 61.25,
+      "ranges_separated": true,
+      "level": 2
+    },
+    {
+      "model_id": "microsoft/resnet-18",
+      "arch_class": "resnet",
+      "ep": "cpu",
+      "device": "cpu",
+      "hyp_id": "h13",
+      "label": "opset 17 + gelu_fusion explicit",
+      "flags": "gelu_fusion=True + opset=17",
+      "gain_pct": 88.89,
+      "noise_floor_pct": 254.22,
+      "ranges_separated": true,
+      "level": 1
+    },
+    {
+      "model_id": "facebook/dinov2-small",
+      "arch_class": "dinov2",
+      "ep": "qnn",
+      "device": "gpu",
+      "hyp_id": "h4",
+      "label": "opset 17 + matmul_transpose_fusion",
+      "flags": "matmul_transpose_fusion=True + opset=17",
+      "gain_pct": 8.45,
+      "noise_floor_pct": 14.83,
+      "ranges_separated": false,
+      "level": 1
+    },
+    {
+      "model_id": "facebook/dinov2-small",
+      "arch_class": "dinov2",
+      "ep": "qnn",
+      "device": "gpu",
+      "hyp_id": "h5",
+      "label": "opset 17 + attention_fusion",
+      "flags": "attention_fusion=True + opset=17",
+      "gain_pct": 10.55,
+      "noise_floor_pct": 14.83,
+      "ranges_separated": false,
+      "level": 1
+    },
+    {
+      "model_id": "facebook/dinov2-small",
+      "arch_class": "dinov2",
+      "ep": "qnn",
+      "device": "gpu",
+      "hyp_id": "h6",
+      "label": "opset 17 + bias_softmax_fusion",
+      "flags": "bias_softmax_fusion=True + opset=17",
+      "gain_pct": 6.39,
+      "noise_floor_pct": 14.83,
+      "ranges_separated": false,
+      "level": 1
+    },
+    {
+      "model_id": "facebook/dinov2-small",
+      "arch_class": "dinov2",
+      "ep": "qnn",
+      "device": "gpu",
+      "hyp_id": "h9",
+      "label": "opset 21 + matmul_transpose + attention_fusion",
+      "flags": "attention_fusion=True + matmul_transpose_fusion=True + opset=21",
+      "gain_pct": 12.85,
+      "noise_floor_pct": 14.83,
+      "ranges_separated": true,
+      "level": 1
+    },
+    {
+      "model_id": "facebook/dinov2-small",
+      "arch_class": "dinov2",
+      "ep": "qnn",
+      "device": "gpu",
+      "hyp_id": "h11",
+      "label": "opset 17 + gelu_fusion explicit",
+      "flags": "gelu_fusion=True + opset=17",
+      "gain_pct": 13.86,
+      "noise_floor_pct": 14.83,
+      "ranges_separated": true,
+      "level": 1
+    },
+    {
+      "model_id": "facebook/dinov2-small",
+      "arch_class": "dinov2",
+      "ep": "qnn",
+      "device": "gpu",
+      "hyp_id": "h12",
+      "label": "opset 17 + transpose_optimizer",
+      "flags": "opset=17 + transpose_optimizer=True",
+      "gain_pct": 16.67,
+      "noise_floor_pct": 14.83,
+      "ranges_separated": true,
+      "level": 2
+    },
+    {
+      "model_id": "microsoft/rad-dino",
+      "arch_class": "dinov2",
+      "ep": "qnn",
+      "device": "gpu",
+      "hyp_id": "h11",
+      "label": "opset 17 + gelu_fusion explicit",
+      "flags": "gelu_fusion=True + opset=17",
+      "gain_pct": 2.0,
+      "noise_floor_pct": 1.72,
+      "ranges_separated": true,
+      "level": 2
+    },
+    {
+      "model_id": "microsoft/resnet-18",
+      "arch_class": "resnet",
+      "ep": "qnn",
+      "device": "gpu",
+      "hyp_id": "h11",
+      "label": "opset 17 + gelu_fusion explicit",
+      "flags": "gelu_fusion=True + opset=17",
+      "gain_pct": 6.4,
+      "noise_floor_pct": 14.6,
+      "ranges_separated": false,
+      "level": 1
+    },
+    {
+      "model_id": "microsoft/resnet-18",
+      "arch_class": "resnet",
+      "ep": "qnn",
+      "device": "gpu",
+      "hyp_id": "h12",
+      "label": "opset 17 + transpose_optimizer",
+      "flags": "opset=17 + transpose_optimizer=True",
+      "gain_pct": 8.38,
+      "noise_floor_pct": 14.6,
+      "ranges_separated": false,
+      "level": 1
+    },
+    {
+      "model_id": "facebook/dinov2-small",
+      "arch_class": "dinov2",
+      "ep": "qnn",
+      "device": "npu",
+      "hyp_id": "h3",
+      "label": "opset 21 (tests npu-001 bypass)",
+      "flags": "opset=21",
+      "gain_pct": 24.14,
+      "noise_floor_pct": 81.45,
+      "ranges_separated": false,
+      "level": 1
+    }
+  ],
+  "L2_confirmed_single_model": [
+    {
+      "model_id": "microsoft/resnet-18",
+      "arch_class": "resnet",
+      "ep": "cpu",
+      "device": "cpu",
+      "hyp_id": "h6",
+      "label": "opset 17 + layer_norm_fusion",
+      "flags": "layer_norm_fusion=True + opset=17",
+      "gain_pct": 10.43,
+      "noise_floor_pct": 7.58,
+      "ranges_separated": true,
+      "level": 2
+    },
+    {
+      "model_id": "microsoft/resnet-18",
+      "arch_class": "resnet",
+      "ep": "cpu",
+      "device": "cpu",
+      "hyp_id": "h9",
+      "label": "opset 17 + matmul_transpose_fusion",
+      "flags": "matmul_transpose_fusion=True + opset=17",
+      "gain_pct": 92.51,
+      "noise_floor_pct": 71.15,
+      "ranges_separated": true,
+      "level": 2
+    },
+    {
+      "model_id": "microsoft/resnet-18",
+      "arch_class": "resnet",
+      "ep": "cpu",
+      "device": "cpu",
+      "hyp_id": "h12",
+      "label": "opset 17 + transpose_optimizer",
+      "flags": "opset=17 + transpose_optimizer=True",
+      "gain_pct": 84.46,
+      "noise_floor_pct": 61.25,
+      "ranges_separated": true,
+      "level": 2
+    },
+    {
+      "model_id": "facebook/dinov2-small",
+      "arch_class": "dinov2",
+      "ep": "qnn",
+      "device": "gpu",
+      "hyp_id": "h12",
+      "label": "opset 17 + transpose_optimizer",
+      "flags": "opset=17 + transpose_optimizer=True",
+      "gain_pct": 16.67,
+      "noise_floor_pct": 14.83,
+      "ranges_separated": true,
+      "level": 2
+    },
+    {
+      "model_id": "microsoft/rad-dino",
+      "arch_class": "dinov2",
+      "ep": "qnn",
+      "device": "gpu",
+      "hyp_id": "h11",
+      "label": "opset 17 + gelu_fusion explicit",
+      "flags": "gelu_fusion=True + opset=17",
+      "gain_pct": 2.0,
+      "noise_floor_pct": 1.72,
+      "ranges_separated": true,
+      "level": 2
+    }
+  ],
+  "L3_generalized_arch_rule": [],
+  "L4_cross_cutting_rule": []
+}
diff --git a/research/autoconfig/ep_device_knowledge/cpu_cpu.json b/research/autoconfig/ep_device_knowledge/cpu_cpu.json
new file mode 100644
index 000000000..9ed9fe3f3
--- /dev/null
+++ b/research/autoconfig/ep_device_knowledge/cpu_cpu.json
@@ -0,0 +1,401 @@
+{
+  "_meta": {
+    "ep": "cpu",
+    "device": "cpu",
+    "hardware": "Snapdragon X Elite CRD (Oryon CPU)",
+    "ort_version": "1.x (check winml version at experiment time)",
+    "model": "facebook/convnext-tiny-224 (ALL findings from this model only)",
+    "last_updated": "2026-06-18",
+    "epistemics_warning": "⚠️ All findings from rigorous 3-run ablation. However, still 1 model, 1 device. CPU behavior can differ significantly between x86 and ARM (Oryon). Check architecture before applying rules.",
+    "models_tested": [
+      "facebook/convnext-tiny-224 (original ablation)",
+      "microsoft/resnet-18 (catalog_cpu_sweep 2026-06-18)",
+      "apple/mobilevit-small (catalog_cpu_sweep 2026-06-18)",
+      "facebook/dinov2-small (catalog_cpu_sweep 2026-06-18)",
+      "deepset/roberta-base-squad2 (sweep in progress)",
+      "deepset/tinyroberta-squad2 (sweep in progress)",
+      "BAAI/bge-small-en-v1.5 (sweep in progress)",
+      "sentence-transformers/all-MiniLM-L6-v2 (sweep in progress)"
+    ]
+  },
+  "sweep_config": {
+    "results_dir": "catalog-cpu-sweep",
+    "quant": false,
+    "compile": false,
+    "screen": {
+      "warmup": 10,
+      "iters": 200,
+      "cv_max": 0.1,
+      "thermal_aware": false
+    },
+    "full": {
+      "warmup": 10,
+      "iters": 300,
+      "sessions": 3,
+      "cool_down_s": 2
+    },
+    "confirm_sessions": 2,
+    "min_improvement_pct": 5.0,
+    "effect_size_gate": false,
+    "effect_size_cv_mult": 2.0,
+    "accuracy_eval": false,
+    "eval_samples": 50,
+    "paired_ab_available": false,
+    "baseline_priority": [
+      "h0"
+    ],
+    "timeouts": {
+      "config_s": 300,
+      "build_s": 600,
+      "bench_s": 480,
+      "eval_s": 360,
+      "model_s": null
+    }
+  },
+  "hypotheses": [
+    {
+      "id": "h0",
+      "label": "baseline (opset 17, autoconf defaults)",
+      "opset": null,
+      "optim": null
+    },
+    {
+      "id": "h1",
+      "label": "opset 17 explicit",
+      "opset": 17,
+      "optim": null
+    },
+    {
+      "id": "h2",
+      "label": "opset 19 (cpu-001 risk - transformer test)",
+      "opset": 19,
+      "optim": null
+    },
+    {
+      "id": "h3",
+      "label": "opset 21 (cpu-001 risk - transformer test)",
+      "opset": 21,
+      "optim": null
+    },
+    {
+      "id": "h4",
+      "label": "opset 17 + attention_fusion",
+      "opset": 17,
+      "optim": {
+        "attention_fusion": true
+      }
+    },
+    {
+      "id": "h5",
+      "label": "opset 17 + skip_layer_norm_fusion",
+      "opset": 17,
+      "optim": {
+        "skip_layer_norm_fusion": true
+      }
+    },
+    {
+      "id": "h6",
+      "label": "opset 17 + layer_norm_fusion",
+      "opset": 17,
+      "optim": {
+        "layer_norm_fusion": true
+      }
+    },
+    {
+      "id": "h7",
+      "label": "opset 17 + bias_softmax_fusion",
+      "opset": 17,
+      "optim": {
+        "bias_softmax_fusion": true
+      }
+    },
+    {
+      "id": "h8",
+      "label": "opset 17 + matmul_add_fusion (cpu-002 guarded)",
+      "opset": 17,
+      "optim": {
+        "matmul_add_fusion": true
+      },
+      "guard": {
+        "type": "skip_if_gemm",
+        "finding": "cpu-002"
+      }
+    },
+    {
+      "id": "h9",
+      "label": "opset 17 + matmul_transpose_fusion",
+      "opset": 17,
+      "optim": {
+        "matmul_transpose_fusion": true
+      }
+    },
+    {
+      "id": "h10",
+      "label": "opset 17 + attention + skip_layer_norm + layer_norm",
+      "opset": 17,
+      "optim": {
+        "attention_fusion": true,
+        "skip_layer_norm_fusion": true,
+        "layer_norm_fusion": true
+      }
+    },
+    {
+      "id": "h11",
+      "label": "opset 17 + nchwc_transformer (Conv-heavy models)",
+      "opset": 17,
+      "optim": {
+        "nchwc_transformer": true
+      }
+    },
+    {
+      "id": "h12",
+      "label": "opset 17 + transpose_optimizer",
+      "opset": 17,
+      "optim": {
+        "transpose_optimizer": true
+      }
+    },
+    {
+      "id": "h13",
+      "label": "opset 17 + gelu_fusion explicit",
+      "opset": 17,
+      "optim": {
+        "gelu_fusion": true
+      }
+    },
+    {
+      "id": "h14",
+      "label": "no optimization (analyzer auto-optimization disabled, --no-analyze)",
+      "opset": null,
+      "optim": null,
+      "build_flags": [
+        "--no-analyze"
+      ]
+    }
+  ],
+  "models": [
+    {
+      "id": "microsoft/resnet-18",
+      "task": "image-classification",
+      "model_type": "resnet"
+    },
+    {
+      "id": "apple/mobilevit-small",
+      "task": "image-classification",
+      "model_type": "mobilevit"
+    },
+    {
+      "id": "facebook/dinov2-small",
+      "task": "image-feature-extraction",
+      "model_type": "dinov2"
+    },
+    {
+      "id": "distilbert/distilbert-base-uncased-finetuned-sst-2-english",
+      "task": "text-classification",
+      "model_type": "distilbert"
+    },
+    {
+      "id": "sentence-transformers/all-MiniLM-L6-v2",
+      "task": "sentence-similarity",
+      "model_type": "bert"
+    },
+    {
+      "id": "deepset/roberta-base-squad2",
+      "task": "question-answering",
+      "model_type": "roberta"
+    },
+    {
+      "id": "microsoft/rad-dino",
+      "task": "image-feature-extraction",
+      "model_type": "dinov2"
+    },
+    {
+      "id": "deepset/tinyroberta-squad2",
+      "task": "question-answering",
+      "model_type": "roberta"
+    },
+    {
+      "id": "BAAI/bge-small-en-v1.5",
+      "task": "sentence-similarity",
+      "model_type": "bert"
+    }
+  ],
+  "cross_checks": [
+    {
+      "id": "cpu-001",
+      "type": "regression_probe",
+      "hypotheses": [
+        "h2",
+        "h3"
+      ],
+      "gain_threshold_pct": -50.0,
+      "label": "opset 19/21 regression on Transpose-dense models"
+    }
+  ],
+  "findings": [
+    {
+      "id": "cpu-001",
+      "title": "opset 19+ causes 3-10x slowdown on models with Transpose-heavy graphs (ConvNext + DINOv2 confirmed) — NOT ConvNext-specific",
+      "observation": "ConvNext: opset17=43.7ms, opset19=160ms (3.7x), opset21=170ms (3.9x). DINOv2-small catalog_cpu_sweep 2026-06-18: baseline (auto-config)=112.6ms, opset19=1106ms (9.8x CPU001_REGRESSION), opset21=1095ms (9.7x). CRITICAL: cpu-001 is NOT ConvNext-specific. DINOv2 is a pure-ViT model with no ConvNext architecture overlap. ResNet-18: opset17=237ms, opset19=231ms (+2.4% neutral), opset21=226ms (+4.5% neutral) — ResNet NOT affected. MobileViT: opset19=-9.1%, opset21=-7.4% (mild slowdown, not catastrophic). Pattern: models with dense Transpose usage (DINOv2, ConvNext) hit cpu-001; models with sparse Transpose (ResNet) do not.",
+      "mechanism_confirmed": false,
+      "mechanism_hypothesis": "Original hypothesis: ORT C++ Transpose Optimizer has a kMaxSupportedOpset gate (optimizer_api.h). If model opset > kMaxSupportedOpset, Transpose Optimizer is skipped silently. ConvNext has 42 Transpose nodes — without optimization, each executes as a full memory-layout copy. HOWEVER: the non-monotonic recovery at opset 22 (85ms vs 160-170ms at opset 19-21) is inconsistent with a simple binary gate. If the gate fires for opset > N, opset 22 should behave identically to opset 19. The actual mechanism is more complex. Additionally, ORT 1.24.x has kMaxSupportedOpset >= 23 confirmed (separate NHWC gate) — the Transpose Optimizer gate threshold may differ but is unverified.",
+      "action_for_autoconfig": "For CPU EP: default to opset 17. The empirical data (1 model, consistent across opsets) is unambiguous — opset 17 is the best option. Do NOT try opset 19+. The mechanism reason is uncertain but the practical conclusion is solid.",
+      "confidence": "high on empirical observation (consistent data across opsets for 1 model). Low on mechanism — the gate hypothesis does not fully explain the non-monotonic opset 22 partial recovery.",
+      "falsified_by": null,
+      "scope": "Models with dense Transpose graphs (ConvNext + DINOv2 confirmed). ResNet-18 is NOT affected. MobileViT mildly affected. BERT/RoBERTa unknown (sweep in progress 2026-06-18).",
+      "ort_kMaxSupportedOpset_by_version": {
+        "note": "These values are for the NHWC layout_transformation gate, NOT the Transpose Optimizer gate. The two constants may differ within the same ORT release.",
+        "v1.14.x": 18,
+        "v1.16.x": 19,
+        "v1.17.x": 20,
+        "v1.18.x": 21,
+        "v1.24.x": ">= 23 (confirmed for NHWC gate; Transpose Optimizer gate unknown)",
+        "main_HEAD": 26
+      },
+      "do_not_generalize_to": "QNN NPU EP or DML EP — kMaxSupportedOpset is a CPU-only ORT optimizer gate. These EPs have their own kernel dispatch unaffected by this.",
+      "validated_regressions": [
+        "facebook/convnext-tiny-224: opset19 3.7x, opset21 3.9x",
+        "facebook/dinov2-small: opset19 9.8x, opset21 9.7x (CPU001_REGRESSION)"
+      ],
+      "validated_neutral": [
+        "microsoft/resnet-18: opset19 +2.4% neutral, opset21 +4.5% neutral",
+        "apple/mobilevit-small: opset19 -9.1%, opset21 -7.4% (mild, not catastrophic)"
+      ],
+      "pending": "BERT/RoBERTa/MiniLM (sweep in progress 2026-06-18 — expected: neutral based on few Transpose nodes)",
+      "last_updated": "2026-06-18"
+    },
+    {
+      "id": "cpu-002",
+      "title": "matmul_add_fusion is a CONFIRMED REGRESSION on ConvNext CPU (+38ms, ~87%)",
+      "observation": "matmul_add_fusion: p50=81.7ms, runs=[63.0, 70.8, 111.2ms]. Baseline p50=43.7ms. All 3 runs far above highest baseline run (45.4ms).",
+      "mechanism_confirmed": false,
+      "mechanism_hypothesis": "ORT baseline already converts MatMul+Add→Gemm (37 Gemm in model.onnx). Applying matmul_add_fusion on top may create redundant kernel dispatch or conflicting operator mapping. Requires profiling to confirm.",
+      "action_for_autoconfig": "Do NOT apply matmul_add_fusion for CPU EP on models where baseline already uses Gemm (check model.onnx for Gemm nodes before applying this pass).",
+      "confidence": "high — 3 independent runs, all far above baseline; direction is unambiguous",
+      "falsified_by": null,
+      "scope": "ConvNext and models where ORT L2 baseline already fuses MatMul+Add→Gemm",
+      "do_not_generalize_to": "Models where baseline does NOT have Gemm (the pass may legitimately help there)"
+    },
+    {
+      "id": "cpu-003",
+      "title": "transpose_optimizer is neutral on ConvNext CPU (NOT +270ms as previously reported)",
+      "observation": "winml perf (warmup=10, iter=50): 42.3 / 52.3 / 41.8ms — overlapping baseline. Earlier winml eval-based measurement showed +270ms — this was a measurement artifact.",
+      "mechanism_confirmed": true,
+      "mechanism_hypothesis": "winml eval includes HF preprocessing + model load + no warmup. The +270ms was preprocessing overhead, not inference regression. Pure inference measurement (winml perf) shows no effect.",
+      "action_for_autoconfig": "transpose_optimizer is neutral for ConvNext CPU — neither helpful nor harmful. Can be omitted from search space.",
+      "confidence": "high — measurement methodology confirmed; tool comparison validated",
+      "falsified_by": "Earlier winml eval measurement — RETRACTED. Use winml perf for all latency comparisons.",
+      "scope": "ConvNext CPU",
+      "measurement_lesson": "Always use winml perf (warmup=10, iter=50) for latency experiments. Never use winml eval latency to compare configs."
+    },
+    {
+      "id": "cpu-004",
+      "title": "nchwc_transformer is neutral on ConvNext CPU",
+      "observation": "nchwc: 43.4 / 48.0 / 44.7ms — overlapping baseline (42.5–45.4ms). No improvement.",
+      "mechanism_confirmed": false,
+      "mechanism_hypothesis": "NCHWc SIMD layout benefits Conv-heavy models. ConvNext has 22 Conv nodes but 57.7% of kernel time is Gemm. The bottleneck is not memory layout but compute throughput — NCHWc doesn't help.",
+      "action_for_autoconfig": "nchwc_transformer is low-priority for ConvNext-class models. Profile first — if Conv% > 40%, try nchwc. If Gemm% > 50%, skip.",
+      "confidence": "medium — 3 runs, neutral result; mechanism is a hypothesis",
+      "falsified_by": null,
+      "scope": "ConvNext CPU (Gemm-dominated, not Conv-dominated)"
+    },
+    {
+      "id": "cpu-005",
+      "title": "Baseline (no extra flags) is the optimal config for ConvNext CPU",
+      "observation": "No flag in 22-experiment ablation improved p50 beyond noise. Baseline p50=43.7ms is the floor.",
+      "mechanism_confirmed": false,
+      "mechanism_hypothesis": "ORT L2 baseline already applies gelu_fusion and MatMul→Gemm before any user flags. The effective optimization space is narrow for ConvNext on CPU. Compute bottleneck (Gemm=57.7%) is not addressable via graph passes.",
+      "action_for_autoconfig": "For CPU EP on ConvNext-class models: skip optimization pass sweep. Go directly to quantization experiments.",
+      "confidence": "high — 22 experiments, no improvement found",
+      "falsified_by": null,
+      "scope": "ConvNext-class vision models on CPU",
+      "do_not_generalize_to": "BERT/Transformer models where attention_fusion + skip_layer_norm can significantly help"
+    },
+    {
+      "id": "cpu-006",
+      "title": "CPU EP opset 21 is 3.9x SLOWER — opposite of QNN NPU behavior",
+      "observation": "CPU opset 21: p50=170ms. CPU opset 17: p50=43.7ms. QNN NPU opset 21 (DINOv2): p50=26ms (~24% FASTER than opset 17 at 34ms). Note: the NPU and CPU experiments used DIFFERENT models (CPU=ConvNext, NPU=DINOv2) — the comparison is directional only, not quantitative.",
+      "mechanism_confirmed": false,
+      "mechanism_hypothesis": "CPU regression from Transpose Optimizer bypass (see cpu-001 — mechanism uncertain). QNN NPU speedup from unknown cause (original Transpose bypass hypothesis invalidated; Transpose counts identical in opset17/21 graphs). The key insight is that CPU and QNN NPU respond oppositely to opset changes, regardless of the root cause.",
+      "action_for_autoconfig": "EP ISOLATION: CPU opset findings MUST NOT influence QNN NPU search space, and vice versa. Always validate per EP independently.",
+      "confidence": "high on empirical observation. Low on mechanism for both directions.",
+      "falsified_by": null,
+      "scope": "ALL — this is a meta-rule about EP isolation, not model-specific"
+    },
+    {
+      "id": "cpu-007",
+      "title": "matmul_transpose_fusion gives +92% speedup on ResNet-18 CPU EP (237ms -> 17.8ms)",
+      "confidence": "high — KEEP_CONFIRMED (all 5 sessions passed Phase C)",
+      "scope": "Conv-dominant models with MatMul+Transpose sequences. ResNet-18 confirmed. DINOv2 tested but ALL fusion flags regressed (cpu-001 interference). MobileViT partial.",
+      "observation": "catalog_cpu_sweep 2026-06-18: ResNet-18 h9 (opset17+matmul_transpose_fusion): median_p50=17.797ms vs baseline 237.472ms = +92.51% KEEP_CONFIRMED. Also: h12 (transpose_optimizer) +84.46% KEEP_CONFIRMED, h13 (gelu_fusion) +88.89% KEEP_CONFIRMED, h10 (bundle) +91.54% KEEP_CONFIRMED, h6 (layer_norm_fusion) +10.43% KEEP_CONFIRMED. All 5 phase-C sessions passed.",
+      "mechanism_hypothesis": "ResNet-18 on CPU at default config has 237ms latency (extremely slow for a tiny model). matmul_transpose_fusion folds MatMul+Transpose into a transposed GEMM call, enabling BLAS-level fused execution. ORT CPU provider has a highly optimized transposed-matmul path. The baseline 237ms suggests the default config exports with a suboptimal graph (possibly unfused MatMul+Transpose pairs that prevent BLAS dispatch).",
+      "mechanism_confirmed": false,
+      "baseline_note": "ResNet-18 baseline=237ms on CPU is extremely slow (17.8ms after optimization = 13x speedup). This suggests the default auto-config for ResNet-18 on CPU is severely suboptimal. The baseline uses auto-config which may not be correctly detecting the model architecture for CPU optimization.",
+      "affected_models": [
+        "microsoft/resnet-18 (+92.51% KEEP_CONFIRMED)"
+      ],
+      "autoconfig_action": "For ResNet-18 class models on CPU: apply matmul_transpose_fusion (h9) + transpose_optimizer (h12) + gelu_fusion (h13) bundle. Test h10 bundle for single combined build.",
+      "added": "2026-06-18",
+      "source": "catalog_cpu_sweep.py h0-h13 sweep"
+    },
+    {
+      "id": "cpu-008",
+      "title": "layer_norm_fusion causes catastrophic -997% regression on MobileViT CPU EP (73ms -> 803ms)",
+      "confidence": "high — 3-session consistent",
+      "scope": "CNN-ViT hybrid models where layer_norm_fusion mismatches the LN implementation. MobileViT confirmed. Pure transformer (BERT/ViT) expected safe.",
+      "observation": "catalog_cpu_sweep 2026-06-18: MobileViT h6 (opset17+layer_norm_fusion): median_p50=803.217ms vs baseline 73.166ms = -997.8% DISCARD. 3-session consistent. For comparison: bias_softmax_fusion (h7) = 64.137ms (+12.34% MARGINAL_UNCONFIRMED). layer_norm_fusion, skip_layer_norm_fusion, attention_fusion, matmul_transpose_fusion all severely regress MobileViT on CPU.",
+      "mechanism_hypothesis": "MobileViT uses a hybrid CNN-ViT architecture where LayerNorm is placed after Conv2D outputs. layer_norm_fusion expects pure transformer LN sequences (MLP-style). Fusing the wrong LN pattern creates a combined op that the CPU runtime cannot dispatch to an optimized kernel path, forcing fallback to element-wise operations.",
+      "mechanism_confirmed": false,
+      "affected_models": [
+        "apple/mobilevit-small (-997% layer_norm, -165% matmul_transpose, -164% attention bundle)"
+      ],
+      "autoconfig_action": "Block layer_norm_fusion for CNN-ViT hybrid models. Also block matmul_transpose_fusion and attention_fusion for MobileViT-class models on CPU. analyze_insight.py should detect CNN-ViT hybrid architecture and skip these fusions.",
+      "added": "2026-06-18",
+      "source": "catalog_cpu_sweep.py h0-h13 sweep"
+    },
+    {
+      "id": "cpu-009",
+      "title": "cpu-001 opset regression fires on DINOv2 pure-ViT: ~10x slowdown at opset19/21 on CPU EP",
+      "confidence": "high — CPU001_REGRESSION verdict confirmed (pattern matches ConvNext)",
+      "scope": "Pure-ViT models with dense Transpose graphs on CPU EP. DINOv2-small confirmed. BERT/NLP expected neutral (sparse Transpose). ResNet-18 confirmed neutral.",
+      "observation": "catalog_cpu_sweep 2026-06-18: DINOv2-small h2 (opset19): 1106ms vs baseline 112ms (-882% CPU001_REGRESSION). h3 (opset21): 1095ms (-873%). h4 attention_fusion: 1083ms (-862%). h7 bias_softmax_fusion: 1121ms (-896%). The baseline (auto-config, opset not forced) = 112ms. Any forced opset or attention-style fusion causes catastrophic regression. Also: h1 opset17-explicit = 762ms (-577%) — even forcing opset17 explicitly regresses DINOv2 vs auto-config baseline.",
+      "mechanism_note": "DINOv2 has 169 Reshape nodes in opset21 vs 121 in opset17. Dense Transpose (49 nodes). cpu-001 mechanism (Transpose Optimizer bypass) applies here as strongly as ConvNext. The auto-config baseline (h0) at 112ms is already the optimized path; ANY deviation from auto-config triggers regression.",
+      "autoconfig_action": "For DINOv2/ViT-class on CPU EP: use auto-config default opset ONLY. Do not force any opset. Do not apply attention_fusion or bias_softmax_fusion (all regress DINOv2 on CPU). CPU EP for DINOv2 is constrained to baseline config only.",
+      "added": "2026-06-18",
+      "source": "catalog_cpu_sweep.py h0-h13 sweep"
+    }
+  ],
+  "search_space_rules": {
+    "opset": {
+      "recommended_order": [
+        17
+      ],
+      "skip": [
+        "19, 20, 21, 22 — kMaxSupportedOpset regression (cpu-001). Only safe to try if ORT version's kMaxSupportedOpset >= target."
+      ],
+      "dialectical_note": "⚠️ This rule is ORT-version dependent. Check kMaxSupportedOpset for the shipping ORT build before skipping higher opsets."
+    },
+    "quantization": {
+      "recommended": "w8a8 (CPU benefits most from small model size)",
+      "dialectical_note": "⚠️ W8A8 on CPU not yet validated for ConvNext. General guidance — run accuracy gate."
+    },
+    "compile": {
+      "always_run": false,
+      "skip": true,
+      "dialectical_note": "⚠️ winml compile targets QNN EPContext. Not applicable to CPU EP."
+    },
+    "graph_passes": {
+      "recommended": "autoconf defaults only",
+      "skip": [
+        "matmul_add_fusion if model already has Gemm (cpu-002)",
+        "nchwc_transformer if Gemm% > 50% in profile (cpu-004)"
+      ],
+      "dialectical_note": "⚠️ Skip rules are Gemm-bottleneck specific. Conv-heavy models may still benefit from nchwc_transformer."
+    }
+  },
+  "meta_lessons": {
+    "measurement_discipline": "Always use winml perf (warmup=10, iter=50) for latency. Never use winml eval latency. See cpu-003.",
+    "ep_isolation": "CPU findings (especially opset regression) DO NOT transfer to QNN NPU or DML. Each EP has its own optimizer path. See cpu-006.",
+    "baseline_check": "Before applying any fusion flag, check model.onnx for existing fused ops. If Gemm already present, matmul_add_fusion is likely a no-op or regression."
+  }
+}
diff --git a/research/autoconfig/ep_device_knowledge/dml_gpu.json b/research/autoconfig/ep_device_knowledge/dml_gpu.json
new file mode 100644
index 000000000..21f3361f6
--- /dev/null
+++ b/research/autoconfig/ep_device_knowledge/dml_gpu.json
@@ -0,0 +1,271 @@
+{
+  "_meta": {
+    "ep": "dml",
+    "device": "gpu",
+    "hardware": "Snapdragon X Elite CRD (Adreno X1-85 / DirectML via D3D12)",
+    "ort_version": "1.x with onnxruntime-directml package",
+    "model": "facebook/convnext-tiny-224 (ALL findings from this model only)",
+    "last_updated": "2026-06-17",
+    "epistemics_warning": "⚠️ DML experiments required swapping onnxruntime-directml for onnxruntime (Python package conflict). Results reflect DML EP behavior via winml's DML DLL, not the Python onnxruntime-directml package directly. Re-validate if package setup changes."
+  },
+  "sweep_config": {
+    "results_dir": "catalog-dml-sweep",
+    "quant": false,
+    "compile": false,
+    "screen": {
+      "warmup": 20,
+      "iters": 200,
+      "cv_max": 0.15,
+      "thermal_aware": false
+    },
+    "full": {
+      "warmup": 20,
+      "iters": 300,
+      "sessions": 3,
+      "cool_down_s": 5
+    },
+    "confirm_sessions": 2,
+    "min_improvement_pct": 5.0,
+    "effect_size_gate": false,
+    "effect_size_cv_mult": 2.0,
+    "accuracy_eval": false,
+    "eval_samples": 50,
+    "paired_ab_available": false,
+    "baseline_priority": [
+      "h0"
+    ],
+    "timeouts": {
+      "config_s": 300,
+      "build_s": 600,
+      "bench_s": 480,
+      "eval_s": 360,
+      "model_s": null
+    }
+  },
+  "hypotheses": [
+    {
+      "id": "h0",
+      "label": "baseline FP32 (auto-config, no compile)",
+      "opset": null,
+      "optim": null
+    },
+    {
+      "id": "h1",
+      "label": "opset 17 explicit",
+      "opset": 17,
+      "optim": null
+    },
+    {
+      "id": "h2",
+      "label": "opset 19",
+      "opset": 19,
+      "optim": null
+    },
+    {
+      "id": "h3",
+      "label": "opset 21 (tests dml-005)",
+      "opset": 21,
+      "optim": null
+    },
+    {
+      "id": "h4",
+      "label": "opset 17 + transpose_optimizer",
+      "opset": 17,
+      "optim": {
+        "transpose_optimizer": true
+      }
+    },
+    {
+      "id": "h5",
+      "label": "opset 17 + layer_norm_fusion",
+      "opset": 17,
+      "optim": {
+        "layer_norm_fusion": true
+      }
+    },
+    {
+      "id": "h6",
+      "label": "opset 17 + skip_layer_norm_fusion",
+      "opset": 17,
+      "optim": {
+        "skip_layer_norm_fusion": true
+      }
+    },
+    {
+      "id": "h7",
+      "label": "opset 17 + matmul_transpose_fusion",
+      "opset": 17,
+      "optim": {
+        "matmul_transpose_fusion": true
+      }
+    },
+    {
+      "id": "h8",
+      "label": "no optimization (analyzer auto-optimization disabled, --no-analyze)",
+      "opset": null,
+      "optim": null,
+      "build_flags": [
+        "--no-analyze"
+      ]
+    }
+  ],
+  "models": [
+    {
+      "id": "microsoft/resnet-18",
+      "task": "image-classification",
+      "model_type": "resnet"
+    },
+    {
+      "id": "google/vit-base-patch16-224",
+      "task": "image-classification",
+      "model_type": "vit"
+    },
+    {
+      "id": "apple/mobilevit-small",
+      "task": "image-classification",
+      "model_type": "mobilevit"
+    },
+    {
+      "id": "facebook/dinov2-small",
+      "task": "image-feature-extraction",
+      "model_type": "dinov2"
+    },
+    {
+      "id": "hustvl/yolos-small",
+      "task": "object-detection",
+      "model_type": "yolos"
+    },
+    {
+      "id": "distilbert/distilbert-base-uncased-finetuned-sst-2-english",
+      "task": "text-classification",
+      "model_type": "distilbert"
+    },
+    {
+      "id": "sentence-transformers/all-MiniLM-L6-v2",
+      "task": "sentence-similarity",
+      "model_type": "bert"
+    },
+    {
+      "id": "deepset/roberta-base-squad2",
+      "task": "question-answering",
+      "model_type": "roberta"
+    },
+    {
+      "id": "microsoft/rad-dino",
+      "task": "image-feature-extraction",
+      "model_type": "dinov2"
+    },
+    {
+      "id": "deepset/tinyroberta-squad2",
+      "task": "question-answering",
+      "model_type": "roberta"
+    },
+    {
+      "id": "BAAI/bge-small-en-v1.5",
+      "task": "sentence-similarity",
+      "model_type": "bert"
+    }
+  ],
+  "cross_checks": [
+    {
+      "id": "dml-005",
+      "type": "opset_bypass",
+      "candidate": "h3",
+      "stress_ref": "h1",
+      "baseline_ref": "h0"
+    }
+  ],
+  "findings": [
+    {
+      "id": "dml-001",
+      "title": "DML FP32 is more stable than QNN GPU FP32 — p50 difference is within noise",
+      "observation": "DML FP32: p50=16.9ms, p90=17.7ms, std=0.52. QNN GPU FP32: p50=17.7ms, p90=19.7ms, std=0.97. p50 diff = 0.8ms = 0.82σ of QNN GPU measurement — distributions OVERLAP. NOT a separable performance difference. DML is meaningfully more stable (std 0.52 vs 0.97, CV 3% vs 5.5%).",
+      "mechanism_confirmed": false,
+      "mechanism_hypothesis": "DML JIT-compiles HLSL shaders at model load time — shader compilation done once, producing stable execution. QNN GPU EP does graph partitioning at each session creation — more overhead and jitter.",
+      "action_for_autoconfig": "CORRECTED: Do NOT claim DML is faster than QNN GPU based on this data — the 0.8ms difference is within noise. DML IS more stable (lower CV). Prefer DML for lower tail latency (p90) and variance. p50 advantage is unconfirmed.",
+      "confidence": "low on p50 speedup (not statistically separable). Medium on stability advantage (std 0.52 vs 0.97 is real difference even if p50 overlaps).",
+      "falsified_by": "Statistical analysis: 0.8ms diff < 1σ of GPU measurement. Removed from 'DML is faster' claims.",
+      "scope": "Adreno X1-85, ConvNext-class models, 3-run comparison (insufficient for definitive p50 ranking)",
+      "do_not_generalize_to": "NVIDIA/Intel GPUs (QNN GPU not available there anyway)"
+    },
+    {
+      "id": "dml-002",
+      "title": "NHWC transformer increases latency variance on DML — p50 is neutral or marginally better",
+      "observation": "DML NHWC: p50=16.5ms (-0.4ms vs baseline 16.9ms), p90=21.0ms (+19% vs baseline 17.7ms), std=1.89 (3.6x worse than FP32 baseline 0.52). NOTE: p50 is marginally BETTER with NHWC, not worse. The regression is in tail latency and variance.",
+      "mechanism_confirmed": false,
+      "mechanism_hypothesis": "D3D12 on Adreno X1-85 handles tensor layouts internally via HLSL shaders. Adding explicit ORT NHWC Transposes does not improve memory alignment for DML but adds dispatch overhead that occasionally causes scheduling jitter, inflating p90 and std.",
+      "action_for_autoconfig": "Do NOT apply nhwc-transformer for DML EP if tail latency stability matters. p50 may be marginally better but p90 is 19% worse and std is 3.6x worse. For applications sensitive to worst-case latency, NHWC is harmful.",
+      "confidence": "low — single run comparison, different baselines (run_count unspecified). Direction for variance is clear; p50 benefit is marginal and unreliable.",
+      "falsified_by": null,
+      "scope": "Adreno X1-85 + DML, ConvNext",
+      "do_not_generalize_to": "NVIDIA GPUs (NHWC may help with CUDNN)"
+    },
+    {
+      "id": "dml-003",
+      "title": "DML FP16 gives ~1.4x speedup with NO DVFS bimodal (unlike QNN GPU FP16)",
+      "observation": "DML FP16 (via Python hack, not official CLI): p50=11.8ms, p90=12.8ms, std=0.66. Clean unimodal distribution.",
+      "mechanism_confirmed": false,
+      "mechanism_hypothesis": "DML HLSL shader compilation locks in FP16 compute paths at load time — no dynamic voltage/frequency switching surprises. QNN GPU FP16 showed DVFS bimodal distribution (some runs in high-power state, some in low-power state).",
+      "action_for_autoconfig": "FP16 is the primary optimization lever for DML. Unblock via #867 (--precision fp16 flag).",
+      "confidence": "low — experiment used Python hack (not official winml CLI). Mark as SKIPPED/CLI-gap until #867 ships.",
+      "falsified_by": null,
+      "scope": "Adreno X1-85 + DML",
+      "tracked_issue": "#867",
+      "cli_gap": true,
+      "cli_gap_note": "⚠️ This finding was produced via a Python workaround, not winml CLI. Cannot be reproduced with winml build today. Blocked on #867."
+    },
+    {
+      "id": "dml-004",
+      "title": "winml analyze returns 0/0/0/251 (all Unknown) for DML EP — no rule data",
+      "observation": "winml analyze --ep dml outputs: supported=0, partial=0, unsupported=0, unknown=251.",
+      "mechanism_confirmed": true,
+      "mechanism_hypothesis": "DML EP supports all standard ONNX ops by design (D3D12 universal op coverage). winml analyze has no DML-specific rule data file. This is a cosmetic gap — DML actually runs all ops natively.",
+      "action_for_autoconfig": "Do not use winml analyze output to prune search space for DML. Assume all ops supported.",
+      "confidence": "high — confirmed by DML running all 251 ops with no CPU fallback",
+      "falsified_by": null,
+      "scope": "DML EP (all models)",
+      "tracked_issue": "not filed — cosmetic gap, low priority"
+    },
+    {
+      "id": "dml-005",
+      "title": "opset 21 on DML not yet validated",
+      "observation": "opset 21 sweep only run on QNN NPU. DML behavior with opset 21 is unknown.",
+      "mechanism_confirmed": false,
+      "mechanism_hypothesis": "DML uses D3D12 dispatch — different from QNN EP kernel registry. opset 21 speedup on QNN NPU may not apply.",
+      "action_for_autoconfig": "Include opset 21 in DML search sweep. No prior data — must run experiment.",
+      "confidence": "low — no data",
+      "falsified_by": null,
+      "scope": "UNKNOWN — needs experiment"
+    }
+  ],
+  "search_space_rules": {
+    "opset": {
+      "recommended_order": [
+        17,
+        21
+      ],
+      "rationale": "dml-005: unknown. Include both in sweep.",
+      "dialectical_note": "⚠️ No data on DML + opset 21. Do not assume NPU behavior transfers."
+    },
+    "quantization": {
+      "recommended": "fp16 (when #867 ships)",
+      "skip": [
+        "w8a8",
+        "w8a16 — quantization rarely helps on GPU via DML"
+      ],
+      "dialectical_note": "⚠️ Quantization skip is based on general DML behavior. Some models with large weights may benefit from W8A16 even on DML. Test empirically."
+    },
+    "compile": {
+      "always_run": false,
+      "skip": true,
+      "dialectical_note": "⚠️ DML uses HLSL, not QNN binary compilation. winml compile targets QNN EPContext only. Not applicable to DML."
+    },
+    "graph_passes": {
+      "recommended": "autoconf defaults only",
+      "skip": [
+        "nhwc-transformer (dml-002)"
+      ],
+      "dialectical_note": "⚠️ Same as QNN GPU: NHWC hurts on Adreno. NVIDIA/Intel may differ."
+    }
+  }
+}
diff --git a/research/autoconfig/ep_device_knowledge/qnn_gpu.json b/research/autoconfig/ep_device_knowledge/qnn_gpu.json
new file mode 100644
index 000000000..697390ce7
--- /dev/null
+++ b/research/autoconfig/ep_device_knowledge/qnn_gpu.json
@@ -0,0 +1,366 @@
+{
+  "_meta": {
+    "ep": "qnn",
+    "device": "gpu",
+    "hardware": "Snapdragon X Elite CRD (Adreno X1-85 / QNN GPU EP)",
+    "ort_version": "1.x (check winml version at experiment time)",
+    "qnn_sdk_version": "unknown — check QnnSystem.dll version",
+    "model": "8 models (catalog sweep 2026-06-18)",
+    "last_updated": "2026-06-18",
+    "epistemics_warning": "⚠️ All findings are hypotheses derived from 1 model on 1 device. Confidence levels reflect mechanism understanding, not universal applicability. GPU EP behavior varies significantly by model architecture and Adreno driver version.",
+    "models_tested": [
+      "facebook/dinov2-small",
+      "microsoft/resnet-18",
+      "apple/mobilevit-small",
+      "deepset/roberta-base-squad2",
+      "deepset/tinyroberta-squad2",
+      "BAAI/bge-small-en-v1.5",
+      "sentence-transformers/all-MiniLM-L6-v2",
+      "microsoft/rad-dino"
+    ]
+  },
+  "sweep_config": {
+    "results_dir": "catalog-gpu-sweep",
+    "quant": false,
+    "compile": false,
+    "screen": {
+      "warmup": 20,
+      "iters": 200,
+      "cv_max": 0.15,
+      "thermal_aware": false
+    },
+    "full": {
+      "warmup": 20,
+      "iters": 300,
+      "sessions": 3,
+      "cool_down_s": 5
+    },
+    "confirm_sessions": 2,
+    "min_improvement_pct": 5.0,
+    "effect_size_gate": false,
+    "effect_size_cv_mult": 2.0,
+    "accuracy_eval": false,
+    "eval_samples": 50,
+    "paired_ab_available": false,
+    "baseline_priority": [
+      "h0"
+    ],
+    "timeouts": {
+      "config_s": 300,
+      "build_s": 600,
+      "bench_s": 480,
+      "eval_s": 360,
+      "model_s": null
+    }
+  },
+  "hypotheses": [
+    {
+      "id": "h0",
+      "label": "baseline FP32 (no quant, no compile)",
+      "opset": null,
+      "optim": null
+    },
+    {
+      "id": "h1",
+      "label": "opset 17 explicit",
+      "opset": 17,
+      "optim": null
+    },
+    {
+      "id": "h2",
+      "label": "opset 19",
+      "opset": 19,
+      "optim": null
+    },
+    {
+      "id": "h3",
+      "label": "opset 21 (tests gpu-006)",
+      "opset": 21,
+      "optim": null
+    },
+    {
+      "id": "h4",
+      "label": "opset 17 + matmul_transpose_fusion",
+      "opset": 17,
+      "optim": {
+        "matmul_transpose_fusion": true
+      }
+    },
+    {
+      "id": "h5",
+      "label": "opset 17 + attention_fusion",
+      "opset": 17,
+      "optim": {
+        "attention_fusion": true
+      }
+    },
+    {
+      "id": "h6",
+      "label": "opset 17 + bias_softmax_fusion",
+      "opset": 17,
+      "optim": {
+        "bias_softmax_fusion": true
+      }
+    },
+    {
+      "id": "h7",
+      "label": "opset 17 + layer_norm_fusion",
+      "opset": 17,
+      "optim": {
+        "layer_norm_fusion": true
+      }
+    },
+    {
+      "id": "h8",
+      "label": "opset 17 + skip_layer_norm_fusion",
+      "opset": 17,
+      "optim": {
+        "skip_layer_norm_fusion": true
+      }
+    },
+    {
+      "id": "h9",
+      "label": "opset 21 + matmul_transpose + attention_fusion",
+      "opset": 21,
+      "optim": {
+        "matmul_transpose_fusion": true,
+        "attention_fusion": true
+      }
+    },
+    {
+      "id": "h10",
+      "label": "opset 17 + ln + skip_ln + matmul_transpose",
+      "opset": 17,
+      "optim": {
+        "layer_norm_fusion": true,
+        "skip_layer_norm_fusion": true,
+        "matmul_transpose_fusion": true
+      }
+    },
+    {
+      "id": "h11",
+      "label": "opset 17 + gelu_fusion explicit",
+      "opset": 17,
+      "optim": {
+        "gelu_fusion": true
+      }
+    },
+    {
+      "id": "h12",
+      "label": "opset 17 + transpose_optimizer",
+      "opset": 17,
+      "optim": {
+        "transpose_optimizer": true
+      }
+    },
+    {
+      "id": "h13",
+      "label": "no optimization (analyzer auto-optimization disabled, --no-analyze)",
+      "opset": null,
+      "optim": null,
+      "build_flags": [
+        "--no-analyze"
+      ]
+    }
+  ],
+  "models": [
+    {
+      "id": "microsoft/resnet-18",
+      "task": "image-classification",
+      "model_type": "resnet"
+    },
+    {
+      "id": "google/vit-base-patch16-224",
+      "task": "image-classification",
+      "model_type": "vit"
+    },
+    {
+      "id": "apple/mobilevit-small",
+      "task": "image-classification",
+      "model_type": "mobilevit"
+    },
+    {
+      "id": "facebook/dinov2-small",
+      "task": "image-feature-extraction",
+      "model_type": "dinov2"
+    },
+    {
+      "id": "hustvl/yolos-small",
+      "task": "object-detection",
+      "model_type": "yolos"
+    },
+    {
+      "id": "distilbert/distilbert-base-uncased-finetuned-sst-2-english",
+      "task": "text-classification",
+      "model_type": "distilbert"
+    },
+    {
+      "id": "sentence-transformers/all-MiniLM-L6-v2",
+      "task": "sentence-similarity",
+      "model_type": "bert"
+    },
+    {
+      "id": "deepset/roberta-base-squad2",
+      "task": "question-answering",
+      "model_type": "roberta"
+    },
+    {
+      "id": "microsoft/rad-dino",
+      "task": "image-feature-extraction",
+      "model_type": "dinov2"
+    },
+    {
+      "id": "deepset/tinyroberta-squad2",
+      "task": "question-answering",
+      "model_type": "roberta"
+    },
+    {
+      "id": "BAAI/bge-small-en-v1.5",
+      "task": "sentence-similarity",
+      "model_type": "bert"
+    }
+  ],
+  "cross_checks": [
+    {
+      "id": "gpu-006",
+      "type": "opset_bypass",
+      "candidate": "h3",
+      "stress_ref": "h1",
+      "baseline_ref": "h0"
+    }
+  ],
+  "findings": [
+    {
+      "id": "gpu-001",
+      "title": "FP32 baseline is already optimal for ConvNext on QNN GPU — no optimization pass helps",
+      "observation": "Full sweep of 11 passes/combinations on ConvNext QNN GPU: all returned 0% node reduction or worse latency. Baseline p50=17.7ms, p90=19.7ms, std=0.97.",
+      "mechanism_confirmed": true,
+      "mechanism_hypothesis": "251/0/0/0 (all ops native on GPU, zero CPU fallback). ConvNext linear layers use Reshape→MatMul→Reshape, not bare MatMul+Add — so MatMulAdd→Conv2D rewrites don't match. autoconf (gelu_fusion + matmul_add_fusion) already applied all applicable transforms.",
+      "action_for_autoconfig": "Skip all graph optimization experiments for QNN GPU on ConvNext-class models. Use FP32 baseline directly.",
+      "confidence": "high — confirmed by 0% node delta on all rewrites + 251/0/0/0 analyze output",
+      "falsified_by": null,
+      "scope": "ConvNext-class models (Reshape→MatMul→Reshape pattern)",
+      "do_not_generalize_to": "Transformer models with bare MatMul+Add (those may benefit from rewrites)"
+    },
+    {
+      "id": "gpu-002",
+      "title": "NHWC transformer hurts QNN GPU on Adreno X1-85 (~10% worse)",
+      "observation": "NHWC transformer: p50=19.5ms (+10%), p90=23.8ms (+21%), std=3.43 (3.5x worse). Consistent across multiple runs.",
+      "mechanism_confirmed": false,
+      "mechanism_hypothesis": "Adreno X1-85 + QNN GPU EP does not benefit from explicit NHWC layout transforms. QNN GPU EP handles layout internally; forcing NHWC via ORT creates additional Reshape overhead without the memory alignment benefit.",
+      "action_for_autoconfig": "Do NOT apply nhwc-transformer for QNN GPU EP.",
+      "confidence": "medium — observed consistently; mechanism hypothesis, not confirmed",
+      "falsified_by": null,
+      "scope": "Adreno X1-85 + QNN GPU EP",
+      "do_not_generalize_to": "Non-Adreno GPUs (NVIDIA, Intel Arc) — NHWC may help there"
+    },
+    {
+      "id": "gpu-003",
+      "title": "winml compile appears to hurt QNN GPU (~34% regression) — SINGLE EXPERIMENT, LOW CONFIDENCE",
+      "observation": "FP32 + compile: p50=23.7ms vs baseline 17.7ms (+34%). Single experiment only.",
+      "mechanism_confirmed": false,
+      "mechanism_hypothesis": "QNN GPU EP compile (EPContext) is designed for NPU (HTP). On GPU EP, the compilation path may force a different dispatch mode that bypasses the optimized GPU shader path. QNN SDK likely has a GPU-specific compilation flow that winml compile doesn't trigger correctly.",
+      "action_for_autoconfig": "AVOID winml compile for QNN GPU EP. Direction (regression) is consistent with mechanism hypothesis and 34% is a large signal, but this is a single experiment. Until replicated, treat as likely harmful but not confirmed.",
+      "confidence": "low — single experiment. 34% gap is above DVFS noise level (CV ~0.05 → noise ~1ms, gap is 6ms). Direction probably real but magnitude uncertain.",
+      "falsified_by": null,
+      "scope": "QNN GPU EP",
+      "do_not_generalize_to": "QNN NPU EP (compile always helps NPU)"
+    },
+    {
+      "id": "gpu-004",
+      "title": "W8A8 QDQ hangs indefinitely on QNN GPU EP",
+      "observation": "Passing a W8A8 QDQ-annotated ONNX to QNN GPU EP causes infinite hang. winml build's _patch_device() sets quant=null for GPU, preventing this in normal user path.",
+      "mechanism_confirmed": true,
+      "mechanism_hypothesis": "QNN SDK's GPU EP does not support QDQ-quantized graphs. This is a known QNN SDK limitation. winml build already protects against this via _patch_device().",
+      "action_for_autoconfig": "Skip ALL quantization experiments for QNN GPU EP. Do not even attempt W8A8 or W8A16.",
+      "confidence": "high — hang confirmed; protection mechanism in _patch_device() confirmed by code inspection",
+      "falsified_by": null,
+      "scope": "QNN GPU EP (QNN SDK limitation)",
+      "tracked_issue": "#868 (fast-fail enhancement)"
+    },
+    {
+      "id": "gpu-005",
+      "title": "gelu_fusion improves latency STABILITY (p90/std) on QNN GPU, not p50",
+      "observation": "Raw export (287 nodes, unfused Gelu): p50=17.4ms, p90=29.2ms, std=5.90. Autoconf (251 nodes, fused Gelu): p50=17.7ms, p90=19.7ms, std=0.97. p50 nearly identical, p90 -48%, std -6x.",
+      "mechanism_confirmed": true,
+      "mechanism_hypothesis": "5 separate GPU kernel dispatches (Mul→Div→Erf→Mul→Add) for unfused GELU create scheduling jitter. Single Gelu kernel eliminates dispatch overhead → dramatically lower tail latency.",
+      "action_for_autoconfig": "Always apply gelu_fusion for QNN GPU (stability benefit). Do not expect p50 improvement.",
+      "confidence": "high — mechanism is well-understood (GPU kernel dispatch overhead)",
+      "falsified_by": null,
+      "scope": "Any model with GELU activations on QNN GPU"
+    },
+    {
+      "id": "gpu-006",
+      "title": "opset 21 on QNN GPU is neutral-to-negative — CONFIRMED across 7 models",
+      "observation": "catalog_gpu_sweep.py full sweep 2026-06-18 (8 models, 13 hypotheses, 3x300 iters + Phase C confirmation): opset21 gains: DINOv2-small +1.22% (MARGINAL), ResNet-18 +3.27% (MARGINAL), MobileViT -3.42% (DISCARD), roberta-squad2 -1.14% (DISCARD), tinyroberta -2.68% (DISCARD), rad-dino -2.63% (DISCARD), bge-small +0.16% (DISCARD). Range: -5.42% to +3.27%. No model shows meaningful opset21 gain on GPU. Opposite of QNN NPU behavior (DINOv2 +30.6% on NPU).",
+      "mechanism_confirmed": true,
+      "mechanism_hypothesis": "QNN GPU EP does not have architecture-specific optimizations that benefit from opset21 graph differences (unlike NPU which shows DINOv2-specific speedup). GPU shader compilation is independent of ONNX opset semantics.",
+      "action_for_autoconfig": "Do NOT try opset 19 or opset 21 for QNN GPU EP. Default to opset 17. Rule is now confirmed across 7 models.",
+      "confidence": "high — confirmed across 7 diverse architectures",
+      "falsified_by": null,
+      "scope": "UNKNOWN — needs experiment",
+      "last_updated": "2026-06-18"
+    },
+    {
+      "id": "gpu-007",
+      "title": "transpose_optimizer gives +8-17% on Conv-dominant and ViT models on QNN GPU — KEEP_CONFIRMED",
+      "confidence": "high",
+      "scope": "Conv-dominant (ResNet) and ViT-class (DINOv2) models on QNN GPU. Likely architecture-general for models with Transpose-heavy graphs.",
+      "observation": "catalog_gpu_sweep.py sweep 2026-06-18: h12 (transpose_optimizer) KEEP_CONFIRMED. DINOv2-small: p50 26.372ms -> 21.977ms = +16.67% (all 5 sessions passed, Phase C confirmed). ResNet-18: p50 6.823ms -> 6.251ms = +8.38% (MARGINAL_UNCONFIRMED — Phase C did not confirm, needs more sessions). NLP models: neutral or BUILD_FAIL. rad-dino: +1.33% (MARGINAL). gelu_fusion explicit (h11) also KEEP_CONFIRMED on DINOv2: +13.86%.",
+      "mechanism_confirmed": false,
+      "mechanism_hypothesis": "transpose_optimizer eliminates redundant Transpose(NCHW->NHWC->NCHW) pairs around Conv/pooling in the graph. QNN GPU EP benefits from fewer Transpose ops because each requires a memory layout pass on Adreno. For DINOv2 and ResNet, the optimizer removes enough Transposes to provide meaningful latency reduction.",
+      "affected_models": [
+        "facebook/dinov2-small (+16.67% KEEP_CONFIRMED)",
+        "microsoft/resnet-18 (+8.38% MARGINAL_UNCONFIRMED)"
+      ],
+      "no_benefit_models": [
+        "NLP models — most failed to build with transpose_optimizer; likely due to IR version incompatibility"
+      ],
+      "autoconfig_action": "Apply transpose_optimizer as default for QNN GPU EP on Conv+ViT models. AVOID for NLP models until BUILD_FAIL issue is resolved. Feature gap: diagnose why h12 causes BUILD_FAIL on BERT/RoBERTa models.",
+      "added": "2026-06-18",
+      "source": "catalog_gpu_sweep.py h0-h12 full sweep"
+    },
+    {
+      "id": "gpu-008",
+      "title": "highdimRTR_lowdimRTR causes -6.9% regression on MobileViT QNN GPU — same root cause as npu-010",
+      "confidence": "high",
+      "scope": "Models with Gemm->Reshape->Transpose hybrid unfold patterns (MobileViT). DINOv2 was not tested with highdimRTR on GPU separately.",
+      "observation": "catalog_gpu_sweep.py 2026-06-18: MobileViT h9 (opset21+matmul_transpose+attention_fusion bundle) p50=19.224ms vs baseline 17.985ms = -6.89% (DISCARD). Root cause analysis via ONNX diff on NPU version shows +36 extra Reshape nodes (same issue as npu-010). GPU regression is less severe than NPU (-6.9% vs -19%) due to lower DMA sensitivity on Adreno vs Hexagon HTP.",
+      "mechanism_confirmed": true,
+      "mechanism_detail": "Same as npu-010: highdimRTR inserts spurious Reshape pairs after Gemm in MobileViT hybrid unfold mechanism. Breaks Gemm+Reshape dispatch merging. Less severe on GPU than NPU.",
+      "cross_ep_note": "npu-010 and gpu-008 share the same root cause. Fix is the same: block highdimRTR for Gemm->Reshape->Transpose models.",
+      "autoconfig_action": "Same as npu-010: hard-block highdimRTR for models with Gemm->Reshape->Transpose patterns. analyze_insight.py skip_set hint required.",
+      "added": "2026-06-18",
+      "source": "catalog_gpu_sweep.py h0-h12 full sweep + npu-010 ONNX diff"
+    }
+  ],
+  "search_space_rules": {
+    "opset": {
+      "recommended_order": [
+        17
+      ],
+      "rationale": "gpu-006 CONFIRMED: opset 21 neutral-to-negative across 7 models. Stay at opset 17.",
+      "dialectical_note": "⚠️ May change once opset 21 GPU experiment is run."
+    },
+    "quantization": {
+      "recommended": "skip",
+      "skip": [
+        "all — QDQ hangs on GPU EP (gpu-004)"
+      ],
+      "dialectical_note": "⚠️ This is a QNN SDK limitation, not winml. May change with future QNN SDK versions that support GPU quantization."
+    },
+    "compile": {
+      "always_run": false,
+      "skip": true,
+      "dialectical_note": "⚠️ gpu-003: compile regresses QNN GPU. Confirmed by single experiment. Re-validate if winml compile behavior changes."
+    },
+    "graph_passes": {
+      "recommended": "autoconf defaults + transpose_optimizer for Conv/ViT models",
+      "skip": [
+        "nhwc-transformer (gpu-002)",
+        "highdimRTR (gpu-008)"
+      ],
+      "dialectical_note": "⚠️ Skip rules are ConvNext-specific. Transformer models may benefit from attention_fusion etc."
+    }
+  }
+}
diff --git a/research/autoconfig/ep_device_knowledge/qnn_npu.json b/research/autoconfig/ep_device_knowledge/qnn_npu.json
new file mode 100644
index 000000000..4d945aed2
--- /dev/null
+++ b/research/autoconfig/ep_device_knowledge/qnn_npu.json
@@ -0,0 +1,729 @@
+{
+  "_meta": {
+    "ep": "qnn",
+    "device": "npu",
+    "hardware": "Snapdragon X Elite CRD (Adreno X1-85 / Hexagon HTP)",
+    "ort_version": "1.24.5 (onnxruntime-windowsml; confirmed kMaxSupportedOpset >= 23)",
+    "qnn_sdk_version": "unknown — check QnnSystem.dll version",
+    "models_tested": [
+      "facebook/convnext-tiny-224",
+      "microsoft/resnet-18",
+      "google/vit-base-patch16-224",
+      "apple/mobilevit-small",
+      "facebook/dinov2-small",
+      "hustvl/yolos-small",
+      "distilbert/distilbert-base-uncased-finetuned-sst-2-english",
+      "sentence-transformers/all-MiniLM-L6-v2",
+      "deepset/roberta-base-squad2",
+      "deepset/tinyroberta-squad2",
+      "facebook/dinov2-base",
+      "microsoft/rad-dino",
+      "facebook/dino-vitb16",
+      "BAAI/bge-small-en-v1.5",
+      "rizvandwiki/gender-classification"
+    ],
+    "last_updated": "2026-06-22",
+    "epistemics_warning": "⚠️ All findings are hypotheses derived from limited models on 1 device (Snapdragon X Elite). Confidence levels reflect how well the mechanism is understood, not how universally applicable the finding is. ALWAYS re-validate on new model architectures before using to prune search space."
+  },
+  "sweep_config": {
+    "results_dir": "catalog-qnn-sweep",
+    "quant": "auto",
+    "compile": false,
+    "screen": {
+      "warmup": 20,
+      "iters": 200,
+      "cv_max": 0.15,
+      "thermal_aware": true
+    },
+    "full": {
+      "warmup": 50,
+      "iters": 500,
+      "sessions": 3,
+      "cool_down_s": 30
+    },
+    "confirm_sessions": 2,
+    "min_improvement_pct": 5.0,
+    "effect_size_gate": true,
+    "effect_size_cv_mult": 2.0,
+    "accuracy_eval": true,
+    "eval_samples": 50,
+    "paired_ab_available": true,
+    "baseline_priority": [
+      "h0",
+      "h1"
+    ],
+    "timeouts": {
+      "config_s": 240,
+      "build_s": 900,
+      "bench_s": 720,
+      "eval_s": 360,
+      "model_s": 14400
+    }
+  },
+  "hypotheses": [
+    {
+      "id": "h0",
+      "label": "baseline (auto-config, W8A16)",
+      "opset": null,
+      "optim": null
+    },
+    {
+      "id": "h1",
+      "label": "no optimization (analyzer auto-optimization disabled, --no-analyze)",
+      "opset": null,
+      "optim": null,
+      "build_flags": [
+        "--no-analyze"
+      ]
+    },
+    {
+      "id": "h2",
+      "label": "opset 19",
+      "opset": 19,
+      "optim": null
+    },
+    {
+      "id": "h3",
+      "label": "opset 21 (tests npu-001 bypass)",
+      "opset": 21,
+      "optim": null
+    },
+    {
+      "id": "h4",
+      "label": "opset 17 + conv fusions",
+      "opset": 17,
+      "optim": {
+        "conv_bn_fusion": true,
+        "conv_add_fusion": true,
+        "conv_activation_fusion": true
+      },
+      "guard": {
+        "type": "conv_pct_regression",
+        "finding": "npu-006",
+        "threshold_pct": 20.0
+      }
+    },
+    {
+      "id": "h5",
+      "label": "opset 21 + conv fusions",
+      "opset": 21,
+      "optim": {
+        "conv_bn_fusion": true,
+        "conv_add_fusion": true,
+        "conv_activation_fusion": true
+      },
+      "guard": {
+        "type": "conv_pct_regression",
+        "finding": "npu-006",
+        "threshold_pct": 20.0
+      }
+    },
+    {
+      "id": "h6",
+      "label": "opset 21 + matmul_transpose_fusion",
+      "opset": 21,
+      "optim": {
+        "matmul_transpose_fusion": true
+      }
+    },
+    {
+      "id": "h7",
+      "label": "opset 21 + bias_softmax_fusion",
+      "opset": 21,
+      "optim": {
+        "bias_softmax_fusion": true
+      }
+    },
+    {
+      "id": "h8",
+      "label": "opset 21 + attention_fusion",
+      "opset": 21,
+      "optim": {
+        "attention_fusion": true
+      }
+    },
+    {
+      "id": "h9",
+      "label": "opset 21 + highdimRTR_lowdimRTR",
+      "opset": 21,
+      "optim": {
+        "highdimRTR_lowdimRTR": true
+      }
+    },
+    {
+      "id": "h10",
+      "label": "opset 17 + conv_add_fusion only",
+      "opset": 17,
+      "optim": {
+        "conv_add_fusion": true
+      }
+    }
+  ],
+  "models": [
+    {
+      "id": "microsoft/resnet-18",
+      "task": "image-classification",
+      "model_type": "resnet"
+    },
+    {
+      "id": "google/vit-base-patch16-224",
+      "task": "image-classification",
+      "model_type": "vit"
+    },
+    {
+      "id": "apple/mobilevit-small",
+      "task": "image-classification",
+      "model_type": "mobilevit"
+    },
+    {
+      "id": "facebook/dinov2-small",
+      "task": "image-feature-extraction",
+      "model_type": "dinov2"
+    },
+    {
+      "id": "hustvl/yolos-small",
+      "task": "object-detection",
+      "model_type": "yolos"
+    },
+    {
+      "id": "distilbert/distilbert-base-uncased-finetuned-sst-2-english",
+      "task": "text-classification",
+      "model_type": "distilbert"
+    },
+    {
+      "id": "sentence-transformers/all-MiniLM-L6-v2",
+      "task": "sentence-similarity",
+      "model_type": "bert"
+    },
+    {
+      "id": "deepset/roberta-base-squad2",
+      "task": "question-answering",
+      "model_type": "roberta"
+    },
+    {
+      "id": "microsoft/rad-dino",
+      "task": "image-feature-extraction",
+      "model_type": "dinov2"
+    },
+    {
+      "id": "deepset/tinyroberta-squad2",
+      "task": "question-answering",
+      "model_type": "roberta"
+    },
+    {
+      "id": "BAAI/bge-small-en-v1.5",
+      "task": "sentence-similarity",
+      "model_type": "bert"
+    }
+  ],
+  "cross_checks": [
+    {
+      "id": "npu-001",
+      "type": "opset_bypass",
+      "candidate": "h3",
+      "stress_ref": "h1",
+      "baseline_ref": "h0"
+    },
+    {
+      "id": "npu-006",
+      "type": "catastrophic_regression",
+      "hypotheses": [
+        "h4",
+        "h5"
+      ],
+      "ratio_threshold": 5.0
+    }
+  ],
+  "findings": [
+    {
+      "id": "npu-001",
+      "title": "opset 21 export gives +24-31% speedup on DINOv2-family models on QNN NPU — mechanism UNKNOWN, NOT a general ViT property, MobileViT benefit NOT reproduced on clean rerun",
+      "observation": "Catalog sweep 2026-06-13 + validation sweep 2026-06-16 (ORT 1.24.5, W8A16 quantized.onnx, 3×500-iter sessions): DINOv2-small +30.6% (opset17 7.18ms → opset21 4.98ms). DINOv2-base +24.1% (opset17 34.56ms → opset21 26.23ms). CRITICAL CONTROL: dino-vitb16 (plain DINO ViT-B/16) -0.7% — NEUTRAL. rad-dino (ViT-L medical) -0.1% — CPU-bound, no NPU effect. ViT-base: -7.4%. BERT/RoBERTa/DistilBERT: neutral. MobileViT-small: REVISED — the original +26.5% (2026-06-13) was on an inflated ~12ms baseline. Clean from-scratch 11-hypothesis rerun 2026-06-22 (fresh winml config+build, 3×500-iter) gave baseline (h0 opset17) median 5.51ms and h3 opset21 median 5.355ms = +2.81% with FULLY OVERLAPPING session ranges (h0=[4.98,5.51,5.72] vs h3=[5.36,5.26,5.90]) → npu-001 NEUTRAL on MobileViT. h6 (opset21+matmul_transpose), previously cited as +42.1%, was 6.218ms = SLOWER than baseline. The earlier MobileViT speedup was a thermal/DVFS artifact of a slow baseline, not an opset21 effect.",
+      "mechanism_confirmed": false,
+      "mechanism_invalidation": "Original hypothesis: kMaxSupportedOpset < 21 gate causes NHWC bypass on older ORT. INVALIDATED: sweep used onnxruntime-windowsml==1.24.5 where kMaxSupportedOpset >= 22. Both opset 17 and opset 21 go through the same NHWC layout transform path on this ORT version. The bypass mechanism does NOT apply. The observed speedup is real but the cause is unknown.",
+      "mechanism_status": "ORIGINAL_MECHANISM_INVALIDATED — must re-investigate",
+      "mechanism_source": "ORT source code investigation (2026-06-10) for ORT < 1.18. Sweep used onnxruntime-windowsml==1.24.5 where this mechanism no longer applies.",
+      "ort_version_critical_note": "The original mechanism (kMaxSupportedOpset gate in IsSupportedOpset()) requires kMaxSupportedOpset < 21. onnxruntime-windowsml==1.24.5 (ORT 1.24.x) has kMaxSupportedOpset >= 22, so BOTH opset17 and opset21 go through the NHWC layout transform. The bypass mechanism does NOT apply to the ORT version used in the sweep. The observed speedup for DINOv2 and MobileViT has an UNKNOWN root cause.",
+      "architecture_requirement": [
+        "empirically: DINOv2 family (facebook/dinov2-*) consistently benefits. Plain ViT (dino-vitb16) does NOT. Hybrid Conv+attention (MobileViT) showed an apparent speedup in original data but did NOT reproduce on clean rerun (neutral). Pure Conv (ResNet) insufficient data. NLP: neutral."
+      ],
+      "critical_caveats": [
+        "MECHANISM UNKNOWN: Transpose count is IDENTICAL in opset17 and opset21 (both 49 nodes on dinov2-small). The original Transpose-elimination hypothesis is RULED OUT. The +48 Reshape nodes in opset21 are the most observable structural difference but why this speeds up QNN NPU is not understood.",
+        "RESNET-18 EXCLUDED: apparent +20% is statistical noise — 3 sessions span 4x range at sub-ms latency. Need 3 sessions × 2000 iters for reliable data at this scale.",
+        "DVFS NOISE: always use 3 sessions × 500+ iters with cool-down. Single-session CV is meaningless on QNN NPU.",
+        "SCOPE IS DINOV2-FAMILY NOT GENERAL VIT: dino-vitb16 (same ViT-B size as dinov2-base) shows -0.7% NEUTRAL. The speedup is DINOv2-architecture-specific."
+      ],
+      "validated_models": {
+        "benefits_from_opset21": [
+          "facebook/dinov2-small (+30.6%, original catalog sweep 2026-06-13, 3-session)",
+          "facebook/dinov2-base (+24.1%, validation sweep 2026-06-16, fresh quantized.onnx builds, 3-session h1=[34.56,34.67,33.15]ms h3=[33.00,26.22,26.23]ms)"
+        ],
+        "no_benefit_neutral": [
+          "apple/mobilevit-small: REVISED to NEUTRAL. Original +42.1% (h6) / +26.5% (h3) was measured against an inflated ~12ms baseline. Clean from-scratch rerun 2026-06-22 (3×500-iter): baseline h0 opset17 5.51ms, h3 opset21 5.355ms = +2.81% with overlapping session ranges; h6 (opset21+matmul_transpose) 6.218ms = SLOWER. The earlier 'win' was a DVFS/thermal baseline artifact.",
+          "facebook/dino-vitb16 (-0.7%, validation sweep 2026-06-16, h1=[19.92,19.97,19.90]ms h3=[20.20,20.07,19.99]ms — NEUTRAL, critical control)",
+          "google/vit-base-patch16-224 (-7.4%, original catalog)",
+          "hustvl/yolos-small (timeout, no data)",
+          "rizvandwiki/gender-classification (+3.5% apparent, ranges overlap 13.89/13.92ms, NEUTRAL — plain ViT, CRITICAL: near-identical op counts to DINOv2-small (49 Transpose, 121 Reshape) yet NO benefit)",
+          "distilbert/distilbert-base-uncased-finetuned-sst-2-english (-0.1%, NLP neutral)",
+          "sentence-transformers/all-MiniLM-L6-v2 (-0.7%, NLP neutral)",
+          "deepset/roberta-base-squad2 (+0.1%, NLP neutral)"
+        ],
+        "marginal_inconclusive": [
+          "BAAI/bge-small-en-v1.5 (+7.3%, h0=[10.52,10.32,11.01]ms h3=[10.25,9.33,9.94]ms — ranges barely non-overlapping but CV=0.3; NOT CONFIRMED. Needs 5+ sessions to differentiate from noise. Unusual for BERT architecture; all other NLP models tested at <1%)"
+        ],
+        "not_benchmarked_predicted_neutral": [
+          "openai/clip-vit-base-patch32 — build failed at quantization (feature-extraction task calibration not supported); pure transformer, expected neutral based on all NLP data",
+          "cardiffnlp/twitter-roberta-base-sentiment-latest — not run; RoBERTa architecture, predicted neutral (consistent with roberta-base-squad2 +0.1%)",
+          "distilbert/distilbert-base-cased-distilled-squad — not run; DistilBERT architecture, predicted neutral (consistent with distilbert-base-uncased -0.1%)"
+        ],
+        "cpu_bound_cannot_test": [
+          "microsoft/rad-dino (-0.1% on CPU EP, all hypotheses ~275ms CV<0.022 — model runs on CPU, opset irrelevant; QNN NPU BUILD_FAIL 2026-06-17, see npu-008)"
+        ],
+        "data_unreliable": [
+          "resnet-18 — sub-ms latency, 3-session range spans 4x; no reliable signal (see data_reliability_notes)"
+        ]
+      },
+      "original_mechanism_explanation": {
+        "root_cause_for_old_ort": "kMaxSupportedOpset gate in IsSupportedOpset() (onnxruntime/core/optimizer/layout_transformation/layout_transformation.cc). On ORT where kMaxSupportedOpset < 21, opset 21 models bypass the NCHW→NHWC layout transformer entirely.",
+        "why_bypass_helped_convnext": "NHWC layout transform inserts Transpose(NCHW→NHWC) around Conv. For ConvNext, residual connections prevent Transpose cancellation → opset17 graph has MORE Transposes on HTP than opset21 graph.",
+        "why_cpu_is_opposite": "CPU relies on TransposeOptimizer to REMOVE existing Transposes. Skipping the optimizer (opset > kMaxSupportedOpset) leaves Transposes in place → CPU SLOWER. Same gate, opposite effect.",
+        "ort_kMaxSupportedOpset_by_version": {
+          "v1.14.x": 18,
+          "v1.16.x": 19,
+          "v1.17.x": 20,
+          "v1.18.x": 21,
+          "v1.24.x": ">= 23 (CONFIRMED: ORT 1.24.4 in C:\\tmp\\autoconfig-demo accepts opset 22 and 23 via InferenceSession with CPUExecutionProvider; opset 24 fails with 'No op registered for ...' not 'Unsupported opset')",
+          "main_HEAD": 26
+        },
+        "key_files": [
+          "onnxruntime/core/optimizer/transpose_optimization/onnx_transpose_optimization.cc:2724-2746 — MakeOptimizerContext() gate",
+          "onnxruntime/core/optimizer/layout_transformation/layout_transformation.cc — IsSupportedOpset()",
+          "onnxruntime/core/session/inference_session.cc:1589-1626 — transform_layout_fn=nullptr path"
+        ]
+      },
+      "transpose_analysis_2026_06_16": {
+        "method": "onnx.load() on winml-built optimized.onnx and quantized.onnx for h0 (opset17) and h3 (opset21) from catalog_qnn_sweep facebook--dinov2-small. Op counts via collections.Counter on graph.node.",
+        "opset17_optimized": {
+          "total_nodes": 391,
+          "Transpose": 49,
+          "Reshape": 121,
+          "Gemm": 72,
+          "Mul": 48,
+          "Conv": 1
+        },
+        "opset21_optimized": {
+          "total_nodes": 439,
+          "Transpose": 49,
+          "Reshape": 169,
+          "Gemm": 72,
+          "Mul": 48,
+          "Conv": 1
+        },
+        "opset17_quantized": {
+          "total_nodes": 1398,
+          "Transpose": 49,
+          "Reshape": 121,
+          "DequantizeLinear": 615,
+          "QuantizeLinear": 392
+        },
+        "opset21_quantized": {
+          "total_nodes": 1542,
+          "Transpose": 49,
+          "Reshape": 169,
+          "DequantizeLinear": 663,
+          "QuantizeLinear": 440
+        },
+        "key_finding": "Transpose count is IDENTICAL (49 nodes) in both opset17 and opset21. The NHWC Transpose-reduction hypothesis is RULED OUT. opset21 has MORE Reshape nodes (+48), more QDQ pairs (+48 DQ, +48 Q), and more total nodes. Despite more nodes, opset21 runs 30% faster on QNN NPU — mechanism still unknown.",
+        "rules_out": [
+          "NHWC Transpose elimination as speedup cause",
+          "Fewer total ops as explanation"
+        ],
+        "consistent_with": [
+          "Different graph structure at opset21 enabling better QNN NPU internal scheduling or graph partitioning, possibly via the +48 Reshape nodes acting as data-layout hints or memory access pattern changes"
+        ]
+      },
+      "alternative_mechanism_hypotheses": [
+        "QNN EP graph partitioner assigns ops differently when the model has opset21 Reshape semantics — the +48 Reshape nodes may segment the graph into better-aligned HTP subgraphs",
+        "Quantization calibration path differs between opset exports → quantized.onnx has different scale/zero-point distributions at opset21 → better QNN NPU numeric alignment",
+        "PyTorch ONNX exporter produces different intermediate tensor shapes at opset 21 → better memory access locality on QNN NPU HBM",
+        "The +48 Reshape ops in opset21 are 'free' no-ops on QNN NPU (identity reshape with same shape) that happen to trigger a faster QNN internal code path"
+      ],
+      "data_reliability_notes": {
+        "dinov2_small": {
+          "h1_opset17_sessions_ms": [
+            7.176,
+            6.392,
+            9.436
+          ],
+          "h3_opset21_sessions_ms": [
+            4.977,
+            4.876,
+            6.884
+          ],
+          "assessment": "RELIABLE. Ranges barely overlap only at extremes. h3 sessions 1+2 (4.97/4.88ms) are well below entire h1 range. Speedup is real.",
+          "tool": "catalog_qnn_sweep.py, optimized.onnx (v1 pipeline)"
+        },
+        "dinov2_base_v3": {
+          "h1_opset17_sessions_ms": [
+            34.556,
+            34.668,
+            33.148
+          ],
+          "h3_opset21_sessions_ms": [
+            33.001,
+            26.224,
+            26.227
+          ],
+          "assessment": "RELIABLE. h1 sessions fully consistent (~34ms). h3 s0 slightly elevated (JIT warmup) but s1+s2 consistent at 26.2ms. Speedup +24.1% is well-separated from noise.",
+          "tool": "validation_sweep.py v3, quantized.onnx W8A16 (fresh builds for both hyps)"
+        },
+        "dino_vitb16": {
+          "h1_opset17_sessions_ms": [
+            19.924,
+            19.975,
+            19.897
+          ],
+          "h3_opset21_sessions_ms": [
+            20.197,
+            20.071,
+            19.988
+          ],
+          "assessment": "RELIABLE CONTROL. Extremely stable. +0.7% regression (within noise). Opset21 has NO EFFECT on plain DINO ViT-B/16. Critical discriminant: npu-001 speedup is NOT a general ViT property.",
+          "tool": "validation_sweep.py, quantized.onnx W8A16 (fresh builds)"
+        },
+        "mobilevit_small": {
+          "h1_opset17_sessions_ms_ORIGINAL_2026_06_13": [
+            10.557,
+            11.721,
+            27.436
+          ],
+          "h3_opset21_sessions_ms_ORIGINAL_2026_06_13": [
+            10.814,
+            8.625,
+            8.449
+          ],
+          "clean_rerun_2026_06_22": {
+            "h0_opset17_sessions_ms": [
+              4.98,
+              5.51,
+              5.72
+            ],
+            "h3_opset21_sessions_ms": [
+              5.36,
+              5.26,
+              5.9
+            ],
+            "h6_opset21_matmul_transpose_p50_ms": 6.218,
+            "verdict": "NEUTRAL_WITHIN_NOISE (+2.81%, ranges overlap)"
+          },
+          "assessment": "REVISED to UNRELIABLE/NEUTRAL. The original h1 (opset17) median ~11.7ms was inflated by a 27.4ms DVFS spike, making opset21 look ~20-26% faster. A clean from-scratch 11-hypothesis rerun 2026-06-22 (fresh winml config+build, 3×500-iter) measured a true baseline median of 5.51ms; h3 opset21 = 5.355ms = +2.81% with FULLY OVERLAPPING session ranges → effect-size gate verdict NEUTRAL_WITHIN_NOISE. h6 (opset21+matmul_transpose, previously cited +42.1%) = 6.218ms = SLOWER. The original 'speedup' was a polluted-baseline / DVFS artifact, not an opset21 effect."
+        },
+        "resnet_18": {
+          "h1_opset17_sessions_ms": [
+            0.99,
+            4.003,
+            2.716
+          ],
+          "h3_opset21_sessions_ms": [
+            1.054,
+            2.175,
+            4.107
+          ],
+          "assessment": "UNRELIABLE. Sub-ms model. Session range spans 4x for same config. Reported '+20.2% speedup' (h1 median 2.72ms vs h3 median 2.18ms) is NOT a real signal — the two distributions fully overlap. REMOVED from benefits list."
+        },
+        "gender_classification_vit": {
+          "h0_opset17_sessions_ms": [
+            14.15,
+            14.94,
+            13.89
+          ],
+          "h3_opset21_sessions_ms": [
+            13.7,
+            13.92,
+            13.87
+          ],
+          "assessment": "NEUTRAL. Ranges barely not overlapping (h0 min=13.89ms, h3 max=13.92ms). +3.5% is within DVFS noise (CV ~0.35). CRITICAL: this ViT model has IDENTICAL op counts to DINOv2-small (49 Transpose, 121 Reshape, ~72 Gemm) yet shows NO benefit. Confirms npu-001 is not explainable by op-count or general ViT architecture.",
+          "tool": "run_one.py 2026-06-17, quantized.onnx W8A16"
+        },
+        "bge_small_en": {
+          "h0_opset17_sessions_ms": [
+            10.52,
+            10.32,
+            11.01
+          ],
+          "h3_opset21_sessions_ms": [
+            10.25,
+            9.33,
+            9.94
+          ],
+          "assessment": "MARGINAL / INCONCLUSIVE. Ranges barely not overlapping but CV ~0.3 means high within-session variance. +7.3% apparent gain — larger than all other NLP models (distilbert -0.1%, MiniLM -0.7%, RoBERTa +0.1%) but may be DVFS noise. Needs 5+ sessions to confirm. Do NOT cite as benefit.",
+          "tool": "run_one.py 2026-06-17, quantized.onnx W8A16, bert model-type"
+        }
+      },
+      "action_for_autoconfig": "Include opset 21 in search for DINOv2-family models (facebook/dinov2-*). Do NOT assume it helps MobileViT-class Conv+attention hybrids — the original MobileViT win did NOT reproduce on a clean rerun (neutral, +2.81% within noise). Do NOT apply to plain ViT (dino-vitb16, gender-classification both neutral), YOLOS, or NLP (BERT-family all neutral at ±0.7%). CRITICAL: gender-classification ViT has IDENTICAL op counts to DINOv2-small (49 Transpose, 121 Reshape) but shows NO benefit — the effect is deeper than op counts. For ResNet-class Conv-only: insufficient data. ALWAYS dump optimized graph to compare Transpose counts if speedup is unexpected, and ALWAYS clear the effect-size gate (gain >= 2×session-CV AND ranges separated) before trusting a win.",
+      "confidence": "medium-high on empirical observation (DINOv2-small +30.6% and DINOv2-base +24.1% both confirmed with clean 3-session protocol, fresh builds). Low on mechanism — original Transpose-bypass explanation ruled out (Transpose count identical opset17/21), kMaxSupportedOpset>=23 confirmed. Mechanism unknown. Scope: DINOv2 family only until mechanism is understood. 12 models now tested: 2 benefit (DINOv2-small/base), 8 neutral (incl. MobileViT after clean 2026-06-22 rerun), 1 marginal/inconclusive (BGE-small +7.3% with high CV), 1 CPU-bound.",
+      "falsified_by": null,
+      "scope": "ORT 1.24.5 (onnxruntime-windowsml). DINOv2-small and DINOv2-base confirmed. MobileViT-small REVISED to NEUTRAL (original win was a DVFS baseline artifact; clean rerun 2026-06-22 = +2.81% within noise). Does NOT apply to plain ViT (dino-vitb16 and rizvandwiki/gender-classification both confirmed NEUTRAL despite identical op counts to DINOv2-small), YOLOS-small, BERT-family NLP, CPU-bound models (rad-dino). ResNet-18 data inconclusive. BGE-small-en +7.3% marginal, inconclusive.",
+      "tracked_issue": "#869",
+      "perf_gain_validation_gates": {
+        "gate1_statistical": "PASSED for DINOv2 (3-session, ranges separate). FAILED for MobileViT (clean rerun 2026-06-22: ranges overlap, +2.81% < effect-size noise floor → NEUTRAL). FAILED for ResNet-18.",
+        "gate2_mechanism": "FAILED — original kMaxSupportedOpset bypass mechanism does not apply to ORT 1.24.x. New mechanism uninvestigated.",
+        "gate3_thermal_control": "PARTIALLY — 3×500-iter with 30s cool-down is better than single-session but DVFS spikes still occur and CAN poison the baseline (the MobileViT win was traced to exactly this; see ep_knowledge/README.md promotion checklist)."
+      },
+      "follow_up_required": [
+        "DONE: kMaxSupportedOpset >= 23 confirmed for ORT 1.24.4 (accepts opset 22 and 23 at InferenceSession level)",
+        "DONE: Transpose analysis — opset17 vs opset21 DINOv2-small: IDENTICAL (49 Transpose both). Not the mechanism.",
+        "OPEN: Investigate QNN EP graph partitioning diff for opset17 vs opset21. Why do +48 Reshape nodes help?",
+        "Run 5+ sessions (not 3) on DINOv2 opset17 vs opset21 to reduce DVFS uncertainty",
+        "Test EfficientNet-B0, MobileNet-V3 to determine if benefit is 'Conv+residual' or 'Conv+attention hybrid' specific",
+        "For ResNet-18: run 3 sessions x 2000 iters to get reliable sub-ms measurements"
+      ],
+      "experiments_convnext_early": [
+        {
+          "opset": 17,
+          "p50_ms": 54.2,
+          "p90_ms": 104.5,
+          "min_ms": 9.56,
+          "std_ms": 44.1,
+          "iters": 50,
+          "note": "warm device, DVFS-dominated, NOT reliable"
+        },
+        {
+          "opset": 19,
+          "p50_ms": 12.1,
+          "p90_ms": 77.7,
+          "min_ms": 9.11,
+          "std_ms": 60.0,
+          "iters": 50,
+          "note": "NOT reliable — 50 iters, DVFS"
+        },
+        {
+          "opset": 21,
+          "p50_ms": 12.2,
+          "p90_ms": 38.0,
+          "min_ms": 9.73,
+          "std_ms": 10.1,
+          "iters": 20,
+          "note": "only 20 iters — NOT reliable"
+        }
+      ],
+      "last_updated": "2026-06-18"
+    },
+    {
+      "id": "npu-002",
+      "title": "W8A16 quantization provides ~1.9x speedup over FP32 on QNN NPU (ConvNext only — not yet generalized)",
+      "observation": "ConvNext FP32 baseline: p50=19.4ms. W8A16 quantized (minmax, 128 samples): p50=10.29ms. 1 model, 1 device.",
+      "mechanism_confirmed": true,
+      "mechanism_hypothesis": "QNN HTP has native INT8 weight / FP16 activation datapath. W8A16 maps directly to HTP's weight-compressed matmul kernels.",
+      "action_for_autoconfig": "Always quantize for QNN NPU. W8A16 is the starting point. Validate accuracy after quantization.",
+      "confidence": "medium — mechanism is well-understood (HTP architecture), but 1.9x magnitude is from 1 model only. Speedup will vary by architecture.",
+      "falsified_by": null,
+      "scope": "ConvNext only — single model validation. The catalog sweep used W8A16 for all 8 models but did not include FP32 baselines for those models, so the 1.9x figure cannot be generalized. Need FP32 baseline runs on at least 3 diverse models before claiming 'most vision models'.",
+      "do_not_generalize_to": "Models with unusual op types not supported by QNN W8A16 path. Magnitude claim (1.9x) is ConvNext-specific.",
+      "follow_up_required": [
+        "Measure FP32 baseline for MobileViT, DINOv2, ResNet-18 to verify speedup generalizes"
+      ]
+    },
+    {
+      "id": "npu-003",
+      "title": "winml compile adds ~1.7x speedup on top of quantization for QNN NPU (ConvNext only — not yet generalized)",
+      "observation": "ConvNext W8A16 quantized: p50=10.29ms. W8A16 + compiled (EPContext): p50=6.01ms. 1 model, 1 device.",
+      "mechanism_confirmed": true,
+      "mechanism_hypothesis": "Compilation pre-builds the QNN binary graph (.bin) and eliminates JIT graph partitioning at session creation time. EPContext model loads the pre-built binary directly.",
+      "action_for_autoconfig": "Always run winml compile after finding best quantized config for QNN NPU.",
+      "confidence": "medium — mechanism is well-understood (EPContext documented by QNN SDK). 1.7x magnitude is ConvNext-specific. Simpler models may see less benefit; complex models may see more.",
+      "falsified_by": null,
+      "scope": "ConvNext only — single model validation. Mechanism generalizes; magnitude (1.7x) does not. The catalog sweep results.json baseline p50 values already include the effects of whatever auto-config winml chose (which may or may not include compile) — not directly comparable.",
+      "follow_up_required": [
+        "Verify compile speedup on MobileViT and DINOv2"
+      ]
+    },
+    {
+      "id": "npu-004",
+      "title": "⚠️ ANECDOTE (NO DATA): W8A8 may cause accuracy collapse on models with LN+GELU — UNVALIDATED",
+      "observation": "W8A8 quantization was attempted on ConvNext. The experiment was aborted early — exact accuracy numbers were NOT recorded. The claim 'top-1 < 15%' is a recalled anecdote from the experimenter, not a measured result.",
+      "mechanism_confirmed": false,
+      "mechanism_hypothesis": "ConvNext uses LayerNormalization + GELU in every block. Quantizing both weights AND activations to INT8 in these ops introduces severe numerical error. However, this is a hypothesis — the aborted experiment does not confirm or refute it.",
+      "action_for_autoconfig": "Treat as anecdotal. Do NOT use this to skip W8A8 without running eval first. If W8A8 top-1 drops > 15 points vs W8A16 baseline on first attempt, then skip.",
+      "confidence": "very_low — anecdotal, no preserved data, experiment not reproducible as recorded",
+      "falsified_by": null,
+      "scope": "UNVALIDATED. May apply to models with LN+GELU blocks but this is unconfirmed.",
+      "do_not_generalize_to": "BERT/ResNet models where W8A8 is often fine",
+      "required_experiment": "Run W8A8 quantization on ConvNext-tiny-224, record exact top-1 accuracy (eval on ImageNet-1k, 1000 samples minimum). Compare to W8A16 baseline. If collapse observed, also run with calibration_method=percentile to see if calibration quality is the issue."
+    },
+    {
+      "id": "npu-005",
+      "title": "QNN Hub W8A16 model is slower on ORT QNN EP stack than ORT-quantized W8A16 — but comparison is not fair",
+      "observation": "QNN Hub W8A16 on winml ORT QNN EP: p50=14.82ms, std=8.8ms. ORT-quantized W8A16 (opset 17 QDQ): p50=6.01ms stable.",
+      "mechanism_confirmed": false,
+      "mechanism_hypothesis": "QNN Hub uses opset 21 QDQ format with uint16 input tensor — this format may be incompatible with ORT QNN EP's expected quantization format.",
+      "fairness_caveat": "⚠️ This is NOT a fair comparison. QNN Hub models are compiled for the qairt native stack (qualcomm AI runtime), not for ORT QNN EP. Running a qairt-compiled model through ORT QNN EP is an unsupported use case. The comparison only shows that you should use ORT-generated quantization when targeting ORT QNN EP — which is obvious.",
+      "action_for_autoconfig": "Use ORT-generated W8A16 quantization (winml build), NOT QNN Hub pre-quantized models, when targeting ORT QNN EP stack.",
+      "confidence": "low — the finding is trivially true (use the right tool for the right stack) but the experiment doesn't tell us anything useful about relative performance.",
+      "falsified_by": null,
+      "scope": "ORT QNN EP stack only. QNN Hub models on their native qairt stack are likely much faster — that comparison was never made."
+    },
+    {
+      "id": "npu-006",
+      "title": "Conv fusions (conv-bn/add/activation) cause catastrophic QNN NPU CPU fallback on Conv-dominant models",
+      "observation": "ResNet-18 with conv-bn-fusion+conv-add-fusion+conv-activation-fusion: 3-session p50s = [132.3, 134.97, 130.67]ms (CV=0.016, extremely stable) vs baseline [0.99, 4.00, 2.72]ms. ~130-135x regression. MobileViT with same fusions: [11.60, 11.36, 10.52]ms — neutral vs baseline [10.56, 11.72, 27.44]ms. BERT-family: neutral (no Conv ops to fuse). VALIDATION SWEEP 2026-06-16: dinov2-base h4=[26.06,25.92,25.87]ms vs h1=[34.56,34.67,33.15]ms → fusions actually -25% (FASTER, not regression). dino-vitb16 h4=[20.12,20.04,20.41]ms vs h1=[19.92,19.97,19.90]ms → +1.0% (neutral). Conv fusions are only hazardous for Conv-dominant models.",
+      "session_evidence_note": "The h4 sessions for ResNet-18 (132.3, 134.97, 130.67ms) show near-zero variance (CV=0.016) — in stark contrast to all other hypotheses. This is unusual for QNN NPU and strongly suggests deterministic CPU fallback (not DVFS noise). The regression is 50-136x even comparing best sessions.",
+      "mechanism_confirmed": false,
+      "mechanism_hypothesis": "ORT conv fusion pass (ConvAddActivationFusion, ConvBNFusion) produces fused op types (e.g., Conv+BN fused) that QNN EP cannot map to HTP kernels. These ops fall back to CPU execution, adding PCIe round-trip overhead per-op for a Conv-heavy graph like ResNet.",
+      "action_for_autoconfig": "⚠️ CRITICAL: Do NOT apply conv-bn-fusion / conv-add-fusion / conv-activation-fusion for QNN NPU on Conv-dominant models (ResNet, EfficientNet, MobileNet). These passes are beneficial for CPU EP but hazardous for QNN NPU. Always run accuracy + latency gate after applying any Conv fusion. If regression > 5x, disable all conv fusions immediately.",
+      "confidence": "high on regression observation (4900%); medium on mechanism (CPU fallback hypothesis not yet confirmed via EP partition dump)",
+      "falsified_by": null,
+      "scope": "Conv-dominant models (ResNet, EfficientNet, MobileNet). MobileViT safe (original data). DINOv2 and plain ViT: fusions are neutral or slightly beneficial (2026-06-16 validation). Not applicable to NLP.",
+      "severity": "critical — can produce 50x regression",
+      "follow_up_required": [
+        "Dump QNN EP partition to confirm fused ops cause CPU fallback",
+        "Test EfficientNet and MobileNet to confirm generalization",
+        "Check if winml analyze linter can detect this pattern pre-build"
+      ],
+      "refinement": "2026-06-17 delta sweep: ResNet-18 h10 (conv_add_fusion ONLY, no conv-bn or conv-activation) = p50 0.955ms vs baseline 0.964ms (+0.93%) — NEUTRAL. Catastrophic regression ONLY occurs with full fusion pack (conv-bn-fusion + conv-add-fusion + conv-activation-fusion) which produces FusedConv ops. Individual conv-add-fusion is safe. Root cause is confirmed: FusedConv op created by the bundle is not dispatchable by QNN EP.",
+      "last_updated": "2026-06-18"
+    },
+    {
+      "id": "npu-007",
+      "title": "DVFS thermal noise on QNN NPU makes CV-based stability gating unreliable — requires session-level averaging",
+      "observation": "Across all 8 catalog models, QNN NPU CV ranges 0.1–2.0+ even on warm device. Original CV<15% gate blocks most candidates. Differences < 10% are within noise floor.",
+      "mechanism_confirmed": true,
+      "mechanism_hypothesis": "Snapdragon X Elite HTP Hexagon core runs DVFS aggressively. Single-session CV is dominated by thermal state, not model performance. The only reliable signal comes from session-level averaging (3+ independent sessions with cool-down).",
+      "action_for_autoconfig": "DISABLE CV gate for QNN NPU. Replace with: (1) minimum 3 independent sessions × 500+ iters with 30s cool-down between sessions. (2) Use median p50 across sessions as the signal. (3) Only trust gains > 10% — anything below is within noise floor. (4) Do NOT compare within-session std to declare stability.",
+      "confidence": "high — consistent across 8 models in catalog sweep",
+      "falsified_by": null,
+      "scope": "General — applies to all models on QNN NPU / Snapdragon X Elite HTP",
+      "bench_protocol_update": {
+        "screen_phase": "SKIP CV gate; run 200 iters as warmup only",
+        "full_phase": "3 sessions × 500 iters, 30s cool-down between sessions",
+        "signal": "median p50 across sessions",
+        "noise_floor": ">10% gain required to declare improvement"
+      }
+    },
+    {
+      "id": "npu-008",
+      "title": "microsoft/rad-dino fails to build on QNN NPU across all opset variants (winml crash rc=0xC0000005)",
+      "observation": "catalog_qnn_sweep run 2026-06-17: all 6 hypotheses for microsoft/rad-dino (opset 17/19/21, with/without conv fusions) returned rc=3221225794 (0xC0000005, access violation) in <2s. No stderr captured — winml process crashed before producing any output. This is distinct from a build error: it is a hard crash of the winml CLI itself.",
+      "mechanism_confirmed": false,
+      "mechanism_hypothesis": "rad-dino is a ViT encoder with a non-standard DINOv2 variant (larger heads, custom CLS token handling). Likely contains one or more ONNX operators or graph shapes that trigger an unguarded null-dereference or out-of-bounds access in the QNN EP quantization or compilation path (winml build calls QNN SDK compilation under the hood). Could also be a model size / dynamic axis issue.",
+      "action_for_autoconfig": "Skip QNN NPU for microsoft/rad-dino. If QNN NPU is required, file a bug with the crash dump and test with winml analyze first to identify unsupported ops before attempting build.",
+      "confidence": "high on observation (reproducible across all 6 hypotheses in same run); low on mechanism (no stack trace available)",
+      "falsified_by": null,
+      "scope": "microsoft/rad-dino only (confirmed). DINOv2-family models in general (facebook/dinov2-small, facebook/dinov2-base) are NOT affected — they build and run on QNN NPU successfully.",
+      "severity": "blocker — model is incompatible with QNN NPU build",
+      "follow_up_required": [
+        "Run winml analyze --ep qnn on rad-dino ONNX to check unsupported ops",
+        "Capture crash dump (ProcDump) to get stack trace",
+        "Compare ONNX graph structure of rad-dino vs facebook/dinov2-small to isolate differentiating ops"
+      ],
+      "date_observed": "2026-06-17"
+    },
+    {
+      "id": "npu-009",
+      "title": "bias_softmax_fusion adds incremental +14% on DINOv2 QNN NPU when combined with opset21",
+      "confidence": "medium",
+      "scope": "ViT-class models with attention+bias patterns. Confirmed on DINOv2-small; untested on plain ViT or BERT.",
+      "observation": "Catalog sweep 2026-06-17 delta sweep: DINOv2-small h7 (opset21+bias_softmax_fusion) p50=4.027ms vs h3 (opset21 only) p50=4.977ms. Incremental gain = +14.1% on top of opset21 alone. Total gain vs baseline: +38.6% (h7) vs +24.1% (h3). bias_softmax_fusion hypothesis also outperforms attention_fusion (h8=+28.4%) and matmul_transpose (h6=+24.8%) on DINOv2.",
+      "mechanism_confirmed": false,
+      "mechanism_hypothesis": "bias_softmax_fusion folds Add(qk_scores, bias)+Softmax into a single FusedSoftmax op. QNN Hexagon HTP has a native hardware path for fused attention-head softmax. Reduces dispatch overhead between the bias addition and the softmax kernel.",
+      "affected_models": [
+        "facebook/dinov2-small"
+      ],
+      "validated_on": [
+        "facebook/dinov2-small (+38.6% total, h7=4.027ms vs baseline 6.561ms)"
+      ],
+      "not_tested": [
+        "DINOv2-base",
+        "plain ViT",
+        "BERT/RoBERTa (expected neutral based on npu-001 scope)"
+      ],
+      "autoconfig_action": "For DINOv2-family: include bias_softmax_fusion in the opset21 bundle. Prioritize over attention_fusion (h8) since h7 outperformed h8 by 10 percentage points on DINOv2-small.",
+      "added": "2026-06-18",
+      "source": "catalog_qnn_sweep.py h6-h10 delta sweep, DINOv2-small results.json"
+    },
+    {
+      "id": "npu-010",
+      "title": "highdimRTR_lowdimRTR causes -19% regression on MobileViT QNN NPU due to spurious Reshape insertion",
+      "confidence": "high",
+      "scope": "Models with Gemm->Reshape->Transpose hybrid unfold patterns (MobileViT confirmed). DINOv2 (pure ViT) benefits: +38.1%. Architecture-dependent.",
+      "observation": "Catalog sweep 2026-06-17 delta sweep: MobileViT-small h9 (opset21+highdimRTR_lowdimRTR) median_p50=14.363ms vs h0 baseline 12.075ms = -18.9% regression. GPU sweep: same model h9 (GPU) = -6.89%. ONNX diff: h9 NPU graph has +36 extra Reshape nodes (395->431 total; 108->144 Reshape). The 12 original RTR patterns in h0 are UNCHANGED in h9. Instead, optimizer inserted Reshape pairs as intermediaries after Gemm nodes, breaking dispatch merging.",
+      "mechanism_confirmed": true,
+      "mechanism_detail": "highdimRTR misidentifies Gemm->Reshape->Transpose sequences (MobileViT CNN-ViT hybrid patch-unfold mechanism) as reducible RTR patterns. Inserts 36 extra intermediate Reshape nodes. These break Gemm+Reshape dispatch merging on QNN NPU and add DMA traffic. NPU more severely affected (-19%) than GPU (-6.9%) due to higher HTP DMA sensitivity.",
+      "affected_models": [
+        "apple/mobilevit-small (-19% NPU, -6.9% GPU)"
+      ],
+      "validated_safe_models": [
+        "facebook/dinov2-small (+38.1% NPU with h9 — pure ViT benefits from highdimRTR)"
+      ],
+      "architectural_discriminator": "Gemm->Reshape->Transpose hybrid unfold pattern (CNN-ViT). Detect via analyze_insight.py op-sequence scan before applying.",
+      "autoconfig_action": "Hard-block highdimRTR for models with Gemm->Reshape->Transpose hybrid sequences. analyze_insight.py should detect this pattern and add highdimRTR to skip_set.",
+      "added": "2026-06-18",
+      "source": "catalog_qnn_sweep.py h6-h10 delta sweep + ONNX graph diff (MobileViT h0 vs h9 node count comparison)"
+    },
+    {
+      "id": "npu-011",
+      "title": "Fusions that fire (graph topology changes) but yield no perf benefit should be recorded and benefit-gated, not auto-kept",
+      "confidence": "medium",
+      "scope": "Cross-architecture on QNN NPU. Observed strongest on transformer/attention models (BERT, ConvNeXt) where conv/attention fusions fire cleanly but p50 is unchanged. Distinct from npu-006 (fusion fires AND regresses) and npu-001 (change helps).",
+      "observation": "A fusion flag can be confirmed *applicable* — after enabling it the post-optimize op count drops and graph topology changes vs the baseline build — yet the measured p50 delta stays inside the DVFS noise band (|delta| < CV-derived threshold). BERT-base h2/h3/h4 on NPU: graphs change per hypothesis but full-session p50s all land 39-43ms with CV 0.6-1.1, indistinguishable from h0 baseline. ConvNeXt-tiny: every hypothesis within 0.24% of baseline. These are 'applied-but-not-beneficial' fusions: real graph transforms with zero perf return.",
+      "criterion": "Classify a fusion as APPLIED when pre-vs-post-optimize op count and/or graph topology change (the fusion fired); classify as BENEFICIAL only when median p50 improves beyond the noise band. APPLIED && !BENEFICIAL = record here. Use op-count + topology diff (not input-graph pattern match) to prove the fusion fired, and session-averaged p50 with CV-derived threshold to judge benefit. IMPORTANT first cut: many neutral fusions are not 'applied but useless' — they are NO-OPS that never fired. The op-count diff is what separates the two.",
+      "evidence": "BERT-base QNN NPU recipe sweep 2026-06-24 (winml analyze metadata.total_operators per hypothesis): h0 baseline opset17 = 392 ops; h4 opset17+conv_fusions = 392 ops IDENTICAL op breakdown (MatMul 24 / Gelu 13 / Add 38 / LayerNorm 26 unchanged) → the conv-fusion flag was a complete NO-OP (BERT has no Conv ops to fuse). h5 opset21+conv_fusions = 440 ops = same as h3 opset21 ALONE (440) → conv fusion again added zero ops on top of the opset bump. The only thing that changed BERT's graph was the opset export (opset17 = 392 ops; opset19/21 = 440 ops, +48 nodes) — and that topology change did NOT improve p50 (opset21 medians no better than baseline, within thermal noise). So BERT's 'neutral conv fusions' are NOT npu-011 instances at all — they never fired; the genuine npu-011 instance here is opset17->21 (+48 ops, no benefit). Without the op-count diff one would have wrongly logged a 'neutral fusion' when nothing happened.",
+      "perf_reliability_note": "BERT per-session p50s were thermally dominated: h0 and h1 have IDENTICAL build configs (both 392 ops, same resolved optim) yet h1's 3 sessions read 50/42/82ms vs h0's clean 29.1/29.5/29.5ms — pure DVFS throttling because hypotheses run back-to-back and the chip heats up, biasing later hypotheses slower. The current 'median of 3 back-to-back session p50s' is not robust for benefit-gating; interleaved or cooldown-separated sampling is required (reinforces npu-007).",
+      "mechanism_confirmed": false,
+      "mechanism_hypothesis": "Fused ops are dispatchable by the QNN EP (no CPU fallback, unlike npu-006) so correctness/perf is preserved, but the fused kernel is not faster than the unfused sequence on HTP for these shapes — the win the fusion targets (CPU EP op-dispatch overhead) does not exist on NPU. Net effect is neutral, while build time and EP-mapping risk still increase.",
+      "autoconfig_action": "The analyzer should not auto-keep a fusion merely because its pattern matches the input graph. Steps: (1) build with and without the flag, diff op counts + topology to confirm the fusion fired; (2) compare session-averaged p50 against the noise band; (3) if it fired but delta is within noise, drop the fusion from the emitted config (or flag it 'neutral — omitted') rather than retaining it. Retaining neutral fusions costs build time and adds EP-dispatch risk for no return. This benefit gate is feature gap #4 in the README.",
+      "added": "2026-06-24",
+      "source": "catalog_qnn_sweep.py recipe NPU sweep — BERT-base + ConvNeXt-tiny (fusions fire, p50 within noise)"
+    }
+  ],
+  "search_space_rules": {
+    "opset": {
+      "recommended_order_conv_residual": [
+        21,
+        17
+      ],
+      "recommended_order_pure_attention": [
+        17
+      ],
+      "recommended_order_nlp": [
+        17
+      ],
+      "recommended_order_pure_conv": [
+        17,
+        "21 only if time allows — insufficient data"
+      ],
+      "architecture_gate": "DINOv2 family (facebook/dinov2-*) → try opset 21 first (+24-31% confirmed). MobileViT-class Conv+attention hybrid → try opset 21 (+26% original data). Plain ViT (dino-vitb16-class) → opset 17 only (NEUTRAL confirmed 2026-06-16). YOLOS → opset 17 only. NLP (BERT-family) → opset 17 only. Pure Conv (ResNet) → opset 17 (data insufficient for opset21 recommendation).",
+      "rationale": "npu-001 validated 2026-06-13 and 2026-06-16: DINOv2-small +30.6%, DINOv2-base +24.1% (fresh builds, clean protocol). Critical control: dino-vitb16 -0.7% NEUTRAL. This proves the speedup is DINOv2-architecture-specific, not a general ViT property.",
+      "dialectical_note": "⚠️ The original mechanism explanation (kMaxSupportedOpset bypass) does NOT apply to ORT 1.24.x (onnxruntime-windowsml 1.24.5). The speedup for DINOv2/MobileViT is empirically real but mechanistically unexplained. Always validate on the actual ORT version being shipped."
+    },
+    "quantization": {
+      "recommended": "w8a16",
+      "skip": [
+        "w8a8 if initial top1 < 15%"
+      ],
+      "dialectical_note": "⚠️ W8A8 skip rule is ConvNext-specific (LN+GELU sensitivity). Try W8A8 for models without LN in every block."
+    },
+    "compile": {
+      "always_run": true,
+      "dialectical_note": "⚠️ Compile benefit is well-understood (EPContext pre-built binary). Low risk of being wrong, but verify compile output loads correctly."
+    },
+    "graph_passes": {
+      "recommended": "autoconf defaults (gelu_fusion, matmul_add_fusion)",
+      "NEVER_apply_for_qnn_npu": [
+        "conv-bn-fusion",
+        "conv-add-fusion",
+        "conv-activation-fusion"
+      ],
+      "hazard_note": "npu-006 CRITICAL: Conv fusions cause 4900% regression on ResNet-18. Do NOT apply conv fusions to Conv-dominant models on QNN NPU.",
+      "dialectical_note": "⚠️ Conv fusion ban is confirmed for ResNet. MobileViT was safe. Always run latency gate after applying any fusion to catch regressions."
+    },
+    "bench_protocol": {
+      "cv_gate": "DISABLED for QNN NPU (npu-007)",
+      "sessions": 3,
+      "iters_per_session": 500,
+      "cool_down_s": 30,
+      "noise_floor_pct": 10,
+      "signal": "median p50 across sessions"
+    }
+  }
+}
diff --git a/research/autoconfig/lib/gen_model_report.py b/research/autoconfig/lib/gen_model_report.py
new file mode 100644
index 000000000..401e5eaf9
--- /dev/null
+++ b/research/autoconfig/lib/gen_model_report.py
@@ -0,0 +1,841 @@
+#!/usr/bin/env python3
+# -------------------------------------------------------------------------
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+# --------------------------------------------------------------------------
+
+"""Generate per-model HTML optimization reports from autoconfig sweep results."""
+
+from __future__ import annotations
+
+import argparse
+import html
+import json
+from pathlib import Path
+
+
+BASE_DIR = Path(__file__).parent
+CHART_MIN_GAIN = -200.0
+CHART_MAX_GAIN = 200.0
+
+
+def _resolve_path(path_str: str) -> Path:
+    path = Path(path_str)
+    if path.is_absolute():
+        return path
+    if path.exists():
+        return path.resolve()
+    return (BASE_DIR / path).resolve()
+
+
+def _escape(value: object) -> str:
+    return html.escape("" if value is None else str(value))
+
+
+def _fmt_ms(value: float | None) -> str:
+    return "—" if value is None else f"{value:.2f} ms"
+
+
+def _fmt_pct(value: float | None, signed: bool = True) -> str:
+    if value is None:
+        return "—"
+    return f"{value:+.1f}%" if signed else f"{value:.1f}%"
+
+
+def _status_class(gain_pct: float | None) -> str:
+    if gain_pct is None:
+        return "neutral"
+    if gain_pct > 0:
+        return "good"
+    if gain_pct < 0:
+        return "bad"
+    return "neutral"
+
+
+def _short_label(label: str, max_len: int = 26) -> str:
+    if len(label) <= max_len:
+        return label
+    return label[: max_len - 1] + "…"
+
+
+def _sort_hypothesis_ids(hyp_id: str) -> tuple[int, str]:
+    if hyp_id.startswith("h"):
+        try:
+            return int(hyp_id[1:]), hyp_id
+        except ValueError:
+            pass
+    return 9999, hyp_id
+
+
+def _get_p50(hyp: dict) -> float | None:
+    """Get median p50 from either nested (QNN/CPU) or flat (GPU) schema."""
+    if "full" in hyp:
+        return hyp["full"].get("median_p50_ms")
+    return hyp.get("median_p50_ms") or hyp.get("overall_median_p50_ms")
+
+
+def _get_runs(hyp: dict) -> list[float]:
+    if "full" in hyp:
+        return [float(v) for v in hyp.get("all_p50s_ms") or hyp.get("full", {}).get("p50s_ms", [])]
+    return [float(v) for v in hyp.get("all_p50s_ms") or hyp.get("full_p50s_ms", [])]
+
+
+def _get_gain_pct(hyp_id: str, hyp: dict, baseline_p50_ms: float | None) -> float | None:
+    if hyp_id == "h0" and baseline_p50_ms is not None:
+        return 0.0
+    for key in ("overall_gain_pct", "confirm_overall_gain_pct", "gain_vs_baseline_pct"):
+        value = hyp.get(key)
+        if value is not None:
+            return float(value)
+    p50 = _get_p50(hyp)
+    if baseline_p50_ms and p50:
+        return (baseline_p50_ms - p50) / baseline_p50_ms * 100
+    return None
+
+
+def _format_extra_optim(extra_optim: dict | None) -> str:
+    if not extra_optim:
+        return "autoconf defaults"
+    enabled = [key for key, value in extra_optim.items() if value]
+    return ", ".join(enabled) if enabled else "autoconf defaults"
+
+
+def _format_champion_config(hyp: dict) -> str:
+    opset = hyp.get("opset")
+    flags = _format_extra_optim(hyp.get("extra_optim"))
+    if opset is None:
+        return flags
+    if flags == "autoconf defaults":
+        return f"opset {opset} + autoconf defaults"
+    return f"opset {opset} + {flags}"
+
+
+def _confidence_text(hyp_id: str, hyp: dict, baseline_runs: list[float]) -> str:
+    status = str(hyp.get("status", ""))
+    verdict = str(hyp.get("verdict", ""))
+
+    if status.startswith("BUILD"):
+        return "build failed"
+    if status == "BENCH_FAIL":
+        return "bench failed"
+    if status.startswith("SKIPPED"):
+        return "guarded skip"
+    if hyp.get("confirm_verdict") == "CONFIRMED":
+        return "ranges separated"
+    if hyp.get("confirm_verdict") == "MARGINAL_UNCONFIRMED":
+        return "ranges overlap"
+    if verdict == "KEEP_CONFIRMED":
+        wins = hyp.get("sessions_above_threshold")
+        total = hyp.get("total_sessions")
+        if wins is not None and total is not None:
+            return f"{wins}/{total} sessions confirm"
+        return "confirmation passed"
+    if verdict == "MARGINAL_UNCONFIRMED":
+        wins = hyp.get("sessions_above_threshold")
+        total = hyp.get("total_sessions")
+        if wins is not None and total is not None:
+            return f"{wins}/{total} sessions confirm"
+        return "confirmation incomplete"
+
+    runs = _get_runs(hyp)
+    if baseline_runs and runs:
+        if max(runs) < min(baseline_runs) or min(runs) > max(baseline_runs):
+            return "ranges separated"
+        return "ranges overlap"
+
+    if hyp_id == "h0":
+        return "baseline reference"
+    return "single-point only"
+
+
+def _table_rows(
+    hyps: list[tuple[str, dict]],
+    baseline_p50_ms: float | None,
+    champion_hyp: str | None,
+    predicate,
+) -> list[dict]:
+    rows: list[dict] = []
+    baseline_runs = _get_runs(dict(hyps).get("h0", {}))
+    for hyp_id, hyp in hyps:
+        gain_pct = _get_gain_pct(hyp_id, hyp, baseline_p50_ms)
+        status = str(hyp.get("status", ""))
+        verdict = str(hyp.get("verdict") or hyp.get("confirm_verdict") or status or "—")
+        row = {
+            "hyp_id": hyp_id,
+            "label": hyp.get("label", ""),
+            "gain_pct": gain_pct,
+            "verdict": verdict,
+            "confidence": _confidence_text(hyp_id, hyp, baseline_runs),
+            "status": status,
+            "is_champion": hyp_id == champion_hyp,
+        }
+        if predicate(row, hyp):
+            rows.append(row)
+    return rows
+
+
+def _render_table(title: str, icon: str, rows: list[dict], champion_hyp: str | None) -> str:
+    if not rows:
+        return ""
+
+    table_rows = []
+    for row in rows:
+        champion_class = " champion-row" if row["hyp_id"] == champion_hyp else ""
+        gain_style = (
+            "gain-neg" if row["gain_pct"] is not None and row["gain_pct"] < 0 else "gain-pos"
+        )
+        table_rows.append(
+            f"""
+            <tr class="{champion_class.strip()}">
+              <td><span class="hyp-pill">{_escape(row["hyp_id"])}</span></td>
+              <td>{_escape(row["label"])}</td>
+              <td class="{gain_style}">{_fmt_pct(row["gain_pct"])}</td>
+              <td>{_escape(row["verdict"])}</td>
+              <td>{_escape(row["confidence"])}</td>
+            </tr>
+            """
+        )
+
+    return f"""
+    <section class="section-card">
+      <div class="section-title">{icon} {title}</div>
+      <table class="report-table">
+        <thead>
+          <tr>
+            <th>Hypothesis</th>
+            <th>Label</th>
+            <th>Gain %</th>
+            <th>Verdict</th>
+            <th>Confidence</th>
+          </tr>
+        </thead>
+        <tbody>
+          {"".join(table_rows)}
+        </tbody>
+      </table>
+    </section>
+    """
+
+
+def _render_characteristics(results: dict) -> str:
+    rows = [
+        ("Model ID", results.get("model_id")),
+        ("Task", results.get("task")),
+        ("Arch type", results.get("model_type")),
+        ("Baseline opset", results.get("baseline_opset")),
+        ("EP", results.get("ep")),
+        ("Device", results.get("device")),
+    ]
+
+    conv_pct = results.get("conv_pct")
+    if "npu006_risk" in results:
+        conv_text = "N/A" if conv_pct is None else f"{conv_pct:.1f}%"
+        rows.append(("Conv%", conv_text))
+        rows.append(("npu-006 risk", "HIGH" if results.get("npu006_risk") else "LOW"))
+
+    if "npu001_generalized" in results:
+        rows.append(("npu-001 note", results.get("npu001_generalized")))
+
+    cells = "".join(
+        f"<tr><th>{_escape(label)}</th><td>{_escape(value if value is not None else '—')}</td></tr>"
+        for label, value in rows
+    )
+    return f"""
+    <section class="section-card">
+      <div class="section-title">Model Characteristics</div>
+      <table class="characteristics-table">
+        {cells}
+      </table>
+    </section>
+    """
+
+
+def _chart_bar_color(gain_pct: float | None) -> str:
+    if gain_pct is None:
+        return "#90a4ae"
+    if gain_pct > 5:
+        return "#43a047"
+    if gain_pct < -5:
+        return "#e53935"
+    return "#90a4ae"
+
+
+def _render_chart(
+    hyps: list[tuple[str, dict]], baseline_p50_ms: float | None, champion_hyp: str | None
+) -> str:
+    row_h = 40
+    header_h = 48
+    footer_h = 26
+    label_w = 150
+    bar_w = 520
+    value_w = 78
+    total_w = label_w + bar_w + value_w
+    total_h = header_h + footer_h + len(hyps) * row_h
+    center_x = label_w + bar_w / 2
+
+    elements: list[str] = [
+        f'<svg class="chart-svg" viewBox="0 0 {total_w} {total_h}" role="img" '
+        'aria-label="Hypothesis gain chart">',
+        "<defs>",
+        '<pattern id="buildFailPattern" patternUnits="userSpaceOnUse" width="8" height="8" patternTransform="rotate(45)">',
+        '<rect width="8" height="8" fill="#cfd8dc"></rect>',
+        '<rect width="3" height="8" fill="#90a4ae"></rect>',
+        "</pattern>",
+        "</defs>",
+        '<text x="0" y="20" class="axis-label">Hypothesis</text>',
+        f'<text x="{label_w}" y="20" class="axis-label">Gain vs baseline (%)</text>',
+        f'<line x1="{center_x:.1f}" y1="{header_h - 8}" x2="{center_x:.1f}" y2="{total_h - footer_h}" class="center-line" />',
+    ]
+
+    for tick in (-200, -100, 0, 100, 200):
+        x = label_w + ((tick - CHART_MIN_GAIN) / (CHART_MAX_GAIN - CHART_MIN_GAIN)) * bar_w
+        elements.append(
+            f'<line x1="{x:.1f}" y1="{header_h - 4}" x2="{x:.1f}" y2="{total_h - footer_h}" class="tick-line" />'
+        )
+        elements.append(
+            f'<text x="{x:.1f}" y="{total_h - 6}" text-anchor="middle" class="tick-label">{tick}%</text>'
+        )
+
+    for idx, (hyp_id, hyp) in enumerate(hyps):
+        y = header_h + idx * row_h
+        bar_mid = y + row_h / 2
+        bar_top = y + 8
+        bar_height = row_h - 16
+        gain_pct = _get_gain_pct(hyp_id, hyp, baseline_p50_ms)
+        clipped_gain = (
+            None if gain_pct is None else max(min(gain_pct, CHART_MAX_GAIN), CHART_MIN_GAIN)
+        )
+        status = str(hyp.get("status", ""))
+        verdict = str(hyp.get("verdict") or hyp.get("confirm_verdict") or "")
+        p50 = _get_p50(hyp)
+        title = (
+            f"{hyp_id}: {hyp.get('label', '')}\n"
+            f"status={status or '—'}  verdict={verdict or '—'}\n"
+            f"p50={_fmt_ms(p50)}  gain={_fmt_pct(gain_pct)}"
+        )
+
+        elements.append(f"<g><title>{_escape(title)}</title>")
+        elements.append(
+            f'<rect x="0" y="{y:.1f}" width="{total_w}" height="{row_h}" class="row-bg" />'
+        )
+        elements.append(
+            f'<text x="8" y="{y + 16:.1f}" class="hyp-label">{_escape(hyp_id)}</text>'
+            f'<text x="8" y="{y + 29:.1f}" class="hyp-sub">{_escape(_short_label(str(hyp.get("label", ""))))}</text>'
+        )
+
+        if hyp_id == "h0":
+            elements.append(
+                f'<line x1="{center_x:.1f}" y1="{bar_top:.1f}" x2="{center_x:.1f}" '
+                f'y2="{bar_top + bar_height:.1f}" class="baseline-bar" />'
+            )
+            elements.append(
+                f'<text x="{center_x + 8:.1f}" y="{bar_mid + 4:.1f}" text-anchor="start" class="value-text">0.0%</text>'
+            )
+        elif status.startswith("BUILD"):
+            fail_w = 92
+            fail_x = center_x - fail_w / 2
+            stroke = "#1e88e5" if hyp_id == champion_hyp else "#78909c"
+            stroke_w = 4 if hyp_id == champion_hyp else 1.5
+            elements.append(
+                f'<rect x="{fail_x:.1f}" y="{bar_top:.1f}" width="{fail_w}" height="{bar_height}" '
+                f'fill="url(#buildFailPattern)" stroke="{stroke}" stroke-width="{stroke_w}" rx="4" />'
+            )
+            elements.append(
+                f'<text x="{center_x:.1f}" y="{bar_mid + 4:.1f}" text-anchor="middle" class="build-fail-text">'
+                "BUILD_FAIL</text>"
+            )
+        elif clipped_gain is not None:
+            target_x = (
+                label_w
+                + ((clipped_gain - CHART_MIN_GAIN) / (CHART_MAX_GAIN - CHART_MIN_GAIN)) * bar_w
+            )
+            x = min(center_x, target_x)
+            width = max(abs(target_x - center_x), 2.0)
+            stroke = "#1e88e5" if hyp_id == champion_hyp else "none"
+            stroke_w = 4 if hyp_id == champion_hyp else 0
+            value_x = target_x + 8 if clipped_gain >= 0 else target_x - 8
+            anchor = "start" if clipped_gain >= 0 else "end"
+            elements.append(
+                f'<rect x="{x:.1f}" y="{bar_top:.1f}" width="{width:.1f}" height="{bar_height}" '
+                f'fill="{_chart_bar_color(gain_pct)}" stroke="{stroke}" stroke-width="{stroke_w}" rx="4" />'
+            )
+            elements.append(
+                f'<text x="{value_x:.1f}" y="{bar_mid + 4:.1f}" text-anchor="{anchor}" class="value-text">'
+                f"{_escape(_fmt_pct(gain_pct))}</text>"
+            )
+
+        elements.append("</g>")
+
+    elements.append("</svg>")
+    return f"""
+    <section class="section-card">
+      <div class="section-title">Hypothesis Gain Chart</div>
+      <div class="chart-wrap">
+        {"".join(elements)}
+      </div>
+    </section>
+    """
+
+
+def _render_all_hypotheses(
+    hyps: list[tuple[str, dict]],
+    baseline_p50_ms: float | None,
+    champion_hyp: str | None,
+) -> str:
+    """Full hypothesis table with opset, flags, all session p50s, and verdict."""
+    baseline_runs = _get_runs(dict(hyps).get("h0", {}))
+    rows: list[str] = []
+
+    for hyp_id, hyp in hyps:
+        status = str(hyp.get("status", ""))
+        verdict = str(hyp.get("verdict") or hyp.get("confirm_verdict") or status or "—")
+        label = hyp.get("label", "")
+        opset = hyp.get("opset", "—")
+        extra_optim = hyp.get("extra_optim")
+        gain_pct = _get_gain_pct(hyp_id, hyp, baseline_p50_ms)
+        p50 = _get_p50(hyp)
+        all_runs = _get_runs(hyp)
+
+        is_champion = hyp_id == champion_hyp
+        row_class = "champion-row" if is_champion else ""
+
+        # Format extra_optim flags
+        if extra_optim:
+            enabled = [k for k, v in extra_optim.items() if v]
+            flags_str = (
+                ", ".join(f'<span class="flag-pill">{_escape(f)}</span>' for f in enabled)
+                if enabled
+                else '<em style="color:#aaa">none</em>'
+            )
+        else:
+            # Not stored — parse from label as fallback
+            flags_str = '<em style="color:#bbb">not stored</em>'
+
+        # Format all session p50s
+        if all_runs:
+            runs_html = " · ".join(f"{r:.2f}" for r in all_runs)
+            runs_cell = f'<span class="runs-val">[{runs_html}]</span>'
+        elif status.startswith("BUILD"):
+            runs_cell = f'<span style="color:#c62828;font-weight:700">{_escape(status)}</span>'
+        else:
+            runs_cell = "—"
+
+        # p50 cell
+        p50_cell = _fmt_ms(p50) if p50 else ("—" if not status.startswith("BUILD") else status)
+
+        # gain cell
+        if gain_pct is not None:
+            gain_class = "gain-pos" if gain_pct > 0 else ("gain-neg" if gain_pct < 0 else "")
+            gain_cell = f'<span class="{gain_class}">{_fmt_pct(gain_pct)}</span>'
+        else:
+            gain_cell = "—"
+
+        # verdict / confidence
+        verdict_class = (
+            "verdict-keep"
+            if "KEEP" in verdict.upper()
+            else "verdict-discard"
+            if (
+                "DISCARD" in verdict.upper()
+                or "BUILD" in verdict.upper()
+                or "FAIL" in verdict.upper()
+            )
+            else ""
+        )
+        conf = _confidence_text(hyp_id, hyp, baseline_runs)
+        champion_star = (
+            ' <span style="color:#1976d2;font-weight:900">★</span>' if is_champion else ""
+        )
+
+        rows.append(f"""
+        <tr class="{row_class}">
+          <td><span class="hyp-pill">{_escape(hyp_id)}</span>{champion_star}</td>
+          <td class="label-cell">{_escape(label)}</td>
+          <td class="opset-cell">{_escape(str(opset))}</td>
+          <td class="flags-cell">{flags_str}</td>
+          <td class="p50-cell">{_escape(p50_cell)}</td>
+          <td class="sessions-cell">{runs_cell}</td>
+          <td>{gain_cell}</td>
+          <td><span class="{verdict_class}">{_escape(verdict)}</span></td>
+          <td class="conf-cell">{_escape(conf)}</td>
+        </tr>""")
+
+    return f"""
+    <section class="section-card">
+      <div class="section-title">🔬 All Hypotheses — Full Detail</div>
+      <div style="overflow-x:auto">
+      <table class="report-table hyp-detail-table">
+        <thead>
+          <tr>
+            <th>ID</th>
+            <th>Config Label</th>
+            <th>Opset</th>
+            <th>Extra Flags</th>
+            <th>Median p50</th>
+            <th>Session p50s (ms)</th>
+            <th>Gain %</th>
+            <th>Verdict</th>
+            <th>Confidence</th>
+          </tr>
+        </thead>
+        <tbody>
+          {"".join(rows)}
+        </tbody>
+      </table>
+      </div>
+      <div style="margin-top:10px;font-size:11px;color:#7b8794">
+        ★ = champion hypothesis &nbsp;·&nbsp; Session p50s are individual bench sessions (median used for comparison)
+      </div>
+    </section>
+    """
+
+
+def _render_feature_gaps(results: dict) -> str:
+    feature_gaps = results.get("feature_gaps") or []
+    if not feature_gaps:
+        return ""
+
+    cards = "".join(f'<div class="gap-card">{_escape(gap)}</div>' for gap in feature_gaps)
+    return f"""
+    <section class="section-card">
+      <div class="section-title">Feature Gaps</div>
+      <div class="gap-grid">{cards}</div>
+    </section>
+    """
+
+
+def generate_model_report(results: dict, output_path: Path) -> None:
+    """Generate a single self-contained HTML report."""
+    hypotheses_map = results.get("hypotheses", {})
+    hyps = sorted(hypotheses_map.items(), key=lambda item: _sort_hypothesis_ids(item[0]))
+    baseline_p50_ms = results.get("baseline_p50_ms")
+    champion_hyp = results.get("best_hypothesis")
+    champion = hypotheses_map.get(champion_hyp or "", {})
+    champion_p50_ms = results.get("best_p50_ms") or _get_p50(champion)
+    best_gain_pct = results.get("best_gain_pct")
+    best_gain_verdict = results.get("best_gain_verdict")
+    gain_reliable = best_gain_verdict == "RELIABLE"
+    # When the best observed gain is not statistically reliable, the recommended
+    # ship config is the auto-config baseline, not the fastest-observed hypothesis.
+    if best_gain_verdict and not gain_reliable:
+        reliability_note = f"⚠ {best_gain_verdict.replace('_', ' ').lower()} — ship baseline"
+    else:
+        reliability_note = ""
+
+    keep_rows = _table_rows(
+        hyps,
+        baseline_p50_ms,
+        champion_hyp,
+        lambda row, _: (row["gain_pct"] is not None and row["gain_pct"] > 5)
+        or row["verdict"] == "KEEP_CONFIRMED",
+    )
+    discard_rows = _table_rows(
+        hyps,
+        baseline_p50_ms,
+        champion_hyp,
+        lambda row, hyp: row["status"].startswith("BUILD")
+        or (row["gain_pct"] is not None and row["gain_pct"] < -2),
+    )
+    neutral_rows = _table_rows(
+        hyps,
+        baseline_p50_ms,
+        champion_hyp,
+        lambda row, hyp: row not in keep_rows and row not in discard_rows,
+    )
+
+    sweep_ts = results.get("timestamp")
+    sweep_date = (
+        sweep_ts.split("T", 1)[0] if isinstance(sweep_ts, str) and "T" in sweep_ts else sweep_ts
+    )
+    header_title = (
+        f"{str(results.get('ep', 'unknown')).upper()} {str(results.get('device', 'unknown')).upper()} "
+        f"Optimization Report — {results.get('model_id', 'unknown')}"
+    )
+    subtitle = (
+        f"{results.get('model_type', 'unknown')} arch · {sweep_date or 'unknown date'} · "
+        f"{len(hyps)} hypotheses tested"
+    )
+    baseline_delta_ms = None
+    if baseline_p50_ms is not None and champion_p50_ms is not None:
+        baseline_delta_ms = baseline_p50_ms - champion_p50_ms
+
+    keep_count = len(keep_rows)
+    discard_count = len(discard_rows)
+    champion_summary = _format_champion_config(champion) if champion else "—"
+
+    html_doc = f"""<!DOCTYPE html>
+<html lang="en">
+<head>
+  <meta charset="UTF-8">
+  <meta name="viewport" content="width=device-width, initial-scale=1.0">
+  <title>{_escape(header_title)}</title>
+  <style>
+    * {{ box-sizing: border-box; margin: 0; padding: 0; }}
+    body {{
+      font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", sans-serif;
+      background: #f4f6f9;
+      color: #1a1a2e;
+      padding: 28px 24px 40px;
+      font-size: 13px;
+      line-height: 1.5;
+    }}
+    h1 {{ font-size: 24px; font-weight: 800; margin-bottom: 6px; color: #102a43; }}
+    .subtitle {{ color: #5f6c80; font-size: 12px; margin-bottom: 24px; }}
+    .section-card {{
+      background: #ffffff;
+      border: 1.5px solid #dbe5f0;
+      border-radius: 12px;
+      padding: 18px 20px;
+      margin-bottom: 20px;
+      box-shadow: 0 1px 3px rgba(16, 42, 67, 0.06);
+    }}
+    .kpi-grid {{
+      display: grid;
+      grid-template-columns: repeat(5, minmax(0, 1fr));
+      gap: 14px;
+      margin-bottom: 20px;
+    }}
+    .kpi-card {{
+      background: linear-gradient(180deg, #ffffff 0%, #f8fbff 100%);
+      border: 1.5px solid #dbe5f0;
+      border-radius: 12px;
+      padding: 16px;
+      min-height: 120px;
+    }}
+    .kpi-label {{
+      font-size: 10px;
+      text-transform: uppercase;
+      letter-spacing: 0.8px;
+      color: #6b7c93;
+      font-weight: 700;
+      margin-bottom: 8px;
+    }}
+    .kpi-value {{
+      font-size: 23px;
+      font-weight: 800;
+      color: #102a43;
+      line-height: 1.15;
+      margin-bottom: 6px;
+      word-break: break-word;
+    }}
+    .kpi-card.good .kpi-value {{ color: #2e7d32; }}
+    .kpi-card.bad .kpi-value {{ color: #c62828; }}
+    .kpi-sub {{ color: #6b7c93; font-size: 11px; }}
+    .section-title {{
+      font-size: 14px;
+      font-weight: 800;
+      color: #102a43;
+      margin-bottom: 14px;
+    }}
+    .characteristics-table {{
+      width: 100%;
+      border-collapse: collapse;
+    }}
+    .characteristics-table th,
+    .characteristics-table td {{
+      padding: 9px 10px;
+      border-bottom: 1px solid #ebf1f6;
+      text-align: left;
+      vertical-align: top;
+    }}
+    .characteristics-table th {{
+      width: 180px;
+      color: #5f6c80;
+      font-size: 11px;
+      text-transform: uppercase;
+      letter-spacing: 0.7px;
+    }}
+    .report-table {{
+      width: 100%;
+      border-collapse: collapse;
+    }}
+    .report-table th {{
+      text-align: left;
+      padding: 9px 10px;
+      background: #eef4fb;
+      color: #486581;
+      border-bottom: 2px solid #d9e2ec;
+      font-size: 10px;
+      text-transform: uppercase;
+      letter-spacing: 0.7px;
+    }}
+    .report-table td {{
+      padding: 10px;
+      border-bottom: 1px solid #ebf1f6;
+      vertical-align: top;
+    }}
+    .report-table tr:hover td {{ background: #f8fbff; }}
+    .champion-row td {{ background: #e8f1fd; }}
+    .hyp-pill {{
+      display: inline-block;
+      background: #102a43;
+      color: white;
+      border-radius: 999px;
+      padding: 2px 8px;
+      font-size: 11px;
+      font-weight: 700;
+    }}
+    .gain-pos {{ color: #2e7d32; font-weight: 700; }}
+    .gain-neg {{ color: #c62828; font-weight: 700; }}
+    .chart-wrap {{
+      overflow-x: auto;
+      border: 1px solid #e6edf5;
+      border-radius: 10px;
+      background: #fbfdff;
+      padding: 10px;
+    }}
+    .chart-svg {{ width: 100%; min-width: 760px; display: block; }}
+    .axis-label {{ fill: #486581; font-size: 11px; font-weight: 700; }}
+    .tick-label {{ fill: #7b8794; font-size: 10px; }}
+    .tick-line {{ stroke: #d9e2ec; stroke-width: 1; }}
+    .center-line {{ stroke: #1e88e5; stroke-width: 2; stroke-dasharray: 4 4; }}
+    .row-bg {{ fill: transparent; }}
+    .hyp-label {{ fill: #102a43; font-size: 12px; font-weight: 800; }}
+    .hyp-sub {{ fill: #7b8794; font-size: 10px; }}
+    .baseline-bar {{ stroke: #546e7a; stroke-width: 3; }}
+    .value-text {{ fill: #102a43; font-size: 11px; font-weight: 700; }}
+    .build-fail-text {{ fill: #37474f; font-size: 10px; font-weight: 800; letter-spacing: 0.5px; }}
+    .gap-grid {{
+      display: grid;
+      grid-template-columns: repeat(auto-fit, minmax(240px, 1fr));
+      gap: 12px;
+    }}
+    .gap-card {{
+      background: #fff8e1;
+      border: 1.5px solid #ffe082;
+      border-radius: 10px;
+      padding: 12px 14px;
+      color: #7c5b00;
+      font-size: 12px;
+    }}
+    .flag-pill {{
+      display: inline-block;
+      background: #e3f2fd;
+      color: #1565c0;
+      border-radius: 4px;
+      padding: 1px 6px;
+      font-size: 10px;
+      font-weight: 700;
+      margin: 1px 2px 1px 0;
+    }}
+    .runs-val {{
+      font-family: "Cascadia Code", "Consolas", monospace;
+      font-size: 10.5px;
+      color: #546e7a;
+      white-space: nowrap;
+    }}
+    .hyp-detail-table .label-cell {{ font-size: 11.5px; max-width: 220px; }}
+    .hyp-detail-table .opset-cell {{ text-align: center; font-weight: 700; color: #3949ab; font-size: 12px; }}
+    .hyp-detail-table .flags-cell {{ min-width: 140px; }}
+    .hyp-detail-table .p50-cell {{ font-family: "Cascadia Code","Consolas",monospace; font-size: 12px; white-space: nowrap; }}
+    .hyp-detail-table .sessions-cell {{ min-width: 160px; }}
+    .hyp-detail-table .conf-cell {{ font-size: 11px; color: #546e7a; }}
+    .verdict-keep {{ color: #2e7d32; font-weight: 700; }}
+    .verdict-discard {{ color: #c62828; font-weight: 700; }}
+    .footer {{
+      margin-top: 16px;
+      color: #7b8794;
+      font-size: 11px;
+      text-align: center;
+    }}
+    @media (max-width: 1200px) {{
+      .kpi-grid {{ grid-template-columns: repeat(2, minmax(0, 1fr)); }}
+    }}
+    @media (max-width: 720px) {{
+      .kpi-grid {{ grid-template-columns: 1fr; }}
+      body {{ padding: 18px 14px 28px; }}
+    }}
+  </style>
+</head>
+<body>
+  <h1>{_escape(header_title)}</h1>
+  <div class="subtitle">{_escape(subtitle)}</div>
+
+  <section class="kpi-grid">
+    <div class="kpi-card {_status_class(best_gain_pct)}">
+      <div class="kpi-label">Best Gain %</div>
+      <div class="kpi-value">{_fmt_pct(best_gain_pct)}</div>
+      <div class="kpi-sub">Champion: {_escape(champion_hyp or "—")}{(" · " + _escape(reliability_note)) if reliability_note else ""}</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">Baseline → Champion ms</div>
+      <div class="kpi-value">{_escape(_fmt_ms(baseline_p50_ms))} → {_escape(_fmt_ms(champion_p50_ms))}</div>
+      <div class="kpi-sub">Latency reduction: {_escape(_fmt_ms(baseline_delta_ms))}</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">EP + Device</div>
+      <div class="kpi-value">{_escape(str(results.get("ep", "unknown")).upper())} / {_escape(str(results.get("device", "unknown")).upper())}</div>
+      <div class="kpi-sub">Baseline opset {_escape(results.get("baseline_opset", "—"))}</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">Champion Config</div>
+      <div class="kpi-value">{_escape(("h0 (baseline)" if reliability_note else (champion_hyp or "—")))}</div>
+      <div class="kpi-sub">{_escape(reliability_note or champion_summary)}</div>
+    </div>
+    <div class="kpi-card">
+      <div class="kpi-label">Total experiments</div>
+      <div class="kpi-value">{len(hyps)}</div>
+      <div class="kpi-sub">{keep_count} KEEP / {discard_count} DISCARD</div>
+    </div>
+  </section>
+
+  {_render_characteristics(results)}
+  {_render_chart(hyps, baseline_p50_ms, champion_hyp)}
+  {_render_all_hypotheses(hyps, baseline_p50_ms, champion_hyp)}
+  {_render_table("Effective Optimizations", "✅", keep_rows, champion_hyp)}
+  {_render_table("Ineffective or Harmful", "❌", discard_rows, champion_hyp)}
+  {_render_table("Neutral / Build Fail", "⚪", neutral_rows, champion_hyp)}
+  {_render_feature_gaps(results)}
+
+  <div class="footer">Generated by gen_model_report.py · research/autoconfig</div>
+</body>
+</html>
+"""
+
+    output_path.write_text(html_doc, encoding="utf-8")
+
+
+def _load_results(results_path: Path) -> dict:
+    return json.loads(results_path.read_text(encoding="utf-8"))
+
+
+def _generate_for_results_file(results_path: Path) -> Path:
+    results = _load_results(results_path)
+    output_path = results_path.with_name("report.html")
+    generate_model_report(results, output_path)
+    return output_path
+
+
+def _generate_for_sweep_dir(sweep_dir: Path) -> list[Path]:
+    outputs: list[Path] = []
+    for results_path in sorted(sweep_dir.rglob("results.json")):
+        outputs.append(_generate_for_results_file(results_path))
+    return outputs
+
+
+def main() -> int:
+    parser = argparse.ArgumentParser(description="Generate per-model autoconfig HTML report(s).")
+    parser.add_argument("results_json", nargs="?", help="Path to a single results.json file")
+    parser.add_argument(
+        "--sweep-dir", help="Sweep directory containing per-model results.json files"
+    )
+    args = parser.parse_args()
+
+    if bool(args.results_json) == bool(args.sweep_dir):
+        parser.error("Provide exactly one of <results_json> or --sweep-dir.")
+
+    if args.sweep_dir:
+        sweep_dir = _resolve_path(args.sweep_dir)
+        outputs = _generate_for_sweep_dir(sweep_dir)
+        for output in outputs:
+            print(output)
+        return 0
+
+    results_path = _resolve_path(args.results_json)
+    output = _generate_for_results_file(results_path)
+    print(output)
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
diff --git a/research/autoconfig/lib/report_gen.py b/research/autoconfig/lib/report_gen.py
new file mode 100644
index 000000000..0a4769bc5
--- /dev/null
+++ b/research/autoconfig/lib/report_gen.py
@@ -0,0 +1,280 @@
+# -------------------------------------------------------------------------
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+# --------------------------------------------------------------------------
+
+"""report_gen.py — Phase 3 HTML report generator for autoconfig.
+
+Reads results.tsv and generates report.html with:
+  - Summary bar chart (p50 per hypothesis, colour-coded by status)
+  - Experiment table (config / delta_pct / status / CV)
+  - Champion config box
+"""
+
+from __future__ import annotations
+
+import csv
+import html as html_lib
+from datetime import datetime
+from pathlib import Path
+
+
+# ── helpers ───────────────────────────────────────────────────────────────────
+
+
+def _load_tsv(results_tsv: Path) -> list[dict]:
+    if not results_tsv.exists():
+        return []
+    with results_tsv.open(encoding="utf-8") as f:
+        return list(csv.DictReader(f, delimiter="\t"))
+
+
+def _status_color(status: str) -> str:
+    s = status.lower()
+    if "new best" in s or (s.startswith("keep") and "marginal" not in s):
+        return "#2e7d32"  # dark green
+    if "marginal" in s:
+        return "#f57f17"  # amber
+    if "discard" in s:
+        return "#b0bec5"  # grey
+    if "crash" in s or "fail" in s:
+        return "#c62828"  # red
+    return "#78909c"
+
+
+def _status_bg(status: str) -> str:
+    s = status.lower()
+    if "new best" in s or (s.startswith("keep") and "marginal" not in s):
+        return "#e8f5e9"
+    if "marginal" in s:
+        return "#fff8e1"
+    if "crash" in s or "fail" in s:
+        return "#ffebee"
+    return "#f5f5f5"
+
+
+def _p50_float(val: str | None) -> float | None:
+    if not val or val == "N/A" or "UNSTABLE" in str(val):
+        return None
+    try:
+        return float(str(val).replace("ms", "").strip())
+    except ValueError:
+        return None
+
+
+# ── bar chart ─────────────────────────────────────────────────────────────────
+
+
+def _bar_chart_html(rows: list[dict], baseline_p50: float | None) -> str:
+    valid = [(r, _p50_float(r.get("median_p50_ms") or r.get("screen_p50_ms"))) for r in rows]
+    valid = [(r, v) for r, v in valid if v is not None]
+    if not valid:
+        return "<p style='color:#888;font-size:12px'>No benchmark data yet.</p>"
+
+    max_val = max(v for _, v in valid) * 1.1
+    bars = []
+    for r, p50 in valid:
+        label = html_lib.escape(r.get("label", "?"))
+        status = r.get("status", "")
+        color = _status_color(status)
+        width_pct = p50 / max_val * 100
+        delta = r.get("delta_pct", "")
+        baseline_marker = ""
+        if baseline_p50:
+            bx = baseline_p50 / max_val * 100
+            baseline_marker = (
+                f'<div style="position:absolute;left:{bx:.1f}%;top:0;bottom:0;'
+                f'width:2px;background:#3949ab;opacity:0.4;z-index:2"></div>'
+            )
+        bars.append(f"""
+  <div style="margin-bottom:6px;position:relative">
+    {baseline_marker}
+    <div style="font-size:10px;color:#556;margin-bottom:2px;white-space:nowrap;overflow:hidden;
+                text-overflow:ellipsis;max-width:600px">{label}</div>
+    <div style="display:flex;align-items:center;gap:8px">
+      <div style="flex:1;background:#eee;border-radius:3px;height:16px;position:relative">
+        <div style="width:{width_pct:.1f}%;background:{color};height:100%;border-radius:3px;
+                    transition:width 0.3s"></div>
+      </div>
+      <div style="font-size:11px;color:#334;min-width:60px">{p50:.1f}ms
+        <span style="color:{color};font-size:10px">{html_lib.escape(delta)}</span>
+      </div>
+    </div>
+  </div>""")
+
+    return (
+        '<div style="max-width:700px">\n'
+        '  <div style="font-size:10px;color:#3949ab;margin-bottom:6px">'
+        "&#8212; baseline (blue line)</div>\n" + "".join(bars) + "\n</div>"
+    )
+
+
+# ── experiment table ──────────────────────────────────────────────────────────
+
+
+def _table_html(rows: list[dict]) -> str:
+    cols = [
+        "iter",
+        "label",
+        "dimension",
+        "optim_flags",
+        "opset",
+        "screen_p50_ms",
+        "median_p50_ms",
+        "delta_pct",
+        "cv",
+        "status",
+    ]
+    hdrs = "".join(
+        f'<th style="text-align:left;padding:6px 10px;font-size:10px;'
+        f"text-transform:uppercase;letter-spacing:0.6px;color:#778;"
+        f'border-bottom:2px solid #dde">{c.replace("_", " ")}</th>'
+        for c in cols
+    )
+    trs = []
+    for r in rows:
+        status = r.get("status", "")
+        bg = _status_bg(status)
+        color = _status_color(status)
+        cells = []
+        for c in cols:
+            val = html_lib.escape(str(r.get(c, "")))
+            if c == "status":
+                cells.append(
+                    f'<td style="padding:5px 10px;font-size:11px;'
+                    f'color:{color};font-weight:600">{val}</td>'
+                )
+            else:
+                cells.append(f'<td style="padding:5px 10px;font-size:11px;color:#334">{val}</td>')
+        trs.append(
+            f'<tr style="background:{bg};border-bottom:1px solid #eef">' + "".join(cells) + "</tr>"
+        )
+    return (
+        '<table style="width:100%;border-collapse:collapse">'
+        f"<thead><tr>{hdrs}</tr></thead>"
+        f"<tbody>{''.join(trs)}</tbody>"
+        "</table>"
+    )
+
+
+# ── champion box ─────────────────────────────────────────────────────────────
+
+
+def _champion_html(rows: list[dict], model_id: str, ep: str) -> str:
+    keeps = [r for r in rows if r.get("status", "").lower().startswith("keep")]
+    if not keeps:
+        return (
+            '<div style="background:#fff3e0;border:1.5px solid #ffcc80;border-radius:8px;'
+            'padding:14px 18px;font-size:12px;color:#e65100">'
+            "No KEEP verdict yet — search in progress.</div>"
+        )
+    best = min(keeps, key=lambda r: _p50_float(r.get("median_p50_ms")) or 999)
+    flags = html_lib.escape(best.get("optim_flags", "(none)"))
+    opset = html_lib.escape(str(best.get("opset", 17)))
+    p50 = html_lib.escape(best.get("median_p50_ms", "N/A"))
+    delta = html_lib.escape(best.get("delta_pct", "N/A"))
+    label = html_lib.escape(best.get("label", "?"))
+    return f"""
+<div style="background:#e8f5e9;border:1.5px solid #a5d6a7;border-radius:8px;
+            padding:14px 18px;font-size:12px">
+  <div style="font-weight:700;font-size:13px;color:#1b5e20;margin-bottom:8px">
+    Champion Config</div>
+  <table style="border-collapse:collapse">
+    <tr><td style="color:#778;padding:2px 12px 2px 0;font-size:11px">Model</td>
+        <td style="font-family:monospace;font-size:11px">{html_lib.escape(model_id)}</td></tr>
+    <tr><td style="color:#778;padding:2px 12px 2px 0;font-size:11px">EP</td>
+        <td style="font-family:monospace;font-size:11px">{html_lib.escape(ep.upper())}</td></tr>
+    <tr><td style="color:#778;padding:2px 12px 2px 0;font-size:11px">Hypothesis</td>
+        <td style="font-size:11px">{label}</td></tr>
+    <tr><td style="color:#778;padding:2px 12px 2px 0;font-size:11px">Optim flags</td>
+        <td style="font-family:monospace;font-size:11px">{flags}</td></tr>
+    <tr><td style="color:#778;padding:2px 12px 2px 0;font-size:11px">Opset</td>
+        <td style="font-family:monospace;font-size:11px">{opset}</td></tr>
+    <tr><td style="color:#778;padding:2px 12px 2px 0;font-size:11px">Median p50</td>
+        <td style="font-size:11px;color:#2e7d32;font-weight:600">{p50} ms
+          ({delta})</td></tr>
+  </table>
+</div>"""
+
+
+# ── main entry ────────────────────────────────────────────────────────────────
+
+
+def generate_report(
+    results_tsv: Path,
+    work_dir: Path,
+    model_id: str,
+    ep: str,
+    insight_notes: list[str] | None = None,
+) -> Path:
+    """Generate report.html inside work_dir. Returns the output path."""
+    rows = _load_tsv(results_tsv)
+    out_path = work_dir / "report.html"
+
+    # Find baseline p50 from h0 row
+    baseline_p50: float | None = None
+    for r in rows:
+        if r.get("iter") == "0" or "baseline" in r.get("label", "").lower():
+            baseline_p50 = _p50_float(r.get("median_p50_ms"))
+            if baseline_p50:
+                break
+
+    chart = _bar_chart_html(rows, baseline_p50)
+    table = _table_html(rows)
+    champion = _champion_html(rows, model_id, ep)
+    ts = datetime.now().strftime("%Y-%m-%d %H:%M")
+    n_done = len(rows)
+    n_keep = sum(1 for r in rows if r.get("status", "").lower().startswith("keep"))
+
+    insight_section = ""
+    if insight_notes:
+        items = "".join(f"<li>{html_lib.escape(n)}</li>" for n in insight_notes)
+        insight_section = f"""
+<h3 style="font-size:13px;font-weight:700;margin:24px 0 8px">Phase 1 Insight Engine</h3>
+<ul style="font-size:11px;color:#556;line-height:1.8;padding-left:18px">{items}</ul>"""
+
+    html = f"""<!DOCTYPE html>
+<html lang="en">
+<head>
+<meta charset="UTF-8">
+<title>autoconfig report — {html_lib.escape(model_id)} ({ep.upper()})</title>
+<style>
+* {{ box-sizing: border-box; margin: 0; padding: 0; }}
+body {{ font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", sans-serif;
+       background: #f4f6f9; padding: 28px 24px; color: #1a1a2e; font-size: 13px; }}
+h2 {{ font-size: 16px; font-weight: 700; margin-bottom: 4px; }}
+h3 {{ font-size: 13px; font-weight: 700; margin: 24px 0 10px; }}
+.meta {{ font-size: 11px; color: #778; margin-bottom: 24px; }}
+.card {{ background: #fff; border-radius: 10px; padding: 18px 20px;
+         border: 1.5px solid #dde; margin-bottom: 20px; }}
+</style>
+</head>
+<body>
+
+<h2>autoconfig — {html_lib.escape(model_id)}</h2>
+<div class="meta">EP: {html_lib.escape(ep.upper())} &nbsp;&middot;&nbsp;
+  {n_done} experiments &nbsp;&middot;&nbsp; {n_keep} KEEP &nbsp;&middot;&nbsp;
+  Generated: {ts}</div>
+
+<div class="card">
+  {champion}
+</div>
+
+<div class="card">
+  <h3 style="margin-top:0">Benchmark Chart (median p50)</h3>
+  {chart}
+</div>
+
+{f'<div class="card">{insight_section}</div>' if insight_section else ""}
+
+<div class="card">
+  <h3 style="margin-top:0">All Experiments</h3>
+  {table}
+</div>
+
+</body>
+</html>"""
+
+    out_path.write_text(html, encoding="utf-8")
+    print(f"  Report written: {out_path}")
+    return out_path
diff --git a/research/autoconfig/skills/explorer/SKILL.md b/research/autoconfig/skills/explorer/SKILL.md
new file mode 100644
index 000000000..659506e87
--- /dev/null
+++ b/research/autoconfig/skills/explorer/SKILL.md
@@ -0,0 +1,75 @@
+---
+name: explorer
+description: >
+  Use this sub-skill (driven by orchestrator) to decide WHAT to try next
+  in a winml-cli config search. It builds the hypothesis pool, applies confirmed-KB
+  hard-blocks and the Phase 1 Insight Engine skip_set to prune dead-end passes, then
+  ranks the survivors by Insight priority boost into a priority_queue and yields the
+  next hypothesis. It never builds or benchmarks — it only chooses the next experiment.
+---
+
+# explorer
+
+The Explorer is the **"what to try next"** sub-skill of the autoconfig loop
+(Phase 2). It owns search *order* only. Mirrors the `Explorer` class in
+`skills/orchestrator/autoconfig.py` and the Explorer box in
+`research/autoconfig/docs/autoconfig_diagram.html`.
+
+**Implementation in this folder:** `analyze_insight.py` (the Phase 1 Insight Engine
+that produces the `skip_set` + `priority_boosts` this skill ranks by) and
+`analyze_graph.py` (ONNX graph-pattern helper).
+
+## When to use
+
+Invoked by `orchestrator` at the top of each Phase 2 iteration to get
+the next candidate config delta. Not used standalone.
+
+## Inputs
+
+- `hypothesis_pool` — the full OFAT search grid from the orchestrator: from a FP32
+  baseline, one factor varied at a time — opset (17–21), quant precision
+  (fp32/fp16/int8/int16/w8a16), or one single graph pass — as
+  `(label, patch_fn, dimension)` triples (~74 combinations). The Explorer
+  prunes/reorders it; it does not generate it.
+- `kb` — confirmed `ep_device_knowledge/<ep>_<device>.json` rules, especially `skip_passes` hard-blocks.
+- `insight` — Phase 1 output: `skip_set` (passes to prune for this model) + `priority_boosts` (per-label ranking weight).
+
+## Procedure
+
+1. **Build the priority_queue** — stable-sort the hypothesis pool by descending
+   Insight `priority_boosts` (model-aware ranking; ties keep pool order).
+2. **Pop the next hypothesis** from the queue.
+3. **Skip-check before yielding** (`skip_reason`):
+   - KB hard-block: if the candidate's flags match a confirmed `skip_passes` rule, skip with that rule as the reason (e.g. npu-006 conv-fusion block when Conv% > 20%).
+   - Insight skip_set: if the label is in `insight.skip_set`, skip with "Insight Engine: <label>".
+   - Otherwise, yield the hypothesis to the Optimizer.
+
+### Graph-presence pruning (`skip_set` source)
+
+The Insight Engine pre-estimates, from the **baseline graph analysis**, which graph
+passes can actually fire. For each `graph_pass` hypothesis it checks whether the
+pass's required pattern is present (`_pass_can_fire` over the detected
+`fusion_candidates`):
+
+- **present** → the pass is kept and gets a priority boost proportional to the
+  candidate count (try the promising passes first).
+- **confidently absent** (e.g. no Conv→BN subgraph for `conv_bn_fusion`, no
+  Softmax for `attention_fusion`) → the label is added to `skip_set` and **cut** —
+  there is nothing to fuse, so benchmarking it would be wasted.
+- **not statically estimable** → left in the queue for the empirical search
+  (no false cuts).
+
+This is complemented at build time by the orchestrator's runtime no-op check
+(`graph_is_noop`): if a pass that survived pruning still produces a graph
+identical to the baseline, that iteration is discarded before screen/bench.
+
+## Outputs
+
+- The next `(label, config-delta, dimension)` to run, **or**
+- A skip decision with a human-readable reason (logged, not benchmarked).
+
+## Constraints
+
+- Pruning is architecture-driven via Insight/KB, never via hardcoded model names.
+- Explorer must be cheap and deterministic — no winml build/perf calls here.
+- A confirmed KB hard-block always wins over a priority boost (safety before speed).
diff --git a/research/autoconfig/skills/explorer/analyze_graph.py b/research/autoconfig/skills/explorer/analyze_graph.py
new file mode 100644
index 000000000..e57ff1032
--- /dev/null
+++ b/research/autoconfig/skills/explorer/analyze_graph.py
@@ -0,0 +1,172 @@
+# -------------------------------------------------------------------------
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+# --------------------------------------------------------------------------
+
+from collections import Counter
+
+import onnx
+
+
+m = onnx.load(r"convnext-search\iter_00\export.onnx")
+g = m.graph
+
+out2node = {}
+for n in g.node:
+    for o in n.output:
+        out2node[o] = n
+
+
+def consumers(node):
+    result = []
+    for o in node.output:
+        for n in g.node:
+            if o in n.input:
+                result.append(n)
+    return result
+
+
+def producer(inp):
+    return out2node.get(inp)
+
+
+# ── 1. Block structure ────────────────────────────────────────
+print("=== ConvNext block structure (trace first DW-Conv forward) ===")
+first_dw = next(
+    (
+        n
+        for n in g.node
+        if n.op_type == "Conv" and next((a.i for a in n.attribute if a.name == "group"), 1) > 1
+    ),
+    None,
+)
+cur = first_dw
+for _ in range(14):
+    if cur is None:
+        break
+    c = consumers(cur)
+    c_types = [n.op_type for n in c]
+    print(f"  {cur.op_type:25s} -> {c_types}")
+    if len(c) == 1:
+        cur = c[0]
+    elif len(c) > 1:
+        non_add = [n for n in c if n.op_type != "Add"]
+        cur = non_add[0] if non_add else c[0]
+    else:
+        break
+
+# ── 2. Transpose patterns ─────────────────────────────────────
+print()
+print("=== Transpose patterns (before -> Transpose -> after) ===")
+trans_patterns = Counter()
+for n in g.node:
+    if n.op_type == "Transpose":
+        c = consumers(n)
+        p = producer(n.input[0])
+        before = p.op_type if p else "INPUT"
+        after = c[0].op_type if c else "OUTPUT"
+        trans_patterns[f"{before} -> Transpose -> {after}"] += 1
+for pat, cnt in trans_patterns.most_common():
+    print(f"  {cnt:3d}x  {pat}")
+
+# ── 3. GELU variants ──────────────────────────────────────────
+print()
+print("=== GELU sub-patterns ===")
+# Standard GELU: Mul -> Div -> Erf -> Add -> Mul -> Mul
+gelu_standard = 0
+for n in g.node:
+    if n.op_type == "Erf":
+        p = producer(n.input[0])
+        if p and p.op_type == "Div":
+            gelu_standard += 1
+print(f"  Div->Erf (Erf-based GELU): {gelu_standard}")
+
+# Check for Sigmoid-based QuickGELU (x * sigmoid(1.702 * x))
+quick_gelu = 0
+for n in g.node:
+    if n.op_type == "Sigmoid":
+        c = consumers(n)
+        if c and c[0].op_type == "Mul":
+            quick_gelu += 1
+print(f"  Sigmoid->Mul (QuickGELU candidate): {quick_gelu}")
+
+# ── 4. Downsampling blocks (stage transitions) ────────────────
+print()
+print("=== Downsampling block pattern (LN->Conv 2x2 stride 2) ===")
+down_blocks = 0
+for n in g.node:
+    if n.op_type == "Conv":
+        stride = next((list(a.ints) for a in n.attribute if a.name == "strides"), [1, 1])
+        kernel = next((list(a.ints) for a in n.attribute if a.name == "kernel_shape"), [])
+        groups = next((a.i for a in n.attribute if a.name == "group"), 1)
+        if stride == [2, 2] and groups == 1:
+            p = producer(n.input[0])
+            print(f"  stride-2 Conv kernel={kernel}  preceded_by={p.op_type if p else 'INPUT'}")
+            down_blocks += 1
+
+# ── 5. Residual branches ──────────────────────────────────────
+print()
+print("=== Add nodes with 2 distinct producer op-types (residual candidates) ===")
+residual_counter = Counter()
+for n in g.node:
+    if n.op_type == "Add" and len(n.input) == 2:
+        p0 = producer(n.input[0])
+        p1 = producer(n.input[1])
+        t0 = p0.op_type if p0 else "INIT"
+        t1 = p1.op_type if p1 else "INIT"
+        if t0 != t1:
+            key = tuple(sorted([t0, t1]))
+            residual_counter[key] += 1
+for pair, cnt in residual_counter.most_common():
+    print(f"  {cnt:3d}x  Add({pair[0]}, {pair[1]})")
+
+# ── 6. Node domain analysis ───────────────────────────────────
+print()
+print("=== Op domains ===")
+domains = Counter()
+for n in g.node:
+    dom = n.domain if n.domain else "ai.onnx"
+    domains[dom] += 1
+for d, c in domains.most_common():
+    print(f"  {d}: {c} nodes")
+
+# ── 7. analyze gaps ───────────────────────────────────────────
+print()
+print("=== Patterns winml analyze may miss ===")
+# 1. Depthwise conv with large kernels (7x7 DW-Conv is ConvNext specific)
+dw7x7 = sum(
+    1
+    for n in g.node
+    if n.op_type == "Conv"
+    and next((a.i for a in n.attribute if a.name == "group"), 1) > 1
+    and next((list(a.ints) for a in n.attribute if a.name == "kernel_shape"), []) == [7, 7]
+)
+print(f"  7x7 DW-Conv (ConvNext pattern): {dw7x7}")
+print("    -> analyze classifies as OP/ai.onnx/Conv (undifferentiated)")
+print("    -> no distinction between DW-Conv and regular Conv EP support")
+
+# 2. Transpose wrapping every layer (NCHW<->NHWC conversion)
+trans_total = sum(1 for n in g.node if n.op_type == "Transpose")
+print(f"  Transpose nodes total: {trans_total}")
+print("    -> analyze reports as single OP/ai.onnx/Transpose")
+print("    -> no detection of Transpose-sandwich (NCHW->NHWC->op->NCHW)")
+print("    -> transpose-optimizer capability not reflected in analyze output")
+
+# 3. MatMul used as dense layer (not Gemm) - different EP kernel path
+matmul_count = sum(1 for n in g.node if n.op_type == "MatMul")
+print(f"  MatMul (not Gemm): {matmul_count}")
+print("    -> ConvNext uses MatMul for MLP (not Gemm), QNN handles differently")
+print("    -> analyze does not distinguish MatMul-as-FC from MatMul-as-attention")
+
+# 4. LayerNormalization as a single op (already fused by PyTorch export)
+ln_count = sum(1 for n in g.node if n.op_type == "LayerNormalization")
+print(f"  LayerNormalization (native op): {ln_count}")
+print("    -> These are already fused (not the ReduceMean->Sub->... subgraph)")
+print("    -> layer-norm-fusion capability targets the decomposed pattern")
+print("    -> analyze should note these are ALREADY fused - no fusion needed")
+
+# 5. Erf-based GELU (not tagged as Gelu op, appears as com.microsoft/Gelu after fusion)
+print(f"  Erf-based GELU subgraphs (unfused): {gelu_standard}")
+print('    -> analyze cannot detect "unfused GELU" as a pattern')
+print("    -> gelu-fusion would convert these to com.microsoft/Gelu")
+print('    -> no analyze rule for "fuseable_pattern: gelu_erf"')
diff --git a/research/autoconfig/skills/explorer/analyze_insight.py b/research/autoconfig/skills/explorer/analyze_insight.py
new file mode 100644
index 000000000..e490a95d1
--- /dev/null
+++ b/research/autoconfig/skills/explorer/analyze_insight.py
@@ -0,0 +1,958 @@
+# -------------------------------------------------------------------------
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+# --------------------------------------------------------------------------
+
+"""analyze_insight.py — Phase 1 Insight Engine for autoconfig.
+
+Fuses three signals to build skip_set and priority_queue:
+  1. Graph analysis  : op counts, Conv%, GELU variant, dynamic axes
+  2. winml analyze   : partial/unsupported op list per EP (static rule data)
+  3. ep_device_knowledge KB : confirmed empirical findings (skip_passes, priority hints)
+
+Outputs:
+  InsightResult.skip_set         — set of hypothesis labels to prune
+  InsightResult.priority_boosts  — {hypothesis_label: boost_score} for reordering
+  InsightResult.notes            — human-readable explanation of each decision
+"""
+
+from __future__ import annotations
+
+import json
+import re
+import sys
+import tempfile
+from collections import Counter
+from dataclasses import dataclass, field
+from pathlib import Path
+from typing import Any
+
+# Optional heavy imports — gracefully degrade if not available
+try:
+    import onnx  # type: ignore[import-untyped]
+
+    _ONNX_OK = True
+except ImportError:
+    _ONNX_OK = False
+
+# Agent package bootstrap: make the autoconfig root importable for sibling packages.
+_AGENT_ROOT = next(
+    p for p in Path(__file__).resolve().parents if (p / "ep_device_knowledge").is_dir()
+)
+if str(_AGENT_ROOT) not in sys.path:
+    sys.path.insert(0, str(_AGENT_ROOT))
+
+from skills.optimizer.bench_utils import run_cmd  # noqa: E402
+
+
+# ── data types ────────────────────────────────────────────────────────────────
+
+
+@dataclass
+class GraphInfo:
+    total_ops: int = 0
+    op_counts: dict[str, int] = field(default_factory=dict)
+    conv_pct: float = 0.0  # Conv / total_ops  (0-100)
+    gemm_pct: float = 0.0  # Gemm / total_ops
+    has_gelu_decomposed: bool = False  # Erf-based GELU sub-pattern
+    has_dynamic_axes: bool = False
+    transpose_count: int = 0
+    available: bool = False  # False when onnx not installed or model not found
+
+
+@dataclass
+class AnalyzeResult:
+    supported: list[str] = field(default_factory=list)
+    partial: list[str] = field(default_factory=list)
+    unsupported: list[str] = field(default_factory=list)
+    unknown: list[str] = field(default_factory=list)
+    available: bool = False  # False when winml analyze failed or ep has no rule data
+
+
+@dataclass
+class InsightResult:
+    skip_set: set[str] = field(default_factory=set)
+    """Labels from HYPOTHESES that should be pruned before the search loop."""
+
+    priority_boosts: dict[str, float] = field(default_factory=dict)
+    """hypothesis_label -> boost (positive = higher priority, negative = deprioritise)."""
+
+    notes: list[str] = field(default_factory=list)
+    """Human-readable explanation for each decision."""
+
+    graph_info: GraphInfo = field(default_factory=GraphInfo)
+    analyze_result: AnalyzeResult = field(default_factory=AnalyzeResult)
+
+
+# ── data types ────────────────────────────────────────────────────────────────
+
+
+@dataclass
+class FusionCandidate:
+    """One detectable pattern that maps to a winml optimize flag."""
+
+    flag: str
+    """winml optimize flag name (e.g. 'gelu_fusion')."""
+
+    count: int
+    """How many candidate instances were found in the graph."""
+
+    evidence: str
+    """Short human-readable description of what was found."""
+
+
+@dataclass
+class GraphInfo:
+    total_ops: int = 0
+    op_counts: dict[str, int] = field(default_factory=dict)
+    conv_pct: float = 0.0  # Conv / total_ops  (0-100)
+    matmul_pct: float = 0.0  # MatMul / total_ops
+    gemm_pct: float = 0.0  # Gemm / total_ops
+    has_gelu_decomposed: bool = False  # any multi-op GELU subgraph detected
+    gelu_types: list[str] = field(default_factory=list)  # 'erf', 'tanh', 'quick'
+    has_dynamic_axes: bool = False
+    transpose_count: int = 0
+    fusion_candidates: list[FusionCandidate] = field(default_factory=list)
+    """Ordered list of detected optimisation opportunities, highest-count first."""
+    available: bool = False  # False when onnx not installed or model not found
+
+
+@dataclass
+class AnalyzeResult:
+    supported: list[str] = field(default_factory=list)
+    partial: list[str] = field(default_factory=list)
+    unsupported: list[str] = field(default_factory=list)
+    unknown: list[str] = field(default_factory=list)
+    available: bool = False  # False when winml analyze failed or ep has no rule data
+
+
+@dataclass
+class InsightResult:
+    skip_set: set[str] = field(default_factory=set)
+    """Labels from HYPOTHESES that should be pruned before the search loop."""
+
+    priority_boosts: dict[str, float] = field(default_factory=dict)
+    """hypothesis_label -> boost (positive = higher priority, negative = deprioritise)."""
+
+    notes: list[str] = field(default_factory=list)
+    """Human-readable explanation for each decision."""
+
+    graph_info: GraphInfo = field(default_factory=GraphInfo)
+    analyze_result: AnalyzeResult = field(default_factory=AnalyzeResult)
+
+
+# ── graph analysis ────────────────────────────────────────────────────────────
+
+
+def _build_consumer_map(graph) -> dict[str, list]:  # type: ignore[type-arg]
+    """Map each output name → list of consumer nodes."""
+    consumers: dict[str, list] = {}
+    for node in graph.node:
+        for inp in node.input:
+            consumers.setdefault(inp, []).append(node)
+    return consumers
+
+
+def _build_producer_map(graph) -> dict[str, object]:
+    """Map each output name → the node that produces it."""
+    return {out: n for n in graph.node for out in n.output}
+
+
+def _get_attr_float(node, name: str) -> float | None:
+    """Extract a float attribute from an ONNX node."""
+    for a in node.attribute:
+        if a.name == name:
+            return float(a.f)
+    return None
+
+
+def _detect_fusion_candidates(graph) -> list[FusionCandidate]:  # type: ignore[type-arg]
+    """
+    Scan the ONNX graph for subgraph patterns that map to winml optimize flags.
+
+    Returns a list of FusionCandidate, ordered highest-count first.
+
+    Detection strategy
+    ------------------
+    We build two lookup tables (producer_map, consumer_map) and then sweep the
+    graph once per pattern family.  Each check is O(N) in the number of nodes.
+
+    Pattern families
+    ----------------
+    GELU variants
+        gelu_fusion         : Div → Erf → Add → Mul → Mul  (exact GELU)
+        fast_gelu_fusion    : Tanh-based GELU  (Tanh node with Pow(3) ancestor)
+        quick_gelu_fusion   : x * sigmoid(1.702*x)
+        bias_gelu_fusion    : Add → GELU subgraph  (bias before GELU entry)
+    LayerNorm variants
+        layer_norm_fusion         : ReduceMean → Sub → Pow(2) → … → Add(ε)
+        simplified_layer_norm_fusion : Pow(2) + ReduceMean (no Sub)
+        fuse_rmsnorm              : Pow → ReduceMean → Add → Sqrt → Div → Mul
+        skip_layer_norm_fusion    : Add (residual) feeds directly into LN subgraph
+    Attention
+        attention_fusion    : Q/K/V MatMul trio feeding a Softmax
+        bias_softmax_fusion : Add immediately before Softmax
+    MatMul patterns
+        matmul_add_fusion         : MatMul → Add (not already counted in LN)
+        matmul_activation_fusion  : MatMul → {Relu, Sigmoid, Tanh, Clip}
+        matmul_transpose_fusion   : Transpose → MatMul  OR  MatMul → Transpose
+        matmul_scale_fusion       : MatMul → Mul (scalar constant)
+    Conv patterns
+        conv_bn_fusion            : Conv → BatchNormalization
+        conv_add_fusion           : Conv → Add
+        conv_mul_fusion           : Conv → Mul
+        conv_activation_fusion    : Conv → {Relu, LeakyRelu, Sigmoid, Tanh, Clip}
+        conv_add_activation_fusion: Conv → Add → activation  (3-node chain)
+        pad_fusion                : Pad → Conv
+    Gemm patterns
+        gemm_activation_fusion    : Gemm → {Relu, Tanh, Sigmoid}
+        gemm_sum_fusion           : Gemm → Add
+        gemm_transpose_fusion     : Transpose → Gemm
+    Eliminations
+        slice_elimination         : multiple Slice ops (potential redundancy)
+        unsqueeze_elimination     : Unsqueeze of initializers
+        concat_slice_elimination  : Concat → Slice (reverse of split)
+        expand_elimination        : Expand nodes
+    Layout
+        transpose_optimizer       : Transpose count > 10
+        nhwc_transformer          : Conv-heavy + Transpose → layout transform candidate
+    Rewrite: highdimRTR_lowdimRTR : Reshape → Transpose → Reshape  with rank > 4
+    """
+    producer = _build_producer_map(graph)
+    consumer = _build_consumer_map(graph)
+
+    # Helper: get the single consumer of a node output (or None)
+    def _single_consumer(node, out_idx: int = 0):
+        if out_idx >= len(node.output):
+            return None
+        consumers = consumer.get(node.output[out_idx], [])
+        return consumers[0] if len(consumers) == 1 else None
+
+    # Helper: check if a node output feeds a specific op type
+    def _consumer_is(node, op: str, out_idx: int = 0) -> bool:
+        c = _single_consumer(node, out_idx)
+        return c is not None and c.op_type == op
+
+    # Helper: check if all inputs to node are exclusively from initializers (weight-only)
+    init_names = {i.name for i in graph.initializer}
+
+    def _is_initializer_input(inp_name: str) -> bool:
+        return inp_name in init_names
+
+    candidates: dict[str, FusionCandidate] = {}
+
+    def _add(flag: str, evidence: str, n: int = 1) -> None:
+        if flag in candidates:
+            candidates[flag].count += n
+            candidates[flag].evidence = evidence  # update to latest
+        else:
+            candidates[flag] = FusionCandidate(flag=flag, count=n, evidence=evidence)
+
+    # ── GELU patterns ──────────────────────────────────────────────────────────
+    erf_gelu_count = 0
+    tanh_gelu_count = 0
+    quick_gelu_count = 0
+    bias_before_gelu = 0
+
+    for node in graph.node:
+        # Erf-based GELU: Div → Erf → (Add → Mul → Mul)
+        if node.op_type == "Erf" and node.input:
+            pred = producer.get(node.input[0])
+            if pred and pred.op_type == "Div":
+                erf_gelu_count += 1
+                # Check if there's an Add feeding the Erf entry point (bias_gelu)
+                # The entry to Erf-GELU is typically through the Div; check what feeds Div
+                if pred.input:
+                    div_pred = producer.get(pred.input[0])
+                    if div_pred and div_pred.op_type in ("Add", "Gemm", "MatMul"):
+                        bias_before_gelu += 1
+
+        # Tanh-based GELU: Tanh with Pow(3) somewhere in the sub-tree
+        if node.op_type == "Tanh" and node.input:
+            # Check 3-hop ancestry for Pow
+            cur = producer.get(node.input[0])
+            for _ in range(4):
+                if cur is None:
+                    break
+                if cur.op_type == "Pow":
+                    tanh_gelu_count += 1
+                    break
+                cur = producer.get(cur.input[0]) if cur.input else None
+
+        # Quick GELU: Sigmoid where predecessor is Mul with constant ≈ 1.702
+        if node.op_type == "Sigmoid" and node.input:
+            pred = producer.get(node.input[0])
+            if pred and pred.op_type == "Mul":
+                quick_gelu_count += 1
+
+    if erf_gelu_count:
+        _add("gelu_fusion", f"{erf_gelu_count} Erf-based GELU subgraph(s)", erf_gelu_count)
+        _add(
+            "gelu_singlegelu",
+            f"{erf_gelu_count} decomposed GELU → can normalise to single Gelu op",
+            erf_gelu_count,
+        )
+    if tanh_gelu_count:
+        _add(
+            "fast_gelu_fusion",
+            f"{tanh_gelu_count} Tanh-based GELU subgraph(s)",
+            tanh_gelu_count,
+        )
+    if quick_gelu_count:
+        _add(
+            "quick_gelu_fusion",
+            f"{quick_gelu_count} Sigmoid(1.702x) quick-GELU pattern(s)",
+            quick_gelu_count,
+        )
+    if bias_before_gelu:
+        _add(
+            "bias_gelu_fusion",
+            f"{bias_before_gelu} Add/MatMul feeding GELU entry",
+            bias_before_gelu,
+        )
+
+    # ── LayerNorm patterns ─────────────────────────────────────────────────────
+    ln_full_count = 0  # ReduceMean + Sub + Pow(2)
+    ln_simplified_count = 0  # Pow(2) + ReduceMean (no Sub)
+    rmsnorm_count = 0  # Pow + ReduceMean (no Sub, no mean-centering)
+    skip_ln_count = 0  # Add → LayerNorm subgraph
+
+    for node in graph.node:
+        if node.op_type == "Pow" and node.input:
+            pred = producer.get(node.input[0])
+            if pred and pred.op_type == "Sub":
+                # Sub → Pow: classic LN  (ReduceMean → Sub → Pow)
+                sub_pred = producer.get(pred.input[0]) if pred.input else None
+                if sub_pred and sub_pred.op_type == "ReduceMean":
+                    ln_full_count += 1
+            elif pred and pred.op_type in ("ReduceMean", "Mul", "Add"):
+                # Simplified / RMSNorm: no Sub predecessor
+                ln_simplified_count += 1
+
+        # RMSNorm: Pow → ReduceMean (direct, without Sub)
+        if node.op_type == "ReduceMean" and node.input:
+            pred = producer.get(node.input[0])
+            if pred and pred.op_type == "Pow":
+                rmsnorm_count += 1
+
+        # skip_layer_norm: Add whose output feeds into the start of an LN subgraph
+        # Heuristic: Add → ReduceMean (the mean-centering step of LN)
+        if node.op_type == "Add" and _consumer_is(node, "ReduceMean"):
+            skip_ln_count += 1
+
+    if ln_full_count:
+        _add(
+            "layer_norm_fusion",
+            f"{ln_full_count} ReduceMean→Sub→Pow LayerNorm subgraph(s)",
+            ln_full_count,
+        )
+    if ln_simplified_count:
+        _add(
+            "simplified_layer_norm_fusion",
+            f"{ln_simplified_count} simplified LayerNorm pattern(s) (no mean-centering)",
+            ln_simplified_count,
+        )
+    if rmsnorm_count:
+        _add("fuse_rmsnorm", f"{rmsnorm_count} RMSNorm Pow→ReduceMean pattern(s)", rmsnorm_count)
+    if skip_ln_count:
+        _add(
+            "skip_layer_norm_fusion",
+            f"{skip_ln_count} Add→ReduceMean (residual+LN) pattern(s)",
+            skip_ln_count,
+        )
+
+    # ── Attention patterns ─────────────────────────────────────────────────────
+    softmax_count = sum(1 for n in graph.node if n.op_type == "Softmax")
+    add_before_softmax = 0
+    for node in graph.node:
+        if node.op_type == "Softmax" and node.input:
+            pred = producer.get(node.input[0])
+            if pred and pred.op_type == "Add":
+                add_before_softmax += 1
+
+    if softmax_count:
+        _add(
+            "attention_fusion",
+            f"{softmax_count} Softmax node(s) — likely attention head(s)",
+            softmax_count,
+        )
+    if add_before_softmax:
+        _add(
+            "bias_softmax_fusion",
+            f"{add_before_softmax} Add→Softmax (bias+attention mask) pattern(s)",
+            add_before_softmax,
+        )
+
+    # ── MatMul patterns ────────────────────────────────────────────────────────
+    _ACTIVATIONS = {"Relu", "LeakyRelu", "Sigmoid", "Tanh", "Clip", "Gelu", "FastGelu"}
+
+    mm_add = mm_act = mm_tp = mm_scale = 0
+    for node in graph.node:
+        if node.op_type != "MatMul":
+            continue
+        c = _single_consumer(node)
+        if c is None:
+            continue
+        if c.op_type == "Add":
+            mm_add += 1
+        elif c.op_type in _ACTIVATIONS:
+            mm_act += 1
+        elif c.op_type == "Transpose":
+            mm_tp += 1
+        elif c.op_type == "Mul":
+            # Mul with a scalar → scale fusion; heuristic: second input is initializer
+            if len(c.input) > 1 and _is_initializer_input(c.input[1]):
+                mm_scale += 1
+
+    # Also check Transpose → MatMul
+    tp_before_mm = sum(
+        1 for node in graph.node if node.op_type == "Transpose" and _consumer_is(node, "MatMul")
+    )
+
+    if mm_add:
+        _add("matmul_add_fusion", f"{mm_add} MatMul→Add pattern(s)", mm_add)
+        _add(
+            "matmuladd_reshapegemm",
+            f"{mm_add} MatMul+Add → Reshape+Gemm rewrite candidate(s)",
+            mm_add,
+        )
+    if mm_act:
+        _add("matmul_activation_fusion", f"{mm_act} MatMul→activation pattern(s)", mm_act)
+    if mm_tp + tp_before_mm:
+        _add(
+            "matmul_transpose_fusion",
+            f"{mm_tp + tp_before_mm} MatMul↔Transpose pattern(s)",
+            mm_tp + tp_before_mm,
+        )
+    if mm_scale:
+        _add("matmul_scale_fusion", f"{mm_scale} MatMul→Mul(scalar) pattern(s)", mm_scale)
+
+    # ── Conv patterns ──────────────────────────────────────────────────────────
+    conv_bn = conv_add = conv_mul = conv_act = conv_add_act = pad_conv = 0
+    for node in graph.node:
+        if node.op_type == "Pad" and _consumer_is(node, "Conv"):
+            pad_conv += 1
+
+        if node.op_type != "Conv":
+            continue
+        c = _single_consumer(node)
+        if c is None:
+            continue
+        if c.op_type == "BatchNormalization":
+            conv_bn += 1
+        elif c.op_type == "Add":
+            conv_add += 1
+            # Check for Conv → Add → activation chain
+            cc = _single_consumer(c)
+            if cc and cc.op_type in _ACTIVATIONS:
+                conv_add_act += 1
+        elif c.op_type == "Mul":
+            conv_mul += 1
+        elif c.op_type in _ACTIVATIONS:
+            conv_act += 1
+
+    if conv_bn:
+        _add("conv_bn_fusion", f"{conv_bn} Conv→BN pattern(s)", conv_bn)
+    if conv_add:
+        _add("conv_add_fusion", f"{conv_add} Conv→Add pattern(s)", conv_add)
+    if conv_mul:
+        _add("conv_mul_fusion", f"{conv_mul} Conv→Mul pattern(s)", conv_mul)
+    if conv_act:
+        _add("conv_activation_fusion", f"{conv_act} Conv→activation pattern(s)", conv_act)
+    if conv_add_act:
+        _add(
+            "conv_add_activation_fusion",
+            f"{conv_add_act} Conv→Add→activation chain(s) (FusedConv)",
+            conv_add_act,
+        )
+    if pad_conv:
+        _add("pad_fusion", f"{pad_conv} Pad→Conv pattern(s)", pad_conv)
+
+    # ── Gemm patterns ──────────────────────────────────────────────────────────
+    gemm_act = gemm_add = gemm_tp = 0
+    for node in graph.node:
+        if node.op_type != "Gemm":
+            continue
+        c = _single_consumer(node)
+        if c is None:
+            continue
+        if c.op_type in _ACTIVATIONS:
+            gemm_act += 1
+        elif c.op_type == "Add":
+            gemm_add += 1
+        elif c.op_type == "Transpose":
+            gemm_tp += 1
+    tp_before_gemm = sum(
+        1 for node in graph.node if node.op_type == "Transpose" and _consumer_is(node, "Gemm")
+    )
+    if gemm_act:
+        _add("gemm_activation_fusion", f"{gemm_act} Gemm→activation pattern(s)", gemm_act)
+    if gemm_add:
+        _add("gemm_sum_fusion", f"{gemm_add} Gemm→Add pattern(s)", gemm_add)
+    if gemm_tp + tp_before_gemm:
+        _add(
+            "gemm_transpose_fusion",
+            f"{gemm_tp + tp_before_gemm} Gemm↔Transpose pattern(s)",
+            gemm_tp + tp_before_gemm,
+        )
+
+    # ── Elimination patterns ───────────────────────────────────────────────────
+    slice_count = sum(1 for n in graph.node if n.op_type == "Slice")
+    expand_count = sum(1 for n in graph.node if n.op_type == "Expand")
+    unsqueeze_init = sum(
+        1
+        for n in graph.node
+        if n.op_type == "Unsqueeze" and n.input and _is_initializer_input(n.input[0])
+    )
+    concat_slice = sum(1 for n in graph.node if n.op_type == "Concat" and _consumer_is(n, "Slice"))
+
+    if slice_count > 3:
+        _add("slice_elimination", f"{slice_count} Slice nodes (potential redundancy)", slice_count)
+    if expand_count > 2:
+        _add("expand_elimination", f"{expand_count} Expand nodes", expand_count)
+    if unsqueeze_init:
+        _add(
+            "unsqueeze_elimination",
+            f"{unsqueeze_init} Unsqueeze(initializer) node(s)",
+            unsqueeze_init,
+        )
+    if concat_slice:
+        _add(
+            "concat_slice_elimination",
+            f"{concat_slice} Concat→Slice pattern(s) (reverse-split)",
+            concat_slice,
+        )
+
+    # ── Layout patterns ────────────────────────────────────────────────────────
+    tp_count = sum(1 for n in graph.node if n.op_type == "Transpose")
+    if tp_count > 10:
+        _add(
+            "transpose_optimizer",
+            f"{tp_count} Transpose nodes — optimizer may collapse chains",
+            tp_count,
+        )
+
+    # Reshape → Transpose → Reshape with high-dimensional input (rank > 4)
+    rtr_highdim = 0
+    for node in graph.node:
+        if node.op_type == "Transpose" and node.input:
+            pred = producer.get(node.input[0])
+            c = _single_consumer(node)
+            if pred and c and pred.op_type == "Reshape" and c.op_type == "Reshape":
+                # Check if any input to the reshape has rank > 4 via shape inference
+                # Approximation: count as candidate if the graph has many dims
+                rtr_highdim += 1
+    if rtr_highdim > 2:
+        _add(
+            "highdimRTR_lowdimRTR",
+            f"{rtr_highdim} Reshape→Transpose→Reshape chain(s) — may reduce to lower rank",
+            rtr_highdim,
+        )
+
+    # Sort by count descending
+    return sorted(candidates.values(), key=lambda c: -c.count)
+
+
+def run_graph_analysis(onnx_path: Path) -> GraphInfo:
+    """Analyse the ONNX proto and return structural statistics."""
+    info = GraphInfo()
+    if not _ONNX_OK:
+        return info
+    if not onnx_path.exists():
+        return info
+
+    try:
+        model = onnx.load(str(onnx_path))
+        g = model.graph
+        counts: Counter = Counter(n.op_type for n in g.node)
+        total = sum(counts.values())
+        info.total_ops = total
+        info.op_counts = dict(counts)
+        info.available = True
+
+        if total > 0:
+            info.conv_pct = counts.get("Conv", 0) / total * 100
+            info.matmul_pct = counts.get("MatMul", 0) / total * 100
+            info.gemm_pct = counts.get("Gemm", 0) / total * 100
+            info.transpose_count = counts.get("Transpose", 0)
+
+        # Detect GELU types
+        if counts.get("Erf", 0):
+            info.has_gelu_decomposed = True
+            info.gelu_types.append("erf")
+        if counts.get("Tanh", 0):
+            info.gelu_types.append("tanh")
+        if counts.get("Sigmoid", 0):
+            info.gelu_types.append("sigmoid/quick")
+
+        # Dynamic axes: any input with dim_param (string dimension)
+        for inp in g.input:
+            for dim in inp.type.tensor_type.shape.dim:
+                if dim.dim_param:
+                    info.has_dynamic_axes = True
+                    break
+
+        # Full fusion candidate scan
+        info.fusion_candidates = _detect_fusion_candidates(g)
+
+    except Exception as e:
+        info.available = False
+        print(f"  [analyze_insight] graph analysis failed: {e}")
+
+    return info
+
+
+# ── winml analyze ─────────────────────────────────────────────────────────────
+
+
+def run_winml_analyze(winml: str, onnx_path: Path, ep: str, device: str) -> AnalyzeResult:
+    """Call `winml analyze -m <path> --ep <ep>` and parse JSON output."""
+    result = AnalyzeResult()
+    if not onnx_path.exists():
+        return result
+
+    with tempfile.NamedTemporaryFile(suffix=".json", delete=False) as f:
+        out_path = Path(f.name)
+
+    try:
+        rc, out, _ = run_cmd(
+            [
+                winml,
+                "analyze",
+                "-m",
+                str(onnx_path),
+                "--ep",
+                ep,
+                "--device",
+                device,
+                "-o",
+                str(out_path),
+            ],
+            label=f"winml analyze --ep {ep}",
+            timeout=120,
+        )
+        if rc not in (0, 1) or not out_path.exists():
+            return result
+
+        data = json.loads(out_path.read_text(encoding="utf-8"))
+        # Output is a list; take first entry (single-EP mode)
+        entry = data[0] if isinstance(data, list) and data else data
+        ep_results = entry.get("results", [])
+        if not ep_results:
+            return result
+
+        ep_res = ep_results[0]
+        cls = ep_res.get("classification", {})
+
+        def _extract_op_types(lst: list[str]) -> list[str]:
+            """Turn 'OP/ai.onnx/Conv (QDQ)' into 'Conv'."""
+            types = []
+            for s in lst:
+                m = re.search(r"/([A-Za-z][A-Za-z0-9_]*)(?:\s|$|\()", s)
+                if m:
+                    types.append(m.group(1))
+            return list(dict.fromkeys(types))  # dedupe, preserve order
+
+        result.supported = _extract_op_types(cls.get("supported", []))
+        result.partial = _extract_op_types(cls.get("partial", []))
+        result.unsupported = _extract_op_types(cls.get("unsupported", []))
+        result.unknown = _extract_op_types(cls.get("unknown", []))
+        # Consider results available only when there's actual rule data
+        result.available = bool(result.supported or result.partial or result.unsupported)
+
+    except Exception as e:
+        print(f"  [analyze_insight] winml analyze failed: {e}")
+    finally:
+        out_path.unlink(missing_ok=True)
+
+    return result
+
+
+# ── graph-presence pruning ──────────────────────────────────────────────────
+
+# Map each OFAT graph pass to the baseline-graph pattern it needs to fire.
+# For "pattern" passes the detector emits an exact-named FusionCandidate ONLY
+# when the subgraph is actually present (no count threshold), so a count of 0 is
+# a confident "this pass would be a no-op" signal we can prune on.
+_PASS_PATTERN_FLAG: dict[str, str] = {
+    "conv_bn_fusion": "conv_bn_fusion",
+    "conv_add_fusion": "conv_add_fusion",
+    "conv_activation_fusion": "conv_activation_fusion",
+    "gelu_fusion": "gelu_fusion",
+    "layer_norm_fusion": "layer_norm_fusion",
+    "skip_layer_norm_fusion": "skip_layer_norm_fusion",
+    "matmul_add_fusion": "matmul_add_fusion",
+    "matmul_transpose_fusion": "matmul_transpose_fusion",
+    "attention_fusion": "attention_fusion",
+    "bias_softmax_fusion": "bias_softmax_fusion",
+}
+
+
+def _pass_name_of(label: str) -> str:
+    """Extract the single graph-pass name from an 'opset=NN + pass' hypothesis label."""
+    return label.split("+")[-1].strip()
+
+
+def _pass_can_fire(pass_name: str, g: GraphInfo, present: dict[str, int]) -> bool | None:
+    """Pre-estimate, from the baseline graph, whether a single pass can change it.
+
+    Returns True (required pattern present), False (confidently absent → the pass
+    is a guaranteed no-op), or None (pass not statically estimable → leave it to
+    the empirical search rather than risk a false cut).
+    """
+    if pass_name in _PASS_PATTERN_FLAG:
+        return present.get(_PASS_PATTERN_FLAG[pass_name], 0) > 0
+    # layout / rewrite passes: fall back to the primitive op the pass operates on.
+    if pass_name == "transpose_optimizer":
+        return g.transpose_count > 0
+    if pass_name == "highdimRTR_lowdimRTR":
+        return g.transpose_count > 0 and g.op_counts.get("Reshape", 0) > 0
+    if pass_name == "nchwc_transformer":
+        return g.op_counts.get("Conv", 0) > 0
+    return None
+
+
+# ── insight engine ────────────────────────────────────────────────────────────
+
+
+def build_insight(
+    onnx_path: Path,
+    winml: str,
+    ep: str,
+    device: str,
+    hypotheses: list[tuple[str, Any, str]],
+    kb: dict,
+) -> InsightResult:
+    """Fuse graph + analyze + KB signals into skip_set and priority_boosts.
+
+    Args:
+        onnx_path:   Path to baseline ONNX (post-export, pre-optim).
+        winml:       Path to winml executable.
+        ep:          Execution provider string (e.g. "cpu", "qnn").
+        device:      Device string (e.g. "cpu", "npu").
+        hypotheses:  List of (label, patch_fn, dimension) from autoconfig.py.
+        kb:          dict from load_ep_knowledge(ep).
+
+    Returns:
+        InsightResult with skip_set, priority_boosts, notes.
+    """
+    result = InsightResult()
+    notes = result.notes
+
+    print("\n=== Phase 1: Insight Engine ===")
+
+    # ── signal 1: graph analysis ───────────────────────────────
+    print("  [1/3] Graph analysis…")
+    g = run_graph_analysis(onnx_path)
+    result.graph_info = g
+    if g.available:
+        top5 = sorted(g.op_counts.items(), key=lambda x: -x[1])[:5]
+        print(
+            f"       total_ops={g.total_ops}  conv%={g.conv_pct:.1f}  "
+            f"matmul%={g.matmul_pct:.1f}  gemm%={g.gemm_pct:.1f}  "
+            f"transpose={g.transpose_count}  dynamic_axes={g.has_dynamic_axes}"
+        )
+        print(f"       top ops: {dict(top5)}")
+        if g.fusion_candidates:
+            print(f"       fusion candidates ({len(g.fusion_candidates)}):")
+            for fc in g.fusion_candidates[:10]:  # top-10 only
+                print(f"         [{fc.count:3d}×] {fc.flag:40s}  {fc.evidence}")
+            if len(g.fusion_candidates) > 10:
+                print(f"         ... and {len(g.fusion_candidates) - 10} more")
+    else:
+        print("       [skip] onnx not available or model not found")
+
+    # ── signal 2: winml analyze ────────────────────────────────
+    print(f"  [2/3] winml analyze --ep {ep}…")
+    ar = run_winml_analyze(winml, onnx_path, ep, device)
+    result.analyze_result = ar
+    if ar.available:
+        print(
+            f"       supported={len(ar.supported)}  partial={len(ar.partial)}  "
+            f"unsupported={len(ar.unsupported)}  unknown={len(ar.unknown)}"
+        )
+        if ar.partial:
+            print(f"       partial ops: {ar.partial[:5]}")
+        if ar.unsupported:
+            print(f"       unsupported ops: {ar.unsupported[:5]}")
+    else:
+        print("       [skip] no rule data for this EP or analyze failed")
+
+    # ── signal 3: KB confirmed rules ───────────────────────────
+    print("  [3/3] Applying KB confirmed rules…")
+
+    # ── build skip_set ─────────────────────────────────────────
+
+    # KB-derived skips (already applied per confirmed finding)
+    for note in kb.get("notes", []):
+        if "[KB confirmed] Skip pass:" in note:
+            pass_name = note.split("Skip pass:")[-1].strip()
+            # Match against hypothesis labels that use this pass
+            for label, _, _ in hypotheses:
+                if pass_name.replace("_", "-") in label or pass_name in label:
+                    result.skip_set.add(label)
+                    notes.append(f"skip [{label}]: KB confirmed rule — {pass_name}")
+
+    # Graph-derived skips
+    if g.available:
+        # npu-006: Conv% > 20% → hard-block conv fusions on QNN NPU
+        if ep in ("qnn",) and device == "npu" and g.conv_pct > 20.0:
+            for label, _, dim in hypotheses:
+                if dim == "graph_pass" and any(kw in label for kw in ("conv", "bn", "batch")):
+                    result.skip_set.add(label)
+                    notes.append(
+                        f"skip [{label}]: npu-006 — Conv%={g.conv_pct:.1f}%>20% on QNN NPU"
+                        " (FusedConv → CPU fallback)"
+                    )
+
+        # cpu-001: opset > 17 regresses on CPU (empirical, mechanism unknown)
+        if ep == "cpu":
+            for label, _, dim in hypotheses:
+                if dim == "opset" and "21" in label:
+                    notes.append(
+                        f"deprioritise [{label}]: cpu-001 — opset21 regresses on CPU"
+                        " (non-monotonic, mechanism unknown)"
+                    )
+                    result.priority_boosts[label] = result.priority_boosts.get(label, 0) - 5
+
+        # gpu-004: QNN GPU — skip all quantization
+        if ep == "qnn" and device == "gpu":
+            for label, _, dim in hypotheses:
+                if dim in ("quant", "precision"):
+                    result.skip_set.add(label)
+                    notes.append(f"skip [{label}]: gpu-004 — quantization hangs on QNN GPU")
+
+        # nhwc-transformer regresses p90 on DML/QNN GPU transformers
+        if ep in ("dml",) or (ep == "qnn" and device == "gpu"):
+            for label, _, dim in hypotheses:
+                if "nhwc" in label.lower():
+                    result.skip_set.add(label)
+                    notes.append(
+                        f"skip [{label}]: dml-002/gpu-002 — nhwc-transformer increases p90 variance"
+                    )
+
+        # graph-presence pruning (static pre-estimate): cut graph-pass hypotheses
+        # whose required pattern is absent from the baseline graph. With nothing to
+        # fuse the pass is a guaranteed no-op, so there is no point benchmarking it.
+        # Passes we cannot statically estimate (_pass_can_fire → None) are left for
+        # the empirical search rather than risk a false cut.
+        present_flags = {fc.flag: fc.count for fc in g.fusion_candidates}
+        for label, _, dim in hypotheses:
+            if dim != "graph_pass":
+                continue
+            if _pass_can_fire(_pass_name_of(label), g, present_flags) is False:
+                result.skip_set.add(label)
+                notes.append(
+                    f"skip [{label}]: graph analysis — required pattern absent in the"
+                    " baseline graph (pass would be a no-op, nothing to fuse)"
+                )
+
+    # ── build priority_boosts ──────────────────────────────────
+
+    if g.available:
+        # DINOv2-family on QNN NPU: opset21 gets strong positive boost (npu-001)
+        if ep == "qnn" and device == "npu":
+            # Heuristic: DINOv2 has many Reshape and high attention ops
+            if g.op_counts.get("Reshape", 0) > 30 and g.conv_pct < 10:
+                for label, _, dim in hypotheses:
+                    if dim == "opset" and "21" in label:
+                        result.priority_boosts[label] = result.priority_boosts.get(label, 0) + 10
+                        notes.append(
+                            f"boost [{label}]: npu-001 heuristic — high Reshape count"
+                            f" ({g.op_counts.get('Reshape', 0)}) + low Conv% suggests DINOv2-family"
+                        )
+
+        # Fusion-candidate-driven boosts: map detected patterns → hypothesis labels
+        #
+        # Strategy: for each FusionCandidate, find hypotheses whose label or dimension
+        # mentions the relevant flag.  Boost proportional to log(count) so that
+        # "288 MatMul→Add" doesn't overwhelm "12 GELU" by 24×.
+        import math
+
+        _FLAG_KEYWORDS: dict[str, list[str]] = {
+            "gelu_fusion": ["gelu"],
+            "fast_gelu_fusion": ["gelu", "fast"],
+            "bias_gelu_fusion": ["gelu", "bias"],
+            "quick_gelu_fusion": ["gelu", "quick"],
+            "gelu_singlegelu": ["gelu"],
+            "layer_norm_fusion": ["layer_norm", "layernorm", "ln"],
+            "skip_layer_norm_fusion": ["skip_layer_norm", "skip_ln"],
+            "simplified_layer_norm_fusion": ["layer_norm", "simplified"],
+            "fuse_rmsnorm": ["rmsnorm", "rms_norm"],
+            "attention_fusion": ["attention"],
+            "bias_softmax_fusion": ["softmax", "attention"],
+            "matmul_add_fusion": ["matmul_add", "matmul-add"],
+            "matmul_activation_fusion": ["matmul_act", "matmul-act"],
+            "matmul_transpose_fusion": ["matmul_transp", "matmul-transp"],
+            "matmul_scale_fusion": ["matmul_scale", "matmul-scale"],
+            "matmuladd_reshapegemm": ["reshape_gemm", "matmuladd"],
+            "conv_bn_fusion": ["conv_bn", "conv-bn"],
+            "conv_add_fusion": ["conv_add", "conv-add"],
+            "conv_mul_fusion": ["conv_mul", "conv-mul"],
+            "conv_activation_fusion": ["conv_act", "conv-act"],
+            "conv_add_activation_fusion": ["conv_add_act", "fused_conv"],
+            "pad_fusion": ["pad_conv", "pad-conv"],
+            "gemm_activation_fusion": ["gemm_act", "gemm-act"],
+            "gemm_sum_fusion": ["gemm_sum", "gemm-sum"],
+            "gemm_transpose_fusion": ["gemm_transp"],
+            "slice_elimination": ["slice_elim"],
+            "unsqueeze_elimination": ["unsqueeze_elim"],
+            "expand_elimination": ["expand_elim"],
+            "concat_slice_elimination": ["concat_slice"],
+            "transpose_optimizer": ["transpose_opt", "tp_opt"],
+            "highdimRTR_lowdimRTR": ["rtr", "reshape_transpose"],
+        }
+
+        for fc in g.fusion_candidates:
+            keywords = _FLAG_KEYWORDS.get(fc.flag, [fc.flag.replace("_", "-")])
+            boost = round(1 + math.log(max(fc.count, 1)), 1)
+            for label, _, dim in hypotheses:
+                label_lower = label.lower()
+                if any(kw in label_lower for kw in keywords):
+                    result.priority_boosts[label] = result.priority_boosts.get(label, 0) + boost
+                    notes.append(
+                        f"boost [{label}] +{boost:.1f}: graph has {fc.count}× {fc.flag} candidate(s)"
+                    )
+
+        # GELU-decomposed: additional direct boost for gelu hypotheses
+        if g.has_gelu_decomposed:
+            for label, _, dim in hypotheses:
+                if "gelu" in label.lower() and label not in {
+                    n.split("]")[0].lstrip("boost [") for n in notes if "gelu" in n
+                }:
+                    result.priority_boosts[label] = result.priority_boosts.get(label, 0) + 2
+                    notes.append(
+                        f"boost [{label}]: decomposed GELU detected — fusion likely beneficial"
+                    )
+
+        # Conv-dense → conv fusions more likely to help (CPU only — not QNN NPU)
+        if g.conv_pct > 40 and ep not in ("qnn",):
+            for label, _, dim in hypotheses:
+                if "conv" in label.lower() and dim == "graph_pass":
+                    result.priority_boosts[label] = result.priority_boosts.get(label, 0) + 2
+                    notes.append(
+                        f"boost [{label}]: high Conv% ({g.conv_pct:.1f}%) — conv fusions promising"
+                    )
+
+    # analyze-derived: if partial ops in model → deprioritise those optims
+    if ar.available and ar.partial:
+        for label, _, dim in hypotheses:
+            for pop in ar.partial:
+                if pop.lower() in label.lower():
+                    result.priority_boosts[label] = result.priority_boosts.get(label, 0) - 2
+                    notes.append(
+                        f"deprioritise [{label}]: op '{pop}' is partial-support on {ep.upper()}"
+                    )
+
+    # ── print summary ──────────────────────────────────────────
+    print("\n  Insight Engine result:")
+    print(f"    skip_set ({len(result.skip_set)}): {result.skip_set or '(none)'}")
+    boosts = {k: v for k, v in result.priority_boosts.items() if v != 0}
+    print(f"    priority_boosts: {boosts or '(none)'}")
+    if notes:
+        print("    notes:")
+        for n in notes:
+            print(f"      - {n}")
+    print()
+
+    return result
diff --git a/research/autoconfig/skills/optimizer/SKILL.md b/research/autoconfig/skills/optimizer/SKILL.md
new file mode 100644
index 000000000..8ca2ddd54
--- /dev/null
+++ b/research/autoconfig/skills/optimizer/SKILL.md
@@ -0,0 +1,54 @@
+---
+name: optimizer
+description: >
+  Use this sub-skill (driven by orchestrator) to RUN one winml-cli config
+  hypothesis and produce raw measurements. It does winml build, a Phase A 200-iter
+  screen with a CV stability gate and early-exit, a Phase B full bench (3 sessions x
+  1000 iters with cool-down), and a winml eval accuracy check. It makes no keep/discard
+  decision — it only returns benchmark + accuracy data for the reviewer to judge.
+---
+
+# optimizer
+
+The Optimizer is the **"run it"** sub-skill of the autoconfig loop (Phase 2). It
+turns one hypothesis into measurements. Mirrors the `Optimizer` class in
+`skills/orchestrator/autoconfig.py` and the Optimizer box in
+`research/autoconfig/docs/autoconfig_diagram.html`.
+
+**Implementation in this folder:** `bench_utils.py` (the shared bench primitives —
+`bench_screen`, `bench_full`, `SessionManager`, and the `ThroughputOnly` verdict
+policy the Reviewer consumes).
+
+## When to use
+
+Invoked by `orchestrator` after `explorer` yields a
+hypothesis. Not used standalone.
+
+## Inputs
+
+- The hypothesis config delta from the Explorer (applied to the base config).
+- Build target: model id, EP, device (held on the Optimizer; thresholds are module constants).
+
+## Procedure
+
+1. **Build** — write `config.json`, run `winml build -c ... --ep <ep> --device <device> --no-quant --no-compile`. Abort the hypothesis on non-zero exit.
+2. **Phase A — screen** (`SCREEN_ITERS = 200`):
+   - Run `bench_screen`; reject as unstable if `CV > SCREEN_CV_MAX (0.10)` (thermal/scheduling noise — cool device and retry later).
+   - The orchestrator early-exits (skips Phase B) if screen improvement vs baseline < 1% (`SCREEN_PASS_MIN_IMPROVEMENT_PCT`), saving 25–90 min per dead hypothesis.
+3. **Phase B — full bench** (`FULL_SESSIONS = 3` x `FULL_ITERS = 1000`, `COOL_DOWN_S = 60`):
+   - Returns one p50 per session; the loop uses the median across sessions (DVFS-aware averaging, npu-007).
+4. **Accuracy** — run `winml eval --samples 50`; parse top-1 / cosine accuracy. Latency comes from the bench, never from eval.
+
+## Outputs
+
+- `screen_p50`, `screen_cv` (Phase A).
+- `full_p50s` list + median p50 (Phase B).
+- `accuracy` (or None when the model/eval is unavailable).
+
+All handed to `reviewer` — the Optimizer never decides KEEP/DISCARD.
+
+## Constraints
+
+- No hardcoded architecture logic; EP/device come from the orchestrator/Insight.
+- Phase B only runs when Phase A is stable and shows promise (cost control).
+- Measurements are session-level; the Optimizer never collapses them to a single point estimate before review.
diff --git a/research/autoconfig/skills/optimizer/bench_utils.py b/research/autoconfig/skills/optimizer/bench_utils.py
new file mode 100644
index 000000000..3a9070ffe
--- /dev/null
+++ b/research/autoconfig/skills/optimizer/bench_utils.py
@@ -0,0 +1,778 @@
+#!/usr/bin/env python3
+# -------------------------------------------------------------------------
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+# --------------------------------------------------------------------------
+
+"""bench_utils.py — Shared benchmarking helpers for QNN NPU sweeps.
+
+Bench protocol (npu-007):
+  Phase A: 200-iter screen. For QNN NPU, high CV (0.15-1.2) is NORMAL due to
+    DVFS/Hexagon HTP thermal throttling. Phase A result is informational only;
+    it never gates Phase B on NPU. Only use CV gate for CPU/GPU EPs.
+  Phase B: 3 independent sessions x 500 iters with 30s cool-down.
+    KEEP criterion: all p50s below baseline; for NPU, ranges must not overlap.
+
+winml config + build helpers are also centralized here to avoid duplication
+between catalog_sweep.py and validation_sweep.py.
+"""
+
+from __future__ import annotations
+
+import copy
+import json
+import math
+import statistics
+import subprocess
+import time
+from abc import ABC, abstractmethod
+from collections.abc import Callable
+from dataclasses import dataclass
+from pathlib import Path
+
+# ── Protocol constants (overridable by callers via module-level reassignment) ─
+SCREEN_WARMUP: int = 20
+SCREEN_ITERS: int = 200
+SCREEN_CV_MAX_NPU: float = 999.0  # never gate on CV for QNN NPU (npu-007)
+SCREEN_CV_MAX_STD: float = 0.10  # CPU / GPU: reject if CV > 10%
+
+FULL_WARMUP: int = 50
+FULL_ITERS: int = 500
+FULL_SESSIONS: int = 3
+COOL_DOWN_S: int = 30  # seconds between full-bench sessions (NPU)
+
+BUILD_TIMEOUT_S: int = 8 * 60
+BENCH_TIMEOUT_S: int = 8 * 60
+CONFIG_TIMEOUT_S: int = 120
+
+# ── Paired A/B + adaptive sampling (self-evolution-design Fix #1 / Fix #2) ─────
+MIN_PAIRS: int = 3  # never conclude on fewer than this many A/B pairs
+MAX_PAIRS: int = 8  # force-stop (MARGINAL) after this many pairs
+KEEP_GAIN_PCT: float = 5.0  # CI lower bound must exceed this to KEEP_CONFIRMED
+DISCARD_GAIN_PCT: float = -2.0  # CI upper bound below this -> DISCARD
+
+# ── Thermal reference classification (self-evolution-design Fix #5) ────────────
+THERMAL_COOL_MULT: float = 1.05  # <= 1.05x cold reference -> proceed
+THERMAL_HOT_MULT: float = 1.30  # >= 1.30x cold reference -> HOT_RUN
+
+
+# ── subprocess wrapper ────────────────────────────────────────────────────────
+
+
+def run_cmd(cmd: list[str], label: str = "", timeout: int = 600) -> tuple[int, str, float]:
+    """Run a subprocess command. Returns (returncode, combined_output, elapsed_s)."""
+    t0 = time.time()
+    print(f"  >> {label or cmd[1]}", flush=True)
+    try:
+        result = subprocess.run(
+            cmd,
+            capture_output=True,
+            text=True,
+            encoding="utf-8",
+            errors="replace",
+            timeout=timeout,
+        )
+        elapsed = time.time() - t0
+        tag = "ok" if result.returncode == 0 else f"rc={result.returncode}"
+        print(f"     {elapsed:.0f}s [{tag}]", flush=True)
+        if result.returncode != 0:
+            snippet = (result.stderr or result.stdout or "")[-600:]
+            print(f"     stderr: {snippet}", flush=True)
+        return result.returncode, result.stdout + result.stderr, elapsed
+    except subprocess.TimeoutExpired:
+        elapsed = time.time() - t0
+        print(f"     TIMEOUT after {elapsed:.0f}s", flush=True)
+        return -999, f"TIMEOUT after {timeout}s", elapsed
+
+
+# ── winml wrappers ────────────────────────────────────────────────────────────
+
+
+def get_base_config(
+    winml: str,
+    model_id: str,
+    task: str,
+    model_type: str,
+    ep: str,
+    device: str,
+    out_path: Path,
+) -> dict | None:
+    """Generate a config via `winml config`. Returns parsed dict or None on failure.
+
+    Tries with --model-type first, then falls back without it.
+    """
+
+    def _try(extra_args: list[str]) -> dict | None:
+        cmd = [
+            winml,
+            "config",
+            "-m",
+            model_id,
+            "-t",
+            task,
+            "--device",
+            device,
+            "--ep",
+            ep,
+            "--no-compile",
+            "-o",
+            str(out_path),
+        ] + extra_args
+        rc, _, _ = run_cmd(cmd, label="winml config", timeout=CONFIG_TIMEOUT_S)
+        if rc == 0 and out_path.exists():
+            try:
+                cfg = json.loads(out_path.read_text(encoding="utf-8"))
+                out_path.unlink(missing_ok=True)
+                return cfg
+            except Exception as e:
+                print(f"  [warn] config parse error: {e}", flush=True)
+        out_path.unlink(missing_ok=True)
+        return None
+
+    cfg = _try(["--model-type", model_type])
+    if cfg is None:
+        print("  [warn] config with --model-type failed, retrying without...", flush=True)
+        cfg = _try([])
+    return cfg
+
+
+def run_build(
+    winml: str,
+    model_id: str,
+    cfg_path: Path,
+    out_dir: Path,
+    ep: str,
+    device: str,
+    extra_flags: list[str] | None = None,
+) -> tuple[bool, str]:
+    """Run `winml build`. Returns (success, combined_output)."""
+    out_dir.mkdir(parents=True, exist_ok=True)
+    cmd = [
+        winml,
+        "build",
+        "-c",
+        str(cfg_path),
+        "-m",
+        model_id,
+        "-o",
+        str(out_dir),
+        "--ep",
+        ep,
+        "--device",
+        device,
+        "--no-compile",
+        "--rebuild",
+    ]
+    if extra_flags:
+        cmd.extend(extra_flags)
+    rc, out, _ = run_cmd(cmd, label=f"winml build [{out_dir.name}]", timeout=BUILD_TIMEOUT_S)
+    return rc == 0, out
+
+
+def make_hypothesis_config(
+    base: dict, opset_override: int | None, extra_optim: dict | None
+) -> dict:
+    """Return a modified deep copy of base config for one hypothesis."""
+    cfg = copy.deepcopy(base)
+    if opset_override is not None and cfg.get("export"):
+        cfg["export"]["opset_version"] = opset_override
+    if extra_optim is not None:
+        cfg["optim"] = {**(cfg.get("optim") or {}), **extra_optim}
+    return cfg
+
+
+def find_model_onnx(hyp_dir: Path) -> Path | None:
+    """Locate the best ONNX artifact in a build output dir.
+
+    Priority: quantized > optimized > any .onnx.
+    Returns None if no .onnx file exists.
+    """
+    model_files = list(hyp_dir.glob("*.onnx"))
+    if not model_files:
+        return None
+    for preference in ("quantized", "optimized"):
+        match = next((f for f in model_files if preference in f.name), None)
+        if match:
+            return match
+    return model_files[0]
+
+
+def is_build_complete(hyp_dir: Path) -> bool:
+    """Return True if the hyp_dir contains a complete build artifact.
+
+    'Complete' means optimized.onnx or quantized.onnx is present.
+    export.onnx alone means the pipeline was truncated before optimization.
+    """
+    return any(
+        f.name for f in hyp_dir.glob("*.onnx") if "optimized" in f.name or "quantized" in f.name
+    )
+
+
+# ── benchmark helpers ─────────────────────────────────────────────────────────
+
+
+class ScreenResult:
+    """Result from Phase A quick screen."""
+
+    __slots__ = ("p50_ms", "cv", "rc_failed")
+
+    def __init__(self, p50_ms: float | None, cv: float, rc_failed: bool = False) -> None:
+        self.p50_ms = p50_ms
+        self.cv = cv
+        self.rc_failed = rc_failed  # True only on subprocess failure; never on high CV
+
+    @property
+    def hard_failed(self) -> bool:
+        """True if the bench command itself failed (rc != 0 or no output file)."""
+        return self.rc_failed
+
+    def to_dict(self, ep: str = "cpu") -> dict:
+        note = None
+        if ep in ("qnn", "npu") and self.cv > 0.10:
+            note = "DVFS noise — high CV expected on QNN NPU (npu-007)"
+        return {
+            "p50_ms": round(self.p50_ms, 3) if self.p50_ms is not None else None,
+            "cv": round(self.cv, 4),
+            "note": note,
+        }
+
+
+def bench_screen(
+    winml: str,
+    model_path: Path,
+    ep: str,
+    device: str,
+    out_json: Path | None = None,
+) -> ScreenResult:
+    """Phase A: 200-iter screen.
+
+    For QNN NPU: high CV is NORMAL (npu-007). Never treat high CV as failure.
+    Only hard-fail on subprocess rc != 0 or missing output file.
+    For CPU/GPU: high CV (> SCREEN_CV_MAX_STD) indicates measurement instability.
+    """
+    if out_json is None:
+        out_json = model_path.parent / "screen_perf.json"
+    rc, _, _ = run_cmd(
+        [
+            winml,
+            "perf",
+            "-m",
+            str(model_path),
+            "--ep",
+            ep,
+            "--device",
+            device,
+            "--warmup",
+            str(SCREEN_WARMUP),
+            "--iterations",
+            str(SCREEN_ITERS),
+            "-o",
+            str(out_json),
+        ],
+        label=f"perf screen ({SCREEN_ITERS} iters)",
+        timeout=BENCH_TIMEOUT_S,
+    )
+    if rc != 0 or not out_json.exists():
+        return ScreenResult(None, 999.0, rc_failed=True)
+    try:
+        data = json.loads(out_json.read_text(encoding="utf-8"))
+        lat = data.get("latency_ms", data)
+        p50 = lat.get("p50") if isinstance(lat, dict) else None
+        std = lat.get("std", 0.0) if isinstance(lat, dict) else 0.0
+        if not p50:
+            return ScreenResult(None, 999.0, rc_failed=True)
+        cv = std / p50 if p50 > 0 else 999.0
+        ep_tag = "NPU" if ep in ("qnn",) and device in ("npu",) else ep.upper()
+        print(
+            f"     screen: p50={p50:.2f}ms  cv={cv:.3f}"
+            + (" [DVFS-normal]" if ep_tag == "NPU" and cv > 0.10 else ""),
+            flush=True,
+        )
+        return ScreenResult(p50, cv)
+    except Exception as e:
+        print(f"     [warn] screen parse error: {e}", flush=True)
+        return ScreenResult(None, 999.0, rc_failed=True)
+
+
+def bench_full(
+    winml: str,
+    model_path: Path,
+    ep: str,
+    device: str,
+    out_prefix: str = "full_perf",
+    warmup: int | None = None,
+    iters: int | None = None,
+    cool_down_s: int | None = None,
+) -> list[float]:
+    """Phase B: 3 × FULL_ITERS-iter full bench with cool-down.
+
+    Returns list of per-session p50_ms values. Empty list = all sessions failed.
+    Session files are written as {out_prefix}_s{n}.json in model_path.parent.
+
+    warmup/iters/cool_down_s override module-level defaults when provided.
+    """
+    _warmup = warmup if warmup is not None else FULL_WARMUP
+    _iters = iters if iters is not None else FULL_ITERS
+    _cool_down = cool_down_s if cool_down_s is not None else COOL_DOWN_S
+    p50s: list[float] = []
+    for s in range(1, FULL_SESSIONS + 1):
+        out_json = model_path.parent / f"{out_prefix}_s{s}.json"
+        rc, _, _ = run_cmd(
+            [
+                winml,
+                "perf",
+                "-m",
+                str(model_path),
+                "--ep",
+                ep,
+                "--device",
+                device,
+                "--warmup",
+                str(_warmup),
+                "--iterations",
+                str(_iters),
+                "-o",
+                str(out_json),
+            ],
+            label=f"perf full s{s}/{FULL_SESSIONS} ({_iters} iters)",
+            timeout=BENCH_TIMEOUT_S,
+        )
+        if rc == 0 and out_json.exists():
+            try:
+                data = json.loads(out_json.read_text(encoding="utf-8"))
+                lat = data.get("latency_ms", data)
+                p50 = lat.get("p50") if isinstance(lat, dict) else None
+                std = lat.get("std", 0.0) if isinstance(lat, dict) else 0.0
+                if p50:
+                    cv = std / p50 if p50 > 0 else 999.0
+                    print(
+                        f"     full s{s}: p50={p50:.2f}ms  std={std:.2f}ms  cv={cv:.3f}",
+                        flush=True,
+                    )
+                    p50s.append(round(p50, 3))
+            except Exception as e:
+                print(f"     [warn] full bench s{s} parse error: {e}", flush=True)
+        else:
+            print(f"     [warn] full bench s{s} failed", flush=True)
+        if s < FULL_SESSIONS:
+            print(f"     cool-down {_cool_down}s...", flush=True)
+            time.sleep(_cool_down)
+    return p50s
+
+
+def median_p50(p50s: list[float]) -> float | None:
+    """Return the median of a list of p50 values, or None if empty."""
+    if not p50s:
+        return None
+    return sorted(p50s)[len(p50s) // 2]
+
+
+def ranges_non_overlapping(a: list[float], b: list[float]) -> bool | None:
+    """Return True if max(a) < min(b) (a is strictly faster than b).
+
+    Returns None if either list is empty (can't determine).
+    """
+    if not a or not b:
+        return None
+    return max(a) < min(b)
+
+
+def session_cv(p50s: list[float]) -> float:
+    """Session-to-session coefficient of variation (sample stddev / mean).
+
+    This is the run-to-run noise floor used by the effect-size gate. Unlike the
+    intra-session CV (screen.cv), it captures thermal / DVFS drift *between*
+    sessions — the noise that produces fake cross-config wins. Returns 0.0 for
+    fewer than 2 samples (spread cannot be estimated).
+    """
+    n = len(p50s)
+    if n < 2:
+        return 0.0
+    mean = sum(p50s) / n
+    if mean <= 0:
+        return 0.0
+    var = sum((x - mean) ** 2 for x in p50s) / (n - 1)
+    return (var**0.5) / mean
+
+
+# ── Paired A/B bench protocol (Fix #1) ─────────────────────────────────────────
+
+
+def run_perf_session(
+    winml: str,
+    model_path: Path,
+    ep: str,
+    device: str,
+    iters: int | None = None,
+    warmup: int | None = None,
+    out_json: Path | None = None,
+) -> float | None:
+    """Run a single `winml perf` session. Returns p50_ms, or None on failure.
+
+    This is the atomic measurement primitive shared by full-bench and paired A/B.
+    """
+    _iters = iters if iters is not None else FULL_ITERS
+    _warmup = warmup if warmup is not None else FULL_WARMUP
+    if out_json is None:
+        out_json = model_path.parent / "ab_perf.json"
+    rc, _, _ = run_cmd(
+        [
+            winml,
+            "perf",
+            "-m",
+            str(model_path),
+            "--ep",
+            ep,
+            "--device",
+            device,
+            "--warmup",
+            str(_warmup),
+            "--iterations",
+            str(_iters),
+            "-o",
+            str(out_json),
+        ],
+        label=f"perf session ({_iters} iters)",
+        timeout=BENCH_TIMEOUT_S,
+    )
+    if rc != 0 or not out_json.exists():
+        return None
+    try:
+        data = json.loads(out_json.read_text(encoding="utf-8"))
+        lat = data.get("latency_ms", data)
+        p50 = lat.get("p50") if isinstance(lat, dict) else None
+        return round(float(p50), 3) if p50 else None
+    except Exception as e:
+        print(f"     [warn] perf session parse error: {e}", flush=True)
+        return None
+
+
+def _ci_half_95(values: list[float]) -> float:
+    """Half-width of the 95% confidence interval of the mean (1.96 * SE).
+
+    Returns a large sentinel (999.0) for fewer than 2 samples (CI undefined).
+    """
+    if len(values) < 2:
+        return 999.0
+    return 1.96 * statistics.stdev(values) / math.sqrt(len(values))
+
+
+def _verdict_from_gains(gains: list[float]) -> dict:
+    """Summarise within-pair gain percentages into mean / 95% CI / verdict.
+
+    verdict:
+      KEEP_CONFIRMED — CI lower bound > KEEP_GAIN_PCT (real, robust speedup)
+      DISCARD        — CI upper bound < DISCARD_GAIN_PCT (real regression)
+      MARGINAL       — CI straddles the indifference band (need more pairs)
+      BENCH_FAIL     — no usable pairs
+    """
+    if not gains:
+        return {
+            "gains_pct": [],
+            "mean_gain_pct": None,
+            "ci_half_95": None,
+            "n_pairs": 0,
+            "verdict": "BENCH_FAIL",
+        }
+    mean = statistics.mean(gains)
+    ci = _ci_half_95(gains)
+    if mean - ci > KEEP_GAIN_PCT:
+        verdict = "KEEP_CONFIRMED"
+    elif mean + ci < DISCARD_GAIN_PCT:
+        verdict = "DISCARD"
+    else:
+        verdict = "MARGINAL"
+    return {
+        "gains_pct": [round(g, 2) for g in gains],
+        "mean_gain_pct": round(mean, 2),
+        "ci_half_95": round(ci, 2) if ci < 999 else None,
+        "n_pairs": len(gains),
+        "verdict": verdict,
+    }
+
+
+def paired_ab_bench(
+    run_session: Callable[[Path], float | None],
+    baseline_path: Path,
+    hyp_path: Path,
+    n_pairs: int = MIN_PAIRS,
+    cool_down_s: int | None = None,
+) -> dict:
+    """Interleaved A/B bench (Fix #1): baseline then hypothesis in one thermal window.
+
+    Each pair measures ``baseline`` immediately followed by ``hyp`` so DVFS / thermal
+    drift appears in BOTH legs and cancels in the within-pair ratio. The mean of the
+    per-pair gains (with a 95% CI) is far more reliable than comparing a cold baseline
+    against a warm hypothesis across separate sweep phases.
+
+    ``run_session`` is an injectable callable ``(model_path) -> p50_ms | None`` so the
+    statistics can be unit-tested without hardware. Use :func:`run_perf_session` (via a
+    lambda binding winml/ep/device) for the real measurement.
+    """
+    _cool = cool_down_s if cool_down_s is not None else COOL_DOWN_S
+    gains: list[float] = []
+    for i in range(max(1, n_pairs)):
+        b = run_session(baseline_path)
+        h = run_session(hyp_path)
+        if b and h and b > 0:
+            gains.append((b - h) / b * 100)
+        if i < n_pairs - 1:
+            print(f"     cool-down {_cool}s...", flush=True)
+            time.sleep(_cool)
+    return _verdict_from_gains(gains)
+
+
+def adaptive_paired_ab_bench(
+    run_session: Callable[[Path], float | None],
+    baseline_path: Path,
+    hyp_path: Path,
+    min_pairs: int = MIN_PAIRS,
+    max_pairs: int = MAX_PAIRS,
+    cool_down_s: int | None = None,
+) -> dict:
+    """Adaptive paired A/B (Fix #2): keep sampling until the 95% CI is decisive.
+
+    Stops early once the CI clears the KEEP or DISCARD band (after at least
+    ``min_pairs`` pairs); otherwise force-stops at ``max_pairs`` and returns MARGINAL.
+    Stable models finish in ``min_pairs``; noisy ones automatically get more pairs.
+    """
+    _cool = cool_down_s if cool_down_s is not None else COOL_DOWN_S
+    gains: list[float] = []
+    for i in range(max(1, max_pairs)):
+        b = run_session(baseline_path)
+        h = run_session(hyp_path)
+        if b and h and b > 0:
+            gains.append((b - h) / b * 100)
+        if len(gains) >= min_pairs:
+            mean = statistics.mean(gains)
+            ci = _ci_half_95(gains)
+            if mean - ci > KEEP_GAIN_PCT or mean + ci < DISCARD_GAIN_PCT:
+                break
+        if i < max_pairs - 1:
+            print(f"     cool-down {_cool}s...", flush=True)
+            time.sleep(_cool)
+    return _verdict_from_gains(gains)
+
+
+def thermal_classify(
+    ref_p50_ms: float,
+    cold_ref_p50_ms: float,
+    cool_mult: float = THERMAL_COOL_MULT,
+    hot_mult: float = THERMAL_HOT_MULT,
+) -> str:
+    """Classify device thermal state from a reference-model latency (Fix #5).
+
+    ``cold_ref_p50_ms`` is the reference latency captured when the device is cold.
+    Returns ``COOL`` (proceed), ``WARM`` (borderline), ``HOT_RUN`` (throttled —
+    exclude from L2 promotion), or ``UNKNOWN`` if no valid cold reference.
+    """
+    if cold_ref_p50_ms <= 0 or ref_p50_ms <= 0:
+        return "UNKNOWN"
+    ratio = ref_p50_ms / cold_ref_p50_ms
+    if ratio <= cool_mult:
+        return "COOL"
+    if ratio >= hot_mult:
+        return "HOT_RUN"
+    return "WARM"
+
+
+# ── ONNX analysis helpers ─────────────────────────────────────────────────────
+
+
+# ── Verdict policies ─────────────────────────────────────────────────────────
+
+
+@dataclass
+class VerdictInput:
+    """Inputs to a verdict policy.
+
+    improvement_pct: positive = latency improvement
+        = (baseline_p50 - new_p50) / baseline_p50 * 100
+    cv_pct: screen coefficient of variation as percent (e.g., 5.0 for 5%)
+    correctness_pass: True if accuracy/parity check passed
+    build_ok: True if build succeeded
+    """
+
+    improvement_pct: float
+    cv_pct: float
+    correctness_pass: bool
+    build_ok: bool = True
+
+
+@dataclass
+class VerdictOutput:
+    """Output from a verdict policy."""
+
+    verdict: str  # KEEP | MARGINAL_KEEP | DISCARD | ACC_FAIL | BUILD_FAIL
+    reasoning: str
+    marginal: bool = False
+    threshold_pct: float = 0.0
+
+
+class VerdictPolicy(ABC):
+    """Abstract base for verdict policies."""
+
+    def __init__(self, min_improvement_pct: float = 1.0, stat_bar_multiplier: float = 2.0) -> None:
+        self.min_improvement_pct = min_improvement_pct
+        self.stat_bar_multiplier = stat_bar_multiplier
+
+    @abstractmethod
+    def evaluate(self, inp: VerdictInput) -> VerdictOutput: ...
+
+
+class ThroughputOnly(VerdictPolicy):
+    """KEEP iff improvement > max(min_improvement_pct, stat_bar * cv_pct).
+
+    Parameterized statistical significance: forces improvements to exceed
+    measurement noise before being declared real (borrowed from
+    AgenticGPUOptimizer V2). Marks verdicts as 'marginal' when improvement is
+    between 1x and 1.5x the threshold.
+    """
+
+    def evaluate(self, inp: VerdictInput) -> VerdictOutput:
+        if not inp.build_ok:
+            return VerdictOutput("BUILD_FAIL", "Build step failed.")
+        if not inp.correctness_pass:
+            return VerdictOutput("ACC_FAIL", "Accuracy check failed.")
+
+        threshold = max(self.min_improvement_pct, self.stat_bar_multiplier * inp.cv_pct)
+
+        if inp.improvement_pct < threshold:
+            return VerdictOutput(
+                "DISCARD",
+                f"Improvement +{inp.improvement_pct:.1f}% < threshold {threshold:.1f}% "
+                f"(max({self.min_improvement_pct:.0f}% floor, "
+                f"{self.stat_bar_multiplier:.0f}x CV={inp.cv_pct:.1f}%))",
+                threshold_pct=threshold,
+            )
+
+        marginal = inp.improvement_pct < threshold * 1.5
+        return VerdictOutput(
+            "MARGINAL_KEEP" if marginal else "KEEP",
+            f"Improvement +{inp.improvement_pct:.1f}% > threshold {threshold:.1f}%",
+            marginal=marginal,
+            threshold_pct=threshold,
+        )
+
+
+# ── Session manager ───────────────────────────────────────────────────────────
+
+
+class SessionManager:
+    """Crash-resume state manager backed by session.json.
+
+    Writes session state atomically (temp-file + rename) after each experiment
+    so an interrupted run can be resumed from where it left off.
+
+    Usage::
+        sm = SessionManager(WORK_DIR)
+        if sm.has_state:
+            print(f"Resuming: {len(sm.completed_iters)} completed iters")
+        # In the hypothesis loop:
+        if i in sm.completed_iters:
+            continue
+        # ... run experiment ...
+        sm.save(iter_idx=i, verdict=status, baseline_p50=..., ...)
+    """
+
+    def __init__(self, work_dir: Path) -> None:
+        self.path = work_dir / "session.json"
+        self._state: dict = {}
+        if self.path.exists():
+            try:
+                self._state = json.loads(self.path.read_text(encoding="utf-8"))
+                n = len(self.completed_iters)
+                if n > 0:
+                    print(
+                        f"  [session] Resuming: {n} completed iter(s) loaded from {self.path.name}",
+                        flush=True,
+                    )
+            except Exception as e:
+                print(f"  [session] Warning: could not load {self.path.name}: {e}", flush=True)
+
+    @property
+    def has_state(self) -> bool:
+        return bool(self._state)
+
+    @property
+    def completed_iters(self) -> set[int]:
+        return set(self._state.get("completed_iters", []))
+
+    @property
+    def baseline_p50(self) -> float | None:
+        return self._state.get("baseline_p50")
+
+    @property
+    def best_p50(self) -> float:
+        v = self._state.get("best_p50")
+        return float(v) if v is not None else float("inf")
+
+    @property
+    def best_label(self) -> str:
+        return self._state.get("best_label", "")
+
+    @property
+    def consecutive_discards(self) -> int:
+        return int(self._state.get("consecutive_discards", 0))
+
+    @property
+    def discard_by_dimension(self) -> dict[str, int]:
+        return dict(self._state.get("discard_by_dimension", {}))
+
+    def save(
+        self,
+        *,
+        iter_idx: int,
+        verdict: str,
+        baseline_p50: float | None,
+        best_p50: float,
+        best_label: str,
+        consecutive_discards: int,
+        discard_by_dimension: dict[str, int],
+    ) -> None:
+        """Save current state to session.json atomically."""
+        completed = list(self.completed_iters | {iter_idx})
+        self._state.update(
+            {
+                "completed_iters": completed,
+                "last_verdict": verdict,
+                "baseline_p50": baseline_p50,
+                "best_p50": best_p50 if best_p50 < float("inf") else None,
+                "best_label": best_label,
+                "consecutive_discards": consecutive_discards,
+                "discard_by_dimension": discard_by_dimension,
+                "last_iter": iter_idx,
+            }
+        )
+        tmp = self.path.with_suffix(".tmp")
+        try:
+            tmp.write_text(json.dumps(self._state, indent=2), encoding="utf-8")
+            tmp.replace(self.path)
+        except Exception as e:
+            print(f"  [session] Warning: could not save session state: {e}", flush=True)
+
+
+def count_conv_pct(model_onnx: Path) -> tuple[float, int, int]:
+    """Count Conv ops as a percentage of all graph nodes.
+
+    Returns (conv_pct, conv_count, total_count).
+    Used to assess npu-006 risk: Conv% > 20% means conv fusions will likely
+    produce FusedConv ops that QNN EP cannot dispatch (-> CPU fallback).
+
+    Returns (0.0, 0, 0) if onnx is not installed or file is missing.
+    The caller must treat (0.0, 0, 0) as 'unknown', not as 'safe'.
+    """
+    if not model_onnx.exists():
+        return 0.0, 0, 0
+    try:
+        import onnx  # noqa: PLC0415
+
+        model = onnx.load(str(model_onnx))
+        ops = [n.op_type for n in model.graph.node]
+        total = len(ops)
+        conv_count = sum(1 for o in ops if o == "Conv")
+        pct = conv_count / total * 100 if total > 0 else 0.0
+        return round(pct, 1), conv_count, total
+    except Exception as e:
+        print(f"  [warn] Conv% analysis failed (onnx not installed?): {e}", flush=True)
+        return 0.0, 0, 0
diff --git a/research/autoconfig/skills/orchestrator/SKILL.md b/research/autoconfig/skills/orchestrator/SKILL.md
new file mode 100644
index 000000000..9e2e80529
--- /dev/null
+++ b/research/autoconfig/skills/orchestrator/SKILL.md
@@ -0,0 +1,101 @@
+---
+name: orchestrator
+description: >
+  Use this skill as the top-level brain for an automated winml-cli build-config
+  search. It runs the full autoconfig lifecycle (Phase 0 Intake, Phase 1 Insight,
+  Phase 2 Opt Loop, Phase 3 Outcome) and coordinates three sub-skills —
+  explorer (what to try), optimizer (run it), and
+  reviewer (judge it) — to find the best EP + opset + graph-optimization
+  config for a given model on the current Windows hardware. Owns session state,
+  crash-resume, champion tracking, and stop conditions; sub-skills own one phase each.
+---
+
+# orchestrator
+
+The Orchestrator is the **main brain** of the autoconfig loop. It does not build,
+benchmark, or judge experiments itself — it sequences the phases and delegates the
+Phase 2 work to three sub-skills, then aggregates their results into a champion
+config plus auditable artifacts.
+
+Reference implementation: `skills/orchestrator/autoconfig.py` (the `main()`
+orchestrator wiring the `Explorer`, `Optimizer`, and `Reviewer` classes).
+Design spec: `research/autoconfig/docs/autoconfig_diagram.html`.
+
+**Implementation in this folder:** `autoconfig.py` (the `main()` orchestrator plus
+the `Explorer` / `Optimizer` / `Reviewer` classes — the runnable reference loop that
+the explorer / optimizer / reviewer sub-skills formalize).
+
+## When to use
+
+- "Find the fastest config for this model on my NPU/GPU/CPU"
+- "Sweep opset 17–21 and graph optimizations and tell me what actually helps"
+- "Run an automated, statistically-honest config search and give me an auditable report"
+- Driving a catalog sweep across many models (see `catalog_sweep.py`)
+
+## Search space — full grid, then prune
+
+The orchestrator owns the **complete, zero-experience search grid**: from a FP32
+baseline it varies exactly **one factor at a time** — opset (17–21), quantization
+precision (fp32/fp16/int8/int16/w8a16), or one of the 13 single graph passes
+(~74 combinations), generated by `build_search_space()` in `autoconfig.py` from
+`OPSET_RANGE` / `PRECISIONS` / `OPTIM_PASSES` (the single source of truth for the
+universe). It lists *all* combinations up front; the Explorer then prunes/reorders
+them by experience (e.g. device-invalid precisions, KB hard-blocks). The
+per-`(ep, device)` matrices in `ep_device_knowledge/<ep>_<device>.json`
+(`hypotheses` + `sweep_config.quant`) that `catalog_sweep.py` runs are the
+**experience-pruned and reordered subsets** of this same grid.
+
+## Sub-skills it coordinates
+
+| Phase | Sub-skill | Responsibility |
+| --- | --- | --- |
+| 2 — pick | `explorer` | Take the full OFAT grid, prune with KB hard-blocks + Insight skip_set, rank into a priority_queue, yield the next hypothesis |
+| 2 — run | `optimizer` | `winml build` -> Phase A screen (CV gate) -> Phase B full bench -> `winml eval` accuracy; returns raw measurements only |
+| 2 — judge | `reviewer` | Apply the ThroughputOnly verdict (`threshold = max(1%, 2x CV)`) -> KEEP / MARGINAL / DISCARD, draft KB entries for real wins |
+
+The orchestrator is the only component that holds global state. Sub-skills are
+stateless with respect to each other: Explorer never benchmarks, Optimizer never
+decides, Reviewer never builds.
+
+## Lifecycle (the procedure)
+
+**Phase 0 — Intake**
+- `winml inspect` the model; resolve `model_type` (architecture family — never hardcode arch names).
+- `winml analyze --ep <ep>` for EP compatibility; establish the correctness contract via `winml eval --mode compare` (cosine ~= 1.000 baseline).
+- Build the baseline config and record its p50 as the reference.
+
+**Phase 1 — Insight**
+- Run the static/graph analyzer over the full OFAT grid to produce the
+  `skip_set` + `priority_boosts` tailored to the model
+  (e.g. Conv% drives the npu-006 conv-fusion hard-block).
+- **Graph-presence pruning:** for every `graph_pass` hypothesis, pre-estimate from
+  the baseline graph whether the pass can fire (`_pass_can_fire` over the detected
+  `fusion_candidates`). Patterns present → boost; confidently absent → **cut**
+  (added to `skip_set`); not estimable → kept for the empirical search.
+- Hand the full grid + `skip_set` + `priority_boosts` to the Explorer.
+
+**Phase 2 — Opt Loop** (repeat until a stop condition)
+1. Ask **explorer** for the next hypothesis (it pops from the priority_queue and skips KB/Insight-blocked passes).
+2. Ask **optimizer** to build + benchmark it (screen early-exits if delta < 1%; full bench is 3x1000 with 60 s cool-down).
+   - **Runtime no-op guard:** for a `graph_pass`, if the built graph is identical to the baseline (`graph_is_noop`), the pass matched nothing at build time → discard before screen/bench (catches passes that survived static pruning but still didn't fire).
+3. Ask **reviewer** for the verdict; on KEEP, update the champion.
+4. Persist `session.json` atomically (crash-resume) and append the TSV row + experiment.md.
+
+**Phase 3 — Outcome**
+- Emit the champion config, an HTML/Markdown report, the per-experiment artifacts, and KB draft entries (`status="draft"`).
+- Summarize confirmed findings and any feature requirements surfaced during the run.
+
+## Stop conditions
+
+Stop the Phase 2 loop when **any** holds:
+- Objective met (target improvement reached), or
+- 30 consecutive DISCARDs (architectural levers exhausted), or
+- Priority queue empty, or
+- User stops.
+
+## Constraints
+
+- No hardcoded model/architecture logic — all arch reasoning comes from winml `model_type`.
+- Accuracy gate (`winml eval`) is mandatory before any KEEP.
+- All perf claims use session-level averaging; never report a point estimate as a win.
+- KB writes are drafts only; promotion to `confirmed` is a human gate (>=2 models + mechanism understood).
diff --git a/research/autoconfig/skills/orchestrator/autoconfig.py b/research/autoconfig/skills/orchestrator/autoconfig.py
new file mode 100644
index 000000000..f1cfa2a12
--- /dev/null
+++ b/research/autoconfig/skills/orchestrator/autoconfig.py
@@ -0,0 +1,963 @@
+#!/usr/bin/env python3
+# -------------------------------------------------------------------------
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+# --------------------------------------------------------------------------
+
+"""autoconfig.py — AutoResearch-style optimize-pass search for winml-cli
+Demo: facebook/convnext-tiny-224, CPU EP, FP32
+
+Loop: hypothesize → winml build → quick-screen bench (CV gate) →
+      full bench (3 sessions) → eval → keep/discard → repeat
+
+Key design principles (from GPU Optimizer V2 + ConvNext lessons):
+  1. Two-phase bench: 200-iter CV screen FIRST, full bench only if CV < threshold
+     (CPU/GPU) — or unconditionally for QNN NPU (npu-007: DVFS makes CV unreliable)
+  2. Use winml perf (NOT winml eval) for latency — eval includes HF preprocessing
+  3. Mandatory external-research after 5 consecutive DISCARDs in same dimension
+  4. Load ep_device_knowledge/*.json (only "confirmed" entries) to prune search space
+  5. Per-experiment structured output: hypothesis/impl/parity/perf/analysis/decision
+  6. Stop condition: 30 consecutive DISCARDs (not 5)
+
+Hypothesis design — ISOLATED mode (each hypothesis is independent):
+  Each hypothesis is applied to a fresh copy of BASELINE. The labels "+" prefix
+  is cosmetic; no state is accumulated across hypotheses. This allows independent
+  attribution: "does gelu-fusion alone help?" rather than "does gelu help on top
+  of conv fusions?". To run a cumulative search, chain patch functions explicitly.
+"""
+
+import copy
+import csv
+import json
+import sys
+import time
+from datetime import datetime
+from pathlib import Path
+
+# Agent package bootstrap: make the autoconfig root (the dir holding ep_device_knowledge/)
+# importable so sibling skills/lib packages resolve when run as a standalone script.
+_AGENT_ROOT = next(
+    p for p in Path(__file__).resolve().parents if (p / "ep_device_knowledge").is_dir()
+)
+if str(_AGENT_ROOT) not in sys.path:
+    sys.path.insert(0, str(_AGENT_ROOT))
+
+from lib.report_gen import generate_report  # noqa: E402
+from skills.explorer.analyze_insight import build_insight, run_graph_analysis  # noqa: E402
+from skills.optimizer.bench_utils import (  # noqa: E402
+    FULL_ITERS,
+    FULL_SESSIONS,
+    SCREEN_CV_MAX_STD,
+    SCREEN_ITERS,
+    SessionManager,
+    ThroughputOnly,
+    VerdictInput,
+    bench_full,
+    bench_screen,
+    median_p50,
+    run_cmd,
+)
+
+sys.stdout.reconfigure(encoding="utf-8", errors="replace")  # type: ignore[attr-defined]
+
+# ── settings ─────────────────────────────────────────────────────────────────
+MODEL_ID = "facebook/convnext-tiny-224"
+TASK = "image-classification"
+EP = "cpu"
+DEVICE = "cpu"
+WINML = str(_AGENT_ROOT / ".venv" / "Scripts" / "winml.exe")
+WORK_DIR = _AGENT_ROOT / "convnext-search"
+RESULTS_TSV = WORK_DIR / "results.tsv"
+KB_DIR = _AGENT_ROOT / "ep_device_knowledge"
+
+EVAL_SAMPLES = 50  # for accuracy gate
+ACCURACY_FLOOR = 0.70  # cosine drop below this → discard
+MIN_IMPROVEMENT = 0.01  # require ≥1% p50 improvement to KEEP
+
+# Verdict policy: improvement must exceed max(MIN_IMPROVEMENT, STAT_BAR * screen_cv)
+# Borrowed from AgenticGPUOptimizer V2 (avoids calling noise-level deltas "improvements")
+STAT_BAR_MULTIPLIER = 2.0
+
+# Screen early exit: skip 3x full-bench when screen already shows < this % improvement.
+# Saves ~25-90 min per rejected hypothesis (3 sessions × FULL_ITERS iters).
+SCREEN_PASS_MIN_IMPROVEMENT_PCT = 1.0
+
+# Bench protocol (two-phase, from GPU Optimizer V2)
+SCREEN_WARMUP = 20
+SCREEN_ITERS = 200
+SCREEN_CV_MAX = 0.10  # Coefficient of Variation = std/p50; reject if > 10%
+FULL_WARMUP = 50
+FULL_ITERS = 1000
+FULL_SESSIONS = 3
+COOL_DOWN_S = 60  # seconds between full-bench sessions
+
+# Stop conditions
+STOP_CONSECUTIVE_DISCARDS = 30  # plateau stop
+EXTERNAL_RESEARCH_TRIGGER = 5  # trigger after this many DISCARDs in same dimension
+
+# ── load ep_knowledge (confirmed entries only) ────────────────────────────────
+
+
+def load_ep_knowledge(ep: str) -> dict:
+    """Load confirmed KB entries for given EP. Only 'confirmed' status entries
+    are used to prune search space. 'draft' entries are informational only.
+    """
+    kb_path = KB_DIR / f"{ep}_{DEVICE}.json"
+    if not kb_path.exists():
+        return {"skip_passes": [], "skip_quantization": False, "notes": []}
+
+    kb = json.loads(kb_path.read_text(encoding="utf-8"))
+    rules = kb.get("search_space_rules", {})
+    skip_passes = []
+    skip_quant = False
+    notes = []
+
+    # Only apply rules from confirmed findings
+    confirmed_ids = {f["id"] for f in kb.get("findings", []) if f.get("mechanism_confirmed", False)}
+
+    for finding in kb.get("findings", []):
+        if finding["id"] not in confirmed_ids:
+            notes.append(f"[DRAFT] {finding['id']}: {finding['title'][:60]}…")
+            continue
+        action = finding.get("action_for_autoconfig", "")
+        if "skip" in action.lower() and "quantization" in action.lower():
+            skip_quant = True
+            notes.append(f"[KB confirmed] Skip quantization: {finding['id']}")
+        if "skip" in action.lower() and "compile" in action.lower():
+            notes.append(f"[KB confirmed] Skip compile: {finding['id']}")
+
+    # Parse search_space_rules for passes to skip
+    graph_passes = rules.get("graph_passes", {})
+    for p in graph_passes.get("skip", []):
+        skip_passes.append(p)
+        notes.append(f"[KB confirmed] Skip pass: {p}")
+
+    return {"skip_passes": skip_passes, "skip_quantization": skip_quant, "notes": notes}
+
+
+# ── baseline config ───────────────────────────────────────────────────────────
+BASELINE: dict = {
+    "export": {
+        "opset_version": 17,
+        "batch_size": 1,
+        "do_constant_folding": True,
+        "dynamo": False,
+        "input_tensors": [
+            {
+                "name": "pixel_values",
+                "dtype": "float32",
+                "shape": [1, 3, 224, 224],
+                "value_range": [0, 1],
+            }
+        ],
+        "output_tensors": [{"name": "logits"}],
+    },
+    "optim": {},
+    "loader": {
+        "task": TASK,
+        "model_class": "AutoModelForImageClassification",
+        "model_type": "convnext",
+    },
+    "eval": {
+        "task": TASK,
+        "dataset": {"path": "timm/mini-imagenet", "split": "test", "samples": EVAL_SAMPLES},
+    },
+}
+
+
+# ── full search space — the unbiased, zero-experience OFAT grid ───────────────
+# The orchestrator reference loop enumerates the COMPLETE one-factor-at-a-time
+# grid: from a FP32 baseline it varies exactly one factor at a time — opset,
+# quantization precision, or a single graph pass. This is the full set of "all
+# combinations" BEFORE any experience is applied. The Explorer then prunes/reorders
+# it via confirmed-KB hard-blocks + the Insight Engine. The per-(ep, device)
+# catalog_sweep matrices in ep_device_knowledge/<ep>_<device>.json ("hypotheses")
+# are the experience-pruned and reordered subsets of this same grid (single source
+# of truth lives here).
+#
+# Each patch_fn receives a FRESH copy of BASELINE (isolated mode): hypotheses are
+# independent, no state is accumulated across them.
+
+OPSET_RANGE: list[int] = [17, 18, 19, 20, 21]
+
+# The full universe of single graph-optimization passes winml-cli can toggle.
+# catalog_sweep KBs draw their per-EP hypothesis matrices from this same set.
+OPTIM_PASSES: list[str] = [
+    "conv_bn_fusion",
+    "conv_add_fusion",
+    "conv_activation_fusion",
+    "gelu_fusion",
+    "layer_norm_fusion",
+    "skip_layer_norm_fusion",
+    "matmul_add_fusion",
+    "matmul_transpose_fusion",
+    "attention_fusion",
+    "bias_softmax_fusion",
+    "transpose_optimizer",
+    "nchwc_transformer",
+    "highdimRTR_lowdimRTR",
+]
+
+# The quantization precisions winml-cli can target. "fp32" == no quantization
+# (the reference). int8/int16/w8a16/fp16 are device-dependent; the full grid lists
+# them all and the Explorer/experience prunes device-invalid ones per (ep, device).
+PRECISIONS: list[str] = ["fp32", "fp16", "int8", "int16", "w8a16"]
+
+
+def _make_patch(opset: int, pass_name: str | None, precision: str = "fp32"):
+    """Return a patch_fn setting one opset, at most one optim pass, and a precision
+    on a fresh BASELINE copy. pass_name=None => no fusion flags; precision='fp32'
+    => no quantization (FP32 reference)."""
+
+    def patch(cfg: dict) -> dict:
+        cfg["export"]["opset_version"] = opset
+        cfg["optim"] = {pass_name: True} if pass_name else {}
+        cfg["precision"] = precision
+        return cfg
+
+    return patch
+
+
+def build_search_space(
+    opsets: list[int] = OPSET_RANGE,
+    passes: list[str] = OPTIM_PASSES,
+    precisions: list[str] = PRECISIONS,
+) -> list[tuple[str, object, str]]:
+    """Enumerate the full OFAT grid: vary one factor at a time from baseline.
+
+    Axes:
+      * baseline   — lowest opset, FP32, no pass (the global reference)
+      * opset      — each higher opset (FP32, no pass)
+      * quant      — each non-FP32 precision (base opset, no pass)
+      * graph_pass — each single pass x each opset (FP32)
+
+    Returns (label, patch_fn, search_dimension) triples.
+    """
+    base_opset = opsets[0]
+    base_prec = precisions[0]
+    space: list[tuple[str, object, str]] = []
+    # 1. pure-opset axis (FP32, no fusion flags): baseline + opset sweep
+    for op in opsets:
+        if op == base_opset:
+            label, dim = f"baseline (opset {op}, {base_prec}, no fusions)", "baseline"
+        else:
+            label, dim = f"opset={op}", "opset"
+        space.append((label, _make_patch(op, None, base_prec), dim))
+    # 2. quant axis (base opset, no fusion flags): each non-FP32 precision
+    for prec in precisions[1:]:
+        space.append((f"quant={prec}", _make_patch(base_opset, None, prec), "quant"))
+    # 3. single graph-pass axis (FP32), crossed with every opset
+    for op in opsets:
+        for p in passes:
+            space.append((f"opset={op} + {p}", _make_patch(op, p, base_prec), "graph_pass"))
+    return space
+
+
+HYPOTHESES: list[tuple[str, object, str]] = build_search_space()
+
+# ── helpers ───────────────────────────────────────────────────────────────────
+
+
+def graph_is_noop(model_path: Path, baseline_op_counts: dict[str, int]) -> bool:
+    """True when the optimized graph is structurally identical to the baseline.
+
+    A graph pass that matches no pattern leaves the op-type histogram unchanged
+    (fusions only ever *reduce* node counts). When that happens there is nothing
+    to benchmark, so the orchestrator discards the hypothesis early — the runtime
+    counterpart to the Explorer's static graph-presence pruning.
+    """
+    if not baseline_op_counts or not model_path.exists():
+        return False
+    info = run_graph_analysis(model_path)
+    return info.available and info.op_counts == baseline_op_counts
+
+
+# ═══════════════════════════════════════════════════════════════════════════
+#  Phase 2 — Opt Loop subagents (autoconfig_diagram.html)
+#
+#  The experiment loop is split into three explicit subagents that mirror the
+#  architecture diagram:
+#
+#    Explorer   — decides *what to try next*: loads the hypothesis pool, applies
+#                 KB hard-blocks + Insight-Engine skip rules (skip_set), and ranks
+#                 the survivors by Insight priority boost (priority_queue).
+#    Optimizer  — *runs* one hypothesis: winml build -> Phase A screen (CV gate) ->
+#                 Phase B full bench -> accuracy eval. Produces raw measurements
+#                 only; it makes no keep/discard decision.
+#    Reviewer   — *judges* the measurements: applies the ThroughputOnly verdict
+#                 policy (threshold = max(min_improvement, stat_bar x CV)), emits
+#                 KEEP / MARGINAL / DISCARD, and drafts KB entries for real wins.
+#
+#  The orchestrator (main) wires them together: Explorer yields a hypothesis ->
+#  Optimizer benchmarks it -> Reviewer returns a verdict -> repeat.
+# ═══════════════════════════════════════════════════════════════════════════
+
+
+class Explorer:
+    """Phase 2 Explorer — hypothesis pool -> skip_set pruning -> priority_queue.
+
+    Owns search *order* only; it never builds or benchmarks. It fuses two pruning
+    signals (confirmed KB hard-blocks and the Phase 1 Insight Engine skip_set) and
+    ranks the remaining hypotheses by Insight priority boost.
+    """
+
+    def __init__(self, hypotheses: list[tuple], kb: dict, insight) -> None:
+        self.kb = kb
+        self.insight = insight
+        # priority_queue: stable sort, highest Insight priority boost first.
+        self.priority_queue = sorted(
+            hypotheses, key=lambda item: -insight.priority_boosts.get(item[0], 0.0)
+        )
+
+    def __iter__(self):
+        """Iterate hypotheses in priority order (pop next from priority_queue)."""
+        return iter(self.priority_queue)
+
+    def skip_reason(self, label: str, flags_preview: str) -> str | None:
+        """Return why this hypothesis is pruned, or None to run it.
+
+        Checks the confirmed-KB hard-block rules first, then the Insight Engine
+        skip_set. Mirrors the diagram's "Apply KB hard blocks -> skip_set" step.
+        """
+        kb_rule = next(
+            (r for r in self.kb["skip_passes"] if any(f in flags_preview for f in r.split()[:2])),
+            None,
+        )
+        if kb_rule is not None:
+            return f"KB confirmed rule: {kb_rule}"
+        if label in self.insight.skip_set:
+            return f"Insight Engine: {label}"
+        return None
+
+
+class Optimizer:
+    """Phase 2 Optimizer — winml build -> Phase A screen -> Phase B full bench -> accuracy.
+
+    Produces raw measurements for one hypothesis. Holds the winml binary path and
+    the build target (model id / EP / device); thresholds stay as module constants.
+    """
+
+    def __init__(self, winml: str, model_id: str, ep: str, device: str) -> None:
+        self.winml = winml
+        self.model_id = model_id
+        self.ep = ep
+        self.device = device
+
+    def build(self, cfg: dict, out_dir: Path) -> tuple[bool, str]:
+        out_dir.mkdir(parents=True, exist_ok=True)
+        cfg = copy.deepcopy(cfg)
+        precision = cfg.pop("precision", "fp32")
+        # fp32 => no quantization (FP32 reference). For a specific precision,
+        # materialize the quant section via `winml config --precision` and build
+        # with --quant; `winml build` alone only resolves device-default quant.
+        quant_flag = "--no-quant"
+        if precision != "fp32":
+            quant_section = self._resolve_quant_section(precision, out_dir)
+            if quant_section is not None:
+                cfg["quant"] = quant_section
+                quant_flag = "--quant"
+        cfg_path = out_dir / "config.json"
+        cfg_path.write_text(json.dumps(cfg, indent=2))
+        rc, out, _ = run_cmd(
+            [
+                self.winml,
+                "build",
+                "-c",
+                str(cfg_path),
+                "-m",
+                self.model_id,
+                "-o",
+                str(out_dir),
+                "--ep",
+                self.ep,
+                "--device",
+                self.device,
+                quant_flag,
+                "--no-compile",
+            ],
+            label=f"winml build [{precision}]",
+        )
+        return rc == 0, out
+
+    def _resolve_quant_section(self, precision: str, out_dir: Path) -> dict | None:
+        """Generate a throwaway config at the requested precision and lift out its
+        quant block, so a specific precision (fp16/int8/int16/w8a16) can be applied
+        to the hand-built config. Returns None if winml config fails."""
+        tmp = out_dir / "_quant_probe.json"
+        rc, _, _ = run_cmd(
+            [
+                self.winml,
+                "config",
+                "-m",
+                self.model_id,
+                "-t",
+                TASK,
+                "--ep",
+                self.ep,
+                "--device",
+                self.device,
+                "--precision",
+                precision,
+                "--no-compile",
+                "-o",
+                str(tmp),
+            ],
+            label=f"winml config --precision {precision}",
+        )
+        if rc != 0 or not tmp.exists():
+            return None
+        try:
+            probe = json.loads(tmp.read_text(encoding="utf-8"))
+        except Exception:
+            return None
+        finally:
+            tmp.unlink(missing_ok=True)
+        return probe.get("quant")
+
+    def screen(self, model_path: Path) -> tuple[float | None, float]:
+        """Phase A: 200-iter screen with CV gate.
+
+        For CPU EP, high CV means thermal/scheduling noise — reject and retry later.
+        Returns (p50_ms, cv). p50_ms=None means unstable or command failed.
+        """
+        sr = bench_screen(winml=self.winml, model_path=model_path, ep=self.ep, device=self.device)
+        if sr.hard_failed:
+            return None, 999.0
+        if sr.cv is not None and sr.cv > SCREEN_CV_MAX:
+            print(
+                f"     Phase A rejected: CV={sr.cv:.2f} > {SCREEN_CV_MAX}"
+                f" (thermal/scheduling noise on {self.ep.upper()} — cool device and retry)"
+            )
+            return None, sr.cv
+        return sr.p50_ms, sr.cv or 0.0
+
+    def full_bench(self, model_path: Path) -> list[float]:
+        """Phase B: 3 sessions × FULL_ITERS with cool-down. Returns p50 per session."""
+        return bench_full(
+            winml=self.winml,
+            model_path=model_path,
+            ep=self.ep,
+            device=self.device,
+            out_prefix="full",
+            iters=FULL_ITERS,
+            cool_down_s=COOL_DOWN_S,
+        )
+
+    def eval_accuracy(self, out_dir: Path) -> float | None:
+        """Run winml eval; return accuracy (top-1 or cosine). For latency: use bench_*."""
+        model_path = out_dir / "model.onnx"
+        if not model_path.exists():
+            return None
+        result_json = out_dir / "eval_result.json"
+        rc, _, _ = run_cmd(
+            [
+                self.winml,
+                "eval",
+                "-m",
+                str(model_path),
+                "--model-id",
+                self.model_id,
+                "--task",
+                TASK,
+                "--ep",
+                self.ep,
+                "--device",
+                self.device,
+                "--samples",
+                str(EVAL_SAMPLES),
+                "-o",
+                str(result_json),
+            ],
+            label="winml eval (accuracy gate)",
+        )
+        if rc != 0 or not result_json.exists():
+            return None
+        try:
+            data = json.loads(result_json.read_text())
+            metrics = data.get("metrics", data)
+            acc = metrics.get("accuracy")
+            return float(acc) if acc is not None else None
+        except Exception as e:
+            print(f"     [warn] parse error: {e}")
+            return None
+
+
+class Reviewer:
+    """Phase 2 Reviewer — ThroughputOnly verdict -> KEEP / MARGINAL / DISCARD.
+
+    Turns Optimizer measurements (full-bench p50s + accuracy) into a verdict via
+    the ThroughputOnly policy, promotes the first successful bench to baseline,
+    and drafts a KB entry for notable confirmed wins.
+    """
+
+    def __init__(
+        self, policy: ThroughputOnly, ep: str, model_id: str, accuracy_floor: float
+    ) -> None:
+        self.policy = policy
+        self.ep = ep
+        self.model_id = model_id
+        self.accuracy_floor = accuracy_floor
+
+    def review(
+        self,
+        label: str,
+        exp_info: dict,
+        screen_cv: float,
+        baseline_p50: float | None,
+        full_p50s: list[float],
+        accuracy: float | None,
+    ) -> tuple[str, dict]:
+        """Judge one hypothesis from its measurements.
+
+        Returns (status_str, updated exp_info). Does not update best_p50/best_label —
+        the orchestrator owns champion tracking so it stays in one place.
+        """
+        med_p50 = median_p50(full_p50s)
+        assert med_p50 is not None
+        exp_info["full_p50s"] = [f"{p:.1f}" for p in full_p50s]
+        exp_info["median_p50"] = f"{med_p50:.1f}"
+
+        # Promote baseline from first successful full bench
+        if baseline_p50 is None:
+            baseline_p50 = med_p50
+            exp_info["baseline_p50"] = f"{baseline_p50:.1f}"
+
+        exp_info["accuracy"] = f"{accuracy:.4f}" if accuracy is not None else "N/A"
+
+        improvement_pct = (baseline_p50 - med_p50) / baseline_p50 * 100
+        delta_pct = -improvement_pct
+        exp_info["delta_pct"] = f"{delta_pct:+.1f}%"
+
+        correctness_pass = accuracy is None or accuracy >= self.accuracy_floor
+        verdict = self.policy.evaluate(
+            VerdictInput(
+                improvement_pct=improvement_pct,
+                cv_pct=screen_cv * 100.0,
+                correctness_pass=correctness_pass,
+            )
+        )
+
+        exp_info["analysis"] = verdict.reasoning
+        if verdict.verdict in ("KEEP", "MARGINAL_KEEP"):
+            status = "keep" + (" (marginal)" if verdict.marginal else "")
+            exp_info["analysis"] = (
+                f"Improvement confirmed: p50 {baseline_p50:.1f}ms -> {med_p50:.1f}ms "
+                f"({delta_pct:+.1f}%). {verdict.reasoning}"
+            )
+            # Auto-write KB draft entry for notable improvements
+            if not verdict.marginal:
+                write_kb_draft(
+                    ep=self.ep,
+                    label=label,
+                    improvement_pct=improvement_pct,
+                    cv=screen_cv,
+                    model_id=self.model_id,
+                    dimension=exp_info.get("dimension", "unknown"),
+                )
+        elif verdict.verdict == "ACC_FAIL":
+            status = f"discard (accuracy {accuracy:.4f} < floor {self.accuracy_floor})"
+        else:
+            status = f"discard ({verdict.reasoning})"
+
+        return status, exp_info
+
+
+def write_experiment_doc(exp_dir: Path, info: dict) -> None:
+    """Write per-experiment structured artifact (V2 pattern):
+    Hypothesis / Implementation / Parity / Perf / Analysis / Decision
+    """
+    exp_dir.mkdir(parents=True, exist_ok=True)
+    doc = f"""# Experiment {info["iter"]:02d}: {info["label"]}
+
+## Hypothesis
+{info.get("hypothesis", "(not recorded)")}
+
+## Implementation
+- Config flags: `{info.get("optim_flags", "")}`
+- Opset: `{info.get("opset", 17)}`
+- Search dimension: `{info.get("dimension", "")}`
+
+## Parity (accuracy gate)
+- Accuracy: `{info.get("accuracy", "N/A")}`
+- Floor: `{ACCURACY_FLOOR}`
+- Result: `{"PASS" if (info.get("accuracy") or 0) >= ACCURACY_FLOOR else "FAIL"}`
+
+## Performance
+### Phase A (quick screen, {SCREEN_ITERS} iters)
+- p50: `{info.get("screen_p50", "N/A")}ms`
+- CV: `{info.get("screen_cv", "N/A")}` (threshold: {SCREEN_CV_MAX})
+
+### Phase B (full bench, {FULL_ITERS}×{FULL_SESSIONS} sessions)
+- p50 per session: `{info.get("full_p50s", [])}`
+- Median p50: `{info.get("median_p50", "N/A")}ms`
+- Baseline p50: `{info.get("baseline_p50", "N/A")}ms`
+- Delta: `{info.get("delta_pct", "N/A")}`
+
+## Analysis
+{info.get("analysis", "(auto-generated: no significant analysis)")}
+
+## Decision
+**{info.get("status", "UNKNOWN").upper()}**
+
+Timestamp: {datetime.now().isoformat(timespec="seconds")}
+"""
+    (exp_dir / "experiment.md").write_text(doc, encoding="utf-8")
+
+
+def write_kb_draft(
+    ep: str, label: str, improvement_pct: float, cv: float, model_id: str, dimension: str
+) -> None:
+    """Append a draft finding to ep_device_knowledge/<ep>_<device>.json when improvement > 10%.
+
+    The entry gets status='draft' — a human must review and promote to 'confirmed'
+    after Gate 2 validation (>=2 independent models, mechanism understood).
+    """
+    if improvement_pct < 10.0:
+        return
+    kb_path = KB_DIR / f"{ep}_{DEVICE}.json"
+    if not kb_path.exists():
+        return
+    try:
+        kb = json.loads(kb_path.read_text(encoding="utf-8"))
+    except Exception:
+        return
+
+    findings = kb.setdefault("findings", [])
+    # Auto-generate a draft ID: ep-draft-<timestamp>
+    draft_id = f"{ep}-draft-{datetime.now().strftime('%Y%m%d%H%M%S')}"
+
+    # Don't duplicate if same label+model already drafted
+    for f in findings:
+        if (
+            f.get("status") == "draft"
+            and f.get("model_id") == model_id
+            and f.get("title", "").startswith(label[:30])
+        ):
+            return
+
+    draft = {
+        "id": draft_id,
+        "status": "draft",
+        "title": f"[DRAFT] {label} — {improvement_pct:+.1f}% on {model_id}",
+        "model_id": model_id,
+        "dimension": dimension,
+        "improvement_pct": round(improvement_pct, 2),
+        "cv": round(cv, 3),
+        "mechanism_confirmed": False,
+        "note": "Auto-generated draft. Requires Gate 2: >=2 models, mechanism understood.",
+        "action_for_autoconfig": "investigate",
+        "timestamp": datetime.now().isoformat(timespec="seconds"),
+    }
+    findings.append(draft)
+    kb_path.write_text(json.dumps(kb, indent=2), encoding="utf-8")
+    print(f"  [KB draft] Wrote draft entry {draft_id}: {label} ({improvement_pct:+.1f}%)")
+
+
+def log(row: dict) -> None:
+    fields = [
+        "iter",
+        "label",
+        "dimension",
+        "optim_flags",
+        "opset",
+        "accuracy",
+        "screen_p50_ms",
+        "median_p50_ms",
+        "baseline_p50_ms",
+        "delta_pct",
+        "cv",
+        "status",
+        "elapsed_s",
+        "timestamp",
+    ]
+    is_new = not RESULTS_TSV.exists()
+    with RESULTS_TSV.open("a", newline="", encoding="utf-8") as f:
+        w = csv.DictWriter(f, fieldnames=fields, delimiter="\t", extrasaction="ignore")
+        if is_new:
+            w.writeheader()
+        w.writerow(row)
+
+
+def optim_flags(cfg: dict) -> str:
+    flags = [k for k, v in cfg.get("optim", {}).items() if v is True]
+    return ",".join(flags) if flags else "(none)"
+
+
+# ── main loop ─────────────────────────────────────────────────────────────────
+
+
+def main() -> None:
+    WORK_DIR.mkdir(parents=True, exist_ok=True)
+
+    # Load EP knowledge (confirmed entries only)
+    kb = load_ep_knowledge(EP)
+    print(f"\n=== KB loaded for EP={EP} ===")
+    for note in kb["notes"]:
+        print(f"  {note}")
+
+    # Resume from prior session if interrupted
+    session = SessionManager(WORK_DIR)
+
+    sep = "=" * 64
+    print(f"\n{sep}")
+    print(f"  autoconfig search  --  {MODEL_ID}")
+    print(f"  EP: {EP}   eval_samples: {EVAL_SAMPLES}   hypotheses: {len(HYPOTHESES)}")
+    print(
+        f"  Bench: screen={SCREEN_ITERS} iters (CV<{SCREEN_CV_MAX}) -> full={FULL_ITERS}x{FULL_SESSIONS}"
+    )
+    print(f"  Stop: {STOP_CONSECUTIVE_DISCARDS} consecutive DISCARDs OR budget")
+    print(f"  External research trigger: after {EXTERNAL_RESEARCH_TRIGGER} DISCARDs same dimension")
+    print(
+        f"  Verdict: improvement must exceed max({MIN_IMPROVEMENT * 100:.0f}%, {STAT_BAR_MULTIPLIER:.0f}x screen-CV)"
+    )
+    print(
+        f"  Screen early exit: skip full bench if screen improvement < {SCREEN_PASS_MIN_IMPROVEMENT_PCT:.0f}%"
+    )
+    print(f"{sep}\n")
+
+    # Restore state from prior session (if resuming)
+    baseline_p50: float | None = session.baseline_p50
+    best_p50 = session.best_p50
+    best_label = session.best_label
+    consecutive_discards = session.consecutive_discards
+    discard_by_dimension: dict[str, int] = session.discard_by_dimension
+
+    policy = ThroughputOnly(
+        min_improvement_pct=MIN_IMPROVEMENT * 100,
+        stat_bar_multiplier=STAT_BAR_MULTIPLIER,
+    )
+
+    # Phase 2 subagents: Optimizer runs hypotheses, Reviewer judges them.
+    # Explorer is constructed after Phase 1 (it needs the Insight Engine output).
+    optimizer = Optimizer(WINML, MODEL_ID, EP, DEVICE)
+    reviewer = Reviewer(policy, EP, MODEL_ID, ACCURACY_FLOOR)
+
+    # ── Phase 1: Insight Engine ────────────────────────────────────────────────
+    # Run AFTER baseline build so we have a real ONNX to analyse.
+    # The baseline ONNX is expected at WORK_DIR/iter_00/model.onnx once h0 has run.
+    # On first run the baseline may not exist yet — insight falls back gracefully.
+    baseline_onnx = WORK_DIR / "iter_00" / "model.onnx"
+    insight = build_insight(
+        onnx_path=baseline_onnx,
+        winml=WINML,
+        ep=EP,
+        device=DEVICE,
+        hypotheses=HYPOTHESES,
+        kb=kb,
+    )
+
+    # Explorer (Phase 2 "what to try next"): owns the priority_queue + skip rules.
+    explorer = Explorer(HYPOTHESES, kb, insight)
+
+    # Baseline op-type histogram (from Phase 1 graph analysis) — used to detect
+    # graph passes that turn out to be no-ops at build time (optimized graph
+    # identical to baseline).
+    baseline_op_counts: dict[str, int] = dict(insight.graph_info.op_counts)
+
+    for i, (label, patch_fn, dimension) in enumerate(explorer):
+        # Skip iters completed in a prior run
+        if i in session.completed_iters:
+            print(f"  [resume] skipping iter {i} ({label}) — already done")
+            continue
+
+        iter_start = time.time()
+        print(f"\n{'--' * 32}")
+        print(f"  iter {i}  |  {label}  [{dimension}]")
+        print(f"{'--' * 32}")
+
+        # Explorer decides whether to prune this hypothesis (KB hard-block or Insight skip_set)
+        flags_preview = optim_flags(patch_fn(copy.deepcopy(BASELINE)))  # type: ignore[operator]
+        skip_reason = explorer.skip_reason(label, flags_preview)
+        if skip_reason:
+            print(f"  skipped by {skip_reason}")
+            continue
+
+        cfg = patch_fn(copy.deepcopy(BASELINE))  # type: ignore[operator]
+        flags = optim_flags(cfg)
+        opset = cfg["export"]["opset_version"]
+        precision = cfg.get("precision", "fp32")
+        print(f"  optim: {flags}")
+        print(f"  opset: {opset}   precision: {precision}")
+
+        out_dir = WORK_DIR / f"iter_{i:02d}"
+        exp_dir = WORK_DIR / "experiments" / f"{i:02d}_{dimension}"
+        ok, _ = optimizer.build(cfg, out_dir)
+
+        exp_info: dict = {
+            "iter": i,
+            "label": label,
+            "dimension": dimension,
+            "optim_flags": flags,
+            "opset": opset,
+            "precision": precision,
+            "hypothesis": label,
+            "baseline_p50": f"{baseline_p50:.1f}" if baseline_p50 else "N/A",
+        }
+
+        if not ok:
+            status = "crash"
+            exp_info["analysis"] = "winml build failed — check build log"
+        elif dimension == "graph_pass" and graph_is_noop(
+            out_dir / "model.onnx", baseline_op_counts
+        ):
+            # Optimized graph is identical to baseline — the pass matched nothing.
+            status = "discard (no-op: optimized graph identical to baseline — pass did not fire)"
+            exp_info["analysis"] = (
+                "Post-build graph analysis: the optimized model has the same op-type "
+                "counts as the baseline, so this graph pass matched no pattern and was "
+                "a no-op. Screen + full bench skipped."
+            )
+            exp_info["graph_delta"] = "none (0 nodes changed)"
+        else:
+            # Optimizer Phase A: quick screen
+            screen_p50, screen_cv = optimizer.screen(out_dir / "model.onnx")
+            exp_info["screen_p50"] = f"{screen_p50:.1f}" if screen_p50 else "UNSTABLE"
+            exp_info["screen_cv"] = f"{screen_cv:.3f}"
+
+            screen_improvement_pct = (
+                (baseline_p50 - screen_p50) / baseline_p50 * 100
+                if (screen_p50 is not None and baseline_p50 is not None)
+                else None
+            )
+
+            if screen_p50 is None:
+                status = "discard (unstable — CV too high)"
+                exp_info["analysis"] = (
+                    f"Phase A rejected: CV={screen_cv:.2f} > {SCREEN_CV_MAX}. "
+                    f"Thermal or scheduling noise on {EP.upper()} EP. Cool device and retry."
+                )
+            elif (
+                screen_improvement_pct is not None
+                and screen_improvement_pct < SCREEN_PASS_MIN_IMPROVEMENT_PCT
+            ):
+                # Screen early exit: skip full bench when screen shows negligible gain.
+                # Saves 3x full-bench time for clearly non-improving configs.
+                status = (
+                    f"discard (screen early exit: improvement {screen_improvement_pct:+.1f}%"
+                    f" < {SCREEN_PASS_MIN_IMPROVEMENT_PCT:.0f}% — full bench skipped)"
+                )
+                exp_info["analysis"] = (
+                    f"Phase A early exit: screen p50={screen_p50:.1f}ms vs baseline "
+                    f"{baseline_p50:.1f}ms ({screen_improvement_pct:+.1f}% improvement) is "
+                    f"below {SCREEN_PASS_MIN_IMPROVEMENT_PCT:.0f}% threshold. "
+                    f"Full bench skipped — not worth 3x{FULL_ITERS} iters."
+                )
+                exp_info["delta_pct"] = f"{-screen_improvement_pct:+.1f}% (screen estimate)"
+            else:
+                # Optimizer Phase B: full bench + accuracy, then Reviewer verdict.
+                full_p50s = optimizer.full_bench(out_dir / "model.onnx")
+                if not full_p50s:
+                    status = "crash (full bench failed)"
+                    exp_info["analysis"] = "Phase B winml perf returned no data"
+                else:
+                    accuracy = optimizer.eval_accuracy(out_dir)
+                    status, exp_info = reviewer.review(
+                        label=label,
+                        exp_info=exp_info,
+                        screen_cv=screen_cv,
+                        baseline_p50=baseline_p50,
+                        full_p50s=full_p50s,
+                        accuracy=accuracy,
+                    )
+                    if status.startswith("keep"):
+                        # Orchestrator owns champion tracking
+                        new_p50 = float(exp_info.get("median_p50", best_p50))
+                        if new_p50 < best_p50:
+                            best_p50 = new_p50
+                            best_label = label
+                            status = "keep *** NEW BEST ***"
+
+        # Extract baseline from first successful full bench
+        if baseline_p50 is None and "median_p50" in exp_info:
+            try:
+                baseline_p50 = float(exp_info["median_p50"])
+                exp_info["baseline_p50"] = f"{baseline_p50:.1f}"
+            except (ValueError, TypeError):
+                pass
+
+        # Write per-experiment doc (V2 pattern)
+        exp_info["status"] = status
+        write_experiment_doc(exp_dir, exp_info)
+
+        # Track consecutive discards + external research trigger
+        if "discard" in status or "crash" in status:
+            consecutive_discards += 1
+            discard_by_dimension[dimension] = discard_by_dimension.get(dimension, 0) + 1
+            if discard_by_dimension[dimension] == EXTERNAL_RESEARCH_TRIGGER:
+                print(
+                    f"\n  EXTERNAL RESEARCH TRIGGER: {EXTERNAL_RESEARCH_TRIGGER} consecutive DISCARDs in [{dimension}]"
+                )
+                print("     -> Search ORT/QNN source code for mechanism before continuing")
+                print(
+                    "     -> Check kMaxSupportedOpset for opset dimension, EP-specific rules for others"
+                )
+                print(
+                    f"     -> File findings in ep_device_knowledge/{EP}_{DEVICE}.json as 'draft' entry"
+                )
+        else:
+            consecutive_discards = 0
+            discard_by_dimension[dimension] = 0
+
+        # Log to TSV
+        log(
+            {
+                "iter": i,
+                "label": label,
+                "dimension": dimension,
+                "optim_flags": flags,
+                "opset": opset,
+                "accuracy": exp_info.get("accuracy", "N/A"),
+                "screen_p50_ms": exp_info.get("screen_p50", "N/A"),
+                "median_p50_ms": exp_info.get("median_p50", "N/A"),
+                "baseline_p50_ms": exp_info.get("baseline_p50", "N/A"),
+                "delta_pct": exp_info.get("delta_pct", "N/A"),
+                "cv": exp_info.get("screen_cv", "N/A"),
+                "status": status,
+                "elapsed_s": f"{time.time() - iter_start:.0f}",
+                "timestamp": datetime.now().isoformat(timespec="seconds"),
+            }
+        )
+
+        print(f"  -> {status}")
+
+        # Persist state for crash-resume
+        session.save(
+            iter_idx=i,
+            verdict=status,
+            baseline_p50=baseline_p50,
+            best_p50=best_p50,
+            best_label=best_label,
+            consecutive_discards=consecutive_discards,
+            discard_by_dimension=discard_by_dimension,
+        )
+
+        # Stop condition
+        if consecutive_discards >= STOP_CONSECUTIVE_DISCARDS:
+            print(f"\n  STOP: {STOP_CONSECUTIVE_DISCARDS} consecutive DISCARDs — plateau reached")
+            break
+
+    print(f"\n{sep}")
+    print("  SEARCH COMPLETE")
+    print(f"  Best config: {best_label}")
+    print(f"  Best p50: {best_p50:.1f}ms" if best_p50 < float("inf") else "  No improvement found")
+    print(f"  Results: {RESULTS_TSV}")
+    print(f"  Experiments: {WORK_DIR / 'experiments'}")
+
+    # ── Phase 3: Generate HTML report ─────────────────────────────────────────
+    try:
+        report_path = generate_report(
+            results_tsv=RESULTS_TSV,
+            work_dir=WORK_DIR,
+            model_id=MODEL_ID,
+            ep=EP,
+            insight_notes=insight.notes,
+        )
+        print(f"  Report:    {report_path}")
+    except Exception as e:
+        print(f"  [warn] Report generation failed: {e}")
+
+    print(f"{sep}\n")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/research/autoconfig/skills/reviewer/SKILL.md b/research/autoconfig/skills/reviewer/SKILL.md
new file mode 100644
index 000000000..76e799306
--- /dev/null
+++ b/research/autoconfig/skills/reviewer/SKILL.md
@@ -0,0 +1,56 @@
+---
+name: reviewer
+description: >
+  Use this sub-skill (driven by orchestrator) to JUDGE the measurements an
+  optimizer produced for one hypothesis. It applies the ThroughputOnly
+  verdict policy with a noise-aware threshold (max(1% floor, 2x screen-CV)), enforces
+  the accuracy floor, and returns KEEP / MARGINAL_KEEP / DISCARD / ACC_FAIL with a
+  human-readable rationale. On a real, non-marginal win it drafts a KB entry
+  (status="draft") for later human promotion. It never builds or benchmarks.
+---
+
+# reviewer
+
+The Reviewer is the **"judge it"** sub-skill of the autoconfig loop (Phase 2). It
+turns Optimizer measurements into a verdict. Mirrors the `Reviewer` class in
+`skills/orchestrator/autoconfig.py` (wrapping `ThroughputOnly` from
+`bench_utils.py`) and the Reviewer box in
+`research/autoconfig/docs/autoconfig_diagram.html`.
+
+**Implementation in this folder:** `promote_findings.py` (confidence-gated KB
+promotion, L1→L4 — the offline counterpart to the in-loop KB drafts this skill writes).
+## When to use
+
+Invoked by `orchestrator` after `optimizer` returns
+benchmark + accuracy data. Not used standalone.
+
+## Inputs
+
+- `full_p50s` (per-session p50s) and `accuracy` from the Optimizer.
+- `screen_cv` (drives the statistical threshold) and the current `baseline_p50`.
+
+## Procedure
+
+1. **Baseline promotion** — if no baseline yet, the first successful full bench median becomes the baseline.
+2. **Compute improvement** — `improvement_pct = (baseline - median_p50) / baseline x 100`.
+3. **Accuracy gate** — pass requires `accuracy is None or accuracy >= ACCURACY_FLOOR (0.70)`; else `ACC_FAIL`.
+4. **Verdict (ThroughputOnly)** — statistically honest threshold `max(MIN_IMPROVEMENT 1%, STAT_BAR 2.0 x screen_CV)`:
+   - `KEEP` — improvement > 1.5x threshold.
+   - `MARGINAL_KEEP` — improvement between 1x and 1.5x threshold.
+   - `DISCARD` — improvement below threshold (noise-level), or
+   - `ACC_FAIL` — accuracy below floor.
+5. **KB draft** — on a non-marginal KEEP with improvement > 10%, append a `status="draft"` finding to `ep_device_knowledge/<ep>_<device>.json` (de-duplicated per label+model).
+
+## Outputs
+
+- A status string (`keep` / `keep (marginal)` / `discard (...)`) plus the verdict reasoning written into the experiment record.
+- Optionally, a KB draft entry for human review.
+
+The Reviewer reports the verdict; the orchestrator owns champion tracking and
+applies the verdict to the loop's stop-condition counters.
+
+## Constraints
+
+- Threshold is noise-aware — a delta inside `2x CV` is never reported as a win.
+- KB writes are drafts only; promotion to `confirmed` is a human gate (Gate 2: >=2 independent models + mechanism understood).
+- No hardcoded architecture logic in the verdict; the policy is model-agnostic.
diff --git a/research/autoconfig/skills/reviewer/promote_findings.py b/research/autoconfig/skills/reviewer/promote_findings.py
new file mode 100644
index 000000000..267d60e02
--- /dev/null
+++ b/research/autoconfig/skills/reviewer/promote_findings.py
@@ -0,0 +1,307 @@
+#!/usr/bin/env python3
+# -------------------------------------------------------------------------
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+# --------------------------------------------------------------------------
+
+"""promote_findings.py — Confidence-gated KB promotion (self-evolution-design Fix #4).
+
+Reads every ``results.json`` produced by the catalog sweeps and applies the
+L1 -> L4 confidence ladder from ``docs/self-evolution-design.html`` §2:
+
+    L1  Observed   — one model, one run: median gain >= L1_GAIN_PCT.
+    L2  Confirmed  — statistically robust on a single model: the hypothesis p50
+                     range is strictly below the baseline range AND the gain
+                     clears the effect-size floor (gain% >= EFFECT_SIZE_CV_MULT x
+                     between-session CV). This is the same anti-DVFS gate the
+                     sweep uses for ``best_gain_reliable``.
+    L3  Generalized — the SAME (ep, flags) signature reaches L2 on >= 2 distinct
+                      models of ONE architecture class (winml ``model_type``).
+    L4  Cross-cutting — the same (ep, flags) signature reaches L2 across >= 3
+                        architecture classes; scope broadens to EP-wide.
+
+Output is written to ``ep_device_knowledge/_auto_promoted.json`` as a DRAFT sink — it
+never clobbers the human-curated ``ep_device_knowledge/<ep>_<device>.json`` files. A human applies
+the promotion checklist in ``ep_device_knowledge/README.md`` before merging anything into
+the curated KB. This keeps "KB holds L3+ only" while protecting curated findings.
+
+Architecture class == winml ``model_type`` (an architecture family such as
+``vit`` / ``resnet`` / ``bert``), never a specific checkpoint — so grouping stays
+universal and contains no hard-coded model logic.
+
+Usage:
+    uv run python promote_findings.py [--root .] [--out ep_device_knowledge/_auto_promoted.json]
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import sys
+from collections import defaultdict
+from pathlib import Path
+
+# Agent package bootstrap: make the autoconfig root importable for sibling packages.
+_AGENT_ROOT = next(
+    p for p in Path(__file__).resolve().parents if (p / "ep_device_knowledge").is_dir()
+)
+if str(_AGENT_ROOT) not in sys.path:
+    sys.path.insert(0, str(_AGENT_ROOT))
+
+from skills.optimizer.bench_utils import session_cv  # noqa: E402
+
+# Effect-size multiplier — must match the `effect_size_cv_mult` in the sweep config
+# (ep_device_knowledge/<ep>_<device>.json) used by catalog_sweep._effect_size.
+EFFECT_SIZE_CV_MULT = 2.0
+L1_GAIN_PCT = 5.0  # L1 (Observed) minimum median gain
+L3_MIN_MODELS = 2  # distinct models of one arch class for L3
+L4_MIN_ARCH_CLASSES = 3  # distinct arch classes for L4
+
+OK_STATUSES = ("OK", "OK_HIGH_CV")
+
+
+def _hyp_p50s(hyp: dict) -> list[float]:
+    """Per-session p50 list for a hypothesis, tolerant of both sweep schemas.
+
+    QNN sweep nests them under ``full.p50s_ms``; the GPU/CPU sweeps store a flat
+    ``full_p50s_ms``. Returns an empty list when neither is present.
+    """
+    nested = (hyp.get("full") or {}).get("p50s_ms")
+    if nested:
+        return list(nested)
+    flat = hyp.get("full_p50s_ms")
+    return list(flat) if flat else []
+
+
+def _flags_signature(hyp: dict) -> tuple[tuple[str, object], ...]:
+    """Canonical, hashable signature of a hypothesis's config delta.
+
+    Combines opset with the extra_optim flags so that, e.g., ``opset21`` and
+    ``opset21 + bias_softmax_fusion`` are distinct signatures.
+    """
+    sig: dict[str, object] = {}
+    opset = hyp.get("opset")
+    if opset is not None:
+        sig["opset"] = opset
+    for k, v in (hyp.get("extra_optim") or {}).items():
+        sig[k] = v
+    return tuple(sorted(sig.items()))
+
+
+def _flags_label(sig: tuple[tuple[str, object], ...]) -> str:
+    if not sig:
+        return "(baseline)"
+    return " + ".join(f"{k}={v}" for k, v in sig)
+
+
+def _baseline_p50s(hyps: dict) -> list[float]:
+    """Per-session p50 list of the baseline hypothesis (prefer h0, fall back h1)."""
+    for h_id in ("h0", "h1"):
+        h = hyps.get(h_id, {})
+        if h.get("status") in OK_STATUSES:
+            p50s = _hyp_p50s(h)
+            if p50s:
+                return p50s
+    return []
+
+
+def classify_hypothesis(hyp: dict, base_p50s: list[float]) -> dict | None:
+    """Return per-hypothesis L1/L2 classification vs a baseline, or None if not OK.
+
+    Mirrors the effect-size gate in catalog_sweep._effect_size so promotion
+    and the sweep agree on what "reliable" means.
+    """
+    if hyp.get("status") not in OK_STATUSES:
+        return None
+    p50s = _hyp_p50s(hyp)
+    base_med = sorted(base_p50s)[len(base_p50s) // 2] if base_p50s else None
+    hyp_med = sorted(p50s)[len(p50s) // 2] if p50s else None
+    if not base_p50s or not p50s or not base_med or not hyp_med:
+        return None
+
+    gain_pct = (base_med - hyp_med) / base_med * 100
+    noise_cv = max(session_cv(base_p50s), session_cv(p50s))
+    noise_floor_pct = EFFECT_SIZE_CV_MULT * noise_cv * 100
+    ranges_separated = max(p50s) < min(base_p50s)
+    effect_size_ok = gain_pct >= noise_floor_pct
+    reliable = bool(effect_size_ok and ranges_separated and gain_pct > 0)
+
+    level = 0
+    if reliable:
+        level = 2
+    elif gain_pct >= L1_GAIN_PCT:
+        level = 1
+    return {
+        "gain_pct": round(gain_pct, 2),
+        "noise_floor_pct": round(noise_floor_pct, 2),
+        "ranges_separated": ranges_separated,
+        "level": level,
+    }
+
+
+def collect(root: Path) -> list[dict]:
+    """Walk every catalog-*-sweep/*/results.json and emit per-hypothesis records."""
+    records: list[dict] = []
+    for results_path in sorted(root.glob("catalog-*-sweep/*/results.json")):
+        try:
+            r = json.loads(results_path.read_text(encoding="utf-8"))
+        except Exception as e:
+            print(f"  [warn] skipping {results_path}: {e}")
+            continue
+        hyps = r.get("hypotheses") or {}
+        base_p50s = _baseline_p50s(hyps)
+        if not base_p50s:
+            continue
+        model_id = r.get("model_id", results_path.parent.name)
+        arch_class = r.get("model_type") or "unknown"
+        ep = r.get("ep", "unknown")
+        device = r.get("device", "unknown")
+        for h_id, hyp in hyps.items():
+            if h_id in ("h0", "h1"):  # baselines are not candidates
+                continue
+            cls = classify_hypothesis(hyp, base_p50s)
+            if not cls or cls["level"] < 1:
+                continue
+            sig = _flags_signature(hyp)
+            if not sig:
+                continue
+            records.append(
+                {
+                    "model_id": model_id,
+                    "arch_class": arch_class,
+                    "ep": ep,
+                    "device": device,
+                    "hyp_id": h_id,
+                    "label": hyp.get("label", h_id),
+                    "flags_sig": sig,
+                    "flags": _flags_label(sig),
+                    **cls,
+                }
+            )
+    return records
+
+
+def promote(records: list[dict]) -> dict:
+    """Apply the L1->L4 ladder to per-hypothesis records."""
+    l1 = [r for r in records if r["level"] >= 1]
+    l2 = [r for r in records if r["level"] >= 2]
+
+    # L3: same (ep, device, flags_sig, arch_class) reaching L2 on >= N distinct models.
+    by_arch: dict[tuple, list[dict]] = defaultdict(list)
+    for r in l2:
+        by_arch[(r["ep"], r["device"], r["flags_sig"], r["arch_class"])].append(r)
+    l3 = []
+    for (ep, device, sig, arch), evs in by_arch.items():
+        models = sorted({e["model_id"] for e in evs})
+        if len(models) >= L3_MIN_MODELS:
+            l3.append(
+                {
+                    "ep": ep,
+                    "device": device,
+                    "arch_class": arch,
+                    "flags": _flags_label(sig),
+                    "models": models,
+                    "mean_gain_pct": round(sum(e["gain_pct"] for e in evs) / len(evs), 2),
+                    "evidence": [
+                        {
+                            "model_id": e["model_id"],
+                            "hyp_id": e["hyp_id"],
+                            "gain_pct": e["gain_pct"],
+                        }
+                        for e in evs
+                    ],
+                }
+            )
+
+    # L4: same (ep, device, flags_sig) reaching L2 across >= M distinct arch classes.
+    by_flags: dict[tuple, list[dict]] = defaultdict(list)
+    for r in l2:
+        by_flags[(r["ep"], r["device"], r["flags_sig"])].append(r)
+    l4 = []
+    for (ep, device, sig), evs in by_flags.items():
+        arches = sorted({e["arch_class"] for e in evs})
+        if len(arches) >= L4_MIN_ARCH_CLASSES:
+            l4.append(
+                {
+                    "ep": ep,
+                    "device": device,
+                    "flags": _flags_label(sig),
+                    "arch_classes": arches,
+                    "models": sorted({e["model_id"] for e in evs}),
+                    "mean_gain_pct": round(sum(e["gain_pct"] for e in evs) / len(evs), 2),
+                }
+            )
+
+    def _public(r: dict) -> dict:
+        return {k: v for k, v in r.items() if k != "flags_sig"}
+
+    return {
+        "L1_observed": [_public(r) for r in l1],
+        "L2_confirmed_single_model": [_public(r) for r in l2],
+        "L3_generalized_arch_rule": l3,
+        "L4_cross_cutting_rule": l4,
+    }
+
+
+def main() -> None:
+    parser = argparse.ArgumentParser(description="Confidence-gated KB promotion (L1->L4).")
+    parser.add_argument(
+        "--root",
+        type=Path,
+        default=_AGENT_ROOT,
+        help="autoconfig root containing catalog-*-sweep/ dirs (default: agent root)",
+    )
+    parser.add_argument(
+        "--out",
+        type=Path,
+        default=None,
+        help="output draft file (default: <root>/ep_device_knowledge/_auto_promoted.json)",
+    )
+    args = parser.parse_args()
+    root: Path = args.root
+    out: Path = args.out or (root / "ep_device_knowledge" / "_auto_promoted.json")
+
+    records = collect(root)
+    ladder = promote(records)
+    payload = {
+        "_meta": {
+            "generated_by": "promote_findings.py",
+            "status": "draft",
+            "note": (
+                "Auto-generated promotion candidates. NOT curated KB. Apply the "
+                "promotion checklist in ep_device_knowledge/README.md (paired A/B, clean "
+                "baseline, effect-size > noise floor, independent reruns, "
+                "baseline-drift check) before merging into <ep>_<device>.json."
+            ),
+            "gates": {
+                "L1_gain_pct": L1_GAIN_PCT,
+                "L2_effect_size_cv_mult": EFFECT_SIZE_CV_MULT,
+                "L3_min_models": L3_MIN_MODELS,
+                "L4_min_arch_classes": L4_MIN_ARCH_CLASSES,
+            },
+        },
+        **ladder,
+    }
+    out.parent.mkdir(parents=True, exist_ok=True)
+    out.write_text(json.dumps(payload, indent=2), encoding="utf-8")
+
+    print(f"promote_findings: scanned {len(records)} qualifying hypothesis record(s)")
+    print(f"  L1 observed              : {len(ladder['L1_observed'])}")
+    print(f"  L2 confirmed (1 model)   : {len(ladder['L2_confirmed_single_model'])}")
+    print(f"  L3 generalized (arch)    : {len(ladder['L3_generalized_arch_rule'])}")
+    print(f"  L4 cross-cutting         : {len(ladder['L4_cross_cutting_rule'])}")
+    for r in ladder["L3_generalized_arch_rule"]:
+        print(
+            f"  [L3] {r['ep']}/{r['device']} {r['arch_class']}: {r['flags']} "
+            f"on {len(r['models'])} models (+{r['mean_gain_pct']}%)"
+        )
+    for r in ladder["L4_cross_cutting_rule"]:
+        print(
+            f"  [L4] {r['ep']}/{r['device']}: {r['flags']} "
+            f"across {len(r['arch_classes'])} arch classes (+{r['mean_gain_pct']}%)"
+        )
+    print(f"  draft written: {out}")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/research/autoconfig/tools/catalog_sweep.py b/research/autoconfig/tools/catalog_sweep.py
new file mode 100644
index 000000000..a4a4a5834
--- /dev/null
+++ b/research/autoconfig/tools/catalog_sweep.py
@@ -0,0 +1,1347 @@
+# -------------------------------------------------------------------------
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+# --------------------------------------------------------------------------
+
+"""catalog_sweep.py — unified, JSON-driven EP/device optimization sweep.
+
+Single driver that replaces the per-EP ``catalog_{cpu,gpu,qnn}_sweep.py`` scripts.
+Everything EP/device-specific is read from
+``ep_device_knowledge/<ep>_<device>.json``:
+
+  - ``sweep_config``  : quant/compile policy, screen/full bench protocol,
+                        confirmation, effect-size gate, thermal-awareness,
+                        accuracy eval, paired-A/B availability, timeouts.
+  - ``hypotheses``    : the (id, label, opset, optim, guard) matrix.
+  - ``models``        : the model catalog (id, task, model_type).
+  - ``cross_checks``  : cross-hypothesis finding probes (opset_bypass /
+                        catastrophic_regression / regression_probe).
+
+Per-hypothesis guards:
+  - ``skip_if_gemm``        : skip the hypothesis if the built model already has
+                              Gemm nodes (cpu-002 — matmul_add_fusion is harmful).
+  - ``conv_pct_regression`` : annotate the hypothesis as an expected regression
+                              when Conv% of the baseline build exceeds a threshold
+                              (npu-006 — FusedConv falls back to CPU on QNN NPU).
+
+Bench protocol (config-driven):
+  Phase A : screen (``screen.iters``); on thermal_aware EPs high CV is logged but
+            never blocks Phase B (the multi-session cool-down is the thermal control).
+  Phase B : ``full.sessions`` x ``full.iters`` with cool-down.
+  Phase C : ``confirm_sessions`` extra sessions on the best hypothesis; CONFIRMED
+            only when all session p50s fall strictly below the baseline range.
+
+Usage:
+    python tools/catalog_sweep.py --ep qnn --device npu
+    python tools/catalog_sweep.py --ep cpu --device cpu --model microsoft/resnet-18
+    python tools/catalog_sweep.py --ep qnn --device npu --only-hypotheses h6,h7 --paired-ab
+    python tools/catalog_sweep.py --ep qnn --device gpu --list
+
+Results: <results_dir>/<model_slug>/{results.json, report.html, champion_<ep>_<device>.json}, SUMMARY.md.
+"""
+
+from __future__ import annotations
+
+import argparse
+import copy
+import json
+import subprocess
+import sys
+import time
+from datetime import datetime
+from pathlib import Path
+
+# Agent package bootstrap: make the autoconfig root importable for sibling packages.
+_AGENT_ROOT = next(
+    p for p in Path(__file__).resolve().parents if (p / "ep_device_knowledge").is_dir()
+)
+if str(_AGENT_ROOT) not in sys.path:
+    sys.path.insert(0, str(_AGENT_ROOT))
+
+try:
+    from lib.gen_model_report import generate_model_report  # noqa: E402
+except Exception:
+    generate_model_report = None
+
+try:
+    from skills.optimizer.bench_utils import (  # noqa: E402
+        adaptive_paired_ab_bench,
+        run_perf_session,
+    )
+except Exception:
+    adaptive_paired_ab_bench = None
+    run_perf_session = None
+
+
+sys.stdout.reconfigure(encoding="utf-8", errors="replace")  # type: ignore[attr-defined]
+
+KB_DIR = _AGENT_ROOT / "ep_device_knowledge"
+WINML = str(_AGENT_ROOT / ".venv" / "Scripts" / "winml.exe")
+
+_OK = ("OK", "OK_HIGH_CV")
+
+
+# ── small perf-json helpers ────────────────────────────────────────────────────
+
+
+def _latency(perf_json: Path) -> tuple[float | None, float | None]:
+    """Return (p50, cv) parsed from a winml perf JSON, or (None, None)."""
+    try:
+        d = json.loads(perf_json.read_text(encoding="utf-8"))
+        lat = d.get("latency_ms", d)
+        p50 = float(lat.get("p50") or 0)
+        std = float(lat.get("std") or 0)
+        if p50 <= 0:
+            return None, None
+        return p50, std / p50
+    except Exception:
+        return None, None
+
+
+def _median(values: list[float]) -> float:
+    return float(sorted(values)[len(values) // 2])
+
+
+def _session_cv(p50s: list[float]) -> float:
+    """Session-to-session CV (std/mean) — the run-to-run noise floor."""
+    n = len(p50s)
+    if n < 2:
+        return 0.0
+    mean = sum(p50s) / n
+    if mean <= 0:
+        return 0.0
+    var = sum((x - mean) ** 2 for x in p50s) / (n - 1)
+    return (var**0.5) / mean
+
+
+class CatalogSweep:
+    """JSON-driven sweep driver for one (ep, device) combination."""
+
+    def __init__(
+        self, ep: str, device: str, paired_ab: bool = False, prune_artifacts: bool = False
+    ) -> None:
+        kb_path = KB_DIR / f"{ep}_{device}.json"
+        if not kb_path.exists():
+            raise SystemExit(f"ERROR: knowledge base not found: {kb_path}")
+        self.ep = ep
+        self.device = device
+        self.kb = json.loads(kb_path.read_text(encoding="utf-8"))
+        self.cfg = self.kb["sweep_config"]
+        self.hyps: list[dict] = self.kb["hypotheses"]
+        self.models: list[dict] = self.kb["models"]
+        self.cross_checks: list[dict] = self.kb.get("cross_checks", [])
+
+        self.results_dir = _AGENT_ROOT / self.cfg["results_dir"]
+        self.screen = self.cfg["screen"]
+        self.full = self.cfg["full"]
+        self.timeouts = self.cfg["timeouts"]
+        self.baseline_id = (self.cfg.get("baseline_priority") or ["h0"])[0]
+        self.prune_artifacts = prune_artifacts
+        self.paired_ab = paired_ab and self.cfg.get("paired_ab_available", False)
+        if paired_ab and adaptive_paired_ab_bench is None:
+            print(
+                "  [warn] --paired-ab requested but bench_utils unavailable — disabled", flush=True
+            )
+            self.paired_ab = False
+
+    # ── subprocess ─────────────────────────────────────────────────────────────
+
+    def run_cmd(self, cmd: list[str], label: str = "", timeout: int = 300) -> tuple[int, str]:
+        t0 = time.monotonic()
+        print(f"  >> {label or ' '.join(cmd[:3])}", flush=True)
+        try:
+            r = subprocess.run(
+                cmd,
+                capture_output=True,
+                text=True,
+                timeout=timeout,
+                encoding="utf-8",
+                errors="replace",
+            )
+            elapsed = time.monotonic() - t0
+            tag = "ok" if r.returncode == 0 else f"rc={r.returncode}"
+            print(f"     {elapsed:.0f}s [{tag}]", flush=True)
+            if r.returncode != 0 and r.stderr.strip():
+                print(f"     stderr: {r.stderr.strip()[:200]}", flush=True)
+            return r.returncode, r.stdout + r.stderr
+        except subprocess.TimeoutExpired:
+            print(f"     TIMEOUT after {timeout}s", flush=True)
+            return -1, "TIMEOUT"
+
+    # ── config / build ─────────────────────────────────────────────────────────
+
+    def _patch_config(self, cfg: dict) -> dict:
+        """Apply quant/compile policy from sweep_config to a base config."""
+        cfg = copy.deepcopy(cfg)
+        if not self.cfg.get("quant"):  # False/None => strip; "auto" => keep
+            cfg["quant"] = None
+        if not self.cfg.get("compile"):
+            cfg["compile"] = None
+        return cfg
+
+    def get_base_config(self, model_id: str, task: str, model_type: str) -> dict | None:
+        tmp = self.results_dir / "_tmp_base_config.json"
+        tmp.parent.mkdir(parents=True, exist_ok=True)
+
+        def _try(extra: list[str]) -> dict | None:
+            cmd = [
+                WINML,
+                "config",
+                "-m",
+                model_id,
+                "-t",
+                task,
+                "--ep",
+                self.ep,
+                "--device",
+                self.device,
+            ]
+            if not self.cfg.get("compile"):
+                cmd += ["--no-compile"]
+            cmd += ["-o", str(tmp)] + extra
+            rc, out = self.run_cmd(
+                cmd, label=f"winml config --ep {self.ep}", timeout=self.timeouts["config_s"]
+            )
+            if rc == 0 and tmp.exists():
+                try:
+                    cfg = json.loads(tmp.read_text(encoding="utf-8"))
+                    tmp.unlink(missing_ok=True)
+                    return cfg
+                except Exception as e:
+                    print(f"  [warn] config parse error: {e}", flush=True)
+            # Fallback: some builds print the config as a JSON line on stdout.
+            for line in out.splitlines():
+                line = line.strip()
+                if line.startswith("{"):
+                    try:
+                        return json.loads(line)
+                    except Exception:
+                        pass
+            tmp.unlink(missing_ok=True)
+            return None
+
+        cfg = _try(["--model-type", model_type])
+        if cfg is None:
+            print("  [warn] config with --model-type failed, retrying without…", flush=True)
+            cfg = _try([])
+        return self._patch_config(cfg) if cfg is not None else None
+
+    @staticmethod
+    def make_hypothesis_config(base: dict, opset: int | None, optim: dict | None) -> dict:
+        cfg = copy.deepcopy(base)
+        if opset is not None and cfg.get("export"):
+            cfg["export"]["opset_version"] = opset
+        if optim:
+            cfg["optim"] = {**(cfg.get("optim") or {}), **optim}
+        return cfg
+
+    def run_build(
+        self,
+        model_id: str,
+        cfg_path: Path,
+        out_dir: Path,
+        build_flags: list[str] | None = None,
+    ) -> tuple[bool, str]:
+        out_dir.mkdir(parents=True, exist_ok=True)
+        cmd = [
+            WINML,
+            "build",
+            "-c",
+            str(cfg_path),
+            "-m",
+            model_id,
+            "-o",
+            str(out_dir),
+            "--ep",
+            self.ep,
+            "--device",
+            self.device,
+        ]
+        if not self.cfg.get("quant"):
+            cmd += ["--no-quant"]
+        if not self.cfg.get("compile"):
+            cmd += ["--no-compile"]
+        if build_flags:
+            cmd += list(build_flags)
+        cmd += ["--rebuild"]
+        rc, out = self.run_cmd(
+            cmd, label=f"winml build [{out_dir.name}]", timeout=self.timeouts["build_s"]
+        )
+        return rc == 0, out
+
+    # ── bench / eval ───────────────────────────────────────────────────────────
+
+    def bench_screen(self, onnx: Path) -> tuple[float | None, float, bool]:
+        out_json = onnx.parent / "screen_perf.json"
+        rc, _ = self.run_cmd(
+            [
+                WINML,
+                "perf",
+                "-m",
+                str(onnx),
+                "--ep",
+                self.ep,
+                "--device",
+                self.device,
+                "--warmup",
+                str(self.screen["warmup"]),
+                "--iterations",
+                str(self.screen["iters"]),
+                "-o",
+                str(out_json),
+            ],
+            label=f"perf screen ({self.screen['iters']} iters)",
+            timeout=self.timeouts["bench_s"],
+        )
+        if rc != 0 or not out_json.exists():
+            return None, 999.0, False
+        p50, cv = _latency(out_json)
+        if p50 is None:
+            return None, 999.0, False
+        stable = cv <= self.screen["cv_max"]
+        if self.screen.get("thermal_aware") and not stable:
+            tag = "HIGH-CV (DVFS noise — proceeding to Phase B)"
+        else:
+            tag = "stable" if stable else "high-CV"
+        print(f"     screen: p50={p50:.2f}ms  CV={cv:.3f}  [{tag}]", flush=True)
+        return p50, cv, stable
+
+    def bench_full(self, onnx: Path) -> list[float]:
+        p50s: list[float] = []
+        n, cd = self.full["sessions"], self.full["cool_down_s"]
+        for s in range(1, n + 1):
+            out_json = onnx.parent / f"full_perf_s{s}.json"
+            rc, _ = self.run_cmd(
+                [
+                    WINML,
+                    "perf",
+                    "-m",
+                    str(onnx),
+                    "--ep",
+                    self.ep,
+                    "--device",
+                    self.device,
+                    "--warmup",
+                    str(self.full["warmup"]),
+                    "--iterations",
+                    str(self.full["iters"]),
+                    "-o",
+                    str(out_json),
+                ],
+                label=f"perf full s{s}/{n} ({self.full['iters']} iters)",
+                timeout=self.timeouts["bench_s"],
+            )
+            p50, cv = _latency(out_json) if rc == 0 and out_json.exists() else (None, None)
+            if p50 is not None:
+                print(f"     full s{s}: p50={p50:.2f}ms  CV={cv:.3f}", flush=True)
+                p50s.append(p50)
+            else:
+                print(f"     [warn] full bench s{s} failed", flush=True)
+            if s < n:
+                print(f"     cool-down {cd}s…", flush=True)
+                time.sleep(cd)
+        return p50s
+
+    def run_eval(self, onnx: Path, model_id: str, task: str) -> float | None:
+        out_json = onnx.parent / "eval_result.json"
+        rc, _ = self.run_cmd(
+            [
+                WINML,
+                "eval",
+                "-m",
+                str(onnx),
+                "--model-id",
+                model_id,
+                "--task",
+                task,
+                "--ep",
+                self.ep,
+                "--device",
+                self.device,
+                "--samples",
+                str(self.cfg["eval_samples"]),
+                "-o",
+                str(out_json),
+            ],
+            label="winml eval (accuracy gate)",
+            timeout=self.timeouts["eval_s"],
+        )
+        if rc != 0 or not out_json.exists():
+            return None
+        try:
+            data = json.loads(out_json.read_text(encoding="utf-8"))
+            acc = data.get("metrics", data).get("accuracy")
+            if acc is not None:
+                print(f"     eval accuracy: {float(acc):.4f}", flush=True)
+            return float(acc) if acc is not None else None
+        except Exception:
+            return None
+
+    # ── onnx introspection (guards) ─────────────────────────────────────────────
+
+    @staticmethod
+    def _model_has_gemm(onnx_path: Path) -> bool:
+        try:
+            import onnx  # noqa: PLC0415
+
+            m = onnx.load(str(onnx_path))
+            return any(n.op_type == "Gemm" for n in m.graph.node)
+        except Exception:
+            return False
+
+    @staticmethod
+    def _conv_pct(onnx_path: Path) -> tuple[float, int, int]:
+        """Return (conv_pct, conv_count, total). (0.0, 0, 0) means UNKNOWN, not SAFE."""
+        if not onnx_path.exists():
+            return 0.0, 0, 0
+        try:
+            import onnx  # noqa: PLC0415
+
+            ops = [n.op_type for n in onnx.load(str(onnx_path)).graph.node]
+            total = len(ops)
+            conv = sum(1 for o in ops if o == "Conv")
+            return (round(conv / total * 100, 1) if total else 0.0), conv, total
+        except Exception:
+            return 0.0, 0, 0
+
+    @staticmethod
+    def _find_onnx(hyp_dir: Path) -> Path | None:
+        for name in ("model.onnx", "quantized.onnx", "optimized.onnx"):
+            if (hyp_dir / name).exists():
+                return hyp_dir / name
+        ctx = list(hyp_dir.glob("*_ctx*.onnx")) + list(hyp_dir.glob("model_npu*.onnx"))
+        return ctx[0] if ctx else None
+
+    @staticmethod
+    def _op_signature(hyp_dir: Path) -> dict | None:
+        """Read the post-optimize op inventory (total_operators + operator_counts +
+        opset) that ``winml build`` emits to ``analyze_result.json``. This is the
+        ground truth for whether an optim/fusion flag actually changed the graph, so
+        a hypothesis can be diffed against the baseline build. Returns None when the
+        analyze artifact is missing or malformed."""
+        f = hyp_dir / "analyze_result.json"
+        if not f.exists():
+            return None
+        try:
+            md = (json.loads(f.read_text(encoding="utf-8")) or {}).get("metadata") or {}
+        except Exception:
+            return None
+        total = md.get("total_operators")
+        counts = md.get("operator_counts")
+        if total is None or counts is None:
+            return None
+        return {
+            "total_operators": total,
+            "operator_counts": dict(counts),
+            "opset": md.get("opset_version"),
+        }
+
+    @staticmethod
+    def _same_graph(a: dict | None, b: dict | None) -> bool:
+        """True when two op signatures are identical (same opset, total op count and
+        per-op-type counts) — i.e. the flag under test was a NO-OP versus baseline."""
+        if not a or not b:
+            return False
+        return (
+            a.get("opset") == b.get("opset")
+            and a.get("total_operators") == b.get("total_operators")
+            and a.get("operator_counts") == b.get("operator_counts")
+        )
+
+    # ── per-model sweep ─────────────────────────────────────────────────────────
+
+    def sweep_model(
+        self,
+        model_id: str,
+        task: str,
+        model_type: str,
+        only_hyp_ids: set[str] | None = None,
+        reuse_baseline_config: bool = False,
+    ) -> dict:
+        model_slug = model_id.replace("/", "--")
+        model_dir = self.results_dir / model_slug
+        model_dir.mkdir(parents=True, exist_ok=True)
+
+        results_path = model_dir / "results.json"
+        if only_hyp_ids and results_path.exists():
+            try:
+                results = json.loads(results_path.read_text(encoding="utf-8"))
+                print("  [resume] loaded existing results", flush=True)
+            except Exception:
+                results = {}
+        else:
+            results = {}
+
+        results.update(
+            {
+                "model_id": model_id,
+                "task": task,
+                "model_type": model_type,
+                "timestamp": datetime.now().isoformat(timespec="seconds"),
+                "ep": self.ep,
+                "device": self.device,
+            }
+        )
+        for k in ("hypotheses",):
+            results.setdefault(k, {})
+        for k in (
+            "baseline_opset",
+            "baseline_p50_ms",
+            "best_hypothesis",
+            "best_p50_ms",
+            "best_gain_pct",
+            "conv_pct",
+            "best_gain_verdict",
+        ):
+            results.setdefault(k, None)
+        results.setdefault("errors", [])
+        results.setdefault("feature_gaps", [])
+
+        print(f"\n{'=' * 64}\n  SWEEP [{self.ep}/{self.device}]: {model_id}  [{task}]", flush=True)
+        if only_hyp_ids:
+            print(f"  (delta — only: {sorted(only_hyp_ids)})", flush=True)
+        print("=" * 64, flush=True)
+
+        model_start = time.time()
+
+        # Step 1: base config
+        print("\n[1/3] Generating base config…", flush=True)
+        base_config = None
+        if reuse_baseline_config:
+            bc = model_dir / self.baseline_id / "build_config.json"
+            if bc.exists():
+                try:
+                    base_config = json.loads(bc.read_text(encoding="utf-8"))
+                    print(f"  [reuse] loaded {self.baseline_id} config", flush=True)
+                except Exception:
+                    base_config = None
+        if base_config is None:
+            base_config = self.get_base_config(model_id, task, model_type)
+        if base_config is None:
+            results["errors"].append("base config generation failed")
+            self._finalize(results, model_dir)
+            return results
+
+        results["baseline_opset"] = (base_config.get("export") or {}).get("opset_version", "?")
+        base_quant = "kept" if self.cfg.get("quant") else "NONE"
+        print(f"  baseline opset={results['baseline_opset']}  quant={base_quant}", flush=True)
+
+        # Step 2: hypothesis loop
+        print(f"\n[2/3] Running {len(self.hyps)} hypotheses…", flush=True)
+        conv_pct = 0.0
+        conv_risk = False
+        has_conv_guard = any(
+            (h.get("guard") or {}).get("type") == "conv_pct_regression" for h in self.hyps
+        )
+        gemm_known: bool | None = None
+        baseline_onnx: Path | None = None
+        baseline_sig: dict | None = None
+
+        for hyp in self.hyps:
+            hyp_id = hyp["id"]
+            if only_hyp_ids is not None and hyp_id not in only_hyp_ids:
+                continue
+            model_to = self.timeouts.get("model_s")
+            if model_to and time.time() - model_start > model_to:
+                results["hypotheses"][hyp_id] = {"status": "TIMEOUT", "label": hyp["label"]}
+                results["errors"].append(f"{hyp_id}: model timeout")
+                continue
+
+            label = hyp["label"]
+            guard = hyp.get("guard") or {}
+            sep = "─" * 56
+            print(f"\n{sep}\n  {hyp_id}: {label}\n{sep}", flush=True)
+
+            hyp_config = self.make_hypothesis_config(
+                base_config, hyp.get("opset"), hyp.get("optim")
+            )
+            opset_used = (hyp_config.get("export") or {}).get("opset_version", "?")
+            build_flags = hyp.get("build_flags") or []
+            extra = f"  flags={build_flags}" if build_flags else ""
+            print(f"  opset={opset_used}  optim={hyp.get('optim')}{extra}", flush=True)
+
+            hyp_dir = model_dir / hyp_id
+            hyp_dir.mkdir(parents=True, exist_ok=True)
+            cfg_path = hyp_dir / "build_config.json"
+            cfg_path.write_text(json.dumps(hyp_config, indent=2), encoding="utf-8")
+
+            build_ok, build_out = self.run_build(model_id, cfg_path, hyp_dir, build_flags)
+            if not build_ok:
+                is_to = "TIMEOUT" in build_out
+                results["hypotheses"][hyp_id] = {
+                    "status": "BUILD_TIMEOUT" if is_to else "BUILD_FAIL",
+                    "label": label,
+                    "opset": opset_used,
+                    "build_error": ("build timed out" if is_to else build_out[-400:]),
+                }
+                results["errors"].append(f"{hyp_id}: build failed")
+                if any(
+                    k in build_out.lower() for k in ("unsupported", "not supported", "no handler")
+                ):
+                    results["feature_gaps"].append(f"{hyp_id} ({label}): EP/op unsupported")
+                continue
+
+            onnx_path = self._find_onnx(hyp_dir)
+            if onnx_path is None:
+                results["hypotheses"][hyp_id] = {"status": "NO_ONNX", "label": label}
+                results["errors"].append(f"{hyp_id}: build OK but no ONNX produced")
+                continue
+
+            # Op-count / topology signature of the built graph (npu-011): the first
+            # cut for benefit-gating is "did the flag actually change the graph?".
+            op_sig = self._op_signature(hyp_dir)
+            if hyp_id == self.baseline_id:
+                baseline_sig = op_sig
+
+            # NO-OP short-circuit: a non-baseline hypothesis whose built graph is
+            # byte-for-byte identical to the baseline (same opset + op counts) cannot
+            # differ in perf — the flag never fired. Skip the expensive screen+full
+            # bench and reuse the baseline numbers. This is what separates
+            # "applied-but-no-benefit" from "never-applied" (see npu-011) and it
+            # saves one screen + N full sessions per dead hypothesis.
+            if (
+                hyp_id != self.baseline_id
+                and baseline_sig is not None
+                and self._same_graph(op_sig, baseline_sig)
+            ):
+                base_bench = results["hypotheses"].get(self.baseline_id, {})
+                noop = {
+                    "status": "NOOP_SKIPPED",
+                    "verdict": "NO_OP",
+                    "label": label,
+                    "opset": opset_used,
+                    "optim": hyp.get("optim") or {},
+                    "build_flags": build_flags,
+                    "op_signature": op_sig,
+                    "graph_changed": False,
+                    "noop_note": (
+                        "graph identical to baseline (opset + op counts unchanged) — "
+                        "flag did not fire; bench skipped, baseline perf assumed"
+                    ),
+                }
+                base_full = base_bench.get("full")
+                if base_full:
+                    noop["full_ref"] = self.baseline_id
+                    noop["assumed_median_p50_ms"] = base_full.get("median_p50_ms")
+                results["hypotheses"][hyp_id] = noop
+                print(
+                    f"  [no-op] {hyp_id} graph == baseline "
+                    f"(ops={op_sig.get('total_operators')}, opset={op_sig.get('opset')}); "
+                    "bench skipped",
+                    flush=True,
+                )
+                if self.prune_artifacts:
+                    self._prune_hyp_artifacts(hyp_dir)
+                    self._prune_runnable_except_best(results, model_dir)
+                continue
+
+            # Guard: skip_if_gemm (cpu-002)
+            if guard.get("type") == "skip_if_gemm":
+                if gemm_known is None:
+                    opt = hyp_dir / "optimized.onnx"
+                    gemm_known = self._model_has_gemm(opt) if opt.exists() else False
+                if gemm_known:
+                    print(f"  [{guard['finding']}] SKIP {hyp_id}: model has Gemm nodes", flush=True)
+                    results["hypotheses"][hyp_id] = {
+                        "status": "SKIPPED_GUARD",
+                        "label": label,
+                        "opset": opset_used,
+                        "guard": guard["finding"],
+                    }
+                    continue
+
+            # After baseline build: compute Conv% for conv_pct_regression guards (npu-006)
+            if hyp_id == self.baseline_id:
+                baseline_onnx = onnx_path
+                if has_conv_guard:
+                    conv_pct, conv_cnt, conv_total = self._conv_pct(onnx_path)
+                    unknown = conv_pct == 0.0 and conv_total == 0
+                    threshold = next(
+                        (
+                            h["guard"]["threshold_pct"]
+                            for h in self.hyps
+                            if (h.get("guard") or {}).get("type") == "conv_pct_regression"
+                        ),
+                        20.0,
+                    )
+                    conv_risk = unknown or conv_pct > threshold
+                    results["conv_pct"] = None if unknown else conv_pct
+                    print(
+                        f"  [conv-guard] Conv%={'UNKNOWN' if unknown else conv_pct}"
+                        f" ({conv_cnt}/{conv_total}) risk={conv_risk}",
+                        flush=True,
+                    )
+
+            # Guard: conv_pct_regression annotation (npu-006)
+            expected_regression = False
+            if guard.get("type") == "conv_pct_regression" and conv_risk:
+                expected_regression = True
+                print(
+                    f"  [{guard['finding']}] WARNING: {hyp_id} conv fusions on Conv-dense model"
+                    f" (Conv%={conv_pct}) — expect catastrophic regression",
+                    flush=True,
+                )
+
+            # Bench: Phase A screen + Phase B full
+            p50_screen, cv_screen, stable = self.bench_screen(onnx_path)
+            bench: dict = {
+                "status": "PENDING",
+                "label": label,
+                "opset": opset_used,
+                "optim": hyp.get("optim") or {},
+                "op_signature": op_sig,
+                "graph_changed": (
+                    None
+                    if (op_sig is None or baseline_sig is None or hyp_id == self.baseline_id)
+                    else not self._same_graph(op_sig, baseline_sig)
+                ),
+                "screen": {"p50_ms": p50_screen, "cv": round(cv_screen, 4), "stable": stable},
+            }
+            if expected_regression:
+                bench["expected_regression"] = True
+                bench["regression_finding"] = guard["finding"]
+
+            if p50_screen is None:
+                bench["status"] = "SCREEN_FAIL"
+                results["hypotheses"][hyp_id] = bench
+                results["errors"].append(f"{hyp_id}: screen failed")
+                continue
+
+            full_p50s = self.bench_full(onnx_path)
+            if not full_p50s:
+                bench["status"] = "BENCH_FAIL"
+                results["hypotheses"][hyp_id] = bench
+                results["errors"].append(f"{hyp_id}: full bench failed")
+                continue
+
+            median = _median(full_p50s)
+            bench["full"] = {
+                "p50s_ms": [round(p, 3) for p in full_p50s],
+                "median_p50_ms": round(median, 3),
+            }
+            bench["status"] = "OK" if stable else "OK_HIGH_CV"
+
+            # Accuracy eval on the baseline build for image-classification models
+            if (
+                self.cfg.get("accuracy_eval")
+                and hyp_id == self.baseline_id
+                and task == "image-classification"
+            ):
+                bench["accuracy"] = self.run_eval(onnx_path, model_id, task)
+
+            # Opt-in paired A/B (DVFS-cancelling) for non-baseline hypotheses
+            if (
+                self.paired_ab
+                and hyp_id != self.baseline_id
+                and baseline_onnx is not None
+                and adaptive_paired_ab_bench is not None
+                and run_perf_session is not None
+            ):
+                print("  [paired-A/B] interleaving baseline vs hypothesis…", flush=True)
+
+                def _session(p: Path, _onnx=onnx_path) -> float | None:
+                    return run_perf_session(
+                        WINML,
+                        p,
+                        self.ep,
+                        self.device,
+                        iters=self.full["iters"],
+                        warmup=self.full["warmup"],
+                    )
+
+                bench["paired_ab"] = adaptive_paired_ab_bench(
+                    _session,
+                    baseline_onnx,
+                    onnx_path,
+                    cool_down_s=self.full["cool_down_s"],
+                )
+                pa = bench["paired_ab"]
+                print(f"  [paired-A/B] {pa['verdict']} mean={pa['mean_gain_pct']}%", flush=True)
+
+            results["hypotheses"][hyp_id] = bench
+            if self.prune_artifacts:
+                self._prune_hyp_artifacts(hyp_dir)
+                self._prune_runnable_except_best(results, model_dir)
+
+        # Step 3: verdicts, cross-checks, confirmation
+        # npu-011 roll-up: which flags fired vs were dead no-ops on this model.
+        noop_ids = [
+            hid for hid, h in results["hypotheses"].items() if h.get("status") == "NOOP_SKIPPED"
+        ]
+        if noop_ids:
+            results["noop_hypotheses"] = noop_ids
+            print(
+                f"  [npu-011] {len(noop_ids)} no-op hypotheses (graph == baseline, bench skipped): "
+                f"{', '.join(noop_ids)}",
+                flush=True,
+            )
+        print("\n[3/3] Computing verdicts…", flush=True)
+        self._compute_verdicts(results)
+        self._run_cross_checks(results)
+        self._confirm_pass(results, model_dir)
+        self._finalize(results, model_dir)
+        return results
+
+    # ── verdicts ────────────────────────────────────────────────────────────────
+
+    def _compute_verdicts(self, results: dict) -> None:
+        hyps = results["hypotheses"]
+        min_gain = self.cfg["min_improvement_pct"]
+
+        # baseline: first OK hypothesis in baseline_priority
+        baseline_p50: float | None = None
+        baseline_h: dict = {}
+        for hid in self.cfg.get("baseline_priority", ["h0"]):
+            h = hyps.get(hid, {})
+            if h.get("status") in _OK:
+                baseline_p50 = h.get("full", {}).get("median_p50_ms")
+                if baseline_p50:
+                    baseline_h = h
+                    h["verdict"] = "BASELINE"
+                    break
+        results["baseline_p50_ms"] = baseline_p50
+
+        # regression-probe membership (per-hypothesis verdict override)
+        probe_map: dict[str, dict] = {}
+        for c in self.cross_checks:
+            if c["type"] == "regression_probe":
+                for hid in c["hypotheses"]:
+                    probe_map[hid] = c
+
+        best_p50: float | None = None
+        best_h: str | None = None
+        best_hyp: dict = {}
+        for hid, h in hyps.items():
+            if h.get("status") not in _OK:
+                continue
+            median = h.get("full", {}).get("median_p50_ms")
+            if median is None:
+                continue
+            if baseline_p50 and h.get("verdict") != "BASELINE":
+                gain = (baseline_p50 - median) / baseline_p50 * 100
+                h["gain_vs_baseline_pct"] = round(gain, 2)
+                if hid in probe_map and gain <= probe_map[hid]["gain_threshold_pct"]:
+                    h["verdict"] = "REGRESSION"
+                    h["regression_finding"] = probe_map[hid]["id"]
+                elif gain >= min_gain:
+                    h["verdict"] = "KEEP"
+                elif gain > 0:
+                    h["verdict"] = "MARGINAL"
+                else:
+                    h["verdict"] = "DISCARD"
+            if best_p50 is None or median < best_p50:
+                best_p50, best_h, best_hyp = median, hid, h
+
+        results["best_hypothesis"] = best_h
+        results["best_p50_ms"] = best_p50
+        if baseline_p50 and best_p50 is not None:
+            gain = (baseline_p50 - best_p50) / baseline_p50 * 100
+            results["best_gain_pct"] = round(gain, 2)
+            if self.cfg.get("effect_size_gate"):
+                self._effect_size(results, baseline_h, best_hyp, best_h, gain)
+            elif best_h and best_h != self.baseline_id and gain >= min_gain:
+                results["best_gain_verdict"] = "KEEP"
+            else:
+                results["best_gain_verdict"] = "BASELINE_IS_BEST"
+
+    def _effect_size(
+        self, results: dict, baseline_h: dict, best_hyp: dict, best_h: str | None, gain: float
+    ) -> None:
+        mult = self.cfg["effect_size_cv_mult"]
+        base_p50s = baseline_h.get("full", {}).get("p50s_ms", [])
+        best_p50s = best_hyp.get("full", {}).get("p50s_ms", [])
+        noise = max(_session_cv(base_p50s), _session_cv(best_p50s))
+        noise_floor = round(mult * noise * 100, 2)
+        ranges_sep = bool(best_p50s and base_p50s and max(best_p50s) < min(base_p50s))
+        effect_ok = gain >= noise_floor
+        reliable = bool(effect_ok and ranges_sep and best_h != self.baseline_id)
+        results["best_gain_noise_floor_pct"] = noise_floor
+        results["best_gain_ranges_separated"] = ranges_sep
+        results["best_gain_reliable"] = reliable
+        if best_h == self.baseline_id:
+            results["best_gain_verdict"] = "BASELINE_IS_BEST"
+        elif reliable:
+            results["best_gain_verdict"] = "RELIABLE"
+        elif not effect_ok:
+            results["best_gain_verdict"] = "NEUTRAL_WITHIN_NOISE"
+        else:
+            results["best_gain_verdict"] = "UNRELIABLE_RANGES_OVERLAP"
+        print(
+            f"  [effect-size] best={best_h} gain={gain:+.1f}% noise_floor={noise_floor:.1f}%"
+            f" ranges_sep={ranges_sep} -> {results['best_gain_verdict']}",
+            flush=True,
+        )
+
+    # ── cross-model finding checks ──────────────────────────────────────────────
+
+    def _run_cross_checks(self, results: dict) -> None:
+        for c in self.cross_checks:
+            if c["type"] == "opset_bypass":
+                self._check_opset_bypass(results, c)
+            elif c["type"] == "catastrophic_regression":
+                self._check_catastrophic(results, c)
+            # regression_probe is applied per-hypothesis in _compute_verdicts
+
+    def _check_opset_bypass(self, results: dict, c: dict) -> None:
+        """Generalized npu-001: candidate (opset21) must beat the explicit-opset
+        stress reference AND the auto-config baseline by the effect-size gate, with
+        non-overlapping session ranges. Guards against DVFS-inflated references.
+        """
+        cid = c["id"]
+        hyps = results["hypotheses"]
+        cand = hyps.get(c["candidate"], {})
+        stress = hyps.get(c["stress_ref"], {})
+        base = hyps.get(c.get("baseline_ref", ""), {})
+        key = f"{cid}_generalized"
+        rkey = f"{cid}_ranges_non_overlapping"
+        results.setdefault(key, None)
+        results.setdefault(rkey, None)
+
+        if stress.get("status") not in _OK or cand.get("status") not in _OK:
+            missing = [
+                r
+                for r, d in ((c["stress_ref"], stress), (c["candidate"], cand))
+                if d.get("status") not in _OK
+            ]
+            results[key] = f"N/A ({', '.join(missing)} not OK)"
+            return
+
+        cand_p50 = cand["full"].get("median_p50_ms", float("inf"))
+        stress_p50 = stress["full"].get("median_p50_ms", float("inf"))
+        cand_p50s = cand["full"].get("p50s_ms", [cand_p50])
+        stress_p50s = stress["full"].get("p50s_ms", [stress_p50])
+        median_gain = cand_p50 < stress_p50 * 0.95
+        median_loss = stress_p50 < cand_p50 * 0.95
+        ranges_sep = max(cand_p50s) < min(stress_p50s) if cand_p50s and stress_p50s else None
+        results[rkey] = ranges_sep
+
+        # Guard 1: stress reference must be reliable (not high-CV / DVFS-thrashing).
+        if stress.get("status") == "OK_HIGH_CV":
+            results[key] = "N/A (high-CV stress reference)"
+            print(f"  [{cid}] N/A: explicit-opset reference is HIGH-CV", flush=True)
+            return
+
+        # Guard 2: candidate must also beat the auto-config baseline by effect size.
+        mult = self.cfg.get("effect_size_cv_mult", 2.0)
+        beats_baseline: bool | None = None
+        if base.get("status") in _OK:
+            base_p50s = base["full"].get("p50s_ms", [])
+            base_p50 = base["full"].get("median_p50_ms")
+            if base_p50s and base_p50 and cand_p50s:
+                gvb = (base_p50 - cand_p50) / base_p50 * 100
+                floor = mult * max(_session_cv(base_p50s), _session_cv(cand_p50s)) * 100
+                beats_baseline = gvb >= floor and max(cand_p50s) < min(base_p50s)
+
+        # Guard 3: a decisive paired-A/B verdict overrides the sequential medians.
+        pab = (cand.get("paired_ab") or {}).get("verdict")
+        pab_rejects = pab in ("MARGINAL", "DISCARD")
+
+        if beats_baseline is False or pab_rejects:
+            results[key] = "neutral"
+            print(f"  [{cid}] NEUTRAL vs auto-config baseline", flush=True)
+        elif median_gain and ranges_sep:
+            results[key] = True
+            print(
+                f"  [{cid}] CONFIRMED: {c['candidate']} beats {c['stress_ref']} + baseline",
+                flush=True,
+            )
+        elif median_gain and not ranges_sep:
+            results[key] = "median_only"
+            print(f"  [{cid}] MARGINAL: median faster but ranges overlap (DVFS noise)", flush=True)
+        elif median_loss:
+            results[key] = False
+            print(f"  [{cid}] NEGATIVE: {c['stress_ref']} faster than {c['candidate']}", flush=True)
+        else:
+            results[key] = "neutral"
+            print(f"  [{cid}] NEUTRAL", flush=True)
+
+    def _check_catastrophic(self, results: dict, c: dict) -> None:
+        """npu-006: conv-fusion hypotheses regress >= ratio_threshold x baseline."""
+        cid = c["id"]
+        baseline_p50 = results.get("baseline_p50_ms")
+        ratio = c.get("ratio_threshold", 5.0)
+        hit = False
+        for hid in c["hypotheses"]:
+            h = results["hypotheses"].get(hid, {})
+            if h.get("status") in _OK and baseline_p50:
+                p50 = h.get("full", {}).get("median_p50_ms")
+                if p50 and p50 >= baseline_p50 * ratio:
+                    hit = True
+                    print(
+                        f"  [{cid}] CATASTROPHIC REGRESSION on {hid}:"
+                        f" {p50:.1f}ms vs baseline {baseline_p50:.1f}ms"
+                        f" ({p50 / baseline_p50:.0f}x)",
+                        flush=True,
+                    )
+        results[f"{cid}_regression"] = hit
+
+    # ── confirmation pass (Phase C) ─────────────────────────────────────────────
+
+    def _confirm_pass(self, results: dict, model_dir: Path) -> None:
+        best_h = results.get("best_hypothesis")
+        baseline_p50 = results.get("baseline_p50_ms")
+        if not best_h or best_h == self.baseline_id or not baseline_p50:
+            return
+        if (results.get("best_gain_pct") or 0) < self.cfg["min_improvement_pct"]:
+            return
+        n = self.cfg["confirm_sessions"]
+        if n <= 0:
+            return
+
+        hyp_dir = model_dir / best_h
+        onnx_path = self._find_onnx(hyp_dir)
+        if onnx_path is None:
+            return
+        best_hyp = results["hypotheses"].get(best_h, {})
+        print(f"\n  ── Phase C: confirming {best_h} ({n} extra sessions) ──", flush=True)
+
+        cd = self.full["cool_down_s"]
+        confirm: list[float] = []
+        for s in range(1, n + 1):
+            out_json = hyp_dir / f"confirm_s{s}.json"
+            rc, _ = self.run_cmd(
+                [
+                    WINML,
+                    "perf",
+                    "-m",
+                    str(onnx_path),
+                    "--ep",
+                    self.ep,
+                    "--device",
+                    self.device,
+                    "--warmup",
+                    str(self.full["warmup"]),
+                    "--iterations",
+                    str(self.full["iters"]),
+                    "-o",
+                    str(out_json),
+                ],
+                label=f"confirm s{s}/{n}",
+                timeout=self.timeouts["bench_s"],
+            )
+            p50, _ = _latency(out_json) if rc == 0 and out_json.exists() else (None, None)
+            if p50 is not None:
+                print(f"     confirm s{s}: p50={p50:.2f}ms", flush=True)
+                confirm.append(p50)
+            if s < n:
+                time.sleep(cd)
+        if not confirm:
+            return
+
+        # Baseline session range for the non-overlap test
+        base_h: dict = {}
+        for hid in self.cfg.get("baseline_priority", ["h0"]):
+            if results["hypotheses"].get(hid, {}).get("status") in _OK:
+                base_h = results["hypotheses"][hid]
+                break
+        base_p50s = base_h.get("full", {}).get("p50s_ms", [baseline_p50])
+        all_p50s = best_hyp.get("full", {}).get("p50s_ms", []) + confirm
+        overall_median = _median(all_p50s)
+        overall_gain = (baseline_p50 - overall_median) / baseline_p50 * 100
+        confirmed = max(all_p50s) < min(base_p50s) if base_p50s else False
+
+        best_hyp["confirm_p50s_ms"] = [round(p, 3) for p in confirm]
+        best_hyp["all_p50s_ms"] = [round(p, 3) for p in all_p50s]
+        best_hyp["overall_median_p50_ms"] = round(overall_median, 3)
+        best_hyp["overall_gain_pct"] = round(overall_gain, 2)
+        if confirmed:
+            best_hyp["confirm_verdict"] = "CONFIRMED"
+            results["best_gain_pct"] = round(overall_gain, 2)
+            print(
+                f"  [CONFIRMED] {best_h}: gain={overall_gain:+.1f}% (ranges non-overlapping)",
+                flush=True,
+            )
+        else:
+            best_hyp["confirm_verdict"] = "MARGINAL_UNCONFIRMED"
+            print(f"  [MARGINAL_UNCONFIRMED] {best_h}: ranges overlap — DVFS noise", flush=True)
+
+    # ── outputs ─────────────────────────────────────────────────────────────────
+
+    def _finalize(self, results: dict, model_dir: Path) -> None:
+        out = model_dir / "results.json"
+        out.write_text(json.dumps(results, indent=2, ensure_ascii=False), encoding="utf-8")
+        print(f"  Results: {out}", flush=True)
+        self._emit_champion(results, model_dir)
+        try:
+            if generate_model_report is None:
+                raise RuntimeError("gen_model_report unavailable")
+            generate_model_report(results, model_dir / "report.html")
+        except Exception as e:
+            print(f"  [warn] report generation failed: {e}", flush=True)
+
+    def _emit_champion(self, results: dict, model_dir: Path) -> None:
+        """Copy the optimal build's winml_build_config.json into the model folder.
+
+        The champion is the best hypothesis when its gain is reliable, otherwise the
+        baseline (auto) config. The emitted file *is* the winml build config of that
+        hypothesis — the fully-resolved ``winml_build_config.json`` winml writes into
+        the build output dir — so it can be fed straight back to ``winml build -c``.
+        Falls back to the input ``build_config.json`` if the build output config is
+        unavailable (e.g. a results-only checkout). Lives in the per-model folder so
+        all tuning products for a model stay together.
+        """
+        baseline_id = self.baseline_id
+        best_h = results.get("best_hypothesis")
+        reliable = bool(results.get("best_gain_reliable")) and best_h not in (None, baseline_id)
+        champion_h = best_h if reliable else baseline_id
+        if not champion_h:
+            return
+
+        build_config = self._load_winml_build_config(model_dir / champion_h)
+        out_path = model_dir / f"champion_{self.ep}_{self.device}.json"
+        if build_config is None:
+            print(
+                f"  [warn] champion config missing — no winml_build_config.json in"
+                f" {model_dir / champion_h} (run a full sweep to materialize it)",
+                flush=True,
+            )
+            return
+
+        out_path.write_text(
+            json.dumps(build_config, indent=2, ensure_ascii=False), encoding="utf-8"
+        )
+        champ = results.get("hypotheses", {}).get(champion_h, {})
+        champ_p50 = results.get("best_p50_ms") if reliable else results.get("baseline_p50_ms")
+        print(
+            f"  Champion config: {out_path}  "
+            f"[{champion_h} {champ.get('label', '')!r}  p50={champ_p50}ms"
+            f"  reliable_gain={reliable}]",
+            flush=True,
+        )
+
+    @staticmethod
+    def _load_winml_build_config(build_dir: Path) -> dict | None:
+        """Return the build config from a hypothesis' build output dir.
+
+        Prefers the fully-resolved ``winml_build_config.json`` winml persists after a
+        build (also matches the ``<cache_key>_winml_build_config.json`` variant); falls
+        back to the ``build_config.json`` the sweep passed as build input.
+        """
+        candidates = [build_dir / "winml_build_config.json"]
+        candidates += sorted(build_dir.glob("*_winml_build_config.json"))
+        candidates.append(build_dir / "build_config.json")
+        for path in candidates:
+            if path.exists():
+                try:
+                    return json.loads(path.read_text(encoding="utf-8"))
+                except Exception:
+                    continue
+        return None
+
+    @staticmethod
+    def _prune_hyp_artifacts(hyp_dir: Path) -> None:
+        """Delete bulky intermediate ONNX artifacts after a hypothesis is benched.
+
+        Keeps the runnable ``model.onnx``/``quantized.onnx`` (+ ``.data``) — needed
+        by the Phase-C confirm re-bench — and all small JSON/metadata. Removes the
+        large export/optimized graphs (often hundreds of MB each) so long multi-model
+        sweeps don't exhaust disk. Best-effort; failures are ignored.
+        """
+        freed = 0
+        for pat in ("export.onnx", "export.onnx.data", "optimized.onnx", "optimized.onnx.data"):
+            p = hyp_dir / pat
+            if p.exists():
+                try:
+                    freed += p.stat().st_size
+                    p.unlink()
+                except OSError:
+                    pass
+        if freed:
+            print(
+                f"     [prune] freed {freed / 1024 / 1024:.0f} MB of build intermediates",
+                flush=True,
+            )
+
+    def _prune_runnable_except_best(self, results: dict, model_dir: Path) -> None:
+        """Keep only the running-best (and baseline) hypothesis' runnable ONNX.
+
+        The Phase-C confirm only re-benches the single best non-baseline hypothesis,
+        so retaining every hypothesis' ``model.onnx``/``quantized.onnx`` is wasteful —
+        for large models (~1-2 GB each) that accumulation exhausts disk mid-sweep.
+        After each hypothesis benches we therefore delete the runnable graphs of every
+        hypothesis except the current lowest-median one and the baseline.
+        """
+        best_id: str | None = None
+        best_med: float | None = None
+        for hid, h in results["hypotheses"].items():
+            if h.get("status") in _OK:
+                med = h.get("full", {}).get("median_p50_ms")
+                if med is not None and (best_med is None or med < best_med):
+                    best_id, best_med = hid, med
+        if best_id is None:
+            return
+        keep = {best_id, self.baseline_id}
+        freed = 0
+        for hyp_dir in model_dir.glob("h*"):
+            if not hyp_dir.is_dir() or hyp_dir.name in keep:
+                continue
+            for pat in (
+                "model.onnx",
+                "model.onnx.data",
+                "quantized.onnx",
+                "quantized.onnx.data",
+            ):
+                p = hyp_dir / pat
+                if p.exists():
+                    try:
+                        freed += p.stat().st_size
+                        p.unlink()
+                    except OSError:
+                        pass
+        if freed:
+            print(
+                f"     [prune] freed {freed / 1024 / 1024:.0f} MB runnable onnx "
+                f"(keeping best={best_id}, baseline={self.baseline_id})",
+                flush=True,
+            )
+
+    def write_summary(self, all_results: list[dict]) -> None:
+        lines = [
+            f"# {self.ep.upper()} / {self.device.upper()} EP Optimization Sweep — Catalog Models",
+            "",
+            f"Generated: {datetime.now().isoformat(timespec='seconds')}  ",
+            f"EP: `{self.ep}` / device: `{self.device}`  ",
+            f"Protocol: screen {self.screen['iters']} iters, full {self.full['iters']}"
+            f"×{self.full['sessions']} sessions + {self.cfg['confirm_sessions']} confirm  ",
+            "",
+            "## Per-Model Results",
+            "",
+            "| Model | Baseline p50 | Best p50 | Best config | Gain% | Verdict | Notes |",
+            "|-------|-------------|----------|-------------|-------|---------|-------|",
+        ]
+        for r in all_results:
+            mid = r.get("model_id", "?")
+            base = f"{r['baseline_p50_ms']:.1f} ms" if r.get("baseline_p50_ms") else "N/A"
+            best = f"{r['best_p50_ms']:.1f} ms" if r.get("best_p50_ms") else "N/A"
+            best_h = r.get("best_hypothesis") or "N/A"
+            label = (
+                r.get("hypotheses", {}).get(best_h, {}).get("label", "") if best_h != "N/A" else ""
+            )
+            gain = f"{r['best_gain_pct']:.1f}%" if r.get("best_gain_pct") is not None else "N/A"
+            verdict = r.get("best_gain_verdict") or "—"
+            notes = "; ".join(r.get("errors", []))[:60] or "none"
+            lines.append(
+                f"| `{mid}` | {base} | {best} | {best_h} ({label}) | {gain} | {verdict} | {notes} |"
+            )
+
+        # Cross-check section (data-driven from results keys the checks emit)
+        if self.cross_checks:
+            lines += ["", "## Cross-Model Finding Checks", ""]
+            headers = ["Model"]
+            check_keys: list[tuple[str, str]] = []
+            for c in self.cross_checks:
+                if c["type"] == "opset_bypass":
+                    check_keys.append((c["id"], f"{c['id']}_generalized"))
+                elif c["type"] == "catastrophic_regression":
+                    check_keys.append((c["id"], f"{c['id']}_regression"))
+                elif c["type"] == "regression_probe":
+                    check_keys.append((c["id"], None))  # per-hypothesis; summarised below
+            headers += [cid for cid, _ in check_keys]
+            lines.append("| " + " | ".join(headers) + " |")
+            lines.append("|" + "|".join(["---"] * len(headers)) + "|")
+            for r in all_results:
+                row = [f"`{r.get('model_id', '?')}`"]
+                for cid, key in check_keys:
+                    if key is None:
+                        probe = [
+                            h
+                            for h, d in r.get("hypotheses", {}).items()
+                            if d.get("regression_finding") == cid
+                        ]
+                        row.append(", ".join(probe) if probe else "no")
+                    else:
+                        row.append(str(r.get(key, "—")))
+                lines.append("| " + " | ".join(row) + " |")
+
+        self.results_dir.mkdir(parents=True, exist_ok=True)
+        summary_path = self.results_dir / "SUMMARY.md"
+        summary_path.write_text("\n".join(lines) + "\n", encoding="utf-8")
+        print(f"\n📄 Summary: {summary_path}", flush=True)
+
+
+# ── CLI ────────────────────────────────────────────────────────────────────────
+
+
+def main() -> None:
+    parser = argparse.ArgumentParser(description="Unified JSON-driven EP/device sweep")
+    parser.add_argument("--ep", required=True, help="execution provider (e.g. cpu, qnn, dml)")
+    parser.add_argument("--device", required=True, help="device (e.g. cpu, npu, gpu)")
+    parser.add_argument("--model", help="run a single model (HuggingFace model ID)")
+    parser.add_argument("--task", help="task for single-model run")
+    parser.add_argument("--model-type", dest="model_type", help="model type for single-model run")
+    parser.add_argument(
+        "--only-hypotheses", dest="only_hyp", help="comma-separated hypothesis IDs (e.g. h4,h5,h10)"
+    )
+    parser.add_argument(
+        "--reuse-baseline-config",
+        dest="reuse_base",
+        action="store_true",
+        help="reuse the baseline build_config.json instead of re-running winml config",
+    )
+    parser.add_argument(
+        "--paired-ab",
+        dest="paired_ab",
+        action="store_true",
+        help="enable opt-in paired A/B (if the EP supports it)",
+    )
+    parser.add_argument("--list", action="store_true", help="list catalog models and exit")
+    parser.add_argument(
+        "--prune-artifacts",
+        dest="prune_artifacts",
+        action="store_true",
+        help="delete bulky export/optimized ONNX intermediates after each hypothesis"
+        " is benched (keeps runnable model.onnx + JSONs); use for long disk-bound sweeps",
+    )
+    args = parser.parse_args()
+
+    sweep = CatalogSweep(
+        args.ep, args.device, paired_ab=args.paired_ab, prune_artifacts=args.prune_artifacts
+    )
+
+    if args.list:
+        print(f"Catalog models for {args.ep}/{args.device}:")
+        for m in sweep.models:
+            print(f"  {m['id']:55s} {m['task']:24s} {m['model_type']}")
+        return
+
+    only_hyp_ids = set(args.only_hyp.split(",")) if args.only_hyp else None
+    all_results: list[dict] = []
+
+    if args.model:
+        # task/model_type fall back to the catalog entry if present
+        entry = next((m for m in sweep.models if m["id"] == args.model), None)
+        task = args.task or (entry or {}).get("task")
+        mtype = args.model_type or (entry or {}).get("model_type")
+        if not task or not mtype:
+            print("ERROR: --task and --model-type required (model not in catalog)", file=sys.stderr)
+            sys.exit(1)
+        all_results.append(
+            sweep.sweep_model(
+                args.model,
+                task,
+                mtype,
+                only_hyp_ids=only_hyp_ids,
+                reuse_baseline_config=args.reuse_base,
+            )
+        )
+    else:
+        for m in sweep.models:
+            all_results.append(
+                sweep.sweep_model(
+                    m["id"],
+                    m["task"],
+                    m["model_type"],
+                    only_hyp_ids=only_hyp_ids,
+                    reuse_baseline_config=args.reuse_base,
+                )
+            )
+
+    sweep.write_summary(all_results)
+    print(
+        f"\n{'=' * 64}\n  {args.ep.upper()}/{args.device.upper()} SWEEP COMPLETE\n{'=' * 64}",
+        flush=True,
+    )
+
+
+if __name__ == "__main__":
+    main()
diff --git a/research/autoconfig/tools/gen_report_v3.py b/research/autoconfig/tools/gen_report_v3.py
new file mode 100644
index 000000000..806bdddc0
--- /dev/null
+++ b/research/autoconfig/tools/gen_report_v3.py
@@ -0,0 +1,338 @@
+# -------------------------------------------------------------------------
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+# --------------------------------------------------------------------------
+
+import datetime
+import json
+
+
+results = json.load(open(r"ablation-search\results.json"))
+
+clean_base = [r for r in results if r["name"] in ["base_0", "base_1"]]
+clean_runs = [v for r in clean_base for v in r["p50_runs"]]
+clean_mean = round(sum(clean_runs) / len(clean_runs), 1)
+
+
+def verdict(name, mean):
+    if name in ["base_0", "base_1", "base_2", "base_mid", "base_end"]:
+        return "outlier run" if name == "base_2" else "baseline"
+    if name == "matmul_add":
+        return "CONFIRMED REGRESSION"
+    if name == "matmul_scale":
+        return "probable mild regression"
+    if name.startswith("opset_"):
+        opset = int(name.split("_")[1])
+        if opset >= 19:
+            return "SEVERE REGRESSION (kMaxSupportedOpset bug)"
+        return "neutral"
+    delta = mean - clean_mean
+    if abs(delta) < 5:
+        return "neutral"
+    if delta > 5:
+        return "mild regression"
+    return "possible improvement"
+
+
+def row_class(name):
+    if name in ["base_0", "base_1", "base_mid", "base_end"]:
+        return "row-base"
+    if name == "base_2":
+        return "row-outlier"
+    if name == "matmul_add":
+        return "row-bad"
+    if name.startswith("opset_") and int(name.split("_")[1]) >= 19:
+        return "row-bad"
+    if name in ["matmul_scale"]:
+        return "row-warn"
+    return "row-neutral"
+
+
+rows_html = ""
+for r in results:
+    runs = r["p50_runs"]
+    delta = r["p50_mean"] - clean_mean
+    v = verdict(r["name"], r["p50_mean"])
+    rc = row_class(r["name"])
+    runs_str = " / ".join("%.1f" % x for x in runs)
+    sign = "+" if delta >= 0 else ""
+    rows_html += (
+        '<tr class="%s"><td>%s</td><td>%.1f</td><td>%s%.1f</td>'
+        "<td>%.1f</td><td>%.1f</td><td>%s</td><td>%s</td></tr>\n"
+        % (rc, r["name"], r["p50_mean"], sign, delta, min(runs), max(runs), runs_str, v)
+    )
+
+bar_labels = [
+    r["name"]
+    for r in results
+    if r["name"] not in ["base_0", "base_1", "base_2", "base_mid", "base_end"]
+]
+bar_values = [
+    round(r["p50_mean"], 1)
+    for r in results
+    if r["name"] not in ["base_0", "base_1", "base_2", "base_mid", "base_end"]
+]
+bar_colors = []
+for r in results:
+    if r["name"] in ["base_0", "base_1", "base_2", "base_mid", "base_end"]:
+        continue
+    if r["name"] == "matmul_add" or (
+        r["name"].startswith("opset_") and int(r["name"].split("_")[1]) >= 19
+    ):
+        bar_colors.append("'#dc3545'")
+    elif r["name"] in ["matmul_scale"]:
+        bar_colors.append("'#fd7e14'")
+    elif abs(r["p50_mean"] - clean_mean) < 5:
+        bar_colors.append("'#198754'")
+    else:
+        bar_colors.append("'#ffc107'")
+
+bar_labels_js = json.dumps(bar_labels)
+bar_values_js = json.dumps(bar_values)
+bar_colors_js = ",".join(bar_colors)
+n_bars = len(bar_labels)
+baseline_line = clean_mean
+now_str = datetime.datetime.now().strftime("%Y-%m-%d")
+n_results = len(results)
+
+html = """<!DOCTYPE html>
+<html lang="en">
+<head>
+<meta charset="UTF-8">
+<title>ConvNext CPU Ablation Report</title>
+<script src="https://cdn.jsdelivr.net/npm/chart.js@4.4.0/dist/chart.umd.min.js"></script>
+<style>
+body{font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',sans-serif;margin:0;background:#f8f9fa;color:#212529}
+.container{max-width:1150px;margin:0 auto;padding:24px}
+h1{font-size:1.6rem;border-bottom:2px solid #dee2e6;padding-bottom:8px}
+h2{font-size:1.2rem;color:#495057;margin-top:32px}
+h3{font-size:1rem;color:#495057;margin-top:20px}
+.grid{display:grid;grid-template-columns:repeat(4,1fr);gap:16px;margin:20px 0}
+.card{background:white;border-radius:8px;padding:16px;box-shadow:0 1px 3px rgba(0,0,0,.1)}
+.card-title{font-size:.75rem;color:#6c757d;text-transform:uppercase;letter-spacing:.5px}
+.card-value{font-size:1.8rem;font-weight:700;margin:4px 0}
+.card-sub{font-size:.8rem;color:#6c757d}
+.green{color:#198754}.red{color:#dc3545}.grey{color:#6c757d}
+.banner{border-radius:6px;padding:12px 16px;margin:16px 0;font-size:.88rem}
+.banner-info{background:#d1ecf1;border:1px solid #bee5eb}
+.banner-warn{background:#fff3cd;border:1px solid #ffc107}
+.banner-danger{background:#f8d7da;border:1px solid #f5c6cb}
+table{width:100%;border-collapse:collapse;background:white;border-radius:8px;overflow:hidden;box-shadow:0 1px 3px rgba(0,0,0,.1);font-size:.875rem}
+th{background:#495057;color:white;padding:10px 12px;text-align:left}
+td{padding:8px 12px;border-bottom:1px solid #dee2e6}
+tr.row-base td{background:#f8f9fa;color:#6c757d}
+tr.row-outlier td{background:#fff3cd}
+tr.row-bad td{background:#f8d7da;font-weight:600}
+tr.row-warn td{background:#fff3cd}
+tr.row-neutral td{background:white}
+.chart-box{background:white;border-radius:8px;padding:20px;margin:16px 0;box-shadow:0 1px 3px rgba(0,0,0,.1)}
+.finding{border-left:4px solid #0d6efd;border-radius:0 6px 6px 0;padding:10px 14px;margin:8px 0;background:white;box-shadow:0 1px 2px rgba(0,0,0,.05)}
+.finding.confirmed{border-color:#dc3545}
+.finding.neutral{border-color:#198754}
+.finding.correction{border-color:#6f42c1}
+.finding.weak{border-color:#fd7e14}
+.finding.rootcause{border-color:#dc3545;background:#fff5f5}
+code{background:#f1f3f4;padding:1px 5px;border-radius:3px;font-size:.85em}
+pre{background:#1e1e1e;color:#d4d4d4;border-radius:8px;padding:16px;font-size:.82rem;overflow-x:auto}
+.opt-table{width:100%;border-collapse:collapse;font-size:.875rem;margin:12px 0}
+.opt-table th{background:#6c757d;color:white;padding:7px 10px;text-align:left}
+.opt-table td{padding:6px 10px;border-bottom:1px solid #dee2e6}
+.opt-table tr:nth-child(even) td{background:#f8f9fa}
+</style>
+</head>
+<body>
+<div class="container">
+<h1>&#x1F4CA; ConvNext CPU Ablation &#x2014; Autoconfig POC + Opset Cliff RCA</h1>
+<p style="color:#6c757d;font-size:.9rem">Model: <strong>facebook/convnext-tiny-224</strong> &nbsp;|&nbsp; EP: <strong>CPU</strong> &nbsp;|&nbsp; DATE_PLACEHOLDER &nbsp;|&nbsp; N_RESULTS_PLACEHOLDER experiments &nbsp;|&nbsp; ORT ORTVER_PLACEHOLDER</p>
+
+<div class="banner banner-info">
+<strong>Measurement methodology:</strong> <code>winml perf --ep cpu --warmup 10 --iterations 50</code> &mdash; pure inference latency, no preprocessing.
+3 independent perf runs per config. Metric: p50 (median) latency. Promotion threshold: max(3%, 2&times;&sigma;_baseline).
+</div>
+
+<div class="grid">
+  <div class="card"><div class="card-title">Clean Baseline p50</div><div class="card-value">CLEAN_MEAN_PLACEHOLDERms</div><div class="card-sub">base_0 + base_1, opset=17</div></div>
+  <div class="card"><div class="card-title">Best Config Found</div><div class="card-value green">Baseline</div><div class="card-sub">opset=17, no extra flags</div></div>
+  <div class="card"><div class="card-title">Worst Finding</div><div class="card-value red">+38ms</div><div class="card-sub">matmul-add-fusion</div></div>
+  <div class="card"><div class="card-title">Root Cause Found</div><div class="card-value red" style="font-size:1rem">kMaxSupportedOpset</div><div class="card-sub">Transpose Optimizer gate</div></div>
+</div>
+
+<!-- ==================== ROOT CAUSE SECTION ==================== -->
+<h2>&#x1F50D; Root Cause Analysis: ORT Opset Performance Cliff</h2>
+
+<div class="finding rootcause">
+<strong>&#x274C; ROOT CAUSE IDENTIFIED: ORT <code>kMaxSupportedOpset</code> gates the entire Transpose Optimizer</strong><br><br>
+In <code>onnxruntime/core/optimizer/transpose_optimization/optimizer_api.h</code>:
+<pre>constexpr int64_t kMaxSupportedOpset = 18;  // in ORT v1.14.x
+// Current ORT (v1.24.5) kMaxSupportedOpset = 21 or 22
+
+// In onnx_transpose_optimization.cc:
+if (*opset &gt; kMaxSupportedOpset) {
+    return std::nullopt;  // ← ENTIRE Transpose Optimizer skipped silently
+}</pre>
+ConvNext has <strong>42 Transpose nodes</strong> forming a NCHW&harr;NHWC "transpose sandwich" in every block.
+The Transpose Optimizer normally eliminates/merges these (pushing through Add&times;18, Mul&times;18, canceling adjacent inverses).
+When it is bypassed, all 42 Transpose nodes execute as raw memory-layout copy operations &rarr; systemic slowdown.
+</div>
+
+<h3>&#x1F4CA; ORT Optimization Level Experiment (confirms root cause)</h3>
+<table class="opt-table">
+<tr><th>Session Optimization Level</th><th>opset=17</th><th>opset=19</th><th>Ratio</th><th>Explanation</th></tr>
+<tr><td><code>DISABLE_ALL</code></td><td>47.5ms</td><td style="background:#f8d7da;font-weight:600">355ms</td><td style="background:#f8d7da">7.5&times;</td><td>No Transpose Optimizer &rarr; all 42 Transposes execute. v17 model.onnx has pre-fused ops; v19 export has more raw ops.</td></tr>
+<tr><td><code>ENABLE_BASIC</code></td><td>289ms</td><td>315ms</td><td>1.1&times;</td><td>Basic opts run on already-fused model, some interference. Near-parity: Transpose Optimizer not yet active at this level.</td></tr>
+<tr><td><code>ENABLE_EXTENDED</code></td><td>209ms</td><td>241ms</td><td>1.2&times;</td><td>Extended optimizations help both but some overhead from re-optimizing pre-fused model.</td></tr>
+<tr><td><code>ENABLE_ALL</code> (default)</td><td style="background:#d1e7dd">216ms</td><td style="background:#d1e7dd">215ms</td><td style="background:#d1e7dd">1.0&times;</td><td>Transpose Optimizer runs on both. Full parity achieved &mdash; confirms optimizer gap is the entire cause.</td></tr>
+</table>
+
+<div class="banner banner-warn">
+<strong>Why does winml perf show opset=19 as 160ms vs 44ms?</strong>
+winml build pre-applies <code>ORT_ENABLE_ALL</code> and saves <code>model.onnx</code>. winml perf then loads <em>that</em> pre-optimized model.
+For opset=17, <code>kMaxSupportedOpset</code> is satisfied &rarr; Transpose Optimizer ran during build &rarr; model.onnx has fewer effective Transposes.
+For opset=19, <code>kMaxSupportedOpset</code> may have been exceeded in the ORT version used during build &rarr; Transpose Optimizer skipped &rarr; model.onnx retains 42 raw Transposes.
+When winml perf loads model.onnx (with <code>ENABLE_ALL</code> again at runtime), if the runtime ORT version's <code>kMaxSupportedOpset</code> covers 19, the gap partially closes. The residual difference depends on which ORT version winml-cli ships.
+</div>
+
+<h3>&#x1F4CB; <code>kMaxSupportedOpset</code> Version History (verified from ORT git tags)</h3>
+<table class="opt-table">
+<tr><th>ORT Release</th><th>kMaxSupportedOpset</th><th>Effect</th></tr>
+<tr><td>v1.14.x</td><td style="background:#f8d7da">18</td><td>opset &ge; 19 &rarr; Transpose Optimizer DISABLED</td></tr>
+<tr><td>v1.16.x</td><td style="background:#fff3cd">19</td><td>opset &ge; 20 &rarr; disabled</td></tr>
+<tr><td>v1.17.x</td><td style="background:#fff3cd">20</td><td>opset &ge; 21 &rarr; disabled</td></tr>
+<tr><td>v1.18.x</td><td style="background:#fff3cd">21</td><td>opset &ge; 22 &rarr; disabled</td></tr>
+<tr><td>main/HEAD</td><td style="background:#d1e7dd">26</td><td>Fully covered for all current ONNX opsets</td></tr>
+</table>
+
+<h3>&#x1F4DC; ORT Source (exact call chain)</h3>
+<pre>InferenceSession::Initialize()
+  &rarr; graph_transformer_mgr_.ApplyTransformers(graph, Level1)
+      &rarr; TransposeOptimizer::ApplyImpl()           [transpose_optimizer.cc:18]
+          &rarr; onnx_transpose_optimization::Optimize() [onnx_transpose_optimization.cc:3344]
+              &rarr; MakeOptimizerContext(graph, ...)
+                  &rarr; graph.Opset("ai.onnx")         // reads DomainToVersionMap()
+                  &rarr; if opset &gt; kMaxSupportedOpset: return nullopt  // &larr; THE GATE
+              &rarr; if ctx == nullopt: return early    // no optimization performed</pre>
+
+<h3>Why ConvNext is especially sensitive</h3>
+<p style="font-size:.9rem">The Transpose Optimizer can push Transposes through <code>Add</code>, <code>Mul</code>, and simple unary ops. ConvNext has 18&times;(Add + Mul) layer-scale and residual connections between blocks, meaning a single Transpose can cascade through many nodes. With the optimizer enabled, adjacent inverse pairs cancel; without it, every NCHW&harr;NHWC conversion is a full memory copy of the activation tensor.</p>
+
+<!-- ==================== ABLATION RESULTS ==================== -->
+<h2>&#x1F4A1; Ablation Key Findings</h2>
+
+<div class="finding confirmed">
+<strong>&#x274C; CONFIRMED REGRESSION: <code>matmul-add-fusion</code> +38ms</strong><br>
+All 3 independent runs: 63.0 / 70.8 / 111.2ms vs clean baseline ~43.7ms.
+The minimum observed (63ms) is 20ms above the highest clean-baseline run. Not attributable to noise.
+Hypothesis: baseline already converts MatMul+Add&rarr;Gemm (37 Gemm in model.onnx); applying matmul-add-fusion creates redundant or conflicting dispatch. Unconfirmed &mdash; requires op-level profiling.
+</div>
+
+<div class="finding correction">
+<strong>&#x1F4DD; MEASUREMENT CORRECTION: <code>transpose-optimizer</code> is NEUTRAL on inference latency</strong><br>
+Earlier 8-iteration search using <code>winml eval</code> reported +270ms. That measurement included HF preprocessing pipeline and had no warmup &mdash; it measured <em>application latency</em>, not <em>model inference</em>.
+With <code>winml perf</code> (warmup=10, iter=50): 42.3 / 52.3 / 41.8ms &mdash; indistinguishable from baseline.
+The +270ms was entirely a measurement artifact. Do not cite in user-facing reports.
+</div>
+
+<div class="finding confirmed">
+<strong>&#x274C; CONFIRMED: opset=19&ndash;22 causes 1.9&ndash;3.9&times; regression on this ORT build</strong><br>
+Mechanism confirmed: <code>kMaxSupportedOpset</code> gate in ORT's Transpose Optimizer. All 3 runs per opset are consistent.
+Fix: use opset&le;17 (current winml-cli default) OR upgrade ORT to a version where <code>kMaxSupportedOpset &ge; 22</code> (main branch).
+</div>
+
+<div class="finding neutral">
+<strong>&#x2705; NEUTRAL: <code>nchwc-transformer</code>, <code>transpose-optimizer</code>, opset=18</strong> &mdash; all within noise of baseline (~43.7ms).
+</div>
+
+<div class="finding weak">
+<strong>&#x26A0; PROBABLE MILD REGRESSION: <code>matmul-scale-fusion</code></strong> &mdash; all 3 runs elevated (51.5 / 58.1 / 61.2ms). Weak signal due to baseline drift during experiment.
+</div>
+
+<h2>&#x1F4CA; Per-Config p50 Latency vs Baseline</h2>
+<div class="chart-box"><canvas id="barChart" height="110"></canvas></div>
+
+<h2>&#x1F4CB; Full Results Table</h2>
+<div class="banner banner-warn">
+Phase 0 = baseline &times;3 &nbsp;|&nbsp; Phase 1 = opset/CF flags &nbsp;|&nbsp; Phase 2 = single-flag ablation &nbsp;|&nbsp; Phase 3 = stepwise (no candidates) &nbsp;|&nbsp; Phase 4 = base_end recheck &nbsp;|&nbsp; Phase 5 = opset 19&ndash;22
+</div>
+<table>
+<tr><th>Config</th><th>p50 mean (ms)</th><th>&#x0394; vs baseline</th><th>min</th><th>max</th><th>Runs (ms)</th><th>Verdict</th></tr>
+ROWS_PLACEHOLDER
+</table>
+
+<h2>&#x1F527; Optimal Config</h2>
+<pre># Optimal config: baseline (opset=17, constant_folding=True, no extra flags)
+winml build --model-id facebook/convnext-tiny-224 -o out_cpu/
+winml perf -m out_cpu/model.onnx --ep cpu --warmup 10 --iterations 50
+# Expected: p50 ~43-44ms
+
+# AVOID:
+#   --optimize matmul-add-fusion     (confirmed +38ms regression)
+#   opset_version: 19-22             (kMaxSupportedOpset bug: 3-4x regression on affected ORT builds)</pre>
+
+<h2>&#x1F9E0; Open Questions</h2>
+<ul style="font-size:.9rem">
+<li><strong>Exact ORT version boundary:</strong> winml-cli ships ORT 1.24.5 (internal versioning). The exact <code>kMaxSupportedOpset</code> value in that build determines whether opset 19-22 is safe. Needs verification against ORT source at that specific commit.</li>
+<li><strong>Why does <code>matmul-add-fusion</code> regress?</strong> 37 Gemm nodes already exist; applying this fusion may create double-fusion or suboptimal kernel selection. Requires <code>--profile</code> to confirm.</li>
+<li><strong>GELU fusion mystery:</strong> baseline model.onnx has <code>com.microsoft/Gelu</code>&times;18 despite <code>GeluFusion</code> being in <code>disabled_optimizers</code>. Source unclear &mdash; likely HF Optimum pre-fuses GELU before ORT.</li>
+</ul>
+
+</div>
+<script>
+const ctx = document.getElementById('barChart').getContext('2d');
+new Chart(ctx, {
+  type: 'bar',
+  data: {
+    labels: BAR_LABELS_JS,
+    datasets: [{
+      label: 'p50 latency (ms)',
+      data: BAR_VALUES_JS,
+      backgroundColor: [BAR_COLORS_JS],
+      borderRadius: 4
+    },{
+      type: 'line',
+      label: 'Clean baseline (BASELINE_LINE_PLACEHOLDERms)',
+      data: Array(N_BARS_PLACEHOLDER).fill(BASELINE_LINE_PLACEHOLDER),
+      borderColor: '#0d6efd', borderDash: [6,3], pointRadius: 0, borderWidth: 2
+    }]
+  },
+  options: {
+    responsive: true,
+    scales: { y: { beginAtZero: false, min: 30, title: { display: true, text: 'p50 latency (ms)' } } },
+    plugins: {
+      legend: { position: 'top' },
+      tooltip: { callbacks: { label: c => c.dataset.label + ': ' + c.raw + 'ms' } }
+    }
+  }
+});
+</script>
+</body>
+</html>"""
+
+import subprocess
+
+
+result = subprocess.run(
+    ["python", "-c", "import onnxruntime as ort; print(ort.__version__)"],
+    capture_output=True,
+    encoding="utf-8",
+    cwd=r"C:\tmp\autoconfig-demo",
+    env={
+        **__import__("os").environ,
+        "PATH": r"C:\tmp\autoconfig-demo\.venv\Scripts;" + __import__("os").environ.get("PATH", ""),
+    },
+)
+ort_ver = result.stdout.strip() or "1.24.5"
+
+html = html.replace("DATE_PLACEHOLDER", now_str)
+html = html.replace("N_RESULTS_PLACEHOLDER", str(n_results))
+html = html.replace("ORTVER_PLACEHOLDER", ort_ver)
+html = html.replace("CLEAN_MEAN_PLACEHOLDER", str(clean_mean))
+html = html.replace("ROWS_PLACEHOLDER", rows_html)
+html = html.replace("BAR_LABELS_JS", bar_labels_js)
+html = html.replace("BAR_VALUES_JS", bar_values_js)
+html = html.replace("BAR_COLORS_JS", bar_colors_js)
+html = html.replace("N_BARS_PLACEHOLDER", str(n_bars))
+html = html.replace("BASELINE_LINE_PLACEHOLDER", str(baseline_line))
+
+with open(r"report.html", "w", encoding="utf-8") as f:
+    f.write(html)
+print("report.html written: %d bytes, %d experiments" % (len(html), n_results))
diff --git a/research/autoconfig/tools/validation_sweep.py b/research/autoconfig/tools/validation_sweep.py
new file mode 100644
index 000000000..e7b328215
--- /dev/null
+++ b/research/autoconfig/tools/validation_sweep.py
@@ -0,0 +1,468 @@
+#!/usr/bin/env python3
+# -------------------------------------------------------------------------
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License.
+# --------------------------------------------------------------------------
+
+"""
+validation_sweep.py — Focused validation sweep for npu-001 and npu-006.
+
+Tests:
+  npu-001: opset17 vs opset21 speedup on Conv+attention hybrid vs pure ViT
+  npu-006: conv fusions regression — confirm MobileViT/DINOv2 are unaffected
+
+Hypotheses (subset of catalog_sweep.py):
+  h0: baseline (auto-config, W8A16)
+  h1: opset 17 explicit
+  h3: opset 21  ← npu-001 test
+  h4: opset 17 + conv fusions  ← npu-006 test
+
+Models:
+  facebook/dinov2-base      → expect npu-001 speedup (larger DINOv2)
+  microsoft/rad-dino        → expect npu-001 speedup (DINOv2 variant)
+  facebook/dino-vitb16      → expect NEUTRAL (pure DINO ViT, no Conv+residual)
+  Intel/dpt-hybrid-midas    → expect npu-001 speedup; npu-006 regression (ResNet backbone)
+
+Output: research/autoconfig/catalog-qnn-sweep/<model-slug>/results_v2.json
+"""
+
+import argparse
+import copy
+import json
+import subprocess
+import sys
+import time
+from datetime import datetime
+from pathlib import Path
+
+sys.stdout.reconfigure(encoding="utf-8", errors="replace")  # type: ignore[attr-defined]
+
+# Autoconfig root = the dir holding ep_device_knowledge/; repo root is two levels above it.
+_AGENT_ROOT = next(
+    p for p in Path(__file__).resolve().parents if (p / "ep_device_knowledge").is_dir()
+)
+BASE_DIR = _AGENT_ROOT
+REPO_ROOT = _AGENT_ROOT.parent.parent  # research/autoconfig/ → research/ → repo root
+WINML = str(REPO_ROOT / ".venv" / "Scripts" / "winml.exe")
+EP = "qnn"
+DEVICE = "npu"
+RESULTS_DIR = BASE_DIR / "catalog-qnn-sweep"
+
+SCREEN_WARMUP = 20
+SCREEN_ITERS = 200
+
+FULL_WARMUP = 50
+FULL_ITERS = 500
+FULL_SESSIONS = 3
+COOL_DOWN_S = 30
+
+MODEL_TIMEOUT_S = (
+    120 * 60
+)  # 120 min per model (rad-dino/large models: 450s per bench session × 3 × 3)
+BUILD_TIMEOUT_S = 15 * 60
+BENCH_TIMEOUT_S = 15 * 60
+EVAL_TIMEOUT_S = 6 * 60
+
+# Focused hypothesis matrix
+HYPOTHESES = [
+    ("h0", "baseline (auto-config, W8A16)", None, None),
+    ("h1", "opset 17 explicit", 17, None),
+    ("h3", "opset 21 (tests npu-001)", 21, None),
+    (
+        "h4",
+        "opset 17 + conv fusions",
+        17,
+        {
+            "conv_bn_fusion": True,
+            "conv_add_fusion": True,
+            "conv_activation_fusion": True,
+        },
+    ),
+]
+
+# (model_id, task, model_type, run_h4_fusion_test)
+VALIDATION_MODELS = [
+    ("facebook/dinov2-base", "image-feature-extraction", "dinov2", True),
+    ("microsoft/rad-dino", "image-feature-extraction", "dinov2", False),
+    ("facebook/dino-vitb16", "image-feature-extraction", "vit", True),
+    ("Intel/dpt-hybrid-midas", "depth-estimation", "dpt", True),
+]
+
+
+def run_cmd(cmd, label="", timeout=600):
+    t0 = time.time()
+    print(f"  >> {label or cmd[1]}", flush=True)
+    try:
+        result = subprocess.run(
+            cmd,
+            capture_output=True,
+            text=True,
+            encoding="utf-8",
+            errors="replace",
+            timeout=timeout,
+        )
+        elapsed = time.time() - t0
+        tag = "ok" if result.returncode == 0 else f"rc={result.returncode}"
+        print(f"     {elapsed:.0f}s [{tag}]", flush=True)
+        if result.returncode != 0:
+            print(f"     stderr: {(result.stderr or result.stdout or '')[-400:]}", flush=True)
+        return result.returncode, result.stdout + result.stderr, elapsed
+    except subprocess.TimeoutExpired:
+        elapsed = time.time() - t0
+        print(f"     TIMEOUT after {elapsed:.0f}s", flush=True)
+        return -999, f"TIMEOUT after {timeout}s", elapsed
+
+
+def get_base_config(model_id, task, model_type):
+    tmp = RESULTS_DIR / "_tmp_val_cfg.json"
+    tmp.parent.mkdir(parents=True, exist_ok=True)
+
+    def _try(extra):
+        cmd = [
+            WINML,
+            "config",
+            "-m",
+            model_id,
+            "-t",
+            task,
+            "--device",
+            DEVICE,
+            "--ep",
+            EP,
+            "--no-compile",
+            "-o",
+            str(tmp),
+        ] + extra
+        rc, _, _ = run_cmd(cmd, "winml config", 600)
+        if rc == 0 and tmp.exists():
+            try:
+                cfg = json.loads(tmp.read_text(encoding="utf-8"))
+                tmp.unlink(missing_ok=True)
+                return cfg
+            except Exception:
+                pass
+        tmp.unlink(missing_ok=True)
+        return None
+
+    cfg = _try(["--model-type", model_type])
+    if cfg is None:
+        print("  [warn] retrying without --model-type", flush=True)
+        cfg = _try([])
+    return cfg
+
+
+def make_hyp_config(base, opset_override, extra_optim):
+    cfg = copy.deepcopy(base)
+    if opset_override is not None and cfg.get("export"):
+        cfg["export"]["opset_version"] = opset_override
+    if extra_optim is not None:
+        cfg["optim"] = {**(cfg.get("optim") or {}), **extra_optim}
+    return cfg
+
+
+def run_build(model_id, cfg_path, out_dir):
+    out_dir.mkdir(parents=True, exist_ok=True)
+    cmd = [
+        WINML,
+        "build",
+        "-c",
+        str(cfg_path),
+        "-m",
+        model_id,
+        "-o",
+        str(out_dir),
+        "--ep",
+        EP,
+        "--device",
+        DEVICE,
+        "--no-compile",
+        "--rebuild",
+    ]
+    rc, out, _ = run_cmd(cmd, f"winml build [{out_dir.name}]", BUILD_TIMEOUT_S)
+    return rc == 0, out
+
+
+def bench_screen(model_path):
+    out_json = model_path.parent / "val_screen.json"
+    rc, _, _ = run_cmd(
+        [
+            WINML,
+            "perf",
+            "-m",
+            str(model_path),
+            "--ep",
+            EP,
+            "--device",
+            DEVICE,
+            "--warmup",
+            str(SCREEN_WARMUP),
+            "--iterations",
+            str(SCREEN_ITERS),
+            "-o",
+            str(out_json),
+        ],
+        f"perf screen ({SCREEN_ITERS} iters)",
+        BENCH_TIMEOUT_S,
+    )
+    if rc != 0 or not out_json.exists():
+        return None, 999.0, False
+    try:
+        d = json.loads(out_json.read_text(encoding="utf-8"))
+        lat = d.get("latency_ms", {})
+        p50 = lat.get("p50") if isinstance(lat, dict) else None
+        std = lat.get("std", 0) if isinstance(lat, dict) else 0
+        if not p50:
+            return None, 999.0, False
+        cv = std / p50
+        stable = cv < 0.15
+        return p50, cv, stable
+    except Exception:
+        return None, 999.0, False
+
+
+def bench_full(model_path):
+    p50s = []
+    for s in range(FULL_SESSIONS):
+        if s > 0:
+            print(f"  [cool-down {COOL_DOWN_S}s]", flush=True)
+            time.sleep(COOL_DOWN_S)
+        out_json = model_path.parent / f"val_full_s{s}.json"
+        rc, _, _ = run_cmd(
+            [
+                WINML,
+                "perf",
+                "-m",
+                str(model_path),
+                "--ep",
+                EP,
+                "--device",
+                DEVICE,
+                "--warmup",
+                str(FULL_WARMUP),
+                "--iterations",
+                str(FULL_ITERS),
+                "-o",
+                str(out_json),
+            ],
+            f"perf full s{s} ({FULL_ITERS} iters)",
+            BENCH_TIMEOUT_S,
+        )
+        if rc != 0 or not out_json.exists():
+            continue
+        try:
+            d = json.loads(out_json.read_text(encoding="utf-8"))
+            lat = d.get("latency_ms", {})
+            p50 = lat.get("p50") if isinstance(lat, dict) else None
+            if p50:
+                p50s.append(round(p50, 3))
+        except Exception:
+            pass
+    if not p50s:
+        return None, None
+    median = sorted(p50s)[len(p50s) // 2]
+    return p50s, round(median, 3)
+
+
+def run_model(model_id, task, model_type, run_h4):
+    slug = model_id.replace("/", "--")
+    print(f"\n{'=' * 60}", flush=True)
+    print(f"  Model: {model_id}", flush=True)
+    print("  Hypotheses: h0, h1, h3" + (", h4" if run_h4 else ""), flush=True)
+    print(f"{'=' * 60}", flush=True)
+
+    out_dir = RESULTS_DIR / slug
+    out_dir.mkdir(parents=True, exist_ok=True)
+    result = {
+        "model_id": model_id,
+        "task": task,
+        "model_type": model_type,
+        "timestamp": datetime.now().isoformat(timespec="seconds"),
+        "ep": EP,
+        "device": DEVICE,
+        "validation_sweep": True,
+        "hypotheses": {},
+        "errors": [],
+    }
+
+    base_cfg = get_base_config(model_id, task, model_type)
+    if base_cfg is None:
+        result["errors"].append("FAILED: could not generate base config")
+        (out_dir / "results_v2.json").write_text(json.dumps(result, indent=2), encoding="utf-8")
+        return result
+
+    t0_model = time.time()
+
+    active_hyps = [
+        (hid, lbl, opset, optim)
+        for hid, lbl, opset, optim in HYPOTHESES
+        if hid in ("h0", "h1", "h3") or (run_h4 and hid == "h4")
+    ]
+
+    for hid, label, opset_override, extra_optim in active_hyps:
+        elapsed_model = time.time() - t0_model
+        if elapsed_model > MODEL_TIMEOUT_S:
+            result["errors"].append(f"Model timed out at {elapsed_model:.0f}s (before {hid})")
+            result["hypotheses"][hid] = {"status": "TIMEOUT", "label": label}
+            continue
+
+        print(f"\n  --- {hid}: {label} ---", flush=True)
+        hyp_dir = out_dir / f"val_{hid}"
+        hyp_dir.mkdir(parents=True, exist_ok=True)
+
+        cfg = make_hyp_config(base_cfg, opset_override, extra_optim)
+        cfg_path = hyp_dir / "config.json"
+        cfg_path.write_text(json.dumps(cfg, indent=2), encoding="utf-8")
+
+        # Reuse existing build output if already present (avoids re-downloading)
+        # Require optimized.onnx or quantized.onnx as completion signal — export.onnx alone
+        # means the build was truncated before optimization/quantization finished.
+        complete_models = [
+            f for f in hyp_dir.glob("*.onnx") if "optimized" in f.name or "quantized" in f.name
+        ]
+        if complete_models:
+            print(f"  [reuse] existing build in {hyp_dir.name}", flush=True)
+            ok = True
+            build_out = "(reused)"
+        else:
+            ok, build_out = run_build(model_id, cfg_path, hyp_dir)
+        if not ok:
+            result["hypotheses"][hid] = {
+                "status": "BUILD_FAIL",
+                "label": label,
+                "build_error": build_out[-300:],
+            }
+            result["errors"].append(f"{hid}: BUILD_FAIL")
+            continue
+
+        # find model file — prefer quantized > optimized > any
+        model_files = list(hyp_dir.glob("*.onnx"))
+        model_path = next((f for f in model_files if "quantized" in f.name), None)
+        if model_path is None:
+            model_path = next((f for f in model_files if "optimized" in f.name), None)
+        if model_path is None and model_files:
+            model_path = model_files[0]
+        if model_path is None:
+            result["hypotheses"][hid] = {
+                "status": "BUILD_FAIL",
+                "label": label,
+                "build_error": "no .onnx found",
+            }
+            continue
+
+        p50_screen, cv, stable = bench_screen(model_path)
+        # npu-007: For QNN NPU, screen failure (rc!=0, empty output) must NOT gate Phase B.
+        # DVFS thermal noise can cause transient subprocess failures on first inference.
+        # Only skip Phase B if screen hard-failed AND the EP is not QNN NPU.
+        is_npu = EP == "qnn" and DEVICE == "npu"
+        if p50_screen is None and not is_npu:
+            result["hypotheses"][hid] = {
+                "status": "BENCH_FAIL",
+                "label": label,
+                "opset": opset_override or "auto",
+            }
+            continue
+
+        p50s, median = bench_full(model_path)
+        status = "OK" if (cv is None or cv < 0.15) else "OK_HIGH_CV"
+        if not p50s:
+            status = "BENCH_FAIL"
+        result["hypotheses"][hid] = {
+            "status": status,
+            "screen": {
+                "p50_ms": round(p50_screen, 3) if p50_screen is not None else None,
+                "cv": round(cv, 4) if cv is not None else None,
+                "stable": stable,
+                "note": "DVFS noise — high CV expected on QNN NPU" if not stable else None,
+            },
+            "full": {"p50s_ms": p50s, "median_p50_ms": median},
+            "label": label,
+            "opset": opset_override or "auto",
+        }
+        screen_str = f"{p50_screen:.2f}ms" if p50_screen is not None else "N/A"
+        cv_str = f"{cv:.3f}" if cv is not None else "N/A"
+        print(
+            f"  [RESULT {hid}] screen p50={screen_str} CV={cv_str}  full_median={median}ms  sessions={p50s}",
+            flush=True,
+        )
+
+    # Compute npu-001 signal
+    h1 = result["hypotheses"].get("h1", {})
+    h3 = result["hypotheses"].get("h3", {})
+    if h1.get("full") and h3.get("full"):
+        m1 = h1["full"]["median_p50_ms"]
+        m3 = h3["full"]["median_p50_ms"]
+        if m1 and m3:
+            gain = round((m1 - m3) / m1 * 100, 1)
+            result["npu001_opset21_vs_17_gain_pct"] = gain
+            result["npu001_note"] = f"opset21 median {m3}ms vs opset17 {m1}ms = {gain:+.1f}%"
+
+    # Compute npu-006 signal
+    h4 = result["hypotheses"].get("h4", {})
+    if h1.get("full") and h4.get("full"):
+        m1 = h1["full"]["median_p50_ms"]
+        m4 = h4["full"]["median_p50_ms"]
+        if m1 and m4:
+            regression = round((m4 - m1) / m1 * 100, 1)
+            result["npu006_conv_fusion_regression_pct"] = regression
+            result["npu006_note"] = (
+                f"conv fusions median {m4}ms vs no-fusion {m1}ms = {regression:+.1f}%"
+            )
+
+    out_path = out_dir / "results_v2.json"
+    out_path.write_text(json.dumps(result, indent=2, ensure_ascii=False), encoding="utf-8")
+    print(f"\n  [SAVED] {out_path}", flush=True)
+    return result
+
+
+def main():
+    parser = argparse.ArgumentParser(description="Focused npu-001/npu-006 validation sweep")
+    parser.add_argument("--model", help="Run single model by ID")
+    parser.add_argument(
+        "--no-h4", action="store_true", help="Skip h4 (conv fusions) for all models"
+    )
+    args = parser.parse_args()
+
+    models = VALIDATION_MODELS
+    if args.model:
+        models = [
+            (m, t, tp, h4)
+            for m, t, tp, h4 in VALIDATION_MODELS
+            if m == args.model or m.split("/")[-1] == args.model
+        ]
+        if not models:
+            print(f"Model '{args.model}' not in validation list. Available:")
+            for m, t, tp, h4 in VALIDATION_MODELS:
+                print(f"  {m}  ({t}, {tp})")
+            sys.exit(1)
+
+    print(f"\nValidation sweep — {len(models)} model(s)", flush=True)
+    print(
+        f"EP: {EP} / {DEVICE}  Proto: {FULL_SESSIONS}×{FULL_ITERS} iters, {COOL_DOWN_S}s cool-down\n",
+        flush=True,
+    )
+
+    all_results = []
+    for model_id, task, model_type, run_h4 in models:
+        if args.no_h4:
+            run_h4 = False
+        res = run_model(model_id, task, model_type, run_h4)
+        all_results.append(res)
+
+    print("\n" + "=" * 60)
+    print("VALIDATION SUMMARY")
+    print("=" * 60)
+    for r in all_results:
+        mid = r["model_id"]
+        npu001 = r.get("npu001_note", "n/a")
+        npu006 = r.get("npu006_note", "")
+        print(f"  {mid}")
+        print(f"    npu-001: {npu001}")
+        if npu006:
+            print(f"    npu-006: {npu006}")
+        if r.get("errors"):
+            print(f"    errors: {r['errors']}")
+    print("=" * 60)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/research/skill-plan/skills-design.html b/research/skill-plan/skills-design.html
new file mode 100644
index 000000000..dbf22b359
--- /dev/null
+++ b/research/skill-plan/skills-design.html
@@ -0,0 +1,3614 @@
+<!DOCTYPE html>
+<html lang="en">
+<head>
+<meta charset="utf-8">
+<meta name="viewport" content="width=device-width, initial-scale=1">
+<title>WinML CLI Skills Design Doc</title>
+<style>
+:root {
+  --bg: #ffffff;
+  --fg: #1f2328;
+  --muted: #59636e;
+  --border: #d1d9e0;
+  --accent: #0969da;
+  --code-bg: #f6f8fa;
+  --table-stripe: #f6f8fa;
+  --sidebar-bg: #f6f8fa;
+}
+@media (prefers-color-scheme: dark) {
+  :root {
+    --bg: #0d1117; --fg: #e6edf3; --muted: #9198a1; --border: #30363d;
+    --accent: #4493f8; --code-bg: #161b22; --table-stripe: #161b22; --sidebar-bg: #161b22;
+  }
+}
+* { box-sizing: border-box; }
+html { scroll-behavior: smooth; }
+body {
+  margin: 0; background: var(--bg); color: var(--fg);
+  font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", "Noto Sans", Helvetica, Arial, sans-serif;
+  font-size: 16px; line-height: 1.65;
+}
+.layout { display: flex; max-width: 1500px; margin: 0 auto; }
+nav.toc {
+  width: 320px; flex-shrink: 0; position: sticky; top: 0; align-self: flex-start;
+  height: 100vh; overflow-y: auto; padding: 24px 16px 60px; border-right: 1px solid var(--border);
+  background: var(--sidebar-bg); font-size: 13.5px;
+}
+nav.toc .toc-title { font-weight: 700; font-size: 14px; text-transform: uppercase; letter-spacing: .04em; color: var(--muted); margin: 0 0 12px; }
+nav.toc ul { list-style: none; margin: 0; padding-left: 14px; }
+nav.toc > ul { padding-left: 0; }
+nav.toc li { margin: 2px 0; }
+nav.toc a { color: var(--fg); text-decoration: none; display: block; padding: 3px 8px; border-radius: 6px; }
+nav.toc a:hover { background: rgba(99,110,123,.15); color: var(--accent); }
+main { flex: 1; min-width: 0; padding: 40px 56px 120px; }
+@media (max-width: 1000px) {
+  nav.toc { display: none; }
+  main { padding: 24px; }
+}
+h1, h2, h3, h4 { line-height: 1.3; margin-top: 1.8em; margin-bottom: .6em; font-weight: 650; scroll-margin-top: 16px; }
+h1 { font-size: 2em; border-bottom: 2px solid var(--border); padding-bottom: .3em; margin-top: 0; }
+h2 { font-size: 1.5em; border-bottom: 1px solid var(--border); padding-bottom: .25em; }
+h3 { font-size: 1.2em; }
+h4 { font-size: 1.02em; }
+a { color: var(--accent); }
+p, li { overflow-wrap: break-word; }
+code {
+  font-family: "SF Mono", "Cascadia Code", Consolas, "Liberation Mono", Menlo, monospace;
+  font-size: 85%; background: var(--code-bg); padding: .2em .4em; border-radius: 6px;
+}
+pre {
+  background: var(--code-bg); border: 1px solid var(--border); border-radius: 8px;
+  padding: 14px 16px; overflow-x: auto; line-height: 1.45;
+}
+pre code { background: none; padding: 0; font-size: 84%; }
+table { border-collapse: collapse; width: 100%; margin: 1em 0; display: block; overflow-x: auto; font-size: 14px; }
+th, td { border: 1px solid var(--border); padding: 7px 12px; text-align: left; vertical-align: top; }
+th { background: var(--sidebar-bg); font-weight: 650; }
+tr:nth-child(even) td { background: var(--table-stripe); }
+blockquote { margin: 1em 0; padding: .2em 1em; color: var(--muted); border-left: 4px solid var(--border); }
+hr { border: none; border-top: 1px solid var(--border); margin: 2.5em 0; }
+.headerlink { text-decoration: none; opacity: 0; margin-left: .4em; font-weight: 400; }
+h1:hover .headerlink, h2:hover .headerlink, h3:hover .headerlink, h4:hover .headerlink { opacity: .5; }
+/* codehilite */
+.codehilite .k, .codehilite .kd, .codehilite .kn { color: #cf222e; }
+.codehilite .s, .codehilite .s1, .codehilite .s2 { color: #0a3069; }
+.codehilite .c, .codehilite .c1, .codehilite .cm { color: var(--muted); font-style: italic; }
+.codehilite .nb, .codehilite .nf { color: #8250df; }
+.codehilite .mi, .codehilite .mf { color: #0550ae; }
+@media (prefers-color-scheme: dark) {
+  .codehilite .k, .codehilite .kd, .codehilite .kn { color: #ff7b72; }
+  .codehilite .s, .codehilite .s1, .codehilite .s2 { color: #a5d6ff; }
+  .codehilite .nb, .codehilite .nf { color: #d2a8ff; }
+  .codehilite .mi, .codehilite .mf { color: #79c0ff; }
+}
+</style>
+</head>
+<body>
+<div class="layout">
+<nav class="toc">
+<p class="toc-title">Contents</p>
+<div class="toc">
+<ul>
+<li><a href="#problem-statement">Problem statement</a></li>
+<li><a href="#overview">Overview</a><ul>
+<li><a href="#user-skills-ranked-by-importance">User skills — ranked by importance</a></li>
+<li><a href="#contributor-skills-ranked-by-importance">Contributor skills — ranked by importance</a></li>
+<li><a href="#user-skill-dependency-graph">User skill dependency graph</a></li>
+<li><a href="#contributor-research-skill">Contributor research skill</a></li>
+<li><a href="#contributor-skill-dependency-graph">Contributor skill dependency graph</a></li>
+</ul>
+</li>
+<li><a href="#design-principle-skills-as-agentic-workflows">Design principle: Skills as agentic workflows</a><ul>
+<li><a href="#the-shift-documentation-automation">The shift: documentation → automation</a></li>
+<li><a href="#structured-output-current-state-and-gaps">Structured output: current state and gaps</a></li>
+<li><a href="#the-gather-analyze-decide-act-skill-structure">The GATHER → ANALYZE → DECIDE → ACT skill structure</a></li>
+<li><a href="#example-debug-model-as-an-agentic-workflow">Example: debug-model as an agentic workflow</a></li>
+</ul>
+</li>
+<li><a href="#validation-confidence-levels-l1l5">Validation confidence levels (L1–L5)</a></li>
+<li><a href="#skill-evaluation">Skill evaluation</a><ul>
+<li><a href="#skill-trigger-eval">Skill trigger eval</a></li>
+<li><a href="#skill-execution-eval">Skill execution eval</a></li>
+<li><a href="#skill-result-eval">Skill result eval</a></li>
+<li><a href="#intermediate-step-eval">Intermediate-step eval</a></li>
+<li><a href="#robust-eval">Robust eval</a></li>
+</ul>
+</li>
+<li><a href="#competitive-analysis">Competitive Analysis</a><ul>
+<li><a href="#summary">Summary</a></li>
+<li><a href="#competitor-feature-matrix">Competitor Feature Matrix</a></li>
+<li><a href="#competitor-deep-dives">Competitor Deep Dives</a></li>
+<li><a href="#top-5-high-impact-gaps-for-winml-cli">Top 5 High-Impact Gaps for winml-cli</a></li>
+<li><a href="#patterns-in-great-toolchain-dx">Patterns in Great Toolchain DX</a></li>
+<li><a href="#whitespace-opportunities-no-competitor-covers">Whitespace Opportunities (No Competitor Covers)</a></li>
+</ul>
+</li>
+<li><a href="#skill-use-winml-cli-existing-extend">Skill: use-winml-cli (existing — extend)</a></li>
+<li><a href="#skill-debug-model">Skill: debug-model</a><ul>
+<li><a href="#frontmatter">Frontmatter</a></li>
+<li><a href="#when-to-use">When to use</a></li>
+<li><a href="#sections">Sections</a></li>
+</ul>
+</li>
+<li><a href="#skill-ship-to-winapp-merge-of-validate-before-ship-prepare-for-winapp">Skill: ship-to-winapp (merge of validate-before-ship + prepare-for-winapp)</a><ul>
+<li><a href="#frontmatter_1">Frontmatter</a></li>
+<li><a href="#when-to-use_1">When to use</a></li>
+<li><a href="#part-a-validate-definition-of-done-gates">Part A — Validate (Definition-of-Done gates)</a></li>
+<li><a href="#part-b-package-integrate-multi-ep">Part B — Package &amp; integrate (multi-EP)</a></li>
+</ul>
+</li>
+<li><a href="#skill-check-model-feasibility-merge-of-find-a-model-ep-compatibility-check">Skill: check-model-feasibility (merge of find-a-model + ep-compatibility-check)</a><ul>
+<li><a href="#frontmatter_2">Frontmatter</a></li>
+<li><a href="#when-to-use_2">When to use</a></li>
+<li><a href="#what-this-skill-does-not-do">What this skill does NOT do</a></li>
+<li><a href="#sections_1">Sections</a></li>
+</ul>
+</li>
+<li><a href="#skill-adding-model-support-contributor">Skill: adding-model-support (contributor)</a><ul>
+<li><a href="#frontmatter_3">Frontmatter</a></li>
+<li><a href="#when-to-use_3">When to use</a></li>
+<li><a href="#sections_2">Sections</a></li>
+</ul>
+</li>
+<li><a href="#skill-adding-ep-support-contributor">Skill: adding-ep-support (contributor)</a><ul>
+<li><a href="#frontmatter_4">Frontmatter</a></li>
+<li><a href="#when-to-use_4">When to use</a></li>
+<li><a href="#sections_3">Sections</a></li>
+</ul>
+</li>
+<li><a href="#skill-contributing-a-skill-contributor">Skill: contributing-a-skill (contributor)</a><ul>
+<li><a href="#frontmatter_5">Frontmatter</a></li>
+<li><a href="#when-to-use_5">When to use</a></li>
+<li><a href="#sections_4">Sections</a></li>
+</ul>
+</li>
+<li><a href="#skill-autoconfig-user-optimize-the-model-automated-loop-manual-framework">Skill: autoconfig (user — optimize the model: automated loop + manual framework)</a><ul>
+<li><a href="#frontmatter_6">Frontmatter</a></li>
+<li><a href="#when-to-use_6">When to use</a></li>
+<li><a href="#what-this-skill-does-not-do_1">What this skill does NOT do</a></li>
+<li><a href="#manual-mode-the-decision-framework-folded-in-from-optimize-for-device">Manual mode — the decision framework (folded in from optimize-for-device)</a></li>
+<li><a href="#epistemic-standard-for-autoconfig-findings">Epistemic standard for autoconfig findings</a></li>
+<li><a href="#design-comparison-gpu-optimizer-v2-vs-winml-autoconfig">Design Comparison: GPU Optimizer V2 vs WinML Autoconfig</a></li>
+<li><a href="#design-the-autoresearch-loop">Design: The Autoresearch Loop</a></li>
+<li><a href="#profiler-enhanced-agent-architecture-redesigned">Profiler-Enhanced Agent Architecture (redesigned)</a></li>
+<li><a href="#sections_5">Sections</a></li>
+<li><a href="#key-commands-used">Key commands used</a></li>
+<li><a href="#cross-references">Cross-references</a></li>
+</ul>
+</li>
+<li><a href="#skill-optimization-research-contributor-internal-deep-gap-analysis">Skill: optimization-research (contributor — internal, deep gap analysis)</a><ul>
+<li><a href="#frontmatter_7">Frontmatter</a></li>
+<li><a href="#when-to-use_7">When to use</a></li>
+<li><a href="#what-this-skill-produces">What this skill produces</a></li>
+<li><a href="#design-deep-search-process">Design: Deep Search Process</a></li>
+<li><a href="#key-external-tools-to-invoke">Key external tools to invoke</a></li>
+<li><a href="#gap-report-format-gap_analysismd">Gap report format (gap_analysis.md)</a></li>
+<li><a href="#github-issue-template">GitHub issue template</a></li>
+</ul>
+</li>
+<li><a href="#expected-vs-actual">Expected vs actual</a></li>
+<li><a href="#root-cause">Root cause</a></li>
+<li><a href="#ort-source-reference">ORT source reference</a></li>
+<li><a href="#proposed-fix-direction">Proposed fix direction</a></li>
+<li><a href="#complexity-estimate">Complexity estimate</a><ul>
+<li><a href="#inconclusive-do-not-report">Inconclusive / do not report</a></li>
+<li><a href="#measurement-methodology-correction-winml-eval-vs-winml-perf">Measurement methodology correction (winml eval vs winml perf)</a></li>
+<li><a href="#key-insight-for-autoconfig-skill">Key insight for autoconfig skill</a></li>
+<li><a href="#winml-analyze-gaps-discovered">winml analyze gaps discovered</a></li>
+</ul>
+</li>
+<li><a href="#implementation-notes">Implementation notes</a><ul>
+<li><a href="#directory-structure">Directory structure</a></li>
+<li><a href="#priority-order-for-implementation">Priority order for implementation</a></li>
+<li><a href="#required-code-changes-for-agentic-skill-execution">Required code changes for agentic skill execution</a></li>
+<li><a href="#sizing-estimate-per-skill">Sizing estimate (per skill)</a></li>
+<li><a href="#relationship-to-existing-use-winml-cli-skill">Relationship to existing use-winml-cli skill</a></li>
+</ul>
+</li>
+<li><a href="#qnn-npu-catalog-sweep-findings-feature-gaps-2026-06-13">QNN NPU Catalog Sweep — Findings &amp; Feature Gaps (2026-06-13)</a><ul>
+<li><a href="#cross-model-results">Cross-model results</a></li>
+<li><a href="#validated-kb-findings">Validated KB findings</a></li>
+<li><a href="#feature-gaps-winml-cli-backlog-items">Feature gaps (winml-cli backlog items)</a></li>
+</ul>
+</li>
+</ul>
+</div>
+
+</nav>
+<main>
+<h1 id="winml-cli-skills-design-doc">WinML CLI Skills Design Doc<a class="headerlink" href="#winml-cli-skills-design-doc" title="Permanent link">&para;</a></h1>
+<h2 id="problem-statement">Problem statement<a class="headerlink" href="#problem-statement" title="Permanent link">&para;</a></h2>
+<p>Getting a model to run <strong>well</strong> on Windows is expert work that doesn't scale. A developer or ISV who
+wants to ship a model has to answer a chain of hard questions by hand: <em>Will this model even run on my
+hardware? Which execution provider (CPU / DirectML / QNN NPU) should I target? What precision —
+FP32, FP16, W8A16, W8A8 — hits my latency budget without breaking accuracy? How should quality
+regressions after optimization/quantization be handled? Is this artifact actually safe to ship?</em> Each answer requires
+deep knowledge of ONNX, the EP landscape, quantization, and hardware-specific quirks (opset ceilings,
+conv-fusion fallbacks, EPs that hang on certain configs). Today that knowledge is tribal and scattered;
+the workflow is manual, slow, and easy to get wrong, which is exactly what blocks the long-tail model
+coverage the team needs (scenarios <strong>S1–S5</strong>: LLM fast-support, ISV non-LLM onboarding, cross-EP parity,
+"customer ONNX can't run", and PyTorch/HF Hub coverage).</p>
+<p><code>winml-cli</code> already exposes the right primitives (<code>inspect</code>, <code>analyze</code>, <code>build</code>, <code>eval</code>, <code>perf</code>,
+<code>compile</code>, <code>package</code>), but a CLI alone doesn't encode <em>judgment</em> — knowing which command to run next,
+how to read its output, and what to conclude. <strong>Skills close that gap.</strong> Each skill packages the
+decision logic for one recurring task as an agentic workflow on top of those commands, so the
+non-expert path produces the same artifact an expert would — a <code>config.json</code>, <code>manifest.json</code>, or
+report — without the developer needing to hold the whole EP/precision/quantization model in their head.
+The goal: turn "weeks of expert tuning" into a repeatable, reproducible, CLI-driven flow.</p>
+<p>These skills are not a loose toolbox — <strong>each one is anchored to a stage of the model lifecycle</strong>, from
+choosing a model that will run, through build/optimize, to shipping it inside a WinApp. <code>winml-cli</code> today
+owns the build pipeline up to <code>artifact.onnx</code> and stops there; the skills both encode the judgment
+<em>within</em> that pipeline and extend past the current break into deployment, so the whole path from a source
+model to a working AI feature on the user's device is covered.</p>
+<style>
+.mlc-wrap { border:1px solid var(--border); border-radius:10px; padding:16px 16px 12px; margin:18px 0; overflow-x:auto; }
+.mlc-note { margin:10px 0 0; font-size:13px; color:var(--muted); }
+.mlc-note b { color:#2563eb; }
+</style>
+<div class="mlc-wrap">
+  <svg width="100%" height="190" viewBox="0 0 980 190" preserveAspectRatio="none" font-family="-apple-system,BlinkMacSystemFont,'Segoe UI',sans-serif" style="display:block;width:100%;height:190px;">
+    <defs>
+      <marker id="mlc-arr" markerWidth="8" markerHeight="6" refX="7" refY="3" orient="auto">
+        <polygon points="0 0,8 3,0 6" fill="#94a3b8"/>
+      </marker>
+      <linearGradient id="mlc-spectrum" x1="0%" y1="0%" x2="100%" y2="0%">
+        <stop offset="0%" stop-color="#93c5fd"/>
+        <stop offset="45%" stop-color="#c4b5fd"/>
+        <stop offset="100%" stop-color="#fca5a5"/>
+      </linearGradient>
+    </defs>
+    <rect x="8" y="14" width="968" height="42" rx="10" fill="#f8fafc" stroke="#e2e8f0"/>
+    <rect x="14" y="18" width="72" height="34" rx="7" fill="#f1f5f9" stroke="#cbd5e1"/><text x="50" y="32" text-anchor="middle" font-size="9.5" font-weight="700" fill="#334155">1</text><text x="50" y="44" text-anchor="middle" font-size="9" fill="#475569">Discover</text>
+    <line x1="86" y1="35" x2="94" y2="35" stroke="#94a3b8" stroke-width="1.4" marker-end="url(#mlc-arr)"/>
+    <rect x="94" y="18" width="72" height="34" rx="7" fill="#cfe0ff" stroke="#9bbcff"/><text x="130" y="32" text-anchor="middle" font-size="9.5" font-weight="700" fill="#334155">2</text><text x="130" y="44" text-anchor="middle" font-size="9" fill="#475569">Build</text>
+    <line x1="166" y1="35" x2="174" y2="35" stroke="#94a3b8" stroke-width="1.4" marker-end="url(#mlc-arr)"/>
+    <rect x="174" y="18" width="72" height="34" rx="7" fill="#bcd5ff" stroke="#8fb0ff"/><text x="210" y="32" text-anchor="middle" font-size="9.5" font-weight="700" fill="#334155">3</text><text x="210" y="44" text-anchor="middle" font-size="9" fill="#475569">Optimize</text>
+    <line x1="246" y1="35" x2="254" y2="35" stroke="#94a3b8" stroke-width="1.4" marker-end="url(#mlc-arr)"/>
+    <rect x="254" y="18" width="72" height="34" rx="7" fill="#a6c8ff" stroke="#7ea1f5"/><text x="290" y="32" text-anchor="middle" font-size="9.5" font-weight="700" fill="#334155">4</text><text x="290" y="44" text-anchor="middle" font-size="9" fill="#475569">Evaluate</text>
+    <line x1="326" y1="35" x2="334" y2="35" stroke="#94a3b8" stroke-width="1.4" marker-end="url(#mlc-arr)"/>
+    <rect x="334" y="18" width="72" height="34" rx="7" fill="#ddd6fe" stroke="#b8a8ff"/><text x="370" y="32" text-anchor="middle" font-size="9.5" font-weight="700" fill="#334155">5</text><text x="370" y="44" text-anchor="middle" font-size="9" fill="#475569">Package</text>
+    <line x1="406" y1="35" x2="414" y2="35" stroke="#94a3b8" stroke-width="1.4" marker-end="url(#mlc-arr)"/>
+    <rect x="414" y="18" width="72" height="34" rx="7" fill="#f5d0fe" stroke="#e9a8f0"/><text x="450" y="32" text-anchor="middle" font-size="9.5" font-weight="700" fill="#334155">6</text><text x="450" y="44" text-anchor="middle" font-size="9" fill="#475569">Distribute</text>
+    <line x1="486" y1="35" x2="494" y2="35" stroke="#94a3b8" stroke-width="1.4" marker-end="url(#mlc-arr)"/>
+    <rect x="494" y="18" width="72" height="34" rx="7" fill="#fecdd3" stroke="#f9a8b8"/><text x="530" y="32" text-anchor="middle" font-size="9.5" font-weight="700" fill="#334155">7</text><text x="530" y="44" text-anchor="middle" font-size="9" fill="#475569">Deploy</text>
+    <line x1="566" y1="35" x2="574" y2="35" stroke="#94a3b8" stroke-width="1.4" marker-end="url(#mlc-arr)"/>
+    <rect x="574" y="18" width="72" height="34" rx="7" fill="#dc2626" stroke="#b91c1c"/><text x="610" y="32" text-anchor="middle" font-size="9.5" font-weight="700" fill="#ffffff">8</text><text x="610" y="44" text-anchor="middle" font-size="9" fill="#ffe4e6">Runtime+Infer</text>
+    <line x1="646" y1="35" x2="654" y2="35" stroke="#94a3b8" stroke-width="1.4" marker-end="url(#mlc-arr)"/>
+    <rect x="654" y="18" width="72" height="34" rx="7" fill="#fb7185" stroke="#f43f5e"/><text x="690" y="32" text-anchor="middle" font-size="9.5" font-weight="700" fill="#ffffff">9</text><text x="690" y="44" text-anchor="middle" font-size="9" fill="#ffe4e6">Monitor</text>
+    <line x1="726" y1="35" x2="734" y2="35" stroke="#94a3b8" stroke-width="1.4" marker-end="url(#mlc-arr)"/>
+    <rect x="734" y="18" width="72" height="34" rx="7" fill="#fda4af" stroke="#fb7185"/><text x="770" y="32" text-anchor="middle" font-size="9.5" font-weight="700" fill="#334155">10</text><text x="770" y="44" text-anchor="middle" font-size="9" fill="#475569">Feedback</text>
+
+    <rect x="14" y="76" width="792" height="18" rx="9" fill="url(#mlc-spectrum)" opacity="0.28"/>
+
+    <path d="M 290 52 L 290 120 L 210 120 L 210 52" stroke="#7c3aed" stroke-width="1.6" stroke-dasharray="5,3" fill="none" marker-end="url(#mlc-arr)"/>
+    <text x="240" y="132" font-size="9.5" fill="#6d28d9" font-style="italic">iteration loop: evaluate → optimize</text>
+    <path d="M 770 52 L 770 150 L 50 150 L 50 52" stroke="#6d28d9" stroke-width="1.8" stroke-dasharray="5,3" fill="none" marker-end="url(#mlc-arr)"/>
+    <text x="370" y="164" font-size="10" fill="#5b21b6" font-style="italic">feedback loop: monitor → next discovery cycle</text>
+  </svg>
+  <p class="mlc-note"><b>Lifecycle view:</b> cooler (blue) tones emphasize model pre-processing, warmer (red) tones emphasize runtime/operational stages, with two key iteration loops.</p>
+</div>
+
+<h2 id="overview">Overview<a class="headerlink" href="#overview" title="Permanent link">&para;</a></h2>
+<p>This document defines the design for 10 skills to be added to <code>skills/</code> in winml-cli.
+Skills are split into <strong>two categories by the single question: does the task require editing repo code?</strong></p>
+<ul>
+<li><strong>User skills (6)</strong> — the user reaches their goal purely by specifying conditions and letting
+  winml-cli produce or modify a <code>config.json</code> / <code>manifest.json</code> / report. <strong>No source code is touched.</strong>
+  Audience: WinApp developers and ISVs deploying models.</li>
+<li><strong>Contributor skills (4)</strong> — the task requires a winml-cli source-code change (a new exporter, a new
+  EP backend, a new skill), or exists specifically to produce code-change backlog. Audience: winml-cli engineers.</li>
+</ul>
+<blockquote>
+<p>Discriminator: if the deliverable is a config/manifest/report, it is a <strong>User</strong> skill. If completing it
+requires editing code in the repo (or its whole purpose is to drive such edits), it is a <strong>Contributor</strong> skill.</p>
+</blockquote>
+<p>Each skill follows the SKILL.md frontmatter convention (<code>name:</code>, <code>description:</code>) established
+by Mobius, NVIDIA Model-Optimizer, and Google LiteRT-CLI as the de facto standard.</p>
+<h3 id="user-skills-ranked-by-importance">User skills — ranked by importance<a class="headerlink" href="#user-skills-ranked-by-importance" title="Permanent link">&para;</a></h3>
+<table>
+<thead>
+<tr>
+<th>Rank</th>
+<th>Skill</th>
+<th>Priority</th>
+<th>Why it ranks here</th>
+<th>Output (no code)</th>
+</tr>
+</thead>
+<tbody>
+<tr>
+<td>1</td>
+<td><code>use-winml-cli</code></td>
+<td>P0</td>
+<td>General tool-scoped onboarding reference (existing). The foundational command layer that underpins every task-scoped skill — every step above runs <code>winml</code> commands it documents.</td>
+<td>command reference</td>
+</tr>
+<tr>
+<td>2</td>
+<td><code>check-model-feasibility</code></td>
+<td>P0</td>
+<td>Pre-build front door, merging model discovery + EP/device compatibility: "find me a <em>supported</em> model from my constraints, then confirm it runs on my hardware." The single "what do I run, and will it run?" gate (<code>inspect</code> → <code>sys</code> → <code>analyze</code>). Highest frequency — every user hits it before building.</td>
+<td>model shortlist + go/no-go + fallback EP</td>
+</tr>
+<tr>
+<td>3</td>
+<td><code>debug-model</code></td>
+<td>P0</td>
+<td>Lightweight, read-only diagnosis skill: quickly synthesizes <code>inspect</code>, <code>analyze</code>, profiling, and op tracing signals into an explanation users can act on.</td>
+<td>diagnostic report + prioritized actions</td>
+</tr>
+<tr>
+<td>4</td>
+<td><code>auto-config</code></td>
+<td>P0</td>
+<td>Flagship. Autonomously searches the config space and delivers the optimal <code>config.json</code> per EP. Also hosts the <strong>manual optimize path</strong> (precision-ladder + latency/accuracy-budget decision framework + hardware table) for users who want to choose by hand or have no target hardware. Maps to all five user scenarios (S1–S5).</td>
+<td><code>config_&lt;ep&gt;_optimal.json</code> + <code>report.html</code> (includes feasible-options comparison table) + <code>feature_requests.md</code></td>
+</tr>
+<tr>
+<td>5</td>
+<td><code>ship-to-winapp</code></td>
+<td>P1</td>
+<td>Ship-time skill, merging validation + packaging: L1–L5 Definition-of-Done gates <strong>plus</strong> multi-EP artifact layout, <code>manifest.json</code>, and runtime EP selection. Everything between "the model is good" and "it's running in the app."</td>
+<td>pass/fail report + <code>manifest.json</code></td>
+</tr>
+<tr>
+<td>6</td>
+<td><code>debug-accuracy-drop</code></td>
+<td>P2</td>
+<td>Narrow, exploratory diagnostic: isolates <em>where</em> accuracy regresses across the optimize → quantize → compile stages, using <code>winml eval --mode compare</code> + <code>winml analyze</code> to attribute the drop to a stage and suggest next steps. Ranks last because its scope overlaps <code>debug-model</code> (which owns primary diagnostics) and it is focused on quantization/accuracy regressions rather than general failures.</td>
+<td>accuracy-regression report + likely-cause stage + suggested fixes</td>
+</tr>
+</tbody>
+</table>
+<h3 id="contributor-skills-ranked-by-importance">Contributor skills — ranked by importance<a class="headerlink" href="#contributor-skills-ranked-by-importance" title="Permanent link">&para;</a></h3>
+<table>
+<thead>
+<tr>
+<th>Rank</th>
+<th>Skill</th>
+<th>Priority</th>
+<th>Why it ranks here</th>
+<th>Code touched</th>
+</tr>
+</thead>
+<tbody>
+<tr>
+<td>1</td>
+<td><code>adding-model-support</code></td>
+<td>P0</td>
+<td>Directly grows model coverage — the core long-tail business problem (ISV onboarding, S2/S5). Highest contribution frequency.</td>
+<td>new exporter + recipe</td>
+</tr>
+<tr>
+<td>2</td>
+<td><code>adding-ep-support</code></td>
+<td>P1</td>
+<td>Onboards a new execution-provider backend. Infrequent, but high value the moment a new NPU vendor lands.</td>
+<td>compile backend + EP registry</td>
+</tr>
+<tr>
+<td>3</td>
+<td><code>optimization-research</code></td>
+<td>P1</td>
+<td>High leverage: deep-searches ORT/Olive/ecosystem to find gaps and file the backlog that drives every other contributor skill. Internal, but sets the roadmap.</td>
+<td>files issues + repro (drives code changes)</td>
+</tr>
+<tr>
+<td>4</td>
+<td><code>contributing-a-skill</code></td>
+<td>P2</td>
+<td>Meta-tooling: how to author, lint, and eval a SKILL.md. Sustains the ecosystem but is supporting infrastructure, not a direct model/EP/perf deliverable.</td>
+<td><code>SKILL.md</code> + evals</td>
+</tr>
+</tbody>
+</table>
+<blockquote>
+<p>The detailed <code>## Skill:</code> sections below appear in document order, not priority order. Importance is
+defined by the two ranked tables above; implementation sequencing (risk/dependency-driven) is in
+<a href="#priority-order-for-implementation">Priority order for implementation</a>.</p>
+</blockquote>
+<h3 id="user-skill-dependency-graph">User skill dependency graph<a class="headerlink" href="#user-skill-dependency-graph" title="Permanent link">&para;</a></h3>
+<svg width="800" height="268" viewBox="0 0 800 268" font-family="-apple-system,BlinkMacSystemFont,'Segoe UI',sans-serif" style="display:block;max-width:100%;margin:12px 0;">
+  <defs>
+    <marker id="sk-arr" markerWidth="8" markerHeight="6" refX="7" refY="3" orient="auto">
+      <polygon points="0 0,8 3,0 6" fill="#7986cb"/>
+    </marker>
+  </defs>
+  <!-- check-model-feasibility -->
+  <rect x="10" y="15" width="192" height="88" rx="8" fill="#e8eaf6" stroke="#7986cb" stroke-width="1.5"/>
+  <text x="106" y="37" text-anchor="middle" font-size="10" font-weight="700" fill="#283593">check-model-feasibility</text>
+  <text x="106" y="52" text-anchor="middle" font-size="9.5" fill="#3949ab">find a supported model</text>
+  <text x="106" y="66" text-anchor="middle" font-size="9.5" fill="#3949ab">confirm EP/device runs</text>
+  <line x1="202" y1="59" x2="290" y2="59" stroke="#7986cb" stroke-width="1.5" marker-end="url(#sk-arr)"/>
+  <!-- autoconfig -->
+  <rect x="292" y="15" width="170" height="88" rx="8" fill="#e8f5e9" stroke="#81c784" stroke-width="1.5"/>
+  <text x="377" y="37" text-anchor="middle" font-size="10.5" font-weight="700" fill="#1b5e20">autoconfig</text>
+  <text x="377" y="52" text-anchor="middle" font-size="9.5" fill="#2e7d32">optimize the model</text>
+  <text x="377" y="66" text-anchor="middle" font-size="9.5" fill="#2e7d32">automated sweep</text>
+  <line x1="462" y1="59" x2="588" y2="59" stroke="#7986cb" stroke-width="1.5" marker-end="url(#sk-arr)"/>
+  <!-- ship-to-winapp -->
+  <rect x="590" y="15" width="196" height="88" rx="8" fill="#fff3e0" stroke="#ffb74d" stroke-width="1.5"/>
+  <text x="688" y="36" text-anchor="middle" font-size="10.5" font-weight="700" fill="#e65100">ship-to-winapp</text>
+  <text x="688" y="51" text-anchor="middle" font-size="9.5" fill="#bf360c">validate (L1–L5 gates)</text>
+  <text x="688" y="65" text-anchor="middle" font-size="9.5" fill="#bf360c">package multi-EP artifacts</text>
+  <text x="688" y="79" text-anchor="middle" font-size="9.5" fill="#bf360c">manifest + runtime EP select</text>
+  <!-- CMF → debug (elbow) -->
+  <polyline points="202,72 247,72 247,183 292,183" stroke="#90a4ae" stroke-width="1.5" fill="none" stroke-dasharray="4,2" marker-end="url(#sk-arr)"/>
+  <!-- autoconfig → debug (down) -->
+  <line x1="377" y1="103" x2="377" y2="148" stroke="#7986cb" stroke-width="1.5" marker-end="url(#sk-arr)"/>
+  <!-- debug-model -->
+  <rect x="292" y="150" width="170" height="65" rx="8" fill="#fce4ec" stroke="#f48fb1" stroke-width="1.5"/>
+  <text x="377" y="172" text-anchor="middle" font-size="10.5" font-weight="700" fill="#880e4f">debug-model</text>
+  <text x="377" y="188" text-anchor="middle" font-size="9.5" fill="#ad1457">read-only explain + actions</text>
+  <!-- debug → ship (elbow up) -->
+  <polyline points="462,183 542,183 542,72 588,72" stroke="#90a4ae" stroke-width="1.5" fill="none" stroke-dasharray="4,2" marker-end="url(#sk-arr)"/>
+  <!-- use-winml-cli bar -->
+  <rect x="10" y="234" width="776" height="26" rx="6" fill="#e8eaf6" stroke="#9fa8da" stroke-width="1.5"/>
+  <text x="398" y="251" text-anchor="middle" font-size="10.5" font-weight="600" fill="#283593">use-winml-cli — general command reference · underpins every step above</text>
+</svg>
+
+<h3 id="contributor-research-skill">Contributor research skill<a class="headerlink" href="#contributor-research-skill" title="Permanent link">&para;</a></h3>
+<svg width="600" height="96" viewBox="0 0 600 96" font-family="-apple-system,BlinkMacSystemFont,'Segoe UI',sans-serif" style="display:block;max-width:100%;margin:12px 0;">
+  <defs>
+    <marker id="sk-arr2" markerWidth="8" markerHeight="6" refX="7" refY="3" orient="auto">
+      <polygon points="0 0,8 3,0 6" fill="#7986cb"/>
+    </marker>
+  </defs>
+  <rect x="10" y="10" width="215" height="76" rx="8" fill="#ede7f6" stroke="#b39ddb" stroke-width="1.5"/>
+  <text x="117" y="34" text-anchor="middle" font-size="10.5" font-weight="700" fill="#4a148c">optimization-research</text>
+  <text x="117" y="50" text-anchor="middle" font-size="9.5" fill="#6a1b9a">deep search: ORT · Olive</text>
+  <text x="117" y="64" text-anchor="middle" font-size="9.5" fill="#6a1b9a">ONNX ecosystem + native models</text>
+  <line x1="225" y1="48" x2="296" y2="48" stroke="#7986cb" stroke-width="1.5" marker-end="url(#sk-arr2)"/>
+  <rect x="298" y="10" width="210" height="76" rx="8" fill="#e3f2fd" stroke="#90caf9" stroke-width="1.5"/>
+  <text x="403" y="34" text-anchor="middle" font-size="10.5" font-weight="700" fill="#0d47a1">GitHub issues / winml backlog</text>
+  <text x="403" y="50" text-anchor="middle" font-size="9.5" fill="#1565c0">find better solutions</text>
+  <text x="403" y="64" text-anchor="middle" font-size="9.5" fill="#1565c0">diagnose gaps → work items</text>
+</svg>
+
+<h3 id="contributor-skill-dependency-graph">Contributor skill dependency graph<a class="headerlink" href="#contributor-skill-dependency-graph" title="Permanent link">&para;</a></h3>
+<svg width="540" height="112" viewBox="0 0 540 112" font-family="-apple-system,BlinkMacSystemFont,'Segoe UI',sans-serif" style="display:block;max-width:100%;margin:12px 0;">
+  <defs>
+    <marker id="sk-arr3" markerWidth="8" markerHeight="6" refX="7" refY="3" orient="auto">
+      <polygon points="0 0,8 3,0 6" fill="#7986cb"/>
+    </marker>
+  </defs>
+  <rect x="10" y="10" width="195" height="40" rx="7" fill="#e8f5e9" stroke="#81c784" stroke-width="1.5"/>
+  <text x="107" y="35" text-anchor="middle" font-size="10.5" font-weight="700" fill="#1b5e20">adding-model-support</text>
+  <rect x="10" y="62" width="195" height="40" rx="7" fill="#e3f2fd" stroke="#90caf9" stroke-width="1.5"/>
+  <text x="107" y="87" text-anchor="middle" font-size="10.5" font-weight="700" fill="#0d47a1">adding-ep-support</text>
+  <line x1="205" y1="30" x2="308" y2="50" stroke="#7986cb" stroke-width="1.5" marker-end="url(#sk-arr3)"/>
+  <line x1="205" y1="82" x2="308" y2="62" stroke="#7986cb" stroke-width="1.5" marker-end="url(#sk-arr3)"/>
+  <rect x="310" y="36" width="200" height="40" rx="7" fill="#fff3e0" stroke="#ffb74d" stroke-width="1.5"/>
+  <text x="410" y="61" text-anchor="middle" font-size="10.5" font-weight="700" fill="#e65100">contributing-a-skill</text>
+</svg>
+
+<hr />
+<h2 id="design-principle-skills-as-agentic-workflows">Design principle: Skills as agentic workflows<a class="headerlink" href="#design-principle-skills-as-agentic-workflows" title="Permanent link">&para;</a></h2>
+<h3 id="the-shift-documentation-automation">The shift: documentation → automation<a class="headerlink" href="#the-shift-documentation-automation" title="Permanent link">&para;</a></h3>
+<p>Current state (most skills in the ecosystem):</p>
+<blockquote>
+<p>Skill tells the user what commands to run → user runs them → user interprets output</p>
+</blockquote>
+<p>Target state for winml-cli:</p>
+<blockquote>
+<p>Skill tells the <strong>agent</strong> what commands to run → <strong>agent runs them</strong> → agent interprets output → agent gives a specific answer</p>
+</blockquote>
+<p>The difference:</p>
+<table>
+<thead>
+<tr>
+<th></th>
+<th>Documentation skill</th>
+<th>Agentic skill</th>
+</tr>
+</thead>
+<tbody>
+<tr>
+<td>Agent sees low cosine</td>
+<td>"Run <code>winml eval --mode compare</code>"</td>
+<td>Runs it, reads cosine=0.87, says "drop at quantize stage, Attention layers"</td>
+</tr>
+<tr>
+<td>EP compatibility</td>
+<td>"Run <code>winml sys</code> then <code>winml analyze</code>"</td>
+<td>Runs both, parses JSON, says "QNN available but LayerNorm is partial"</td>
+</tr>
+<tr>
+<td>Optimize precision</td>
+<td>"Use the decision framework"</td>
+<td>Runs fp16/w8a16/w8a8 sweep, builds actual tradeoff table, recommends W8A16</td>
+</tr>
+<tr>
+<td>Validate before ship</td>
+<td>"Check these 6 gates"</td>
+<td>Runs all 6 gates, generates a pass/fail report with actual numbers</td>
+</tr>
+</tbody>
+</table>
+<p>This is only possible if skills describe a <strong>GATHER → ANALYZE → DECIDE → ACT</strong> workflow,
+and winml-cli commands emit <strong>machine-readable structured output</strong> that the agent can parse.</p>
+<h3 id="structured-output-current-state-and-gaps">Structured output: current state and gaps<a class="headerlink" href="#structured-output-current-state-and-gaps" title="Permanent link">&para;</a></h3>
+<p>Copilot agents have shell tool access and can run <code>winml</code> commands directly.
+The key requirement is <code>--format json</code> on stdout so the agent can parse results
+without screen-scraping Rich/ANSI terminal output.</p>
+<table>
+<thead>
+<tr>
+<th>Command</th>
+<th>Structured output today</th>
+<th>Gap</th>
+</tr>
+</thead>
+<tbody>
+<tr>
+<td><code>winml inspect</code></td>
+<td>✓ <code>--format json</code> (stdout)</td>
+<td>None</td>
+</tr>
+<tr>
+<td><code>winml sys</code></td>
+<td>✓ <code>--format json</code> (stdout)</td>
+<td>None</td>
+</tr>
+<tr>
+<td><code>winml run</code></td>
+<td>✓ <code>--format json</code> (stdout)</td>
+<td>None</td>
+</tr>
+<tr>
+<td><code>winml analyze</code></td>
+<td>⚠ <code>--output file.json</code> (file only)</td>
+<td>Add <code>--format json</code> stdout</td>
+</tr>
+<tr>
+<td><code>winml perf</code></td>
+<td>⚠ <code>--output file.json</code> (file only)</td>
+<td>Add <code>--format json</code> stdout</td>
+</tr>
+<tr>
+<td><code>winml eval</code></td>
+<td>✗ No structured output</td>
+<td>Add <code>--format json</code> stdout</td>
+</tr>
+</tbody>
+</table>
+<p><strong>Required code changes</strong> (enables agentic skill execution):
+1. <code>winml eval --format json</code> — outputs <code>{cosine, sqnr, psnr, task_metric}</code> to stdout
+2. <code>winml analyze --format json</code> — outputs <code>{supported: [...], partial: [...], unsupported: [...]}</code> to stdout
+3. <code>winml perf --format json</code> — outputs <code>{p50_ms, p90_ms, p99_ms, mean_ms}</code> to stdout</p>
+<h3 id="the-gather-analyze-decide-act-skill-structure">The GATHER → ANALYZE → DECIDE → ACT skill structure<a class="headerlink" href="#the-gather-analyze-decide-act-skill-structure" title="Permanent link">&para;</a></h3>
+<p>Each skill section should be written with agent execution in mind:</p>
+<div class="codehilite"><pre><span></span><code>## GATHER: what to run
+Commands the agent runs first (with --format json) to collect facts.
+
+## ANALYZE: what to look for
+How to interpret the JSON output. What values matter. What thresholds to apply.
+
+## DECIDE: what to recommend
+Decision logic. If X → recommend Y. If A and B → recommend C.
+
+## ACT: what to tell the user
+What to surface to the user: specific diagnosis + specific next step.
+</code></pre></div>
+
+<p>In practice this maps onto the existing "Sections" structure — the key is ensuring
+each section has <strong>concrete commands to run</strong> and <strong>concrete interpretation rules</strong>,
+not just prose description.</p>
+<h3 id="example-debug-model-as-an-agentic-workflow">Example: <code>debug-model</code> as an agentic workflow<a class="headerlink" href="#example-debug-model-as-an-agentic-workflow" title="Permanent link">&para;</a></h3>
+<div class="codehilite"><pre><span></span><code>User: &quot;My model runs, but results are unstable and slower than expected&quot;
+
+GATHER:
+  agent runs: winml inspect -m model.onnx --format json
+  agent runs: winml analyze -m model.onnx --ep qnn --format json
+  agent runs: winml perf -m model.onnx --ep qnn --device npu --format json
+  agent runs: winml optrace -m model.onnx --ep qnn --format json (if available)
+
+ANALYZE:
+  inspect: dynamic axes + large activation tensors found
+  analyze: partial ops on QNN indicate fallback risk
+  perf/optrace: hotspots align with fallback boundary
+
+DECIDE:
+  classify as EP-coverage + graph-shape issue
+  prioritize actions by expected impact and effort
+
+ACT:
+  Agent report:
+    1) likely root causes with evidence
+    2) immediate actions (config/model-level)
+    3) next verification commands
+</code></pre></div>
+
+<p>Without structured output (<code>--format json</code>), the agent would have to tell the user to run
+each step manually and paste the results back. With structured output, the agent runs the
+full diagnostic in one turn.</p>
+<hr />
+<h2 id="validation-confidence-levels-l1l5">Validation confidence levels (L1–L5)<a class="headerlink" href="#validation-confidence-levels-l1l5" title="Permanent link">&para;</a></h2>
+<p>Inspired by Mobius <code>writing-tests</code>. Applied in <code>ship-to-winapp</code> as the Definition-of-Done backbone.
+Each level is checked <strong>independently</strong> — a model can pass L3 without passing L2.</p>
+<table>
+<thead>
+<tr>
+<th>Level</th>
+<th>Name</th>
+<th>What it verifies</th>
+<th>Key command</th>
+</tr>
+</thead>
+<tbody>
+<tr>
+<td><strong>L1</strong></td>
+<td>Loadable</td>
+<td>Artifact is valid ONNX, loads without error</td>
+<td><code>winml inspect -m &lt;artifact&gt;</code></td>
+</tr>
+<tr>
+<td><strong>L2</strong></td>
+<td>Shape correct</td>
+<td>Output shape matches expected spec</td>
+<td><code>winml eval -m &lt;artifact&gt; --model-id &lt;model&gt;</code> (check shape in output)</td>
+</tr>
+<tr>
+<td><strong>L3</strong></td>
+<td>Numerical parity</td>
+<td>Output matches FP32 baseline (cosine ≥ 0.99 FP16, ≥ 0.95 W8A16, ≥ 0.90 W8A8)</td>
+<td><code>winml eval --mode compare -m &lt;artifact&gt; --model-id &lt;model&gt;</code></td>
+</tr>
+<tr>
+<td><strong>L4</strong></td>
+<td>Task accuracy</td>
+<td>Task metric (Top-1/F1/mAP) within acceptable drop from FP32 reference</td>
+<td><code>winml eval -m &lt;artifact&gt; --model-id &lt;model&gt;</code> (task metric)</td>
+</tr>
+<tr>
+<td><strong>L5</strong></td>
+<td>Production ready</td>
+<td>Perf SLA met on target device + cross-EP consistency verified</td>
+<td><code>winml perf --iterations 100 --monitor</code></td>
+</tr>
+</tbody>
+</table>
+<p><strong>Quick pass criteria:</strong></p>
+<table>
+<thead>
+<tr>
+<th>Precision</th>
+<th>L3 threshold</th>
+</tr>
+</thead>
+<tbody>
+<tr>
+<td>FP16</td>
+<td>cosine_similarity ≥ 0.99</td>
+</tr>
+<tr>
+<td>W8A16</td>
+<td>cosine_similarity ≥ 0.95</td>
+</tr>
+<tr>
+<td>W8A8</td>
+<td>cosine_similarity ≥ 0.90 (or task-specific)</td>
+</tr>
+</tbody>
+</table>
+<p>Waivers: any level that cannot be verified must be documented with a reason and tracking issue.
+The <code>ship-to-winapp</code> skill maps each of its 6 validation gates to an L-level.</p>
+<hr />
+<h2 id="skill-evaluation">Skill evaluation<a class="headerlink" href="#skill-evaluation" title="Permanent link">&para;</a></h2>
+<p>This section defines a high-level evaluation framework for skills: pick the right skill, then execute it
+with reliable and decision-useful outcomes.</p>
+<h3 id="skill-trigger-eval">Skill trigger eval<a class="headerlink" href="#skill-trigger-eval" title="Permanent link">&para;</a></h3>
+<p>Focuses on routing quality — whether user intent is consistently mapped to the correct skill boundary.</p>
+<h3 id="skill-execution-eval">Skill execution eval<a class="headerlink" href="#skill-execution-eval" title="Permanent link">&para;</a></h3>
+<p>Focuses on run quality after routing — whether the workflow produces correct, actionable, and stable outputs.</p>
+<h4 id="skill-result-eval">Skill result eval<a class="headerlink" href="#skill-result-eval" title="Permanent link">&para;</a></h4>
+<p>Evaluates final outputs at outcome level: correctness, completeness, and practical usefulness for decisions.</p>
+<h4 id="intermediate-step-eval">Intermediate-step eval<a class="headerlink" href="#intermediate-step-eval" title="Permanent link">&para;</a></h4>
+<p>Evaluates whether major workflow stages are coherent and evidence-backed, not just final-answer quality.</p>
+<h4 id="robust-eval">Robust eval<a class="headerlink" href="#robust-eval" title="Permanent link">&para;</a></h4>
+<p>Evaluates resilience under ambiguity/noise so behavior remains safe, stable, and useful across varied contexts.</p>
+<hr />
+<h2 id="competitive-analysis">Competitive Analysis<a class="headerlink" href="#competitive-analysis" title="Permanent link">&para;</a></h2>
+<h3 id="summary">Summary<a class="headerlink" href="#summary" title="Permanent link">&para;</a></h3>
+<p>winml-cli has a solid optimization pipeline (export→quantize→compile→benchmark) but lacks the <strong>debugging/diagnostic loop</strong>, <strong>accuracy recovery tooling</strong>, and <strong>developer observability</strong> that distinguish great toolchains from adequate ones.</p>
+<hr />
+<h3 id="competitor-feature-matrix">Competitor Feature Matrix<a class="headerlink" href="#competitor-feature-matrix" title="Permanent link">&para;</a></h3>
+<table>
+<thead>
+<tr>
+<th>Feature</th>
+<th>Apple</th>
+<th>ExecuTorch</th>
+<th>AI Hub</th>
+<th>NVIDIA</th>
+<th>OpenVINO</th>
+<th>Optimum</th>
+<th>Olive</th>
+<th>winml-cli</th>
+</tr>
+</thead>
+<tbody>
+<tr>
+<td>Per-layer accuracy debugging</td>
+<td>❌</td>
+<td>✅ SVG graph</td>
+<td>✅ cloud</td>
+<td>❌</td>
+<td>❌</td>
+<td>❌</td>
+<td>❌</td>
+<td>❌</td>
+</tr>
+<tr>
+<td>Compute unit utilization report</td>
+<td>❌</td>
+<td>✅</td>
+<td>✅</td>
+<td>❌</td>
+<td>Partial</td>
+<td>❌</td>
+<td>❌</td>
+<td>❌</td>
+</tr>
+<tr>
+<td>Accuracy-Aware PTQ (auto layer rollback)</td>
+<td>❌</td>
+<td>❌</td>
+<td>❌</td>
+<td>❌</td>
+<td>✅ NNCF</td>
+<td>❌</td>
+<td>❌</td>
+<td>❌</td>
+</tr>
+<tr>
+<td>Standard NLP benchmark (MMLU/PPL)</td>
+<td>❌</td>
+<td>❌</td>
+<td>❌</td>
+<td>✅</td>
+<td>✅</td>
+<td>❌</td>
+<td>❌</td>
+<td>❌</td>
+</tr>
+<tr>
+<td>Cross-EP side-by-side compare</td>
+<td>❌</td>
+<td>❌</td>
+<td>Partial</td>
+<td>❌</td>
+<td>❌</td>
+<td>❌</td>
+<td>❌</td>
+<td>❌</td>
+</tr>
+<tr>
+<td>Zero-deploy validation (model.predict)</td>
+<td>✅ macOS</td>
+<td>✅</td>
+<td>✅ cloud</td>
+<td>❌</td>
+<td>✅</td>
+<td>✅</td>
+<td>❌</td>
+<td>Partial</td>
+</tr>
+<tr>
+<td>Pre-quantized model zoo</td>
+<td>❌</td>
+<td>❌</td>
+<td>✅ 500+</td>
+<td>✅ HF org</td>
+<td>✅</td>
+<td>❌</td>
+<td>❌</td>
+<td>❌</td>
+</tr>
+<tr>
+<td>One-line optimize command</td>
+<td>❌</td>
+<td>❌</td>
+<td>❌</td>
+<td>❌</td>
+<td>❌</td>
+<td>❌</td>
+<td>✅</td>
+<td>❌</td>
+</tr>
+<tr>
+<td>Multi-EP artifact packaging</td>
+<td>✅ .mlpackage</td>
+<td>✅ .pte</td>
+<td>❌</td>
+<td>❌</td>
+<td>❌</td>
+<td>❌</td>
+<td>❌</td>
+<td>❌</td>
+</tr>
+<tr>
+<td>QAT / accuracy recovery fine-tuning</td>
+<td>✅</td>
+<td>❌</td>
+<td>✅ AIMET</td>
+<td>✅</td>
+<td>✅</td>
+<td>❌</td>
+<td>❌</td>
+<td>❌</td>
+</tr>
+<tr>
+<td>Advanced quant (AWQ/SmoothQuant)</td>
+<td>❌</td>
+<td>❌</td>
+<td>✅</td>
+<td>✅</td>
+<td>✅ NNCF</td>
+<td>❌</td>
+<td>❌</td>
+<td>❌</td>
+</tr>
+<tr>
+<td>Thermal/sustained-load profiling</td>
+<td>❌</td>
+<td>❌</td>
+<td>❌</td>
+<td>❌</td>
+<td>❌</td>
+<td>❌</td>
+<td>❌</td>
+<td>❌</td>
+</tr>
+</tbody>
+</table>
+<hr />
+<h3 id="competitor-deep-dives">Competitor Deep Dives<a class="headerlink" href="#competitor-deep-dives" title="Permanent link">&para;</a></h3>
+<h4 id="apple-coremltools">Apple coremltools<a class="headerlink" href="#apple-coremltools" title="Permanent link">&para;</a></h4>
+<p><strong>Most relevant</strong>: zero-deploy validation + compute_units API + palettization</p>
+<ul>
+<li><code>model.predict({'input': np_array})</code> — validates converted model in one Python call without any device deploy. Can force <code>ComputeUnit.CPU_ONLY</code> for numerical comparison vs <code>CPU_AND_NE</code>.</li>
+<li><code>compute_units</code> is switchable <strong>at prediction time</strong> (not just compile time) — enables A/B testing EP performance without re-converting.</li>
+<li><strong>Palettization</strong>: LUT-based weight compression at 1–8 bits (k-means clustering, not linear quant). Matches Neural Engine hardware kernels better than INT4 linear quantization for many models.</li>
+<li>Three compression workflows: data-free / calibration-based / fine-tuning-based (QAT).</li>
+<li><code>.mlpackage</code> separates architecture from weights → streaming-friendly, supports on-device compilation after download.</li>
+</ul>
+<h4 id="executorch-meta">ExecuTorch (Meta)<a class="headerlink" href="#executorch-meta" title="Permanent link">&para;</a></h4>
+<p><strong>Most relevant</strong>: per-layer QNN accuracy debugging (best-in-class of all competitors)</p>
+<ul>
+<li><code>QNNIntermediateDebugger</code>: dumps intermediate tensor outputs at every QNN op, computes cosine similarity per layer vs CPU reference, generates <strong>color-coded SVG computation graph</strong> (green ≥ 0.9, red &lt; 0.9).</li>
+<li><code>get_delegation_info()</code>: table of ops showing delegated-to-NPU count vs CPU-fallback count per op type.</li>
+<li><code>ETDump</code> + <code>Inspector</code> API: per-op timing table with avg (ms), op type, is_delegated. Returns pandas DataFrame.</li>
+<li>QAIRT Visualizer: <code>pip install qairt-visualizer</code> — interactive GUI overlaying op trace + QHAS (QNN HTP Analysis Summary) on model graph.</li>
+<li><strong>Missing</strong>: no cloud device testing, no automated accuracy-latency sweep, build process is complex.</li>
+</ul>
+<h4 id="qualcomm-ai-hub">Qualcomm AI Hub<a class="headerlink" href="#qualcomm-ai-hub" title="Permanent link">&para;</a></h4>
+<p><strong>Most relevant</strong>: cloud profiling with physical hardware, per-step memory breakdown</p>
+<ul>
+<li>Compile + Profile + Inference on real physical devices (Snapdragon X Elite laptops, Galaxy S24) in the cloud — no local hardware needed.</li>
+<li>Per-step memory profiling: compilation time/memory, first-load time/memory (NE optimization), subsequent-load (cached), inference latency.</li>
+<li>500+ pre-optimized models in model zoo.</li>
+<li><code>--clone j1glw6y8p</code> — clone any previous job with modified params.</li>
+<li>Cloud AIMET quantization: sophisticated PTQ as a service (<code>submit_quantize_job()</code>).</li>
+</ul>
+<h4 id="nvidia-modelopt">NVIDIA ModelOpt<a class="headerlink" href="#nvidia-modelopt" title="Permanent link">&para;</a></h4>
+<p><strong>Most relevant</strong>: 16 compression techniques + MMLU benchmark scripts + pre-quantized HF checkpoints</p>
+<ul>
+<li>Compression techniques beyond PTQ: AWQ, SmoothQuant, QAT, pruning (Minitron 33% smaller, 50% faster), distillation, speculative decoding, sparsity, NAS (Puzzletron).</li>
+<li>Windows accuracy benchmark: <code>mmlu_benchmark.py</code> (57 subjects, DirectML/ORT/TensorRT-LLM/CPU), perplexity on WikiText-2, KL-divergence metrics.</li>
+<li>Pre-quantized HF checkpoints: <code>nvidia/DeepSeek-R1-FP4</code>, <code>nvidia/Llama-3.3-70B-FP4</code> etc. — pull validated optimized models without running pipeline.</li>
+</ul>
+<h4 id="intel-openvino-nncf">Intel OpenVINO + NNCF<a class="headerlink" href="#intel-openvino-nncf" title="Permanent link">&para;</a></h4>
+<p><strong>Most relevant</strong>: Accuracy-Aware PTQ (auto layer rollback)</p>
+<ul>
+<li>NNCF <code>AccuracyAwareQuantization</code>: automatically identifies sensitivity of each layer to quantization, rolls back sensitive layers to float when accuracy drop exceeds threshold. Fully automated accuracy-performance tradeoff solver.</li>
+<li><code>benchmark_app -hint latency</code> vs <code>-hint throughput</code>: auto-configures streams, batch, inference requests for each mode. <code>-d AUTO</code>: automatic device selection with fallback.</li>
+<li>100+ Jupyter notebooks on Binder/Colab — zero setup barrier.</li>
+<li><code>OpenVINO GenAI</code>: high-level <code>LLMPipeline</code>, <code>WhisperPipeline</code> — deploy-ready LLM inference in 5 lines.</li>
+</ul>
+<h4 id="huggingface-optimum">HuggingFace Optimum<a class="headerlink" href="#huggingface-optimum" title="Permanent link">&para;</a></h4>
+<p><strong>Most relevant</strong>: drop-in Transformers replacement + multi-backend hub</p>
+<ul>
+<li>Replace <code>AutoModelForSequenceClassification.from_pretrained()</code> with <code>ORTModelForSequenceClassification.from_pretrained()</code> → ONNX Runtime inference with zero code change.</li>
+<li>8 hardware backends: ONNX Runtime, OpenVINO, NVIDIA TensorRT-LLM, AMD Ryzen AI, AWS Inferentia, ExecuTorch, Intel Gaudi, FuriosaAI.</li>
+<li>Task-aware export: <code>--task text-generation</code> auto-configures dynamic axes and model wrapping.</li>
+</ul>
+<h4 id="microsoft-olive-direct-competitor">Microsoft Olive (direct competitor)<a class="headerlink" href="#microsoft-olive-direct-competitor" title="Permanent link">&para;</a></h4>
+<p><strong>Most relevant</strong>: one-line optimize command + VS Code AI Toolkit</p>
+<ul>
+<li><code>olive optimize --model_name_or_path Qwen/Qwen2.5-0.5B-Instruct --precision int4 --output_path models/qwen</code> — one command, no per-step config.</li>
+<li>JSON-based pipeline config for full declarative multi-step control.</li>
+<li>VS Code AI Toolkit extension: GUI for model optimization, fine-tuning, and inference testing — no CLI knowledge needed.</li>
+<li>MultiLoRA serving support.</li>
+</ul>
+<hr />
+<h3 id="top-5-high-impact-gaps-for-winml-cli">Top 5 High-Impact Gaps for winml-cli<a class="headerlink" href="#top-5-high-impact-gaps-for-winml-cli" title="Permanent link">&para;</a></h3>
+<h4 id="gap-1-per-layer-accuracy-debugging">🔴 Gap 1: Per-Layer Accuracy Debugging<a class="headerlink" href="#gap-1-per-layer-accuracy-debugging" title="Permanent link">&para;</a></h4>
+<p><strong>Pain</strong>: Accuracy degrades after QNN compilation/quantization, user has no idea which layer caused it. Currently requires QNN SDK expert knowledge.</p>
+<p><strong>Solution</strong>: <code>winml debug --model model.onnx --ep qnn --inputs calibration_data/</code>
+1. Runs model on CPU and QNN, captures intermediate tensor outputs at each op
+2. Computes cosine similarity per layer
+3. Outputs HTML/SVG graph with color-coded accuracy (green/red per layer)</p>
+<p><strong>Reference</strong>: ExecuTorch <code>QNNIntermediateDebugger</code> → <code>OutputFormat.SVG_GRAPH</code> + <code>QcomCosineSimilarityComparator</code></p>
+<p><strong>Impact</strong>: Turns multi-day debugging into a 30-minute diagnosis. Currently no Windows-on-NPU tool does this.</p>
+<hr />
+<h4 id="gap-2-compute-unit-utilization-report">🔴 Gap 2: Compute Unit Utilization Report<a class="headerlink" href="#gap-2-compute-unit-utilization-report" title="Permanent link">&para;</a></h4>
+<p><strong>Pain</strong>: <code>winml perf</code> shows slower-than-expected latency with no explanation. User doesn't know what % of ops ran on NPU vs fell back to CPU.</p>
+<p><strong>Solution</strong>: Extend <code>winml analyze</code> to output delegation table:</p>
+<div class="codehilite"><pre><span></span><code>Op Type         | NPU Delegated | CPU Fallback | Reason
+----------------|---------------|--------------|------------------
+MatMul (INT8)   | 47 / 47       | 0            | -
+LayerNorm       |  0 / 12       | 12           | Unsupported dtype
+Softmax (FP32)  |  0 /  6       |  6           | Requires INT8 input
+</code></pre></div>
+
+<p><strong>Reference</strong>: ExecuTorch <code>get_delegation_info().get_operator_delegation_dataframe()</code> / AI Hub per-layer compute unit mapping</p>
+<p><strong>Impact</strong>: Directly actionable — if user sees "60% of ops on CPU due to unsupported dtype," they know to switch to W8A8.</p>
+<hr />
+<h4 id="gap-3-quantization-sensitivity-analysis">🟠 Gap 3: Quantization Sensitivity Analysis<a class="headerlink" href="#gap-3-quantization-sensitivity-analysis" title="Permanent link">&para;</a></h4>
+<p><strong>Pain</strong>: <code>winml quantize --algo w8a8</code> produces a model with unacceptable accuracy. User doesn't know if it's a specific layer, the algorithm, or the calibration data.</p>
+<p><strong>Solution</strong>: <code>winml analyze-quant --model model.onnx --calibration data/ --eval-dataset eval/</code>
+1. Run full W8A8 quantization
+2. For each block/layer, measure accuracy impact of reverting to FP16
+3. Rank layers by sensitivity
+4. Report: "reverting 3 attention layers to FP16 recovers X% accuracy at Y% latency cost"</p>
+<p><strong>Reference</strong>: Intel NNCF <code>AccuracyAwareQuantization</code> (automatic per-layer rollback)</p>
+<p><strong>Impact</strong>: Replaces multi-day trial-and-error with a 10-minute automated report.</p>
+<hr />
+<h4 id="gap-4-standard-benchmark-integration-mmlu-perplexity">🟠 Gap 4: Standard Benchmark Integration (MMLU / Perplexity)<a class="headerlink" href="#gap-4-standard-benchmark-integration-mmlu-perplexity" title="Permanent link">&para;</a></h4>
+<p><strong>Pain</strong>: <code>winml eval</code> supports custom scripts but no out-of-box standard benchmarks. Users have no reference point for whether their quantized model's accuracy is "expected."</p>
+<p><strong>Solution</strong>: <code>winml eval --model model.onnx --benchmark mmlu --ep qnn</code>
+- Built-in MMLU (57 subjects), WikiText-2 perplexity, KL-divergence scripts
+- Reference numbers from FP32 baseline shown alongside quantized result
+- <code>FP16 baseline: 78.2% → W8A8 QNN: 77.9% (−0.3%, expected range: −0.1% to −0.5%)</code></p>
+<p><strong>Reference</strong>: NVIDIA ModelOpt <code>examples/windows/accuracy_benchmark/mmlu_benchmark.py</code> supports DirectML/ORT/CPU</p>
+<p><strong>Impact</strong>: Removes ambiguity and creates trust. Critical for LLM users.</p>
+<hr />
+<h4 id="gap-5-cross-ep-side-by-side-comparison">🟡 Gap 5: Cross-EP Side-by-Side Comparison<a class="headerlink" href="#gap-5-cross-ep-side-by-side-comparison" title="Permanent link">&para;</a></h4>
+<p><strong>Pain</strong>: Choosing between QNN/DirectML/CPU/OpenVINO requires running each EP manually and aggregating results. No tool does this automatically.</p>
+<p><strong>Solution</strong>: <code>winml sweep --model model.onnx --precision w8a16,fp16 --ep qnn,dml,cpu</code>
+- Runs build+eval+perf for each (precision × EP) combination
+- Outputs a single comparison table: accuracy / latency / op coverage %
+- Agent-driven: skill reads JSON output and recommends the optimal combination</p>
+<p><strong>Reference</strong>: Truly unique — no competitor does this for Windows multi-EP. Closest is AI Hub's multi-device fleet testing (Android only).</p>
+<p><strong>Impact</strong>: The single most-requested decision for Windows AI developers. Unique to winml-cli.</p>
+<hr />
+<h3 id="patterns-in-great-toolchain-dx">Patterns in Great Toolchain DX<a class="headerlink" href="#patterns-in-great-toolchain-dx" title="Permanent link">&para;</a></h3>
+<p><strong>Pattern 1: The "Why" Feedback Loop</strong>
+Great toolchains explain <em>why</em> results are the way they are. ExecuTorch's delegation table, AI Hub's compute unit mapping, NNCF's layer sensitivity analysis all answer "why?" winml-cli currently stops at "here's the result."</p>
+<p><strong>Pattern 2: Progressive Disclosure of Complexity</strong>
+- Olive: <code>olive optimize --precision int4</code> (one line) → full JSON config pipeline
+- coremltools: <code>ct.convert(model)</code> → MIL IR manipulation
+- AI Hub: web dashboard → Python SDK → CLI → AIMET configs</p>
+<p>winml-cli is currently too close to the expert path: each step requires understanding EP-specific options.</p>
+<p><strong>Pattern 3: Zero-Deploy Validation</strong>
+Every strong toolchain lets you test model output before deploying to hardware: coremltools <code>model.predict()</code>, ExecuTorch Python pybind, AI Hub <code>submit_inference_job()</code>. winml-cli is strong for CPU but lacks the quick "compare CPU vs QNN output" path.</p>
+<p><strong>Pattern 4: Pre-Validated Model Artifacts</strong>
+ModelOpt (HF nvidia/ org), AI Hub (500+ models), NNCF (Model Zoo with accuracy tables) all reduce the cold-start problem. Users don't need the full pipeline for popular models.</p>
+<hr />
+<h3 id="whitespace-opportunities-no-competitor-covers">Whitespace Opportunities (No Competitor Covers)<a class="headerlink" href="#whitespace-opportunities-no-competitor-covers" title="Permanent link">&para;</a></h3>
+<table>
+<thead>
+<tr>
+<th>Opportunity</th>
+<th>Why it's winml-cli territory</th>
+</tr>
+</thead>
+<tbody>
+<tr>
+<td><strong>Cross-EP regression table</strong> (one command, all EPs)</td>
+<td>Multi-EP is the unique Windows AI challenge; no Android/iOS tool does this</td>
+</tr>
+<tr>
+<td><strong>Quantization config recommender</strong> (<code>winml recommend --target qnn --constraint latency=20ms</code>)</td>
+<td>Rule-based recommendation from hardware+model arch analysis</td>
+</tr>
+<tr>
+<td><strong>EP-aware ONNX graph visualizer</strong> (Netron + green/yellow/red per EP)</td>
+<td>Netron exists but has no EP coverage overlay</td>
+</tr>
+<tr>
+<td><strong>Thermal/sustained-load profiling</strong> (latency curve over 100 runs, detect throttling)</td>
+<td>AI Hub hides variance; no tool surfaces thermal behavior</td>
+</tr>
+<tr>
+<td><strong>Windows AI Model Package</strong> (.mlpackage equivalent with multi-EP manifest)</td>
+<td>Apple has .mlpackage; Windows has nothing equivalent</td>
+</tr>
+</tbody>
+</table>
+<hr />
+<h2 id="skill-use-winml-cli-existing-extend">Skill: <code>use-winml-cli</code> (existing — extend)<a class="headerlink" href="#skill-use-winml-cli-existing-extend" title="Permanent link">&para;</a></h2>
+<p><strong>Status:</strong> Exists at <code>skills/use-winml-cli/SKILL.md</code>. Needs two additions:
+- Add <code>winml run</code> and <code>winml serve</code> usage (currently missing)
+- Add "first-time onboarding" path for users who don't know where to start</p>
+<p>No structural changes needed; the existing skill is the general entry point.</p>
+<hr />
+<h2 id="skill-debug-model">Skill: <code>debug-model</code><a class="headerlink" href="#skill-debug-model" title="Permanent link">&para;</a></h2>
+<h3 id="frontmatter">Frontmatter<a class="headerlink" href="#frontmatter" title="Permanent link">&para;</a></h3>
+<div class="codehilite"><pre><span></span><code><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">debug-model</span>
+<span class="nt">description</span><span class="p">:</span><span class="w"> </span><span class="p p-Indicator">&gt;</span>
+<span class="w">  </span><span class="no">Use this lightweight, read-only skill when a user wants fast explanation of why a model is</span>
+<span class="w">  </span><span class="no">behaving poorly (accuracy, latency, fallback, instability) without launching a full search loop.</span>
+<span class="w">  </span><span class="no">The skill gathers signals from winml inspect, winml analyze, profiling, and op tracing, then</span>
+<span class="w">  </span><span class="no">returns a concise report with evidence-backed root causes and prioritized actions.</span>
+</code></pre></div>
+
+<h3 id="when-to-use">When to use<a class="headerlink" href="#when-to-use" title="Permanent link">&para;</a></h3>
+<ul>
+<li>"My model runs but latency is much worse than expected"</li>
+<li>"NPU and CPU results or speed are inconsistent"</li>
+<li>"I need a quick diagnosis before trying autoconfig"</li>
+<li>"Explain what winml inspect/analyze/perf data means and what I should do next"</li>
+</ul>
+<h3 id="sections">Sections<a class="headerlink" href="#sections" title="Permanent link">&para;</a></h3>
+<p><strong>1. Gather signals (read-only)</strong></p>
+<div class="codehilite"><pre><span></span><code>winml inspect -m &lt;model&gt; --format json
+winml analyze -m &lt;model&gt; --ep &lt;ep&gt; --format json
+winml perf -m &lt;model&gt; --ep &lt;ep&gt; --device &lt;target&gt; --format json
+winml optrace -m &lt;model&gt; --ep &lt;ep&gt; --format json   # if available
+</code></pre></div>
+<p><strong>2. Explainable diagnosis rules</strong></p>
+<table>
+<thead>
+<tr>
+<th>Symptom</th>
+<th>Likely cause (evidence)</th>
+<th>Suggested action</th>
+</tr>
+</thead>
+<tbody>
+<tr>
+<td>High p99 + high variance in perf</td>
+<td>DVFS / thermal throttling likely</td>
+<td>Use multi-session averaging protocol before concluding regression</td>
+</tr>
+<tr>
+<td>Many partial/unsupported ops in analyze</td>
+<td>EP fallback dominates</td>
+<td>Change EP target or alter graph/precision to reduce fallback ops</td>
+</tr>
+<tr>
+<td>optrace hotspots align with fallback boundary</td>
+<td>Cross-EP boundary overhead</td>
+<td>Prioritize fixes around top fallback ops reported by optrace</td>
+</tr>
+<tr>
+<td>inspect shows dynamic shapes / unstable input profile</td>
+<td>Shape/path mismatch between expected and actual traffic</td>
+<td>Pin representative shapes and rerun perf/eval on production-like data</td>
+</tr>
+<tr>
+<td>accuracy drop only after precision lowering</td>
+<td>precision sensitivity in key subgraph</td>
+<td>Escalate precision for sensitive nodes, then re-check objective</td>
+</tr>
+</tbody>
+</table>
+<p><strong>3. Output contract (fast, explainable)</strong>
+- Input: model (+ optional EP/device target)
+- Output: short report + prioritized action list
+- Mode: read-only (no model mutation, no automatic rebuild loop)</p>
+<p><strong>Cross-references:</strong>
+- To compare precision options systematically → <code>autoconfig</code> (manual or automated optimize)
+- If op is listed as unsupported → <code>check-model-feasibility</code></p>
+<hr />
+<h2 id="skill-ship-to-winapp-merge-of-validate-before-ship-prepare-for-winapp">Skill: <code>ship-to-winapp</code> (merge of <code>validate-before-ship</code> + <code>prepare-for-winapp</code>)<a class="headerlink" href="#skill-ship-to-winapp-merge-of-validate-before-ship-prepare-for-winapp" title="Permanent link">&para;</a></h2>
+<p>Covers the whole ship-time phase: <strong>first validate</strong> the model meets the Definition-of-Done,
+<strong>then package</strong> the multi-EP artifacts and manifest for the WinApp to load at runtime.</p>
+<h3 id="frontmatter_1">Frontmatter<a class="headerlink" href="#frontmatter_1" title="Permanent link">&para;</a></h3>
+<div class="codehilite"><pre><span></span><code><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">ship-to-winapp</span>
+<span class="nt">description</span><span class="p">:</span><span class="w"> </span><span class="p p-Indicator">&gt;</span>
+<span class="w">  </span><span class="no">Use this skill when taking a winml-cli model artifact the last mile into a Windows</span>
+<span class="w">  </span><span class="no">application — both validating it is good enough to ship and packaging it for the app.</span>
+<span class="w">  </span><span class="no">Validation half: a Definition-of-Done checklist covering artifact completeness, accuracy</span>
+<span class="w">  </span><span class="no">vs FP32 baseline, performance SLA, output correctness on real inputs, cross-EP consistency,</span>
+<span class="w">  </span><span class="no">and fallback chain (every item checked or explicitly waived). Packaging half: how to organize</span>
+<span class="w">  </span><span class="no">multi-EP artifacts (QNN/NPU, OpenVINO, VitisAI, DirectML/GPU, CPU fallback), the recommended</span>
+<span class="w">  </span><span class="no">directory layout and manifest.json for runtime EP selection, and the runtime EP detection /</span>
+<span class="w">  </span><span class="no">fallback pattern. Use when the user says &quot;I&#39;m ready to ship&quot;, &quot;what should I test before</span>
+<span class="w">  </span><span class="no">release&quot;, &quot;how do I know the model is good enough&quot;, &quot;how do I use this in my app&quot;,</span>
+<span class="w">  </span><span class="no">&quot;how do I package the model&quot;, or &quot;what file do I load at runtime&quot;.</span>
+</code></pre></div>
+
+<h3 id="when-to-use_1">When to use<a class="headerlink" href="#when-to-use_1" title="Permanent link">&para;</a></h3>
+<ul>
+<li>About to ship a WinApp with on-device inference; final QA gate before production</li>
+<li>After any build config change (new quantization, new EP, new model version)</li>
+<li>"I built the model, how do I ship it in my app?"</li>
+<li>"How do I load different models for different hardware / what happens with no NPU?"</li>
+<li>"How do I package QNN + DML + CPU variants together?"</li>
+</ul>
+<hr />
+<h3 id="part-a-validate-definition-of-done-gates">Part A — Validate (Definition-of-Done gates)<a class="headerlink" href="#part-a-validate-definition-of-done-gates" title="Permanent link">&para;</a></h3>
+<p><strong>The checklist</strong></p>
+<p><strong>Gate 1 — Artifact completeness</strong>
+- [ ] All target EP artifacts exist and are loadable
+- [ ] CPU fallback artifact exists
+- [ ] manifest.json (if using multi-EP layout) is valid and references existing files
+- [ ] Artifact was built with <code>winml build</code> (not opaque cache artifact)</p>
+<div class="codehilite"><pre><span></span><code>winml<span class="w"> </span>inspect<span class="w"> </span>-m<span class="w"> </span>&lt;artifact&gt;.onnx<span class="w">  </span><span class="c1"># verify each artifact loads</span>
+</code></pre></div>
+
+<p><strong>Gate 2 — Accuracy vs FP32 baseline</strong>
+- [ ] cosine_similarity ≥ 0.99 for FP16 artifacts
+- [ ] cosine_similarity ≥ 0.95 for W8A16 artifacts
+- [ ] cosine_similarity ≥ 0.90 for W8A8 artifacts (or task-specific threshold)
+- [ ] Task accuracy metric (Top-1, F1, mAP) within acceptable drop from FP32</p>
+<div class="codehilite"><pre><span></span><code>winml<span class="w"> </span><span class="nb">eval</span><span class="w"> </span>--mode<span class="w"> </span>compare<span class="w"> </span>-m<span class="w"> </span>&lt;artifact&gt;.onnx<span class="w"> </span>--model-id<span class="w"> </span>&lt;model&gt;
+winml<span class="w"> </span><span class="nb">eval</span><span class="w"> </span>-m<span class="w"> </span>&lt;artifact&gt;.onnx<span class="w"> </span>--model-id<span class="w"> </span>&lt;model&gt;<span class="w">  </span><span class="c1"># task accuracy</span>
+</code></pre></div>
+
+<p><strong>Gate 3 — Performance SLA</strong>
+- [ ] p50 latency meets application target on target device
+- [ ] p99 latency within 2x p50 (no outlier spikes)
+- [ ] Benchmark run on actual target hardware (not developer machine)</p>
+<div class="codehilite"><pre><span></span><code>winml<span class="w"> </span>perf<span class="w"> </span>-m<span class="w"> </span>&lt;artifact&gt;.onnx<span class="w"> </span>--device<span class="w"> </span>&lt;target&gt;<span class="w"> </span>--iterations<span class="w"> </span><span class="m">100</span><span class="w"> </span>--monitor
+</code></pre></div>
+
+<p><strong>Gate 4 — Output correctness on real inputs</strong>
+- [ ] Model produces correct output on ≥3 representative real-world inputs
+- [ ] No NaN or Inf in outputs
+- [ ] Output shape matches expected shape</p>
+<div class="codehilite"><pre><span></span><code>winml<span class="w"> </span>run<span class="w"> </span>-m<span class="w"> </span>&lt;artifact&gt;.onnx<span class="w"> </span>--file<span class="w"> </span>&lt;real_input&gt;<span class="w">  </span><span class="c1"># visual/manual check</span>
+</code></pre></div>
+
+<p><strong>Gate 5 — Cross-EP consistency (if shipping multiple EP variants)</strong>
+- [ ] QNN and DML outputs agree within tolerance on same input
+- [ ] CPU fallback output agrees with primary EP within tolerance</p>
+<div class="codehilite"><pre><span></span><code>winml<span class="w"> </span>run<span class="w"> </span>-m<span class="w"> </span>model_qnn.onnx<span class="w"> </span>--file<span class="w"> </span>sample.jpg<span class="w"> </span>--format<span class="w"> </span>json<span class="w"> </span>-o<span class="w"> </span>qnn_out.json
+winml<span class="w"> </span>run<span class="w"> </span>-m<span class="w"> </span>model_dml.onnx<span class="w"> </span>--file<span class="w"> </span>sample.jpg<span class="w"> </span>--format<span class="w"> </span>json<span class="w"> </span>-o<span class="w"> </span>dml_out.json
+winml<span class="w"> </span>run<span class="w"> </span>-m<span class="w"> </span>model_cpu.onnx<span class="w"> </span>--file<span class="w"> </span>sample.jpg<span class="w"> </span>--format<span class="w"> </span>json<span class="w"> </span>-o<span class="w"> </span>cpu_out.json
+<span class="c1"># compare qnn_out.json vs dml_out.json vs cpu_out.json manually</span>
+</code></pre></div>
+
+<p><strong>Gate 6 — Fallback chain</strong>
+- [ ] CPU fallback artifact verified independently (not just assumed to work)
+- [ ] App runtime selects correct artifact when target EP is absent (simulate by removing EP)</p>
+<p><strong>Waiver policy</strong>
+Any item that cannot be completed must be waived explicitly:</p>
+<div class="codehilite"><pre><span></span><code>Waivers:
+- Cross-EP consistency: VitisAI not available on developer machine.
+  Verified on target hardware by QA team. Issue #NNN.
+- Performance SLA: Target hardware (Snapdragon X Elite) in procurement.
+  Benchmark deferred to post-merge, tracked in issue #NNN.
+</code></pre></div>
+
+<p>Unchecked items without waiver → do not ship.</p>
+<p><strong>L-level mapping</strong> — the 6 gates map directly to the L1–L5 confidence system (see Overview):</p>
+<table>
+<thead>
+<tr>
+<th>Gate</th>
+<th>L-level</th>
+</tr>
+</thead>
+<tbody>
+<tr>
+<td>Gate 1 — Artifact completeness</td>
+<td>L1</td>
+</tr>
+<tr>
+<td>Gate 2 — Accuracy vs FP32 baseline</td>
+<td>L3 + L4</td>
+</tr>
+<tr>
+<td>Gate 3 — Performance SLA</td>
+<td>L5</td>
+</tr>
+<tr>
+<td>Gate 4 — Output correctness on real inputs</td>
+<td>L4</td>
+</tr>
+<tr>
+<td>Gate 5 — Cross-EP consistency</td>
+<td>L5</td>
+</tr>
+<tr>
+<td>Gate 6 — Fallback chain</td>
+<td>L1 (CPU artifact)</td>
+</tr>
+</tbody>
+</table>
+<p>Minimum to ship: L1 + L3 all passing. L4 + L5 required for production release.</p>
+<p><strong>Quick command reference</strong></p>
+<div class="codehilite"><pre><span></span><code><span class="c1"># Gate 1: inspect all artifacts</span>
+<span class="k">for</span><span class="w"> </span>f<span class="w"> </span><span class="k">in</span><span class="w"> </span>model_qnn.onnx<span class="w"> </span>model_dml.onnx<span class="w"> </span>model_cpu.onnx<span class="p">;</span><span class="w"> </span><span class="k">do</span><span class="w"> </span>winml<span class="w"> </span>inspect<span class="w"> </span>-m<span class="w"> </span><span class="nv">$f</span><span class="p">;</span><span class="w"> </span><span class="k">done</span>
+<span class="c1"># Gate 2: accuracy</span>
+winml<span class="w"> </span><span class="nb">eval</span><span class="w"> </span>--mode<span class="w"> </span>compare<span class="w"> </span>-m<span class="w"> </span>&lt;artifact&gt;.onnx<span class="w"> </span>--model-id<span class="w"> </span>&lt;model&gt;
+winml<span class="w"> </span><span class="nb">eval</span><span class="w"> </span>-m<span class="w"> </span>&lt;artifact&gt;.onnx<span class="w"> </span>--model-id<span class="w"> </span>&lt;model&gt;
+<span class="c1"># Gate 3: perf</span>
+winml<span class="w"> </span>perf<span class="w"> </span>-m<span class="w"> </span>&lt;artifact&gt;.onnx<span class="w"> </span>--device<span class="w"> </span>auto<span class="w"> </span>--iterations<span class="w"> </span><span class="m">100</span><span class="w"> </span>--monitor
+<span class="c1"># Gate 4: real input</span>
+winml<span class="w"> </span>run<span class="w"> </span>-m<span class="w"> </span>&lt;artifact&gt;.onnx<span class="w"> </span>--file<span class="w"> </span>&lt;sample&gt;
+<span class="c1"># Gate 5: cross-EP (run individually, compare outputs)</span>
+winml<span class="w"> </span>run<span class="w"> </span>-m<span class="w"> </span>model_qnn.onnx<span class="w"> </span>--file<span class="w"> </span>&lt;sample&gt;<span class="w"> </span>--format<span class="w"> </span>json
+winml<span class="w"> </span>run<span class="w"> </span>-m<span class="w"> </span>model_dml.onnx<span class="w"> </span>--file<span class="w"> </span>&lt;sample&gt;<span class="w"> </span>--format<span class="w"> </span>json
+</code></pre></div>
+
+<hr />
+<h3 id="part-b-package-integrate-multi-ep">Part B — Package &amp; integrate (multi-EP)<a class="headerlink" href="#part-b-package-integrate-multi-ep" title="Permanent link">&para;</a></h3>
+<p><strong>1. The multi-EP artifact problem</strong>
+<code>winml compile</code> produces EP-locked files (not portable), so a WinApp needs a strategy to
+select the right file per device.</p>
+<p><strong>2. Recommended artifact layout</strong></p>
+<div class="codehilite"><pre><span></span><code>my_model/
+  manifest.json          ← EP → file mapping + version
+  model_qnn.onnx         ← QNN NPU (compiled, Snapdragon X)
+  model_openvino.onnx    ← OpenVINO NPU/GPU (Intel Core Ultra)
+  model_vitisai.onnx     ← VitisAI NPU (AMD Ryzen AI)
+  model_dml.onnx         ← DirectML GPU (any GPU, non-NPU machines)
+  model_cpu.onnx         ← CPU fallback (universal)
+</code></pre></div>
+
+<p><strong>3. manifest.json schema</strong></p>
+<div class="codehilite"><pre><span></span><code><span class="p">{</span>
+<span class="w">  </span><span class="nt">&quot;model_id&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;facebook/convnext-tiny-224&quot;</span><span class="p">,</span>
+<span class="w">  </span><span class="nt">&quot;task&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;image-classification&quot;</span><span class="p">,</span>
+<span class="w">  </span><span class="nt">&quot;version&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;1.0.0&quot;</span><span class="p">,</span>
+<span class="w">  </span><span class="nt">&quot;variants&quot;</span><span class="p">:</span><span class="w"> </span><span class="p">[</span>
+<span class="w">    </span><span class="p">{</span><span class="w"> </span><span class="nt">&quot;ep&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;qnn&quot;</span><span class="p">,</span><span class="w">       </span><span class="nt">&quot;device&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;npu&quot;</span><span class="p">,</span><span class="w">  </span><span class="nt">&quot;file&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;model_qnn.onnx&quot;</span><span class="p">,</span><span class="w">       </span><span class="nt">&quot;precision&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;w8a16&quot;</span><span class="w"> </span><span class="p">},</span>
+<span class="w">    </span><span class="p">{</span><span class="w"> </span><span class="nt">&quot;ep&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;openvino&quot;</span><span class="p">,</span><span class="w">  </span><span class="nt">&quot;device&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;npu&quot;</span><span class="p">,</span><span class="w">  </span><span class="nt">&quot;file&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;model_openvino.onnx&quot;</span><span class="p">,</span><span class="w">  </span><span class="nt">&quot;precision&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;w8a8&quot;</span><span class="w">  </span><span class="p">},</span>
+<span class="w">    </span><span class="p">{</span><span class="w"> </span><span class="nt">&quot;ep&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;vitisai&quot;</span><span class="p">,</span><span class="w">   </span><span class="nt">&quot;device&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;npu&quot;</span><span class="p">,</span><span class="w">  </span><span class="nt">&quot;file&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;model_vitisai.onnx&quot;</span><span class="p">,</span><span class="w">   </span><span class="nt">&quot;precision&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;w8a8&quot;</span><span class="w">  </span><span class="p">},</span>
+<span class="w">    </span><span class="p">{</span><span class="w"> </span><span class="nt">&quot;ep&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;dml&quot;</span><span class="p">,</span><span class="w">       </span><span class="nt">&quot;device&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;gpu&quot;</span><span class="p">,</span><span class="w">  </span><span class="nt">&quot;file&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;model_dml.onnx&quot;</span><span class="p">,</span><span class="w">       </span><span class="nt">&quot;precision&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;fp16&quot;</span><span class="w">  </span><span class="p">},</span>
+<span class="w">    </span><span class="p">{</span><span class="w"> </span><span class="nt">&quot;ep&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;cpu&quot;</span><span class="p">,</span><span class="w">       </span><span class="nt">&quot;device&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;cpu&quot;</span><span class="p">,</span><span class="w">  </span><span class="nt">&quot;file&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;model_cpu.onnx&quot;</span><span class="p">,</span><span class="w">       </span><span class="nt">&quot;precision&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;w8a8&quot;</span><span class="w">  </span><span class="p">}</span>
+<span class="w">  </span><span class="p">],</span>
+<span class="w">  </span><span class="nt">&quot;selection_order&quot;</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="s2">&quot;qnn&quot;</span><span class="p">,</span><span class="w"> </span><span class="s2">&quot;openvino&quot;</span><span class="p">,</span><span class="w"> </span><span class="s2">&quot;vitisai&quot;</span><span class="p">,</span><span class="w"> </span><span class="s2">&quot;dml&quot;</span><span class="p">,</span><span class="w"> </span><span class="s2">&quot;cpu&quot;</span><span class="p">]</span>
+<span class="p">}</span>
+</code></pre></div>
+
+<p>(For multi-EP artifacts, <code>autoconfig</code> emits this <code>manifest.json</code> directly with experiment provenance.)</p>
+<p><strong>4. Building all variants with winml-cli</strong></p>
+<div class="codehilite"><pre><span></span><code><span class="c1"># Generate configs per EP</span>
+winml<span class="w"> </span>config<span class="w"> </span>-m<span class="w"> </span>&lt;model&gt;<span class="w"> </span>--device<span class="w"> </span>npu<span class="w"> </span>--ep<span class="w"> </span>qnn<span class="w"> </span>-o<span class="w"> </span>config_qnn.json
+winml<span class="w"> </span>config<span class="w"> </span>-m<span class="w"> </span>&lt;model&gt;<span class="w"> </span>--device<span class="w"> </span>npu<span class="w"> </span>--ep<span class="w"> </span>openvino<span class="w"> </span>-o<span class="w"> </span>config_ov.json
+winml<span class="w"> </span>config<span class="w"> </span>-m<span class="w"> </span>&lt;model&gt;<span class="w"> </span>--device<span class="w"> </span>gpu<span class="w"> </span>--ep<span class="w"> </span>dml<span class="w"> </span>-o<span class="w"> </span>config_dml.json
+winml<span class="w"> </span>config<span class="w"> </span>-m<span class="w"> </span>&lt;model&gt;<span class="w"> </span>--device<span class="w"> </span>cpu<span class="w"> </span>-o<span class="w"> </span>config_cpu.json
+
+<span class="c1"># Build all</span>
+winml<span class="w"> </span>build<span class="w"> </span>-c<span class="w"> </span>config_qnn.json<span class="w"> </span>-m<span class="w"> </span>&lt;model&gt;<span class="w"> </span>-o<span class="w"> </span>out_qnn/
+winml<span class="w"> </span>build<span class="w"> </span>-c<span class="w"> </span>config_ov.json<span class="w">  </span>-m<span class="w"> </span>&lt;model&gt;<span class="w"> </span>-o<span class="w"> </span>out_ov/
+winml<span class="w"> </span>build<span class="w"> </span>-c<span class="w"> </span>config_dml.json<span class="w"> </span>-m<span class="w"> </span>&lt;model&gt;<span class="w"> </span>-o<span class="w"> </span>out_dml/
+winml<span class="w"> </span>build<span class="w"> </span>-c<span class="w"> </span>config_cpu.json<span class="w"> </span>-m<span class="w"> </span>&lt;model&gt;<span class="w"> </span>-o<span class="w"> </span>out_cpu/
+</code></pre></div>
+
+<p><strong>5. Runtime EP selection pattern (C++ / ORT)</strong>
+Pseudocode for app-side logic:
+- Read manifest.json
+- Query available EPs on device (<code>GetAvailableProviders()</code> or <code>winml sys</code> equivalent)
+- Walk <code>selection_order</code>, pick first EP available on this device
+- Load the corresponding file
+- If all fail → CPU is always available</p>
+<p><strong>6. What NOT to do</strong>
+- Don't load a QNN-compiled model with CPU EP → will fail or produce wrong results
+- Don't hardcode EP names → check availability at runtime
+- Don't ship only the compiled artifact without a CPU fallback</p>
+<p><strong>Cross-references:</strong>
+- If diagnosis is needed before rerun → <code>debug-model</code>
+- If performance gate fails → <code>autoconfig</code> (manual or automated optimize path)
+- If EP not available for testing, or to pick the right EP → <code>check-model-feasibility</code>
+- To build the artifacts → <code>use-winml-cli</code></p>
+<hr />
+<h2 id="skill-check-model-feasibility-merge-of-find-a-model-ep-compatibility-check">Skill: <code>check-model-feasibility</code> (merge of <code>find-a-model</code> + <code>ep-compatibility-check</code>)<a class="headerlink" href="#skill-check-model-feasibility-merge-of-find-a-model-ep-compatibility-check" title="Permanent link">&para;</a></h2>
+<p>The pre-build front door. Two entry points, one shared engine (<code>inspect</code> → <code>sys</code> → <code>analyze</code>):
+<strong>(A)</strong> the user has no model yet → recommend a <em>supported</em> one from their constraints;
+<strong>(B)</strong> the user has a model → confirm it runs on their target EP/device. Both converge on the
+same three-layer check, so they are one skill.</p>
+<h3 id="frontmatter_2">Frontmatter<a class="headerlink" href="#frontmatter_2" title="Permanent link">&para;</a></h3>
+<div class="codehilite"><pre><span></span><code><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">check-model-feasibility</span>
+<span class="nt">description</span><span class="p">:</span><span class="w"> </span><span class="p p-Indicator">&gt;</span>
+<span class="w">  </span><span class="no">Use this skill before a full build, to answer two linked questions: &quot;which model should I</span>
+<span class="w">  </span><span class="no">use?&quot; and &quot;will it run on my hardware?&quot;. Model discovery: when the user knows the task</span>
+<span class="w">  </span><span class="no">(image classification, text embedding, object detection, summarization, …) but has no model</span>
+<span class="w">  </span><span class="no">yet, gather their constraints, generate Hugging Face candidates, and screen each one for</span>
+<span class="w">  </span><span class="no">winml-cli support. Compatibility: for a chosen (or candidate) model, run the three-layer check</span>
+<span class="w">  </span><span class="no">— winml inspect (model support), winml sys (EP availability on this machine), winml analyze</span>
+<span class="w">  </span><span class="no">(operator-level EP coverage) — plus the EP-to-hardware mapping and fallback chain for Windows</span>
+<span class="w">  </span><span class="no">AI PCs. Use when the user says &quot;what model should I use for X&quot;, &quot;find me a model that runs</span>
+<span class="w">  </span><span class="no">under 20ms on the NPU&quot;, &quot;recommend a small image classifier&quot;, &quot;I don&#39;t have a model yet&quot;,</span>
+<span class="w">  </span><span class="no">&quot;will this work on my device&quot;, &quot;is QNN supported here&quot;, &quot;what hardware do I need for NPU&quot;,</span>
+<span class="w">  </span><span class="no">or when they hit an unsupported-operator error.</span>
+
+<span class="nt">audience</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">external (WinApp developers)</span>
+</code></pre></div>
+
+<h3 id="when-to-use_2">When to use<a class="headerlink" href="#when-to-use_2" title="Permanent link">&para;</a></h3>
+<ul>
+<li>"What model should I use for background blur / OCR / summarization?"</li>
+<li>"Find a text-embedding model under 100MB that runs on the Intel NPU"</li>
+<li>"Will this model work on my Snapdragon X Elite laptop? Is QNN supported here?"</li>
+<li>"The compile step failed with an unsupported op"</li>
+<li>Starting a new project: pick a model and verify feasibility before investing build time</li>
+</ul>
+<h3 id="what-this-skill-does-not-do">What this skill does NOT do<a class="headerlink" href="#what-this-skill-does-not-do" title="Permanent link">&para;</a></h3>
+<ul>
+<li>It does not train, fine-tune, or optimize a model — optimization hands off to <code>autoconfig</code>.</li>
+<li>It only recommends models whose architecture winml-cli can actually export/run (verified via
+  <code>winml inspect</code>), never an arbitrary HF model it cannot load.</li>
+</ul>
+<h3 id="sections_1">Sections<a class="headerlink" href="#sections_1" title="Permanent link">&para;</a></h3>
+<p><strong>1. Two entry points</strong>
+- (A) <strong>No model yet</strong> → run Section 2 (discovery) to produce candidates, then Section 3 on each.
+- (B) <strong>Have a model</strong> → skip to Section 3 (three-layer check) directly.</p>
+<p><strong>2. Discovery — find candidate models (entry point A)</strong>
+Capture and lock the selection constraints first:</p>
+<table>
+<thead>
+<tr>
+<th>Condition</th>
+<th>Example</th>
+<th>Drives</th>
+</tr>
+</thead>
+<tbody>
+<tr>
+<td>Task</td>
+<td>image-classification, feature-extraction, text-generation</td>
+<td>HF Hub filter</td>
+</tr>
+<tr>
+<td>Target device / EP</td>
+<td>Snapdragon X NPU (QNN), Intel NPU (OpenVINO), any GPU (DML)</td>
+<td>feasibility + latency class</td>
+</tr>
+<tr>
+<td>Latency budget</td>
+<td>p50 ≤ 20 ms</td>
+<td>size / architecture shortlist</td>
+</tr>
+<tr>
+<td>Accuracy need</td>
+<td>"≥ ResNet-50 top-1" or a benchmark floor</td>
+<td>candidate quality bar</td>
+</tr>
+<tr>
+<td>Size limit</td>
+<td>≤ 100 MB on disk</td>
+<td>excludes large variants</td>
+</tr>
+<tr>
+<td>License</td>
+<td>permissive (Apache-2.0 / MIT)</td>
+<td>excludes restricted models</td>
+</tr>
+</tbody>
+</table>
+<p>The agent queries the HF Hub by task, sorted by downloads/likes, restricted to architecture
+families winml-cli is known to support → a 5–10 model shortlist. Each candidate then goes
+through the three-layer check below; drop any that fail Layer 1 or have heavy unsupported ops.</p>
+<p><strong>3. The three-layer feasibility check (entry points A and B)</strong>
+Layer 1 — Model support · Layer 2 — EP availability · Layer 3 — Operator coverage.
+Run in order, stop at first hard failure.</p>
+<p><em>Layer 1 — Model support</em></p>
+<div class="codehilite"><pre><span></span><code>winml<span class="w"> </span>inspect<span class="w"> </span>-m<span class="w"> </span>&lt;model-id&gt;<span class="w"> </span>--format<span class="w"> </span>json
+</code></pre></div>
+
+<p>Look for <code>loader</code>, <code>exporter</code>, <code>winml_inference_class</code> populated. If inspect fails or shows
+"unsupported" → model is out of scope for winml-cli (drop the candidate; do not recommend it).</p>
+<p><em>Layer 2 — EP availability</em></p>
+<div class="codehilite"><pre><span></span><code>winml<span class="w"> </span>sys<span class="w"> </span>--list-ep<span class="w"> </span>--list-device
+</code></pre></div>
+
+<table>
+<thead>
+<tr>
+<th>EP</th>
+<th>Hardware requirement</th>
+<th>Check for</th>
+</tr>
+</thead>
+<tbody>
+<tr>
+<td>QNN</td>
+<td>Qualcomm Snapdragon X Elite / X Plus</td>
+<td>QNNExecutionProvider in list</td>
+</tr>
+<tr>
+<td>OpenVINO</td>
+<td>Intel Core Ultra (Meteor Lake / Lunar Lake+)</td>
+<td>OpenVINOExecutionProvider</td>
+</tr>
+<tr>
+<td>VitisAI</td>
+<td>AMD Ryzen AI (Phoenix / Hawk Point / Strix)</td>
+<td>VitisAIExecutionProvider</td>
+</tr>
+<tr>
+<td>NvTensorRTRTX</td>
+<td>NVIDIA discrete GPU (RTX series)</td>
+<td>NvTensorRTRTXExecutionProvider</td>
+</tr>
+<tr>
+<td>DML</td>
+<td>Any DirectX 12 GPU</td>
+<td>DmlExecutionProvider</td>
+</tr>
+<tr>
+<td>CPU</td>
+<td>Any</td>
+<td>Always available</td>
+</tr>
+</tbody>
+</table>
+<p>If the desired EP is not listed → recommend next best EP from the fallback chain.</p>
+<p><em>Layer 3 — Operator coverage</em></p>
+<div class="codehilite"><pre><span></span><code>winml<span class="w"> </span>analyze<span class="w"> </span>-m<span class="w"> </span>&lt;exported_model&gt;.onnx<span class="w"> </span>--ep<span class="w"> </span>&lt;ep&gt;<span class="w"> </span>--format<span class="w"> </span>json
+<span class="c1"># or for all EPs at once:</span>
+winml<span class="w"> </span>analyze<span class="w"> </span>-m<span class="w"> </span>&lt;exported_model&gt;.onnx<span class="w"> </span>--device<span class="w"> </span>all
+</code></pre></div>
+
+<ul>
+<li><code>supported</code> (green): op runs natively on EP</li>
+<li><code>partial</code> (yellow): op may fall back to CPU for some configurations</li>
+<li><code>unsupported</code> (red): op cannot run on this EP</li>
+</ul>
+<p>Decision rule: any <code>unsupported</code> → either change EP or accept CPU fallback for those ops
+(which may impact accuracy and latency).</p>
+<p><strong>4. Fallback chain recommendation</strong>
+If target EP not available or has unsupported ops:</p>
+<div class="codehilite"><pre><span></span><code>QNN not available → OpenVINO (if Intel) or VitisAI (if AMD) → DML → CPU
+</code></pre></div>
+
+<p><strong>5. Rank and recommend (entry point A) / fast-fail before compile (entry point B)</strong>
+- Discovery: rank surviving candidates by fit against the locked conditions (size, latency
+  class, accuracy reference, op coverage, downloads as a popularity prior). Output a short
+  ranked table + one recommended pick + rationale.
+- <code>winml compile</code> is expensive (minutes). Always run <code>analyze</code> first; if it shows &gt;20%
+  unsupported ops → likely not worth compiling for that EP.</p>
+<p><strong>Cross-references:</strong>
+- After picking a model + confirming feasibility → <code>autoconfig</code> (find the optimal config)
+- To build the chosen artifacts → <code>use-winml-cli</code>
+- If <strong>no</strong> supported model meets the constraints, or all EPs show unsupported ops → the gap
+  feeds <code>optimization-research</code> (long-tail coverage) and <code>adding-model-support</code></p>
+<blockquote>
+<p>Addresses the <strong>Pre-quantized model zoo / cold-start</strong> whitespace from the Competitive Analysis:
+NVIDIA (<code>nvidia/</code> HF org) and AI Hub (500+ models) reduce cold-start with curated zoos; winml-cli
+has none, so this skill substitutes a constraints-driven recommender that only returns <em>supported</em> models.</p>
+</blockquote>
+<hr />
+<h2 id="skill-adding-model-support-contributor">Skill: <code>adding-model-support</code> (contributor)<a class="headerlink" href="#skill-adding-model-support-contributor" title="Permanent link">&para;</a></h2>
+<h3 id="frontmatter_3">Frontmatter<a class="headerlink" href="#frontmatter_3" title="Permanent link">&para;</a></h3>
+<div class="codehilite"><pre><span></span><code><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">adding-model-support</span>
+<span class="nt">description</span><span class="p">:</span><span class="w"> </span><span class="p p-Indicator">&gt;</span>
+<span class="w">  </span><span class="no">Use this skill when contributing support for a new Hugging Face model to</span>
+<span class="w">  </span><span class="no">winml-cli. Covers finding the correct exporter, writing a recipe config,</span>
+<span class="w">  </span><span class="no">verifying at each pipeline stage (export → optimize → quantize → compile),</span>
+<span class="w">  </span><span class="no">and passing the L1–L5 validation gates before submitting a PR. Use when</span>
+<span class="w">  </span><span class="no">a contributor says &quot;I want to add support for model X&quot;, &quot;this model type</span>
+<span class="w">  </span><span class="no">is not supported&quot;, or &quot;how do I write a recipe for a new architecture&quot;.</span>
+</code></pre></div>
+
+<h3 id="when-to-use_3">When to use<a class="headerlink" href="#when-to-use_3" title="Permanent link">&para;</a></h3>
+<ul>
+<li>"I want to add support for Qwen3 / Phi-4 / [new model]"</li>
+<li>"winml-cli says this model is unsupported"</li>
+<li>"How do I write a recipe config for a new model family?"</li>
+</ul>
+<h3 id="sections_2">Sections<a class="headerlink" href="#sections_2" title="Permanent link">&para;</a></h3>
+<p><strong>1. Find the right exporter</strong></p>
+<div class="codehilite"><pre><span></span><code>winml<span class="w"> </span>inspect<span class="w"> </span>-m<span class="w"> </span>&lt;hf_model_id&gt;<span class="w">  </span><span class="c1"># check if auto-detected</span>
+</code></pre></div>
+
+<p>If inspect fails → the model needs a new exporter or recipe.
+Look in <code>src/winml/modelkit/export/</code> for existing exporters as reference.</p>
+<p><strong>2. Find a reference model of the same family</strong>
+- Same architecture class (e.g., LlamaForCausalLM, BertModel)?
+- Check <code>recipes/</code> for an existing <code>.json</code> config for that class
+- Prefer copying the closest recipe and adjusting rather than writing from scratch</p>
+<p><strong>3. Write the recipe config</strong>
+Minimal recipe template:</p>
+<div class="codehilite"><pre><span></span><code><span class="p">{</span>
+<span class="w">  </span><span class="nt">&quot;model_id&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;org/model-name&quot;</span><span class="p">,</span>
+<span class="w">  </span><span class="nt">&quot;task&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;text-generation&quot;</span><span class="p">,</span>
+<span class="w">  </span><span class="nt">&quot;export&quot;</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nt">&quot;opset&quot;</span><span class="p">:</span><span class="w"> </span><span class="mi">17</span><span class="w"> </span><span class="p">},</span>
+<span class="w">  </span><span class="nt">&quot;optimize&quot;</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nt">&quot;passes&quot;</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="s2">&quot;MatMulAddFusion&quot;</span><span class="p">,</span><span class="w"> </span><span class="s2">&quot;LayerNormFusion&quot;</span><span class="p">]</span><span class="w"> </span><span class="p">},</span>
+<span class="w">  </span><span class="nt">&quot;quantize&quot;</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nt">&quot;mode&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;w8a16&quot;</span><span class="p">,</span><span class="w"> </span><span class="nt">&quot;calibration_dataset&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;wikitext2&quot;</span><span class="w"> </span><span class="p">}</span>
+<span class="p">}</span>
+</code></pre></div>
+
+<p><strong>4. Validate at each stage (L1 → L5)</strong></p>
+<table>
+<thead>
+<tr>
+<th>Stage</th>
+<th>Command</th>
+<th>Pass criterion</th>
+</tr>
+</thead>
+<tbody>
+<tr>
+<td>L1: Export loads</td>
+<td><code>winml inspect -m &lt;exported&gt;.onnx</code></td>
+<td>No error</td>
+</tr>
+<tr>
+<td>L2: Shape correct</td>
+<td><code>winml eval -m &lt;exported&gt;.onnx --model-id &lt;id&gt;</code></td>
+<td>Output shape matches</td>
+</tr>
+<tr>
+<td>L3: Numerical parity</td>
+<td><code>winml eval --mode compare -m &lt;quantized&gt;.onnx --model-id &lt;id&gt;</code></td>
+<td>cosine ≥ threshold</td>
+</tr>
+<tr>
+<td>L4: Task accuracy</td>
+<td><code>winml eval -m &lt;quantized&gt;.onnx --model-id &lt;id&gt;</code></td>
+<td>Task metric in spec</td>
+</tr>
+<tr>
+<td>L5: Perf on target EP</td>
+<td><code>winml perf -m &lt;compiled&gt;.onnx --device &lt;target&gt;</code></td>
+<td>Meets latency target</td>
+</tr>
+</tbody>
+</table>
+<p><strong>5. Common pitfalls for new models</strong>
+- New op types not in operator coverage → run <code>winml analyze</code> early
+- Attention variant (GQA, MQA, MLA) → check quantization mode compatibility
+- Dynamic shapes → add explicit shape hints in export config
+- Non-standard tokenizer → verify <code>winml run</code> input preprocessing</p>
+<p><strong>Cross-references:</strong>
+- If EP shows unsupported ops → <code>check-model-feasibility</code>
+- After L1–L5 all pass → <code>ship-to-winapp</code> for PR gate</p>
+<hr />
+<h2 id="skill-adding-ep-support-contributor">Skill: <code>adding-ep-support</code> (contributor)<a class="headerlink" href="#skill-adding-ep-support-contributor" title="Permanent link">&para;</a></h2>
+<h3 id="frontmatter_4">Frontmatter<a class="headerlink" href="#frontmatter_4" title="Permanent link">&para;</a></h3>
+<div class="codehilite"><pre><span></span><code><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">adding-ep-support</span>
+<span class="nt">description</span><span class="p">:</span><span class="w"> </span><span class="p p-Indicator">&gt;</span>
+<span class="w">  </span><span class="no">Use this skill when adding a new execution provider (EP) backend to</span>
+<span class="w">  </span><span class="no">winml-cli. Covers implementing the compile backend interface, adding</span>
+<span class="w">  </span><span class="no">EP-specific optimize passes, wiring the new EP into winml sys and</span>
+<span class="w">  </span><span class="no">winml analyze, and verifying coverage with the L1–L5 test gates.</span>
+<span class="w">  </span><span class="no">Use when a contributor says &quot;I want to add support for a new EP&quot;,</span>
+<span class="w">  </span><span class="no">&quot;how does the QNN compile backend work&quot;, or &quot;can we support EP X&quot;.</span>
+</code></pre></div>
+
+<h3 id="when-to-use_4">When to use<a class="headerlink" href="#when-to-use_4" title="Permanent link">&para;</a></h3>
+<ul>
+<li>Adding a new EP compile backend (e.g., a new NPU vendor)</li>
+<li>Extending an existing EP with new optimization passes</li>
+<li>Understanding how the existing QNN / OpenVINO / VitisAI backends are structured</li>
+</ul>
+<h3 id="sections_3">Sections<a class="headerlink" href="#sections_3" title="Permanent link">&para;</a></h3>
+<p><strong>1. EP backend interface</strong>
+Reference implementation: <code>src/winml/modelkit/compile/qnn_backend.py</code>
+Three methods to implement:</p>
+<div class="codehilite"><pre><span></span><code><span class="k">class</span><span class="w"> </span><span class="nc">MyEPBackend</span><span class="p">(</span><span class="n">CompileBackend</span><span class="p">):</span>
+    <span class="k">def</span><span class="w"> </span><span class="nf">is_available</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">bool</span><span class="p">:</span> <span class="o">...</span>      <span class="c1"># detect EP on current machine</span>
+    <span class="k">def</span><span class="w"> </span><span class="nf">optimize</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">model</span><span class="p">,</span> <span class="n">config</span><span class="p">):</span> <span class="o">...</span>   <span class="c1"># EP-specific graph transforms</span>
+    <span class="k">def</span><span class="w"> </span><span class="nf">compile</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">model</span><span class="p">,</span> <span class="n">config</span><span class="p">):</span> <span class="o">...</span>    <span class="c1"># produce EP-locked artifact</span>
+</code></pre></div>
+
+<p><strong>2. Wire into EP registry</strong>
+Register in <code>src/winml/modelkit/ep_registry.py</code>:</p>
+<div class="codehilite"><pre><span></span><code><span class="n">EP_REGISTRY</span><span class="p">[</span><span class="s2">&quot;myep&quot;</span><span class="p">]</span> <span class="o">=</span> <span class="n">MyEPBackend</span>
+</code></pre></div>
+
+<p>This makes <code>--ep myep</code> work in <code>winml config</code>, <code>winml compile</code>, <code>winml analyze</code>.</p>
+<p><strong>3. Add operator coverage data</strong>
+Add a coverage JSON to <code>src/winml/modelkit/analyze/coverage/myep_ops.json</code>:</p>
+<div class="codehilite"><pre><span></span><code><span class="p">{</span><span class="w"> </span><span class="nt">&quot;Add&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;supported&quot;</span><span class="p">,</span><span class="w"> </span><span class="nt">&quot;LayerNorm&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;partial&quot;</span><span class="p">,</span><span class="w"> </span><span class="nt">&quot;CustomOp&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;unsupported&quot;</span><span class="w"> </span><span class="p">}</span>
+</code></pre></div>
+
+<p>This is what <code>winml analyze --ep myep</code> reads.</p>
+<p><strong>4. Add to <code>winml sys</code> output</strong>
+Add EP availability check to <code>src/winml/commands/sys.py</code> so it appears
+in <code>winml sys --list-ep</code>.</p>
+<p><strong>5. L1–L5 validation for the new EP</strong>
+Minimum before merging:
+- L1: A known-good model compiles without crash
+- L3: Compiled artifact passes <code>winml eval --mode compare</code> (cosine threshold)
+- L5: <code>winml perf</code> produces valid latency output on target hardware</p>
+<p><strong>Cross-references:</strong>
+- Operator coverage analysis → <code>check-model-feasibility</code>
+- After adding: document the EP in the <code>check-model-feasibility</code> hardware table</p>
+<hr />
+<h2 id="skill-contributing-a-skill-contributor">Skill: <code>contributing-a-skill</code> (contributor)<a class="headerlink" href="#skill-contributing-a-skill-contributor" title="Permanent link">&para;</a></h2>
+<h3 id="frontmatter_5">Frontmatter<a class="headerlink" href="#frontmatter_5" title="Permanent link">&para;</a></h3>
+<div class="codehilite"><pre><span></span><code><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">contributing-a-skill</span>
+<span class="nt">description</span><span class="p">:</span><span class="w"> </span><span class="p p-Indicator">&gt;</span>
+<span class="w">  </span><span class="no">Use this skill when writing a new SKILL.md for winml-cli or improving</span>
+<span class="w">  </span><span class="no">an existing one. Covers frontmatter requirements, description writing</span>
+<span class="w">  </span><span class="no">(the description is the agent trigger, not a human summary), section</span>
+<span class="w">  </span><span class="no">structure conventions, cross-reference format, command accuracy</span>
+<span class="w">  </span><span class="no">requirements, and the review checklist before submitting. Use when a</span>
+<span class="w">  </span><span class="no">contributor says &quot;I want to add a new skill&quot;, &quot;how should I write</span>
+<span class="w">  </span><span class="no">SKILL.md&quot;, or &quot;what are the skill authoring rules&quot;.</span>
+</code></pre></div>
+
+<h3 id="when-to-use_5">When to use<a class="headerlink" href="#when-to-use_5" title="Permanent link">&para;</a></h3>
+<ul>
+<li>Writing a new skill for a gap not covered by existing skills</li>
+<li>Improving an existing skill with new commands or sections</li>
+<li>Reviewing a skill PR</li>
+</ul>
+<h3 id="sections_4">Sections<a class="headerlink" href="#sections_4" title="Permanent link">&para;</a></h3>
+<p><strong>1. Frontmatter rules</strong></p>
+<div class="codehilite"><pre><span></span><code><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">kebab-case-skill-name</span><span class="w">   </span><span class="c1"># matches directory name under skills/</span>
+<span class="nt">description</span><span class="p">:</span><span class="w"> </span><span class="p p-Indicator">&gt;</span>
+<span class="w">  </span><span class="no">Use this skill when &lt;trigger phrase describing user&#39;s problem&gt;.</span>
+<span class="w">  </span><span class="no">Covers &lt;what the skill teaches&gt;.</span>
+<span class="w">  </span><span class="no">Use when the user says &quot;&lt;example trigger phrase 1&gt;&quot;, &quot;&lt;example 2&gt;&quot;, or &lt;condition&gt;.</span>
+</code></pre></div>
+
+<p><strong>Critical:</strong> The <code>description</code> field is what the Copilot agent reads to decide
+whether to activate this skill. Write it as a trigger specification, not a
+documentation summary. Include representative user phrases in quotes.</p>
+<p><strong>2. Required sections (in order)</strong>
+1. <code>## When to use</code> — 3–5 bullet points with user-facing symptoms/questions
+2. Diagnostic or decision section — symptom → cause → fix structure
+3. Command examples — runnable <code>winml</code> commands with real flags
+4. Reference tables — hardware, thresholds, EP names as concrete data
+5. <code>## Cross-references</code> — links to related skills using relative paths</p>
+<p><strong>3. Cross-reference format</strong></p>
+<div class="codehilite"><pre><span></span><code><span class="k">-</span><span class="w"> </span>If model behavior is unclear → see <span class="sb">`.agents/skills/debug-model/SKILL.md`</span>
+<span class="k">-</span><span class="w"> </span>After validating → see <span class="sb">`.agents/skills/validate-before-ship/SKILL.md`</span>
+</code></pre></div>
+
+<p><strong>4. Content rules</strong>
+- All commands must be runnable exactly as written (no pseudocode flags)
+- Include concrete numbers: thresholds (cosine ≥ 0.99), speedup (3–5×), latency (&lt;50ms)
+- Target ~200 lines prose + tables; move deep content to <code>references/</code> subdirectory
+- Do not duplicate content from another skill — cross-reference instead</p>
+<p><strong>5. Review checklist before PR</strong>
+- [ ] <code>description</code> contains ≥3 quoted user trigger phrases
+- [ ] All commands are tested and produce the described output
+- [ ] Cross-references use relative paths and the linked skill exists
+- [ ] No commands reference flags that don't exist in current <code>winml --help</code>
+- [ ] Hardware names and EP names match the canonical list in <code>check-model-feasibility</code>
+- [ ] <code>evals/eval.yaml</code> exists with ≥2 test cases (including at least one negative assertion)</p>
+<hr />
+<h2 id="skill-autoconfig-user-optimize-the-model-automated-loop-manual-framework">Skill: <code>autoconfig</code> (user — optimize the model: automated loop + manual framework)<a class="headerlink" href="#skill-autoconfig-user-optimize-the-model-automated-loop-manual-framework" title="Permanent link">&para;</a></h2>
+<p>The optimize skill. Two modes: <strong>automated</strong> (the autoresearch loop — the bulk of this section) for
+"figure it out for me / run overnight", and <strong>manual</strong> (the decision framework folded in from
+<code>optimize-for-device</code>) for "I'll choose by hand" or when there is no target hardware to benchmark on.</p>
+<h3 id="frontmatter_6">Frontmatter<a class="headerlink" href="#frontmatter_6" title="Permanent link">&para;</a></h3>
+<div class="codehilite"><pre><span></span><code><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">autoconfig</span>
+<span class="nt">description</span><span class="p">:</span><span class="w"> </span><span class="p p-Indicator">&gt;</span>
+<span class="w">  </span><span class="no">Use this skill when a **WinApp developer** wants the best performance for their model on one or</span>
+<span class="w">  </span><span class="no">more Windows EP/device targets — either by letting winml-cli search automatically, or by working</span>
+<span class="w">  </span><span class="no">through the precision/EP tradeoffs by hand. Automated mode: an autonomous experiment loop that</span>
+<span class="w">  </span><span class="no">proposes config.json hypotheses, runs winml build + eval + perf, evaluates against user-defined</span>
+<span class="w">  </span><span class="no">objectives (accuracy floor, latency budget, or Pareto frontier), and iterates — keeping</span>
+<span class="w">  </span><span class="no">improvements, discarding regressions; covers single-EP optimization, multi-EP parallel search,</span>
+<span class="w">  </span><span class="no">mixed-precision (nodes_to_exclude) exploration, calibration tuning, and manifest.json output.</span>
+<span class="w">  </span><span class="no">Manual mode: the latency-budget vs accuracy-floor decision framework, the FP32→FP16→W8A16→W8A8</span>
+<span class="w">  </span><span class="no">precision ladder, a per-device hardware guidance table, and how to read tradeoff results.</span>
+<span class="w">  </span><span class="no">Use when the user says &quot;find the best config for my model on QNN&quot;, &quot;automate the config search&quot;,</span>
+<span class="w">  </span><span class="no">&quot;generate configs for all EPs&quot;, &quot;I want to leave this running overnight&quot;, &quot;make it faster&quot;,</span>
+<span class="w">  </span><span class="no">&quot;which precision should I use&quot;, &quot;is NPU worth it&quot;, or &quot;compare QNN vs DirectML vs CPU&quot;.</span>
+<span class="w">  </span><span class="no">The report output includes a feasible-options comparison table (top candidates with tradeoffs)</span>
+<span class="w">  </span><span class="no">so the user can choose confidently instead of seeing only one &quot;winner&quot; config.</span>
+
+<span class="nt">audience</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">external (WinApp developers)</span>
+</code></pre></div>
+
+<h3 id="when-to-use_6">When to use<a class="headerlink" href="#when-to-use_6" title="Permanent link">&para;</a></h3>
+<ul>
+<li>"Find the best W8A8 config that keeps accuracy &gt; 0.95 on QNN"</li>
+<li>"Generate optimized configs for QNN + DirectML + CPU and build a manifest"</li>
+<li>"I don't know which quantization settings to use, figure it out for me" / "run overnight"</li>
+<li>"Make it faster" / "which precision should I use" / "is NPU worth it" (→ manual mode)</li>
+<li>"Compare QNN vs DirectML vs CPU for my model"</li>
+<li>User has a latency SLA or accuracy floor but doesn't know how to achieve it</li>
+</ul>
+<h3 id="what-this-skill-does-not-do_1">What this skill does NOT do<a class="headerlink" href="#what-this-skill-does-not-do_1" title="Permanent link">&para;</a></h3>
+<ul>
+<li>It only searches within what <code>winml build</code> currently supports (existing capabilities)</li>
+<li>It does not look for optimization techniques outside winml's current feature set</li>
+<li>It does not suggest that winml needs new features or file bugs</li>
+<li>For finding what winml is <em>missing</em>, use <code>optimization-research</code> instead</li>
+</ul>
+<hr />
+<h3 id="manual-mode-the-decision-framework-folded-in-from-optimize-for-device">Manual mode — the decision framework (folded in from <code>optimize-for-device</code>)<a class="headerlink" href="#manual-mode-the-decision-framework-folded-in-from-optimize-for-device" title="Permanent link">&para;</a></h3>
+<p>Use this lightweight path when the user wants to decide by hand, or has no target hardware to
+benchmark on (so the automated loop's perf gate can't run). It is the conceptual model the
+automated loop below mechanizes.</p>
+<p><strong>1. The decision framework</strong> — two inputs: latency budget OR accuracy budget.
+- Have a latency SLA (e.g. &lt;50ms)? → find highest accuracy within that budget
+- Have an accuracy floor (e.g. &lt;2% drop)? → find fastest within that floor</p>
+<p><strong>2. The precision ladder</strong> — FP32 → FP16 → W8A16 → W8A8, with typical speedup and accuracy-drop
+ranges per model family (Encoder/BERT-like, Vision/ConvNet, Transformer/ViT).</p>
+<p><strong>3. The sweep workflow</strong> — run <code>winml build</code> + <code>winml eval</code> + <code>winml perf</code> for each precision,
+collect into a tradeoff table, apply the decision framework.</p>
+<div class="codehilite"><pre><span></span><code>winml<span class="w"> </span>config<span class="w"> </span>-m<span class="w"> </span>&lt;model&gt;<span class="w"> </span>--device<span class="w"> </span>&lt;device&gt;<span class="w"> </span>--precision<span class="w"> </span>fp16<span class="w"> </span>-o<span class="w"> </span>config_fp16.json
+winml<span class="w"> </span>build<span class="w"> </span>-c<span class="w"> </span>config_fp16.json<span class="w"> </span>-m<span class="w"> </span>&lt;model&gt;<span class="w"> </span>-o<span class="w"> </span>out_fp16/
+winml<span class="w"> </span><span class="nb">eval</span><span class="w"> </span>-m<span class="w"> </span>out_fp16/&lt;artifact&gt;.onnx<span class="w"> </span>--model-id<span class="w"> </span>&lt;model&gt;
+winml<span class="w"> </span>perf<span class="w"> </span>-m<span class="w"> </span>out_fp16/&lt;artifact&gt;.onnx<span class="w"> </span>--device<span class="w"> </span>&lt;device&gt;<span class="w"> </span>--iterations<span class="w"> </span><span class="m">50</span>
+<span class="c1"># repeat for w8a16, w8a8</span>
+</code></pre></div>
+
+<p><strong>4. Hardware-specific guidance table</strong>
+| Device | Best EP | Sweet-spot precision | Notes |
+|---|---|---|---|
+| Snapdragon X Elite NPU | QNN | W8A16 | HTP native for W8A16; W8A8 risky for Attention |
+| Intel Core Ultra NPU | OpenVINO | W8A8 | OpenVINO PTQ handles INT8 well |
+| AMD Ryzen AI NPU | VitisAI | W8A8 | Phoenix/Hawk Point prefer INT8 |
+| Any GPU | DirectML | FP16 | FP16 sufficient; quantization rarely helps on GPU |
+| CPU fallback | CPU | W8A8 | Size + latency both benefit |</p>
+<p><strong>5. Reading the output</strong> — how to interpret <code>winml eval</code> cosine_similarity / SQNR and
+<code>winml perf</code> p50/p90/p99; what values indicate "acceptable" vs "needs investigation".</p>
+<p>When the user wants this automated instead of done by hand, continue to the autoresearch loop below.</p>
+<hr />
+<h3 id="epistemic-standard-for-autoconfig-findings">Epistemic standard for autoconfig findings<a class="headerlink" href="#epistemic-standard-for-autoconfig-findings" title="Permanent link">&para;</a></h3>
+<p><strong>Any conclusion this skill writes into a report or recommends to a user must meet this bar:</strong></p>
+<table>
+<thead>
+<tr>
+<th>Requirement</th>
+<th>What it means</th>
+</tr>
+</thead>
+<tbody>
+<tr>
+<td><strong>Observation vs explanation</strong></td>
+<td>State what was measured separately from why it happened. "latency increased 270ms" is fact. "because NHWC causes cache thrashing" is a hypothesis — label it as such unless confirmed by profiling.</td>
+</tr>
+<tr>
+<td><strong>Statistical validity</strong></td>
+<td>A latency claim requires ≥ 3 independent runs with warmup. A single <code>winml eval</code> run (no warmup, includes preprocessing) is insufficient to quote as a latency number. It can guide search decisions but not final reports.</td>
+</tr>
+<tr>
+<td><strong>Mechanism confirmation</strong></td>
+<td>Do not explain a regression unless the mechanism is confirmed (e.g., by profiler, by op-level timing, or by <strong>source code inspection of ORT/QNN SDK</strong>). If unknown, write "cause unconfirmed; further profiling needed."</td>
+</tr>
+<tr>
+<td><strong>Scope boundary</strong></td>
+<td>Results measured on one model/EP are never generalized to other models/EPs without explicit qualification. "On ConvNext-tiny CPU" is allowed. "CPU dislikes fusion" is not — it's an overgeneralization.</td>
+</tr>
+<tr>
+<td><strong>Unresolved uncertainty</strong></td>
+<td>If an observation contradicts the expected behavior (e.g., a "disabled" fusion still appears in the output), the report must flag this as an open question, not silently adopt an explanation.</td>
+</tr>
+<tr>
+<td><strong>EP isolation</strong></td>
+<td>A finding on one EP (positive or negative) MUST NOT be applied to prune the search space of a different EP without independent validation. CPU opset regression ≠ QNN NPU opset regression. Always validate per EP independently.</td>
+</tr>
+</tbody>
+</table>
+<p>The skill MUST NOT write confident root-cause explanations in the HTML report or chat summary for regressions where only the measurement is available. Use hedged language: "this likely relates to…", "one hypothesis is…", or simply omit the explanation and recommend profiling.</p>
+<h4 id="perf-gain-validation-protocol">Perf gain validation protocol<a class="headerlink" href="#perf-gain-validation-protocol" title="Permanent link">&para;</a></h4>
+<p>Before <strong>any</strong> perf gain is written into a report, config recommendation, or knowledge base as a confirmed finding, it must pass ALL three gates:</p>
+<p><strong>Gate 1 — Statistical: two-phase bench protocol (from GPU Optimizer V2)</strong></p>
+<div class="codehilite"><pre><span></span><code>Phase A — Quick screen (fast, ~2 min):
+  winml perf -m &lt;model&gt; --ep &lt;ep&gt; --device &lt;device&gt; --warmup 20 --iterations 200 -o screen.json
+  CV = screen.json.std / screen.json.p50
+  IF CV &gt; 0.10 (10%): REJECT — high DVFS variance, measurement unreliable
+                       → cool down 120s, retry once
+                       → if still CV &gt; 0.10: flag as [UNSTABLE], skip candidate
+
+Phase B — Full bench (only if Phase A passes, ~15 min):
+  # 3 independent sessions with 60s cool-down between each
+  winml perf ... --warmup 50 --iterations 1000 -o run1.json
+  sleep 60
+  winml perf ... --warmup 50 --iterations 1000 -o run2.json
+  sleep 60
+  winml perf ... --warmup 50 --iterations 1000 -o run3.json
+
+  # KEEP if ALL of:
+  #   1. p50(run1,2,3) are all faster than baseline p50 × (1 - min_improvement)
+  #   2. CV of each run &lt; 0.10
+  #   3. cosine_similarity ≥ accuracy_floor
+  KEEP_threshold = baseline_p50 × 0.99   # ≥1% improvement required
+</code></pre></div>
+
+<p>Rationale: DVFS on mobile NPUs causes 2-10x run-to-run variance. CV check catches this before wasting 15 min on full bench.</p>
+<p><strong>Gate 2 — Mechanism: read ORT/QNN source code before explaining why</strong></p>
+<p><strong>Gate 2 — Mechanism: read ORT/QNN source code before explaining why</strong>
+- For QNN EP gains: check <code>onnxruntime/core/providers/qnn/builder/</code> for opset-conditional dispatch
+- For CPU EP gains: check <code>onnxruntime/core/optimizer/</code> for pass applicability conditions
+- For DML EP gains: check DML operator mapping tables
+- <strong>Do not publish "opset 21 = 2.3x faster on QNN NPU" without confirming the mechanism in source code.</strong> It may be DVFS bias, not a real architectural difference.</p>
+<p><strong>Gate 3 — Reproducibility: baseline and candidate measured in same thermal state</strong>
+- Run baseline and candidate back-to-back in the same session OR
+- Use a device-level tool to lock NPU clock frequency
+- If you cannot control thermal state, report min_ms (peak-performance ceiling) alongside p50 (typical performance), and flag the variance explicitly.</p>
+<p><strong>Lesson from ConvNext opset sweep (2026-06-10):</strong>
+Initial opset 21 measurement (8.45ms, 50 iters) vs opset 17 (19.4ms) appeared to show 2.3x gain. Full 17-22 sweep with 50 iters each showed:
+- All opsets min ~9-10ms (same peak capability)
+- opset 17 p50=54ms, opset 19-22 p50=12ms — but opset 18 p50=43ms (bimodal)
+- opset 21 std varied from 10ms (cool device) to 37ms (warm device)
+<strong>Conclusion: data is inconclusive. Gain may be real OR may be thermal artifact. Gates 1+2 not yet passed.</strong></p>
+<hr />
+<h3 id="design-comparison-gpu-optimizer-v2-vs-winml-autoconfig">Design Comparison: GPU Optimizer V2 vs WinML Autoconfig<a class="headerlink" href="#design-comparison-gpu-optimizer-v2-vs-winml-autoconfig" title="Permanent link">&para;</a></h3>
+<p><strong>Reference</strong>: "Agentic GPU Model Optimization" doc (cheye@, 2026-03-20). GPU Optimizer V2 is a 6-role multi-agent system for cloud GPU inference optimization (ONER-1B KNN service, H100). Autoconfig is a local edge inference optimizer (winml-cli, Snapdragon X). Most of their infrastructure (machine pool, SSH fleet, Triton serving, custom CUDA kernels, SM occupancy tuning) does not apply here. But the agent loop design has several directly adoptable ideas.</p>
+<h4 id="adoptable-insights-from-gpu-optimizer-v2">Adoptable insights from GPU Optimizer V2<a class="headerlink" href="#adoptable-insights-from-gpu-optimizer-v2" title="Permanent link">&para;</a></h4>
+<table>
+<thead>
+<tr>
+<th>V2 design decision</th>
+<th>V2 rationale</th>
+<th>Adopt into autoconfig?</th>
+<th>Notes</th>
+</tr>
+</thead>
+<tbody>
+<tr>
+<td><strong>Two-phase bench: 200-iter quick screen → 3×1000-iter full bench</strong></td>
+<td>"CV&lt;2% gates full bench — avoid wasting time on high-variance results"</td>
+<td>✅ <strong>YES — highest priority gap</strong></td>
+<td>We've been doing single 50-iter runs and calling them facts. CV check would have caught the DVFS noise immediately.</td>
+</tr>
+<tr>
+<td><strong>Verdict policy names (ThroughputOnly, ThroughputOrLatency…)</strong></td>
+<td>"Named policies prevent Reviewer from ad-hoc criteria drift"</td>
+<td>✅ YES (simplified)</td>
+<td>Autoconfig should have explicit KEEP criteria: <code>p50_ms &lt; baseline × (1 - threshold)</code> AND <code>cosine ≥ floor</code></td>
+</tr>
+<tr>
+<td><strong>Append-only experiment_log.md + results.tsv written only by Reviewer</strong></td>
+<td>"Single writer = no drift, full audit trail"</td>
+<td>✅ YES</td>
+<td>Our results.tsv exists but no "single writer" discipline</td>
+</tr>
+<tr>
+<td><strong>Explorer mandatory external-research triggers</strong></td>
+<td>"After 15 consecutive DISCARDs → external research sweep"</td>
+<td>✅ YES — this is the exact gap that caused the opset 21 miss</td>
+<td>If we had this rule, we would have searched ORT source after N DISCARDs and found kMaxSupportedOpset earlier</td>
+</tr>
+<tr>
+<td><strong>Knowledge agent with review gate before KB save</strong></td>
+<td>"Learnings reviewed before they prune future search"</td>
+<td>✅ YES</td>
+<td>ep_knowledge/*.json entries should be marked draft until Gate 2 (mechanism) is confirmed</td>
+</tr>
+<tr>
+<td><strong>Correctness contract locked after Phase 0, never modified</strong></td>
+<td>"Prevents accuracy goal-post moving"</td>
+<td>✅ YES</td>
+<td>We have accuracy gate but no locked contract file</td>
+</tr>
+<tr>
+<td><strong>30-consecutive-DISCARD stop condition</strong></td>
+<td>"Prevents endless search in exhausted space"</td>
+<td>✅ YES</td>
+<td>autoconfig has no stop condition today</td>
+</tr>
+<tr>
+<td><strong>Per-experiment structured output: Hypothesis → Implementation → Parity → Perf → Analysis → Decision</strong></td>
+<td>"Enables post-analysis and knowledge extraction"</td>
+<td>✅ YES</td>
+<td>autoconfig report is currently holistic, not per-experiment</td>
+</tr>
+<tr>
+<td><strong>Role separation: Profiler / Explorer / Optimizer / Reviewer are separate agents</strong></td>
+<td>"Prevents context drift; each agent stays focused"</td>
+<td>⚠️ Partial</td>
+<td>Full 6-agent split is overkill for CLI tool; but Explorer / Reviewer distinction is valuable</td>
+</tr>
+<tr>
+<td><strong>Resource lock: only one GPU job at a time</strong></td>
+<td>"Prevents benchmark interference"</td>
+<td>✅ YES (trivially)</td>
+<td>Already serial; but should be explicitly enforced if autoconfig ever parallelizes</td>
+</tr>
+<tr>
+<td><strong>Machine pool + SSH fleet + Model Registry</strong></td>
+<td>Cloud GPU fleet management</td>
+<td>❌ N/A</td>
+<td>Local device only</td>
+</tr>
+<tr>
+<td><strong>Custom CUDA kernel writing</strong></td>
+<td>"Extreme asymmetry benefits from custom kernels"</td>
+<td>❌ N/A</td>
+<td>CLI-only constraint; no kernel modification</td>
+</tr>
+<tr>
+<td><strong>SM occupancy / GEMM tile count tuning</strong></td>
+<td>"H100 has 132 SMs; 48 output tiles = 36% occupancy"</td>
+<td>❌ N/A</td>
+<td>Edge NPU/GPU, not H100 multi-SM</td>
+</tr>
+<tr>
+<td><strong>FlashAttention / fused QKV</strong></td>
+<td>"Eliminate HBM traffic for attention score matrix"</td>
+<td>❌ N/A</td>
+<td>Model is already trained; deployment-time optimization only</td>
+</tr>
+</tbody>
+</table>
+<h4 id="key-gaps-in-current-autoconfig-design-from-v2-comparison">Key gaps in current autoconfig design (from V2 comparison)<a class="headerlink" href="#key-gaps-in-current-autoconfig-design-from-v2-comparison" title="Permanent link">&para;</a></h4>
+<p><strong>Gap 1 (critical): No two-phase bench protocol</strong>
+Current design runs <code>--iterations 50</code> and accepts the result. V2 runs:
+1. Quick screen: 200 iters, check CV &lt; 2% (Coefficient of Variation = std/mean)
+2. Only if CV &lt; 2%: full bench 3×1000 iters with 60s cool-down between sessions
+3. KEEP only if Δp50 &gt; threshold AND CV(candidate) &lt; 2%</p>
+<p>This directly matches the "iter ≥ 1000" rule we just added. Formalize it as two phases.</p>
+<p><strong>Gap 2 (critical): No mandatory external-research trigger in Explorer</strong>
+V2 Explorer triggers external research (web search, papers, source code) after:
+- 15 consecutive DISCARDs
+- Every KEEP that changes model/precision
+- Before declaring backlog_empty</p>
+<p>We discovered kMaxSupportedOpset only by accident (downloading QNN Hub models). A mandatory "read ORT source after 5 DISCARDs in opset dimension" rule would have found it in Phase 2.</p>
+<p><strong>Gap 3 (important): ep_knowledge/*.json has no draft/confirmed state</strong>
+V2 Knowledge agent requires review gate before KB entries are used to prune search space. Our ep_knowledge findings should have:
+- <code>status: "draft"</code> — observed, mechanism unconfirmed (Gate 2 not passed)
+- <code>status: "confirmed"</code> — mechanism confirmed via source code (Gate 2 passed)<br />
+- <code>status: "deprecated"</code> — finding invalidated by new experiment or ORT version change
+Only <code>"confirmed"</code> entries should prune search space. <code>"draft"</code> entries inform hypothesis priority but don't prune.</p>
+<p><strong>Gap 4 (nice-to-have): No per-experiment structured artifact</strong>
+V2 produces per-experiment: Hypothesis / Implementation / Parity / Perf / Analysis / Decision
+autoconfig produces: one aggregate report.html. Should produce both.</p>
+<h3 id="design-the-autoresearch-loop">Design: The Autoresearch Loop<a class="headerlink" href="#design-the-autoresearch-loop" title="Permanent link">&para;</a></h3>
+<p>Inspired by <a href="https://github.com/karpathy/autoresearch">karpathy/autoresearch</a>:
+agent modifies a config file, runs a fixed-cost experiment, checks if the objective improved, keeps or discards, and repeats autonomously until manually stopped or convergence criteria met.</p>
+<div class="codehilite"><pre><span></span><code>OBJECTIVE (user-defined, one of):
+  A. Accuracy-primary:  maximize cosine_similarity  subject to  p50_ms ≤ &lt;budget&gt;
+  B. Latency-primary:   minimize p50_ms             subject to  cosine ≥ &lt;floor&gt;
+  C. Pareto search:     find the full accuracy-latency frontier
+
+SEARCH SPACE — config.json has three sections the agent can modify:
+
+  [export]
+    opset_version          : int   — 17, 18, 19, 20  (higher = newer ops, EP may not support)
+    do_constant_folding    : bool  — may affect graph structure visible to EP
+    dynamic_axes           : dict  — static vs dynamic shapes (QNN prefers static batch=1)
+
+  [optimize]  — full capability list (from winml optimize --list-capabilities)
+
+    GraphPipe (run via ORT SessionOptions):
+      GELU:
+        gelu-fusion            : bool  — fuse tanh-GELU subgraph → Gelu op
+        fast-gelu-fusion       : bool  — fuse fast-GELU (tanh-approx) → FastGelu
+        bias-gelu-fusion       : bool  — fuse Bias+GELU (requires gelu-fusion)
+        quick-gelu-fusion      : bool  — fuse x*sigmoid(1.702x) → FastGelu
+        gelu-approximation     : bool  — convert exact Gelu → FastGelu (requires gelu-fusion)
+      Activation:
+        bias-softmax-fusion    : bool  — fuse Bias+Softmax
+        bias-dropout-fusion    : bool  — fuse Bias+Dropout
+      Convolution:
+        conv-add-fusion        : bool  — fuse Conv+Add (bias)
+        conv-bn-fusion         : bool  — fuse Conv+BatchNorm into weights
+        conv-mul-fusion        : bool  — fuse Conv+Multiply
+        conv-activation-fusion : bool  — fuse Conv+activation (ReLU, Sigmoid, etc.)
+      Elimination:
+        slice-elimination      : bool  — remove redundant Slice ops
+        expand-elimination     : bool  — remove no-op Expand
+        unsqueeze-elimination  : bool  — fold Unsqueeze into initializers
+      GEMM:
+        gemm-activation-fusion : bool  — fuse GEMM+activation
+        gemm-sum-fusion        : bool  — fuse GEMM+Sum
+        gemm-transpose-fusion  : bool  — fuse GEMM+Transpose
+      Graph:
+        concat-slice-elimination   : bool  — remove Concat+Slice that restore originals
+        double-qdq-pairs-remover   : bool  — remove consecutive QDQ pairs
+        constant-folding           : bool  — pre-compute constant exprs (default=True; disable to reduce size)
+      LayerNorm:
+        layer-norm-fusion          : bool  — fuse ReduceMean→Sub→Pow→Sqrt→Div→Mul→Add
+        skip-layer-norm-fusion     : bool  — fuse Add(residual)+LayerNorm → SkipLayerNorm (requires layer-norm-fusion)
+        simplified-layer-norm-fusion : bool — fuse simplified LayerNorm (no mean-centering)
+      Layout:
+        transpose-optimizer        : bool  — eliminate redundant transpose chains
+        nhwc-transformer           : bool  — NCHW→NHWC (GPU memory layout)
+        nchwc-transformer          : bool  — NCHW→NCHWc (CPU SIMD layout)
+        conv-add-activation-fusion : bool  — fuse Conv+Add+Activation → FusedConv
+      MatMul:
+        matmul-add-fusion          : bool  — fuse MatMul+Add → single kernel
+        matmul-activation-fusion   : bool  — fuse MatMul+activation (DML-only, requires matmul-transpose-fusion)
+        matmul-transpose-fusion    : bool  — fuse MatMul+Transpose → FusedMatMul
+        matmul-scale-fusion        : bool  — fuse MatMul+Scale
+        matmul-bn-fusion           : bool  — fuse MatMul+BatchNorm
+        dynamic-quantize-matmul-fusion : bool — dynamic quant for MatMul
+      Misc:
+        gather-slice-to-split-fusion : bool — fuse Gather+Slice → Split
+        gather-to-slice-fusion       : bool — convert Gather to Slice (contiguous idx)
+        pad-fusion                   : bool — fuse Pad with Conv/Pool
+        not-where-fusion             : bool — fuse Not+Where
+
+    FusionPipe (ORT transformer fusions, via FusionOptions):
+      attention-fusion              : bool  — fuse MHA pattern → Attention/MultiHeadAttention
+      layer-norm-fusion             : bool  — (FusionPipe variant, same flag)
+      skip-layer-norm-fusion        : bool  — (FusionPipe variant)
+      simplified-layer-norm-fusion  : bool  — (FusionPipe variant)
+      embed-layer-norm-fusion       : bool  — fuse Embedding+Position+LayerNorm (requires layer-norm-fusion)
+      bias-skip-layer-norm-fusion   : bool  — fuse Bias+SkipLayerNorm (requires skip-layer-norm-fusion)
+      fuse-rmsnorm                  : bool  — fuse RMSNorm → LpNormalization(p=2) [custom, QNN-compatible]
+      packed-qkv-fusion             : bool  — (SD only)
+      packed-kv-fusion              : bool  — (SD only)
+      skip-group-norm-fusion        : bool  — (SD only)
+      bias-add-fusion               : bool  — fuse BiasAdd
+      qordered-matmul               : bool  — (SD only)
+
+    SurgeryPipe (pre-EP graph fixes):
+      clamp-constant-values         : bool  — clamp -inf/+inf constants → [-1e3, 1e3] (prevents QNN quant issues)
+      remove-isnan-in-attention-mask: bool  — remove Softmax→IsNaN→Where guards (use after clamp)
+
+    RewritePipe (pattern-based subgraph rewriting):
+      --enable-{source-slug}-{target-slug}  (run winml optimize --list-rewrites for full list)
+      Examples: --enable-gelu-singlegelu, --enable-matmuladdpattern-reshapegemmreshapepattern
+
+  [quant]
+    precision              : fp16 | w8a16 | w8a8
+    calibration_method     : minmax | entropy | percentile
+    samples                : 64 | 128 | 256 | 512
+    per_channel            : bool
+    symmetric              : bool
+    op_types_to_quantize   : list[str]  — restrict which op types get quantized
+    nodes_to_exclude       : list[str]  — exclude specific named nodes
+
+FIXED:  winml build + winml eval + winml perf  (the experiment harness)
+METRIC: cosine_similarity  (from winml eval --format json)
+        p50_ms             (from winml perf --format json)
+RECORD: results.tsv
+</code></pre></div>
+
+<hr />
+<h3 id="profiler-enhanced-agent-architecture-redesigned">Profiler-Enhanced Agent Architecture (redesigned)<a class="headerlink" href="#profiler-enhanced-agent-architecture-redesigned" title="Permanent link">&para;</a></h3>
+<p><strong>Insight from GPU Optimizer v2 analysis and ConvNext POC:</strong>
+Running the profiler <em>before</em> the search loop would have shown Gemm=57.7% on ConvNext —
+immediately ruling out layout-pass experiments (Transpose only 2.6%, already fused Gelu already
+canonical). Profile-first makes the Explorer smarter and the search shorter.</p>
+<p><strong>New 4-phase structure:</strong></p>
+<svg width="700" height="600" viewBox="0 0 700 600" font-family="-apple-system,BlinkMacSystemFont,'Segoe UI',sans-serif" style="display:block;max-width:100%;margin:12px 0;">
+  <defs>
+    <marker id="sk-ph" markerWidth="8" markerHeight="6" refX="7" refY="3" orient="auto">
+      <polygon points="0 0,8 3,0 6" fill="#7986cb"/>
+    </marker>
+    <marker id="sk-ph-up" markerWidth="8" markerHeight="6" refX="1" refY="3" orient="auto">
+      <polygon points="8 0,0 3,8 6" fill="#90caf9"/>
+    </marker>
+  </defs>
+  <!-- PHASE 0 — INTAKE -->
+  <rect x="10" y="0" width="680" height="105" rx="8" fill="#eef1fc" stroke="#7986cb" stroke-width="1.5"/>
+  <rect x="10" y="0" width="680" height="28" rx="8" fill="#3949ab"/>
+  <rect x="10" y="18" width="680" height="10" fill="#3949ab"/>
+  <text x="350" y="20" text-anchor="middle" font-size="12" font-weight="700" fill="#fff">PHASE 0 — INTAKE</text>
+  <text x="28" y="48" font-size="10.5" fill="#1a237e">●  winml inspect  →  validate model is supported</text>
+  <text x="28" y="63" font-size="10.5" fill="#1a237e">●  winml build (baseline config)  →  get model.onnx</text>
+  <text x="28" y="78" font-size="10.5" fill="#1a237e">●  winml eval --mode compare  →  lock FP32 correctness baseline</text>
+  <text x="28" y="93" font-size="10.5" fill="#1a237e">●  winml perf (baseline)  →  establish latency floor</text>
+  <!-- Arrow 0→1 -->
+  <line x1="350" y1="105" x2="350" y2="128" stroke="#7986cb" stroke-width="1.5" marker-end="url(#sk-ph)"/>
+  <!-- PHASE 1 — PROFILE -->
+  <rect x="10" y="130" width="680" height="125" rx="8" fill="#e3f2fd" stroke="#1976d2" stroke-width="1.5"/>
+  <rect x="10" y="130" width="680" height="28" rx="8" fill="#1976d2"/>
+  <rect x="10" y="148" width="680" height="10" fill="#1976d2"/>
+  <text x="350" y="150" text-anchor="middle" font-size="12" font-weight="700" fill="#fff">PHASE 1 — PROFILE  (runs once, before search)</text>
+  <text x="28" y="176" font-size="10.5" fill="#0d47a1">●  winml perf -m baseline/model.onnx --ep &lt;ep&gt; --profile  →  bottleneck.json</text>
+  <text x="28" y="191" font-size="10.5" fill="#0d47a1">●  Classify bottleneck:  compute (Gemm/Conv/Attention)  vs  layout (Transpose/Reshape)</text>
+  <text x="28" y="206" font-size="10.5" fill="#0d47a1">●  "already_canonical" (fused op type)  →  skip corresponding fusion flag</text>
+  <text x="28" y="221" font-size="10.5" fill="#0d47a1">●  Output: prioritized_hypothesis_queue (ordered by profile evidence)</text>
+  <text x="28" y="242" font-size="10" fill="#1565c0" font-style="italic">headroom_hints: actionable pass recommendations from profile data</text>
+  <!-- Arrow 1→2 -->
+  <line x1="350" y1="255" x2="350" y2="278" stroke="#7986cb" stroke-width="1.5" marker-end="url(#sk-ph)"/>
+  <!-- PHASE 2 — OPTIMIZATION LOOP -->
+  <rect x="10" y="280" width="680" height="220" rx="8" fill="#e8f5e9" stroke="#388e3c" stroke-width="1.5"/>
+  <rect x="10" y="280" width="680" height="28" rx="8" fill="#388e3c"/>
+  <rect x="10" y="298" width="680" height="10" fill="#388e3c"/>
+  <text x="350" y="300" text-anchor="middle" font-size="12" font-weight="700" fill="#fff">PHASE 2 — PROFILE-GUIDED OPTIMIZATION LOOP</text>
+  <!-- Sub: EXPLORER / OPTIMIZER / REVIEWER -->
+  <rect x="28" y="318" width="170" height="80" rx="6" fill="#fff" stroke="#a5d6a7" stroke-width="1.5"/>
+  <text x="113" y="337" text-anchor="middle" font-size="10.5" font-weight="700" fill="#1b5e20">EXPLORER</text>
+  <text x="113" y="352" text-anchor="middle" font-size="9.5" fill="#2e7d32">Pop next hypothesis</text>
+  <text x="113" y="366" text-anchor="middle" font-size="9.5" fill="#2e7d32">from queue; prune via</text>
+  <text x="113" y="380" text-anchor="middle" font-size="9.5" fill="#2e7d32">bottleneck.json KB rules</text>
+  <line x1="198" y1="358" x2="256" y2="358" stroke="#388e3c" stroke-width="1.5" marker-end="url(#sk-ph)"/>
+  <rect x="258" y="318" width="170" height="80" rx="6" fill="#fff" stroke="#a5d6a7" stroke-width="1.5"/>
+  <text x="343" y="337" text-anchor="middle" font-size="10.5" font-weight="700" fill="#1b5e20">OPTIMIZER</text>
+  <text x="343" y="352" text-anchor="middle" font-size="9.5" fill="#2e7d32">winml build + eval</text>
+  <text x="343" y="366" text-anchor="middle" font-size="9.5" fill="#2e7d32">quick-screen → full bench</text>
+  <text x="343" y="380" text-anchor="middle" font-size="9.5" fill="#2e7d32">→ perf measurement</text>
+  <line x1="428" y1="358" x2="486" y2="358" stroke="#388e3c" stroke-width="1.5" marker-end="url(#sk-ph)"/>
+  <rect x="488" y="318" width="175" height="80" rx="6" fill="#fff" stroke="#a5d6a7" stroke-width="1.5"/>
+  <text x="575" y="337" text-anchor="middle" font-size="10.5" font-weight="700" fill="#1b5e20">REVIEWER</text>
+  <text x="575" y="352" text-anchor="middle" font-size="9.5" fill="#2e7d32">Cross-exp verdict:</text>
+  <text x="575" y="366" text-anchor="middle" font-size="9.5" fill="#2e7d32">keep / discard / plateau</text>
+  <text x="575" y="380" text-anchor="middle" font-size="9.5" fill="#2e7d32">write KB draft entry</text>
+  <!-- Loop arrow -->
+  <path d="M 575,398 L 575,422 L 113,422 L 113,398" stroke="#90caf9" stroke-width="1.5" stroke-dasharray="5,3" fill="none" marker-end="url(#sk-ph-up)"/>
+  <text x="344" y="436" text-anchor="middle" font-size="9.5" fill="#1565c0" font-style="italic">↺  loop until convergence / budget / search_space_exhausted</text>
+  <text x="28" y="460" font-size="9.5" fill="#33691e">Pruning: Gemm&gt;50% → skip layout passes · Transpose&gt;10% → check opset gate · Conv&gt;20% → try nchwc/conv-activation-fusion</text>
+  <text x="28" y="475" font-size="9.5" fill="#33691e">Triggers: 5× DISCARD → ORT source sweep · every KEEP+EP-change → re-read ep_knowledge</text>
+  <!-- Arrow 2→3 -->
+  <line x1="350" y1="500" x2="350" y2="523" stroke="#7986cb" stroke-width="1.5" marker-end="url(#sk-ph)"/>
+  <!-- PHASE 3 — REPORT -->
+  <rect x="10" y="525" width="680" height="70" rx="8" fill="#fff3e0" stroke="#f57c00" stroke-width="1.5"/>
+  <rect x="10" y="525" width="680" height="28" rx="8" fill="#f57c00"/>
+  <rect x="10" y="543" width="680" height="10" fill="#f57c00"/>
+  <text x="350" y="545" text-anchor="middle" font-size="12" font-weight="700" fill="#fff">PHASE 3 — REPORT</text>
+  <text x="28" y="568" font-size="10.5" fill="#bf360c">config_&lt;ep&gt;_optimal.json  ·  report.html  ·  experiments/&lt;n&gt;/  ·  kb_entry.json (status="draft")</text>
+  <text x="28" y="583" font-size="10" fill="#bf360c" font-style="italic">KB "draft" → "confirmed" only after mechanism confirmed via ORT/QNN source (Gate 2)</text>
+</svg>
+
+<p><strong>ep_knowledge draft/confirmed lifecycle (Gap 3 fix):</strong></p>
+<div class="codehilite"><pre><span></span><code>KB entry states:
+  &quot;draft&quot;     — observed perf delta, mechanism unconfirmed (Gate 2 not passed)
+                Can influence hypothesis PRIORITY but NOT prune search space
+  &quot;confirmed&quot; — mechanism confirmed via ORT/QNN source code (Gate 2 passed)
+                Can prune search space for future runs
+  &quot;deprecated&quot;— finding invalidated by new experiment or stack version change
+                Must NOT influence search space; kept for history only
+
+Transition rules:
+  draft → confirmed:   requires mechanism_confirmed=true + source_citation
+  confirmed → deprecated: requires contradicting experiment OR stack version bump
+  deprecated entries:  kept in JSON with status field, never deleted
+</code></pre></div>
+
+<p><strong>Profiler output → Explorer mapping table:</strong></p>
+<table>
+<thead>
+<tr>
+<th>Profile finding</th>
+<th>Explorer action</th>
+<th>Hypothesis skipped</th>
+</tr>
+</thead>
+<tbody>
+<tr>
+<td>Gemm &gt; 50%</td>
+<td>Prioritize quant/calib experiments</td>
+<td>All layout-transform passes</td>
+</tr>
+<tr>
+<td>Transpose &lt; 5% (opset=17)</td>
+<td>Transpose Optimizer already working</td>
+<td>transpose-optimizer trials</td>
+</tr>
+<tr>
+<td>op_type "Gelu" present</td>
+<td>Already fused</td>
+<td>gelu-fusion, fast-gelu-fusion</td>
+</tr>
+<tr>
+<td>op_type "LayerNormalization" present</td>
+<td>Already fused</td>
+<td>layer-norm-fusion trials</td>
+</tr>
+<tr>
+<td>Reorder{Input,Output} present (&gt;4%)</td>
+<td>NCHWc already active</td>
+<td>nchwc-transformer trials</td>
+</tr>
+<tr>
+<td>op_type "Attention" present</td>
+<td>MHA already fused</td>
+<td>attention-fusion trials</td>
+</tr>
+<tr>
+<td>QDQ ops &gt; 15%</td>
+<td>Quant overhead high</td>
+<td>Focus on op_types_to_quantize exclusions</td>
+</tr>
+<tr>
+<td>Transpose &gt; 10% + opset ≥ 19</td>
+<td>kMaxSupportedOpset issue</td>
+<td>Flag as [KNOWN_TRADEOFF], lower opset</td>
+</tr>
+</tbody>
+</table>
+<p><strong>Why profile-first matters (validated on ConvNext):</strong></p>
+<p>The ablation experiment ran 22 experiments over multiple days. Had the profiler run first:
+- Profile shows: Gemm=57.7%, Conv=12.6%, Transpose=2.6%, Gelu=8% (already "Gelu" op)
+- Explorer would have immediately skipped: <code>gelu-fusion</code>, <code>layer-norm-fusion</code>, <code>transpose-optimizer</code>,
+  <code>nchwc-transformer</code> (already active via ReorderInput/Output)
+- Only candidates from profile: <code>matmul-add-fusion</code> (Gemm bottleneck), <code>conv-activation-fusion</code>
+- This would have reduced 22 experiments to ~6, with the same conclusions</p>
+<p><strong>POC profiler:</strong> <code>C:\tmp\autoconfig-demo\winml_profile.py</code>
+- Uses ORT <code>enable_profiling=True</code> + <code>end_profiling()</code> (same pattern as AI Studio's profile_file.py)
+- CPU EP: parses <code>_kernel_time</code> events from ORT JSON trace
+- Output: <code>bottleneck.json</code> (structured) + <code>bottleneck.txt</code> (human-readable) + raw ORT trace
+- ConvNext result: Gemm 57.7%, Conv 12.6%, Transpose 2.6% → confirms baseline is optimal for CPU</p>
+<hr />
+<h3 id="sections_5">Sections<a class="headerlink" href="#sections_5" title="Permanent link">&para;</a></h3>
+<p><strong>1. Phase 0 — Intake + Baseline</strong></p>
+<div class="codehilite"><pre><span></span><code><span class="c1"># Step 1: verify the model is supported</span>
+winml<span class="w"> </span>inspect<span class="w"> </span>-m<span class="w"> </span>&lt;model-id&gt;<span class="w"> </span>--format<span class="w"> </span>json
+
+<span class="c1"># Step 2: baseline build (default config, opset=17)</span>
+winml<span class="w"> </span><span class="nb">export</span><span class="w"> </span>-m<span class="w"> </span>&lt;model-id&gt;<span class="w"> </span>-o<span class="w"> </span>baseline/
+winml<span class="w"> </span>build<span class="w"> </span>-c<span class="w"> </span>config_baseline.json<span class="w"> </span>-m<span class="w"> </span>&lt;model-id&gt;<span class="w"> </span>-o<span class="w"> </span>baseline_built/
+
+<span class="c1"># Step 3: correctness contract</span>
+winml<span class="w"> </span><span class="nb">eval</span><span class="w"> </span>--mode<span class="w"> </span>compare<span class="w"> </span>-m<span class="w"> </span>baseline_built/model.onnx<span class="w"> </span>--model-id<span class="w"> </span>&lt;model-id&gt;<span class="w"> </span>--format<span class="w"> </span>json
+<span class="c1"># Expected: cosine=1.0 (FP32 self-comparison)</span>
+
+<span class="c1"># Step 4: baseline perf</span>
+winml<span class="w"> </span>perf<span class="w"> </span>-m<span class="w"> </span>baseline_built/model.onnx<span class="w"> </span>--ep<span class="w"> </span>&lt;ep&gt;<span class="w"> </span>--warmup<span class="w"> </span><span class="m">10</span><span class="w"> </span>--iterations<span class="w"> </span><span class="m">50</span><span class="w"> </span>--format<span class="w"> </span>json
+<span class="c1"># Record: baseline_p50_ms</span>
+</code></pre></div>
+
+<p>Initialize <code>results.tsv</code> (TSV, not CSV — commas break in description field):</p>
+<div class="codehilite"><pre><span></span><code>commit  precision   nodes_excluded  cosine  p50_ms  calibration_samples status  notes
+</code></pre></div>
+
+<hr />
+<p><strong>2. Phase 1 — Profile (runs once, BEFORE any search experiments)</strong></p>
+<div class="codehilite"><pre><span></span><code><span class="c1"># Run profiler on baseline model (--profile flag added to winml perf)</span>
+winml<span class="w"> </span>perf<span class="w"> </span>-m<span class="w"> </span>baseline_built/model.onnx<span class="w"> </span>--ep<span class="w"> </span>&lt;ep&gt;<span class="w"> </span><span class="se">\</span>
+<span class="w">  </span>--warmup<span class="w"> </span><span class="m">5</span><span class="w"> </span>--iterations<span class="w"> </span><span class="m">20</span><span class="w"> </span>--profile<span class="w"> </span>--out<span class="w"> </span>profile_out/<span class="w"> </span>--format<span class="w"> </span>json
+<span class="c1"># Reads: profile_out/bottleneck.json</span>
+<span class="c1"># POC (before --profile ships): python winml_profile.py --model ... --ep ...</span>
+</code></pre></div>
+
+<p>Profiler output drives Explorer hypothesis initialization:</p>
+<div class="codehilite"><pre><span></span><code>READ bottleneck.json:
+  top_bottleneck: &lt;op_type&gt;
+  op_summary: [{op_type, pct}, ...]  (sorted by descending pct)
+  headroom_hints: [...]
+
+BUILD skip_set (passes not worth trying):
+  FOR each op_type in op_summary:
+    IF op_type == &quot;Gelu&quot;:          skip_set.add(gelu-fusion, fast-gelu-fusion)
+    IF op_type == &quot;LayerNormalization&quot;: skip_set.add(layer-norm-fusion)
+    IF op_type == &quot;Attention&quot;:     skip_set.add(attention-fusion)
+    IF &quot;ReorderInput&quot; in op_summary AND pct &gt; 2%:
+                                   skip_set.add(nchwc-transformer)  # already active
+  IF Transpose pct &lt; 5% AND opset=17:
+                                   skip_set.add(transpose-optimizer)  # already working, no gain
+  IF Transpose pct &gt; 10% AND opset &gt;= 19:
+                                   flag as [KNOWN_TRADEOFF]; add to report
+
+BUILD priority_queue (hypotheses in evidence-based order):
+  IF top_bottleneck == &quot;Gemm&quot; OR &quot;MatMul&quot;:
+    queue: [quant_precision, calib_method, calib_samples, matmul_fusions, per_channel]
+  IF top_bottleneck == &quot;Conv&quot;:
+    queue: [nchwc (if not in skip_set), conv_fusions, quant_precision]
+  IF top_bottleneck == &quot;Attention&quot;:
+    queue: [quant_precision, nodes_to_exclude (Attention), calib_method]
+  DEFAULT:
+    queue: [quant_precision, calib_method, calib_samples]
+</code></pre></div>
+
+<hr />
+<p><strong>3. Phase 2 — Profile-Guided Optimization Loop (single EP)</strong></p>
+<div class="codehilite"><pre><span></span><code>LOOP FOREVER (until user stops or convergence):
+
+1. EXPLORER: pop next hypothesis from priority_queue
+   - Skip if in skip_set (pruned by profile)
+   - If queue empty → enter Phase 4 (generalization) or stop
+
+2. HYPOTHESIZE: build config.json delta based on hypothesis
+   Hypothesis rules (profile-informed, in priority order):
+   a. If first loop: start with full W8A8/W8A16, all ops quantized
+   b. If cosine &lt; floor: add worst partial_op to nodes_to_exclude (one at a time)
+   c. If cosine ≥ floor but latency &gt; budget: try W8A8 instead of W8A16,
+      or reduce calibration_samples, or add per_channel=true
+   d. If stuck (3 iterations no improvement): try calibration_method change
+      (minmax → entropy → percentile)
+   e. If still stuck: try precision escalation (W8A8 → W8A16 → FP16)
+
+3. MODIFY: write updated config.json
+   Key fields in quant section:
+   {
+     &quot;precision&quot;: &quot;w8a8&quot;,
+     &quot;samples&quot;: 128,
+     &quot;calibration_method&quot;: &quot;minmax&quot;,
+     &quot;nodes_to_exclude&quot;: [&quot;LayerNorm_0&quot;, &quot;Softmax_3&quot;],
+     &quot;per_channel&quot;: false
+   }
+
+4. OPTIMIZER: winml build -c config.json -m &lt;model-id&gt; -o out_&lt;iteration&gt;/
+   If build crashes: log as &quot;crash&quot;, revert config, try different hypothesis
+
+5a. EVAL — quick sanity (cosine proxy, cheap):
+    winml eval --mode compare -m out_&lt;iteration&gt;/artifact.onnx \
+               --model-id &lt;model-id&gt; --format json
+    → cosine_similarity, sqnr_db
+    If cosine &lt; hard_floor (e.g. 0.85): fail-fast, skip step 5b + 6, log as discard
+
+5b. EVAL — task accuracy (real quality gate):
+    winml eval -m out_&lt;iteration&gt;/artifact.onnx \
+               --model-id &lt;model-id&gt; \
+               --task &lt;task&gt;  --device &lt;target&gt; --ep &lt;ep&gt; \
+               --samples 100 --format json
+    → top1_accuracy (image-classification), f1 (text), mAP (detection), etc.
+    This is the authoritative accuracy metric for Reviewer verdict.
+
+    Why cosine alone is not sufficient:
+    - High cosine (0.97) but top-1 drops 5%: logit magnitudes preserved but relative ranking shifted
+    - Low cosine (0.92) but same top-1: relative ranking unchanged despite numeric difference
+    → Only task accuracy tells you whether the model still does its job
+
+6. PERF: winml perf -m out_&lt;iteration&gt;/artifact.onnx \
+         --device &lt;target&gt; --ep &lt;ep&gt; --warmup 10 --iterations 50 --format json
+   → p50_ms, p90_ms
+
+7. REVIEWER: cross-experiment verdict
+   keep    if task_accuracy ≥ accuracy_floor  AND  p50_ms ≤ latency_budget
+   discard if task_accuracy &lt; accuracy_floor  OR   p50_ms &gt; latency_budget
+   crash   if build/eval failed
+
+   Reviewer also checks:
+   - Plateau: 3+ keeps with Δlatency &lt; 2% → likely at local optimum
+   - Profile divergence: if new op_type appears after build, re-profile
+   - Skip_set update: if experiment proves a pass is a no-op, add to skip_set
+   - Accuracy cliff: if task_accuracy drops &gt; 3% in one step → flag, do not cascade
+
+8. LOG to results.tsv:
+   &lt;git-short-hash&gt;  &lt;precision&gt;  &lt;nodes_excluded&gt;  &lt;cosine&gt;  &lt;top1_acc&gt;  &lt;p50_ms&gt;  &lt;samples&gt;  keep/discard/crash  &lt;notes&gt;
+
+9. If keep: advance to next iteration from this config
+   If discard: revert to last kept config, try different hypothesis
+</code></pre></div>
+
+<p><strong>Convergence criteria</strong> (stop the loop):
+- cosine ≥ target floor AND p50_ms ≤ latency budget: objective achieved
+- 5 consecutive discards with no improvement: report best so far
+- User manually stops the agent</p>
+<hr />
+<p><strong>3. Hypothesis generation rules (the intelligence layer)</strong></p>
+<p>The agent generates hypotheses by traversing the search space in priority order.
+Each hypothesis is motivated by diagnostic data from the previous experiment, not random search.</p>
+<p><strong>Priority ordering across the three config sections:</strong></p>
+<div class="codehilite"><pre><span></span><code>Phase 1 — establish baseline (iteration 0)
+  Start with: opset_version=17, all fusions enabled, precision=w8a16, minmax, 128 samples
+
+Phase 2 — precision first (fastest to try, most impact)
+  If cosine &lt; floor:
+    w8a16 → try w8a8 with selective exclusions, or w8a16 first
+  If latency &gt; budget:
+    w8a16 → try w8a8 (smaller model, faster inference)
+    fp16  → try w8a16 (if currently at fp16)
+
+Phase 3 — calibration tuning (if precision is right but cosine still low)
+  Try in order: minmax → entropy → percentile
+  Try increasing samples: 128 → 256 → 512
+  Try per_channel=true (better accuracy, slightly slower build)
+  Try symmetric=false if currently true
+
+Phase 4 — optimize pass tuning (independent of quant, affects graph structure)
+  Hypothesis: some fusion patterns create op shapes QNN handles poorly
+  Transformer models (try in order):
+    attention-fusion → skip-layer-norm-fusion → layer-norm-fusion → fuse-rmsnorm
+  Vision models (try in order):
+    conv-bn-fusion → conv-add-fusion → conv-activation-fusion
+  Shared (try if cosine drops or build crashes):
+    constant-folding=false  (prevents size bloat; sometimes exposes EP-incompatible shape)
+    clamp-constant-values=true  (fixes -inf attention mask → quantization issues)
+    remove-isnan-in-attention-mask=true  (use after clamp; cleans dead IsNaN guards)
+  Try opset_version: 17 → 18 → 19
+    (Higher opsets expose newer op types that may have better EP support)
+
+Phase 5 — selective node exclusion (when analyze shows partial ops)
+  Read winml analyze --format json → partial_ops list
+  Exclude one partial_op at a time (greedy: exclude highest-impact first)
+  Also try excluding op_types_to_quantize selectively
+    e.g., remove &quot;LayerNorm&quot; from op_types_to_quantize list
+
+Phase 6 — combined search (if single-dimension changes are stuck)
+  Try combinations of best Phase 3 + Phase 4 + Phase 5 changes together
+</code></pre></div>
+
+<p><strong>Diagnosis table — what to try given what you see:</strong></p>
+<table>
+<thead>
+<tr>
+<th>Symptom</th>
+<th>Likely cause</th>
+<th>Phase to try next</th>
+</tr>
+</thead>
+<tbody>
+<tr>
+<td>cosine drops a lot at quant stage, all ops supported</td>
+<td>Calibration data mismatch</td>
+<td>Phase 3: entropy calib, more samples</td>
+</tr>
+<tr>
+<td>cosine drops at quant, Attention ops partial</td>
+<td>Attention activation quant on QNN</td>
+<td>Phase 5: exclude Attention nodes</td>
+</tr>
+<tr>
+<td>cosine OK but latency worse than CPU</td>
+<td>Fusion pattern creating unoptimized subgraph</td>
+<td>Phase 4: disable attention-fusion, try different opset</td>
+</tr>
+<tr>
+<td>cosine OK but model larger than expected</td>
+<td>Constant folding inlining large weights</td>
+<td>Phase 4: constant-folding=false</td>
+</tr>
+<tr>
+<td>Both cosine and latency good at w8a8 but build crashes</td>
+<td>opset op not supported by quant pipeline</td>
+<td>Phase 4: opset_version 17 → 16</td>
+</tr>
+<tr>
+<td>cosine highly variable across seeds</td>
+<td>Calibration with too few samples</td>
+<td>Phase 3: 128 → 256 samples</td>
+</tr>
+<tr>
+<td>All ops supported, cosine still drops after fusions</td>
+<td>Fusion creates non-quantizable shape</td>
+<td>Phase 4: disable skip-layer-norm-fusion</td>
+</tr>
+<tr>
+<td>QNN build fails with "invalid scale"</td>
+<td>-inf in attention mask initializer</td>
+<td>Phase 4: clamp-constant-values=true</td>
+</tr>
+<tr>
+<td>Vision model: accuracy drops unexpectedly</td>
+<td>Conv+BN fusion slightly changes weight values</td>
+<td>Phase 4: disable conv-bn-fusion</td>
+</tr>
+<tr>
+<td>MatMul-heavy model: latency not improving</td>
+<td>MatMul not being fused</td>
+<td>Phase 4: matmul-add-fusion, matmul-transpose-fusion</td>
+</tr>
+<tr>
+<td>RMSNorm model (Llama etc.) poor QNN perf</td>
+<td>ORT not recognizing RMSNorm pattern</td>
+<td>Phase 4: fuse-rmsnorm=true</td>
+</tr>
+</tbody>
+</table>
+<p>This is the key difference from grid search: <strong>each hypothesis is motivated by diagnostic data from <code>winml analyze</code> and the previous experiment result</strong>.</p>
+<hr />
+<p><strong>4. Multi-EP config generation</strong></p>
+<p>Run parallel loops for each target EP, then aggregate into <code>manifest.json</code>:</p>
+<div class="codehilite"><pre><span></span><code><span class="c1"># Agent runs loops for each EP (can be sequential or parallel):</span>
+<span class="c1"># Loop 1: ep=qnn,   target_device=npu</span>
+<span class="c1"># Loop 2: ep=dml,   target_device=gpu</span>
+<span class="c1"># Loop 3: ep=cpu,   target_device=cpu</span>
+
+<span class="c1"># After all loops complete, agent generates:</span>
+<span class="c1"># - config_qnn_optimal.json   (best config found for QNN)</span>
+<span class="c1"># - config_dml_optimal.json   (best config found for DirectML)</span>
+<span class="c1"># - config_cpu_optimal.json   (best config found for CPU)</span>
+
+<span class="c1"># Then builds final artifacts and assembles manifest.json</span>
+</code></pre></div>
+
+<p>Generated <code>manifest.json</code> includes experiment provenance:</p>
+<div class="codehilite"><pre><span></span><code><span class="p">{</span>
+<span class="w">  </span><span class="nt">&quot;model_id&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;microsoft/resnet-50&quot;</span><span class="p">,</span>
+<span class="w">  </span><span class="nt">&quot;generated_by&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;autoconfig&quot;</span><span class="p">,</span>
+<span class="w">  </span><span class="nt">&quot;experiments_run&quot;</span><span class="p">:</span><span class="w"> </span><span class="mi">34</span><span class="p">,</span>
+<span class="w">  </span><span class="nt">&quot;variants&quot;</span><span class="p">:</span><span class="w"> </span><span class="p">[</span>
+<span class="w">    </span><span class="p">{</span>
+<span class="w">      </span><span class="nt">&quot;ep&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;qnn&quot;</span><span class="p">,</span><span class="w"> </span><span class="nt">&quot;device&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;npu&quot;</span><span class="p">,</span>
+<span class="w">      </span><span class="nt">&quot;file&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;model_qnn.onnx&quot;</span><span class="p">,</span>
+<span class="w">      </span><span class="nt">&quot;precision&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;w8a16&quot;</span><span class="p">,</span>
+<span class="w">      </span><span class="nt">&quot;nodes_excluded&quot;</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="s2">&quot;MultiHeadAttention&quot;</span><span class="p">],</span>
+<span class="w">      </span><span class="nt">&quot;cosine_similarity&quot;</span><span class="p">:</span><span class="w"> </span><span class="mf">0.972</span><span class="p">,</span>
+<span class="w">      </span><span class="nt">&quot;p50_ms&quot;</span><span class="p">:</span><span class="w"> </span><span class="mf">18.3</span><span class="p">,</span>
+<span class="w">      </span><span class="nt">&quot;config&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;config_qnn_optimal.json&quot;</span>
+<span class="w">    </span><span class="p">},</span>
+<span class="w">    </span><span class="p">{</span>
+<span class="w">      </span><span class="nt">&quot;ep&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;dml&quot;</span><span class="p">,</span><span class="w"> </span><span class="nt">&quot;device&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;gpu&quot;</span><span class="p">,</span>
+<span class="w">      </span><span class="nt">&quot;file&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;model_dml.onnx&quot;</span><span class="p">,</span>
+<span class="w">      </span><span class="nt">&quot;precision&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;fp16&quot;</span><span class="p">,</span>
+<span class="w">      </span><span class="nt">&quot;nodes_excluded&quot;</span><span class="p">:</span><span class="w"> </span><span class="p">[],</span>
+<span class="w">      </span><span class="nt">&quot;cosine_similarity&quot;</span><span class="p">:</span><span class="w"> </span><span class="mf">0.999</span><span class="p">,</span>
+<span class="w">      </span><span class="nt">&quot;p50_ms&quot;</span><span class="p">:</span><span class="w"> </span><span class="mf">22.1</span><span class="p">,</span>
+<span class="w">      </span><span class="nt">&quot;config&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;config_dml_optimal.json&quot;</span>
+<span class="w">    </span><span class="p">},</span>
+<span class="w">    </span><span class="p">{</span>
+<span class="w">      </span><span class="nt">&quot;ep&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;cpu&quot;</span><span class="p">,</span><span class="w"> </span><span class="nt">&quot;device&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;cpu&quot;</span><span class="p">,</span>
+<span class="w">      </span><span class="nt">&quot;file&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;model_cpu.onnx&quot;</span><span class="p">,</span>
+<span class="w">      </span><span class="nt">&quot;precision&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;w8a8&quot;</span><span class="p">,</span>
+<span class="w">      </span><span class="nt">&quot;nodes_excluded&quot;</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="s2">&quot;LayerNorm&quot;</span><span class="p">],</span>
+<span class="w">      </span><span class="nt">&quot;cosine_similarity&quot;</span><span class="p">:</span><span class="w"> </span><span class="mf">0.931</span><span class="p">,</span>
+<span class="w">      </span><span class="nt">&quot;p50_ms&quot;</span><span class="p">:</span><span class="w"> </span><span class="mf">84.7</span><span class="p">,</span>
+<span class="w">      </span><span class="nt">&quot;config&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;config_cpu_optimal.json&quot;</span>
+<span class="w">    </span><span class="p">}</span>
+<span class="w">  </span><span class="p">],</span>
+<span class="w">  </span><span class="nt">&quot;selection_order&quot;</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="s2">&quot;qnn&quot;</span><span class="p">,</span><span class="w"> </span><span class="s2">&quot;dml&quot;</span><span class="p">,</span><span class="w"> </span><span class="s2">&quot;cpu&quot;</span><span class="p">]</span>
+<span class="p">}</span>
+</code></pre></div>
+
+<hr />
+<p><strong>5. results.tsv format</strong></p>
+<p>Track all three config sections per experiment (TSV, not CSV):</p>
+<div class="codehilite"><pre><span></span><code>commit  opset   fusions_disabled    precision   nodes_excluded  cosine  p50_ms  calib_samples   calib_method    status  notes
+baseline    17  []  fp32    []  1.000   —   —   —   keep    FP32 reference
+a1b2c3d 17  []  w8a8    []  0.871   16.2    128 minmax  discard full W8A8 too aggressive
+b2c3d4e 17  []  w8a16   []  0.967   19.8    128 minmax  keep    W8A16 baseline meets floor
+c3d4e5f 17  []  w8a16   []  0.969   19.1    256 entropy keep    entropy calib improvement
+d4e5f6g 17  [attention-fusion]  w8a16   []  0.971   18.4    256 entropy keep    disabling attn-fusion helps latency
+e5f6g7h 18  [attention-fusion]  w8a16   []  0.973   17.9    256 entropy keep    opset18 best so far
+f6g7h8i 18  [attention-fusion]  w8a8    [MultiHeadAttention]    0.961   14.2    256 entropy keep    mixed prec: meet latency budget
+</code></pre></div>
+
+<hr />
+<p><strong>6. Skill outputs</strong></p>
+<p>autoconfig produces <strong>two primary outputs</strong> after convergence or user stop:</p>
+<h4 id="output-a-best-config-file">Output A: Best config file<a class="headerlink" href="#output-a-best-config-file" title="Permanent link">&para;</a></h4>
+<p><code>config_&lt;ep&gt;_optimal.json</code> — the winning config.json, ready to pass to <code>winml build</code>. Contains provenance metadata so it's reproducible:</p>
+<div class="codehilite"><pre><span></span><code><span class="p">{</span>
+<span class="w">  </span><span class="nt">&quot;_autoconfig_meta&quot;</span><span class="p">:</span><span class="w"> </span><span class="p">{</span>
+<span class="w">    </span><span class="nt">&quot;model_id&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;facebook/convnext-tiny-224&quot;</span><span class="p">,</span>
+<span class="w">    </span><span class="nt">&quot;ep&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;qnn&quot;</span><span class="p">,</span>
+<span class="w">    </span><span class="nt">&quot;objective&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;latency-primary&quot;</span><span class="p">,</span>
+<span class="w">    </span><span class="nt">&quot;latency_budget_ms&quot;</span><span class="p">:</span><span class="w"> </span><span class="mi">20</span><span class="p">,</span>
+<span class="w">    </span><span class="nt">&quot;accuracy_floor&quot;</span><span class="p">:</span><span class="w"> </span><span class="mf">0.95</span><span class="p">,</span>
+<span class="w">    </span><span class="nt">&quot;experiments_run&quot;</span><span class="p">:</span><span class="w"> </span><span class="mi">23</span><span class="p">,</span>
+<span class="w">    </span><span class="nt">&quot;best_iter&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;iter_17&quot;</span><span class="p">,</span>
+<span class="w">    </span><span class="nt">&quot;timestamp&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;2026-06-10T11:55:05+08:00&quot;</span>
+<span class="w">  </span><span class="p">},</span>
+<span class="w">  </span><span class="nt">&quot;export&quot;</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nt">&quot;opset_version&quot;</span><span class="p">:</span><span class="w"> </span><span class="mi">18</span><span class="w"> </span><span class="p">},</span>
+<span class="w">  </span><span class="nt">&quot;optimize&quot;</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nt">&quot;attention-fusion&quot;</span><span class="p">:</span><span class="w"> </span><span class="kc">false</span><span class="w"> </span><span class="p">},</span>
+<span class="w">  </span><span class="nt">&quot;quantize&quot;</span><span class="p">:</span><span class="w"> </span><span class="p">{</span>
+<span class="w">    </span><span class="nt">&quot;precision&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;w8a16&quot;</span><span class="p">,</span>
+<span class="w">    </span><span class="nt">&quot;calibration_method&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;entropy&quot;</span><span class="p">,</span>
+<span class="w">    </span><span class="nt">&quot;calibration_samples&quot;</span><span class="p">:</span><span class="w"> </span><span class="mi">256</span><span class="p">,</span>
+<span class="w">    </span><span class="nt">&quot;nodes_to_exclude&quot;</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="s2">&quot;MultiHeadAttention_0&quot;</span><span class="p">]</span>
+<span class="w">  </span><span class="p">}</span>
+<span class="p">}</span>
+</code></pre></div>
+
+<h4 id="output-b-html-benchmark-report">Output B: HTML benchmark report<a class="headerlink" href="#output-b-html-benchmark-report" title="Permanent link">&para;</a></h4>
+<p><code>report.html</code> — self-contained single-file report (no external dependencies), viewable in any browser. Contains:</p>
+<p><strong>Section 1 — Summary card</strong></p>
+<div class="codehilite"><pre><span></span><code>Model:    facebook/convnext-tiny-224     EP: QNN (NPU)
+Objective: latency-primary ≤ 20ms       Accuracy floor: 0.95
+Result:   ✅ FOUND                       Experiments: 23  Time: 41 min
+
+Best config:  W8A16, entropy calib, 256 samples
+  Accuracy:   0.953  (floor 0.95 ✓)
+  p50 latency: 15.8ms  (budget 20ms ✓)
+</code></pre></div>
+
+<p><strong>Section 2 — Search progress chart</strong>
+Scatter plot: all 23 experiments, x=p50_latency_ms, y=accuracy.
+- Green dot = kept (improvement)
+- Red dot = discarded (regression)
+- Star = best found
+- Hover tooltip: iter ID, config diff vs previous</p>
+<p><strong>Section 3 — Iteration table</strong>
+Full results.tsv rendered as sortable HTML table with columns:</p>
+<div class="codehilite"><pre><span></span><code>iter | opset | precision | nodes_excluded | calib | accuracy | p50_ms | Δacc | Δlatency | status | hypothesis
+</code></pre></div>
+
+<p>Color-coded rows: green = keep, red = discard, gold = best.</p>
+<p><strong>Section 4 — Config diff timeline</strong>
+Visual diff showing what changed between each kept iteration (config deltas as <code>+</code>/<code>-</code> lines).</p>
+<p><strong>Section 5 — Model graph analysis</strong> (from pre-search <code>winml analyze</code>)
+- Op distribution pie chart (ONNX vs com.microsoft)
+- EP compatibility table: ops supported/unsupported on target EP
+- Detected patterns (GELU variant, attention structure, Transpose-sandwich)</p>
+<p><strong>Section 6 — Benchmark details</strong>
+For the best config, full <code>winml perf</code> output:
+- p10/p50/p90/p99 latency histogram
+- Throughput (samples/sec)
+- Warmup vs steady-state comparison
+- (If multi-EP: side-by-side EP comparison bar chart)</p>
+<p><strong>Section 7 — Reproduction instructions</strong></p>
+<div class="codehilite"><pre><span></span><code><span class="c1"># Reproduce the winning config:</span>
+winml<span class="w"> </span>build<span class="w"> </span>-c<span class="w"> </span>config_qnn_optimal.json<span class="w"> </span>-m<span class="w"> </span>facebook/convnext-tiny-224<span class="w"> </span>-o<span class="w"> </span>out/
+<span class="c1"># For NPU: always compile after build (empirically +1.7× speedup)</span>
+winml<span class="w"> </span>compile<span class="w"> </span>-m<span class="w"> </span>out/model.onnx<span class="w"> </span>--device<span class="w"> </span>npu<span class="w"> </span>--ep<span class="w"> </span>qnn<span class="w"> </span>-o<span class="w"> </span>out_compiled/
+winml<span class="w"> </span>perf<span class="w"> </span>-m<span class="w"> </span>out_compiled/model_npu_ctx.onnx<span class="w"> </span>--ep<span class="w"> </span>qnn<span class="w"> </span>--iterations<span class="w"> </span><span class="m">100</span><span class="w"> </span>--warmup<span class="w"> </span><span class="m">10</span>
+</code></pre></div>
+
+<p><strong>Section 8 — Feasible options comparison table</strong>
+The report includes a compact decision table of top feasible candidates (not just the winner), with columns:
+<code>config</code>, <code>accuracy</code>, <code>p50</code>, <code>headroom_to_budget</code>, <code>tradeoff_note</code>, and <code>recommended_when</code>. This helps users choose a robust fallback plan when constraints or hardware change.</p>
+<p><strong>Report generation approach</strong>: The agent generates report.html using inline Python with Jinja2-style string templating + embedded Chart.js (CDN or inlined). No external dependencies — single file, opens offline.</p>
+<hr />
+<p><strong>7. What the agent says in chat</strong></p>
+<p>After convergence or user stop (terminal summary, report is the real deliverable):</p>
+<div class="codehilite"><pre><span></span><code>autoconfig completed. 23 experiments run (41 min).
+
+Best config (QNN NPU):
+  W8A16, entropy calib, 256 samples, MultiHeadAttention excluded
+  accuracy 0.953 ✓ (floor 0.95)   p50 15.8ms ✓ (budget 20ms)
+
+Outputs:
+  config_qnn_optimal.json   ← drop into winml build -c
+  report.html               ← open in browser for full benchmark breakdown
+
+Next: winml validate-before-ship for production gate.
+</code></pre></div>
+
+<hr />
+<p><strong>8. Constraints and failure handling</strong></p>
+<ul>
+<li><strong>Build timeout</strong>: If <code>winml build</code> exceeds 15 minutes, kill and log as crash</li>
+<li><strong>OOM</strong>: If build fails with out-of-memory, reduce <code>calibration_samples</code> by half</li>
+<li><strong>All hypotheses exhausted</strong>: Report best config found, note convergence limit</li>
+<li><strong>Latency not measurable</strong> (target EP not on machine): run eval only, skip perf gate</li>
+</ul>
+<p><strong>9. CLI-only constraint (critical)</strong></p>
+<p>The agent MUST use only official <code>winml</code> CLI commands as its tool surface. No Python scripting, no direct ONNX manipulation, no third-party tools (onnxconverter-common, onnxsim, Olive, etc.) except where explicitly documented as a known workaround.</p>
+<p><strong>Rationale</strong>: autoconfig's output is a <code>config.json</code> + <code>report.html</code> that a user can reproduce with <code>winml build -c config.json</code>. If the agent used a Python hack to produce a model artifact, the config is not reproducible and the report is misleading.</p>
+<p><strong>Known workarounds (allowed, must be flagged in report):</strong>
+| Workaround | Replaces | Tracking issue | Required flag in report |
+|---|---|---|---|
+| <code>python winml_profile.py</code> | <code>winml perf --profile</code> (not yet shipped) | pending | ⚠️ "Profile data via POC script, not official API" |</p>
+<p><strong>Gap reporting rule</strong>: If a hypothesis cannot be tested because the required <code>winml</code> CLI capability does not exist, the agent MUST:
+1. Record the hypothesis as <code>SKIPPED — CLI gap</code> in the experiment table
+2. Add an entry to <strong>Section 6 "Gaps &amp; Issues"</strong> block in <code>report.html</code>:
+   <code>GAP: &lt;hypothesis&gt; requires &lt;missing capability&gt;
+   Impact: &lt;what speedup/accuracy improvement was not measurable&gt;
+   Filed: &lt;issue URL or "not yet filed"&gt;</code>
+3. NOT silently substitute a Python workaround that produces unverifiable artifacts</p>
+<p><strong>Example gaps encountered during ConvNext QNN GPU validation:</strong>
+- <code>winml build --precision fp16</code> flag not available (#867) → FP16 native export untested → <code>SKIPPED — CLI gap</code>
+- <code>winml perf --ep-option</code> not available (#865) → runtime flag sweep untested → <code>SKIPPED — CLI gap</code>
+- <code>winml perf --profile</code> for QNN EP not available → profiling via POC script (allowed workaround)
+- W8A8 QDQ ONNX on QNN GPU EP hangs indefinitely — root cause is QNN SDK behavior; <code>winml build</code> already prevents this via <code>_patch_device()</code>; fast-fail enhancement filed as #868 (low priority)</p>
+<hr />
+<h3 id="key-commands-used">Key commands used<a class="headerlink" href="#key-commands-used" title="Permanent link">&para;</a></h3>
+<div class="codehilite"><pre><span></span><code><span class="c1"># Phase 1: profiling (--profile flag on winml perf, before search)</span>
+winml<span class="w"> </span>perf<span class="w"> </span>-m<span class="w"> </span>baseline_built/model.onnx<span class="w"> </span>--ep<span class="w"> </span>&lt;ep&gt;<span class="w"> </span>--warmup<span class="w"> </span><span class="m">5</span><span class="w"> </span>--iterations<span class="w"> </span><span class="m">20</span><span class="w"> </span><span class="se">\</span>
+<span class="w">  </span>--profile<span class="w"> </span>--out<span class="w"> </span>profile_out/<span class="w"> </span>--format<span class="w"> </span>json
+<span class="c1"># → profile_out/bottleneck.json  (machine-readable for Explorer)</span>
+<span class="c1"># → profile_out/bottleneck.txt   (human-readable summary)</span>
+<span class="c1"># POC: python winml_profile.py --model ... --ep ... (until --profile ships)</span>
+
+<span class="c1"># Phase 2: analysis (informs nodes_to_exclude hypotheses)</span>
+winml<span class="w"> </span>analyze<span class="w"> </span>-m<span class="w"> </span>&lt;exported&gt;.onnx<span class="w"> </span>--ep<span class="w"> </span>&lt;ep&gt;<span class="w"> </span>--format<span class="w"> </span>json
+
+<span class="c1"># Phase 2: experiment</span>
+winml<span class="w"> </span>build<span class="w"> </span>-c<span class="w"> </span>config.json<span class="w"> </span>-m<span class="w"> </span>&lt;model-id&gt;<span class="w"> </span>-o<span class="w"> </span>out_&lt;n&gt;/
+
+<span class="c1"># Phase 2: metrics</span>
+winml<span class="w"> </span><span class="nb">eval</span><span class="w"> </span>--mode<span class="w"> </span>compare<span class="w"> </span>-m<span class="w"> </span>out_&lt;n&gt;/artifact.onnx<span class="w"> </span>--model-id<span class="w"> </span>&lt;model-id&gt;<span class="w"> </span>--format<span class="w"> </span>json
+winml<span class="w"> </span>perf<span class="w"> </span>-m<span class="w"> </span>out_&lt;n&gt;/artifact.onnx<span class="w"> </span>--device<span class="w"> </span>&lt;target&gt;<span class="w"> </span>--ep<span class="w"> </span>&lt;ep&gt;<span class="w"> </span>--iterations<span class="w"> </span><span class="m">50</span><span class="w"> </span>--format<span class="w"> </span>json
+
+<span class="c1"># Phase 3: compile best candidate to QNN EPContext (NPU only)</span>
+<span class="c1"># Eliminates JIT overhead; empirically ~1.7× further speedup on ConvNext W8A16</span>
+winml<span class="w"> </span>compile<span class="w"> </span>-m<span class="w"> </span>best_candidate/model.onnx<span class="w"> </span>--device<span class="w"> </span>npu<span class="w"> </span>--ep<span class="w"> </span>qnn<span class="w"> </span>-o<span class="w"> </span>best_compiled/
+<span class="c1"># → best_compiled/model_npu_ctx.onnx  (loads context binary at runtime)</span>
+<span class="c1"># → best_compiled/model_npu_ctx_qnn.bin  (QNN hardware-compiled graph)</span>
+
+<span class="c1"># Phase 3: re-benchmark compiled model</span>
+winml<span class="w"> </span>perf<span class="w"> </span>-m<span class="w"> </span>best_compiled/model_npu_ctx.onnx<span class="w"> </span>--device<span class="w"> </span>npu<span class="w"> </span>--ep<span class="w"> </span>qnn<span class="w"> </span>--warmup<span class="w"> </span><span class="m">10</span><span class="w"> </span>--iterations<span class="w"> </span><span class="m">50</span>
+</code></pre></div>
+
+<p><strong>Empirical data: ConvNext QNN NPU compile impact</strong>
+| Version | p50 | vs FP32 NPU |
+|---|---|---|
+| FP32 baseline | 19.39ms | — |
+| W8A16 quantized | 10.29ms | 1.9× |
+| <strong>W8A16 + compile</strong> | <strong>6.01ms</strong> | <strong>3.2×</strong> |
+→ <code>winml compile</code> alone adds ~1.7× on top of quantization. Always compile for NPU deployment.</p>
+<p><strong>Empirical data: ConvNext QNN GPU optimization sweep (Adreno X1-85) — full search</strong>
+| Experiment | p50 | p90 | std | vs FP32 | Notes |
+|---|---|---|---|---|---|
+| FP32 baseline (autoconf) | <strong>17.7ms</strong> | 19.7ms | 0.97 | — | ✅ <strong>OPTIMAL with current CLI</strong> |
+| NHWC transformer | 19.5ms | 23.8ms | 3.43 | ❌ −10% | Hurts Adreno+QNN EP |
+| NHWC + all GPU fusions | 18.1ms | 23.9ms | 2.71 | ❌ −2% | Still worse |
+| Conv/norm fusions (no NHWC) | 17.6ms | 22.6ms | 5.51 | ≈0% | Variance ↑, no gain |
+| LayerNorm rewrite | 18.4ms | 21.4ms | 2.04 | ❌ −4% | Pattern mismatch anyway |
+| Transpose optimizer | 0% node Δ | — | — | no-op | Already optimal positions |
+| HiDimRTR→LowDimRTR | 0% node Δ | — | — | no-op | ConvNext RTR doesn't match pattern |
+| MatMulAdd→Conv2D (2d/3d/4d) | 0% node Δ | — | — | no-op | ConvNext uses Reshape→MatMul, not bare MatMul+Add |
+| FP32 + compile | 23.7ms | — | — | ❌ −34% | Compile hurts GPU (opposite of NPU) |
+| W8A8 QDQ quantized | hangs | — | — | ❌ blocked | #868 enhancement (fast-fail) |
+| FP16 (invalid CLI path) | 8.8ms | ~32ms | bimodal | ⚠️ 2× p50 | BLOCKED — need #867 |</p>
+<p><strong>Root cause: why no pass matches ConvNext on QNN GPU</strong>
+- All 251 ops run natively on GPU (251/0/0/0) — no CPU fallback to eliminate
+- ConvNext linear layers: <code>Reshape → MatMul → Reshape</code> pattern, not bare <code>MatMul+Add</code> → Conv2D rewrites don't match
+- 72 Reshape + 42 Transpose are already at minimum / optimal topology from PyTorch export
+- <code>winml build</code> autoconf (gelu_fusion + matmul_add_fusion) already applied all relevant transforms
+- The bottleneck is compute throughput + memory bandwidth — only FP16 (smaller tensors) can improve this</p>
+<p><strong>Key insight: gelu_fusion matters for variance, not p50</strong>
+| Version | p50 | p90 | std |
+|---|---|---|---|
+| Raw export (287 nodes, unfused Gelu) | 17.4ms | 29.2ms | 5.90 |
+| Autoconf (251 nodes, fused Gelu+Gemm) | 17.7ms | 19.7ms | 0.97 |</p>
+<p>Unfused Gelu = 5 separate GPU kernel launches (Mul→Div→Erf→Mul→Add) with scheduling jitter.
+A single <code>Gelu</code> kernel eliminates dispatch overhead → p90 −48%, std −6×.
+→ autoconf's role on GPU is <strong>stability</strong>, not speedup. Critical for real-time / latency-SLA deployments.</p>
+<p>→ <strong>QNN GPU search space exhausted.</strong> FP16 is the only remaining lever, blocked by #867.</p>
+<p><strong>Empirical data: ConvNext DML optimization sweep (Adreno X1-85, DirectML)</strong>
+| Experiment | p50 | p90 | std | vs FP32 |
+|---|---|---|---|---|
+| FP32 baseline (autoconf, 251 nodes) | <strong>16.9ms</strong> | 17.7ms | 0.52 | — ← OPTIMAL with current CLI |
+| NHWC transformer | 16.5ms | 21.0ms | 1.89 | ❌ p90 worse |
+| Raw unfused export (287 nodes) | 16.5ms | 18.4ms | 2.74 | ❌ p99=35ms, worse tail |
+| FP16 (Python hack ⚠️) | <strong>11.8ms</strong> | 12.8ms | 0.66 | ✅ <strong>1.4× faster, clean dist</strong> — BLOCKED #867 |</p>
+<p><strong>DML vs QNN GPU comparison (same Adreno X1-85):</strong>
+| | QNN GPU FP32 | DML FP32 | DML FP16 (invalid) |
+|---|---|---|---|
+| p50 | 17.7ms | <strong>16.9ms</strong> | <strong>11.8ms</strong> |
+| p90 | 19.7ms | <strong>17.7ms</strong> | <strong>12.8ms</strong> |
+| std | 0.97 | <strong>0.52</strong> | <strong>0.66</strong> |</p>
+<p>→ DML is consistently faster and more stable than QNN GPU at FP32. Root cause: DML JIT-compiles HLSL shaders at model load time; QNN GPU EP does graph partitioning at each session creation.
+→ DML FP16: no DVFS bimodal (unlike QNN GPU FP16) — DML's shader compilation locks in FP16 compute paths.
+→ NHWC hurts DML too (same reason as QNN GPU: Adreno X1-85 + D3D12 doesn't benefit from explicit NHWC transforms).
+→ Note: <code>winml analyze</code> returns 0/0/0/251 (all Unknown) for DML — no rule data. DML supports all standard ONNX ops by design.</p>
+<p><strong>QNN Hub benchmark comparison (Snapdragon X Elite CRD) — WITH cross-stack test</strong></p>
+<table>
+<thead>
+<tr>
+<th>Model</th>
+<th>Stack</th>
+<th>NPU p50</th>
+<th>GPU p50</th>
+<th>Notes</th>
+</tr>
+</thead>
+<tbody>
+<tr>
+<td>QNN Hub Float (opset 21, 222 nodes, MatMul)</td>
+<td>qairt cloud</td>
+<td><strong>2.687ms</strong></td>
+<td>—</td>
+<td>Reference</td>
+</tr>
+<tr>
+<td>QNN Hub Float (same model)</td>
+<td>winml ORT QNN EP</td>
+<td><strong>8.78ms</strong></td>
+<td>23.9ms</td>
+<td>Direct test on this device</td>
+</tr>
+<tr>
+<td>Our Float (opset 17, 251 nodes, Gemm)</td>
+<td>winml ORT QNN EP</td>
+<td>19.4ms</td>
+<td>17.7ms</td>
+<td>winml build output</td>
+</tr>
+<tr>
+<td>QNN Hub W8A16 (opset 21, 798 QDQ, uint16 input)</td>
+<td>qairt cloud</td>
+<td><strong>2.612ms</strong></td>
+<td>—</td>
+<td>Reference</td>
+</tr>
+<tr>
+<td>QNN Hub W8A16 (same model)</td>
+<td>winml ORT QNN EP</td>
+<td>14.82ms (std=8.8!)</td>
+<td>—</td>
+<td>ORT-QNN mismatch</td>
+</tr>
+<tr>
+<td>Our W8A16 + compile (opset 17, ORT quant)</td>
+<td>winml ORT QNN EP</td>
+<td><strong>6.01ms</strong></td>
+<td>—</td>
+<td>Best we can do</td>
+</tr>
+</tbody>
+</table>
+<p><strong>Gap decomposition (three independent sources):</strong></p>
+<div class="codehilite"><pre><span></span><code>QNN Hub cloud:   2.7ms
+                  ↑ 3.3× Runtime gap  (qairt native vs ORT QNN EP adapter overhead)
+QNN Hub on winml: 8.78ms
+                  ↑ 2.2× Model graph gap (opset 21/MatMul/222 nodes vs opset 17/Gemm/251 nodes)
+Our model on winml: 19.4ms (FP32)
+</code></pre></div>
+
+<p><strong>Actionable findings (updated 2026-06-10 — mechanism confirmed via ORT source):</strong>
+1. <strong>opset 21 NPU speedup mechanism CONFIRMED — but ORT-version-dependent</strong> (#869)
+   - <strong>Root cause</strong>: <code>kMaxSupportedOpset</code> gate in <code>IsSupportedOpset()</code> (layout_transformation.cc). On older ORT where <code>kMaxSupportedOpset</code> &lt; 21, opset 21 models bypass the NHWC layout transform entirely (<code>transform_layout_fn = nullptr</code>).
+   - <strong>Why bypass helps ConvNext</strong>: NHWC transform inserts <code>Transpose(NCHW→NHWC/NHWC→NCHW)</code> around Conv. ConvNext residual connections <strong>block</strong> full transpose cancellation → extra Transpose ops on HTP → slower. Bypassing = cleaner graph = faster.
+   - <strong>Critical caveat</strong>: Current ORT main has <code>kMaxSupportedOpset = 26</code> → BOTH opset 17 and 21 get NHWC transform. <strong>Must verify ORT version</strong> before assuming the speedup exists.
+   - <strong>Does NOT generalize</strong> to: MobileNet/EfficientNet (no residual Transpose blocks), ViT (no Conv).
+   - <strong>Perf claim validation status</strong>: Gate 1 (iter≥1000×3) and Gate 3 (thermal control) still FAILED. Perf numbers are DVFS-dominated.
+2. <strong>Runtime stack gap (3.3×) is structural</strong>: qairt native will always be faster. Correct baseline = "QNN Hub ONNX on winml" (8.78ms).
+3. <strong>QNN Hub W8A16 is WORSE on our stack</strong> (14.82ms, std=8.8ms): opset 21 QDQ + uint16 input incompatible with ORT QNN EP format.
+4. <strong>Opset is a search dimension</strong> — but the correct action is a FULL SWEEP (17–22), not "try 21 first". The optimal opset depends on ORT version.</p>
+<p><strong>EP-specific search space rules</strong></p>
+<table>
+<thead>
+<tr>
+<th>EP</th>
+<th>Quantization</th>
+<th>Opset</th>
+<th>Graph passes</th>
+<th>Compile</th>
+<th>Key insight</th>
+</tr>
+</thead>
+<tbody>
+<tr>
+<td>QNN NPU</td>
+<td>✅ W8A16</td>
+<td>Full sweep 17-22 (mechanism ORT-version-dependent)</td>
+<td>autoconf (gelu+matmul_add)</td>
+<td>✅ Always</td>
+<td>W8A8 catastrophic on LN+GELU; opset effect depends on ORT kMaxSupportedOpset</td>
+</tr>
+<tr>
+<td>QNN GPU</td>
+<td>❌ Skip</td>
+<td>17 (opset 21 not validated)</td>
+<td>autoconf only</td>
+<td>❌ Skip</td>
+<td>Compile regresses; FP16 only lever (#867)</td>
+</tr>
+<tr>
+<td>DML</td>
+<td>❌ Skip</td>
+<td>17 (opset 21 not validated)</td>
+<td>autoconf only</td>
+<td>N/A</td>
+<td>FP16 primary lever (#867); faster+stabler than QNN GPU</td>
+</tr>
+<tr>
+<td>CPU</td>
+<td>❌ Skip</td>
+<td>17 only (kMaxSupportedOpset causes 3-4× regression on 19+)</td>
+<td>nchwc, matmul-add, gelu</td>
+<td>N/A</td>
+<td>kMaxSupportedOpset gate hurts CPU for same reason it helps QNN</td>
+</tr>
+</tbody>
+</table>
+<p>Rule: autoconfig must use EP-specific search space. Do NOT run quantization experiments for GPU/DML/CPU.
+Rule: for QNN NPU opset sweep, verify ORT <code>kMaxSupportedOpset</code> first — if ≥ 22, all opsets get NHWC transform and the opset-based speedup may not apply.
+Rule: for NPU, if W8A8 top-1 ≤ 15% on first attempt → skip all W8A8 variants, go directly to W8A16.
+Rule: always run <code>winml compile</code> after finding best quantized config for QNN NPU. NEVER compile for GPU (regresses).
+Rule: for GPU/DML, skip ALL graph optimization passes beyond what <code>winml build</code> autoconf applies (NHWC and additional fusions hurt).
+Rule: W8A8 QDQ on GPU EP hangs — skip quantization immediately for GPU targets without testing.</p>
+<p><strong>User scenario mapping</strong></p>
+<table>
+<thead>
+<tr>
+<th>Scenario</th>
+<th>How autoconfig addresses it</th>
+</tr>
+</thead>
+<tbody>
+<tr>
+<td>S1: LLM fast support (7-30d)</td>
+<td>autoconfig replaces manual per-EP tuning; outputs <code>config_optimal.json + report.html</code> deployable in hours not days</td>
+</tr>
+<tr>
+<td>S2: ISV non-LLM model support</td>
+<td>Exact use case: ISV brings model → autoconfig finds config → report is deliverable with SOP turnaround</td>
+</tr>
+<tr>
+<td>S3: Cross-EP parity</td>
+<td>Multi-EP parallel run: same model, EP-specific search spaces in parallel → output config matrix per EP</td>
+</tr>
+<tr>
+<td>S4: Customer ONNX can't run</td>
+<td>Phase 0 intake diagnoses "can't run" (partial ops → block reason); Phase 1+2 finds "escape config" for "runs poorly"</td>
+</tr>
+<tr>
+<td>S5: PyTorch HF Hub coverage</td>
+<td>Phase 0 IS the "can WinML run it?" gate; failed Phase 0 → structured block reason feeds long-tail gap tracking</td>
+</tr>
+</tbody>
+</table>
+<p><strong>Dependencies on code changes</strong>:
+- <code>winml perf --profile</code> (new flag) — adds per-op bottleneck output alongside existing latency metrics; POC script <code>winml_profile.py</code> exists to unblock
+- <code>--format json</code> on <code>winml eval</code> (#847), <code>winml analyze</code> (#848), <code>winml perf</code> (#849)</p>
+<h3 id="cross-references">Cross-references<a class="headerlink" href="#cross-references" title="Permanent link">&para;</a></h3>
+<ul>
+<li>Run <code>check-model-feasibility</code> before starting to pick a model and verify the EP is available</li>
+<li>After autoconfig completes → <code>ship-to-winapp</code> for final validation gates + packaging</li>
+<li>If autoconfig cannot meet objective → <code>debug-model</code> for deeper diagnosis</li>
+<li>Multi-EP output feeds directly into <code>ship-to-winapp</code>'s manifest layout</li>
+<li>If the best config found is still not good enough → escalate to <code>optimization-research</code></li>
+</ul>
+<hr />
+<h2 id="skill-optimization-research-contributor-internal-deep-gap-analysis">Skill: <code>optimization-research</code> (contributor — internal, deep gap analysis)<a class="headerlink" href="#skill-optimization-research-contributor-internal-deep-gap-analysis" title="Permanent link">&para;</a></h2>
+<h3 id="frontmatter_7">Frontmatter<a class="headerlink" href="#frontmatter_7" title="Permanent link">&para;</a></h3>
+<div class="codehilite"><pre><span></span><code><span class="nt">name</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">optimization-research</span>
+<span class="nt">description</span><span class="p">:</span><span class="w"> </span><span class="p p-Indicator">&gt;</span>
+<span class="w">  </span><span class="no">Use this skill when a winml-cli engineer wants to find out whether a model can</span>
+<span class="w">  </span><span class="no">be optimized better than what winml-cli currently achieves, identify what is</span>
+<span class="w">  </span><span class="no">blocking that optimization, and produce concrete backlog work items.</span>
+<span class="w">  </span><span class="no">The agent performs a deep search across: ORT source code and its optimizer</span>
+<span class="w">  </span><span class="no">passes, Olive recipes and benchmarks, other ONNX ecosystem tools (onnxsim,</span>
+<span class="w">  </span><span class="no">onnxoptimizer, neural-compressor, etc.), and native stack reference models</span>
+<span class="w">  </span><span class="no">and datasets. It compares the best achievable result (using all available tools)</span>
+<span class="w">  </span><span class="no">against what winml produces today, diagnoses the gap, and files GitHub issues</span>
+<span class="w">  </span><span class="no">with reproduction steps. Use when an internal engineer says &quot;why is this model</span>
+<span class="w">  </span><span class="no">slower than it should be&quot;, &quot;what optimization techniques are we missing&quot;,</span>
+<span class="w">  </span><span class="no">or &quot;what would it take to match Olive&#39;s results&quot;.</span>
+
+<span class="nt">audience</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">internal (winml-cli team engineers)</span>
+</code></pre></div>
+
+<h3 id="when-to-use_7">When to use<a class="headerlink" href="#when-to-use_7" title="Permanent link">&para;</a></h3>
+<ul>
+<li>"ConvNext on QNN is 3× slower than what Qualcomm's SDK achieves — why?"</li>
+<li>"Olive gets 15ms on this model; winml gets 28ms — what's the gap?"</li>
+<li>"We're seeing quantization accuracy drop on LLaMA; are there better calibration methods we're not supporting?"</li>
+<li>"What would it take to match ORT's best-known config for this architecture?"</li>
+<li>After <code>autoconfig</code> hits a ceiling: best config found is still not meeting the objective</li>
+</ul>
+<h3 id="what-this-skill-produces">What this skill produces<a class="headerlink" href="#what-this-skill-produces" title="Permanent link">&para;</a></h3>
+<p><strong>Primary outputs:</strong>
+1. <strong><code>gap_analysis.md</code></strong> — structured report of what the best achievable result is and what's missing
+2. <strong><code>repro/</code></strong> — scripts to reproduce the better result using external tools
+3. <strong>GitHub issues</strong> — one per identified gap, filed against winml-cli with: repro steps, expected vs actual, what ORT/Olive/ecosystem already does, proposed fix direction</p>
+<hr />
+<h3 id="design-deep-search-process">Design: Deep Search Process<a class="headerlink" href="#design-deep-search-process" title="Permanent link">&para;</a></h3>
+<svg width="700" height="530" viewBox="0 0 700 530" font-family="-apple-system,BlinkMacSystemFont,'Segoe UI',sans-serif" style="display:block;max-width:100%;margin:12px 0;">
+  <defs>
+    <marker id="sk-res" markerWidth="8" markerHeight="6" refX="7" refY="3" orient="auto">
+      <polygon points="0 0,8 3,0 6" fill="#7986cb"/>
+    </marker>
+  </defs>
+  <!-- PHASE 1 — BASELINE -->
+  <rect x="10" y="0" width="680" height="70" rx="8" fill="#eef1fc" stroke="#7986cb" stroke-width="1.5"/>
+  <rect x="10" y="0" width="680" height="26" rx="8" fill="#3949ab"/>
+  <rect x="10" y="16" width="680" height="10" fill="#3949ab"/>
+  <text x="350" y="18" text-anchor="middle" font-size="11.5" font-weight="700" fill="#fff">PHASE 1 — BASELINE</text>
+  <text x="28" y="44" font-size="10.5" fill="#1a237e">●  winml autoconfig best result for this model / EP</text>
+  <text x="28" y="59" font-size="10.5" fill="#1a237e">●  (or provided by user if already run)</text>
+  <line x1="350" y1="70" x2="350" y2="92" stroke="#7986cb" stroke-width="1.5" marker-end="url(#sk-res)"/>
+  <!-- PHASE 2 — EXTERNAL BENCHMARK -->
+  <rect x="10" y="94" width="680" height="98" rx="8" fill="#e3f2fd" stroke="#1976d2" stroke-width="1.5"/>
+  <rect x="10" y="94" width="680" height="26" rx="8" fill="#1976d2"/>
+  <rect x="10" y="110" width="680" height="10" fill="#1976d2"/>
+  <text x="350" y="112" text-anchor="middle" font-size="11.5" font-weight="700" fill="#fff">PHASE 2 — EXTERNAL BENCHMARK</text>
+  <text x="28" y="138" font-size="10.5" fill="#0d47a1">A.  ORT optimizer directly (onnxruntime.tools.transformers)</text>
+  <text x="28" y="153" font-size="10.5" fill="#0d47a1">B.  Olive (olive-ai) with EP-specific recipe</text>
+  <text x="28" y="168" font-size="10.5" fill="#0d47a1">C.  onnxsim + onnxoptimizer (static graph simplification)</text>
+  <text x="28" y="183" font-size="10.5" fill="#0d47a1">D.  neural-compressor (Intel) for quantization comparison  →  record best latency, accuracy, config</text>
+  <line x1="350" y1="192" x2="350" y2="214" stroke="#7986cb" stroke-width="1.5" marker-end="url(#sk-res)"/>
+  <!-- PHASE 3 — GAP DIAGNOSIS -->
+  <rect x="10" y="216" width="680" height="115" rx="8" fill="#e8f5e9" stroke="#388e3c" stroke-width="1.5"/>
+  <rect x="10" y="216" width="680" height="26" rx="8" fill="#388e3c"/>
+  <rect x="10" y="232" width="680" height="10" fill="#388e3c"/>
+  <text x="350" y="234" text-anchor="middle" font-size="11.5" font-weight="700" fill="#fff">PHASE 3 — GAP DIAGNOSIS</text>
+  <text x="28" y="260" font-size="10.5" fill="#1b5e20">For each gap (external better than winml): diff ONNX graphs · read ORT source · check capability registry · check Olive recipe</text>
+  <text x="28" y="278" font-size="10" fill="#2e7d32">Classify as one of:</text>
+  <rect x="28" y="286" width="130" height="18" rx="4" fill="#ffebee"/><text x="93" y="299" text-anchor="middle" font-size="9" font-weight="700" fill="#c62828">MISSING_CAPABILITY</text>
+  <rect x="168" y="286" width="110" height="18" rx="4" fill="#fff3e0"/><text x="223" y="299" text-anchor="middle" font-size="9" font-weight="700" fill="#e65100">WRONG_DEFAULT</text>
+  <rect x="288" y="286" width="50" height="18" rx="4" fill="#fce4ec"/><text x="313" y="299" text-anchor="middle" font-size="9" font-weight="700" fill="#880e4f">BUG</text>
+  <rect x="348" y="286" width="120" height="18" rx="4" fill="#e8f5e9"/><text x="408" y="299" text-anchor="middle" font-size="9" font-weight="700" fill="#1b5e20">CALIBRATION_DATA</text>
+  <rect x="478" y="286" width="105" height="18" rx="4" fill="#e8eaf6"/><text x="530" y="299" text-anchor="middle" font-size="9" font-weight="700" fill="#283593">EP_LIMITATION</text>
+  <rect x="593" y="286" width="95" height="18" rx="4" fill="#f3e5f5"/><text x="640" y="299" text-anchor="middle" font-size="9" font-weight="700" fill="#4a148c">KNOWN_TRADEOFF</text>
+  <text x="28" y="322" font-size="10" fill="#1b5e20" font-style="italic">a. Diff graphs  b. ORT source  c. winml capability registry (missing? disabled? wired wrong?)  d. Olive recipe flags/params</text>
+  <line x1="350" y1="331" x2="350" y2="353" stroke="#7986cb" stroke-width="1.5" marker-end="url(#sk-res)"/>
+  <!-- PHASE 4 — NATIVE STACK VALIDATION -->
+  <rect x="10" y="355" width="680" height="90" rx="8" fill="#fff3e0" stroke="#f57c00" stroke-width="1.5"/>
+  <rect x="10" y="355" width="680" height="26" rx="8" fill="#f57c00"/>
+  <rect x="10" y="371" width="680" height="10" fill="#f57c00"/>
+  <text x="350" y="373" text-anchor="middle" font-size="11.5" font-weight="700" fill="#fff">PHASE 4 — NATIVE STACK VALIDATION</text>
+  <text x="28" y="399" font-size="10.5" fill="#bf360c">●  winml-cli test suite: any reference models of this arch in tests/models/?</text>
+  <text x="28" y="414" font-size="10.5" fill="#bf360c">●  Windows AI Studio / WinML model zoo: listed? at what performance?</text>
+  <text x="28" y="429" font-size="10.5" fill="#bf360c">●  QNN SDK reference benchmarks (if QNN EP): does vendor claim better numbers?</text>
+  <line x1="350" y1="445" x2="350" y2="467" stroke="#7986cb" stroke-width="1.5" marker-end="url(#sk-res)"/>
+  <!-- PHASE 5 — WORK ITEMS -->
+  <rect x="10" y="469" width="680" height="56" rx="8" fill="#ede7f6" stroke="#7b1fa2" stroke-width="1.5"/>
+  <rect x="10" y="469" width="680" height="26" rx="8" fill="#7b1fa2"/>
+  <rect x="10" y="485" width="680" height="10" fill="#7b1fa2"/>
+  <text x="350" y="487" text-anchor="middle" font-size="11.5" font-weight="700" fill="#fff">PHASE 5 — WORK ITEMS</text>
+  <text x="28" y="511" font-size="10.5" fill="#4a148c">MISSING_CAPABILITY / WRONG_DEFAULT → GitHub issue (title · repro · fix · ORT pointer · complexity S/M/L/XL) · BUG → repro script · EP_LIMITATION → SDK reference</text>
+</svg>
+
+<hr />
+<h3 id="key-external-tools-to-invoke">Key external tools to invoke<a class="headerlink" href="#key-external-tools-to-invoke" title="Permanent link">&para;</a></h3>
+<div class="codehilite"><pre><span></span><code><span class="c1"># A. ORT transformer optimizer (the &quot;gold standard&quot; for transformer models)</span>
+python<span class="w"> </span>-c<span class="w"> </span><span class="s2">&quot;</span>
+<span class="s2">from onnxruntime.transformers import optimizer</span>
+<span class="s2">from onnxruntime.transformers.fusion_options import FusionOptions</span>
+<span class="s2">opts = FusionOptions(&#39;bert&#39;)   # or &#39;gpt2&#39;, &#39;clip&#39;, etc.</span>
+<span class="s2">opts.enable_attention = True</span>
+<span class="s2">opts.enable_gelu = True</span>
+<span class="s2">model = optimizer.optimize_model(</span>
+<span class="s2">    &#39;export.onnx&#39;, model_type=&#39;bert&#39;,</span>
+<span class="s2">    num_heads=12, hidden_size=768,</span>
+<span class="s2">    optimization_options=opts</span>
+<span class="s2">)</span>
+<span class="s2">model.save_model_to_file(&#39;ort_optimized.onnx&#39;)</span>
+<span class="s2">&quot;</span>
+
+<span class="c1"># B. Olive (end-to-end, EP-aware)</span>
+olive<span class="w"> </span>run<span class="w"> </span>--config<span class="w"> </span>olive_recipe.json
+<span class="c1"># olive recipe template: see skills/optimization-research/templates/olive_qnn.json</span>
+
+<span class="c1"># C. onnxsim (structural simplification)</span>
+python<span class="w"> </span>-m<span class="w"> </span>onnxsim<span class="w"> </span>export.onnx<span class="w"> </span>simplified.onnx
+
+<span class="c1"># D. onnxoptimizer</span>
+python<span class="w"> </span>-c<span class="w"> </span><span class="s2">&quot;</span>
+<span class="s2">import onnxoptimizer, onnx</span>
+<span class="s2">m = onnx.load(&#39;export.onnx&#39;)</span>
+<span class="s2">passes = onnxoptimizer.get_available_passes()</span>
+<span class="s2">m2 = onnxoptimizer.optimize(m, passes)</span>
+<span class="s2">onnx.save(m2, &#39;onnxopt.onnx&#39;)</span>
+<span class="s2">&quot;</span>
+</code></pre></div>
+
+<hr />
+<h3 id="gap-report-format-gap_analysismd">Gap report format (<code>gap_analysis.md</code>)<a class="headerlink" href="#gap-report-format-gap_analysismd" title="Permanent link">&para;</a></h3>
+<div class="codehilite"><pre><span></span><code><span class="gh"># Optimization Gap Analysis: &lt;model_id&gt; on &lt;ep&gt;</span>
+
+Date: &lt;timestamp&gt;
+winml-cli version: &lt;version&gt;
+ORT version: &lt;version&gt;
+
+<span class="gu">## Summary</span>
+| Tool | Latency p50 | Accuracy | Config notes |
+|---|---|---|---|
+| winml best (autoconfig) | 28.3ms | 0.953 | W8A16, entropy, 256 samples |
+| ORT transformer optimizer | 19.1ms | 0.951 | model_type=bert, all fusions |
+| Olive QNN recipe | 17.8ms | 0.948 | W8A8 + attention fusion |
+| <span class="gs">**Gap**</span> | <span class="gs">**10.5ms (37%)**</span> | — | — |
+
+<span class="gu">## Gap 1: [MISSING_CAPABILITY] FusedMatMul with rotary embedding</span>
+<span class="gs">**What external tool does:**</span> ...
+<span class="gs">**What winml does:**</span> ...
+<span class="gs">**ORT source:**</span> <span class="sb">`onnxruntime/python/tools/transformers/fusion_rotary_attention.py`</span>
+<span class="gs">**Proposed fix:**</span> Add RotaryAttentionFusion to FusionPipe capability registry
+<span class="gs">**Estimated effort:**</span> M
+
+<span class="gu">## Gap 2: [WRONG_DEFAULT] attention-fusion disabled by default</span>
+...
+</code></pre></div>
+
+<hr />
+<h3 id="github-issue-template">GitHub issue template<a class="headerlink" href="#github-issue-template" title="Permanent link">&para;</a></h3>
+<div class="codehilite"><pre><span></span><code>title: [optimization-gap] &lt;model_arch&gt;/&lt;ep&gt;: &lt;gap description&gt;
+
+body:
+<span class="gu">## Summary</span>
+&lt;one-sentence description of what&#39;s missing&gt;
+
+<span class="gu">## Reproduction</span>
+```bash
+<span class="gh"># Install</span>
+uv pip install winml-cli
+
+<span class="gh"># Baseline (winml current)</span>
+winml build -c config.json -m &lt;model-id&gt; -o winml_out/
+winml perf -m winml_out/model.onnx --ep &lt;ep&gt; --warmup 10 --iterations 50
+
+<span class="gh"># Better result (external)</span>
+&lt;commands to reproduce the external result&gt;
+</code></pre></div>
+
+<h2 id="expected-vs-actual">Expected vs actual<a class="headerlink" href="#expected-vs-actual" title="Permanent link">&para;</a></h2>
+<ul>
+<li>External tool achieves: <latency>ms at <accuracy></li>
+<li>winml achieves:         <latency>ms at <accuracy></li>
+<li>Gap: <delta>ms (<pct>%)</li>
+</ul>
+<h2 id="root-cause">Root cause<a class="headerlink" href="#root-cause" title="Permanent link">&para;</a></h2>
+<p><what the external tool does that winml doesn't></p>
+<h2 id="ort-source-reference">ORT source reference<a class="headerlink" href="#ort-source-reference" title="Permanent link">&para;</a></h2>
+<p><link to relevant ORT optimizer code></p>
+<h2 id="proposed-fix-direction">Proposed fix direction<a class="headerlink" href="#proposed-fix-direction" title="Permanent link">&para;</a></h2>
+<p><what capability / default change / bug fix would close this gap></p>
+<h2 id="complexity-estimate">Complexity estimate<a class="headerlink" href="#complexity-estimate" title="Permanent link">&para;</a></h2>
+<p>S / M / L / XL</p>
+<div class="codehilite"><pre><span></span><code>---
+
+### What this skill does NOT do
+- Does not make code changes to winml-cli itself (files issues only)
+- Does not run production benchmarks (uses quick screening methodology)
+- Does not replace formal performance testing with validated hardware
+
+### Cross-references
+- `autoconfig` provides the winml baseline to compare against
+- Issues filed here feed `adding-ep-support` and `contributing-a-skill` workflows
+- Use `check-model-feasibility` to confirm EP availability before running external benchmarks
+
+---
+
+
+---
+
+## ConvNext Autoconfig POC — Rigorous Ablation Results
+
+**Source:** `C:\tmp\autoconfig-demo\ablation.py` — 4-phase rigorous ablation experiment
+**Measurement:** `winml perf --ep cpu --warmup 10 --iterations 50` — pure inference latency, no preprocessing
+**Design:** 3 independent runs per config; promotion threshold = max(3%, 2×σ_baseline); correctness gate (`winml eval --samples 20`) per config
+**Report:** `C:\tmp\autoconfig-demo\report.html` | **Config:** `C:\tmp\autoconfig-demo\config_cpu_optimal.json`
+
+### Graph structure (facebook/convnext-tiny-224, opset 17)
+
+**Op counts (raw export):** 287 nodes total
+</code></pre></div>
+
+<p>Add×72  Mul×54  Transpose×42  MatMul×36  LayerNormalization×23
+Conv×22  Div×18  Erf×18  ReduceMean×1  Gemm×1</p>
+<div class="codehilite"><pre><span></span><code>**ConvNext block structure** (traced from first DW-Conv):
+</code></pre></div>
+
+<p>DW-Conv(7x7, g=96)  → Transpose
+→ LayerNormalization (native, already fused at export)
+→ MatMul(C→4C)      → Add(bias)
+→ [GELU: Div → Erf → Add(1) → Mul → Mul(0.5)]   ← 18 unfused in export
+→ MatMul(4C→C)      → Add(bias)   [Gemm after ORT L2]
+→ Mul (layer scale) → Add (residual)
+→ Transpose (back to NCHW)</p>
+<div class="codehilite"><pre><span></span><code>**Conv breakdown:** 4 regular (1×stem 4x4, 3×downsample 2x2 stride-2), 18×DW-Conv 7x7
+
+**Transpose patterns:**
+</code></pre></div>
+
+<p>19× Conv → Transpose → LayerNormalization     (NCHW→NHWC for LN)
+15× Mul  → Transpose → Add                   (NHWC→NCHW for residual)
+ 4× LayerNormalization → Transpose → Conv    (NHWC→NCHW for next DW-Conv)
+ 2× Add  → Transpose → Conv
+ 2× Add  → Transpose → LayerNormalization</p>
+<div class="codehilite"><pre><span></span><code>→ ConvNext is a **Transpose-sandwich** model: alternates NCHW (Conv) and NHWC (LN) layout
+
+**Observed graph transformation (export.onnx → model.onnx after winml build, baseline config):**
+| Op | export.onnx | model.onnx (baseline) | Change |
+|---|---|---|---|
+| `com.microsoft/Gelu` | 0 | 18 | +18 |
+| `Gemm` | 1 | 37 | +36 |
+| `MatMul` | 36 | 0 | −36 |
+| `Add` | 72 | 18 | −54 |
+| `Mul` | 54 | 18 | −36 |
+| `Div`, `Erf` | 18 each | 0 | −18 each |
+| `Reshape` | 0 | 72 | +72 |
+
+**Observation (confirmed):** The baseline `model.onnx` (no user fusion flags) already differs substantially from `export.onnx`. GELU and MatMul+Add are fused before any user capability flag is applied.
+
+**Open question (unresolved):** The `ORTGraphPipe` design (graph.py) is supposed to disable `GeluFusion`/`GeluFusionL2`/`LayerNormFusion` in the baseline via `optimization.disable_specified_optimizers`. Yet the baseline output clearly contains `com.microsoft/Gelu`. This contradiction is unresolved — possible explanations include: ORT name mismatch in disabled list, a different code path fusing GELU, or the export step (via HF Optimum) applying fusion before winml. **This must be investigated before any mechanistic claims about &quot;ORT L2 already does X&quot; are written in user-facing reports.**
+
+---
+
+### Ablation results (rigorous, Phase 0–4)
+
+**Clean baseline:** 43.7ms p50 (base_0 + base_1, 6 runs, all within 42.5–45.4ms)
+
+| config | p50 mean | Δ vs baseline | runs (ms) | verdict |
+|---|---|---|---|---|
+| base_0 | 43.0ms | −0.6ms | 43.8 / 42.7 / 42.5 | baseline |
+| base_1 | 44.3ms | +0.6ms | 43.2 / 44.3 / 45.4 | baseline |
+| base_2 | 73.5ms | +29.8ms | 47.2 / **127.1** / 46.2 | outlier run (system spike) |
+| opset_18 | 48.0ms | +4.3ms | 50.2 / 44.0 / 49.7 | neutral |
+| **opset_19** | **160.3ms** | **+116ms** | **147.6 / 145.8 / 187.4** | **⚠️ SEVERE REGRESSION** |
+| **opset_20** | **131.0ms** | **+87ms** | **135.7 / 129.8 / 127.5** | **⚠️ SEVERE REGRESSION** |
+| **opset_21** | **170.3ms** | **+126ms** | **190.1 / 164.9 / 155.8** | **⚠️ SEVERE REGRESSION** |
+| **opset_22** | **85.0ms** | **+41ms** | **70.9 / 93.9 / 90.2** | **confirmed regression** |
+| no_cf_17 | 51.8ms | +8.1ms | 56.4 / 49.0 / 49.9 | mild regression |
+| base_mid | 49.4ms | +5.8ms | 51.3 / 51.1 / 45.9 | baseline (mid-exp drift) |
+| gelu_only | 52.5ms | +8.9ms | 53.0 / 55.6 / 49.1 | mild regression |
+| ln_only | 57.2ms | +13.6ms | **79.3** / 47.9 / 44.5 | inconclusive (outlier) |
+| conv_add | 50.2ms | +6.5ms | 47.3 / 55.9 / 47.4 | inconclusive |
+| conv_act | 51.2ms | +7.5ms | 45.2 / 41.9 / **66.4** | inconclusive (outlier) |
+| **matmul_add** | **81.7ms** | **+38.0ms** | **63.0 / 70.8 / 111.2** | **CONFIRMED REGRESSION** |
+| transpose_opt | 45.5ms | +1.8ms | 42.3 / 52.3 / 41.8 | neutral |
+| nchwc | 45.4ms | +1.7ms | 43.4 / 48.0 / 44.7 | neutral |
+| matmul_scale | 56.9ms | +13.3ms | 51.5 / 58.1 / 61.2 | probable mild regression |
+| base_end | 48.3ms | +4.7ms | 45.3 / 56.7 / 43.1 | baseline (end-of-exp drift) |
+
+**Phase 3 outcome:** No candidates met promotion threshold (29.4ms needed). Baseline is optimal.
+
+---
+
+### Confirmed findings (statistically defensible)
+
+**1. `matmul-add-fusion` is a confirmed regression on ConvNext CPU (+38ms)**
+- All 3 independent runs: 63.0 / 70.8 / 111.2ms — each far above the highest clean baseline run (45.4ms)
+- Not attributable to system noise (no run-to-run overlap with baseline distribution)
+- Mechanism hypothesis: baseline already converts MatMul+Add→Gemm (37 Gemm in model.onnx); applying matmul-add-fusion on top may create redundant or conflicting kernel dispatch. Unconfirmed — requires profiling.
+
+**2. `transpose-optimizer` is NEUTRAL on pure inference latency**
+- Runs: 42.3 / 52.3 / 41.8ms — overlapping with clean baseline (42.5–45.4ms)
+- ⚠️ **CORRECTION OF EARLIER FINDING:** A previous 8-iteration search (using `winml eval`) reported +270ms. That was a measurement artifact — `winml eval` includes HF preprocessing pipeline overhead and has no warmup. It measures *application startup + preprocessing + inference*, not *inference alone*. With `winml perf` (warmup=10, iter=50, pure inference): transpose_opt = baseline. Do not cite the +270ms in any report.
+
+**3. `nchwc-transformer` is neutral on this model**
+- NCHWc SIMD layout: 43.4 / 48.0 / 44.7ms — no benefit for ConvNext CPU inference.
+
+**4. opset=18 is neutral**
+- Same node count (251) as opset=17 — no graph structure changes. Mean slightly above baseline (48ms) is within machine variance.
+
+**5. No flag improved latency beyond noise. Baseline is the optimal config.**
+
+---
+
+### ⚠️ Critical finding: ORT performance cliff at opset 19 (ConvNext CPU)
+
+**Experiment:** tested opset 17–22, all with identical graph structure (251 nodes, same op counts)
+
+| opset | mean p50 | slowdown |
+|---|---|---|
+| 17 | 43.7ms | — (baseline) |
+| 18 | 48.0ms | 1.1× |
+| **19** | **160.3ms** | **3.7×** |
+| **20** | **131.0ms** | **3.0×** |
+| **21** | **170.3ms** | **3.9×** |
+| **22** | **85.0ms** | **1.9×** |
+
+**Key facts:**
+- All runs within each opset are consistent (no outliers) — this is real, not noise
+- Graph structure is **byte-for-byte identical**: Reshape×72, Transpose×42, Gemm×37, LN×23, Conv×22 for ALL opsets
+- The performance difference is entirely in ORT&#39;s runtime execution path, not the graph
+
+**Mechanism: CONFIRMED ROOT CAUSE — ORT `kMaxSupportedOpset` gates Transpose Optimizer**
+
+Source: `onnxruntime/core/optimizer/transpose_optimization/optimizer_api.h`
+```cpp
+constexpr int64_t kMaxSupportedOpset = 18;  // ORT v1.14.x — bumped each ORT release
+</code></pre></div>
+
+<p>Entry point <code>onnx_transpose_optimization::Optimize()</code> → <code>MakeOptimizerContext()</code>:</p>
+<div class="codehilite"><pre><span></span><code><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="o">*</span><span class="n">opset</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="n">kMaxSupportedOpset</span><span class="p">)</span><span class="w"> </span><span class="p">{</span>
+<span class="w">    </span><span class="k">return</span><span class="w"> </span><span class="n">std</span><span class="o">::</span><span class="n">nullopt</span><span class="p">;</span><span class="w">  </span><span class="c1">// entire Transpose Optimizer skipped silently</span>
+<span class="p">}</span>
+</code></pre></div>
+
+<p>ConvNext has 42 Transpose nodes (NCHW↔NHWC sandwich in every block). The Transpose Optimizer normally:
+- Pushes Transposes through Add×18, Mul×18 (layer-scale + residual) across block boundaries
+- Cancels adjacent inverse pairs</p>
+<p>When bypassed (opset &gt; kMaxSupportedOpset), all 42 Transposes execute as full memory-layout copies → 3–4× systemic slowdown.</p>
+<p><strong>ORT optimization level experiment (definitive proof):</strong></p>
+<table>
+<thead>
+<tr>
+<th>Session opt level</th>
+<th>opset=17</th>
+<th>opset=19</th>
+<th>ratio</th>
+<th>explanation</th>
+</tr>
+</thead>
+<tbody>
+<tr>
+<td>DISABLE_ALL</td>
+<td>47.5ms</td>
+<td><strong>355ms</strong></td>
+<td><strong>7.5×</strong></td>
+<td>No Transpose Optimizer → all 42 Transposes raw</td>
+</tr>
+<tr>
+<td>ENABLE_BASIC</td>
+<td>289ms</td>
+<td>315ms</td>
+<td>1.1×</td>
+<td>Both slow (re-optimizing pre-fused graph)</td>
+</tr>
+<tr>
+<td>ENABLE_EXTENDED</td>
+<td>209ms</td>
+<td>241ms</td>
+<td>1.2×</td>
+<td>Better but no layout transform</td>
+</tr>
+<tr>
+<td><strong>ENABLE_ALL</strong></td>
+<td>216ms</td>
+<td><strong>215ms</strong></td>
+<td><strong>1.0×</strong></td>
+<td>Transpose Optimizer runs on both → full parity</td>
+</tr>
+</tbody>
+</table>
+<p><strong><code>kMaxSupportedOpset</code> version history:</strong></p>
+<table>
+<thead>
+<tr>
+<th>ORT version</th>
+<th>kMaxSupportedOpset</th>
+<th>opset ≥ N disabled</th>
+</tr>
+</thead>
+<tbody>
+<tr>
+<td>v1.14.x</td>
+<td><strong>18</strong></td>
+<td>≥ 19</td>
+</tr>
+<tr>
+<td>v1.16.x</td>
+<td>19</td>
+<td>≥ 20</td>
+</tr>
+<tr>
+<td>v1.17.x</td>
+<td>20</td>
+<td>≥ 21</td>
+</tr>
+<tr>
+<td>v1.18.x</td>
+<td>21</td>
+<td>≥ 22</td>
+</tr>
+<tr>
+<td>main/HEAD</td>
+<td><strong>26</strong></td>
+<td>fully covered</td>
+</tr>
+</tbody>
+</table>
+<p><strong>Classification for optimization-research skill:</strong> <code>[KNOWN_TRADEOFF]</code> (intentional design: ORT bumps the ceiling with each ONNX opset release)
+- winml-cli ships a specific ORT build → its <code>kMaxSupportedOpset</code> is fixed
+- winml-cli's <strong>default opset=17 is correct and essential</strong> — it is the safe zone for all current ORT builds
+- Raising opset requires ensuring the shipping ORT version has <code>kMaxSupportedOpset ≥ target_opset</code>
+- Do NOT raise default opset without verifying <code>kMaxSupportedOpset</code> in the shipped ORT</p>
+<p><strong>Call chain:</strong></p>
+<div class="codehilite"><pre><span></span><code>InferenceSession::Initialize()
+  → TransposeOptimizer::ApplyImpl()         [transpose_optimizer.cc:18]
+      → onnx_transpose_optimization::Optimize()
+          → MakeOptimizerContext()
+              → if opset &gt; kMaxSupportedOpset: return nullopt  ← THE GATE
+</code></pre></div>
+
+<hr />
+<h3 id="inconclusive-do-not-report">Inconclusive / do not report<a class="headerlink" href="#inconclusive-do-not-report" title="Permanent link">&para;</a></h3>
+<p>These show elevated means but cannot be confirmed as regressions given machine variance (p90 = 2–3× p50 throughout):
+- <code>ln_only</code>, <code>conv_add</code>, <code>conv_act</code>: each has ≥1 extreme outlier run; other runs are baseline-level
+- <code>gelu_only</code>: consistently 49–56ms, possibly a mild regression but no outlier; 3 runs insufficient to separate from drift
+- <code>matmul_scale</code>: all 3 runs elevated (51–61ms), but concurrent baseline also drifted (+5ms); net delta ~+8ms, weak signal</p>
+<p>Do not write these as confirmed regressions in user-facing reports. Label as "inconclusive" or omit.</p>
+<hr />
+<h3 id="measurement-methodology-correction-winml-eval-vs-winml-perf">Measurement methodology correction (winml eval vs winml perf)<a class="headerlink" href="#measurement-methodology-correction-winml-eval-vs-winml-perf" title="Permanent link">&para;</a></h3>
+<table>
+<thead>
+<tr>
+<th>Tool</th>
+<th>What it measures</th>
+<th>Latency for ConvNext CPU</th>
+</tr>
+</thead>
+<tbody>
+<tr>
+<td><code>winml eval</code> (no warmup, includes preprocessing)</td>
+<td>Application-level: model load + HF preprocessing + inference × N</td>
+<td>~67ms/sample</td>
+</tr>
+<tr>
+<td><code>winml perf --warmup 10 --iterations 50</code></td>
+<td>Pure inference: steady-state kernel execution only</td>
+<td>~43.7ms p50</td>
+</tr>
+<tr>
+<td>Difference</td>
+<td>HF preprocessing + JIT warmup overhead</td>
+<td>~23ms</td>
+</tr>
+</tbody>
+</table>
+<p><strong>Rule for autoconfig skill:</strong> Always use <code>winml perf</code> with <code>--warmup 10 --iterations 50</code> for latency measurements in experiments. Never use <code>winml eval</code> latency to compare configs.</p>
+<hr />
+<h3 id="key-insight-for-autoconfig-skill">Key insight for autoconfig skill<a class="headerlink" href="#key-insight-for-autoconfig-skill" title="Permanent link">&para;</a></h3>
+<ul>
+<li>CPU EP on ConvNext: no extra flag tested improved latency. Baseline (no fusions beyond what ORT L2 applies unconditionally) is optimal.</li>
+<li>The only actionable finding is: <strong>do not add <code>matmul-add-fusion</code> for ConvNext on CPU</strong> (or any model where baseline already uses Gemm).</li>
+<li>QNN/DML: not yet tested. Guidance on those EPs requires separate validated experiments.</li>
+</ul>
+<hr />
+<h3 id="winml-analyze-gaps-discovered"><code>winml analyze</code> gaps discovered<a class="headerlink" href="#winml-analyze-gaps-discovered" title="Permanent link">&para;</a></h3>
+<p>These are cases where analyzing the graph <em>before</em> running autoconfig would have prevented wasted search iterations:</p>
+<p><strong>Gap 1: "Already fused" vs "fuseable" not distinguished</strong>
+- ConvNext has <code>LayerNormalization</code> as a native op (already fused at PyTorch export)
+- <code>layer-norm-fusion</code> targets the <em>decomposed</em> ReduceMean→Sub→... pattern
+- <code>winml analyze</code> reports <code>OP/ai.onnx/LayerNormalization</code> without indicating it's already in canonical form
+- <strong>Impact:</strong> user enables <code>layer-norm-fusion</code> thinking it will help; it does nothing (but builds take longer)
+- <strong>Fix:</strong> analyze should tag ops as <code>already_canonical</code> vs <code>fuseable_subgraph</code></p>
+<p><strong>Gap 2: DW-Conv not distinguished from regular Conv</strong>
+- ConvNext has 18×7x7 DW-Conv (group=C) and 4×regular Conv (group=1)
+- <code>winml analyze</code> reports all as <code>OP/ai.onnx/Conv</code> (undifferentiated)
+- QNN EP supports DW-Conv natively (important for NPU efficiency), but EP support classification is per op type, not per <code>groups</code> value
+- <strong>Impact:</strong> user cannot tell whether Conv ops are the DW or regular variant; EP support may differ
+- <strong>Fix:</strong> analyze should emit <code>OP/ai.onnx/Conv[depthwise]</code> vs <code>OP/ai.onnx/Conv[regular]</code></p>
+<p><strong>Gap 3: Transpose-sandwich pattern not detected</strong>
+- 42 Transpose nodes in ConvNext form a clear <code>Conv→Transpose→LN→...→Transpose</code> repeating pattern
+- <code>transpose-optimizer</code> turns this into NHWC chains (good for GPU/NPU, bad for CPU)
+- <code>winml analyze</code> reports Transpose as just <code>OP/ai.onnx/Transpose</code> with no structural context
+- <strong>Impact:</strong> user cannot predict whether <code>transpose-optimizer</code> will help or hurt without running it
+- <strong>Fix:</strong> analyze should detect <code>transpose_sandwich_depth: N</code> and emit a warning for CPU EP</p>
+<p><strong>Gap 4: ORT L2 baseline fusions not surfaced</strong>
+- After ORT Level 2 optimization (which runs unconditionally), the graph already has fused Gelu, Gemm
+- The analyze command runs on the <em>pre-optimize</em> export.onnx, not the actual optimized model
+- <code>winml analyze</code> sees 36×MatMul in export.onnx but the real model at inference has 37×Gemm
+- <strong>Impact:</strong> analyze output doesn't reflect what the model actually looks like when running
+- <strong>Fix:</strong> analyze should optionally run on <code>optimized.onnx</code> (post-ORT-L2), not just <code>export.onnx</code></p>
+<p><strong>Gap 5: MatMul semantic not classified</strong>
+- 36 MatMul ops are all MLP dense layers (4C→C or C→4C expansion)
+- No attention MatMuls present (ConvNext has no self-attention)
+- QNN handles dense-layer MatMul differently from attention-context MatMul
+- <code>winml analyze</code> reports <code>OP/ai.onnx/MatMul</code> without semantic classification
+- <strong>Fix:</strong> analyze could detect MatMul role heuristically (shapes: attention = square-ish, MLP = wide fan-out)</p>
+<hr />
+<h2 id="implementation-notes">Implementation notes<a class="headerlink" href="#implementation-notes" title="Permanent link">&para;</a></h2>
+<h3 id="directory-structure">Directory structure<a class="headerlink" href="#directory-structure" title="Permanent link">&para;</a></h3>
+<div class="codehilite"><pre><span></span><code>skills/
+  use-winml-cli/              ← existing, extend (user)
+    SKILL.md
+    evals/eval.yaml
+  check-model-feasibility/    ← new (user — model discovery + EP/device compatibility)
+    SKILL.md
+    evals/eval.yaml
+  debug-model/                ← new (user)
+    SKILL.md
+    evals/eval.yaml
+  autoconfig/                 ← new (user — optimize: autoresearch loop + manual framework)
+    SKILL.md
+    evals/eval.yaml
+  ship-to-winapp/             ← new (user — validation gates + multi-EP packaging; partial dep on winml package feature)
+    SKILL.md
+    evals/eval.yaml
+  adding-model-support/       ← new (contributor)
+    SKILL.md
+    evals/eval.yaml
+  adding-ep-support/          ← new (contributor)
+    SKILL.md
+    evals/eval.yaml
+  contributing-a-skill/       ← new (contributor)
+    SKILL.md
+    evals/eval.yaml
+  optimization-research/      ← new (contributor — internal deep gap analysis for winml-cli team)
+    SKILL.md
+    templates/olive_qnn.json
+    templates/olive_dml.json
+    evals/eval.yaml
+</code></pre></div>
+
+<h3 id="priority-order-for-implementation">Priority order for implementation<a class="headerlink" href="#priority-order-for-implementation" title="Permanent link">&para;</a></h3>
+<p>This is <strong>implementation sequencing</strong> (risk- and dependency-driven), which intentionally differs from
+the <strong>importance</strong> ranking in the Overview. Importance answers "which skill matters most to users";
+this answers "which is safest to build first." Example: <code>auto-config</code> remains a high-importance user skill
+but ships <em>last</em> because it depends on the <code>--format json</code> changes and is the most complex.</p>
+<p><strong>Code changes first (unblocks agentic skill execution):</strong>
+0. <code>winml eval --format json</code> — critical: enables all accuracy-related agentic flows
+0. <code>winml analyze --format json</code> — enables EP compatibility agentic flows
+0. <code>winml perf --format json</code> — enables performance SLA agentic flows</p>
+<p><strong>User skills:</strong>
+1. <code>check-model-feasibility</code> — lowest risk, pure existing commands (<code>inspect</code>/<code>sys</code>/<code>analyze</code>); front door for new users (model discovery half needs <code>analyze --format json</code>)
+2. <code>debug-model</code> — lightweight read-only explainer that shortens time-to-diagnosis
+3. <code>ship-to-winapp</code> — validation checklist + packaging; build it once the gate commands exist (partial dep on <code>winml package</code> feature)
+4. <code>autoconfig</code> — depends on #847/#848/#849 + most complex skill to implement (manual mode can ship first as the lightweight framework)</p>
+<p><strong>Contributor skills:</strong>
+5. <code>contributing-a-skill</code> — enables community contributions to the skill ecosystem
+6. <code>adding-model-support</code> — most impactful for model coverage growth
+7. <code>adding-ep-support</code> — lower frequency, but needed for new EP onboarding
+8. <code>optimization-research</code> — internal gap-finder; depends on a working <code>autoconfig</code> baseline to compare against</p>
+<h3 id="required-code-changes-for-agentic-skill-execution">Required code changes for agentic skill execution<a class="headerlink" href="#required-code-changes-for-agentic-skill-execution" title="Permanent link">&para;</a></h3>
+<p>The three changes that turn skills from documentation into agentic programs:</p>
+<p><strong>1. <code>winml eval --format json</code></strong></p>
+<p>File: <code>src/winml/modelkit/commands/eval.py</code></p>
+<p>Add <code>--format</code> option and emit structured JSON to stdout:</p>
+<div class="codehilite"><pre><span></span><code><span class="p">{</span>
+<span class="w">  </span><span class="nt">&quot;mode&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;compare&quot;</span><span class="p">,</span>
+<span class="w">  </span><span class="nt">&quot;model&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;path/to/quantized.onnx&quot;</span><span class="p">,</span>
+<span class="w">  </span><span class="nt">&quot;model_id&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;microsoft/resnet-50&quot;</span><span class="p">,</span>
+<span class="w">  </span><span class="nt">&quot;metrics&quot;</span><span class="p">:</span><span class="w"> </span><span class="p">{</span>
+<span class="w">    </span><span class="nt">&quot;cosine_similarity&quot;</span><span class="p">:</span><span class="w"> </span><span class="mf">0.87</span><span class="p">,</span>
+<span class="w">    </span><span class="nt">&quot;sqnr_db&quot;</span><span class="p">:</span><span class="w"> </span><span class="mf">28.3</span><span class="p">,</span>
+<span class="w">    </span><span class="nt">&quot;psnr_db&quot;</span><span class="p">:</span><span class="w"> </span><span class="mf">31.1</span><span class="p">,</span>
+<span class="w">    </span><span class="nt">&quot;max_abs_diff&quot;</span><span class="p">:</span><span class="w"> </span><span class="mf">0.042</span>
+<span class="w">  </span><span class="p">},</span>
+<span class="w">  </span><span class="nt">&quot;task_metric&quot;</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nt">&quot;top1_accuracy&quot;</span><span class="p">:</span><span class="w"> </span><span class="mf">0.741</span><span class="w"> </span><span class="p">},</span>
+<span class="w">  </span><span class="nt">&quot;threshold_pass&quot;</span><span class="p">:</span><span class="w"> </span><span class="kc">false</span>
+<span class="p">}</span>
+</code></pre></div>
+
+<p><strong>2. <code>winml analyze --format json</code></strong></p>
+<p>File: <code>src/winml/modelkit/commands/analyze.py</code></p>
+<p>Already supports <code>--output file.json</code>. Add <code>--format json</code> to also print to stdout
+(mirrors pattern from <code>winml inspect</code> and <code>winml sys</code>):</p>
+<div class="codehilite"><pre><span></span><code><span class="p">{</span>
+<span class="w">  </span><span class="nt">&quot;ep&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;qnn&quot;</span><span class="p">,</span>
+<span class="w">  </span><span class="nt">&quot;model&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;path/to/model.onnx&quot;</span><span class="p">,</span>
+<span class="w">  </span><span class="nt">&quot;summary&quot;</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nt">&quot;supported&quot;</span><span class="p">:</span><span class="w"> </span><span class="mi">142</span><span class="p">,</span><span class="w"> </span><span class="nt">&quot;partial&quot;</span><span class="p">:</span><span class="w"> </span><span class="mi">3</span><span class="p">,</span><span class="w"> </span><span class="nt">&quot;unsupported&quot;</span><span class="p">:</span><span class="w"> </span><span class="mi">1</span><span class="w"> </span><span class="p">},</span>
+<span class="w">  </span><span class="nt">&quot;partial_ops&quot;</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="s2">&quot;MultiHeadAttention&quot;</span><span class="p">,</span><span class="w"> </span><span class="s2">&quot;LayerNorm&quot;</span><span class="p">,</span><span class="w"> </span><span class="s2">&quot;Softmax&quot;</span><span class="p">],</span>
+<span class="w">  </span><span class="nt">&quot;unsupported_ops&quot;</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="s2">&quot;CustomRotaryEmbedding&quot;</span><span class="p">]</span>
+<span class="p">}</span>
+</code></pre></div>
+
+<p><strong>3. <code>winml perf --format json</code></strong></p>
+<p>File: <code>src/winml/modelkit/commands/perf.py</code></p>
+<p>Already writes JSON to file via <code>-o</code>. Add <code>--format json</code> stdout output:</p>
+<div class="codehilite"><pre><span></span><code><span class="p">{</span>
+<span class="w">  </span><span class="nt">&quot;model&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;path/to/model.onnx&quot;</span><span class="p">,</span>
+<span class="w">  </span><span class="nt">&quot;ep&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;qnn&quot;</span><span class="p">,</span>
+<span class="w">  </span><span class="nt">&quot;device&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;npu&quot;</span><span class="p">,</span>
+<span class="w">  </span><span class="nt">&quot;iterations&quot;</span><span class="p">:</span><span class="w"> </span><span class="mi">100</span><span class="p">,</span>
+<span class="w">  </span><span class="nt">&quot;latency_ms&quot;</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nt">&quot;p50&quot;</span><span class="p">:</span><span class="w"> </span><span class="mf">18.3</span><span class="p">,</span><span class="w"> </span><span class="nt">&quot;p90&quot;</span><span class="p">:</span><span class="w"> </span><span class="mf">21.7</span><span class="p">,</span><span class="w"> </span><span class="nt">&quot;p99&quot;</span><span class="p">:</span><span class="w"> </span><span class="mf">28.4</span><span class="p">,</span><span class="w"> </span><span class="nt">&quot;mean&quot;</span><span class="p">:</span><span class="w"> </span><span class="mf">18.9</span><span class="w"> </span><span class="p">},</span>
+<span class="w">  </span><span class="nt">&quot;throughput_rps&quot;</span><span class="p">:</span><span class="w"> </span><span class="mf">54.6</span>
+<span class="p">}</span>
+</code></pre></div>
+
+<p>These three changes are ~50 lines of code each, follow the existing pattern from
+<code>winml inspect --format json</code> and <code>winml sys --format json</code>, and unlock the full
+agentic execution model for all consumer skills.</p>
+<h3 id="sizing-estimate-per-skill">Sizing estimate (per skill)<a class="headerlink" href="#sizing-estimate-per-skill" title="Permanent link">&para;</a></h3>
+<p>Each SKILL.md based on Mobius patterns (~8–14KB):
+- ~200 lines prose + decision tables
+- ~50 lines code examples
+- Cross-reference section</p>
+<h3 id="relationship-to-existing-use-winml-cli-skill">Relationship to existing <code>use-winml-cli</code> skill<a class="headerlink" href="#relationship-to-existing-use-winml-cli-skill" title="Permanent link">&para;</a></h3>
+<p>The new skills are <strong>task-scoped</strong> (problem → solution) vs the existing skill which is
+<strong>tool-scoped</strong> (here's what each command does). They complement, not replace each other.
+The existing skill should add cross-references to the new skills in its "Common patterns" section.</p>
+<hr />
+<h2 id="qnn-npu-catalog-sweep-findings-feature-gaps-2026-06-13">QNN NPU Catalog Sweep — Findings &amp; Feature Gaps (2026-06-13)<a class="headerlink" href="#qnn-npu-catalog-sweep-findings-feature-gaps-2026-06-13" title="Permanent link">&para;</a></h2>
+<p>Source: 8-model catalog sweep via autoconfig POC (C:\tmp\autoconfig-demo\catalog_qnn_sweep.py)</p>
+<h3 id="cross-model-results">Cross-model results<a class="headerlink" href="#cross-model-results" title="Permanent link">&para;</a></h3>
+<table>
+<thead>
+<tr>
+<th>Model</th>
+<th>Arch</th>
+<th>Baseline p50</th>
+<th>Best p50</th>
+<th>Gain</th>
+<th>Best config</th>
+</tr>
+</thead>
+<tbody>
+<tr>
+<td>microsoft/resnet-18</td>
+<td>resnet</td>
+<td>0.96ms</td>
+<td>0.96ms</td>
+<td>—</td>
+<td>baseline (opset17)</td>
+</tr>
+<tr>
+<td>google/vit-base-patch16-224</td>
+<td>vit</td>
+<td>9.04ms</td>
+<td>9.04ms</td>
+<td>—</td>
+<td>baseline (opset17)</td>
+</tr>
+<tr>
+<td>apple/mobilevit-small</td>
+<td>mobilevit</td>
+<td>12.07ms</td>
+<td><strong>8.62ms</strong></td>
+<td>+29%</td>
+<td>opset21+conv_fusions</td>
+</tr>
+<tr>
+<td>facebook/dinov2-small</td>
+<td>dinov2</td>
+<td>6.56ms</td>
+<td><strong>4.98ms</strong></td>
+<td>+24%</td>
+<td>opset21</td>
+</tr>
+<tr>
+<td>hustvl/yolos-small</td>
+<td>yolos</td>
+<td>78.69ms</td>
+<td>—</td>
+<td>timeout</td>
+<td>—</td>
+</tr>
+<tr>
+<td>distilbert SST-2</td>
+<td>distilbert</td>
+<td>19.48ms</td>
+<td>19.48ms</td>
+<td>—</td>
+<td>baseline</td>
+</tr>
+<tr>
+<td>all-MiniLM-L6-v2</td>
+<td>bert</td>
+<td>5.81ms</td>
+<td>5.81ms</td>
+<td>—</td>
+<td>baseline</td>
+</tr>
+<tr>
+<td>deepset/roberta-base-squad2</td>
+<td>roberta</td>
+<td>14.94ms</td>
+<td>14.72ms</td>
+<td>1.5%</td>
+<td>opset21</td>
+</tr>
+</tbody>
+</table>
+<h3 id="validated-kb-findings">Validated KB findings<a class="headerlink" href="#validated-kb-findings" title="Permanent link">&para;</a></h3>
+<p><strong>npu-001 refined</strong>: opset21 benefit is architecture-gated:
+- ✅ Conv + residual connections: +25–31% (mobilevit, dinov2, convnext)
+- ❌ Pure transformer (ViT, YOLOS): -7% or neutral
+- ⚪ NLP BERT-family: neutral</p>
+<p><strong>npu-006 NEW — CRITICAL</strong>: Conv fusions (conv-bn/add/activation) cause catastrophic QNN NPU CPU fallback
+- ResNet-18 with conv fusions: 0.96ms → 132ms (+4900% regression)
+- MobileViT: safe (no regression)
+- Severity: critical — can produce 50x+ regression silently</p>
+<p><strong>npu-007 NEW</strong>: DVFS thermal noise makes CV gate unreliable on QNN NPU
+- New bench protocol: 3 sessions × 500 iters + 30s cool-down + median p50 + &gt;10% noise floor</p>
+<h3 id="feature-gaps-winml-cli-backlog-items">Feature gaps (winml-cli backlog items)<a class="headerlink" href="#feature-gaps-winml-cli-backlog-items" title="Permanent link">&para;</a></h3>
+<p><strong>Gap A: winml analyze — Conv fusion QNN safety check</strong>
+winml analyze should detect Conv-dominant topologies and warn when conv-bn/add/activation
+fusions are configured for QNN NPU target. Currently no pre-build detection of this hazard.
+- Command to add: warning in analyze output when ep=qnn AND conv_fusion_pass is enabled AND model has &gt;N Conv ops
+- Priority: HIGH (silent 50x regression risk)</p>
+<p><strong>Gap B: budget-aware sweep in autoconfig</strong>
+Large models (YOLOS, ~78ms/inf) cause sweep timeout with current fixed budget.
+Need: per-hypothesis time estimation → auto-skip models that exceed budget, log as "timeout" not failure.
+- Affects: autoconfig POC and any future winml sweep command</p>
+<p><strong>Gap C: winml perf DVFS-aware session averaging</strong>
+winml perf should natively support session-level median aggregation for QNN NPU.
+Current single-session variance is dominated by DVFS thermal state, not model performance.
+- Flag proposal: --sessions 3 --cool-down 30 --signal median-p50
+- This would make winml perf output trustworthy for optimization decisions on Snapdragon X Elite</p>
+</main>
+</div>
+</body>
+</html>

Model ID	apple/mobilevit-small
Task	image-classification
Arch type	mobilevit
Baseline opset	17
EP	cpu
Device	cpu
ID	Config Label	Opset	Extra Flags	Median p50	Session p50s (ms)	Gain %	Verdict	Confidence
h0	baseline (opset 17, autoconf defaults)	17	not stored	73.17 ms	[73.17 · 72.10 · 80.23]	+0.0%	BASELINE	ranges overlap
h1	opset 17 explicit	17	not stored	87.48 ms	[87.48 · 89.86 · 57.04]	-19.6%	DISCARD	ranges overlap
h2	opset 19 (cpu-001 risk — transformer test)	19	not stored	79.83 ms	[74.50 · 86.26 · 79.83]	-9.1%	DISCARD	ranges overlap
h3	opset 21 (cpu-001 risk — transformer test)	21	not stored	78.59 ms	[67.43 · 84.27 · 78.59]	-7.4%	DISCARD	ranges overlap
h4	opset 17 + attention_fusion	17	attention_fusion	77.77 ms	[83.44 · 70.19 · 77.77]	-6.3%	DISCARD	ranges overlap
h5	opset 17 + skip_layer_norm_fusion	17	skip_layer_norm_fusion	80.51 ms	[80.51 · 60.10 · 785.99]	-10.0%	DISCARD	ranges overlap
h6	opset 17 + layer_norm_fusion	17	layer_norm_fusion	803.22 ms	[817.62 · 803.22 · 184.94]	-997.8%	DISCARD	ranges separated
h7 ★	opset 17 + bias_softmax_fusion	17	bias_softmax_fusion	64.14 ms	[60.64 · 64.14 · 119.52 · 239.25 · 279.32]	-63.4%	MARGINAL_UNCONFIRMED	2/5 sessions confirm
h8	opset 17 + matmul_add_fusion (cpu-002 guarded)	17	not stored	—	—	—	SKIPPED_CPU002	guarded skip
h9	opset 17 + matmul_transpose_fusion	17	matmul_transpose_fusion	194.19 ms	[194.19 · 175.97 · 203.41]	-165.4%	DISCARD	ranges separated
h10	opset 17 + attention + skip_layer_norm + layer_norm	17	attention_fusion, skip_layer_norm_fusion, layer_norm_fusion	193.64 ms	[261.24 · 189.74 · 193.64]	-164.7%	DISCARD	ranges separated
h11	opset 17 + nchwc_transformer (Conv-heavy models)	17	not stored	BUILD_FAIL	BUILD_FAIL	—	BUILD_FAIL	build failed
h12	opset 17 + transpose_optimizer	17	not stored	BUILD_FAIL	BUILD_FAIL	—	BUILD_FAIL	build failed
h13	opset 17 + gelu_fusion explicit	17	not stored	BUILD_FAIL	BUILD_FAIL	—	BUILD_FAIL	build failed
Model ID	facebook/dinov2-small
Task	image-feature-extraction
Arch type	dinov2
Baseline opset	17
EP	cpu
Device	cpu
ID	Config Label	Opset	Extra Flags	Median p50	Session p50s (ms)	Gain %	Verdict	Confidence
h0	baseline (opset 17, autoconf defaults)	17	not stored	112.60 ms	[142.03 · 105.56 · 112.60]	+0.0%	BASELINE	ranges overlap
h1	opset 17 explicit	17	not stored	762.81 ms	[150.63 · 1123.34 · 762.81]	-577.5%	DISCARD	ranges separated
h2	opset 19 (cpu-001 risk — transformer test)	19	not stored	1106.11 ms	[1106.11 · 1104.49 · 1164.20]	-882.4%	CPU001_REGRESSION	ranges separated
h3	opset 21 (cpu-001 risk — transformer test)	21	not stored	1095.19 ms	[1057.56 · 1095.19 · 1128.22]	-872.6%	CPU001_REGRESSION	ranges separated
h4	opset 17 + attention_fusion	17	attention_fusion	1083.83 ms	[1086.54 · 1068.75 · 1083.83]	-862.6%	DISCARD	ranges separated
h5	opset 17 + skip_layer_norm_fusion	17	skip_layer_norm_fusion	1103.07 ms	[1119.95 · 1103.07 · 161.83]	-879.6%	DISCARD	ranges separated
h6	opset 17 + layer_norm_fusion	17	layer_norm_fusion	148.70 ms	[142.60 · 155.01 · 148.70]	-32.1%	DISCARD	ranges separated
h7	opset 17 + bias_softmax_fusion	17	bias_softmax_fusion	1121.98 ms	[899.91 · 1145.34 · 1121.98]	-896.4%	DISCARD	ranges separated
h8	opset 17 + matmul_add_fusion (cpu-002 guarded)	17	not stored	—	—	—	SKIPPED_CPU002	guarded skip
h9	opset 17 + matmul_transpose_fusion	17	matmul_transpose_fusion	186.48 ms	[161.47 · 186.48 · 334.34]	-65.6%	DISCARD	ranges separated
h10	opset 17 + attention + skip_layer_norm + layer_norm	17	attention_fusion, skip_layer_norm_fusion, layer_norm_fusion	136.57 ms	[121.38 · 167.90 · 136.57]	-21.3%	DISCARD	ranges overlap
h11	opset 17 + nchwc_transformer (Conv-heavy models)	17	nchwc_transformer	157.51 ms	[157.51 · 192.39 · 157.25]	-39.9%	DISCARD	ranges separated
h12	opset 17 + transpose_optimizer	17	transpose_optimizer	154.59 ms	[175.29 · 143.11 · 154.59]	-37.3%	DISCARD	ranges separated
h13	opset 17 + gelu_fusion explicit	17	gelu_fusion	154.10 ms	[146.72 · 163.78 · 154.10]	-36.9%	DISCARD	ranges separated
Model ID	microsoft/rad-dino
Task	image-feature-extraction
Arch type	dinov2
Baseline opset	—
EP	cpu
Device	cpu
Model ID	microsoft/resnet-18
Task	image-classification
Arch type	resnet
Baseline opset	17
EP	cpu
Device	cpu
ID	Config Label	Opset	Extra Flags	Median p50	Session p50s (ms)	Gain %	Verdict	Confidence
h0	baseline (opset 17, autoconf defaults)	17	not stored	237.47 ms	[237.47 · 230.21 · 238.44]	+0.0%	BASELINE	ranges overlap
h1	opset 17 explicit	17	not stored	244.96 ms	[221.85 · 252.22 · 244.96]	-3.1%	DISCARD	ranges overlap
h2	opset 19 (cpu-001 risk — transformer test)	19	not stored	231.69 ms	[209.29 · 231.69 · 238.07]	+2.4%	MARGINAL	ranges overlap
h3	opset 21 (cpu-001 risk — transformer test)	21	not stored	226.69 ms	[218.89 · 226.69 · 230.42]	+4.5%	MARGINAL	ranges overlap
h4	opset 17 + attention_fusion	17	attention_fusion	231.07 ms	[209.43 · 231.07 · 235.40]	+2.7%	MARGINAL	ranges overlap
h5	opset 17 + skip_layer_norm_fusion	17	skip_layer_norm_fusion	226.59 ms	[207.97 · 226.59 · 227.74]	+4.6%	MARGINAL	ranges separated
h6	opset 17 + layer_norm_fusion	17	layer_norm_fusion	212.70 ms	[200.31 · 212.70 · 215.09 · 40.37 · 24.96]	+15.7%	KEEP_CONFIRMED	5/5 sessions confirm
h7	opset 17 + bias_softmax_fusion	17	bias_softmax_fusion	227.78 ms	[222.57 · 245.29 · 227.78]	+4.1%	MARGINAL	ranges overlap
h8	opset 17 + matmul_add_fusion (cpu-002 guarded)	17	not stored	—	—	—	SKIPPED_CPU002	guarded skip
h9 ★	opset 17 + matmul_transpose_fusion	17	matmul_transpose_fusion	17.80 ms	[24.22 · 11.52 · 17.80 · 186.46 · 197.36]	+89.8%	KEEP_CONFIRMED	5/5 sessions confirm
h10	opset 17 + attention + skip_layer_norm + layer_norm	17	attention_fusion, skip_layer_norm_fusion, layer_norm_fusion	20.09 ms	[20.09 · 14.91 · 43.27 · 18.86 · 39.40]	+91.5%	KEEP_CONFIRMED	5/5 sessions confirm
h11	opset 17 + nchwc_transformer (Conv-heavy models)	17	nchwc_transformer	40.87 ms	[30.78 · 40.87 · 230.91 · 36.88 · 27.17]	+84.5%	MARGINAL_UNCONFIRMED	4/5 sessions confirm
h12	opset 17 + transpose_optimizer	17	transpose_optimizer	36.91 ms	[36.91 · 21.65 · 40.88 · 26.59 · 38.94]	+84.5%	KEEP_CONFIRMED	5/5 sessions confirm
h13	opset 17 + gelu_fusion explicit	17	gelu_fusion	26.39 ms	[219.22 · 26.39 · 20.94 · 18.75 · 215.34]	+88.9%	KEEP_CONFIRMED	5/5 sessions confirm
Model ID	sentence-transformers/all-MiniLM-L6-v2
Task	sentence-similarity
Arch type	bert
Baseline opset	—
EP	cpu
Device	cpu
Model ID	BAAI/bge-small-en-v1.5
Task	sentence-similarity
Arch type	bert
Baseline opset	17
EP	qnn
Device	gpu
ID	Config Label	Opset	Extra Flags	Median p50	Session p50s (ms)	Gain %	Verdict	Confidence
h0	baseline FP32 (no quant, no compile)	17	not stored	52.63 ms	[57.21 · 52.63 · 51.96]	+0.0%	BASELINE	ranges overlap
h1	opset 17 explicit	17	not stored	53.34 ms	[53.69 · 52.32 · 53.34]	-1.4%	DISCARD	ranges overlap
h2	opset 19	19	not stored	53.36 ms	[52.84 · 53.36 · 53.40]	-1.4%	DISCARD	ranges overlap
h3	opset 21 (tests gpu-006)	21	not stored	52.54 ms	[52.19 · 53.58 · 52.54]	+0.2%	MARGINAL	ranges overlap
h4	opset 17 + matmul_transpose_fusion	17	matmul_transpose_fusion	52.81 ms	[52.81 · 53.63 · 52.17]	-0.3%	DISCARD	ranges overlap
h5	opset 17 + attention_fusion	17	attention_fusion	52.57 ms	[52.99 · 52.27 · 52.57]	+0.1%	MARGINAL	ranges overlap
h6	opset 17 + bias_softmax_fusion	17	bias_softmax_fusion	52.70 ms	[51.83 · 53.41 · 52.70]	-0.1%	DISCARD	ranges overlap
h7	opset 17 + layer_norm_fusion	17	layer_norm_fusion	52.62 ms	[54.30 · 52.52 · 52.62]	+0.0%	MARGINAL	ranges overlap
h8	opset 17 + skip_layer_norm_fusion	—	not stored	—	—	—	BENCH_FAIL	bench failed
h9	opset 21 + matmul_transpose + attention_fusion	21	not stored	BUILD_FAIL	BUILD_FAIL	—	BUILD_FAIL	build failed
h10	opset 17 + ln + skip_ln + matmul_transpose	17	not stored	BUILD_FAIL	BUILD_FAIL	—	BUILD_FAIL	build failed
h11	opset 17 + gelu_fusion explicit	17	not stored	BUILD_FAIL	BUILD_FAIL	—	BUILD_FAIL	build failed
h12	opset 17 + transpose_optimizer	17	not stored	BUILD_FAIL	BUILD_FAIL	—	BUILD_FAIL	build failed