diff --git a/pyproject.toml b/pyproject.toml index 20aada0df..597f1de0a 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -298,6 +298,8 @@ lint.per-file-ignores."tests/**" = [ "ANN", "D", "PLR2004", "PT", "S101", "T20" lint.per-file-ignores."tests/**/generate_patterns.py" = [ "PERF401" ] # Generated opset code: Allow long lines lint.per-file-ignores."src/winml/modelkit/analyze/onnx_opset/**" = [ "D", "E501", "N802", "N803", "N806", "TC001", "TC002", "TC003" ] +# Research scripts: POC code, not production — exempt from all style/type/security rules +lint.per-file-ignores."research/**" = [ "ANN", "D", "E", "N", "S", "T20", "UP", "W", "B", "C4", "FA", "I", "PERF", "PIE", "PT", "PTH", "RET", "RSE", "RUF", "SIM", "TCH", "TID", "TRY", "G", "ICN", "E402", "E501", "F401", "F403", "F811" ] # === Import Conventions === lint.flake8-bandit.check-typed-exception = true lint.flake8-bandit.hardcoded-tmp-directory = [ "/tmp", "/var/tmp", "C:\\Temp" ] diff --git a/research/autoconfig/README.md b/research/autoconfig/README.md new file mode 100644 index 000000000..2769b7fd1 --- /dev/null +++ b/research/autoconfig/README.md @@ -0,0 +1,350 @@ +# autoconfig — Automated Config Search POC + +**Status: Research POC — not production code.** + +This directory contains an experimental automated search system that finds the optimal +`winml-cli` build configuration (execution provider, opset version, graph optimizations) +for a given model on Windows hardware — without requiring the user to understand the +underlying ORT/EP optimizer mechanics. + +--- + +## What This Is + +`autoconfig.py` implements an Explorer/Optimizer/Reviewer loop as three explicit +classes wired by a thin orchestrator (`main()`): + +1. **`Explorer`** — selects the next hypothesis from the **full OFAT search grid** + the orchestrator enumerates (from a FP32 baseline, one factor varied at a time — + opset 17–21, quant precision fp32/fp16/int8/int16/w8a16, or one single graph + pass; ~74 combinations via `build_search_space()`): it builds the `priority_queue` + and prunes refuted/no-op configs via KB hard-blocks + the Insight Engine + `skip_set`. Pruning uses the **baseline graph analysis** — a graph pass whose + pattern is absent (e.g. no Conv→BN subgraph) is cut, while passes whose pattern + is present are boosted to the front. Owns search *order* only — the grid itself + is generated up front, zero-experience. +2. **`Optimizer`** — runs `winml build` + `winml perf` (two-phase: 200-iter CV screen → 3×500-iter full bench) + + `winml eval` accuracy. Produces raw measurements only. A graph pass that + builds to a graph identical to the baseline (`graph_is_noop`) is discarded + before benchmarking — it matched nothing. +3. **`Reviewer`** — applies the `ThroughputOnly` verdict (`threshold = max(1%, 2×CV)`), + decides keep/discard, and drafts KB entries. + +The loop terminates after 30 consecutive discards (plateau detection) or a time budget. + +The same four-role architecture is also captured as composable **skill definitions** +under `skills/` — an `autoconfig-orchestrator` (the brain) that delegates to three +sub-skills `autoconfig-explorer`, `autoconfig-optimizer`, and `autoconfig-reviewer`. +Each `SKILL.md` mirrors the corresponding class and the diagram phase. + +`catalog_sweep.py` is a single, JSON-driven multi-model sweep. It reads the hypothesis +matrix, model catalog, and per-EP bench protocol from `ep_device_knowledge/_.json` +and runs them for any `--ep/--device` combination (qnn/npu, qnn/gpu, dml/gpu, cpu/cpu), +collecting structured results in `catalog--sweep//results.json`. + +`analyze_graph.py` is an ONNX graph analysis helper that identifies architectural +patterns relevant to EP optimization (Transpose sandwiches, residual branches, GELU +variants, depthwise Conv) and surfaces gaps in `winml analyze` output. + +`gen_report_v3.py` generates an HTML sweep report from `results.json` files. + +`autoconfig_diagram.html` is an interactive architecture diagram of the Explorer/Optimizer/ +Reviewer loop. + +--- + +## Key Findings — 8-Model QNN NPU Catalog Sweep (2026-06-13) + +### npu-001: opset 21 NHWC bypass is real — but architecture-specific + +Opset ≥ 21 bypasses ORT's NHWC layout transformer for QNN EP, giving a large speedup +on **Conv + residual** models but no benefit (or slight regression) on pure transformers: + +| Architecture | Models | opset 21 vs opset 17 | +|---|---|---| +| Conv + residual | MobileViT-small, DINOv2-small | **+26–31% speedup** | +| Pure transformer | ViT-base, YOLOS-small | neutral / slight regression | +| BERT-family NLP | DistilBERT, MiniLM, RoBERTa | neutral (within DVFS noise) | +| Plain Conv (ResNet) | ResNet-18 | ~+20% (h1→h3), but DVFS-dominated | + +Root cause: ORT's `IsSupportedOpset()` gate in `layout_transformation.cc` causes the +NHWC layout transform to insert Transpose nodes around Conv ops. For Conv+residual +models these Transposes cannot be cancelled, so bypassing the transform (opset 21) gives +a cleaner HTP graph. Pure attention models have no Conv→NHWC transposes, so the bypass +has no effect. + +### npu-006: Conv fusions cause ~4900% regression on QNN NPU for Conv-dominant models + +`conv_bn_fusion`, `conv_add_fusion`, `conv_activation_fusion` produce fused op nodes +that QNN EP cannot execute natively — falling back to CPU for every fused Conv: + +| Model | h4 (conv fusions) vs h1 (baseline) | +|---|---| +| ResNet-18 | **132.3 ms vs 2.72 ms (+4764% regression)** | +| MobileViT-small | 11.36 ms vs 11.72 ms (neutral) | +| DistilBERT | 19.59 ms vs 19.5 ms (neutral — no Conv to fuse) | + +This is a critical correctness/performance hazard. `winml` should detect when the target +EP would CPU-fallback fused Conv ops and suppress incompatible fusions automatically +(see [Feature Gaps](#feature-gaps)). + +### npu-007: DVFS thermal noise requires session-level averaging for reliable results + +QNN NPU exhibits extreme DVFS thermal throttling. CV is consistently 0.10–2.0+ across +all models. Practical implications: + +- The CV < 15% Phase-A gate must be **disabled** for QNN NPU (blocks all models) +- Differences < 10% between configs are **unreliable** without ≥ 1500 total iterations +- Recommended protocol: **3 × 500-iter sessions** with 30 s cool-down; report median of + session p50 values +- 30 s cool-down reduces but does not eliminate DVFS spikes + +--- + +## How to Run + +### Prerequisites + +- `winml` CLI installed and on PATH +- Python 3.11+ with `onnx` package (`pip install onnx`) +- For QNN experiments: Snapdragon X Elite device with QNN SDK (Hexagon HTP driver) + +### autoconfig.py — single-model adaptive search + +Configured at the top of the file (edit `MODEL_ID`, `TASK`, `EP`, `DEVICE`, `WORK_DIR`): + +```bash +# Default: facebook/convnext-tiny-224 on CPU +python skills/orchestrator/autoconfig.py +``` + +Results are written to `WORK_DIR/results.tsv` and per-hypothesis subdirectories. +The script reads `ep_device_knowledge/_.json` to prune already-refuted configurations. + +### catalog_sweep.py — JSON-driven multi-model sweep + +One driver covers every EP/device. The hypothesis matrix, model catalog, and bench +protocol (screen/full iterations, thermal handling, effect-size gate, paired A/B, +accuracy eval) all come from `ep_device_knowledge/_.json`: + +```bash +# Full QNN NPU catalog sweep (all models, ~6-8 hours on X Elite) +python tools/catalog_sweep.py --ep qnn --device npu + +# CPU EP sweep, single model +python tools/catalog_sweep.py --ep cpu --device cpu --model microsoft/resnet-18 + +# QNN GPU sweep +python tools/catalog_sweep.py --ep qnn --device gpu + +# Show the models/hypotheses configured for an EP/device +python tools/catalog_sweep.py --ep qnn --device npu --list +``` + +Results land in `catalog--sweep//` — `results.json`, an HTML +report, and `champion__.json` — the recommended build config itself: a copy +of the optimal hypothesis' `winml_build_config.json`, so it can be fed straight back to +`winml build -c`. A `SUMMARY.md` is regenerated at the end of each sweep. + +### analyze_graph.py — ONNX graph analysis + +```bash +# Edit the onnx path at the top of the file, then: +python skills/explorer/analyze_graph.py +``` + +Prints Transpose patterns, residual branch structure, GELU variants, and op domain +breakdown to stdout. + +--- + +## ep_device_knowledge/ — Empirical Knowledge Base + +Each JSON file stores empirical findings **and** the sweep configuration for one +EP/device combination, named `_.json`: + +| File | EP/device | +|---|---| +| `cpu_cpu.json` | CPU EP (Snapdragon X Elite Oryon) | +| `dml_gpu.json` | DirectML EP (GPU) | +| `qnn_gpu.json` | QNN Adreno GPU | +| `qnn_npu.json` | QNN HTP (Hexagon NPU) — most findings here | + +### Schema overview + +Each file has a `findings` array. Each finding has: + +```json +{ + "id": "npu-001", + "title": "...", + "mechanism_confirmed": true, + "architecture_requirement": ["has_conv_ops", "has_residual_connections"], + "status": "confirmed", + "confidence": "high" +} +``` + +It also carries the data-driven sweep contract consumed by `catalog_sweep.py`: +`sweep_config` (bench protocol), `hypotheses` (the h0–hN matrix with opset/optim/guards), +`models` (the catalog), and `cross_checks` (npu-001 opset-bypass, npu-006 catastrophic +regression, cpu-001 regression probe). + +And a `search_space_rules` object that `autoconfig.py` reads to prune configurations +(only findings with `"mechanism_confirmed": true` are applied as pruning rules). + +### Adding a new finding + +1. Run the experiment and collect bench data +2. Add an entry to the appropriate `ep_device_knowledge/_.json` under `findings` +3. Set `"mechanism_confirmed": false` and `"confidence": "draft"` until the mechanism + is understood from ORT/EP source code +4. If the finding prunes a search dimension, add a rule under `search_space_rules` +5. Set `"mechanism_confirmed": true` only after source code investigation confirms + the root cause — do NOT promote to confirmed based on benchmark numbers alone +6. See `ep_device_knowledge/README.md` for the epistemics guidelines + +--- + +## Self-Evolution Tooling + +Implements the loop from [`docs/self-evolution-design.html`](docs/self-evolution-design.html) — +how sweeps stabilize their own conclusions and promote findings without a human in the loop. + +### skills/optimizer/bench_utils.py — paired A/B + adaptive sampling + +Shared bench primitives used across sweeps: + +- **`paired_ab_bench(run_session, baseline, hyp, n_pairs)`** (Fix #1) — interleaves the + baseline and hypothesis perf sessions in one thermal window so DVFS/thermal drift appears + in both legs and **cancels** in the within-pair ratio. Returns mean gain, 95% CI, and a + verdict (`KEEP_CONFIRMED` / `MARGINAL` / `DISCARD`). This is the unbiased fix for the + npu-001/MobileViT failure, where a cold baseline vs warm hypothesis manufactured a fake win. +- **`adaptive_paired_ab_bench(...)`** (Fix #2) — keeps adding pairs until the 95% CI is + decisive (clears the KEEP or DISCARD band) or `MAX_PAIRS` is reached. Stable models finish + in `MIN_PAIRS=3`; noisy ones automatically get more samples. +- **`thermal_classify(ref_p50, cold_ref_p50)`** (Fix #5) — classifies device thermal state + (`COOL`/`WARM`/`HOT_RUN`) from a reference-model latency, for excluding throttled runs. +- **`session_cv(p50s)`** — between-session coefficient of variation (the effect-size noise floor). + +The QNN sweep opts into paired A/B with `--paired-ab` (default off; the validated default is +the sequential Phase B): + +```bash +python tools/catalog_sweep.py --ep qnn --device npu --model apple/mobilevit-small --task image-classification --paired-ab +``` + +### skills/reviewer/promote_findings.py — confidence-gated KB promotion (L1 → L4) + +Post-processing script (Fix #4) that reads every `catalog-*-sweep/*/results.json` and applies +the confidence ladder, writing a **draft** to `ep_device_knowledge/_auto_promoted.json` (it never +clobbers the curated `_.json` files): + +| Level | Gate | +|---|---| +| **L1** Observed | median gain ≥ 5% on one model, one run | +| **L2** Confirmed | hypothesis p50 range strictly below baseline range **and** gain ≥ 2×(session CV) — the same effect-size gate the sweep uses | +| **L3** Generalized | same `(ep, flags)` reaches L2 on ≥2 distinct models of one architecture class (`model_type`) | +| **L4** Cross-cutting | same `(ep, flags)` reaches L2 across ≥3 architecture classes | + +```bash +python skills/reviewer/promote_findings.py # writes ep_device_knowledge/_auto_promoted.json +``` + +A human applies the promotion checklist in [`ep_device_knowledge/README.md`](ep_device_knowledge/README.md) +(paired A/B, clean baseline, effect-size > noise floor, independent reruns, baseline-drift +check) before merging any auto-promoted candidate into the curated KB. + +### skills/explorer/analyze_insight.py — architecture-based pruning (Fix #3) + +`build_insight()` fuses graph fingerprint + `winml analyze` + KB rules into a `skip_set` +(hypotheses to prune) and `priority_boosts` (reordering), cutting the 14-hypothesis matrix +to the few that matter per architecture. + +--- + +## Feature Gaps Identified + +Four actionable gaps in `winml-cli` surfaced by this research: + +1. **FusedConv detection in `winml analyze`** — `analyze` should detect Conv ops that + would CPU-fallback on QNN NPU after fusion (npu-006), and either warn or suppress + incompatible fusions in the generated build config. + +2. **DVFS-aware perf** — `winml perf` should support `--thermal-stabilization` mode + that waits for device temperature to stabilize before measurements, and should report + confidence intervals rather than a single p50. + +3. **Budget-aware sweep** — `tools/catalog_sweep.py` exhausts the 20-min budget on models + > 50 ms baseline after just 2 hypotheses (YOLOS: 78 ms × 3×500 iters = 207 s/hypothesis). + A `--quick` flag that reduces to 1×200-iter for large models is needed. + +4. **Benefit-gated fusion in `winml analyze`** — the analyzer currently auto-applies a fusion + whenever the graph pattern matches, but a fusion *firing* (op count drops / graph topology + changes after the flag) does **not** imply a perf win. Many fusions fire cleanly yet land + within measurement noise (e.g. BERT/ConvNeXt on QNN NPU — graph changes, p50 unchanged, see + npu-011). The analyzer should: (a) confirm a fusion actually fired by diffing pre/post-optimize + op counts and graph topology (not just pattern-match the input graph), and (b) gate retention + of that fusion on a measured perf delta beyond the noise band — applied-but-not-beneficial + fusions should be dropped (or flagged) rather than kept, since they add build cost and EP risk + for no return. This research records such cases so they can train that benefit gate. + +--- + +## Directory Layout + +``` +research/autoconfig/ +├── README.md ← this file +│ +├── skills/ ← the agent loop, one folder per role (each has SKILL.md + its scripts) +│ ├── orchestrator/ ← the brain: Phase 0–3 lifecycle +│ │ ├── SKILL.md +│ │ └── autoconfig.py ← adaptive single-model search loop (Explorer/Optimizer/Reviewer classes) +│ ├── explorer/ ← "what to try next": priority_queue + skip_set +│ │ ├── SKILL.md +│ │ ├── analyze_insight.py ← graph + analyze + KB → skip_set / priority_boosts +│ │ └── analyze_graph.py ← ONNX graph pattern analysis helper +│ ├── optimizer/ ← "run it": build → screen → full bench → eval +│ │ ├── SKILL.md +│ │ └── bench_utils.py ← shared bench primitives (paired A/B, adaptive, thermal, verdict) +│ └── reviewer/ ← "judge it": ThroughputOnly verdict + KB draft +│ ├── SKILL.md +│ └── promote_findings.py ← L1→L4 confidence-gated KB promotion (draft sink) +│ +├── lib/ ← shared, role-agnostic helpers +│ ├── report_gen.py ← HTML/markdown report rendering +│ └── gen_model_report.py ← per-model report builder used by the sweeps +│ +├── tools/ ← batch drivers and one-off utilities +│ ├── catalog_sweep.py ← JSON-driven multi-model sweep (--ep/--device, --paired-ab) +│ ├── validation_sweep.py ← re-runs to validate KB findings +│ └── gen_report_v3.py ← legacy HTML report generator +│ +├── docs/ ← design docs (self-evolution, agent, skills, cross-device) +│ └── autoconfig_diagram.html ← Explorer/Optimizer/Reviewer architecture diagram +│ +├── ep_device_knowledge/ +│ ├── README.md ← epistemics guidelines + promotion checklist +│ ├── _auto_promoted.json ← promote_findings.py output (auto-generated draft) +│ ├── cpu_cpu.json ← CPU EP findings + sweep config (ConvNext, 6 findings) +│ ├── dml_gpu.json ← DirectML EP findings + sweep config +│ ├── qnn_gpu.json ← QNN Adreno GPU findings + sweep config +│ └── qnn_npu.json ← QNN HTP NPU findings + sweep config (npu-001 … npu-007) +│ +├── catalog-qnn-sweep/ ← QNN NPU sweep results (also catalog-cpu-sweep/, catalog-gpu-sweep/) +│ ├── SUMMARY.md ← 8-model sweep results and cross-model analysis +│ ├── apple--mobilevit-small/ ← per-model tuning products live together: +│ │ ├── results.json ← benchmark results + verdicts +│ │ ├── report.html ← per-model HTML report +│ │ └── champion_qnn_npu.json ← recommended build config (raw winml_build_config.json) +│ ├── facebook--dinov2-small/ +│ ├── microsoft--resnet-18/ +│ ├── google--vit-base-patch16-224/ +│ ├── deepset--roberta-base-squad2/ +│ ├── distilbert--distilbert-base-uncased-finetuned-sst-2-english/ +│ ├── sentence-transformers--all-MiniLM-L6-v2/ +│ └── hustvl--yolos-small/ +│ +└── catalog-cpu-sweep/, catalog-gpu-sweep/ ← analogous per-model results for CPU / QNN GPU +``` diff --git a/research/autoconfig/catalog-cpu-sweep/.gitignore b/research/autoconfig/catalog-cpu-sweep/.gitignore new file mode 100644 index 000000000..b3b91d38b --- /dev/null +++ b/research/autoconfig/catalog-cpu-sweep/.gitignore @@ -0,0 +1,10 @@ +# Hypothesis build artifacts (large binary files) +h*/ +_tmp_config/ +# Raw perf session files +full_perf_s*.json +screen_perf.json +confirm_s*.json +# Model weight files +*.data +*.onnx diff --git a/research/autoconfig/catalog-cpu-sweep/apple--mobilevit-small/report.html b/research/autoconfig/catalog-cpu-sweep/apple--mobilevit-small/report.html new file mode 100644 index 000000000..96988c64b --- /dev/null +++ b/research/autoconfig/catalog-cpu-sweep/apple--mobilevit-small/report.html @@ -0,0 +1,598 @@ + + + + + + CPU CPU Optimization Report — apple/mobilevit-small + + + +

CPU CPU Optimization Report — apple/mobilevit-small

+
mobilevit arch · 2026-06-18 · 14 hypotheses tested
+ +
+
+
Best Gain %
+
+12.3%
+
Champion: h7
+
+
+
Baseline → Champion ms
+
73.17 ms → 64.14 ms
+
Latency reduction: 9.03 ms
+
+
+
EP + Device
+
CPU / CPU
+
Baseline opset 17
+
+
+
Champion Config
+
h7
+
opset 17 + bias_softmax_fusion
+
+
+
Total experiments
+
14
+
0 KEEP / 12 DISCARD
+
+
+ + +
+
Model Characteristics
+ + +
Model IDapple/mobilevit-small
Taskimage-classification
Arch typemobilevit
Baseline opset17
EPcpu
Devicecpu
+
+ + +
+
Hypothesis Gain Chart
+
+ HypothesisGain vs baseline (%)-200%-100%0%100%200%h0: baseline (opset 17, autoconf defaults) +status=OK verdict=BASELINE +p50=73.17 ms gain=+0.0%h0baseline (opset 17, autoc…0.0%h1: opset 17 explicit +status=OK verdict=DISCARD +p50=87.48 ms gain=-19.6%h1opset 17 explicit-19.6%h2: opset 19 (cpu-001 risk — transformer test) +status=OK verdict=DISCARD +p50=79.83 ms gain=-9.1%h2opset 19 (cpu-001 risk — …-9.1%h3: opset 21 (cpu-001 risk — transformer test) +status=OK verdict=DISCARD +p50=78.59 ms gain=-7.4%h3opset 21 (cpu-001 risk — …-7.4%h4: opset 17 + attention_fusion +status=OK verdict=DISCARD +p50=77.77 ms gain=-6.3%h4opset 17 + attention_fusi…-6.3%h5: opset 17 + skip_layer_norm_fusion +status=OK verdict=DISCARD +p50=80.51 ms gain=-10.0%h5opset 17 + skip_layer_nor…-10.0%h6: opset 17 + layer_norm_fusion +status=OK verdict=DISCARD +p50=803.22 ms gain=-997.8%h6opset 17 + layer_norm_fus…-997.8%h7: opset 17 + bias_softmax_fusion +status=OK verdict=MARGINAL_UNCONFIRMED +p50=64.14 ms gain=-63.4%h7opset 17 + bias_softmax_f…-63.4%h8: opset 17 + matmul_add_fusion (cpu-002 guarded) +status=SKIPPED_CPU002 verdict=— +p50=— gain=—h8opset 17 + matmul_add_fus…h9: opset 17 + matmul_transpose_fusion +status=OK verdict=DISCARD +p50=194.19 ms gain=-165.4%h9opset 17 + matmul_transpo…-165.4%h10: opset 17 + attention + skip_layer_norm + layer_norm +status=OK verdict=DISCARD +p50=193.64 ms gain=-164.7%h10opset 17 + attention + sk…-164.7%h11: opset 17 + nchwc_transformer (Conv-heavy models) +status=BUILD_FAIL verdict=— +p50=— gain=—h11opset 17 + nchwc_transfor…BUILD_FAILh12: opset 17 + transpose_optimizer +status=BUILD_FAIL verdict=— +p50=— gain=—h12opset 17 + transpose_opti…BUILD_FAILh13: opset 17 + gelu_fusion explicit +status=BUILD_FAIL verdict=— +p50=— gain=—h13opset 17 + gelu_fusion ex…BUILD_FAIL +
+
+ + +
+
🔬 All Hypotheses — Full Detail
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
IDConfig LabelOpsetExtra FlagsMedian p50Session p50s (ms)Gain %VerdictConfidence
h0baseline (opset 17, autoconf defaults)17not stored73.17 ms[73.17 · 72.10 · 80.23]+0.0%BASELINEranges overlap
h1opset 17 explicit17not stored87.48 ms[87.48 · 89.86 · 57.04]-19.6%DISCARDranges overlap
h2opset 19 (cpu-001 risk — transformer test)19not stored79.83 ms[74.50 · 86.26 · 79.83]-9.1%DISCARDranges overlap
h3opset 21 (cpu-001 risk — transformer test)21not stored78.59 ms[67.43 · 84.27 · 78.59]-7.4%DISCARDranges overlap
h4opset 17 + attention_fusion17attention_fusion77.77 ms[83.44 · 70.19 · 77.77]-6.3%DISCARDranges overlap
h5opset 17 + skip_layer_norm_fusion17skip_layer_norm_fusion80.51 ms[80.51 · 60.10 · 785.99]-10.0%DISCARDranges overlap
h6opset 17 + layer_norm_fusion17layer_norm_fusion803.22 ms[817.62 · 803.22 · 184.94]-997.8%DISCARDranges separated
h7 opset 17 + bias_softmax_fusion17bias_softmax_fusion64.14 ms[60.64 · 64.14 · 119.52 · 239.25 · 279.32]-63.4%MARGINAL_UNCONFIRMED2/5 sessions confirm
h8opset 17 + matmul_add_fusion (cpu-002 guarded)17not storedSKIPPED_CPU002guarded skip
h9opset 17 + matmul_transpose_fusion17matmul_transpose_fusion194.19 ms[194.19 · 175.97 · 203.41]-165.4%DISCARDranges separated
h10opset 17 + attention + skip_layer_norm + layer_norm17attention_fusion, skip_layer_norm_fusion, layer_norm_fusion193.64 ms[261.24 · 189.74 · 193.64]-164.7%DISCARDranges separated
h11opset 17 + nchwc_transformer (Conv-heavy models)17not storedBUILD_FAILBUILD_FAILBUILD_FAILbuild failed
h12opset 17 + transpose_optimizer17not storedBUILD_FAILBUILD_FAILBUILD_FAILbuild failed
h13opset 17 + gelu_fusion explicit17not storedBUILD_FAILBUILD_FAILBUILD_FAILbuild failed
+
+
+ ★ = champion hypothesis  ·  Session p50s are individual bench sessions (median used for comparison) +
+
+ + + +
+
❌ Ineffective or Harmful
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
HypothesisLabelGain %VerdictConfidence
h1opset 17 explicit-19.6%DISCARDranges overlap
h2opset 19 (cpu-001 risk — transformer test)-9.1%DISCARDranges overlap
h3opset 21 (cpu-001 risk — transformer test)-7.4%DISCARDranges overlap
h4opset 17 + attention_fusion-6.3%DISCARDranges overlap
h5opset 17 + skip_layer_norm_fusion-10.0%DISCARDranges overlap
h6opset 17 + layer_norm_fusion-997.8%DISCARDranges separated
h7opset 17 + bias_softmax_fusion-63.4%MARGINAL_UNCONFIRMED2/5 sessions confirm
h9opset 17 + matmul_transpose_fusion-165.4%DISCARDranges separated
h10opset 17 + attention + skip_layer_norm + layer_norm-164.7%DISCARDranges separated
h11opset 17 + nchwc_transformer (Conv-heavy models)BUILD_FAILbuild failed
h12opset 17 + transpose_optimizerBUILD_FAILbuild failed
h13opset 17 + gelu_fusion explicitBUILD_FAILbuild failed
+
+ + +
+
⚪ Neutral / Build Fail
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
HypothesisLabelGain %VerdictConfidence
h0baseline (opset 17, autoconf defaults)+0.0%BASELINEranges overlap
h8opset 17 + matmul_add_fusion (cpu-002 guarded)SKIPPED_CPU002guarded skip
+
+ + + + + + diff --git a/research/autoconfig/catalog-cpu-sweep/apple--mobilevit-small/results.json b/research/autoconfig/catalog-cpu-sweep/apple--mobilevit-small/results.json new file mode 100644 index 000000000..173f889f9 --- /dev/null +++ b/research/autoconfig/catalog-cpu-sweep/apple--mobilevit-small/results.json @@ -0,0 +1,232 @@ +{ + "model_id": "apple/mobilevit-small", + "task": "image-classification", + "model_type": "mobilevit", + "timestamp": "2026-06-18T15:29:58", + "ep": "cpu", + "device": "cpu", + "hypotheses": { + "h0": { + "status": "OK", + "label": "baseline (opset 17, autoconf defaults)", + "opset": 17, + "extra_optim": null, + "screen_p50_ms": 66.804, + "screen_cv": 3.1693311777737856, + "full_p50s_ms": [ + 73.166, + 72.1, + 80.234 + ], + "median_p50_ms": 73.166, + "verdict": "BASELINE" + }, + "h1": { + "status": "OK", + "label": "opset 17 explicit", + "opset": 17, + "extra_optim": null, + "screen_p50_ms": 69.007, + "screen_cv": 3.623472981001927, + "full_p50s_ms": [ + 87.48, + 89.858, + 57.036 + ], + "median_p50_ms": 87.48, + "gain_vs_baseline_pct": -19.56, + "verdict": "DISCARD" + }, + "h2": { + "status": "OK", + "label": "opset 19 (cpu-001 risk — transformer test)", + "opset": 19, + "extra_optim": null, + "screen_p50_ms": 78.369, + "screen_cv": 3.1204047518789317, + "full_p50s_ms": [ + 74.505, + 86.262, + 79.826 + ], + "median_p50_ms": 79.826, + "gain_vs_baseline_pct": -9.1, + "verdict": "DISCARD" + }, + "h3": { + "status": "OK", + "label": "opset 21 (cpu-001 risk — transformer test)", + "opset": 21, + "extra_optim": null, + "screen_p50_ms": 41.225, + "screen_cv": 5.67767131594906, + "full_p50s_ms": [ + 67.43, + 84.267, + 78.586 + ], + "median_p50_ms": 78.586, + "gain_vs_baseline_pct": -7.41, + "verdict": "DISCARD" + }, + "h4": { + "status": "OK", + "label": "opset 17 + attention_fusion", + "opset": 17, + "extra_optim": { + "attention_fusion": true + }, + "screen_p50_ms": 57.061, + "screen_cv": 4.881863269133033, + "full_p50s_ms": [ + 83.444, + 70.192, + 77.772 + ], + "median_p50_ms": 77.772, + "gain_vs_baseline_pct": -6.3, + "verdict": "DISCARD" + }, + "h5": { + "status": "OK", + "label": "opset 17 + skip_layer_norm_fusion", + "opset": 17, + "extra_optim": { + "skip_layer_norm_fusion": true + }, + "screen_p50_ms": 72.701, + "screen_cv": 3.3349472496939523, + "full_p50s_ms": [ + 80.514, + 60.097, + 785.991 + ], + "median_p50_ms": 80.514, + "gain_vs_baseline_pct": -10.04, + "verdict": "DISCARD" + }, + "h6": { + "status": "OK", + "label": "opset 17 + layer_norm_fusion", + "opset": 17, + "extra_optim": { + "layer_norm_fusion": true + }, + "screen_p50_ms": 759.837, + "screen_cv": 0.4795699603994014, + "full_p50s_ms": [ + 817.624, + 803.217, + 184.944 + ], + "median_p50_ms": 803.217, + "gain_vs_baseline_pct": -997.8, + "verdict": "DISCARD" + }, + "h7": { + "status": "OK", + "label": "opset 17 + bias_softmax_fusion", + "opset": 17, + "extra_optim": { + "bias_softmax_fusion": true + }, + "screen_p50_ms": 52.703, + "screen_cv": 0.4295580896723146, + "full_p50s_ms": [ + 60.637, + 64.137, + 119.521 + ], + "median_p50_ms": 64.137, + "gain_vs_baseline_pct": 12.34, + "verdict": "MARGINAL_UNCONFIRMED", + "confirm_p50s_ms": [ + 239.249, + 279.325 + ], + "all_p50s_ms": [ + 60.637, + 64.137, + 119.521, + 239.249, + 279.325 + ], + "overall_median_p50_ms": 119.521, + "overall_gain_pct": -63.36, + "sessions_above_threshold": 2, + "total_sessions": 5 + }, + "h8": { + "status": "SKIPPED_CPU002", + "label": "opset 17 + matmul_add_fusion (cpu-002 guarded)", + "opset": 17, + "reason": "cpu-002: model already has Gemm — matmul_add_fusion skipped" + }, + "h9": { + "status": "OK", + "label": "opset 17 + matmul_transpose_fusion", + "opset": 17, + "extra_optim": { + "matmul_transpose_fusion": true + }, + "screen_p50_ms": 153.131, + "screen_cv": 1.1719312222868001, + "full_p50s_ms": [ + 194.194, + 175.965, + 203.405 + ], + "median_p50_ms": 194.194, + "gain_vs_baseline_pct": -165.42, + "verdict": "DISCARD" + }, + "h10": { + "status": "OK", + "label": "opset 17 + attention + skip_layer_norm + layer_norm", + "opset": 17, + "extra_optim": { + "attention_fusion": true, + "skip_layer_norm_fusion": true, + "layer_norm_fusion": true + }, + "screen_p50_ms": 202.155, + "screen_cv": 1.1776211322994732, + "full_p50s_ms": [ + 261.236, + 189.739, + 193.641 + ], + "median_p50_ms": 193.641, + "gain_vs_baseline_pct": -164.66, + "verdict": "DISCARD" + }, + "h11": { + "status": "BUILD_FAIL", + "label": "opset 17 + nchwc_transformer (Conv-heavy models)", + "opset": 17, + "build_error": " device \n⏳ Optimize Optimizing ONNX graph...\n Analyzing 395 nodes (iter 1/3)\n Patterns\n Matmul Add → matmul_add_fusion\n Optimizing (applying autoconf)\n {'matmul_add_fusion': True}Error: Build failed: [Errno 28] No space left on device\n" + }, + "h12": { + "status": "BUILD_FAIL", + "label": "opset 17 + transpose_optimizer", + "opset": 17, + "build_error": "pple--mobilevit-small\\h12\\export.onnx\n(21.6 MB)\n[06/18/26 16:34:49] ERROR ✗ ort_graph failed: [Errno 28] No space left on \n device \n⏳ Optimize Optimizing ONNX graph...Error: Build failed: [Errno 28] No space left on device\n" + }, + "h13": { + "status": "BUILD_FAIL", + "label": "opset 17 + gelu_fusion explicit", + "opset": 17, + "build_error": "📂 Output: \nC:\\tmp\\autoconfig-demo\\catalog-cpu-sweep\\apple--mobilevit-small\\h13\n\n════════════════════════════════════════════════════════════\n🎯 Stages\n════════════════════════════════════════════════════════════\n⏳ Export Exporting to ONNX...Error: Build failed: [Errno 28] No space left on device\n" + } + }, + "baseline_p50_ms": 73.166, + "best_p50_ms": 64.137, + "best_hypothesis": "h7", + "best_gain_pct": 12.34, + "errors": [ + "h11: BUILD_FAIL", + "h12: BUILD_FAIL", + "h13: BUILD_FAIL" + ], + "baseline_opset": 17 +} diff --git a/research/autoconfig/catalog-cpu-sweep/facebook--dinov2-small/report.html b/research/autoconfig/catalog-cpu-sweep/facebook--dinov2-small/report.html new file mode 100644 index 000000000..df542812b --- /dev/null +++ b/research/autoconfig/catalog-cpu-sweep/facebook--dinov2-small/report.html @@ -0,0 +1,598 @@ + + + + + + CPU CPU Optimization Report — facebook/dinov2-small + + + +

CPU CPU Optimization Report — facebook/dinov2-small

+
dinov2 arch · 2026-06-18 · 14 hypotheses tested
+ +
+
+
Best Gain %
+
+
Champion: —
+
+
+
Baseline → Champion ms
+
112.60 ms → —
+
Latency reduction: —
+
+
+
EP + Device
+
CPU / CPU
+
Baseline opset 17
+
+
+
Champion Config
+
+
+
+
+
Total experiments
+
14
+
0 KEEP / 12 DISCARD
+
+
+ + +
+
Model Characteristics
+ + +
Model IDfacebook/dinov2-small
Taskimage-feature-extraction
Arch typedinov2
Baseline opset17
EPcpu
Devicecpu
+
+ + +
+
Hypothesis Gain Chart
+
+ HypothesisGain vs baseline (%)-200%-100%0%100%200%h0: baseline (opset 17, autoconf defaults) +status=OK verdict=BASELINE +p50=112.60 ms gain=+0.0%h0baseline (opset 17, autoc…0.0%h1: opset 17 explicit +status=OK verdict=DISCARD +p50=762.81 ms gain=-577.5%h1opset 17 explicit-577.5%h2: opset 19 (cpu-001 risk — transformer test) +status=OK verdict=CPU001_REGRESSION +p50=1106.11 ms gain=-882.4%h2opset 19 (cpu-001 risk — …-882.4%h3: opset 21 (cpu-001 risk — transformer test) +status=OK verdict=CPU001_REGRESSION +p50=1095.19 ms gain=-872.6%h3opset 21 (cpu-001 risk — …-872.6%h4: opset 17 + attention_fusion +status=OK verdict=DISCARD +p50=1083.83 ms gain=-862.6%h4opset 17 + attention_fusi…-862.6%h5: opset 17 + skip_layer_norm_fusion +status=OK verdict=DISCARD +p50=1103.07 ms gain=-879.6%h5opset 17 + skip_layer_nor…-879.6%h6: opset 17 + layer_norm_fusion +status=OK verdict=DISCARD +p50=148.70 ms gain=-32.1%h6opset 17 + layer_norm_fus…-32.1%h7: opset 17 + bias_softmax_fusion +status=OK verdict=DISCARD +p50=1121.98 ms gain=-896.4%h7opset 17 + bias_softmax_f…-896.4%h8: opset 17 + matmul_add_fusion (cpu-002 guarded) +status=SKIPPED_CPU002 verdict=— +p50=— gain=—h8opset 17 + matmul_add_fus…h9: opset 17 + matmul_transpose_fusion +status=OK verdict=DISCARD +p50=186.48 ms gain=-65.6%h9opset 17 + matmul_transpo…-65.6%h10: opset 17 + attention + skip_layer_norm + layer_norm +status=OK verdict=DISCARD +p50=136.57 ms gain=-21.3%h10opset 17 + attention + sk…-21.3%h11: opset 17 + nchwc_transformer (Conv-heavy models) +status=OK verdict=DISCARD +p50=157.51 ms gain=-39.9%h11opset 17 + nchwc_transfor…-39.9%h12: opset 17 + transpose_optimizer +status=OK verdict=DISCARD +p50=154.59 ms gain=-37.3%h12opset 17 + transpose_opti…-37.3%h13: opset 17 + gelu_fusion explicit +status=OK verdict=DISCARD +p50=154.10 ms gain=-36.9%h13opset 17 + gelu_fusion ex…-36.9% +
+
+ + +
+
🔬 All Hypotheses — Full Detail
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
IDConfig LabelOpsetExtra FlagsMedian p50Session p50s (ms)Gain %VerdictConfidence
h0baseline (opset 17, autoconf defaults)17not stored112.60 ms[142.03 · 105.56 · 112.60]+0.0%BASELINEranges overlap
h1opset 17 explicit17not stored762.81 ms[150.63 · 1123.34 · 762.81]-577.5%DISCARDranges separated
h2opset 19 (cpu-001 risk — transformer test)19not stored1106.11 ms[1106.11 · 1104.49 · 1164.20]-882.4%CPU001_REGRESSIONranges separated
h3opset 21 (cpu-001 risk — transformer test)21not stored1095.19 ms[1057.56 · 1095.19 · 1128.22]-872.6%CPU001_REGRESSIONranges separated
h4opset 17 + attention_fusion17attention_fusion1083.83 ms[1086.54 · 1068.75 · 1083.83]-862.6%DISCARDranges separated
h5opset 17 + skip_layer_norm_fusion17skip_layer_norm_fusion1103.07 ms[1119.95 · 1103.07 · 161.83]-879.6%DISCARDranges separated
h6opset 17 + layer_norm_fusion17layer_norm_fusion148.70 ms[142.60 · 155.01 · 148.70]-32.1%DISCARDranges separated
h7opset 17 + bias_softmax_fusion17bias_softmax_fusion1121.98 ms[899.91 · 1145.34 · 1121.98]-896.4%DISCARDranges separated
h8opset 17 + matmul_add_fusion (cpu-002 guarded)17not storedSKIPPED_CPU002guarded skip
h9opset 17 + matmul_transpose_fusion17matmul_transpose_fusion186.48 ms[161.47 · 186.48 · 334.34]-65.6%DISCARDranges separated
h10opset 17 + attention + skip_layer_norm + layer_norm17attention_fusion, skip_layer_norm_fusion, layer_norm_fusion136.57 ms[121.38 · 167.90 · 136.57]-21.3%DISCARDranges overlap
h11opset 17 + nchwc_transformer (Conv-heavy models)17nchwc_transformer157.51 ms[157.51 · 192.39 · 157.25]-39.9%DISCARDranges separated
h12opset 17 + transpose_optimizer17transpose_optimizer154.59 ms[175.29 · 143.11 · 154.59]-37.3%DISCARDranges separated
h13opset 17 + gelu_fusion explicit17gelu_fusion154.10 ms[146.72 · 163.78 · 154.10]-36.9%DISCARDranges separated
+
+
+ ★ = champion hypothesis  ·  Session p50s are individual bench sessions (median used for comparison) +
+
+ + + +
+
❌ Ineffective or Harmful
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
HypothesisLabelGain %VerdictConfidence
h1opset 17 explicit-577.5%DISCARDranges separated
h2opset 19 (cpu-001 risk — transformer test)-882.4%CPU001_REGRESSIONranges separated
h3opset 21 (cpu-001 risk — transformer test)-872.6%CPU001_REGRESSIONranges separated
h4opset 17 + attention_fusion-862.6%DISCARDranges separated
h5opset 17 + skip_layer_norm_fusion-879.6%DISCARDranges separated
h6opset 17 + layer_norm_fusion-32.1%DISCARDranges separated
h7opset 17 + bias_softmax_fusion-896.4%DISCARDranges separated
h9opset 17 + matmul_transpose_fusion-65.6%DISCARDranges separated
h10opset 17 + attention + skip_layer_norm + layer_norm-21.3%DISCARDranges overlap
h11opset 17 + nchwc_transformer (Conv-heavy models)-39.9%DISCARDranges separated
h12opset 17 + transpose_optimizer-37.3%DISCARDranges separated
h13opset 17 + gelu_fusion explicit-36.9%DISCARDranges separated
+
+ + +
+
⚪ Neutral / Build Fail
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
HypothesisLabelGain %VerdictConfidence
h0baseline (opset 17, autoconf defaults)+0.0%BASELINEranges overlap
h8opset 17 + matmul_add_fusion (cpu-002 guarded)SKIPPED_CPU002guarded skip
+
+ + + + + + diff --git a/research/autoconfig/catalog-cpu-sweep/facebook--dinov2-small/results.json b/research/autoconfig/catalog-cpu-sweep/facebook--dinov2-small/results.json new file mode 100644 index 000000000..88067a09f --- /dev/null +++ b/research/autoconfig/catalog-cpu-sweep/facebook--dinov2-small/results.json @@ -0,0 +1,249 @@ +{ + "model_id": "facebook/dinov2-small", + "task": "image-feature-extraction", + "model_type": "dinov2", + "timestamp": "2026-06-18T12:25:19", + "ep": "cpu", + "device": "cpu", + "hypotheses": { + "h0": { + "status": "OK", + "label": "baseline (opset 17, autoconf defaults)", + "opset": 17, + "extra_optim": null, + "screen_p50_ms": 1108.058, + "screen_cv": 0.303283763124313, + "full_p50s_ms": [ + 142.033, + 105.561, + 112.599 + ], + "median_p50_ms": 112.599, + "verdict": "BASELINE" + }, + "h1": { + "status": "OK", + "label": "opset 17 explicit", + "opset": 17, + "extra_optim": null, + "screen_p50_ms": 114.372, + "screen_cv": 3.0164201028223694, + "full_p50s_ms": [ + 150.633, + 1123.338, + 762.812 + ], + "median_p50_ms": 762.812, + "gain_vs_baseline_pct": -577.46, + "verdict": "DISCARD" + }, + "h2": { + "status": "OK", + "label": "opset 19 (cpu-001 risk — transformer test)", + "opset": 19, + "extra_optim": null, + "screen_p50_ms": 918.187, + "screen_cv": 0.6378613506834665, + "full_p50s_ms": [ + 1106.113, + 1104.489, + 1164.205 + ], + "median_p50_ms": 1106.113, + "gain_vs_baseline_pct": -882.35, + "verdict": "CPU001_REGRESSION" + }, + "h3": { + "status": "OK", + "label": "opset 21 (cpu-001 risk — transformer test)", + "opset": 21, + "extra_optim": null, + "screen_p50_ms": 1139.678, + "screen_cv": 0.23544106317749397, + "full_p50s_ms": [ + 1057.558, + 1095.186, + 1128.223 + ], + "median_p50_ms": 1095.186, + "gain_vs_baseline_pct": -872.64, + "verdict": "CPU001_REGRESSION" + }, + "h4": { + "status": "OK", + "label": "opset 17 + attention_fusion", + "opset": 17, + "extra_optim": { + "attention_fusion": true + }, + "screen_p50_ms": 1093.504, + "screen_cv": 0.2851768260564205, + "full_p50s_ms": [ + 1086.54, + 1068.752, + 1083.83 + ], + "median_p50_ms": 1083.83, + "gain_vs_baseline_pct": -862.56, + "verdict": "DISCARD" + }, + "h5": { + "status": "OK", + "label": "opset 17 + skip_layer_norm_fusion", + "opset": 17, + "extra_optim": { + "skip_layer_norm_fusion": true + }, + "screen_p50_ms": 1099.529, + "screen_cv": 0.3173004077200328, + "full_p50s_ms": [ + 1119.951, + 1103.065, + 161.832 + ], + "median_p50_ms": 1103.065, + "gain_vs_baseline_pct": -879.64, + "verdict": "DISCARD" + }, + "h6": { + "status": "OK", + "label": "opset 17 + layer_norm_fusion", + "opset": 17, + "extra_optim": { + "layer_norm_fusion": true + }, + "screen_p50_ms": 881.27, + "screen_cv": 0.6731739421516676, + "full_p50s_ms": [ + 142.596, + 155.014, + 148.704 + ], + "median_p50_ms": 148.704, + "gain_vs_baseline_pct": -32.07, + "verdict": "DISCARD" + }, + "h7": { + "status": "OK", + "label": "opset 17 + bias_softmax_fusion", + "opset": 17, + "extra_optim": { + "bias_softmax_fusion": true + }, + "screen_p50_ms": 107.327, + "screen_cv": 0.367950282780661, + "full_p50s_ms": [ + 899.911, + 1145.338, + 1121.982 + ], + "median_p50_ms": 1121.982, + "gain_vs_baseline_pct": -896.44, + "verdict": "DISCARD" + }, + "h8": { + "status": "SKIPPED_CPU002", + "label": "opset 17 + matmul_add_fusion (cpu-002 guarded)", + "opset": 17, + "reason": "cpu-002: model already has Gemm — matmul_add_fusion skipped" + }, + "h9": { + "status": "OK", + "label": "opset 17 + matmul_transpose_fusion", + "opset": 17, + "extra_optim": { + "matmul_transpose_fusion": true + }, + "screen_p50_ms": 102.861, + "screen_cv": 0.5473794732697523, + "full_p50s_ms": [ + 161.473, + 186.476, + 334.336 + ], + "median_p50_ms": 186.476, + "gain_vs_baseline_pct": -65.61, + "verdict": "DISCARD" + }, + "h10": { + "status": "OK", + "label": "opset 17 + attention + skip_layer_norm + layer_norm", + "opset": 17, + "extra_optim": { + "attention_fusion": true, + "skip_layer_norm_fusion": true, + "layer_norm_fusion": true + }, + "screen_p50_ms": 168.419, + "screen_cv": 3.5594440057238192, + "full_p50s_ms": [ + 121.378, + 167.902, + 136.572 + ], + "median_p50_ms": 136.572, + "gain_vs_baseline_pct": -21.29, + "verdict": "DISCARD" + }, + "h11": { + "status": "OK", + "label": "opset 17 + nchwc_transformer (Conv-heavy models)", + "opset": 17, + "extra_optim": { + "nchwc_transformer": true + }, + "screen_p50_ms": 156.796, + "screen_cv": 2.250503839383658, + "full_p50s_ms": [ + 157.508, + 192.392, + 157.246 + ], + "median_p50_ms": 157.508, + "gain_vs_baseline_pct": -39.88, + "verdict": "DISCARD" + }, + "h12": { + "status": "OK", + "label": "opset 17 + transpose_optimizer", + "opset": 17, + "extra_optim": { + "transpose_optimizer": true + }, + "screen_p50_ms": 159.442, + "screen_cv": 3.556904705159243, + "full_p50s_ms": [ + 175.292, + 143.108, + 154.593 + ], + "median_p50_ms": 154.593, + "gain_vs_baseline_pct": -37.3, + "verdict": "DISCARD" + }, + "h13": { + "status": "OK", + "label": "opset 17 + gelu_fusion explicit", + "opset": 17, + "extra_optim": { + "gelu_fusion": true + }, + "screen_p50_ms": 175.256, + "screen_cv": 2.8835132606016343, + "full_p50s_ms": [ + 146.716, + 163.783, + 154.105 + ], + "median_p50_ms": 154.105, + "gain_vs_baseline_pct": -36.86, + "verdict": "DISCARD" + } + }, + "baseline_p50_ms": 112.599, + "best_p50_ms": null, + "best_hypothesis": null, + "best_gain_pct": null, + "errors": [], + "baseline_opset": 17 +} diff --git a/research/autoconfig/catalog-cpu-sweep/microsoft--rad-dino/report.html b/research/autoconfig/catalog-cpu-sweep/microsoft--rad-dino/report.html new file mode 100644 index 000000000..0d6b9c240 --- /dev/null +++ b/research/autoconfig/catalog-cpu-sweep/microsoft--rad-dino/report.html @@ -0,0 +1,268 @@ + + + + + + CPU CPU Optimization Report — microsoft/rad-dino + + + +

CPU CPU Optimization Report — microsoft/rad-dino

+
dinov2 arch · 2026-06-18 · 0 hypotheses tested
+ +
+
+
Best Gain %
+
+
Champion: —
+
+
+
Baseline → Champion ms
+
— → —
+
Latency reduction: —
+
+
+
EP + Device
+
CPU / CPU
+
Baseline opset —
+
+
+
Champion Config
+
+
+
+
+
Total experiments
+
0
+
0 KEEP / 0 DISCARD
+
+
+ + +
+
Model Characteristics
+ + +
Model IDmicrosoft/rad-dino
Taskimage-feature-extraction
Arch typedinov2
Baseline opset
EPcpu
Devicecpu
+
+ + +
+
Hypothesis Gain Chart
+
+ HypothesisGain vs baseline (%)-200%-100%0%100%200% +
+
+ + +
+
🔬 All Hypotheses — Full Detail
+
+ + + + + + + + + + + + + + + + + +
IDConfig LabelOpsetExtra FlagsMedian p50Session p50s (ms)Gain %VerdictConfidence
+
+
+ ★ = champion hypothesis  ·  Session p50s are individual bench sessions (median used for comparison) +
+
+ + + + + + + + + diff --git a/research/autoconfig/catalog-cpu-sweep/microsoft--rad-dino/results.json b/research/autoconfig/catalog-cpu-sweep/microsoft--rad-dino/results.json new file mode 100644 index 000000000..d4a47523c --- /dev/null +++ b/research/autoconfig/catalog-cpu-sweep/microsoft--rad-dino/results.json @@ -0,0 +1,16 @@ +{ + "model_id": "microsoft/rad-dino", + "task": "image-feature-extraction", + "model_type": "dinov2", + "timestamp": "2026-06-18T16:53:15", + "ep": "cpu", + "device": "cpu", + "hypotheses": {}, + "baseline_p50_ms": null, + "best_p50_ms": null, + "best_hypothesis": null, + "best_gain_pct": null, + "errors": [ + "base config generation failed" + ] +} diff --git a/research/autoconfig/catalog-cpu-sweep/microsoft--resnet-18/report.html b/research/autoconfig/catalog-cpu-sweep/microsoft--resnet-18/report.html new file mode 100644 index 000000000..54658bb64 --- /dev/null +++ b/research/autoconfig/catalog-cpu-sweep/microsoft--resnet-18/report.html @@ -0,0 +1,616 @@ + + + + + + CPU CPU Optimization Report — microsoft/resnet-18 + + + +

CPU CPU Optimization Report — microsoft/resnet-18

+
resnet arch · 2026-06-18 · 14 hypotheses tested
+ +
+
+
Best Gain %
+
+92.5%
+
Champion: h9
+
+
+
Baseline → Champion ms
+
237.47 ms → 17.80 ms
+
Latency reduction: 219.68 ms
+
+
+
EP + Device
+
CPU / CPU
+
Baseline opset 17
+
+
+
Champion Config
+
h9
+
opset 17 + matmul_transpose_fusion
+
+
+
Total experiments
+
14
+
6 KEEP / 1 DISCARD
+
+
+ + +
+
Model Characteristics
+ + +
Model IDmicrosoft/resnet-18
Taskimage-classification
Arch typeresnet
Baseline opset17
EPcpu
Devicecpu
+
+ + +
+
Hypothesis Gain Chart
+
+ HypothesisGain vs baseline (%)-200%-100%0%100%200%h0: baseline (opset 17, autoconf defaults) +status=OK verdict=BASELINE +p50=237.47 ms gain=+0.0%h0baseline (opset 17, autoc…0.0%h1: opset 17 explicit +status=OK verdict=DISCARD +p50=244.96 ms gain=-3.1%h1opset 17 explicit-3.1%h2: opset 19 (cpu-001 risk — transformer test) +status=OK verdict=MARGINAL +p50=231.69 ms gain=+2.4%h2opset 19 (cpu-001 risk — …+2.4%h3: opset 21 (cpu-001 risk — transformer test) +status=OK verdict=MARGINAL +p50=226.69 ms gain=+4.5%h3opset 21 (cpu-001 risk — …+4.5%h4: opset 17 + attention_fusion +status=OK verdict=MARGINAL +p50=231.07 ms gain=+2.7%h4opset 17 + attention_fusi…+2.7%h5: opset 17 + skip_layer_norm_fusion +status=OK verdict=MARGINAL +p50=226.59 ms gain=+4.6%h5opset 17 + skip_layer_nor…+4.6%h6: opset 17 + layer_norm_fusion +status=OK verdict=KEEP_CONFIRMED +p50=212.70 ms gain=+15.7%h6opset 17 + layer_norm_fus…+15.7%h7: opset 17 + bias_softmax_fusion +status=OK verdict=MARGINAL +p50=227.78 ms gain=+4.1%h7opset 17 + bias_softmax_f…+4.1%h8: opset 17 + matmul_add_fusion (cpu-002 guarded) +status=SKIPPED_CPU002 verdict=— +p50=— gain=—h8opset 17 + matmul_add_fus…h9: opset 17 + matmul_transpose_fusion +status=OK verdict=KEEP_CONFIRMED +p50=17.80 ms gain=+89.8%h9opset 17 + matmul_transpo…+89.8%h10: opset 17 + attention + skip_layer_norm + layer_norm +status=OK verdict=KEEP_CONFIRMED +p50=20.09 ms gain=+91.5%h10opset 17 + attention + sk…+91.5%h11: opset 17 + nchwc_transformer (Conv-heavy models) +status=OK verdict=MARGINAL_UNCONFIRMED +p50=40.87 ms gain=+84.5%h11opset 17 + nchwc_transfor…+84.5%h12: opset 17 + transpose_optimizer +status=OK verdict=KEEP_CONFIRMED +p50=36.91 ms gain=+84.5%h12opset 17 + transpose_opti…+84.5%h13: opset 17 + gelu_fusion explicit +status=OK verdict=KEEP_CONFIRMED +p50=26.39 ms gain=+88.9%h13opset 17 + gelu_fusion ex…+88.9% +
+
+ + +
+
🔬 All Hypotheses — Full Detail
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
IDConfig LabelOpsetExtra FlagsMedian p50Session p50s (ms)Gain %VerdictConfidence
h0baseline (opset 17, autoconf defaults)17not stored237.47 ms[237.47 · 230.21 · 238.44]+0.0%BASELINEranges overlap
h1opset 17 explicit17not stored244.96 ms[221.85 · 252.22 · 244.96]-3.1%DISCARDranges overlap
h2opset 19 (cpu-001 risk — transformer test)19not stored231.69 ms[209.29 · 231.69 · 238.07]+2.4%MARGINALranges overlap
h3opset 21 (cpu-001 risk — transformer test)21not stored226.69 ms[218.89 · 226.69 · 230.42]+4.5%MARGINALranges overlap
h4opset 17 + attention_fusion17attention_fusion231.07 ms[209.43 · 231.07 · 235.40]+2.7%MARGINALranges overlap
h5opset 17 + skip_layer_norm_fusion17skip_layer_norm_fusion226.59 ms[207.97 · 226.59 · 227.74]+4.6%MARGINALranges separated
h6opset 17 + layer_norm_fusion17layer_norm_fusion212.70 ms[200.31 · 212.70 · 215.09 · 40.37 · 24.96]+15.7%KEEP_CONFIRMED5/5 sessions confirm
h7opset 17 + bias_softmax_fusion17bias_softmax_fusion227.78 ms[222.57 · 245.29 · 227.78]+4.1%MARGINALranges overlap
h8opset 17 + matmul_add_fusion (cpu-002 guarded)17not storedSKIPPED_CPU002guarded skip
h9 opset 17 + matmul_transpose_fusion17matmul_transpose_fusion17.80 ms[24.22 · 11.52 · 17.80 · 186.46 · 197.36]+89.8%KEEP_CONFIRMED5/5 sessions confirm
h10opset 17 + attention + skip_layer_norm + layer_norm17attention_fusion, skip_layer_norm_fusion, layer_norm_fusion20.09 ms[20.09 · 14.91 · 43.27 · 18.86 · 39.40]+91.5%KEEP_CONFIRMED5/5 sessions confirm
h11opset 17 + nchwc_transformer (Conv-heavy models)17nchwc_transformer40.87 ms[30.78 · 40.87 · 230.91 · 36.88 · 27.17]+84.5%MARGINAL_UNCONFIRMED4/5 sessions confirm
h12opset 17 + transpose_optimizer17transpose_optimizer36.91 ms[36.91 · 21.65 · 40.88 · 26.59 · 38.94]+84.5%KEEP_CONFIRMED5/5 sessions confirm
h13opset 17 + gelu_fusion explicit17gelu_fusion26.39 ms[219.22 · 26.39 · 20.94 · 18.75 · 215.34]+88.9%KEEP_CONFIRMED5/5 sessions confirm
+
+
+ ★ = champion hypothesis  ·  Session p50s are individual bench sessions (median used for comparison) +
+
+ + +
+
✅ Effective Optimizations
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
HypothesisLabelGain %VerdictConfidence
h6opset 17 + layer_norm_fusion+15.7%KEEP_CONFIRMED5/5 sessions confirm
h9opset 17 + matmul_transpose_fusion+89.8%KEEP_CONFIRMED5/5 sessions confirm
h10opset 17 + attention + skip_layer_norm + layer_norm+91.5%KEEP_CONFIRMED5/5 sessions confirm
h11opset 17 + nchwc_transformer (Conv-heavy models)+84.5%MARGINAL_UNCONFIRMED4/5 sessions confirm
h12opset 17 + transpose_optimizer+84.5%KEEP_CONFIRMED5/5 sessions confirm
h13opset 17 + gelu_fusion explicit+88.9%KEEP_CONFIRMED5/5 sessions confirm
+
+ + +
+
❌ Ineffective or Harmful
+ + + + + + + + + + + + + + + + + + + + + +
HypothesisLabelGain %VerdictConfidence
h1opset 17 explicit-3.1%DISCARDranges overlap
+
+ + +
+
⚪ Neutral / Build Fail
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
HypothesisLabelGain %VerdictConfidence
h0baseline (opset 17, autoconf defaults)+0.0%BASELINEranges overlap
h2opset 19 (cpu-001 risk — transformer test)+2.4%MARGINALranges overlap
h3opset 21 (cpu-001 risk — transformer test)+4.5%MARGINALranges overlap
h4opset 17 + attention_fusion+2.7%MARGINALranges overlap
h5opset 17 + skip_layer_norm_fusion+4.6%MARGINALranges separated
h7opset 17 + bias_softmax_fusion+4.1%MARGINALranges overlap
h8opset 17 + matmul_add_fusion (cpu-002 guarded)SKIPPED_CPU002guarded skip
+
+ + + + + + diff --git a/research/autoconfig/catalog-cpu-sweep/microsoft--resnet-18/results.json b/research/autoconfig/catalog-cpu-sweep/microsoft--resnet-18/results.json new file mode 100644 index 000000000..9be730ae4 --- /dev/null +++ b/research/autoconfig/catalog-cpu-sweep/microsoft--resnet-18/results.json @@ -0,0 +1,339 @@ +{ + "model_id": "microsoft/resnet-18", + "task": "image-classification", + "model_type": "resnet", + "timestamp": "2026-06-18T11:14:10", + "ep": "cpu", + "device": "cpu", + "hypotheses": { + "h0": { + "status": "OK", + "label": "baseline (opset 17, autoconf defaults)", + "opset": 17, + "extra_optim": null, + "screen_p50_ms": 231.091, + "screen_cv": 0.634823511084378, + "full_p50s_ms": [ + 237.472, + 230.213, + 238.44 + ], + "median_p50_ms": 237.472, + "verdict": "BASELINE" + }, + "h1": { + "status": "OK", + "label": "opset 17 explicit", + "opset": 17, + "extra_optim": null, + "screen_p50_ms": 236.135, + "screen_cv": 0.6215427615558896, + "full_p50s_ms": [ + 221.852, + 252.225, + 244.959 + ], + "median_p50_ms": 244.959, + "gain_vs_baseline_pct": -3.15, + "verdict": "DISCARD" + }, + "h2": { + "status": "OK", + "label": "opset 19 (cpu-001 risk — transformer test)", + "opset": 19, + "extra_optim": null, + "screen_p50_ms": 228.935, + "screen_cv": 0.6700941315220477, + "full_p50s_ms": [ + 209.29, + 231.693, + 238.073 + ], + "median_p50_ms": 231.693, + "gain_vs_baseline_pct": 2.43, + "verdict": "MARGINAL" + }, + "h3": { + "status": "OK", + "label": "opset 21 (cpu-001 risk — transformer test)", + "opset": 21, + "extra_optim": null, + "screen_p50_ms": 222.347, + "screen_cv": 0.6050137847598573, + "full_p50s_ms": [ + 218.891, + 226.688, + 230.417 + ], + "median_p50_ms": 226.688, + "gain_vs_baseline_pct": 4.54, + "verdict": "MARGINAL" + }, + "h4": { + "status": "OK", + "label": "opset 17 + attention_fusion", + "opset": 17, + "extra_optim": { + "attention_fusion": true + }, + "screen_p50_ms": 229.793, + "screen_cv": 0.6810172633631137, + "full_p50s_ms": [ + 209.431, + 231.069, + 235.402 + ], + "median_p50_ms": 231.069, + "gain_vs_baseline_pct": 2.7, + "verdict": "MARGINAL" + }, + "h5": { + "status": "OK", + "label": "opset 17 + skip_layer_norm_fusion", + "opset": 17, + "extra_optim": { + "skip_layer_norm_fusion": true + }, + "screen_p50_ms": 188.605, + "screen_cv": 0.8141088518331964, + "full_p50s_ms": [ + 207.967, + 226.586, + 227.739 + ], + "median_p50_ms": 226.586, + "gain_vs_baseline_pct": 4.58, + "verdict": "MARGINAL" + }, + "h6": { + "status": "OK", + "label": "opset 17 + layer_norm_fusion", + "opset": 17, + "extra_optim": { + "layer_norm_fusion": true + }, + "screen_p50_ms": 206.291, + "screen_cv": 0.6984017722537581, + "full_p50s_ms": [ + 200.308, + 212.704, + 215.094 + ], + "median_p50_ms": 212.704, + "gain_vs_baseline_pct": 10.43, + "verdict": "KEEP_CONFIRMED", + "confirm_p50s_ms": [ + 40.366, + 24.962 + ], + "all_p50s_ms": [ + 200.308, + 212.704, + 215.094, + 40.366, + 24.962 + ], + "overall_median_p50_ms": 200.308, + "overall_gain_pct": 15.65, + "sessions_above_threshold": 5, + "total_sessions": 5 + }, + "h7": { + "status": "OK", + "label": "opset 17 + bias_softmax_fusion", + "opset": 17, + "extra_optim": { + "bias_softmax_fusion": true + }, + "screen_p50_ms": 176.804, + "screen_cv": 0.8944367774484739, + "full_p50s_ms": [ + 222.575, + 245.29, + 227.782 + ], + "median_p50_ms": 227.782, + "gain_vs_baseline_pct": 4.08, + "verdict": "MARGINAL" + }, + "h8": { + "status": "SKIPPED_CPU002", + "label": "opset 17 + matmul_add_fusion (cpu-002 guarded)", + "opset": 17, + "reason": "cpu-002: model already has Gemm — matmul_add_fusion skipped" + }, + "h9": { + "status": "OK", + "label": "opset 17 + matmul_transpose_fusion", + "opset": 17, + "extra_optim": { + "matmul_transpose_fusion": true + }, + "screen_p50_ms": 15.5, + "screen_cv": 0.6570967741935484, + "full_p50s_ms": [ + 24.223, + 11.524, + 17.797 + ], + "median_p50_ms": 17.797, + "gain_vs_baseline_pct": 92.51, + "verdict": "KEEP_CONFIRMED", + "confirm_p50s_ms": [ + 186.462, + 197.357 + ], + "all_p50s_ms": [ + 24.223, + 11.524, + 17.797, + 186.462, + 197.357 + ], + "overall_median_p50_ms": 24.223, + "overall_gain_pct": 89.8, + "sessions_above_threshold": 5, + "total_sessions": 5 + }, + "h10": { + "status": "OK", + "label": "opset 17 + attention + skip_layer_norm + layer_norm", + "opset": 17, + "extra_optim": { + "attention_fusion": true, + "skip_layer_norm_fusion": true, + "layer_norm_fusion": true + }, + "screen_p50_ms": 24.828, + "screen_cv": 0.5822458514580312, + "full_p50s_ms": [ + 20.086, + 14.909, + 43.266 + ], + "median_p50_ms": 20.086, + "gain_vs_baseline_pct": 91.54, + "verdict": "KEEP_CONFIRMED", + "confirm_p50s_ms": [ + 18.859, + 39.401 + ], + "all_p50s_ms": [ + 20.086, + 14.909, + 43.266, + 18.859, + 39.401 + ], + "overall_median_p50_ms": 20.086, + "overall_gain_pct": 91.54, + "sessions_above_threshold": 5, + "total_sessions": 5 + }, + "h11": { + "status": "OK", + "label": "opset 17 + nchwc_transformer (Conv-heavy models)", + "opset": 17, + "extra_optim": { + "nchwc_transformer": true + }, + "screen_p50_ms": 14.073, + "screen_cv": 0.8100618205073545, + "full_p50s_ms": [ + 30.776, + 40.872, + 230.911 + ], + "median_p50_ms": 40.872, + "gain_vs_baseline_pct": 82.79, + "verdict": "MARGINAL_UNCONFIRMED", + "confirm_p50s_ms": [ + 36.88, + 27.171 + ], + "all_p50s_ms": [ + 30.776, + 40.872, + 230.911, + 36.88, + 27.171 + ], + "overall_median_p50_ms": 36.88, + "overall_gain_pct": 84.47, + "sessions_above_threshold": 4, + "total_sessions": 5 + }, + "h12": { + "status": "OK", + "label": "opset 17 + transpose_optimizer", + "opset": 17, + "extra_optim": { + "transpose_optimizer": true + }, + "screen_p50_ms": 10.858, + "screen_cv": 0.7146804199668446, + "full_p50s_ms": [ + 36.911, + 21.651, + 40.879 + ], + "median_p50_ms": 36.911, + "gain_vs_baseline_pct": 84.46, + "verdict": "KEEP_CONFIRMED", + "confirm_p50s_ms": [ + 26.592, + 38.939 + ], + "all_p50s_ms": [ + 36.911, + 21.651, + 40.879, + 26.592, + 38.939 + ], + "overall_median_p50_ms": 36.911, + "overall_gain_pct": 84.46, + "sessions_above_threshold": 5, + "total_sessions": 5 + }, + "h13": { + "status": "OK", + "label": "opset 17 + gelu_fusion explicit", + "opset": 17, + "extra_optim": { + "gelu_fusion": true + }, + "screen_p50_ms": 183.865, + "screen_cv": 0.9105920104424441, + "full_p50s_ms": [ + 219.217, + 26.395, + 20.936 + ], + "median_p50_ms": 26.395, + "gain_vs_baseline_pct": 88.89, + "verdict": "KEEP_CONFIRMED", + "confirm_p50s_ms": [ + 18.747, + 215.344 + ], + "all_p50s_ms": [ + 219.217, + 26.395, + 20.936, + 18.747, + 215.344 + ], + "overall_median_p50_ms": 26.395, + "overall_gain_pct": 88.89, + "sessions_above_threshold": 5, + "total_sessions": 5 + } + }, + "baseline_p50_ms": 237.472, + "best_p50_ms": 17.797, + "best_hypothesis": "h9", + "best_gain_pct": 92.51, + "errors": [], + "baseline_opset": 17 +} diff --git a/research/autoconfig/catalog-cpu-sweep/sentence-transformers--all-MiniLM-L6-v2/report.html b/research/autoconfig/catalog-cpu-sweep/sentence-transformers--all-MiniLM-L6-v2/report.html new file mode 100644 index 000000000..801d03710 --- /dev/null +++ b/research/autoconfig/catalog-cpu-sweep/sentence-transformers--all-MiniLM-L6-v2/report.html @@ -0,0 +1,268 @@ + + + + + + CPU CPU Optimization Report — sentence-transformers/all-MiniLM-L6-v2 + + + +

CPU CPU Optimization Report — sentence-transformers/all-MiniLM-L6-v2

+
bert arch · 2026-06-18 · 0 hypotheses tested
+ +
+
+
Best Gain %
+
+
Champion: —
+
+
+
Baseline → Champion ms
+
— → —
+
Latency reduction: —
+
+
+
EP + Device
+
CPU / CPU
+
Baseline opset —
+
+
+
Champion Config
+
+
+
+
+
Total experiments
+
0
+
0 KEEP / 0 DISCARD
+
+
+ + +
+
Model Characteristics
+ + +
Model IDsentence-transformers/all-MiniLM-L6-v2
Tasksentence-similarity
Arch typebert
Baseline opset
EPcpu
Devicecpu
+
+ + +
+
Hypothesis Gain Chart
+
+ HypothesisGain vs baseline (%)-200%-100%0%100%200% +
+
+ + +
+
🔬 All Hypotheses — Full Detail
+
+ + + + + + + + + + + + + + + + + +
IDConfig LabelOpsetExtra FlagsMedian p50Session p50s (ms)Gain %VerdictConfidence
+
+
+ ★ = champion hypothesis  ·  Session p50s are individual bench sessions (median used for comparison) +
+
+ + + + + + + + + diff --git a/research/autoconfig/catalog-cpu-sweep/sentence-transformers--all-MiniLM-L6-v2/results.json b/research/autoconfig/catalog-cpu-sweep/sentence-transformers--all-MiniLM-L6-v2/results.json new file mode 100644 index 000000000..a174931d9 --- /dev/null +++ b/research/autoconfig/catalog-cpu-sweep/sentence-transformers--all-MiniLM-L6-v2/results.json @@ -0,0 +1,16 @@ +{ + "model_id": "sentence-transformers/all-MiniLM-L6-v2", + "task": "sentence-similarity", + "model_type": "bert", + "timestamp": "2026-06-18T16:52:49", + "ep": "cpu", + "device": "cpu", + "hypotheses": {}, + "baseline_p50_ms": null, + "best_p50_ms": null, + "best_hypothesis": null, + "best_gain_pct": null, + "errors": [ + "base config generation failed" + ] +} diff --git a/research/autoconfig/catalog-gpu-sweep/.gitignore b/research/autoconfig/catalog-gpu-sweep/.gitignore new file mode 100644 index 000000000..b3b91d38b --- /dev/null +++ b/research/autoconfig/catalog-gpu-sweep/.gitignore @@ -0,0 +1,10 @@ +# Hypothesis build artifacts (large binary files) +h*/ +_tmp_config/ +# Raw perf session files +full_perf_s*.json +screen_perf.json +confirm_s*.json +# Model weight files +*.data +*.onnx diff --git a/research/autoconfig/catalog-gpu-sweep/BAAI--bge-small-en-v1.5/report.html b/research/autoconfig/catalog-gpu-sweep/BAAI--bge-small-en-v1.5/report.html new file mode 100644 index 000000000..c17fca4e0 --- /dev/null +++ b/research/autoconfig/catalog-gpu-sweep/BAAI--bge-small-en-v1.5/report.html @@ -0,0 +1,577 @@ + + + + + + QNN GPU Optimization Report — BAAI/bge-small-en-v1.5 + + + +

QNN GPU Optimization Report — BAAI/bge-small-en-v1.5

+
bert arch · 2026-06-18 · 13 hypotheses tested
+ +
+
+
Best Gain %
+
+
Champion: —
+
+
+
Baseline → Champion ms
+
52.63 ms → —
+
Latency reduction: —
+
+
+
EP + Device
+
QNN / GPU
+
Baseline opset 17
+
+
+
Champion Config
+
+
+
+
+
Total experiments
+
13
+
0 KEEP / 4 DISCARD
+
+
+ + +
+
Model Characteristics
+ + +
Model IDBAAI/bge-small-en-v1.5
Tasksentence-similarity
Arch typebert
Baseline opset17
EPqnn
Devicegpu
+
+ + +
+
Hypothesis Gain Chart
+
+ HypothesisGain vs baseline (%)-200%-100%0%100%200%h0: baseline FP32 (no quant, no compile) +status=OK verdict=BASELINE +p50=52.63 ms gain=+0.0%h0baseline FP32 (no quant, …0.0%h1: opset 17 explicit +status=OK verdict=DISCARD +p50=53.34 ms gain=-1.4%h1opset 17 explicit-1.4%h2: opset 19 +status=OK verdict=DISCARD +p50=53.36 ms gain=-1.4%h2opset 19-1.4%h3: opset 21 (tests gpu-006) +status=OK verdict=MARGINAL +p50=52.54 ms gain=+0.2%h3opset 21 (tests gpu-006)+0.2%h4: opset 17 + matmul_transpose_fusion +status=OK verdict=DISCARD +p50=52.81 ms gain=-0.3%h4opset 17 + matmul_transpo…-0.3%h5: opset 17 + attention_fusion +status=OK verdict=MARGINAL +p50=52.57 ms gain=+0.1%h5opset 17 + attention_fusi…+0.1%h6: opset 17 + bias_softmax_fusion +status=OK verdict=DISCARD +p50=52.70 ms gain=-0.1%h6opset 17 + bias_softmax_f…-0.1%h7: opset 17 + layer_norm_fusion +status=OK verdict=MARGINAL +p50=52.62 ms gain=+0.0%h7opset 17 + layer_norm_fus…+0.0%h8: opset 17 + skip_layer_norm_fusion +status=BENCH_FAIL verdict=— +p50=— gain=—h8opset 17 + skip_layer_nor…h9: opset 21 + matmul_transpose + attention_fusion +status=BUILD_FAIL verdict=— +p50=— gain=—h9opset 21 + matmul_transpo…BUILD_FAILh10: opset 17 + ln + skip_ln + matmul_transpose +status=BUILD_FAIL verdict=— +p50=— gain=—h10opset 17 + ln + skip_ln +…BUILD_FAILh11: opset 17 + gelu_fusion explicit +status=BUILD_FAIL verdict=— +p50=— gain=—h11opset 17 + gelu_fusion ex…BUILD_FAILh12: opset 17 + transpose_optimizer +status=BUILD_FAIL verdict=— +p50=— gain=—h12opset 17 + transpose_opti…BUILD_FAIL +
+
+ + +
+
🔬 All Hypotheses — Full Detail
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
IDConfig LabelOpsetExtra FlagsMedian p50Session p50s (ms)Gain %VerdictConfidence
h0baseline FP32 (no quant, no compile)17not stored52.63 ms[57.21 · 52.63 · 51.96]+0.0%BASELINEranges overlap
h1opset 17 explicit17not stored53.34 ms[53.69 · 52.32 · 53.34]-1.4%DISCARDranges overlap
h2opset 1919not stored53.36 ms[52.84 · 53.36 · 53.40]-1.4%DISCARDranges overlap
h3opset 21 (tests gpu-006)21not stored52.54 ms[52.19 · 53.58 · 52.54]+0.2%MARGINALranges overlap
h4opset 17 + matmul_transpose_fusion17matmul_transpose_fusion52.81 ms[52.81 · 53.63 · 52.17]-0.3%DISCARDranges overlap
h5opset 17 + attention_fusion17attention_fusion52.57 ms[52.99 · 52.27 · 52.57]+0.1%MARGINALranges overlap
h6opset 17 + bias_softmax_fusion17bias_softmax_fusion52.70 ms[51.83 · 53.41 · 52.70]-0.1%DISCARDranges overlap
h7opset 17 + layer_norm_fusion17layer_norm_fusion52.62 ms[54.30 · 52.52 · 52.62]+0.0%MARGINALranges overlap
h8opset 17 + skip_layer_norm_fusionnot storedBENCH_FAILbench failed
h9opset 21 + matmul_transpose + attention_fusion21not storedBUILD_FAILBUILD_FAILBUILD_FAILbuild failed
h10opset 17 + ln + skip_ln + matmul_transpose17not storedBUILD_FAILBUILD_FAILBUILD_FAILbuild failed
h11opset 17 + gelu_fusion explicit17not storedBUILD_FAILBUILD_FAILBUILD_FAILbuild failed
h12opset 17 + transpose_optimizer17not storedBUILD_FAILBUILD_FAILBUILD_FAILbuild failed
+
+
+ ★ = champion hypothesis  ·  Session p50s are individual bench sessions (median used for comparison) +
+
+ + + +
+
❌ Ineffective or Harmful
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
HypothesisLabelGain %VerdictConfidence
h9opset 21 + matmul_transpose + attention_fusionBUILD_FAILbuild failed
h10opset 17 + ln + skip_ln + matmul_transposeBUILD_FAILbuild failed
h11opset 17 + gelu_fusion explicitBUILD_FAILbuild failed
h12opset 17 + transpose_optimizerBUILD_FAILbuild failed
+
+ + +
+
⚪ Neutral / Build Fail
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
HypothesisLabelGain %VerdictConfidence
h0baseline FP32 (no quant, no compile)+0.0%BASELINEranges overlap
h1opset 17 explicit-1.4%DISCARDranges overlap
h2opset 19-1.4%DISCARDranges overlap
h3opset 21 (tests gpu-006)+0.2%MARGINALranges overlap
h4opset 17 + matmul_transpose_fusion-0.3%DISCARDranges overlap
h5opset 17 + attention_fusion+0.1%MARGINALranges overlap
h6opset 17 + bias_softmax_fusion-0.1%DISCARDranges overlap
h7opset 17 + layer_norm_fusion+0.0%MARGINALranges overlap
h8opset 17 + skip_layer_norm_fusionBENCH_FAILbench failed
+
+ + + + + + diff --git a/research/autoconfig/catalog-gpu-sweep/BAAI--bge-small-en-v1.5/results.json b/research/autoconfig/catalog-gpu-sweep/BAAI--bge-small-en-v1.5/results.json new file mode 100644 index 000000000..ad8809324 --- /dev/null +++ b/research/autoconfig/catalog-gpu-sweep/BAAI--bge-small-en-v1.5/results.json @@ -0,0 +1,187 @@ +{ + "model_id": "BAAI/bge-small-en-v1.5", + "task": "sentence-similarity", + "model_type": "bert", + "timestamp": "2026-06-18T00:17:47", + "ep": "qnn", + "device": "gpu", + "baseline_opset": 17, + "hypotheses": { + "h0": { + "status": "OK", + "label": "baseline FP32 (no quant, no compile)", + "opset": 17, + "extra_optim": null, + "screen_p50_ms": 54.32, + "screen_cv": 0.7272091310751105, + "full_p50s_ms": [ + 57.207, + 52.628, + 51.964 + ], + "median_p50_ms": 52.628, + "verdict": "BASELINE" + }, + "h1": { + "status": "OK", + "label": "opset 17 explicit", + "opset": 17, + "extra_optim": null, + "screen_p50_ms": 54.548, + "screen_cv": 0.2691757717973161, + "full_p50s_ms": [ + 53.686, + 52.321, + 53.336 + ], + "median_p50_ms": 53.336, + "gain_vs_baseline_pct": -1.35, + "verdict": "DISCARD" + }, + "h2": { + "status": "OK", + "label": "opset 19", + "opset": 19, + "extra_optim": null, + "screen_p50_ms": 53.712, + "screen_cv": 0.11630548108430146, + "full_p50s_ms": [ + 52.844, + 53.359, + 53.4 + ], + "median_p50_ms": 53.359, + "gain_vs_baseline_pct": -1.39, + "verdict": "DISCARD" + }, + "h3": { + "status": "OK", + "label": "opset 21 (tests gpu-006)", + "opset": 21, + "extra_optim": null, + "screen_p50_ms": 53.406, + "screen_cv": 0.1399842714301764, + "full_p50s_ms": [ + 52.192, + 53.582, + 52.542 + ], + "median_p50_ms": 52.542, + "gain_vs_baseline_pct": 0.16, + "verdict": "MARGINAL" + }, + "h4": { + "status": "OK", + "label": "opset 17 + matmul_transpose_fusion", + "opset": 17, + "extra_optim": { + "matmul_transpose_fusion": true + }, + "screen_p50_ms": 52.792, + "screen_cv": 0.18718745264434003, + "full_p50s_ms": [ + 52.812, + 53.63, + 52.173 + ], + "median_p50_ms": 52.812, + "gain_vs_baseline_pct": -0.35, + "verdict": "DISCARD" + }, + "h5": { + "status": "OK", + "label": "opset 17 + attention_fusion", + "opset": 17, + "extra_optim": { + "attention_fusion": true + }, + "screen_p50_ms": 52.42, + "screen_cv": 0.1541205646699733, + "full_p50s_ms": [ + 52.991, + 52.271, + 52.571 + ], + "median_p50_ms": 52.571, + "gain_vs_baseline_pct": 0.11, + "verdict": "MARGINAL" + }, + "h6": { + "status": "OK", + "label": "opset 17 + bias_softmax_fusion", + "opset": 17, + "extra_optim": { + "bias_softmax_fusion": true + }, + "screen_p50_ms": 52.712, + "screen_cv": 0.15228031567764452, + "full_p50s_ms": [ + 51.826, + 53.412, + 52.698 + ], + "median_p50_ms": 52.698, + "gain_vs_baseline_pct": -0.13, + "verdict": "DISCARD" + }, + "h7": { + "status": "OK", + "label": "opset 17 + layer_norm_fusion", + "opset": 17, + "extra_optim": { + "layer_norm_fusion": true + }, + "screen_p50_ms": 58.252, + "screen_cv": 0.19027672869601042, + "full_p50s_ms": [ + 54.301, + 52.525, + 52.622 + ], + "median_p50_ms": 52.622, + "gain_vs_baseline_pct": 0.01, + "verdict": "MARGINAL" + }, + "h8": { + "status": "BENCH_FAIL", + "label": "opset 17 + skip_layer_norm_fusion" + }, + "h9": { + "status": "BUILD_FAIL", + "label": "opset 21 + matmul_transpose + attention_fusion", + "opset": 21, + "build_error": "AI--bge-small-en-v1.5\\h9\\export.onnx\n(127.5 MB)\n[06/18/26 00:59:02] ERROR ✗ ort_graph failed: [Errno 28] No space left on \n device \n⏳ Optimize Optimizing ONNX graph...Error: Build failed: [Errno 28] No space left on device\n" + }, + "h10": { + "status": "BUILD_FAIL", + "label": "opset 17 + ln + skip_ln + matmul_transpose", + "opset": 17, + "build_error": " Supported tasks are: feature-extraction, \n fill-mask, multiple-choice, question-answering, \n text-classification, token-classification. \n⏳ Export Exporting to ONNX...Error: Build failed: [Errno 28] No space left on device\n" + }, + "h11": { + "status": "BUILD_FAIL", + "label": "opset 17 + gelu_fusion explicit", + "opset": 17, + "build_error": " Supported tasks are: feature-extraction, \n fill-mask, multiple-choice, question-answering, \n text-classification, token-classification. \n⏳ Export Exporting to ONNX...Error: Build failed: [Errno 28] No space left on device\n" + }, + "h12": { + "status": "BUILD_FAIL", + "label": "opset 17 + transpose_optimizer", + "opset": 17, + "build_error": " Supported tasks are: feature-extraction, \n fill-mask, multiple-choice, question-answering, \n text-classification, token-classification. \n⏳ Export Exporting to ONNX...Error: Build failed: [Errno 28] No space left on device\n" + } + }, + "best_hypothesis": null, + "baseline_p50_ms": 52.628, + "best_p50_ms": null, + "best_gain_pct": null, + "opset21_gain_pct": 0.16, + "feature_gaps": [], + "errors": [ + "h8: screen bench failed", + "h9: BUILD_FAIL", + "h10: BUILD_FAIL", + "h11: BUILD_FAIL", + "h12: BUILD_FAIL" + ] +} diff --git a/research/autoconfig/catalog-gpu-sweep/apple--mobilevit-small/report.html b/research/autoconfig/catalog-gpu-sweep/apple--mobilevit-small/report.html new file mode 100644 index 000000000..6e4b98e27 --- /dev/null +++ b/research/autoconfig/catalog-gpu-sweep/apple--mobilevit-small/report.html @@ -0,0 +1,577 @@ + + + + + + QNN GPU Optimization Report — apple/mobilevit-small + + + +

QNN GPU Optimization Report — apple/mobilevit-small

+
mobilevit arch · 2026-06-18 · 13 hypotheses tested
+ +
+
+
Best Gain %
+
+
Champion: —
+
+
+
Baseline → Champion ms
+
17.98 ms → —
+
Latency reduction: —
+
+
+
EP + Device
+
QNN / GPU
+
Baseline opset 17
+
+
+
Champion Config
+
+
+
+
+
Total experiments
+
13
+
0 KEEP / 3 DISCARD
+
+
+ + +
+
Model Characteristics
+ + +
Model IDapple/mobilevit-small
Taskimage-classification
Arch typemobilevit
Baseline opset17
EPqnn
Devicegpu
+
+ + +
+
Hypothesis Gain Chart
+
+ HypothesisGain vs baseline (%)-200%-100%0%100%200%h0: baseline FP32 (no quant, no compile) +status=OK verdict=BASELINE +p50=17.98 ms gain=+0.0%h0baseline FP32 (no quant, …0.0%h1: opset 17 explicit +status=OK verdict=MARGINAL +p50=17.73 ms gain=+1.4%h1opset 17 explicit+1.4%h2: opset 19 +status=OK verdict=DISCARD +p50=18.28 ms gain=-1.6%h2opset 19-1.6%h3: opset 21 (tests gpu-006) +status=OK verdict=DISCARD +p50=18.60 ms gain=-3.4%h3opset 21 (tests gpu-006)-3.4%h4: opset 17 + matmul_transpose_fusion +status=OK verdict=MARGINAL +p50=17.74 ms gain=+1.4%h4opset 17 + matmul_transpo…+1.4%h5: opset 17 + attention_fusion +status=OK verdict=DISCARD +p50=18.14 ms gain=-0.9%h5opset 17 + attention_fusi…-0.9%h6: opset 17 + bias_softmax_fusion +status=OK verdict=MARGINAL +p50=17.67 ms gain=+1.8%h6opset 17 + bias_softmax_f…+1.8%h7: opset 17 + layer_norm_fusion +status=OK verdict=MARGINAL +p50=17.83 ms gain=+0.9%h7opset 17 + layer_norm_fus…+0.9%h8: opset 17 + skip_layer_norm_fusion +status=BENCH_FAIL verdict=— +p50=— gain=—h8opset 17 + skip_layer_nor…h9: opset 21 + matmul_transpose + attention_fusion +status=OK verdict=DISCARD +p50=19.22 ms gain=-6.9%h9opset 21 + matmul_transpo…-6.9%h10: opset 17 + ln + skip_ln + matmul_transpose +status=BENCH_FAIL verdict=— +p50=— gain=—h10opset 17 + ln + skip_ln +…h11: opset 17 + gelu_fusion explicit +status=OK verdict=MARGINAL +p50=17.79 ms gain=+1.1%h11opset 17 + gelu_fusion ex…+1.1%h12: opset 17 + transpose_optimizer +status=OK verdict=DISCARD +p50=18.57 ms gain=-3.2%h12opset 17 + transpose_opti…-3.2% +
+
+ + +
+
🔬 All Hypotheses — Full Detail
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
IDConfig LabelOpsetExtra FlagsMedian p50Session p50s (ms)Gain %VerdictConfidence
h0baseline FP32 (no quant, no compile)17not stored17.98 ms[18.20 · 17.98 · 17.77]+0.0%BASELINEranges overlap
h1opset 17 explicit17not stored17.73 ms[18.66 · 17.73 · 17.56]+1.4%MARGINALranges overlap
h2opset 1919not stored18.28 ms[18.16 · 18.38 · 18.28]-1.6%DISCARDranges overlap
h3opset 21 (tests gpu-006)21not stored18.60 ms[18.19 · 18.85 · 18.60]-3.4%DISCARDranges overlap
h4opset 17 + matmul_transpose_fusion17matmul_transpose_fusion17.74 ms[17.74 · 17.61 · 18.28]+1.4%MARGINALranges overlap
h5opset 17 + attention_fusion17attention_fusion18.14 ms[20.60 · 18.14 · 17.86]-0.9%DISCARDranges overlap
h6opset 17 + bias_softmax_fusion17bias_softmax_fusion17.67 ms[18.07 · 17.67 · 17.66]+1.8%MARGINALranges overlap
h7opset 17 + layer_norm_fusion17layer_norm_fusion17.83 ms[17.66 · 17.83 · 20.32]+0.9%MARGINALranges overlap
h8opset 17 + skip_layer_norm_fusionnot storedBENCH_FAILbench failed
h9opset 21 + matmul_transpose + attention_fusion21matmul_transpose_fusion, attention_fusion19.22 ms[18.16 · 19.24 · 19.22]-6.9%DISCARDranges overlap
h10opset 17 + ln + skip_ln + matmul_transposenot storedBENCH_FAILbench failed
h11opset 17 + gelu_fusion explicit17gelu_fusion17.79 ms[17.68 · 18.41 · 17.79]+1.1%MARGINALranges overlap
h12opset 17 + transpose_optimizer17transpose_optimizer18.57 ms[17.71 · 18.88 · 18.57]-3.2%DISCARDranges overlap
+
+
+ ★ = champion hypothesis  ·  Session p50s are individual bench sessions (median used for comparison) +
+
+ + + +
+
❌ Ineffective or Harmful
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
HypothesisLabelGain %VerdictConfidence
h3opset 21 (tests gpu-006)-3.4%DISCARDranges overlap
h9opset 21 + matmul_transpose + attention_fusion-6.9%DISCARDranges overlap
h12opset 17 + transpose_optimizer-3.2%DISCARDranges overlap
+
+ + +
+
⚪ Neutral / Build Fail
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
HypothesisLabelGain %VerdictConfidence
h0baseline FP32 (no quant, no compile)+0.0%BASELINEranges overlap
h1opset 17 explicit+1.4%MARGINALranges overlap
h2opset 19-1.6%DISCARDranges overlap
h4opset 17 + matmul_transpose_fusion+1.4%MARGINALranges overlap
h5opset 17 + attention_fusion-0.9%DISCARDranges overlap
h6opset 17 + bias_softmax_fusion+1.8%MARGINALranges overlap
h7opset 17 + layer_norm_fusion+0.9%MARGINALranges overlap
h8opset 17 + skip_layer_norm_fusionBENCH_FAILbench failed
h10opset 17 + ln + skip_ln + matmul_transposeBENCH_FAILbench failed
h11opset 17 + gelu_fusion explicit+1.1%MARGINALranges overlap
+
+ + + + + + diff --git a/research/autoconfig/catalog-gpu-sweep/apple--mobilevit-small/results.json b/research/autoconfig/catalog-gpu-sweep/apple--mobilevit-small/results.json new file mode 100644 index 000000000..0d75e8b9d --- /dev/null +++ b/research/autoconfig/catalog-gpu-sweep/apple--mobilevit-small/results.json @@ -0,0 +1,219 @@ +{ + "model_id": "apple/mobilevit-small", + "task": "image-classification", + "model_type": "mobilevit", + "timestamp": "2026-06-18T01:40:29", + "ep": "qnn", + "device": "gpu", + "baseline_opset": 17, + "hypotheses": { + "h0": { + "status": "OK", + "label": "baseline FP32 (no quant, no compile)", + "opset": 17, + "extra_optim": null, + "screen_p50_ms": 21.759, + "screen_cv": 0.17119352911438945, + "full_p50s_ms": [ + 18.204, + 17.985, + 17.773 + ], + "median_p50_ms": 17.985, + "verdict": "BASELINE" + }, + "h1": { + "status": "OK", + "label": "opset 17 explicit", + "opset": 17, + "extra_optim": null, + "screen_p50_ms": 18.003, + "screen_cv": 0.17624840304393713, + "full_p50s_ms": [ + 18.657, + 17.727, + 17.557 + ], + "median_p50_ms": 17.727, + "gain_vs_baseline_pct": 1.43, + "verdict": "MARGINAL" + }, + "h2": { + "status": "OK", + "label": "opset 19", + "opset": 19, + "extra_optim": null, + "screen_p50_ms": 18.53, + "screen_cv": 0.15267134376686453, + "full_p50s_ms": [ + 18.162, + 18.381, + 18.281 + ], + "median_p50_ms": 18.281, + "gain_vs_baseline_pct": -1.65, + "verdict": "DISCARD" + }, + "h3": { + "status": "OK", + "label": "opset 21 (tests gpu-006)", + "opset": 21, + "extra_optim": null, + "screen_p50_ms": 18.209, + "screen_cv": 0.2452633313196771, + "full_p50s_ms": [ + 18.188, + 18.851, + 18.6 + ], + "median_p50_ms": 18.6, + "gain_vs_baseline_pct": -3.42, + "verdict": "DISCARD" + }, + "h4": { + "status": "OK", + "label": "opset 17 + matmul_transpose_fusion", + "opset": 17, + "extra_optim": { + "matmul_transpose_fusion": true + }, + "screen_p50_ms": 17.775, + "screen_cv": 0.15651195499296766, + "full_p50s_ms": [ + 17.74, + 17.609, + 18.28 + ], + "median_p50_ms": 17.74, + "gain_vs_baseline_pct": 1.36, + "verdict": "MARGINAL" + }, + "h5": { + "status": "OK", + "label": "opset 17 + attention_fusion", + "opset": 17, + "extra_optim": { + "attention_fusion": true + }, + "screen_p50_ms": 17.942, + "screen_cv": 0.3691896109686768, + "full_p50s_ms": [ + 20.597, + 18.141, + 17.859 + ], + "median_p50_ms": 18.141, + "gain_vs_baseline_pct": -0.87, + "verdict": "DISCARD" + }, + "h6": { + "status": "OK", + "label": "opset 17 + bias_softmax_fusion", + "opset": 17, + "extra_optim": { + "bias_softmax_fusion": true + }, + "screen_p50_ms": 20.757, + "screen_cv": 0.15112973936503346, + "full_p50s_ms": [ + 18.068, + 17.671, + 17.662 + ], + "median_p50_ms": 17.671, + "gain_vs_baseline_pct": 1.75, + "verdict": "MARGINAL" + }, + "h7": { + "status": "OK", + "label": "opset 17 + layer_norm_fusion", + "opset": 17, + "extra_optim": { + "layer_norm_fusion": true + }, + "screen_p50_ms": 17.683, + "screen_cv": 0.21947633320137985, + "full_p50s_ms": [ + 17.655, + 17.827, + 20.316 + ], + "median_p50_ms": 17.827, + "gain_vs_baseline_pct": 0.88, + "verdict": "MARGINAL" + }, + "h8": { + "status": "BENCH_FAIL", + "label": "opset 17 + skip_layer_norm_fusion" + }, + "h9": { + "status": "OK", + "label": "opset 21 + matmul_transpose + attention_fusion", + "opset": 21, + "extra_optim": { + "matmul_transpose_fusion": true, + "attention_fusion": true + }, + "screen_p50_ms": 18.238, + "screen_cv": 0.10938699418795922, + "full_p50s_ms": [ + 18.161, + 19.242, + 19.224 + ], + "median_p50_ms": 19.224, + "gain_vs_baseline_pct": -6.89, + "verdict": "DISCARD" + }, + "h10": { + "status": "BENCH_FAIL", + "label": "opset 17 + ln + skip_ln + matmul_transpose" + }, + "h11": { + "status": "OK", + "label": "opset 17 + gelu_fusion explicit", + "opset": 17, + "extra_optim": { + "gelu_fusion": true + }, + "screen_p50_ms": 17.604, + "screen_cv": 0.1246875710065894, + "full_p50s_ms": [ + 17.678, + 18.414, + 17.788 + ], + "median_p50_ms": 17.788, + "gain_vs_baseline_pct": 1.1, + "verdict": "MARGINAL" + }, + "h12": { + "status": "OK", + "label": "opset 17 + transpose_optimizer", + "opset": 17, + "extra_optim": { + "transpose_optimizer": true + }, + "screen_p50_ms": 17.827, + "screen_cv": 0.200201940876199, + "full_p50s_ms": [ + 17.706, + 18.881, + 18.57 + ], + "median_p50_ms": 18.57, + "gain_vs_baseline_pct": -3.25, + "verdict": "DISCARD" + } + }, + "best_hypothesis": null, + "baseline_p50_ms": 17.985, + "best_p50_ms": null, + "best_gain_pct": null, + "opset21_gain_pct": -3.42, + "feature_gaps": [], + "errors": [ + "h8: screen bench failed", + "h10: screen bench failed" + ] +} diff --git a/research/autoconfig/catalog-gpu-sweep/deepset--roberta-base-squad2/report.html b/research/autoconfig/catalog-gpu-sweep/deepset--roberta-base-squad2/report.html new file mode 100644 index 000000000..abb7d9fcd --- /dev/null +++ b/research/autoconfig/catalog-gpu-sweep/deepset--roberta-base-squad2/report.html @@ -0,0 +1,577 @@ + + + + + + QNN GPU Optimization Report — deepset/roberta-base-squad2 + + + +

QNN GPU Optimization Report — deepset/roberta-base-squad2

+
roberta arch · 2026-06-18 · 13 hypotheses tested
+ +
+
+
Best Gain %
+
+
Champion: —
+
+
+
Baseline → Champion ms
+
99.53 ms → —
+
Latency reduction: —
+
+
+
EP + Device
+
QNN / GPU
+
Baseline opset 17
+
+
+
Champion Config
+
+
+
+
+
Total experiments
+
13
+
0 KEEP / 7 DISCARD
+
+
+ + +
+
Model Characteristics
+ + +
Model IDdeepset/roberta-base-squad2
Taskquestion-answering
Arch typeroberta
Baseline opset17
EPqnn
Devicegpu
+
+ + +
+
Hypothesis Gain Chart
+
+ HypothesisGain vs baseline (%)-200%-100%0%100%200%h0: baseline FP32 (no quant, no compile) +status=OK verdict=BASELINE +p50=99.53 ms gain=+0.0%h0baseline FP32 (no quant, …0.0%h1: opset 17 explicit +status=OK verdict=MARGINAL +p50=98.59 ms gain=+0.9%h1opset 17 explicit+0.9%h2: opset 19 +status=OK verdict=DISCARD +p50=100.33 ms gain=-0.8%h2opset 19-0.8%h3: opset 21 (tests gpu-006) +status=OK verdict=DISCARD +p50=100.67 ms gain=-1.1%h3opset 21 (tests gpu-006)-1.1%h4: opset 17 + matmul_transpose_fusion +status=OK verdict=MARGINAL +p50=98.44 ms gain=+1.1%h4opset 17 + matmul_transpo…+1.1%h5: opset 17 + attention_fusion +status=OK verdict=MARGINAL +p50=98.59 ms gain=+0.9%h5opset 17 + attention_fusi…+0.9%h6: opset 17 + bias_softmax_fusion +status=BUILD_FAIL verdict=— +p50=— gain=—h6opset 17 + bias_softmax_f…BUILD_FAILh7: opset 17 + layer_norm_fusion +status=BUILD_FAIL verdict=— +p50=— gain=—h7opset 17 + layer_norm_fus…BUILD_FAILh8: opset 17 + skip_layer_norm_fusion +status=BUILD_FAIL verdict=— +p50=— gain=—h8opset 17 + skip_layer_nor…BUILD_FAILh9: opset 21 + matmul_transpose + attention_fusion +status=BUILD_FAIL verdict=— +p50=— gain=—h9opset 21 + matmul_transpo…BUILD_FAILh10: opset 17 + ln + skip_ln + matmul_transpose +status=BUILD_FAIL verdict=— +p50=— gain=—h10opset 17 + ln + skip_ln +…BUILD_FAILh11: opset 17 + gelu_fusion explicit +status=BUILD_FAIL verdict=— +p50=— gain=—h11opset 17 + gelu_fusion ex…BUILD_FAILh12: opset 17 + transpose_optimizer +status=BUILD_FAIL verdict=— +p50=— gain=—h12opset 17 + transpose_opti…BUILD_FAIL +
+
+ + +
+
🔬 All Hypotheses — Full Detail
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
IDConfig LabelOpsetExtra FlagsMedian p50Session p50s (ms)Gain %VerdictConfidence
h0baseline FP32 (no quant, no compile)17not stored99.53 ms[99.95 · 97.75 · 99.53]+0.0%BASELINEranges overlap
h1opset 17 explicit17not stored98.59 ms[99.11 · 98.16 · 98.59]+0.9%MARGINALranges overlap
h2opset 1919not stored100.33 ms[100.33 · 99.66 · 101.42]-0.8%DISCARDranges overlap
h3opset 21 (tests gpu-006)21not stored100.67 ms[100.42 · 100.67 · 100.98]-1.1%DISCARDranges separated
h4opset 17 + matmul_transpose_fusion17matmul_transpose_fusion98.44 ms[98.04 · 99.49 · 98.44]+1.1%MARGINALranges overlap
h5opset 17 + attention_fusion17attention_fusion98.59 ms[98.59 · 98.91 · 98.56]+0.9%MARGINALranges overlap
h6opset 17 + bias_softmax_fusion17not storedBUILD_FAILBUILD_FAILBUILD_FAILbuild failed
h7opset 17 + layer_norm_fusion17not storedBUILD_FAILBUILD_FAILBUILD_FAILbuild failed
h8opset 17 + skip_layer_norm_fusion17not storedBUILD_FAILBUILD_FAILBUILD_FAILbuild failed
h9opset 21 + matmul_transpose + attention_fusion21not storedBUILD_FAILBUILD_FAILBUILD_FAILbuild failed
h10opset 17 + ln + skip_ln + matmul_transpose17not storedBUILD_FAILBUILD_FAILBUILD_FAILbuild failed
h11opset 17 + gelu_fusion explicit17not storedBUILD_FAILBUILD_FAILBUILD_FAILbuild failed
h12opset 17 + transpose_optimizer17not storedBUILD_FAILBUILD_FAILBUILD_FAILbuild failed
+
+
+ ★ = champion hypothesis  ·  Session p50s are individual bench sessions (median used for comparison) +
+
+ + + +
+
❌ Ineffective or Harmful
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
HypothesisLabelGain %VerdictConfidence
h6opset 17 + bias_softmax_fusionBUILD_FAILbuild failed
h7opset 17 + layer_norm_fusionBUILD_FAILbuild failed
h8opset 17 + skip_layer_norm_fusionBUILD_FAILbuild failed
h9opset 21 + matmul_transpose + attention_fusionBUILD_FAILbuild failed
h10opset 17 + ln + skip_ln + matmul_transposeBUILD_FAILbuild failed
h11opset 17 + gelu_fusion explicitBUILD_FAILbuild failed
h12opset 17 + transpose_optimizerBUILD_FAILbuild failed
+
+ + +
+
⚪ Neutral / Build Fail
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
HypothesisLabelGain %VerdictConfidence
h0baseline FP32 (no quant, no compile)+0.0%BASELINEranges overlap
h1opset 17 explicit+0.9%MARGINALranges overlap
h2opset 19-0.8%DISCARDranges overlap
h3opset 21 (tests gpu-006)-1.1%DISCARDranges separated
h4opset 17 + matmul_transpose_fusion+1.1%MARGINALranges overlap
h5opset 17 + attention_fusion+0.9%MARGINALranges overlap
+
+ + + + + + diff --git a/research/autoconfig/catalog-gpu-sweep/deepset--roberta-base-squad2/results.json b/research/autoconfig/catalog-gpu-sweep/deepset--roberta-base-squad2/results.json new file mode 100644 index 000000000..773322212 --- /dev/null +++ b/research/autoconfig/catalog-gpu-sweep/deepset--roberta-base-squad2/results.json @@ -0,0 +1,167 @@ +{ + "model_id": "deepset/roberta-base-squad2", + "task": "question-answering", + "model_type": "roberta", + "timestamp": "2026-06-18T02:23:50", + "ep": "qnn", + "device": "gpu", + "baseline_opset": 17, + "hypotheses": { + "h0": { + "status": "OK", + "label": "baseline FP32 (no quant, no compile)", + "opset": 17, + "extra_optim": null, + "screen_p50_ms": 98.117, + "screen_cv": 0.15088109094244626, + "full_p50s_ms": [ + 99.948, + 97.755, + 99.535 + ], + "median_p50_ms": 99.535, + "verdict": "BASELINE" + }, + "h1": { + "status": "OK", + "label": "opset 17 explicit", + "opset": 17, + "extra_optim": null, + "screen_p50_ms": 98.107, + "screen_cv": 0.14872537127829819, + "full_p50s_ms": [ + 99.112, + 98.16, + 98.593 + ], + "median_p50_ms": 98.593, + "gain_vs_baseline_pct": 0.95, + "verdict": "MARGINAL" + }, + "h2": { + "status": "OK", + "label": "opset 19", + "opset": 19, + "extra_optim": null, + "screen_p50_ms": 107.597, + "screen_cv": 0.21958790672602396, + "full_p50s_ms": [ + 100.327, + 99.658, + 101.422 + ], + "median_p50_ms": 100.327, + "gain_vs_baseline_pct": -0.8, + "verdict": "DISCARD" + }, + "h3": { + "status": "OK", + "label": "opset 21 (tests gpu-006)", + "opset": 21, + "extra_optim": null, + "screen_p50_ms": 100.15, + "screen_cv": 0.16429355966050924, + "full_p50s_ms": [ + 100.42, + 100.667, + 100.984 + ], + "median_p50_ms": 100.667, + "gain_vs_baseline_pct": -1.14, + "verdict": "DISCARD" + }, + "h4": { + "status": "OK", + "label": "opset 17 + matmul_transpose_fusion", + "opset": 17, + "extra_optim": { + "matmul_transpose_fusion": true + }, + "screen_p50_ms": 97.954, + "screen_cv": 0.14972333952671663, + "full_p50s_ms": [ + 98.044, + 99.494, + 98.442 + ], + "median_p50_ms": 98.442, + "gain_vs_baseline_pct": 1.1, + "verdict": "MARGINAL" + }, + "h5": { + "status": "OK", + "label": "opset 17 + attention_fusion", + "opset": 17, + "extra_optim": { + "attention_fusion": true + }, + "screen_p50_ms": 102.402, + "screen_cv": 0.22213433331380247, + "full_p50s_ms": [ + 98.593, + 98.912, + 98.564 + ], + "median_p50_ms": 98.593, + "gain_vs_baseline_pct": 0.95, + "verdict": "MARGINAL" + }, + "h6": { + "status": "BUILD_FAIL", + "label": "opset 17 + bias_softmax_fusion", + "opset": 17, + "build_error": "ion_mask [1, 512] int32\n Output: start_logits\n end_logits\n 📦 Artifact: \nC:\\tmp\\autoconfig-demo\\catalog-gpu-sweep\\deepset--roberta-base-squad2\\h6\\export\n.onnx (474.9 MB)\n⏳ Optimize Optimizing ONNX graph...Error: Build failed: [Errno 28] No space left on device\n" + }, + "h7": { + "status": "BUILD_FAIL", + "label": "opset 17 + layer_norm_fusion", + "opset": 17, + "build_error": "put: \nC:\\tmp\\autoconfig-demo\\catalog-gpu-sweep\\deepset--roberta-base-squad2\\h7\n\n════════════════════════════════════════════════════════════\n🎯 Stages\n════════════════════════════════════════════════════════════\n⏳ Export Exporting to ONNX...Error: Build failed: [Errno 28] No space left on device\n" + }, + "h8": { + "status": "BUILD_FAIL", + "label": "opset 17 + skip_layer_norm_fusion", + "opset": 17, + "build_error": "put: \nC:\\tmp\\autoconfig-demo\\catalog-gpu-sweep\\deepset--roberta-base-squad2\\h8\n\n════════════════════════════════════════════════════════════\n🎯 Stages\n════════════════════════════════════════════════════════════\n⏳ Export Exporting to ONNX...Error: Build failed: [Errno 28] No space left on device\n" + }, + "h9": { + "status": "BUILD_FAIL", + "label": "opset 21 + matmul_transpose + attention_fusion", + "opset": 21, + "build_error": "put: \nC:\\tmp\\autoconfig-demo\\catalog-gpu-sweep\\deepset--roberta-base-squad2\\h9\n\n════════════════════════════════════════════════════════════\n🎯 Stages\n════════════════════════════════════════════════════════════\n⏳ Export Exporting to ONNX...Error: Build failed: [Errno 28] No space left on device\n" + }, + "h10": { + "status": "BUILD_FAIL", + "label": "opset 17 + ln + skip_ln + matmul_transpose", + "opset": 17, + "build_error": "ut: \nC:\\tmp\\autoconfig-demo\\catalog-gpu-sweep\\deepset--roberta-base-squad2\\h10\n\n════════════════════════════════════════════════════════════\n🎯 Stages\n════════════════════════════════════════════════════════════\n⏳ Export Exporting to ONNX...Error: Build failed: [Errno 28] No space left on device\n" + }, + "h11": { + "status": "BUILD_FAIL", + "label": "opset 17 + gelu_fusion explicit", + "opset": 17, + "build_error": "ut: \nC:\\tmp\\autoconfig-demo\\catalog-gpu-sweep\\deepset--roberta-base-squad2\\h11\n\n════════════════════════════════════════════════════════════\n🎯 Stages\n════════════════════════════════════════════════════════════\n⏳ Export Exporting to ONNX...Error: Build failed: [Errno 28] No space left on device\n" + }, + "h12": { + "status": "BUILD_FAIL", + "label": "opset 17 + transpose_optimizer", + "opset": 17, + "build_error": "ut: \nC:\\tmp\\autoconfig-demo\\catalog-gpu-sweep\\deepset--roberta-base-squad2\\h12\n\n════════════════════════════════════════════════════════════\n🎯 Stages\n════════════════════════════════════════════════════════════\n⏳ Export Exporting to ONNX...Error: Build failed: [Errno 28] No space left on device\n" + } + }, + "best_hypothesis": null, + "baseline_p50_ms": 99.535, + "best_p50_ms": null, + "best_gain_pct": null, + "opset21_gain_pct": -1.14, + "feature_gaps": [], + "errors": [ + "h6: BUILD_FAIL", + "h7: BUILD_FAIL", + "h8: BUILD_FAIL", + "h9: BUILD_FAIL", + "h10: BUILD_FAIL", + "h11: BUILD_FAIL", + "h12: BUILD_FAIL" + ] +} diff --git a/research/autoconfig/catalog-gpu-sweep/deepset--tinyroberta-squad2/report.html b/research/autoconfig/catalog-gpu-sweep/deepset--tinyroberta-squad2/report.html new file mode 100644 index 000000000..bf073f126 --- /dev/null +++ b/research/autoconfig/catalog-gpu-sweep/deepset--tinyroberta-squad2/report.html @@ -0,0 +1,577 @@ + + + + + + QNN GPU Optimization Report — deepset/tinyroberta-squad2 + + + +

QNN GPU Optimization Report — deepset/tinyroberta-squad2

+
roberta arch · 2026-06-17 · 13 hypotheses tested
+ +
+
+
Best Gain %
+
+
Champion: —
+
+
+
Baseline → Champion ms
+
51.17 ms → —
+
Latency reduction: —
+
+
+
EP + Device
+
QNN / GPU
+
Baseline opset 17
+
+
+
Champion Config
+
+
+
+
+
Total experiments
+
13
+
0 KEEP / 3 DISCARD
+
+
+ + +
+
Model Characteristics
+ + +
Model IDdeepset/tinyroberta-squad2
Taskquestion-answering
Arch typeroberta
Baseline opset17
EPqnn
Devicegpu
+
+ + +
+
Hypothesis Gain Chart
+
+ HypothesisGain vs baseline (%)-200%-100%0%100%200%h0: baseline FP32 (no quant, no compile) +status=OK verdict=BASELINE +p50=51.17 ms gain=+0.0%h0baseline FP32 (no quant, …0.0%h1: opset 17 explicit +status=OK verdict=MARGINAL +p50=51.14 ms gain=+0.1%h1opset 17 explicit+0.1%h2: opset 19 +status=OK verdict=DISCARD +p50=52.25 ms gain=-2.1%h2opset 19-2.1%h3: opset 21 (tests gpu-006) +status=OK verdict=DISCARD +p50=52.54 ms gain=-2.7%h3opset 21 (tests gpu-006)-2.7%h4: opset 17 + matmul_transpose_fusion +status=OK verdict=MARGINAL +p50=50.67 ms gain=+1.0%h4opset 17 + matmul_transpo…+1.0%h5: opset 17 + attention_fusion +status=OK verdict=DISCARD +p50=51.58 ms gain=-0.8%h5opset 17 + attention_fusi…-0.8%h6: opset 17 + bias_softmax_fusion +status=OK verdict=MARGINAL +p50=51.06 ms gain=+0.2%h6opset 17 + bias_softmax_f…+0.2%h7: opset 17 + layer_norm_fusion +status=OK verdict=MARGINAL +p50=50.63 ms gain=+1.1%h7opset 17 + layer_norm_fus…+1.1%h8: opset 17 + skip_layer_norm_fusion +status=BENCH_FAIL verdict=— +p50=— gain=—h8opset 17 + skip_layer_nor…h9: opset 21 + matmul_transpose + attention_fusion +status=OK verdict=DISCARD +p50=52.58 ms gain=-2.8%h9opset 21 + matmul_transpo…-2.8%h10: opset 17 + ln + skip_ln + matmul_transpose +status=BENCH_FAIL verdict=— +p50=— gain=—h10opset 17 + ln + skip_ln +…h11: opset 17 + gelu_fusion explicit +status=OK verdict=MARGINAL +p50=50.50 ms gain=+1.3%h11opset 17 + gelu_fusion ex…+1.3%h12: opset 17 + transpose_optimizer +status=OK verdict=MARGINAL +p50=51.02 ms gain=+0.3%h12opset 17 + transpose_opti…+0.3% +
+
+ + +
+
🔬 All Hypotheses — Full Detail
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
IDConfig LabelOpsetExtra FlagsMedian p50Session p50s (ms)Gain %VerdictConfidence
h0baseline FP32 (no quant, no compile)17not stored51.17 ms[51.17 · 51.24 · 50.41]+0.0%BASELINEranges overlap
h1opset 17 explicit17not stored51.14 ms[50.52 · 51.14 · 51.37]+0.1%MARGINALranges overlap
h2opset 1919not stored52.25 ms[51.06 · 53.10 · 52.25]-2.1%DISCARDranges overlap
h3opset 21 (tests gpu-006)21not stored52.54 ms[52.54 · 55.41 · 51.71]-2.7%DISCARDranges separated
h4opset 17 + matmul_transpose_fusion17matmul_transpose_fusion50.67 ms[50.39 · 50.67 · 51.56]+1.0%MARGINALranges overlap
h5opset 17 + attention_fusion17attention_fusion51.58 ms[51.86 · 51.58 · 50.47]-0.8%DISCARDranges overlap
h6opset 17 + bias_softmax_fusion17bias_softmax_fusion51.06 ms[51.06 · 50.97 · 52.08]+0.2%MARGINALranges overlap
h7opset 17 + layer_norm_fusion17layer_norm_fusion50.63 ms[50.63 · 50.47 · 51.42]+1.1%MARGINALranges overlap
h8opset 17 + skip_layer_norm_fusionnot storedBENCH_FAILbench failed
h9opset 21 + matmul_transpose + attention_fusion21matmul_transpose_fusion, attention_fusion52.58 ms[52.58 · 51.76 · 57.06]-2.8%DISCARDranges separated
h10opset 17 + ln + skip_ln + matmul_transposenot storedBENCH_FAILbench failed
h11opset 17 + gelu_fusion explicit17gelu_fusion50.50 ms[50.34 · 50.50 · 51.30]+1.3%MARGINALranges overlap
h12opset 17 + transpose_optimizer17transpose_optimizer51.02 ms[51.02 · 52.29 · 50.83]+0.3%MARGINALranges overlap
+
+
+ ★ = champion hypothesis  ·  Session p50s are individual bench sessions (median used for comparison) +
+
+ + + +
+
❌ Ineffective or Harmful
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
HypothesisLabelGain %VerdictConfidence
h2opset 19-2.1%DISCARDranges overlap
h3opset 21 (tests gpu-006)-2.7%DISCARDranges separated
h9opset 21 + matmul_transpose + attention_fusion-2.8%DISCARDranges separated
+
+ + +
+
⚪ Neutral / Build Fail
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
HypothesisLabelGain %VerdictConfidence
h0baseline FP32 (no quant, no compile)+0.0%BASELINEranges overlap
h1opset 17 explicit+0.1%MARGINALranges overlap
h4opset 17 + matmul_transpose_fusion+1.0%MARGINALranges overlap
h5opset 17 + attention_fusion-0.8%DISCARDranges overlap
h6opset 17 + bias_softmax_fusion+0.2%MARGINALranges overlap
h7opset 17 + layer_norm_fusion+1.1%MARGINALranges overlap
h8opset 17 + skip_layer_norm_fusionBENCH_FAILbench failed
h10opset 17 + ln + skip_ln + matmul_transposeBENCH_FAILbench failed
h11opset 17 + gelu_fusion explicit+1.3%MARGINALranges overlap
h12opset 17 + transpose_optimizer+0.3%MARGINALranges overlap
+
+ + + + + + diff --git a/research/autoconfig/catalog-gpu-sweep/deepset--tinyroberta-squad2/results.json b/research/autoconfig/catalog-gpu-sweep/deepset--tinyroberta-squad2/results.json new file mode 100644 index 000000000..dfa47962b --- /dev/null +++ b/research/autoconfig/catalog-gpu-sweep/deepset--tinyroberta-squad2/results.json @@ -0,0 +1,219 @@ +{ + "model_id": "deepset/tinyroberta-squad2", + "task": "question-answering", + "model_type": "roberta", + "timestamp": "2026-06-17T23:13:59", + "ep": "qnn", + "device": "gpu", + "baseline_opset": 17, + "hypotheses": { + "h0": { + "status": "OK", + "label": "baseline FP32 (no quant, no compile)", + "opset": 17, + "extra_optim": null, + "screen_p50_ms": 51.003, + "screen_cv": 0.1682450051957728, + "full_p50s_ms": [ + 51.171, + 51.243, + 50.412 + ], + "median_p50_ms": 51.171, + "verdict": "BASELINE" + }, + "h1": { + "status": "OK", + "label": "opset 17 explicit", + "opset": 17, + "extra_optim": null, + "screen_p50_ms": 53.124, + "screen_cv": 0.19249303516301483, + "full_p50s_ms": [ + 50.523, + 51.142, + 51.373 + ], + "median_p50_ms": 51.142, + "gain_vs_baseline_pct": 0.06, + "verdict": "MARGINAL" + }, + "h2": { + "status": "OK", + "label": "opset 19", + "opset": 19, + "extra_optim": null, + "screen_p50_ms": 52.106, + "screen_cv": 0.15598971327678193, + "full_p50s_ms": [ + 51.063, + 53.096, + 52.254 + ], + "median_p50_ms": 52.254, + "gain_vs_baseline_pct": -2.12, + "verdict": "DISCARD" + }, + "h3": { + "status": "OK", + "label": "opset 21 (tests gpu-006)", + "opset": 21, + "extra_optim": null, + "screen_p50_ms": 52.129, + "screen_cv": 0.20215235281705002, + "full_p50s_ms": [ + 52.541, + 55.415, + 51.708 + ], + "median_p50_ms": 52.541, + "gain_vs_baseline_pct": -2.68, + "verdict": "DISCARD" + }, + "h4": { + "status": "OK", + "label": "opset 17 + matmul_transpose_fusion", + "opset": 17, + "extra_optim": { + "matmul_transpose_fusion": true + }, + "screen_p50_ms": 50.249, + "screen_cv": 0.1313060956436944, + "full_p50s_ms": [ + 50.388, + 50.669, + 51.56 + ], + "median_p50_ms": 50.669, + "gain_vs_baseline_pct": 0.98, + "verdict": "MARGINAL" + }, + "h5": { + "status": "OK", + "label": "opset 17 + attention_fusion", + "opset": 17, + "extra_optim": { + "attention_fusion": true + }, + "screen_p50_ms": 52.692, + "screen_cv": 0.18913687087223865, + "full_p50s_ms": [ + 51.86, + 51.58, + 50.474 + ], + "median_p50_ms": 51.58, + "gain_vs_baseline_pct": -0.8, + "verdict": "DISCARD" + }, + "h6": { + "status": "OK", + "label": "opset 17 + bias_softmax_fusion", + "opset": 17, + "extra_optim": { + "bias_softmax_fusion": true + }, + "screen_p50_ms": 51.657, + "screen_cv": 0.15649379561337282, + "full_p50s_ms": [ + 51.058, + 50.966, + 52.08 + ], + "median_p50_ms": 51.058, + "gain_vs_baseline_pct": 0.22, + "verdict": "MARGINAL" + }, + "h7": { + "status": "OK", + "label": "opset 17 + layer_norm_fusion", + "opset": 17, + "extra_optim": { + "layer_norm_fusion": true + }, + "screen_p50_ms": 50.528, + "screen_cv": 0.14144632678910704, + "full_p50s_ms": [ + 50.635, + 50.467, + 51.424 + ], + "median_p50_ms": 50.635, + "gain_vs_baseline_pct": 1.05, + "verdict": "MARGINAL" + }, + "h8": { + "status": "BENCH_FAIL", + "label": "opset 17 + skip_layer_norm_fusion" + }, + "h9": { + "status": "OK", + "label": "opset 21 + matmul_transpose + attention_fusion", + "opset": 21, + "extra_optim": { + "matmul_transpose_fusion": true, + "attention_fusion": true + }, + "screen_p50_ms": 51.784, + "screen_cv": 0.1952340491271435, + "full_p50s_ms": [ + 52.576, + 51.761, + 57.061 + ], + "median_p50_ms": 52.576, + "gain_vs_baseline_pct": -2.75, + "verdict": "DISCARD" + }, + "h10": { + "status": "BENCH_FAIL", + "label": "opset 17 + ln + skip_ln + matmul_transpose" + }, + "h11": { + "status": "OK", + "label": "opset 17 + gelu_fusion explicit", + "opset": 17, + "extra_optim": { + "gelu_fusion": true + }, + "screen_p50_ms": 50.985, + "screen_cv": 0.13755025988035696, + "full_p50s_ms": [ + 50.344, + 50.501, + 51.304 + ], + "median_p50_ms": 50.501, + "gain_vs_baseline_pct": 1.31, + "verdict": "MARGINAL" + }, + "h12": { + "status": "OK", + "label": "opset 17 + transpose_optimizer", + "opset": 17, + "extra_optim": { + "transpose_optimizer": true + }, + "screen_p50_ms": 50.361, + "screen_cv": 0.18784376799507557, + "full_p50s_ms": [ + 51.016, + 52.289, + 50.832 + ], + "median_p50_ms": 51.016, + "gain_vs_baseline_pct": 0.3, + "verdict": "MARGINAL" + } + }, + "best_hypothesis": null, + "baseline_p50_ms": 51.171, + "best_p50_ms": null, + "best_gain_pct": null, + "opset21_gain_pct": -2.68, + "feature_gaps": [], + "errors": [ + "h8: screen bench failed", + "h10: screen bench failed" + ] +} diff --git a/research/autoconfig/catalog-gpu-sweep/facebook--dinov2-small/report.html b/research/autoconfig/catalog-gpu-sweep/facebook--dinov2-small/report.html new file mode 100644 index 000000000..99b67bcf8 --- /dev/null +++ b/research/autoconfig/catalog-gpu-sweep/facebook--dinov2-small/report.html @@ -0,0 +1,577 @@ + + + + + + QNN GPU Optimization Report — facebook/dinov2-small + + + +

QNN GPU Optimization Report — facebook/dinov2-small

+
dinov2 arch · 2026-06-18 · 13 hypotheses tested
+ +
+
+
Best Gain %
+
+16.7%
+
Champion: h12
+
+
+
Baseline → Champion ms
+
26.37 ms → 21.98 ms
+
Latency reduction: 4.39 ms
+
+
+
EP + Device
+
QNN / GPU
+
Baseline opset 17
+
+
+
Champion Config
+
h12
+
opset 17 + transpose_optimizer
+
+
+
Total experiments
+
13
+
7 KEEP / 0 DISCARD
+
+
+ + +
+
Model Characteristics
+ + +
Model IDfacebook/dinov2-small
Taskimage-feature-extraction
Arch typedinov2
Baseline opset17
EPqnn
Devicegpu
+
+ + +
+
Hypothesis Gain Chart
+
+ HypothesisGain vs baseline (%)-200%-100%0%100%200%h0: baseline FP32 (no quant, no compile) +status=OK verdict=BASELINE +p50=26.37 ms gain=+0.0%h0baseline FP32 (no quant, …0.0%h1: opset 17 explicit +status=OK verdict=MARGINAL_UNCONFIRMED +p50=24.91 ms gain=+6.9%h1opset 17 explicit+6.9%h2: opset 19 +status=OK verdict=MARGINAL +p50=25.42 ms gain=+3.6%h2opset 19+3.6%h3: opset 21 (tests gpu-006) +status=OK verdict=MARGINAL +p50=26.05 ms gain=+1.2%h3opset 21 (tests gpu-006)+1.2%h4: opset 17 + matmul_transpose_fusion +status=OK verdict=KEEP_CONFIRMED +p50=24.14 ms gain=+9.4%h4opset 17 + matmul_transpo…+9.4%h5: opset 17 + attention_fusion +status=OK verdict=MARGINAL_UNCONFIRMED +p50=23.59 ms gain=+11.3%h5opset 17 + attention_fusi…+11.3%h6: opset 17 + bias_softmax_fusion +status=OK verdict=KEEP_CONFIRMED +p50=24.69 ms gain=+6.5%h6opset 17 + bias_softmax_f…+6.5%h7: opset 17 + layer_norm_fusion +status=OK verdict=MARGINAL +p50=25.30 ms gain=+4.1%h7opset 17 + layer_norm_fus…+4.1%h8: opset 17 + skip_layer_norm_fusion +status=BENCH_FAIL verdict=— +p50=— gain=—h8opset 17 + skip_layer_nor…h9: opset 21 + matmul_transpose + attention_fusion +status=OK verdict=KEEP_CONFIRMED +p50=22.98 ms gain=+12.3%h9opset 21 + matmul_transpo…+12.3%h10: opset 17 + ln + skip_ln + matmul_transpose +status=BENCH_FAIL verdict=— +p50=— gain=—h10opset 17 + ln + skip_ln +…h11: opset 17 + gelu_fusion explicit +status=OK verdict=KEEP_CONFIRMED +p50=22.72 ms gain=+16.2%h11opset 17 + gelu_fusion ex…+16.2%h12: opset 17 + transpose_optimizer +status=OK verdict=KEEP_CONFIRMED +p50=21.98 ms gain=+16.7%h12opset 17 + transpose_opti…+16.7% +
+
+ + +
+
🔬 All Hypotheses — Full Detail
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
IDConfig LabelOpsetExtra FlagsMedian p50Session p50s (ms)Gain %VerdictConfidence
h0baseline FP32 (no quant, no compile)17not stored26.37 ms[23.28 · 26.70 · 26.37]+0.0%BASELINEranges overlap
h1opset 17 explicit17not stored24.91 ms[24.91 · 25.30 · 24.55 · 22.38 · 21.70]+6.9%MARGINAL_UNCONFIRMED4/5 sessions confirm
h2opset 1919not stored25.42 ms[26.47 · 24.46 · 25.42]+3.6%MARGINALranges overlap
h3opset 21 (tests gpu-006)21not stored26.05 ms[24.24 · 26.17 · 26.05]+1.2%MARGINALranges overlap
h4opset 17 + matmul_transpose_fusion17matmul_transpose_fusion24.14 ms[24.14 · 24.63 · 23.45 · 21.29 · 23.90]+9.4%KEEP_CONFIRMED5/5 sessions confirm
h5opset 17 + attention_fusion17attention_fusion23.59 ms[23.59 · 23.39 · 25.55 · 22.78 · 23.18]+11.3%MARGINAL_UNCONFIRMED4/5 sessions confirm
h6opset 17 + bias_softmax_fusion17bias_softmax_fusion24.69 ms[24.69 · 24.81 · 24.67 · 21.98 · 22.55]+6.5%KEEP_CONFIRMED5/5 sessions confirm
h7opset 17 + layer_norm_fusion17layer_norm_fusion25.30 ms[26.77 · 25.30 · 23.94]+4.1%MARGINALranges overlap
h8opset 17 + skip_layer_norm_fusionnot storedBENCH_FAILbench failed
h9opset 21 + matmul_transpose + attention_fusion21matmul_transpose_fusion, attention_fusion22.98 ms[22.98 · 22.44 · 23.15 · 23.13 · 23.92]+12.3%KEEP_CONFIRMED5/5 sessions confirm
h10opset 17 + ln + skip_ln + matmul_transposenot storedBENCH_FAILbench failed
h11opset 17 + gelu_fusion explicit17gelu_fusion22.72 ms[23.02 · 21.82 · 22.72 · 21.74 · 22.10]+16.2%KEEP_CONFIRMED5/5 sessions confirm
h12 opset 17 + transpose_optimizer17transpose_optimizer21.98 ms[22.32 · 21.71 · 21.98 · 21.67 · 23.43]+16.7%KEEP_CONFIRMED5/5 sessions confirm
+
+
+ ★ = champion hypothesis  ·  Session p50s are individual bench sessions (median used for comparison) +
+
+ + +
+
✅ Effective Optimizations
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
HypothesisLabelGain %VerdictConfidence
h1opset 17 explicit+6.9%MARGINAL_UNCONFIRMED4/5 sessions confirm
h4opset 17 + matmul_transpose_fusion+9.4%KEEP_CONFIRMED5/5 sessions confirm
h5opset 17 + attention_fusion+11.3%MARGINAL_UNCONFIRMED4/5 sessions confirm
h6opset 17 + bias_softmax_fusion+6.5%KEEP_CONFIRMED5/5 sessions confirm
h9opset 21 + matmul_transpose + attention_fusion+12.3%KEEP_CONFIRMED5/5 sessions confirm
h11opset 17 + gelu_fusion explicit+16.2%KEEP_CONFIRMED5/5 sessions confirm
h12opset 17 + transpose_optimizer+16.7%KEEP_CONFIRMED5/5 sessions confirm
+
+ + + +
+
⚪ Neutral / Build Fail
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
HypothesisLabelGain %VerdictConfidence
h0baseline FP32 (no quant, no compile)+0.0%BASELINEranges overlap
h2opset 19+3.6%MARGINALranges overlap
h3opset 21 (tests gpu-006)+1.2%MARGINALranges overlap
h7opset 17 + layer_norm_fusion+4.1%MARGINALranges overlap
h8opset 17 + skip_layer_norm_fusionBENCH_FAILbench failed
h10opset 17 + ln + skip_ln + matmul_transposeBENCH_FAILbench failed
+
+ + + + + + diff --git a/research/autoconfig/catalog-gpu-sweep/facebook--dinov2-small/results.json b/research/autoconfig/catalog-gpu-sweep/facebook--dinov2-small/results.json new file mode 100644 index 000000000..b3ada263f --- /dev/null +++ b/research/autoconfig/catalog-gpu-sweep/facebook--dinov2-small/results.json @@ -0,0 +1,324 @@ +{ + "model_id": "facebook/dinov2-small", + "task": "image-feature-extraction", + "model_type": "dinov2", + "timestamp": "2026-06-18T09:31:21", + "ep": "qnn", + "device": "gpu", + "baseline_opset": 17, + "hypotheses": { + "h0": { + "status": "OK", + "label": "baseline FP32 (no quant, no compile)", + "opset": 17, + "extra_optim": null, + "screen_p50_ms": 23.84, + "screen_cv": 0.22185402684563757, + "full_p50s_ms": [ + 23.282, + 26.705, + 26.372 + ], + "median_p50_ms": 26.372, + "verdict": "BASELINE" + }, + "h1": { + "status": "OK", + "label": "opset 17 explicit", + "opset": 17, + "extra_optim": null, + "screen_p50_ms": 23.814, + "screen_cv": 0.2982279331485681, + "full_p50s_ms": [ + 24.915, + 25.298, + 24.546 + ], + "median_p50_ms": 24.915, + "gain_vs_baseline_pct": 5.52, + "verdict": "MARGINAL_UNCONFIRMED", + "confirm_p50s_ms": [ + 22.377, + 21.697 + ], + "all_p50s_ms": [ + 24.915, + 25.298, + 24.546, + 22.377, + 21.697 + ], + "overall_median_p50_ms": 24.546, + "overall_gain_pct": 6.92, + "sessions_above_threshold": 4, + "total_sessions": 5 + }, + "h2": { + "status": "OK", + "label": "opset 19", + "opset": 19, + "extra_optim": null, + "screen_p50_ms": 26.353, + "screen_cv": 0.22968921944370657, + "full_p50s_ms": [ + 26.467, + 24.459, + 25.421 + ], + "median_p50_ms": 25.421, + "gain_vs_baseline_pct": 3.61, + "verdict": "MARGINAL" + }, + "h3": { + "status": "OK", + "label": "opset 21 (tests gpu-006)", + "opset": 21, + "extra_optim": null, + "screen_p50_ms": 25.534, + "screen_cv": 0.25432756324900135, + "full_p50s_ms": [ + 24.236, + 26.174, + 26.051 + ], + "median_p50_ms": 26.051, + "gain_vs_baseline_pct": 1.22, + "verdict": "MARGINAL" + }, + "h4": { + "status": "OK", + "label": "opset 17 + matmul_transpose_fusion", + "opset": 17, + "extra_optim": { + "matmul_transpose_fusion": true + }, + "screen_p50_ms": 23.241, + "screen_cv": 0.19310700916483803, + "full_p50s_ms": [ + 24.144, + 24.633, + 23.453 + ], + "median_p50_ms": 24.144, + "gain_vs_baseline_pct": 8.45, + "verdict": "KEEP_CONFIRMED", + "confirm_p50s_ms": [ + 21.288, + 23.896 + ], + "all_p50s_ms": [ + 24.144, + 24.633, + 23.453, + 21.288, + 23.896 + ], + "overall_median_p50_ms": 23.896, + "overall_gain_pct": 9.39, + "sessions_above_threshold": 5, + "total_sessions": 5 + }, + "h5": { + "status": "OK", + "label": "opset 17 + attention_fusion", + "opset": 17, + "extra_optim": { + "attention_fusion": true + }, + "screen_p50_ms": 23.289, + "screen_cv": 0.17308600626905404, + "full_p50s_ms": [ + 23.589, + 23.385, + 25.548 + ], + "median_p50_ms": 23.589, + "gain_vs_baseline_pct": 10.55, + "verdict": "MARGINAL_UNCONFIRMED", + "confirm_p50s_ms": [ + 22.777, + 23.185 + ], + "all_p50s_ms": [ + 23.589, + 23.385, + 25.548, + 22.777, + 23.185 + ], + "overall_median_p50_ms": 23.385, + "overall_gain_pct": 11.33, + "sessions_above_threshold": 4, + "total_sessions": 5 + }, + "h6": { + "status": "OK", + "label": "opset 17 + bias_softmax_fusion", + "opset": 17, + "extra_optim": { + "bias_softmax_fusion": true + }, + "screen_p50_ms": 23.287, + "screen_cv": 0.22261347532958303, + "full_p50s_ms": [ + 24.686, + 24.808, + 24.666 + ], + "median_p50_ms": 24.686, + "gain_vs_baseline_pct": 6.39, + "verdict": "KEEP_CONFIRMED", + "confirm_p50s_ms": [ + 21.979, + 22.546 + ], + "all_p50s_ms": [ + 24.686, + 24.808, + 24.666, + 21.979, + 22.546 + ], + "overall_median_p50_ms": 24.666, + "overall_gain_pct": 6.47, + "sessions_above_threshold": 5, + "total_sessions": 5 + }, + "h7": { + "status": "OK", + "label": "opset 17 + layer_norm_fusion", + "opset": 17, + "extra_optim": { + "layer_norm_fusion": true + }, + "screen_p50_ms": 43.267, + "screen_cv": 0.5303580095684932, + "full_p50s_ms": [ + 26.767, + 25.295, + 23.936 + ], + "median_p50_ms": 25.295, + "gain_vs_baseline_pct": 4.08, + "verdict": "MARGINAL" + }, + "h8": { + "status": "BENCH_FAIL", + "label": "opset 17 + skip_layer_norm_fusion" + }, + "h9": { + "status": "OK", + "label": "opset 21 + matmul_transpose + attention_fusion", + "opset": 21, + "extra_optim": { + "matmul_transpose_fusion": true, + "attention_fusion": true + }, + "screen_p50_ms": 23.101, + "screen_cv": 0.1016839097874551, + "full_p50s_ms": [ + 22.982, + 22.438, + 23.149 + ], + "median_p50_ms": 22.982, + "gain_vs_baseline_pct": 12.85, + "verdict": "KEEP_CONFIRMED", + "confirm_p50s_ms": [ + 23.132, + 23.917 + ], + "all_p50s_ms": [ + 22.982, + 22.438, + 23.149, + 23.132, + 23.917 + ], + "overall_median_p50_ms": 23.132, + "overall_gain_pct": 12.29, + "sessions_above_threshold": 5, + "total_sessions": 5 + }, + "h10": { + "status": "BENCH_FAIL", + "label": "opset 17 + ln + skip_ln + matmul_transpose" + }, + "h11": { + "status": "OK", + "label": "opset 17 + gelu_fusion explicit", + "opset": 17, + "extra_optim": { + "gelu_fusion": true + }, + "screen_p50_ms": 22.655, + "screen_cv": 0.15378503641580224, + "full_p50s_ms": [ + 23.022, + 21.821, + 22.718 + ], + "median_p50_ms": 22.718, + "gain_vs_baseline_pct": 13.86, + "verdict": "KEEP_CONFIRMED", + "confirm_p50s_ms": [ + 21.742, + 22.096 + ], + "all_p50s_ms": [ + 23.022, + 21.821, + 22.718, + 21.742, + 22.096 + ], + "overall_median_p50_ms": 22.096, + "overall_gain_pct": 16.21, + "sessions_above_threshold": 5, + "total_sessions": 5 + }, + "h12": { + "status": "OK", + "label": "opset 17 + transpose_optimizer", + "opset": 17, + "extra_optim": { + "transpose_optimizer": true + }, + "screen_p50_ms": 22.066, + "screen_cv": 0.2796156983594671, + "full_p50s_ms": [ + 22.323, + 21.709, + 21.977 + ], + "median_p50_ms": 21.977, + "gain_vs_baseline_pct": 16.67, + "verdict": "KEEP_CONFIRMED", + "confirm_p50s_ms": [ + 21.667, + 23.431 + ], + "all_p50s_ms": [ + 22.323, + 21.709, + 21.977, + 21.667, + 23.431 + ], + "overall_median_p50_ms": 21.977, + "overall_gain_pct": 16.67, + "sessions_above_threshold": 5, + "total_sessions": 5 + } + }, + "best_hypothesis": "h12", + "baseline_p50_ms": 26.372, + "best_p50_ms": 21.977, + "best_gain_pct": 16.67, + "opset21_gain_pct": 1.22, + "feature_gaps": [], + "errors": [ + "h8: screen bench failed", + "h10: screen bench failed" + ] +} diff --git a/research/autoconfig/catalog-gpu-sweep/microsoft--rad-dino/report.html b/research/autoconfig/catalog-gpu-sweep/microsoft--rad-dino/report.html new file mode 100644 index 000000000..ca69b133a --- /dev/null +++ b/research/autoconfig/catalog-gpu-sweep/microsoft--rad-dino/report.html @@ -0,0 +1,577 @@ + + + + + + QNN GPU Optimization Report — microsoft/rad-dino + + + +

QNN GPU Optimization Report — microsoft/rad-dino

+
dinov2 arch · 2026-06-17 · 13 hypotheses tested
+ +
+
+
Best Gain %
+
+
Champion: —
+
+
+
Baseline → Champion ms
+
321.26 ms → —
+
Latency reduction: —
+
+
+
EP + Device
+
QNN / GPU
+
Baseline opset 17
+
+
+
Champion Config
+
+
+
+
+
Total experiments
+
13
+
0 KEEP / 5 DISCARD
+
+
+ + +
+
Model Characteristics
+ + +
Model IDmicrosoft/rad-dino
Taskimage-feature-extraction
Arch typedinov2
Baseline opset17
EPqnn
Devicegpu
+
+ + +
+
Hypothesis Gain Chart
+
+ HypothesisGain vs baseline (%)-200%-100%0%100%200%h0: baseline FP32 (no quant, no compile) +status=OK verdict=BASELINE +p50=321.26 ms gain=+0.0%h0baseline FP32 (no quant, …0.0%h1: opset 17 explicit +status=OK verdict=DISCARD +p50=338.66 ms gain=-5.4%h1opset 17 explicit-5.4%h2: opset 19 +status=OK verdict=DISCARD +p50=331.58 ms gain=-3.2%h2opset 19-3.2%h3: opset 21 (tests gpu-006) +status=OK verdict=DISCARD +p50=329.70 ms gain=-2.6%h3opset 21 (tests gpu-006)-2.6%h4: opset 17 + matmul_transpose_fusion +status=OK verdict=DISCARD +p50=324.80 ms gain=-1.1%h4opset 17 + matmul_transpo…-1.1%h5: opset 17 + attention_fusion +status=OK verdict=DISCARD +p50=329.01 ms gain=-2.4%h5opset 17 + attention_fusi…-2.4%h6: opset 17 + bias_softmax_fusion +status=OK verdict=DISCARD +p50=329.61 ms gain=-2.6%h6opset 17 + bias_softmax_f…-2.6%h7: opset 17 + layer_norm_fusion +status=OK verdict=DISCARD +p50=327.27 ms gain=-1.9%h7opset 17 + layer_norm_fus…-1.9%h8: opset 17 + skip_layer_norm_fusion +status=BENCH_FAIL verdict=— +p50=— gain=—h8opset 17 + skip_layer_nor…h9: opset 21 + matmul_transpose + attention_fusion +status=OK verdict=DISCARD +p50=324.74 ms gain=-1.1%h9opset 21 + matmul_transpo…-1.1%h10: opset 17 + ln + skip_ln + matmul_transpose +status=BENCH_FAIL verdict=— +p50=— gain=—h10opset 17 + ln + skip_ln +…h11: opset 17 + gelu_fusion explicit +status=OK verdict=MARGINAL +p50=314.84 ms gain=+2.0%h11opset 17 + gelu_fusion ex…+2.0%h12: opset 17 + transpose_optimizer +status=OK verdict=MARGINAL +p50=316.97 ms gain=+1.3%h12opset 17 + transpose_opti…+1.3% +
+
+ + +
+
🔬 All Hypotheses — Full Detail
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
IDConfig LabelOpsetExtra FlagsMedian p50Session p50s (ms)Gain %VerdictConfidence
h0baseline FP32 (no quant, no compile)17not stored321.26 ms[318.51 · 324.03 · 321.26]+0.0%BASELINEranges overlap
h1opset 17 explicit17not stored338.66 ms[420.76 · 338.66 · 331.40]-5.4%DISCARDranges separated
h2opset 1919not stored331.58 ms[337.96 · 331.58 · 328.05]-3.2%DISCARDranges separated
h3opset 21 (tests gpu-006)21not stored329.70 ms[326.33 · 334.93 · 329.70]-2.6%DISCARDranges separated
h4opset 17 + matmul_transpose_fusion17matmul_transpose_fusion324.80 ms[324.80 · 324.38 · 326.37]-1.1%DISCARDranges separated
h5opset 17 + attention_fusion17attention_fusion329.01 ms[321.67 · 332.71 · 329.01]-2.4%DISCARDranges overlap
h6opset 17 + bias_softmax_fusion17bias_softmax_fusion329.61 ms[331.67 · 327.97 · 329.61]-2.6%DISCARDranges separated
h7opset 17 + layer_norm_fusion17layer_norm_fusion327.27 ms[327.27 · 324.65 · 327.86]-1.9%DISCARDranges separated
h8opset 17 + skip_layer_norm_fusionnot storedBENCH_FAILbench failed
h9opset 21 + matmul_transpose + attention_fusion21matmul_transpose_fusion, attention_fusion324.74 ms[319.64 · 324.74 · 328.56]-1.1%DISCARDranges overlap
h10opset 17 + ln + skip_ln + matmul_transposenot storedBENCH_FAILbench failed
h11opset 17 + gelu_fusion explicit17gelu_fusion314.84 ms[314.87 · 314.84 · 313.88]+2.0%MARGINALranges separated
h12opset 17 + transpose_optimizer17transpose_optimizer316.97 ms[320.64 · 316.97 · 311.98]+1.3%MARGINALranges overlap
+
+
+ ★ = champion hypothesis  ·  Session p50s are individual bench sessions (median used for comparison) +
+
+ + + +
+
❌ Ineffective or Harmful
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
HypothesisLabelGain %VerdictConfidence
h1opset 17 explicit-5.4%DISCARDranges separated
h2opset 19-3.2%DISCARDranges separated
h3opset 21 (tests gpu-006)-2.6%DISCARDranges separated
h5opset 17 + attention_fusion-2.4%DISCARDranges overlap
h6opset 17 + bias_softmax_fusion-2.6%DISCARDranges separated
+
+ + +
+
⚪ Neutral / Build Fail
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
HypothesisLabelGain %VerdictConfidence
h0baseline FP32 (no quant, no compile)+0.0%BASELINEranges overlap
h4opset 17 + matmul_transpose_fusion-1.1%DISCARDranges separated
h7opset 17 + layer_norm_fusion-1.9%DISCARDranges separated
h8opset 17 + skip_layer_norm_fusionBENCH_FAILbench failed
h9opset 21 + matmul_transpose + attention_fusion-1.1%DISCARDranges overlap
h10opset 17 + ln + skip_ln + matmul_transposeBENCH_FAILbench failed
h11opset 17 + gelu_fusion explicit+2.0%MARGINALranges separated
h12opset 17 + transpose_optimizer+1.3%MARGINALranges overlap
+
+ + + + + + diff --git a/research/autoconfig/catalog-gpu-sweep/microsoft--rad-dino/results.json b/research/autoconfig/catalog-gpu-sweep/microsoft--rad-dino/results.json new file mode 100644 index 000000000..13721bfac --- /dev/null +++ b/research/autoconfig/catalog-gpu-sweep/microsoft--rad-dino/results.json @@ -0,0 +1,219 @@ +{ + "model_id": "microsoft/rad-dino", + "task": "image-feature-extraction", + "model_type": "dinov2", + "timestamp": "2026-06-17T21:21:07", + "ep": "qnn", + "device": "gpu", + "baseline_opset": 17, + "hypotheses": { + "h0": { + "status": "OK", + "label": "baseline FP32 (no quant, no compile)", + "opset": 17, + "extra_optim": null, + "screen_p50_ms": 311.746, + "screen_cv": 0.07637307295041477, + "full_p50s_ms": [ + 318.508, + 324.031, + 321.256 + ], + "median_p50_ms": 321.256, + "verdict": "BASELINE" + }, + "h1": { + "status": "OK", + "label": "opset 17 explicit", + "opset": 17, + "extra_optim": null, + "screen_p50_ms": 321.096, + "screen_cv": 0.13311595286144953, + "full_p50s_ms": [ + 420.756, + 338.659, + 331.4 + ], + "median_p50_ms": 338.659, + "gain_vs_baseline_pct": -5.42, + "verdict": "DISCARD" + }, + "h2": { + "status": "OK", + "label": "opset 19", + "opset": 19, + "extra_optim": null, + "screen_p50_ms": 334.966, + "screen_cv": 0.11764776126532245, + "full_p50s_ms": [ + 337.958, + 331.58, + 328.045 + ], + "median_p50_ms": 331.58, + "gain_vs_baseline_pct": -3.21, + "verdict": "DISCARD" + }, + "h3": { + "status": "OK", + "label": "opset 21 (tests gpu-006)", + "opset": 21, + "extra_optim": null, + "screen_p50_ms": 328.867, + "screen_cv": 0.09647060969936173, + "full_p50s_ms": [ + 326.329, + 334.932, + 329.704 + ], + "median_p50_ms": 329.704, + "gain_vs_baseline_pct": -2.63, + "verdict": "DISCARD" + }, + "h4": { + "status": "OK", + "label": "opset 17 + matmul_transpose_fusion", + "opset": 17, + "extra_optim": { + "matmul_transpose_fusion": true + }, + "screen_p50_ms": 321.486, + "screen_cv": 0.1220675239357235, + "full_p50s_ms": [ + 324.795, + 324.376, + 326.365 + ], + "median_p50_ms": 324.795, + "gain_vs_baseline_pct": -1.1, + "verdict": "DISCARD" + }, + "h5": { + "status": "OK", + "label": "opset 17 + attention_fusion", + "opset": 17, + "extra_optim": { + "attention_fusion": true + }, + "screen_p50_ms": 327.99, + "screen_cv": 0.11450958870697277, + "full_p50s_ms": [ + 321.67, + 332.714, + 329.006 + ], + "median_p50_ms": 329.006, + "gain_vs_baseline_pct": -2.41, + "verdict": "DISCARD" + }, + "h6": { + "status": "OK", + "label": "opset 17 + bias_softmax_fusion", + "opset": 17, + "extra_optim": { + "bias_softmax_fusion": true + }, + "screen_p50_ms": 326.056, + "screen_cv": 0.11823429104202961, + "full_p50s_ms": [ + 331.665, + 327.97, + 329.607 + ], + "median_p50_ms": 329.607, + "gain_vs_baseline_pct": -2.6, + "verdict": "DISCARD" + }, + "h7": { + "status": "OK", + "label": "opset 17 + layer_norm_fusion", + "opset": 17, + "extra_optim": { + "layer_norm_fusion": true + }, + "screen_p50_ms": 320.193, + "screen_cv": 0.12970302286433497, + "full_p50s_ms": [ + 327.267, + 324.65, + 327.859 + ], + "median_p50_ms": 327.267, + "gain_vs_baseline_pct": -1.87, + "verdict": "DISCARD" + }, + "h8": { + "status": "BENCH_FAIL", + "label": "opset 17 + skip_layer_norm_fusion" + }, + "h9": { + "status": "OK", + "label": "opset 21 + matmul_transpose + attention_fusion", + "opset": 21, + "extra_optim": { + "matmul_transpose_fusion": true, + "attention_fusion": true + }, + "screen_p50_ms": 320.802, + "screen_cv": 0.12593125978017594, + "full_p50s_ms": [ + 319.641, + 324.735, + 328.564 + ], + "median_p50_ms": 324.735, + "gain_vs_baseline_pct": -1.08, + "verdict": "DISCARD" + }, + "h10": { + "status": "BENCH_FAIL", + "label": "opset 17 + ln + skip_ln + matmul_transpose" + }, + "h11": { + "status": "OK", + "label": "opset 17 + gelu_fusion explicit", + "opset": 17, + "extra_optim": { + "gelu_fusion": true + }, + "screen_p50_ms": 312.178, + "screen_cv": 0.12257750385997732, + "full_p50s_ms": [ + 314.865, + 314.838, + 313.876 + ], + "median_p50_ms": 314.838, + "gain_vs_baseline_pct": 2.0, + "verdict": "MARGINAL" + }, + "h12": { + "status": "OK", + "label": "opset 17 + transpose_optimizer", + "opset": 17, + "extra_optim": { + "transpose_optimizer": true + }, + "screen_p50_ms": 314.241, + "screen_cv": 0.12399082233063159, + "full_p50s_ms": [ + 320.636, + 316.974, + 311.984 + ], + "median_p50_ms": 316.974, + "gain_vs_baseline_pct": 1.33, + "verdict": "MARGINAL" + } + }, + "best_hypothesis": null, + "baseline_p50_ms": 321.256, + "best_p50_ms": null, + "best_gain_pct": null, + "opset21_gain_pct": -2.63, + "feature_gaps": [], + "errors": [ + "h8: screen bench failed", + "h10: screen bench failed" + ] +} diff --git a/research/autoconfig/catalog-gpu-sweep/microsoft--resnet-18/report.html b/research/autoconfig/catalog-gpu-sweep/microsoft--resnet-18/report.html new file mode 100644 index 000000000..80f9a441c --- /dev/null +++ b/research/autoconfig/catalog-gpu-sweep/microsoft--resnet-18/report.html @@ -0,0 +1,595 @@ + + + + + + QNN GPU Optimization Report — microsoft/resnet-18 + + + +

QNN GPU Optimization Report — microsoft/resnet-18

+
resnet arch · 2026-06-18 · 13 hypotheses tested
+ +
+
+
Best Gain %
+
+8.4%
+
Champion: h12
+
+
+
Baseline → Champion ms
+
6.82 ms → 6.25 ms
+
Latency reduction: 0.57 ms
+
+
+
EP + Device
+
QNN / GPU
+
Baseline opset 17
+
+
+
Champion Config
+
h12
+
opset 17 + transpose_optimizer
+
+
+
Total experiments
+
13
+
2 KEEP / 2 DISCARD
+
+
+ + +
+
Model Characteristics
+ + +
Model IDmicrosoft/resnet-18
Taskimage-classification
Arch typeresnet
Baseline opset17
EPqnn
Devicegpu
+
+ + +
+
Hypothesis Gain Chart
+
+ HypothesisGain vs baseline (%)-200%-100%0%100%200%h0: baseline FP32 (no quant, no compile) +status=OK verdict=BASELINE +p50=6.82 ms gain=+0.0%h0baseline FP32 (no quant, …0.0%h1: opset 17 explicit +status=OK verdict=MARGINAL +p50=6.64 ms gain=+2.7%h1opset 17 explicit+2.7%h2: opset 19 +status=OK verdict=DISCARD +p50=6.88 ms gain=-0.9%h2opset 19-0.9%h3: opset 21 (tests gpu-006) +status=OK verdict=MARGINAL +p50=6.60 ms gain=+3.3%h3opset 21 (tests gpu-006)+3.3%h4: opset 17 + matmul_transpose_fusion +status=OK verdict=MARGINAL +p50=6.56 ms gain=+3.8%h4opset 17 + matmul_transpo…+3.8%h5: opset 17 + attention_fusion +status=OK verdict=MARGINAL +p50=6.59 ms gain=+3.4%h5opset 17 + attention_fusi…+3.4%h6: opset 17 + bias_softmax_fusion +status=OK verdict=MARGINAL +p50=6.52 ms gain=+4.5%h6opset 17 + bias_softmax_f…+4.5%h7: opset 17 + layer_norm_fusion +status=OK verdict=DISCARD +p50=7.11 ms gain=-4.2%h7opset 17 + layer_norm_fus…-4.2%h8: opset 17 + skip_layer_norm_fusion +status=OK verdict=MARGINAL +p50=6.78 ms gain=+0.7%h8opset 17 + skip_layer_nor…+0.7%h9: opset 21 + matmul_transpose + attention_fusion +status=OK verdict=DISCARD +p50=7.37 ms gain=-8.0%h9opset 21 + matmul_transpo…-8.0%h10: opset 17 + ln + skip_ln + matmul_transpose +status=OK verdict=MARGINAL +p50=6.76 ms gain=+1.0%h10opset 17 + ln + skip_ln +…+1.0%h11: opset 17 + gelu_fusion explicit +status=OK verdict=MARGINAL_UNCONFIRMED +p50=6.39 ms gain=+6.4%h11opset 17 + gelu_fusion ex…+6.4%h12: opset 17 + transpose_optimizer +status=OK verdict=MARGINAL_UNCONFIRMED +p50=6.25 ms gain=+5.7%h12opset 17 + transpose_opti…+5.7% +
+
+ + +
+
🔬 All Hypotheses — Full Detail
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
IDConfig LabelOpsetExtra FlagsMedian p50Session p50s (ms)Gain %VerdictConfidence
h0baseline FP32 (no quant, no compile)17not stored6.82 ms[6.82 · 6.92 · 6.04]+0.0%BASELINEranges overlap
h1opset 17 explicit17not stored6.64 ms[6.64 · 6.74 · 6.01]+2.7%MARGINALranges overlap
h2opset 1919not stored6.88 ms[7.39 · 6.56 · 6.88]-0.9%DISCARDranges overlap
h3opset 21 (tests gpu-006)21not stored6.60 ms[6.60 · 6.78 · 6.41]+3.3%MARGINALranges overlap
h4opset 17 + matmul_transpose_fusion17matmul_transpose_fusion6.56 ms[6.56 · 6.53 · 7.55]+3.8%MARGINALranges overlap
h5opset 17 + attention_fusion17attention_fusion6.59 ms[6.57 · 7.50 · 6.59]+3.4%MARGINALranges overlap
h6opset 17 + bias_softmax_fusion17bias_softmax_fusion6.52 ms[6.11 · 6.52 · 6.68]+4.5%MARGINALranges overlap
h7opset 17 + layer_norm_fusion17layer_norm_fusion7.11 ms[7.11 · 6.88 · 7.23]-4.2%DISCARDranges overlap
h8opset 17 + skip_layer_norm_fusion17skip_layer_norm_fusion6.78 ms[6.34 · 6.78 · 7.06]+0.7%MARGINALranges overlap
h9opset 21 + matmul_transpose + attention_fusion21matmul_transpose_fusion, attention_fusion7.37 ms[7.45 · 6.43 · 7.37]-8.0%DISCARDranges overlap
h10opset 17 + ln + skip_ln + matmul_transpose17layer_norm_fusion, skip_layer_norm_fusion, matmul_transpose_fusion6.76 ms[6.76 · 8.16 · 6.38]+1.0%MARGINALranges overlap
h11opset 17 + gelu_fusion explicit17gelu_fusion6.39 ms[6.48 · 6.17 · 6.39 · 6.48 · 6.35]+6.4%MARGINAL_UNCONFIRMED4/5 sessions confirm
h12 opset 17 + transpose_optimizer17transpose_optimizer6.25 ms[6.44 · 6.25 · 6.22 · 8.72 · 6.87]+5.7%MARGINAL_UNCONFIRMED3/5 sessions confirm
+
+
+ ★ = champion hypothesis  ·  Session p50s are individual bench sessions (median used for comparison) +
+
+ + +
+
✅ Effective Optimizations
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
HypothesisLabelGain %VerdictConfidence
h11opset 17 + gelu_fusion explicit+6.4%MARGINAL_UNCONFIRMED4/5 sessions confirm
h12opset 17 + transpose_optimizer+5.7%MARGINAL_UNCONFIRMED3/5 sessions confirm
+
+ + +
+
❌ Ineffective or Harmful
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
HypothesisLabelGain %VerdictConfidence
h7opset 17 + layer_norm_fusion-4.2%DISCARDranges overlap
h9opset 21 + matmul_transpose + attention_fusion-8.0%DISCARDranges overlap
+
+ + +
+
⚪ Neutral / Build Fail
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
HypothesisLabelGain %VerdictConfidence
h0baseline FP32 (no quant, no compile)+0.0%BASELINEranges overlap
h1opset 17 explicit+2.7%MARGINALranges overlap
h2opset 19-0.9%DISCARDranges overlap
h3opset 21 (tests gpu-006)+3.3%MARGINALranges overlap
h4opset 17 + matmul_transpose_fusion+3.8%MARGINALranges overlap
h5opset 17 + attention_fusion+3.4%MARGINALranges overlap
h6opset 17 + bias_softmax_fusion+4.5%MARGINALranges overlap
h8opset 17 + skip_layer_norm_fusion+0.7%MARGINALranges overlap
h10opset 17 + ln + skip_ln + matmul_transpose+1.0%MARGINALranges overlap
+
+ + + + + + diff --git a/research/autoconfig/catalog-gpu-sweep/microsoft--resnet-18/results.json b/research/autoconfig/catalog-gpu-sweep/microsoft--resnet-18/results.json new file mode 100644 index 000000000..400cf7e3a --- /dev/null +++ b/research/autoconfig/catalog-gpu-sweep/microsoft--resnet-18/results.json @@ -0,0 +1,276 @@ +{ + "model_id": "microsoft/resnet-18", + "task": "image-classification", + "model_type": "resnet", + "timestamp": "2026-06-18T09:05:04", + "ep": "qnn", + "device": "gpu", + "baseline_opset": 17, + "hypotheses": { + "h0": { + "status": "OK", + "label": "baseline FP32 (no quant, no compile)", + "opset": 17, + "extra_optim": null, + "screen_p50_ms": 7.266, + "screen_cv": 0.713047068538398, + "full_p50s_ms": [ + 6.823, + 6.916, + 6.04 + ], + "median_p50_ms": 6.823, + "verdict": "BASELINE" + }, + "h1": { + "status": "OK", + "label": "opset 17 explicit", + "opset": 17, + "extra_optim": null, + "screen_p50_ms": 7.584, + "screen_cv": 0.2866561181434599, + "full_p50s_ms": [ + 6.638, + 6.738, + 6.012 + ], + "median_p50_ms": 6.638, + "gain_vs_baseline_pct": 2.71, + "verdict": "MARGINAL" + }, + "h2": { + "status": "OK", + "label": "opset 19", + "opset": 19, + "extra_optim": null, + "screen_p50_ms": 7.013, + "screen_cv": 0.348353058605447, + "full_p50s_ms": [ + 7.392, + 6.557, + 6.884 + ], + "median_p50_ms": 6.884, + "gain_vs_baseline_pct": -0.89, + "verdict": "DISCARD" + }, + "h3": { + "status": "OK", + "label": "opset 21 (tests gpu-006)", + "opset": 21, + "extra_optim": null, + "screen_p50_ms": 7.382, + "screen_cv": 0.38715795177458684, + "full_p50s_ms": [ + 6.6, + 6.775, + 6.409 + ], + "median_p50_ms": 6.6, + "gain_vs_baseline_pct": 3.27, + "verdict": "MARGINAL" + }, + "h4": { + "status": "OK", + "label": "opset 17 + matmul_transpose_fusion", + "opset": 17, + "extra_optim": { + "matmul_transpose_fusion": true + }, + "screen_p50_ms": 7.024, + "screen_cv": 0.3361332574031891, + "full_p50s_ms": [ + 6.562, + 6.53, + 7.551 + ], + "median_p50_ms": 6.562, + "gain_vs_baseline_pct": 3.83, + "verdict": "MARGINAL" + }, + "h5": { + "status": "OK", + "label": "opset 17 + attention_fusion", + "opset": 17, + "extra_optim": { + "attention_fusion": true + }, + "screen_p50_ms": 7.344, + "screen_cv": 0.6030773420479303, + "full_p50s_ms": [ + 6.574, + 7.504, + 6.594 + ], + "median_p50_ms": 6.594, + "gain_vs_baseline_pct": 3.36, + "verdict": "MARGINAL" + }, + "h6": { + "status": "OK", + "label": "opset 17 + bias_softmax_fusion", + "opset": 17, + "extra_optim": { + "bias_softmax_fusion": true + }, + "screen_p50_ms": 7.052, + "screen_cv": 0.34160521837776514, + "full_p50s_ms": [ + 6.114, + 6.516, + 6.682 + ], + "median_p50_ms": 6.516, + "gain_vs_baseline_pct": 4.5, + "verdict": "MARGINAL" + }, + "h7": { + "status": "OK", + "label": "opset 17 + layer_norm_fusion", + "opset": 17, + "extra_optim": { + "layer_norm_fusion": true + }, + "screen_p50_ms": 6.483, + "screen_cv": 0.9310504396112911, + "full_p50s_ms": [ + 7.109, + 6.881, + 7.234 + ], + "median_p50_ms": 7.109, + "gain_vs_baseline_pct": -4.19, + "verdict": "DISCARD" + }, + "h8": { + "status": "OK", + "label": "opset 17 + skip_layer_norm_fusion", + "opset": 17, + "extra_optim": { + "skip_layer_norm_fusion": true + }, + "screen_p50_ms": 6.841, + "screen_cv": 0.2964478877357111, + "full_p50s_ms": [ + 6.339, + 6.777, + 7.058 + ], + "median_p50_ms": 6.777, + "gain_vs_baseline_pct": 0.67, + "verdict": "MARGINAL" + }, + "h9": { + "status": "OK", + "label": "opset 21 + matmul_transpose + attention_fusion", + "opset": 21, + "extra_optim": { + "matmul_transpose_fusion": true, + "attention_fusion": true + }, + "screen_p50_ms": 6.98, + "screen_cv": 0.6378223495702006, + "full_p50s_ms": [ + 7.448, + 6.432, + 7.368 + ], + "median_p50_ms": 7.368, + "gain_vs_baseline_pct": -7.99, + "verdict": "DISCARD" + }, + "h10": { + "status": "OK", + "label": "opset 17 + ln + skip_ln + matmul_transpose", + "opset": 17, + "extra_optim": { + "layer_norm_fusion": true, + "skip_layer_norm_fusion": true, + "matmul_transpose_fusion": true + }, + "screen_p50_ms": 5.897, + "screen_cv": 0.9113108360183143, + "full_p50s_ms": [ + 6.756, + 8.163, + 6.381 + ], + "median_p50_ms": 6.756, + "gain_vs_baseline_pct": 0.98, + "verdict": "MARGINAL" + }, + "h11": { + "status": "OK", + "label": "opset 17 + gelu_fusion explicit", + "opset": 17, + "extra_optim": { + "gelu_fusion": true + }, + "screen_p50_ms": 6.974, + "screen_cv": 0.8368224835101807, + "full_p50s_ms": [ + 6.482, + 6.175, + 6.386 + ], + "median_p50_ms": 6.386, + "gain_vs_baseline_pct": 6.4, + "verdict": "MARGINAL_UNCONFIRMED", + "confirm_p50s_ms": [ + 6.48, + 6.348 + ], + "all_p50s_ms": [ + 6.482, + 6.175, + 6.386, + 6.48, + 6.348 + ], + "overall_median_p50_ms": 6.386, + "overall_gain_pct": 6.4, + "sessions_above_threshold": 4, + "total_sessions": 5 + }, + "h12": { + "status": "OK", + "label": "opset 17 + transpose_optimizer", + "opset": 17, + "extra_optim": { + "transpose_optimizer": true + }, + "screen_p50_ms": 5.992, + "screen_cv": 0.3384512683578104, + "full_p50s_ms": [ + 6.437, + 6.251, + 6.224 + ], + "median_p50_ms": 6.251, + "gain_vs_baseline_pct": 8.38, + "verdict": "MARGINAL_UNCONFIRMED", + "confirm_p50s_ms": [ + 8.718, + 6.869 + ], + "all_p50s_ms": [ + 6.437, + 6.251, + 6.224, + 8.718, + 6.869 + ], + "overall_median_p50_ms": 6.437, + "overall_gain_pct": 5.66, + "sessions_above_threshold": 3, + "total_sessions": 5 + } + }, + "best_hypothesis": "h12", + "baseline_p50_ms": 6.823, + "best_p50_ms": 6.251, + "best_gain_pct": 8.38, + "opset21_gain_pct": 3.27, + "feature_gaps": [], + "errors": [] +} diff --git a/research/autoconfig/catalog-gpu-sweep/sentence-transformers--all-MiniLM-L6-v2/report.html b/research/autoconfig/catalog-gpu-sweep/sentence-transformers--all-MiniLM-L6-v2/report.html new file mode 100644 index 000000000..b7cfc5441 --- /dev/null +++ b/research/autoconfig/catalog-gpu-sweep/sentence-transformers--all-MiniLM-L6-v2/report.html @@ -0,0 +1,577 @@ + + + + + + QNN GPU Optimization Report — sentence-transformers/all-MiniLM-L6-v2 + + + +

QNN GPU Optimization Report — sentence-transformers/all-MiniLM-L6-v2

+
bert arch · 2026-06-18 · 13 hypotheses tested
+ +
+
+
Best Gain %
+
+
Champion: —
+
+
+
Baseline → Champion ms
+
27.93 ms → —
+
Latency reduction: —
+
+
+
EP + Device
+
QNN / GPU
+
Baseline opset 17
+
+
+
Champion Config
+
+
+
+
+
Total experiments
+
13
+
0 KEEP / 9 DISCARD
+
+
+ + +
+
Model Characteristics
+ + +
Model IDsentence-transformers/all-MiniLM-L6-v2
Tasksentence-similarity
Arch typebert
Baseline opset17
EPqnn
Devicegpu
+
+ + +
+
Hypothesis Gain Chart
+
+ HypothesisGain vs baseline (%)-200%-100%0%100%200%h0: baseline FP32 (no quant, no compile) +status=OK verdict=BASELINE +p50=27.93 ms gain=+0.0%h0baseline FP32 (no quant, …0.0%h1: opset 17 explicit +status=OK verdict=DISCARD +p50=32.66 ms gain=-16.9%h1opset 17 explicit-16.9%h2: opset 19 +status=BUILD_FAIL verdict=— +p50=— gain=—h2opset 19BUILD_FAILh3: opset 21 (tests gpu-006) +status=BUILD_FAIL verdict=— +p50=— gain=—h3opset 21 (tests gpu-006)BUILD_FAILh4: opset 17 + matmul_transpose_fusion +status=BUILD_FAIL verdict=— +p50=— gain=—h4opset 17 + matmul_transpo…BUILD_FAILh5: opset 17 + attention_fusion +status=BUILD_FAIL verdict=— +p50=— gain=—h5opset 17 + attention_fusi…BUILD_FAILh6: opset 17 + bias_softmax_fusion +status=BUILD_FAIL verdict=— +p50=— gain=—h6opset 17 + bias_softmax_f…BUILD_FAILh7: opset 17 + layer_norm_fusion +status=OK verdict=DISCARD +p50=28.55 ms gain=-2.2%h7opset 17 + layer_norm_fus…-2.2%h8: opset 17 + skip_layer_norm_fusion +status=BENCH_FAIL verdict=— +p50=— gain=—h8opset 17 + skip_layer_nor…h9: opset 21 + matmul_transpose + attention_fusion +status=OK verdict=DISCARD +p50=29.08 ms gain=-4.1%h9opset 21 + matmul_transpo…-4.1%h10: opset 17 + ln + skip_ln + matmul_transpose +status=BENCH_FAIL verdict=— +p50=— gain=—h10opset 17 + ln + skip_ln +…h11: opset 17 + gelu_fusion explicit +status=OK verdict=MARGINAL +p50=27.38 ms gain=+2.0%h11opset 17 + gelu_fusion ex…+2.0%h12: opset 17 + transpose_optimizer +status=OK verdict=DISCARD +p50=28.86 ms gain=-3.3%h12opset 17 + transpose_opti…-3.3% +
+
+ + +
+
🔬 All Hypotheses — Full Detail
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
IDConfig LabelOpsetExtra FlagsMedian p50Session p50s (ms)Gain %VerdictConfidence
h0baseline FP32 (no quant, no compile)17not stored27.93 ms[27.93 · 27.93 · 28.94]+0.0%BASELINEranges overlap
h1opset 17 explicit17not stored32.66 ms[32.66 · 44.52 · 31.94]-16.9%DISCARDranges separated
h2opset 1919not storedBUILD_FAILBUILD_FAILBUILD_FAILbuild failed
h3opset 21 (tests gpu-006)21not storedBUILD_FAILBUILD_FAILBUILD_FAILbuild failed
h4opset 17 + matmul_transpose_fusion17not storedBUILD_FAILBUILD_FAILBUILD_FAILbuild failed
h5opset 17 + attention_fusion17not storedBUILD_FAILBUILD_FAILBUILD_FAILbuild failed
h6opset 17 + bias_softmax_fusion17not storedBUILD_FAILBUILD_FAILBUILD_FAILbuild failed
h7opset 17 + layer_norm_fusion17layer_norm_fusion28.55 ms[27.27 · 28.55 · 28.84]-2.2%DISCARDranges overlap
h8opset 17 + skip_layer_norm_fusionnot storedBENCH_FAILbench failed
h9opset 21 + matmul_transpose + attention_fusion21matmul_transpose_fusion, attention_fusion29.08 ms[29.98 · 27.45 · 29.08]-4.1%DISCARDranges overlap
h10opset 17 + ln + skip_ln + matmul_transposenot storedBENCH_FAILbench failed
h11opset 17 + gelu_fusion explicit17gelu_fusion27.38 ms[27.38 · 26.66 · 27.49]+2.0%MARGINALranges separated
h12opset 17 + transpose_optimizer17transpose_optimizer28.86 ms[28.86 · 29.80 · 28.35]-3.3%DISCARDranges overlap
+
+
+ ★ = champion hypothesis  ·  Session p50s are individual bench sessions (median used for comparison) +
+
+ + + +
+
❌ Ineffective or Harmful
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
HypothesisLabelGain %VerdictConfidence
h1opset 17 explicit-16.9%DISCARDranges separated
h2opset 19BUILD_FAILbuild failed
h3opset 21 (tests gpu-006)BUILD_FAILbuild failed
h4opset 17 + matmul_transpose_fusionBUILD_FAILbuild failed
h5opset 17 + attention_fusionBUILD_FAILbuild failed
h6opset 17 + bias_softmax_fusionBUILD_FAILbuild failed
h7opset 17 + layer_norm_fusion-2.2%DISCARDranges overlap
h9opset 21 + matmul_transpose + attention_fusion-4.1%DISCARDranges overlap
h12opset 17 + transpose_optimizer-3.3%DISCARDranges overlap
+
+ + +
+
⚪ Neutral / Build Fail
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
HypothesisLabelGain %VerdictConfidence
h0baseline FP32 (no quant, no compile)+0.0%BASELINEranges overlap
h8opset 17 + skip_layer_norm_fusionBENCH_FAILbench failed
h10opset 17 + ln + skip_ln + matmul_transposeBENCH_FAILbench failed
h11opset 17 + gelu_fusion explicit+2.0%MARGINALranges separated
+
+ + + + + + diff --git a/research/autoconfig/catalog-gpu-sweep/sentence-transformers--all-MiniLM-L6-v2/results.json b/research/autoconfig/catalog-gpu-sweep/sentence-transformers--all-MiniLM-L6-v2/results.json new file mode 100644 index 000000000..e7ecb0f24 --- /dev/null +++ b/research/autoconfig/catalog-gpu-sweep/sentence-transformers--all-MiniLM-L6-v2/results.json @@ -0,0 +1,168 @@ +{ + "model_id": "sentence-transformers/all-MiniLM-L6-v2", + "task": "sentence-similarity", + "model_type": "bert", + "timestamp": "2026-06-18T10:30:56", + "ep": "qnn", + "device": "gpu", + "baseline_opset": 17, + "hypotheses": { + "h0": { + "status": "OK", + "label": "baseline FP32 (no quant, no compile)", + "opset": 17, + "extra_optim": null, + "screen_p50_ms": 28.907, + "screen_cv": 0.3101670875566472, + "full_p50s_ms": [ + 27.929, + 27.929, + 28.942 + ], + "median_p50_ms": 27.929, + "verdict": "BASELINE" + }, + "h1": { + "status": "OK", + "label": "opset 17 explicit", + "opset": 17, + "extra_optim": null, + "screen_p50_ms": 33.164, + "screen_cv": 0.5540344952357978, + "full_p50s_ms": [ + 32.658, + 44.515, + 31.943 + ], + "median_p50_ms": 32.658, + "gain_vs_baseline_pct": -16.93, + "verdict": "DISCARD" + }, + "h2": { + "status": "BUILD_FAIL", + "label": "opset 19", + "opset": 19, + "build_error": " Supported tasks are: feature-extraction, \n fill-mask, multiple-choice, question-answering, \n text-classification, token-classification. \n⏳ Export Exporting to ONNX...Error: Build failed: [Errno 28] No space left on device\n" + }, + "h3": { + "status": "BUILD_FAIL", + "label": "opset 21 (tests gpu-006)", + "opset": 21, + "build_error": " Supported tasks are: feature-extraction, \n fill-mask, multiple-choice, question-answering, \n text-classification, token-classification. \n⏳ Export Exporting to ONNX...Error: Build failed: [Errno 28] No space left on device\n" + }, + "h4": { + "status": "BUILD_FAIL", + "label": "opset 17 + matmul_transpose_fusion", + "opset": 17, + "build_error": " Supported tasks are: feature-extraction, \n fill-mask, multiple-choice, question-answering, \n text-classification, token-classification. \n⏳ Export Exporting to ONNX...Error: Build failed: [Errno 28] No space left on device\n" + }, + "h5": { + "status": "BUILD_FAIL", + "label": "opset 17 + attention_fusion", + "opset": 17, + "build_error": "..Error: Build failed: ONNX Runtime optimization failed: [ONNXRuntimeError] : 7 : INVALID_PROTOBUF : Protobuf serialization failed. | Pipe: ort_graph | Model info: {'optimization_level': 2, 'disabled_count': 40} | Caused by: [ONNXRuntimeError] : 7 : INVALID_PROTOBUF : Protobuf serialization failed.\n" + }, + "h6": { + "status": "BUILD_FAIL", + "label": "opset 17 + bias_softmax_fusion", + "opset": 17, + "build_error": "e}Error: Build failed: ONNX Runtime optimization failed: [ONNXRuntimeError] : 7 : INVALID_PROTOBUF : Protobuf serialization failed. | Pipe: ort_graph | Model info: {'optimization_level': 2, 'disabled_count': 37} | Caused by: [ONNXRuntimeError] : 7 : INVALID_PROTOBUF : Protobuf serialization failed.\n" + }, + "h7": { + "status": "OK", + "label": "opset 17 + layer_norm_fusion", + "opset": 17, + "extra_optim": { + "layer_norm_fusion": true + }, + "screen_p50_ms": 27.998, + "screen_cv": 0.7119437102650189, + "full_p50s_ms": [ + 27.269, + 28.545, + 28.837 + ], + "median_p50_ms": 28.545, + "gain_vs_baseline_pct": -2.21, + "verdict": "DISCARD" + }, + "h8": { + "status": "BENCH_FAIL", + "label": "opset 17 + skip_layer_norm_fusion" + }, + "h9": { + "status": "OK", + "label": "opset 21 + matmul_transpose + attention_fusion", + "opset": 21, + "extra_optim": { + "matmul_transpose_fusion": true, + "attention_fusion": true + }, + "screen_p50_ms": 28.12, + "screen_cv": 0.3956258890469417, + "full_p50s_ms": [ + 29.983, + 27.454, + 29.083 + ], + "median_p50_ms": 29.083, + "gain_vs_baseline_pct": -4.13, + "verdict": "DISCARD" + }, + "h10": { + "status": "BENCH_FAIL", + "label": "opset 17 + ln + skip_ln + matmul_transpose" + }, + "h11": { + "status": "OK", + "label": "opset 17 + gelu_fusion explicit", + "opset": 17, + "extra_optim": { + "gelu_fusion": true + }, + "screen_p50_ms": 26.486, + "screen_cv": 0.2676508344030809, + "full_p50s_ms": [ + 27.382, + 26.663, + 27.486 + ], + "median_p50_ms": 27.382, + "gain_vs_baseline_pct": 1.96, + "verdict": "MARGINAL" + }, + "h12": { + "status": "OK", + "label": "opset 17 + transpose_optimizer", + "opset": 17, + "extra_optim": { + "transpose_optimizer": true + }, + "screen_p50_ms": 31.432, + "screen_cv": 0.36580554848561975, + "full_p50s_ms": [ + 28.86, + 29.805, + 28.349 + ], + "median_p50_ms": 28.86, + "gain_vs_baseline_pct": -3.33, + "verdict": "DISCARD" + } + }, + "best_hypothesis": null, + "baseline_p50_ms": 27.929, + "best_p50_ms": null, + "best_gain_pct": null, + "opset21_gain_pct": null, + "feature_gaps": [], + "errors": [ + "h2: BUILD_FAIL", + "h3: BUILD_FAIL", + "h4: BUILD_FAIL", + "h5: BUILD_FAIL", + "h6: BUILD_FAIL", + "h8: screen bench failed", + "h10: screen bench failed" + ] +} diff --git a/research/autoconfig/catalog-qnn-sweep/.gitignore b/research/autoconfig/catalog-qnn-sweep/.gitignore new file mode 100644 index 000000000..29bb809b7 --- /dev/null +++ b/research/autoconfig/catalog-qnn-sweep/.gitignore @@ -0,0 +1,3 @@ +# Ignore per-hypothesis build artifacts from validation_sweep.py +# (ONNX model files, calibration data, perf session JSONs) +val_h*/ diff --git a/research/autoconfig/catalog-qnn-sweep/BAAI--bge-small-en-v1.5/results_new.json b/research/autoconfig/catalog-qnn-sweep/BAAI--bge-small-en-v1.5/results_new.json new file mode 100644 index 000000000..fed23f364 --- /dev/null +++ b/research/autoconfig/catalog-qnn-sweep/BAAI--bge-small-en-v1.5/results_new.json @@ -0,0 +1,31 @@ +{ + "model_id": "BAAI/bge-small-en-v1.5", + "task": "sentence-similarity", + "hypotheses": { + "h0": { + "description": "opset17 no opts", + "model_file": "quantized.onnx", + "screen_p50_ms": 9.208, + "screen_cv": 0.3059, + "full_p50s_ms": [ + 10.516, + 10.323, + 11.01 + ], + "avg_p50_ms": 10.616 + }, + "h3": { + "description": "opset21 no opts", + "model_file": "quantized.onnx", + "screen_p50_ms": 9.562, + "screen_cv": 0.2575, + "full_p50s_ms": [ + 10.253, + 9.331, + 9.937 + ], + "avg_p50_ms": 9.84 + } + }, + "opset21_gain_pct": 7.31 +} diff --git a/research/autoconfig/catalog-qnn-sweep/SUMMARY.md b/research/autoconfig/catalog-qnn-sweep/SUMMARY.md new file mode 100644 index 000000000..fca9f0439 --- /dev/null +++ b/research/autoconfig/catalog-qnn-sweep/SUMMARY.md @@ -0,0 +1,175 @@ +# QNN NPU Optimization Sweep — Catalog Models + +Generated: 2026-06-22T12:29:01 +EP: `qnn` / device: `npu` +Bench protocol: Phase-A 200 iters (high CV expected on QNN NPU — DVFS), Phase-B 500x3 sessions, 30s cool-down +npu-001 criterion: median >=5% gain AND ranges non-overlapping +npu-006 criterion: Conv% of ops; h4/h5 marked catastrophic if >=5x baseline +Effect-size gate: gain reliable only if gain% >= 2×(session-CV) AND ranges separated + +--- + +## Per-Model Results + +| Model | Conv% | Baseline p50 | Best p50 | Best config | Gain% | Reliable? | npu-001? | npu-006 regression? | Notes | +|-------|-------|-------------|----------|-------------|-------|-----------|----------|---------------------|-------| +| `apple/mobilevit-small` | 2% | 5.5 ms | 5.4 ms | h3 (opset 21 (tests npu-001 bypass)) | 2.8% | ⚠️ within noise | neutral | no | none | +| `deepset/roberta-base-squad2` | N/A | 14.9 ms | 14.7 ms | h1 (opset 17 explicit) | 1.5% | N/A | neutral | no | Model timed out at 1466s (before h4); Model timed out at 1466s (before h5) | +| `distilbert/distilbert-base-uncased-finetuned-sst-2-english` | N/A | 19.5 ms | 19.5 ms | h2 (opset 19) | 0.0% | N/A | neutral | no | Model timed out at 1385s (before h5) | +| `facebook/dinov2-small` | N/A | 6.6 ms | 5.0 ms | h3 (opset 21 (tests npu-001 bypass)) | 24.1% | N/A | YES (median) | no | Model timed out at 1333s (before h4); Model timed out at 1333s (before h5) | +| `google/vit-base-patch16-224` | N/A | 9.0 ms | 9.0 ms | h0 (baseline (auto-config, W8A16)) | 0.0% | N/A | NO | no | h2: BUILD_FAIL; Model timed out at 1204s (before h4); Model timed out at 1204s ( | +| `hustvl/yolos-small` | 0% | 49.6 ms | 48.6 ms | h3 (opset 21 (tests npu-001 bypass)) | 2.0% | ⚠️ within noise | N/A | no | h2 (opset 19), h4/h5 (conv fusions): not measured — agent deprioritized (yolos i | +| `microsoft/resnet-18` | N/A | 1.0 ms | 1.0 ms | h0 (baseline (auto-config, W8A16)) | 0.0% | N/A | YES (median) | no | Model timed out at 1560s (before h5) | +| `sentence-transformers/all-MiniLM-L6-v2` | N/A | 5.8 ms | 5.8 ms | h0 (baseline (auto-config, W8A16)) | 0.0% | N/A | neutral | no | Model timed out at 1346s (before h5) | + +## Hypothesis Breakdown per Model + +### apple/mobilevit-small + +| Hypothesis | Opset | Screen p50 | Full p50 (median) | CV | Status | Accuracy | +|------------|-------|-----------|-------------------|-----|--------|---------| +| h0 (baseline (auto-config, W8A16)) | 17 | 5.0 | 5.5 | 0.093 | OK | — | +| h1 (opset 17 explicit) | 17 | 5.8 | 5.6 | 0.304 | OK_HIGH_CV ⚡DVFS | — | +| h2 (opset 19) | 19 | 5.8 | 6.6 | 0.120 | OK | — | +| h3 (opset 21 (tests npu-001 bypass)) | 21 | 5.2 | 5.4 | 0.163 | OK_HIGH_CV ⚡DVFS | — | +| h4 (opset 17 + conv fusions) | 17 | 6.7 | 6.5 | 0.181 | OK_HIGH_CV ⚡DVFS | — | +| h5 (opset 21 + conv fusions) | 21 | 6.2 | 6.7 | 0.153 | OK_HIGH_CV ⚡DVFS | — | +| h6 (opset 21 + matmul_transpose_fusion) | 21 | 5.9 | 6.2 | 0.229 | OK_HIGH_CV ⚡DVFS | — | +| h7 (opset 21 + bias_softmax_fusion) | 21 | 4.6 | 6.4 | 0.043 | OK | — | +| h8 (opset 21 + attention_fusion) | 21 | 6.5 | 5.8 | 0.455 | OK_HIGH_CV ⚡DVFS | — | +| h9 (opset 21 + highdimRTR_lowdimRTR) | 21 | 5.7 | 6.5 | 0.190 | OK_HIGH_CV ⚡DVFS | — | +| h10 (opset 17 + conv_add_fusion only) | 17 | 6.7 | 5.9 | 0.188 | OK_HIGH_CV ⚡DVFS | — | + +### deepset/roberta-base-squad2 + +| Hypothesis | Opset | Screen p50 | Full p50 (median) | CV | Status | Accuracy | +|------------|-------|-----------|-------------------|-----|--------|---------| +| h0 (baseline (auto-config, W8A16)) | 17 | 14.9 | 14.9 | 0.119 | OK | — | +| h1 (opset 17 explicit) | 17 | 14.7 | 14.7 | 0.129 | OK | — | +| h2 (opset 19) | 19 | 15.3 | 14.9 | 0.234 | OK_HIGH_CV ⚡DVFS | — | +| h3 (opset 21 (tests npu-001 bypass)) | 21 | 14.8 | 14.9 | 0.116 | OK | — | +| h4 (opset 17 + conv fusions) | ? | — | — | ? | TIMEOUT | — | +| h5 (opset 21 + conv fusions) | ? | — | — | ? | TIMEOUT | — | + +### distilbert/distilbert-base-uncased-finetuned-sst-2-english + +| Hypothesis | Opset | Screen p50 | Full p50 (median) | CV | Status | Accuracy | +|------------|-------|-----------|-------------------|-----|--------|---------| +| h0 (baseline (auto-config, W8A16)) | 17 | 19.5 | 19.5 | 0.156 | OK_HIGH_CV ⚡DVFS | — | +| h1 (opset 17 explicit) | 17 | 19.7 | 19.5 | 0.272 | OK_HIGH_CV ⚡DVFS | — | +| h2 (opset 19) | 19 | 19.4 | 19.5 | 0.195 | OK_HIGH_CV ⚡DVFS | — | +| h3 (opset 21 (tests npu-001 bypass)) | 21 | 19.4 | 19.5 | 0.290 | OK_HIGH_CV ⚡DVFS | — | +| h4 (opset 17 + conv fusions) | 17 | 19.4 | 19.6 | 0.237 | OK_HIGH_CV ⚡DVFS | — | +| h5 (opset 21 + conv fusions) | ? | — | — | ? | TIMEOUT | — | + +### facebook/dinov2-small + +| Hypothesis | Opset | Screen p50 | Full p50 (median) | CV | Status | Accuracy | +|------------|-------|-----------|-------------------|-----|--------|---------| +| h0 (baseline (auto-config, W8A16)) | 17 | 7.2 | 6.6 | 0.344 | OK_HIGH_CV ⚡DVFS | — | +| h1 (opset 17 explicit) | 17 | 4.9 | 7.2 | 0.457 | OK_HIGH_CV ⚡DVFS | — | +| h2 (opset 19) | 19 | 7.0 | 7.2 | 1.805 | OK_HIGH_CV ⚡DVFS | — | +| h3 (opset 21 (tests npu-001 bypass)) | 21 | 9.4 | 5.0 | 0.936 | OK_HIGH_CV ⚡DVFS | — | +| h4 (opset 17 + conv fusions) | ? | — | — | ? | TIMEOUT | — | +| h5 (opset 21 + conv fusions) | ? | — | — | ? | TIMEOUT | — | + +### google/vit-base-patch16-224 + +| Hypothesis | Opset | Screen p50 | Full p50 (median) | CV | Status | Accuracy | +|------------|-------|-----------|-------------------|-----|--------|---------| +| h0 (baseline (auto-config, W8A16)) | 17 | 9.2 | 9.0 | 1.289 | OK_HIGH_CV ⚡DVFS | 0.740 | +| h1 (opset 17 explicit) | 17 | 9.7 | 9.3 | 0.743 | OK_HIGH_CV ⚡DVFS | — | +| h2 (opset 19) | 19 | — | — | ? | BUILD_FAIL | — | +| h3 (opset 21 (tests npu-001 bypass)) | 21 | 11.6 | 10.0 | 2.159 | OK_HIGH_CV ⚡DVFS | — | +| h4 (opset 17 + conv fusions) | ? | — | — | ? | TIMEOUT | — | +| h5 (opset 21 + conv fusions) | ? | — | — | ? | TIMEOUT | — | + +### hustvl/yolos-small + +| Hypothesis | Opset | Screen p50 | Full p50 (median) | CV | Status | Accuracy | +|------------|-------|-----------|-------------------|-----|--------|---------| +| h0 (baseline (auto-config, W8A16)) | 17 | 48.7 | 49.6 | 0.067 | OK | — | +| h1 (opset 17 explicit) | 17 | 66.4 | 65.9 | 0.226 | OK_HIGH_CV ⚡DVFS | — | +| h2 (opset 19) | ? | — | — | ? | TIMEOUT | — | +| h3 (opset 21 (tests npu-001 bypass)) | 21 | 48.8 | 48.6 | 0.050 | OK | — | +| h4 (opset 17 + conv fusions) | ? | — | — | ? | TIMEOUT | — | +| h5 (opset 21 + conv fusions) | ? | — | — | ? | TIMEOUT | — | +| h6 (opset 21 + matmul_transpose_fusion) | 21 | 49.0 | 50.0 | 0.048 | OK | — | +| h7 (opset 21 + bias_softmax_fusion) | 21 | 49.0 | 51.6 | 0.062 | OK | — | +| h8 (opset 21 + attention_fusion) | 21 | 51.3 | 49.5 | 0.078 | OK | — | + +### microsoft/resnet-18 + +| Hypothesis | Opset | Screen p50 | Full p50 (median) | CV | Status | Accuracy | +|------------|-------|-----------|-------------------|-----|--------|---------| +| h0 (baseline (auto-config, W8A16)) | 17 | 4.0 | 1.0 | 1.690 | OK_HIGH_CV ⚡DVFS | 0.660 | +| h1 (opset 17 explicit) | 17 | 3.1 | 2.7 | 2.036 | OK_HIGH_CV ⚡DVFS | — | +| h2 (opset 19) | 19 | 4.0 | 1.1 | 1.517 | OK_HIGH_CV ⚡DVFS | — | +| h3 (opset 21 (tests npu-001 bypass)) | 21 | 3.0 | 2.2 | 1.176 | OK_HIGH_CV ⚡DVFS | — | +| h4 (opset 17 + conv fusions) | 17 | 128.1 | 132.3 | 1.405 | OK_HIGH_CV ⚡DVFS | — | +| h5 (opset 21 + conv fusions) | ? | — | — | ? | TIMEOUT | — | + +### sentence-transformers/all-MiniLM-L6-v2 + +| Hypothesis | Opset | Screen p50 | Full p50 (median) | CV | Status | Accuracy | +|------------|-------|-----------|-------------------|-----|--------|---------| +| h0 (baseline (auto-config, W8A16)) | 17 | 5.9 | 5.8 | 0.222 | OK_HIGH_CV ⚡DVFS | — | +| h1 (opset 17 explicit) | 17 | 5.9 | 5.9 | 0.999 | OK_HIGH_CV ⚡DVFS | — | +| h2 (opset 19) | 19 | 5.3 | 6.0 | 0.205 | OK_HIGH_CV ⚡DVFS | — | +| h3 (opset 21 (tests npu-001 bypass)) | 21 | 6.0 | 5.9 | 1.127 | OK_HIGH_CV ⚡DVFS | — | +| h4 (opset 17 + conv fusions) | 17 | 5.5 | 6.0 | 0.134 | OK | — | +| h5 (opset 21 + conv fusions) | ? | — | — | ? | TIMEOUT | — | + +--- + +## Cross-Model Patterns + +### npu-001: Does opset 21 bypass help broadly? + +- **Helps (2 models):** `facebook/dinov2-small`, `microsoft/resnet-18` +- **Hurts (1 models):** `google/vit-base-patch16-224` +- **Neutral (4 models):** `apple/mobilevit-small`, `deepset/roberta-base-squad2`, `distilbert/distilbert-base-uncased-finetuned-sst-2-english`, `sentence-transformers/all-MiniLM-L6-v2` +- **N/A (1 models):** `hustvl/yolos-small` + +> **Finding**: Mixed results (2 help, 1 hurt, 4 neutral). Architecture-dependent. Confirm ORT `kMaxSupportedOpset` version before drawing conclusions. + +### Feature Gaps + +- No feature gaps observed + +### Build / Compatibility Issues + +**`deepset/roberta-base-squad2`** + - Model timed out at 1466s (before h4) + - Model timed out at 1466s (before h5) +**`distilbert/distilbert-base-uncased-finetuned-sst-2-english`** + - Model timed out at 1385s (before h5) +**`facebook/dinov2-small`** + - Model timed out at 1333s (before h4) + - Model timed out at 1333s (before h5) +**`google/vit-base-patch16-224`** + - h2: BUILD_FAIL + - Model timed out at 1204s (before h4) + - Model timed out at 1204s (before h5) +**`hustvl/yolos-small`** + - h2 (opset 19), h4/h5 (conv fusions): not measured — agent deprioritized (yolos is 0.1% conv / 99.9% transformer, so conv-fusion and intermediate-opset hypotheses are low expected-value). +**`microsoft/resnet-18`** + - Model timed out at 1560s (before h5) +**`sentence-transformers/all-MiniLM-L6-v2`** + - Model timed out at 1346s (before h5) + +--- + +## Updated Recommendations for `ep_knowledge/qnn_npu.json` + +Based on this cross-architecture sweep: + +- **npu-001**: Broaden scope beyond ConvNext. Architectures that benefit: facebook/dinov2-small, microsoft/resnet-18. Update `scope` field and set `gate1_statistical` confidence accordingly. +- **search_space_rules.opset.recommended_order**: Retain `[21, 17]` as default order. + +### Conv Fusion Findings (h4 vs h1, h5 vs h3) + +- **`apple/mobilevit-small`**: conv-fusions on opset17: -16.0% (5.6→6.5ms); conv-fusions on opset21: -25.3% (5.4→6.7ms) +- **`distilbert/distilbert-base-uncased-finetuned-sst-2-english`**: conv-fusions on opset17: -0.5% (19.5→19.6ms) +- **`microsoft/resnet-18`**: conv-fusions on opset17: -4771.1% (2.7→132.3ms) +- **`sentence-transformers/all-MiniLM-L6-v2`**: conv-fusions on opset17: -1.5% (5.9→6.0ms) diff --git a/research/autoconfig/catalog-qnn-sweep/VALIDATION_SUMMARY.md b/research/autoconfig/catalog-qnn-sweep/VALIDATION_SUMMARY.md new file mode 100644 index 000000000..0dc697d3e --- /dev/null +++ b/research/autoconfig/catalog-qnn-sweep/VALIDATION_SUMMARY.md @@ -0,0 +1,108 @@ +# Validation Sweep Results — QNN NPU (2026-06-16) + +**Device:** Snapdragon X Elite X1E80100 +**ORT:** onnxruntime-windowsml==1.24.5 +**QNN SDK:** 2.2450.47.0 +**Protocol:** 3 × 500 iters, 30s cool-down, `quantized.onnx` (W8A16), `--no-compile` +**Script:** `validation_sweep.py` — targeted 4-hypothesis sweep (h0/h1/h3/h4) + +## Hypothesis Matrix + +| ID | Config | Purpose | +|----|--------|---------| +| h0 | auto-config baseline (W8A16, opset auto) | baseline reference | +| h1 | opset 17 explicit (W8A16) | npu-001 baseline | +| h3 | opset 21 (W8A16) | **npu-001 test** — does opset21 help? | +| h4 | opset 17 + conv fusions | **npu-006 test** — do conv fusions regress? | + +--- + +## Results by Model + +### facebook/dinov2-base (ViT-B DINOv2, image-feature-extraction) + +| Hyp | Median p50 | Sessions (ms) | CV note | +|-----|-----------|---------------|---------| +| h0 auto | 38.68 ms | [38.99, 38.68, 36.26] | stable (stale build artifact) | +| **h1 opset17** | **34.56 ms** | [34.56, 34.67, 33.15] | rock stable | +| **h3 opset21** | **26.23 ms** | [33.00, 26.22, 26.23] | s0 elevated (JIT warmup), s1+s2 stable | +| h4 fusions | 25.92 ms | [26.06, 25.92, 25.87] | rock stable | + +**npu-001: opset21 → +24.1% speedup** `(34.56 → 26.23ms)` +**npu-006: conv fusions → -25% (fusions FASTER, not regression)** — DINOv2 is attention-dominant, few Conv ops to fuse + +--- + +### microsoft/rad-dino (ViT-L DINOv2 medical, image-feature-extraction) + +| Hyp | Median p50 | Sessions (ms) | CV note | +|-----|-----------|---------------|---------| +| **h1 opset17** | **274.98 ms** | [274.98, 274.56, 275.10] | CV=0.009, CPU-deterministic | +| **h3 opset21** | **275.36 ms** | [275.30, 275.36, 275.56] | CV=0.022 | + +**npu-001: -0.1% — NEUTRAL (CPU-bound)** +Model runs entirely on CPU (~275ms). QNN NPU cannot accelerate rad-dino (ViT-L too large or incompatible ops). Opset has no effect when model is CPU-bound. + +--- + +### facebook/dino-vitb16 (plain DINO ViT-B/16, image-feature-extraction) + +| Hyp | Median p50 | Sessions (ms) | CV note | +|-----|-----------|---------------|---------| +| **h1 opset17** | **19.92 ms** | [19.92, 19.97, 19.90] | rock stable | +| **h3 opset21** | **20.07 ms** | [20.20, 20.07, 19.99] | rock stable | +| h4 fusions | 20.12 ms | [20.12, 20.04, 20.41] | rock stable | + +**npu-001: -0.7% — NEUTRAL** ← **critical control** +**npu-006: +1.0% — NEUTRAL** (no Conv layers to fuse, patch-embed Conv fusion is benign) + +--- + +## Cross-Model Summary — npu-001 (opset21 vs opset17) + +| Model | Architecture | opset17 (h1) | opset21 (h3) | Gain | Verdict | +|-------|-------------|-------------|-------------|------|---------| +| facebook/dinov2-small | DINOv2 ViT-S | 7.18 ms* | 4.98 ms* | **+30.6%** | ✅ CONFIRMED | +| facebook/dinov2-base | DINOv2 ViT-B | 34.56 ms | 26.23 ms | **+24.1%** | ✅ CONFIRMED | +| apple/mobilevit-small | Conv+Attn hybrid | 11.72 ms* | 8.62 ms* | **+26.5%** ⚠️ | 🟡 LIKELY (DVFS spike in h1) | +| facebook/dino-vitb16 | plain ViT-B/16 | 19.92 ms | 20.07 ms | **-0.7%** | ❌ NEUTRAL — critical control | +| microsoft/rad-dino | ViT-L DINOv2 | 274.98 ms | 275.36 ms | **-0.1%** | ⬛ CPU-BOUND (untestable) | +| google/vit-base-patch16-224 | plain ViT-B | n/a | n/a | **-7.4%** ⚠️* | ❌ REGRESSION | + +_*Original catalog_qnn_sweep.py data (optimized.onnx, not quantized.onnx — different pipeline)_ + +**Key architectural discriminant:** opset21 consistently helps **DINOv2 family** (+24-31%) but has **zero effect on plain ViT** (dino-vitb16: -0.7%, noise-level). This is NOT a general ViT property. DINOv2-specific op patterns must explain the difference — mechanism TBD. + +--- + +## Cross-Model Summary — npu-006 (conv fusions) + +| Model | Architecture | h1 no-fusions | h4 fusions | Regression | Verdict | +|-------|-------------|--------------|-----------|------------|---------| +| microsoft/resnet-18 | Conv-dominant | ~1–4 ms* | 132–135 ms* | **+4900%** 🔥 | ✅ CATASTROPHIC | +| apple/mobilevit-small | Conv+Attn | ~10–12 ms* | ~10–12 ms* | **≈0%** | 🟢 SAFE | +| facebook/dinov2-base | DINOv2 ViT-B | 34.56 ms | 25.92 ms | **-25%** (faster) | 🟢 SAFE / beneficial | +| facebook/dino-vitb16 | plain ViT-B | 19.92 ms | 20.12 ms | **+1.0%** | 🟢 SAFE (neutral) | + +_*Original catalog_qnn_sweep.py data_ + +**Conclusion:** Conv fusions only regress Conv-dominant models (ResNet). Attention-dominant models (DINOv2, ViT) are safe or slightly benefit. The hazard is proportional to Conv op density. + +--- + +## Bugs Found and Fixed in validation_sweep.py + +| Bug | Impact | Fix | +|-----|--------|-----| +| `bench_screen` parsed `d.get("p50_ms")` instead of `d["latency_ms"]["p50"]` | All hypotheses marked BENCH_FAIL in v1/v2 runs | Fixed to read nested `latency_ms.p50` | +| Reuse check triggered on any `.onnx` (including truncated `export.onnx`) | h1 was benchmarked on FP32 unoptimized model | Changed to require `quantized.onnx` or `optimized.onnx` | +| Model file selection preferred `optimized.onnx` over `quantized.onnx` alphabetically | Benchmarked FP32 graph instead of W8A16 quantized | Fixed to explicitly prefer `quantized` > `optimized` > other | + +--- + +## Known Limitations + +1. **`--no-compile` throughout**: All runs omit `winml compile` (pre-built QNN context binary). Production use would include compile, which npu-003 suggests adds ~1.7x additional speedup. The npu-001 ratio should hold with compile enabled, but absolute latencies will be lower. +2. **3 sessions only**: DVFS on QNN NPU can cause any single session to be thermal-spiked. With only 3 sessions, the median can still be affected if 2/3 spike. See h3 dinov2-base s0=33ms (warmup effect) vs s1+s2=26ms. +3. **rad-dino untestable**: When a model falls back entirely to CPU, no NPU-related findings can be extracted. The reason for CPU fallback (model size? unsupported ops?) was not investigated. +4. **dinov2-small not re-validated with v2 pipeline**: The original +30.6% result was from `catalog_qnn_sweep.py` using `optimized.onnx`. The v2 pipeline uses `quantized.onnx`. For full comparability, dinov2-small should be re-run with `validation_sweep.py`. diff --git a/research/autoconfig/catalog-qnn-sweep/apple--mobilevit-small/champion_qnn_npu.json b/research/autoconfig/catalog-qnn-sweep/apple--mobilevit-small/champion_qnn_npu.json new file mode 100644 index 000000000..72a1a9465 --- /dev/null +++ b/research/autoconfig/catalog-qnn-sweep/apple--mobilevit-small/champion_qnn_npu.json @@ -0,0 +1,59 @@ +{ + "export": { + "opset_version": 17, + "batch_size": 1, + "export_params": true, + "do_constant_folding": true, + "verbose": false, + "dynamo": false, + "enable_hierarchy_tags": true, + "clean_onnx": false, + "hierarchy_tag_format": "full", + "input_tensors": [ + { + "name": "pixel_values", + "dtype": "float32", + "shape": [ + 1, + 3, + 256, + 256 + ], + "value_range": [ + 0, + 1 + ] + } + ], + "output_tensors": [ + { + "name": "logits" + } + ] + }, + "optim": {}, + "quant": { + "mode": "qdq", + "samples": 10, + "calibration_method": "minmax", + "weight_type": "uint8", + "activation_type": "uint16", + "per_channel": false, + "symmetric": false, + "save_calibration": false, + "distribution": "uniform", + "seed": null, + "calibration_load_path": null, + "calibration_save_path": null, + "op_types_to_quantize": null, + "nodes_to_exclude": null, + "task": "image-classification", + "model_name": "apple/mobilevit-small" + }, + "compile": null, + "loader": { + "task": "image-classification", + "model_class": "AutoModelForImageClassification", + "model_type": "mobilevit" + } +} diff --git a/research/autoconfig/catalog-qnn-sweep/apple--mobilevit-small/report.html b/research/autoconfig/catalog-qnn-sweep/apple--mobilevit-small/report.html new file mode 100644 index 000000000..85d8074da --- /dev/null +++ b/research/autoconfig/catalog-qnn-sweep/apple--mobilevit-small/report.html @@ -0,0 +1,535 @@ + + + + + + QNN NPU Optimization Report — apple/mobilevit-small + + + +

QNN NPU Optimization Report — apple/mobilevit-small

+
mobilevit arch · 2026-06-22 · 11 hypotheses tested
+ +
+
+
Best Gain %
+
+2.8%
+
Champion: h3 · ⚠ neutral within noise — ship baseline
+
+
+
Baseline → Champion ms
+
5.51 ms → 5.36 ms
+
Latency reduction: 0.15 ms
+
+
+
EP + Device
+
QNN / NPU
+
Baseline opset 17
+
+
+
Champion Config
+
h0 (baseline)
+
⚠ neutral within noise — ship baseline
+
+
+
Total experiments
+
11
+
0 KEEP / 8 DISCARD
+
+
+ + +
+
Model Characteristics
+ + +
Model IDapple/mobilevit-small
Taskimage-classification
Arch typemobilevit
Baseline opset17
EPqnn
Devicenpu
Conv%2.5%
npu-006 riskLOW
npu-001 noteneutral
+
+ + +
+
Hypothesis Gain Chart
+
+ HypothesisGain vs baseline (%)-200%-100%0%100%200%h0: baseline (auto-config, W8A16) +status=OK verdict=— +p50=5.51 ms gain=+0.0%h0baseline (auto-config, W8…0.0%h1: opset 17 explicit +status=OK_HIGH_CV verdict=— +p50=5.61 ms gain=-1.9%h1opset 17 explicit-1.9%h2: opset 19 +status=OK verdict=— +p50=6.59 ms gain=-19.5%h2opset 19-19.5%h3: opset 21 (tests npu-001 bypass) +status=OK_HIGH_CV verdict=— +p50=5.36 ms gain=+2.8%h3opset 21 (tests npu-001 b…+2.8%h4: opset 17 + conv fusions +status=OK_HIGH_CV verdict=— +p50=6.51 ms gain=-18.2%h4opset 17 + conv fusions-18.2%h5: opset 21 + conv fusions +status=OK_HIGH_CV verdict=— +p50=6.71 ms gain=-21.8%h5opset 21 + conv fusions-21.8%h6: opset 21 + matmul_transpose_fusion +status=OK_HIGH_CV verdict=— +p50=6.22 ms gain=-12.8%h6opset 21 + matmul_transpo…-12.8%h7: opset 21 + bias_softmax_fusion +status=OK verdict=— +p50=6.43 ms gain=-16.7%h7opset 21 + bias_softmax_f…-16.7%h8: opset 21 + attention_fusion +status=OK_HIGH_CV verdict=— +p50=5.75 ms gain=-4.4%h8opset 21 + attention_fusi…-4.4%h9: opset 21 + highdimRTR_lowdimRTR +status=OK_HIGH_CV verdict=— +p50=6.54 ms gain=-18.8%h9opset 21 + highdimRTR_low…-18.8%h10: opset 17 + conv_add_fusion only +status=OK_HIGH_CV verdict=— +p50=5.88 ms gain=-6.7%h10opset 17 + conv_add_fusio…-6.7% +
+
+ + +
+
🔬 All Hypotheses — Full Detail
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
IDConfig LabelOpsetExtra FlagsMedian p50Session p50s (ms)Gain %VerdictConfidence
h0baseline (auto-config, W8A16)17not stored5.51 ms[4.98 · 5.51 · 5.72]+0.0%OKranges overlap
h1opset 17 explicit17not stored5.61 ms[5.63 · 5.31 · 5.61]-1.9%OK_HIGH_CVranges overlap
h2opset 1919not stored6.59 ms[6.68 · 6.59 · 5.29]-19.5%OKranges overlap
h3 opset 21 (tests npu-001 bypass)21not stored5.36 ms[5.36 · 5.26 · 5.89]+2.8%OK_HIGH_CVranges overlap
h4opset 17 + conv fusions17conv_bn_fusion, conv_add_fusion, conv_activation_fusion6.51 ms[6.43 · 6.84 · 6.51]-18.2%OK_HIGH_CVranges separated
h5opset 21 + conv fusions21conv_bn_fusion, conv_add_fusion, conv_activation_fusion6.71 ms[5.60 · 6.84 · 6.71]-21.8%OK_HIGH_CVranges overlap
h6opset 21 + matmul_transpose_fusion21matmul_transpose_fusion6.22 ms[5.76 · 6.26 · 6.22]-12.8%OK_HIGH_CVranges separated
h7opset 21 + bias_softmax_fusion21bias_softmax_fusion6.43 ms[6.43 · 6.47 · 5.59]-16.7%OKranges overlap
h8opset 21 + attention_fusion21attention_fusion5.75 ms[5.75 · 5.67 · 6.72]-4.4%OK_HIGH_CVranges overlap
h9opset 21 + highdimRTR_lowdimRTR21highdimRTR_lowdimRTR6.54 ms[6.63 · 5.54 · 6.54]-18.8%OK_HIGH_CVranges overlap
h10opset 17 + conv_add_fusion only17conv_add_fusion5.88 ms[5.88 · 6.07 · 5.55]-6.7%OK_HIGH_CVranges overlap
+
+
+ ★ = champion hypothesis  ·  Session p50s are individual bench sessions (median used for comparison) +
+
+ + + +
+
❌ Ineffective or Harmful
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
HypothesisLabelGain %VerdictConfidence
h2opset 19-19.5%OKranges overlap
h4opset 17 + conv fusions-18.2%OK_HIGH_CVranges separated
h5opset 21 + conv fusions-21.8%OK_HIGH_CVranges overlap
h6opset 21 + matmul_transpose_fusion-12.8%OK_HIGH_CVranges separated
h7opset 21 + bias_softmax_fusion-16.7%OKranges overlap
h8opset 21 + attention_fusion-4.4%OK_HIGH_CVranges overlap
h9opset 21 + highdimRTR_lowdimRTR-18.8%OK_HIGH_CVranges overlap
h10opset 17 + conv_add_fusion only-6.7%OK_HIGH_CVranges overlap
+
+ + +
+
⚪ Neutral / Build Fail
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
HypothesisLabelGain %VerdictConfidence
h0baseline (auto-config, W8A16)+0.0%OKranges overlap
h1opset 17 explicit-1.9%OK_HIGH_CVranges overlap
h3opset 21 (tests npu-001 bypass)+2.8%OK_HIGH_CVranges overlap
+
+ + + + + + diff --git a/research/autoconfig/catalog-qnn-sweep/apple--mobilevit-small/results.json b/research/autoconfig/catalog-qnn-sweep/apple--mobilevit-small/results.json new file mode 100644 index 000000000..8736e8048 --- /dev/null +++ b/research/autoconfig/catalog-qnn-sweep/apple--mobilevit-small/results.json @@ -0,0 +1,274 @@ +{ + "model_id": "apple/mobilevit-small", + "task": "image-classification", + "model_type": "mobilevit", + "timestamp": "2026-06-22T08:34:17", + "ep": "qnn", + "device": "npu", + "baseline_opset": 17, + "conv_pct": 2.5, + "npu006_risk": false, + "npu006_regression": false, + "hypotheses": { + "h0": { + "status": "OK", + "screen": { + "p50_ms": 5.05, + "cv": 0.0935, + "stable": true + }, + "full": { + "p50s_ms": [ + 4.976, + 5.51, + 5.716 + ], + "median_p50_ms": 5.51 + }, + "accuracy": null, + "label": "baseline (auto-config, W8A16)", + "opset": 17, + "extra_optim": {} + }, + "h1": { + "status": "OK_HIGH_CV", + "screen": { + "p50_ms": 5.844, + "cv": 0.3039, + "stable": false, + "note": "DVFS noise — high CV expected on QNN NPU" + }, + "full": { + "p50s_ms": [ + 5.634, + 5.307, + 5.614 + ], + "median_p50_ms": 5.614 + }, + "accuracy": null, + "label": "opset 17 explicit", + "opset": 17, + "extra_optim": {} + }, + "h2": { + "status": "OK", + "screen": { + "p50_ms": 5.81, + "cv": 0.1203, + "stable": true + }, + "full": { + "p50s_ms": [ + 6.678, + 6.586, + 5.293 + ], + "median_p50_ms": 6.586 + }, + "accuracy": null, + "label": "opset 19", + "opset": 19, + "extra_optim": {} + }, + "h3": { + "status": "OK_HIGH_CV", + "screen": { + "p50_ms": 5.218, + "cv": 0.1631, + "stable": false, + "note": "DVFS noise — high CV expected on QNN NPU" + }, + "full": { + "p50s_ms": [ + 5.355, + 5.256, + 5.895 + ], + "median_p50_ms": 5.355 + }, + "accuracy": null, + "label": "opset 21 (tests npu-001 bypass)", + "opset": 21, + "extra_optim": {} + }, + "h4": { + "status": "OK_HIGH_CV", + "screen": { + "p50_ms": 6.73, + "cv": 0.1811, + "stable": false, + "note": "DVFS noise — high CV expected on QNN NPU" + }, + "full": { + "p50s_ms": [ + 6.43, + 6.842, + 6.515 + ], + "median_p50_ms": 6.515 + }, + "accuracy": null, + "label": "opset 17 + conv fusions", + "opset": 17, + "extra_optim": { + "conv_bn_fusion": true, + "conv_add_fusion": true, + "conv_activation_fusion": true + }, + "npu006_expected_regression": false + }, + "h5": { + "status": "OK_HIGH_CV", + "screen": { + "p50_ms": 6.187, + "cv": 0.1526, + "stable": false, + "note": "DVFS noise — high CV expected on QNN NPU" + }, + "full": { + "p50s_ms": [ + 5.604, + 6.837, + 6.711 + ], + "median_p50_ms": 6.711 + }, + "accuracy": null, + "label": "opset 21 + conv fusions", + "opset": 21, + "extra_optim": { + "conv_bn_fusion": true, + "conv_add_fusion": true, + "conv_activation_fusion": true + }, + "npu006_expected_regression": false + }, + "h6": { + "status": "OK_HIGH_CV", + "screen": { + "p50_ms": 5.921, + "cv": 0.2292, + "stable": false, + "note": "DVFS noise — high CV expected on QNN NPU" + }, + "full": { + "p50s_ms": [ + 5.762, + 6.263, + 6.218 + ], + "median_p50_ms": 6.218 + }, + "accuracy": null, + "label": "opset 21 + matmul_transpose_fusion", + "opset": 21, + "extra_optim": { + "matmul_transpose_fusion": true + } + }, + "h7": { + "status": "OK", + "screen": { + "p50_ms": 4.618, + "cv": 0.0427, + "stable": true + }, + "full": { + "p50s_ms": [ + 6.431, + 6.47, + 5.586 + ], + "median_p50_ms": 6.431 + }, + "accuracy": null, + "label": "opset 21 + bias_softmax_fusion", + "opset": 21, + "extra_optim": { + "bias_softmax_fusion": true + } + }, + "h8": { + "status": "OK_HIGH_CV", + "screen": { + "p50_ms": 6.451, + "cv": 0.4551, + "stable": false, + "note": "DVFS noise — high CV expected on QNN NPU" + }, + "full": { + "p50s_ms": [ + 5.75, + 5.675, + 6.718 + ], + "median_p50_ms": 5.75 + }, + "accuracy": null, + "label": "opset 21 + attention_fusion", + "opset": 21, + "extra_optim": { + "attention_fusion": true + } + }, + "h9": { + "status": "OK_HIGH_CV", + "screen": { + "p50_ms": 5.72, + "cv": 0.1899, + "stable": false, + "note": "DVFS noise — high CV expected on QNN NPU" + }, + "full": { + "p50s_ms": [ + 6.627, + 5.535, + 6.545 + ], + "median_p50_ms": 6.545 + }, + "accuracy": null, + "label": "opset 21 + highdimRTR_lowdimRTR", + "opset": 21, + "extra_optim": { + "highdimRTR_lowdimRTR": true + } + }, + "h10": { + "status": "OK_HIGH_CV", + "screen": { + "p50_ms": 6.726, + "cv": 0.1875, + "stable": false, + "note": "DVFS noise — high CV expected on QNN NPU" + }, + "full": { + "p50s_ms": [ + 5.879, + 6.067, + 5.55 + ], + "median_p50_ms": 5.879 + }, + "accuracy": null, + "label": "opset 17 + conv_add_fusion only", + "opset": 17, + "extra_optim": { + "conv_add_fusion": true + } + } + }, + "best_hypothesis": "h3", + "baseline_p50_ms": 5.51, + "best_p50_ms": 5.355, + "best_gain_pct": 2.81, + "npu001_generalized": "neutral", + "npu001_ranges_non_overlapping": false, + "feature_gaps": [], + "errors": [], + "best_gain_noise_floor_pct": 14.14, + "best_gain_ranges_separated": false, + "best_gain_reliable": false, + "best_gain_verdict": "NEUTRAL_WITHIN_NOISE" +} diff --git a/research/autoconfig/catalog-qnn-sweep/deepset--roberta-base-squad2/report.html b/research/autoconfig/catalog-qnn-sweep/deepset--roberta-base-squad2/report.html new file mode 100644 index 000000000..d316dc973 --- /dev/null +++ b/research/autoconfig/catalog-qnn-sweep/deepset--roberta-base-squad2/report.html @@ -0,0 +1,412 @@ + + + + + + QNN NPU Optimization Report — deepset/roberta-base-squad2 + + + +

QNN NPU Optimization Report — deepset/roberta-base-squad2

+
roberta arch · 2026-06-13 · 6 hypotheses tested
+ +
+
+
Best Gain %
+
+1.5%
+
Champion: h1
+
+
+
Baseline → Champion ms
+
14.94 ms → 14.72 ms
+
Latency reduction: 0.23 ms
+
+
+
EP + Device
+
QNN / NPU
+
Baseline opset 17
+
+
+
Champion Config
+
h1
+
opset 17 + autoconf defaults
+
+
+
Total experiments
+
6
+
0 KEEP / 0 DISCARD
+
+
+ + +
+
Model Characteristics
+ + +
Model IDdeepset/roberta-base-squad2
Taskquestion-answering
Arch typeroberta
Baseline opset17
EPqnn
Devicenpu
npu-001 noteneutral
+
+ + +
+
Hypothesis Gain Chart
+
+ HypothesisGain vs baseline (%)-200%-100%0%100%200%h0: baseline (auto-config, W8A16) +status=OK verdict=— +p50=14.94 ms gain=+0.0%h0baseline (auto-config, W8…0.0%h1: opset 17 explicit +status=OK verdict=— +p50=14.72 ms gain=+1.5%h1opset 17 explicit+1.5%h2: opset 19 +status=OK_HIGH_CV verdict=— +p50=14.88 ms gain=+0.4%h2opset 19+0.4%h3: opset 21 (tests npu-001 bypass) +status=OK verdict=— +p50=14.92 ms gain=+0.1%h3opset 21 (tests npu-001 b…+0.1%h4: opset 17 + conv fusions +status=TIMEOUT verdict=— +p50=— gain=—h4opset 17 + conv fusionsh5: opset 21 + conv fusions +status=TIMEOUT verdict=— +p50=— gain=—h5opset 21 + conv fusions +
+
+ + +
+
🔬 All Hypotheses — Full Detail
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
IDConfig LabelOpsetExtra FlagsMedian p50Session p50s (ms)Gain %VerdictConfidence
h0baseline (auto-config, W8A16)17not stored14.94 ms[14.94 · 14.71 · 14.97]+0.0%OKranges overlap
h1 opset 17 explicit17not stored14.72 ms[14.64 · 14.87 · 14.72]+1.5%OKranges overlap
h2opset 1919not stored14.88 ms[14.95 · 14.88 · 14.83]+0.4%OK_HIGH_CVranges overlap
h3opset 21 (tests npu-001 bypass)21not stored14.92 ms[16.68 · 14.74 · 14.92]+0.1%OKranges overlap
h4opset 17 + conv fusionsnot storedTIMEOUTsingle-point only
h5opset 21 + conv fusionsnot storedTIMEOUTsingle-point only
+
+
+ ★ = champion hypothesis  ·  Session p50s are individual bench sessions (median used for comparison) +
+
+ + + + +
+
⚪ Neutral / Build Fail
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
HypothesisLabelGain %VerdictConfidence
h0baseline (auto-config, W8A16)+0.0%OKranges overlap
h1opset 17 explicit+1.5%OKranges overlap
h2opset 19+0.4%OK_HIGH_CVranges overlap
h3opset 21 (tests npu-001 bypass)+0.1%OKranges overlap
h4opset 17 + conv fusionsTIMEOUTsingle-point only
h5opset 21 + conv fusionsTIMEOUTsingle-point only
+
+ + + + + + diff --git a/research/autoconfig/catalog-qnn-sweep/deepset--roberta-base-squad2/results.json b/research/autoconfig/catalog-qnn-sweep/deepset--roberta-base-squad2/results.json new file mode 100644 index 000000000..fa8a959f4 --- /dev/null +++ b/research/autoconfig/catalog-qnn-sweep/deepset--roberta-base-squad2/results.json @@ -0,0 +1,106 @@ +{ + "model_id": "deepset/roberta-base-squad2", + "task": "question-answering", + "model_type": "roberta", + "timestamp": "2026-06-13T16:21:18", + "ep": "qnn", + "device": "npu", + "baseline_opset": 17, + "hypotheses": { + "h0": { + "status": "OK", + "screen": { + "p50_ms": 14.919, + "cv": 0.1188, + "stable": true + }, + "full": { + "p50s_ms": [ + 14.941, + 14.711, + 14.97 + ], + "median_p50_ms": 14.941 + }, + "accuracy": null, + "label": "baseline (auto-config, W8A16)", + "opset": 17 + }, + "h1": { + "status": "OK", + "screen": { + "p50_ms": 14.747, + "cv": 0.1286, + "stable": true + }, + "full": { + "p50s_ms": [ + 14.645, + 14.873, + 14.716 + ], + "median_p50_ms": 14.716 + }, + "accuracy": null, + "label": "opset 17 explicit", + "opset": 17 + }, + "h2": { + "status": "OK_HIGH_CV", + "screen": { + "p50_ms": 15.309, + "cv": 0.2344, + "stable": false, + "note": "DVFS noise — high CV expected on QNN NPU" + }, + "full": { + "p50s_ms": [ + 14.951, + 14.877, + 14.834 + ], + "median_p50_ms": 14.877 + }, + "accuracy": null, + "label": "opset 19", + "opset": 19 + }, + "h3": { + "status": "OK", + "screen": { + "p50_ms": 14.798, + "cv": 0.1159, + "stable": true + }, + "full": { + "p50s_ms": [ + 16.685, + 14.743, + 14.919 + ], + "median_p50_ms": 14.919 + }, + "accuracy": null, + "label": "opset 21 (tests npu-001 bypass)", + "opset": 21 + }, + "h4": { + "status": "TIMEOUT", + "label": "opset 17 + conv fusions" + }, + "h5": { + "status": "TIMEOUT", + "label": "opset 21 + conv fusions" + } + }, + "best_hypothesis": "h1", + "baseline_p50_ms": 14.941, + "best_p50_ms": 14.716, + "best_gain_pct": 1.51, + "npu001_generalized": "neutral", + "feature_gaps": [], + "errors": [ + "Model timed out at 1466s (before h4)", + "Model timed out at 1466s (before h5)" + ] +} diff --git a/research/autoconfig/catalog-qnn-sweep/distilbert--distilbert-base-uncased-finetuned-sst-2-english/report.html b/research/autoconfig/catalog-qnn-sweep/distilbert--distilbert-base-uncased-finetuned-sst-2-english/report.html new file mode 100644 index 000000000..9566543c7 --- /dev/null +++ b/research/autoconfig/catalog-qnn-sweep/distilbert--distilbert-base-uncased-finetuned-sst-2-english/report.html @@ -0,0 +1,412 @@ + + + + + + QNN NPU Optimization Report — distilbert/distilbert-base-uncased-finetuned-sst-2-english + + + +

QNN NPU Optimization Report — distilbert/distilbert-base-uncased-finetuned-sst-2-english

+
distilbert arch · 2026-06-13 · 6 hypotheses tested
+ +
+
+
Best Gain %
+
+0.0%
+
Champion: h2
+
+
+
Baseline → Champion ms
+
19.48 ms → 19.48 ms
+
Latency reduction: 0.00 ms
+
+
+
EP + Device
+
QNN / NPU
+
Baseline opset 17
+
+
+
Champion Config
+
h2
+
opset 19 + autoconf defaults
+
+
+
Total experiments
+
6
+
0 KEEP / 0 DISCARD
+
+
+ + +
+
Model Characteristics
+ + +
Model IDdistilbert/distilbert-base-uncased-finetuned-sst-2-english
Tasktext-classification
Arch typedistilbert
Baseline opset17
EPqnn
Devicenpu
npu-001 noteneutral
+
+ + +
+
Hypothesis Gain Chart
+
+ HypothesisGain vs baseline (%)-200%-100%0%100%200%h0: baseline (auto-config, W8A16) +status=OK_HIGH_CV verdict=— +p50=19.48 ms gain=+0.0%h0baseline (auto-config, W8…0.0%h1: opset 17 explicit +status=OK_HIGH_CV verdict=— +p50=19.50 ms gain=-0.1%h1opset 17 explicit-0.1%h2: opset 19 +status=OK_HIGH_CV verdict=— +p50=19.48 ms gain=+0.0%h2opset 19+0.0%h3: opset 21 (tests npu-001 bypass) +status=OK_HIGH_CV verdict=— +p50=19.50 ms gain=-0.1%h3opset 21 (tests npu-001 b…-0.1%h4: opset 17 + conv fusions +status=OK_HIGH_CV verdict=— +p50=19.59 ms gain=-0.6%h4opset 17 + conv fusions-0.6%h5: opset 21 + conv fusions +status=TIMEOUT verdict=— +p50=— gain=—h5opset 21 + conv fusions +
+
+ + +
+
🔬 All Hypotheses — Full Detail
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
IDConfig LabelOpsetExtra FlagsMedian p50Session p50s (ms)Gain %VerdictConfidence
h0baseline (auto-config, W8A16)17not stored19.48 ms[19.51 · 19.46 · 19.48]+0.0%OK_HIGH_CVranges overlap
h1opset 17 explicit17not stored19.50 ms[19.50 · 19.42 · 19.52]-0.1%OK_HIGH_CVranges overlap
h2 opset 1919not stored19.48 ms[19.47 · 19.68 · 19.48]+0.0%OK_HIGH_CVranges overlap
h3opset 21 (tests npu-001 bypass)21not stored19.50 ms[19.59 · 19.45 · 19.50]-0.1%OK_HIGH_CVranges overlap
h4opset 17 + conv fusions17not stored19.59 ms[19.59 · 19.63 · 19.50]-0.6%OK_HIGH_CVranges overlap
h5opset 21 + conv fusionsnot storedTIMEOUTsingle-point only
+
+
+ ★ = champion hypothesis  ·  Session p50s are individual bench sessions (median used for comparison) +
+
+ + + + +
+
⚪ Neutral / Build Fail
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
HypothesisLabelGain %VerdictConfidence
h0baseline (auto-config, W8A16)+0.0%OK_HIGH_CVranges overlap
h1opset 17 explicit-0.1%OK_HIGH_CVranges overlap
h2opset 19+0.0%OK_HIGH_CVranges overlap
h3opset 21 (tests npu-001 bypass)-0.1%OK_HIGH_CVranges overlap
h4opset 17 + conv fusions-0.6%OK_HIGH_CVranges overlap
h5opset 21 + conv fusionsTIMEOUTsingle-point only
+
+ + + + + + diff --git a/research/autoconfig/catalog-qnn-sweep/distilbert--distilbert-base-uncased-finetuned-sst-2-english/results.json b/research/autoconfig/catalog-qnn-sweep/distilbert--distilbert-base-uncased-finetuned-sst-2-english/results.json new file mode 100644 index 000000000..9d10a6736 --- /dev/null +++ b/research/autoconfig/catalog-qnn-sweep/distilbert--distilbert-base-uncased-finetuned-sst-2-english/results.json @@ -0,0 +1,124 @@ +{ + "model_id": "distilbert/distilbert-base-uncased-finetuned-sst-2-english", + "task": "text-classification", + "model_type": "distilbert", + "timestamp": "2026-06-13T15:34:52", + "ep": "qnn", + "device": "npu", + "baseline_opset": 17, + "hypotheses": { + "h0": { + "status": "OK_HIGH_CV", + "screen": { + "p50_ms": 19.511, + "cv": 0.156, + "stable": false, + "note": "DVFS noise — high CV expected on QNN NPU" + }, + "full": { + "p50s_ms": [ + 19.512, + 19.459, + 19.48 + ], + "median_p50_ms": 19.48 + }, + "accuracy": null, + "label": "baseline (auto-config, W8A16)", + "opset": 17 + }, + "h1": { + "status": "OK_HIGH_CV", + "screen": { + "p50_ms": 19.721, + "cv": 0.2715, + "stable": false, + "note": "DVFS noise — high CV expected on QNN NPU" + }, + "full": { + "p50s_ms": [ + 19.498, + 19.417, + 19.519 + ], + "median_p50_ms": 19.498 + }, + "accuracy": null, + "label": "opset 17 explicit", + "opset": 17 + }, + "h2": { + "status": "OK_HIGH_CV", + "screen": { + "p50_ms": 19.431, + "cv": 0.1945, + "stable": false, + "note": "DVFS noise — high CV expected on QNN NPU" + }, + "full": { + "p50s_ms": [ + 19.471, + 19.684, + 19.477 + ], + "median_p50_ms": 19.477 + }, + "accuracy": null, + "label": "opset 19", + "opset": 19 + }, + "h3": { + "status": "OK_HIGH_CV", + "screen": { + "p50_ms": 19.443, + "cv": 0.2903, + "stable": false, + "note": "DVFS noise — high CV expected on QNN NPU" + }, + "full": { + "p50s_ms": [ + 19.591, + 19.447, + 19.505 + ], + "median_p50_ms": 19.505 + }, + "accuracy": null, + "label": "opset 21 (tests npu-001 bypass)", + "opset": 21 + }, + "h4": { + "status": "OK_HIGH_CV", + "screen": { + "p50_ms": 19.404, + "cv": 0.237, + "stable": false, + "note": "DVFS noise — high CV expected on QNN NPU" + }, + "full": { + "p50s_ms": [ + 19.588, + 19.628, + 19.502 + ], + "median_p50_ms": 19.588 + }, + "accuracy": null, + "label": "opset 17 + conv fusions", + "opset": 17 + }, + "h5": { + "status": "TIMEOUT", + "label": "opset 21 + conv fusions" + } + }, + "best_hypothesis": "h2", + "baseline_p50_ms": 19.48, + "best_p50_ms": 19.477, + "best_gain_pct": 0.02, + "npu001_generalized": "neutral", + "feature_gaps": [], + "errors": [ + "Model timed out at 1385s (before h5)" + ] +} diff --git a/research/autoconfig/catalog-qnn-sweep/facebook--dino-vitb16/results_v2.json b/research/autoconfig/catalog-qnn-sweep/facebook--dino-vitb16/results_v2.json new file mode 100644 index 000000000..b8c34f0d3 --- /dev/null +++ b/research/autoconfig/catalog-qnn-sweep/facebook--dino-vitb16/results_v2.json @@ -0,0 +1,92 @@ +{ + "model_id": "facebook/dino-vitb16", + "task": "image-feature-extraction", + "model_type": "vit", + "timestamp": "2026-06-16T18:19:46", + "ep": "qnn", + "device": "npu", + "validation_sweep": true, + "hypotheses": { + "h0": { + "status": "OK_HIGH_CV", + "screen": { + "p50_ms": 20.367, + "cv": 0.2452, + "stable": false, + "note": "DVFS noise — high CV expected on QNN NPU" + }, + "full": { + "p50s_ms": [ + 20.037, + 20.009, + 20.048 + ], + "median_p50_ms": 20.037 + }, + "label": "baseline (auto-config, W8A16)", + "opset": "auto" + }, + "h1": { + "status": "OK_HIGH_CV", + "screen": { + "p50_ms": 20.027, + "cv": 0.4804, + "stable": false, + "note": "DVFS noise — high CV expected on QNN NPU" + }, + "full": { + "p50s_ms": [ + 19.924, + 19.975, + 19.897 + ], + "median_p50_ms": 19.924 + }, + "label": "opset 17 explicit", + "opset": 17 + }, + "h3": { + "status": "OK_HIGH_CV", + "screen": { + "p50_ms": 20.369, + "cv": 0.9085, + "stable": false, + "note": "DVFS noise — high CV expected on QNN NPU" + }, + "full": { + "p50s_ms": [ + 20.197, + 20.071, + 19.988 + ], + "median_p50_ms": 20.071 + }, + "label": "opset 21 (tests npu-001)", + "opset": 21 + }, + "h4": { + "status": "OK_HIGH_CV", + "screen": { + "p50_ms": 19.871, + "cv": 0.3492, + "stable": false, + "note": "DVFS noise — high CV expected on QNN NPU" + }, + "full": { + "p50s_ms": [ + 20.123, + 20.037, + 20.413 + ], + "median_p50_ms": 20.123 + }, + "label": "opset 17 + conv fusions", + "opset": 17 + } + }, + "errors": [], + "npu001_opset21_vs_17_gain_pct": -0.7, + "npu001_note": "opset21 median 20.071ms vs opset17 19.924ms = -0.7%", + "npu006_conv_fusion_regression_pct": 1.0, + "npu006_note": "conv fusions median 20.123ms vs no-fusion 19.924ms = +1.0%" +} diff --git a/research/autoconfig/catalog-qnn-sweep/facebook--dinov2-base/results_v2.json b/research/autoconfig/catalog-qnn-sweep/facebook--dinov2-base/results_v2.json new file mode 100644 index 000000000..416ddce95 --- /dev/null +++ b/research/autoconfig/catalog-qnn-sweep/facebook--dinov2-base/results_v2.json @@ -0,0 +1,92 @@ +{ + "model_id": "facebook/dinov2-base", + "task": "image-feature-extraction", + "model_type": "dinov2", + "timestamp": "2026-06-16T16:12:15", + "ep": "qnn", + "device": "npu", + "validation_sweep": true, + "hypotheses": { + "h0": { + "status": "OK_HIGH_CV", + "screen": { + "p50_ms": 41.108, + "cv": 1.2524, + "stable": false, + "note": "DVFS noise — high CV expected on QNN NPU" + }, + "full": { + "p50s_ms": [ + 38.991, + 38.68, + 36.256 + ], + "median_p50_ms": 38.68 + }, + "label": "baseline (auto-config, W8A16)", + "opset": "auto" + }, + "h1": { + "status": "OK_HIGH_CV", + "screen": { + "p50_ms": 36.348, + "cv": 0.7429, + "stable": false, + "note": "DVFS noise — high CV expected on QNN NPU" + }, + "full": { + "p50s_ms": [ + 34.556, + 34.668, + 33.148 + ], + "median_p50_ms": 34.556 + }, + "label": "opset 17 explicit", + "opset": 17 + }, + "h3": { + "status": "OK_HIGH_CV", + "screen": { + "p50_ms": 32.742, + "cv": 0.8357, + "stable": false, + "note": "DVFS noise — high CV expected on QNN NPU" + }, + "full": { + "p50s_ms": [ + 33.001, + 26.224, + 26.227 + ], + "median_p50_ms": 26.227 + }, + "label": "opset 21 (tests npu-001)", + "opset": 21 + }, + "h4": { + "status": "OK", + "screen": { + "p50_ms": 25.83, + "cv": 0.1082, + "stable": true, + "note": null + }, + "full": { + "p50s_ms": [ + 26.064, + 25.921, + 25.872 + ], + "median_p50_ms": 25.921 + }, + "label": "opset 17 + conv fusions", + "opset": 17 + } + }, + "errors": [], + "npu001_opset21_vs_17_gain_pct": 24.1, + "npu001_note": "opset21 median 26.227ms vs opset17 34.556ms = +24.1%", + "npu006_conv_fusion_regression_pct": -25.0, + "npu006_note": "conv fusions median 25.921ms vs no-fusion 34.556ms = -25.0%" +} diff --git a/research/autoconfig/catalog-qnn-sweep/facebook--dinov2-small/report.html b/research/autoconfig/catalog-qnn-sweep/facebook--dinov2-small/report.html new file mode 100644 index 000000000..432deb35e --- /dev/null +++ b/research/autoconfig/catalog-qnn-sweep/facebook--dinov2-small/report.html @@ -0,0 +1,448 @@ + + + + + + QNN NPU Optimization Report — facebook/dinov2-small + + + +

QNN NPU Optimization Report — facebook/dinov2-small

+
dinov2 arch · 2026-06-13 · 6 hypotheses tested
+ +
+
+
Best Gain %
+
+24.1%
+
Champion: h3
+
+
+
Baseline → Champion ms
+
6.56 ms → 4.98 ms
+
Latency reduction: 1.58 ms
+
+
+
EP + Device
+
QNN / NPU
+
Baseline opset 17
+
+
+
Champion Config
+
h3
+
opset 21 + autoconf defaults
+
+
+
Total experiments
+
6
+
1 KEEP / 2 DISCARD
+
+
+ + +
+
Model Characteristics
+ + +
Model IDfacebook/dinov2-small
Taskimage-feature-extraction
Arch typedinov2
Baseline opset17
EPqnn
Devicenpu
npu-001 noteTrue
+
+ + +
+
Hypothesis Gain Chart
+
+ HypothesisGain vs baseline (%)-200%-100%0%100%200%h0: baseline (auto-config, W8A16) +status=OK_HIGH_CV verdict=— +p50=6.56 ms gain=+0.0%h0baseline (auto-config, W8…0.0%h1: opset 17 explicit +status=OK_HIGH_CV verdict=— +p50=7.18 ms gain=-9.4%h1opset 17 explicit-9.4%h2: opset 19 +status=OK_HIGH_CV verdict=— +p50=7.19 ms gain=-9.6%h2opset 19-9.6%h3: opset 21 (tests npu-001 bypass) +status=OK_HIGH_CV verdict=— +p50=4.98 ms gain=+24.1%h3opset 21 (tests npu-001 b…+24.1%h4: opset 17 + conv fusions +status=TIMEOUT verdict=— +p50=— gain=—h4opset 17 + conv fusionsh5: opset 21 + conv fusions +status=TIMEOUT verdict=— +p50=— gain=—h5opset 21 + conv fusions +
+
+ + +
+
🔬 All Hypotheses — Full Detail
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
IDConfig LabelOpsetExtra FlagsMedian p50Session p50s (ms)Gain %VerdictConfidence
h0baseline (auto-config, W8A16)17not stored6.56 ms[6.56 · 6.35 · 12.41]+0.0%OK_HIGH_CVranges overlap
h1opset 17 explicit17not stored7.18 ms[7.18 · 6.39 · 9.44]-9.4%OK_HIGH_CVranges overlap
h2opset 1919not stored7.19 ms[8.45 · 7.19 · 6.19]-9.6%OK_HIGH_CVranges overlap
h3 opset 21 (tests npu-001 bypass)21not stored4.98 ms[4.98 · 4.88 · 6.88]+24.1%OK_HIGH_CVranges overlap
h4opset 17 + conv fusionsnot storedTIMEOUTsingle-point only
h5opset 21 + conv fusionsnot storedTIMEOUTsingle-point only
+
+
+ ★ = champion hypothesis  ·  Session p50s are individual bench sessions (median used for comparison) +
+
+ + +
+
✅ Effective Optimizations
+ + + + + + + + + + + + + + + + + + + + + +
HypothesisLabelGain %VerdictConfidence
h3opset 21 (tests npu-001 bypass)+24.1%OK_HIGH_CVranges overlap
+
+ + +
+
❌ Ineffective or Harmful
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
HypothesisLabelGain %VerdictConfidence
h1opset 17 explicit-9.4%OK_HIGH_CVranges overlap
h2opset 19-9.6%OK_HIGH_CVranges overlap
+
+ + +
+
⚪ Neutral / Build Fail
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
HypothesisLabelGain %VerdictConfidence
h0baseline (auto-config, W8A16)+0.0%OK_HIGH_CVranges overlap
h4opset 17 + conv fusionsTIMEOUTsingle-point only
h5opset 21 + conv fusionsTIMEOUTsingle-point only
+
+ + + + + + diff --git a/research/autoconfig/catalog-qnn-sweep/facebook--dinov2-small/results.json b/research/autoconfig/catalog-qnn-sweep/facebook--dinov2-small/results.json new file mode 100644 index 000000000..521b465de --- /dev/null +++ b/research/autoconfig/catalog-qnn-sweep/facebook--dinov2-small/results.json @@ -0,0 +1,109 @@ +{ + "model_id": "facebook/dinov2-small", + "task": "image-feature-extraction", + "model_type": "dinov2", + "timestamp": "2026-06-13T14:49:59", + "ep": "qnn", + "device": "npu", + "baseline_opset": 17, + "hypotheses": { + "h0": { + "status": "OK_HIGH_CV", + "screen": { + "p50_ms": 7.213, + "cv": 0.3437, + "stable": false, + "note": "DVFS noise — high CV expected on QNN NPU" + }, + "full": { + "p50s_ms": [ + 6.561, + 6.353, + 12.408 + ], + "median_p50_ms": 6.561 + }, + "accuracy": null, + "label": "baseline (auto-config, W8A16)", + "opset": 17 + }, + "h1": { + "status": "OK_HIGH_CV", + "screen": { + "p50_ms": 4.897, + "cv": 0.4572, + "stable": false, + "note": "DVFS noise — high CV expected on QNN NPU" + }, + "full": { + "p50s_ms": [ + 7.176, + 6.392, + 9.436 + ], + "median_p50_ms": 7.176 + }, + "accuracy": null, + "label": "opset 17 explicit", + "opset": 17 + }, + "h2": { + "status": "OK_HIGH_CV", + "screen": { + "p50_ms": 6.953, + "cv": 1.8047, + "stable": false, + "note": "DVFS noise — high CV expected on QNN NPU" + }, + "full": { + "p50s_ms": [ + 8.454, + 7.191, + 6.194 + ], + "median_p50_ms": 7.191 + }, + "accuracy": null, + "label": "opset 19", + "opset": 19 + }, + "h3": { + "status": "OK_HIGH_CV", + "screen": { + "p50_ms": 9.432, + "cv": 0.936, + "stable": false, + "note": "DVFS noise — high CV expected on QNN NPU" + }, + "full": { + "p50s_ms": [ + 4.977, + 4.876, + 6.884 + ], + "median_p50_ms": 4.977 + }, + "accuracy": null, + "label": "opset 21 (tests npu-001 bypass)", + "opset": 21 + }, + "h4": { + "status": "TIMEOUT", + "label": "opset 17 + conv fusions" + }, + "h5": { + "status": "TIMEOUT", + "label": "opset 21 + conv fusions" + } + }, + "best_hypothesis": "h3", + "baseline_p50_ms": 6.561, + "best_p50_ms": 4.977, + "best_gain_pct": 24.14, + "npu001_generalized": true, + "feature_gaps": [], + "errors": [ + "Model timed out at 1333s (before h4)", + "Model timed out at 1333s (before h5)" + ] +} diff --git a/research/autoconfig/catalog-qnn-sweep/google--vit-base-patch16-224/report.html b/research/autoconfig/catalog-qnn-sweep/google--vit-base-patch16-224/report.html new file mode 100644 index 000000000..a66c1b47d --- /dev/null +++ b/research/autoconfig/catalog-qnn-sweep/google--vit-base-patch16-224/report.html @@ -0,0 +1,430 @@ + + + + + + QNN NPU Optimization Report — google/vit-base-patch16-224 + + + +

QNN NPU Optimization Report — google/vit-base-patch16-224

+
vit arch · 2026-06-13 · 6 hypotheses tested
+ +
+
+
Best Gain %
+
+0.0%
+
Champion: h0
+
+
+
Baseline → Champion ms
+
9.04 ms → 9.04 ms
+
Latency reduction: 0.00 ms
+
+
+
EP + Device
+
QNN / NPU
+
Baseline opset 17
+
+
+
Champion Config
+
h0
+
opset 17 + autoconf defaults
+
+
+
Total experiments
+
6
+
0 KEEP / 3 DISCARD
+
+
+ + +
+
Model Characteristics
+ + +
Model IDgoogle/vit-base-patch16-224
Taskimage-classification
Arch typevit
Baseline opset17
EPqnn
Devicenpu
npu-001 noteFalse
+
+ + +
+
Hypothesis Gain Chart
+
+ HypothesisGain vs baseline (%)-200%-100%0%100%200%h0: baseline (auto-config, W8A16) +status=OK_HIGH_CV verdict=— +p50=9.04 ms gain=+0.0%h0baseline (auto-config, W8…0.0%h1: opset 17 explicit +status=OK_HIGH_CV verdict=— +p50=9.33 ms gain=-3.2%h1opset 17 explicit-3.2%h2: opset 19 +status=BUILD_FAIL verdict=— +p50=— gain=—h2opset 19BUILD_FAILh3: opset 21 (tests npu-001 bypass) +status=OK_HIGH_CV verdict=— +p50=10.02 ms gain=-10.8%h3opset 21 (tests npu-001 b…-10.8%h4: opset 17 + conv fusions +status=TIMEOUT verdict=— +p50=— gain=—h4opset 17 + conv fusionsh5: opset 21 + conv fusions +status=TIMEOUT verdict=— +p50=— gain=—h5opset 21 + conv fusions +
+
+ + +
+
🔬 All Hypotheses — Full Detail
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
IDConfig LabelOpsetExtra FlagsMedian p50Session p50s (ms)Gain %VerdictConfidence
h0 baseline (auto-config, W8A16)17not stored9.04 ms[9.04 · 8.60 · 9.78]+0.0%OK_HIGH_CVranges overlap
h1opset 17 explicit17not stored9.33 ms[9.33 · 12.72 · 9.06]-3.2%OK_HIGH_CVranges overlap
h2opset 1919not storedBUILD_FAILBUILD_FAILBUILD_FAILbuild failed
h3opset 21 (tests npu-001 bypass)21not stored10.02 ms[15.27 · 10.02 · 7.81]-10.8%OK_HIGH_CVranges overlap
h4opset 17 + conv fusionsnot storedTIMEOUTsingle-point only
h5opset 21 + conv fusionsnot storedTIMEOUTsingle-point only
+
+
+ ★ = champion hypothesis  ·  Session p50s are individual bench sessions (median used for comparison) +
+
+ + + +
+
❌ Ineffective or Harmful
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
HypothesisLabelGain %VerdictConfidence
h1opset 17 explicit-3.2%OK_HIGH_CVranges overlap
h2opset 19BUILD_FAILbuild failed
h3opset 21 (tests npu-001 bypass)-10.8%OK_HIGH_CVranges overlap
+
+ + +
+
⚪ Neutral / Build Fail
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
HypothesisLabelGain %VerdictConfidence
h0baseline (auto-config, W8A16)+0.0%OK_HIGH_CVranges overlap
h4opset 17 + conv fusionsTIMEOUTsingle-point only
h5opset 21 + conv fusionsTIMEOUTsingle-point only
+
+ + + + + + diff --git a/research/autoconfig/catalog-qnn-sweep/google--vit-base-patch16-224/results.json b/research/autoconfig/catalog-qnn-sweep/google--vit-base-patch16-224/results.json new file mode 100644 index 000000000..42edb241b --- /dev/null +++ b/research/autoconfig/catalog-qnn-sweep/google--vit-base-patch16-224/results.json @@ -0,0 +1,96 @@ +{ + "model_id": "google/vit-base-patch16-224", + "task": "image-classification", + "model_type": "vit", + "timestamp": "2026-06-13T14:05:37", + "ep": "qnn", + "device": "npu", + "baseline_opset": 17, + "hypotheses": { + "h0": { + "status": "OK_HIGH_CV", + "screen": { + "p50_ms": 9.245, + "cv": 1.2887, + "stable": false, + "note": "DVFS noise — high CV expected on QNN NPU" + }, + "full": { + "p50s_ms": [ + 9.039, + 8.6, + 9.779 + ], + "median_p50_ms": 9.039 + }, + "accuracy": 0.74, + "label": "baseline (auto-config, W8A16)", + "opset": 17 + }, + "h1": { + "status": "OK_HIGH_CV", + "screen": { + "p50_ms": 9.656, + "cv": 0.7434, + "stable": false, + "note": "DVFS noise — high CV expected on QNN NPU" + }, + "full": { + "p50s_ms": [ + 9.33, + 12.723, + 9.064 + ], + "median_p50_ms": 9.33 + }, + "accuracy": null, + "label": "opset 17 explicit", + "opset": 17 + }, + "h2": { + "status": "BUILD_FAIL", + "label": "opset 19", + "opset": 19, + "build_error": "MzU3NTk3NTM4NmY1YzY0YjEzZjgwNTlkYmY3MWVkNDBkYWEwMGFcXD91c2VyX2lkPXB1YmxpYyZYLVhldC1DYXMtVWlkPXB1YmxpYyZyZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0aW9uPWlubGluZSUzQitmaWxlbmFtZSUyQSUzRFVURi04JTI3JTI3dHJhaW4tMDAwMDAtb2YtMDAwMTMucGFycXVldCUzQitmaWxlbmFtZSUzRCUyMnRyYWluLTAwMDAwLW9mLTAwMDEzLnBhcnF1ZXQlMjIlM0IiLCJDb25kaXRpb24iOnsiRGF0ZUxlc3NUaGFuIjp7IkVwb2NoVGltZSI6MTc4MTMzNTIwOH0sIkJ5dGVSYW5nZSI6eyJFeHBlY3RlZEhlYWRlciI6ImJ5dGVzPTQ4NTEzNzYwNC00ODUyMDMxMzkifX19XX0_&Signature=MEUCIQD51-TIZFhcd8Id1yCa5oFvcfXtxBJQLnbeG3PPgDJm5AIgBbqpmbciOJZpxVhunYiYCwhL8FT6ymJ72UKocE3aygs_&Key-Pair-Id=01KAYHXK2CBJSW0YZTMNXK9W1M\n\n" + }, + "h3": { + "status": "OK_HIGH_CV", + "screen": { + "p50_ms": 11.564, + "cv": 2.1585, + "stable": false, + "note": "DVFS noise — high CV expected on QNN NPU" + }, + "full": { + "p50s_ms": [ + 15.271, + 10.019, + 7.808 + ], + "median_p50_ms": 10.019 + }, + "accuracy": null, + "label": "opset 21 (tests npu-001 bypass)", + "opset": 21 + }, + "h4": { + "status": "TIMEOUT", + "label": "opset 17 + conv fusions" + }, + "h5": { + "status": "TIMEOUT", + "label": "opset 21 + conv fusions" + } + }, + "best_hypothesis": "h0", + "baseline_p50_ms": 9.039, + "best_p50_ms": 9.039, + "best_gain_pct": 0.0, + "npu001_generalized": false, + "feature_gaps": [], + "errors": [ + "h2: BUILD_FAIL", + "Model timed out at 1204s (before h4)", + "Model timed out at 1204s (before h5)" + ] +} diff --git a/research/autoconfig/catalog-qnn-sweep/hustvl--yolos-small/champion_qnn_npu.json b/research/autoconfig/catalog-qnn-sweep/hustvl--yolos-small/champion_qnn_npu.json new file mode 100644 index 000000000..3e73b6c4f --- /dev/null +++ b/research/autoconfig/catalog-qnn-sweep/hustvl--yolos-small/champion_qnn_npu.json @@ -0,0 +1,62 @@ +{ + "export": { + "opset_version": 17, + "batch_size": 1, + "export_params": true, + "do_constant_folding": true, + "verbose": false, + "dynamo": false, + "enable_hierarchy_tags": true, + "clean_onnx": false, + "hierarchy_tag_format": "full", + "input_tensors": [ + { + "name": "pixel_values", + "dtype": "float32", + "shape": [ + 1, + 3, + 512, + 864 + ], + "value_range": [ + 0, + 1 + ] + } + ], + "output_tensors": [ + { + "name": "logits" + }, + { + "name": "pred_boxes" + } + ] + }, + "optim": {}, + "quant": { + "mode": "qdq", + "samples": 10, + "calibration_method": "minmax", + "weight_type": "uint8", + "activation_type": "uint16", + "per_channel": false, + "symmetric": false, + "save_calibration": false, + "distribution": "uniform", + "seed": null, + "calibration_load_path": null, + "calibration_save_path": null, + "op_types_to_quantize": null, + "nodes_to_exclude": null, + "task": "object-detection", + "model_name": "hustvl/yolos-small" + }, + "compile": null, + "loader": { + "task": "object-detection", + "model_class": "AutoModelForObjectDetection", + "model_type": "yolos" + } +} diff --git a/research/autoconfig/catalog-qnn-sweep/hustvl--yolos-small/report.html b/research/autoconfig/catalog-qnn-sweep/hustvl--yolos-small/report.html new file mode 100644 index 000000000..c9422c1ad --- /dev/null +++ b/research/autoconfig/catalog-qnn-sweep/hustvl--yolos-small/report.html @@ -0,0 +1,493 @@ + + + + + + QNN NPU Optimization Report — hustvl/yolos-small + + + +

QNN NPU Optimization Report — hustvl/yolos-small

+
auto arch · 2026-06-22 · 9 hypotheses tested
+ +
+
+
Best Gain %
+
+2.0%
+
Champion: h3 · ⚠ neutral within noise — ship baseline
+
+
+
Baseline → Champion ms
+
49.60 ms → 48.60 ms
+
Latency reduction: 0.99 ms
+
+
+
EP + Device
+
QNN / NPU
+
Baseline opset 17
+
+
+
Champion Config
+
h0 (baseline)
+
⚠ neutral within noise — ship baseline
+
+
+
Total experiments
+
9
+
0 KEEP / 2 DISCARD
+
+
+ + +
+
Model Characteristics
+ + +
Model IDhustvl/yolos-small
Taskobject-detection
Arch typeauto
Baseline opset17
EPqnn
Devicenpu
Conv%0.1%
npu-006 riskLOW
npu-001 noteN/A (high-CV opset17 reference)
+
+ + +
+
Hypothesis Gain Chart
+
+ HypothesisGain vs baseline (%)-200%-100%0%100%200%h0: baseline (auto-config, W8A16) +status=OK verdict=— +p50=49.60 ms gain=+0.0%h0baseline (auto-config, W8…0.0%h1: opset 17 explicit +status=OK_HIGH_CV verdict=— +p50=65.89 ms gain=-32.8%h1opset 17 explicit-32.8%h2: opset 19 +status=TIMEOUT verdict=— +p50=— gain=—h2opset 19h3: opset 21 (tests npu-001 bypass) +status=OK verdict=— +p50=48.60 ms gain=+2.0%h3opset 21 (tests npu-001 b…+2.0%h4: opset 17 + conv fusions +status=TIMEOUT verdict=— +p50=— gain=—h4opset 17 + conv fusionsh5: opset 21 + conv fusions +status=TIMEOUT verdict=— +p50=— gain=—h5opset 21 + conv fusionsh6: opset 21 + matmul_transpose_fusion +status=OK verdict=— +p50=49.96 ms gain=-0.7%h6opset 21 + matmul_transpo…-0.7%h7: opset 21 + bias_softmax_fusion +status=OK verdict=— +p50=51.63 ms gain=-4.1%h7opset 21 + bias_softmax_f…-4.1%h8: opset 21 + attention_fusion +status=OK verdict=— +p50=49.53 ms gain=+0.1%h8opset 21 + attention_fusi…+0.1% +
+
+ + +
+
🔬 All Hypotheses — Full Detail
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
IDConfig LabelOpsetExtra FlagsMedian p50Session p50s (ms)Gain %VerdictConfidence
h0baseline (auto-config, W8A16)17not stored49.60 ms[50.72 · 49.32 · 49.60]+0.0%OKranges overlap
h1opset 17 explicit17not stored65.89 ms[67.96 · 65.68 · 65.89]-32.8%OK_HIGH_CVranges separated
h2opset 19not storedTIMEOUTsingle-point only
h3 opset 21 (tests npu-001 bypass)21not stored48.60 ms[48.74 · 48.60 · 48.60]+2.0%OKranges separated
h4opset 17 + conv fusionsnot storedTIMEOUTsingle-point only
h5opset 21 + conv fusionsnot storedTIMEOUTsingle-point only
h6opset 21 + matmul_transpose_fusion21matmul_transpose_fusion49.96 ms[50.15 · 49.57 · 49.96]-0.7%OKranges overlap
h7opset 21 + bias_softmax_fusion21bias_softmax_fusion51.63 ms[49.06 · 51.63 · 51.90]-4.1%OKranges overlap
h8opset 21 + attention_fusion21attention_fusion49.53 ms[51.23 · 49.20 · 49.53]+0.1%OKranges overlap
+
+
+ ★ = champion hypothesis  ·  Session p50s are individual bench sessions (median used for comparison) +
+
+ + + +
+
❌ Ineffective or Harmful
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
HypothesisLabelGain %VerdictConfidence
h1opset 17 explicit-32.8%OK_HIGH_CVranges separated
h7opset 21 + bias_softmax_fusion-4.1%OKranges overlap
+
+ + +
+
⚪ Neutral / Build Fail
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
HypothesisLabelGain %VerdictConfidence
h0baseline (auto-config, W8A16)+0.0%OKranges overlap
h2opset 19TIMEOUTsingle-point only
h3opset 21 (tests npu-001 bypass)+2.0%OKranges separated
h4opset 17 + conv fusionsTIMEOUTsingle-point only
h5opset 21 + conv fusionsTIMEOUTsingle-point only
h6opset 21 + matmul_transpose_fusion-0.7%OKranges overlap
h8opset 21 + attention_fusion+0.1%OKranges overlap
+
+ + + + + + diff --git a/research/autoconfig/catalog-qnn-sweep/hustvl--yolos-small/results.json b/research/autoconfig/catalog-qnn-sweep/hustvl--yolos-small/results.json new file mode 100644 index 000000000..2af500c4e --- /dev/null +++ b/research/autoconfig/catalog-qnn-sweep/hustvl--yolos-small/results.json @@ -0,0 +1,183 @@ +{ + "model_id": "hustvl/yolos-small", + "task": "object-detection", + "model_type": "auto", + "timestamp": "2026-06-22T12:06:44", + "ep": "qnn", + "device": "npu", + "baseline_opset": 17, + "hypotheses": { + "h0": { + "status": "OK", + "screen": { + "p50_ms": 48.663, + "cv": 0.0666, + "stable": true + }, + "full": { + "p50s_ms": [ + 50.715, + 49.324, + 49.598 + ], + "median_p50_ms": 49.598 + }, + "accuracy": null, + "label": "baseline (auto-config, W8A16)", + "opset": 17, + "extra_optim": {} + }, + "h1": { + "status": "OK_HIGH_CV", + "screen": { + "p50_ms": 66.447, + "cv": 0.2261, + "stable": false, + "note": "DVFS noise — high CV expected on QNN NPU" + }, + "full": { + "p50s_ms": [ + 67.96, + 65.68, + 65.888 + ], + "median_p50_ms": 65.888 + }, + "accuracy": null, + "label": "opset 17 explicit", + "opset": 17, + "extra_optim": {} + }, + "h2": { + "status": "TIMEOUT", + "label": "opset 19" + }, + "h3": { + "status": "OK", + "screen": { + "p50_ms": 48.787, + "cv": 0.0503, + "stable": true + }, + "full": { + "p50s_ms": [ + 48.741, + 48.599, + 48.605 + ], + "median_p50_ms": 48.605 + }, + "accuracy": null, + "label": "opset 21 (tests npu-001 bypass)", + "opset": 21, + "extra_optim": {}, + "paired_ab": { + "gains_pct": [ + -0.39, + -1.85, + -0.05, + -0.42, + -0.16, + -0.31, + -0.73, + -0.22 + ], + "mean_gain_pct": -0.52, + "ci_half_95": 0.4, + "n_pairs": 8, + "verdict": "MARGINAL" + } + }, + "h4": { + "status": "TIMEOUT", + "label": "opset 17 + conv fusions" + }, + "h5": { + "status": "TIMEOUT", + "label": "opset 21 + conv fusions" + }, + "h6": { + "status": "OK", + "screen": { + "p50_ms": 49.012, + "cv": 0.048, + "stable": true + }, + "full": { + "p50s_ms": [ + 50.151, + 49.574, + 49.956 + ], + "median_p50_ms": 49.956 + }, + "accuracy": null, + "label": "opset 21 + matmul_transpose_fusion", + "opset": 21, + "extra_optim": { + "matmul_transpose_fusion": true + } + }, + "h7": { + "status": "OK", + "screen": { + "p50_ms": 49.042, + "cv": 0.0618, + "stable": true + }, + "full": { + "p50s_ms": [ + 49.06, + 51.631, + 51.895 + ], + "median_p50_ms": 51.631 + }, + "accuracy": null, + "label": "opset 21 + bias_softmax_fusion", + "opset": 21, + "extra_optim": { + "bias_softmax_fusion": true + } + }, + "h8": { + "status": "OK", + "screen": { + "p50_ms": 51.292, + "cv": 0.078, + "stable": true + }, + "full": { + "p50s_ms": [ + 51.226, + 49.202, + 49.531 + ], + "median_p50_ms": 49.531 + }, + "accuracy": null, + "label": "opset 21 + attention_fusion", + "opset": 21, + "extra_optim": { + "attention_fusion": true + } + } + }, + "best_hypothesis": "h3", + "baseline_p50_ms": 49.598, + "best_p50_ms": 48.605, + "best_gain_pct": 2.0, + "npu001_generalized": "N/A (high-CV opset17 reference)", + "feature_gaps": [], + "errors": [ + "h2 (opset 19), h4/h5 (conv fusions): not measured — agent deprioritized (yolos is 0.1% conv / 99.9% transformer, so conv-fusion and intermediate-opset hypotheses are low expected-value)." + ], + "conv_pct": 0.1, + "npu006_risk": false, + "npu006_regression": false, + "best_gain_reliable": false, + "best_gain_verdict": "NEUTRAL_WITHIN_NOISE", + "best_gain_noise_floor_pct": 2.95, + "best_gain_ranges_separated": true, + "npu001_ranges_non_overlapping": true +} diff --git a/research/autoconfig/catalog-qnn-sweep/microsoft--rad-dino/results_v2.json b/research/autoconfig/catalog-qnn-sweep/microsoft--rad-dino/results_v2.json new file mode 100644 index 000000000..20cf14836 --- /dev/null +++ b/research/autoconfig/catalog-qnn-sweep/microsoft--rad-dino/results_v2.json @@ -0,0 +1,71 @@ +{ + "model_id": "microsoft/rad-dino", + "task": "image-feature-extraction", + "model_type": "dinov2", + "timestamp": "2026-06-16T16:43:10", + "ep": "qnn", + "device": "npu", + "validation_sweep": true, + "hypotheses": { + "h0": { + "status": "OK", + "screen": { + "p50_ms": 274.506, + "cv": 0.0134, + "stable": true, + "note": null + }, + "full": { + "p50s_ms": [ + 274.727, + 274.621, + 274.949 + ], + "median_p50_ms": 274.727 + }, + "label": "baseline (auto-config, W8A16)", + "opset": "auto" + }, + "h1": { + "status": "OK", + "screen": { + "p50_ms": 274.204, + "cv": 0.0088, + "stable": true, + "note": null + }, + "full": { + "p50s_ms": [ + 274.979, + 274.557, + 275.099 + ], + "median_p50_ms": 274.979 + }, + "label": "opset 17 explicit", + "opset": 17 + }, + "h3": { + "status": "OK", + "screen": { + "p50_ms": 275.269, + "cv": 0.0222, + "stable": true, + "note": null + }, + "full": { + "p50s_ms": [ + 275.298, + 275.355, + 275.564 + ], + "median_p50_ms": 275.355 + }, + "label": "opset 21 (tests npu-001)", + "opset": 21 + } + }, + "errors": [], + "npu001_opset21_vs_17_gain_pct": -0.1, + "npu001_note": "opset21 median 275.355ms vs opset17 274.979ms = -0.1%" +} diff --git a/research/autoconfig/catalog-qnn-sweep/microsoft--resnet-18/report.html b/research/autoconfig/catalog-qnn-sweep/microsoft--resnet-18/report.html new file mode 100644 index 000000000..8a6e36f71 --- /dev/null +++ b/research/autoconfig/catalog-qnn-sweep/microsoft--resnet-18/report.html @@ -0,0 +1,430 @@ + + + + + + QNN NPU Optimization Report — microsoft/resnet-18 + + + +

QNN NPU Optimization Report — microsoft/resnet-18

+
resnet arch · 2026-06-13 · 6 hypotheses tested
+ +
+
+
Best Gain %
+
+0.0%
+
Champion: h0
+
+
+
Baseline → Champion ms
+
0.96 ms → 0.96 ms
+
Latency reduction: 0.00 ms
+
+
+
EP + Device
+
QNN / NPU
+
Baseline opset 17
+
+
+
Champion Config
+
h0
+
opset 17 + autoconf defaults
+
+
+
Total experiments
+
6
+
0 KEEP / 4 DISCARD
+
+
+ + +
+
Model Characteristics
+ + +
Model IDmicrosoft/resnet-18
Taskimage-classification
Arch typeresnet
Baseline opset17
EPqnn
Devicenpu
npu-001 noteTrue
+
+ + +
+
Hypothesis Gain Chart
+
+ HypothesisGain vs baseline (%)-200%-100%0%100%200%h0: baseline (auto-config, W8A16) +status=OK_HIGH_CV verdict=— +p50=0.96 ms gain=+0.0%h0baseline (auto-config, W8…0.0%h1: opset 17 explicit +status=OK_HIGH_CV verdict=— +p50=2.72 ms gain=-181.7%h1opset 17 explicit-181.7%h2: opset 19 +status=OK_HIGH_CV verdict=— +p50=1.15 ms gain=-19.0%h2opset 19-19.0%h3: opset 21 (tests npu-001 bypass) +status=OK_HIGH_CV verdict=— +p50=2.17 ms gain=-125.6%h3opset 21 (tests npu-001 b…-125.6%h4: opset 17 + conv fusions +status=OK_HIGH_CV verdict=— +p50=132.30 ms gain=-13624.1%h4opset 17 + conv fusions-13624.1%h5: opset 21 + conv fusions +status=TIMEOUT verdict=— +p50=— gain=—h5opset 21 + conv fusions +
+
+ + +
+
🔬 All Hypotheses — Full Detail
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
IDConfig LabelOpsetExtra FlagsMedian p50Session p50s (ms)Gain %VerdictConfidence
h0 baseline (auto-config, W8A16)17not stored0.96 ms[1.31 · 0.95 · 0.96]+0.0%OK_HIGH_CVranges overlap
h1opset 17 explicit17not stored2.72 ms[0.99 · 4.00 · 2.72]-181.7%OK_HIGH_CVranges overlap
h2opset 1919not stored1.15 ms[1.15 · 1.11 · 1.95]-19.0%OK_HIGH_CVranges overlap
h3opset 21 (tests npu-001 bypass)21not stored2.17 ms[1.05 · 2.17 · 4.11]-125.6%OK_HIGH_CVranges overlap
h4opset 17 + conv fusions17not stored132.30 ms[132.30 · 134.97 · 130.67]-13624.1%OK_HIGH_CVranges separated
h5opset 21 + conv fusionsnot storedTIMEOUTsingle-point only
+
+
+ ★ = champion hypothesis  ·  Session p50s are individual bench sessions (median used for comparison) +
+
+ + + +
+
❌ Ineffective or Harmful
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
HypothesisLabelGain %VerdictConfidence
h1opset 17 explicit-181.7%OK_HIGH_CVranges overlap
h2opset 19-19.0%OK_HIGH_CVranges overlap
h3opset 21 (tests npu-001 bypass)-125.6%OK_HIGH_CVranges overlap
h4opset 17 + conv fusions-13624.1%OK_HIGH_CVranges separated
+
+ + +
+
⚪ Neutral / Build Fail
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
HypothesisLabelGain %VerdictConfidence
h0baseline (auto-config, W8A16)+0.0%OK_HIGH_CVranges overlap
h5opset 21 + conv fusionsTIMEOUTsingle-point only
+
+ + + + + + diff --git a/research/autoconfig/catalog-qnn-sweep/microsoft--resnet-18/results.json b/research/autoconfig/catalog-qnn-sweep/microsoft--resnet-18/results.json new file mode 100644 index 000000000..555428793 --- /dev/null +++ b/research/autoconfig/catalog-qnn-sweep/microsoft--resnet-18/results.json @@ -0,0 +1,124 @@ +{ + "model_id": "microsoft/resnet-18", + "task": "image-classification", + "model_type": "resnet", + "timestamp": "2026-06-13T13:38:52", + "ep": "qnn", + "device": "npu", + "baseline_opset": 17, + "hypotheses": { + "h0": { + "status": "OK_HIGH_CV", + "screen": { + "p50_ms": 4.031, + "cv": 1.6902, + "stable": false, + "note": "DVFS noise — high CV expected on QNN NPU" + }, + "full": { + "p50s_ms": [ + 1.311, + 0.952, + 0.964 + ], + "median_p50_ms": 0.964 + }, + "accuracy": 0.66, + "label": "baseline (auto-config, W8A16)", + "opset": 17 + }, + "h1": { + "status": "OK_HIGH_CV", + "screen": { + "p50_ms": 3.111, + "cv": 2.0363, + "stable": false, + "note": "DVFS noise — high CV expected on QNN NPU" + }, + "full": { + "p50s_ms": [ + 0.99, + 4.003, + 2.716 + ], + "median_p50_ms": 2.716 + }, + "accuracy": null, + "label": "opset 17 explicit", + "opset": 17 + }, + "h2": { + "status": "OK_HIGH_CV", + "screen": { + "p50_ms": 3.992, + "cv": 1.5168, + "stable": false, + "note": "DVFS noise — high CV expected on QNN NPU" + }, + "full": { + "p50s_ms": [ + 1.147, + 1.114, + 1.947 + ], + "median_p50_ms": 1.147 + }, + "accuracy": null, + "label": "opset 19", + "opset": 19 + }, + "h3": { + "status": "OK_HIGH_CV", + "screen": { + "p50_ms": 2.968, + "cv": 1.1762, + "stable": false, + "note": "DVFS noise — high CV expected on QNN NPU" + }, + "full": { + "p50s_ms": [ + 1.054, + 2.175, + 4.107 + ], + "median_p50_ms": 2.175 + }, + "accuracy": null, + "label": "opset 21 (tests npu-001 bypass)", + "opset": 21 + }, + "h4": { + "status": "OK_HIGH_CV", + "screen": { + "p50_ms": 128.104, + "cv": 1.4049, + "stable": false, + "note": "DVFS noise — high CV expected on QNN NPU" + }, + "full": { + "p50s_ms": [ + 132.3, + 134.97, + 130.669 + ], + "median_p50_ms": 132.3 + }, + "accuracy": null, + "label": "opset 17 + conv fusions", + "opset": 17 + }, + "h5": { + "status": "TIMEOUT", + "label": "opset 21 + conv fusions" + } + }, + "best_hypothesis": "h0", + "baseline_p50_ms": 0.964, + "best_p50_ms": 0.964, + "best_gain_pct": 0.0, + "npu001_generalized": true, + "feature_gaps": [], + "errors": [ + "Model timed out at 1560s (before h5)" + ] +} diff --git a/research/autoconfig/catalog-qnn-sweep/rizvandwiki--gender-classification/results_new.json b/research/autoconfig/catalog-qnn-sweep/rizvandwiki--gender-classification/results_new.json new file mode 100644 index 000000000..ad2ca7a54 --- /dev/null +++ b/research/autoconfig/catalog-qnn-sweep/rizvandwiki--gender-classification/results_new.json @@ -0,0 +1,31 @@ +{ + "model_id": "rizvandwiki/gender-classification", + "task": "image-classification", + "hypotheses": { + "h0": { + "description": "opset17 no opts", + "model_file": "quantized.onnx", + "screen_p50_ms": 29.602, + "screen_cv": 0.5068, + "full_p50s_ms": [ + 14.151, + 14.942, + 13.889 + ], + "avg_p50_ms": 14.327 + }, + "h3": { + "description": "opset21 no opts", + "model_file": "quantized.onnx", + "screen_p50_ms": 15.056, + "screen_cv": 0.579, + "full_p50s_ms": [ + 13.698, + 13.921, + 13.868 + ], + "avg_p50_ms": 13.829 + } + }, + "opset21_gain_pct": 3.48 +} diff --git a/research/autoconfig/catalog-qnn-sweep/sentence-transformers--all-MiniLM-L6-v2/report.html b/research/autoconfig/catalog-qnn-sweep/sentence-transformers--all-MiniLM-L6-v2/report.html new file mode 100644 index 000000000..edf2604a2 --- /dev/null +++ b/research/autoconfig/catalog-qnn-sweep/sentence-transformers--all-MiniLM-L6-v2/report.html @@ -0,0 +1,430 @@ + + + + + + QNN NPU Optimization Report — sentence-transformers/all-MiniLM-L6-v2 + + + +

QNN NPU Optimization Report — sentence-transformers/all-MiniLM-L6-v2

+
bert arch · 2026-06-13 · 6 hypotheses tested
+ +
+
+
Best Gain %
+
+0.0%
+
Champion: h0
+
+
+
Baseline → Champion ms
+
5.81 ms → 5.81 ms
+
Latency reduction: 0.00 ms
+
+
+
EP + Device
+
QNN / NPU
+
Baseline opset 17
+
+
+
Champion Config
+
h0
+
opset 17 + autoconf defaults
+
+
+
Total experiments
+
6
+
0 KEEP / 2 DISCARD
+
+
+ + +
+
Model Characteristics
+ + +
Model IDsentence-transformers/all-MiniLM-L6-v2
Tasksentence-similarity
Arch typebert
Baseline opset17
EPqnn
Devicenpu
npu-001 noteneutral
+
+ + +
+
Hypothesis Gain Chart
+
+ HypothesisGain vs baseline (%)-200%-100%0%100%200%h0: baseline (auto-config, W8A16) +status=OK_HIGH_CV verdict=— +p50=5.81 ms gain=+0.0%h0baseline (auto-config, W8…0.0%h1: opset 17 explicit +status=OK_HIGH_CV verdict=— +p50=5.88 ms gain=-1.2%h1opset 17 explicit-1.2%h2: opset 19 +status=OK_HIGH_CV verdict=— +p50=5.98 ms gain=-3.0%h2opset 19-3.0%h3: opset 21 (tests npu-001 bypass) +status=OK_HIGH_CV verdict=— +p50=5.85 ms gain=-0.7%h3opset 21 (tests npu-001 b…-0.7%h4: opset 17 + conv fusions +status=OK verdict=— +p50=5.97 ms gain=-2.7%h4opset 17 + conv fusions-2.7%h5: opset 21 + conv fusions +status=TIMEOUT verdict=— +p50=— gain=—h5opset 21 + conv fusions +
+
+ + +
+
🔬 All Hypotheses — Full Detail
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
IDConfig LabelOpsetExtra FlagsMedian p50Session p50s (ms)Gain %VerdictConfidence
h0 baseline (auto-config, W8A16)17not stored5.81 ms[5.81 · 5.65 · 5.83]+0.0%OK_HIGH_CVranges overlap
h1opset 17 explicit17not stored5.88 ms[5.81 · 5.88 · 5.91]-1.2%OK_HIGH_CVranges overlap
h2opset 1919not stored5.98 ms[5.98 · 5.80 · 6.02]-3.0%OK_HIGH_CVranges overlap
h3opset 21 (tests npu-001 bypass)21not stored5.85 ms[6.00 · 5.85 · 5.84]-0.7%OK_HIGH_CVranges separated
h4opset 17 + conv fusions17not stored5.97 ms[6.06 · 5.97 · 5.47]-2.7%OKranges overlap
h5opset 21 + conv fusionsnot storedTIMEOUTsingle-point only
+
+
+ ★ = champion hypothesis  ·  Session p50s are individual bench sessions (median used for comparison) +
+
+ + + +
+
❌ Ineffective or Harmful
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
HypothesisLabelGain %VerdictConfidence
h2opset 19-3.0%OK_HIGH_CVranges overlap
h4opset 17 + conv fusions-2.7%OKranges overlap
+
+ + +
+
⚪ Neutral / Build Fail
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
HypothesisLabelGain %VerdictConfidence
h0baseline (auto-config, W8A16)+0.0%OK_HIGH_CVranges overlap
h1opset 17 explicit-1.2%OK_HIGH_CVranges overlap
h3opset 21 (tests npu-001 bypass)-0.7%OK_HIGH_CVranges separated
h5opset 21 + conv fusionsTIMEOUTsingle-point only
+
+ + + + + + diff --git a/research/autoconfig/catalog-qnn-sweep/sentence-transformers--all-MiniLM-L6-v2/results.json b/research/autoconfig/catalog-qnn-sweep/sentence-transformers--all-MiniLM-L6-v2/results.json new file mode 100644 index 000000000..67483f470 --- /dev/null +++ b/research/autoconfig/catalog-qnn-sweep/sentence-transformers--all-MiniLM-L6-v2/results.json @@ -0,0 +1,123 @@ +{ + "model_id": "sentence-transformers/all-MiniLM-L6-v2", + "task": "sentence-similarity", + "model_type": "bert", + "timestamp": "2026-06-13T15:58:36", + "ep": "qnn", + "device": "npu", + "baseline_opset": 17, + "hypotheses": { + "h0": { + "status": "OK_HIGH_CV", + "screen": { + "p50_ms": 5.934, + "cv": 0.2221, + "stable": false, + "note": "DVFS noise — high CV expected on QNN NPU" + }, + "full": { + "p50s_ms": [ + 5.808, + 5.647, + 5.829 + ], + "median_p50_ms": 5.808 + }, + "accuracy": null, + "label": "baseline (auto-config, W8A16)", + "opset": 17 + }, + "h1": { + "status": "OK_HIGH_CV", + "screen": { + "p50_ms": 5.851, + "cv": 0.9986, + "stable": false, + "note": "DVFS noise — high CV expected on QNN NPU" + }, + "full": { + "p50s_ms": [ + 5.814, + 5.88, + 5.912 + ], + "median_p50_ms": 5.88 + }, + "accuracy": null, + "label": "opset 17 explicit", + "opset": 17 + }, + "h2": { + "status": "OK_HIGH_CV", + "screen": { + "p50_ms": 5.309, + "cv": 0.2051, + "stable": false, + "note": "DVFS noise — high CV expected on QNN NPU" + }, + "full": { + "p50s_ms": [ + 5.98, + 5.799, + 6.021 + ], + "median_p50_ms": 5.98 + }, + "accuracy": null, + "label": "opset 19", + "opset": 19 + }, + "h3": { + "status": "OK_HIGH_CV", + "screen": { + "p50_ms": 5.959, + "cv": 1.1272, + "stable": false, + "note": "DVFS noise — high CV expected on QNN NPU" + }, + "full": { + "p50s_ms": [ + 6.0, + 5.851, + 5.844 + ], + "median_p50_ms": 5.851 + }, + "accuracy": null, + "label": "opset 21 (tests npu-001 bypass)", + "opset": 21 + }, + "h4": { + "status": "OK", + "screen": { + "p50_ms": 5.478, + "cv": 0.1344, + "stable": true + }, + "full": { + "p50s_ms": [ + 6.059, + 5.966, + 5.469 + ], + "median_p50_ms": 5.966 + }, + "accuracy": null, + "label": "opset 17 + conv fusions", + "opset": 17 + }, + "h5": { + "status": "TIMEOUT", + "label": "opset 21 + conv fusions" + } + }, + "best_hypothesis": "h0", + "baseline_p50_ms": 5.808, + "best_p50_ms": 5.808, + "best_gain_pct": 0.0, + "npu001_generalized": "neutral", + "feature_gaps": [], + "errors": [ + "Model timed out at 1346s (before h5)" + ] +} diff --git a/research/autoconfig/docs/agent-design.html b/research/autoconfig/docs/agent-design.html new file mode 100644 index 000000000..ae2a050d4 --- /dev/null +++ b/research/autoconfig/docs/agent-design.html @@ -0,0 +1,426 @@ + + + + + +WinML CLI Agent Design Doc + + + +
+ +
+

WinML CLI Agent Design Doc

+

Status: Draft (research POC) · Updated: 2026-06-21

+ +

Problem statement

+

+Although winml-cli provides default config generation, real deployment needs usually require model- and target-specific trade-offs. +The same default can behave very differently when objective shifts across latency, accuracy, memory, EP, and device. +

+ + + + + + + + +
Customer signalObserved askDesign implication
Teams / ecosystem feedbackNeed usable CLI/tooling for model analysis, optimization, and benchmarking; WinML CLI introduced as recommended entry.Agent layer must turn primitives into guided workflows, not raw command output only.
Canva / AffinityOne universal model across IHVs, near-real-time performance, minimal per-vendor tuning, better debuggability than black-box behavior.Cross-device confidence + explainable diagnostics are core requirements.
AdobeDML memory footprint and GPU↔CPU fallback ping-pong called out as major blockers.Need EP-behavior visibility, fallback analysis, and actionable optimization guidance.
CyberLinkNeed parity vs native runtimes, one model across silicon, and minimal engineering overhead (auto EP preference).Agent must optimize for portability + performance while reducing expert intervention.
+

+Problem focus in this design: +

+
    +
  1. Default config is not always enough. Users have different constraints (perf, accuracy, memory, cross-EP/device portability), and trade-offs should be made from actual runtime evidence on target devices, sometimes requiring coordinated cross-device tuning.
  2. +
  3. Negative optimization exists on some models. Current measurements show default config can regress specific models; this must be identified systematically and resolved with explainable diagnostics.
  4. +
  5. Optimization behavior changes over time. EP/driver/winml version upgrades can shift optimal settings; the system should capture these shifts and stay up-to-date instead of freezing historical assumptions.
  6. +
  7. Analyzer → optimizer coverage is still incomplete. Not all meaningful fusion opportunities are currently detected and translated into optimization actions; we need to identify which missed fusion opportunities matter most on real models and prioritize them.
  8. +
+ +

Key user scenarios

+ + + + + + + + + + + + + + + + + + + +
ScenarioExample askExpected outcome
Constraint-driven config search"Find a ConvNeXt config with accuracy drop < 10% and memory < 800MB."Feasible config set + ranked recommendation + trade-off table.
Cross-device / cross-EP search"Find a ConvNeXt model/config that runs on 3 NPUs." / "Find a model that can run on all EPs."Portability-aware recommendation, fallback chain, and confidence by device/EP scope.
Model optimization upgrade"Find a model/config faster than current one with accuracy drop < 5%."Candidate replacements with verified speedup, bounded accuracy loss, and migration guidance.
+ +

Execution plan and deliverables

+

+Execution will use built-in model recipe tuning as the driver task. Each iteration improves two things at once: +the skill's reasoning quality and the per-model built-in recipe quality. +

+
    +
  • Run iterative tuning on built-in model set across target EP/device scopes.
  • +
  • Capture gaps in analyzer/optimizer coverage and convert high-impact misses into prioritized feature work.
  • +
  • Feed validated results back to recipes, EP-specific knowledge, and skill evaluation signals.
  • +
+ +
+ + + + + + + + Built-in model recipe tuning + + Run + measure + + Analyze misses / regressions + + Update skill + recipes + requirements + + + + + + iterate with next recipe/model batch + + + + + + + + Improved built-in model recipes + EP-specific optimization experience / knowledge + WinML CLI feature requirements + Self-evaluated skills + +
+ +

winml-cli vs Olive

+

+The overlap is real at implementation level (both use ORT ecosystem), but the optimization philosophy is intentionally different. +winml-cli focuses on a limited set of high-impact, commonly needed tuning levers with regular verification. +Olive is a broader optimization framework with deeper knobs and stronger low-level control. +

+ + + + + + + + + + + +
DimensionOlivewinml-cli (+ agent layer)
Tuning scopeComprehensive optimization surface, including many advanced pass combinationsCurated, high-impact optimization set for common Windows deployment paths
Control depthFine-grained and expert-orientedConstrained and opinionated by design to reduce operational complexity
Verification modelUser-driven validation strategyRegular verification built into flow (baseline checks, eval gates, confidence-aware decisions)
Primary investmentBroad model transformation and deep optimization controlStronger debugging/diagnostic capability, explainability, and safer decision support
Deep model adjustment ownershipFirst-class (advanced pass-level tuning)Often delegated to Olive / Mobius for heavy model surgery
Primary userML/optimization engineerWinApp developer / product engineer prioritizing time-to-ship
Default UX goalMaximum controllabilityMinimum engineering effort with explainable, reliable outcomes
+ +

Design principles

+
    +
  1. Agent for judgment, tools for computation: keep heavy search deterministic and use the agent for diagnosis, prioritization, and explanation.
  2. +
  3. Lifecycle orchestration first: one orchestrator role spans Intake → Insight → Opt Loop → Outcome.
  4. +
  5. Evidence over intuition: all recommendations are backed by validation signals and confidence semantics.
  6. +
  7. Cross-device by default: design for deployment fleets, not only the developer machine.
  8. +
  9. Self-evolving knowledge: findings are promoted through confidence levels before broad reuse.
  10. +
+
+Lifecycle visual reference: autoconfig_diagram.html. +
+ +

Solution

+

+The solution is the lifecycle shown in autoconfig_diagram.html: +an orchestrated pipeline from intake to outcome, with an explicit optimization loop in the middle. +

+ +

Diagram walkthrough

+
    +
  1. Phase 0 · Intake: establish baseline + correctness contract, and load resume state.
  2. +
  3. Phase 1 · Insight: collect profiling/analyzer/graph evidence and generate hypothesis_pool.
  4. +
  5. Phase 2 · Opt Loop: Explorer → Optimizer → Reviewer repeatedly evaluate candidate deltas.
  6. +
  7. Phase 3 · Outcome: emit champion config, report, experiment artifacts, and KB draft findings.
  8. +
+

+The key is the loop: hypotheses are not run once in a fixed sequence. Reviewer verdicts feed back into the next Explorer iteration until stop conditions are met (objective reached, queue exhausted, or plateau). +

+ +

Autoconfig positioning

+

+Autoconfig is a sub-tool, not the primary UX entry. The agent uses it for targeted sweep over +EP × opset × graph options, then returns an explainable report with a feasible-options comparison table. +Correctness validation (winml eval) is mandatory before recommendation. +

+ +

Loop v3 vs agent layer

+

Autoconfig loop v3 already improved core execution quality (thresholded verdicts, early exit, crash-resume, KB-guided pruning, DVFS-aware handling). The agent layer still adds capabilities the loop alone does not provide:

+
    +
  1. Architecture-aware reasoning: explain why a hypothesis exists for this model, not only run it.
  2. +
  3. Failure explanation: convert DISCARD/failure traces into actionable diagnosis.
  4. +
  5. Cross-device confidence: reason about deployment behavior beyond the local machine.
  6. +
  7. Adaptive strategy: stop/reprioritize based on evidence trajectory, not only fixed counters.
  8. +
  9. Knowledge narration: present promoted findings in developer-readable form, not just raw artifacts.
  10. +
+ +

Input

+
+ 👤 + User input + — Model ID + Target EP + Objective: + accuracy-primary + latency-primary + Pareto + + optional budget / accuracy floor +
+ +

Output

+ + + + + + + + + +
OutputSource in autoconfig_diagramValue
Champion configconfig_<ep>_optimal.jsonDirectly consumable best-known config
HTML benchmark report + comparison tablereport.html with experiment chart/tableExplainable recommendation and tradeoff visibility
Experiment artifactsexperiments/<n>/, plus loop telemetry in results.tsv and session.jsonAudit trail, reproducibility, crash-resume continuity
KB draft entryep_knowledge/<ep>.json with new entries marked status="draft"Feeds confidence-gated knowledge evolution
Feature requirementsIssue references for capability gaps (e.g. fused-op diagnostics, DVFS-aware perf)Turns findings into product backlog action
+ +

Roles and responsibilities

+ + + + + + + + +
RoleResponsibilityKey artifacts
OrchestratorControls phase transitions, loop gates, stop conditions, and resume behaviorsession.json, run state, final synthesis
ExplorerBuilds and ranks candidate experiments from hypothesis_pool under KB constraintsskip_set, priority_queue, candidate deltas
OptimizerExecutes build/perf/eval for each candidate and records measurement evidenceperf/eval logs, results.tsv, experiment folders
ReviewerApplies acceptance policy and returns verdict/suggestions to next loop iterationKEEP/MARGINAL/DISCARD outcomes + rationale
+

+The Insight phase can leverage debug-model as a sub-skill to enrich failure analysis and bottleneck interpretation before hypothesis ranking. +

+ +

Auto-research-inspired policy

+

+This design borrows from auto-research thinking: Explorer expands search and manages both termination and acceptance conditions through explicit policy, not ad-hoc trial-and-error. +

+

Search space definition

+

Explorer currently mutates all major tunable knobs exposed in the winml build path, including:

+
    +
  1. opset version
  2. +
  3. winml optimize options
  4. +
  5. quantization parameters
  6. +
  7. ORT runtime configuration options
  8. +
+

Termination condition

+

The loop stops when all experiments considered worth exploring have been executed (or when global stop conditions trigger).

+

Acceptance condition

+

+An experiment is accepted only when performance, accuracy, and memory all satisfy user requirements, and the observed performance gain is stable rather than noise. +

+ +

How it works

+

Lifecycle orchestration

+

+The orchestrator controls phase transitions and loop gates. Explorer/Optimizer/Reviewer execute the optimization loop, +while outcome synthesis consolidates recommendation and evidence into final outputs. +

+
    +
  • Phase governance: enforce intake prerequisites (baseline + correctness contract) before optimization starts.
  • +
  • Loop governance: drive priority_queue consumption, enforce stop conditions (objective met, plateau, queue empty), and keep run state resumable via session.json.
  • +
  • Decision governance: ensure each recommendation is backed by benchmark evidence and clear verdict logic, then package it into operator-friendly outputs.
  • +
  • Reliability governance: preserve crash-resume semantics and avoid losing completed experiments when long sweeps are interrupted.
  • +
+

+In practice, Orchestrator is the control plane and Explorer/Optimizer/Reviewer are the execution plane. This separation keeps compute deterministic while allowing higher-level strategy updates without rewriting bench primitives. +

+ +

Cross-device integration

+

+From cross-device-design.html: treat winml serve Phase 0 endpoints as distributed workers, +optimize with joint objective across devices, and add a device axis to portability confidence. +

+
    +
  • Execution model: each device runs winml serve as a worker; orchestrator fans out build/perf/eval calls and aggregates results centrally.
  • +
  • Objective model: replace single-host “best local config” with a weighted multi-device objective (latency/accuracy/coverage by deployment mix).
  • +
  • Portability model: mark findings with device scope (local-only vs cross-device stable) so recommendations can express confidence per hardware tier.
  • +
  • Operational model: generate fallback chains (for example QNN → DML → CPU) when a single universal winner is not feasible.
  • +
+

+This directly addresses customer requests for “one model across IHVs” and reduced manual per-vendor tuning by shifting complexity from app teams into orchestration logic. +

+ +

Self-evolution integration

+

+From self-evolution-design.html: use paired A/B protocol and adaptive sampling to stabilize conclusions, +then promote findings through L1→L5 confidence levels before broad KB reuse. +

+
    +
  • Measurement robustness: paired A/B reduces thermal/order bias; adaptive session count increases confidence only when needed.
  • +
  • Knowledge quality: promotion gates prevent noisy one-off wins from entering reusable KB rules prematurely.
  • +
  • Search efficiency: once findings are promoted, skip_set and ranking improve, reducing wasted experiments in future runs.
  • +
  • Governance loop: each sweep contributes structured evidence back to KB, making later recommendations faster and more reliable.
  • +
+

+The result is a closed-loop system: run experiments → accumulate evidence → promote stable patterns → improve next orchestration cycle. +

+ +

Evidence constraints

+ + + + + + + +
FindingImplicationRequired control
npu-001opset 21 benefits Conv+residual patternsKeep opset as first-class search lever
npu-006Conv fusion can cause catastrophic NPU regressionsHard-block risky fusion hypotheses
npu-007DVFS distorts naive perf conclusionsUse DVFS-aware bench protocol + confidence gating
+ +

Key concerns

+ + + + + + + + +
ConcernMitigation in design
Device heterogeneity may invalidate local optimumCross-Device Confidence Agent + multi-device objective and fallback chains
Trust/auditability of recommendationsRequire provenance artifacts and report-level explanation
Noise-driven false winsDVFS-aware protocol, thresholded verdict policy, confidence gates
Overlap concerns with OliveDifferentiate on UX/explainability and Windows deployment reasoning
+ +

Open questions

+
    +
  1. Should this ship as winml agent or as agent-assist modes on existing commands?
  2. +
  3. How should cross-device execution be provisioned: local lab fleet, cloud runners, or hybrid?
  4. +
  5. What is the minimal offline fallback for restricted environments?
  6. +
+ +

References

+ +
+
+ + diff --git a/research/autoconfig/docs/agent-design.md b/research/autoconfig/docs/agent-design.md new file mode 100644 index 000000000..688dad029 --- /dev/null +++ b/research/autoconfig/docs/agent-design.md @@ -0,0 +1,254 @@ +# WinML CLI Agent Design + +> Status: Draft — 2026-06-17 (updated: autoconfig loop V3 changes incorporated) +> Context: Strategic design for the agent layer of winml-cli + +--- + +## 1. Context: Why Agent Matters for winml-cli + +### 1.1 winml-cli vs Olive — The Real Distinction + +Microsoft Olive already exists as a pass-based optimization framework supporting QNN, DML, and other Windows EPs. The temptation is to dismiss winml-cli's agent as redundant with Olive. That would be wrong — the distinction is fundamental: + +| Dimension | Olive | winml-cli | +| --- | --- | --- | +| Target user | ML engineer who understands ORT internals | WinApp developer who wants their model to work on Windows | +| Workflow | Compose passes manually, specify EP upfront | `config` + `build` — two commands, full pipeline | +| Hardware selection | Manual EP specification | `--device auto` — detects hardware, selects EP | +| Explainability | Silent pipeline output | Designed for transparency | +| Windows-first | Cross-platform, Windows supported | Built exclusively for Windows hardware diversity | +| Operator diagnostics | Not available | `winml analyze` — operator linting, EP compatibility | +| Agent-ready | Not designed for it | First-class design goal | + +**Analogy:** Olive is webpack (powerful, expert-configured); winml-cli is Vite (opinionated, works for most cases out of the box). + +### 1.2 The Core Gap Agent Should Fill + +WinApp developers lack access to a senior ML engineer who: + +- Knows why a model fails on QNN NPU for this specific operator pattern +- Can read an error message and immediately know the root cause +- Understands which optimization knob to turn for which problem +- Knows how a config that works on Snapdragon X Elite will behave on Intel Meteor Lake + +**The agent's job is to be that person.** + +--- + +## 2. Agent Design Philosophy + +### 2.1 The Improved Loop (autoconfig V3) vs The Agent Layer + +The autoconfig search loop has been significantly improved since the initial draft. As of v3 (`59e7329d`): + +**What the improved loop does well:** +- Statistical significance via `ThroughputOnly` verdict policy: `improvement > max(1% floor, 2× screen_CV)` — noise-level deltas no longer pass as KEEP +- Screen early exit: if screen improvement < 1%, skip 3× full bench — saves 25–90 min per rejected hypothesis +- Crash-resume via `session.json`: atomic state persistence, restartable without re-running completed experiments +- KB-guided search: `ep_knowledge/*.json` confirmed rules prune the search space before any experiment runs +- DVFS-aware bench protocol: npu-007 CV gate disabled on QNN NPU; 3× 500-iter sessions with cool-down +- npu-006 guard: Conv% > 20% → hard-block conv fusions before they cause 4900% regression + +**What still requires the agent layer:** + +The loop is a *computation engine*, not an *intelligence layer*. It needs an agent because: + +1. **No architecture-aware hypothesis generation** — hypotheses are hardcoded per EP, not generated from model analysis. An attention-heavy model gets the same hypotheses as a Conv-heavy one. +2. **No failure explanation** — DISCARD is logged but not explained. Developers can't learn from results without reading raw JSON. +3. **No cross-device reasoning** — a config found on Snapdragon X Elite has unknown behavior on Intel Meteor Lake. The loop can't tell you that. +4. **No adaptive stopping** — 30-DISCARD plateau is a static heuristic. An agent would recognize when all architectural levers for this model/EP pair have been exhausted. +5. **No KB self-update** — KB is manually maintained. An agent with memory extraction (cf. AgenticGPUOptimizer `memory_extractor.py`) would auto-update `ep_knowledge/*.json` after each run. + +The revised framing: **autoconfig is a sub-tool that the agent invokes and explains, not a headless replacement for the agent**. + +### 2.2 The Wrong Design (Original Autoconfig) + +The *original* autoconfig ran a **headless search loop** with no statistical significance, no crash-resume, and no KB-guided pruning: +Explorer → Optimizer → Reviewer → repeat + +**Problems that were present (now fixed in V3):** + +- No statistical significance — 1% hardcoded floor meant noise-level deltas passed as KEEP +- No screen early exit — every hypothesis ran 3× full bench regardless of screen result +- No crash-resume — an interrupted run lost all state +- All optim keys in kebab-case → `build_config()` silently used snake_case lookups → every hypothesis ran as baseline (critical bug, fixed) + +**Remaining problems (require agent layer to fix):** + +- A Python script can do benchmark loops faster, cheaper, and more reliably than an LLM agent — the loop is good, the LLM overhead is not worth it +- Results (config files) are not auditable — developer cannot verify why a config was chosen +- No explainability — developer doesn't understand what was decided or why +- Treats developer as absent; no collaborative interaction +- The "agentic" overhead (LLM inference cost per loop iteration) adds nondeterminism without intelligence + +Autoconfig search is useful as a **sub-tool**, not as the primary value proposition of the agent layer. + +### 2.2 The Right Design: Diagnosis + Guidance over Search + +Agent excels at **judgment, diagnosis, and explanation** — not computation. The redesign centers on: + +> **When a developer encounters a problem, the agent gives explanation + executable next step — not a config file.** + +#### Design Principles + +1. **Explain, don't just output** + Instead of silently picking an EP, say: *"I picked QNN EP because your device has a Qualcomm NPU. Operator coverage is 97% — the remaining 3% fall back to CPU, which is acceptable for these specific ops."* +2. **Fix, don't just diagnose** + When an incompatible operator is found, apply the graph transformation — don't just flag it. +3. **Developer talks, agent acts** + The agent is interactive and conversational. Developer says "this model is slow on GPU" → agent asks clarifying questions, runs targeted experiments, explains findings. +4. **Progressive trust** + Show confidence levels. Be explicit about uncertainty. Let the developer see what the agent is doing. Never give false precision (e.g., "Config A is 3% faster" when standard deviation is 5%). +5. **Windows device diversity as first-class concern** + Always reason about what happens on devices the developer doesn't have — not just the machine the agent runs on. + +--- + +## 3. Agent Types + +### 3.1 Diagnostic Agent *(highest priority)* + +**Trigger:** Model fails to load, crashes at inference, throws EP compatibility error +**Developer question:** "My model fails on QNN NPU — why? What do I do?" + +**Agent responsibilities:** + +- Parse error message → identify root cause (unsupported op, shape mismatch, driver version, etc.) +- Analyze model graph → enumerate incompatible operators per EP +- Propose and apply concrete fix (graph transformation, operator substitution, fallback EP) +- Verify fix with `winml eval` accuracy check + +**Why this is Olive-incompatible:** Olive doesn't converse, doesn't diagnose, doesn't explain. It fails silently or produces a broken model. + +**Example interaction:** + +```javascript +Developer: winml build failed. Error: "QNNExecutionProvider: Unsupported op at node /conv/Conv_3" +Agent: Found it. Conv_3 has dynamic padding — QNN NPU requires static shapes. + I'll apply DynamicToFixedShape transform and re-run the compile. + [applies fix] → Build succeeded. NPU latency: 12.3ms. Accuracy delta: 0.01%. +``` + +--- + +### 3.2 Decision Guidance Agent + +**Trigger:** Developer is at a decision point in the pipeline (which EP? which precision? to quantize or not?) +**Developer question:** "I don't know what options to pick. What's the tradeoff?" + +**Agent responsibilities:** + +- Run quick comparative benchmarks (not exhaustive search) +- Present tradeoffs with numbers: latency gain vs accuracy delta vs model size +- Make a recommendation with reasoning, not just a number +- Let developer override with understanding of consequences + +**Key difference from autoconfig:** This is interactive and decision-oriented, not headless. The developer is in the loop. + +--- + +### 3.3 Cross-Device Confidence Agent *(winml-cli unique)* + +**Trigger:** Developer has a working config, asks "will this work on my users' devices?" +**Developer question:** "My app ships on many Windows hardware configs. Will this be okay?" + +**Agent responsibilities:** + +- Given a config optimized for Device A, reason about behavior on Device B, C... +- Identify configs that are device-specific (compiled QNN binaries only work on Qualcomm) +- Generate multi-device config with automatic EP fallback chain (QNN → DML → CPU) +- Surface warnings: "This config will fail on Intel Meteor Lake — here's the fallback" + +**Why this matters:** WinApp developers ship to millions of devices. No other tool addresses Windows hardware diversity in the deployment sense. + +--- + +### 3.4 Regression Detection Agent *(CI/CD scenario)* + +**Trigger:** ORT version bump, driver update, or scheduled CI run +**Developer question:** "Something changed — my model got slower / broke" + +**Agent responsibilities:** + +- Compare before/after perf numbers with statistical validity (not point estimates) +- Correlate change with known ORT/EP changelog entries +- Identify which layer / operator regressed using profiler output +- Propose workaround or file structured bug report + +--- + +## 4. Role of Autoconfig (Perf Search) in This Design + +Autoconfig (opset × EP × opt\_level search) is **not abandoned** — it becomes a sub-tool invoked by the agents above when appropriate. + +**When it's invoked:** + +- Diagnostic Agent: "Let me try a few configs to see if this performance issue is config-related" +- Decision Guidance Agent: "Let me run a quick sweep across opt\_level to show you the tradeoff" + +**What changes:** + +- It's not the primary entry point +- Its output is explained, not just surfaced as a config file +- It runs fewer, targeted experiments (not exhaustive grid search) +- Accuracy validation (`winml eval`) is mandatory before any recommendation + +--- + +### 3.5 Model Recommendation Agent *(developer onboarding scenario)* + +**Trigger:** Developer describes a problem they want to solve but doesn't know which model to use +**Developer question:** "I want to add real-time object detection to my app. What model should I use?" + +**Agent responsibilities:** + +- Understand the developer's task from natural language (classification, detection, segmentation, NLP, etc.) +- Query `winml catalog` to enumerate available built-in models for that task +- Factor in the developer's current hardware (`winml sys`) — which EPs are available, memory constraints, NPU presence +- Recommend the best-fit model with reasoning: why this model, expected latency on their device, accuracy tier +- Output a ready-to-run command so the developer can immediately try it + +**Why this is valuable:** Developers often don't know which model architecture fits their use case, let alone which one performs well on their specific Windows hardware. The catalog has verified, pre-tested models — the agent bridges "I have a problem" to "here is a model you can ship today." + +**Example interaction:** + +```javascript +Developer: I want to classify product images in my e-commerce app. Should be fast on low-end laptops. +Agent: For image classification on low-end hardware without an NPU, I recommend: + + → microsoft/resnet-50 (catalog built-in) + - Task: image-classification + - Latency on CPU: ~18ms p50 (tested on Intel Core i5) + - Size: 98MB FP32, 25MB w8a16 + - Best EP on your machine: CPUExecutionProvider + + Runner-up: google/vit-base-patch16-224 (better accuracy, ~2x slower on CPU) + + To build and benchmark: + winml build -c $(winml config -m microsoft/resnet-50 --device auto) -o resnet_out/ + winml perf -m resnet_out/model.onnx --device auto --iterations 100 +``` + +**What makes this different from a search engine:** The recommendation is hardware-aware — the same question asked on a machine with a Qualcomm NPU would surface a different model (or a different EP for the same model) with different expected numbers. It's not a static lookup, it's a contextual match. + +--- + +## 5. Key Concerns to Track + +| Concern | Mitigation | +| --- | --- | +| Device heterogeneity: config found on Dev's machine may not generalize | Cross-Device Confidence Agent explicitly addresses this; output includes device scope | +| Trust/auditability: developer can't verify agent recommendation | All recommendations include reasoning + confidence + "how I tested this" | +| Olive overlap at implementation layer | winml-cli uses ORT under the hood like Olive; the differentiation is UX + Windows-first + explainability, not reimplementing optimization passes | +| Accuracy validation | `winml eval` is mandatory in every agent loop that modifies the model | +| Agent hallucinating perf numbers | All perf claims require iteration ≥ 1000 and report p50/p90/p99 with std dev | + +--- + +## 6. Open Questions + +1. **Scope**: Should the agent be a CLI mode (`winml agent`) or embedded into existing commands (`winml build --agent`)? +2. **Olive relationship**: Should winml-cli contribute opset search back to Olive, or maintain it independently? Needs alignment with Olive team. +3. **Offline / no-LLM mode**: Should the agent work without LLM (rule-based fallback) for air-gapped CI environments? +4. **Multi-device testing**: Cross-Device Confidence Agent requires access to multiple devices or a device simulation layer — how to implement? diff --git a/research/autoconfig/docs/autoconfig_diagram.html b/research/autoconfig/docs/autoconfig_diagram.html new file mode 100644 index 000000000..9b5b9e69a --- /dev/null +++ b/research/autoconfig/docs/autoconfig_diagram.html @@ -0,0 +1,451 @@ + + + + +autoconfig Skill — Architecture + + + + +

autoconfig — Skill Architecture

+

Profile-guided autonomous config search for WinApp developers

+
v3 · 2026-06-17 · AgenticGPUOptimizer V2 patterns applied
+ +
+ + +
+
👤
+
+ User input  —  + Model ID  +  Target EP  +  Objective: + accuracy-primary + latency-primary + Pareto +  + optional budget / accuracy floor +
+
+ +
+ +
+
+
Orchestrator
+ + +
+
Phase 0 · Intake
+
+
+
Inspect
+
    +
  • winml inspect
  • +
  • EP availability check
  • +
  • Load session.json (crash-resume)
  • +
+
+
+
+
Baseline Build
+
    +
  • winml build (opset17, no quant)
  • +
  • Record baseline p50
  • +
+
+
+
+
Correctness Contract
+
    +
  • winml eval --mode compare
  • +
  • Reference: original ONNX or HF PyTorch
  • +
  • Lock cosine similarity = 1.000
  • +
+
+
+
+ +
+ + +
+
Phase 1 · Insight
+
+ +
+
+
Runtime Profile
+
    +
  • winml perf --profile (pending #158)
  • +
  • Per-op kernel time, bottleneck %
  • +
+
+
+
Static Analyzer
+
    +
  • winml analyze --ep <ep>
  • +
  • Conv% → npu-006 risk flag
  • +
  • Partial-support op list
  • +
+
+
+
Graph Analysis
+
    +
  • Op counts by type
  • +
  • Fusion opportunities
  • +
  • Static vs dynamic axes
  • +
+
+
+ +
+
+
Insight Engine — hypothesis_pool (unfiltered candidates)
+
+
+ +
+
+ +
+ + +
+
Phase 2 · Opt Loop
+
+
+ +
+
+
Experiment loop (until stop condition)
+ +
+
Explorer
+
+
+
skip_set
+
KB hard-block pruning after hypothesis generation
+
+
+
priority_queue
+
Ranked hypotheses after pruning
+
+
+
    +
  • Skip completed iters from session.json NEW
  • +
  • Load hypothesis_pool from Insight Engine
  • +
  • Apply KB hard blocks → skip_set
  • +
  • Rank remaining hypotheses → priority_queue
  • +
  • Pop next hypothesis from priority_queue
  • +
  • Build config.json delta
  • +
+
+ +
spawn per experiment ↓
+
+ +
+
Optimizer
+
    +
  • winml build -c config.json
  • +
  • Phase A — screen (200 iters): CV gate for CPU/GPU; disabled for QNN NPU (DVFS)
  • +
  • Early exit NEW: screen Δ < 1% → DISCARD, skip full bench
  • +
  • Phase B — full bench (3 × 1000 iters, 60s cool-down)
  • +
  • winml eval → accuracy gate
  • +
+
+ +
benchmark + accuracy data ↓
+
+ +
+
Reviewer — ThroughputOnly NEW
+
    +
  • threshold = max(1%, 2.0 × CV)
  • +
+
+ KEEP >1.5×thr + MARGINAL 1×–1.5× + DISCARD + EARLY DISCARD + ACC/BUILD FAIL +
+
+
reviewer verdict / suggestions → back to Explorer (next iteration)
+
+ +
+
Crash-Resume NEW
+
    +
  • Atomic write after every experiment
  • +
  • Stores: completed iters, baseline/best p50, discard counters
  • +
+
+ +
+ +
+
+
Stop conditions
+
    +
  • Objective met
  • +
  • 30 consecutive DISCARDs
  • +
  • Queue empty
  • +
  • User stops
  • +
+
+
+
results.tsv
+ config · screen_p50 · median_p50
+ CV · delta_pct · status +
+
+
session.json
+ completed_iters
+ baseline/best p50
+ discard counters +
+
+
ep_knowledge/
+ New entries as
+ status="draft" +
+
+ +
+
+
+ +
+ + +
+
Phase 3 · Outcome
+
+
+
+
Champion Config
+ Best config + provenance + config_<ep>_optimal.json +
+
+
HTML Report
+ Chart + experiment table + report.html +
+
+
Experiment Artifacts
+ Per-hypothesis logs + experiments/<n>/ +
+
+
KB Draft Entry
+ New findings, promoted after Gate 2 + ep_knowledge/<ep>.json +
+
+
Feature Requirements
+ Issues filed per finding + #NNN · <feature gap title> +
+
+
+
+
+ + +
+ v3 · 2026-06-17: + ThroughputOnly verdict policy (threshold = max(1%, 2×CV)); + screen early exit (Δ<1% skips full bench, saves ~25–90 min); + crash-resume via atomic session.json. +  ·  + Key constraints: + npu-006 (Conv%>20% → block conv fusions); + npu-007 (CV gate off on NPU); + cpu-001 (opset17 on CPU); + gpu-004 (no quant on QNN GPU). +
+ +
+ + diff --git a/research/autoconfig/docs/cross-device-design.html b/research/autoconfig/docs/cross-device-design.html new file mode 100644 index 000000000..04cf3cdf3 --- /dev/null +++ b/research/autoconfig/docs/cross-device-design.html @@ -0,0 +1,696 @@ + + + + +autoconfig Skill — Cross-Device / Cross-EP Auto-Config Design + + + + +

autoconfig Skill — Cross-Device / Cross-EP Auto-Config Design

+

Turning the single-machine sweep into a fleet-wide, multi-objective tuner — orchestrated through winml serve

+DESIGN +CORE IDEA: winml serve Phase 0 = the fleet worker (no new mode) +PROPOSAL → V3 ROADMAP + +
+ + + + +
+ + +
+ +

1 · The Gap — Single-Device Search Can't Answer Fleet Questions

+ +

+Every sweep today (catalog_qnn_sweep.py, catalog_gpu_sweep.py, catalog_cpu_sweep.py) +runs on one machine and optimizes for one (EP, device) pair. But a WinApp developer ships to +millions of heterogeneous Windows devices. The champion config found on a Snapdragon X Elite NPU has +unknown behavior on an Intel Lunar Lake NPU, an AMD Ryzen AI XDNA part, or a CPU-only budget laptop.

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Limitation TodayWhy It HurtsSeverity
One device per sweepChampion config is only validated where the sweep ran; portability is a guess.CRITICAL
No joint objectiveRunning N independent sweeps yields N local optima — never a config that is jointly good across the fleet.HIGH
Can't co-locate hardwareYou cannot put a Qualcomm NPU, Intel NPU, and AMD NPU in one box. The search must reach across machines.HIGH
Serial wall-clockA 14-hypothesis sweep is ~6–8 h on one device; the same matrix across a 3-device fleet is 3× longer, not parallel.MEDIUM
KB has no device axisep_knowledge/*.json findings are implicitly "true on X Elite". No way to express "true across 3 NPU SKUs".MEDIUM
+ +

2 · Core Idea — winml serve (Phase 0) is already the tune worker

+ +

+winml serve (in commands/serve.py, currently in _DISABLED_COMMANDS — implemented, not yet shipped) +has a Phase 0 mode that exposes every winml command as an HTTP endpoint — including build, +perf, eval, and sys. Bound to --host 0.0.0.0 it is reachable across machines. +That is the whole worker: no new flag, no new RPC protocol. Each physical device in a lab/fleet just runs +winml serve --host 0.0.0.0; a central orchestrator — a thin layer over the existing sweep loop — +acts as an HTTP client, treating the fleet as one distributed benchmark backend.

+ +
+ Earlier drafts proposed a dedicated winml serve --tune mode. It was dropped — Phase 0 already + exposes build/perf/eval/sys over HTTP, --idle-timeout 0 already keeps the session warm during a sweep, and + --host 0.0.0.0 already opens the network. A --tune flag added surface area without adding capability. +
+ +
+
+
❌ Today: local CLI calls
+

The sweep calls winml build + winml perf as local subprocesses. The (EP, device) is whatever box you launched on.

+
# single host, single EP
+b = run_perf_session(baseline)  # local
+h = run_perf_session(hyp)       # local
+
+
+
✅ Proposed: fan out to serve workers
+

The orchestrator dispatches the same job to every device that owns a relevant EP — in parallel — and collects a result vector keyed by device.

+
# N hosts, N EPs, one round-trip
+results = fleet.bench(baseline, hyp,
+            devices=["npu-a","npu-b","npu-c"])
+# → {device: gain_vector}
+
+
+ +

Fleet topology

+
+
+
+ autoconfig orchestrator + job scheduler · result aggregator
multi-device objective
+
+
+
↓   HTTP client calls to Phase 0 endpoints: /build · /perf · /eval · /sys   ↓
+
+
winml serve --host 0.0.0.0Snapdragon X Elite
QNN · Hexagon HTP NPU
+
winml serve --host 0.0.0.0Intel Lunar Lake
OpenVINO · NPU
+
winml serve --host 0.0.0.0AMD Ryzen AI
Vitis AI · XDNA NPU
+
winml serve --host 0.0.0.0discrete / iGPU
DML · GPU
+
winml serve --host 0.0.0.0budget laptop
CPU only · 8 GB
+
+

+ Each worker runs the Paired A/B protocol locally (so DVFS/thermal drift cancels per device) and returns only result JSON — never raw weights on the hot path. +

+
+ +
+ Why this is cheap to build: the worker already exists (Phase 0 HTTP wrapper), and the orchestrator reuses + everything from self-evolution-design.html — Paired A/B bench, adaptive n_sessions, the confidence + ladder, champion-config output. The only genuinely new pieces are (a) the fleet client + scheduler, + (b) the multi-device objective function, and (c) a device axis on the KB. Networking is plumbing; the + intellectual work is the objective and the portability taxonomy (Tab 2). +
+ +

3 · Distributed Bench Protocol — Reuse Phase 0 Endpoints

+ +

+No new protocol is invented. Phase 0 already maps each winml command to an HTTP endpoint; the orchestrator just calls them +remotely instead of as local subprocesses. The mapping below is existing commands over the existing wrapper — +the only orchestrator-side convention is that the worker runs the Paired A/B loop locally so thermal/DVFS cancels per device.

+ + + + + + + + + +
Phase 0 endpointBacking commandUsed forNotes
GET /syswinml sysHardware fingerprint: EP list, NPU SKU, driver versions, RAM, ISAOrchestrator builds the device matrix from these.
POST /buildwinml buildCompile a candidate config (opset, EP flags, graph passes, precision)Artifact stays on the worker; orchestrator references it by output dir.
POST /perfwinml perfLatency distribution: p50/p90/p99, CV, CPU-fallback%Orchestrator drives the A/B pairing by sequencing calls; thermal cancels locally.
POST /evalwinml evalaccuracy, top-1/top-5 delta, cosine vs FP baselineMandatory before any cross-device recommendation.
+ +
+ Two small gaps worth noting (neither needs a new mode): (1) Phase 0 today exposes commands one call at a time — + a config-hash artifact cache on the worker (compile once, perf many) would be a nice server-side optimization but is not required for correctness; + (2) the Paired A/B sequencing lives in the orchestrator, which means it must trust the worker not to interleave other jobs mid-pair — + enforced by scheduling discipline (one bench job per worker at a time), not by a flag. +
+ +
+ Weight transfer is off the hot path. The model is pushed to each worker once (or pulled from a shared + catalog URL); thereafter only config specs and result JSON cross the wire. +
+ +
+ + +
+ +

4 · The Hard Part — Multi-Device Objective

+ +

+Single-device search optimizes a scalar (latency). A fleet produces a metric vector per device +— so "best" is no longer a single number. The orchestrator scores each candidate config c against +the device set D, then selects by the aggregation strategy the user asked for.

+ +
# per device d, candidate config c → metric tuple
+m(c, d) = (latency_p50_ms, accuracy_delta, peak_mem_mb, cpu_fallback_pct, portability_class)
+
+def score(c, D, strategy, constraints):
+    rows = [fleet_metric(c, d) for d in D]
+    # hard gates first — any device violating a constraint disqualifies c
+    if any(violates(r, constraints[d]) for r, d in zip(rows, D)):
+        return DISQUALIFIED
+    if strategy == "worst_case":                 # Scenario A: best for N NPUs
+        return max(r.latency_p50 for r in rows)   # minimize the slowest device
+    if strategy == "weighted":                   # Scenario B: balance tiers
+        return sum(w[d] * norm(r) for r, d in zip(rows, D))
+ +
+
+
Worst-case (minimax)
+

Minimize the slowest device's latency. Guarantees a floor for every device in the set. Used for "best for N NPUs".

+
+
+
Weighted-sum
+

Tier weights (e.g. low-end 0.7 / high-end 0.3) with hard per-device constraints. Used for the balanced high/low scenario.

+
+
+
Pareto frontier
+

Return the non-dominated set over {perf, accuracy, mem} plus the knee point, so the agent can explain tradeoffs rather than hide them.

+
+
+ +

5 · Config Portability Taxonomy

+ +

+"One config for 3 NPUs" is ambiguous until you classify what is portable. Graph-level decisions travel; compiled +vendor binaries do not. The orchestrator tags every champion with a portability class.

+ + + + + + + + + + + + + + + + + + + + + + + +
ClassWhat's sharedTravels across…Example
PORTABLEopset, graph passes, precision — pure ONNX-level decisionsany device + any EPopset 21 NHWC-bypass (npu-001); w8a16 quantization
EP-PORTABLESame EP family + flags; recompiled per devicesame EP vendor, different SKUQNN HTP flags shared across two Hexagon SKUs
DEVICE-LOCKEDCompiled context binary / vendor blobone device + driver onlyQNN context binary; OpenVINO compiled blob
+ +
+ Key consequence: for a heterogeneous-vendor NPU set (Qualcomm + Intel + AMD), a literal "single config" can only + be PORTABLE — i.e. shared graph-level choices (opset, passes, precision) plus a + per-device EP selection. The orchestrator searches the portable dimensions jointly and locks the EP per device. + The deliverable is therefore "one shared config + N compiled artifacts", not one binary. +
+ +

6 · Cross-Device KB — Adding the Device Axis

+ +

+The confidence ladder from self-evolution-design.html generalizes findings along the architecture axis +(2+ models → arch rule). The fleet adds an orthogonal device axis: a finding confirmed on 3 NPU SKUs is a +device-general rule.

+ + + + + + + + +
Field (new)MeaningPromotion gate
device_scopeList of (EP, SKU, driver) where the finding holdsRecorded per confirming worker
device_generalHolds across ≥3 SKUs of the same EP class≥3 device_scope entries, same verdict sign
cross_epHolds across ≥2 EP vendors (truly portable)Confirmed on ≥2 distinct EP families
+ +
+ Payoff: once a rule is cross_ep, the orchestrator can predict it on an unseen device and + skip that hypothesis on new fleet members — the device-axis analogue of the L5 predictive tier. The fleet is what lets a + finding earn that scope in the first place. +
+ +
+ + +
+ +

7 · Marked User Scenarios

+

The two scenarios this design must serve directly. Both reduce to the same engine — they differ only in the strategy and constraints passed to score().

+ + +
+
Scenario A · worst-case / minimax
+
"Find me a config best for 3 NPUs."
+ +

Fleet

+

3 NPU workers — e.g. Snapdragon X Elite (QNN/Hexagon), Intel Lunar Lake (OpenVINO/NPU), AMD Ryzen AI (Vitis/XDNA) — each running winml serve --host 0.0.0.0.

+ +

Objective

+
    +
  • strategy = worst_case: minimize the slowest NPU's p50 latency, so every device gets an acceptable floor.
  • +
  • hard gate: accuracy_delta ≤ ε and cpu_fallback% ≈ 0 on each NPU (a config that CPU-falls-back on one vendor is disqualified — cf. npu-006 conv-fusion hazard).
  • +
+ +

How it runs

+
    +
  • Orchestrator searches the PORTABLE dimensions (opset, graph passes, precision) jointly; locks the best EP per NPU vendor.
  • +
  • Each candidate is fanned out to all 3 workers; the minimax score is the slowest of the 3.
  • +
  • Portability verdict surfaced: shared graph config travels, but each NPU keeps its own compiled artifact (DEVICE-LOCKED binary).
  • +
+ +

Output

+
champion (shared, PORTABLE):  opset=21, passes=[layout_bypass], precision=w8a16
+per-device EP:   { snapdragon: QNN-HTP, lunar_lake: OpenVINO-NPU, ryzen_ai: Vitis-XDNA }
+worst-case p50:  14.9 ms   (slowest = ryzen_ai)
+per-device p50:  { snapdragon: 11.2, lunar_lake: 13.4, ryzen_ai: 14.9 } ms
+accuracy_delta:  all ≤ 0.4%   ✔ gate passed
+artifacts:       3 compiled bundles (one per NPU)
+
+ + +
+
Scenario B · constrained weighted-sum
+
"Find me a config that balances high-end and low-end machines — perf / accuracy / memory requirements are xxx."
+ +

Fleet (spans tiers)

+

High-end: NPU + GPU, 32 GB.  Low-end: CPU-only, 8 GB. Both running winml serve --host 0.0.0.0.

+ +

Constraints (the "xxx" — supplied by the developer)

+ + + + + + + +
RequirementLow-end (binding)High-end
perf (p50)≤ 30 ms≤ 10 ms
accuracy_delta≤ 1%≤ 1%
peak memory≤ 2 GB≤ 8 GB
+ +

How it runs

+
    +
  • strategy = weighted with tier weights; the low-end device is usually the binding constraint.
  • +
  • Constrained multi-objective: satisfy every hard gate on the low-end box first, then maximize high-end perf within the feasible set.
  • +
  • The orchestrator may discover that no single config satisfies both — in which case it returns a tier-conditional config map (one config per tier) with the EP fallback chain, not a forced compromise.
  • +
+ +

Output — two shapes, agent picks the honest one

+
+
+
Single balanced config (if feasible)
+
config: opset=21, w8a16, layout_bypass
+low-end CPU:  p50 27 ms · mem 1.6 GB · Δacc 0.6%  ✔
+high-end NPU: p50  8 ms · mem 2.1 GB · Δacc 0.6%  ✔
+binding:  low-end perf (27/30 ms)
+
+
+
Tier-conditional map (if not)
+
low-end:  { ep: CPU,  precision: w8a16, opset: 21 }
+high-end: { ep: QNN,  precision: fp16,  opset: 21 }
+fallback chain:  QNN → DML → CPU
+note: no single config meets the 2 GB cap
+      on low-end at fp16 — split required.
+
+
+

The tier-conditional path connects directly to the Cross-Device Confidence Agent in agent-design.md §3.3 — same EP-fallback-chain concept, now produced from real fleet measurements instead of reasoning.

+
+ +
+ One engine, two questions. Scenario A is strategy=worst_case over a homogeneous-role (all-NPU) set; + Scenario B is strategy=weighted + hard constraints over a heterogeneous-tier set. Nothing in the search loop, + bench protocol, or KB changes between them — only the objective arguments. +
+ +

8 · Skill Outcome — Possible Results & Trade-offs

+ +

+The skill does not silently pick one answer shape. It runs the fleet, then presents all outcomes that +satisfy the hard constraints, with the trade-offs spelled out — the user makes the final call. "Best" is a +judgement (simplicity vs peak perf vs maintenance cost) that only the developer can make.

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Outcome shapeWhat you shipProsConsBest when
Single portable config1 shared graph config (opset/passes/precision) + per-device EP + N compiled artifactsOne source of truth; one config to reason about; portable across the setCompromise perf on every device; still N binaries to build/storeHomogeneous-role fleet (e.g. all NPUs); you want simplicity over peak perf
Single balanced config1 config that meets every hard constraint across all tiersSimplest possible — one config, possibly one artifact path if EP sharedOften optimal on no device; may be infeasible under tight constraintsTiers share an EP and constraints are loose enough to leave slack
Tier-conditional mapOne config per tier + EP fallback chain (QNN → DML → CPU)Each tier near-optimal; honest when no single config can satisfy allMore configs to maintain/test; runtime needs device→tier routing logicWide tier spread (NPU↔CPU) and/or tight perf/memory budgets
Pareto frontier setThe non-dominated set over {perf, accuracy, mem} + the knee pointFull visibility into every trade-off; nothing hiddenNo single answer — the user must choose; more to digestPriorities are unclear up front; you want to explore before committing
Per-device championIndependent best config for each device (N configs, no sharing)Max achievable perf on every single deviceWorst portability; N configs + N artifacts to manage; no shared storyYou already ship per-SKU builds, so maintenance cost is already paid
+ +
+ The skill's job is to make the trade-off legible, not to decide it. Every row above comes with measured + per-device numbers (latency, accuracy delta, peak mem, portability class) so the developer chooses with evidence — e.g. + "the single portable config is 18% slower on the fast NPU but saves me a second build pipeline — I'll take it." +
+ +
+ + +
+ +

9 · Implementation Plan

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
PriorityComponentFile(s)StatusKey change
P0Enable winml serve Phase 0cli.py MODTODORemove serve from _DISABLED_COMMANDS; verify build/perf/eval/sys endpoints reachable over 0.0.0.0
P0Fleet client + schedulerfleet.py NEWTODORegister workers from /sys, fan out jobs, collect result vectors
P0Multi-device objectivefleet_objective.py NEWTODOscore(c, D, strategy, constraints) → worst_case / weighted / Pareto
P1Reuse Paired A/B per workersweep_utils.py REUSETODOWorker runs existing Paired A/B locally; orchestrator only aggregates
P1Portability classifierfleet_objective.py NEWTODOTag champion PORTABLE / EP-PORTABLE / DEVICE-LOCKED; emit per-device artifacts
P1Device axis in KBep_knowledge/*.json MODTODOAdd device_scope · device_general · cross_ep; promote_findings reads them
P2Tier-conditional config map outputfleet.py NEWTODOWhen no single config is feasible, emit per-tier map + EP fallback chain
P2Cross-device predictionanalyze_insight.py MODTODOUse cross_ep rules to skip hypotheses on new fleet members
+ +

10 · Conclusions

+ +
+
+
✅ winml serve makes this a small build
+

The worker already exists (Phase 0 HTTP wrapper) and the orchestrator reuses Paired A/B, adaptive sampling, the confidence ladder, and champion output. The only new code is a fleet client plus an objective function.

+
+
+
⚠ The hard part is the objective, not the network
+

Networking is plumbing. The real design work is the multi-device objective (minimax vs weighted vs Pareto) and the portability taxonomy that decides what "one config" even means across vendors.

+
+
+
📦 Two honest deliverable shapes
+

A fleet answer is either one shared (portable) config + N compiled artifacts, or a tier-conditional config map with an EP fallback chain. Forcing a single binary across vendors is dishonest — the system must be allowed to return a map.

+
+
+
🧠 The fleet is what grows the KB's device axis
+

Only a multi-device fleet can promote a finding from "true on X Elite" to cross_ep. That scope, in turn, lets the system predict behavior on unseen devices — closing the loop with the Cross-Device Confidence Agent.

+
+
+ +
+ Bottom line: winml serve (Phase 0) + a fleet orchestrator + a multi-device objective converts autoconfig + from a single-device tuner into a fleet tuner that answers the two questions developers actually ask — "best for N NPUs" + and "balance my high-end and low-end fleet under perf/accuracy/memory budgets" — with measured, auditable, per-device evidence. +
+ +

Honest limitations

+
    +
  • DVFS is still per-device. Thermal noise doesn't disappear; each worker must still run the full Paired A/B protocol locally, so wall-clock is bounded by the slowest/noisiest device.
  • +
  • Heterogeneous-vendor "single config" is a graph-level config, not a binary. Compiled NPU blobs are DEVICE-LOCKED by construction.
  • +
  • Fleet provisioning is real cost. The value of cross-device search scales with how representative the fleet is of the shipping device population — a 3-device fleet generalizes better than 1, but is not the world.
  • +
  • Driver/EP version drift across workers can confound results; /sys fingerprints must be recorded with every finding so the KB stays honest.
  • +
+ +
+ +

+
Generated 2026-06-18 · research/autoconfig/docs/cross-device-design.html
+ + + + + diff --git a/research/autoconfig/docs/ep-findings-summary.html b/research/autoconfig/docs/ep-findings-summary.html new file mode 100644 index 000000000..26e5d6ce6 --- /dev/null +++ b/research/autoconfig/docs/ep-findings-summary.html @@ -0,0 +1,941 @@ + + + + +WinML EP Findings — Validated Catalog + + + + +

WinML EP Findings — Validated Catalog

+

+ Hardware: Snapdragon X Elite CRD  |  ORT: 1.24.5 (onnxruntime-windowsml)  |  + QNN SDK: Hexagon HTP (NPU) + Adreno X1-85 (GPU)  |  + Last updated: 2026-06-22  |  15 models (QNN NPU), 8 models (QNN GPU), 4 models (CPU partial)  |  npu-001 revised: MobileViT opset21 NEUTRAL on clean rerun +

+ +
+
31
total findings
+
11
visible (multi-model / cross-EP)
+
20
hidden (single-model)
+
15
NPU models tested
+
8
GPU models tested
+
8
feature requests
+
+ +
+ Scope warning: All findings from 1 hardware device (Snapdragon X Elite CRD, Oryon CPU + Adreno X1-85 + Hexagon HTP NPU). + DML EP not available on this device (package conflict with onnxruntime-windowsml). + QNN NPU: 15 models tested (8 catalog + 3 recipe + 4 validation). QNN GPU: 8 models (full catalog sweep 2026-06-18). + CPU: partial sweep (4/8 models done; ResNet-18/MobileViT/DINOv2/rad-dino; BERT/NLP in progress 2026-06-18). + Always re-validate on new model architectures before using findings to prune search space. +
+ +
+ + + Showing only multi-model / cross-EP findings by default. Single-model and LOW-confidence findings are hidden. +
+ + +
+
+ QNN NPU  —  Hexagon HTP (Snapdragon X Elite) +  15 models, 3×500-iter sessions, 30s cool-down | h0-h10 hypotheses (catalog_qnn_sweep.py) +
+
+ +
+
npu-006
+
HIGH
confirmed
+
+
Full conv fusion pack causes catastrophic CPU fallback on Conv-dominant models (~130x regression)
+
+ ResNet-18 full pack (conv-bn + conv-add + conv-activation fusions): + 3-session p50 = [132, 135, 131]ms vs baseline ~1ms. + ~130x regression, near-zero CV = deterministic CPU fallback. + DINOv2-base (Conv%<1%): fusion is neutral. + ORT FusedConv op produced by full pack is not dispatchable by QNN EP. + Refinement 2026-06-17: conv_add_fusion alone (h10 ResNet-18) = +0.93% NEUTRAL. Regression requires full pack that creates FusedConv. +
Scope: Conv-dominant models when Conv% > 20%. Not applicable to transformer or NLP models.
+
+
+
+
Autoconfig action
+
+ Hard-block full conv fusion pack when Conv% > 20%. + conv_add_fusion alone is safe. + Gate in catalog_qnn_sweep.py via count_conv_pct(). +
+
+
+ +
+
npu-007
+
HIGH
confirmed
+
+
DVFS thermal noise makes CV-based stability gating unreliable on QNN NPU
+
+ Across all catalog models, within-session CV ranges 0.1–2.0+ even on warm device. + CV gate (<15%) blocks most valid candidates — the noise is DVFS, not model instability. + Reliable signal: 3+ independent sessions × 500+ iters with 30s cool-down. Use median p50 across sessions. + Differences <10% are within noise floor. +
Scope: General — all models on QNN NPU / Snapdragon X Elite.
+
+
+
+
Autoconfig action
+
+ CV gate DISABLED for QNN NPU (SCREEN_CV_MAX_NPU = 999.0). + Always run 3×500 Phase B regardless of screen CV. + Feature request: winml perf --sessions 3 --cool-down 30s (#155). +
+
+
+ +
+
npu-001
+
MEDIUM
empirical
+
+
opset 21 export gives +24–31% speedup on DINOv2-family models — mechanism UNKNOWN, NOT a general ViT property; MobileViT benefit NOT reproduced on clean rerun
+
+ DINOv2-small: opset17 7.2ms → opset21 5.0ms (+30.6%). + DINOv2-base: opset17 34.6ms → opset21 26.2ms (+24.1%). + MobileViT-small: REVISED to NEUTRAL. The original +28.6% / +42.1% (matmul_transpose) was measured against an inflated ~12ms baseline. A clean from-scratch 11-hypothesis rerun (2026-06-22, fresh winml config+build, 3×500-iter) gave baseline 5.51ms → opset21 5.355ms = +2.81% with overlapping session ranges; matmul_transpose (h6) = 6.218ms = SLOWER. The earlier “win” was a DVFS/thermal baseline artifact. + Critical controls: dino-vitb16 −0.7% NEUTRAL; ViT-base −10.8% HURTS; all NLP tested neutral. Not a general ViT property. + Also: bias_softmax_fusion adds +14% incremental on DINOv2 on top of opset21 (npu-009). Original kMaxSupportedOpset bypass mechanism INVALIDATED (ORT 1.24.5 has kMaxSupportedOpset≥23). +
Scope: DINOv2-family confirmed. MobileViT REVISED to neutral. NOT plain ViT (ViT-base HURTS), NOT NLP.
+
+
+
+
Autoconfig action
+
+ DINOv2-family: try opset21 + bias_softmax_fusion bundle first. + MobileViT-class hybrids: do NOT assume opset21 helps — clear the effect-size gate (gain ≥ 2×session-CV AND ranges separated) before trusting any win. + Plain ViT: SKIP — confirmed harmful. + NLP: SKIP — consistently neutral. + Architecture check required. +
+
+
+ +
+
npu-010
+
HIGH
cross-EP
+
+
highdimRTR causes regression on CNN-ViT hybrids: −19% NPU, −6.9% GPU — spurious +36 Reshape insertion
+
+ MobileViT NPU (h9 opset21+highdimRTR): median 14.4ms vs baseline 12.1ms = −18.9%. + MobileViT GPU (h9 bundle): 19.2ms vs 18.0ms = −6.89%. + ONNX diff: h9 graph has +36 extra Reshape nodes (108→144). The 12 original RTR patterns are UNCHANGED. + Root cause: highdimRTR misidentifies Gemm→Reshape→Transpose sequences in MobileViT hybrid unfold. Inserts Reshape intermediaries after Gemm → breaks dispatch merging → extra DMA. + Contrast: DINOv2 (pure ViT): h9 = +38.1% NPU — pure ViT benefits. +
Scope: CNN-ViT hybrids with Gemm→Reshape→Transpose unfold. Pure-ViT models benefit. Architecture-dependent.
+
+
+
+
Autoconfig action
+
+ Hard-block highdimRTR for Gemm→Reshape→Transpose hybrid unfold models. + analyze_insight.py must detect and add highdimRTR to skip_set. + Safe for pure-ViT (DINOv2 +38%). Architecture check required. +
+
+
+ +
+ 5 single-model findings hidden (npu-002/003/004/008/009) + Click “Show single-model findings” above to expand +
+ +
+
npu-002
+
MEDIUM
1 model
+
+
W8A16 quantization gives ~1.9x speedup over FP32 on QNN NPU
+
+ ConvNext FP32: 19.4ms → W8A16: 10.3ms (1.9x). 1 model only. +
Scope: ConvNext only for magnitude. Mechanism generalizes; magnitude does not.
+
+
+
+
Autoconfig action
+
Always quantize for QNN NPU. W8A16 is the starting point.
+
+
+ +
+
npu-003
+
MEDIUM
1 model
+
+
winml compile (EPContext) adds ~1.7x speedup on top of W8A16
+
+ ConvNext W8A16: 10.3ms → EPContext: 6.0ms (1.7x). 1 model only. +
Scope: ConvNext only for magnitude. Mechanism generalizes to all QNN NPU models.
+
+
+
+
Autoconfig action
+
Always run winml compile after finding best config for QNN NPU.
+
+
+ +
+
npu-004
+
LOW
anecdote
+
+
W8A8 may cause accuracy collapse on models with LN+GELU (UNVALIDATED)
+
+ Experiment aborted early — no accuracy numbers preserved. Recalled anecdote only. + Do NOT skip W8A8 without running eval first. +
Scope: UNVALIDATED. ConvNext only.
+
+
+
+
Autoconfig action
+
Treat as anecdotal. Run W8A8 eval before deciding.
+
+
+ +
+
npu-008
+
HIGH
1 model
+
+
microsoft/rad-dino fails to build on QNN NPU across ALL opset variants (access violation rc=0xC0000005)
+
+ winml build crash (rc=0xC0000005) for opset 17, 19, and 21 on QNN NPU. + rad-dino is ViT-L scale (large model, non-standard medical imaging architecture). + Builds successfully on CPU EP (~275ms). QNN GPU also BUILD_FAIL all hypotheses. +
Scope: microsoft/rad-dino only. Likely unsupported op or tensor size in QNN SDK for ViT-L scale.
+
+
+
+
Autoconfig action
+
Route rad-dino to CPU EP only. Feature gap: winml build should fast-fail with diagnostic rather than access violation.
+
+
+ +
+
npu-009
+
MEDIUM
1 model
+
+
bias_softmax_fusion adds +14% incremental speedup on DINOv2 NPU when combined with opset21
+
+ DINOv2-small h7 (opset21 + bias_softmax_fusion): p50=4.03ms (+38.6% total) + vs h3 (opset21 only): 4.98ms. Incremental gain: +14.1%. + Outperforms attention_fusion (h8=+28.4%) and matmul_transpose (h6=+24.8%) on DINOv2. + Mechanism: folds Add(qk, bias)+Softmax → single FusedSoftmax with native HTP path. +
Scope: DINOv2-small confirmed. Not tested on DINOv2-base or plain ViT.
+
+
+
+
Autoconfig action
+
+ For DINOv2-family: use opset21 + bias_softmax_fusion bundle. + Prioritize over attention_fusion (outperformed h8 by 10pp). +
+
+
+ +
+
+ + +
+
+ CPU EP  —  Oryon CPU (Snapdragon X Elite) +  4 models done (ResNet/MobileViT/DINOv2/rad-dino); BERT/NLP in progress | 3×300-iter sessions, Phase C +
+
+ + +
+
cpu-006
+
HIGH
empirical
+
+
CPU EP, QNN GPU, and QNN NPU respond DIFFERENTLY to opset changes — EP isolation is mandatory
+
+ CPU opset17 vs 21 (ConvNext): 3.9x SLOWER at opset21. + CPU opset17 vs 21 (DINOv2): ~10x SLOWER at opset21 (cpu-001/cpu-009). + QNN GPU opset17 vs 21: neutral-to-slightly-negative (−5.4% to +3.3%) across 7 models. + QNN NPU opset17 vs 21 (DINOv2): +24% FASTER. + Same opset change, three different outcomes on the same chip. DINOv2 goes +24% on NPU but −10x on CPU. +
Scope: Meta-rule about EP isolation. Applies to all models.
+
+
+
+
Autoconfig action
+
+ NEVER transfer opset findings across EPs. + Always validate per EP independently. + CPU, GPU, and NPU search spaces are fully independent. +
+
+
+ + +
+
cpu-007
+
HIGH
KEEP_CONFIRMED
1 model
+
+
matmul_transpose_fusion gives +92% speedup on ResNet-18 CPU EP (237ms → 17.8ms) — baseline config is severely suboptimal
+
+ ResNet-18 h9 (matmul_transpose_fusion): 17.8ms vs auto-config baseline 237ms = +92.51% KEEP_CONFIRMED (all 5 Phase C sessions passed). + Also confirmed: h12 (transpose_optimizer) +84.46%, h13 (gelu_fusion) +88.89%, h10 (bundle) +91.54%, h6 (layer_norm_fusion) +10.43%. + 237ms baseline = severely suboptimal auto-config for ResNet-18 on CPU. matmul_transpose_fusion enables BLAS-level transposed GEMM dispatch that ORT cannot reach with unfused MatMul+Transpose pairs. +
Scope: ResNet-18 confirmed. Models with unfused MatMul+Transpose chains likely benefit. DINOv2: NOT applicable (all fusion flags regress DINOv2 on CPU via cpu-001 interference).
+
+
+
+
Autoconfig action
+
+ For ResNet-class on CPU: apply matmul_transpose_fusion + transpose_optimizer + gelu_fusion bundle. + Critical: verify whether auto-config baseline is always this suboptimal for ResNet. + Needs re-test with fresh config to confirm finding is reproducible. +
+
+
+ + +
+
cpu-008
+
HIGH
1 model
+
+
layer_norm_fusion causes −997% regression on MobileViT CPU EP (73ms → 803ms) — wrong LN pattern match
+
+ MobileViT h6 (layer_norm_fusion): 803ms vs baseline 73ms = −997% DISCARD. + Also: matmul_transpose_fusion −165%, attention_fusion bundle −164%, skip_layer_norm_fusion −10%. + Only bias_softmax_fusion helps: 64ms (+12.3% MARGINAL_UNCONFIRMED). + Mechanism: MobileViT places LayerNorm after Conv2D outputs (CNN-ViT hybrid). layer_norm_fusion expects pure transformer LN (post-MLP). Fusing the wrong pattern creates an op the CPU runtime cannot dispatch to an optimized kernel. +
Scope: CNN-ViT hybrid models (MobileViT). Pure transformers (BERT/ViT) are expected safe.
+
+
+
+
Autoconfig action
+
+ Block layer_norm_fusion, matmul_transpose_fusion, and attention_fusion for CNN-ViT hybrid models on CPU. + analyze_insight.py must detect CNN-ViT hybrid architecture and skip these fusions. +
+
+
+ +
+
+ 🔍 CPU sweep: BERT/NLP models in progress (2026-06-18) — roberta-base-squad2, tinyroberta, bge-small, MiniLM-L6-v2. + Key open question: does cpu-001 (opset regression) fire on pure-BERT models (sparse Transpose) or is it only Transpose-heavy architectures? + Expected finding: BERT is safe at opset19/21; attention_fusion may help BERT significantly. +
+
+ +
+ 5 single-model findings hidden (cpu-001/002/005/009 — single-arch, + cpu-004 anecdote) + Click “Show single-model findings” above to expand +
+ +
+
cpu-001
+
HIGH
confirmed
ConvNext+DINOv2
+
+
opset 19+ causes 3–10x slowdown on models with dense Transpose graphs — NOT ConvNext-specific
+
+ ConvNext: opset19=160ms (3.7x), opset21=170ms (3.9x) vs baseline 43.7ms. + DINOv2-small: opset19=1106ms (9.8x), opset21=1095ms (9.7x) vs baseline 112ms — CPU001_REGRESSION verdict. + ResNet-18: opset19=231ms (+2.4% neutral), opset21=227ms (+4.5% neutral) — NOT affected. + MobileViT: opset19=−9.1% (mild, not catastrophic). + Pattern: models with ≥49 Transpose nodes (ConvNext, DINOv2) hit cpu-001; sparse-Transpose models (ResNet) do not. + BERT/NLP pending (expected neutral based on Transformer LN-dominant graph with few Transposes). +
Scope: Dense-Transpose models confirmed (ConvNext, DINOv2). ResNet confirmed safe. BERT/NLP pending.
+
+
+
+
Autoconfig action
+
Default to opset17 for CPU EP. For DINOv2/ConvNext: hard-block opset19+. For ResNet: opset is safely neutral.
+
+
+ +
+
cpu-009
+
HIGH
1 model
+
+
DINOv2 CPU EP is constrained to auto-config only — ANY explicit flag causes catastrophic regression
+
+ DINOv2-small: auto-config baseline=112ms. h1 opset17-explicit: 762ms (−577%). + h2/h3 opset19/21: ~1100ms (−880% CPU001_REGRESSION). + h4 attention_fusion: 1083ms (−862%). h7 bias_softmax_fusion: 1121ms (−896%). + Even forcing opset17 explicitly (h1) regresses −577% vs auto-config — the auto-config default must use a specific graph optimization path that is disrupted by any explicit override. +
Scope: DINOv2-small confirmed. Likely generalizes to all pure-ViT models with dense Transpose graphs on CPU.
+
+
+
+
Autoconfig action
+
For DINOv2/ViT-class on CPU: use auto-config ONLY. Do not force any opset. Do not apply any fusion flags. All deviations regress.
+
+
+ +
+
cpu-002
+
HIGH
confirmed
+
+
matmul_add_fusion regresses CPU +87% on models where ORT L2 already produced Gemm nodes
+
+ ConvNext: p50=81.7ms vs baseline 43.7ms (+87%). + ORT L2 already converts MatMul+Add → Gemm at baseline. Applying fusion on top conflicts. + catalog_cpu_sweep.py auto-skips via _model_has_gemm() guard. +
Scope: Models where ORT L2 baseline already has Gemm nodes.
+
+
+
+
Autoconfig action
+
Skip matmul_add_fusion when model.onnx already contains Gemm. Guard implemented.
+
+
+ +
+
cpu-005
+
HIGH
confirmed
+
+
Baseline (no extra flags) is optimal for ConvNext CPU — graph pass sweep is wasted
+
+ 22-experiment ablation: no flag improved p50 beyond noise. Baseline at 43.7ms is floor. + ORT L2 already applies gelu_fusion and MatMul→Gemm. +
Scope: ConvNext-class. Transformer: awaiting CPU catalog sweep completion.
+
+
+
+
Autoconfig action
+
For CPU + ConvNext: skip graph pass sweep. ResNet: apply matmul_transpose_fusion (cpu-007).
+
+
+ +
+
+ + +
+
+ DML EP  —  Adreno X1-85 via Direct3D 12 +  1 model only (facebook/convnext-tiny-224). DML not available on test device (onnxruntime-windowsml package conflict) +
+
+ +
+ 3 single-model findings hidden (dml-001/002/003 — ConvNext only) + Click “Show single-model findings” above to expand +
+ +
+
dml-001
+
MEDIUM
stability
+
+
DML is more stable than QNN GPU — p50 difference is within noise
+
+ DML FP32: p50=16.9ms, std=0.52. QNN GPU FP32: p50=17.7ms, std=0.97. + p50 diff = 0.82σ of QNN GPU — distributions OVERLAP. Not a separable p50 advantage. + DML meaningfully more stable: CV 3% vs 5.5%. +
Scope: Adreno X1-85, ConvNext. 3-run comparison only.
+
+
+
+
Autoconfig action
+
Prefer DML over QNN GPU for lower tail latency (p90). Do NOT claim DML is faster based on p50 alone.
+
+
+ +
+
dml-002
+
MEDIUM
1 run
+
+
NHWC transformer increases latency variance on DML — p50 neutral, p90 +19%
+
+ DML NHWC: p50=16.5ms, p90=21.0ms (+19%), std=1.89 (3.6x worse). +
Scope: Adreno X1-85 + DML, ConvNext.
+
+
+
+
Autoconfig action
+
Do NOT apply nhwc-transformer for DML EP. p90 +19% is unacceptable.
+
+
+ +
+
dml-003
+
LOW
CLI gap
+
+
DML FP16 gives ~1.4x speedup with clean unimodal distribution — BLOCKED (#867)
+
+ DML FP16 (Python hack): p50=11.8ms, p90=12.8ms, std=0.66 vs FP32 16.9ms. + Cannot reproduce with winml CLI today. Blocked on #867 (--precision fp16). +
Scope: Adreno X1-85 + DML. 1 experiment only.
+
+
+
+
Autoconfig action
+
Marked SKIPPED until #867 ships. FP16 is the primary DML lever.
+
+
+ +
+
+ + +
+
+ QNN GPU EP  —  Adreno X1-85 via QNN SDK +  8 models, 3×300-iter sessions + Phase C confirmation | catalog_gpu_sweep.py h0-h12 (2026-06-18) +
+
+ +
+
gpu-004
+
HIGH
confirmed
+
+
W8A8 QDQ hangs indefinitely on QNN GPU EP (QNN SDK limitation)
+
+ Any W8A8 QDQ-annotated ONNX passed to QNN GPU EP → infinite hang. + winml build already protects via _patch_device() (quant=null for GPU). + Fast-fail enhancement: #868. +
Scope: QNN GPU EP. QNN SDK limitation. Not a concern in normal user path.
+
+
+
+
Autoconfig action
+
Skip ALL quantization for QNN GPU EP. winml build protects automatically. Tracked: #868.
+
+
+ +
+
gpu-006
+
HIGH
confirmed
7 models
+
+
opset 21 is neutral-to-negative on QNN GPU — CONFIRMED across 7 models
+
+ Full sweep 2026-06-18 (h3 = opset21): + DINOv2 +1.2%, ResNet-18 +3.3% (both MARGINAL), + MobileViT −3.4%, roberta −1.1%, + tinyroberta −2.7%, rad-dino −2.6%, bge-small +0.2%. + Range: −5.4% to +3.3%. No KEEP verdict. All MARGINAL or DISCARD. + Opposite of QNN NPU: DINOv2 +30% on NPU vs +1.2% on GPU. +
Scope: QNN GPU EP. Confirmed across 7 diverse architectures (CNN, ViT, transformer, NLP).
+
+
+
+
Autoconfig action
+
+ Do NOT try opset 19 or 21 for QNN GPU. Default to opset 17. + Rule confirmed — remove opset sweep from GPU search space. +
+
+
+ +
+
gpu-007
+
HIGH
confirmed
DINOv2
+
+
transpose_optimizer gives +8–17% on Conv/ViT models on QNN GPU — KEEP_CONFIRMED on DINOv2
+
+ DINOv2-small h12 (transpose_optimizer): 26.4ms → 22.0ms = +16.67% (KEEP_CONFIRMED — all 5 sessions passed Phase C). + ResNet-18 h12: 6.82ms → 6.25ms = +8.38% (MARGINAL_UNCONFIRMED — Phase C inconclusive). + gelu_fusion explicit (h11) also KEEP_CONFIRMED on DINOv2: +13.86%. + NLP models (roberta, bge-small): mostly BUILD_FAIL with transpose_optimizer — likely IR incompatibility. +
Scope: Conv-dominant and ViT models. NLP: BUILD_FAIL needs investigation.
+
+
+
+
Autoconfig action
+
+ Apply transpose_optimizer as default for QNN GPU on Conv/ViT models. + Skip for NLP models until BUILD_FAIL is resolved. + Feature gap: diagnose why transpose_optimizer causes BUILD_FAIL on BERT/RoBERTa. +
+
+
+ +
+
gpu-008
+
HIGH
cross-EP
+
+
highdimRTR causes −6.9% regression on MobileViT GPU — same root cause as npu-010
+
+ MobileViT h9 (bundle including highdimRTR): p50=19.2ms vs baseline 18.0ms = −6.89% DISCARD. + Less severe than NPU (−19%) due to lower DMA sensitivity on Adreno vs Hexagon HTP. + Root cause: same +36 spurious Reshape nodes confirmed by npu-010 ONNX diff. +
Scope: CNN-ViT hybrids with Gemm→Reshape→Transpose unfold. See npu-010 for full mechanism.
+
+
+
+
Autoconfig action
+
Same block rule as npu-010. Cross-EP: same architecture check protects both GPU and NPU sweeps.
+
+
+ +
+ 4 single-model findings hidden (gpu-001/002/003/005 — ConvNext only) + Click “Show single-model findings” above to expand +
+ +
+
gpu-001
+
HIGH
confirmed
+
+
FP32 baseline is already optimal for ConvNext on QNN GPU — no optimization pass helps
+
+ 11-pass sweep on ConvNext: all 0% node reduction or worse. 251/0/0/0 analyze output. +
Scope: ConvNext-class. Transformer models may benefit (see gpu-007).
+
+
+
+
Autoconfig action
+
Skip all graph pass experiments for QNN GPU on ConvNext-class. FP16 is the only remaining lever (#867).
+
+
+ +
+
gpu-002
+
MEDIUM
consistent
+
+
NHWC transformer hurts QNN GPU on Adreno — ~10% worse p50, +21% p90
+
+ NHWC: p50=19.5ms (+10%), p90=23.8ms (+21%), std=3.43 (3.5x worse). +
Scope: Adreno X1-85 + QNN GPU.
+
+
+
+
Autoconfig action
+
Do NOT apply nhwc-transformer for QNN GPU EP.
+
+
+ +
+
gpu-005
+
HIGH
confirmed
+
+
gelu_fusion improves latency STABILITY on QNN GPU — not p50
+
+ Unfused GELU (287 nodes): p50=17.4ms, p90=29.2ms, std=5.90. + Fused GELU (251 nodes): p50=17.7ms, p90=19.7ms (−48%), std=0.97 (−6x). +
Scope: Any model with GELU activations on QNN GPU.
+
+
+
+
Autoconfig action
+
Always apply gelu_fusion for QNN GPU (stability benefit, not p50).
+
+
+ +
+
gpu-003
+
LOW
1 run
+
+
winml compile regresses QNN GPU by ~34% — single experiment, low confidence
+
+ FP32 + compile: p50=23.7ms vs baseline 17.7ms (+34%). Single experiment only. +
Scope: QNN GPU EP. QNN NPU: compile always helps.
+
+
+
+
Autoconfig action
+
Avoid winml compile for QNN GPU EP. Re-validate if behavior changes.
+
+
+ +
+
+ + + +
+
+ Feature Requests & CLI Gaps + — required to complete the autoconfig skill +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
FeatureIssuePriorityMotivationEP FindingStatus
winml perf multi-session bench protocol
--sessions N --cool-down S
#155P0npu-007: reliable QNN NPU measurement requires 3 independent sessions with 30s cool-down. Single-session p50 is meaningless due to DVFS. catalog_qnn_sweep.py works around this but CLI support is needed for production autoconfig.npu-007OPEN
winml analyze: detect FusedConv and warn when Conv% > threshold pre-buildnot filedP0npu-006: conv fusions create FusedConv ops that QNN EP cannot dispatch, causing 130x regression. autoconfig guards via Conv% counter but a CLI-level lint in winml analyze --ep qnn would make this generally available without custom Python code.npu-006NEEDS ISSUE
winml analyze: detect Gemm→Reshape→Transpose hybrid unfold; warn before applying highdimRTRnot filedP1npu-010 / gpu-008: highdimRTR inserts +36 spurious Reshape nodes on CNN-ViT hybrids (MobileViT), causing −19% NPU / −6.9% GPU regression. analyze_insight.py adds to skip_set internally, but a winml analyze --flag highdimRTR_lowdimRTR lint check would make this available to all users.npu-010, gpu-008NEEDS ISSUE
winml build --precision fp16#867P1dml-003: DML FP16 gives ~1.4x speedup with clean distribution, only achievable via Python workaround. Same for QNN GPU (FP16 is the only remaining lever after all graph passes exhausted).dml-003, gpu-001OPEN
winml perf --profile (per-op kernel time)#158P1Phase 1 Insight in autoconfig needs dynamic op-dominance data (Gemm% vs Conv% vs Attention%) to prioritize hypotheses. POC bridges via static analyze_graph.py, but dynamic profiling is needed for accurate attribution.all EPsOPEN
winml build --json (structured output)#443P2autoconfig parses winml build stdout to detect failures — fragile string parsing. A --json flag should emit per-step status, elapsed time, and output artifact paths. Would enable precise partial-failure detection and resume.all EPsOPEN
winml eval --mode compare: support local PyTorch model as referencenot filedP2autoconfig correctness gate requires HuggingFace model ID as golden reference. Local .pt/.pth files and custom fine-tunes are not supported, blocking cosine-similarity correctness checks for non-HF models.all EPsNEEDS ISSUE
+
+ +
+ How to read confidence levels: + HIGH confirmed = mechanism understood + data from ≥3 independent sessions with non-overlapping ranges. +   + MEDIUM empirical = data is reliable but mechanism unconfirmed or from 1 model only. +   + LOW = single experiment, anecdote, or CLI gap blocking proper validation. +
+ All findings from Snapdragon X Elite CRD (Oryon CPU + Adreno X1-85 GPU + Hexagon HTP NPU). + ORT 1.24.5 (onnxruntime-windowsml). Findings may not generalize to x86 hardware or older ORT versions. +
+ + + + diff --git a/research/autoconfig/docs/ep-knowledge-review.md b/research/autoconfig/docs/ep-knowledge-review.md new file mode 100644 index 000000000..288467396 --- /dev/null +++ b/research/autoconfig/docs/ep-knowledge-review.md @@ -0,0 +1,246 @@ +# EP Knowledge Base — Critical Review + +> Date: 2026-06-16 +> Reviewer: internal audit +> Scope: `ep_knowledge/qnn_npu.json` findings npu-001 through npu-007 +> +> This document records issues found in the original KB entries and the +> reasoning behind corrections applied in the June 2026 update. + +--- + +## Summary of Issues Found + +| Finding | Status Before Review | Issue | Corrected Status | +|---------|---------------------|-------|-----------------| +| npu-001 | `mechanism_confirmed: true` | ORT version used has kMaxSupportedOpset ≥ 22 — bypass mechanism does not apply; ResNet-18 data is noise | `mechanism_confirmed: false`, mechanism UNKNOWN | +| npu-002 | scope: "General / most vision models" | Tested on 1 model only (ConvNext) | scope narrowed to ConvNext | +| npu-003 | scope: "General / all QNN NPU" | Tested on 1 model only (ConvNext) | scope narrowed to ConvNext | +| npu-004 | confidence: "medium" | No recorded data; experiment aborted before measurements saved | confidence: "very_low / anecdote" | +| npu-005 | confidence: "medium" | Compares ORT QNN EP vs qairt native stack — different compilation pipeline entirely | added fairness caveat | +| npu-006 | `mechanism_confirmed: false` | Observation is solid (3-session consistent). Mechanism is unconfirmed but regression is unambiguous | no change to confirmed status; added session evidence | +| npu-007 | `mechanism_confirmed: true` | Solid, confirmed across all 8 models | no change | + +--- + +## Detailed Analysis + +### npu-001 — opset 21 speedup + +#### ORT version issue (critical) + +The catalog sweep used `onnxruntime-windowsml==1.24.5`. The npu-001 mechanism +explanation relies on ORT's `kMaxSupportedOpset` gate: + +> "On older ORT where kMaxSupportedOpset < 21, opset 21 models bypass the +> NCHW→NHWC layout transformer entirely." + +But the `kMaxSupportedOpset` version table (from `cpu.json`) shows: + +| ORT version | kMaxSupportedOpset | +|-------------|-------------------| +| v1.14.x | 18 | +| v1.16.x | 19 | +| v1.17.x | 20 | +| v1.18.x | 21 | +| main_HEAD | 26 | + +At ORT 1.24.x, `kMaxSupportedOpset` is almost certainly ≥ 22. This means BOTH +opset 17 and opset 21 models go through the NHWC layout transform in the ORT +version actually used in the sweep. **The "bypass" mechanism does not apply.** + +Consequence: `mechanism_confirmed` must be `false`. The speedup for DINOv2 and +MobileViT is empirically real but the cause is **unknown**. The ORT source code +investigation confirmed the bypass mechanism for *older* ORT versions, not for +the ORT version actually used. + +Possible alternative mechanisms (uninvestigated): +1. PyTorch ONNX exporter produces a structurally different graph at opset 21 + (different op decompositions, fewer reshape/squeeze nodes) +2. QNN EP's graph partitioner behaves differently with opset 21 operator + semantics even when the NHWC transform fires +3. Quantization calibration path differs between opset export versions +4. The NHWC transform at opset 21 still inserts fewer Transposes for some reason + despite firing (investigation needed via optimized graph dump) + +#### ResNet-18 data is noise-dominated + +ResNet-18 baseline p50 is ~1ms. At this latency, the 3×500-iter protocol +produces per-session p50s that vary 4x between sessions: + +``` +h1 (opset17): sessions = [0.990, 4.003, 2.716] ms ← 4x range +h3 (opset21): sessions = [1.054, 2.175, 4.107] ms ← 4x range +``` + +The two distributions fully overlap. Declaring a "+20.2% speedup" from comparing +medians (2.716 vs 2.175ms) is not statistically valid. This data point is +**removed** from `validated_models.benefits_from_opset21`. + +To get reliable data for ResNet-18, a minimum of ~3000 iterations per session +and ≥ 5 sessions would be needed. + +#### MobileViT DVFS spike in h1 + +h1 (opset17) sessions: [10.557, 11.721, **27.436**] ms + +The third session at 27.4ms is a clear DVFS thermal event (2.4x spike). The +median (11.721ms) is upward-biased by this session. The "true" opset17 p50 is +likely ~11ms, making the "+26.5%" speedup calculation overstated. A more +conservative estimate is ~20-22%. + +However, h3 (opset21) sessions [10.814, 8.625, 8.449] show two highly consistent +low-latency sessions. The speedup is real, magnitude uncertain (~20-26%). + +#### DINOv2 — most reliable evidence for npu-001 + +h1 (opset17): [7.176, 6.392, 9.436] ms — range 6.4–9.4ms +h3 (opset21): [4.977, 4.876, 6.884] ms — range 4.9–6.9ms + +The two distributions barely overlap only at extremes (h3 max 6.884 ≈ h1 min +6.392). h3 sessions 1 and 2 (4.977, 4.876ms) are tightly clustered at ~4.9ms, +well below the h1 range. The speedup appears real (≥24% vs h1's non-spiked +sessions, up to 31% vs h1 median). + +DINOv2-small's benefit is notable because it is primarily a Vision Transformer — +it has a patch embedding Conv layer but attention-dominant compute. Why opset21 +helps DINOv2 but NOT ViT-base is unknown. This architecture distinction needs +investigation. + +#### Updated empirical claim for npu-001 + +**Observable fact**: For DINOv2-small and MobileViT-small on QNN NPU (ORT 1.24.5, +Snapdragon X Elite), using opset 21 export instead of opset 17 produces a +consistent latency reduction of ~20-31% across 3-session benchmarks. + +**What is NOT known**: Why this occurs in ORT 1.24.x where the kMaxSupportedOpset +bypass should not apply. + +**What needs investigation**: +1. Dump optimized.onnx for both opset17 and opset21 DINOv2, count Transpose nodes + — if opset21 has fewer Transposes, explains speedup via a different mechanism +2. Verify ORT 1.24.x kMaxSupportedOpset value from compiled binary +3. Test 3+ additional Conv+residual models: EfficientNet-B0, MobileNet-V3, + ConvNeXt-tiny (already done for CPU; needs QNN NPU validation) + +--- + +### npu-002 — W8A16 speedup over FP32 + +**Issue**: Scope states "General (applies to most vision models on QNN NPU)". +Evidence base: 1 model (ConvNext), 1 device. + +The 1.9x speedup is plausible from HTP architecture (INT8 weight path), but +the magnitude varies by model: a model with few weight-heavy ops (e.g., pure +attention) may see less speedup than a Conv-heavy model. "Most vision models" +is over-claimed. + +**Correction**: Scope narrowed to "ConvNext — single model validation". The +catalog sweep provides indirect evidence (all 8 models used W8A16 and ran +faster than FP32 would on HTP) but no direct FP32 comparison baseline for +those models. + +--- + +### npu-003 — compile speedup + +**Issue**: Scope states "General (applies to all QNN NPU deployments)". Evidence +base: 1 model (ConvNext), 1 device. + +The compile (EPContext) mechanism is well-understood and applies generally, but +the 1.7x magnitude is model-specific. Models with simpler graphs may see less +benefit; models with many ops may see more. + +**Correction**: Scope narrowed. The mechanism claim ("eliminates JIT partitioning") +is generally correct; the magnitude claim (1.7x) is ConvNext-specific. + +--- + +### npu-004 — W8A8 accuracy collapse + +**Issue**: The observation is "Exact numbers not recorded — aborted early." This +is an anecdote, not a finding. The confidence of "medium" is unjustified without +data. + +The claim may well be correct (W8A8 on LN+GELU is problematic), but without +recorded accuracy numbers it cannot be treated as a KB finding. + +**Correction**: Confidence downgraded to "very_low". The finding is relabeled +as an unrecorded anecdote pending a proper experiment with recorded numbers. + +--- + +### npu-006 — conv fusions catastrophic regression + +This finding is the **most statistically solid** in the entire KB: + +ResNet-18 h4 sessions: [132.3, 134.97, 130.669] ms — CV = 0.016 (extremely stable) +ResNet-18 h1 sessions: [0.990, 4.003, 2.716] ms — median 2.716ms + +Even using the best h1 session (0.990ms) vs worst h4 session (134.97ms), the +regression is 136x. The 3-session consistency of h4 (~130-135ms) with near-zero +variance is unusual for QNN NPU (all other hypotheses show high CV). This +suggests the fused ops cause a deterministic CPU fallback with no DVFS noise — +consistent with the mechanism hypothesis. + +The only issue is "mechanism_confirmed: false" — the CPU fallback has not been +verified via EP partition dump. The regression is unambiguous; the mechanism is +a strong hypothesis. + +**No changes needed** except documenting the 3-session evidence more explicitly. + +--- + +## Additional Models Needed for Validation + +### For npu-001 (opset21 benefit for Conv+residual) + +| Model | Why useful | Predicted result | +|-------|-----------|-----------------| +| `microsoft/efficientnet-b0` | Conv-dominant, no residual-add structure | uncertain | +| `microsoft/mobilenet-v3-small` | Conv-dominant + SE blocks | likely benefits | +| `timm/convnextv2-nano` | ConvNext variant, already confirmed for ConvNext | should benefit | +| `facebook/deit-small-patch16-224` | Pure ViT (no Conv), similar to ViT-base | should be neutral | +| `timm/regnetx-002` | ResNet-like but with group Conv | uncertain | + +Goal: determine whether the benefit is "Conv+residual" or something more specific +to the DINOv2/MobileViT architectures (e.g., hybrid Conv+attention). + +### For npu-006 (conv fusions) + +| Model | Why useful | Predicted result | +|-------|-----------|-----------------| +| `microsoft/efficientnet-b0` | Conv+BN heavy (many fuseable patterns) | should regress | +| `google/mobilenet-v2-1.0-224` | Depthwise Conv dominant | should regress | +| `timm/vgg16` | Pure Conv-BN | should regress | +| `microsoft/beit-base-patch16-224` | Pure transformer | should be neutral | + +Goal: confirm that the regression generalizes to all Conv-dominant models, not +just ResNet-18. + +### For npu-002/003 (W8A16 and compile) + +Run FP32 vs W8A16 and W8A16 vs W8A16+compile on at least: +- `apple/mobilevit-small` (already benchmarked W8A16; need FP32 baseline) +- `microsoft/resnet-18` (same) +- `facebook/dinov2-small` (same) + +This would promote npu-002 and npu-003 from "1-model observations" to +"catalog-validated" findings. + +--- + +## Minimum Experiment Protocol for Validation + +For any new model added to the KB: + +1. Run 3 independent sessions × 500 iters with 30s cool-down (npu-007 protocol) +2. Record raw per-session p50s, not just the median +3. Verify session-to-session range is < 50% of the median before reporting a gain +4. For sub-2ms models: increase to 3 sessions × 2000 iters minimum +5. Always dump the optimized graph (`--save-optimized-model`) for opset comparison +6. Record ORT version (`winml --version`) at experiment time in the finding + +--- + +*This review document should be re-run after any ORT or QNN SDK version update.* diff --git a/research/autoconfig/docs/feature-gaps/921-analyze-highdimRTR-hybrid-unfold.json b/research/autoconfig/docs/feature-gaps/921-analyze-highdimRTR-hybrid-unfold.json new file mode 100644 index 000000000..320f4b475 --- /dev/null +++ b/research/autoconfig/docs/feature-gaps/921-analyze-highdimRTR-hybrid-unfold.json @@ -0,0 +1,59 @@ +{ + "issue_number": 921, + "github_url": "https://github.com/microsoft/winml-cli/issues/921", + "title": "analyze: detect Gemm→Reshape→Transpose hybrid-unfold pattern; warn before applying highdimRTR", + "status": "OPEN", + "labels": ["static-analyzer", "graph-optimizer", "P2", "triaged"], + "filed_date": "2026-06-18", + "category": "analyze", + "source_findings": ["npu-010", "gpu-008"], + "affected_eps": ["qnn_npu", "qnn_gpu"], + "affected_arch": ["mobilevit", "cnn_vit_hybrid"], + "summary": "analyze_insight.py does not detect Gemm→Reshape→Transpose unfold blocks (CNN-ViT hybrid fingerprint). When highdimRTR_lowdimRTR is applied to models with this pattern, it inserts ~36 spurious Reshape nodes after Gemm layers, increasing memory traffic and causing regression.", + "root_cause": "MobileViT's CNN encoder implements a sliding-window unfold via Gemm→Reshape→Transpose. highdimRTR_lowdimRTR misidentifies these as optimizable RTR chains. The optimizer tries to lower dimensionality but fails — in the process inserting layout-conversion Reshape nodes before/after each Gemm-unfold block to meet expected tensor formats. Net effect: +36 extra nodes, more DMA traffic on NPU/GPU.", + "measured_impact": [ + { + "model": "apple/mobilevit-small", + "ep": "qnn_npu", + "hypothesis": "h9", + "baseline_ms": 26.6, + "result_ms": 31.8, + "gain_pct": -19.5, + "verdict": "DISCARD", + "protocol": "3x500 iters, Phase C confirmed", + "date": "2026-06-17" + }, + { + "model": "apple/mobilevit-small", + "ep": "qnn_gpu", + "hypothesis": "h9", + "baseline_ms": null, + "result_ms": null, + "gain_pct": -6.9, + "verdict": "DISCARD", + "protocol": "3x300 iters, Phase C confirmed", + "date": "2026-06-18" + }, + { + "model": "facebook/dinov2-small", + "ep": "qnn_npu", + "hypothesis": "h9", + "baseline_ms": null, + "result_ms": null, + "gain_pct": 38.1, + "verdict": "KEEP_CONFIRMED", + "note": "Pure-ViT — no Gemm-unfold blocks. highdimRTR works correctly here.", + "protocol": "3x500 iters", + "date": "2026-06-17" + } + ], + "fix_needed": { + "file": "analyze_insight.py", + "function": "detect_fusion_candidates", + "description": "Add a pass that counts Gemm→Reshape→Transpose chains. If count > 0, emit FusionCandidate with tag 'highdimRTR_risky' and add hypothesis h9 to skip_set.", + "code_sketch": "for node in graph.node:\n if node.op_type == 'Reshape':\n pred = producer.get(node.input[0])\n if pred and pred.op_type in ('Gemm', 'MatMul'):\n consumer = _single_consumer(node)\n if consumer and consumer.op_type == 'Transpose':\n gemm_unfold_count += 1" + }, + "discriminator": "Gemm→Reshape→Transpose count > 0 → add highdimRTR to skip_set; count == 0 → highdimRTR is a candidate (may give +38% on pure-ViT)", + "related_issues": [180], + "notes": "Issue #180 is a companion question about whether unmergeable RTR patterns should be surfaced by the pattern matcher. This issue is about pre-detection of the *source* pattern before rewrite is attempted." +} diff --git a/research/autoconfig/docs/feature-gaps/README.md b/research/autoconfig/docs/feature-gaps/README.md new file mode 100644 index 000000000..9e349faf8 --- /dev/null +++ b/research/autoconfig/docs/feature-gaps/README.md @@ -0,0 +1,68 @@ +# Feature Gap Issues — WinML autoconfig Research + +Each issue is a separate JSON file in this directory. Filed issues have `issue_number` set; +pending issues have `issue_number: null`. + +## JSON Schema + +```json +{ + "issue_number": 921, // null if not yet filed + "github_url": "https://...", // null if pending + "title": "...", + "status": "OPEN | CLOSED | PENDING", + "labels": ["..."], + "filed_date": "YYYY-MM-DD", // null if pending + "category": "analyze | build | optimize | perf | ...", + "source_findings": ["npu-010"], // KB finding IDs that motivated this issue + "affected_eps": ["qnn_npu"], + "affected_arch": ["mobilevit"], + "summary": "One paragraph", + "root_cause": "Detailed explanation", + "measured_impact": [ + { + "model": "apple/mobilevit-small", + "ep": "qnn_npu", + "hypothesis": "h9", + "baseline_ms": 26.6, + "result_ms": 31.8, + "gain_pct": -19.5, + "verdict": "DISCARD", + "protocol": "3x500 iters", + "date": "YYYY-MM-DD" + } + ], + "fix_needed": { + "file": "analyze_insight.py", + "function": "...", + "description": "...", + "code_sketch": "..." // optional + }, + "discriminator": "How to detect this case at analysis time", + "related_issues": [180], + "notes": "..." +} +``` + +## Index + +| File | Issue | Status | Category | Source Findings | +|---|---|---|---|---| +| `921-analyze-highdimRTR-hybrid-unfold.json` | [#921](https://github.com/microsoft/winml-cli/issues/921) | OPEN | analyze | npu-010, gpu-008 | +| `pending-cpu001-opset-regression-warning.json` | pending | PENDING | build | cpu-001 | +| `pending-cpu008-layer-norm-fusion-guard.json` | pending | PENDING | optimize | cpu-008 | +| `pending-npu006-fusedconv-unfuse.json` | pending | PENDING | optimize | npu-006 | +| `pending-npu007-dvfs-protocol-flag.json` | pending | PENDING | perf | npu-007 | + +## How to file a pending issue + +```bash +gh issue create --repo microsoft/winml-cli \ + --title "" \ + --body "$(cat pending-<name>.json | python -c 'import json,sys; d=json.load(sys.stdin); print(d[\"summary\"] + \"\\n\\n\" + d[\"root_cause\"])')" \ + --label "P2,triaged" + +# Then update the JSON: +# - Set issue_number, github_url, status = "OPEN", filed_date +# - Rename file from pending-* to <number>-<slug>.json +``` diff --git a/research/autoconfig/docs/feature-gaps/pending-cpu001-opset-regression-warning.json b/research/autoconfig/docs/feature-gaps/pending-cpu001-opset-regression-warning.json new file mode 100644 index 000000000..738050e6d --- /dev/null +++ b/research/autoconfig/docs/feature-gaps/pending-cpu001-opset-regression-warning.json @@ -0,0 +1,67 @@ +{ + "issue_number": null, + "github_url": null, + "title": "winml build: warn when opset 19/21 regresses dense-Transpose models on CPU EP", + "status": "PENDING", + "labels": ["bug", "cpu", "dev experience", "P2"], + "filed_date": null, + "category": "build", + "source_findings": ["cpu-001"], + "affected_eps": ["cpu"], + "affected_arch": ["convnext", "dinov2", "dense_transpose_vit"], + "summary": "Auto-config baseline for ConvNext and DINOv2 uses a special Transpose-optimizer bypass path. Any explicit opset override (17, 19, or 21) disrupts this path and causes 3–10x slowdown on CPU EP. Users have no warning when this happens.", + "root_cause": "ORT's CPU EP Transpose optimizer uses a code path that only activates when opset is left at the ONNX model's native value. Forcing an explicit opset — even opset 17 (same as baseline) — triggers a different code path that materializes all Transpose operations explicitly, causing catastrophic memory overhead on dense-Transpose graph topologies.", + "measured_impact": [ + { + "model": "microsoft/convnext-base", + "ep": "cpu", + "hypothesis": "h2 (opset19)", + "baseline_ms": null, + "gain_pct": -290.0, + "verdict": "DISCARD", + "protocol": "3x300 iters", + "date": "2026-06 (prior sweep)" + }, + { + "model": "facebook/dinov2-small", + "ep": "cpu", + "hypothesis": "h1 (opset17 explicit)", + "baseline_ms": 112.6, + "result_ms": 762.0, + "gain_pct": -577.0, + "verdict": "DISCARD", + "note": "Even forcing opset17 (same as baseline) causes 6.8x regression — it's the explicitness of the override, not the version number.", + "protocol": "3x300 iters", + "date": "2026-06-18" + }, + { + "model": "facebook/dinov2-small", + "ep": "cpu", + "hypothesis": "h2 (opset19)", + "baseline_ms": 112.6, + "result_ms": 1106.0, + "gain_pct": -882.0, + "verdict": "DISCARD", + "note": "9.8x slowdown — cpu-001 fires on DINOv2 as hard as ConvNext.", + "protocol": "3x300 iters", + "date": "2026-06-18" + }, + { + "model": "microsoft/resnet-18", + "ep": "cpu", + "hypothesis": "h2 (opset19)", + "baseline_ms": null, + "gain_pct": 2.4, + "verdict": "MARGINAL", + "note": "ResNet is SAFE — sparse Transpose graph, opset changes neutral to slightly positive.", + "date": "2026-06-18" + } + ], + "fix_needed": { + "file": "winml build (pipeline)", + "description": "When user requests opset override on a dense-Transpose model targeting CPU EP, emit a warning: 'cpu-001: opset override may disrupt ORT Transpose optimizer bypass path on this model. Baseline auto-config is recommended for CPU EP with dense-Transpose architectures.'" + }, + "discriminator": "Transpose count >= 49 AND ep == 'cpu' → warn before applying opset override", + "related_issues": [], + "notes": "cpu-001 was originally documented as ConvNext-specific. 2026-06-18 sweep confirms it fires equally on DINOv2 (both have dense Transpose graphs). Scope updated in cpu.json." +} diff --git a/research/autoconfig/docs/feature-gaps/pending-cpu008-layer-norm-fusion-guard.json b/research/autoconfig/docs/feature-gaps/pending-cpu008-layer-norm-fusion-guard.json new file mode 100644 index 000000000..a9d57b3f7 --- /dev/null +++ b/research/autoconfig/docs/feature-gaps/pending-cpu008-layer-norm-fusion-guard.json @@ -0,0 +1,46 @@ +{ + "issue_number": null, + "github_url": null, + "title": "winml optimize: guard layer_norm_fusion against CNN-ViT hybrid LayerNorm patterns", + "status": "PENDING", + "labels": ["bug", "cpu", "graph-optimizer", "P2"], + "filed_date": null, + "category": "optimize", + "source_findings": ["cpu-008"], + "affected_eps": ["cpu"], + "affected_arch": ["mobilevit", "cnn_vit_hybrid"], + "summary": "layer_norm_fusion causes catastrophic regression (-997%) on MobileViT CPU. The CNN encoder's LayerNorm is applied to different tensor shapes than standard Transformer LayerNorm — the fusion pattern mismatch results in incorrect kernel selection and extreme slowdown.", + "root_cause": "MobileViT uses LayerNorm inside its CNN unfold blocks on patch-level feature tensors (e.g., [B, N_patches, C]). Standard layer_norm_fusion targets [B, seq, hidden] Transformer LN. The fusion incorrectly matches MobileViT LN due to shape similarity, but the fused kernel paths produce much slower execution on the CNN-LN tensor layout.", + "measured_impact": [ + { + "model": "apple/mobilevit-small", + "ep": "cpu", + "hypothesis": "h6 (layer_norm_fusion)", + "baseline_ms": 73.0, + "result_ms": 803.0, + "gain_pct": -997.8, + "verdict": "DISCARD", + "note": "11x slowdown. Most severe regression observed in the entire CPU catalog sweep.", + "protocol": "3x300 iters", + "date": "2026-06-18" + }, + { + "model": "apple/mobilevit-small", + "ep": "cpu", + "hypothesis": "h9 (matmul_transpose_fusion)", + "baseline_ms": 73.0, + "gain_pct": -165.0, + "verdict": "DISCARD", + "note": "Also regresses badly — consistent with CNN-ViT hybrid pattern mismatch theme.", + "date": "2026-06-18" + } + ], + "fix_needed": { + "file": "analyze_insight.py", + "description": "Detect Gemm→Reshape→Transpose unfold blocks (same as #921 discriminator). If present, add layer_norm_fusion and matmul_transpose_fusion to CPU skip_set.", + "note": "Same discriminator as issue #921 — models with CNN-ViT hybrid fingerprint should have both highdimRTR (NPU/GPU) AND layer_norm_fusion (CPU) in their skip_set." + }, + "discriminator": "Gemm→Reshape→Transpose count > 0 AND ep == 'cpu' → skip layer_norm_fusion, matmul_transpose_fusion", + "related_issues": [921], + "notes": "This is the CPU analog of issue #921. The Gemm-unfold fingerprint is the common discriminator for both. A single detection in analyze_insight.py can feed both skip_sets." +} diff --git a/research/autoconfig/docs/feature-gaps/pending-npu006-fusedconv-unfuse.json b/research/autoconfig/docs/feature-gaps/pending-npu006-fusedconv-unfuse.json new file mode 100644 index 000000000..042f65dd2 --- /dev/null +++ b/research/autoconfig/docs/feature-gaps/pending-npu006-fusedconv-unfuse.json @@ -0,0 +1,46 @@ +{ + "issue_number": null, + "github_url": null, + "title": "winml optimize: add FusedConv detection and unfuse path for QNN EP", + "status": "PENDING", + "labels": ["bug", "qnn", "graph-optimizer", "P1"], + "filed_date": null, + "category": "optimize", + "source_findings": ["npu-006"], + "affected_eps": ["qnn_npu", "qnn_gpu"], + "affected_arch": ["resnet", "cnn_dense", "convnext"], + "summary": "Conv fusions (conv-bn + conv-add + conv-activation) produce FusedConv nodes that QNN EP cannot dispatch, causing CPU fallback and catastrophic regression (up to 4900%). winml optimize should detect FusedConv nodes when targeting QNN EP and either block the fusion or unfuse them post-build.", + "root_cause": "ORT graph optimizer's conv fusion pass combines Conv+BatchNorm+Add+Activation into a single FusedConv node. QNN EP's op support list does not include FusedConv — it falls back to CPU EP for these nodes, which defeats the purpose of QNN execution. The full pack (all 3 fusions together) is catastrophic; individual fusions (conv_add alone) are neutral-to-safe.", + "measured_impact": [ + { + "model": "microsoft/resnet-18", + "ep": "qnn_npu", + "hypothesis": "h4 (full conv fusion pack)", + "baseline_ms": 7.23, + "result_ms": 361.5, + "gain_pct": -4899.0, + "verdict": "DISCARD", + "note": "4900% regression — pure CPU fallback due to FusedConv unsupported by QNN EP.", + "protocol": "3x500 iters", + "date": "2026-06 (prior sweep)" + }, + { + "model": "microsoft/resnet-18", + "ep": "qnn_npu", + "hypothesis": "h10 (conv_add_fusion only)", + "baseline_ms": 7.23, + "gain_pct": 0.93, + "verdict": "NEUTRAL", + "note": "conv_add alone is SAFE — only the full 3-fusion pack creates FusedConv.", + "date": "2026-06-17" + } + ], + "fix_needed": { + "description": "Option A (preferred): In winml build pipeline, after applying conv fusions, scan graph for FusedConv nodes. If EP is QNN (NPU or GPU), unfuse them back to Conv+BN+Add+Activation before compilation.", + "alternative": "Option B: Block conv_bn_fusion + conv_activation_fusion flags when EP=QNN. Allow conv_add_fusion alone (confirmed safe).", + "detection": "ORT graph: node.op_type == 'FusedConv' — these nodes should never reach QNN EP compilation." + }, + "discriminator": "Conv% of total ops > 20% AND ep in ('qnn_npu', 'qnn_gpu') → warn before applying full conv fusion pack", + "related_issues": [], + "notes": "npu-006 refinement (2026-06-17): conv_add_fusion alone is neutral (+0.93%) and safe. Only the combination (conv-bn + conv-add + conv-activation) creates FusedConv. The catalog sweep h4/h5 guard already warns based on Conv% threshold." +} diff --git a/research/autoconfig/docs/feature-gaps/pending-npu007-dvfs-protocol-flag.json b/research/autoconfig/docs/feature-gaps/pending-npu007-dvfs-protocol-flag.json new file mode 100644 index 000000000..6f6be79cc --- /dev/null +++ b/research/autoconfig/docs/feature-gaps/pending-npu007-dvfs-protocol-flag.json @@ -0,0 +1,48 @@ +{ + "issue_number": null, + "github_url": null, + "title": "winml perf: add --dvfs-protocol flag for reliable QNN NPU benchmarking (multi-session + cool-down)", + "status": "PENDING", + "labels": ["dev experience", "qnn", "NPU", "P2"], + "filed_date": null, + "category": "perf", + "source_findings": ["npu-007"], + "affected_eps": ["qnn_npu"], + "affected_arch": ["all"], + "summary": "Single-session winml perf on QNN NPU can show ±30% variance due to DVFS (Dynamic Voltage/Frequency Scaling) thermal noise. Reliable benchmarking requires 3+ independent sessions with 30s cool-down between sessions. This protocol should be built into winml perf as a first-class flag.", + "root_cause": "Snapdragon X Elite HTP (QNN NPU) uses DVFS to manage power/thermal. After sustained inference load, the NPU frequency drops to manage heat. A single bench session may start at full frequency (fast) and end at throttled frequency (slow), making the result unrepresentative. Multi-session protocol with cool-down ensures each session starts at a consistent thermal baseline.", + "measured_impact": [ + { + "description": "Single-session CV on MobileViT QNN NPU", + "cv_observed": "0.37 (37%)", + "note": "CV > 0.15 is common on QNN NPU — current catalog_qnn_sweep.py Phase A screen marks this as 'DVFS noise — high CV expected' and always proceeds to Phase B regardless." + }, + { + "description": "Multi-session stability", + "note": "3x500 iters with 30s cool-down between sessions achieves consistent results. All 3 session p50s should be compared (range non-overlap criterion) rather than using a single p50." + } + ], + "proposed_api": { + "flag": "--dvfs-protocol", + "description": "When set, winml perf runs N_sessions independent sessions with cool_down_s seconds between them. Reports median of session p50s, range, and CV-per-session.", + "defaults": { + "n_sessions": 3, + "iterations_per_session": 500, + "cool_down_s": 30, + "warmup": 10 + }, + "output": { + "session_p50s_ms": [6.9, 7.0, 8.0], + "median_p50_ms": 7.0, + "range_ms": 1.1, + "dvfs_stable": true + } + }, + "fix_needed": { + "file": "winml perf command", + "description": "Add --dvfs-protocol flag. When active: run N sessions with cool-down, report median + range. Also report whether ranges overlap with a reference run (useful for A/B comparison)." + }, + "discriminator": "ep == 'qnn_npu' → recommend --dvfs-protocol for reliable results", + "related_issues": [], + "notes": "The autoconfig catalog_qnn_sweep.py already implements this protocol internally (FULL_SESSIONS=3, COOL_DOWN_S=30, FULL_ITERS=500). Promoting to winml perf as a first-class flag would let users get reliable NPU numbers without needing the full autoconfig sweep infrastructure." +} diff --git a/research/autoconfig/docs/self-evolution-design.html b/research/autoconfig/docs/self-evolution-design.html new file mode 100644 index 000000000..28d15ee9d --- /dev/null +++ b/research/autoconfig/docs/self-evolution-design.html @@ -0,0 +1,698 @@ +<!DOCTYPE html> +<html lang="en"> +<head> +<meta charset="UTF-8"> +<title>autoconfig Skill — Self-Evolution Design + + + + +

autoconfig Skill — Self-Evolution Design

+

How the sweep loop learns, stabilizes, and improves itself over time

+DESIGN +POC → V2 ROADMAP + +
+ + +
+ + +
+ + +

1 · Problem Statement

+ +

Performance noise makes sweep conclusions unreliable — results vary run-to-run, causing false KB promotions and wasted sweep time.

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Pain PointRoot CauseSeverityGap
DVFS thermal noiseNPU frequency scales with temp; same model 2× slower when hotCRITICALFixed 30s cool-down ignores actual temperature
Sequence biasBaseline runs cold, late hypotheses run hot — unfairHIGHSystematic order bias, no mitigation
No paired comparisonBaseline & hypothesis measured at different thermal momentsHIGHDelta confounds drift with gain (unpaired)
Fixed sample size3 sessions regardless of varianceMEDIUMNo adaptive sampling during sweep
Manual KB promotionFindings written by hand from logsMEDIUMKB grows only when a human reads logs
No prioritizationAll 14 hypotheses run per model, even irrelevant onesLOW-MEDSweeps don't consume skip_set yet
+ + +

2 · Confidence-Gated Promotion (L1 → L5)

+ +

Findings climb confidence levels via quantitative gates — no manual judgement. KB holds L3+ only.

+ +
+ +
+
L1
+
+
Observed — Single Model, Single Run
+
Beats baseline in one sweep. Stored in results.json.
+
Gate: median gain > 5%.
+
+
+ +
+
+
+ Paired A/B bench → range non-overlap +
+
+ +
+
L2
+
+
Confirmed — Statistically Robust
+
All hypothesis p50s beat all baseline p50s; 95% CI excludes 0.
+
Gate: max(hyp_p50s) < min(baseline_p50s).
+
+
+ +
+
+
+ promote_findings.py — same flags on 2+ models, same arch +
+
+ +
+
L3
+
+
Generalized — Architecture Rule
+
Same flags give L2 gains on ≥2 models of one arch class. Written to ep_knowledge/<ep>.json.
+
Gate: ≥2 L2 with same (flags, arch_class).
+
+
+ +
+
+
+ promote_findings.py — confirmed across arch classes +
+
+ +
+
L4
+
+
Cross-Cutting Rule
+
Applies across ≥3 arch classes; scope broadens to EP-wide.
+
Gate: ≥3 L2 across ≥3 arch_class values.
+
+
+ +
+
+
+ analyze_insight.py predicts from graph fingerprint +
+
+ +
+
L5
+
+
Predictive — No Sweep Required
+
Graph fingerprint predicts help/hurt before running; sweep skips it and emits the optimal config.
+
Gate: L4 rule + pattern match, <5% false-positive on held-out models.
+
+
+ +
+ +
+ Current state (2026-06-18): Most findings sit at L2 (manual). npu-001 & npu-007 have L3-grade evidence. promote_findings.py and L5 prediction are not yet built. +
+ +
+ + +
+ +

Each fix resolves a pain point from §1. #1 kills thermal noise + sequence bias + unpaired comparison · #2 fixed sample size · #3 no prioritization · #4 manual KB promotion · #5 thermal noise (calibration backstop).

+ + +

Fix #1 — Paired A/B Bench Protocol

+ +
+
+
❌ Current: Sequential (Biased)
+

All baseline runs, then all hypothesis runs. Baseline cold, hypothesis warm — "gain" includes thermal drift.

+
# device heats up →
+h0: [base] [base] [base]   # cool
+h6: [hyp]  [hyp]  [hyp]    # warm
+# delta = optimization + drift (confounded)
+
+
+
✅ New: Paired A/B (Unbiased)
+

Each pair runs baseline then hypothesis in one thermal window. Average the within-pair ratios — drift cancels.

+
pair_n: [base] → [hyp]   # ratio = (base-hyp)/base
+gain = mean(ratios) ± 95% CI
+# drift appears in both → cancels
+
+
+ +
def paired_ab_bench(baseline, hyp, n_pairs=3, iters=500, cool_down_s=30) -> dict:
+    """Interleaved A/B bench → gains_pct list + CI + verdict."""
+    gains = []
+    for i in range(n_pairs):
+        b = run_perf_session(baseline, iters)
+        h = run_perf_session(hyp, iters)
+        if b and h: gains.append((b - h) / b * 100)
+        if i < n_pairs - 1: time.sleep(cool_down_s)
+    if not gains: return {"verdict": "BENCH_FAIL"}
+    mean = statistics.mean(gains)
+    ci   = 1.96 * statistics.stdev(gains) / math.sqrt(len(gains)) if len(gains) > 1 else 999
+    verdict = ("KEEP_CONFIRMED" if mean - ci > 5 else
+               "DISCARD" if mean + ci < -2 else "MARGINAL")
+    return {"gains_pct": gains, "mean_gain_pct": round(mean, 2),
+            "ci_half_95": round(ci, 2), "verdict": verdict}
+ + +

Fix #2 — Adaptive n_sessions

+ +

Keep sampling until the 95% CI is decisive — or budget runs out — instead of a fixed N.

+ +
+
+
Stopping Criterion
+

Stop early: CI_lower > +5% (KEEP) or CI_upper < -2% (DISCARD).

+

Force stop at MAX_PAIRS = 8 → MARGINAL.

+

Stable models finish in 3 pairs; noisy ones get more automatically.

+
+
+
Budget Allocation
+

Priority queue: test highest-prior hypotheses first.

+

Once a KEEP_CONFIRMED is found, remaining hypotheses get fewer pairs (quick reject/confirm).

+
+
+ + +

Fix #3 — Architecture-Based Hypothesis Pruning

+ +

Sweeps consume analyze_insight.py graph patterns to skip irrelevant/harmful hypotheses — cutting 14 to 4–5 per model.

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Architecture ClassGraph FingerprintSkip (known harmful)Prioritize (likely helpful)
Pure ViT
DINOv2, ViT-B, YOLOS
Dense Transpose (≥49), no Gemm-unfold blocksconv fusions (h4/h5/h10), layer_norm_fusionopset21 (npu-001), highdimRTR (L4), bias_softmax
Pure CNN
ResNet-18, EfficientNet
Conv% > 20%, sparse Transposeattention_fusion, highdimRTR, bias_softmaxmatmul_transpose_fusion (cpu-007), opset19/21 safe
CNN-ViT Hybrid
MobileViT, EfficientFormer
Gemm→Reshape→Transpose unfold blocks presenthighdimRTR ⚠ -19% NPU, layer_norm_fusion ⚠ -997% CPU, matmul_transpose CPUopset21 + matmul_transpose_fusion (NPU h6: +42%)
BERT / NLP Encoder
BERT, RoBERTa, DistilBERT, MiniLM
Attention pattern, sparse Transpose, Add→Softmaxconv fusions, layer_norm_fusion (BERT LN≠CV LN), opset21 (cpu-001 on dense-Transpose subclass)attention_fusion, bias_softmax_fusion (npu-009 +14%)
Dense-Transpose ViT
ConvNext, DINOv2-style
Transpose count ≥ 49 AND Gemm-unfold absentopset19/21 on CPU ⚠ cpu-001 10x slowdownopset17 explicit (baseline), highdimRTR
+ +
def get_hypothesis_skip_set(model_type, candidates) -> set[str]:
+    """Skip hypotheses by arch fingerprint + KB rules."""
+    skip, tags = set(), {c.tag for c in candidates}
+    if "conv_dense" in tags:                  skip |= {"h4", "h5"}   # npu-006
+    if "gemm_reshape_transpose_unfold" in tags: skip.add("h9")       # npu-010 highdimRTR
+    if model_type == "mobilevit":             skip.add("h6")       # cpu-008 layer_norm
+    if "dense_transpose" in tags:             skip |= {"h2", "h3"}   # cpu-001 opset19/21
+    return skip
+ + +

Fix #4 — promote_findings.py

+ +

Post-processing script: reads all results.json, applies the confidence ladder, auto-updates the KB.

+ + + + + + + + + + results.json × N + catalog-*-sweep/ + */results.json + + + + + promote_findings.py + L1→L2: range non-overlap + L2→L3: 2+ models, same arch + + + + + ep_knowledge/*.json + auto-generated findings + with evidence list + + + + + analyze_insight.py + reads KB → skip_set + for next sweep + + +
def collect_l2_candidates(sweep_dirs):
+    """L1→L2: range non-overlap for a single model."""
+    out = []
+    for d in sweep_dirs:
+        r = json.loads((d / "results.json").read_text())
+        base = r["hypotheses"]["h0"]["full"]["p50s_ms"]
+        for h_id, h in r["hypotheses"].items():
+            if h.get("verdict") == "KEEP_CONFIRMED" and max(h["full"]["p50s_ms"]) < min(base):
+                out.append({"arch": r["model_type"], "ep": r["ep"],
+                            "flags": h.get("extra_optim", {}), "gain_pct": h["mean_gain_pct"]})
+    return out
+
+def promote_to_l3(l2s):
+    """L2→L3: same flags on ≥2 models of one arch class."""
+    g = defaultdict(list)
+    for c in l2s:
+        g[(c["ep"], frozenset(c["flags"].items()), c["arch"])].append(c)
+    return [{"ep": ep, "flags": dict(f), "arch": a, "evidence": ev}
+            for (ep, f, a), ev in g.items() if len(ev) >= 2]
+ + +

Fix #5 — Thermal Reference Model (P2)

+ +
+
+
Concept
+

Run a fixed tiny model (100 iters) before each session — its latency proxies device thermal state.

+

Store thermal_ref_p50_ms; normalize gains by it for valid cross-run comparison.

+
+
+
When to Use
+

Cool (≤ 1.05× cold): proceed.

+

Hot (> 1.3× cold): wait 60s, retry up to 3×, then flag "HOT_RUN".

+

HOT_RUN sessions excluded from L2 promotion.

+
+
+ +
+ + +
+ + +

3 · Full Self-Evolution Loop

+ + + + + + + + + + + + + + analyze_insight.py + graph fingerprint + → skip_set + priority + + + + + + + catalog_*_sweep.py + Paired A/B · adaptive n + 4–5 hyps (pruned) + + + + + + + results.json + verdict · CI · gain + extra_optim stored + + + + + + + promote_findings.py + L1→L2→L3→L4 + auto-updates KB + + + + + + + ep_knowledge/ + *.json + L3+ findings + config_optimal + + + + + + skip_set feeds back → sweeps get shorter + + + +

4 · Implementation Plan

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
PriorityComponentFile(s)StatusKey change
P0Paired A/B bench primitivesweep_utils.py NEWTODOpaired_ab_bench(baseline_path, hyp_path, n_pairs, iters) → verdict + CI
P0Adaptive n_sessionssweep_utils.py NEWTODOStop when CI_lower > 5% or CI_upper < -2%, max 8 pairs
P0promote_findings.pypromote_findings.py NEWTODOL1→L2→L3→L4 gates, auto-write to ep_knowledge/*.json
P0Champion config outputAll 3 sweep scripts MODTODOAfter sweep: write config_<ep>_<device>_optimal.json from best hypothesis
P1Architecture hypothesis pruninganalyze_insight.py MODTODOget_hypothesis_skip_set(model_type, candidates) → set of h_ids to skip
P1Wire Paired A/B into QNN sweepcatalog_qnn_sweep.py MODTODOReplace bench_full() with paired_ab_bench(); add PAIRED_AB=True flag
P1Wire Paired A/B into GPU + CPU sweepscatalog_gpu_sweep.py, catalog_cpu_sweep.py MODTODOSame as QNN; import from sweep_utils
P1Feature gaps issue logdocs/feature-gaps/issues.md NEWTODOPersistent log of all research-derived GitHub issues with context + date
P2Thermal reference modelsweep_utils.py NEWTODOthermal_calibrate(ep, device) → thermal_ref_p50_ms, HOT_RUN detection
P2L5 prediction in analyze_insightanalyze_insight.py MODTODORead L4 KB rules → predict winner before sweep; emit champion config directly
+ +
+ Key insight: Once Paired A/B + promote_findings.py exist, the system self-corrects — each sweep adds L2 candidates that reinforce or surface KB rules. skip_set grows richer, sweeps get shorter and more reliable, with no human in the loop between runs. +
+ +
+ +

+
Generated 2026-06-18 · research/autoconfig/docs/self-evolution-design.html
+ + + + + diff --git a/research/autoconfig/ep_device_knowledge/README.md b/research/autoconfig/ep_device_knowledge/README.md new file mode 100644 index 000000000..51310f233 --- /dev/null +++ b/research/autoconfig/ep_device_knowledge/README.md @@ -0,0 +1,56 @@ +# Per-EP Empirical Knowledge Base + +Each JSON file stores empirical findings for one EP/device combination. + +## ⚠️ CRITICAL EPISTEMICS + +These findings are **observational hypotheses, not ground truth**. They were derived +from a small number of experiments on a single model (ConvNext-tiny) on a single device +(Snapdragon X Elite CRD). Every finding carries a `confidence` field and a `falsified_by` +field. Before using a finding to prune a search space, check: + +1. **Is the model architecture similar?** (ConvNext ≠ BERT ≠ ResNet) +2. **Is the hardware the same?** (X Elite CRD ≠ X Plus ≠ X1E-80-100) +3. **Is the ORT/QNN SDK version the same?** +4. **Is the mechanism confirmed?** (see `mechanism_confirmed` field) + +**Dialectical rule**: A finding that prunes a search dimension must be re-enabled +if a new experiment on a new model/hardware contradicts it. Findings degrade over time +as ORT and QNN SDK versions change. + +## ✅ Promotion checklist (before a finding becomes a pruning rule) + +These rules exist because of the **npu-001 / MobileViT failure**: a `+26.5%` opset-21 +"win" was recorded from a single sweep whose baseline (~12 ms) was silently inflated by +DVFS/thermal throttling. A clean from-scratch rerun (2026-06-22) measured the baseline at +~5.5 ms and the same config at +2.8% — fully within noise. The fake gain came from a +**polluted baseline and a cross-run comparison**, the two least reliable things on a +DVFS NPU. To avoid recording artifacts as findings, a result must clear ALL of these +before its `confidence` is raised above `draft` / before it is used to prune search space: + +1. **Paired / same-thermal-window measurement.** Compare a config against its baseline + measured in the *same* thermal window (interleave A/B/A/B), and compare the + within-window **delta** — never an absolute baseline carried over from another run. +2. **Clean baseline gate.** Reject the whole comparison if the baseline session-to-session + CV is high or contains a >2σ spike. A noisy baseline poisons every ratio derived from it. +3. **Effect size > noise floor.** Require `gain% >= 2 × (session-to-session CV)` AND + non-overlapping session p50 ranges. A sub-5% median win on QNN NPU is noise by default. + (`catalog_sweep.py` now emits `best_gain_verdict`: `RELIABLE` / + `NEUTRAL_WITHIN_NOISE` / `UNRELIABLE_RANGES_OVERLAP` for exactly this.) +4. **Independent reruns, then tiered confidence.** A single sweep is **L1 (draft)** only. + Promote to **L3** only after ≥N independent reruns (fresh build) agree in direction; + reach **L5** only after cross-time / cross-device stability. Only ≥L3 findings may be + used to prune the search space (see `docs/self-evolution-design.html`, L1–L5). +5. **Track absolute-baseline drift.** Record each model's absolute baseline over time. If + the baseline shifts beyond threshold between runs, **invalidate dependent findings** and + re-measure — a baseline that moves 2× is itself a regression signal, not a constant. + +> One-line rule: on DVFS hardware, trust only **same-window paired deltas that exceed the +> noise floor and reproduce across independent reruns** — never single-run absolute +> baselines or cross-run ratios. + +## Files +- `qnn_npu.json` — QNN HTP (NPU) EP findings +- `qnn_gpu.json` — QNN GPU EP findings +- `dml.json` — DirectML EP findings +- `cpu.json` — CPU EP findings diff --git a/research/autoconfig/ep_device_knowledge/_auto_promoted.json b/research/autoconfig/ep_device_knowledge/_auto_promoted.json new file mode 100644 index 000000000..90147b3b0 --- /dev/null +++ b/research/autoconfig/ep_device_knowledge/_auto_promoted.json @@ -0,0 +1,305 @@ +{ + "_meta": { + "generated_by": "promote_findings.py", + "status": "draft", + "note": "Auto-generated promotion candidates. NOT curated KB. Apply the promotion checklist in ep_device_knowledge/README.md (paired A/B, clean baseline, effect-size > noise floor, independent reruns, baseline-drift check) before merging into _.json.", + "gates": { + "L1_gain_pct": 5.0, + "L2_effect_size_cv_mult": 2.0, + "L3_min_models": 2, + "L4_min_arch_classes": 3 + } + }, + "L1_observed": [ + { + "model_id": "apple/mobilevit-small", + "arch_class": "mobilevit", + "ep": "cpu", + "device": "cpu", + "hyp_id": "h7", + "label": "opset 17 + bias_softmax_fusion", + "flags": "bias_softmax_fusion=True + opset=17", + "gain_pct": 12.34, + "noise_floor_pct": 81.13, + "ranges_separated": false, + "level": 1 + }, + { + "model_id": "microsoft/resnet-18", + "arch_class": "resnet", + "ep": "cpu", + "device": "cpu", + "hyp_id": "h6", + "label": "opset 17 + layer_norm_fusion", + "flags": "layer_norm_fusion=True + opset=17", + "gain_pct": 10.43, + "noise_floor_pct": 7.58, + "ranges_separated": true, + "level": 2 + }, + { + "model_id": "microsoft/resnet-18", + "arch_class": "resnet", + "ep": "cpu", + "device": "cpu", + "hyp_id": "h9", + "label": "opset 17 + matmul_transpose_fusion", + "flags": "matmul_transpose_fusion=True + opset=17", + "gain_pct": 92.51, + "noise_floor_pct": 71.15, + "ranges_separated": true, + "level": 2 + }, + { + "model_id": "microsoft/resnet-18", + "arch_class": "resnet", + "ep": "cpu", + "device": "cpu", + "hyp_id": "h10", + "label": "opset 17 + attention + skip_layer_norm + layer_norm", + "flags": "attention_fusion=True + layer_norm_fusion=True + opset=17 + skip_layer_norm_fusion=True", + "gain_pct": 91.54, + "noise_floor_pct": 115.77, + "ranges_separated": true, + "level": 1 + }, + { + "model_id": "microsoft/resnet-18", + "arch_class": "resnet", + "ep": "cpu", + "device": "cpu", + "hyp_id": "h11", + "label": "opset 17 + nchwc_transformer (Conv-heavy models)", + "flags": "nchwc_transformer=True + opset=17", + "gain_pct": 82.79, + "noise_floor_pct": 223.59, + "ranges_separated": false, + "level": 1 + }, + { + "model_id": "microsoft/resnet-18", + "arch_class": "resnet", + "ep": "cpu", + "device": "cpu", + "hyp_id": "h12", + "label": "opset 17 + transpose_optimizer", + "flags": "opset=17 + transpose_optimizer=True", + "gain_pct": 84.46, + "noise_floor_pct": 61.25, + "ranges_separated": true, + "level": 2 + }, + { + "model_id": "microsoft/resnet-18", + "arch_class": "resnet", + "ep": "cpu", + "device": "cpu", + "hyp_id": "h13", + "label": "opset 17 + gelu_fusion explicit", + "flags": "gelu_fusion=True + opset=17", + "gain_pct": 88.89, + "noise_floor_pct": 254.22, + "ranges_separated": true, + "level": 1 + }, + { + "model_id": "facebook/dinov2-small", + "arch_class": "dinov2", + "ep": "qnn", + "device": "gpu", + "hyp_id": "h4", + "label": "opset 17 + matmul_transpose_fusion", + "flags": "matmul_transpose_fusion=True + opset=17", + "gain_pct": 8.45, + "noise_floor_pct": 14.83, + "ranges_separated": false, + "level": 1 + }, + { + "model_id": "facebook/dinov2-small", + "arch_class": "dinov2", + "ep": "qnn", + "device": "gpu", + "hyp_id": "h5", + "label": "opset 17 + attention_fusion", + "flags": "attention_fusion=True + opset=17", + "gain_pct": 10.55, + "noise_floor_pct": 14.83, + "ranges_separated": false, + "level": 1 + }, + { + "model_id": "facebook/dinov2-small", + "arch_class": "dinov2", + "ep": "qnn", + "device": "gpu", + "hyp_id": "h6", + "label": "opset 17 + bias_softmax_fusion", + "flags": "bias_softmax_fusion=True + opset=17", + "gain_pct": 6.39, + "noise_floor_pct": 14.83, + "ranges_separated": false, + "level": 1 + }, + { + "model_id": "facebook/dinov2-small", + "arch_class": "dinov2", + "ep": "qnn", + "device": "gpu", + "hyp_id": "h9", + "label": "opset 21 + matmul_transpose + attention_fusion", + "flags": "attention_fusion=True + matmul_transpose_fusion=True + opset=21", + "gain_pct": 12.85, + "noise_floor_pct": 14.83, + "ranges_separated": true, + "level": 1 + }, + { + "model_id": "facebook/dinov2-small", + "arch_class": "dinov2", + "ep": "qnn", + "device": "gpu", + "hyp_id": "h11", + "label": "opset 17 + gelu_fusion explicit", + "flags": "gelu_fusion=True + opset=17", + "gain_pct": 13.86, + "noise_floor_pct": 14.83, + "ranges_separated": true, + "level": 1 + }, + { + "model_id": "facebook/dinov2-small", + "arch_class": "dinov2", + "ep": "qnn", + "device": "gpu", + "hyp_id": "h12", + "label": "opset 17 + transpose_optimizer", + "flags": "opset=17 + transpose_optimizer=True", + "gain_pct": 16.67, + "noise_floor_pct": 14.83, + "ranges_separated": true, + "level": 2 + }, + { + "model_id": "microsoft/rad-dino", + "arch_class": "dinov2", + "ep": "qnn", + "device": "gpu", + "hyp_id": "h11", + "label": "opset 17 + gelu_fusion explicit", + "flags": "gelu_fusion=True + opset=17", + "gain_pct": 2.0, + "noise_floor_pct": 1.72, + "ranges_separated": true, + "level": 2 + }, + { + "model_id": "microsoft/resnet-18", + "arch_class": "resnet", + "ep": "qnn", + "device": "gpu", + "hyp_id": "h11", + "label": "opset 17 + gelu_fusion explicit", + "flags": "gelu_fusion=True + opset=17", + "gain_pct": 6.4, + "noise_floor_pct": 14.6, + "ranges_separated": false, + "level": 1 + }, + { + "model_id": "microsoft/resnet-18", + "arch_class": "resnet", + "ep": "qnn", + "device": "gpu", + "hyp_id": "h12", + "label": "opset 17 + transpose_optimizer", + "flags": "opset=17 + transpose_optimizer=True", + "gain_pct": 8.38, + "noise_floor_pct": 14.6, + "ranges_separated": false, + "level": 1 + }, + { + "model_id": "facebook/dinov2-small", + "arch_class": "dinov2", + "ep": "qnn", + "device": "npu", + "hyp_id": "h3", + "label": "opset 21 (tests npu-001 bypass)", + "flags": "opset=21", + "gain_pct": 24.14, + "noise_floor_pct": 81.45, + "ranges_separated": false, + "level": 1 + } + ], + "L2_confirmed_single_model": [ + { + "model_id": "microsoft/resnet-18", + "arch_class": "resnet", + "ep": "cpu", + "device": "cpu", + "hyp_id": "h6", + "label": "opset 17 + layer_norm_fusion", + "flags": "layer_norm_fusion=True + opset=17", + "gain_pct": 10.43, + "noise_floor_pct": 7.58, + "ranges_separated": true, + "level": 2 + }, + { + "model_id": "microsoft/resnet-18", + "arch_class": "resnet", + "ep": "cpu", + "device": "cpu", + "hyp_id": "h9", + "label": "opset 17 + matmul_transpose_fusion", + "flags": "matmul_transpose_fusion=True + opset=17", + "gain_pct": 92.51, + "noise_floor_pct": 71.15, + "ranges_separated": true, + "level": 2 + }, + { + "model_id": "microsoft/resnet-18", + "arch_class": "resnet", + "ep": "cpu", + "device": "cpu", + "hyp_id": "h12", + "label": "opset 17 + transpose_optimizer", + "flags": "opset=17 + transpose_optimizer=True", + "gain_pct": 84.46, + "noise_floor_pct": 61.25, + "ranges_separated": true, + "level": 2 + }, + { + "model_id": "facebook/dinov2-small", + "arch_class": "dinov2", + "ep": "qnn", + "device": "gpu", + "hyp_id": "h12", + "label": "opset 17 + transpose_optimizer", + "flags": "opset=17 + transpose_optimizer=True", + "gain_pct": 16.67, + "noise_floor_pct": 14.83, + "ranges_separated": true, + "level": 2 + }, + { + "model_id": "microsoft/rad-dino", + "arch_class": "dinov2", + "ep": "qnn", + "device": "gpu", + "hyp_id": "h11", + "label": "opset 17 + gelu_fusion explicit", + "flags": "gelu_fusion=True + opset=17", + "gain_pct": 2.0, + "noise_floor_pct": 1.72, + "ranges_separated": true, + "level": 2 + } + ], + "L3_generalized_arch_rule": [], + "L4_cross_cutting_rule": [] +} diff --git a/research/autoconfig/ep_device_knowledge/cpu_cpu.json b/research/autoconfig/ep_device_knowledge/cpu_cpu.json new file mode 100644 index 000000000..9ed9fe3f3 --- /dev/null +++ b/research/autoconfig/ep_device_knowledge/cpu_cpu.json @@ -0,0 +1,401 @@ +{ + "_meta": { + "ep": "cpu", + "device": "cpu", + "hardware": "Snapdragon X Elite CRD (Oryon CPU)", + "ort_version": "1.x (check winml version at experiment time)", + "model": "facebook/convnext-tiny-224 (ALL findings from this model only)", + "last_updated": "2026-06-18", + "epistemics_warning": "⚠️ All findings from rigorous 3-run ablation. However, still 1 model, 1 device. CPU behavior can differ significantly between x86 and ARM (Oryon). Check architecture before applying rules.", + "models_tested": [ + "facebook/convnext-tiny-224 (original ablation)", + "microsoft/resnet-18 (catalog_cpu_sweep 2026-06-18)", + "apple/mobilevit-small (catalog_cpu_sweep 2026-06-18)", + "facebook/dinov2-small (catalog_cpu_sweep 2026-06-18)", + "deepset/roberta-base-squad2 (sweep in progress)", + "deepset/tinyroberta-squad2 (sweep in progress)", + "BAAI/bge-small-en-v1.5 (sweep in progress)", + "sentence-transformers/all-MiniLM-L6-v2 (sweep in progress)" + ] + }, + "sweep_config": { + "results_dir": "catalog-cpu-sweep", + "quant": false, + "compile": false, + "screen": { + "warmup": 10, + "iters": 200, + "cv_max": 0.1, + "thermal_aware": false + }, + "full": { + "warmup": 10, + "iters": 300, + "sessions": 3, + "cool_down_s": 2 + }, + "confirm_sessions": 2, + "min_improvement_pct": 5.0, + "effect_size_gate": false, + "effect_size_cv_mult": 2.0, + "accuracy_eval": false, + "eval_samples": 50, + "paired_ab_available": false, + "baseline_priority": [ + "h0" + ], + "timeouts": { + "config_s": 300, + "build_s": 600, + "bench_s": 480, + "eval_s": 360, + "model_s": null + } + }, + "hypotheses": [ + { + "id": "h0", + "label": "baseline (opset 17, autoconf defaults)", + "opset": null, + "optim": null + }, + { + "id": "h1", + "label": "opset 17 explicit", + "opset": 17, + "optim": null + }, + { + "id": "h2", + "label": "opset 19 (cpu-001 risk - transformer test)", + "opset": 19, + "optim": null + }, + { + "id": "h3", + "label": "opset 21 (cpu-001 risk - transformer test)", + "opset": 21, + "optim": null + }, + { + "id": "h4", + "label": "opset 17 + attention_fusion", + "opset": 17, + "optim": { + "attention_fusion": true + } + }, + { + "id": "h5", + "label": "opset 17 + skip_layer_norm_fusion", + "opset": 17, + "optim": { + "skip_layer_norm_fusion": true + } + }, + { + "id": "h6", + "label": "opset 17 + layer_norm_fusion", + "opset": 17, + "optim": { + "layer_norm_fusion": true + } + }, + { + "id": "h7", + "label": "opset 17 + bias_softmax_fusion", + "opset": 17, + "optim": { + "bias_softmax_fusion": true + } + }, + { + "id": "h8", + "label": "opset 17 + matmul_add_fusion (cpu-002 guarded)", + "opset": 17, + "optim": { + "matmul_add_fusion": true + }, + "guard": { + "type": "skip_if_gemm", + "finding": "cpu-002" + } + }, + { + "id": "h9", + "label": "opset 17 + matmul_transpose_fusion", + "opset": 17, + "optim": { + "matmul_transpose_fusion": true + } + }, + { + "id": "h10", + "label": "opset 17 + attention + skip_layer_norm + layer_norm", + "opset": 17, + "optim": { + "attention_fusion": true, + "skip_layer_norm_fusion": true, + "layer_norm_fusion": true + } + }, + { + "id": "h11", + "label": "opset 17 + nchwc_transformer (Conv-heavy models)", + "opset": 17, + "optim": { + "nchwc_transformer": true + } + }, + { + "id": "h12", + "label": "opset 17 + transpose_optimizer", + "opset": 17, + "optim": { + "transpose_optimizer": true + } + }, + { + "id": "h13", + "label": "opset 17 + gelu_fusion explicit", + "opset": 17, + "optim": { + "gelu_fusion": true + } + }, + { + "id": "h14", + "label": "no optimization (analyzer auto-optimization disabled, --no-analyze)", + "opset": null, + "optim": null, + "build_flags": [ + "--no-analyze" + ] + } + ], + "models": [ + { + "id": "microsoft/resnet-18", + "task": "image-classification", + "model_type": "resnet" + }, + { + "id": "apple/mobilevit-small", + "task": "image-classification", + "model_type": "mobilevit" + }, + { + "id": "facebook/dinov2-small", + "task": "image-feature-extraction", + "model_type": "dinov2" + }, + { + "id": "distilbert/distilbert-base-uncased-finetuned-sst-2-english", + "task": "text-classification", + "model_type": "distilbert" + }, + { + "id": "sentence-transformers/all-MiniLM-L6-v2", + "task": "sentence-similarity", + "model_type": "bert" + }, + { + "id": "deepset/roberta-base-squad2", + "task": "question-answering", + "model_type": "roberta" + }, + { + "id": "microsoft/rad-dino", + "task": "image-feature-extraction", + "model_type": "dinov2" + }, + { + "id": "deepset/tinyroberta-squad2", + "task": "question-answering", + "model_type": "roberta" + }, + { + "id": "BAAI/bge-small-en-v1.5", + "task": "sentence-similarity", + "model_type": "bert" + } + ], + "cross_checks": [ + { + "id": "cpu-001", + "type": "regression_probe", + "hypotheses": [ + "h2", + "h3" + ], + "gain_threshold_pct": -50.0, + "label": "opset 19/21 regression on Transpose-dense models" + } + ], + "findings": [ + { + "id": "cpu-001", + "title": "opset 19+ causes 3-10x slowdown on models with Transpose-heavy graphs (ConvNext + DINOv2 confirmed) — NOT ConvNext-specific", + "observation": "ConvNext: opset17=43.7ms, opset19=160ms (3.7x), opset21=170ms (3.9x). DINOv2-small catalog_cpu_sweep 2026-06-18: baseline (auto-config)=112.6ms, opset19=1106ms (9.8x CPU001_REGRESSION), opset21=1095ms (9.7x). CRITICAL: cpu-001 is NOT ConvNext-specific. DINOv2 is a pure-ViT model with no ConvNext architecture overlap. ResNet-18: opset17=237ms, opset19=231ms (+2.4% neutral), opset21=226ms (+4.5% neutral) — ResNet NOT affected. MobileViT: opset19=-9.1%, opset21=-7.4% (mild slowdown, not catastrophic). Pattern: models with dense Transpose usage (DINOv2, ConvNext) hit cpu-001; models with sparse Transpose (ResNet) do not.", + "mechanism_confirmed": false, + "mechanism_hypothesis": "Original hypothesis: ORT C++ Transpose Optimizer has a kMaxSupportedOpset gate (optimizer_api.h). If model opset > kMaxSupportedOpset, Transpose Optimizer is skipped silently. ConvNext has 42 Transpose nodes — without optimization, each executes as a full memory-layout copy. HOWEVER: the non-monotonic recovery at opset 22 (85ms vs 160-170ms at opset 19-21) is inconsistent with a simple binary gate. If the gate fires for opset > N, opset 22 should behave identically to opset 19. The actual mechanism is more complex. Additionally, ORT 1.24.x has kMaxSupportedOpset >= 23 confirmed (separate NHWC gate) — the Transpose Optimizer gate threshold may differ but is unverified.", + "action_for_autoconfig": "For CPU EP: default to opset 17. The empirical data (1 model, consistent across opsets) is unambiguous — opset 17 is the best option. Do NOT try opset 19+. The mechanism reason is uncertain but the practical conclusion is solid.", + "confidence": "high on empirical observation (consistent data across opsets for 1 model). Low on mechanism — the gate hypothesis does not fully explain the non-monotonic opset 22 partial recovery.", + "falsified_by": null, + "scope": "Models with dense Transpose graphs (ConvNext + DINOv2 confirmed). ResNet-18 is NOT affected. MobileViT mildly affected. BERT/RoBERTa unknown (sweep in progress 2026-06-18).", + "ort_kMaxSupportedOpset_by_version": { + "note": "These values are for the NHWC layout_transformation gate, NOT the Transpose Optimizer gate. The two constants may differ within the same ORT release.", + "v1.14.x": 18, + "v1.16.x": 19, + "v1.17.x": 20, + "v1.18.x": 21, + "v1.24.x": ">= 23 (confirmed for NHWC gate; Transpose Optimizer gate unknown)", + "main_HEAD": 26 + }, + "do_not_generalize_to": "QNN NPU EP or DML EP — kMaxSupportedOpset is a CPU-only ORT optimizer gate. These EPs have their own kernel dispatch unaffected by this.", + "validated_regressions": [ + "facebook/convnext-tiny-224: opset19 3.7x, opset21 3.9x", + "facebook/dinov2-small: opset19 9.8x, opset21 9.7x (CPU001_REGRESSION)" + ], + "validated_neutral": [ + "microsoft/resnet-18: opset19 +2.4% neutral, opset21 +4.5% neutral", + "apple/mobilevit-small: opset19 -9.1%, opset21 -7.4% (mild, not catastrophic)" + ], + "pending": "BERT/RoBERTa/MiniLM (sweep in progress 2026-06-18 — expected: neutral based on few Transpose nodes)", + "last_updated": "2026-06-18" + }, + { + "id": "cpu-002", + "title": "matmul_add_fusion is a CONFIRMED REGRESSION on ConvNext CPU (+38ms, ~87%)", + "observation": "matmul_add_fusion: p50=81.7ms, runs=[63.0, 70.8, 111.2ms]. Baseline p50=43.7ms. All 3 runs far above highest baseline run (45.4ms).", + "mechanism_confirmed": false, + "mechanism_hypothesis": "ORT baseline already converts MatMul+Add→Gemm (37 Gemm in model.onnx). Applying matmul_add_fusion on top may create redundant kernel dispatch or conflicting operator mapping. Requires profiling to confirm.", + "action_for_autoconfig": "Do NOT apply matmul_add_fusion for CPU EP on models where baseline already uses Gemm (check model.onnx for Gemm nodes before applying this pass).", + "confidence": "high — 3 independent runs, all far above baseline; direction is unambiguous", + "falsified_by": null, + "scope": "ConvNext and models where ORT L2 baseline already fuses MatMul+Add→Gemm", + "do_not_generalize_to": "Models where baseline does NOT have Gemm (the pass may legitimately help there)" + }, + { + "id": "cpu-003", + "title": "transpose_optimizer is neutral on ConvNext CPU (NOT +270ms as previously reported)", + "observation": "winml perf (warmup=10, iter=50): 42.3 / 52.3 / 41.8ms — overlapping baseline. Earlier winml eval-based measurement showed +270ms — this was a measurement artifact.", + "mechanism_confirmed": true, + "mechanism_hypothesis": "winml eval includes HF preprocessing + model load + no warmup. The +270ms was preprocessing overhead, not inference regression. Pure inference measurement (winml perf) shows no effect.", + "action_for_autoconfig": "transpose_optimizer is neutral for ConvNext CPU — neither helpful nor harmful. Can be omitted from search space.", + "confidence": "high — measurement methodology confirmed; tool comparison validated", + "falsified_by": "Earlier winml eval measurement — RETRACTED. Use winml perf for all latency comparisons.", + "scope": "ConvNext CPU", + "measurement_lesson": "Always use winml perf (warmup=10, iter=50) for latency experiments. Never use winml eval latency to compare configs." + }, + { + "id": "cpu-004", + "title": "nchwc_transformer is neutral on ConvNext CPU", + "observation": "nchwc: 43.4 / 48.0 / 44.7ms — overlapping baseline (42.5–45.4ms). No improvement.", + "mechanism_confirmed": false, + "mechanism_hypothesis": "NCHWc SIMD layout benefits Conv-heavy models. ConvNext has 22 Conv nodes but 57.7% of kernel time is Gemm. The bottleneck is not memory layout but compute throughput — NCHWc doesn't help.", + "action_for_autoconfig": "nchwc_transformer is low-priority for ConvNext-class models. Profile first — if Conv% > 40%, try nchwc. If Gemm% > 50%, skip.", + "confidence": "medium — 3 runs, neutral result; mechanism is a hypothesis", + "falsified_by": null, + "scope": "ConvNext CPU (Gemm-dominated, not Conv-dominated)" + }, + { + "id": "cpu-005", + "title": "Baseline (no extra flags) is the optimal config for ConvNext CPU", + "observation": "No flag in 22-experiment ablation improved p50 beyond noise. Baseline p50=43.7ms is the floor.", + "mechanism_confirmed": false, + "mechanism_hypothesis": "ORT L2 baseline already applies gelu_fusion and MatMul→Gemm before any user flags. The effective optimization space is narrow for ConvNext on CPU. Compute bottleneck (Gemm=57.7%) is not addressable via graph passes.", + "action_for_autoconfig": "For CPU EP on ConvNext-class models: skip optimization pass sweep. Go directly to quantization experiments.", + "confidence": "high — 22 experiments, no improvement found", + "falsified_by": null, + "scope": "ConvNext-class vision models on CPU", + "do_not_generalize_to": "BERT/Transformer models where attention_fusion + skip_layer_norm can significantly help" + }, + { + "id": "cpu-006", + "title": "CPU EP opset 21 is 3.9x SLOWER — opposite of QNN NPU behavior", + "observation": "CPU opset 21: p50=170ms. CPU opset 17: p50=43.7ms. QNN NPU opset 21 (DINOv2): p50=26ms (~24% FASTER than opset 17 at 34ms). Note: the NPU and CPU experiments used DIFFERENT models (CPU=ConvNext, NPU=DINOv2) — the comparison is directional only, not quantitative.", + "mechanism_confirmed": false, + "mechanism_hypothesis": "CPU regression from Transpose Optimizer bypass (see cpu-001 — mechanism uncertain). QNN NPU speedup from unknown cause (original Transpose bypass hypothesis invalidated; Transpose counts identical in opset17/21 graphs). The key insight is that CPU and QNN NPU respond oppositely to opset changes, regardless of the root cause.", + "action_for_autoconfig": "EP ISOLATION: CPU opset findings MUST NOT influence QNN NPU search space, and vice versa. Always validate per EP independently.", + "confidence": "high on empirical observation. Low on mechanism for both directions.", + "falsified_by": null, + "scope": "ALL — this is a meta-rule about EP isolation, not model-specific" + }, + { + "id": "cpu-007", + "title": "matmul_transpose_fusion gives +92% speedup on ResNet-18 CPU EP (237ms -> 17.8ms)", + "confidence": "high — KEEP_CONFIRMED (all 5 sessions passed Phase C)", + "scope": "Conv-dominant models with MatMul+Transpose sequences. ResNet-18 confirmed. DINOv2 tested but ALL fusion flags regressed (cpu-001 interference). MobileViT partial.", + "observation": "catalog_cpu_sweep 2026-06-18: ResNet-18 h9 (opset17+matmul_transpose_fusion): median_p50=17.797ms vs baseline 237.472ms = +92.51% KEEP_CONFIRMED. Also: h12 (transpose_optimizer) +84.46% KEEP_CONFIRMED, h13 (gelu_fusion) +88.89% KEEP_CONFIRMED, h10 (bundle) +91.54% KEEP_CONFIRMED, h6 (layer_norm_fusion) +10.43% KEEP_CONFIRMED. All 5 phase-C sessions passed.", + "mechanism_hypothesis": "ResNet-18 on CPU at default config has 237ms latency (extremely slow for a tiny model). matmul_transpose_fusion folds MatMul+Transpose into a transposed GEMM call, enabling BLAS-level fused execution. ORT CPU provider has a highly optimized transposed-matmul path. The baseline 237ms suggests the default config exports with a suboptimal graph (possibly unfused MatMul+Transpose pairs that prevent BLAS dispatch).", + "mechanism_confirmed": false, + "baseline_note": "ResNet-18 baseline=237ms on CPU is extremely slow (17.8ms after optimization = 13x speedup). This suggests the default auto-config for ResNet-18 on CPU is severely suboptimal. The baseline uses auto-config which may not be correctly detecting the model architecture for CPU optimization.", + "affected_models": [ + "microsoft/resnet-18 (+92.51% KEEP_CONFIRMED)" + ], + "autoconfig_action": "For ResNet-18 class models on CPU: apply matmul_transpose_fusion (h9) + transpose_optimizer (h12) + gelu_fusion (h13) bundle. Test h10 bundle for single combined build.", + "added": "2026-06-18", + "source": "catalog_cpu_sweep.py h0-h13 sweep" + }, + { + "id": "cpu-008", + "title": "layer_norm_fusion causes catastrophic -997% regression on MobileViT CPU EP (73ms -> 803ms)", + "confidence": "high — 3-session consistent", + "scope": "CNN-ViT hybrid models where layer_norm_fusion mismatches the LN implementation. MobileViT confirmed. Pure transformer (BERT/ViT) expected safe.", + "observation": "catalog_cpu_sweep 2026-06-18: MobileViT h6 (opset17+layer_norm_fusion): median_p50=803.217ms vs baseline 73.166ms = -997.8% DISCARD. 3-session consistent. For comparison: bias_softmax_fusion (h7) = 64.137ms (+12.34% MARGINAL_UNCONFIRMED). layer_norm_fusion, skip_layer_norm_fusion, attention_fusion, matmul_transpose_fusion all severely regress MobileViT on CPU.", + "mechanism_hypothesis": "MobileViT uses a hybrid CNN-ViT architecture where LayerNorm is placed after Conv2D outputs. layer_norm_fusion expects pure transformer LN sequences (MLP-style). Fusing the wrong LN pattern creates a combined op that the CPU runtime cannot dispatch to an optimized kernel path, forcing fallback to element-wise operations.", + "mechanism_confirmed": false, + "affected_models": [ + "apple/mobilevit-small (-997% layer_norm, -165% matmul_transpose, -164% attention bundle)" + ], + "autoconfig_action": "Block layer_norm_fusion for CNN-ViT hybrid models. Also block matmul_transpose_fusion and attention_fusion for MobileViT-class models on CPU. analyze_insight.py should detect CNN-ViT hybrid architecture and skip these fusions.", + "added": "2026-06-18", + "source": "catalog_cpu_sweep.py h0-h13 sweep" + }, + { + "id": "cpu-009", + "title": "cpu-001 opset regression fires on DINOv2 pure-ViT: ~10x slowdown at opset19/21 on CPU EP", + "confidence": "high — CPU001_REGRESSION verdict confirmed (pattern matches ConvNext)", + "scope": "Pure-ViT models with dense Transpose graphs on CPU EP. DINOv2-small confirmed. BERT/NLP expected neutral (sparse Transpose). ResNet-18 confirmed neutral.", + "observation": "catalog_cpu_sweep 2026-06-18: DINOv2-small h2 (opset19): 1106ms vs baseline 112ms (-882% CPU001_REGRESSION). h3 (opset21): 1095ms (-873%). h4 attention_fusion: 1083ms (-862%). h7 bias_softmax_fusion: 1121ms (-896%). The baseline (auto-config, opset not forced) = 112ms. Any forced opset or attention-style fusion causes catastrophic regression. Also: h1 opset17-explicit = 762ms (-577%) — even forcing opset17 explicitly regresses DINOv2 vs auto-config baseline.", + "mechanism_note": "DINOv2 has 169 Reshape nodes in opset21 vs 121 in opset17. Dense Transpose (49 nodes). cpu-001 mechanism (Transpose Optimizer bypass) applies here as strongly as ConvNext. The auto-config baseline (h0) at 112ms is already the optimized path; ANY deviation from auto-config triggers regression.", + "autoconfig_action": "For DINOv2/ViT-class on CPU EP: use auto-config default opset ONLY. Do not force any opset. Do not apply attention_fusion or bias_softmax_fusion (all regress DINOv2 on CPU). CPU EP for DINOv2 is constrained to baseline config only.", + "added": "2026-06-18", + "source": "catalog_cpu_sweep.py h0-h13 sweep" + } + ], + "search_space_rules": { + "opset": { + "recommended_order": [ + 17 + ], + "skip": [ + "19, 20, 21, 22 — kMaxSupportedOpset regression (cpu-001). Only safe to try if ORT version's kMaxSupportedOpset >= target." + ], + "dialectical_note": "⚠️ This rule is ORT-version dependent. Check kMaxSupportedOpset for the shipping ORT build before skipping higher opsets." + }, + "quantization": { + "recommended": "w8a8 (CPU benefits most from small model size)", + "dialectical_note": "⚠️ W8A8 on CPU not yet validated for ConvNext. General guidance — run accuracy gate." + }, + "compile": { + "always_run": false, + "skip": true, + "dialectical_note": "⚠️ winml compile targets QNN EPContext. Not applicable to CPU EP." + }, + "graph_passes": { + "recommended": "autoconf defaults only", + "skip": [ + "matmul_add_fusion if model already has Gemm (cpu-002)", + "nchwc_transformer if Gemm% > 50% in profile (cpu-004)" + ], + "dialectical_note": "⚠️ Skip rules are Gemm-bottleneck specific. Conv-heavy models may still benefit from nchwc_transformer." + } + }, + "meta_lessons": { + "measurement_discipline": "Always use winml perf (warmup=10, iter=50) for latency. Never use winml eval latency. See cpu-003.", + "ep_isolation": "CPU findings (especially opset regression) DO NOT transfer to QNN NPU or DML. Each EP has its own optimizer path. See cpu-006.", + "baseline_check": "Before applying any fusion flag, check model.onnx for existing fused ops. If Gemm already present, matmul_add_fusion is likely a no-op or regression." + } +} diff --git a/research/autoconfig/ep_device_knowledge/dml_gpu.json b/research/autoconfig/ep_device_knowledge/dml_gpu.json new file mode 100644 index 000000000..21f3361f6 --- /dev/null +++ b/research/autoconfig/ep_device_knowledge/dml_gpu.json @@ -0,0 +1,271 @@ +{ + "_meta": { + "ep": "dml", + "device": "gpu", + "hardware": "Snapdragon X Elite CRD (Adreno X1-85 / DirectML via D3D12)", + "ort_version": "1.x with onnxruntime-directml package", + "model": "facebook/convnext-tiny-224 (ALL findings from this model only)", + "last_updated": "2026-06-17", + "epistemics_warning": "⚠️ DML experiments required swapping onnxruntime-directml for onnxruntime (Python package conflict). Results reflect DML EP behavior via winml's DML DLL, not the Python onnxruntime-directml package directly. Re-validate if package setup changes." + }, + "sweep_config": { + "results_dir": "catalog-dml-sweep", + "quant": false, + "compile": false, + "screen": { + "warmup": 20, + "iters": 200, + "cv_max": 0.15, + "thermal_aware": false + }, + "full": { + "warmup": 20, + "iters": 300, + "sessions": 3, + "cool_down_s": 5 + }, + "confirm_sessions": 2, + "min_improvement_pct": 5.0, + "effect_size_gate": false, + "effect_size_cv_mult": 2.0, + "accuracy_eval": false, + "eval_samples": 50, + "paired_ab_available": false, + "baseline_priority": [ + "h0" + ], + "timeouts": { + "config_s": 300, + "build_s": 600, + "bench_s": 480, + "eval_s": 360, + "model_s": null + } + }, + "hypotheses": [ + { + "id": "h0", + "label": "baseline FP32 (auto-config, no compile)", + "opset": null, + "optim": null + }, + { + "id": "h1", + "label": "opset 17 explicit", + "opset": 17, + "optim": null + }, + { + "id": "h2", + "label": "opset 19", + "opset": 19, + "optim": null + }, + { + "id": "h3", + "label": "opset 21 (tests dml-005)", + "opset": 21, + "optim": null + }, + { + "id": "h4", + "label": "opset 17 + transpose_optimizer", + "opset": 17, + "optim": { + "transpose_optimizer": true + } + }, + { + "id": "h5", + "label": "opset 17 + layer_norm_fusion", + "opset": 17, + "optim": { + "layer_norm_fusion": true + } + }, + { + "id": "h6", + "label": "opset 17 + skip_layer_norm_fusion", + "opset": 17, + "optim": { + "skip_layer_norm_fusion": true + } + }, + { + "id": "h7", + "label": "opset 17 + matmul_transpose_fusion", + "opset": 17, + "optim": { + "matmul_transpose_fusion": true + } + }, + { + "id": "h8", + "label": "no optimization (analyzer auto-optimization disabled, --no-analyze)", + "opset": null, + "optim": null, + "build_flags": [ + "--no-analyze" + ] + } + ], + "models": [ + { + "id": "microsoft/resnet-18", + "task": "image-classification", + "model_type": "resnet" + }, + { + "id": "google/vit-base-patch16-224", + "task": "image-classification", + "model_type": "vit" + }, + { + "id": "apple/mobilevit-small", + "task": "image-classification", + "model_type": "mobilevit" + }, + { + "id": "facebook/dinov2-small", + "task": "image-feature-extraction", + "model_type": "dinov2" + }, + { + "id": "hustvl/yolos-small", + "task": "object-detection", + "model_type": "yolos" + }, + { + "id": "distilbert/distilbert-base-uncased-finetuned-sst-2-english", + "task": "text-classification", + "model_type": "distilbert" + }, + { + "id": "sentence-transformers/all-MiniLM-L6-v2", + "task": "sentence-similarity", + "model_type": "bert" + }, + { + "id": "deepset/roberta-base-squad2", + "task": "question-answering", + "model_type": "roberta" + }, + { + "id": "microsoft/rad-dino", + "task": "image-feature-extraction", + "model_type": "dinov2" + }, + { + "id": "deepset/tinyroberta-squad2", + "task": "question-answering", + "model_type": "roberta" + }, + { + "id": "BAAI/bge-small-en-v1.5", + "task": "sentence-similarity", + "model_type": "bert" + } + ], + "cross_checks": [ + { + "id": "dml-005", + "type": "opset_bypass", + "candidate": "h3", + "stress_ref": "h1", + "baseline_ref": "h0" + } + ], + "findings": [ + { + "id": "dml-001", + "title": "DML FP32 is more stable than QNN GPU FP32 — p50 difference is within noise", + "observation": "DML FP32: p50=16.9ms, p90=17.7ms, std=0.52. QNN GPU FP32: p50=17.7ms, p90=19.7ms, std=0.97. p50 diff = 0.8ms = 0.82σ of QNN GPU measurement — distributions OVERLAP. NOT a separable performance difference. DML is meaningfully more stable (std 0.52 vs 0.97, CV 3% vs 5.5%).", + "mechanism_confirmed": false, + "mechanism_hypothesis": "DML JIT-compiles HLSL shaders at model load time — shader compilation done once, producing stable execution. QNN GPU EP does graph partitioning at each session creation — more overhead and jitter.", + "action_for_autoconfig": "CORRECTED: Do NOT claim DML is faster than QNN GPU based on this data — the 0.8ms difference is within noise. DML IS more stable (lower CV). Prefer DML for lower tail latency (p90) and variance. p50 advantage is unconfirmed.", + "confidence": "low on p50 speedup (not statistically separable). Medium on stability advantage (std 0.52 vs 0.97 is real difference even if p50 overlaps).", + "falsified_by": "Statistical analysis: 0.8ms diff < 1σ of GPU measurement. Removed from 'DML is faster' claims.", + "scope": "Adreno X1-85, ConvNext-class models, 3-run comparison (insufficient for definitive p50 ranking)", + "do_not_generalize_to": "NVIDIA/Intel GPUs (QNN GPU not available there anyway)" + }, + { + "id": "dml-002", + "title": "NHWC transformer increases latency variance on DML — p50 is neutral or marginally better", + "observation": "DML NHWC: p50=16.5ms (-0.4ms vs baseline 16.9ms), p90=21.0ms (+19% vs baseline 17.7ms), std=1.89 (3.6x worse than FP32 baseline 0.52). NOTE: p50 is marginally BETTER with NHWC, not worse. The regression is in tail latency and variance.", + "mechanism_confirmed": false, + "mechanism_hypothesis": "D3D12 on Adreno X1-85 handles tensor layouts internally via HLSL shaders. Adding explicit ORT NHWC Transposes does not improve memory alignment for DML but adds dispatch overhead that occasionally causes scheduling jitter, inflating p90 and std.", + "action_for_autoconfig": "Do NOT apply nhwc-transformer for DML EP if tail latency stability matters. p50 may be marginally better but p90 is 19% worse and std is 3.6x worse. For applications sensitive to worst-case latency, NHWC is harmful.", + "confidence": "low — single run comparison, different baselines (run_count unspecified). Direction for variance is clear; p50 benefit is marginal and unreliable.", + "falsified_by": null, + "scope": "Adreno X1-85 + DML, ConvNext", + "do_not_generalize_to": "NVIDIA GPUs (NHWC may help with CUDNN)" + }, + { + "id": "dml-003", + "title": "DML FP16 gives ~1.4x speedup with NO DVFS bimodal (unlike QNN GPU FP16)", + "observation": "DML FP16 (via Python hack, not official CLI): p50=11.8ms, p90=12.8ms, std=0.66. Clean unimodal distribution.", + "mechanism_confirmed": false, + "mechanism_hypothesis": "DML HLSL shader compilation locks in FP16 compute paths at load time — no dynamic voltage/frequency switching surprises. QNN GPU FP16 showed DVFS bimodal distribution (some runs in high-power state, some in low-power state).", + "action_for_autoconfig": "FP16 is the primary optimization lever for DML. Unblock via #867 (--precision fp16 flag).", + "confidence": "low — experiment used Python hack (not official winml CLI). Mark as SKIPPED/CLI-gap until #867 ships.", + "falsified_by": null, + "scope": "Adreno X1-85 + DML", + "tracked_issue": "#867", + "cli_gap": true, + "cli_gap_note": "⚠️ This finding was produced via a Python workaround, not winml CLI. Cannot be reproduced with winml build today. Blocked on #867." + }, + { + "id": "dml-004", + "title": "winml analyze returns 0/0/0/251 (all Unknown) for DML EP — no rule data", + "observation": "winml analyze --ep dml outputs: supported=0, partial=0, unsupported=0, unknown=251.", + "mechanism_confirmed": true, + "mechanism_hypothesis": "DML EP supports all standard ONNX ops by design (D3D12 universal op coverage). winml analyze has no DML-specific rule data file. This is a cosmetic gap — DML actually runs all ops natively.", + "action_for_autoconfig": "Do not use winml analyze output to prune search space for DML. Assume all ops supported.", + "confidence": "high — confirmed by DML running all 251 ops with no CPU fallback", + "falsified_by": null, + "scope": "DML EP (all models)", + "tracked_issue": "not filed — cosmetic gap, low priority" + }, + { + "id": "dml-005", + "title": "opset 21 on DML not yet validated", + "observation": "opset 21 sweep only run on QNN NPU. DML behavior with opset 21 is unknown.", + "mechanism_confirmed": false, + "mechanism_hypothesis": "DML uses D3D12 dispatch — different from QNN EP kernel registry. opset 21 speedup on QNN NPU may not apply.", + "action_for_autoconfig": "Include opset 21 in DML search sweep. No prior data — must run experiment.", + "confidence": "low — no data", + "falsified_by": null, + "scope": "UNKNOWN — needs experiment" + } + ], + "search_space_rules": { + "opset": { + "recommended_order": [ + 17, + 21 + ], + "rationale": "dml-005: unknown. Include both in sweep.", + "dialectical_note": "⚠️ No data on DML + opset 21. Do not assume NPU behavior transfers." + }, + "quantization": { + "recommended": "fp16 (when #867 ships)", + "skip": [ + "w8a8", + "w8a16 — quantization rarely helps on GPU via DML" + ], + "dialectical_note": "⚠️ Quantization skip is based on general DML behavior. Some models with large weights may benefit from W8A16 even on DML. Test empirically." + }, + "compile": { + "always_run": false, + "skip": true, + "dialectical_note": "⚠️ DML uses HLSL, not QNN binary compilation. winml compile targets QNN EPContext only. Not applicable to DML." + }, + "graph_passes": { + "recommended": "autoconf defaults only", + "skip": [ + "nhwc-transformer (dml-002)" + ], + "dialectical_note": "⚠️ Same as QNN GPU: NHWC hurts on Adreno. NVIDIA/Intel may differ." + } + } +} diff --git a/research/autoconfig/ep_device_knowledge/qnn_gpu.json b/research/autoconfig/ep_device_knowledge/qnn_gpu.json new file mode 100644 index 000000000..697390ce7 --- /dev/null +++ b/research/autoconfig/ep_device_knowledge/qnn_gpu.json @@ -0,0 +1,366 @@ +{ + "_meta": { + "ep": "qnn", + "device": "gpu", + "hardware": "Snapdragon X Elite CRD (Adreno X1-85 / QNN GPU EP)", + "ort_version": "1.x (check winml version at experiment time)", + "qnn_sdk_version": "unknown — check QnnSystem.dll version", + "model": "8 models (catalog sweep 2026-06-18)", + "last_updated": "2026-06-18", + "epistemics_warning": "⚠️ All findings are hypotheses derived from 1 model on 1 device. Confidence levels reflect mechanism understanding, not universal applicability. GPU EP behavior varies significantly by model architecture and Adreno driver version.", + "models_tested": [ + "facebook/dinov2-small", + "microsoft/resnet-18", + "apple/mobilevit-small", + "deepset/roberta-base-squad2", + "deepset/tinyroberta-squad2", + "BAAI/bge-small-en-v1.5", + "sentence-transformers/all-MiniLM-L6-v2", + "microsoft/rad-dino" + ] + }, + "sweep_config": { + "results_dir": "catalog-gpu-sweep", + "quant": false, + "compile": false, + "screen": { + "warmup": 20, + "iters": 200, + "cv_max": 0.15, + "thermal_aware": false + }, + "full": { + "warmup": 20, + "iters": 300, + "sessions": 3, + "cool_down_s": 5 + }, + "confirm_sessions": 2, + "min_improvement_pct": 5.0, + "effect_size_gate": false, + "effect_size_cv_mult": 2.0, + "accuracy_eval": false, + "eval_samples": 50, + "paired_ab_available": false, + "baseline_priority": [ + "h0" + ], + "timeouts": { + "config_s": 300, + "build_s": 600, + "bench_s": 480, + "eval_s": 360, + "model_s": null + } + }, + "hypotheses": [ + { + "id": "h0", + "label": "baseline FP32 (no quant, no compile)", + "opset": null, + "optim": null + }, + { + "id": "h1", + "label": "opset 17 explicit", + "opset": 17, + "optim": null + }, + { + "id": "h2", + "label": "opset 19", + "opset": 19, + "optim": null + }, + { + "id": "h3", + "label": "opset 21 (tests gpu-006)", + "opset": 21, + "optim": null + }, + { + "id": "h4", + "label": "opset 17 + matmul_transpose_fusion", + "opset": 17, + "optim": { + "matmul_transpose_fusion": true + } + }, + { + "id": "h5", + "label": "opset 17 + attention_fusion", + "opset": 17, + "optim": { + "attention_fusion": true + } + }, + { + "id": "h6", + "label": "opset 17 + bias_softmax_fusion", + "opset": 17, + "optim": { + "bias_softmax_fusion": true + } + }, + { + "id": "h7", + "label": "opset 17 + layer_norm_fusion", + "opset": 17, + "optim": { + "layer_norm_fusion": true + } + }, + { + "id": "h8", + "label": "opset 17 + skip_layer_norm_fusion", + "opset": 17, + "optim": { + "skip_layer_norm_fusion": true + } + }, + { + "id": "h9", + "label": "opset 21 + matmul_transpose + attention_fusion", + "opset": 21, + "optim": { + "matmul_transpose_fusion": true, + "attention_fusion": true + } + }, + { + "id": "h10", + "label": "opset 17 + ln + skip_ln + matmul_transpose", + "opset": 17, + "optim": { + "layer_norm_fusion": true, + "skip_layer_norm_fusion": true, + "matmul_transpose_fusion": true + } + }, + { + "id": "h11", + "label": "opset 17 + gelu_fusion explicit", + "opset": 17, + "optim": { + "gelu_fusion": true + } + }, + { + "id": "h12", + "label": "opset 17 + transpose_optimizer", + "opset": 17, + "optim": { + "transpose_optimizer": true + } + }, + { + "id": "h13", + "label": "no optimization (analyzer auto-optimization disabled, --no-analyze)", + "opset": null, + "optim": null, + "build_flags": [ + "--no-analyze" + ] + } + ], + "models": [ + { + "id": "microsoft/resnet-18", + "task": "image-classification", + "model_type": "resnet" + }, + { + "id": "google/vit-base-patch16-224", + "task": "image-classification", + "model_type": "vit" + }, + { + "id": "apple/mobilevit-small", + "task": "image-classification", + "model_type": "mobilevit" + }, + { + "id": "facebook/dinov2-small", + "task": "image-feature-extraction", + "model_type": "dinov2" + }, + { + "id": "hustvl/yolos-small", + "task": "object-detection", + "model_type": "yolos" + }, + { + "id": "distilbert/distilbert-base-uncased-finetuned-sst-2-english", + "task": "text-classification", + "model_type": "distilbert" + }, + { + "id": "sentence-transformers/all-MiniLM-L6-v2", + "task": "sentence-similarity", + "model_type": "bert" + }, + { + "id": "deepset/roberta-base-squad2", + "task": "question-answering", + "model_type": "roberta" + }, + { + "id": "microsoft/rad-dino", + "task": "image-feature-extraction", + "model_type": "dinov2" + }, + { + "id": "deepset/tinyroberta-squad2", + "task": "question-answering", + "model_type": "roberta" + }, + { + "id": "BAAI/bge-small-en-v1.5", + "task": "sentence-similarity", + "model_type": "bert" + } + ], + "cross_checks": [ + { + "id": "gpu-006", + "type": "opset_bypass", + "candidate": "h3", + "stress_ref": "h1", + "baseline_ref": "h0" + } + ], + "findings": [ + { + "id": "gpu-001", + "title": "FP32 baseline is already optimal for ConvNext on QNN GPU — no optimization pass helps", + "observation": "Full sweep of 11 passes/combinations on ConvNext QNN GPU: all returned 0% node reduction or worse latency. Baseline p50=17.7ms, p90=19.7ms, std=0.97.", + "mechanism_confirmed": true, + "mechanism_hypothesis": "251/0/0/0 (all ops native on GPU, zero CPU fallback). ConvNext linear layers use Reshape→MatMul→Reshape, not bare MatMul+Add — so MatMulAdd→Conv2D rewrites don't match. autoconf (gelu_fusion + matmul_add_fusion) already applied all applicable transforms.", + "action_for_autoconfig": "Skip all graph optimization experiments for QNN GPU on ConvNext-class models. Use FP32 baseline directly.", + "confidence": "high — confirmed by 0% node delta on all rewrites + 251/0/0/0 analyze output", + "falsified_by": null, + "scope": "ConvNext-class models (Reshape→MatMul→Reshape pattern)", + "do_not_generalize_to": "Transformer models with bare MatMul+Add (those may benefit from rewrites)" + }, + { + "id": "gpu-002", + "title": "NHWC transformer hurts QNN GPU on Adreno X1-85 (~10% worse)", + "observation": "NHWC transformer: p50=19.5ms (+10%), p90=23.8ms (+21%), std=3.43 (3.5x worse). Consistent across multiple runs.", + "mechanism_confirmed": false, + "mechanism_hypothesis": "Adreno X1-85 + QNN GPU EP does not benefit from explicit NHWC layout transforms. QNN GPU EP handles layout internally; forcing NHWC via ORT creates additional Reshape overhead without the memory alignment benefit.", + "action_for_autoconfig": "Do NOT apply nhwc-transformer for QNN GPU EP.", + "confidence": "medium — observed consistently; mechanism hypothesis, not confirmed", + "falsified_by": null, + "scope": "Adreno X1-85 + QNN GPU EP", + "do_not_generalize_to": "Non-Adreno GPUs (NVIDIA, Intel Arc) — NHWC may help there" + }, + { + "id": "gpu-003", + "title": "winml compile appears to hurt QNN GPU (~34% regression) — SINGLE EXPERIMENT, LOW CONFIDENCE", + "observation": "FP32 + compile: p50=23.7ms vs baseline 17.7ms (+34%). Single experiment only.", + "mechanism_confirmed": false, + "mechanism_hypothesis": "QNN GPU EP compile (EPContext) is designed for NPU (HTP). On GPU EP, the compilation path may force a different dispatch mode that bypasses the optimized GPU shader path. QNN SDK likely has a GPU-specific compilation flow that winml compile doesn't trigger correctly.", + "action_for_autoconfig": "AVOID winml compile for QNN GPU EP. Direction (regression) is consistent with mechanism hypothesis and 34% is a large signal, but this is a single experiment. Until replicated, treat as likely harmful but not confirmed.", + "confidence": "low — single experiment. 34% gap is above DVFS noise level (CV ~0.05 → noise ~1ms, gap is 6ms). Direction probably real but magnitude uncertain.", + "falsified_by": null, + "scope": "QNN GPU EP", + "do_not_generalize_to": "QNN NPU EP (compile always helps NPU)" + }, + { + "id": "gpu-004", + "title": "W8A8 QDQ hangs indefinitely on QNN GPU EP", + "observation": "Passing a W8A8 QDQ-annotated ONNX to QNN GPU EP causes infinite hang. winml build's _patch_device() sets quant=null for GPU, preventing this in normal user path.", + "mechanism_confirmed": true, + "mechanism_hypothesis": "QNN SDK's GPU EP does not support QDQ-quantized graphs. This is a known QNN SDK limitation. winml build already protects against this via _patch_device().", + "action_for_autoconfig": "Skip ALL quantization experiments for QNN GPU EP. Do not even attempt W8A8 or W8A16.", + "confidence": "high — hang confirmed; protection mechanism in _patch_device() confirmed by code inspection", + "falsified_by": null, + "scope": "QNN GPU EP (QNN SDK limitation)", + "tracked_issue": "#868 (fast-fail enhancement)" + }, + { + "id": "gpu-005", + "title": "gelu_fusion improves latency STABILITY (p90/std) on QNN GPU, not p50", + "observation": "Raw export (287 nodes, unfused Gelu): p50=17.4ms, p90=29.2ms, std=5.90. Autoconf (251 nodes, fused Gelu): p50=17.7ms, p90=19.7ms, std=0.97. p50 nearly identical, p90 -48%, std -6x.", + "mechanism_confirmed": true, + "mechanism_hypothesis": "5 separate GPU kernel dispatches (Mul→Div→Erf→Mul→Add) for unfused GELU create scheduling jitter. Single Gelu kernel eliminates dispatch overhead → dramatically lower tail latency.", + "action_for_autoconfig": "Always apply gelu_fusion for QNN GPU (stability benefit). Do not expect p50 improvement.", + "confidence": "high — mechanism is well-understood (GPU kernel dispatch overhead)", + "falsified_by": null, + "scope": "Any model with GELU activations on QNN GPU" + }, + { + "id": "gpu-006", + "title": "opset 21 on QNN GPU is neutral-to-negative — CONFIRMED across 7 models", + "observation": "catalog_gpu_sweep.py full sweep 2026-06-18 (8 models, 13 hypotheses, 3x300 iters + Phase C confirmation): opset21 gains: DINOv2-small +1.22% (MARGINAL), ResNet-18 +3.27% (MARGINAL), MobileViT -3.42% (DISCARD), roberta-squad2 -1.14% (DISCARD), tinyroberta -2.68% (DISCARD), rad-dino -2.63% (DISCARD), bge-small +0.16% (DISCARD). Range: -5.42% to +3.27%. No model shows meaningful opset21 gain on GPU. Opposite of QNN NPU behavior (DINOv2 +30.6% on NPU).", + "mechanism_confirmed": true, + "mechanism_hypothesis": "QNN GPU EP does not have architecture-specific optimizations that benefit from opset21 graph differences (unlike NPU which shows DINOv2-specific speedup). GPU shader compilation is independent of ONNX opset semantics.", + "action_for_autoconfig": "Do NOT try opset 19 or opset 21 for QNN GPU EP. Default to opset 17. Rule is now confirmed across 7 models.", + "confidence": "high — confirmed across 7 diverse architectures", + "falsified_by": null, + "scope": "UNKNOWN — needs experiment", + "last_updated": "2026-06-18" + }, + { + "id": "gpu-007", + "title": "transpose_optimizer gives +8-17% on Conv-dominant and ViT models on QNN GPU — KEEP_CONFIRMED", + "confidence": "high", + "scope": "Conv-dominant (ResNet) and ViT-class (DINOv2) models on QNN GPU. Likely architecture-general for models with Transpose-heavy graphs.", + "observation": "catalog_gpu_sweep.py sweep 2026-06-18: h12 (transpose_optimizer) KEEP_CONFIRMED. DINOv2-small: p50 26.372ms -> 21.977ms = +16.67% (all 5 sessions passed, Phase C confirmed). ResNet-18: p50 6.823ms -> 6.251ms = +8.38% (MARGINAL_UNCONFIRMED — Phase C did not confirm, needs more sessions). NLP models: neutral or BUILD_FAIL. rad-dino: +1.33% (MARGINAL). gelu_fusion explicit (h11) also KEEP_CONFIRMED on DINOv2: +13.86%.", + "mechanism_confirmed": false, + "mechanism_hypothesis": "transpose_optimizer eliminates redundant Transpose(NCHW->NHWC->NCHW) pairs around Conv/pooling in the graph. QNN GPU EP benefits from fewer Transpose ops because each requires a memory layout pass on Adreno. For DINOv2 and ResNet, the optimizer removes enough Transposes to provide meaningful latency reduction.", + "affected_models": [ + "facebook/dinov2-small (+16.67% KEEP_CONFIRMED)", + "microsoft/resnet-18 (+8.38% MARGINAL_UNCONFIRMED)" + ], + "no_benefit_models": [ + "NLP models — most failed to build with transpose_optimizer; likely due to IR version incompatibility" + ], + "autoconfig_action": "Apply transpose_optimizer as default for QNN GPU EP on Conv+ViT models. AVOID for NLP models until BUILD_FAIL issue is resolved. Feature gap: diagnose why h12 causes BUILD_FAIL on BERT/RoBERTa models.", + "added": "2026-06-18", + "source": "catalog_gpu_sweep.py h0-h12 full sweep" + }, + { + "id": "gpu-008", + "title": "highdimRTR_lowdimRTR causes -6.9% regression on MobileViT QNN GPU — same root cause as npu-010", + "confidence": "high", + "scope": "Models with Gemm->Reshape->Transpose hybrid unfold patterns (MobileViT). DINOv2 was not tested with highdimRTR on GPU separately.", + "observation": "catalog_gpu_sweep.py 2026-06-18: MobileViT h9 (opset21+matmul_transpose+attention_fusion bundle) p50=19.224ms vs baseline 17.985ms = -6.89% (DISCARD). Root cause analysis via ONNX diff on NPU version shows +36 extra Reshape nodes (same issue as npu-010). GPU regression is less severe than NPU (-6.9% vs -19%) due to lower DMA sensitivity on Adreno vs Hexagon HTP.", + "mechanism_confirmed": true, + "mechanism_detail": "Same as npu-010: highdimRTR inserts spurious Reshape pairs after Gemm in MobileViT hybrid unfold mechanism. Breaks Gemm+Reshape dispatch merging. Less severe on GPU than NPU.", + "cross_ep_note": "npu-010 and gpu-008 share the same root cause. Fix is the same: block highdimRTR for Gemm->Reshape->Transpose models.", + "autoconfig_action": "Same as npu-010: hard-block highdimRTR for models with Gemm->Reshape->Transpose patterns. analyze_insight.py skip_set hint required.", + "added": "2026-06-18", + "source": "catalog_gpu_sweep.py h0-h12 full sweep + npu-010 ONNX diff" + } + ], + "search_space_rules": { + "opset": { + "recommended_order": [ + 17 + ], + "rationale": "gpu-006 CONFIRMED: opset 21 neutral-to-negative across 7 models. Stay at opset 17.", + "dialectical_note": "⚠️ May change once opset 21 GPU experiment is run." + }, + "quantization": { + "recommended": "skip", + "skip": [ + "all — QDQ hangs on GPU EP (gpu-004)" + ], + "dialectical_note": "⚠️ This is a QNN SDK limitation, not winml. May change with future QNN SDK versions that support GPU quantization." + }, + "compile": { + "always_run": false, + "skip": true, + "dialectical_note": "⚠️ gpu-003: compile regresses QNN GPU. Confirmed by single experiment. Re-validate if winml compile behavior changes." + }, + "graph_passes": { + "recommended": "autoconf defaults + transpose_optimizer for Conv/ViT models", + "skip": [ + "nhwc-transformer (gpu-002)", + "highdimRTR (gpu-008)" + ], + "dialectical_note": "⚠️ Skip rules are ConvNext-specific. Transformer models may benefit from attention_fusion etc." + } + } +} diff --git a/research/autoconfig/ep_device_knowledge/qnn_npu.json b/research/autoconfig/ep_device_knowledge/qnn_npu.json new file mode 100644 index 000000000..4d945aed2 --- /dev/null +++ b/research/autoconfig/ep_device_knowledge/qnn_npu.json @@ -0,0 +1,729 @@ +{ + "_meta": { + "ep": "qnn", + "device": "npu", + "hardware": "Snapdragon X Elite CRD (Adreno X1-85 / Hexagon HTP)", + "ort_version": "1.24.5 (onnxruntime-windowsml; confirmed kMaxSupportedOpset >= 23)", + "qnn_sdk_version": "unknown — check QnnSystem.dll version", + "models_tested": [ + "facebook/convnext-tiny-224", + "microsoft/resnet-18", + "google/vit-base-patch16-224", + "apple/mobilevit-small", + "facebook/dinov2-small", + "hustvl/yolos-small", + "distilbert/distilbert-base-uncased-finetuned-sst-2-english", + "sentence-transformers/all-MiniLM-L6-v2", + "deepset/roberta-base-squad2", + "deepset/tinyroberta-squad2", + "facebook/dinov2-base", + "microsoft/rad-dino", + "facebook/dino-vitb16", + "BAAI/bge-small-en-v1.5", + "rizvandwiki/gender-classification" + ], + "last_updated": "2026-06-22", + "epistemics_warning": "⚠️ All findings are hypotheses derived from limited models on 1 device (Snapdragon X Elite). Confidence levels reflect how well the mechanism is understood, not how universally applicable the finding is. ALWAYS re-validate on new model architectures before using to prune search space." + }, + "sweep_config": { + "results_dir": "catalog-qnn-sweep", + "quant": "auto", + "compile": false, + "screen": { + "warmup": 20, + "iters": 200, + "cv_max": 0.15, + "thermal_aware": true + }, + "full": { + "warmup": 50, + "iters": 500, + "sessions": 3, + "cool_down_s": 30 + }, + "confirm_sessions": 2, + "min_improvement_pct": 5.0, + "effect_size_gate": true, + "effect_size_cv_mult": 2.0, + "accuracy_eval": true, + "eval_samples": 50, + "paired_ab_available": true, + "baseline_priority": [ + "h0", + "h1" + ], + "timeouts": { + "config_s": 240, + "build_s": 900, + "bench_s": 720, + "eval_s": 360, + "model_s": 14400 + } + }, + "hypotheses": [ + { + "id": "h0", + "label": "baseline (auto-config, W8A16)", + "opset": null, + "optim": null + }, + { + "id": "h1", + "label": "no optimization (analyzer auto-optimization disabled, --no-analyze)", + "opset": null, + "optim": null, + "build_flags": [ + "--no-analyze" + ] + }, + { + "id": "h2", + "label": "opset 19", + "opset": 19, + "optim": null + }, + { + "id": "h3", + "label": "opset 21 (tests npu-001 bypass)", + "opset": 21, + "optim": null + }, + { + "id": "h4", + "label": "opset 17 + conv fusions", + "opset": 17, + "optim": { + "conv_bn_fusion": true, + "conv_add_fusion": true, + "conv_activation_fusion": true + }, + "guard": { + "type": "conv_pct_regression", + "finding": "npu-006", + "threshold_pct": 20.0 + } + }, + { + "id": "h5", + "label": "opset 21 + conv fusions", + "opset": 21, + "optim": { + "conv_bn_fusion": true, + "conv_add_fusion": true, + "conv_activation_fusion": true + }, + "guard": { + "type": "conv_pct_regression", + "finding": "npu-006", + "threshold_pct": 20.0 + } + }, + { + "id": "h6", + "label": "opset 21 + matmul_transpose_fusion", + "opset": 21, + "optim": { + "matmul_transpose_fusion": true + } + }, + { + "id": "h7", + "label": "opset 21 + bias_softmax_fusion", + "opset": 21, + "optim": { + "bias_softmax_fusion": true + } + }, + { + "id": "h8", + "label": "opset 21 + attention_fusion", + "opset": 21, + "optim": { + "attention_fusion": true + } + }, + { + "id": "h9", + "label": "opset 21 + highdimRTR_lowdimRTR", + "opset": 21, + "optim": { + "highdimRTR_lowdimRTR": true + } + }, + { + "id": "h10", + "label": "opset 17 + conv_add_fusion only", + "opset": 17, + "optim": { + "conv_add_fusion": true + } + } + ], + "models": [ + { + "id": "microsoft/resnet-18", + "task": "image-classification", + "model_type": "resnet" + }, + { + "id": "google/vit-base-patch16-224", + "task": "image-classification", + "model_type": "vit" + }, + { + "id": "apple/mobilevit-small", + "task": "image-classification", + "model_type": "mobilevit" + }, + { + "id": "facebook/dinov2-small", + "task": "image-feature-extraction", + "model_type": "dinov2" + }, + { + "id": "hustvl/yolos-small", + "task": "object-detection", + "model_type": "yolos" + }, + { + "id": "distilbert/distilbert-base-uncased-finetuned-sst-2-english", + "task": "text-classification", + "model_type": "distilbert" + }, + { + "id": "sentence-transformers/all-MiniLM-L6-v2", + "task": "sentence-similarity", + "model_type": "bert" + }, + { + "id": "deepset/roberta-base-squad2", + "task": "question-answering", + "model_type": "roberta" + }, + { + "id": "microsoft/rad-dino", + "task": "image-feature-extraction", + "model_type": "dinov2" + }, + { + "id": "deepset/tinyroberta-squad2", + "task": "question-answering", + "model_type": "roberta" + }, + { + "id": "BAAI/bge-small-en-v1.5", + "task": "sentence-similarity", + "model_type": "bert" + } + ], + "cross_checks": [ + { + "id": "npu-001", + "type": "opset_bypass", + "candidate": "h3", + "stress_ref": "h1", + "baseline_ref": "h0" + }, + { + "id": "npu-006", + "type": "catastrophic_regression", + "hypotheses": [ + "h4", + "h5" + ], + "ratio_threshold": 5.0 + } + ], + "findings": [ + { + "id": "npu-001", + "title": "opset 21 export gives +24-31% speedup on DINOv2-family models on QNN NPU — mechanism UNKNOWN, NOT a general ViT property, MobileViT benefit NOT reproduced on clean rerun", + "observation": "Catalog sweep 2026-06-13 + validation sweep 2026-06-16 (ORT 1.24.5, W8A16 quantized.onnx, 3×500-iter sessions): DINOv2-small +30.6% (opset17 7.18ms → opset21 4.98ms). DINOv2-base +24.1% (opset17 34.56ms → opset21 26.23ms). CRITICAL CONTROL: dino-vitb16 (plain DINO ViT-B/16) -0.7% — NEUTRAL. rad-dino (ViT-L medical) -0.1% — CPU-bound, no NPU effect. ViT-base: -7.4%. BERT/RoBERTa/DistilBERT: neutral. MobileViT-small: REVISED — the original +26.5% (2026-06-13) was on an inflated ~12ms baseline. Clean from-scratch 11-hypothesis rerun 2026-06-22 (fresh winml config+build, 3×500-iter) gave baseline (h0 opset17) median 5.51ms and h3 opset21 median 5.355ms = +2.81% with FULLY OVERLAPPING session ranges (h0=[4.98,5.51,5.72] vs h3=[5.36,5.26,5.90]) → npu-001 NEUTRAL on MobileViT. h6 (opset21+matmul_transpose), previously cited as +42.1%, was 6.218ms = SLOWER than baseline. The earlier MobileViT speedup was a thermal/DVFS artifact of a slow baseline, not an opset21 effect.", + "mechanism_confirmed": false, + "mechanism_invalidation": "Original hypothesis: kMaxSupportedOpset < 21 gate causes NHWC bypass on older ORT. INVALIDATED: sweep used onnxruntime-windowsml==1.24.5 where kMaxSupportedOpset >= 22. Both opset 17 and opset 21 go through the same NHWC layout transform path on this ORT version. The bypass mechanism does NOT apply. The observed speedup is real but the cause is unknown.", + "mechanism_status": "ORIGINAL_MECHANISM_INVALIDATED — must re-investigate", + "mechanism_source": "ORT source code investigation (2026-06-10) for ORT < 1.18. Sweep used onnxruntime-windowsml==1.24.5 where this mechanism no longer applies.", + "ort_version_critical_note": "The original mechanism (kMaxSupportedOpset gate in IsSupportedOpset()) requires kMaxSupportedOpset < 21. onnxruntime-windowsml==1.24.5 (ORT 1.24.x) has kMaxSupportedOpset >= 22, so BOTH opset17 and opset21 go through the NHWC layout transform. The bypass mechanism does NOT apply to the ORT version used in the sweep. The observed speedup for DINOv2 and MobileViT has an UNKNOWN root cause.", + "architecture_requirement": [ + "empirically: DINOv2 family (facebook/dinov2-*) consistently benefits. Plain ViT (dino-vitb16) does NOT. Hybrid Conv+attention (MobileViT) showed an apparent speedup in original data but did NOT reproduce on clean rerun (neutral). Pure Conv (ResNet) insufficient data. NLP: neutral." + ], + "critical_caveats": [ + "MECHANISM UNKNOWN: Transpose count is IDENTICAL in opset17 and opset21 (both 49 nodes on dinov2-small). The original Transpose-elimination hypothesis is RULED OUT. The +48 Reshape nodes in opset21 are the most observable structural difference but why this speeds up QNN NPU is not understood.", + "RESNET-18 EXCLUDED: apparent +20% is statistical noise — 3 sessions span 4x range at sub-ms latency. Need 3 sessions × 2000 iters for reliable data at this scale.", + "DVFS NOISE: always use 3 sessions × 500+ iters with cool-down. Single-session CV is meaningless on QNN NPU.", + "SCOPE IS DINOV2-FAMILY NOT GENERAL VIT: dino-vitb16 (same ViT-B size as dinov2-base) shows -0.7% NEUTRAL. The speedup is DINOv2-architecture-specific." + ], + "validated_models": { + "benefits_from_opset21": [ + "facebook/dinov2-small (+30.6%, original catalog sweep 2026-06-13, 3-session)", + "facebook/dinov2-base (+24.1%, validation sweep 2026-06-16, fresh quantized.onnx builds, 3-session h1=[34.56,34.67,33.15]ms h3=[33.00,26.22,26.23]ms)" + ], + "no_benefit_neutral": [ + "apple/mobilevit-small: REVISED to NEUTRAL. Original +42.1% (h6) / +26.5% (h3) was measured against an inflated ~12ms baseline. Clean from-scratch rerun 2026-06-22 (3×500-iter): baseline h0 opset17 5.51ms, h3 opset21 5.355ms = +2.81% with overlapping session ranges; h6 (opset21+matmul_transpose) 6.218ms = SLOWER. The earlier 'win' was a DVFS/thermal baseline artifact.", + "facebook/dino-vitb16 (-0.7%, validation sweep 2026-06-16, h1=[19.92,19.97,19.90]ms h3=[20.20,20.07,19.99]ms — NEUTRAL, critical control)", + "google/vit-base-patch16-224 (-7.4%, original catalog)", + "hustvl/yolos-small (timeout, no data)", + "rizvandwiki/gender-classification (+3.5% apparent, ranges overlap 13.89/13.92ms, NEUTRAL — plain ViT, CRITICAL: near-identical op counts to DINOv2-small (49 Transpose, 121 Reshape) yet NO benefit)", + "distilbert/distilbert-base-uncased-finetuned-sst-2-english (-0.1%, NLP neutral)", + "sentence-transformers/all-MiniLM-L6-v2 (-0.7%, NLP neutral)", + "deepset/roberta-base-squad2 (+0.1%, NLP neutral)" + ], + "marginal_inconclusive": [ + "BAAI/bge-small-en-v1.5 (+7.3%, h0=[10.52,10.32,11.01]ms h3=[10.25,9.33,9.94]ms — ranges barely non-overlapping but CV=0.3; NOT CONFIRMED. Needs 5+ sessions to differentiate from noise. Unusual for BERT architecture; all other NLP models tested at <1%)" + ], + "not_benchmarked_predicted_neutral": [ + "openai/clip-vit-base-patch32 — build failed at quantization (feature-extraction task calibration not supported); pure transformer, expected neutral based on all NLP data", + "cardiffnlp/twitter-roberta-base-sentiment-latest — not run; RoBERTa architecture, predicted neutral (consistent with roberta-base-squad2 +0.1%)", + "distilbert/distilbert-base-cased-distilled-squad — not run; DistilBERT architecture, predicted neutral (consistent with distilbert-base-uncased -0.1%)" + ], + "cpu_bound_cannot_test": [ + "microsoft/rad-dino (-0.1% on CPU EP, all hypotheses ~275ms CV<0.022 — model runs on CPU, opset irrelevant; QNN NPU BUILD_FAIL 2026-06-17, see npu-008)" + ], + "data_unreliable": [ + "resnet-18 — sub-ms latency, 3-session range spans 4x; no reliable signal (see data_reliability_notes)" + ] + }, + "original_mechanism_explanation": { + "root_cause_for_old_ort": "kMaxSupportedOpset gate in IsSupportedOpset() (onnxruntime/core/optimizer/layout_transformation/layout_transformation.cc). On ORT where kMaxSupportedOpset < 21, opset 21 models bypass the NCHW→NHWC layout transformer entirely.", + "why_bypass_helped_convnext": "NHWC layout transform inserts Transpose(NCHW→NHWC) around Conv. For ConvNext, residual connections prevent Transpose cancellation → opset17 graph has MORE Transposes on HTP than opset21 graph.", + "why_cpu_is_opposite": "CPU relies on TransposeOptimizer to REMOVE existing Transposes. Skipping the optimizer (opset > kMaxSupportedOpset) leaves Transposes in place → CPU SLOWER. Same gate, opposite effect.", + "ort_kMaxSupportedOpset_by_version": { + "v1.14.x": 18, + "v1.16.x": 19, + "v1.17.x": 20, + "v1.18.x": 21, + "v1.24.x": ">= 23 (CONFIRMED: ORT 1.24.4 in C:\\tmp\\autoconfig-demo accepts opset 22 and 23 via InferenceSession with CPUExecutionProvider; opset 24 fails with 'No op registered for ...' not 'Unsupported opset')", + "main_HEAD": 26 + }, + "key_files": [ + "onnxruntime/core/optimizer/transpose_optimization/onnx_transpose_optimization.cc:2724-2746 — MakeOptimizerContext() gate", + "onnxruntime/core/optimizer/layout_transformation/layout_transformation.cc — IsSupportedOpset()", + "onnxruntime/core/session/inference_session.cc:1589-1626 — transform_layout_fn=nullptr path" + ] + }, + "transpose_analysis_2026_06_16": { + "method": "onnx.load() on winml-built optimized.onnx and quantized.onnx for h0 (opset17) and h3 (opset21) from catalog_qnn_sweep facebook--dinov2-small. Op counts via collections.Counter on graph.node.", + "opset17_optimized": { + "total_nodes": 391, + "Transpose": 49, + "Reshape": 121, + "Gemm": 72, + "Mul": 48, + "Conv": 1 + }, + "opset21_optimized": { + "total_nodes": 439, + "Transpose": 49, + "Reshape": 169, + "Gemm": 72, + "Mul": 48, + "Conv": 1 + }, + "opset17_quantized": { + "total_nodes": 1398, + "Transpose": 49, + "Reshape": 121, + "DequantizeLinear": 615, + "QuantizeLinear": 392 + }, + "opset21_quantized": { + "total_nodes": 1542, + "Transpose": 49, + "Reshape": 169, + "DequantizeLinear": 663, + "QuantizeLinear": 440 + }, + "key_finding": "Transpose count is IDENTICAL (49 nodes) in both opset17 and opset21. The NHWC Transpose-reduction hypothesis is RULED OUT. opset21 has MORE Reshape nodes (+48), more QDQ pairs (+48 DQ, +48 Q), and more total nodes. Despite more nodes, opset21 runs 30% faster on QNN NPU — mechanism still unknown.", + "rules_out": [ + "NHWC Transpose elimination as speedup cause", + "Fewer total ops as explanation" + ], + "consistent_with": [ + "Different graph structure at opset21 enabling better QNN NPU internal scheduling or graph partitioning, possibly via the +48 Reshape nodes acting as data-layout hints or memory access pattern changes" + ] + }, + "alternative_mechanism_hypotheses": [ + "QNN EP graph partitioner assigns ops differently when the model has opset21 Reshape semantics — the +48 Reshape nodes may segment the graph into better-aligned HTP subgraphs", + "Quantization calibration path differs between opset exports → quantized.onnx has different scale/zero-point distributions at opset21 → better QNN NPU numeric alignment", + "PyTorch ONNX exporter produces different intermediate tensor shapes at opset 21 → better memory access locality on QNN NPU HBM", + "The +48 Reshape ops in opset21 are 'free' no-ops on QNN NPU (identity reshape with same shape) that happen to trigger a faster QNN internal code path" + ], + "data_reliability_notes": { + "dinov2_small": { + "h1_opset17_sessions_ms": [ + 7.176, + 6.392, + 9.436 + ], + "h3_opset21_sessions_ms": [ + 4.977, + 4.876, + 6.884 + ], + "assessment": "RELIABLE. Ranges barely overlap only at extremes. h3 sessions 1+2 (4.97/4.88ms) are well below entire h1 range. Speedup is real.", + "tool": "catalog_qnn_sweep.py, optimized.onnx (v1 pipeline)" + }, + "dinov2_base_v3": { + "h1_opset17_sessions_ms": [ + 34.556, + 34.668, + 33.148 + ], + "h3_opset21_sessions_ms": [ + 33.001, + 26.224, + 26.227 + ], + "assessment": "RELIABLE. h1 sessions fully consistent (~34ms). h3 s0 slightly elevated (JIT warmup) but s1+s2 consistent at 26.2ms. Speedup +24.1% is well-separated from noise.", + "tool": "validation_sweep.py v3, quantized.onnx W8A16 (fresh builds for both hyps)" + }, + "dino_vitb16": { + "h1_opset17_sessions_ms": [ + 19.924, + 19.975, + 19.897 + ], + "h3_opset21_sessions_ms": [ + 20.197, + 20.071, + 19.988 + ], + "assessment": "RELIABLE CONTROL. Extremely stable. +0.7% regression (within noise). Opset21 has NO EFFECT on plain DINO ViT-B/16. Critical discriminant: npu-001 speedup is NOT a general ViT property.", + "tool": "validation_sweep.py, quantized.onnx W8A16 (fresh builds)" + }, + "mobilevit_small": { + "h1_opset17_sessions_ms_ORIGINAL_2026_06_13": [ + 10.557, + 11.721, + 27.436 + ], + "h3_opset21_sessions_ms_ORIGINAL_2026_06_13": [ + 10.814, + 8.625, + 8.449 + ], + "clean_rerun_2026_06_22": { + "h0_opset17_sessions_ms": [ + 4.98, + 5.51, + 5.72 + ], + "h3_opset21_sessions_ms": [ + 5.36, + 5.26, + 5.9 + ], + "h6_opset21_matmul_transpose_p50_ms": 6.218, + "verdict": "NEUTRAL_WITHIN_NOISE (+2.81%, ranges overlap)" + }, + "assessment": "REVISED to UNRELIABLE/NEUTRAL. The original h1 (opset17) median ~11.7ms was inflated by a 27.4ms DVFS spike, making opset21 look ~20-26% faster. A clean from-scratch 11-hypothesis rerun 2026-06-22 (fresh winml config+build, 3×500-iter) measured a true baseline median of 5.51ms; h3 opset21 = 5.355ms = +2.81% with FULLY OVERLAPPING session ranges → effect-size gate verdict NEUTRAL_WITHIN_NOISE. h6 (opset21+matmul_transpose, previously cited +42.1%) = 6.218ms = SLOWER. The original 'speedup' was a polluted-baseline / DVFS artifact, not an opset21 effect." + }, + "resnet_18": { + "h1_opset17_sessions_ms": [ + 0.99, + 4.003, + 2.716 + ], + "h3_opset21_sessions_ms": [ + 1.054, + 2.175, + 4.107 + ], + "assessment": "UNRELIABLE. Sub-ms model. Session range spans 4x for same config. Reported '+20.2% speedup' (h1 median 2.72ms vs h3 median 2.18ms) is NOT a real signal — the two distributions fully overlap. REMOVED from benefits list." + }, + "gender_classification_vit": { + "h0_opset17_sessions_ms": [ + 14.15, + 14.94, + 13.89 + ], + "h3_opset21_sessions_ms": [ + 13.7, + 13.92, + 13.87 + ], + "assessment": "NEUTRAL. Ranges barely not overlapping (h0 min=13.89ms, h3 max=13.92ms). +3.5% is within DVFS noise (CV ~0.35). CRITICAL: this ViT model has IDENTICAL op counts to DINOv2-small (49 Transpose, 121 Reshape, ~72 Gemm) yet shows NO benefit. Confirms npu-001 is not explainable by op-count or general ViT architecture.", + "tool": "run_one.py 2026-06-17, quantized.onnx W8A16" + }, + "bge_small_en": { + "h0_opset17_sessions_ms": [ + 10.52, + 10.32, + 11.01 + ], + "h3_opset21_sessions_ms": [ + 10.25, + 9.33, + 9.94 + ], + "assessment": "MARGINAL / INCONCLUSIVE. Ranges barely not overlapping but CV ~0.3 means high within-session variance. +7.3% apparent gain — larger than all other NLP models (distilbert -0.1%, MiniLM -0.7%, RoBERTa +0.1%) but may be DVFS noise. Needs 5+ sessions to confirm. Do NOT cite as benefit.", + "tool": "run_one.py 2026-06-17, quantized.onnx W8A16, bert model-type" + } + }, + "action_for_autoconfig": "Include opset 21 in search for DINOv2-family models (facebook/dinov2-*). Do NOT assume it helps MobileViT-class Conv+attention hybrids — the original MobileViT win did NOT reproduce on a clean rerun (neutral, +2.81% within noise). Do NOT apply to plain ViT (dino-vitb16, gender-classification both neutral), YOLOS, or NLP (BERT-family all neutral at ±0.7%). CRITICAL: gender-classification ViT has IDENTICAL op counts to DINOv2-small (49 Transpose, 121 Reshape) but shows NO benefit — the effect is deeper than op counts. For ResNet-class Conv-only: insufficient data. ALWAYS dump optimized graph to compare Transpose counts if speedup is unexpected, and ALWAYS clear the effect-size gate (gain >= 2×session-CV AND ranges separated) before trusting a win.", + "confidence": "medium-high on empirical observation (DINOv2-small +30.6% and DINOv2-base +24.1% both confirmed with clean 3-session protocol, fresh builds). Low on mechanism — original Transpose-bypass explanation ruled out (Transpose count identical opset17/21), kMaxSupportedOpset>=23 confirmed. Mechanism unknown. Scope: DINOv2 family only until mechanism is understood. 12 models now tested: 2 benefit (DINOv2-small/base), 8 neutral (incl. MobileViT after clean 2026-06-22 rerun), 1 marginal/inconclusive (BGE-small +7.3% with high CV), 1 CPU-bound.", + "falsified_by": null, + "scope": "ORT 1.24.5 (onnxruntime-windowsml). DINOv2-small and DINOv2-base confirmed. MobileViT-small REVISED to NEUTRAL (original win was a DVFS baseline artifact; clean rerun 2026-06-22 = +2.81% within noise). Does NOT apply to plain ViT (dino-vitb16 and rizvandwiki/gender-classification both confirmed NEUTRAL despite identical op counts to DINOv2-small), YOLOS-small, BERT-family NLP, CPU-bound models (rad-dino). ResNet-18 data inconclusive. BGE-small-en +7.3% marginal, inconclusive.", + "tracked_issue": "#869", + "perf_gain_validation_gates": { + "gate1_statistical": "PASSED for DINOv2 (3-session, ranges separate). FAILED for MobileViT (clean rerun 2026-06-22: ranges overlap, +2.81% < effect-size noise floor → NEUTRAL). FAILED for ResNet-18.", + "gate2_mechanism": "FAILED — original kMaxSupportedOpset bypass mechanism does not apply to ORT 1.24.x. New mechanism uninvestigated.", + "gate3_thermal_control": "PARTIALLY — 3×500-iter with 30s cool-down is better than single-session but DVFS spikes still occur and CAN poison the baseline (the MobileViT win was traced to exactly this; see ep_knowledge/README.md promotion checklist)." + }, + "follow_up_required": [ + "DONE: kMaxSupportedOpset >= 23 confirmed for ORT 1.24.4 (accepts opset 22 and 23 at InferenceSession level)", + "DONE: Transpose analysis — opset17 vs opset21 DINOv2-small: IDENTICAL (49 Transpose both). Not the mechanism.", + "OPEN: Investigate QNN EP graph partitioning diff for opset17 vs opset21. Why do +48 Reshape nodes help?", + "Run 5+ sessions (not 3) on DINOv2 opset17 vs opset21 to reduce DVFS uncertainty", + "Test EfficientNet-B0, MobileNet-V3 to determine if benefit is 'Conv+residual' or 'Conv+attention hybrid' specific", + "For ResNet-18: run 3 sessions x 2000 iters to get reliable sub-ms measurements" + ], + "experiments_convnext_early": [ + { + "opset": 17, + "p50_ms": 54.2, + "p90_ms": 104.5, + "min_ms": 9.56, + "std_ms": 44.1, + "iters": 50, + "note": "warm device, DVFS-dominated, NOT reliable" + }, + { + "opset": 19, + "p50_ms": 12.1, + "p90_ms": 77.7, + "min_ms": 9.11, + "std_ms": 60.0, + "iters": 50, + "note": "NOT reliable — 50 iters, DVFS" + }, + { + "opset": 21, + "p50_ms": 12.2, + "p90_ms": 38.0, + "min_ms": 9.73, + "std_ms": 10.1, + "iters": 20, + "note": "only 20 iters — NOT reliable" + } + ], + "last_updated": "2026-06-18" + }, + { + "id": "npu-002", + "title": "W8A16 quantization provides ~1.9x speedup over FP32 on QNN NPU (ConvNext only — not yet generalized)", + "observation": "ConvNext FP32 baseline: p50=19.4ms. W8A16 quantized (minmax, 128 samples): p50=10.29ms. 1 model, 1 device.", + "mechanism_confirmed": true, + "mechanism_hypothesis": "QNN HTP has native INT8 weight / FP16 activation datapath. W8A16 maps directly to HTP's weight-compressed matmul kernels.", + "action_for_autoconfig": "Always quantize for QNN NPU. W8A16 is the starting point. Validate accuracy after quantization.", + "confidence": "medium — mechanism is well-understood (HTP architecture), but 1.9x magnitude is from 1 model only. Speedup will vary by architecture.", + "falsified_by": null, + "scope": "ConvNext only — single model validation. The catalog sweep used W8A16 for all 8 models but did not include FP32 baselines for those models, so the 1.9x figure cannot be generalized. Need FP32 baseline runs on at least 3 diverse models before claiming 'most vision models'.", + "do_not_generalize_to": "Models with unusual op types not supported by QNN W8A16 path. Magnitude claim (1.9x) is ConvNext-specific.", + "follow_up_required": [ + "Measure FP32 baseline for MobileViT, DINOv2, ResNet-18 to verify speedup generalizes" + ] + }, + { + "id": "npu-003", + "title": "winml compile adds ~1.7x speedup on top of quantization for QNN NPU (ConvNext only — not yet generalized)", + "observation": "ConvNext W8A16 quantized: p50=10.29ms. W8A16 + compiled (EPContext): p50=6.01ms. 1 model, 1 device.", + "mechanism_confirmed": true, + "mechanism_hypothesis": "Compilation pre-builds the QNN binary graph (.bin) and eliminates JIT graph partitioning at session creation time. EPContext model loads the pre-built binary directly.", + "action_for_autoconfig": "Always run winml compile after finding best quantized config for QNN NPU.", + "confidence": "medium — mechanism is well-understood (EPContext documented by QNN SDK). 1.7x magnitude is ConvNext-specific. Simpler models may see less benefit; complex models may see more.", + "falsified_by": null, + "scope": "ConvNext only — single model validation. Mechanism generalizes; magnitude (1.7x) does not. The catalog sweep results.json baseline p50 values already include the effects of whatever auto-config winml chose (which may or may not include compile) — not directly comparable.", + "follow_up_required": [ + "Verify compile speedup on MobileViT and DINOv2" + ] + }, + { + "id": "npu-004", + "title": "⚠️ ANECDOTE (NO DATA): W8A8 may cause accuracy collapse on models with LN+GELU — UNVALIDATED", + "observation": "W8A8 quantization was attempted on ConvNext. The experiment was aborted early — exact accuracy numbers were NOT recorded. The claim 'top-1 < 15%' is a recalled anecdote from the experimenter, not a measured result.", + "mechanism_confirmed": false, + "mechanism_hypothesis": "ConvNext uses LayerNormalization + GELU in every block. Quantizing both weights AND activations to INT8 in these ops introduces severe numerical error. However, this is a hypothesis — the aborted experiment does not confirm or refute it.", + "action_for_autoconfig": "Treat as anecdotal. Do NOT use this to skip W8A8 without running eval first. If W8A8 top-1 drops > 15 points vs W8A16 baseline on first attempt, then skip.", + "confidence": "very_low — anecdotal, no preserved data, experiment not reproducible as recorded", + "falsified_by": null, + "scope": "UNVALIDATED. May apply to models with LN+GELU blocks but this is unconfirmed.", + "do_not_generalize_to": "BERT/ResNet models where W8A8 is often fine", + "required_experiment": "Run W8A8 quantization on ConvNext-tiny-224, record exact top-1 accuracy (eval on ImageNet-1k, 1000 samples minimum). Compare to W8A16 baseline. If collapse observed, also run with calibration_method=percentile to see if calibration quality is the issue." + }, + { + "id": "npu-005", + "title": "QNN Hub W8A16 model is slower on ORT QNN EP stack than ORT-quantized W8A16 — but comparison is not fair", + "observation": "QNN Hub W8A16 on winml ORT QNN EP: p50=14.82ms, std=8.8ms. ORT-quantized W8A16 (opset 17 QDQ): p50=6.01ms stable.", + "mechanism_confirmed": false, + "mechanism_hypothesis": "QNN Hub uses opset 21 QDQ format with uint16 input tensor — this format may be incompatible with ORT QNN EP's expected quantization format.", + "fairness_caveat": "⚠️ This is NOT a fair comparison. QNN Hub models are compiled for the qairt native stack (qualcomm AI runtime), not for ORT QNN EP. Running a qairt-compiled model through ORT QNN EP is an unsupported use case. The comparison only shows that you should use ORT-generated quantization when targeting ORT QNN EP — which is obvious.", + "action_for_autoconfig": "Use ORT-generated W8A16 quantization (winml build), NOT QNN Hub pre-quantized models, when targeting ORT QNN EP stack.", + "confidence": "low — the finding is trivially true (use the right tool for the right stack) but the experiment doesn't tell us anything useful about relative performance.", + "falsified_by": null, + "scope": "ORT QNN EP stack only. QNN Hub models on their native qairt stack are likely much faster — that comparison was never made." + }, + { + "id": "npu-006", + "title": "Conv fusions (conv-bn/add/activation) cause catastrophic QNN NPU CPU fallback on Conv-dominant models", + "observation": "ResNet-18 with conv-bn-fusion+conv-add-fusion+conv-activation-fusion: 3-session p50s = [132.3, 134.97, 130.67]ms (CV=0.016, extremely stable) vs baseline [0.99, 4.00, 2.72]ms. ~130-135x regression. MobileViT with same fusions: [11.60, 11.36, 10.52]ms — neutral vs baseline [10.56, 11.72, 27.44]ms. BERT-family: neutral (no Conv ops to fuse). VALIDATION SWEEP 2026-06-16: dinov2-base h4=[26.06,25.92,25.87]ms vs h1=[34.56,34.67,33.15]ms → fusions actually -25% (FASTER, not regression). dino-vitb16 h4=[20.12,20.04,20.41]ms vs h1=[19.92,19.97,19.90]ms → +1.0% (neutral). Conv fusions are only hazardous for Conv-dominant models.", + "session_evidence_note": "The h4 sessions for ResNet-18 (132.3, 134.97, 130.67ms) show near-zero variance (CV=0.016) — in stark contrast to all other hypotheses. This is unusual for QNN NPU and strongly suggests deterministic CPU fallback (not DVFS noise). The regression is 50-136x even comparing best sessions.", + "mechanism_confirmed": false, + "mechanism_hypothesis": "ORT conv fusion pass (ConvAddActivationFusion, ConvBNFusion) produces fused op types (e.g., Conv+BN fused) that QNN EP cannot map to HTP kernels. These ops fall back to CPU execution, adding PCIe round-trip overhead per-op for a Conv-heavy graph like ResNet.", + "action_for_autoconfig": "⚠️ CRITICAL: Do NOT apply conv-bn-fusion / conv-add-fusion / conv-activation-fusion for QNN NPU on Conv-dominant models (ResNet, EfficientNet, MobileNet). These passes are beneficial for CPU EP but hazardous for QNN NPU. Always run accuracy + latency gate after applying any Conv fusion. If regression > 5x, disable all conv fusions immediately.", + "confidence": "high on regression observation (4900%); medium on mechanism (CPU fallback hypothesis not yet confirmed via EP partition dump)", + "falsified_by": null, + "scope": "Conv-dominant models (ResNet, EfficientNet, MobileNet). MobileViT safe (original data). DINOv2 and plain ViT: fusions are neutral or slightly beneficial (2026-06-16 validation). Not applicable to NLP.", + "severity": "critical — can produce 50x regression", + "follow_up_required": [ + "Dump QNN EP partition to confirm fused ops cause CPU fallback", + "Test EfficientNet and MobileNet to confirm generalization", + "Check if winml analyze linter can detect this pattern pre-build" + ], + "refinement": "2026-06-17 delta sweep: ResNet-18 h10 (conv_add_fusion ONLY, no conv-bn or conv-activation) = p50 0.955ms vs baseline 0.964ms (+0.93%) — NEUTRAL. Catastrophic regression ONLY occurs with full fusion pack (conv-bn-fusion + conv-add-fusion + conv-activation-fusion) which produces FusedConv ops. Individual conv-add-fusion is safe. Root cause is confirmed: FusedConv op created by the bundle is not dispatchable by QNN EP.", + "last_updated": "2026-06-18" + }, + { + "id": "npu-007", + "title": "DVFS thermal noise on QNN NPU makes CV-based stability gating unreliable — requires session-level averaging", + "observation": "Across all 8 catalog models, QNN NPU CV ranges 0.1–2.0+ even on warm device. Original CV<15% gate blocks most candidates. Differences < 10% are within noise floor.", + "mechanism_confirmed": true, + "mechanism_hypothesis": "Snapdragon X Elite HTP Hexagon core runs DVFS aggressively. Single-session CV is dominated by thermal state, not model performance. The only reliable signal comes from session-level averaging (3+ independent sessions with cool-down).", + "action_for_autoconfig": "DISABLE CV gate for QNN NPU. Replace with: (1) minimum 3 independent sessions × 500+ iters with 30s cool-down between sessions. (2) Use median p50 across sessions as the signal. (3) Only trust gains > 10% — anything below is within noise floor. (4) Do NOT compare within-session std to declare stability.", + "confidence": "high — consistent across 8 models in catalog sweep", + "falsified_by": null, + "scope": "General — applies to all models on QNN NPU / Snapdragon X Elite HTP", + "bench_protocol_update": { + "screen_phase": "SKIP CV gate; run 200 iters as warmup only", + "full_phase": "3 sessions × 500 iters, 30s cool-down between sessions", + "signal": "median p50 across sessions", + "noise_floor": ">10% gain required to declare improvement" + } + }, + { + "id": "npu-008", + "title": "microsoft/rad-dino fails to build on QNN NPU across all opset variants (winml crash rc=0xC0000005)", + "observation": "catalog_qnn_sweep run 2026-06-17: all 6 hypotheses for microsoft/rad-dino (opset 17/19/21, with/without conv fusions) returned rc=3221225794 (0xC0000005, access violation) in <2s. No stderr captured — winml process crashed before producing any output. This is distinct from a build error: it is a hard crash of the winml CLI itself.", + "mechanism_confirmed": false, + "mechanism_hypothesis": "rad-dino is a ViT encoder with a non-standard DINOv2 variant (larger heads, custom CLS token handling). Likely contains one or more ONNX operators or graph shapes that trigger an unguarded null-dereference or out-of-bounds access in the QNN EP quantization or compilation path (winml build calls QNN SDK compilation under the hood). Could also be a model size / dynamic axis issue.", + "action_for_autoconfig": "Skip QNN NPU for microsoft/rad-dino. If QNN NPU is required, file a bug with the crash dump and test with winml analyze first to identify unsupported ops before attempting build.", + "confidence": "high on observation (reproducible across all 6 hypotheses in same run); low on mechanism (no stack trace available)", + "falsified_by": null, + "scope": "microsoft/rad-dino only (confirmed). DINOv2-family models in general (facebook/dinov2-small, facebook/dinov2-base) are NOT affected — they build and run on QNN NPU successfully.", + "severity": "blocker — model is incompatible with QNN NPU build", + "follow_up_required": [ + "Run winml analyze --ep qnn on rad-dino ONNX to check unsupported ops", + "Capture crash dump (ProcDump) to get stack trace", + "Compare ONNX graph structure of rad-dino vs facebook/dinov2-small to isolate differentiating ops" + ], + "date_observed": "2026-06-17" + }, + { + "id": "npu-009", + "title": "bias_softmax_fusion adds incremental +14% on DINOv2 QNN NPU when combined with opset21", + "confidence": "medium", + "scope": "ViT-class models with attention+bias patterns. Confirmed on DINOv2-small; untested on plain ViT or BERT.", + "observation": "Catalog sweep 2026-06-17 delta sweep: DINOv2-small h7 (opset21+bias_softmax_fusion) p50=4.027ms vs h3 (opset21 only) p50=4.977ms. Incremental gain = +14.1% on top of opset21 alone. Total gain vs baseline: +38.6% (h7) vs +24.1% (h3). bias_softmax_fusion hypothesis also outperforms attention_fusion (h8=+28.4%) and matmul_transpose (h6=+24.8%) on DINOv2.", + "mechanism_confirmed": false, + "mechanism_hypothesis": "bias_softmax_fusion folds Add(qk_scores, bias)+Softmax into a single FusedSoftmax op. QNN Hexagon HTP has a native hardware path for fused attention-head softmax. Reduces dispatch overhead between the bias addition and the softmax kernel.", + "affected_models": [ + "facebook/dinov2-small" + ], + "validated_on": [ + "facebook/dinov2-small (+38.6% total, h7=4.027ms vs baseline 6.561ms)" + ], + "not_tested": [ + "DINOv2-base", + "plain ViT", + "BERT/RoBERTa (expected neutral based on npu-001 scope)" + ], + "autoconfig_action": "For DINOv2-family: include bias_softmax_fusion in the opset21 bundle. Prioritize over attention_fusion (h8) since h7 outperformed h8 by 10 percentage points on DINOv2-small.", + "added": "2026-06-18", + "source": "catalog_qnn_sweep.py h6-h10 delta sweep, DINOv2-small results.json" + }, + { + "id": "npu-010", + "title": "highdimRTR_lowdimRTR causes -19% regression on MobileViT QNN NPU due to spurious Reshape insertion", + "confidence": "high", + "scope": "Models with Gemm->Reshape->Transpose hybrid unfold patterns (MobileViT confirmed). DINOv2 (pure ViT) benefits: +38.1%. Architecture-dependent.", + "observation": "Catalog sweep 2026-06-17 delta sweep: MobileViT-small h9 (opset21+highdimRTR_lowdimRTR) median_p50=14.363ms vs h0 baseline 12.075ms = -18.9% regression. GPU sweep: same model h9 (GPU) = -6.89%. ONNX diff: h9 NPU graph has +36 extra Reshape nodes (395->431 total; 108->144 Reshape). The 12 original RTR patterns in h0 are UNCHANGED in h9. Instead, optimizer inserted Reshape pairs as intermediaries after Gemm nodes, breaking dispatch merging.", + "mechanism_confirmed": true, + "mechanism_detail": "highdimRTR misidentifies Gemm->Reshape->Transpose sequences (MobileViT CNN-ViT hybrid patch-unfold mechanism) as reducible RTR patterns. Inserts 36 extra intermediate Reshape nodes. These break Gemm+Reshape dispatch merging on QNN NPU and add DMA traffic. NPU more severely affected (-19%) than GPU (-6.9%) due to higher HTP DMA sensitivity.", + "affected_models": [ + "apple/mobilevit-small (-19% NPU, -6.9% GPU)" + ], + "validated_safe_models": [ + "facebook/dinov2-small (+38.1% NPU with h9 — pure ViT benefits from highdimRTR)" + ], + "architectural_discriminator": "Gemm->Reshape->Transpose hybrid unfold pattern (CNN-ViT). Detect via analyze_insight.py op-sequence scan before applying.", + "autoconfig_action": "Hard-block highdimRTR for models with Gemm->Reshape->Transpose hybrid sequences. analyze_insight.py should detect this pattern and add highdimRTR to skip_set.", + "added": "2026-06-18", + "source": "catalog_qnn_sweep.py h6-h10 delta sweep + ONNX graph diff (MobileViT h0 vs h9 node count comparison)" + }, + { + "id": "npu-011", + "title": "Fusions that fire (graph topology changes) but yield no perf benefit should be recorded and benefit-gated, not auto-kept", + "confidence": "medium", + "scope": "Cross-architecture on QNN NPU. Observed strongest on transformer/attention models (BERT, ConvNeXt) where conv/attention fusions fire cleanly but p50 is unchanged. Distinct from npu-006 (fusion fires AND regresses) and npu-001 (change helps).", + "observation": "A fusion flag can be confirmed *applicable* — after enabling it the post-optimize op count drops and graph topology changes vs the baseline build — yet the measured p50 delta stays inside the DVFS noise band (|delta| < CV-derived threshold). BERT-base h2/h3/h4 on NPU: graphs change per hypothesis but full-session p50s all land 39-43ms with CV 0.6-1.1, indistinguishable from h0 baseline. ConvNeXt-tiny: every hypothesis within 0.24% of baseline. These are 'applied-but-not-beneficial' fusions: real graph transforms with zero perf return.", + "criterion": "Classify a fusion as APPLIED when pre-vs-post-optimize op count and/or graph topology change (the fusion fired); classify as BENEFICIAL only when median p50 improves beyond the noise band. APPLIED && !BENEFICIAL = record here. Use op-count + topology diff (not input-graph pattern match) to prove the fusion fired, and session-averaged p50 with CV-derived threshold to judge benefit. IMPORTANT first cut: many neutral fusions are not 'applied but useless' — they are NO-OPS that never fired. The op-count diff is what separates the two.", + "evidence": "BERT-base QNN NPU recipe sweep 2026-06-24 (winml analyze metadata.total_operators per hypothesis): h0 baseline opset17 = 392 ops; h4 opset17+conv_fusions = 392 ops IDENTICAL op breakdown (MatMul 24 / Gelu 13 / Add 38 / LayerNorm 26 unchanged) → the conv-fusion flag was a complete NO-OP (BERT has no Conv ops to fuse). h5 opset21+conv_fusions = 440 ops = same as h3 opset21 ALONE (440) → conv fusion again added zero ops on top of the opset bump. The only thing that changed BERT's graph was the opset export (opset17 = 392 ops; opset19/21 = 440 ops, +48 nodes) — and that topology change did NOT improve p50 (opset21 medians no better than baseline, within thermal noise). So BERT's 'neutral conv fusions' are NOT npu-011 instances at all — they never fired; the genuine npu-011 instance here is opset17->21 (+48 ops, no benefit). Without the op-count diff one would have wrongly logged a 'neutral fusion' when nothing happened.", + "perf_reliability_note": "BERT per-session p50s were thermally dominated: h0 and h1 have IDENTICAL build configs (both 392 ops, same resolved optim) yet h1's 3 sessions read 50/42/82ms vs h0's clean 29.1/29.5/29.5ms — pure DVFS throttling because hypotheses run back-to-back and the chip heats up, biasing later hypotheses slower. The current 'median of 3 back-to-back session p50s' is not robust for benefit-gating; interleaved or cooldown-separated sampling is required (reinforces npu-007).", + "mechanism_confirmed": false, + "mechanism_hypothesis": "Fused ops are dispatchable by the QNN EP (no CPU fallback, unlike npu-006) so correctness/perf is preserved, but the fused kernel is not faster than the unfused sequence on HTP for these shapes — the win the fusion targets (CPU EP op-dispatch overhead) does not exist on NPU. Net effect is neutral, while build time and EP-mapping risk still increase.", + "autoconfig_action": "The analyzer should not auto-keep a fusion merely because its pattern matches the input graph. Steps: (1) build with and without the flag, diff op counts + topology to confirm the fusion fired; (2) compare session-averaged p50 against the noise band; (3) if it fired but delta is within noise, drop the fusion from the emitted config (or flag it 'neutral — omitted') rather than retaining it. Retaining neutral fusions costs build time and adds EP-dispatch risk for no return. This benefit gate is feature gap #4 in the README.", + "added": "2026-06-24", + "source": "catalog_qnn_sweep.py recipe NPU sweep — BERT-base + ConvNeXt-tiny (fusions fire, p50 within noise)" + } + ], + "search_space_rules": { + "opset": { + "recommended_order_conv_residual": [ + 21, + 17 + ], + "recommended_order_pure_attention": [ + 17 + ], + "recommended_order_nlp": [ + 17 + ], + "recommended_order_pure_conv": [ + 17, + "21 only if time allows — insufficient data" + ], + "architecture_gate": "DINOv2 family (facebook/dinov2-*) → try opset 21 first (+24-31% confirmed). MobileViT-class Conv+attention hybrid → try opset 21 (+26% original data). Plain ViT (dino-vitb16-class) → opset 17 only (NEUTRAL confirmed 2026-06-16). YOLOS → opset 17 only. NLP (BERT-family) → opset 17 only. Pure Conv (ResNet) → opset 17 (data insufficient for opset21 recommendation).", + "rationale": "npu-001 validated 2026-06-13 and 2026-06-16: DINOv2-small +30.6%, DINOv2-base +24.1% (fresh builds, clean protocol). Critical control: dino-vitb16 -0.7% NEUTRAL. This proves the speedup is DINOv2-architecture-specific, not a general ViT property.", + "dialectical_note": "⚠️ The original mechanism explanation (kMaxSupportedOpset bypass) does NOT apply to ORT 1.24.x (onnxruntime-windowsml 1.24.5). The speedup for DINOv2/MobileViT is empirically real but mechanistically unexplained. Always validate on the actual ORT version being shipped." + }, + "quantization": { + "recommended": "w8a16", + "skip": [ + "w8a8 if initial top1 < 15%" + ], + "dialectical_note": "⚠️ W8A8 skip rule is ConvNext-specific (LN+GELU sensitivity). Try W8A8 for models without LN in every block." + }, + "compile": { + "always_run": true, + "dialectical_note": "⚠️ Compile benefit is well-understood (EPContext pre-built binary). Low risk of being wrong, but verify compile output loads correctly." + }, + "graph_passes": { + "recommended": "autoconf defaults (gelu_fusion, matmul_add_fusion)", + "NEVER_apply_for_qnn_npu": [ + "conv-bn-fusion", + "conv-add-fusion", + "conv-activation-fusion" + ], + "hazard_note": "npu-006 CRITICAL: Conv fusions cause 4900% regression on ResNet-18. Do NOT apply conv fusions to Conv-dominant models on QNN NPU.", + "dialectical_note": "⚠️ Conv fusion ban is confirmed for ResNet. MobileViT was safe. Always run latency gate after applying any fusion to catch regressions." + }, + "bench_protocol": { + "cv_gate": "DISABLED for QNN NPU (npu-007)", + "sessions": 3, + "iters_per_session": 500, + "cool_down_s": 30, + "noise_floor_pct": 10, + "signal": "median p50 across sessions" + } + } +} diff --git a/research/autoconfig/lib/gen_model_report.py b/research/autoconfig/lib/gen_model_report.py new file mode 100644 index 000000000..401e5eaf9 --- /dev/null +++ b/research/autoconfig/lib/gen_model_report.py @@ -0,0 +1,841 @@ +#!/usr/bin/env python3 +# ------------------------------------------------------------------------- +# Copyright (c) Microsoft Corporation. All rights reserved. +# Licensed under the MIT License. +# -------------------------------------------------------------------------- + +"""Generate per-model HTML optimization reports from autoconfig sweep results.""" + +from __future__ import annotations + +import argparse +import html +import json +from pathlib import Path + + +BASE_DIR = Path(__file__).parent +CHART_MIN_GAIN = -200.0 +CHART_MAX_GAIN = 200.0 + + +def _resolve_path(path_str: str) -> Path: + path = Path(path_str) + if path.is_absolute(): + return path + if path.exists(): + return path.resolve() + return (BASE_DIR / path).resolve() + + +def _escape(value: object) -> str: + return html.escape("" if value is None else str(value)) + + +def _fmt_ms(value: float | None) -> str: + return "—" if value is None else f"{value:.2f} ms" + + +def _fmt_pct(value: float | None, signed: bool = True) -> str: + if value is None: + return "—" + return f"{value:+.1f}%" if signed else f"{value:.1f}%" + + +def _status_class(gain_pct: float | None) -> str: + if gain_pct is None: + return "neutral" + if gain_pct > 0: + return "good" + if gain_pct < 0: + return "bad" + return "neutral" + + +def _short_label(label: str, max_len: int = 26) -> str: + if len(label) <= max_len: + return label + return label[: max_len - 1] + "…" + + +def _sort_hypothesis_ids(hyp_id: str) -> tuple[int, str]: + if hyp_id.startswith("h"): + try: + return int(hyp_id[1:]), hyp_id + except ValueError: + pass + return 9999, hyp_id + + +def _get_p50(hyp: dict) -> float | None: + """Get median p50 from either nested (QNN/CPU) or flat (GPU) schema.""" + if "full" in hyp: + return hyp["full"].get("median_p50_ms") + return hyp.get("median_p50_ms") or hyp.get("overall_median_p50_ms") + + +def _get_runs(hyp: dict) -> list[float]: + if "full" in hyp: + return [float(v) for v in hyp.get("all_p50s_ms") or hyp.get("full", {}).get("p50s_ms", [])] + return [float(v) for v in hyp.get("all_p50s_ms") or hyp.get("full_p50s_ms", [])] + + +def _get_gain_pct(hyp_id: str, hyp: dict, baseline_p50_ms: float | None) -> float | None: + if hyp_id == "h0" and baseline_p50_ms is not None: + return 0.0 + for key in ("overall_gain_pct", "confirm_overall_gain_pct", "gain_vs_baseline_pct"): + value = hyp.get(key) + if value is not None: + return float(value) + p50 = _get_p50(hyp) + if baseline_p50_ms and p50: + return (baseline_p50_ms - p50) / baseline_p50_ms * 100 + return None + + +def _format_extra_optim(extra_optim: dict | None) -> str: + if not extra_optim: + return "autoconf defaults" + enabled = [key for key, value in extra_optim.items() if value] + return ", ".join(enabled) if enabled else "autoconf defaults" + + +def _format_champion_config(hyp: dict) -> str: + opset = hyp.get("opset") + flags = _format_extra_optim(hyp.get("extra_optim")) + if opset is None: + return flags + if flags == "autoconf defaults": + return f"opset {opset} + autoconf defaults" + return f"opset {opset} + {flags}" + + +def _confidence_text(hyp_id: str, hyp: dict, baseline_runs: list[float]) -> str: + status = str(hyp.get("status", "")) + verdict = str(hyp.get("verdict", "")) + + if status.startswith("BUILD"): + return "build failed" + if status == "BENCH_FAIL": + return "bench failed" + if status.startswith("SKIPPED"): + return "guarded skip" + if hyp.get("confirm_verdict") == "CONFIRMED": + return "ranges separated" + if hyp.get("confirm_verdict") == "MARGINAL_UNCONFIRMED": + return "ranges overlap" + if verdict == "KEEP_CONFIRMED": + wins = hyp.get("sessions_above_threshold") + total = hyp.get("total_sessions") + if wins is not None and total is not None: + return f"{wins}/{total} sessions confirm" + return "confirmation passed" + if verdict == "MARGINAL_UNCONFIRMED": + wins = hyp.get("sessions_above_threshold") + total = hyp.get("total_sessions") + if wins is not None and total is not None: + return f"{wins}/{total} sessions confirm" + return "confirmation incomplete" + + runs = _get_runs(hyp) + if baseline_runs and runs: + if max(runs) < min(baseline_runs) or min(runs) > max(baseline_runs): + return "ranges separated" + return "ranges overlap" + + if hyp_id == "h0": + return "baseline reference" + return "single-point only" + + +def _table_rows( + hyps: list[tuple[str, dict]], + baseline_p50_ms: float | None, + champion_hyp: str | None, + predicate, +) -> list[dict]: + rows: list[dict] = [] + baseline_runs = _get_runs(dict(hyps).get("h0", {})) + for hyp_id, hyp in hyps: + gain_pct = _get_gain_pct(hyp_id, hyp, baseline_p50_ms) + status = str(hyp.get("status", "")) + verdict = str(hyp.get("verdict") or hyp.get("confirm_verdict") or status or "—") + row = { + "hyp_id": hyp_id, + "label": hyp.get("label", ""), + "gain_pct": gain_pct, + "verdict": verdict, + "confidence": _confidence_text(hyp_id, hyp, baseline_runs), + "status": status, + "is_champion": hyp_id == champion_hyp, + } + if predicate(row, hyp): + rows.append(row) + return rows + + +def _render_table(title: str, icon: str, rows: list[dict], champion_hyp: str | None) -> str: + if not rows: + return "" + + table_rows = [] + for row in rows: + champion_class = " champion-row" if row["hyp_id"] == champion_hyp else "" + gain_style = ( + "gain-neg" if row["gain_pct"] is not None and row["gain_pct"] < 0 else "gain-pos" + ) + table_rows.append( + f""" + + {_escape(row["hyp_id"])} + {_escape(row["label"])} + {_fmt_pct(row["gain_pct"])} + {_escape(row["verdict"])} + {_escape(row["confidence"])} + + """ + ) + + return f""" +
+
{icon} {title}
+ + + + + + + + + + + + {"".join(table_rows)} + +
HypothesisLabelGain %VerdictConfidence
+
+ """ + + +def _render_characteristics(results: dict) -> str: + rows = [ + ("Model ID", results.get("model_id")), + ("Task", results.get("task")), + ("Arch type", results.get("model_type")), + ("Baseline opset", results.get("baseline_opset")), + ("EP", results.get("ep")), + ("Device", results.get("device")), + ] + + conv_pct = results.get("conv_pct") + if "npu006_risk" in results: + conv_text = "N/A" if conv_pct is None else f"{conv_pct:.1f}%" + rows.append(("Conv%", conv_text)) + rows.append(("npu-006 risk", "HIGH" if results.get("npu006_risk") else "LOW")) + + if "npu001_generalized" in results: + rows.append(("npu-001 note", results.get("npu001_generalized"))) + + cells = "".join( + f"{_escape(label)}{_escape(value if value is not None else '—')}" + for label, value in rows + ) + return f""" +
+
Model Characteristics
+ + {cells} +
+
+ """ + + +def _chart_bar_color(gain_pct: float | None) -> str: + if gain_pct is None: + return "#90a4ae" + if gain_pct > 5: + return "#43a047" + if gain_pct < -5: + return "#e53935" + return "#90a4ae" + + +def _render_chart( + hyps: list[tuple[str, dict]], baseline_p50_ms: float | None, champion_hyp: str | None +) -> str: + row_h = 40 + header_h = 48 + footer_h = 26 + label_w = 150 + bar_w = 520 + value_w = 78 + total_w = label_w + bar_w + value_w + total_h = header_h + footer_h + len(hyps) * row_h + center_x = label_w + bar_w / 2 + + elements: list[str] = [ + f'', + "", + '', + '', + '', + "", + "", + 'Hypothesis', + f'Gain vs baseline (%)', + f'', + ] + + for tick in (-200, -100, 0, 100, 200): + x = label_w + ((tick - CHART_MIN_GAIN) / (CHART_MAX_GAIN - CHART_MIN_GAIN)) * bar_w + elements.append( + f'' + ) + elements.append( + f'{tick}%' + ) + + for idx, (hyp_id, hyp) in enumerate(hyps): + y = header_h + idx * row_h + bar_mid = y + row_h / 2 + bar_top = y + 8 + bar_height = row_h - 16 + gain_pct = _get_gain_pct(hyp_id, hyp, baseline_p50_ms) + clipped_gain = ( + None if gain_pct is None else max(min(gain_pct, CHART_MAX_GAIN), CHART_MIN_GAIN) + ) + status = str(hyp.get("status", "")) + verdict = str(hyp.get("verdict") or hyp.get("confirm_verdict") or "") + p50 = _get_p50(hyp) + title = ( + f"{hyp_id}: {hyp.get('label', '')}\n" + f"status={status or '—'} verdict={verdict or '—'}\n" + f"p50={_fmt_ms(p50)} gain={_fmt_pct(gain_pct)}" + ) + + elements.append(f"{_escape(title)}") + elements.append( + f'' + ) + elements.append( + f'{_escape(hyp_id)}' + f'{_escape(_short_label(str(hyp.get("label", ""))))}' + ) + + if hyp_id == "h0": + elements.append( + f'' + ) + elements.append( + f'0.0%' + ) + elif status.startswith("BUILD"): + fail_w = 92 + fail_x = center_x - fail_w / 2 + stroke = "#1e88e5" if hyp_id == champion_hyp else "#78909c" + stroke_w = 4 if hyp_id == champion_hyp else 1.5 + elements.append( + f'' + ) + elements.append( + f'' + "BUILD_FAIL" + ) + elif clipped_gain is not None: + target_x = ( + label_w + + ((clipped_gain - CHART_MIN_GAIN) / (CHART_MAX_GAIN - CHART_MIN_GAIN)) * bar_w + ) + x = min(center_x, target_x) + width = max(abs(target_x - center_x), 2.0) + stroke = "#1e88e5" if hyp_id == champion_hyp else "none" + stroke_w = 4 if hyp_id == champion_hyp else 0 + value_x = target_x + 8 if clipped_gain >= 0 else target_x - 8 + anchor = "start" if clipped_gain >= 0 else "end" + elements.append( + f'' + ) + elements.append( + f'' + f"{_escape(_fmt_pct(gain_pct))}" + ) + + elements.append("") + + elements.append("") + return f""" +
+
Hypothesis Gain Chart
+
+ {"".join(elements)} +
+
+ """ + + +def _render_all_hypotheses( + hyps: list[tuple[str, dict]], + baseline_p50_ms: float | None, + champion_hyp: str | None, +) -> str: + """Full hypothesis table with opset, flags, all session p50s, and verdict.""" + baseline_runs = _get_runs(dict(hyps).get("h0", {})) + rows: list[str] = [] + + for hyp_id, hyp in hyps: + status = str(hyp.get("status", "")) + verdict = str(hyp.get("verdict") or hyp.get("confirm_verdict") or status or "—") + label = hyp.get("label", "") + opset = hyp.get("opset", "—") + extra_optim = hyp.get("extra_optim") + gain_pct = _get_gain_pct(hyp_id, hyp, baseline_p50_ms) + p50 = _get_p50(hyp) + all_runs = _get_runs(hyp) + + is_champion = hyp_id == champion_hyp + row_class = "champion-row" if is_champion else "" + + # Format extra_optim flags + if extra_optim: + enabled = [k for k, v in extra_optim.items() if v] + flags_str = ( + ", ".join(f'{_escape(f)}' for f in enabled) + if enabled + else 'none' + ) + else: + # Not stored — parse from label as fallback + flags_str = 'not stored' + + # Format all session p50s + if all_runs: + runs_html = " · ".join(f"{r:.2f}" for r in all_runs) + runs_cell = f'[{runs_html}]' + elif status.startswith("BUILD"): + runs_cell = f'{_escape(status)}' + else: + runs_cell = "—" + + # p50 cell + p50_cell = _fmt_ms(p50) if p50 else ("—" if not status.startswith("BUILD") else status) + + # gain cell + if gain_pct is not None: + gain_class = "gain-pos" if gain_pct > 0 else ("gain-neg" if gain_pct < 0 else "") + gain_cell = f'{_fmt_pct(gain_pct)}' + else: + gain_cell = "—" + + # verdict / confidence + verdict_class = ( + "verdict-keep" + if "KEEP" in verdict.upper() + else "verdict-discard" + if ( + "DISCARD" in verdict.upper() + or "BUILD" in verdict.upper() + or "FAIL" in verdict.upper() + ) + else "" + ) + conf = _confidence_text(hyp_id, hyp, baseline_runs) + champion_star = ( + ' ' if is_champion else "" + ) + + rows.append(f""" + + {_escape(hyp_id)}{champion_star} + {_escape(label)} + {_escape(str(opset))} + {flags_str} + {_escape(p50_cell)} + {runs_cell} + {gain_cell} + {_escape(verdict)} + {_escape(conf)} + """) + + return f""" +
+
🔬 All Hypotheses — Full Detail
+
+ + + + + + + + + + + + + + + + {"".join(rows)} + +
IDConfig LabelOpsetExtra FlagsMedian p50Session p50s (ms)Gain %VerdictConfidence
+
+
+ ★ = champion hypothesis  ·  Session p50s are individual bench sessions (median used for comparison) +
+
+ """ + + +def _render_feature_gaps(results: dict) -> str: + feature_gaps = results.get("feature_gaps") or [] + if not feature_gaps: + return "" + + cards = "".join(f'
{_escape(gap)}
' for gap in feature_gaps) + return f""" +
+
Feature Gaps
+
{cards}
+
+ """ + + +def generate_model_report(results: dict, output_path: Path) -> None: + """Generate a single self-contained HTML report.""" + hypotheses_map = results.get("hypotheses", {}) + hyps = sorted(hypotheses_map.items(), key=lambda item: _sort_hypothesis_ids(item[0])) + baseline_p50_ms = results.get("baseline_p50_ms") + champion_hyp = results.get("best_hypothesis") + champion = hypotheses_map.get(champion_hyp or "", {}) + champion_p50_ms = results.get("best_p50_ms") or _get_p50(champion) + best_gain_pct = results.get("best_gain_pct") + best_gain_verdict = results.get("best_gain_verdict") + gain_reliable = best_gain_verdict == "RELIABLE" + # When the best observed gain is not statistically reliable, the recommended + # ship config is the auto-config baseline, not the fastest-observed hypothesis. + if best_gain_verdict and not gain_reliable: + reliability_note = f"⚠ {best_gain_verdict.replace('_', ' ').lower()} — ship baseline" + else: + reliability_note = "" + + keep_rows = _table_rows( + hyps, + baseline_p50_ms, + champion_hyp, + lambda row, _: (row["gain_pct"] is not None and row["gain_pct"] > 5) + or row["verdict"] == "KEEP_CONFIRMED", + ) + discard_rows = _table_rows( + hyps, + baseline_p50_ms, + champion_hyp, + lambda row, hyp: row["status"].startswith("BUILD") + or (row["gain_pct"] is not None and row["gain_pct"] < -2), + ) + neutral_rows = _table_rows( + hyps, + baseline_p50_ms, + champion_hyp, + lambda row, hyp: row not in keep_rows and row not in discard_rows, + ) + + sweep_ts = results.get("timestamp") + sweep_date = ( + sweep_ts.split("T", 1)[0] if isinstance(sweep_ts, str) and "T" in sweep_ts else sweep_ts + ) + header_title = ( + f"{str(results.get('ep', 'unknown')).upper()} {str(results.get('device', 'unknown')).upper()} " + f"Optimization Report — {results.get('model_id', 'unknown')}" + ) + subtitle = ( + f"{results.get('model_type', 'unknown')} arch · {sweep_date or 'unknown date'} · " + f"{len(hyps)} hypotheses tested" + ) + baseline_delta_ms = None + if baseline_p50_ms is not None and champion_p50_ms is not None: + baseline_delta_ms = baseline_p50_ms - champion_p50_ms + + keep_count = len(keep_rows) + discard_count = len(discard_rows) + champion_summary = _format_champion_config(champion) if champion else "—" + + html_doc = f""" + + + + + {_escape(header_title)} + + + +

{_escape(header_title)}

+
{_escape(subtitle)}
+ +
+
+
Best Gain %
+
{_fmt_pct(best_gain_pct)}
+
Champion: {_escape(champion_hyp or "—")}{(" · " + _escape(reliability_note)) if reliability_note else ""}
+
+
+
Baseline → Champion ms
+
{_escape(_fmt_ms(baseline_p50_ms))} → {_escape(_fmt_ms(champion_p50_ms))}
+
Latency reduction: {_escape(_fmt_ms(baseline_delta_ms))}
+
+
+
EP + Device
+
{_escape(str(results.get("ep", "unknown")).upper())} / {_escape(str(results.get("device", "unknown")).upper())}
+
Baseline opset {_escape(results.get("baseline_opset", "—"))}
+
+
+
Champion Config
+
{_escape(("h0 (baseline)" if reliability_note else (champion_hyp or "—")))}
+
{_escape(reliability_note or champion_summary)}
+
+
+
Total experiments
+
{len(hyps)}
+
{keep_count} KEEP / {discard_count} DISCARD
+
+
+ + {_render_characteristics(results)} + {_render_chart(hyps, baseline_p50_ms, champion_hyp)} + {_render_all_hypotheses(hyps, baseline_p50_ms, champion_hyp)} + {_render_table("Effective Optimizations", "✅", keep_rows, champion_hyp)} + {_render_table("Ineffective or Harmful", "❌", discard_rows, champion_hyp)} + {_render_table("Neutral / Build Fail", "⚪", neutral_rows, champion_hyp)} + {_render_feature_gaps(results)} + + + + +""" + + output_path.write_text(html_doc, encoding="utf-8") + + +def _load_results(results_path: Path) -> dict: + return json.loads(results_path.read_text(encoding="utf-8")) + + +def _generate_for_results_file(results_path: Path) -> Path: + results = _load_results(results_path) + output_path = results_path.with_name("report.html") + generate_model_report(results, output_path) + return output_path + + +def _generate_for_sweep_dir(sweep_dir: Path) -> list[Path]: + outputs: list[Path] = [] + for results_path in sorted(sweep_dir.rglob("results.json")): + outputs.append(_generate_for_results_file(results_path)) + return outputs + + +def main() -> int: + parser = argparse.ArgumentParser(description="Generate per-model autoconfig HTML report(s).") + parser.add_argument("results_json", nargs="?", help="Path to a single results.json file") + parser.add_argument( + "--sweep-dir", help="Sweep directory containing per-model results.json files" + ) + args = parser.parse_args() + + if bool(args.results_json) == bool(args.sweep_dir): + parser.error("Provide exactly one of or --sweep-dir.") + + if args.sweep_dir: + sweep_dir = _resolve_path(args.sweep_dir) + outputs = _generate_for_sweep_dir(sweep_dir) + for output in outputs: + print(output) + return 0 + + results_path = _resolve_path(args.results_json) + output = _generate_for_results_file(results_path) + print(output) + return 0 + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/research/autoconfig/lib/report_gen.py b/research/autoconfig/lib/report_gen.py new file mode 100644 index 000000000..0a4769bc5 --- /dev/null +++ b/research/autoconfig/lib/report_gen.py @@ -0,0 +1,280 @@ +# ------------------------------------------------------------------------- +# Copyright (c) Microsoft Corporation. All rights reserved. +# Licensed under the MIT License. +# -------------------------------------------------------------------------- + +"""report_gen.py — Phase 3 HTML report generator for autoconfig. + +Reads results.tsv and generates report.html with: + - Summary bar chart (p50 per hypothesis, colour-coded by status) + - Experiment table (config / delta_pct / status / CV) + - Champion config box +""" + +from __future__ import annotations + +import csv +import html as html_lib +from datetime import datetime +from pathlib import Path + + +# ── helpers ─────────────────────────────────────────────────────────────────── + + +def _load_tsv(results_tsv: Path) -> list[dict]: + if not results_tsv.exists(): + return [] + with results_tsv.open(encoding="utf-8") as f: + return list(csv.DictReader(f, delimiter="\t")) + + +def _status_color(status: str) -> str: + s = status.lower() + if "new best" in s or (s.startswith("keep") and "marginal" not in s): + return "#2e7d32" # dark green + if "marginal" in s: + return "#f57f17" # amber + if "discard" in s: + return "#b0bec5" # grey + if "crash" in s or "fail" in s: + return "#c62828" # red + return "#78909c" + + +def _status_bg(status: str) -> str: + s = status.lower() + if "new best" in s or (s.startswith("keep") and "marginal" not in s): + return "#e8f5e9" + if "marginal" in s: + return "#fff8e1" + if "crash" in s or "fail" in s: + return "#ffebee" + return "#f5f5f5" + + +def _p50_float(val: str | None) -> float | None: + if not val or val == "N/A" or "UNSTABLE" in str(val): + return None + try: + return float(str(val).replace("ms", "").strip()) + except ValueError: + return None + + +# ── bar chart ───────────────────────────────────────────────────────────────── + + +def _bar_chart_html(rows: list[dict], baseline_p50: float | None) -> str: + valid = [(r, _p50_float(r.get("median_p50_ms") or r.get("screen_p50_ms"))) for r in rows] + valid = [(r, v) for r, v in valid if v is not None] + if not valid: + return "

No benchmark data yet.

" + + max_val = max(v for _, v in valid) * 1.1 + bars = [] + for r, p50 in valid: + label = html_lib.escape(r.get("label", "?")) + status = r.get("status", "") + color = _status_color(status) + width_pct = p50 / max_val * 100 + delta = r.get("delta_pct", "") + baseline_marker = "" + if baseline_p50: + bx = baseline_p50 / max_val * 100 + baseline_marker = ( + f'
' + ) + bars.append(f""" +
+ {baseline_marker} +
{label}
+
+
+
+
+
{p50:.1f}ms + {html_lib.escape(delta)} +
+
+
""") + + return ( + '
\n' + '
' + "— baseline (blue line)
\n" + "".join(bars) + "\n
" + ) + + +# ── experiment table ────────────────────────────────────────────────────────── + + +def _table_html(rows: list[dict]) -> str: + cols = [ + "iter", + "label", + "dimension", + "optim_flags", + "opset", + "screen_p50_ms", + "median_p50_ms", + "delta_pct", + "cv", + "status", + ] + hdrs = "".join( + f'{c.replace("_", " ")}' + for c in cols + ) + trs = [] + for r in rows: + status = r.get("status", "") + bg = _status_bg(status) + color = _status_color(status) + cells = [] + for c in cols: + val = html_lib.escape(str(r.get(c, ""))) + if c == "status": + cells.append( + f'{val}' + ) + else: + cells.append(f'{val}') + trs.append( + f'' + "".join(cells) + "" + ) + return ( + '' + f"{hdrs}" + f"{''.join(trs)}" + "
" + ) + + +# ── champion box ───────────────────────────────────────────────────────────── + + +def _champion_html(rows: list[dict], model_id: str, ep: str) -> str: + keeps = [r for r in rows if r.get("status", "").lower().startswith("keep")] + if not keeps: + return ( + '
' + "No KEEP verdict yet — search in progress.
" + ) + best = min(keeps, key=lambda r: _p50_float(r.get("median_p50_ms")) or 999) + flags = html_lib.escape(best.get("optim_flags", "(none)")) + opset = html_lib.escape(str(best.get("opset", 17))) + p50 = html_lib.escape(best.get("median_p50_ms", "N/A")) + delta = html_lib.escape(best.get("delta_pct", "N/A")) + label = html_lib.escape(best.get("label", "?")) + return f""" +
+
+ Champion Config
+ + + + + + + + + + + + + +
Model{html_lib.escape(model_id)}
EP{html_lib.escape(ep.upper())}
Hypothesis{label}
Optim flags{flags}
Opset{opset}
Median p50{p50} ms + ({delta})
+
""" + + +# ── main entry ──────────────────────────────────────────────────────────────── + + +def generate_report( + results_tsv: Path, + work_dir: Path, + model_id: str, + ep: str, + insight_notes: list[str] | None = None, +) -> Path: + """Generate report.html inside work_dir. Returns the output path.""" + rows = _load_tsv(results_tsv) + out_path = work_dir / "report.html" + + # Find baseline p50 from h0 row + baseline_p50: float | None = None + for r in rows: + if r.get("iter") == "0" or "baseline" in r.get("label", "").lower(): + baseline_p50 = _p50_float(r.get("median_p50_ms")) + if baseline_p50: + break + + chart = _bar_chart_html(rows, baseline_p50) + table = _table_html(rows) + champion = _champion_html(rows, model_id, ep) + ts = datetime.now().strftime("%Y-%m-%d %H:%M") + n_done = len(rows) + n_keep = sum(1 for r in rows if r.get("status", "").lower().startswith("keep")) + + insight_section = "" + if insight_notes: + items = "".join(f"
  • {html_lib.escape(n)}
  • " for n in insight_notes) + insight_section = f""" +

    Phase 1 Insight Engine

    +
      {items}
    """ + + html = f""" + + + +autoconfig report — {html_lib.escape(model_id)} ({ep.upper()}) + + + + +

    autoconfig — {html_lib.escape(model_id)}

    +
    EP: {html_lib.escape(ep.upper())}  ·  + {n_done} experiments  ·  {n_keep} KEEP  ·  + Generated: {ts}
    + +
    + {champion} +
    + +
    +

    Benchmark Chart (median p50)

    + {chart} +
    + +{f'
    {insight_section}
    ' if insight_section else ""} + +
    +

    All Experiments

    + {table} +
    + + +""" + + out_path.write_text(html, encoding="utf-8") + print(f" Report written: {out_path}") + return out_path diff --git a/research/autoconfig/skills/explorer/SKILL.md b/research/autoconfig/skills/explorer/SKILL.md new file mode 100644 index 000000000..659506e87 --- /dev/null +++ b/research/autoconfig/skills/explorer/SKILL.md @@ -0,0 +1,75 @@ +--- +name: explorer +description: > + Use this sub-skill (driven by orchestrator) to decide WHAT to try next + in a winml-cli config search. It builds the hypothesis pool, applies confirmed-KB + hard-blocks and the Phase 1 Insight Engine skip_set to prune dead-end passes, then + ranks the survivors by Insight priority boost into a priority_queue and yields the + next hypothesis. It never builds or benchmarks — it only chooses the next experiment. +--- + +# explorer + +The Explorer is the **"what to try next"** sub-skill of the autoconfig loop +(Phase 2). It owns search *order* only. Mirrors the `Explorer` class in +`skills/orchestrator/autoconfig.py` and the Explorer box in +`research/autoconfig/docs/autoconfig_diagram.html`. + +**Implementation in this folder:** `analyze_insight.py` (the Phase 1 Insight Engine +that produces the `skip_set` + `priority_boosts` this skill ranks by) and +`analyze_graph.py` (ONNX graph-pattern helper). + +## When to use + +Invoked by `orchestrator` at the top of each Phase 2 iteration to get +the next candidate config delta. Not used standalone. + +## Inputs + +- `hypothesis_pool` — the full OFAT search grid from the orchestrator: from a FP32 + baseline, one factor varied at a time — opset (17–21), quant precision + (fp32/fp16/int8/int16/w8a16), or one single graph pass — as + `(label, patch_fn, dimension)` triples (~74 combinations). The Explorer + prunes/reorders it; it does not generate it. +- `kb` — confirmed `ep_device_knowledge/_.json` rules, especially `skip_passes` hard-blocks. +- `insight` — Phase 1 output: `skip_set` (passes to prune for this model) + `priority_boosts` (per-label ranking weight). + +## Procedure + +1. **Build the priority_queue** — stable-sort the hypothesis pool by descending + Insight `priority_boosts` (model-aware ranking; ties keep pool order). +2. **Pop the next hypothesis** from the queue. +3. **Skip-check before yielding** (`skip_reason`): + - KB hard-block: if the candidate's flags match a confirmed `skip_passes` rule, skip with that rule as the reason (e.g. npu-006 conv-fusion block when Conv% > 20%). + - Insight skip_set: if the label is in `insight.skip_set`, skip with "Insight Engine: