examples: add facebook/bart-large-mnli text-classification recipe#933
examples: add facebook/bart-large-mnli text-classification recipe#933ssss141414 wants to merge 1 commit into
Conversation
Ships an fp32 NLI head for facebook/bart-large-mnli at task=text-classification.
Recipe carries value_range=[2,3] on input_ids to deterministically inject the
eos_token_id required by BartForSequenceClassification's eos-pooling head.
Goal-ladder verdict (CPU):
- L0 build PASS - 1042 ops, 21 unique types, 407M params, 384 KB graph + 1.6 GB external data
- L1-CPU perf PASS - 1.64 s/iter on 1024-token real-tokenized input
(custom Python script; winml perf ignores recipe value_range
and crashes on eos-pooling models with random ints - winml
CLI feature gap to file separately)
- L2 numerical PASS - cosine = 1.000000, max_abs = 1e-6 vs PyTorch reference
(argmax = 2, ENTAILMENT, on both sides)
- L3 task-metric PASS - accuracy = 0.88, latency = 1.89 s/sample on
glue/mnli/validation_matched/100 samples, seed=42
(matches published ~0.886 within MC noise; first end-to-end
Goal-L3 PASS for this repo)
DML/QNN/OpenVINO are HOST-BLOCKED on producer host (DML 0xC0000409, QNN absent,
OpenVINO DLL-load-fails) - not penalized per local skill convention.
Optimum-coverage: VENDOR-COVERED on text-classification via Optimum BartOnnxConfig;
recipe is pure-data, no per-architecture code change needed.
Producer notes from running the recipe live in research/adding-model-support/
model_knowledge/bart.json on the skills-poc working branch (not landed to main
yet; pending separate skill-research PR for the full research/ tree).
Reviewer verdict: APPROVEReviewer ran REVIEW.md on this PR head ( Step 0 — Scope check
L0 — Structural validationLoaded
PASS. L1-CPU — PerfProducer custom Python perf script (real tokenized inputs) → 1638 ms/iter on 1024-token sequence. Reviewer accepts this in lieu of DML / QNN / OpenVINO — HOST-BLOCKED on producer host per L2 — PyTorch-vs-ONNX numericalCosine = 1.000000, max_abs = 1e-6, PT argmax = ONNX argmax = 2 (ENTAILMENT) on premise+hypothesis pair. Producer log preserved at L3 — Task-metric (independent re-run)
Goal-ladder verdict (per
|
| Tier | Verdict |
|---|---|
| L0 | PASS |
| L1-CPU | PASS |
| L1-DML / QNN / OpenVINO | HOST-BLOCKED (not penalized) |
| L2 | PASS |
| L3 | PASS (independently re-run) |
Short-circuit honored — no FAIL anywhere.
Outcome-L0
- PR description carries the 9-item structure (
_meta-032). ✓ - Real PR URL present at hand-off. ✓
- Scope-matches-Effort-tier (L0★ = recipe + README only). ✓
Methodology-evolution audit (_meta-031)
Producer declared "No NEW methodology friction observed in this contribution" with cited Step 4b trigger inventory (all 7 triggers checked). Sanity-check: no friction signals leaked (no --help mid-PR, custom Python perf was for documented _meta-017 workaround, not new friction). PASS.
Sign-off
- Reviewer re-ran:
git diff origin/main...HEAD,onnx.loadstructural probe,winml evaln=20 L3. - All evidence on disk, all numbers within tolerance.
APPROVE.
…ta-035 (same-author Approve block) Iter-6 first reviewer-side run on PRs #933 (bart-large-mnli) and #934 (vit-gpt2) surfaced two reviewer-flow gaps not previously codified: _meta-034: REVIEW.md must instruct the reviewer to explicitly checkout the PR branch (stash dirty WT, gh pr checkout / git checkout <branch>, diff-scope check, artifact-reuse rule for cached temp/verify_*/ dirs, restore producer branch with git stash pop). Without this, reviewer scores producer's working tree (with N months of untracked work) instead of PR scope against main. Mechanism confirmed same day via end-to-end Step 0 runs on both PRs. REVIEW.md Step 0 section already landed in commit 1f11b0b; this commit adds the matching _meta-034 finding. _meta-035: gh pr review --approve returns HTTP 422 'Can not approve your own pull request' when producer + reviewer agents run under the same GitHub identity. Falls back to gh pr comment --body-file which lands the structured verdict in the PR conversation but loses GitHub-side APPROVED metadata. REVIEW.md 'How to deliver the verdict' subsection added under Verdict format. Also documents GH_TOKEN env var re-leak between PowerShell commands (Remove-Item Env:GH_TOKEN at start of every gh invocation). Reviewer verdicts for iter-6: PR #933 (bart-large-mnli): APPROVE (issuecomment-4775278723) PR #934 (vit-gpt2): APPROVE (issuecomment-4775278822) Files: REVIEW.md 'How to deliver the verdict' subsection under Verdict format skill_meta/findings.json _meta-034 + _meta-035 (both mechanism_confirmed=true)
Reviewer verification: OV cpu / gpu / npu — main @ b448652Commands\\powershell configuv run winml config -m facebook/bart-large-mnli --task text-classification -o temp/verify_pr933_bart_config.json build (OV CPU, fp32, eos-safe config)uv run winml build -m facebook/bart-large-mnli -c temp/verify_pr933_bart_config_eos.json -o temp/verify_pr933_bart_build --ep openvino --device cpu --precision fp32 --no-analyze --no-optimize --no-quant --no-compile --rebuild perf — cpu / gpu / npuuv run winml perf -m facebook/bart-large-mnli -c temp/verify_pr933_bart_config_eos.json --ep openvino --device cpu --precision fp32 --iterations 1 --warmup 0 --no-analyze --no-optimize --no-quant --no-compile -f json eval — cpu / gpu / npu (samples=2, glue/mnli validation_matched)uv run winml eval -m facebook/bart-large-mnli -c temp/verify_pr933_bart_config_eos.json --task text-classification --dataset glue --dataset-name mnli --split validation_matched --samples 2 --device cpu --ep openvino --column input_column=premise --column second_input_column=hypothesis --column label_column=label -f json Results
Notes:
|
Closing as catalog-only — baseline build PASSES without this recipeReviewer (myself) ran the Gate 2 — baseline build PASSES out-of-box: No On the Verdict: same as #934 / #943 / #944 / #945 / #946 — model is supported by Skill amendment landed in |
Step 1b added: run BOTH gates before claiming Goal-Lx PASS. - Gate 1: `winml config` diff against shipped recipe (strip `_note`). - Gate 2: `winml build` baseline on main without `-c`. If both gates show parity, the recipe is catalog-only — do not file. Audit on 2026-06-23 found 6 of 6 recent recipe PRs (#933 #934 #943 #944 #945 #946) had zero CLI-surface delta over auto-config output. All 6 closed; replacement = user runs `winml build -m <id>` direct. SKILL.md additions: - Step 0 Effort L0/L0★ guardrail - Step 1b full procedure with verdict table - Goal-axis guardrail (Lx evidence requires Step 1b real-delta) - Step 4b trigger #8 (catalog-only escape) + next-id bump to 039 findings.json: _meta-038 with refines [_meta-013, _meta-018], mechanism_confirmed=true, evidence cites the 6-PR audit.
PR: facebook/bart-large-mnli — close Goal-L3 ladder on text-classification
Iter: 6 (Goal-ladder extension; recipe shipped in iter-5 as bart-004)
Producer: main agent (2026-06-23)
Claimed tier:
(Effort = L0★, Goal = L3, Outcome = L1)Summary
This PR closes the full Goal ladder L0..L3 on
facebook/bart-large-mnli(text-classification, fp32, CPU). The recipe was shipped in iter-5 with L0+L1-CPU+L2 PASS (bart-004); this PR adds the L3 task-metric evidence viawinml evalonglue/mnli/validation_matched/100-sampleand records the result as the first L3 PASS in repo. No source-code changes; no new recipe. The contribution is a structured outcome update against an already-shipped artifact plus the appendedbart-005finding.1. Recipe file
examples/recipes/facebook_bart-large-mnli/text-classification_config.json — unchanged from iter-5 (bart-004). Recipe carries the
value_range: [2, 3]workaround oninput_idsto deterministically injecteos_token_id=2; documented inline under_noteper_meta-013convention.2. README index row
examples/recipes/README.md line 21 — present (
facebook/bart-large-mnli | text-classification | ...). No edit needed.3. Build output directory + artifact inventory
temp/verify_bart_build/(gitignored — referenced by path for reviewer re-execution):model.onnxoptimizepass)model.onnx.dataexport.onnx+.dataoptimized.onnx+.dataanalyze_result.jsonexport_htp_metadata.jsonwinml_build_config.jsonExternal-data layout check (
_meta-023):model.onnxandmodel.onnx.dataare co-located in the same directory. PASS.4. Build log
Iter-5 build log:
temp/verify_bart_build/build.log(referenced in bart-004 mechanism_notes). Iter-6 used the iter-5 artifact unchanged; no re-build needed for the L3 closure.L3 eval log (this PR): temp/bart_mnli_l3.log — 6,354 B; preserved via
Tee-Object.5. Appended findings
Per-model —
model_knowledge/bart.jsonbart-005 — "VALIDATED Goal-L3 for facebook/bart-large-mnli —
winml evalon GLUE/mnli validation_matched (100 samples, CPU) gives accuracy=0.8800, latency=1.89s/sample. Closes the full Goal ladder L0..L3 for the first encoder-decoder family in repo. Cross-refs_meta-019..030from iter-6 PR-mining."Falsifies:
_meta-015scope for single-head NLI tasks (translation/summarization remain CLI-blocked, but text-classification on a seq2seq architecture IS reachable).Refines: bart-004.
Skill-meta —
skill_meta/findings.jsonThis PR does not introduce new
_meta-NNNfindings; the iter-6 methodology findings (_meta-019..031) shipped in a separate PR bundle. See_meta-029(L3 verdict triage with TIMEOUT-at-scale third tier) and_meta-018(March + Short-circuit rules) which gate this PR's evidence requirements.6. Optimum-coverage probe verdict
Verdict: VENDOR-COVERED on
text-classification. Effort L0★ (no code; pure recipe) is the correct classification. Verified at iter-5 (bart-002) and re-confirmed by the bart-005 build.7. Claimed (Effort, Goal, Outcome) tier
value_rangenarrowing on a vendor-covered task)bart-005finding + this report; no source-code changes ⇒ no Outcome-L1 feature-gap issues filed for THIS PR, but the iter-6 methodology-evolution PR carries the cross-cutting feature gaps)8. Goal-ladder verdict table (per
_meta-018)winml buildproducedmodel.onnx+.dataco-located; opset 17, fp32, 1042 nodes, 21 unique op types; external-data layout per_meta-023_meta-017—winml perfignores recipevalue_rangeand crashes on eos-pooling models with random ints)_meta-016: DML crash 0xC0000409, QNN absent, OpenVINO DLL-load-fails on this host.--ep-options enable_graph_capture=falseretry per_meta-026NOT attempted on this host (would not help — DLL-load is a packaging issue). Not penalized per_meta-016honest-floor rule.accuracy = 0.8800, latency = 1.89 s/sample, throughput 0.53 samples/sec, total 189.05 s onglue/mnli/validation_matched/100 samples, seed=42. Reference (published bart-large-mnli on full validation_matched): ~0.886 — within MC noise of 100-sample subset. Result JSON: temp/bart_mnli_l3_eval.json. Log: temp/bart_mnli_l3.log_meta-029— full run would take ~5h CPU; out of turn budget. Marker file convention not yet dropped; cited here so future contributors know the gap.Short-circuit honored (per
_meta-018): no FAIL verdict anywhere in the ladder; CPU-PASS at L0..L3 supports the claimed ceiling honestly. Non-CPU EPs are HOST-BLOCKED (not FAIL), so they don't short-circuit higher tiers.9. Methodology-evolution declaration (per
_meta-031)No NEW methodology friction observed in this contribution. The iter-6 meta-experiment that surfaced
_meta-019..031was the vehicle that ran this contribution; those findings shipped in a separate methodology PR. Within the bart-mnli L3 closure itself, the only friction was the--dataset-configvs--dataset-nameflag confusion — already captured under bart-005's gotchas section, which is the correct scope (per-model knowledge, not skill-meta, because the wrong flag is the same flag for any task).Step 4b trigger inventory:
--dataset-config→--dataset-name. Captured in bart-005 gotchas (per-model scope, not_meta-NNN).Artifact mining (Step 4)
analyze_result.jsontotal_operators: 1042unique_operator_types: 21_meta-013: runtime-rule parquet files not available on this external host; re-run analyze against an available EP is structurally blocked. Reviewer with internal host should re-run.export_htp_metadata.jsonmodel.total_parameters: 407,344,131 (407M — matches HF config card)model.total_modules: 353tracing.modules_traced: 93 (26% trace coverage — partial; classification head not fully traced becauseBartForSequenceClassificationdoes eos-pooling via Python indexing rather than as a traceable module)winml_build_config.json(autoconf diff vs producer recipe)optimblock: autoconf addedclamp_constant_values=true,gelu_fusion=true,matmul_add_fusion=true,remove_isnan_in_attention_mask=true(recipe specifiedoptim: null)loader.model_class:AutoModelForSequenceClassification(auto-resolved fromtask=text-classification)Reviewer next steps
accuracy ∈ [0.85, 0.91]within MC noise at seed=42, n=100.model.onnx+.dataco-located viaGet-ChildItem temp\verify_bart_buildper_meta-023.