examples: add facebook/bart-large-mnli text-classification recipe by ssss141414 · Pull Request #933 · microsoft/winml-cli

ssss141414 · 2026-06-23T03:09:07Z

PR: facebook/bart-large-mnli — close Goal-L3 ladder on text-classification

Iter: 6 (Goal-ladder extension; recipe shipped in iter-5 as bart-004)
Producer: main agent (2026-06-23)
Claimed tier: (Effort = L0★, Goal = L3, Outcome = L1)

Summary

This PR closes the full Goal ladder L0..L3 on facebook/bart-large-mnli (text-classification, fp32, CPU). The recipe was shipped in iter-5 with L0+L1-CPU+L2 PASS (bart-004); this PR adds the L3 task-metric evidence via winml eval on glue/mnli/validation_matched/100-sample and records the result as the first L3 PASS in repo. No source-code changes; no new recipe. The contribution is a structured outcome update against an already-shipped artifact plus the appended bart-005 finding.

1. Recipe file

examples/recipes/facebook_bart-large-mnli/text-classification_config.json — unchanged from iter-5 (bart-004). Recipe carries the value_range: [2, 3] workaround on input_ids to deterministically inject eos_token_id=2; documented inline under _note per _meta-013 convention.

2. README index row

examples/recipes/README.md line 21 — present (facebook/bart-large-mnli | text-classification | ...). No edit needed.

3. Build output directory + artifact inventory

temp/verify_bart_build/ (gitignored — referenced by path for reviewer re-execution):

File	Size	Purpose
`model.onnx`	384,628 B	optimized ONNX graph (post-`optimize` pass)
`model.onnx.data`	1,633,574,896 B	external-data shard (FLOAT32 weights, 1.63 GB)
`export.onnx` + `.data`	1.63 GB	pre-optimize artifact
`optimized.onnx` + `.data`	1.63 GB	mid-pipeline artifact
`analyze_result.json`	1,916 B	op histogram (Step 4 mining)
`export_htp_metadata.json`	275,710 B	module hierarchy + trace coverage (Step 4 mining)
`winml_build_config.json`	1,149 B	autoconf diff (Step 4 mining)

External-data layout check (_meta-023): model.onnx and model.onnx.data are co-located in the same directory. PASS.

4. Build log

Iter-5 build log: temp/verify_bart_build/build.log (referenced in bart-004 mechanism_notes). Iter-6 used the iter-5 artifact unchanged; no re-build needed for the L3 closure.

L3 eval log (this PR): temp/bart_mnli_l3.log — 6,354 B; preserved via Tee-Object.

5. Appended findings

Per-model — `model_knowledge/bart.json`

bart-005 — "VALIDATED Goal-L3 for facebook/bart-large-mnli — winml eval on GLUE/mnli validation_matched (100 samples, CPU) gives accuracy=0.8800, latency=1.89s/sample. Closes the full Goal ladder L0..L3 for the first encoder-decoder family in repo. Cross-refs _meta-019..030 from iter-6 PR-mining."

Falsifies: _meta-015 scope for single-head NLI tasks (translation/summarization remain CLI-blocked, but text-classification on a seq2seq architecture IS reachable).
Refines: bart-004.

Skill-meta — `skill_meta/findings.json`

This PR does not introduce new _meta-NNN findings; the iter-6 methodology findings (_meta-019..031) shipped in a separate PR bundle. See _meta-029 (L3 verdict triage with TIMEOUT-at-scale third tier) and _meta-018 (March + Short-circuit rules) which gate this PR's evidence requirements.

6. Optimum-coverage probe verdict

import optimum.exporters.onnx.model_configs
from optimum.exporters.tasks import TasksManager
from winml.modelkit.export.io import ensure_hf_models_registered
mt = "bart"
vendor = sorted(TasksManager._SUPPORTED_MODEL_TYPE.get(mt, {}).get("onnx", {}).keys())
ensure_hf_models_registered()
after  = sorted(TasksManager._SUPPORTED_MODEL_TYPE.get(mt, {}).get("onnx", {}).keys())
# vendor includes: feature-extraction, feature-extraction-with-past, question-answering, text-classification,
#                  text-generation, text-generation-with-past, text2text-generation, text2text-generation-with-past
# after_winml: same set with winml overrides on feature-extraction + text2text-generation
# added_by_winml: [] for text-classification ⇒ vanilla Optimum BartOnnxConfig handles task='text-classification'

Verdict: VENDOR-COVERED on text-classification. Effort L0★ (no code; pure recipe) is the correct classification. Verified at iter-5 (bart-002) and re-confirmed by the bart-005 build.

7. Claimed (Effort, Goal, Outcome) tier

Effort = L0★ (recipe-only; one well-chosen value_range narrowing on a vendor-covered task)
Goal = L3 (full ladder L0..L3 closed on CPU)
Outcome = L1 (recipe + appended bart-005 finding + this report; no source-code changes ⇒ no Outcome-L1 feature-gap issues filed for THIS PR, but the iter-6 methodology-evolution PR carries the cross-cutting feature gaps)

8. Goal-ladder verdict table (per `_meta-018`)

Tier	Verdict	Evidence
L0 — build + artifact validation	PASS	`winml build` produced `model.onnx` + `.data` co-located; opset 17, fp32, 1042 nodes, 21 unique op types; external-data layout per `_meta-023`
L1-CPU — perf	PASS	1637 ms/iter on 1024-token sequence via custom Python perf script with real tokenized input (per `_meta-017` — `winml perf` ignores recipe `value_range` and crashes on eos-pooling models with random ints)
L1-DML / L1-QNN / L1-OpenVINO	HOST-BLOCKED	Per `_meta-016`: DML crash 0xC0000409, QNN absent, OpenVINO DLL-load-fails on this host. `--ep-options enable_graph_capture=false` retry per `_meta-026` NOT attempted on this host (would not help — DLL-load is a packaging issue). Not penalized per `_meta-016` honest-floor rule.
L2 — PT-vs-ONNX numerical	PASS	cosine = 1.000000, max_abs = 1e-6, argmax = 2 (ENTAILMENT) on both PT and ONNX sides, real tokenized input ("A soccer game with multiple males playing." → "This example is sports."). Log: temp/bart_mnli_l2.log
L3 — task-metric eval	PASS	`accuracy = 0.8800`, latency = 1.89 s/sample, throughput 0.53 samples/sec, total 189.05 s on `glue/mnli/validation_matched/100 samples, seed=42`. Reference (published bart-large-mnli on full validation_matched): ~0.886 — within MC noise of 100-sample subset. Result JSON: temp/bart_mnli_l3_eval.json. Log: temp/bart_mnli_l3.log
L3 — full validation_matched (9815 samples)	TIMEOUT-at-scale (NOT-ATTEMPTED)	Per `_meta-029` — full run would take ~5h CPU; out of turn budget. Marker file convention not yet dropped; cited here so future contributors know the gap.

Short-circuit honored (per _meta-018): no FAIL verdict anywhere in the ladder; CPU-PASS at L0..L3 supports the claimed ceiling honestly. Non-CPU EPs are HOST-BLOCKED (not FAIL), so they don't short-circuit higher tiers.

9. Methodology-evolution declaration (per `_meta-031`)

No NEW methodology friction observed in this contribution. The iter-6 meta-experiment that surfaced _meta-019..031 was the vehicle that ran this contribution; those findings shipped in a separate methodology PR. Within the bart-mnli L3 closure itself, the only friction was the --dataset-config vs --dataset-name flag confusion — already captured under bart-005's gotchas section, which is the correct scope (per-model knowledge, not skill-meta, because the wrong flag is the same flag for any task).

Step 4b trigger inventory:

(1) CLI surprise — --dataset-config → --dataset-name. Captured in bart-005 gotchas (per-model scope, not _meta-NNN).
(2) Doc-code drift — none observed.
(3) Silent-failure mode — none.
(4) New verdict shape — none (PASS / TIMEOUT-at-scale already in vocabulary).
(5) Reviewer-found gap — pending reviewer pass.
(6) Effort mis-estimate — none (L0★ predicted, L0★ delivered).
(7) PR-mining discovery — none in this PR (PR-mining was the methodology PR, separate bundle).

Artifact mining (Step 4)

`analyze_result.json`

total_operators: 1042
unique_operator_types: 21
Top-10 op histogram: Reshape(316), Gemm(194), Transpose(145), Add(98), Mul(72), MatMul(72), LayerNormalization(62), Softmax(36), Gelu(24), Cast(4)
EP coverage caveat per _meta-013: runtime-rule parquet files not available on this external host; re-run analyze against an available EP is structurally blocked. Reviewer with internal host should re-run.

`export_htp_metadata.json`

model.total_parameters: 407,344,131 (407M — matches HF config card)
model.total_modules: 353
tracing.modules_traced: 93 (26% trace coverage — partial; classification head not fully traced because BartForSequenceClassification does eos-pooling via Python indexing rather than as a traceable module)

`winml_build_config.json` (autoconf diff vs producer recipe)

optim block: autoconf added clamp_constant_values=true, gelu_fusion=true, matmul_add_fusion=true, remove_isnan_in_attention_mask=true (recipe specified optim: null)
loader.model_class: AutoModelForSequenceClassification (auto-resolved from task=text-classification)
All other fields match the recipe verbatim

Reviewer next steps

Re-run the L3 command on a fresh CPU host:

uv run winml eval -m temp\verify_bart_build\model.onnx --model-id facebook/bart-large-mnli `
  --task text-classification --dataset glue --dataset-name mnli `
  --split validation_matched --samples 100 --device cpu --ep cpu `
  --column input_column=premise --column second_input_column=hypothesis --column label_column=label `
  -o temp\review_bart_l3.json

Expect accuracy ∈ [0.85, 0.91] within MC noise at seed=42, n=100.

Re-run L2 script (per temp/bart_mnli_l2.py referenced in bart-004); confirm cosine ≥ 0.9999 and argmax matches.
Verify model.onnx + .data co-located via Get-ChildItem temp\verify_bart_build per _meta-023.
Confirm bart-005 finding is appended (not rewriting bart-004) per Step 4 append-don't-rewrite rule.
Verdict: APPROVE / REQUEST_CHANGES / REJECT per REVIEW.md.

Ships an fp32 NLI head for facebook/bart-large-mnli at task=text-classification. Recipe carries value_range=[2,3] on input_ids to deterministically inject the eos_token_id required by BartForSequenceClassification's eos-pooling head. Goal-ladder verdict (CPU): - L0 build PASS - 1042 ops, 21 unique types, 407M params, 384 KB graph + 1.6 GB external data - L1-CPU perf PASS - 1.64 s/iter on 1024-token real-tokenized input (custom Python script; winml perf ignores recipe value_range and crashes on eos-pooling models with random ints - winml CLI feature gap to file separately) - L2 numerical PASS - cosine = 1.000000, max_abs = 1e-6 vs PyTorch reference (argmax = 2, ENTAILMENT, on both sides) - L3 task-metric PASS - accuracy = 0.88, latency = 1.89 s/sample on glue/mnli/validation_matched/100 samples, seed=42 (matches published ~0.886 within MC noise; first end-to-end Goal-L3 PASS for this repo) DML/QNN/OpenVINO are HOST-BLOCKED on producer host (DML 0xC0000409, QNN absent, OpenVINO DLL-load-fails) - not penalized per local skill convention. Optimum-coverage: VENDOR-COVERED on text-classification via Optimum BartOnnxConfig; recipe is pure-data, no per-architecture code change needed. Producer notes from running the recipe live in research/adding-model-support/ model_knowledge/bart.json on the skills-poc working branch (not landed to main yet; pending separate skill-research PR for the full research/ tree).

ssss141414 · 2026-06-23T03:20:44Z

Reviewer verdict: APPROVE

Reviewer ran REVIEW.md on this PR head (12a391d3, base 77176b46).

Step 0 — Scope check

git diff --name-only origin/main...HEAD → exactly 2 files, 59 insertions:
- examples/recipes/facebook_bart-large-mnli/text-classification_config.json (+58)
- examples/recipes/README.md (+1 row)
Matches L0★ scope (recipe + README only). No src/ edits, no skill-file leakage. PASS.

L0 — Structural validation

Loaded temp/verify_bart_build/model.onnx (built from this PR-head recipe):

opset 17, IR 8, 1042 nodes
inputs: input_ids[1,1024] int32, attention_mask[1,1024] int32 — matches recipe declaration
outputs: logits[1,3] float32 — matches NLI 3-class head
external data: model.onnx + model.onnx.data (1.63 GB) co-located ✓ (_meta-023)

PASS.

L1-CPU — Perf

Producer custom Python perf script (real tokenized inputs) → 1638 ms/iter on 1024-token sequence. Reviewer accepts this in lieu of winml perf per _meta-017 (eos-pooling models crash with random ints). PASS.

DML / QNN / OpenVINO — HOST-BLOCKED on producer host per _meta-016 honest-floor rule. Not penalized.

L2 — PyTorch-vs-ONNX numerical

Cosine = 1.000000, max_abs = 1e-6, PT argmax = ONNX argmax = 2 (ENTAILMENT) on premise+hypothesis pair. Producer log preserved at temp/bart_mnli_l2.log. PASS.

L3 — Task-metric (independent re-run)

Producer: accuracy=0.8800, n=100, glue/mnli/validation_matched, seed=42, latency=1.89 s/sample (CPU)

Reviewer independent re-run: accuracy=0.9500, n=20, latency=2.02 s/sample (CPU)

uv run winml eval -m temp\verify_bart_build\model.onnx --model-id facebook/bart-large-mnli \
  --task text-classification --dataset glue --dataset-name mnli \
  --split validation_matched --samples 20 --device cpu --ep cpu \
  --column input_column=premise --column second_input_column=hypothesis --column label_column=label

Both within MC noise of published 0.886 on full validation_matched. Latency within ±10%. PASS.

Goal-ladder verdict (per `_meta-018`)

Tier	Verdict
L0	PASS
L1-CPU	PASS
L1-DML / QNN / OpenVINO	HOST-BLOCKED (not penalized)
L2	PASS
L3	PASS (independently re-run)

Short-circuit honored — no FAIL anywhere.

Outcome-L0

PR description carries the 9-item structure (_meta-032). ✓
Real PR URL present at hand-off. ✓
Scope-matches-Effort-tier (L0★ = recipe + README only). ✓

Methodology-evolution audit (`_meta-031`)

Producer declared "No NEW methodology friction observed in this contribution" with cited Step 4b trigger inventory (all 7 triggers checked). Sanity-check: no friction signals leaked (no --help mid-PR, custom Python perf was for documented _meta-017 workaround, not new friction). PASS.

Sign-off

Reviewer re-ran: git diff origin/main...HEAD, onnx.load structural probe, winml eval n=20 L3.
All evidence on disk, all numbers within tolerance.

APPROVE.

…ta-035 (same-author Approve block) Iter-6 first reviewer-side run on PRs #933 (bart-large-mnli) and #934 (vit-gpt2) surfaced two reviewer-flow gaps not previously codified: _meta-034: REVIEW.md must instruct the reviewer to explicitly checkout the PR branch (stash dirty WT, gh pr checkout / git checkout <branch>, diff-scope check, artifact-reuse rule for cached temp/verify_*/ dirs, restore producer branch with git stash pop). Without this, reviewer scores producer's working tree (with N months of untracked work) instead of PR scope against main. Mechanism confirmed same day via end-to-end Step 0 runs on both PRs. REVIEW.md Step 0 section already landed in commit 1f11b0b; this commit adds the matching _meta-034 finding. _meta-035: gh pr review --approve returns HTTP 422 'Can not approve your own pull request' when producer + reviewer agents run under the same GitHub identity. Falls back to gh pr comment --body-file which lands the structured verdict in the PR conversation but loses GitHub-side APPROVED metadata. REVIEW.md 'How to deliver the verdict' subsection added under Verdict format. Also documents GH_TOKEN env var re-leak between PowerShell commands (Remove-Item Env:GH_TOKEN at start of every gh invocation). Reviewer verdicts for iter-6: PR #933 (bart-large-mnli): APPROVE (issuecomment-4775278723) PR #934 (vit-gpt2): APPROVE (issuecomment-4775278822) Files: REVIEW.md 'How to deliver the verdict' subsection under Verdict format skill_meta/findings.json _meta-034 + _meta-035 (both mechanism_confirmed=true)

ssss141414 · 2026-06-23T08:07:09Z

Reviewer verification: OV cpu / gpu / npu — main @ `b448652`

Commands

\\powershell

config

uv run winml config -m facebook/bart-large-mnli --task text-classification -o temp/verify_pr933_bart_config.json

build (OV CPU, fp32, eos-safe config)

uv run winml build -m facebook/bart-large-mnli -c temp/verify_pr933_bart_config_eos.json -o temp/verify_pr933_bart_build --ep openvino --device cpu --precision fp32 --no-analyze --no-optimize --no-quant --no-compile --rebuild

perf — cpu / gpu / npu

uv run winml perf -m facebook/bart-large-mnli -c temp/verify_pr933_bart_config_eos.json --ep openvino --device cpu --precision fp32 --iterations 1 --warmup 0 --no-analyze --no-optimize --no-quant --no-compile -f json
uv run winml perf -m facebook/bart-large-mnli -c temp/verify_pr933_bart_config_eos.json --ep openvino --device gpu --precision fp32 --iterations 1 --warmup 0 --no-analyze --no-optimize --no-quant --no-compile -f json
uv run winml perf -m facebook/bart-large-mnli -c temp/verify_pr933_bart_config_eos.json --ep openvino --device npu --precision fp32 --iterations 1 --warmup 0 --no-analyze --no-optimize --no-quant --no-compile -f json

eval — cpu / gpu / npu (samples=2, glue/mnli validation_matched)

uv run winml eval -m facebook/bart-large-mnli -c temp/verify_pr933_bart_config_eos.json --task text-classification --dataset glue --dataset-name mnli --split validation_matched --samples 2 --device cpu --ep openvino --column input_column=premise --column second_input_column=hypothesis --column label_column=label -f json
uv run winml eval -m facebook/bart-large-mnli -c temp/verify_pr933_bart_config_eos.json --task text-classification --dataset glue --dataset-name mnli --split validation_matched --samples 2 --device gpu --ep openvino --column input_column=premise --column second_input_column=hypothesis --column label_column=label -f json
uv run winml eval -m facebook/bart-large-mnli -c temp/verify_pr933_bart_config_eos.json --task text-classification --dataset glue --dataset-name mnli --split validation_matched --samples 2 --device npu --ep openvino --column input_column=premise --column second_input_column=hypothesis --column label_column=label -f json
\\

Results

Command	cpu	gpu	npu
config	✅ PASS	—	—
build	✅ PASS (81s, model.onnx 1.6 GB)	—	—
perf	✅ 4131 ms/iter	✅ 117 ms/iter	✅ 3653 ms/iter ⚠️
eval	❌ FAIL	❌ FAIL	❌ FAIL

Notes:

config / �uild / perf pass on all three OV devices.
perf requires �alue_range: [2, 3] on input_ids in the config to prevent random inputs from triggering BART's eos-pooling IndexError: index -1 is out of bounds. Running winml perf -m ... without a config fails — known limitation per _meta-017.
NPU perf completes but emits a compiler warning: ConvertGather Pass failed: IE.Reshape doesn't support dynamic shapes. OV EP likely falls back to CPU for those ops; true NPU-only execution needs further investigation.
�val fails on all three devices for the same reason as _meta-017: the �val path re-exports the model with fully random inputs (ignoring config �alue_range), which triggers the same eos index crash. Not OV-specific.

ssss141414 · 2026-06-23T14:01:53Z

Closing as catalog-only — baseline build PASSES without this recipe

Reviewer (myself) ran the _meta-038 gates against main @ 77176b46:

Gate 2 — baseline build PASSES out-of-box:

uv run winml build -m facebook/bart-large-mnli -o temp/baseline_bart \
  --ep cpu --device cpu \
  --no-analyze --no-optimize --no-quant --no-compile --rebuild
# → ✅ Build complete in 151.2s, model.onnx 1.6 GB, 1837 nodes

No -c flag. No PR files. The CLI on main builds facebook/bart-large-mnli end-to-end today.

On the value_range: [2, 3] override: I originally argued this PR had real delta because the shipped recipe sets value_range: [2, 3] while winml config emits the default [0, vocab_size). But that override only matters at winml perf time — and per _meta-017, winml perf uses random dummy inputs that IGNORE the recipe's value_range. So the override doesn't actually do anything through the CLI surface; it would only help an ad-hoc Python harness, which is what _meta-017 already recommends. Net engineering delta to the user-visible CLI: zero.

Verdict: same as #934 / #943 / #944 / #945 / #946 — model is supported by winml today; users build it directly with uv run winml build -m facebook/bart-large-mnli. Closing.

Skill amendment landed in _meta-038 (auto-config-diff AND baseline-build gates required before claiming "added support"). Apologies for the noise.

Step 1b added: run BOTH gates before claiming Goal-Lx PASS. - Gate 1: `winml config` diff against shipped recipe (strip `_note`). - Gate 2: `winml build` baseline on main without `-c`. If both gates show parity, the recipe is catalog-only — do not file. Audit on 2026-06-23 found 6 of 6 recent recipe PRs (#933 #934 #943 #944 #945 #946) had zero CLI-surface delta over auto-config output. All 6 closed; replacement = user runs `winml build -m <id>` direct. SKILL.md additions: - Step 0 Effort L0/L0★ guardrail - Step 1b full procedure with verdict table - Goal-axis guardrail (Lx evidence requires Step 1b real-delta) - Step 4b trigger #8 (catalog-only escape) + next-id bump to 039 findings.json: _meta-038 with refines [_meta-013, _meta-018], mechanism_confirmed=true, evidence cites the 6-PR audit.

ssss141414 requested a review from a team as a code owner June 23, 2026 03:09

timenick approved these changes Jun 23, 2026

View reviewed changes

ssss141414 closed this Jun 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

examples: add facebook/bart-large-mnli text-classification recipe#933

examples: add facebook/bart-large-mnli text-classification recipe#933
ssss141414 wants to merge 1 commit into
mainfrom
shzhen/add-bart-large-mnli-recipe

ssss141414 commented Jun 23, 2026

Uh oh!

ssss141414 commented Jun 23, 2026

Uh oh!

ssss141414 commented Jun 23, 2026

Uh oh!

ssss141414 commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ssss141414 commented Jun 23, 2026

PR: facebook/bart-large-mnli — close Goal-L3 ladder on text-classification

Summary

1. Recipe file

2. README index row

3. Build output directory + artifact inventory

4. Build log

5. Appended findings

Per-model — model_knowledge/bart.json

Skill-meta — skill_meta/findings.json

6. Optimum-coverage probe verdict

7. Claimed (Effort, Goal, Outcome) tier

8. Goal-ladder verdict table (per _meta-018)

9. Methodology-evolution declaration (per _meta-031)

Artifact mining (Step 4)

analyze_result.json

export_htp_metadata.json

winml_build_config.json (autoconf diff vs producer recipe)

Reviewer next steps

Uh oh!

ssss141414 commented Jun 23, 2026

Reviewer verdict: APPROVE

Step 0 — Scope check

L0 — Structural validation

L1-CPU — Perf

L2 — PyTorch-vs-ONNX numerical

L3 — Task-metric (independent re-run)

Goal-ladder verdict (per _meta-018)

Outcome-L0

Methodology-evolution audit (_meta-031)

Sign-off

Uh oh!

ssss141414 commented Jun 23, 2026

Reviewer verification: OV cpu / gpu / npu — main @ b448652

Commands

config

build (OV CPU, fp32, eos-safe config)

perf — cpu / gpu / npu

eval — cpu / gpu / npu (samples=2, glue/mnli validation_matched)

Results

Uh oh!

ssss141414 commented Jun 23, 2026

Closing as catalog-only — baseline build PASSES without this recipe

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Per-model — `model_knowledge/bart.json`

Skill-meta — `skill_meta/findings.json`

8. Goal-ladder verdict table (per `_meta-018`)

9. Methodology-evolution declaration (per `_meta-031`)

`analyze_result.json`

`export_htp_metadata.json`

`winml_build_config.json` (autoconf diff vs producer recipe)

Goal-ladder verdict (per `_meta-018`)

Methodology-evolution audit (`_meta-031`)

Reviewer verification: OV cpu / gpu / npu — main @ `b448652`