feat(skillopt): cross-family judge jury (#349) by jerryfane · Pull Request #512 · jerryfane/gitmoot

jerryfane · 2026-06-27T15:42:16Z

Generalizes the off-by-default single cross-family LLM judge (#483) into an N-diverse-family judge jury for promotion-boundary / pairwise comparisons, per the evaluator-hardening epic (#344).

What it does

Pure aggregator internal/skillopt/jury.go (EvaluateJury, in the style of the existing pure EvaluateAutoPromote/EvaluateCanaryRegression — pure, deterministic, total, no I/O): per-dimension median (robust to one outlier; correct for even N), majority vote (tie => fail-safe false), fail-closed minority-veto over configured safety dimensions, and a disagreement flag (per-dimension std > tau, or a non-unanimous "2:1" vote). Fully unit-tested.
N-diverse-family picker workflow.PickCrossFamilyJury: returns up to N reviewers from distinct model families, deduped by family (diversity > headcount, never padded with near-identical families). Fewer than N distinct families => runs as many as exist; size < 2 (or an unknown implementer family) => nil so the caller uses the single-judge path.
Wired into the existing runSkillOptABJudge seam: when jury_size >= 2 and >= 2 distinct families are available, each juror judges the same blind shuffled A/B (serialized Deliver), any erroring/unparseable juror is dropped (fail-safe) and the jury proceeds, the aggregated verdict is recorded under the canonical skillopt-ab-judge tag, each juror's pick under the distinct skillopt-ab-juror source, and the disagreement flag rides the eval_review_item metadata and routes to a human (feeds SkillOpt: capture judge↔human outcomes → optimize the judge prompt (per task-kind) #345). Falls back to the single judge when < 2 families.

Invariants

Off-by-default / byte-identical: with jury_size <= 1 (the default) behavior is exactly today's single cross-family judge — no extra judges picked, no behavior change. Covered by a test that fails if a jury runs by default.
Additive, no contract bump: contract_version=1 unchanged; jury/disagreement metadata rides existing eval metadata.
Manual promotion preserved: the jury is evidence-only — never promotes, never touches the bandit posterior (verified by a bandit-pull test).
Diversity + fail-safe: families deduped; graceful degradation below N (and below 2); a single failing juror is dropped and the jury proceeds; the eval is never aborted.
Blinding: candidate origin never leaks to any juror.

Config knobs (all off by default)

mode_b_jury_size (0/1 = off), mode_b_jury_veto_dimensions, mode_b_jury_veto_floor, mode_b_jury_disagreement_tau (+ a --jury-size flag). Added to SkillOptPolicy and the commented init.go stub.

Tests

Pure aggregator (median odd/even, majority incl. tie, minority-veto, disagreement via std>tau and 2:1), picker (distinct-family dedupe, graceful degradation < N and < 2, size<2 off, unknown-implementer skip, registered-over-ephemeral), and CLI wiring (aggregate + per-juror rows, disagreement flag, per-juror failure fail-safe, single-judge fallback, off-by-default no-jury, config-knob enable, manual-promotion preserved).

Gate green: go build ./... && go vet ./... && go test ./... + -race on internal/cli, internal/skillopt, internal/workflow.

Closes #349

🤖 Generated with Claude Code

…ries (#349) Generalize the off-by-default single cross-family LLM judge (#483) into an N-diverse-family judge JURY for promotion-boundary / pairwise decisions. - Pure aggregator (internal/skillopt/jury.go, in the style of EvaluateAutoPromote): EvaluateJury computes per-dimension MEDIAN (robust to one outlier, even-N safe), MAJORITY vote (tie => fail-safe false), fail-closed MINORITY-VETO over configured safety dimensions, and a DISAGREEMENT flag (per-dimension std > tau, or a non-unanimous vote). Pure, deterministic, total, no I/O. Fully unit-tested. - N-diverse-family picker (workflow.PickCrossFamilyJury): up to N reviewers from DISTINCT families, deduped by family (diversity over headcount, never padded); graceful degradation when < N families; size < 2 / unknown implementer => nil so the caller takes the single-judge path. - Wired into the existing runSkillOptABJudge seam: jury_size >= 2 AND >= 2 distinct families runs the jury (each juror judges the SAME blind A/B, serialized), drops any erroring/unparseable juror (fail-safe), records the aggregated verdict under the canonical skillopt-ab-judge tag + per-juror rows under skillopt-ab-juror, and stamps the disagreement flag onto eval_review_item metadata (no contract bump). - Config knobs (off by default): mode_b_jury_size, mode_b_jury_veto_dimensions, mode_b_jury_veto_floor, mode_b_jury_disagreement_tau (+ --jury-size flag). OFF-BY-DEFAULT / BYTE-IDENTICAL: with jury_size <= 1 behavior is exactly today's single cross-family judge. ADDITIVE: contract_version=1 unchanged. MANUAL PROMOTION preserved: the jury is evidence-only, never promotes, never touches the bandit. BLINDING preserved: candidate origin never leaks to any juror. Closes #349 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The fail-closed minority veto matched configured veto dimensions against judge dimension keys exactly/case-sensitively, so a config like "Safety" vs the lowercase rubric key "safety" would silently fail OPEN on a control documented as fail-closed. Normalize both sides (lower + trim) before matching; add a regression subtest that fails without the fix. Caught by adversarial re-review of PR #512 (latent: the pairwise A/B path produces no DimensionScores today, but the config knobs ship now). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

jerryfane and others added 2 commits June 27, 2026 17:41

jerryfane merged commit 2b4b5f9 into main Jun 27, 2026
1 check passed

jerryfane deleted the feat/349-judge-jury branch June 27, 2026 16:45

jerryfane mentioned this pull request Jun 27, 2026

Epic: harden the SkillOpt evaluator — measure it, lean on hard verifiers, prefer pairwise #344

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(skillopt): cross-family judge jury (#349)#512

feat(skillopt): cross-family judge jury (#349)#512
jerryfane merged 2 commits into
mainfrom
feat/349-judge-jury

jerryfane commented Jun 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jerryfane commented Jun 27, 2026

What it does

Invariants

Config knobs (all off by default)

Tests

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant