Skip to content

feat(skillopt): cross-family judge jury (#349)#512

Merged
jerryfane merged 2 commits into
mainfrom
feat/349-judge-jury
Jun 27, 2026
Merged

feat(skillopt): cross-family judge jury (#349)#512
jerryfane merged 2 commits into
mainfrom
feat/349-judge-jury

Conversation

@jerryfane

Copy link
Copy Markdown
Owner

Generalizes the off-by-default single cross-family LLM judge (#483) into an N-diverse-family judge jury for promotion-boundary / pairwise comparisons, per the evaluator-hardening epic (#344).

What it does

  • Pure aggregator internal/skillopt/jury.go (EvaluateJury, in the style of the existing pure EvaluateAutoPromote/EvaluateCanaryRegression — pure, deterministic, total, no I/O): per-dimension median (robust to one outlier; correct for even N), majority vote (tie => fail-safe false), fail-closed minority-veto over configured safety dimensions, and a disagreement flag (per-dimension std > tau, or a non-unanimous "2:1" vote). Fully unit-tested.
  • N-diverse-family picker workflow.PickCrossFamilyJury: returns up to N reviewers from distinct model families, deduped by family (diversity > headcount, never padded with near-identical families). Fewer than N distinct families => runs as many as exist; size < 2 (or an unknown implementer family) => nil so the caller uses the single-judge path.
  • Wired into the existing runSkillOptABJudge seam: when jury_size >= 2 and >= 2 distinct families are available, each juror judges the same blind shuffled A/B (serialized Deliver), any erroring/unparseable juror is dropped (fail-safe) and the jury proceeds, the aggregated verdict is recorded under the canonical skillopt-ab-judge tag, each juror's pick under the distinct skillopt-ab-juror source, and the disagreement flag rides the eval_review_item metadata and routes to a human (feeds SkillOpt: capture judge↔human outcomes → optimize the judge prompt (per task-kind) #345). Falls back to the single judge when < 2 families.

Invariants

  • Off-by-default / byte-identical: with jury_size <= 1 (the default) behavior is exactly today's single cross-family judge — no extra judges picked, no behavior change. Covered by a test that fails if a jury runs by default.
  • Additive, no contract bump: contract_version=1 unchanged; jury/disagreement metadata rides existing eval metadata.
  • Manual promotion preserved: the jury is evidence-only — never promotes, never touches the bandit posterior (verified by a bandit-pull test).
  • Diversity + fail-safe: families deduped; graceful degradation below N (and below 2); a single failing juror is dropped and the jury proceeds; the eval is never aborted.
  • Blinding: candidate origin never leaks to any juror.

Config knobs (all off by default)

mode_b_jury_size (0/1 = off), mode_b_jury_veto_dimensions, mode_b_jury_veto_floor, mode_b_jury_disagreement_tau (+ a --jury-size flag). Added to SkillOptPolicy and the commented init.go stub.

Tests

Pure aggregator (median odd/even, majority incl. tie, minority-veto, disagreement via std>tau and 2:1), picker (distinct-family dedupe, graceful degradation < N and < 2, size<2 off, unknown-implementer skip, registered-over-ephemeral), and CLI wiring (aggregate + per-juror rows, disagreement flag, per-juror failure fail-safe, single-judge fallback, off-by-default no-jury, config-knob enable, manual-promotion preserved).

Gate green: go build ./... && go vet ./... && go test ./... + -race on internal/cli, internal/skillopt, internal/workflow.

Closes #349

🤖 Generated with Claude Code

jerryfane and others added 2 commits June 27, 2026 17:41
…ries (#349)

Generalize the off-by-default single cross-family LLM judge (#483) into an
N-diverse-family judge JURY for promotion-boundary / pairwise decisions.

- Pure aggregator (internal/skillopt/jury.go, in the style of EvaluateAutoPromote):
  EvaluateJury computes per-dimension MEDIAN (robust to one outlier, even-N safe),
  MAJORITY vote (tie => fail-safe false), fail-closed MINORITY-VETO over configured
  safety dimensions, and a DISAGREEMENT flag (per-dimension std > tau, or a
  non-unanimous vote). Pure, deterministic, total, no I/O. Fully unit-tested.
- N-diverse-family picker (workflow.PickCrossFamilyJury): up to N reviewers from
  DISTINCT families, deduped by family (diversity over headcount, never padded);
  graceful degradation when < N families; size < 2 / unknown implementer => nil so
  the caller takes the single-judge path.
- Wired into the existing runSkillOptABJudge seam: jury_size >= 2 AND >= 2 distinct
  families runs the jury (each juror judges the SAME blind A/B, serialized), drops
  any erroring/unparseable juror (fail-safe), records the aggregated verdict under
  the canonical skillopt-ab-judge tag + per-juror rows under skillopt-ab-juror, and
  stamps the disagreement flag onto eval_review_item metadata (no contract bump).
- Config knobs (off by default): mode_b_jury_size, mode_b_jury_veto_dimensions,
  mode_b_jury_veto_floor, mode_b_jury_disagreement_tau (+ --jury-size flag).

OFF-BY-DEFAULT / BYTE-IDENTICAL: with jury_size <= 1 behavior is exactly today's
single cross-family judge. ADDITIVE: contract_version=1 unchanged. MANUAL
PROMOTION preserved: the jury is evidence-only, never promotes, never touches the
bandit. BLINDING preserved: candidate origin never leaks to any juror.

Closes #349

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The fail-closed minority veto matched configured veto dimensions against
judge dimension keys exactly/case-sensitively, so a config like "Safety"
vs the lowercase rubric key "safety" would silently fail OPEN on a control
documented as fail-closed. Normalize both sides (lower + trim) before
matching; add a regression subtest that fails without the fix.

Caught by adversarial re-review of PR #512 (latent: the pairwise A/B path
produces no DimensionScores today, but the config knobs ship now).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jerryfane jerryfane merged commit 2b4b5f9 into main Jun 27, 2026
1 check passed
@jerryfane jerryfane deleted the feat/349-judge-jury branch June 27, 2026 16:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SkillOpt: judge jury via the review-panel recipe (diverse-judge ensemble)

1 participant