feat(skillopt): cross-family judge jury (#349)#512
Merged
Conversation
…ries (#349) Generalize the off-by-default single cross-family LLM judge (#483) into an N-diverse-family judge JURY for promotion-boundary / pairwise decisions. - Pure aggregator (internal/skillopt/jury.go, in the style of EvaluateAutoPromote): EvaluateJury computes per-dimension MEDIAN (robust to one outlier, even-N safe), MAJORITY vote (tie => fail-safe false), fail-closed MINORITY-VETO over configured safety dimensions, and a DISAGREEMENT flag (per-dimension std > tau, or a non-unanimous vote). Pure, deterministic, total, no I/O. Fully unit-tested. - N-diverse-family picker (workflow.PickCrossFamilyJury): up to N reviewers from DISTINCT families, deduped by family (diversity over headcount, never padded); graceful degradation when < N families; size < 2 / unknown implementer => nil so the caller takes the single-judge path. - Wired into the existing runSkillOptABJudge seam: jury_size >= 2 AND >= 2 distinct families runs the jury (each juror judges the SAME blind A/B, serialized), drops any erroring/unparseable juror (fail-safe), records the aggregated verdict under the canonical skillopt-ab-judge tag + per-juror rows under skillopt-ab-juror, and stamps the disagreement flag onto eval_review_item metadata (no contract bump). - Config knobs (off by default): mode_b_jury_size, mode_b_jury_veto_dimensions, mode_b_jury_veto_floor, mode_b_jury_disagreement_tau (+ --jury-size flag). OFF-BY-DEFAULT / BYTE-IDENTICAL: with jury_size <= 1 behavior is exactly today's single cross-family judge. ADDITIVE: contract_version=1 unchanged. MANUAL PROMOTION preserved: the jury is evidence-only, never promotes, never touches the bandit. BLINDING preserved: candidate origin never leaks to any juror. Closes #349 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The fail-closed minority veto matched configured veto dimensions against judge dimension keys exactly/case-sensitively, so a config like "Safety" vs the lowercase rubric key "safety" would silently fail OPEN on a control documented as fail-closed. Normalize both sides (lower + trim) before matching; add a regression subtest that fails without the fix. Caught by adversarial re-review of PR #512 (latent: the pairwise A/B path produces no DimensionScores today, but the config knobs ship now). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Generalizes the off-by-default single cross-family LLM judge (#483) into an N-diverse-family judge jury for promotion-boundary / pairwise comparisons, per the evaluator-hardening epic (#344).
What it does
internal/skillopt/jury.go(EvaluateJury, in the style of the existing pureEvaluateAutoPromote/EvaluateCanaryRegression— pure, deterministic, total, no I/O): per-dimension median (robust to one outlier; correct for even N), majority vote (tie => fail-safefalse), fail-closed minority-veto over configured safety dimensions, and a disagreement flag (per-dimension std > tau, or a non-unanimous "2:1" vote). Fully unit-tested.workflow.PickCrossFamilyJury: returns up to N reviewers from distinct model families, deduped by family (diversity > headcount, never padded with near-identical families). Fewer than N distinct families => runs as many as exist; size < 2 (or an unknown implementer family) =>nilso the caller uses the single-judge path.runSkillOptABJudgeseam: whenjury_size >= 2and >= 2 distinct families are available, each juror judges the same blind shuffled A/B (serialized Deliver), any erroring/unparseable juror is dropped (fail-safe) and the jury proceeds, the aggregated verdict is recorded under the canonicalskillopt-ab-judgetag, each juror's pick under the distinctskillopt-ab-jurorsource, and the disagreement flag rides theeval_review_itemmetadata and routes to a human (feeds SkillOpt: capture judge↔human outcomes → optimize the judge prompt (per task-kind) #345). Falls back to the single judge when < 2 families.Invariants
jury_size <= 1(the default) behavior is exactly today's single cross-family judge — no extra judges picked, no behavior change. Covered by a test that fails if a jury runs by default.contract_version=1unchanged; jury/disagreement metadata rides existing eval metadata.Config knobs (all off by default)
mode_b_jury_size(0/1 = off),mode_b_jury_veto_dimensions,mode_b_jury_veto_floor,mode_b_jury_disagreement_tau(+ a--jury-sizeflag). Added toSkillOptPolicyand the commentedinit.gostub.Tests
Pure aggregator (median odd/even, majority incl. tie, minority-veto, disagreement via std>tau and 2:1), picker (distinct-family dedupe, graceful degradation < N and < 2, size<2 off, unknown-implementer skip, registered-over-ephemeral), and CLI wiring (aggregate + per-juror rows, disagreement flag, per-juror failure fail-safe, single-judge fallback, off-by-default no-jury, config-knob enable, manual-promotion preserved).
Gate green:
go build ./... && go vet ./... && go test ./...+-raceoninternal/cli,internal/skillopt,internal/workflow.Closes #349
🤖 Generated with Claude Code