Part of #1155.
Objective
Fix an existing AgentV outlier behavior: agentv compare's pairwise mode currently exits based on mean delta (determineExitCode(comparison.summary.meanDelta) at apps/cli/src/commands/compare/index.ts:259,532). Industry convention (Jest, Betterer, Braintrust, Langfuse) is per-test regression detection — any test dropping below baseline - threshold fails CI. Mean aggregation can hide serious per-test drops behind unrelated improvements.
Additionally: add --update-baseline (short -u) so the regression-gating workflow has a write-back-on-success step, completing the loop against AgentV's existing sidecar <eval>.baseline.jsonl convention.
Design latitude
Breaking change: default exit logic
- Pairwise mode (two positional files, or single-manifest
--baseline/--candidate): exit 1 if any per-test score in candidate drops below baseline_score - threshold. Remove the mean-delta exit path.
- Matrix / N-way mode (N targets in one manifest, no
--baseline): unchanged — per-test doesn't naturally apply when each test has N scores, one per target.
New flag: --update-baseline / -u
- Only meaningful in pairwise mode with a file-path first positional.
- On success (no regression): overwrite the first positional with the candidate's scores (the contents of the second positional).
- On failure: do not touch the baseline file.
- Error clearly if used in matrix mode or if the first positional is not a writable path.
Resulting CLI
# Pairwise, per-test exit by default (behavior change from today):
agentv compare baseline.jsonl candidate.jsonl
# Regression-gating workflow against a sidecar baseline:
agentv compare evals/my-eval.baseline.jsonl runs/latest/my-eval.jsonl --update-baseline
No --ratchet, --snapshot, --per-test, or --baseline-file — positional ordering establishes baseline-vs-candidate; per-test is the default; --update-baseline is the only new flag.
Acceptance signals
- Pairwise mode catches an injected single-test regression on a toy eval (previously mean-delta would have let it through).
- Pairwise mode still exits 0 when all per-test deltas are within
threshold.
- Matrix mode exit behavior unchanged.
--update-baseline writes candidate scores to the first positional only on success.
--update-baseline errors cleanly in matrix mode or with non-writable first positional.
- Release notes / changelog call out the default behavior change for anyone relying on mean-delta exit semantics.
Non-goals
- Not a hosted dashboard or visualization layer.
- Not a remote baseline store — sidecar file + git remains the persistence model.
- Not auto-sidecar resolution (matching result files to their sidecar baselines in a directory) in this issue — follow-up if users need it.
Lineage
Part of #1155.
Objective
Fix an existing AgentV outlier behavior:
agentv compare's pairwise mode currently exits based on mean delta (determineExitCode(comparison.summary.meanDelta)atapps/cli/src/commands/compare/index.ts:259,532). Industry convention (Jest, Betterer, Braintrust, Langfuse) is per-test regression detection — any test dropping belowbaseline - thresholdfails CI. Mean aggregation can hide serious per-test drops behind unrelated improvements.Additionally: add
--update-baseline(short-u) so the regression-gating workflow has a write-back-on-success step, completing the loop against AgentV's existing sidecar<eval>.baseline.jsonlconvention.Design latitude
Breaking change: default exit logic
--baseline/--candidate): exit 1 if any per-test score in candidate drops belowbaseline_score - threshold. Remove the mean-delta exit path.--baseline): unchanged — per-test doesn't naturally apply when each test has N scores, one per target.New flag:
--update-baseline/-uResulting CLI
No
--ratchet,--snapshot,--per-test, or--baseline-file— positional ordering establishes baseline-vs-candidate; per-test is the default;--update-baselineis the only new flag.Acceptance signals
threshold.--update-baselinewrites candidate scores to the first positional only on success.--update-baselineerrors cleanly in matrix mode or with non-writable first positional.Non-goals
Lineage
agentv eval --thresholdabsolute suite gate) — that's an absolute mean gate; this is per-test regression vs a stored baseline. Maintainer note in feat(cli): --threshold flag for suite-level quality gates #698 explicitly distinguishes the two.