Skip to content

feat(core): per-test regression as default exit for agentv compare; add --update-baseline flag #1158

@christso

Description

@christso

Part of #1155.

Objective

Fix an existing AgentV outlier behavior: agentv compare's pairwise mode currently exits based on mean delta (determineExitCode(comparison.summary.meanDelta) at apps/cli/src/commands/compare/index.ts:259,532). Industry convention (Jest, Betterer, Braintrust, Langfuse) is per-test regression detection — any test dropping below baseline - threshold fails CI. Mean aggregation can hide serious per-test drops behind unrelated improvements.

Additionally: add --update-baseline (short -u) so the regression-gating workflow has a write-back-on-success step, completing the loop against AgentV's existing sidecar <eval>.baseline.jsonl convention.

Design latitude

Breaking change: default exit logic

  • Pairwise mode (two positional files, or single-manifest --baseline/--candidate): exit 1 if any per-test score in candidate drops below baseline_score - threshold. Remove the mean-delta exit path.
  • Matrix / N-way mode (N targets in one manifest, no --baseline): unchanged — per-test doesn't naturally apply when each test has N scores, one per target.

New flag: --update-baseline / -u

  • Only meaningful in pairwise mode with a file-path first positional.
  • On success (no regression): overwrite the first positional with the candidate's scores (the contents of the second positional).
  • On failure: do not touch the baseline file.
  • Error clearly if used in matrix mode or if the first positional is not a writable path.

Resulting CLI

# Pairwise, per-test exit by default (behavior change from today):
agentv compare baseline.jsonl candidate.jsonl

# Regression-gating workflow against a sidecar baseline:
agentv compare evals/my-eval.baseline.jsonl runs/latest/my-eval.jsonl --update-baseline

No --ratchet, --snapshot, --per-test, or --baseline-file — positional ordering establishes baseline-vs-candidate; per-test is the default; --update-baseline is the only new flag.

Acceptance signals

  • Pairwise mode catches an injected single-test regression on a toy eval (previously mean-delta would have let it through).
  • Pairwise mode still exits 0 when all per-test deltas are within threshold.
  • Matrix mode exit behavior unchanged.
  • --update-baseline writes candidate scores to the first positional only on success.
  • --update-baseline errors cleanly in matrix mode or with non-writable first positional.
  • Release notes / changelog call out the default behavior change for anyone relying on mean-delta exit semantics.

Non-goals

  • Not a hosted dashboard or visualization layer.
  • Not a remote baseline store — sidecar file + git remains the persistence model.
  • Not auto-sidecar resolution (matching result files to their sidecar baselines in a directory) in this issue — follow-up if users need it.

Lineage

Metadata

Metadata

Assignees

No one assigned

    Labels

    coreAnything pertaining to core functionality of AgentVenhancementNew feature or request

    Type

    No type

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions