feat(core): per-test regression as default exit for agentv compare; add --update-baseline flag

Part of #1155.

## Objective

Fix an existing AgentV outlier behavior: `agentv compare`'s pairwise mode currently exits based on **mean delta** (`determineExitCode(comparison.summary.meanDelta)` at `apps/cli/src/commands/compare/index.ts:259,532`). Industry convention (Jest, Betterer, Braintrust, Langfuse) is **per-test regression detection** — any test dropping below `baseline - threshold` fails CI. Mean aggregation can hide serious per-test drops behind unrelated improvements.

Additionally: add `--update-baseline` (short `-u`) so the regression-gating workflow has a write-back-on-success step, completing the loop against AgentV's existing sidecar `<eval>.baseline.jsonl` convention.

## Design latitude

### Breaking change: default exit logic

- **Pairwise mode** (two positional files, or single-manifest `--baseline`/`--candidate`): exit 1 if any per-test score in candidate drops below `baseline_score - threshold`. Remove the mean-delta exit path.
- **Matrix / N-way mode** (N targets in one manifest, no `--baseline`): unchanged — per-test doesn't naturally apply when each test has N scores, one per target.

### New flag: `--update-baseline` / `-u`

- Only meaningful in pairwise mode with a file-path first positional.
- On success (no regression): overwrite the first positional with the candidate's scores (the contents of the second positional).
- On failure: do not touch the baseline file.
- Error clearly if used in matrix mode or if the first positional is not a writable path.

### Resulting CLI

```bash
# Pairwise, per-test exit by default (behavior change from today):
agentv compare baseline.jsonl candidate.jsonl

# Regression-gating workflow against a sidecar baseline:
agentv compare evals/my-eval.baseline.jsonl runs/latest/my-eval.jsonl --update-baseline
```

No `--ratchet`, `--snapshot`, `--per-test`, or `--baseline-file` — positional ordering establishes baseline-vs-candidate; per-test is the default; `--update-baseline` is the only new flag.

## Acceptance signals

- Pairwise mode catches an injected single-test regression on a toy eval (previously mean-delta would have let it through).
- Pairwise mode still exits 0 when all per-test deltas are within `threshold`.
- Matrix mode exit behavior unchanged.
- `--update-baseline` writes candidate scores to the first positional only on success.
- `--update-baseline` errors cleanly in matrix mode or with non-writable first positional.
- Release notes / changelog call out the default behavior change for anyone relying on mean-delta exit semantics.

## Non-goals

- Not a hosted dashboard or visualization layer.
- Not a remote baseline store — sidecar file + git remains the persistence model.
- Not auto-sidecar resolution (matching result files to their sidecar baselines in a directory) in this issue — follow-up if users need it.

## Lineage

- Extends #381 (closed, N-way matrix comparison) — same command.
- Distinct from #698 (closed, `agentv eval --threshold` absolute suite gate) — that's an absolute mean gate; this is per-test regression vs a stored baseline. Maintainer note in #698 explicitly distinguishes the two.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(core): per-test regression as default exit for agentv compare; add --update-baseline flag #1158

Objective

Design latitude

Breaking change: default exit logic

New flag: `--update-baseline` / `-u`

Resulting CLI

Acceptance signals

Non-goals

Lineage

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat(core): per-test regression as default exit for agentv compare; add --update-baseline flag #1158

Description

Objective

Design latitude

Breaking change: default exit logic

New flag: --update-baseline / -u

Resulting CLI

Acceptance signals

Non-goals

Lineage

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

New flag: `--update-baseline` / `-u`