tracking: PR/issue-mined evals + per-test regression gating

## Objective

Make PR/issue history a first-class eval authoring source, fix AgentV's `compare` exit logic to match industry per-test regression convention, and give the sidecar baseline workflow read-tooling.

## Why

Internal research (see `agentevals-research` repo) identified that closing the loop between PRs → mined eval suites → regression-gated baseline → skill harvesting is a pattern worth adopting. AgentV already implements the runner layer (executors, evaluators, scoring, comparison, CI gating). The gaps are upstream (eval authoring from git history) and in comparison exit semantics (AgentV currently uses mean-delta; industry convention is per-test regression).

## Design latitude

- PR/issue mining goes in a skill (`plugins/agentv-dev/skills/agentv-eval-writer`), not a new CLI subcommand. Authoring is LLM-assisted decision work that fits the skill system.
- GitHub only via `gh` CLI. No GitLab/Linear/Bitbucket adapters in v1.
- Sidecar baseline convention (`<eval>.baseline.jsonl`) already exists — reuse it, don't invent a new format.
- `compare` changes are additive except for the default exit logic (mean → per-test). That's a breaking change — accepted per maintainer direction.

## Sub-issues

- [ ] #1156 — spike: sample merged PRs and classify eval-case yield *(blocks #1159)*
- [ ] #1157 — feat(core): optional provenance block in EVAL.yaml schema
- [ ] #1158 — feat(core): per-test regression as default exit for agentv compare; add --update-baseline
- [ ] #1159 — feat(skill): extend agentv-eval-writer with PR/issue generation sources *(blocked by #1156; depends on #1157)*
- [ ] #1160 — feat(cli): agentv history read command

## Sequencing

1. #1156 (gate) → yes/no on PR-mining direction.
2. #1157 + #1158 in parallel (independent).
3. #1159 (depends on #1157, unblocked by #1156).
4. #1160 (independent).

## Acceptance signals

Tracking closes when all sub-issues are closed or consciously deferred with a follow-up note.

## Non-goals

- Building a coding agent.
- GitLab/Bitbucket/Linear adapters.
- Hosted dashboard or baseline visualization UI.
- Remote baseline store (sidecar + git is the persistence model).
- Labeling UI for review-feedback ingestion.

## Related

- #1003 — tracking: optimization loop roadmap (bench automation, autoresearch, iteration state) — overlapping scope, not a parent.
- #698 (closed) — added `agentv eval --threshold` as absolute suite gate; explicitly distinct from per-test regression gating.
- #381 (closed) — added N-way matrix comparison in compare.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tracking: PR/issue-mined evals + per-test regression gating #1155

Objective

Why

Design latitude

Sub-issues

Sequencing

Acceptance signals

Non-goals

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

tracking: PR/issue-mined evals + per-test regression gating #1155

Description

Objective

Why

Design latitude

Sub-issues

Sequencing

Acceptance signals

Non-goals

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions