Skip to content

tracking: PR/issue-mined evals + per-test regression gating #1155

@christso

Description

@christso

Objective

Make PR/issue history a first-class eval authoring source, fix AgentV's compare exit logic to match industry per-test regression convention, and give the sidecar baseline workflow read-tooling.

Why

Internal research (see agentevals-research repo) identified that closing the loop between PRs → mined eval suites → regression-gated baseline → skill harvesting is a pattern worth adopting. AgentV already implements the runner layer (executors, evaluators, scoring, comparison, CI gating). The gaps are upstream (eval authoring from git history) and in comparison exit semantics (AgentV currently uses mean-delta; industry convention is per-test regression).

Design latitude

  • PR/issue mining goes in a skill (plugins/agentv-dev/skills/agentv-eval-writer), not a new CLI subcommand. Authoring is LLM-assisted decision work that fits the skill system.
  • GitHub only via gh CLI. No GitLab/Linear/Bitbucket adapters in v1.
  • Sidecar baseline convention (<eval>.baseline.jsonl) already exists — reuse it, don't invent a new format.
  • compare changes are additive except for the default exit logic (mean → per-test). That's a breaking change — accepted per maintainer direction.

Sub-issues

Sequencing

  1. spike: sample merged PRs in agentv and classify eval-case yield #1156 (gate) → yes/no on PR-mining direction.
  2. feat(core): optional provenance block in EVAL.yaml schema #1157 + feat(core): per-test regression as default exit for agentv compare; add --update-baseline flag #1158 in parallel (independent).
  3. feat(skill): extend agentv-eval-writer with PR/issue generation sources #1159 (depends on feat(core): optional provenance block in EVAL.yaml schema #1157, unblocked by spike: sample merged PRs in agentv and classify eval-case yield #1156).
  4. feat(cli): agentv history read command for per-test score timeline #1160 (independent).

Acceptance signals

Tracking closes when all sub-issues are closed or consciously deferred with a follow-up note.

Non-goals

  • Building a coding agent.
  • GitLab/Bitbucket/Linear adapters.
  • Hosted dashboard or baseline visualization UI.
  • Remote baseline store (sidecar + git is the persistence model).
  • Labeling UI for review-feedback ingestion.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    eval-writerWork on or enabling the agentv-eval-writer skillprojectTracking issue for a multi-issue initiativeroadmap

    Type

    No type

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions