You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Make PR/issue history a first-class eval authoring source, fix AgentV's compare exit logic to match industry per-test regression convention, and give the sidecar baseline workflow read-tooling.
Why
Internal research (see agentevals-research repo) identified that closing the loop between PRs → mined eval suites → regression-gated baseline → skill harvesting is a pattern worth adopting. AgentV already implements the runner layer (executors, evaluators, scoring, comparison, CI gating). The gaps are upstream (eval authoring from git history) and in comparison exit semantics (AgentV currently uses mean-delta; industry convention is per-test regression).
Design latitude
PR/issue mining goes in a skill (plugins/agentv-dev/skills/agentv-eval-writer), not a new CLI subcommand. Authoring is LLM-assisted decision work that fits the skill system.
GitHub only via gh CLI. No GitLab/Linear/Bitbucket adapters in v1.
Sidecar baseline convention (<eval>.baseline.jsonl) already exists — reuse it, don't invent a new format.
compare changes are additive except for the default exit logic (mean → per-test). That's a breaking change — accepted per maintainer direction.
Objective
Make PR/issue history a first-class eval authoring source, fix AgentV's
compareexit logic to match industry per-test regression convention, and give the sidecar baseline workflow read-tooling.Why
Internal research (see
agentevals-researchrepo) identified that closing the loop between PRs → mined eval suites → regression-gated baseline → skill harvesting is a pattern worth adopting. AgentV already implements the runner layer (executors, evaluators, scoring, comparison, CI gating). The gaps are upstream (eval authoring from git history) and in comparison exit semantics (AgentV currently uses mean-delta; industry convention is per-test regression).Design latitude
plugins/agentv-dev/skills/agentv-eval-writer), not a new CLI subcommand. Authoring is LLM-assisted decision work that fits the skill system.ghCLI. No GitLab/Linear/Bitbucket adapters in v1.<eval>.baseline.jsonl) already exists — reuse it, don't invent a new format.comparechanges are additive except for the default exit logic (mean → per-test). That's a breaking change — accepted per maintainer direction.Sub-issues
Sequencing
Acceptance signals
Tracking closes when all sub-issues are closed or consciously deferred with a follow-up note.
Non-goals
Related
agentv eval --thresholdas absolute suite gate; explicitly distinct from per-test regression gating.