Skip to content

feat(skill): extend agentv-eval-writer with PR/issue generation sources #1159

@christso

Description

@christso

Part of #1155. Blocked by #1156 (yield spike). Depends on #1157 (provenance schema) for provenance stamping.

Objective

Extend plugins/agentv-dev/skills/agentv-eval-writer/ to generate EVAL.yaml test cases from two new sources: merged pull requests and resolved issues. This makes PR/issue-driven eval authoring a first-class capability in AgentV's skill system without baking it into core.

Design latitude

Why a skill, not a CLI subcommand

PR/issue mining is authoring work — deciding which PRs are eval-worthy, phrasing the criteria line, choosing grader style. LLM-assisted judgment, not a pure transform. Skills already own the analogous "generate from chat transcripts" workflow in the same skill. Keeping PR/issue generation adjacent preserves "how do I generate an eval from X?" as one mental model.

Scope

  • GitHub only, mined via the gh CLI. No GitLab/Bitbucket/Linear adapters in v1 — add them later only if someone asks.
  • Deterministic structural extraction in scripts/*.sh; LLM judgment (which PRs to include, how to phrase criteria) in skill prose.
  • Open sub-decision: move the existing inline chat-transcript section into references/from-chat-transcripts.md for progressive disclosure, or leave inline. Recommended to move for consistency with the new sources, but not required in this issue.

Proposed layout

plugins/agentv-dev/skills/agentv-eval-writer/
├── SKILL.md                              # Extend: add generation-source routing
├── references/
│   ├── config-schema.json                # existing
│   ├── custom-evaluators.md              # existing
│   ├── eval-schema.json                  # existing
│   ├── rubric-evaluator.md               # existing
│   ├── from-pull-requests.md             # NEW — gh pr view patterns, mapping PR fields to test case fields
│   ├── from-issues.md                    # NEW — gh issue view, linked-PR resolution
│   ├── grader-patterns-diff.md           # NEW — llm-grader with agent target for diff behavioral equivalence
│   └── provenance-block.md               # NEW — how to populate provenance metadata
└── scripts/                              # NEW directory
    ├── extract_pr.sh                     # gh pr view --json … → normalized JSON
    └── extract_issue.sh                  # gh issue view --json …

Test case mapping per mined PR

  • PR title → test id (slugified) and seed for criteria.
  • PR body + linked issue body → test input.
  • PR merge-base commit → workspace.checkout.ref (agent starts from pre-PR state).
  • PR diff → test expected_output as inline multi-line string (YAML | block scalar; the schema is z.string() | MessageContent, no length cap — no fixture files needed).
  • Grader: llm-grader with target: claude-code. No max_steps — the agent target controls its own loop. Grader prompt compares {{file_changes}} (agent's output) against {{expected_output}} (PR's diff) for behavioral equivalence, not textual match.
  • provenance block stamped (depends on feat(core): optional provenance block in EVAL.yaml schema #1157).

Emitted test case shape

- id: pr-1234-add-retry-logic
  input: "Add exponential backoff retries to the HTTP client per issue #1200"
  expected_output: |
    diff --git a/src/http.ts b/src/http.ts
    @@ -12,6 +12,20 @@
    ...
  assertions:
    - type: llm-grader
      target: claude-code
      prompt: "file://judges/diff-behavioral-equivalence.md"
  provenance:
    source: pr
    url: https://github.com/owner/name/pull/1234
    commit: abc123def
    generated_by: "agentv-eval-writer@<version>"
    generated_at: 2026-04-23T10:00:00Z

Acceptance signals

  • Skill runs on a real repo (agentv itself works; use the yield from spike: sample merged PRs in agentv and classify eval-case yield #1156 to pick specific PRs) and emits a syntactically valid EVAL.yaml that passes agentv eval lint.
  • Each generated test stamps a complete provenance block.
  • Running the emitted eval against any configured target completes without harness/wiring errors. Scoring quality is not an acceptance criterion for this issue — only that the pipeline runs.
  • references/from-pull-requests.md and references/from-issues.md document the mapping and gh-CLI commands clearly enough that an LLM driving the skill can follow them without extra guidance.

Non-goals

  • Not a CLI subcommand.
  • Not LLM-summarizing PRs/issues at extraction time (keep structural extraction deterministic; reproducibility matters).
  • Not solving large-PR prompt-size truncation — observe behavior first, solve if it becomes a problem.
  • Not supporting arbitrary Git providers — GitHub only in v1.

Depends on

References

Design derivation and full design-iteration history: agentevals-research repo (internal).

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requesteval-writerWork on or enabling the agentv-eval-writer skill

    Type

    No type

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions