Skip to content

feat: Structured execution trace capture for agent workflows #148

Description

@suuus

Summary

Add structured execution trace capture to git-ape agent and skill workflows, emitting semantic state transitions as machine-readable events. This is the foundation for trace-based validation, false-success detection, and auto-generated eval baselines (future work).

Why This Matters

The Problem

Git-ape orchestrates multi-stage deployment workflows with mandatory checkpoints (security gate, user confirmation, preflight validation). Today, the only record of what happened is:

  • Chat transcript (unstructured, model-dependent phrasing)
  • state.json (final deployment result — no journey, just destination)

We have no structural record of which stages the agent actually executed, in what order, or whether mandatory gates were genuinely evaluated vs. skipped.

The Research

Sharma, Mittal & Hu (2025) — "Learning Correct Behavior from Examples: Validating Sequential Execution in Autonomous Agents" demonstrates that:

"The CUA frequently misreported failures as successes, often due to timing out or misinterpreting its own state, achieving only 82.2% accuracy and 60.0% recall. In contrast, the dominator tree achieved perfect differentiation by focusing on whether essential milestones were actually reached rather than relying on the agent's internal assessment."

Their key insight: agents cannot reliably self-assess their own execution completeness. Structural trace validation — checking that essential states were hit in order — outperforms agent self-report by a wide margin (100% vs 82% accuracy).

This applies directly to git-ape: when the coding agent reports "deployment successful", we currently trust that claim. With trace capture, we can independently verify the agent actually passed through all mandatory checkpoints.

What This Unlocks (Future Issues)

Once we have structured traces, we can build:

  1. Dominator-tree validation skill — Learn essential vs. optional states from 3–5 golden runs, validate all future executions structurally. Ships as a customer-facing skill.
  2. False-success detection — Post-deployment check: "agent claimed success, but trace shows security gate was never evaluated." Integrates into azure-integration-tester.
  3. Auto-generated eval baselines/skill-bench records golden traces → builds dominator model automatically, no manual grader authoring.
  4. Audit trail for compliance — Customers in regulated industries get a machine-readable proof that mandatory gates (security, policy, cost review) were executed.

Per the paper:

"We collect 3–5 passing execution traces where the agent successfully completes the task. [...] Our system automatically identifies that [optional states] appear in some traces but not others [and] extracts the dominator tree showing essential states."

Traces are the raw material for all of this.

What to Build

Trace Event Schema

Define a trace event format emitted at each significant state transition:

// .azure/deployments/<deployment-id>/trace.jsonl
{"ts":"2025-06-01T08:30:00Z","state":"requirements_gathered","stage":1,"action":"delegate","target":"azure-requirements-gatherer","meta":{"resources":["func","st"]}}
{"ts":"2025-06-01T08:31:12Z","state":"api_reference_lookup","stage":2,"action":"skill_invoke","target":"azure-rest-api-reference","meta":{"resource_types":["Microsoft.Web/sites","Microsoft.Storage/storageAccounts"]}}
{"ts":"2025-06-01T08:32:45Z","state":"template_generated","stage":2,"action":"complete","target":"azure-template-generator","meta":{"file":"template.json"}}
{"ts":"2025-06-01T08:33:10Z","state":"security_gate_evaluated","stage":2.5,"action":"skill_invoke","target":"azure-security-analyzer","meta":{"result":"passed"}}
{"ts":"2025-06-01T08:34:00Z","state":"user_confirmation","stage":3,"action":"checkpoint","meta":{"confirmed":true}}
{"ts":"2025-06-01T08:36:30Z","state":"deployment_executed","stage":3,"action":"complete","target":"azure-resource-deployer","meta":{"status":"Succeeded"}}
{"ts":"2025-06-01T08:37:15Z","state":"integration_tests_passed","stage":4,"action":"skill_invoke","target":"azure-integration-tester","meta":{"checks_passed":4,"checks_total":4}}

State Vocabulary (Derived from Agent Workflow Stages)

These are the semantic states corresponding to the git-ape agent's documented workflow in git-ape.agent.md:

State ID Stage Essential? Description
requirements_gathered 1 Requirements collection complete
api_reference_lookup 2 ARM schema verified via REST API reference
template_generated 2 ARM template created
security_gate_evaluated 2.5 Security analysis run, gate decision made
security_gate_passed 2.5 Gate result: passed or overridden
policy_assessed 2 Policy advisor run (advisory, not blocking)
cost_estimated 2 Cost estimator invoked
architecture_reviewed 2.75 WAF review (optional)
preflight_validated 2 What-if analysis completed
user_confirmation 3 ✅* User confirmed deployment intent (*interactive only)
deployment_executed 3 ARM deployment completed
integration_tests_passed 4 Post-deploy health checks pass
drift_checked pre Pre-deployment drift check (optional)

Essential states (✅) map to dominator nodes in the paper's model — every valid deployment must pass through them.

Emission Points

Add trace emission calls at these locations:

  1. git-ape.agent.md — Document the trace contract in the agent instructions so the orchestrator emits events at stage boundaries
  2. Each subagent — Emit on entry (action: delegate) and completion (action: complete)
  3. Each skill invocation — Emit when a skill is called (action: skill_invoke) with result metadata
  4. Checkpoints — Emit at blocking gates (security gate, user confirmation)

Implementation Approach

Option A: Agent-emitted (instruction-driven)
Add trace emission instructions to git-ape.agent.md and subagent .agent.md files. The agent writes to trace.jsonl at each checkpoint. Pros: no code changes. Cons: depends on LLM compliance.

Option B: Workflow-emitted (GitHub Actions)
Add trace emission steps in the git-ape-plan.yml and git-ape-deploy.yml workflows. Pros: deterministic. Cons: only captures CI stages, misses interactive-mode flow.

Option C: Hybrid (recommended)

  • Agent instructions mandate writing trace.jsonl during interactive/headless execution
  • Workflow steps append their own trace events (validation, deployment, integration tests)
  • A validate-trace post-step in the deploy workflow checks trace completeness before committing state.json

File Location

.azure/deployments/<deployment-id>/
├── template.json
├── parameters.json
├── metadata.json
├── state.json          # existing: final result
├── trace.jsonl         # NEW: execution trace
└── architecture.md

Validation (Minimal — This PR)

For this initial PR, add a lightweight post-deployment check in the deploy workflow:

# After deployment succeeds, verify trace contains all essential states
ESSENTIAL="requirements_gathered template_generated security_gate_evaluated security_gate_passed preflight_validated deployment_executed integration_tests_passed"
for state in $ESSENTIAL; do
  if ! grep -q "\"state\":\"$state\"" trace.jsonl; then
    echo "⚠️ TRACE INCOMPLETE: missing essential state: $state"
  fi
done

This is the simplest possible version of the dominator-tree validation — a flat checklist. The full topological subsequence matching (paper Section 3.3) comes in a follow-up issue.

Acceptance Criteria

  • Trace event schema defined (TypeScript interface or JSON Schema in .github/schemas/trace-event.schema.json)
  • git-ape.agent.md updated with trace emission instructions at each stage boundary
  • Subagent .agent.md files updated to emit entry/exit events
  • git-ape-deploy.exampleyml updated with trace validation post-step
  • At least one end-to-end example trace.jsonl committed as a fixture in .github/evals/
  • README or docs updated to describe the trace format

Non-Goals (This Issue)

  • Full dominator-tree extraction algorithm (future issue)
  • Customer-facing trace-validator skill (future issue)
  • LLM-based semantic equivalence for state matching (future issue)
  • Trace-based eval graders in waza (future issue)

References

  • Paper: arXiv:2605.03159 — "Learning Correct Behavior from Examples: Validating Sequential Execution in Autonomous Agents" (Sharma, Mittal, Hu — UW/Microsoft, 2025)
  • Git-Ape agent workflow: .github/agents/git-ape.agent.md (Stages 1–4 + Security Gate)
  • Existing state artifact: .azure/deployments/<id>/state.json
  • Eval harness: .github/evals/README.md

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions