Summary
Add structured execution trace capture to git-ape agent and skill workflows, emitting semantic state transitions as machine-readable events. This is the foundation for trace-based validation, false-success detection, and auto-generated eval baselines (future work).
Why This Matters
The Problem
Git-ape orchestrates multi-stage deployment workflows with mandatory checkpoints (security gate, user confirmation, preflight validation). Today, the only record of what happened is:
- Chat transcript (unstructured, model-dependent phrasing)
state.json (final deployment result — no journey, just destination)
We have no structural record of which stages the agent actually executed, in what order, or whether mandatory gates were genuinely evaluated vs. skipped.
The Research
Sharma, Mittal & Hu (2025) — "Learning Correct Behavior from Examples: Validating Sequential Execution in Autonomous Agents" demonstrates that:
"The CUA frequently misreported failures as successes, often due to timing out or misinterpreting its own state, achieving only 82.2% accuracy and 60.0% recall. In contrast, the dominator tree achieved perfect differentiation by focusing on whether essential milestones were actually reached rather than relying on the agent's internal assessment."
Their key insight: agents cannot reliably self-assess their own execution completeness. Structural trace validation — checking that essential states were hit in order — outperforms agent self-report by a wide margin (100% vs 82% accuracy).
This applies directly to git-ape: when the coding agent reports "deployment successful", we currently trust that claim. With trace capture, we can independently verify the agent actually passed through all mandatory checkpoints.
What This Unlocks (Future Issues)
Once we have structured traces, we can build:
- Dominator-tree validation skill — Learn essential vs. optional states from 3–5 golden runs, validate all future executions structurally. Ships as a customer-facing skill.
- False-success detection — Post-deployment check: "agent claimed success, but trace shows security gate was never evaluated." Integrates into
azure-integration-tester.
- Auto-generated eval baselines —
/skill-bench records golden traces → builds dominator model automatically, no manual grader authoring.
- Audit trail for compliance — Customers in regulated industries get a machine-readable proof that mandatory gates (security, policy, cost review) were executed.
Per the paper:
"We collect 3–5 passing execution traces where the agent successfully completes the task. [...] Our system automatically identifies that [optional states] appear in some traces but not others [and] extracts the dominator tree showing essential states."
Traces are the raw material for all of this.
What to Build
Trace Event Schema
Define a trace event format emitted at each significant state transition:
State Vocabulary (Derived from Agent Workflow Stages)
These are the semantic states corresponding to the git-ape agent's documented workflow in git-ape.agent.md:
| State ID |
Stage |
Essential? |
Description |
requirements_gathered |
1 |
✅ |
Requirements collection complete |
api_reference_lookup |
2 |
✅ |
ARM schema verified via REST API reference |
template_generated |
2 |
✅ |
ARM template created |
security_gate_evaluated |
2.5 |
✅ |
Security analysis run, gate decision made |
security_gate_passed |
2.5 |
✅ |
Gate result: passed or overridden |
policy_assessed |
2 |
❌ |
Policy advisor run (advisory, not blocking) |
cost_estimated |
2 |
❌ |
Cost estimator invoked |
architecture_reviewed |
2.75 |
❌ |
WAF review (optional) |
preflight_validated |
2 |
✅ |
What-if analysis completed |
user_confirmation |
3 |
✅* |
User confirmed deployment intent (*interactive only) |
deployment_executed |
3 |
✅ |
ARM deployment completed |
integration_tests_passed |
4 |
✅ |
Post-deploy health checks pass |
drift_checked |
pre |
❌ |
Pre-deployment drift check (optional) |
Essential states (✅) map to dominator nodes in the paper's model — every valid deployment must pass through them.
Emission Points
Add trace emission calls at these locations:
git-ape.agent.md — Document the trace contract in the agent instructions so the orchestrator emits events at stage boundaries
- Each subagent — Emit on entry (
action: delegate) and completion (action: complete)
- Each skill invocation — Emit when a skill is called (
action: skill_invoke) with result metadata
- Checkpoints — Emit at blocking gates (security gate, user confirmation)
Implementation Approach
Option A: Agent-emitted (instruction-driven)
Add trace emission instructions to git-ape.agent.md and subagent .agent.md files. The agent writes to trace.jsonl at each checkpoint. Pros: no code changes. Cons: depends on LLM compliance.
Option B: Workflow-emitted (GitHub Actions)
Add trace emission steps in the git-ape-plan.yml and git-ape-deploy.yml workflows. Pros: deterministic. Cons: only captures CI stages, misses interactive-mode flow.
Option C: Hybrid (recommended)
- Agent instructions mandate writing
trace.jsonl during interactive/headless execution
- Workflow steps append their own trace events (validation, deployment, integration tests)
- A
validate-trace post-step in the deploy workflow checks trace completeness before committing state.json
File Location
.azure/deployments/<deployment-id>/
├── template.json
├── parameters.json
├── metadata.json
├── state.json # existing: final result
├── trace.jsonl # NEW: execution trace
└── architecture.md
Validation (Minimal — This PR)
For this initial PR, add a lightweight post-deployment check in the deploy workflow:
# After deployment succeeds, verify trace contains all essential states
ESSENTIAL="requirements_gathered template_generated security_gate_evaluated security_gate_passed preflight_validated deployment_executed integration_tests_passed"
for state in $ESSENTIAL; do
if ! grep -q "\"state\":\"$state\"" trace.jsonl; then
echo "⚠️ TRACE INCOMPLETE: missing essential state: $state"
fi
done
This is the simplest possible version of the dominator-tree validation — a flat checklist. The full topological subsequence matching (paper Section 3.3) comes in a follow-up issue.
Acceptance Criteria
Non-Goals (This Issue)
- Full dominator-tree extraction algorithm (future issue)
- Customer-facing
trace-validator skill (future issue)
- LLM-based semantic equivalence for state matching (future issue)
- Trace-based eval graders in waza (future issue)
References
- Paper: arXiv:2605.03159 — "Learning Correct Behavior from Examples: Validating Sequential Execution in Autonomous Agents" (Sharma, Mittal, Hu — UW/Microsoft, 2025)
- Git-Ape agent workflow:
.github/agents/git-ape.agent.md (Stages 1–4 + Security Gate)
- Existing state artifact:
.azure/deployments/<id>/state.json
- Eval harness:
.github/evals/README.md
Summary
Add structured execution trace capture to git-ape agent and skill workflows, emitting semantic state transitions as machine-readable events. This is the foundation for trace-based validation, false-success detection, and auto-generated eval baselines (future work).
Why This Matters
The Problem
Git-ape orchestrates multi-stage deployment workflows with mandatory checkpoints (security gate, user confirmation, preflight validation). Today, the only record of what happened is:
state.json(final deployment result — no journey, just destination)We have no structural record of which stages the agent actually executed, in what order, or whether mandatory gates were genuinely evaluated vs. skipped.
The Research
Sharma, Mittal & Hu (2025) — "Learning Correct Behavior from Examples: Validating Sequential Execution in Autonomous Agents" demonstrates that:
Their key insight: agents cannot reliably self-assess their own execution completeness. Structural trace validation — checking that essential states were hit in order — outperforms agent self-report by a wide margin (100% vs 82% accuracy).
This applies directly to git-ape: when the coding agent reports "deployment successful", we currently trust that claim. With trace capture, we can independently verify the agent actually passed through all mandatory checkpoints.
What This Unlocks (Future Issues)
Once we have structured traces, we can build:
azure-integration-tester./skill-benchrecords golden traces → builds dominator model automatically, no manual grader authoring.Per the paper:
Traces are the raw material for all of this.
What to Build
Trace Event Schema
Define a trace event format emitted at each significant state transition:
State Vocabulary (Derived from Agent Workflow Stages)
These are the semantic states corresponding to the git-ape agent's documented workflow in
git-ape.agent.md:requirements_gatheredapi_reference_lookuptemplate_generatedsecurity_gate_evaluatedsecurity_gate_passedpolicy_assessedcost_estimatedarchitecture_reviewedpreflight_validateduser_confirmationdeployment_executedintegration_tests_passeddrift_checkedEssential states (✅) map to dominator nodes in the paper's model — every valid deployment must pass through them.
Emission Points
Add trace emission calls at these locations:
git-ape.agent.md— Document the trace contract in the agent instructions so the orchestrator emits events at stage boundariesaction: delegate) and completion (action: complete)action: skill_invoke) with result metadataImplementation Approach
Option A: Agent-emitted (instruction-driven)
Add trace emission instructions to
git-ape.agent.mdand subagent.agent.mdfiles. The agent writes totrace.jsonlat each checkpoint. Pros: no code changes. Cons: depends on LLM compliance.Option B: Workflow-emitted (GitHub Actions)
Add trace emission steps in the
git-ape-plan.ymlandgit-ape-deploy.ymlworkflows. Pros: deterministic. Cons: only captures CI stages, misses interactive-mode flow.Option C: Hybrid (recommended)
trace.jsonlduring interactive/headless executionvalidate-tracepost-step in the deploy workflow checks trace completeness before committingstate.jsonFile Location
Validation (Minimal — This PR)
For this initial PR, add a lightweight post-deployment check in the deploy workflow:
This is the simplest possible version of the dominator-tree validation — a flat checklist. The full topological subsequence matching (paper Section 3.3) comes in a follow-up issue.
Acceptance Criteria
.github/schemas/trace-event.schema.json)git-ape.agent.mdupdated with trace emission instructions at each stage boundary.agent.mdfiles updated to emit entry/exit eventsgit-ape-deploy.exampleymlupdated with trace validation post-steptrace.jsonlcommitted as a fixture in.github/evals/Non-Goals (This Issue)
trace-validatorskill (future issue)References
.github/agents/git-ape.agent.md(Stages 1–4 + Security Gate).azure/deployments/<id>/state.json.github/evals/README.md