feat: Structured execution trace capture for agent workflows

## Summary

Add structured execution trace capture to git-ape agent and skill workflows, emitting semantic state transitions as machine-readable events. This is the **foundation** for trace-based validation, false-success detection, and auto-generated eval baselines (future work).

## Why This Matters

### The Problem

Git-ape orchestrates multi-stage deployment workflows with mandatory checkpoints (security gate, user confirmation, preflight validation). Today, the only record of what happened is:
- Chat transcript (unstructured, model-dependent phrasing)
- `state.json` (final deployment result — no journey, just destination)

We have **no structural record** of which stages the agent actually executed, in what order, or whether mandatory gates were genuinely evaluated vs. skipped.

### The Research

[Sharma, Mittal & Hu (2025) — "Learning Correct Behavior from Examples: Validating Sequential Execution in Autonomous Agents"](https://arxiv.org/abs/2605.03159) demonstrates that:

> *"The CUA frequently misreported failures as successes, often due to timing out or misinterpreting its own state, achieving only 82.2% accuracy and 60.0% recall. In contrast, the dominator tree achieved perfect differentiation by focusing on whether essential milestones were actually reached rather than relying on the agent's internal assessment."*

Their key insight: **agents cannot reliably self-assess their own execution completeness**. Structural trace validation — checking that essential states were hit in order — outperforms agent self-report by a wide margin (100% vs 82% accuracy).

This applies directly to git-ape: when the coding agent reports "deployment successful", we currently trust that claim. With trace capture, we can independently verify the agent actually passed through all mandatory checkpoints.

### What This Unlocks (Future Issues)

Once we have structured traces, we can build:

1. **Dominator-tree validation skill** — Learn essential vs. optional states from 3–5 golden runs, validate all future executions structurally. Ships as a customer-facing skill.
2. **False-success detection** — Post-deployment check: "agent claimed success, but trace shows security gate was never evaluated." Integrates into `azure-integration-tester`.
3. **Auto-generated eval baselines** — `/skill-bench` records golden traces → builds dominator model automatically, no manual grader authoring.
4. **Audit trail for compliance** — Customers in regulated industries get a machine-readable proof that mandatory gates (security, policy, cost review) were executed.

Per the paper:
> *"We collect 3–5 passing execution traces where the agent successfully completes the task. [...] Our system automatically identifies that [optional states] appear in some traces but not others [and] extracts the dominator tree showing essential states."*

Traces are the raw material for all of this.

## What to Build

### Trace Event Schema

Define a trace event format emitted at each significant state transition:

```jsonc
// .azure/deployments/<deployment-id>/trace.jsonl
{"ts":"2025-06-01T08:30:00Z","state":"requirements_gathered","stage":1,"action":"delegate","target":"azure-requirements-gatherer","meta":{"resources":["func","st"]}}
{"ts":"2025-06-01T08:31:12Z","state":"api_reference_lookup","stage":2,"action":"skill_invoke","target":"azure-rest-api-reference","meta":{"resource_types":["Microsoft.Web/sites","Microsoft.Storage/storageAccounts"]}}
{"ts":"2025-06-01T08:32:45Z","state":"template_generated","stage":2,"action":"complete","target":"azure-template-generator","meta":{"file":"template.json"}}
{"ts":"2025-06-01T08:33:10Z","state":"security_gate_evaluated","stage":2.5,"action":"skill_invoke","target":"azure-security-analyzer","meta":{"result":"passed"}}
{"ts":"2025-06-01T08:34:00Z","state":"user_confirmation","stage":3,"action":"checkpoint","meta":{"confirmed":true}}
{"ts":"2025-06-01T08:36:30Z","state":"deployment_executed","stage":3,"action":"complete","target":"azure-resource-deployer","meta":{"status":"Succeeded"}}
{"ts":"2025-06-01T08:37:15Z","state":"integration_tests_passed","stage":4,"action":"skill_invoke","target":"azure-integration-tester","meta":{"checks_passed":4,"checks_total":4}}
```

### State Vocabulary (Derived from Agent Workflow Stages)

These are the semantic states corresponding to the git-ape agent's documented workflow in `git-ape.agent.md`:

| State ID | Stage | Essential? | Description |
|----------|-------|:---:|-------------|
| `requirements_gathered` | 1 | ✅ | Requirements collection complete |
| `api_reference_lookup` | 2 | ✅ | ARM schema verified via REST API reference |
| `template_generated` | 2 | ✅ | ARM template created |
| `security_gate_evaluated` | 2.5 | ✅ | Security analysis run, gate decision made |
| `security_gate_passed` | 2.5 | ✅ | Gate result: passed or overridden |
| `policy_assessed` | 2 | ❌ | Policy advisor run (advisory, not blocking) |
| `cost_estimated` | 2 | ❌ | Cost estimator invoked |
| `architecture_reviewed` | 2.75 | ❌ | WAF review (optional) |
| `preflight_validated` | 2 | ✅ | What-if analysis completed |
| `user_confirmation` | 3 | ✅* | User confirmed deployment intent (*interactive only) |
| `deployment_executed` | 3 | ✅ | ARM deployment completed |
| `integration_tests_passed` | 4 | ✅ | Post-deploy health checks pass |
| `drift_checked` | pre | ❌ | Pre-deployment drift check (optional) |

Essential states (✅) map to dominator nodes in the paper's model — every valid deployment must pass through them.

### Emission Points

Add trace emission calls at these locations:

1. **`git-ape.agent.md`** — Document the trace contract in the agent instructions so the orchestrator emits events at stage boundaries
2. **Each subagent** — Emit on entry (`action: delegate`) and completion (`action: complete`)
3. **Each skill invocation** — Emit when a skill is called (`action: skill_invoke`) with result metadata
4. **Checkpoints** — Emit at blocking gates (security gate, user confirmation)

### Implementation Approach

**Option A: Agent-emitted (instruction-driven)**
Add trace emission instructions to `git-ape.agent.md` and subagent `.agent.md` files. The agent writes to `trace.jsonl` at each checkpoint. Pros: no code changes. Cons: depends on LLM compliance.

**Option B: Workflow-emitted (GitHub Actions)**
Add trace emission steps in the `git-ape-plan.yml` and `git-ape-deploy.yml` workflows. Pros: deterministic. Cons: only captures CI stages, misses interactive-mode flow.

**Option C: Hybrid (recommended)**
- Agent instructions mandate writing `trace.jsonl` during interactive/headless execution
- Workflow steps append their own trace events (validation, deployment, integration tests)
- A `validate-trace` post-step in the deploy workflow checks trace completeness before committing `state.json`

### File Location

```
.azure/deployments/<deployment-id>/
├── template.json
├── parameters.json
├── metadata.json
├── state.json          # existing: final result
├── trace.jsonl         # NEW: execution trace
└── architecture.md
```

### Validation (Minimal — This PR)

For this initial PR, add a **lightweight post-deployment check** in the deploy workflow:

```bash
# After deployment succeeds, verify trace contains all essential states
ESSENTIAL="requirements_gathered template_generated security_gate_evaluated security_gate_passed preflight_validated deployment_executed integration_tests_passed"
for state in $ESSENTIAL; do
  if ! grep -q "\"state\":\"$state\"" trace.jsonl; then
    echo "⚠️ TRACE INCOMPLETE: missing essential state: $state"
  fi
done
```

This is the simplest possible version of the dominator-tree validation — a flat checklist. The full topological subsequence matching (paper Section 3.3) comes in a follow-up issue.

## Acceptance Criteria

- [ ] Trace event schema defined (TypeScript interface or JSON Schema in `.github/schemas/trace-event.schema.json`)
- [ ] `git-ape.agent.md` updated with trace emission instructions at each stage boundary
- [ ] Subagent `.agent.md` files updated to emit entry/exit events
- [ ] `git-ape-deploy.exampleyml` updated with trace validation post-step
- [ ] At least one end-to-end example `trace.jsonl` committed as a fixture in `.github/evals/`
- [ ] README or docs updated to describe the trace format

## Non-Goals (This Issue)

- Full dominator-tree extraction algorithm (future issue)
- Customer-facing `trace-validator` skill (future issue)
- LLM-based semantic equivalence for state matching (future issue)
- Trace-based eval graders in waza (future issue)

## References

- Paper: [arXiv:2605.03159](https://arxiv.org/abs/2605.03159) — "Learning Correct Behavior from Examples: Validating Sequential Execution in Autonomous Agents" (Sharma, Mittal, Hu — UW/Microsoft, 2025)
- Git-Ape agent workflow: `.github/agents/git-ape.agent.md` (Stages 1–4 + Security Gate)
- Existing state artifact: `.azure/deployments/<id>/state.json`
- Eval harness: `.github/evals/README.md`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Structured execution trace capture for agent workflows #148

Summary

Why This Matters

The Problem

The Research

What This Unlocks (Future Issues)

What to Build

Trace Event Schema

State Vocabulary (Derived from Agent Workflow Stages)

Emission Points

Implementation Approach

File Location

Validation (Minimal — This PR)

Acceptance Criteria

Non-Goals (This Issue)

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

State ID	Stage	Essential?	Description
`requirements_gathered`	1	✅	Requirements collection complete
`api_reference_lookup`	2	✅	ARM schema verified via REST API reference
`template_generated`	2	✅	ARM template created
`security_gate_evaluated`	2.5	✅	Security analysis run, gate decision made
`security_gate_passed`	2.5	✅	Gate result: passed or overridden
`policy_assessed`	2	❌	Policy advisor run (advisory, not blocking)
`cost_estimated`	2	❌	Cost estimator invoked
`architecture_reviewed`	2.75	❌	WAF review (optional)
`preflight_validated`	2	✅	What-if analysis completed
`user_confirmation`	3	✅*	User confirmed deployment intent (*interactive only)
`deployment_executed`	3	✅	ARM deployment completed
`integration_tests_passed`	4	✅	Post-deploy health checks pass
`drift_checked`	pre	❌	Pre-deployment drift check (optional)

Uh oh!

feat: Structured execution trace capture for agent workflows #148

Description

Summary

Why This Matters

The Problem

The Research

What This Unlocks (Future Issues)

What to Build

Trace Event Schema

State Vocabulary (Derived from Agent Workflow Stages)

Emission Points

Implementation Approach

File Location

Validation (Minimal — This PR)

Acceptance Criteria

Non-Goals (This Issue)

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions