Skip to content

feat(interactive-agents): async-only background task UX + Cedar HITL design#52

Open
scoropeza wants to merge 80 commits intoaws-samples:mainfrom
scoropeza:feature/interactive-background-agents
Open

feat(interactive-agents): async-only background task UX + Cedar HITL design#52
scoropeza wants to merge 80 commits intoaws-samples:mainfrom
scoropeza:feature/interactive-background-agents

Conversation

@scoropeza
Copy link
Copy Markdown
Contributor

@scoropeza scoropeza commented Apr 29, 2026

Summary

Introduces interactive background agents — users can watch, steer, trace, and review agent runs through a unified async-only architecture. Ships five user-facing surfaces (watch, nudge, status, trace, webhook) plus the first real notification-plane dispatcher (GitHub edit-in-place). Includes the locked design for Phase 3 Cedar-driven HITL approval gates — no implementation; design only.

All shipped features are deployed to a dev stack and E2E-validated against scoropeza/agent-plugins PRs #32/#33/#43. Three runtime bugs were caught and fixed during deploy-validation — see Deploy validation below.

Branch state: ~82 commits. Will squash on merge.

What ships (deployed + tested)

Polling / observability

  • bgagent watch <task-id> — adaptive polling (500 ms active; 1 / 2 / 5 s idle backoff; resets on event arrival), cold-start retry wrapper, clean terminal exit. Streams turn-by-turn events from TaskEventsTable via GET /v1/tasks/{id}/events?after=<cursor>.
  • bgagent status <task-id> — deterministic snapshot (AD-6): Type, Channel, Description, Turn, Last milestone, Current, Cost, Reason, Trace S3, Last event. --wait is a pure blocking flag; same layout in both modes.
  • bgagent events <task-id> — full event log (paged).
  • --verbose — HTTP debug output on stderr.

Interactive steering

  • bgagent nudge <task-id> "<message>" — mid-run steering via POST /v1/tasks/{id}/nudgesTaskNudgesTable → agent's between-turns hook. Combined-turn acknowledgement per AD-5 — the agent emits a nudge_acknowledged milestone before incorporating the nudge, giving users a visible handshake without the complexity of mid-turn interrupt.
  • bgagent cancel <task-id> — terminates the running agent, persists task_cancelled, prevents PR push.

Trace / debug

  • --trace on submit — raises progress-writer preview cap from 200 chars to 4 KB and uploads a full gzipped NDJSON trajectory to S3 on terminal.
  • bgagent trace download <task-id> — streams gzipped NDJSON to stdout (pipe-friendly: | gunzip | jq -s .) or writes raw gzip to -o <file> with a --force overwrite guard.

Notification plane (fanout)

  • FanOutConsumer Lambda on TaskEventsTable DDB Streams (ParallelizationFactor: 1 for per-task ordering) routes events to per-channel dispatchers with configurable default filters (AD-4).
  • GitHub edit-in-place dispatcher — single status comment per task on the target PR, edited in place as progress events fire. Slack / Email dispatchers are log-only stubs pending integration.

Webhook submission

  • bgagent webhook create / list / revoke — registers external integrations. HMAC-SHA256 request verification (X-Webhook-Signature) + REQUEST authorizer for lookup. channel_source: "webhook" propagates through the task record for observability.

Design-only (not implemented)

Phase 3 — Cedar-driven HITL approval gates

  • docs/design/PHASE3_CEDAR_HITL.md (1,883 lines) — 22 locked decisions across 17 sections: three-outcome model (allow / hard-deny / soft-deny), policy authoring guide, evaluate_tool_use skeleton, ApprovalAllowlist + RecentDecisionCache, REST API contract, CLI UX, state-machine concurrency, data model, observability, security, failure modes, scenarios, carry-forward list.
  • docs/diagrams/phase3-cedar-hitl.drawio — 12-page companion.
  • Plumbing only: cedar_policies is threaded through blueprintConfigrepoConfig → orchestrator → agent payload shape (agent/src/models.py), and agent/src/hooks.py has a documented PreToolUse stub calling out Cedar evaluation as the next step. No Cedar engine, no policy evaluation, no approval flow. Implementation is a separate PR (~3–4 weeks for 3a core).
  • @cedar-policy/cedar-wasm@4.10.0 locked as the engine choice after WASM vs native spike.

Architectural highlights

  • Async-only polling (rev-6). No SSE, no two-runtime split. Single AgentCore runtime, GET /events?after=<cursor> replaces the prior streaming surface.
  • DDB-Stream ordering for fan-outParallelizationFactor: 1 guarantees per-task ordering; no SQS FIFO, no ETag. (AD-9, with an explicit "rejected alternative" for GitHub's If-Match — see Deploy validation bug Docs: user guide hard-codes us-east-1 #3.)
  • Combined-turn nudge ack (AD-5) — agent emits nudge_acknowledged BEFORE folding the nudge into the next turn; single milestone, no consume-side event.
  • Deterministic status snapshot (AD-6)formatStatusSnapshot derives every rendered field from TaskDetail + a small window of recent events; no LLM in the loop, no fabricated state.
  • agent_milestone unwrap — agent writes named checkpoints (pr_created, nudge_acknowledged, …) as agent_milestone events with metadata.milestone. Fanout unwraps against a narrow ROUTABLE_MILESTONES allowlist before matching channel filters.
  • Numeric coercion at the DDB boundary — shared coerceNumericOrNull helper handles the Document-client number-as-string quirk uniformly.

Design doc

  • docs/design/INTERACTIVE_AGENTS.md — canonical rev-6 design. Sections: polling (§5), notification plane (§6), fanout routing (§6.2), GitHub edit-in-place (§6.4), observability / --trace (§10.1), concurrency rationale (§8.9), AD-1 through AD-9.
  • docs/diagrams/interactive-agents-phases.drawio — 5-page architecture (1. rev-6 topology, 2. watch polling, 3. nudge flow, 4. status + trace, 5. fanout + GitHub edit-in-place).
  • docs/design/PHASE3_CEDAR_HITL.md + 12-page drawio (design-only; see above).

Deploy validation

Every shipped surface was E2E-validated against a dev stack (backgroundagent-dev, us-east-1). The validation pass caught three runtime bugs that passed all unit tests but failed on live AWS — all are fixed in this PR:

  1. costUsd.toFixed is not a function — DDB Document-client returns Number attributes as strings; fanout dispatcher was not coercing. Fix: shared coerceNumericOrNull helper at the fanout + orchestrator boundaries.
  2. agent_milestone wrapper events never routed — fanout filter matched event_type literally; missed pr_created wrapped in agent_milestone. Fix: effectiveEventType unwrap against ROUTABLE_MILESTONES allowlist.
  3. GitHub rejects If-Match on PATCH /issues/comments/{id} with HTTP 400 ("Conditional request headers are not allowed in unsafe requests unless supported by the endpoint"). The earlier ETag optimistic-concurrency design never worked in production. Fix: drop If-Match entirely; rely on DDB-Stream ordering + single-writer invariant (AD-9 rewritten; §6.4 rewritten).

Also surfaced + fixed during validation: Bedrock Guardrail inputStrength over-blocked (HIGHMEDIUM); fanout dispatcher silent-ignored non-finite numeric input (now warns); bgagent watch didn't exit cleanly on terminal (deferred-exit fallback); bgagent status snapshot missed Type / Channel / Description / Reason (all surfaced); channel_source silently dropped by API response (now included in TaskDetail); specific classifiers for agent_status=error_max_turns / error_max_budget_usd / error_during_execution (were falling through to a generic "Agent task did not succeed").

Test plan

Known limitations / follow-ups

Filed separately (see .l4-issue-drafts/ on this branch; 6 issue drafts ready to file post-merge):

  • failTask double-emits task_failed + over-decrements concurrency on Durable Execution step retry.
  • Bedrock Guardrail PROMPT_ATTACK over-triggers on imperative-mood prompts even at MEDIUM (example log in the draft).
  • Idempotent submit returns DUPLICATE_TASK error instead of the original task_id — non-standard semantics.
  • webhook_url missing from CreateWebhookResponse — users have to construct it by hand.
  • Pre-existing drafts: session-tag IAM tightening; container CVE rebase cadence.

Deferred feature work (tracked elsewhere):

  • Phase 3 Cedar HITL implementation — ~3–4 weeks for 3a core; separate PR after merge.
  • Slack / Email dispatchers (currently log-only stubs).
  • GitHub webhook body-shape adapter (today's webhook path expects {repo, task_description, ...}; GitHub sends IssueCommentEvent / PullRequestEvent).

Reviewers — suggested reading order

  1. docs/design/INTERACTIVE_AGENTS.md — canonical rev-6 design (ADs, architecture, §5 / §6 / §10).
  2. docs/diagrams/interactive-agents-phases.drawio — 5-page architecture.
  3. CDK handlers: cdk/src/handlers/fanout-task-events.ts, cdk/src/handlers/shared/github-comment.ts, cdk/src/handlers/shared/orchestrator.ts, cdk/src/handlers/shared/error-classifier.ts, cdk/src/handlers/shared/numeric.ts.
  4. CDK constructs: cdk/src/constructs/task-api.ts, cdk/src/constructs/task-events-table.ts, cdk/src/constructs/task-nudges-table.ts, cdk/src/stacks/agent.ts.
  5. Agent runtime: agent/src/pipeline.py, agent/src/runner.py, agent/src/hooks.py, agent/src/progress_writer.py, agent/src/telemetry.py.
  6. CLI: cli/src/commands/{watch,nudge,status,submit,trace,webhook}.ts, cli/src/format.ts, cli/src/wait.ts, cli/src/bin/bgagent.ts.
  7. Phase 3 design (optional; will be a separate PR): docs/design/PHASE3_CEDAR_HITL.md, docs/diagrams/phase3-cedar-hitl.drawio.

GitHub's Squash and merge is the intended merge strategy.

scoropeza and others added 30 commits April 28, 2026 13:53
- INTERACTIVE_AGENTS.md: comprehensive design for bidirectional
  agent-user communication (progress streaming, nudges, HITL
  approval gates, pause/resume)
- Architecture diagrams (4 pages): current state, SSE streaming,
  nudge flow, state machine extensions
- Research prompt that drove the design work
Detailed prompt for implementing DDB progress events + CLI watch
command. Covers: ProgressWriter, entrypoint integration, CLI watch
command, tests, and constraints. Ready for a fresh agent session.
…e 1a prompt

- Use subagents to parallelize agent-side and CLI-side work
- Keep main context clean by delegating reads/research to subagents
- CRITICAL: never deviate from approved design without explicit approval
- On blockers: research first, then surface for discussion
Introduces _ProgressWriter that emits structured AG-UI-style progress
events to the existing TaskEventsTable during agent execution. Events
cover turns, tool calls, tool results, milestones, cost updates, and
errors.

- progress_writer.py: lazy boto3 init, fail-open writes, circuit breaker
  after 3 consecutive DDB failures, ULID-sortable event IDs, 90-day TTL
- entrypoint.py: integrated into run_agent() message loop and run_task()
  pipeline; handles UserMessage branch to capture ToolResultBlocks
- Dockerfile: add progress_writer.py to the agent image COPY layer
- 25 unit tests cover all event types, truncation, fail-open, and
  circuit breaker behavior
Adds a new bgagent watch <task_id> command that polls GET /tasks/{id}/events
every 2 seconds, renders progress events in a human-readable format, and
exits cleanly when the task reaches a terminal state.

- watch.ts: polling loop, event rendering for all 6 progress event types,
  JSON output mode (--output json), Ctrl+C handling, last-seen event
  tracking to avoid duplicates
- bgagent.ts: register the new command
- 14 unit tests cover event rendering, polling, terminal state detection,
  deduplication, and JSON mode

Exit codes: 0 on COMPLETED, 1 on FAILED/CANCELLED/TIMED_OUT.
Enables end-to-end local testing of progress events without deploying
to AWS. The agent container connects to a local DynamoDB instance on
a shared Docker network; boto3 endpoint redirection via
AWS_ENDPOINT_URL_DYNAMODB means zero code changes — the same
_ProgressWriter code path runs locally and in production.

- docker-compose.yml: dynamodb-local service on agent-local network
- scripts/create-local-tables.sh: idempotent creation of TaskEventsTable
  and TaskTable schemas matching the CDK constructs
- mise.toml: local:up (start + create tables), local:down, local:events,
  local:events:json
- run.sh: new --local-events flag that joins the agent container to the
  agent-local network and sets the right env vars

Workflow:
  cd agent && mise run local:up
  ./run.sh --local-events "owner/repo" 42
  mise run local:events    # in another terminal
  mise run local:down
- docs/design/INTERACTIVE_AGENTS.md rev 3: adds Section 9.9 local testing
  subsection with the DDB Local workflow; updates "last updated" marker
- docs/guides/DEVELOPER_GUIDE.md: new "Testing with progress events
  (DynamoDB Local)" section under Local testing
- docs/guides/USER_GUIDE.md: new "Watching a task in real time"
  subsection documenting bgagent watch (behavior only — transport
  details intentionally omitted, will change in Phase 1b)
- AGENTS.md: routing table row for progress_writer.py so future coding
  sessions know where progress events are emitted from
- docs/src/content/docs/**: Starlight sync output from the above
Captures two issues discovered during Phase 1a E2E validation that are
scoped outside the interactive-agents feature. Each file contains
problem statement, reproduction, proposed design, test cases, and
acceptance criteria so a future session can pick it up independently.

- CLI_COGNITO_NEW_PASSWORD_CHALLENGE.md: bgagent login does not handle
  Cognito's NEW_PASSWORD_REQUIRED challenge, blocking first-time users
  without admin intervention
- EARLY_PROGRESS_MILESTONES.md: UX enhancement to emit finer-grained
  milestones during setup_repo (clone / install / baseline build) so
  watch output appears within seconds instead of minutes
Team discussion (Sam ↔ Alain, 2026-04-17) agreed to replace the
hardcoded 3-tier HITL model with Cedar policy-driven approvals,
reusing the in-process Cedar engine already in the repo (on branch
fix/validate-aws-before-docker-build as agent/src/policy.py).

The existing ALLOW/DENY decisions extend to include REQUIRE_APPROVAL —
same policy language, unified governance, supports workflows like
AI-DLC where users gate per phase and relax over time.

Phase 1a (done) and Phase 1b (next) are unaffected. Phase 3 (HITL)
blocked on design revision and the Cedar branch landing on main.
Re-integrates Phase 1a ProgressWriter event emission after rebasing onto
upstream/main, which decomposed agent/entrypoint.py into agent/src/
modules (pipeline.py, runner.py, etc.).

- pipeline.py: instantiate _ProgressWriter; emit repo_setup_complete,
  agent_execution_complete, pr_created milestones
- runner.py: instantiate _ProgressWriter; track tool_use_id -> tool_name
  for UserMessage ToolResultBlock correlation; emit agent_turn,
  agent_tool_call, agent_tool_result, agent_cost_update, agent_error

No behavior change vs. Phase 1a - same event shapes, same call sites
moved to their new module homes. All 305 agent tests pass.
Updates the interactive agents design doc and adds two draw.io files
documenting the Phase 1b architecture.

Design doc (rev 4, 2026-04-17):
- Revision history section added
- New: section 8.9 (control / streaming / fan-out planes + channel-fit
  matrix), 9.1.1 (two-runtime split, D1 resolved), 9.10 (thread + queue
  bridge, D2 resolved), 9.11 (CLI hybrid client, D3 resolved), 9.12
  (AgentCore streaming limits: 60-min cap, LifecycleConfiguration,
  no SSE-native resume, SDK timeout overrides, API GW REST
  incompatibility)
- Updated: section 4 Phase 1b parameters, 5 SSEAdapter sibling pattern,
  8.2 reconnection details, 9 heading bumped to rev 4, 10 Phase 1b
  implementation plan rewritten, Appendix C file change map completed
- Fixed: exec-summary inaccuracy about SSE 60-min cap (line 23)

Diagrams:
- docs/interactive-agents-phases.drawio — v1, 8 pages, documents the
  decision-space (options to choose from) for D1/D2/D3
- docs/interactive-agents-phases-v2.drawio — v2, 8 pages, documents the
  final resolved Phase 1b architecture with rejected alternatives
  captured on the rationale page for future reference

No code changes. No impact on deployed stack. Phase 1a continues to
pass all 305 agent / 721 CDK / 84 CLI tests.
Infrastructure foundation for Phase 1b SSE streaming. Introduces a
second AgentCore Runtime with Cognito JWT authorizer for direct
CLI-to-AgentCore SSE consumption, while preserving the existing
IAM-authed runtime for the orchestrator path. No agent or CLI code
changes in this step — those land in Steps 2-6.

Changes:
- cdk/src/stacks/agent.ts: rename Runtime -> RuntimeIam, add RuntimeJwt
  with RuntimeAuthorizerConfiguration.usingCognito(userPool,
  [appClient]). Same artifact, env vars, VPC, models, memory,
  filesystem mount, and log/trace wiring shared across both runtimes.
  Explicit LifecycleConfiguration (8h idle + 8h max) on both per
  §9.12 of design doc. Lazy.string breaks the TaskApi <-> Runtime-JWT
  <-> Orchestrator circular dependency. OrchestratorFn IAM policy
  scoped to RuntimeIam only. New CFN outputs RuntimeIamArn and
  RuntimeJwtArn; RuntimeArn kept as deprecated alias.
- cdk/src/constructs/task-api.ts: expose appClient as public attr for
  Runtime-JWT to reference.
- cdk/src/constructs/task-events-table.ts: enable DynamoDB Streams
  with NEW_IMAGE (in-place update, Phase 1a event data preserved).
  Prerequisite for the fan-out plane.
- cdk/test/**: 8 new tests covering two-runtime topology, JWT authorizer
  config, LifecycleConfiguration shape, scoped IAM, DDB Streams, new
  CFN outputs. Full suite 729 passed / 0 failed (--runInBand).

Refs docs/design/INTERACTIVE_AGENTS.md §9.1.1, §9.10, §9.12.
First deploy of Phase 1b Step 1 rolled back with:
  AWS::BedrockAgentCore::Runtime 'jean_cloude' already exists

Root cause: I had renamed the existing Runtime construct from 'Runtime'
to 'RuntimeIam'. CloudFormation interprets construct id changes as
replacement — CREATE new resource, then DELETE old. Because the old
Runtime has runtimeName 'jean_cloude' (immutable) and the new RuntimeIam
was declared with the same name, CFN tried to create a duplicate and
failed on AgentCore's account-level name uniqueness constraint.

Fix: revert the construct id to 'Runtime' so CFN updates the existing
resource in place (only the new LifecycleConfiguration is added). The
TS variable name `runtimeIam` is kept for readability + Phase 1b role
documentation. Test updated to match logical id shape via regex.

Verified via cdk diff: existing Runtime gets only [~] LifecycleConfig
(in-place), RuntimeJwt + log infra are net-new, TaskEventsTable gets
StreamSpecification in-place. No destroys of load-bearing resources.
…f ProgressWriter)

Introduces _SSEAdapter, the producer side of a per-task asyncio.Queue
that will feed the SSE handler in server.py (to be wired in Step 3).
Semantic API mirrors _ProgressWriter 1:1 so integration is symmetric:
pipeline.py / runner.py will call both adapters at the same sites.

Design contract (see docs/design/INTERACTIVE_AGENTS.md §9.10):
- Thread-safe: write_agent_* methods run in the pipeline background
  thread; bridge to the asyncio loop via loop.call_soon_threadsafe
  (chosen over run_coroutine_threadsafe — Queue.put_nowait is a plain
  sync call, no need for coroutine + Future wrapping).
- Backpressure: bounded asyncio.Queue, drop-oldest when full. Counter
  dropped_count only ever grows. Never blocks the pipeline thread.
- No-subscribers case: writes before attach_loop or after detach_loop
  are silent drops with counter increment. Pipeline must NEVER be
  affected by whether a client is connected.
- Fail-open: every enqueue path swallows exceptions and bumps the
  counter. Same philosophy as ProgressWriter. write_agent_error has a
  second-level try/except — it must NEVER raise.
- Close sentinel: object() sentinel distinguishable from any real
  event dict. get() returns None when drained for clean stream shutdown.

Wire-format translation (semantic dict → AG-UI TEXT_MESSAGE_* /
TOOL_CALL_* frames) is NOT done here — that's the SSE handler's job in
Step 3. This adapter traffics in semantic Python dicts.

Tests: 27 new in test_sse_adapter.py covering happy path, all 6
semantic methods, FIFO ordering, no-subscribers drop, post-detach
drop, re-attach, queue-full drop-oldest, close sentinel, post-close
behavior, thread-safe enqueue from non-loop thread, concurrent
producers, loop-closed scenarios, dropped_count monotonicity, payload
integrity, large-payload pass-through, bulletproof write_agent_error.

Test counts: 305 Phase 1a baseline + 27 new = 332 passed, 0 failed.
Lint (ruff check): clean. Format: clean. Type check (ty): clean.
Phase 1a ProgressWriter unchanged (deployed + working).

No integration in pipeline.py / runner.py / server.py yet — that lands
in Step 3.
…ions

Phase 1b Step 3 implementation resolves the pending tactical decision
in §9.10 and Appendix C: the SSE handler lives on the existing
/invocations endpoint via content-type negotiation rather than a new
/invocations/stream endpoint. Rationale: one endpoint contract, matches
AgentCore's documented AG-UI pattern, zero risk of the orchestrator's
sync path being misrouted.

- §9.10: replace "deferred to PR" sentence with the resolved choice —
  Accept: text/event-stream routes to SSE, anything else preserves the
  existing sync behavior byte-for-byte.
- Appendix C: update the server.py row to describe content-type
  negotiation explicitly. Add new row for agent/src/sse_wire.py —
  the pure-function translator from semantic events to AG-UI frames,
  kept separate from transport for testability.
- v2 drawio: surgical text updates to 7 labels that referenced
  /invocations/stream; no edge or layout changes. Validator passes.

v1 drawio (interactive-agents-phases.drawio, kept as decision history)
is untouched.
…erver.py

Makes Phase 1b SSE functional end-to-end inside the agent container.
Client-side CLI work lands in Steps 5/6.

Endpoint: content-type negotiation on existing /invocations (locked in
docs fa622b4). Accept: text/event-stream → new SSE handler. Any other
Accept (missing, application/json, */*) → existing sync acceptance
response, byte-for-byte preserved. Orchestrator's InvokeAgentRuntime
calls are unaffected because they do not send text/event-stream.

server.py:
- content-type negotiation via _wants_sse() — case-insensitive substring
  match so qualified types and lists work (e.g. 'text/event-stream;q=1').
- SSE path: create _SSEAdapter, attach_loop, spawn background thread
  running run_task(..., sse_adapter=adapter), return StreamingResponse.
- Async generator drains adapter.get() with 15s timeout; emits
  ': ping\n\n' keepalive on timeout, AG-UI 'data: <json>\n\n' frames on
  events. RUN_STARTED synthesised up front; RUN_FINISHED on clean close,
  RUN_ERROR if any agent_error observed during the stream.
- Client disconnect detection: GeneratorExit → detach_loop; background
  run_task keeps running so DDB durability is unaffected, events then
  drop silently until the adapter is closed.
- Idempotent close() in both server.py finally and pipeline.py finally
  (belt-and-braces — close is idempotent by design).

sse_wire.py (new): pure-function translator from semantic SSEAdapter
dicts to AG-UI wire-format events. Kept separate from transport for
testability. Mappings per design doc §9.10 and the AG-UI facts section
of the resolved-decisions memory:
- agent_turn → TEXT_MESSAGE_START / CONTENT / END (+ optional CUSTOM
  for non-empty thinking since AG-UI has no native thinking event)
- agent_tool_call → TOOL_CALL_START / ARGS / END
- agent_tool_result → TOOL_CALL_RESULT (role=tool, is_error propagated)
- agent_milestone → STEP_STARTED + STEP_FINISHED pair
- agent_cost_update → CUSTOM (no native cost event)
- agent_error → RUN_ERROR (terminal) or CUSTOM (non-terminal); Phase 1b
  treats all runner errors as terminal
Uses camelCase on the wire, SCREAMING_SNAKE_CASE event types, ULIDs
for message/tool_call ids, ms-since-epoch timestamps.

pipeline.py: add sse_adapter: _SSEAdapter | None = None kwarg; mirror
the 3 progress.write_agent_milestone calls on the adapter. finally
block calls adapter.close() so the consumer stream ends cleanly.

runner.py: add sse_adapter kwarg; mirror the 5 progress.write_X calls
(turn, tool_call, tool_result, cost_update, error). Zero change to
ProgressWriter behavior — DDB remains durable source of truth.

Tests (38 new, 371 total, 305 Phase 1a baseline preserved):
- test_sse_wire.py (29 tests): per-event-type roundtrip, camelCase,
  ULID shape, timestamp presence, CUSTOM fallback for cost, is_error
  on tool results, empty-field edge cases, terminality rule for errors.
- test_server.py (+6): content-type negotiation routing, SSE happy path,
  SSE terminates with RUN_ERROR on agent_error, 15s keepalive ping,
  client-disconnect detach_loop + background continues, sync path
  regression ×2 (no Accept / Accept: application/json).
- test_pipeline.py (+3): sse_adapter=None preserves current behavior,
  adapter mirrored at milestone sites + close() in finally, runner
  signature accepts sse_adapter kwarg.

Test hardening: test_background_thread_failure_503 now waits for the
mock to be called (race existed in baseline too — /ping flips 503
before task_state.write_terminal runs because of intervening print +
traceback.print_exc in the except block).

No CDK change (Step 1 shipped). No CLI change (Steps 5/6). Not
deployed yet — deployment gated on Step 4 (get-task-events ?after= for
catch-up). ProgressWriter unchanged.
…tch-up

Adds the backend + CLI client pieces for SSE reconnect catch-up. When
the CLI's SSE stream drops (network blip, AgentCore 60-min cap), it
queries REST /tasks/{id}/events?after=<last_seen_event_id> to replay
missed events from DynamoDB, then reopens the SSE stream. This is the
ONLY reconnection mechanism because AG-UI has no Last-Event-ID and
AgentCore has no SSE-native resume (design §9.12).

Handler (cdk/src/handlers/get-task-events.ts):
- Accept ?after=<ulid> alongside existing ?next_token. Back-compat: if
  neither present, existing from-beginning behavior preserved.
- Query mode routing: after → KeyConditionExpression 'task_id AND
  event_id > :after'. ULIDs are Crockford Base32 lexicographically
  sortable, so string compare is correct.
- Collision policy: both ?after and ?next_token present → after wins
  + WARN log (indicates a client bug).
- ?after validation: Crockford Base32 regex, 26 chars. Invalid → 400
  VALIDATION_ERROR.
- Empty after= string falls through to from-beginning (matches how
  URLSearchParams omits empty values; avoids spurious 400s).
- limit-truncated responses still emit next_token when using after,
  so callers can paginate beyond the first page.

Shared validation (cdk/src/handlers/shared/validation.ts):
- New isValidUlid() + ULID_PATTERN. Exported for handler + future uses.

Shared types (cdk/src/handlers/shared/types.ts + cli/src/types.ts):
- New GetTaskEventsQuery type formalizing the query shape. Mirrored per
  AGENTS.md "cdk types must stay in sync with cli types" rule.

CLI (cli/src/api-client.ts):
- getTaskEvents now accepts { after?: string } in the options object
  alongside existing { next_token?, limit? }. URL-encoded properly.
- New catchUpEvents(taskId, afterEventId, pageSize=100) helper that
  paginates internally. First page uses ?after; subsequent pages use
  ?next_token (no re-sending after to avoid server WARN). Returns
  flattened event array — one-call API for Step 5/6 reconnect path.

Comprehensive structured logging (per user emphasis this step):
- INFO on handler entry: {request_id, task_id, limit, query_mode:
  'next_token' | 'after' | 'from_beginning'}.
- INFO on handler exit: {request_id, task_id, event_count, has_more}.
- WARN when both ?after and ?next_token provided, when ?after fails
  ULID validation, when DDB returns unexpectedly empty after a cursor.
- ERROR on DDB exceptions, malformed response, unexpected errors —
  all include error_type.
- DEBUG-style gated via LOG_LEVEL=DEBUG env; emits INFO with tag
  level_override='DEBUG' so CloudWatch filters distinguish them
  without requiring a logger-module change.

Tests: CDK 729 → 745 (+16: 9 handler + 7 validation). CLI 84 → 90
(+6). All pass, compile clean, lint clean. No agent code touched. No
deploy yet — chained with Steps 5/6/7.
…+ fetch)

Transport layer for bgagent watch's upcoming SSE path (Step 6). D3
hybrid bet locked in: import @ag-ui/core for event TYPES + Zod schemas
only; own transport with native fetch + eventsource-parser; own the
reconnection/backoff/JWT-refresh/60-min-restart logic.

No @ag-ui/client anywhere (only grep hits are docstrings explicitly
calling out why we do not use it).

Deps:
- @ag-ui/core@^0.0.52 (types + Zod schemas, pulls zod transitively)
- eventsource-parser@^3.0.8 (SSE frame parser, zero deps, ~32.7M wk)

cli/src/sse-client.ts (746 lines):
- runSseClient(options): Promise<SseRunResult>. Happy path, reconnect,
  catch-up, terminal detection, external cancellation.
- AgentCore URL builder + required header (x-amzn-bedrock-agentcore-
  runtime-session-id = task_id, matches server.py extraction).
- Frame parsing: eventsource-parser → JSON.parse → Zod
  EventSchemas.safeParse; Zod failures still forward by type sniff
  so future AG-UI additions do not break terminal detection.
- Keepalive watchdog (default 30s grace) resets on any byte incl
  ': ping\n\n' comments; on starve, aborts and reconnects.
- Proactive 60-min restart at maxStreamSeconds (default 3500) pre-
  empts AgentCore's hard cap.
- Exponential backoff reconnect (1s initial, factor 2, max 30s).
- Dedup by id chain: id → messageId → toolCallId → runId →
  stepName+ts → name+ts → type+ts fallback. Set + ordered array;
  capped at 10000 with oldest-half eviction.
- 401 handling: one forced refresh via getAuthToken(), retry once;
  double-401 rejects with CliError(UNAUTHORIZED) + login hint.
- External AbortSignal support.
- Terminal codes fatal (no reconnect): AGENT_ERROR, AgentError,
  UNAUTHORIZED, ACCESS_DENIED, missing code; transient RUN_ERROR
  reconnects.

Catch-up cursor architecture:
- SSE events from sse_wire.py do NOT carry DDB event_ids.
- Client accepts initialCatchUpCursor + delegates cursor management
  to the caller. Cursor advances only on events that carry an
  explicit 'id' field. watch.ts (Step 6) will inject DDB event_id as
  id on events returned from its catchUp closure; live events never
  advance the cursor; dedup handles replay overlap.

Comprehensive structured logging at DEBUG/INFO/WARN/ERROR as the user
requested this step.

Tests: 35 new (90 baseline -> 125 total). Coverage 95.6% stmts / 77%
branch / 88% funcs. mise //cli:build clean. No CDK, agent, or design-
doc changes.

Step 6 consumes this: watch.ts must inject DDB event_id as id on
catchUp() events so the cursor advances; absence only causes re-fetch
(dedup prevents double-emit).
…lback

Rewires bgagent watch to use the Step 5 SSE client as primary transport
with REST polling as fallback. Final client-side piece before deploy +
E2E (Step 7).

New --transport <sse|polling|auto> flag (default auto). --stream-
timeout-seconds passes through to maxStreamSeconds (default 3500 =
58 min; pre-empts AgentCore's 60-min cap). Existing --output and
existing text/JSON formatting preserved byte-for-byte.

New cli/src/ag-ui-translator.ts:
- translateDbRowToAgUi(row): mirror of agent/src/sse_wire.py in TS.
  Emits the same AG-UI triad/pair shapes (TEXT_MESSAGE_*, TOOL_CALL_*,
  STEP_*, CUSTOM for cost, RUN_ERROR) so live-SSE and catch-up-REST
  frames are indistinguishable to dedup + formatter.
- CRITICAL: every returned event carries the DDB event_id as `id` with
  a suffix (:start, :content, :end, :args, :call, :step-started,
  :step-finished, :thinking, :msg) so multi-frame groups dedup
  distinctly AND the SSE client's cursor advances on replay.
- agUiToSemantic(ev): inverse translator for formatter rendering
  (Option A boundary — text output byte-identical between transports).

watch.ts flow:
1. Snapshot fetch (getTaskEvents + getTask in parallel) serves three
   purposes: detect already-terminal task → print tail + exit;
   print history tail for late joiners; seed initialCatchUpCursor.
2. Transport selection:
   - --transport sse: runSseClient only; errors propagate.
   - --transport polling: pollTaskEvents only (Phase 1a behavior).
   - --transport auto (default): SSE first; on any error from
     runSseClient, fall back to polling with seeded cursor so no
     duplicates. Missing runtime_jwt_arn config → WARN + polling.
3. Post-SSE authoritative status: one more getTask after
   RUN_FINISHED/RUN_ERROR to set exit code from REST truth
   (COMPLETED → 0; anything else → 1).
4. Ctrl+C unification: single AbortController plumbed into both
   runSseClient({signal}) and pollTaskEvents' abortable sleep.
5. Consecutive-reconnect counter → WARN "network may be flaky"
   after 3 in a row.

configure.ts: new --runtime-jwt-arn <arn> flag persisting to config.
tryLoadConfig() helper enables partial-update merge — running
bgagent configure --runtime-jwt-arn <arn> alone works on an already-
configured install. First-time configure still requires the four
core fields (post-merge check). CliConfig's runtime_jwt_arn is
optional for back-compat; old config.json files without it load fine.

Comprehensive structured logging (per user emphasis this step):
- DEBUG: transport choice, config read, snapshot (N events), cursor
  seed, each SSE event (type + id), reconnect attempts, catch-up
  replay, fallback rationale.
- INFO: transport selected, reconnecting, replayed N events, task
  completed/failed, clean exit.
- WARN: fallback-to-polling (with reason), missing runtime_jwt_arn,
  duplicate skipped, consecutive-reconnect > 3.
- ERROR: terminal RUN_ERROR fatal, config invalid, not authenticated,
  --transport sse unrecoverable.
- JSON mode discipline: every log line routes to process.stderr;
  stdout stays pure NDJSON. Test 11 asserts every stdout line
  JSON.parses cleanly.

Tests: 51 new (125 baseline → 176 total, 14 → 15 suites).
- ag-ui-translator.test.ts: 29 tests (10 per-type translator + 11
  agUiToSemantic + 6 round-trip + 2 edge).
- watch.test.ts: 18 new transport tests covering all flag
  combinations, happy/sad paths, missing-config, already-terminal,
  SIGINT, JSON mode discipline, catch-up id injection, fallback
  ordering + no-duplicates.
- configure.test.ts: 4 new tests (flag accepted, merge, first-time
  required-field check, back-compat old config).

No CDK changes. No agent changes. No design-doc prose changes.
mise //cli:build clean. No new runtime deps.

Step 7 risks (for the parent to validate in E2E):
- AG-UI messageId/toolCallId mismatch between live SSE (fresh ULIDs)
  and catch-up REST (DDB event_id with suffix). Expected — live wins
  dedup-wise, catch-up fills gaps. Verify no double-rendering on a
  mid-turn reconnect.
- Polling snapshot re-render on SSE-fallback: cursor carries forward
  the SSE run's last emitted event id's DDB prefix; claimed
  correct + verified by Test 15, but worth eyeballing during E2E.
E2E of Phase 1b revealed that the two endpoints require different
Cognito JWT types because they check different claims:

- API Gateway's Cognito authorizer validates the 'aud' claim, which
  only Cognito ID tokens carry. It 401s on access tokens.
- AgentCore Runtime's customJWTAuthorizer (configured via
  RuntimeAuthorizerConfiguration.usingCognito(pool, [appClient]) in
  Step 1 CDK) renders CloudFormation with allowedClients set, which
  AgentCore validates against the 'client_id' claim — only present on
  access tokens. It rejects ID tokens.

Before this fix the CLI cached only the ID token and sent it to both
endpoints. REST worked; SSE to Runtime-JWT failed with UNAUTHORIZED.
The SSE client's 401 handler retried once and the --transport auto
path correctly fell back to polling, so the user-visible failure was
a warning line + a transport downgrade, not a hard error. But the
primary SSE path was effectively unusable.

Fix:
- Cache BOTH IdToken and AccessToken from Cognito in
  ~/.bgagent/credentials.json (new optional access_token field; old
  credential files continue to load and fall back to id_token with a
  debug line).
- Split the single getAuthToken() into:
  - getIdToken() → REST API (API Gateway Cognito authorizer)
  - getAccessToken() → AgentCore Runtime-JWT SSE endpoint
  - getAuthToken() kept as a backward-compat alias for getIdToken so
    api-client.ts and any other REST call path is unchanged.
- watch.ts's SSE client closure now calls getAccessToken() specifically.
- Login and refresh flows require both tokens from Cognito; tests
  updated to mock AccessToken alongside IdToken.

Verified with curl:
- API Gateway + ID token → HTTP 200
- API Gateway + access token → HTTP 401 (WHY we keep id_token for REST)
- AgentCore Runtime-JWT + ID token → would 401 (WHY we need access token)
- AgentCore Runtime-JWT + access token → JWT authenticates; request
  reaches the container (separate unrelated container 502 under
  investigation).

Tests: 176 → 177 (+1: legacy credentials fallback). All CLI tests
pass. No CDK, agent, design-doc, or diagram changes. Users must
re-run 'bgagent login' once to populate access_token in their
credentials file; otherwise SSE falls back to polling with a hint.
Exercising the SSE path end-to-end against the deployed stack surfaced
three production bugs that blocked streaming from reaching the client.
Each is fixed and verified: the latest test run received the full live
stream of 74 AG-UI events (turns, tool calls, tool results, cost,
milestones) with correct timestamps and a real PR created at
scoropeza/agent-plugins#10. Two known issues remain for a fresh session
— see "Remaining work" in the doc + memory entries updated elsewhere.

Bugs fixed in this commit:

1. agent/src/server.py — _debug_cw was blocking module import with
   synchronous boto3 CloudWatch Logs writes. AgentCore runs a /ping
   health check within ~1s of container boot; if uvicorn hasn't
   bound port 8080 by then, the container is marked unhealthy and
   the runtime returns 424 "Runtime health check failed or timed
   out". Fix: CW writes run in a daemon thread (fire-and-forget),
   and the boot-time log uses plain print() to avoid spawning any
   thread during import. The debug path (used after the first
   request arrives) still writes to CloudWatch.

2. cli/src/sse-client.ts — was sending the 26-char task_id ULID as
   the X-Amzn-Bedrock-AgentCore-Runtime-Session-Id header, but
   AgentCore validates >=33 chars and returns HTTP 400 Bad Request.
   Fix: new buildRuntimeSessionId() helper that prefixes the task_id
   with 'bgagent-watch-' (40 chars total). Deterministic so reconnect
   attempts re-use the same session and AgentCore routes back to the
   same microVM (preserving in-progress session state).

3. cli/src/commands/watch.ts — AG-UI translator mints suffixed event
   ids like "01KPPVWM...:step-started" to keep TEXT_MESSAGE/
   TOOL_CALL triads dedup-unique. On SSE reconnect, the CLI was
   passing the full suffixed id to GET /events?after=, which fails
   the ULID validator added in Step 4 with "Invalid `after`
   parameter: must be a 26-character ULID." Fix: strip ':suffix'
   in the catchUp closure before calling apiClient.catchUpEvents.

Debug infrastructure added (stays enabled by default for the rest of
Phase 1b development per explicit user direction — so adding new
debug lines later doesn't require redeploy to toggle a log level):

- _debug_cw helper in server.py writes to server_debug/<task_id>
  CloudWatch streams (AgentCore doesn't forward container stdout;
  only explicit CloudWatch Logs API writes land in APPLICATION_LOGS,
  matching the existing telemetry._emit_metrics_to_cloudwatch
  pattern). Wired through invoke_agent entry, _extract_invocation_
  params result, sync vs SSE routing decision, _invoke_sse entry,
  _SSEAdapter construction, attach_loop, _spawn_background,
  _run_task_background entry, _sse_event_stream entry/RUN_STARTED
  yield/event translations/keepalive pings/close sentinel/disconnect.
- Every CW write is non-blocking (daemon thread) so request latency
  is unaffected.
- Local-stdout print() is retained for docker-compose runs.

No test changes in this commit (tests still pass: agent 371, CLI 177,
CDK 745). No new deps. Known remaining issues documented separately:
(a) terminal RUN_FINISHED frame not delivered to the CLI after task
completion — CLI hits repeated 424s and never detects terminality;
(b) suspected duplicate-pipeline execution — REST-submit path fires
orchestrator → Runtime-IAM pipeline AND the subsequent SSE
invocation on Runtime-JWT spawns ANOTHER pipeline, yielding ~2×
expected event volume in DDB. These are Phase 1b design-level gaps
(not one-line fixes) flagged for the next session.
…esearch

First live SSE bring-up against the deployed stack surfaced a design gap
in the rev-4 plan: SSE invocations on Runtime-JWT were spawning a
SECOND pipeline when the task had already been launched via the
orchestrator on Runtime-IAM. Root cause: AgentCore's "same session_id
→ same microVM" routing is per-runtime-ARN only; cross-runtime live
attach requires an external pub/sub layer.

Competitive survey (docs/research/agent-streaming-patterns.md, new;
31 sources across CopilotKit, LangGraph Platform, OpenAI Assistants,
Mastra, Temporal, Vercel resumable-stream, AgentCore itself)
identified three dominant shapes: same-process streaming (used by
CopilotKit / Mastra / OpenAI), orchestrator+observer with pub/sub
(LangGraph join_stream, Vercel resumable-stream), and pull-based
(Temporal, OpenAI fallback). LangGraph Platform documented as the
clearest reference for orchestrator+observer (Postgres log + Redis
pubsub); Vercel resumable-stream is the simpler AWS-friendly
equivalent.

Branch A chosen for Phase 1b (§9.13, new in rev 5):

- Path 1 — `bgagent submit --watch` / `bgagent run`: direct-submit
  via POST /v1/tasks with `execution_mode=interactive` → Lambda
  admits + writes TaskTable, SKIPS orchestrator → CLI opens SSE to
  Runtime-JWT → server.py runs pipeline same-process. True real-time.
  Reconnection within microVM lifetime attaches to existing adapter
  via new {task_id: adapter} registry (attach-don't-spawn).
- Path 2 — `bgagent submit` plain: default `execution_mode=orchestrator`
  keeps Phase 1a behaviour (pipeline on Runtime-IAM). CLI `watch` on
  such a task = polling (no cross-runtime live attach; that's Phase 1c).
- Non-interactive (Slack/webhook/cron): Path 2 plus DDB Streams fan-out
  per §8.9.

Trade-offs documented:
- Pipeline lifetime for both paths bounded by AgentCore maxLifetime
  (8h in our CDK). DDB persists log only, not execution continuation.
- Tasks > 8h need to leave AgentCore Runtime (Fargate / Step Functions,
  out of Phase 1b scope).
- Direct-submit task dies with its microVM if CLI disconnects long
  enough for idle / maxLifetime eviction; orchestrator-submit is
  indifferent to CLI presence.

New concrete deliverables flagged in the doc (to be implemented next):

1. CDK: add `execution_mode` field to POST /v1/tasks.
2. CreateTask Lambda: if `interactive`, skip orchestrator Lambda.Invoke.
3. server.py: attach-don't-spawn logic via {task_id: _SSEAdapter}
   registry.
4. server.py: /ping HealthyBusy while pipeline thread alive.
5. CLI: bgagent submit --watch (or bgagent run) flow — POST /v1/tasks
   with execution_mode=interactive, then open SSE to Runtime-JWT.

Phase 1c roadmap: add pub/sub (IoT Core MQTT or ElastiCache Redis +
resumable-stream port) for real-time cross-runtime attach. Additive;
Branch A code does not change.

Diagram updates to v2 (surgical text edits, no layout change):
- Page 2 caption: explicit two-path architecture.
- Page 5 caption + sequence arrow: reflect same-process pipeline on
  Runtime-JWT for interactive path.

Research brief: docs/research/agent-streaming-patterns.md, ~2300 words,
31 sources, flagged time-sensitive claims (especially AgentCore quotas).

No code changes in this commit.
… + multi-subscriber

Implements the server-side Branch A design from §9.13:

1. _SSEAdapter multi-subscriber fan-out (sse_adapter.py):
   - New _Subscriber dataclass (queue + dropped_count).
   - Default subscriber created eagerly at __init__ (backward-compatible
     get() still works even when write_*() fires before first get()).
   - New subscribe() returns a fresh per-observer queue; unsubscribe(q)
     removes it.
   - _broadcast_from_loop writes to all subscriber queues with per-sub
     drop-oldest backpressure (one slow consumer can't stall others).
   - _broadcast_sentinel_from_loop fans out close sentinel to all.
   - dropped_count is the sum of per-sub drops + _undelivered_count
     (no-loop / closed cases).

2. server.py attach-don't-spawn logic:
   - New module-level _active_sse_adapters: dict[str, _SSEAdapter]
     tracked under _threads_lock.
   - _invoke_sse checks the registry first: if an adapter with active
     subscribers exists for this task_id, subscribe() a new queue and
     return a StreamingResponse backed by it — do NOT call
     _spawn_background. Solves the duplicate-pipeline bug (two
     pipelines running the same task, observed in the previous
     bring-up run).
   - Spawn path now: register in _active_sse_adapters BEFORE spawning
     (so a rapid reconnect race attaches instead of double-spawning),
     then subscribe() the observer's queue BEFORE spawning (so no
     events are missed between spawn and first drain iteration).
   - _run_task_background's finally block removes the adapter from
     the registry. Only removes if the current entry matches — guards
     against a newer adapter having replaced it.
   - Rollback on spawn failure: adapter removed from registry if
     _spawn_background raises.

3. _sse_event_stream takes sub_queue per-observer parameter:
   - Drains THIS observer's queue (not the adapter's default) so
     multiple observers attached to the same pipeline each receive
     the full event stream independently.
   - On client disconnect / generator cancellation, calls
     adapter.unsubscribe(sub_queue) — leaves the adapter and other
     subscribers intact. The background pipeline keeps running and
     ProgressWriter keeps writing to DDB.

4. /ping HealthyBusy (§9.13.2 idle-evict defense-in-depth):
   - Returns {"status": "HealthyBusy"} while any pipeline thread is
     alive, signalling AgentCore not to idle-evict the microVM. Per
     AWS runtime-long-run docs.
   - Returns {"status": "healthy"} otherwise.
   - 503 + "unhealthy" on _background_pipeline_failed unchanged.
   - _last_ping_status module-level: /ping transitions logged to
     CW once (healthy <-> HealthyBusy <-> unhealthy) so the stream
     is not spammed with per-probe lines.

5. Tests — 371 baseline → 377:
   - test_ping_reports_healthy_when_idle.
   - test_ping_reports_healthybusy_when_pipeline_alive.
   - test_sse_attach_does_not_spawn_second_pipeline: registry
     pre-populated, _invoke_sse returns StreamingResponse WITHOUT
     calling _spawn_background; subscriber_count increments.
   - test_multi_subscriber_broadcast: two subscribers both receive
     the same event.
   - test_multi_subscriber_close_sentinel_fans_out: close()
     sentinel reaches every subscriber.
   - test_registry_cleanup_on_pipeline_completion: finally block
     removes the adapter from the registry.
   - Existing test_sse_stream_client_disconnect_calls_detach renamed
     and updated to verify unsubscribe(queue) instead of detach_loop.

Comprehensive CW debug logs at every state transition (registry
insert / remove / attach / rollback, ping status transitions,
subscribe / unsubscribe). All via the existing _debug_cw helper
(writes in daemon thread so container startup is not blocked).

No CLI changes, no CDK changes, no design-doc changes (already
landed in 48837af). Part 1 (CDK admission execution_mode +
bgagent run command) will ship in a separate commit.

Known gap: the CDK POST /v1/tasks still unconditionally fires the
orchestrator Lambda when a task is created, so `bgagent run` (Part 1,
not yet implemented) cannot yet use this path to submit a task
without ALSO firing the orchestrator. Part 1 adds the
execution_mode field to skip the orchestrator invoke when the CLI
is going to drive the pipeline itself via SSE.
…xport helpers

Partial landing of the CDK admission + CLI helper-export work for rev 5
Branch A. Stops short of the final `bgagent run` command — that needs
one more session of work. Everything below is test-green and
uncommitted branches are fine for handoff.

CDK changes:

1. types.ts + cli/src/types.ts (kept in sync per AGENTS.md):
   - New ExecutionMode = 'orchestrator' | 'interactive'.
   - CreateTaskRequest now has optional execution_mode.

2. create-task-core.ts:
   - New optional allowedExecutionModes parameter (default
     ['orchestrator']). Validates body.execution_mode against the
     allowlist; 400 VALIDATION_ERROR with a clear message if not
     allowed.
   - execution_mode='interactive' SKIPS the orchestrator Lambda.Invoke
     and logs "Admission: interactive mode, orchestrator invoke
     skipped". Default/undefined/='orchestrator' preserves Phase 1a
     behaviour byte-for-byte.

3. create-task.ts (Cognito-authed POST /v1/tasks):
   - Passes ['orchestrator', 'interactive'] → both modes allowed.
   - webhook-create-task.ts unchanged → default ['orchestrator'] →
     webhook callers cannot request interactive mode (returns 400).

4. Tests (create-task-core.test.ts): 5 new cases — interactive skips
   orchestrator, explicit orchestrator still fires, undefined is
   orchestrator (regression), webhook rejects interactive, bogus
   mode rejected. Full CDK suite: 750 passed / 0 failed (745
   baseline + 5 new).

CLI changes (helper exports for the upcoming `bgagent run` command):

5. cli/src/commands/watch.ts:
   - Exported makeFormatter, fetchInitialSnapshot, runSse, RunSseArgs
     so a new run command can reuse the same SSE-watch machinery
     without duplicating the renderer / reconnect / terminal-state
     logic.
   - No behaviour change; purely public-API widening.

6. cli/test/sse-client.test.ts: updated the session-id header
   expectation to match the buildRuntimeSessionId() prefix (fix
   landed earlier in cd093b2 — this test was still expecting the
   bare task_id).

Not yet in this commit (next session):

- cli/src/commands/run.ts — composes apiClient.createTask({execution_mode:
  'interactive'}) + runSse(). Should be a small file (~100-150 lines).
- cli/src/bin/bgagent.ts — register run command alongside submit/watch.
- cli/src/api-client.ts — verify createTask forwards execution_mode
  (current shape spreads ...CreateTaskRequest so it already does, but
  needs a test assertion).
- cli/test/commands/run.test.ts — new tests for the run flow.
- Deploy + E2E validation on the deployed stack.

Test counts: CDK 745 -> 750 (+5). CLI unchanged at 177 (no new tests
yet for run command). Agent unchanged at 377 (Part 2 committed in
3dab225 already).
…RE guard

Completes Phase 1b rev-5 Branch A (design §9.13). Adds the direct-submit
interactive CLI path and the cross-runtime safety guard that prevents
duplicate pipeline execution when a watcher opens SSE against a task
that was submitted via the orchestrator.

CLI — `bgagent run`:
* New command composes createTask({execution_mode: 'interactive'}) + runSse
  so the pipeline executes same-process with the SSE stream on Runtime-JWT
  (real-time, no orchestrator hop).
* Requires runtime_jwt_arn in config; errors with a clear pointer to
  `bgagent configure` otherwise.
* Registered in bin/bgagent.ts between `submit` and `list`.
* 11 tests in cli/test/commands/run.test.ts.
* 1 api-client test: execution_mode is forwarded in POST body.

RUN_ELSEWHERE guard (§9.13.4):
* CDK: TaskRecord gains execution_mode; create-task-core persists it on
  the TaskTable record so server.py can read it. 1 new test asserts
  persistence for both interactive and orchestrator modes.
* Agent: _invoke_sse now checks task_state.get_task() before spawning.
  If the task record says execution_mode != 'interactive', returns
  HTTP 409 {code: RUN_ELSEWHERE, execution_mode} instead of spawning.
  Fails open when record is missing (blueprints / legacy tasks).
  3 new tests cover orchestrator-rejected, interactive-allowed, and
  fail-open behaviours.
* CLI: sse-client recognizes 409 RUN_ELSEWHERE as a non-retriable error,
  throws CliError so `watch --transport auto` catch-block falls back to
  polling cleanly (no reconnect storm). Generic 409s still take the
  regular reconnect path. 2 new sse-client tests cover both paths.

Test status: agent 380 (was 377), CDK 751 (was 750), CLI 191 (was 177).
Live bring-up of the rev-5 Branch A path surfaced three production bugs
after the initial rev-5 deploy. All three are fixed here and validated
end-to-end against backgroundagent-dev/us-east-1.

1. CDK: RuntimeJwt execution role was missing ECR pull perms.
   Root cause: the L2 `AgentRuntimeArtifact.fromAsset` AssetImage.bind()
   guards against double-grant with `this.bound = true`, so when the
   SAME artifact instance is attached to two Runtimes the second
   runtime's role never receives the ECR pull statements. Image pull
   failed with "no basic auth credentials", AgentCore then returned
   424 Failed Dependency on every /invocations for the JWT runtime.
   Fix: create two AssetImage instances (one per runtime). The
   DockerImageAsset dedupes on asset hash so only one image is
   published to ECR.

2. Agent: server.py on Runtime-JWT could not run pipelines for
   CLI-submitted tasks. The rev-5 design (§9.13.4) requires server.py
   to hydrate repo_url + task description + max_turns / max_budget /
   task_type / branch_name / pr_number from TaskTable when the SSE
   body only carries `{task_id}`. Previously `_extract_params` read
   only from the invocation body, so pipelines blew up with
   `ValueError: repo_url is required` immediately after spawn.
   Fix: extend the RUN_ELSEWHERE guard block in `_invoke_sse` to
   also hydrate missing params from the TaskTable record when the
   mode is `'interactive'`. Preserves the orchestrator contract
   (which passes full params in the invocation body — we only fill
   fields the caller didn't send).

3. CLI+CDK: `bgagent watch --transport auto` against an orchestrator-
   submitted task hit a reconnect storm. server.py correctly returns
   409 RUN_ELSEWHERE, but AgentCore wraps non-2xx responses as 424
   to the client, so the CLI never saw the 409 code and fell into
   generic http_error retry. Design-correct fix: expose `execution_mode`
   on the TaskDetail REST response so watch can choose the transport
   from the snapshot instead of probing Runtime-JWT. watch.ts now
   routes directly to polling whenever `snapshot.executionMode !==
   'interactive'` under `--transport auto`. The 409 RUN_ELSEWHERE
   guard stays server-side as defence-in-depth.

CDK: TaskDetail gains `execution_mode: ExecutionMode | null`
(null for legacy tasks that predate rev 5). `toTaskDetail` forwards
the field.

CLI: watch.ts SnapshotResult gains `executionMode`. Two new tests
cover orchestrator-mode → polling and legacy (null) → polling.

E2E results against backgroundagent-dev:
* E2E-A (`bgagent run`, interactive): task 01KPRN3XKQ2V4AVNT6GV1E3PEN
  → PR scoropeza/agent-plugins#11, COMPLETED in 54s, 61 real-time
  events streamed, exit 0.
* E2E-B (`submit` + `watch --transport auto`, orchestrator):
  snapshot resolves execution_mode=orchestrator, CLI skips SSE,
  polling streams turn events and tool calls, final status FAILED
  (max-turns exhausted — orchestrator-side behaviour unaffected by
  this fix), exit 1. No 424 reconnect storm.

Tests: agent 380, CDK 751, CLI 193. TS typecheck clean.
… nits)

Multi-agent validation pass (code-reviewer, silent-failure-hunter,
pr-test-analyzer, type-design-analyzer) surfaced five P0 concerns and
several key nits. Per user direction (Option C — no cutting corners),
land all P0s inline + key nits; defer P1s and design refactors to
`docs/design/PHASE_1B_REV5_FOLLOWUPS.md` with full rationale.

P0 fixes:

* P0-a — latent AttributeError in _SSEAdapter.write_agent_error:
  referenced non-existent `self._dropped_count` inside a "last-ditch"
  except block. Fix to the real `_undelivered_count` + regression test
  (test_write_agent_error_fallback_uses_undelivered_counter).
* P0-b — task_state.get_task conflated NotFound with FetchFailed (both
  returned None via bare `except Exception`). Rev-5 RUN_ELSEWHERE guard
  on Runtime-JWT would spawn a duplicate pipeline on orchestrator-mode
  tasks during a DDB blip. Introduce new `TaskFetchError` exception;
  `get_task` raises on DDB/boto failures, returns None only on real
  absence. server.py `_invoke_sse` now returns 503
  TASK_STATE_UNAVAILABLE on fetch failure (fail-closed) and still
  fails-open for legacy tasks with no record. Four new tests cover
  the three return shapes + the 503 response.
* P0-d — `bgagent run` wraps `runSse` in try/catch. On fatal SSE
  error: cancels the task (so it doesn't sit stranded in SUBMITTED
  occupying a concurrency slot), prints the task_id + a
  `bgagent status <task_id>` resume hint, and exits non-zero. Three
  new tests: SSE-fails-immediately-and-cancels, cancel-also-fails-
  bubbles-SSE-error, SIGINT-propagates-abort-to-runSseClient.
* P0-e — post-hydration validation in server.py `_invoke_sse`. New
  `_validate_required_params` helper checks for the minimum viable
  param set (repo_url; new_task needs issue OR description;
  pr_iteration/pr_review need pr_number). Returns 500
  TASK_RECORD_INCOMPLETE with a list of missing fields instead of
  letting the pipeline crash inside setup_repo. Two new tests
  (hydration-fills-missing, explicit-caller-wins-over-record) + one
  for the 500 + one for the helper's type-dispatch logic.

Key nits:

* Lift `validateStreamTimeout` + `DEFAULT_STREAM_TIMEOUT_SECONDS=3500`
  into `cli/src/commands/_stream.ts` (shared between run + watch).
* `SnapshotResult.executionMode: string | null` → `ExecutionMode |
  null` (compile-time exhaustiveness restored).
* `TaskDetail.execution_mode?:` in CLI → required `ExecutionMode |
  null` (matches CDK side, which always sets it).
* server.py now exports `EXECUTION_MODE_ORCHESTRATOR` and
  `EXECUTION_MODE_INTERACTIVE` constants; inline string literals
  replaced.
* Heartbeat 45s magic number → `_HEARTBEAT_INTERVAL_SECONDS`.
* `logInfo` no-op `if/else` branches in watch.ts removed; parameter
  renamed `_isJson` for call-site documentation.
* `run.ts` `logInfo(message)` signature → `logInfo(isJson, message)`
  to match watch.ts; `Verbose mode:` line gated on `isVerbose()`.
* CDK two-artifact workaround comment: replace `<check>` placeholder
  with a pointer to `DEFER_FOLLOWUPS.md` CDK-1 + detailed location of
  the L2 bug (`AssetImage.bind` guard in the vendored module).

Tests added:

* agent: +9 tests (hydration-full, hydration-preserves-caller, 503-
  TASK_STATE_UNAVAILABLE, 500-TASK_RECORD_INCOMPLETE, validator
  dispatch for pr_iteration/pr_review/new_task, get_task-notfound,
  get_task-found, get_task-empty-item, get_task-raises-TaskFetchError,
  write_agent_error-fallback).
* cdk: +1 test (both runtime execution roles carry ECR pull perms —
  regression for the L2 double-attach bug).
* cli: +3 tests (run-SSE-fatal-cancels-task, run-cancel-also-fails-
  bubbles-SSE-error, run-SIGINT-forwards-abort-to-runSseClient).

Other:

* New docs/interactive-agents-phases-v3.drawio (generated by
  diagram-specialist) with execution_mode fork on POST /tasks,
  RUN_ELSEWHERE guard + hydration annotation, Path-1 vs Path-2
  sequence, per-microVM reconnection-attach note, and a "Rev-5
  invariants" callout. v2 kept as a reference baseline.
* New docs/design/PHASE_1B_REV5_FOLLOWUPS.md catalogs the 14 deferred
  items (P0-c stranded-task reconciler, all P1s, type refactors,
  observability gaps, polling-interval design deviation, upstream
  L2 bug filing).

Test totals: agent 380 → 390 (+10), CDK 751 → 752 (+1), CLI 193 →
196 (+3). TS typecheck clean.

Re-validated against deployed backgroundagent-dev:
* POST-P0 E2E-A (interactive/JWT/SSE) — task 01KPS5A90S6PPX8D9083BCC8P3
  → PR scoropeza/agent-plugins#14, COMPLETED, hydration log confirms
  5 params filled from TaskTable.
* POST-P0 E2E-B (orchestrator/IAM/polling) — task
  01KPS6TM1K74QAV2ZGPBB8GDTG → snapshot reads execution_mode=
  orchestrator, CLI prints "using polling (SSE only supported for
  interactive tasks)", no SSE attempt, 33 turns streamed via polling,
  terminal status FAILED (max_turns=6 exhaustion — pipeline-side,
  unrelated to watch).
…S_PER_USER 3→10

Follow-up to fe84de5 pre-push hardening: land the last P0 from the
multi-agent validation pass. User feedback during the review exposed a
live instance of the stranded-task problem (2 interactive tasks
occupying concurrency slots after CLI-side failures pre-P0-d), which
made the case to fix it in-scope rather than deferring to a follow-up
PR.

Stranded-task reconciler:

* New construct `cdk/src/constructs/stranded-task-reconciler.ts` wires an
  EventBridge schedule to a new Lambda running every 5 minutes by
  default.
* New handler `cdk/src/handlers/reconcile-stranded-tasks.ts` queries
  `TaskTable.StatusIndex` for `status IN (SUBMITTED, HYDRATING)` with
  `created_at < cutoff`, classifies each row by `execution_mode`, and
  fails any exceeding its applicable timeout:
    - interactive:       300 s (STRANDED_INTERACTIVE_TIMEOUT_SECONDS)
    - orchestrator:     1200 s (STRANDED_ORCHESTRATOR_TIMEOUT_SECONDS)
    - legacy (missing): 1200 s (treated as orchestrator)
  Thresholds are env-tunable per Lambda without code changes.
* Transition to FAILED is conditional on the current status to avoid
  racing a legitimate transition; on CCFE we log `advanced_during_
  reconcile` and move on.
* Emits two events on fail: `task_stranded` carrying
  `{code: 'STRANDED_NO_HEARTBEAT', prior_status, execution_mode,
  age_seconds}` for observability, and the standard `task_failed` that
  existing consumers already react to.
* Decrements the user's concurrency counter conditionally. If the
  counter is already 0 (double-release or drift), we swallow the CCFE —
  the existing concurrency reconciler catches lingering drift.
* Does NOT touch RUNNING / FINALIZING — those remain handled by
  `pollTaskStatus`'s `agent_heartbeat_at` timeout path in
  `orchestrator.ts`.
* Handler `entry` uses `@aws-sdk/client-dynamodb` inline (not
  `lib-dynamodb`) to match the sibling reconciler pattern and keep the
  Lambda bundle lean.

8 new handler unit tests cover: no-candidates no-op, interactive stale
→ fails + events + decrement, orchestrator young → ignored,
orchestrator old → fails, legacy no-mode → orchestrator threshold, race
with legitimate transition (CCFE clean skip), both SUBMITTED + HYDRATING
queried, pagination via `ExclusiveStartKey`.

MAX_CONCURRENT_TASKS_PER_USER 3 → 10:

* `cdk/src/constructs/task-orchestrator.ts:163` default raised. 3 was
  too tight for power-user CLI flows (`bgagent run` during iterative
  development triggered the cliff routinely). The stranded-task
  reconciler now prevents abandoned tasks from occupying slots, so the
  bump is ergonomic rather than load-bearing.
* Existing test updated (`task-orchestrator.test.ts`).
* The `?? 3` fallback → `?? 10` only; callers that pass an explicit
  value are unaffected.

Docs:

* `docs/design/PHASE_1B_REV5_FOLLOWUPS.md` — move P0-c from "deferred"
  to "LANDED" with the actual design delivered. Same for
  MAX_CONCURRENT_TASKS_PER_USER. Also add new entry DATA-1 (split DDB
  `turns` field into `turns_attempted` + derived `turns_completed` —
  user flagged during review; SDK reports attempted count including
  the cap-check, cosmetic not bug).

Tests: agent unchanged at 390, CDK 752 → 760 (+8 reconciler handler
tests; existing task-orchestrator test updated for new default), CLI
unchanged at 196. TS typecheck clean.

Deployed to backgroundagent-dev: StrandedTaskReconciler Lambda live,
env verified (INTERACTIVE=300 ORCHESTRATOR=1200). Running every 5
minutes on EventBridge.
Round 1 of the pre-push follow-ups. Three correctness fixes from the
silent-failure-hunter pass, plus the observability gap the concurrency
incident exposed. Plan in docs/design/PHASE_1B_REV5_FOLLOWUPS.md.

P1-3 — attach-path subscribe() failure no longer duplicates pipeline
(agent/src/server.py):
* When `_invoke_sse` finds a live adapter in the registry but
  `adapter.subscribe()` raises (close race, queue-lifecycle bug), we
  now return HTTP 503 `{"code": "SSE_ATTACH_RACE"}` instead of falling
  through to spawn. Falling through would have duplicated the pipeline
  inside a single microVM.
* New test
  `test_sse_attach_subscribe_failure_returns_503_no_spawn` asserts no
  spawn is called and the registry is left untouched on subscribe
  failure.

P1-1 — 409 on the SSE path is always terminal (cli/src/sse-client.ts):
* Prior behavior: if the 409 body wasn't JSON or didn't carry
  `code: RUN_ELSEWHERE`, the code fell through to the generic
  reconnect path — i.e., retried against a server that was
  deliberately rejecting the request (reconnect storm). Now any 409 is
  terminal: RUN_ELSEWHERE → `CliError('RUN_ELSEWHERE...')` (caller
  falls back to polling), other 409 → `CliError('HTTP 409 ... body:
  ...')` with a 500-byte body excerpt so operators can see what the
  server said. Non-retriable.
* Updated existing test and added
  `HTTP 409 with non-JSON body → terminal CliError`.

OBS-4 — interactive path records session_id on TaskTable
(agent/src/task_state.py + agent/src/server.py):
* New helper `task_state.write_session_info(task_id, session_id,
  agent_runtime_arn)`. Conditional UpdateItem on
  `status IN (SUBMITTED, HYDRATING)` so a concurrent legitimate
  transition wins cleanly.
* `_invoke_sse` calls it just before spawning the pipeline for
  `execution_mode='interactive'` tasks, passing the AgentCore
  session header value + (for now) an empty runtime ARN.
* Cancel-task Lambda resolves the runtime ARN from `execution_mode`
  against two new env vars `RUNTIME_IAM_ARN` + `RUNTIME_JWT_ARN`
  (cdk/src/handlers/cancel-task.ts + cdk/src/constructs/task-api.ts).
  `StopRuntimeSession` now picks the correct runtime for interactive
  tasks that only have `session_id` on the record.
* Four new tests in test_task_state.py (writes both fields, no-op on
  empty, ConditionalCheckFailed skip, partial session-only), two in
  test_server.py (spawn-writes-session-info + skip-when-both-missing).

Important design note added to agent.ts: an earlier attempt injected
each runtime's own ARN as an env var on the runtime itself. That
creates a CFN cycle (the Runtime's properties reference its own
`AgentRuntimeArn` GetAtt). The solution moved to cancel-task where
both runtime ARNs are already in scope at CDK synth time with no
cycle. `agent_runtime_arn` stays null on the interactive TaskTable
record; cancel resolves from `execution_mode`.

Test totals: agent 390 → 397 (+7), CDK 760 → 760 (existing tests
unchanged), CLI 196 → 197 (+1 non-JSON 409 test). TS typecheck clean.

Deployed to backgroundagent-dev and validated with a fresh E2E-A run:
task 01KPSEF0G8C2141JFF0CVGY4CD → PR scoropeza/agent-plugins#15,
COMPLETED, session_id "bgagent-watch-01KPSEF..." now recorded on
TaskTable (was null on all prior interactive-path tasks).
Two follow-ups from the silent-failure-hunter pass that surface
triage-critical information previously lost at debug-level only.

P1-2 — post-SSE getTask failure no longer silent
(cli/src/commands/watch.ts):

Before, if the authoritative-status REST call after the SSE terminal
event failed, the CLI silently inferred status from the AG-UI frame
type, logged only via `debug()` (invisible without `--verbose`), and
printed "Task completed." / "Task failed." without any indication
that the status was inferred. In practice this means a user sees a
green "completed" even when REST is down and the server might have
finalized as CANCELLED or FAILED.

Now on getTask failure we:
* Emit a `WARN:` line to stderr with the underlying error and a
  `bgagent status <task_id>` suggestion.
* Suffix the terminal line with ` (inferred)` so it is self-evident
  that the status may not match the server's view.
* Exit code still uses the inferred status (nothing else to do).

Two new tests: RUN_FINISHED + getTask fails → warn + inferred
completed (exit 0); RUN_ERROR + getTask fails → warn + inferred
failed (exit 1).

P1-5 — rev-5 except blocks now include tracebacks in CloudWatch
(agent/src/server.py):

The AgentCore runtime container's stdout is not forwarded to
APPLICATION_LOGS (server.py docstring already flags this). Before,
every bare `except Exception` in the rev-5 paths logged only
`{type(exc).__name__}: {exc}` via `_debug_cw`, losing the call-site
traceback and making programming-bug triage expensive.

* New helper `_debug_cw_exc(message, exc, *, task_id=None)` formats
  the full traceback and hands it to `_debug_cw` so it reaches the
  custom CloudWatch stream alongside the exception.
* Replaced 6 rev-5-era call sites:
  `_extract_invocation_params FAILED`, `_invoke_sse FAILED`,
  attach-path `subscribe() FAILED`, spawn-path `_SSEAdapter
  construction FAILED`, `attach_loop FAILED`, `subscribe FAILED in
  spawn path`, `_spawn_background FAILED`. Removed now-redundant
  `traceback.print_exc()` calls at those sites (stdout was
  swallowed in production anyway).

Test totals: agent 397 (unchanged — no new behavior assertion; the
helper is exercised indirectly through the existing P1-3 + SSE
construction tests), CDK 760 (unchanged), CLI 197 → 199 (+2 P1-2
tests). TS typecheck clean.

No infra change — agent/src changes but no deploy needed until we
actually want to exercise P1-5 in live CloudWatch.
bgagent added 26 commits April 30, 2026 15:27
Implements design INTERACTIVE_AGENTS.md §5.2: a recency-biased, LLM-free
snapshot of task state assembled from a single parallel fetch of
TaskRecord + most-recent N events. Never calls an LLM, never hallucinates,
safe to call repeatedly.

Server:
- GET /tasks/{id}/events grows ?desc=1 to return newest-first, so the
  status snapshot runs O(limit) instead of walking the full event stream.
- desc+after is rejected with 400 (forward cursor vs descending scan is
  a client bug).
- next_token is suppressed on descending pages - LastEvaluatedKey
  carries no direction, so a follow-up caller that drops desc would
  silently interleave ascending results.

CLI:
- New formatStatusSnapshot(task, events, now) - pure, side-effect-free,
  injectable clock. Defensively re-sorts events by ULID so an upstream
  ordering regression can't silently render a stale tool call as
  'Current'.
- Graceful degradation: missing fields render as '-' (just-submitted
  task, no cost recorded, malformed timestamp, no milestones yet).
- Terminal tasks prefer the SDK-reported duration_s over orchestrator
  wall-clock math.
- ApiClient.getStatusSnapshot runs getTask + getTaskEvents(desc=1) in
  parallel via Promise.all (fail-fast by design).
- status command: default path -> snapshot; --wait -> existing full
  detail block-and-poll; --output json -> raw TaskDetail contract
  preserved.

Tests: CLI 110->125 (+15), CDK 867->872 (+5). Agent unchanged at 391.

References: docs/design/INTERACTIVE_AGENTS.md §5.2, AD-6.
Implements INTERACTIVE_AGENTS.md §5.3 adaptive polling and both Chunk D
carry-forwards: transient-failure retry with exponential backoff, and
AbortSignal propagation for clean Ctrl+C.

Adaptive cadence (new `nextCadence` pure state machine):
- 500 ms when events are arriving
- 1 s / 2 s / 5 s ladder on consecutive empty polls, capped at 5 s
- Resets to fast on the next event regardless of backoff depth

Retry policy (`withTransientRetry`):
- Whitelist classification - only 5xx ApiError + fetch network errors
  (TypeError with 'fetch failed') are retried. 4xx, 401 auth-expired,
  CliError, and all other throws propagate immediately so users see
  real errors instead of 're-run to resume' messages that would never
  succeed.
- Equal-jitter backoff (AWS-style): half the base delay is fixed, half
  is randomized. Prevents the degenerate retry-storm where Math.random
  rolls near-zero on every attempt.
- Budget 5 retries per op; exhaustion throws with the exact resume
  command in the message.

AbortSignal plumbing:
- ApiClient.request threads signal into fetch({signal}); getTask,
  getTaskEvents, catchUpEvents all accept and propagate it.
- Non-JSON HTTP error responses now throw ApiError (not CliError) with
  the status preserved, so the retry classifier can correctly reject
  4xx WAF/CloudFront pages.
- SIGINT fires abortController.abort(); signal is checked before
  finalStatus in runPolling so Ctrl+C always wins (POSIX exit 130).
- Snapshot-SIGINT path only honors exit 130 if the thrown error is
  genuinely an AbortError from our signal - real errors that race with
  SIGINT (e.g. expired token) surface their login hint instead of
  getting silently swallowed.

Tests: CLI 125 -> 142 (+17). Agent unchanged at 391. Prek pre-commit
passes all 11 hooks.

References: docs/design/INTERACTIVE_AGENTS.md §5.3, AD-2.
Resolves: Chunk D carry-forwards S#4 (retry) and S#5 (AbortSignal).
Replaces the single-filter FanOutConsumer with a per-channel router.
Each channel (Slack / Email / GitHub) gets its own default event-type
subscription matching design §6.2, and the router threads an optional
per-task overrides object (§6.5) forward-compatibly so Chunk K can
wire a DDB read without touching routing logic.

CHANNEL_DEFAULTS (matches §6.2 exactly):
- Slack  : terminal (all 4) + pr_created + agent_error
           + approval_required + status_response
- Email  : task_completed + task_failed + approval_required
           (task_cancelled and task_stranded are NOT delivered --
           user knows they cancelled; strands are an operator signal)
- GitHub : terminal (all 4) + pr_created

Router + telemetry:
- resolveChannelFilter(ch, overrides?) with enabled:false / 'default'
  token expansion / explicit event list replacement.
- Empty events list ({events: []}) mutes the channel AND emits a
  fanout.resolve.empty_events_override warn -- silent mute is the
  exact silent-failure trap Chunk K submit-time validation needs
  defense in depth against.
- routeEvent uses Promise.allSettled and returns only the channels
  that actually fulfilled. Rejected channels log
  fanout.dispatcher.rejected with channel context, and are NOT
  counted as dispatched in batch telemetry.

Types consolidated:
- ChannelConfig and TaskNotificationsConfig live in handlers/shared/
  types.ts as the single source of truth. TaskRecord.notifications
  and the fanout handler both reference them - type drift caught at
  compile time.

Dispatcher simplification:
- Removed per-dispatcher try/catch. Single error sink in the router
  guarantees batch telemetry and per-channel warns stay consistent.

Tests: CDK 872 -> 892 (+20). Agent unchanged at 391. Prek pre-commit
passes all 11 hooks.

References: docs/design/INTERACTIVE_AGENTS.md §6.2, §6.5, AD-4.
Replaces the FanOutConsumer's GitHub log-only stub with a real
edit-in-place implementation per design §6.4: maintain a single
comment per task, PATCH with If-Match ETag for optimistic concurrency,
single retry on 412, fallback POST on 404.

New handlers/shared/github-comment.ts:
- upsertTaskComment({existingCommentId, existingEtag, ...}): when
  called with a cached etag, PATCH directly (steady-state ONE GitHub
  call per event). Falls back to GET+retry only on 412. Without a
  cached etag, defensive GET-then-PATCH cold-start path.
- renderCommentBody(): pure Markdown. Sanitizes event-type strings
  (defensive against future writers); caps at 60k chars with visible
  truncation marker.
- Exported BGAGENT_COMMENT_MARKER_PREFIX so downstream forensics /
  reconciliation callers don't re-invent the marker regex.

Handler changes (fanout-task-events.ts):
- dispatchToGitHubComment reads TaskRecord, resolves per-repo token
  (falling back to platform default), renders body, upserts, and
  persists {comment_id, etag} back to TaskTable via conditional
  UpdateCommand.
- On 401 from GitHub: evict token cache and retry once. Self-heals
  token rotation without operator intervention.
- resolveTokenSecretArn distinguishes IAM deny (hard throw -- misconfig
  must surface) from RepoTable-absent (INFO, dev fallback) from
  transient DDB errors (ERROR + fallback with error_id).
- saveCommentState failure branches: ConditionalCheckFailedException
  (task TTL-evicted) logs INFO (benign); other failures log ERROR
  with error_id FANOUT_GITHUB_PERSIST_FAILED so operators can alarm
  on the duplicate-comment risk distinctly.

TaskRecord gains github_comment_id + github_comment_etag (read path
now alive -- the BLOCKER from review).

Construct plumbing:
- FanOutConsumer accepts optional taskTable, repoTable,
  githubTokenSecret props. Each guarded so dev deploys without
  wiring still work (dispatcher log-and-skips). agent.ts wires
  all three in production.

Tests: CDK 909 -> 923 (+14). Agent unchanged at 391. Prek pre-commit
passes all 11 hooks. Unit tests cover cached-etag fast path, PATCH
body content, PATCH missing ETag, event-type sanitization, body
truncation, marker constant. Integration tests cover etag
pass-through, issue_number fallback, loadRepoConfig throw,
resolveGitHubToken throw, saveCommentState conditional vs
non-conditional, 401 cache-eviction retry.

References: docs/design/INTERACTIVE_AGENTS.md §6.4, AD-9.
…K1 of 2)

Plumbs --trace end-to-end from CLI -> CDK -> agent so opt-in debug
tasks get 4 KB preview fields in TaskEventsTable instead of the
default 200-char truncation. Routine observability still goes through
watch / notifications; --trace is opt-in per task.

K1 ships the preview-cap half of design §10.1. K2 (S3 trajectory
upload + bgagent trace download) deferred to a follow-up PR -- needs
new CDK S3 infra + pre-signed URL handler that's cleaner reviewed
alone.

CLI:
- --trace flag on submit; threaded into CreateTaskRequest body only
  when opt-in (absent flag omits the field, keeping wire payloads
  slim for the common path).
- TaskDetail.trace non-optional, matching the CDK's toTaskDetail
  default-false guarantee.

CDK:
- strict-boolean validation rejects 'true'/0/1/null with 400 so a
  misbehaving client can't accidentally enable trace with truthy
  non-boolean values.
- Orchestrator threads trace: true into the agent payload only when
  set on the TaskRecord.

Agent:
- TaskConfig gains trace: bool; build_config accepts it; pipeline
  forwards from run_task(trace=...) through to config.
- runner.py:227 constructs _ProgressWriter(task_id, trace=config.trace).
  This is the load-bearing fix -- the runner's writer emits the turn /
  tool_call / tool_result / error previews that --trace is meant to
  raise. Dropping trace here silently no-ops the feature for every
  field that matters; pipeline.py's milestone writer is a separate
  instance.
- _ProgressWriter gains per-instance _preview_max_len (200 / 4096);
  all six preview call sites now use self._preview(...).
- Strict-boolean extraction in server.py: inp.get('trace') is True.

Tests: CLI 142 -> 144 (+2), CDK 923 -> 933 (+10), agent 391 -> 408
(+17). Boundary conditions locked at 4095/4096/4097 and 199/200/201.
run_task -> TaskConfig.trace threading locked. Adversarial validator
inputs (string/int/null/object) all rejected.

References: docs/design/INTERACTIVE_AGENTS.md §10.1. K2 deferred:
S3 trajectory upload + bgagent trace download.
… of 2)

Completes design §10.1 by wiring the end-to-end --trace debug-escape-hatch:
agent uploads a gzipped JSONL trajectory dump to S3 on terminal state;
a new GET /v1/tasks/{task_id}/trace handler returns a 15-min presigned
download URL after Cognito + per-caller-prefix auth; bgagent trace
download streams it back.

CDK:
- TraceArtifactsBucket construct (7-day TTL, SSE, enforceSSL, BLOCK_ALL,
  autoDeleteObjects; targeted AwsSolutions-S1 suppression with honest
  threat-model rationale).
- get-trace-url handler: Cognito auth, record-ownership check, two
  defense-in-depth guards (bucket must match TRACE_ARTIFACTS_BUCKET_NAME;
  key must live under the CALLER's traces/<sub>/ prefix, NOT the
  record's -- hardens against DDB-spoofed trace_s3_uri redirection).
- TaskRecord.trace_s3_uri?, TaskDetail.trace_s3_uri: string | null,
  ErrorCode.TRACE_NOT_AVAILABLE; toTaskDetail surfaces it.
- Orchestrator threads task.user_id into the agent payload
  (unconditional -- trace consumes it but it's plumbed once).
- @aws-sdk/client-s3 + @aws-sdk/s3-request-presigner added.
- TODO on grantPut(runtime): Chunk L will tighten to a per-prefix
  IAM condition via AgentCore session tags.

Agent:
- TaskConfig.user_id; server.py coerces non-string payloads to empty
  with a WARN (avoids silent unreachable-key writes).
- _TrajectoryWriter grows an opt-in 50 MB in-memory accumulator +
  dump_gzipped_jsonl() returning a self-describing gzipped JSONL
  artifact with a drop-count header; fire-once truncation callback
  emits a trace_truncated milestone on bgagent watch.
- upload_trace_to_s3: fail-open, gates on non-empty user_id +
  task_id + TRACE_ARTIFACTS_BUCKET_NAME; flags IAM misconfig in logs.
- pipeline.run_task runs _maybe_upload_trace on BOTH the happy path
  (after post-hooks, before write_terminal so trace_s3_uri lands
  atomically with terminal status) AND the crash path (inside the
  outer except, so a failing task still produces a debug artifact --
  the exact case --trace was designed for). Upload exceptions on the
  crash path never mask the original pipeline exception.
- task_state.write_terminal persists trace_s3_uri when present; new
  orphan-diagnostic log fires when ConditionalCheckFailed races a
  successful S3 upload (named trace URI + 7-day lifecycle note so
  operators can reason about orphaned objects).
- get_config() reads TRACE + USER_ID env vars for local-batch parity.
- CANCELLED fast-path logs a WARN directing users to the
  trajectory/<task_id> CloudWatch log stream; S3 upload deferred
  pending a conditional UpdateItem follow-up.

CLI:
- TraceUrlResponse type; ApiClient.getTraceUrl.
- bgagent trace download <task-id> [-o <file>]: default streams
  gunzipped JSONL to stdout (pipe-friendly for `| jq -s .`);
  -o <file> writes raw gzipped bytes. Friendly TRACE_NOT_AVAILABLE
  404 message, expired-URL hint on S3 403.

Tests: +89 total (CDK 933->973, agent 408->449, CLI 144->152). Includes
boundary pins, defense-in-depth assertions, crash-path invariants,
cross-package type-sync updates, and orphan-diagnostic regression
guards. All pre-existing tests untouched.

Carry-forwards captured in Chunk L -- per-prefix IAM tightening,
self-healing UpdateItem for orphans, CLI fetch timeout,
formatTaskDetail trace display, grantRead narrowing.

Review gate: code-reviewer + silent-failure-hunter + pr-test-analyzer
all approved with pre-commit fixes applied (S1, M1, N1, SIG-1).

Refs: design §10.1, AD-6 review Findings aws-samples#1/aws-samples#3/SIG-1.
L1 is the hygiene bucket of Chunk L. Non-functional cleanups + one
tiny UX addition. All changes are surface-level; no behavior changes.

- agent/src/telemetry.py: add peak-memory comment on
  dump_gzipped_jsonl. The 50 MB cap is fine, but gzip'ing it
  transiently peaks to ~55-60 MB; a future reader raising the cap
  needs to match container memory headroom.
- cdk/src/handlers/shared/create-task-core.ts: remove unused
  DEFAULT_MAX_TURNS import (still used in orchestrator.ts).
- cdk/src/handlers/shared/validation.ts: remove unused isPrTaskType
  import (still exported from types.ts; still used in create-task-core).
- cli/src/format.ts: surface trace_s3_uri in formatTaskDetail when
  non-null so `bgagent status <id>` shows the trace URI to terminal
  users, not just --output json consumers. +3 LOC render + +2 tests.

Stranded-ref grep across source trees: zero hits for
STRANDED_INTERACTIVE / runtime_jwt / execution_mode / sse_* /
RUN_ELSEWHERE. The retired-design references that remain are all
in design-history markdown + drawio (separate carry-forward).
Additive-only pass. No production changes. Closes specific gaps
pr-test-analyzer raised on the K2 commit-gate review:

- agent/tests/test_trace_key_contract.py (NEW, 4 tests): cross-
  language drift detector. The literal ``traces/<user_id>/
  <task_id>.jsonl.gz`` layout is asserted by THREE independent
  codepaths (CDK TRACE_OBJECT_KEY_PREFIX constant, CDK handler's
  expectedKeyPrefix at runtime, agent upload_trace_to_s3 Key
  construction). Parses the two TS files as text and round-trips
  the Python upload against a mocked boto3 to confirm all three
  agree. Fails loudly if anyone renames ``traces/`` on one side.

- agent/tests/test_trace_upload.py (+132 LOC, +5 tests):
  * exact-boundary test at _ACCUMULATOR_MAX_BYTES pins the <=
    upper-bound on _put_event (guards against a future < off-by-one).
  * AWS_REGION / AWS_DEFAULT_REGION unset -> boto3 client accepts
    region_name=None (default-chain path).
  * body=b"" -> put_object still called; URI returned as normal.
  * body=None -> current fail-open behavior pinned (swallowed,
    returns None). Flagged in review: L3+ could add an early guard.
  * truncation callback fires once but _accumulator_dropped keeps
    incrementing past the announcement.

Test counts: agent 459 -> 468 (+9). CDK / CLI unchanged.
Five small production fixes addressing silent-failure-hunter and
code-reviewer findings from K2's final commit-gate review. All items
keep fail-open semantics; none change user-facing behavior on the
happy path.

- cli/src/commands/trace.ts: AbortSignal-backed 2-minute fetch
  timeout + SIGINT handler so a stalled S3 download no longer wedges
  the CLI with no recovery signal. Raw ``Z_DATA_ERROR`` / ``zlib``
  pipeline failures on the stdout gunzip path are now wrapped in a
  ``CliError`` naming the real cause (corrupt or not gzipped),
  replacing a stack trace that previously looked like a CLI bug. The
  duck-typed ``isZlibError`` avoids a Jest-realm ``instanceof Error``
  fragility.
- cdk/src/constructs/task-api.ts: tighten ``get-trace-url`` IAM from
  ``grantRead`` (which expands to ``s3:GetObject*`` + ``s3:GetBucket*``
  + ``s3:List*``) to an explicit ``s3:GetObject`` statement on
  ``bucket/*``. HeadObject is implicitly covered per IAM docs.
- cdk/src/handlers/get-trace-url.ts: HEAD-check the S3 object before
  presigning. Returns the existing ``TRACE_NOT_AVAILABLE`` 404 when
  the agent wrote ``trace_s3_uri`` to DDB before the PUT propagated
  (or after a lifecycle sweep), replacing an S3-XML 404 the user
  had no user-friendly remedy for.
- cdk/src/handlers/shared/github-comment.ts: ``logRateLimitIfLow``
  WARNs when ``x-ratelimit-remaining`` drops below 500 on any PATCH /
  POST response. Operators get advance notice of the approaching 403
  instead of discovering it in the dispatcher-rejected log stream.
- cli/src/commands/watch.ts: session-level retry counter across
  ``withTransientRetry`` poll iterations. A flapping 5xx at ~50%
  previously never tripped the per-iteration budget and never
  surfaced; now, after 10 retries in a session the CLI emits a
  one-shot stderr WARN naming the flap.

Tests: CDK 973 -> 982 (+9), CLI 155 -> 158 (+3). Agent unchanged.
Two behavioral changes addressing K2 final-review SIG-1 and MIN-2.

Self-healing trace_s3_uri (SIG-1):
A race between the agent's happy-path write_terminal and a concurrent
cancel / reconciler was causing orphaned S3 uploads — the trace
object existed but TaskRecord.trace_s3_uri never landed, so
``get-trace-url`` returned 404 TRACE_NOT_AVAILABLE for a trace that
was actually uploaded. K2 shipped a diagnostic log; this commit
adds the self-heal.

- agent/src/task_state.py: new ``write_trace_uri_conditional(task_id,
  uri)`` helper. Conditional UpdateItem scoped to
  ``attribute_not_exists(trace_s3_uri) AND status IN (CANCELLED,
  COMPLETED, FAILED, TIMED_OUT)`` — only adds the URI, never touches
  ``status`` or ``status_created_at``, so the state-machine atomic-
  transition invariant is preserved. CCF and transient DDB errors
  are fail-open (log + False).
- agent/src/task_state.py: ``write_terminal``'s CCF handler now
  invokes the helper when ``result.trace_s3_uri`` is present. The
  pre-existing orphan-diagnostic log still prints above the heal
  result so the race remains operator-visible even when healed.
- agent/src/pipeline.py: the cancel short-circuit now also does a
  best-effort upload + self-heal when ``config.trace=True`` (task
  cancelled mid-run with --trace now produces a downloadable
  artifact instead of just a CloudWatch note). Fully fail-open.

Cancel-path behavior change: S3 PutObject is now attempted on the
cancel fast-path when ``trace=True``. The upload is gated by
``_maybe_upload_trace``'s existing empty-trajectory early-return, so
an early cancel (no events buffered) is a no-op. Runtime IAM role
already carries ``s3:PutObject`` on the traces bucket — no new
permission needed.

--force overwrite guard (MIN-2):
- cli/src/commands/trace.ts: ``bgagent trace download -o <file>``
  now refuses to overwrite an existing file unless ``--force`` is
  passed. Previously silently clobbered, which is a footgun when a
  user re-runs the command against the same output path. Check
  fires before the presigned-URL fetch so the refusal path is
  zero-round-trip.

Tests: CDK unchanged. Agent 459 -> 468 (+9: TestWriteTraceUriConditional
class of 7 + 2 cancel-path pipeline tests). CLI 158 -> 160 (+2 for
overwrite-blocked / overwrite-with-force).
GetTraceUrlFn deployed with Lambda's defaults (3s timeout, 128 MB)
caused every request to hit INIT timeout and return 502 Bad Gateway.
The bundle includes @aws-sdk/client-s3, @aws-sdk/s3-request-presigner,
and @aws-sdk/lib-dynamodb; cold-start init was observed at 10-13s
with memory pegged at ~126 MB of the 128 MB default.

Raise to 15s / 512 MB — matches the CancelTaskFn pattern (itself
bumped for bedrock-agentcore SDK cold starts) with extra memory
headroom since this handler loads more SDK packages.

Caught during K2+L deploy validation (Scenario 1, control path).
Not caught by CDK unit tests because those verify synth shape, not
runtime behavior. Added a regression test that asserts
Timeout: 15 / MemorySize: 512 on the synthesized function.

Refs: K2, post-deploy-validation bug
The ``ContentEncoding: gzip`` metadata on the ``--trace`` S3 upload
causes Node's fetch (undici) to transparently decompress the
presigned-URL response, which breaks both CLI download paths:

  * ``bgagent trace download <id>`` (stdout) — the already-decompressed
    bytes are piped through ``createGunzip()``, which throws
    ``Z_DATA_ERROR: incorrect header check`` (L3's zlib-wrapper
    reframes it as "Trace artifact is corrupt or not gzipped" —
    misleading since the artifact is fine).
  * ``bgagent trace download <id> -o <file>`` — writes the
    decompressed bytes, violating the documented "raw gzipped bytes"
    contract.

Drop ``ContentEncoding: gzip`` and change ``ContentType`` from
``application/x-ndjson`` to ``application/gzip``. The bytes actually
stored ARE gzipped; describing them as gzipped without the encoding
hint is the honest representation, and matches how the CLI treats
them (pipe through gunzip for stdout, write raw for ``-o``).

Trade-off: the S3 console / ``aws s3 cp`` no longer auto-decompress
on download — users have to gunzip manually. Acceptable for a
CLI-mediated debug artifact; the primary consumer is
``bgagent trace download``.

Caught by Scenario 2 deploy validation; not caught by unit tests
because they asserted the header was SET rather than asserting
round-trip behavior.

Refs: K2, post-deploy-validation Finding 1
L1 added the ``Trace S3:`` line to ``formatTaskDetail`` (the
``--wait`` and detail view), but missed the parallel
``formatStatusSnapshot`` function used by the default
``bgagent status <task_id>`` command. A trace-enabled task's URI
was only visible via ``--output json``; terminal users had to
know about the secondary path to discover it.

Add the same conditional render to ``formatStatusSnapshot`` so
both entry points surface the URI the same way. Zero-diff when
``trace_s3_uri`` is null (matches the conditional rendering of
``PR:`` and ``Cost:``).

Caught by Scenario 2 deploy validation.

Refs: K2, L1, post-deploy-validation Finding 2
…r routing

Scenario 7-extended in-account deploy-validation of the GitHub
edit-in-place dispatcher (§6.2 / §6.4) caught two runtime bugs that
caused zero status comments to be posted on pr_iteration tasks.

1. Numeric coercion at the fan-out boundary
   The DynamoDB Document-client returns Number attributes as strings,
   but TaskRecord.{duration_s,cost_usd} are typed `number | null`.
   renderCommentBody called `costUsd.toFixed(4)` on a string and threw
   TypeError on every terminal event, rejecting the dispatcher.
   Sibling orchestrator.ts:527-528 already coerces; fanout now does
   too via `coerceNumericOrNull`. Non-finite inputs log
   `fanout.numeric_coercion_failed` so writer bugs surface instead of
   silently dropping the Cost/Duration rows.

2. agent_milestone wrapper events routing
   The agent writes `pr_created` / `nudge_acknowledged` /
   `repo_setup_complete` as `agent_milestone` events with
   `metadata.milestone` carrying the name. Per-channel default filters
   target the milestone names directly (§6.2 GitHub default includes
   `pr_created`), so fanout must unwrap before matching. New
   `effectiveEventType()` helper unwraps against a
   `ROUTABLE_MILESTONES` allowlist (today `{pr_created}`) as a
   structural defense against future naming drift.

Observability: log lines across the fan-out plane (slack/email stubs,
rate-limit hit, dispatcher rejection, github dispatched) now carry
`effective_event_type` so operators can grep CloudWatch for the
milestone name rather than the wrapper.

Tests: +10 regression tests covering string-typed DDB numerics,
allowlist collision defense, malformed metadata, absent fields, and
end-to-end milestone dispatch. CDK suite: 993 passing (baseline 983).

Design doc §4.2 updated to clarify that named milestones are
agent_milestone carriers with a selectively-unwrapped subset.

Refs: §6.2 per-channel routing, §6.4 GitHub edit-in-place
Scenarios 2, 6, and 7-extended deploy validation observed two
``bgagent watch`` rough edges.

1. Initial snapshot had no transient-retry wrap
   ``fetchInitialSnapshot`` called ``apiClient.getTaskEvents`` and
   ``apiClient.getTask`` raw, while the polling loop already wraps
   every subsequent call in ``withTransientRetry``. A single Lambda
   cold-start ``fetch failed`` / 5xx on the first request crashed the
   watch command before the polling loop could stabilise (observed
   Scenario 2: one cold-start failure, recovered on re-run).

   The two snapshot calls are now wrapped, with the same retry
   contract as the polling loop: 4xx propagates immediately, 5xx and
   network transients retry with jittered exponential backoff up to
   ``MAX_TRANSIENT_RETRIES``. Budget exhaustion surfaces the same
   "re-run to resume" hint.

   Hardening: ``fetchInitialSnapshot`` now requires a concrete
   ``signal`` (was optional) so SIGINT during a retry backoff actually
   aborts, and the snapshot error path calls ``abortController.abort()``
   before rethrow so a sibling ``Promise.all`` retry leg cannot keep
   burning retries after the command has already decided to fail.

2. Watch process hung past terminal
   Node's global ``fetch`` (undici) keeps TCP sockets alive in a
   connection pool; after ``bgagent watch`` finished its logical work
   on a terminal task, those sockets kept the event loop open for the
   pool's idle timeout. Observed in Scenarios 6 and 7-extended: the
   process had to be ``pkill``-ed after COMPLETED.

   The program entry point now schedules a deferred
   ``process.exit(process.exitCode ?? 0)`` via an ``unref``'d 50 ms
   timer in the ``.finally`` chain. Exit codes stay whatever each
   command set (0 / 1 / 130), stderr/stdout writes flush naturally,
   ``on('exit')`` handlers still run — but the process doesn't linger
   on the pool's keep-alive timeout. The entry point is also guarded
   by ``require.main === module`` so accidental transitive imports
   from test harnesses don't execute ``parseAsync`` on the importer's
   argv.

Tests: +4 regression tests covering snapshot transient retry, 4xx
fast-fail, retry budget exhaustion + "re-run" hint surfaced, and the
resource-leak guard where a non-transient error on one snapshot leg
aborts the sibling's retry loop. CLI suite: 166 passing (baseline 162).
… memory.ts NaN hazard

Code-review follow-up from PR aws-samples#52's fanout fix (commit 9fe704e).

The numeric-coercion helper introduced inline in fanout-task-events.ts
is now needed in a second place: the orchestrator's fallback-episode
write path. Hoist it to ``cdk/src/handlers/shared/numeric.ts`` with a
small structural ``CoerceLogger`` interface so test suites can pass a
spy without importing the real logger.

Why a second call site: ``shared/memory.ts::writeMinimalEpisode`` line
325 calls ``costUsd.toFixed(4)`` on the value the orchestrator passes
in. Pre-refactor the orchestrator did ``Number(task.cost_usd)``, which
silently produces ``NaN`` on a corrupt DDB string ("abc", etc.) and
ships ``Cost: $NaN.`` into the episode text. The shared helper:

  - passes real numbers and parseable strings straight through;
  - treats ``null`` / ``undefined`` / empty-string as "absent" without
    a warn (the common "not set yet" case);
  - collapses non-finite coercions to ``null`` AND emits a
    ``numeric.coercion_failed`` warn with the raw input so writer
    bugs surface in CloudWatch rather than silently dropping data.

Changes:
  - New: ``cdk/src/handlers/shared/numeric.ts`` (+ tests covering
    passthrough, string parse, absent inputs, NaN/Infinity coercion,
    warn payload contract).
  - ``fanout-task-events.ts``: delete the inline helper, import the
    shared one. Warn event name harmonised from
    ``fanout.numeric_coercion_failed`` to ``numeric.coercion_failed``
    (the helper is now cross-handler).
  - ``orchestrator.ts::onOrchestrate``: replace the bare
    ``Number(task.cost_usd) / Number(task.duration_s)`` coercion with
    the shared helper so a corrupt string no longer reaches
    ``writeMinimalEpisode``. Matches the same coercion the fanout
    plane already does for the GitHub comment body.

Tests: CDK suite 998 passing (baseline 993 from previous commit;
+5 from the new ``shared/numeric`` suite).

Refs: code-review SIGNIFICANTs on commit 9fe704e (memory.ts hazard
+ shared-helper placement).
…ional headers)

Scenario 7-extended deploy-validation (after the redeploy of the
previous fanout fix, commit 9fe704e) uncovered that GitHub's REST API
rejects ``If-Match`` on ``PATCH /repos/{o}/{r}/issues/comments/{id}``
with HTTP 400:

  "Conditional request headers are not allowed in unsafe requests
  unless supported by the endpoint"

The original design (§6.4 + AD-9) assumed ``If-Match`` ETag drove
optimistic concurrency. It never did — every steady-state PATCH
returned 400 in production. Comments were created on the initial POST
(``pr_created``) and then never edited on subsequent events
(``task_completed`` etc.), leaving the comment body frozen at whatever
state the agent happened to be in when ``pr_created`` fired.

Root-cause verified live against api.github.com with an otherwise-
identical request:
  - PATCH with ``If-Match: "dummy"`` → HTTP 400 (error string above)
  - PATCH without ``If-Match`` → HTTP 200

Fix: drop ``If-Match`` everywhere. Per-``task_id`` ordering is already
guaranteed upstream by ``TaskEventsTable`` DDB Streams with
``ParallelizationFactor: 1``, and the fanout Lambda is the only writer
on its own comment, so last-writer-wins is safe.

Code changes:
  - ``shared/github-comment.ts``: drop ``If-Match`` header from PATCH.
    Drop the GET-before-PATCH cold-start pattern and the 412 retry
    branch (neither reachable without ``If-Match``). ``upsertTaskComment``
    no longer takes ``existingEtag``. ``UpsertCommentResult.etag``
    removed (no longer load-bearing). POST no longer throws when the
    ETag header is absent — the comment_id comes from the body.
    404 → POST fallback preserved.
  - ``fanout-task-events.ts``: drop ``existingEtag`` pass-through;
    stop persisting ``github_comment_etag``. ``saveCommentState``
    now fires only on the POST path (initial create or 404 re-POST),
    not on every PATCH. Add a ConditionExpression to the UpdateItem
    so a 404 re-POST cannot silently overwrite a sibling fanout
    invocation's freshly-written comment_id (first-ever POST
    requires ``attribute_not_exists(github_comment_id)``; 404 re-POST
    requires ``github_comment_id = :prev``).
  - ``shared/types.ts``: remove ``TaskRecord.github_comment_etag``.
    Legacy items carrying the field pass through DocumentClient
    untouched on read — no migration required. Documented in §6.4.

Test changes:
  - Delete 4 tests that only made sense under ``If-Match`` semantics:
    412-retry, stale-etag-fast-path, POST-without-ETag-throws,
    PATCH-without-ETag-throws.
  - Add regression guards: PATCH carries no conditional headers,
    legacy ``github_comment_etag`` on a DDB record is ignored by the
    dispatcher, POST without ETag succeeds, 400 from PATCH propagates
    without being retried as POST, first-ever POST uses the
    ``attribute_not_exists`` condition, 404 re-POST uses the
    ``github_comment_id = :prev`` condition, end-to-end 400 surfaces
    as a ``fanout.dispatcher.rejected`` warn without duplicate write.
  - CDK suite: 996 passing (was 998 pre-fix; net −2 after deletions
    and new additions).

Design doc:
  - §3.8 dispatcher bullet, §6.4 full section, AD-9, and the §12
    Phase-1 checklist line all rewritten to reflect DDB-Stream
    ordering as the sole concurrency mechanism, with the HTTP-400
    finding cited as the rejected-alternative rationale.
  - §6.4 now names the three bounded races we tolerate (persist
    failure after POST; 404 + sibling POST race; transient task-load
    failure) and explains the logging / ConditionExpression defenses
    for each.
  - Starlight mirror regenerated.

Follow-up (not in this PR):
  - Marker-scan sweeper for orphaned ``<!-- bgagent:task-id=... -->``
    comments left behind when ``saveCommentState`` fails non-benignly
    after a successful POST. Today the duplicate-comment risk is
    bounded per invocation and surfaces via
    ``error_id: FANOUT_GITHUB_PERSIST_FAILED`` for alarming.

Refs: PR aws-samples#52 Scenario 7-extended deploy validation, §6.2, §6.4, AD-9
…lassification

Pre-fix, ``bgagent watch`` printed ``Task completed.`` or ``Task failed.``
on terminal, with no task_id and no hint at *why* a non-COMPLETED task
failed. A user watching multiple tasks (or scrolling back through a log
or CI capture) couldn't tell which task ended, and couldn't tell from
the watch output whether a FAILED task was blocked by a guardrail, timed
out, or hit a different class of error — they had to chase a separate
``bgagent status`` round-trip.

Observed during PR aws-samples#52 Scenario 7-extended take 3 where a ``pr_iteration``
task failed at hydration due to a Bedrock Guardrail PROMPT_ATTACK block.
Watch printed ``Task failed.`` with no context.

Changes:
  - New exported ``formatTerminalMessage`` helper in ``cli/src/commands/
    watch.ts``. Renders ``Task <id> <status>.`` always, and on non-
    COMPLETED terminals appends ``<category>: <title>`` from
    ``error_classification`` when present, or the trimmed raw
    ``error_message`` as a fallback. Never emits an orphan trailing
    colon or space when neither is available.
  - ``runPolling`` receives the full ``TaskDetail`` through
    ``onTerminal`` (was just the status string) so it can pass the
    classification through without a second ``getTask`` round-trip.
  - The already-terminal path in ``makeWatchCommand`` now makes a
    best-effort ``getTask`` to retrieve classification before
    rendering the terminal line. Wrapped in ``withTransientRetry`` and
    a try/catch so a flaky GET doesn't surface after the snapshot tail
    has already streamed — the degraded fallback is the bare
    ``Task <id> <status>.`` form.

Examples (before → after):
  Task completed. → Task 01KQT96W0B8J48HDWGF4YA9V7D completed.
  Task failed.    → Task 01KQT96W0B8J48HDWGF4YA9V7D failed. guardrail: PR context blocked

Tests: +5 regression tests for the formatter covering COMPLETED (no
clause), FAILED with classification, FAILED without classification
(error_message fallback with whitespace trim), FAILED with neither
(degrades cleanly), and CANCELLED (non-COMPLETED terminals other than
FAILED still render their classification). CLI suite: 171 passing.

Refs: PR aws-samples#52 Scenario 7-ext take 3 carry-forward (watch terminal UX)
… → MEDIUM

PR aws-samples#52 deploy-validation uncovered that the task-input-guardrail at
HIGH strength blocks legitimate imperative-mood user input at LOW
classifier confidence — both on submit-time task descriptions and on
pr_iteration hydration of PR bodies containing ordinary developer
documentation.

Observed failures on the current deploy:

  $ bgagent submit --repo scoropeza/agent-plugins \\
      --task "Make no changes, just inspect README.md and finish."
  Error: Task description was blocked by content policy. (VALIDATION_ERROR)

  $ bgagent submit --repo scoropeza/agent-plugins \\
      --task "Please enumerate every plugin in this repo in extreme detail, one at a time."
  Error: Task description was blocked by content policy. (VALIDATION_ERROR)

  (Scenario 7-ext take 3, against PR aws-samples#32 hydration)
  Guardrail blocked: PR context blocked by content policy:
  CONTENT/PROMPT_ATTACK (LOW)

``inputStrength: HIGH`` blocks on LOW, MEDIUM, and HIGH confidence. The
PROMPT_ATTACK classifier is stochastic at LOW confidence and flags
ordinary imperative language (task descriptions written in direct,
second-person voice) as well as ordinary PR bodies containing
imperative documentation (Troubleshooting sections, build-and-test
imperatives, etc.). The Bedrock Guardrails documentation recommends
MEDIUM as the default for non-adversarial user input.

Fix: ``inputStrength: HIGH → MEDIUM`` — blocks only MEDIUM and HIGH
confidence detections, ignores LOW. Preserves the prompt-injection
defense against the clear-case patterns the classifier is confident
about, and unblocks the long tail of natural-language user input that
motivated the feature.

Handler-side behavior is unchanged: the screening paths in
``shared/create-task-core.ts`` (submit) and ``shared/context-
hydration.ts`` (pr_iteration hydration) still fail-closed on a
``GUARDRAIL_INTERVENED`` result. Only the trigger threshold moves.

CDK suite: 996 passing (unchanged; no stack-level test pinned the
previous threshold).

Refs: PR aws-samples#52 Scenario 7-extended deploy-validation, Subagent B
investigation (guardrail over-triggering on pr_iteration).
…triage

Hands-on testing against the deployed stack revealed two UX gaps in the
default ``bgagent status`` compact snapshot:

  1. A ``pr_review`` task renders indistinguishably from a ``new_task``,
     so a user checking status on a review submission has no default-
     view confirmation the task_type was set as intended.
  2. A FAILED / CANCELLED / TIMED_OUT task shows the terminal status
     but not *why*. Users had to either re-run with ``--wait`` (which
     blocks), switch to ``--output json``, or grep ``bgagent events``
     to recover the error classification that the API already returns.

Both fields are already present on the ``TaskDetail`` payload that
``formatStatusSnapshot`` receives — the previous template just did not
render them. This is pure render-layer plumbing; no new API calls, no
backend changes.

Changes in ``cli/src/format.ts::formatStatusSnapshot``:

  - Renders ``Type: pr_iteration (PR aws-samples#42)`` / ``Type: pr_review (PR aws-samples#7)``
    after ``Repo:`` when ``task_type !== 'new_task'``. Matches the
    ``formatTaskDetail`` treatment (``Type:`` line at format.ts:29-31)
    so the compact and detailed views agree. Omits the ``(PR #N)``
    suffix defensively if ``pr_number`` is null.
  - Renders ``Reason: <category>: <title>`` after ``Cost:`` for
    non-COMPLETED terminal statuses, prefering
    ``error_classification.{category, title}`` from the server's
    classifier and falling back to a trimmed ``error_message`` if no
    structured classification is available. Emits nothing when neither
    is populated, or when the task is still running or COMPLETED —
    so the line never appears as a dangling ``Reason:`` with no value.

Example (before → after):

  Before (FAILED task, user had to chase error in events):
    Task 01KQ… — FAILED (3m 12s total)
      Repo:          scoropeza/agent-plugins
      Turn:          8 / 20
      Last milestone: agent_execution_complete (2s ago)
      Current:       task failed
      Cost:          $0.18 / budget $0.05
      Last event:    2026-05-04T21:25:04Z

  After:
    Task 01KQ… — FAILED (3m 12s total)
      Repo:          scoropeza/agent-plugins
      Turn:          8 / 20
      Last milestone: agent_execution_complete (2s ago)
      Current:       task failed
      Cost:          $0.18 / budget $0.05
      Reason:        compute: Exceeded max budget
      Last event:    2026-05-04T21:25:04Z

Tests: +10 regression tests covering Type line for pr_iteration /
pr_review / new_task / missing-pr-number, Reason line for
classification / fallback-message / neither / CANCELLED / TIMED_OUT /
COMPLETED-never-emits / RUNNING-never-emits. CLI suite: 181 passing
(was 171).

Refs: PR aws-samples#52 hands-on testing feedback
…_usd specifically

Pre-fix, a task that hit ``--max-turns`` or ``--max-budget`` showed the
generic "Agent task did not succeed" reason in ``bgagent status``:

  Task 01KQ… — FAILED (20s total)
    Reason: agent: Agent task did not succeed

… even though the agent's raw error_message already carried the
specific ``agent_status='error_max_budget_usd'`` signal. The classifier
at ``error-classifier.ts:205`` matched the outer ``Task did not
succeed.*agent_status=`` pattern and swallowed the specific literal.

Fix: add specific classifier patterns BEFORE the generic catch-all for
the three concrete ``agent_status`` literals the agent emits (see
``agent/src/pipeline.py::_resolve_overall_task_status``):

  - ``agent_status=error_max_turns``        → TIMEOUT: Exceeded max turns
  - ``agent_status=error_max_budget_usd``   → TIMEOUT: Exceeded max budget
  - ``agent_status=error_during_execution`` → AGENT: Agent errored during execution

The new titles surface the cap hit directly in ``bgagent status``'s
Reason line and point users at the concrete remedy (``--max-turns`` /
``--max-budget`` on the submit call). Category is TIMEOUT for the caps
(consistent with the existing TIMEOUT classifier for orchestrator poll
timeouts) and AGENT for mid-turn errors.

Tests: +4 regression tests covering each specific literal, the
ordering guarantee vs. the generic catch-all, and quoted-vs-unquoted
variants (the agent currently emits single-quoted repr values but a
future refactor could drop them). CDK suite: 1000 passing.

Refs: PR aws-samples#52 hands-on testing (Scenario 11 generic-reason feedback)
…king flag

Pre-fix, ``bgagent status <id>`` and ``bgagent status <id> --wait``
rendered fundamentally different views of the same task:

  - Default: the compact ``formatStatusSnapshot`` template
  - ``--wait``: the verbose ``formatTaskDetail`` key-value dump

Hands-on testing surfaced the UX trap: users asking "why does the
terse snapshot not show me the Type / error Reason?" had to learn
that adding a blocking flag was the only way to unlock a richer view.
The previous rationale ("snapshot is recency-biased and less useful
once the task has landed") is weaker now that commit 78f6cda added
Type + Reason to the snapshot itself — the snapshot is already the
complete human view.

New contract: ``--wait`` is a pure blocking flag. Both paths render
the SAME ``formatStatusSnapshot`` output. Only the blocking behavior
(and the status-derived exit code) differ.

Behavior matrix:

  status <id>                     → snapshot now
  status <id> --wait              → snapshot at terminal
  status <id> --output json       → TaskDetail JSON now
  status <id> --wait --output json → TaskDetail JSON at terminal

JSON mode is unchanged so scripting consumers see no contract break.

Changes:
  - ``cli/src/commands/status.ts``: drop the ``formatTaskDetail``
    branch in the ``--wait`` path. When ``--wait`` lands a terminal
    task, fetch the recent-events window and pass it through
    ``formatStatusSnapshot`` (cheap follow-up call; the task has
    already terminated so the window is stable).
  - ``--wait`` flag description clarified to state it's a blocking +
    exit-code flag, not a layout switch.

Tests: +2 regression tests covering (a) ``--wait`` renders the same
snapshot layout as the default path, and (b) ``--wait --output json``
still returns raw ``TaskDetail`` unchanged for scripting. CLI suite:
183 passing (was 181).

Carry-forward: ``bgagent submit --wait`` uses the same
``waitForTask`` helper but its final render path is simpler (already
just a single-task dump). Left unchanged for this commit — its
``--wait`` semantics are already honest (block-until-terminal with a
progress indicator).

Refs: PR aws-samples#52 hands-on testing (status ``--wait`` naming feedback)
Hands-on testing uncovered a data-visibility bug: ``channel_source`` is
computed by ``create-task-core.ts`` (``api`` for CLI / Cognito submits,
``webhook`` for HMAC-signed webhook submits) and persisted on the DDB
TaskRecord, but ``toTaskDetail`` never mapped it into the API response
and neither the CDK ``TaskDetail`` type nor the CLI mirror declared the
field. Every ``GET /tasks/{id}`` response silently dropped the
provenance signal.

Observed impact: Scenario 12 of PR aws-samples#52 deploy-validation asked ``bgagent
status <webhook-created-task> --output json | jq .channel_source`` to
verify the webhook path. That returned ``null`` even though the
CloudWatch log for the handler clearly showed ``channel_source:
"webhook"`` at creation.

Fix:

  - ``cdk/src/handlers/shared/types.ts::TaskDetail``: declare
    ``channel_source: string``. Non-optional — every TaskRecord has
    the field set at creation time (the ``create-task-core`` writer
    has required it since day one), so the API guarantee is honest.
  - ``cdk/src/handlers/shared/types.ts::toTaskDetail``: map
    ``record.channel_source`` through verbatim.
  - ``cli/src/types.ts::TaskDetail``: mirror the field per the
    documented CLI types-sync contract.
  - Test fixtures in ``cli/test/format.test.ts`` and
    ``cli/test/format-status-snapshot.test.ts`` get ``channel_source:
    'api'`` to satisfy the non-optional declaration.
  - ``cdk/test/handlers/get-task.test.ts`` grows two regression
    guards: (a) the default response carries ``channel_source``
    through, and (b) a webhook-flagged TaskRecord surfaces
    ``channel_source=webhook``. Pins the
    previously-silently-dropped field.

Not in scope for this commit:
  - No change to the compact ``formatStatusSnapshot`` rendering.
    Channel provenance is scripting / audit metadata, not a
    status-line concern. Available via ``bgagent status <id>
    --output json | jq .channel_source``.
  - ``TaskSummary`` (the list-response shape) intentionally omits
    ``channel_source`` — the list view is already terse.

CDK suite: 1001 passing (was 1000; +1 from the webhook-flagged
regression test, default-case assertion folded into the existing
``returns task detail successfully`` test). CLI suite: 183 passing.

Refs: PR aws-samples#52 hands-on testing (Scenario 12 webhook verification)
Hands-on testing feedback: after earlier PR aws-samples#52 polish added Type and
Reason to the default status snapshot (commit 78f6cda), the two
remaining fields users had to chase through ``--output json`` were
``channel_source`` (useful to tell webhook-submitted tasks apart from
CLI-submitted tasks at a glance) and ``task_description`` (the prompt
the user actually typed — the highest-signal reminder of what this
task is about).

Changes in ``cli/src/format.ts::formatStatusSnapshot``:

  - ``Channel:`` — always shown (``api`` / ``webhook``), immediately
    after ``Repo:``. Unlike ``Type:`` which is conditional on a
    non-default ``task_type``, provenance is always meaningful and
    always one word, so the line is cheap and uniform.
  - ``Description:`` — shown when ``task_description`` is populated,
    between ``Type:`` and ``Turn:``. Word-wrapped to 60-char columns
    with a 17-char continuation indent so a long prompt stays
    readable at 80-col terminals without truncating the user's input.
    Whitespace inside the description collapses to single spaces; no
    other reflow.

Intentionally NOT adding a ``Duration:`` row: the header already
shows ``(<elapsed>)`` for running tasks and ``(<total>)`` for
terminal tasks via ``elapsedDescription``. A dedicated Duration
line would duplicate that signal.

Examples:

  Running task, api channel, short description (fits on one line):

    Task 01KQ… — RUNNING (47s elapsed)
      Repo:          scoropeza/agent-plugins
      Channel:       api
      Description:   Make a small tweak to README.md
      Turn:          7 / 20
      ...

  Running task, webhook channel, long description (wraps):

    Task 01KQ… — RUNNING (12s elapsed)
      Repo:          scoropeza/agent-plugins
      Channel:       webhook
      Description:   This is a much longer description of a task that
                     the user submitted and needs to be wrapped across
                     multiple lines rather than truncated
      Turn:          2 / 20
      ...

Tests: +7 regression tests covering Channel rendering for api /
webhook / empty-degrade-to-placeholder, Description short /
wrapped-multi-line / null-omits-line / whitespace-trimming. The
golden ``happy path renders the full template`` snapshot updated to
include both new rows. CLI suite: 190 passing (was 183).

Refs: PR aws-samples#52 hands-on testing (snapshot visibility follow-up)
… async-only

Two combined changes in one file, committed together because pages 1-4
and page 5 lived in the same drawio multi-page file:

1. Full regeneration of pages 1-4 for the rev-6 async-only shape.

   The previous 11-page file predated PR aws-samples#52's rev-6 rework. It showed
   SSE streaming, two-runtime splits, a mid-turn interrupt proposal,
   and a WebSocket upgrade proposal that were all removed or
   superseded during this PR. Pages 1-4 now render:

     1. Rev-6 async-only architecture (full topology — CLI + webhook
        → API Gateway → 7 Lambda handlers → OrchestratorFn →
        AgentCore Runtime → TaskEventsTable spine → FanOutConsumer →
        per-channel dispatchers, with an explicit "Removed in rev-6"
        callout).
     2. ``bgagent watch`` adaptive polling (500 ms active, 1/2/5 s
        idle ladder, cursor advancement).
     3. Nudge flow (combined-turn ack per AD-5, with SDK-constraint
        rationale box).
     4. ``bgagent status`` + ``bgagent trace download`` (deterministic
        StatusFn, presigned S3 trace URL).

2. Page 5 rewrite for the ``If-Match`` removal (commit 2c2eda0).

   Removed: ``PATCH ... If-Match: <etag>``, 412 retry branch, GET-
   before-PATCH cold start, ``github_comment_etag`` persistence,
   ``last_etag`` UpdateItem.

   Added: single-shot PATCH (no conditional headers), ``agent_milestone``
   unwrap with ``ROUTABLE_MILESTONES`` allowlist note, DDB numeric
   coercion note, ``ConditionExpression`` guards on the persist path
   (``attribute_not_exists(github_comment_id)`` for first POST,
   ``github_comment_id = :prev`` for 404 re-POST), explicit
   ``fanout.dispatcher.rejected`` error-isolation callout, and an
   "AD-9 rationale" box citing the HTTP 400 finding that motivated
   the rewrite.

File size: 277 KB → 107 KB (−61%). Page count: 11 → 5. The dropped
pages were either obsolete (SSE / WebSocket / mid-turn-interrupt) or
covered in dedicated files (Phase 3 Cedar HITL stays in
``phase3-cedar-hitl.drawio``).

Not touched:
  - ``docs/src/content/docs/`` (generated; not affected by drawio).
  - ``phase3-cedar-hitl.drawio`` (separate file, still accurate).

Refs: PR aws-samples#52 rev-6 async-only rework + Scenario 7-extended If-Match
finding.
The upstream blueprint at ``cdk/src/stacks/agent.ts`` points at
``krokoko/agent-plugins``. During PR aws-samples#52 deploy-validation it was
flipped to ``scoropeza/agent-plugins`` so the test stack would onboard
Sam's own fork for hands-on testing. That override leaked into an
earlier PR commit (``1c87094``) and has to come back off before push.

This commit is a pure revert of the single ``repo:`` line. Everything
else on the branch stays as-is.
@scoropeza scoropeza changed the title feat(interactive-agents): Phase 1a/1b/2 (Phase 3 design only) feat(interactive-agents): async-only background task UX + Phase 3 Cedar HITL design May 4, 2026
@scoropeza scoropeza changed the title feat(interactive-agents): async-only background task UX + Phase 3 Cedar HITL design feat(interactive-agents): async-only background task UX + Cedar HITL design May 4, 2026
@scoropeza
Copy link
Copy Markdown
Contributor Author

Hi @krokoko — this PR is ready for another look.

Since we locked the rev-6 async-only architecture on 2026-04-30, the branch has been fully implemented, deployed to a dev stack, and E2E-validated. A few notable things happened along the way:

Deploy validation caught three real runtime bugs (all fixed on-branch)

All three passed 100% of the unit tests but failed on live AWS. Documented in the updated PR description:

  1. costUsd.toFixed is not a function — DDB Document-client returns Number as strings; fanout dispatcher was not coercing. Every terminal event on pr_iteration / pr_review was silently rejected in production.
  2. agent_milestone wrapper events never routed — fanout filter matched event_type literally; missed pr_created wrapped in an agent_milestone event. Zero comments ever posted on any PR.
  3. GitHub rejects If-Match on PATCH /issues/comments/{id} with HTTP 400. The ETag optimistic-concurrency scheme from the original §6.4 design never worked in production. AD-9 and §6.4 have been rewritten to rely on DDB-Stream ParallelizationFactor: 1 + single-writer invariant instead. The "rejected alternative" is documented in the design doc with the verbatim GitHub error string so the next person to reach for ETag hits the explanation immediately.

Hands-on testing — 15 scenarios against scoropeza/agent-plugins

All CLI surfaces validated end-to-end: bgagent submit / list / status / cancel / nudge / events / watch / trace download / webhook, plus --trace / --wait / --force / --idempotency-key / --verbose / --max-turns / --max-budget / --pr / --review-pr. Each hands-on session also surfaced UX gaps that are now fixed on-branch (snapshot missing Type / Channel / Description / Reason; generic error classifications; channel_source silently dropped from API responses; --wait formatter bifurcation; stale watch terminal message).

Test baselines updated

  • CDK: 1001 passing (was 879)
  • CLI: 190 passing (was 215 — count dropped because we deleted SSE tests during rev-6, then rose again with the hardening)
  • Agent: 468 passing

Phase 3 Cedar HITL — design + plumbing only, no engine

docs/design/PHASE3_CEDAR_HITL.md (1,883 lines) and the 12-page companion diagram are fully locked. Orchestrator + agent payload shape threads cedar_policies through end-to-end so the next PR that implements the engine doesn't need to touch the transport layer. No Cedar evaluation; no approval flow.

Follow-ups

6 carry-forward issues drafted to .l4-issue-drafts/ on the branch. Notable ones for post-merge:

  • failTask double-emits task_failed + over-decrements concurrency on Durable Execution step retry (pre-existing bug this PR surfaced; documented with root cause + fix plan).
  • Bedrock Guardrail PROMPT_ATTACK over-triggers on imperative-mood prompts even at MEDIUM (several example prompts captured).
  • Idempotent submit returns DUPLICATE_TASK error instead of the original task_id — non-standard semantics.
  • webhook_url missing from CreateWebhookResponse.

Ready for review whenever you have time. Happy to walk through any section or specific commit in a sync if that's faster.

@krokoko
Copy link
Copy Markdown
Contributor

krokoko commented May 5, 2026

Awesome, just a few findings from automated review, after that let's merge

  1. Fanout handler never returns DynamoDBBatchResponse despite reportBatchItemFailures: true

[CDK Code Review] cdk/src/constructs/fanout-consumer.ts:146 + cdk/src/handlers/fanout-task-events.ts:697

The construct sets reportBatchItemFailures: true but the handler returns void. When any record fails (e.g., resolveTokenSecretArn throws
AccessDeniedException), Lambda retries the entire batch instead of isolating the poisonous record. This defeats the per-task ordering guarantee and can cause
cascading retries.

Fix: Change handler return type to Promise, wrap per-record processing in try/catch, and return { batchItemFailures: [{itemIdentifier:
record.eventID}] } for failed records.

  1. turns_attempted / turns_completed bypass coerceNumericOrNull — same bug class as the fixed costUsd.toFixed crash

[Type Design + Error Handling] cdk/src/handlers/shared/types.ts:304-305

The toTaskDetail mapper passes these fields through with only ?? null:
turns_attempted: record.turns_attempted ?? null,
turns_completed: record.turns_completed ?? null,
These are written via the same DDB Document-client path that caused Bug #1 (costUsd.toFixed is not a function). Any downstream caller doing arithmetic on
these will crash at runtime.

Fix: Apply coerceNumericOrNull consistently:
turns_attempted: coerceNumericOrNull(record.turns_attempted, 'turns_attempted', logger),
turns_completed: coerceNumericOrNull(record.turns_completed, 'turns_completed', logger),

  1. Cancel hook doesn't short-circuit nudge processing — nudges consumed but never delivered

[Agent Runtime] agent/src/hooks.py:374-385

When the cancel hook fires and sets ctx["_cancel_requested"] = True, the loop continues to _nudge_between_turns_hook, which reads DDB, marks nudges as
consumed, and adds them to _INJECTED_NUDGES. Since cancel takes precedence, these nudges are consumed (DDB state changed) but never actually injected into
the agent's context. The user sees 202 Accepted for their nudge but it silently disappears.

Fix: Check if ctx.get("_cancel_requested"): break after the cancel hook, or have _nudge_between_turns_hook early-return when cancel is set.

  1. CLI/CDK type drift: turns_attempted and turns_completed optionality mismatch

[Type Design + CLI Review] cli/src/types.ts:66-69

CDK TaskDetail declares these as required (number | null), but CLI marks them optional (?). This violates the documented contract that these files must stay
in sync.

Fix: Remove the ? from both fields in cli/src/types.ts.


Important Issues (8 found — should fix)

  1. Unhandled exception in routeEvent crashes entire fanout batch

[CDK Code Review] cdk/src/handlers/fanout-task-events.ts:734

routeEvent is called with no try/catch inside the record loop. If resolveTokenSecretArn throws (intentional for AccessDeniedException), the entire handler
crashes — combined with Issue #1, all records in the batch retry.

  1. progress_writer.py circuit breaker trips on ValidationException — kills all progress writes permanently

[Error Handling] agent/src/progress_writer.py:138-152

The bare except Exception feeds all errors into the same circuit breaker. A persistent schema error (e.g., item > 400KB) trips the breaker for all event
types, silencing the entire progress stream. Transient vs. permanent errors should be distinguished.

  1. Reconciler silently succeeds when ALL transitions fail (e.g., IAM/DDB outage)

[Error Handling] cdk/src/handlers/reconcile-stranded-tasks.ts:261-270

Per-task failures are caught and logged at WARN, but the Lambda returns success. A systemic DDB failure strands all tasks indefinitely with only WARN-level
logs as signal.

  1. Dual _ProgressWriter instances with independent circuit breakers

[Agent Runtime] agent/src/runner.py:240 + agent/src/pipeline.py:303

Two independent writers (runner-level for turn events, pipeline-level for milestones) each have their own failure counter. One can silently disable while the
other continues, creating invisible event gaps.

  1. renderCommentBody not self-defending against uncoerced DDB strings

[Error Handling] cdk/src/handlers/shared/github-comment.ts:329

Defense is entirely at the single fanout call site. A future caller passing raw DDB values will hit the same toFixed is not a function crash.

  1. channel_source typed as string instead of 'api' | 'webhook' literal union

[Type Design] cdk/src/handlers/shared/types.ts:54,158

Loses exhaustiveness checking in downstream switches. The internal CreateTaskContext correctly narrows it but the external types don't.

  1. Python TaskConfig missing @model_validator for trace=True + user_id=""

[Type Design] agent/src/models.py:111-124

This combination causes a runtime failure at S3 upload time rather than at construction. A validator would surface it immediately.

  1. Markdown injection possible via prUrl in GitHub comment body

[CDK Code Review] cdk/src/handlers/shared/github-comment.ts:315-325

While sanitizeEventType strips dangerous characters, prUrl goes unsanitized into a markdown link. A compromised agent writing a crafted URL could break the
table layout or inject content.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants