Skip to content

Quarantine a saga whose branch-deletion compensation fails #434

Description

@hadamrd

Problem

The epic (#431) adds a new DELETE_BRANCH compensation kind and a reap_branch
handler so recovery can reclaim the orphan loop/<n> branch a dead worker left
behind. Once DELETE_BRANCH is a handled kind (registered in
_HANDLED_COMPENSATION_KINDS), reconcile_stale_sagas will actually call
reap_branch during the sweep — and reap_branch can fail (branch protected,
network/API error, ref already gone-but-erroring, git lock).

When that call raises, the saga must NOT silently reach a "clean" terminal
state. The never-half-compensated invariant (established for unhandled kinds in
#360) must also hold for a handled kind whose handler fails: a branch we
could not actually delete must surface as quarantined work needing a human, not
be recorded as COMPENSATED.

Concrete example: a stale saga carries a single DELETE_BRANCH compensation for
loop/42. The branch is protected, so reap_branch(repo, "loop/42") raises
RuntimeError("protected branch"). Today the generic
except Exception arm in the per-saga loop catches it, appends a line to
RecoveryReport.errors, and continues to the next saga — leaving this saga in
its non-terminal RUNNING state. It is therefore still stale, gets re-swept
on every subsequent boot/tick, and never drains from the in-flight view: the
same immortal-saga bug #360 fixed, reborn for a handler that raises. (The worker
must first confirm the exact current outcome — RUNNING-and-re-swept vs.
COMPENSATED-with-error — and write the test against whatever it actually is.)
The correct terminal state is QUARANTINED: it ends the saga (liveness) without
asserting the branch was deleted (integrity).

Acceptance criteria

  • When a stale saga's reap_branch (the DELETE_BRANCH handler) raises
    during reconcile_stale_sagas, the saga is driven to terminal
    TaskState.QUARANTINED — never COMPENSATED, never left RUNNING.
  • The failure is captured in RecoveryReport.errors: an entry naming the
    saga id and identifying the branch-deletion failure (kind and/or
    exception type/message). The saga does not appear in
    report.recovered.
  • The saga's terminal_reason explains it was parked because branch
    deletion failed (operator-readable; references the branch/kind).
  • The quarantined saga is terminal (is_terminal is True) and drains from
    store.list_in_flight(); a second reconcile_stale_sagas sweep is a
    no-op for it (not re-reported, not re-mutated).
  • One bad saga does not abort the sweep: a sibling stale saga with a
    handled, succeeding compensation in the same sweep still reaches
    COMPENSATED.
  • If reap_branch raises, no partial success is claimed — the saga is not
    recorded as having reaped that branch.

Test matrix

Unit (tests/test_recovery.py, reuse _store / _stale_with_compensations
helpers and inject a reap_branch that raises):

  • Adversarial / sad-path (primary):
    test_reconcile_quarantines_saga_when_branch_deletion_raises — stale saga with
    a DELETE_BRANCH compensation; injected reap_branch raises. Assert: state is
    QUARANTINED, is_terminal, report.recovered == (), exactly one
    report.errors entry naming the saga id, terminal_reason references the
    branch/kind, store.list_in_flight() == ().
  • Re-sweep idempotence: after the failing sweep, a second
    reconcile_stale_sagas returns errors == () and recovered == () for that
    saga (proves it is no longer stale / not immortal).
  • Sweep continues: mixed sweep — one saga whose reap_branch raises
    (→ QUARANTINED) and one healthy REMOVE_WORKTREE saga (→ COMPENSATED);
    assert both reach terminal and list_in_flight() == ().
  • Happy path unaffected: a DELETE_BRANCH saga whose reap_branch succeeds
    still reaches COMPENSATED and appears in report.recovered (guards against
    over-quarantining).
  • Exhaustiveness guard stays green: the existing
    _HANDLED_COMPENSATION_KINDS == set(CompensationKind) contract test still
    passes with DELETE_BRANCH registered.

Integration: _lease_reconcile_sweep (src/forge_loop/runner/boot.py) /
_cmd_recover surface the quarantine — confirm a failing branch reap yields a
non-zero report.errors count in the emitted event / CLI exit code (extend an
existing boot-recovery test rather than adding a new harness).

E2E: not required for this ticket (no live GitHub). If a real-git path already
exists for recovery, a single end-to-end "protected branch → quarantined" case
is a bonus, not a gate.

Out of scope

  • Adding the DELETE_BRANCH enum member, the reap_branch helper, and its
    registration in _HANDLED_COMPENSATION_KINDS — those are sibling tickets in
    Compensate the branch a failed worker abandons #431. If they are not yet merged on your base, stub/inject reap_branch in
    the test and make the minimal recovery.py change so the invariant holds
    regardless; do not block on the handler PR.
  • Enqueuing DELETE_BRANCH at dispatch time (separate sub-ticket).
  • Changing how REMOVE_WORKTREE is compensated, or the worktree reaper.
  • Auto-retrying / auto-unquarantining failed branch deletions — QUARANTINED is
    the intended human-parking terminal state; no retry logic here.
  • New RecoveryReport fields or a typed error model — reuse the existing
    errors: tuple[str, ...].
  • Any change to TaskState / CompensationKind beyond what siblings add.

File pointers

  • src/forge_loop/control/recovery.pyreconcile_stale_sagas; the per-saga
    loop and its except arm. Make the minimal change so a raising handler drives
    mark_quarantined(...) (with a branch-deletion reason) + an errors append,
    instead of leaving the saga non-terminal or marking it COMPENSATED. This is
    the only production file expected to change.
  • tests/test_recovery.py — add the adversarial + supporting tests; mirror the
    existing test_reconcile_quarantines_saga_with_* patterns.
  • src/forge_loop/tasks/saga.pyCompensationKind / TaskState (reference;
    DELETE_BRANCH added by a sibling).
  • src/forge_loop/tasks/store.pymark_quarantined(task_id, *, reason) /
    mark_compensated (the transitions you call; reference).
  • src/forge_loop/runner/_helpers.pyreap_worktree (the handler pattern
    reap_branch follows; reference) and src/forge_loop/runner/boot.py
    (_lease_reconcile_sweep wiring; reference).

Original report

Lock in the never-half-compensated invariant for the new DELETE_BRANCH kind: if
reap_branch raises during reconcile_stale_sagas, the saga must be driven to
QUARANTINED (parked for a human) and recorded as an error, and must NOT be marked
COMPENSATED. Add the adversarial test; make the minimal recovery.py change if
current behavior marks it COMPENSATED-with-error. Primary falsifiable acceptance:
when reap_branch raises, the saga ends in QUARANTINED with the error captured in
the RecoveryReport, and is never COMPENSATED.

Parent: #431 · Epic: "Compensate the branch a failed worker abandons"

Customer story: As an operator who relies on terminal saga state to mean what it
says: a branch that could not actually be deleted must surface as quarantined
work needing attention, not be silently marked compensated.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions