You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The epic (#431) adds a new DELETE_BRANCH compensation kind and a reap_branch
handler so recovery can reclaim the orphan loop/<n> branch a dead worker left
behind. Once DELETE_BRANCH is a handled kind (registered in _HANDLED_COMPENSATION_KINDS), reconcile_stale_sagas will actually call reap_branch during the sweep — and reap_branch can fail (branch protected,
network/API error, ref already gone-but-erroring, git lock).
When that call raises, the saga must NOT silently reach a "clean" terminal
state. The never-half-compensated invariant (established for unhandled kinds in #360) must also hold for a handled kind whose handler fails: a branch we
could not actually delete must surface as quarantined work needing a human, not
be recorded as COMPENSATED.
Concrete example: a stale saga carries a single DELETE_BRANCH compensation for loop/42. The branch is protected, so reap_branch(repo, "loop/42") raises RuntimeError("protected branch"). Today the generic except Exception arm in the per-saga loop catches it, appends a line to RecoveryReport.errors, and continues to the next saga — leaving this saga in
its non-terminalRUNNING state. It is therefore still stale, gets re-swept
on every subsequent boot/tick, and never drains from the in-flight view: the
same immortal-saga bug #360 fixed, reborn for a handler that raises. (The worker
must first confirm the exact current outcome — RUNNING-and-re-swept vs.
COMPENSATED-with-error — and write the test against whatever it actually is.)
The correct terminal state is QUARANTINED: it ends the saga (liveness) without
asserting the branch was deleted (integrity).
Acceptance criteria
When a stale saga's reap_branch (the DELETE_BRANCH handler) raises
during reconcile_stale_sagas, the saga is driven to terminal TaskState.QUARANTINED — never COMPENSATED, never left RUNNING.
The failure is captured in RecoveryReport.errors: an entry naming the
saga id and identifying the branch-deletion failure (kind and/or
exception type/message). The saga does not appear in report.recovered.
The saga's terminal_reason explains it was parked because branch
deletion failed (operator-readable; references the branch/kind).
The quarantined saga is terminal (is_terminal is True) and drains from store.list_in_flight(); a second reconcile_stale_sagas sweep is a
no-op for it (not re-reported, not re-mutated).
One bad saga does not abort the sweep: a sibling stale saga with a
handled, succeeding compensation in the same sweep still reaches COMPENSATED.
If reap_branch raises, no partial success is claimed — the saga is not
recorded as having reaped that branch.
Test matrix
Unit (tests/test_recovery.py, reuse _store / _stale_with_compensations
helpers and inject a reap_branch that raises):
Adversarial / sad-path (primary): test_reconcile_quarantines_saga_when_branch_deletion_raises — stale saga with
a DELETE_BRANCH compensation; injected reap_branch raises. Assert: state is QUARANTINED, is_terminal, report.recovered == (), exactly one report.errors entry naming the saga id, terminal_reason references the
branch/kind, store.list_in_flight() == ().
Re-sweep idempotence: after the failing sweep, a second reconcile_stale_sagas returns errors == () and recovered == () for that
saga (proves it is no longer stale / not immortal).
Sweep continues: mixed sweep — one saga whose reap_branch raises
(→ QUARANTINED) and one healthy REMOVE_WORKTREE saga (→ COMPENSATED);
assert both reach terminal and list_in_flight() == ().
Happy path unaffected: a DELETE_BRANCH saga whose reap_branch succeeds
still reaches COMPENSATED and appears in report.recovered (guards against
over-quarantining).
Exhaustiveness guard stays green: the existing _HANDLED_COMPENSATION_KINDS == set(CompensationKind) contract test still
passes with DELETE_BRANCH registered.
Integration: _lease_reconcile_sweep (src/forge_loop/runner/boot.py) / _cmd_recover surface the quarantine — confirm a failing branch reap yields a
non-zero report.errors count in the emitted event / CLI exit code (extend an
existing boot-recovery test rather than adding a new harness).
E2E: not required for this ticket (no live GitHub). If a real-git path already
exists for recovery, a single end-to-end "protected branch → quarantined" case
is a bonus, not a gate.
Out of scope
Adding the DELETE_BRANCH enum member, the reap_branch helper, and its
registration in _HANDLED_COMPENSATION_KINDS — those are sibling tickets in Compensate the branch a failed worker abandons #431. If they are not yet merged on your base, stub/inject reap_branch in
the test and make the minimalrecovery.py change so the invariant holds
regardless; do not block on the handler PR.
Enqueuing DELETE_BRANCH at dispatch time (separate sub-ticket).
Changing how REMOVE_WORKTREE is compensated, or the worktree reaper.
Auto-retrying / auto-unquarantining failed branch deletions — QUARANTINED is
the intended human-parking terminal state; no retry logic here.
New RecoveryReport fields or a typed error model — reuse the existing errors: tuple[str, ...].
Any change to TaskState / CompensationKind beyond what siblings add.
File pointers
src/forge_loop/control/recovery.py — reconcile_stale_sagas; the per-saga
loop and its except arm. Make the minimal change so a raising handler drives mark_quarantined(...) (with a branch-deletion reason) + an errors append,
instead of leaving the saga non-terminal or marking it COMPENSATED. This is
the only production file expected to change.
tests/test_recovery.py — add the adversarial + supporting tests; mirror the
existing test_reconcile_quarantines_saga_with_* patterns.
src/forge_loop/tasks/saga.py — CompensationKind / TaskState (reference; DELETE_BRANCH added by a sibling).
src/forge_loop/tasks/store.py — mark_quarantined(task_id, *, reason) / mark_compensated (the transitions you call; reference).
src/forge_loop/runner/_helpers.py — reap_worktree (the handler pattern reap_branch follows; reference) and src/forge_loop/runner/boot.py
(_lease_reconcile_sweep wiring; reference).
Original report
Lock in the never-half-compensated invariant for the new DELETE_BRANCH kind: if
reap_branch raises during reconcile_stale_sagas, the saga must be driven to
QUARANTINED (parked for a human) and recorded as an error, and must NOT be marked
COMPENSATED. Add the adversarial test; make the minimal recovery.py change if
current behavior marks it COMPENSATED-with-error. Primary falsifiable acceptance:
when reap_branch raises, the saga ends in QUARANTINED with the error captured in
the RecoveryReport, and is never COMPENSATED.
Parent: #431 · Epic: "Compensate the branch a failed worker abandons"
Customer story: As an operator who relies on terminal saga state to mean what it
says: a branch that could not actually be deleted must surface as quarantined
work needing attention, not be silently marked compensated.
Problem
The epic (#431) adds a new
DELETE_BRANCHcompensation kind and areap_branchhandler so recovery can reclaim the orphan
loop/<n>branch a dead worker leftbehind. Once
DELETE_BRANCHis a handled kind (registered in_HANDLED_COMPENSATION_KINDS),reconcile_stale_sagaswill actually callreap_branchduring the sweep — andreap_branchcan fail (branch protected,network/API error, ref already gone-but-erroring, git lock).
When that call raises, the saga must NOT silently reach a "clean" terminal
state. The never-half-compensated invariant (established for unhandled kinds in
#360) must also hold for a handled kind whose handler fails: a branch we
could not actually delete must surface as quarantined work needing a human, not
be recorded as
COMPENSATED.Concrete example: a stale saga carries a single
DELETE_BRANCHcompensation forloop/42. The branch is protected, soreap_branch(repo, "loop/42")raisesRuntimeError("protected branch"). Today the genericexcept Exceptionarm in the per-saga loop catches it, appends a line toRecoveryReport.errors, andcontinues to the next saga — leaving this saga inits non-terminal
RUNNINGstate. It is therefore still stale, gets re-swepton every subsequent boot/tick, and never drains from the in-flight view: the
same immortal-saga bug #360 fixed, reborn for a handler that raises. (The worker
must first confirm the exact current outcome — RUNNING-and-re-swept vs.
COMPENSATED-with-error — and write the test against whatever it actually is.)
The correct terminal state is
QUARANTINED: it ends the saga (liveness) withoutasserting the branch was deleted (integrity).
Acceptance criteria
reap_branch(theDELETE_BRANCHhandler) raisesduring
reconcile_stale_sagas, the saga is driven to terminalTaskState.QUARANTINED— neverCOMPENSATED, never leftRUNNING.RecoveryReport.errors: an entry naming thesaga id and identifying the branch-deletion failure (kind and/or
exception type/message). The saga does not appear in
report.recovered.terminal_reasonexplains it was parked because branchdeletion failed (operator-readable; references the branch/kind).
is_terminal is True) and drains fromstore.list_in_flight(); a secondreconcile_stale_sagassweep is ano-op for it (not re-reported, not re-mutated).
handled, succeeding compensation in the same sweep still reaches
COMPENSATED.reap_branchraises, no partial success is claimed — the saga is notrecorded as having reaped that branch.
Test matrix
Unit (
tests/test_recovery.py, reuse_store/_stale_with_compensationshelpers and inject a
reap_branchthat raises):test_reconcile_quarantines_saga_when_branch_deletion_raises— stale saga witha
DELETE_BRANCHcompensation; injectedreap_branchraises. Assert: state isQUARANTINED,is_terminal,report.recovered == (), exactly onereport.errorsentry naming the saga id,terminal_reasonreferences thebranch/kind,
store.list_in_flight() == ().reconcile_stale_sagasreturnserrors == ()andrecovered == ()for thatsaga (proves it is no longer stale / not immortal).
reap_branchraises(→
QUARANTINED) and one healthyREMOVE_WORKTREEsaga (→COMPENSATED);assert both reach terminal and
list_in_flight() == ().DELETE_BRANCHsaga whosereap_branchsucceedsstill reaches
COMPENSATEDand appears inreport.recovered(guards againstover-quarantining).
_HANDLED_COMPENSATION_KINDS == set(CompensationKind)contract test stillpasses with
DELETE_BRANCHregistered.Integration:
_lease_reconcile_sweep(src/forge_loop/runner/boot.py) /_cmd_recoversurface the quarantine — confirm a failing branch reap yields anon-zero
report.errorscount in the emitted event / CLI exit code (extend anexisting boot-recovery test rather than adding a new harness).
E2E: not required for this ticket (no live GitHub). If a real-git path already
exists for recovery, a single end-to-end "protected branch → quarantined" case
is a bonus, not a gate.
Out of scope
DELETE_BRANCHenum member, thereap_branchhelper, and itsregistration in
_HANDLED_COMPENSATION_KINDS— those are sibling tickets inCompensate the branch a failed worker abandons #431. If they are not yet merged on your base, stub/inject
reap_branchinthe test and make the minimal
recovery.pychange so the invariant holdsregardless; do not block on the handler PR.
DELETE_BRANCHat dispatch time (separate sub-ticket).REMOVE_WORKTREEis compensated, or the worktree reaper.the intended human-parking terminal state; no retry logic here.
RecoveryReportfields or a typed error model — reuse the existingerrors: tuple[str, ...].TaskState/CompensationKindbeyond what siblings add.File pointers
src/forge_loop/control/recovery.py—reconcile_stale_sagas; the per-sagaloop and its
exceptarm. Make the minimal change so a raising handler drivesmark_quarantined(...)(with a branch-deletion reason) + anerrorsappend,instead of leaving the saga non-terminal or marking it
COMPENSATED. This isthe only production file expected to change.
tests/test_recovery.py— add the adversarial + supporting tests; mirror theexisting
test_reconcile_quarantines_saga_with_*patterns.src/forge_loop/tasks/saga.py—CompensationKind/TaskState(reference;DELETE_BRANCHadded by a sibling).src/forge_loop/tasks/store.py—mark_quarantined(task_id, *, reason)/mark_compensated(the transitions you call; reference).src/forge_loop/runner/_helpers.py—reap_worktree(the handler patternreap_branchfollows; reference) andsrc/forge_loop/runner/boot.py(
_lease_reconcile_sweepwiring; reference).Original report
Lock in the never-half-compensated invariant for the new DELETE_BRANCH kind: if
reap_branch raises during reconcile_stale_sagas, the saga must be driven to
QUARANTINED (parked for a human) and recorded as an error, and must NOT be marked
COMPENSATED. Add the adversarial test; make the minimal recovery.py change if
current behavior marks it COMPENSATED-with-error. Primary falsifiable acceptance:
when reap_branch raises, the saga ends in QUARANTINED with the error captured in
the RecoveryReport, and is never COMPENSATED.
Parent: #431 · Epic: "Compensate the branch a failed worker abandons"
Customer story: As an operator who relies on terminal saga state to mean what it
says: a branch that could not actually be deleted must surface as quarantined
work needing attention, not be silently marked compensated.