Skip to content

fix(sandbox): bound ACP sandbox lifetime + self-heal vanished attach#153

Merged
DIodide merged 1 commit into
stagingfrom
fix/persistent-sandbox-lifecycle
Jun 23, 2026
Merged

fix(sandbox): bound ACP sandbox lifetime + self-heal vanished attach#153
DIodide merged 1 commit into
stagingfrom
fix/persistent-sandbox-lifecycle

Conversation

@DIodide

@DIodide DIodide commented Jun 23, 2026

Copy link
Copy Markdown
Owner

Why

The per-workspace sandbox unification (#141) creates persistent Daytona boxes with auto_stop_interval but no auto_delete_interval, so Daytona stops → archives → keeps them forever. Two leak sources pile up as archived sandboxes:

  1. Leaked session-owned scratch boxes — teardown is missed on a gateway restart/crash, so the box is never deleted.
  2. Abandoned workspace boxes — a workspace nobody returns to.

Archived boxes take ~3 minutes to wake (filesystem restored from object storage), so once enough accumulate the whole Daytona account crawls — Claude Code sessions "start indefinitely" and sandboxes won't open. This is exactly what took staging down (82 sandboxes, 60 of them archived). The backlog was cleaned up manually; this PR makes it self-correcting.

What

1. Bound lifetime at the source — set auto_delete_interval on both create paths (provision_agent_sandbox + DaytonaService.create_sandbox):

  • Scratch (session-owned) boxes: 1 day — reclaimed before they even archive (default auto-archive is 7 days). They hold nothing durable (deleted on teardown anyway).
  • Persistent workspace / code-exec boxes: 14 days — they hold the user's files, so a long grace period.
  • The auto-delete clock is the same "continuously stopped" clock that drives auto-archive, so it spans the archived period — archived boxes do get reclaimed (verified against the SDK docstring).
  • Both intervals are env-tunable. Non-positive clamps to "disabled" because Daytona reads 0 as delete-immediately-on-stop (a data-loss footgun, not "off"). ephemeral=False on both sites avoids the SDK validator that would force the interval to 0.

2. Self-heal a vanished attach — a box Daytona auto-deletes still looks "owned" to verify_sandbox_owner (which checks the Convex record, not Daytona), so the next session would attach a ghost and hard-error. On DaytonaNotFoundError during an attach, _provision_once now:

  • drops the stale Convex link (sandboxes:removeByDaytonaId), then
  • for a workspace-unification box, creates a fresh persistent box and relinks it (transparent recovery);
  • for an explicit, user-chosen harness sandbox, surfaces the loss (can't fabricate the box the user pointed at) — but the dead link is still cleared so the next session doesn't re-attach it.

This mirrors the existing revive-path heal (which already handles a vanished owned box) and closes the gap on the initial-provision/attach path.

Tests

tests/test_sandbox_lifecycle.py:

  • _auto_delete_minutes interval selection (persist vs scratch, scratch < persistent, non-positive → disabled clamp).
  • _recover_from_missing_attach decision (workspace box → recover + unlink; explicit harness sandbox → surface but still unlink; id mismatch → treated as workspace).
  • _unlink_dead_sandbox calls removeByDaytonaId and swallows ConvexMutationError.

Full fastapi suite: 370 passed, ruff clean.

Notes

  • Behavior change for DaytonaService.create_sandbox (Manage Sandboxes / code-exec): a user box untouched for 14 continuous days is now reclaimed by Daytona. Capped at 20/user and env-tunable; raise acp_persistent_sandbox_auto_delete_minutes (or set ≤0 to disable) if a longer-lived box is desired.

Persistent-sandbox unification (per-workspace boxes) created sandboxes with
auto_stop but no auto_delete, so Daytona stopped → archived → kept them
forever. Leaked session-owned boxes (teardown missed on a gateway restart)
and abandoned workspace boxes both piled up as archived sandboxes; archived
boxes take ~3 min to wake, so once enough accumulate the whole Daytona
account crawls and Claude Code sessions hang on cold start.

- Set auto_delete_interval on creation (both ACP + code-exec create paths).
  Scratch (session-owned) boxes hold nothing durable → reclaimed in 1 day
  (before they even archive); persistent workspace/code-exec boxes hold the
  user's files → 14-day grace. The "continuously stopped" clock spans the
  archived period, so archived boxes do get reclaimed. Both intervals are
  env-tunable; non-positive clamps to "disabled" (Daytona reads 0 as
  "delete immediately on stop", a data-loss footgun).
- Self-heal in _provision_once: a box Daytona auto-deletes still looks
  "owned" to verify_sandbox_owner (it checks Convex, not Daytona), so the
  next session would attach a ghost and error. On DaytonaNotFoundError for an
  attach, drop the stale Convex link and — for a workspace-unification box —
  create a fresh persistent one and relink. An explicit, user-chosen harness
  sandbox can't be fabricated, so that surfaces (link still cleared).
- Tests: interval selection + clamp, and the missing-attach heal decision.
@DIodide DIodide merged commit 8975d60 into staging Jun 23, 2026
4 checks passed
@DIodide DIodide deleted the fix/persistent-sandbox-lifecycle branch June 23, 2026 01:21
DIodide added a commit that referenced this pull request Jun 23, 2026
Release notes for the 1.0.0 major release covering the full unreleased
span since v0.2.1 (PRs #81#153): live session following, rewind/fork,
chat + harness sharing & collaboration, Skill Packs, per-workspace agent
sandboxes, workspace credentials, per-credential usage, Claude Code
config, and reliability/integrity hardening. Devops/infra setup
(Redis Streams prod provisioning, CI, deploy plumbing) intentionally
excluded — user-facing changes only.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant