feat(cursor): operator-driven legacy stream-json -> ACP session migrator (transplant)#34
feat(cursor): operator-driven legacy stream-json -> ACP session migrator (transplant)#34heavygee wants to merge 29 commits into
Conversation
…tor (transplant) Adds a per-session and bulk migrator that moves pre-ACP Cursor sessions (left over from tiann#799's cutover) into the on-disk location `agent acp` reads from, then flips `cursorSessionProtocol = 'acp'` so the existing cursorAcpRemoteLauncher picks the session up on next resume. The migration is a transplant, not a replay: the legacy `store.db` is byte-identical in schema to the ACP one; only the directory layout differs. The migrator copies the legacy `~/.cursor/chats/<wsh>/<uuid>/store.db` into `~/.cursor/acp-sessions/<uuid>/`, synthesizes `meta.json`, drives `initialize + session/load + session/prompt` against a real `agent acp` in an isolated `$HOME` (staging - no pollution of the operator's real store), preserves `lastUsedModel` from the legacy meta record onto the HAPI sessions.model column, then removes the legacy source. Each step has a rollback path. Surfaces: - POST /api/sessions/:id/migrate-to-acp (web button) - POST /api/cursor/migrate-legacy + GET /api/cursor/legacy-candidates - Mirror endpoints under /cli/* (Bearer cliApiToken) for the CLI - `hapi cursor migrate <id> | --all | --dry-run | --keep-source | --force-archive-running | --list` - Web UI menu item on legacy session rows - Auto hapi.db backup before any bulk run Closes tiann#824 Refused-by-default cases: - non-cursor flavor; already on ACP; missing on-disk store; pre-existing ACP target dir; lifecycleState=running (override with --force-archive-running, which archives the session first and waits for the launcher to release the store.db lock). Verified end-to-end: - 25 unit tests covering every refusal and rollback branch (mocked `agent acp` probe + mocked metadata update) - 3 integration tests spawning real `agent acp` (opt-in via `CURSOR_AGENT_INTEGRATION=1`); one drives a synthetic store, two accept operator-supplied fodder uuids via env so real on-disk stores can be exercised without committing private fixtures.
…rict TS
CI typecheck (stricter than the local Bun-resolved tsc) flagged accesses
to ok-success fields (acpSessionId, sourceRemoved, etc.) as unsafe
because the success branch only had `dryRun?: false` while the dry-run
branch had `dryRun: true`. Making `dryRun: false` REQUIRED on the
production-success branch turns the union into a hard discriminator and
lets call sites narrow with `if (out.ok && !out.dryRun) {...}`.
No runtime change to the migrator's behavior - production-success
already returned the full payload; the test assertions just couldn't
prove it to tsc.
Also wires the explicit `dryRun: false` into the migrator's success
return and updates the tests to use the now-discriminator-friendly
narrowing.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2e3d0a18fc
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Self-cold-review pass: tryRm(dirname(legacy.storeDbPath)) with recursive=true would silently destroy any non-store.db artifact a future cursor-agent version might drop into the chat dir. Use rmdirSync (only succeeds on empty dirs) so unknown files survive. In practice today only store.db, store.db-wal, store.db-shm ever live there (verified against operator's fodder), so behavior is unchanged on current cursor-agent; this is a defense-in-depth tweak.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a731c1ec55
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…, P2 401-refresh) P1 (WAL preservation) — Without checkpoint, un-checkpointed transactions in the legacy store.db-wal sibling would either appear stale in the ACP target or be lost outright when the cleanup step removes the WAL file. Added a PRAGMA wal_checkpoint(TRUNCATE) flush of the legacy store before any cp, inside the same pre-flight window where the launcher has been confirmed quiescent. Idempotent on non-WAL stores (no-op). P2 (skipVerify must still run session/load) — The schema docstring described skipVerify as skipping the prompt step. The implementation was bypassing the entire verify block, which would have let a corrupted transplant flip HAPI metadata. Split the verify into 'always-load, optionally-prompt'; skipVerify now skips ONLY the prompt. Tests updated: the existing skipVerify-happy-path expectation rewritten + a new test verifies skipVerify still refuses on session/load failure. P2 (web client 401 refresh) — migrateCursorSessionToAcp now triggers the same onUnauthorized refresh dance as the shared request() helper instead of hard-failing on an expired JWT. Matches existing session-action UX.
…ues)
P1 — Lock release must do a real SQLite busy-probe, not just stat-poll.
killSession kicks off cleanup asynchronously; an idle-but-not-yet-closed
runner can let the stat-based size-stability heuristic report 'released'
while still owning the file lock. Added sqliteLockHeldByOtherProcess()
which opens the legacy store with BEGIN IMMEDIATE; SQLITE_BUSY is the
authoritative signal that another process holds a write lock. The
size-poll is retained as a secondary signal for partial writes.
P2 — preflightSession now refuses on session.active=true as well as
lifecycleState==='running'. Some legacy rows may lack lifecycleState
but still be active in the hub cache; the CLI/API paths relied on the
lifecycle-only check, while the web UI already hid the button for
active sessions. Now uniform across all entry points.
P2 — Atomic ACP target creation. Replaced the existsSync + mkdirSync
(recursive: true) TOCTOU with an exclusive mkdirSync(recursive: false);
EEXIST is caught and converted to a target_already_exists refusal. The
rollback path only fires for the dir we exclusively created.
P2 — cursorSessionId basename validation. Reject anything that isn't
[A-Za-z0-9_.-]+ before any path join, with explicit rejection of '.'
and '..'. Prevents path-traversal in either the chats lookup or the
acp-sessions target.
P2 — hapi.db backup now uses SQLite's serialize() so committed pages in
hapi.db-wal are included. The raw cp from before could omit recent
sessions or pre-migration metadata. Added Store.backupTo(destPath)
exposing the consistent-snapshot path.
Tests:
- session.active=true refusal
- cursorSessionId path-traversal refusal (../escape, has/slash, .,
.., has\backslash)
- existing tests unchanged; full hub suite 299 pass.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2cbe6edb41
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
P1 — Lock release uses session.active poll as the primary signal. An archive() kicks off async cleanup; the closest hub-side proxy for "runner process has exited" is sessionCache.active flipping false (set when the runner socket disconnects). The migrator now polls that before falling back to the SQLite busy-probe + size-poll, then sleeps a 500ms grace after all three pass so any in-flight kernel syscall on the WAL/SHM can complete. SyncEngine injects the cache-poll fn. P2 — Abort bulk migration when the hapi.db backup fails. The previous code logged + continued; without the backup, operators have no rescue handle if a metadata flip lands wrong. Now throws unless the DB is in-memory (test fixtures). API handlers wrap in try/catch and surface the error as HTTP 500 with the backup-failure message. P2 — Windows agent-binary spawn. AcpVerifyProbe.start() now matches cursorAcpRemoteLauncher's shape with shell:true + windowsHide:true on win32, so the `agent.cmd` shim is reachable. Without this the verify probe fails with ENOENT even though normal Cursor ACP sessions work. P2 — Bulk CLI timeout removed. A finite 30-minute axios timeout could cut off a slow operator run mid-flight (verifyTimeoutMs is 120s per session by default; ~15 sessions hits the cap). The hub-side bulk operation is idempotent per session, so the request can safely block until completion. Operators that want to cap a run can split it with a smaller --lifecycle filter. Tests: full hub suite 299 pass + integration (real agent acp) 3 pass on the 34-msg fodder fixture. Typecheck clean across cli/hub/web. Co-authored-by: Cursor <cursoragent@cursor.com>
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
|
@codex review Operator has topped up Codex credits after hitting the 5h Code Review cap. Current HEAD is
Last automatic review verdict was on commit |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 96578a55b9
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
P1 — Wait for real runner exit before transplanting. The previous awaitSessionInactive() poll was bogus: archiveSession() synchronously calls handleSessionEnd() which sets sessionCache.active=false, so any cache-based wait returns immediately. An idle Cursor process with no SQLite write lock could also pass the busy-probe instantly while still mid-SIGTERM-cleanup. Replaced with a real minimum dwell time (2s default) inside awaitLockRelease, gated on SQLite busy-probe + size-stability. Removed the SyncEngine-side awaitSessionInactive injection (it was the source of the false reassurance). P1 — Honor the agent ACP active-process lock. Cursor's `agent` binary enforces single-instance semantics (see cli/src/agent/backends/acp/agentCliGuard.ts); spawning the verify probe while another `agent acp` is live can SIGTERM the live one and crash the operator's Cursor session. The migrator now reads the $HAPI_HOME/locks/agent-acp-active/pid file and refuses with a new 'acp_transport_active' reason if the lock points to a live PID. The check is injectable for tests. P2 — Exclude active sessions from archived bulk candidates. The default archived bulk path filtered only on lifecycleState==='running', but preflightSession also treats session.active===true as running for legacy rows that lack lifecycle metadata. Without mirroring that here, `hapi cursor migrate --all` would include an active session, migrateOne would refuse it 'running_refused', and the bulk run would exit as failed. Extracted the filter to a pure exported predicate `isLegacyCursorMigrationCandidate()` and added unit tests. Tests: 35 pass (was 28) + integration 3 pass on the 34-msg fodder. Typecheck clean across cli/hub/web. Co-authored-by: Cursor <cursoragent@cursor.com>
|
@codex review Round 4: addresses the 3 P1/P2 findings from the previous Codex pass on commit 96578a5:
HEAD is now eb446b3. 35 unit pass + 3 integration pass on the 34-msg fodder fixture. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: eb446b391b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Co-authored-by: Cursor <cursoragent@cursor.com>
P1 — Block resumes while mutating the store. Between preflight and the destructive flip+rm, a session could be resumed via the existing hub paths. The resume reads cursorSessionProtocol from hapi.db (still 'legacy' until the flip), spawns a stream-json runner against the legacy source — which we are about to remove. Added getCurrentSession dependency that the migrator calls right before the metadata flip; if it sees active=true / lifecycleState=running / a concurrent flip to acp, it rolls back the ACP placement and refuses with new reason 'session_resumed_during_migrate' (or 'already_acp' for the concurrent flip case). SyncEngine injects a real impl reading the session cache. P2 — Register ACP verify probes in the active lock. The probe now acquires the same $HAPI_HOME/locks/agent-acp-active/ lock the CLI guard uses, just before spawning agent acp, and releases it AFTER the SIGTERM in stop(). Without this, a concurrent `hapi cursor migrate` or model-list request could see no active ACP lock, spawn another agent acp, and the second spawn would SIGTERM this verify per Cursor's single-instance enforcement (cli/src/agent/backends/acp/agentCliGuard.ts). New `hapiHome` option on AcpVerifyProbe lets tests pin the lock to an isolated dir so they don't clobber the operator's real lock. P2 — Override Windows home variables for the verify spawn. HOME alone is insufficient on Windows; the agent binary may resolve via USERPROFILE / HOMEDRIVE / HOMEPATH and end up reading the operator's real .cursor tree. The probe env now sets all three on win32. P2 — Reject single-session --dry-run. The CLI accepted --dry-run for per-session use but the per-session API has no dryRun field, so the flag was silently dropped and the real migration ran. Now throws a clear error instructing the operator to use --keep-source or the bulk path. P2 — Preserve private permissions on hapi.db backups. Store.backupTo() now passes mode 0o600 to writeFileSync and chmods after as belt-and-braces. Without this the rescue copy (which contains session transcripts) could land world-readable under a permissive umask, even though the live hapi.db is locked down to 0600. Tests: 37 pass (was 35) — added rollback test for resume-mid-migration and concurrent-flip-to-acp scenarios. Integration 3 pass on the 34-msg fodder fixture; integration tests now use isolated $HAPI_HOME to avoid touching the operator's real ACP lock. Typecheck clean. Co-authored-by: Cursor <cursoragent@cursor.com>
|
@codex review Round 5: addresses the 5 P1/P2 findings from the previous Codex pass on commit eb446b3:
HEAD is now be7696a. 37 unit pass (+2 rollback tests) + 3 integration pass. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: be7696a857
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
P2 v2 — Atomic agent-acp-active lock acquisition. The previous mkdirSync(recursive:true) + writeFileSync was not race-safe: two concurrent migrations could both pass isAgentAcpTransportActive() and then BOTH 'acquire' the lock by writing their pid in succession. Now uses mkdirSync(recursive:false) which is atomic — exactly one caller wins on EEXIST contention. Stale locks (dead pid) are cleared and retried once. start() throws a typed error on race loss; verifyInTempHome catches it and the migrator surfaces it as the acp_transport_active refusal so the operator sees the authoritative reason. Added 4 unit tests for the lock primitive (acquire / refuse-when-held / stale-cleanup / no-clobber-on-stop). P1 v2 — Authoritative active-check inside the metadata flip. The migrator's getCurrentSession recheck (round 5) was best-effort: a resume landing AFTER the recheck but BEFORE the DB update would still slip through. Moved the same check INSIDE flipCursorSessionProtocolToAcp so the read-active and write-protocol pair are within a single synchronous Bun call (no JS interleaving). Added new flip result 'session-active' + corresponding UpdateAfterMigrateResult variant + corresponding refusal mapping back to 'session_resumed_during_migrate'. Added a unit test that drives the path with updateOverride returning session_active. Tests: 313 unit pass (was 308; +4 probe lock, +1 atomic flip) plus 3 integration on the 34-msg fodder fixture. Typecheck clean across cli/hub/web. Co-authored-by: Cursor <cursoragent@cursor.com>
|
@codex review Round 6: addresses the 2 P1/P2 findings from previous pass on be7696a:
HEAD is now a6908df. 313 unit pass + 3 integration pass on 34-msg fodder fixture. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a6908df576
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
P1 v3 — Detect resume-then-exit during the migration window. The active-flag rechecks (both the migrator's getCurrentSession and the flip-time atomic check) only catch resumes that are STILL active at the moment of the check. A brief resume that came in, wrote turns to store.db/store.db-wal, and exited before the check would slip through — the store.db would have diverged from the snapshot we copied to the ACP target. Added an mtime+size fingerprint of the legacy store.db captured right after the WAL checkpoint; compared immediately before the metadata flip and rolled back with new reason 'legacy_store_modified_during_migrate' on any divergence. P2 — Respect metadata.homeDir + metadata.host. Hub processes can run under a different account than the user who created the Cursor session (service-account / multi-machine setups). The legacy chats live under the recorded session-owner home, not the hub's. The migrator now prefers metadata.homeDir for resolving both the legacy source (~/.cursor/chats) and the ACP target (~/.cursor/acp-sessions), falling back to homeDir() when the field is absent. Cross-host sessions (metadata.host != local hostName) are refused with new reason 'cross_host_session' so we never touch a same-id store under the wrong user account. hostName is a new injectable dep with default process.env.HAPI_HOSTNAME || os.hostname(). P2 v3 — Don't clobber freshly-starting CLI ACP locks. The previous retry path treated 'lock dir exists but no pid file' as stale and cleared it — but the CLI guard's registerActiveAcpTransport does mkdir BEFORE writeFileSync, so a probe landing in that small window would delete the live caller's lock. probeLockHolder now returns 'starting' for the pidless case, which falls through to the refuse- path. Only a present-but-dead pid file is treated as stale. Tests: 317 hub pass (was 313; +cross-host refusal, +fingerprint divergence rollback, +homeDir resolution, +mid-startup probe lock refusal). Integration 3 pass on the 34-msg fodder fixture (with injected hostName). Co-authored-by: Cursor <cursoragent@cursor.com>
|
@codex review Round 7: addresses the 3 P1/P2 findings from pass on a6908df:
HEAD is now 8145b73. 317 hub pass (+4 new tests) + 3 integration pass. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8145b73551
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5010bd77b9
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Six P2/P3 findings from chatgpt-codex-connector on commit 5010bd7: 1. P3 — Treat partially-written pid files as active (cli/src/agent/backends/acp/agentCliGuard.ts) The previous round's pid-less check only protected the "no file" case. An empty/malformed pid file (mid-write window between mkdirSync and writeFileSync flush) would fall through to readLockPid() returning null and removeAcpLockDir() would clobber the live holder. Add probeLockPidFile() that returns missing | starting | live | dead. Only the 'dead' state clears the lock. 2. P2 — Clear stale running lifecycle during ACP flip (hub/src/sync/syncEngine.ts) When migrating a stale-running row (active=false, lifecycleState= running) the flip used to preserve all old metadata and only add cursorSessionProtocol:'acp'. Stale lifecycle survived and downstream code keyed on metadata.lifecycleState would keep treating the archived ACP session as live. Set lifecycleState:'archived' when the old value was 'running'. 3. P2 — Kill Windows process tree before releasing ACP lock (hub/src/cursor/acpVerifyProbe.ts) On Windows we spawn 'agent acp' through a shell (shell: true), so proc.kill only signals the wrapper. The agent child can survive and the next verifier acquires the freed lock + spawns a second agent while ours is still running. Use 'taskkill /F /T /PID' to kill the tree on win32 with SIGTERM fallback. 4. P2 — Reserve ACP lock before archiving (hub/src/cursor/cursorLegacyMigrator.ts) isAgentAcpTransportActive() was a point-in-time check; the actual lock acquisition didn't happen until verifyInTempHome much later. In the gap another agent acp could start and a --force-archive migration would kill the legacy session before refusing. Extract tryAcquireAcpActiveLock() as a free helper, acquire it in migrateOne BEFORE archive, hold across the entire mutation window, release in finally. Verify probe gets skipLockAcquire:true so it inherits the caller's hold instead of double-acquiring. 5. P2 — Refuse when ACP lock cannot be claimed (hub/src/cursor/acpVerifyProbe.ts) Non-EEXIST mkdir errors (permission denied, read-only fs, disk full) used to fall through to lockHeld=false and start() spawned agent acp UNGUARDED. Throw instead so verifyInTempHome surfaces the refusal cleanly. 6. P2 — Remove pid-less locks owned by the probe (hub/src/cursor/acpVerifyProbe.ts) If we successfully mkdir'd the lock dir but pid write failed, the old releaseLock() skipped removal because the pid file was absent. With the CLI guard now treating pid-less dirs as 'starting' (active), that left a permanently-stuck lock. We own the mkdir; release() must rmdir even when pidless. Only skip removal when the pid file contains a DIFFERENT valid pid (another holder somehow took over). Tests: - hub/cursor: 50 unit + 2 new lifecycle tests (lock acquired/released on success and on mid-flow refusal) - hub/probe: 12 unit tests including new tryAcquireAcpActiveLock helper, skipLockAcquire bypass behavior, pidless-owned release - cli/guard: 7 tests including empty pid file + malformed pid file treated as starting (mid-write race) - integration: 3 pass with injected acquireAcpActiveLock pointing at tmp dir (isolates from operator's live HAPI_HOME lock) Typecheck clean across hub/cli/web. Total: 335 hub tests + 7 cli + 3 integration pass; 0 unresolved threads. Co-authored-by: Cursor <cursoragent@cursor.com>
|
@codex review HEAD is now 5899a00. Addressed all 6 round-11 findings:
335 hub + 7 cli guard + 3 integration tests pass. Typecheck clean across hub/cli/web. 0 unresolved threads. Looking for a clean pass. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5899a00870
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Co-authored-by: Cursor <cursoragent@cursor.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fc7b2e15af
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Two P2 findings from chatgpt-codex-connector on commit fc7b2e1: 1. P2 — Make real ACP spawns honor the migration lock (cli/src/agent/backends/acp/AcpStdioTransport.ts + agentCliGuard.ts) When the hub-side migrator held the agent-acp-active lock, a normal Cursor ACP resume STILL spawned `agent acp`: registerActiveAcpTransport() used mkdirSync(recursive:true), which silently succeeds on an existing dir and overwrote the migrator's pid file. Worse, AcpStdioTransport spawned agent FIRST and called register SECOND, so even an atomic register couldn't prevent the racy spawn. - registerActiveAcpTransport() now uses mkdirSync(recursive:false) and throws AcpLockHeldError when another live (or mid-startup) process holds the lock. Stale (dead-pid) locks are still cleared and retried once. In-process refcount is preserved so the same CLI process can re-enter without re-claiming the on-disk lock. - AcpStdioTransport reordered: register FIRST, spawn SECOND. On spawn failure the register is released. AcpLockHeldError now propagates to the caller so resume/start refuses cleanly instead of spawning an agent that will get killed by Cursor's single- instance enforcement. - unregisterActiveAcpTransport() now only removes the lock dir when the pid file confirms we own it (or no pid file exists, which means we own the mkdir-but-pid-write-failed case). 2. P2 — Reject WAL reappearance before setting the baseline (hub/src/cursor/cursorLegacyMigrator.ts) PRAGMA wal_checkpoint(TRUNCATE) zeros store.db-wal. The post- checkpoint fingerprint captured whatever was there at the time; if a legacy process resumed and wrote a turn between checkpoint return and our stat(), the WAL grew above zero and became the baseline. Since the verify-copy and ACP-placement copy only store.db, those frames would never transplant — and the post- fingerprint mtime/size comparison would still match, allowing cleanup to delete the legacy WAL with frames lost. Added a post-checkpoint stat: if store.db-wal exists with size > 0, refuse the migration with legacy_store_modified_during_migrate ('between checkpoint and fingerprint capture'). Extracted checkpointLegacyStore as an injectable dep so the unit test can simulate a writer landing in the gap. Tests: - hub/cursor: 53 unit tests, including new 'refuses early when WAL has content immediately after the checkpoint' test that injects a checkpoint stub which leaves 4 bytes in -wal. - cli/guard: 11 tests, including 4 new ones for registerActiveAcpTransport throws-on-held, refuses-on-pidless, in-process-refcount-reuse, and stale-lock-clearing. - cli/vitest: 944 tests (full suite), all green. - integration: 3 pass with real agent acp + injected lock target. Typecheck clean across hub/cli/web. 0 unresolved threads. Co-authored-by: Cursor <cursoragent@cursor.com>
|
@codex review HEAD is now cb614b6. Round 12 addressed both P2 findings:
336 hub + 944 cli (vitest) + 11 cli (bun:test) + 3 integration tests pass. Typecheck clean. 0 unresolved threads. Looking for a clean pass. |
|
Codex Review: Didn't find any major issues. You're on a roll. ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
Operator feedback: the operator-driven "Migrate to ACP" button and its acp_transport_active error dialog leak implementation details to users who never agreed to the cursor-json->ACP protocol switch in the first place. Migration is the debt WE owe for changing the underlying protocol; the user must not be made aware of cursorSessionProtocol, single-instance agent-acp locks, or any of the rest of it. Pivot to **invisible, automatic, per-session migration on first open** of a legacy stream-json session, with three coordinated changes: 1. Rip the UI surface. Remove the "Migrate to ACP" menu entry, ConfirmDialog flow, MigrateIcon, the matching state + handler in SessionHeader, and the en/zh-CN locale strings. The REST endpoint POST /api/sessions/:id/migrate-to-acp stays as an operator escape hatch; the API client method also stays. 2. Isolate the verify probe's HAPI_HOME (tiann#824 follow-up). verifyInTempHome now overrides HAPI_HOME (in addition to HOME) so the spawned `agent acp` child registers its agent-acp-active lock under the per-verify temp dir, NOT under the host's real HAPI_HOME where other live transports (peer agents, IDE chats) already hold the lock. Without this isolation the auto-migration path on a busy machine would always refuse with acp_transport_active, defeating its own purpose. The default `acquireAcpActiveLock` is now a no-op grant; tests that need the conservative refusal can still inject `() => null`, and operator CLI/REST callers that want the pre- isolation behavior can inject `tryAcquireAcpActiveLock(home)`. 3. Wire sync-on-open in `SyncEngine.resumeSession`. New helper `maybeAutoMigrateLegacyCursorSession` runs the existing transplant migrator synchronously when a user opens an inactive legacy stream-json cursor session. The user sees the regular session- loading state for ~3-5s longer; the session opens as ACP and the next open is instant. Guards: env flag `HAPI_CURSOR_LEGACY_AUTO_ MIGRATE` defaults on (set to `0`/`false`/`no`/`off` to disable); non-cursor + already-ACP + missing-cursorSessionId all bypass; if the host's agent-acp-active lock is held by a live PID the migration is also skipped (the post-migration ACP launcher would fail to spawn anyway, so we spare the user the wasted wait). All failures are soft — the legacy launcher continues to handle the open. Tests: - 10 new SyncEngine.maybeAutoMigrateLegacyCursorSession tests covering the full guard matrix (non-cursor, already-acp, no-id, env-flag-off, env-flag=false, live-lock-held, dead-lock, no-lock, soft-fail, unexpected-throw). - 1 new migrator test asserting HOME==HAPI_HOME==fakeHome and HAPI_HOME != real HAPI_HOME on the verify probe's spawn env. - All 54 existing migrator tests still green. - All 350 hub tests + 944 cli + 841 web + 77 shared still green. Closes-via-replace: the operator-facing manual button flow from earlier revisions of this PR. tiann#824 issue body to be rewritten to match. Co-authored-by: Cursor <cursoragent@cursor.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0058220731
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…edback swear01 pushback on tiann#824 flagged "three surfaces is too much for a remediation PR". This commit concedes the bulk path entirely so the PR focuses on the load-bearing core for tiann#784 safety remediation. Removed: - CLI: `hapi cursor migrate` subcommand (cursorMigrate.ts + dispatcher entry) - Hub: `POST /api/cursor/migrate-legacy` (createCursorRoutes) + the CLI-mirror `POST /cli/cursor/migrate-legacy` + `GET /cli/cursor/legacy-candidates` routes - SyncEngine: `bulkMigrateLegacyCursorSessions`, `backupHapiDb`, `getLegacyCursorSessionCandidates`, and the now-unused `isLegacyCursorMigrationCandidate` export - Store: `backupTo()` (only used by the bulk pre-flight backup) - Migrator: `dryRun` option (only the bulk path used it for previews) - Types: `CursorMigrateBulkRequest{,Schema}`, `CursorMigrateBulkResult`, and the `dryRun: true` variant of `CursorMigrateOutcome` Retained (the safety-critical core): - Transplant migrator + verify + rollback machinery - SyncEngine.maybeAutoMigrateLegacyCursorSession sync-on-open trigger (invisible auto-migration the first time a user opens a legacy session) - `POST /api/sessions/:id/migrate-to-acp` (per-session operator escape hatch) - HAPI_CURSOR_LEGACY_AUTO_MIGRATE env kill-switch Tests: 338 hub tests pass (3 integration skipped without LEGACY_FODDER_UUID). Typecheck clean on hub + cli. ~650-line reduction in PR surface. Hold for swear01 review before pushing. Co-authored-by: Cursor <cursoragent@cursor.com>
|
@codex review Scope-strip pass: dropped the bulk migration surface (CLI subcommand, bulk REST endpoints, Rationale: upstream maintainer pushback on tiann#824 flagged the 3-surface scope as too costly for a remediation PR. This narrows the diff to the load-bearing path. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d26739a7d3
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…nn#835 Revert agentCliGuard.ts + AcpStdioTransport.ts + agentCliGuard.test.ts to upstream/main. Strip the agent-acp-active lock pre-check from maybeAutoMigrateLegacyCursorSession + simplify its test suite to match. Per tiann#832 (filed by swear01), the assumption underpinning my v8 strict-mkdirSync refusal — that Cursor's `agent` binary SIGTERMs a second `agent acp` process on the same host — is incorrect. Two `agent acp` processes coexist fine. The lock's actual job is narrower: prevent `agent --list-models` from killing an active `agent acp` child. swear01's open PR tiann#835 ("fix(cursor): merge SKU catalog under ACP lock and refcount agent guard") refactors agent-acp-active from a single-PID file to a cross-process refcount with live-pid reconciliation. We expect tiann#835 to land before tiann#824 (this PR) and have added swear01's three open ACP mop-up PRs (tiann#835/tiann#836/tiann#837) to the local driver-manifest soup so the migrator is exercised against the eventual upstream shape. Net effect on tiann#824: * agent-acp-active lock concerns are someone else's problem now (the refcount lock in tiann#835 just works for our migrator flow). * Verify probe still runs in isolated HAPI_HOME (unchanged) — that isolation is independent of the host-side lock semantics and remains the correct safety boundary for the probe. * Auto-migrate trigger in resumeSession is unchanged behaviourally; only the now-obsolete lock pre-check is gone. 53/53 unit tests pass (cursorLegacyMigrator + syncEngineAutoMigrate). Co-authored-by: Cursor <cursoragent@cursor.com>
Orchestrator caught a gap: the helper had zero observability. Every early-return path AND the success path emitted nothing. Only soft-fail and unexpected-throw logged. End result: operator couldn't distinguish "helper fired and skipped at guard X" from "helper fired and migrated successfully" from "helper never called at all". Add five log points per session: [auto-migrate] considering — unconditional entry log [auto-migrate] skipped: <reason> — one per guard (env, flavor, proto, cursorId) [auto-migrate] starting transplant — all guards passed [auto-migrate] success — migration completed (was previously silent) [auto-migrate] legacy ... left as stream-json — soft fail (existed) [auto-migrate] unexpected error — throw (existed) The 8 syncEngineAutoMigrate tests still pass. Surfacing this gap is also a process lesson: I declared "soup live, go dogfood" without confirming the helper actually fires on the live hub resume path. Receipts now possible the moment the next hub restart loads this commit. Co-authored-by: Cursor <cursoragent@cursor.com>
Live dogfood on hapi-hub.service surfaced verify_load_failed soft-fail:
"initialize failed: Executable not found in $PATH: \"agent\"".
hapi-runner.service ships an explicit
Environment=PATH=$HOME/.local/bin:$HOME/.npm-global/bin:/usr/local/bin:
/usr/bin:/bin in its unit drop-in so its spawned cursor-agents find
the Cursor binary at ~/.local/bin/agent (the standard install location).
hapi-hub.service has no such drop-in and inherits systemd's minimal
PATH (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin),
so every hub-side `spawn('agent', ...)` ENOENTs.
Move the PATH augmentation into the verify probe itself: prepend
$HOME/.local/bin and $HOME/.npm-global/bin to baseEnv.PATH before
spawning. Now hub installs work out of the box without the operator
hand-tuning a systemd dropin.
Pin the behaviour with a regression test: stub `agent` binary in a
fake $HOME/.local/bin, baseline PATH=/usr/bin:/bin (no ~/.local/bin),
confirm the probe still finds and runs the stub (exit 99).
The migrator's safety contract was preserved end-to-end during the
dogfood: cp succeeded, verify failed, the soft-fail path fired and
the legacy session stayed exactly where it was. The fix unblocks
the verify step so migration can complete.
Co-authored-by: Cursor <cursoragent@cursor.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5d61dcf92f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…nv.HOME Second dogfood attempt (same hub session, fresh fodder b52e3f51-...) reproduced the same `Executable not found in $PATH: "agent"` failure even with the previous PATH-augmentation commit (5d61dcf) loaded. Root cause: my own fix had a bug. cursorLegacyMigrator passes a HOME-overridden env to the probe for HAPI_HOME isolation (HOME=fakeHome, HAPI_HOME=fakeHome, both pointing to /tmp/hapi-acp-verify-<token>). The first patch derived bin paths from baseEnv.HOME — so on dogfood runs it augmented PATH with /tmp/.../local/bin and /tmp/.../npm-global/bin, both nonexistent. The real ~/.local/bin/agent stayed unfindable. Fix: resolve the binary lookup against process.env.HOME (the hub process's own home, set by systemd User= to the operator account). The HOME override is for cache/state isolation — it doesn't move the agent binary install location. Test updates: pin both the systemd-PATH dogfood AND the HOME-override isolation case. The new test spreads them: stub agent at process.env.HOME/.local/bin/agent, override env HOME to a separate tmpdir without .local/bin, assert spawn exit code = stub's 99. Co-authored-by: Cursor <cursoragent@cursor.com>
|
@codex review Three new commits since the bot's last pass at
Live dogfood receipts (4 successful auto-migrations against real archived legacy cursor sessions on the operator's box, including the operator's own 374MB jellybot-dev session) are in the updated PR body under 'Live dogfood on the soup (2026-06-07)'. The post-mortem on the Please focus the new pass on the verify-probe spawn changes in |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6c682facc1
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
F1 — acpVerifyProbe.ts: use path.delimiter for PATH joining
Windows PATH entries are `;`-separated. Previous `:` join produced
invalid combined entries on win32 (`C:\...\.local\bin:C:\Windows\...`)
that cmd.exe treated as one segment.
F2 — acpVerifyProbe.ts: accept session-owner home via agentLookupHome
Deployment shape where hub runs as a service account whose
process.env.HOME differs from the human user who installed Cursor
(recorded in `metadata.homeDir`). Migrator threads `this.deps.homeDir()`
into the probe so the binary lookup uses the right home. Falls back
to process.env.HOME when no explicit value supplied.
F3 — acpVerifyProbe.ts: preserve explicit options.env.PATH precedence
Previous fix PREPENDED cursor-bin fallback — overrode a caller's
intentionally pinned `agent`. Switch to APPEND so the existing PATH
wins; fallback only kicks in when the existing PATH doesn't already
contain `agent`. Still fixes the minimal-systemd-PATH dogfood case.
F4 — sessions.ts + cli.ts: reject malformed JSON on migrate-to-acp
`c.req.json().catch(() => ({}))` silently converted parse errors
into `{}` which passed Zod (all-optional schema) and ran the
migration with destructive defaults. Distinguish "no body" from
"malformed JSON" (return 400 on the latter).
F5 — syncEngine.ts auto-migrate helper: pass forceArchiveRunning: true
Inactive rows with stale `lifecycleState=running` (orphaned by hub
crashes that didn't finish the lifecycle transition) were being
refused with `running_refused` and silently fell back to the legacy
launcher forever. resumeSession's caller-side `if (active) return`
guard already ensured no live runner — the stale lifecycle is
exactly the kind of stuck row sync-on-open is meant to clean up.
Tests: 2 new tests pin precedence (F3) + path.delimiter (F1) via stubbed
competing agent binaries in priority vs fallback dirs. Existing live-
dogfood regression test extended to cover both explicit agentLookupHome
AND fallback to process.env.HOME. 339 hub tests pass, typecheck clean.
Co-authored-by: Cursor <cursoragent@cursor.com>
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
Codex review #34 P2 (round 13, finding 3369444056): on a multi-user host with a default 022 umask and a non-private ~/.cursor tree, the new ACP session directory was being created with the default 0o755 mode and the transplanted store.db inherited the source file's mode bits (typically 0o644). Other local users could traverse the session dir and read the transplanted transcript — even when the source store and HAPI DB were private. - mkdirSync(acpSessionDir, { recursive: false, mode: 0o700 }) - chmodSync(store.db, 0o600) after copyFileSync (copy inherits source) - writeFileSync(meta.json, ..., { mode: 0o600 }) 339 hub tests pass, typecheck clean. Co-authored-by: Cursor <cursoragent@cursor.com>
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
|
@codex review Round 13 (HEAD
Two new tests pin the PATH-precedence (F3) and Please review the latest commits:
|
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
Without an in-flight progress indicator the 15-20s dark wait while the
migrator transplants store.db, spawns agent acp, replays notifications,
and tears down the verify probe reads as "broken / nothing's happening"
to a fresh reviewer. Operator dogfood quote: "it looks like it's broken
and nothing is working and then suddenly it just refreshes the entire
session... it's not fantastic but it absolutely works". Shipping the
migrator without this banner means every first-time reviewer forms a
sticky bad first impression — the polish IS the feature surface.
Design (UX A++):
- Hub adds metadata.cursorMigrationState='in_progress' BEFORE the
long-running transplant. The session-cache refresh emits the existing
session-updated SSE event so the web client picks it up in
milliseconds. No client-side polling needed; SSE is already real-time.
- Hub clears the flag in the SAME metadata write that flips
cursorSessionProtocol to 'acp' on success. A single SSE event carries
both transitions, so the banner disappears in the same render tick
the chat re-renders as ACP — no flicker window.
- Hub clears the flag explicitly in the auto-migrate helper's finally
on failure/exception so the banner never gets stuck if migration
falls back to the legacy launcher.
- Web renders a thin accessible (role=status, aria-live=polite) banner
with an indeterminate spinner and short explanation when the flag is
set. No fake percentage — we don't have phase data and a fake
progress bar would lie.
Files:
- shared/src/schemas.ts: add cursorMigrationState to MetadataSchema.
- hub/src/sync/syncEngine.ts:
+ flipCursorSessionProtocolToAcp drops cursorMigrationState in the
same write that flips protocol to 'acp'.
+ maybeAutoMigrateLegacyCursorSession sets the flag before transplant;
finally-clears on failure/exception only (success path delegates to
the atomic flip writer).
+ setCursorMigrationStateInProgress / clearCursorMigrationState
helpers handle the version-mismatch retry.
- hub/src/sync/syncEngineAutoMigrate.test.ts: 4 new tests pin flag
transitions across failure / exception / success-no-clear /
flip-atomic-clears paths.
- web/src/components/CursorMigrationBanner.tsx: new component.
- web/src/components/CursorMigrationBanner.test.tsx: 10 tests covering
all visibility paths + a11y role/aria-live.
- web/src/components/SessionChat.tsx: render banner between
SessionHeader and TeamPanel.
- web/src/lib/locales/en.ts + zh-CN.ts: copy strings.
- web/src/types/api.ts: re-export Metadata for component typing.
Hub 343 pass (was 339; +4), web banner 10/10, typecheck clean across
all three workspaces.
Co-authored-by: Cursor <cursoragent@cursor.com>
|
@codex review Round 14 (HEAD
New tests: hub 4 new (flag transitions across failure/exception/success-no-clear/flip-atomic-clears), web banner 10 (visibility paths + a11y). Hub 343 pass (was 339; +4), web banner 10/10, typecheck clean across cli/web/hub. Note: bot may be at quota cap; if so this triggers a queued pass for when credits refill. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 35daa9f2eb
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // host's live ACP transports cannot block migration. Operator | ||
| // CLI/REST callers that want the pre-isolation conservative | ||
| // behavior can inject `() => tryAcquireAcpActiveLock(home)`. | ||
| acquireAcpActiveLock: deps.acquireAcpActiveLock ?? (() => ({ release() {} })), |
There was a problem hiding this comment.
Serialize production verify probes with a real ACP lock
In the SyncEngine path no acquireAcpActiveLock override is passed, so production uses this no-op handle while the default probe is also constructed with skipLockAcquire: true. If two legacy Cursor sessions are opened/migrated at about the same time, both verifiers can spawn agent acp concurrently; the existing ACP guard documents that Cursor's agent CLI is single-process and a second spawn can terminate the first, so one or both migrations can fail and fall back to legacy despite being otherwise migratable. Use a real shared lock here, or only skip the probe's internal lock when the caller actually acquired one.
Useful? React with 👍 / 👎.
| if (latest && !wasActive && latest.lifecycleState === 'running') { | ||
| // Lifecycle says running but active=false and we did NOT | ||
| // archive this session ourselves. Something external lifted | ||
| // it back to running between preflight and now (rare but | ||
| // possible if the session was preflight-archived and then | ||
| // a peer un-archived it via the lifecycle API). | ||
| tryRm(acpSessionDir) | ||
| return refusal(session.id, 'session_resumed_during_migrate', `session lifecycle became running during migration (lifecycleState=running, active=false, wasActive=false); rolled back ACP placement`, start, this.deps.now) |
There was a problem hiding this comment.
Let stale-running inactive rows reach the ACP flip
When an inactive legacy Cursor row still has metadata.lifecycleState === 'running' after a hub crash, the auto-migrate path passes forceArchiveRunning: true and intentionally skips the archive RPC because active is false, but this recheck later treats the same stale lifecycle value as a resume race and rolls back the ACP placement. In that stale-row case the session will silently fall back to the legacy launcher every time instead of being migrated and having the lifecycle cleared by flipCursorSessionProtocolToAcp.
Useful? React with 👍 / 👎.
| // PATH. Thread the migrator's homeDir dep through so the | ||
| // probe doesn't fall back to process.env.HOME when a more | ||
| // specific lookup home is available. | ||
| agentLookupHome: this.deps.homeDir() |
There was a problem hiding this comment.
Use the recorded owner home for agent binary lookup
For sessions with metadata.homeDir different from the hub process home, migrateOneWithLock correctly reads the legacy store under the recorded owner home, but the default verifier still augments PATH from this.deps.homeDir() rather than that recorded home. In service-account hub deployments where Cursor's agent binary is installed in the human user's ~/.local/bin, verification will fail with agent not found even though the store was located successfully; thread the chosen source home into the probe's agentLookupHome.
Useful? React with 👍 / 👎.
| export type RenameSessionRequest = z.infer<typeof RenameSessionRequestSchema> | ||
|
|
||
| /** Per-session legacy stream-json → ACP migrator request. See tiann/hapi#824. */ | ||
| export const CursorMigrateToAcpRequestSchema = z.object({ |
There was a problem hiding this comment.
Reject unknown migration options before destructive defaults
Because this Zod object is not strict, unknown keys are silently stripped; with all fields optional, an operator typo such as {"keep_source": true} validates as {} and the migration proceeds with the destructive default that removes the legacy source after a successful flip. For this destructive endpoint, reject unrecognized keys so malformed-but-valid JSON cannot accidentally drop the safety option.
Useful? React with 👍 / 👎.
Summary
Adds invisible, automatic, per-session ACP migration triggered on first
open of a legacy stream-json Cursor session. The hub transplants the legacy
~/.cursor/chats/<wsh>/<uuid>/store.dbinto~/.cursor/acp-sessions/<uuid>/,verifies it loads via
agent acp, flipscursorSessionProtocol = 'acp',removes the legacy source — all before
resumeSessionreturns its target.The user sees one upgrade pause; subsequent opens are pure ACP.
Sees
tiann/hapi#824for the design and the spike report atdocs/plans/2026-06-06-cursor-legacy-to-acp-spike.mdfor the underlyingevidence (identical SQLite schema, only the directory layout differs
between legacy and ACP).
Why this is safety, not feature-parity
This PR's primary justification is remediation for
#784(cursor-agentfabricates
Questions skipped by the userresponses in legacy stream-jsonmode). Despite
#801's mitigation, the bug still fires regularly indogfood: the agent ships destructive side effects against fabricated
consent. Migration to ACP closes the protocol-level door — the
AskQuestiontool doesn't exist on the ACP side, so there's nothing tofabricate.
#799deliberately left existing sessions on legacy "because they keepworking." That tradeoff was reasonable at the time. The accumulated
#784evidence makes legacy sessions actively unsafe; this PR makes the upgrade
path invisible enough that operators stop avoiding it.
Coordination with upstream ACP mop-up
This PR stacks on top of swear01's three open ACP mop-up PRs:
#835fix(cursor): merge SKU catalog under ACP lock and refcount agent guard(fixes fix(cursor): incomplete cliModelSkus during ACP lock — variant picker shows raw wire params tiann/hapi#831 + fix(cursor): agent-acp-active lock clears while another ACP session still runs tiann/hapi#832) — cross-process refcount + live-pid reconciliation for
the
agent-acp-activelock. The migrator's post-flip ACP launcher claimsthis lock through the standard path; this PR does not modify
cli/src/agent/backends/acp/agentCliGuard.ts.#836fix(web): hide sidebar fake sessions for Cursor resume/archive(fixes
#833) — sidebar dedup. Post-migration the same session row getsa new
cursorSessionProtocolvalue; fix(web): hide sidebar fake sessions for Cursor resume/archive tiann/hapi#836'sdeduplicateSessionsByAgentIdensures the upgraded session doesn't show as two rows during the
transition.
#837fix(cursor): register cursorSessionId before ACP session/load(fixes
#834) — pre-writescursorSessionId+cursorSessionProtocolbefore awaiting
session/load, the same flip-before-resume orderingthis PR's migration helper observes.
Local soup is already running this PR's branch stacked on top of tiann#835/tiann#836/tiann#837
so the migrator is exercised against the eventual upstream shape.
Live dogfood on the soup (2026-06-07)
After the original PR draft landed I dogfooded against the live hub
(
hapi-hub.serviceon the operator's box) instead of trusting the unitsuite alone. Four migrations on real sessions:
5a2fba34-...(jellybot-dev, active operator session)jellybot-dev6c63c1f4-...androidsideloader3dd30dc4-...sparlingAll four triggered via the existing Reopen path in the Web UI (operator
clicks; hub auto-fires the helper inside
resumeSession). All four:cursorSessionProtocol=acpinhapi.db~/.cursor/acp-sessions/<curId>/{store.db,meta.json,WAL}~/.cursor/chats/<wsh>/<curId>/store.db(except wherean active runner held an open fd — see "Known wart" below)
The active-session migration (jellybot-dev: operator's own working
session, 374MB store.db, 8650 messages of history) succeeded mid-flight
— the conversation continued in the upgraded ACP session without the
operator having to do anything beyond clicking Reopen.
Post-mortem: three follow-up commits after the original draft
The first live dogfood surfaced an environmental gap the unit suite did
not catch. Three follow-up commits:
3cc4c037— observabilityThe helper's success path and all four early-return guards were silent.
On the first dogfood run, zero log lines meant I could not distinguish
"helper bailed at guard X" from "helper succeeded silently" from "helper
was never called." Added five log points:
[auto-migrate] considering(unconditional entry)[auto-migrate] skipped: <reason>per guard[auto-migrate] starting transplant[auto-migrate] success(structured: durationMs, sourceRemoved,replayNotifications, lastUsedModelPreserved)
[auto-migrate] legacy ... left as stream-json(soft fail; existed)This is now the source of the receipts in the table above.
5d61dcf9(attempt 1) →6c682fac(fix): PATH augmentation for verify-probeSecond dogfood run produced:
Root cause:
hapi-hub.serviceships systemd's minimal PATH(
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin) andnever sees Cursor's standard install location at
$HOME/.local/bin/agent.hapi-runner.servicehand-fixes this in its drop-in viaEnvironment=PATH=$HOME/.local/bin:.... The hub had no equivalent,so every
spawn('agent', ...)inside the verify probe returned ENOENT.5d61dcf9prepended$HOME/.local/binand$HOME/.npm-global/bintobaseEnv.PATHbefore spawning. Wrong fix: the migrator passesHOME=<tmpdir>to the probe for HAPI_HOME isolation, sobaseEnv.HOMEpointed at a tmpdir that doesn't contain.local/bin.6c682facswitched the lookup toprocess.env.HOME(the hub process'sown home, set by systemd
User=) and added a regression test thatexplicitly distinguishes
process.env.HOMEfrom a per-spawnHOMEoverride. Right fix: the agent binary lives at the operator's real
home regardless of any per-verify
HOMEoverride (Cursor installs atthe user's
$HOME/.local/binat install time, not at spawn time).The safety contract held across all of this: every failed verify left
the legacy session exactly as it was; no
rmfired without a verifysuccess.
In-flight upgrade banner (UX A++, ships in this PR)
The migration takes 15-20 seconds on long sessions (copy a multi-hundred-MB
store, spawn
agent acp, replay thousands of notifications, tear down theprobe). The first dogfood pass shipped this window with no progress
indicator — and the operator's lived feedback made clear that was a
non-starter:
Shipping the migrator without the banner means every first-time reviewer
forms a sticky bad first impression — "looks broken." The polish IS the
feature surface here, not an optional follow-up.
Design (intentionally minimal, no SSE event-type expansion):
Hub sets
metadata.cursorMigrationState='in_progress'BEFORE thelong-running transplant call. The session-cache refresh emits the
existing
session-updatedSSE event (no new event type), so the webclient picks it up in milliseconds. No client-side polling change
needed; the SSE channel is already real-time.
Hub clears the flag in the SAME metadata write that flips
cursorSessionProtocolto'acp'on success. A single SSE eventcarries both transitions, so the banner disappears in the same render
tick the chat re-renders as ACP — no flicker window between
"banner gone" and "ACP content rendered".
Hub clears the flag explicitly in the auto-migrate helper's
finallyon failure / exception, so the banner never gets stuck if migration
soft-fails and the legacy launcher takes over.
Web renders an accessible (
role=status,aria-live=polite) bannerwith an indeterminate spinner and short explanation when the flag is
set. Deliberately no fake percentage — we don't have phase data and a
fake progress bar would lie. The minimal-information design beats
engineering-brain over-design (per-phase SSE events, progress
percentages) which doesn't actually solve the lived rendering-timing
pain.
Files added (UX A++ slice on top of the migrator core):
shared/src/schemas.ts: addcursorMigrationState: z.enum(['in_progress'])toMetadataSchemahub/src/sync/syncEngine.ts:flipCursorSessionProtocolToAcpdrops the flag in the atomic write;setCursorMigrationStateInProgress/clearCursorMigrationStatehelpers with version-mismatch retry;maybeAutoMigrateLegacyCursorSessionsets the flag before transplant and finally-clears on failure/exception onlyhub/src/sync/syncEngineAutoMigrate.test.ts: 4 new tests pin the flag transitions (failure / exception / success-no-clear / flip-atomic-clears)web/src/components/CursorMigrationBanner.tsx: thin banner component (50 lines)web/src/components/CursorMigrationBanner.test.tsx: 10 tests covering all visibility paths + a11yweb/src/components/SessionChat.tsx: render banner betweenSessionHeaderandTeamPanelweb/src/lib/locales/{en,zh-CN}.ts: copy strings (English + Simplified Chinese)web/src/types/api.ts: re-exportMetadatafor component typingSurfaces (drastically reduced from earlier draft)
Per the cost critique on
#824, the bulk-migration surface area isstripped. No CLI subcommand, no web button, no REST bulk endpoint, no
Store.backupTo, nodryRun, no candidate-listing API.SyncEngine.resumeSessioncallsmaybeAutoMigrateLegacyCursorSessionbefore resolving the target. Migration is transparent.
HAPI_CURSOR_LEGACY_AUTO_MIGRATE=0|false|no|offdisables auto-migration (default = ON).
POST /api/sessions/:id/migrate-to-acpformanual one-off triggers (used by tests + operator scripts).
Verify-probe isolation
The probe spawn uses an isolated
HOMEandHAPI_HOME(both pointed ata per-probe
mkdtempSyncdirectory). This means:agent acpcannot see or mutate the user's real~/.cursor/acp-sessions/agent-acp-activelockThe real
cpof the legacy store into~/.cursor/acp-sessions/<uuid>/happens after the probe succeeds in the temp
HOME. The legacy sourceis
rmd only after thehapi.dbprotocol flip lands cleanly.Refusal contract (silent — caller falls back to legacy launcher)
cursorSessionIdThe session is never left in a half-migrated state. Verify failure rolls
back the placement before the protocol flip; WAL checkpoint + size
checks ensure store.db consistency across the copy.
Verification
cursorLegacyMigrator.test.ts(45 tests coveringrefusals, rollback, telemetry, force-archive, ACP-lock dependency
injection, fingerprint protection) +
syncEngineAutoMigrate.test.ts(8 tests covering the guard matrix + soft-fail behaviour).
agent acp(opt-in viaCURSOR_AGENT_INTEGRATION=1):buildSyntheticLegacyStoreLEGACY_FODDER_WSH+LEGACY_FODDER_UUIDbun run typecheckclean.Out of scope (mentioned in tiann#824 for future work)
hapi cursor import --from-cursor-diskfor the ~329 chats in~/.cursor/chats/that HAPI has no metadata row for. Separate discoverycleanly when
active=true; the next resume on an inactive row triggersthe migration invisibly.
Test plan (reviewer)
maybeAutoMigrateLegacyCursorSessionruns before thelegacy launcher resolves its target (otherwise the user gets a stale
legacy spawn).
cursorLegacyMigrator.test.tsmatch thecontract in
apiTypes.ts(CursorMigrateRefusalReason).HOME+HAPI_HOMEisolation preventspollution of the user's real
~/.cursor/acp-sessions/.HAPI_CURSOR_LEGACY_AUTO_MIGRATE=0cleanly disables auto-migration.