fix(transport): prevent SSE replay storms and silent message loss#396
fix(transport): prevent SSE replay storms and silent message loss#396dimakis wants to merge 9 commits into
Conversation
Three fixes for the SSE transport regression causing 8+ minute message
delivery delays and silent message drops:
1. Periodic sync replay storm prevention (connection-registry):
- Replace isSessionActive with shouldSync that checks session STATE
(ACTIVE/STARTING/CREATED only), not just the isActive boolean.
SUSPENDED/DETACHED/ENDED sessions no longer trigger sync.
- Add MAX_REPLAY_GAP (200 events) cap — when cursor is far behind
head (e.g. after server restart), skip ahead instead of replaying
thousands of events at 50/5s. Client lazy-loads history via API.
- Add getHeadSeq to EventStore for gap detection.
2. POST retry for failed message sends (sse-connection):
- Re-queue send POSTs that get HTTP errors (e.g. 404 from stale
connectionId after reconnect) or network errors. Messages flush
on next successful reconnect instead of being silently dropped.
- Only send endpoints retry — stop/interrupt are fire-and-forget.
3. Guard ProcessTransport.interrupt() crash (chat.ts):
- Wrap queryInstance.interrupt() in try/catch to prevent unhandled
errors from killing the server process. This was the trigger that
caused server restart → in-memory state loss → replay storms.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
dimakis
left a comment
There was a problem hiding this comment.
Centaur Review
Found 5 issue(s) (2 warning).
packages/client/src/sse-connection.ts
Solid fix for a real production issue — replay storm prevention and retry-on-failure are well-designed. Main concern is the missing MAX_PENDING_SENDS cap on the re-queue path in doPost, and EventStore.getHeadSeq lacks direct unit tests against SQLite.
- 🟡 bugs (L340): Re-queue in doPost bypasses the MAX_PENDING_SENDS cap enforced in send() (line 97). If multiple flushPendingSends rounds fail repeatedly, pendingSends can grow beyond 100 without the shift() eviction. Consider applying the same cap here.
[fixable]
packages/protocol/src/event-store.ts
Solid fix for a real production issue — replay storm prevention and retry-on-failure are well-designed. Main concern is the missing MAX_PENDING_SENDS cap on the re-queue path in doPost, and EventStore.getHeadSeq lacks direct unit tests against SQLite.
- 🟡 missing_tests (L617): EventStore.getHeadSeq() has no unit tests. It's exercised indirectly via connection-registry tests (which mock it), but the actual SQLite query (MAX(seq)) is never tested against a real database — e.g., no events returns 0, single event returns its seq, multiple events returns the max.
[fixable]
server/index.ts
Solid fix for a real production issue — replay storm prevention and retry-on-failure are well-designed. Main concern is the missing MAX_PENDING_SENDS cap on the re-queue path in doPost, and EventStore.getHeadSeq lacks direct unit tests against SQLite.
- 🔵 unsafe_assumptions (L105): shouldSync returns false for null state (unknown session ID), which is correct but implicit. If a session is watched but hasn't been registered in the EventStore yet (race at startup), periodic sync silently skips it. This is probably fine since events wouldn't exist yet either, but worth noting.
server/chat.ts
Solid fix for a real production issue — replay storm prevention and retry-on-failure are well-designed. Main concern is the missing MAX_PENDING_SENDS cap on the re-queue path in doPost, and EventStore.getHeadSeq lacks direct unit tests against SQLite.
- 🔵 missing_tests (L1269): The new try/catch around queryInstance.interrupt() has no test verifying that the server survives a ProcessTransport crash. The existing interruptChat tests in send-to-chat.test.ts don't cover the case where interrupt() throws.
[fixable]
packages/client/src/__tests__/sse-connection.test.ts
Solid fix for a real production issue — replay storm prevention and retry-on-failure are well-designed. Main concern is the missing MAX_PENDING_SENDS cap on the re-queue path in doPost, and EventStore.getHeadSeq lacks direct unit tests against SQLite.
- 🔵 style (L876): The callCount tracking variable in the first two retry tests shadows mockFetch.mock.calls.length. Using mockFetch.mockImplementationOnce() for the first call and a default for subsequent calls would be cleaner and avoid the manual counter.
[fixable]
| // Message delivery failed (e.g. 404 from stale connectionId). | ||
| // Re-queue so it flushes on the next successful reconnect. | ||
| console.warn(`[SseConnection] POST /${endpoint} returned ${res.status}, re-queuing`); | ||
| this.pendingSends.push({ endpoint, body }); |
There was a problem hiding this comment.
🟡 bugs: Re-queue in doPost bypasses the MAX_PENDING_SENDS cap enforced in send() (line 97). If multiple flushPendingSends rounds fail repeatedly, pendingSends can grow beyond 100 without the shift() eviction. Consider applying the same cap here. [fixable]
| return (row?.state as SessionState) ?? null; | ||
| } | ||
|
|
||
| /** Return the highest seq number for a session, or 0 if no events exist. */ |
There was a problem hiding this comment.
🟡 missing_tests: EventStore.getHeadSeq() has no unit tests. It's exercised indirectly via connection-registry tests (which mock it), but the actual SQLite query (MAX(seq)) is never tested against a real database — e.g., no events returns 0, single event returns its seq, multiple events returns the max. [fixable]
| // shouldSync skips ENDED/SUSPENDED/DETACHED sessions to prevent replay storms. | ||
| // getHeadSeq caps replay depth so post-crash reconnects don't flood clients. | ||
| connRegistry.setEventStore({ | ||
| getEventsAfter: (sessionId, afterSeq, limit) => |
There was a problem hiding this comment.
🔵 unsafe_assumptions: shouldSync returns false for null state (unknown session ID), which is correct but implicit. If a session is watched but hasn't been registered in the EventStore yet (race at startup), periodic sync silently skips it. This is probably fine since events wouldn't exist yet either, but worth noting.
| await Promise.allSettled(stops); | ||
| } | ||
| await session.queryInstance.interrupt(); | ||
| try { |
There was a problem hiding this comment.
🔵 missing_tests: The new try/catch around queryInstance.interrupt() has no test verifying that the server survives a ProcessTransport crash. The existing interruptChat tests in send-to-chat.test.ts don't cover the case where interrupt() throws. [fixable]
| expect(mockFetch).not.toHaveBeenCalled(); | ||
| }); | ||
|
|
||
| // ─── POST retry on failure ─────────────────────────────────────────────── |
There was a problem hiding this comment.
🔵 style: The callCount tracking variable in the first two retry tests shadows mockFetch.mock.calls.length. Using mockFetch.mockImplementationOnce() for the first call and a default for subsequent calls would be cleaner and avoid the manual counter. [fixable]
- Add MAX_PENDING_SENDS cap to re-queue path in doPost via enqueuePending() helper, preventing unbounded queue growth on repeated flush failures - Add EventStore.getHeadSeq unit tests against real SQLite (0 events, single event, max of multiple events) - Add interrupt() crash survival test — verifies ProcessTransport error doesn't propagate and message still reaches inputQueue - Use mockImplementationOnce in retry tests instead of manual call counters Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Centaur ReviewFound 5 issue(s) (1 warning).
|
Centaur ReviewFound 4 issue(s) (1 warning).
|
dimakis
left a comment
There was a problem hiding this comment.
Centaur Review
Found 6 issue(s) (3 warning).
packages/client/src/sse-connection.ts
Solid defensive fix addressing real failure modes (replay storms, interrupt crashes, silent message loss). The main concern is that client-side send re-queuing doesn't distinguish transient from permanent POST failures and lacks an immediate retry path when the SSE stream stays connected. The MAX_REPLAY_GAP references client lazy-loading that doesn't exist yet.
- 🟡 bugs (L336): Re-queues on ALL non-OK statuses including permanent errors (400, 422). A malformed request will be retried on every reconnect indefinitely, evicted only when the queue fills to MAX_PENDING_SENDS. Consider excluding 4xx client errors (except 404/408/429) from re-queue.
[fixable] - 🟡 bugs (L340): When doPost fails for a 'send' during normal connected operation (not during reconnect), the message is re-queued to pendingSends but only flushed on the next SSE reconnect. If the POST failure was transient (e.g. 500) but the SSE stream stays alive, the re-queued message sits in the queue indefinitely. Consider triggering a reconnect cycle when a send POST fails, or implementing an immediate retry with backoff.
[fixable] - 🔵 style (L339): The console.warn interpolation
POST /${endpoint}will always bePOST /sendsince the re-queue is guarded byendpoint === 'send'. The template is misleadingly generic.[fixable]
packages/harness/src/connection-registry.ts
Solid defensive fix addressing real failure modes (replay storms, interrupt crashes, silent message loss). The main concern is that client-side send re-queuing doesn't distinguish transient from permanent POST failures and lacks an immediate retry path when the SSE stream stays connected. The MAX_REPLAY_GAP references client lazy-loading that doesn't exist yet.
- 🟡 unsafe_assumptions (L55): MAX_REPLAY_GAP comment says 'Client can lazy-load older events via the /api/sessions/:id/events API', but the @mitzo/client package has no lazy-loading implementation. Events skipped by the gap cap are silently lost to the client. Either implement client-side lazy-loading or document this as a known trade-off.
server/index.ts
Solid defensive fix addressing real failure modes (replay storms, interrupt crashes, silent message loss). The main concern is that client-side send re-queuing doesn't distinguish transient from permanent POST failures and lacks an immediate retry path when the SSE stream stays connected. The MAX_REPLAY_GAP references client lazy-loading that doesn't exist yet.
- 🔵 missing_tests (L108): shouldSync returns false when getSessionState returns null (unknown session). This is correct behavior but the edge case is untested — the connection-registry tests mock shouldSync directly rather than exercising the actual null-state path wired in index.ts.
[fixable]
packages/client/src/__tests__/sse-connection.test.ts
Solid defensive fix addressing real failure modes (replay storms, interrupt crashes, silent message loss). The main concern is that client-side send re-queuing doesn't distinguish transient from permanent POST failures and lacks an immediate retry path when the SSE stream stays connected. The MAX_REPLAY_GAP references client lazy-loading that doesn't exist yet.
- 🔵 missing_tests (L875): No test for the scenario where doPost fails during flushPendingSends (reconnect flush). If a re-queued send fails again during flush, it should be re-queued once more. This interaction between flushPendingSends and enqueuePending is the most complex path and deserves coverage.
[fixable]
| }, | ||
| body: JSON.stringify(body), | ||
| }); | ||
| if (!res.ok && endpoint === 'send') { |
There was a problem hiding this comment.
🟡 bugs: Re-queues on ALL non-OK statuses including permanent errors (400, 422). A malformed request will be retried on every reconnect indefinitely, evicted only when the queue fills to MAX_PENDING_SENDS. Consider excluding 4xx client errors (except 404/408/429) from re-queue. [fixable]
| // Message delivery failed (e.g. 404 from stale connectionId). | ||
| // Re-queue so it flushes on the next successful reconnect. | ||
| console.warn(`[SseConnection] POST /${endpoint} returned ${res.status}, re-queuing`); | ||
| this.enqueuePending(endpoint, body); |
There was a problem hiding this comment.
🟡 bugs: When doPost fails for a 'send' during normal connected operation (not during reconnect), the message is re-queued to pendingSends but only flushed on the next SSE reconnect. If the POST failure was transient (e.g. 500) but the SSE stream stays alive, the re-queued message sits in the queue indefinitely. Consider triggering a reconnect cycle when a send POST fails, or implementing an immediate retry with backoff. [fixable]
| if (!res.ok && endpoint === 'send') { | ||
| // Message delivery failed (e.g. 404 from stale connectionId). | ||
| // Re-queue so it flushes on the next successful reconnect. | ||
| console.warn(`[SseConnection] POST /${endpoint} returned ${res.status}, re-queuing`); |
There was a problem hiding this comment.
🔵 style: The console.warn interpolation POST /${endpoint} will always be POST /send since the re-queue is guarded by endpoint === 'send'. The template is misleadingly generic. [fixable]
| const SYNC_BATCH_LIMIT = 50; | ||
| // Maximum gap between cursor and head before we skip ahead. | ||
| // Prevents replay storms after server restart or long disconnects. | ||
| // Client can lazy-load older events via the /api/sessions/:id/events API. |
There was a problem hiding this comment.
🟡 unsafe_assumptions: MAX_REPLAY_GAP comment says 'Client can lazy-load older events via the /api/sessions/:id/events API', but the @mitzo/client package has no lazy-loading implementation. Events skipped by the gap cap are silently lost to the client. Either implement client-side lazy-loading or document this as a known trade-off.
| eventStore.getEventsAfter(sessionId, afterSeq, limit), | ||
| isSessionActive: (sessionId) => eventStore.getSession(sessionId)?.isActive ?? false, | ||
| shouldSync: (sessionId) => { | ||
| const state = eventStore.getSessionState(sessionId); |
There was a problem hiding this comment.
🔵 missing_tests: shouldSync returns false when getSessionState returns null (unknown session). This is correct behavior but the edge case is untested — the connection-registry tests mock shouldSync directly rather than exercising the actual null-state path wired in index.ts. [fixable]
|
|
||
| expect(mockFetch).not.toHaveBeenCalled(); | ||
| }); | ||
|
|
There was a problem hiding this comment.
🔵 missing_tests: No test for the scenario where doPost fails during flushPendingSends (reconnect flush). If a re-queued send fails again during flush, it should be re-queued once more. This interaction between flushPendingSends and enqueuePending is the most complex path and deserves coverage. [fixable]
…queuePending - Restrict POST retries to transient errors (404 + 5xx), skip 400/4xx to avoid re-queuing permanently invalid messages - Use enqueuePending() in send() to consolidate cap logic in one place - Add tests: 400 not retried, re-queued sends don't infinite loop in one flush cycle (atomic capture invariant) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Centaur ReviewFound 5 issue(s) (1 warning).
|
…t test - Add MAX_SEND_RETRIES (3) budget to pending sends — messages dropped after 3 failed attempts instead of cycling indefinitely - Include CLOSING state in shouldSync — agent still produces events during closeout (summary, commits), observers need delivery retries - Add backward compat test for missing getHeadSeq in EventStoreAdapter - Update comment to reflect CLOSING inclusion Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Centaur ReviewFound 2 issue(s).
|
dimakis
left a comment
There was a problem hiding this comment.
Centaur Review
Found 3 issue(s).
packages/harness/__tests__/connection-registry.test.ts
Well-structured fix addressing real failure modes (replay storms, silent message loss, ProcessTransport crashes). The retry logic is correctly bounded with atomic queue capture, the MAX_REPLAY_GAP cap is sound, and test coverage is thorough. Three minor suggestions, no bugs or critical issues.
- 🔵 missing_tests (L535): The 'caps replay depth' test is missing
registry.stopPeriodicSync()at the end, unlike the adjacent 'does not cap replay' test. Not a functional bug (beforeEach creates a fresh registry and afterEach restores real timers), but inconsistent cleanup within the same describe block.[fixable]
server/index.ts
Well-structured fix addressing real failure modes (replay storms, silent message loss, ProcessTransport crashes). The retry logic is correctly bounded with atomic queue capture, the MAX_REPLAY_GAP cap is sound, and test coverage is thorough. Three minor suggestions, no bugs or critical issues.
- 🔵 regressions (L108):
shouldSyncexcludes DETACHED sessions from periodic sync, whereas the oldisSessionActive(which checked theis_activeboolean column) would have included them — DETACHED sessions haveis_active = true. This is likely intentional to prevent replay storms, but it means a secondary watcher (e.g. dashboard) could miss events from a DETACHED session if the primary broadcast failed. Worth documenting the tradeoff.
packages/client/src/sse-connection.ts
Well-structured fix addressing real failure modes (replay storms, silent message loss, ProcessTransport crashes). The retry logic is correctly bounded with atomic queue capture, the MAX_REPLAY_GAP cap is sound, and test coverage is thorough. Three minor suggestions, no bugs or critical issues.
- 🔵 unsafe_assumptions (L333): HTTP 429 (Too Many Requests) is a transient error but won't trigger re-queue — only 404 and >=500 do. If the server ever rate-limits, user messages would be silently dropped. Consider
res.status === 404 || res.status === 429 || res.status >= 500.[fixable]
| // Should have fetched from 9800 (head - 200), not from 0 | ||
| const calls = (store.getEventsAfter as ReturnType<typeof vi.fn>).mock.calls; | ||
| expect(calls[0][1]).toBe(9800); // afterSeq should be head - MAX_REPLAY_GAP | ||
| }); |
There was a problem hiding this comment.
🔵 missing_tests: The 'caps replay depth' test is missing registry.stopPeriodicSync() at the end, unlike the adjacent 'does not cap replay' test. Not a functional bug (beforeEach creates a fresh registry and afterEach restores real timers), but inconsistent cleanup within the same describe block. [fixable]
| eventStore.getEventsAfter(sessionId, afterSeq, limit), | ||
| isSessionActive: (sessionId) => eventStore.getSession(sessionId)?.isActive ?? false, | ||
| shouldSync: (sessionId) => { | ||
| const state = eventStore.getSessionState(sessionId); |
There was a problem hiding this comment.
🔵 regressions: shouldSync excludes DETACHED sessions from periodic sync, whereas the old isSessionActive (which checked the is_active boolean column) would have included them — DETACHED sessions have is_active = true. This is likely intentional to prevent replay storms, but it means a secondary watcher (e.g. dashboard) could miss events from a DETACHED session if the primary broadcast failed. Worth documenting the tradeoff.
| }, | ||
| body: JSON.stringify(body), | ||
| }); | ||
| if (!res.ok && endpoint === 'send' && (res.status === 404 || res.status >= 500)) { |
There was a problem hiding this comment.
🔵 unsafe_assumptions: HTTP 429 (Too Many Requests) is a transient error but won't trigger re-queue — only 404 and >=500 do. If the server ever rate-limits, user messages would be silently dropped. Consider res.status === 404 || res.status === 429 || res.status >= 500. [fixable]
dimakis
left a comment
There was a problem hiding this comment.
Centaur Review
Found 2 issue(s) (1 warning).
packages/client/src/__tests__/sse-connection.test.ts
Well-structured fix with strong test coverage for retry semantics and replay storm prevention; only missing a test for the MAX_SEND_RETRIES exhaustion path.
- 🟡 missing_tests: No test for MAX_SEND_RETRIES budget exhaustion. The enqueuePending method drops messages after 3 retries with a console.warn, but no test validates this path. A test should send a message, simulate 3 consecutive reconnect+flush cycles with persistent 404s, and assert the message is no longer in the pending queue on the 4th cycle. Without this, the constant could be changed or the guard removed without any test failing.
[fixable]
packages/harness/__tests__/connection-registry.test.ts
Well-structured fix with strong test coverage for retry semantics and replay storm prevention; only missing a test for the MAX_SEND_RETRIES exhaustion path.
- 🔵 style (L535): Missing
registry.stopPeriodicSync()call at the end of 'caps replay depth when cursor is far behind head' test. The sibling tests ('does not cap replay...', 'replays from cursor...') all call stopPeriodicSync. Not a correctness issue since afterEach calls vi.useRealTimers() which discards fake timers, but inconsistent with the pattern in adjacent tests.[fixable]
| // Should have fetched from 9800 (head - 200), not from 0 | ||
| const calls = (store.getEventsAfter as ReturnType<typeof vi.fn>).mock.calls; | ||
| expect(calls[0][1]).toBe(9800); // afterSeq should be head - MAX_REPLAY_GAP | ||
| }); |
There was a problem hiding this comment.
🔵 style: Missing registry.stopPeriodicSync() call at the end of 'caps replay depth when cursor is far behind head' test. The sibling tests ('does not cap replay...', 'replays from cursor...') all call stopPeriodicSync. Not a correctness issue since afterEach calls vi.useRealTimers() which discards fake timers, but inconsistent with the pattern in adjacent tests. [fixable]
Address fourth Centaur review: retry 429 (rate limited) as transient, flush re-queued sends sequentially to preserve message ordering. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
dimakis
left a comment
There was a problem hiding this comment.
Centaur Review
Found 2 issue(s) (1 warning).
packages/client/src/__tests__/sse-connection.test.ts
Solid defensive PR — retry budgets, replay gap capping, and interrupt crash guards are correctly implemented with good test coverage. The only meaningful gap is a missing test for the MAX_SEND_RETRIES drop path.
- 🟡 missing_tests: No test exercises the MAX_SEND_RETRIES (3) drop behavior. The constant and the
if (retries >= MAX_SEND_RETRIES)guard inenqueuePendinghave zero coverage. A test should send a message, force it to fail across 3 reconnect cycles, and verify it is dropped on the 4th flush attempt (with the console.warn). Without this, a future change to the constant or the >= check could silently regress.[fixable]
packages/harness/__tests__/connection-registry.test.ts
Solid defensive PR — retry budgets, replay gap capping, and interrupt crash guards are correctly implemented with good test coverage. The only meaningful gap is a missing test for the MAX_SEND_RETRIES drop path.
- 🔵 style (L535): The 'caps replay depth when cursor is far behind head' test does not call
registry.stopPeriodicSync()at the end, unlike its sibling tests ('does not cap replay…' at line 556, 'replays from cursor…' at line 584). Inconsistent cleanup — harmless under fake timers but could leak intervals if test isolation changes.[fixable]
| // Should have fetched from 9800 (head - 200), not from 0 | ||
| const calls = (store.getEventsAfter as ReturnType<typeof vi.fn>).mock.calls; | ||
| expect(calls[0][1]).toBe(9800); // afterSeq should be head - MAX_REPLAY_GAP | ||
| }); |
There was a problem hiding this comment.
🔵 style: The 'caps replay depth when cursor is far behind head' test does not call registry.stopPeriodicSync() at the end, unlike its sibling tests ('does not cap replay…' at line 556, 'replays from cursor…' at line 584). Inconsistent cleanup — harmless under fake timers but could leak intervals if test isolation changes. [fixable]
Centaur ReviewFound 4 issue(s) (1 warning).
|
Address fifth Centaur review: await flushPendingSends before firing _open to prevent ordering race, extend transient retry to permission endpoint (lost permission_response blocks agent indefinitely). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Centaur ReviewFound 5 issue(s) (2 warning).
|
dimakis
left a comment
There was a problem hiding this comment.
Centaur Review
Found 6 issue(s) (2 warning).
packages/client/src/sse-connection.ts
Solid defensive PR. The core mechanisms (replay gap cap, retry budget, sequential flush, interrupt try/catch) are well-designed. Main concern: _connected = true before flush completes on the non-reconnect welcome path creates a message ordering race — the reconnect path already does this correctly, so applying the same pattern would close the gap.
- 🟡 bugs (L232):
_connectedis set totrueBEFOREflushPendingSends()completes (line 232, fire-and-forget viavoid). If external code callssend()before the flush finishes, the new message bypasses the pending queue viadoPost()directly (line 97-98), violating the ordering guarantee the comment on line 234 promises. The reconnect path indoReconnectPosthandles this correctly by awaiting flush before setting_connected(line 307-309). Fix: movethis._connected = trueinto the.then()callback, orawaitthe flush (matching thedoReconnectPostpattern).[fixable] - 🟡 bugs (L356): When
doPostis called directly fromsend()(connected path, line 98) and fails with a transient error (404/429/5xx), it re-queues viaenqueuePending. However,pendingSendsare only flushed during reconnect (flushPendingSendsis only called from welcome handlers). If the SSE stream stays healthy but the POST endpoint has a transient issue, the re-queued message sits indefinitely until the next reconnect — which may never happen. Consider drainingpendingSendson a timer or after a short delay, not only on reconnect.[fixable] - 🔵 unsafe_assumptions (L233): No guard against concurrent flushes. If
checkAndReconnect(true)is called while aflushPendingSends()from a prior welcome is still awaiting itsdoPostpromises, a secondflushPendingSends()can run concurrently with a different_connectionId. Both will issue POSTs in parallel, potentially duplicating sends. Low risk in practice (the browser event loop makes rapid double-welcome unlikely), but aflushInProgressflag or chained promise would make it safe by construction.[fixable]
packages/client/src/__tests__/sse-connection.test.ts
Solid defensive PR. The core mechanisms (replay gap cap, retry budget, sequential flush, interrupt try/catch) are well-designed. Main concern: _connected = true before flush completes on the non-reconnect welcome path creates a message ordering race — the reconnect path already does this correctly, so applying the same pattern would close the gap.
- 🔵 missing_tests: No test verifies that
send()called while a flush is in-progress preserves ordering relative to the queued messages. The_openordering test (line 1072) checks listener event order but not interleavedsend()calls. Add a test: queue messages, trigger welcome, callsend()before flush completes, verify the queued messages are delivered before the new one.[fixable]
packages/harness/src/connection-registry.ts
Solid defensive PR. The core mechanisms (replay gap cap, retry budget, sequential flush, interrupt try/catch) are well-designed. Main concern: _connected = true before flush completes on the non-reconnect welcome path creates a message ordering race — the reconnect path already does this correctly, so applying the same pattern would close the gap.
- 🔵 style (L55):
MAX_REPLAY_GAP = 200is a magic number. The comment says clients can lazy-load older events via/api/sessions/:id/events, but the client code doesn't implement automatic lazy-loading after a gap skip — the skipped events are silently dropped from the client's perspective. Consider documenting this trade-off more explicitly (e.g.,// IMPORTANT: skipped events are NOT auto-fetched by the client).[fixable] - 🔵 regressions (L44): The
isSessionActive→shouldSyncrename is a breaking change to theEventStoreAdapterinterface. It's optional and the only consumer isserver/index.ts(confirmed), so no runtime breakage. But any out-of-tree consumers (tests, forks) referencingisSessionActivewill silently stop filtering because the old property is simply ignored. This is acceptable since the interface is internal, but worth noting.
| @@ -226,8 +230,14 @@ export class SseConnection implements ChatConnection { | |||
| this.doReconnectPost(welcomeConnectionId, welcomeEs); | |||
| } else { | |||
| this._connected = true; | |||
There was a problem hiding this comment.
🟡 bugs: _connected is set to true BEFORE flushPendingSends() completes (line 232, fire-and-forget via void). If external code calls send() before the flush finishes, the new message bypasses the pending queue via doPost() directly (line 97-98), violating the ordering guarantee the comment on line 234 promises. The reconnect path in doReconnectPost handles this correctly by awaiting flush before setting _connected (line 307-309). Fix: move this._connected = true into the .then() callback, or await the flush (matching the doReconnectPost pattern). [fixable]
| // 5xx = server error). Re-queue so it flushes on the next successful | ||
| // reconnect. Non-transient errors (400 = invalid body) are not retried. | ||
| console.warn(`[SseConnection] POST /${endpoint} returned ${res.status}, re-queuing`); | ||
| this.enqueuePending(endpoint, body, retries + 1); |
There was a problem hiding this comment.
🟡 bugs: When doPost is called directly from send() (connected path, line 98) and fails with a transient error (404/429/5xx), it re-queues via enqueuePending. However, pendingSends are only flushed during reconnect (flushPendingSends is only called from welcome handlers). If the SSE stream stays healthy but the POST endpoint has a transient issue, the re-queued message sits indefinitely until the next reconnect — which may never happen. Consider draining pendingSends on a timer or after a short delay, not only on reconnect. [fixable]
| this._connected = true; | ||
| this.flushPendingSends(); | ||
| this.listener?.({ type: '_open' }); | ||
| if (this.pendingSends.length > 0) { |
There was a problem hiding this comment.
🔵 unsafe_assumptions: No guard against concurrent flushes. If checkAndReconnect(true) is called while a flushPendingSends() from a prior welcome is still awaiting its doPost promises, a second flushPendingSends() can run concurrently with a different _connectionId. Both will issue POSTs in parallel, potentially duplicating sends. Low risk in practice (the browser event loop makes rapid double-welcome unlikely), but a flushInProgress flag or chained promise would make it safe by construction. [fixable]
| const SYNC_BATCH_LIMIT = 50; | ||
| // Maximum gap between cursor and head before we skip ahead. | ||
| // Prevents replay storms after server restart or long disconnects. | ||
| // Client can lazy-load older events via the /api/sessions/:id/events API. |
There was a problem hiding this comment.
🔵 style: MAX_REPLAY_GAP = 200 is a magic number. The comment says clients can lazy-load older events via /api/sessions/:id/events, but the client code doesn't implement automatic lazy-loading after a gap skip — the skipped events are silently dropped from the client's perspective. Consider documenting this trade-off more explicitly (e.g., // IMPORTANT: skipped events are NOT auto-fetched by the client). [fixable]
| /** Check if a session should receive periodic sync delivery. | ||
| * When provided, periodic sync skips sessions where this returns false | ||
| * (e.g. ENDED, SUSPENDED, DETACHED). */ | ||
| shouldSync?(sessionId: string): boolean; |
There was a problem hiding this comment.
🔵 regressions: The isSessionActive → shouldSync rename is a breaking change to the EventStoreAdapter interface. It's optional and the only consumer is server/index.ts (confirmed), so no runtime breakage. But any out-of-tree consumers (tests, forks) referencing isSessionActive will silently stop filtering because the old property is simply ignored. This is acceptable since the interface is internal, but worth noting.
Address sixth Centaur review: add _flushing guard to prevent concurrent flushPendingSends when rapid reconnects fire overlapping welcomes, add test proving MAX_SEND_RETRIES=3 drops messages after exhaustion. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Centaur ReviewFound 6 issue(s) (2 warning).
|
…sh race Address seventh Centaur review: concurrent flushPendingSends calls now wait for the in-progress flush instead of skipping, ensuring _open fires only after all queued sends complete. Added .catch() safety net and concurrent flush race test. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Centaur ReviewFound 5 issue(s) (3 warning).
|
Address eighth Centaur review: flush loop re-checks pendingSends after each batch so failures re-queued mid-flush aren't stranded until next reconnect. Rename MAX_SEND_RETRIES → MAX_SEND_ATTEMPTS for clarity. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
dimakis
left a comment
There was a problem hiding this comment.
Centaur Review
LGTM — no issues found.
dimakis
left a comment
There was a problem hiding this comment.
Centaur Review
Found 4 issue(s) (2 warning).
packages/client/src/sse-connection.ts
The while loop in flushPendingSends defeats the cross-reconnect retry strategy by exhausting all MAX_SEND_ATTEMPTS in a single flush cycle; otherwise solid defensive changes with good test coverage for the happy paths.
- 🟡 bugs (L383): The
while (this.pendingSends.length > 0)loop influshPendingSendscauses re-queued failures to be retried immediately within the same flush cycle, burning all MAX_SEND_ATTEMPTS (3) in rapid succession rather than spreading retries across reconnects. WhendoPostfails (e.g. 404 from stale connectionId) and callsenqueuePending, the item is added tothis.pendingSends. The innerforloop finishes, thewhilecondition re-evaluates to true, and the item is retried immediately — still using the same stale_connectionId. For 404 errors this always fails; for 429 it retries without backoff. ThedoPostcomment on line 356 says 'Re-queue so it flushes on the next successful reconnect' but the while loop contradicts that intent. Fix: remove thewhileloop and do a single pass — re-queued items will naturally flush on the next reconnect with a fresh connectionId.[fixable]
packages/client/src/__tests__/sse-connection.test.ts
The while loop in flushPendingSends defeats the cross-reconnect retry strategy by exhausting all MAX_SEND_ATTEMPTS in a single flush cycle; otherwise solid defensive changes with good test coverage for the happy paths.
- 🟡 missing_tests (L1023): The test 're-queued sends do not cause infinite retry within one flush cycle' does not exercise
flushPendingSendsat all — it callsconn.send()on an already-connected instance, which goes throughdoPostdirectly (fire-and-forget fromsend()), never triggering a flush. The test passes because only 1 direct send occurs, but the comment 'flushPendingSends atomically captures the queue before iterating' describes behavior that is never exercised. A test that queues a persistently-failing message offline, reconnects (triggering flush), and verifies it is NOT retried more than once within that single flush would actually cover the stated invariant.[fixable] - 🔵 style (L1050): The test 'drops message after MAX_SEND_ATTEMPTS (3) exhausted across reconnects' has comments claiming retries are spread across 3 separate reconnect cycles (conn-1, conn-2, conn-3). In reality, due to the while loop in
flushPendingSends, all 3 attempts are exhausted during the conn-2 flush — the conn-3 and conn-4 reconnects are no-ops. The test passes because its assertions happen to be correct regardless, but the step-by-step comments are misleading about the actual retry timing.[fixable]
server/index.ts
The while loop in flushPendingSends defeats the cross-reconnect retry strategy by exhausting all MAX_SEND_ATTEMPTS in a single flush cycle; otherwise solid defensive changes with good test coverage for the happy paths.
- 🔵 regressions (L107): The old
isSessionActivecheckedeventStore.getSession(sessionId)?.isActive, which returned true for DETACHED sessions (agent still running, client disconnected). The newshouldSyncexcludes DETACHED. During DETACHED state the agent is still producing events (per the state machine: DETACHED can transition to ACTIVE, CLOSING, or ENDED). If a second client connection (e.g. another browser tab) is watching a DETACHED session via the ConnectionRegistry, it will stop receiving periodic sync delivery. This may be intentional, but the behavioral change from the old code is worth verifying.[fixable]
| try { | ||
| // Loop: re-check after each batch in case doPost failures re-queued | ||
| // new items during the flush (e.g. concurrent welcome + partial failure). | ||
| while (this.pendingSends.length > 0) { |
There was a problem hiding this comment.
🟡 bugs: The while (this.pendingSends.length > 0) loop in flushPendingSends causes re-queued failures to be retried immediately within the same flush cycle, burning all MAX_SEND_ATTEMPTS (3) in rapid succession rather than spreading retries across reconnects. When doPost fails (e.g. 404 from stale connectionId) and calls enqueuePending, the item is added to this.pendingSends. The inner for loop finishes, the while condition re-evaluates to true, and the item is retried immediately — still using the same stale _connectionId. For 404 errors this always fails; for 429 it retries without backoff. The doPost comment on line 356 says 'Re-queue so it flushes on the next successful reconnect' but the while loop contradicts that intent. Fix: remove the while loop and do a single pass — re-queued items will naturally flush on the next reconnect with a fresh connectionId. [fixable]
| const warnSpy = vi.spyOn(console, 'warn').mockImplementation(() => {}); | ||
| let sendCallCount = 0; | ||
| const mockFetch = vi.fn().mockImplementation((url: string) => { | ||
| if (typeof url === 'string' && url.includes('/send')) { |
There was a problem hiding this comment.
🟡 missing_tests: The test 're-queued sends do not cause infinite retry within one flush cycle' does not exercise flushPendingSends at all — it calls conn.send() on an already-connected instance, which goes through doPost directly (fire-and-forget from send()), never triggering a flush. The test passes because only 1 direct send occurs, but the comment 'flushPendingSends atomically captures the queue before iterating' describes behavior that is never exercised. A test that queues a persistently-failing message offline, reconnects (triggering flush), and verifies it is NOT retried more than once within that single flush would actually cover the stated invariant. [fixable]
| if (typeof url === 'string' && url.includes('/send')) { | ||
| return Promise.resolve({ ok: false, status: 500 }); | ||
| } | ||
| return Promise.resolve({ ok: true }); |
There was a problem hiding this comment.
🔵 style: The test 'drops message after MAX_SEND_ATTEMPTS (3) exhausted across reconnects' has comments claiming retries are spread across 3 separate reconnect cycles (conn-1, conn-2, conn-3). In reality, due to the while loop in flushPendingSends, all 3 attempts are exhausted during the conn-2 flush — the conn-3 and conn-4 reconnects are no-ops. The test passes because its assertions happen to be correct regardless, but the step-by-step comments are misleading about the actual retry timing. [fixable]
| getEventsAfter: (sessionId, afterSeq, limit) => | ||
| eventStore.getEventsAfter(sessionId, afterSeq, limit), | ||
| isSessionActive: (sessionId) => eventStore.getSession(sessionId)?.isActive ?? false, | ||
| shouldSync: (sessionId) => { |
There was a problem hiding this comment.
🔵 regressions: The old isSessionActive checked eventStore.getSession(sessionId)?.isActive, which returned true for DETACHED sessions (agent still running, client disconnected). The new shouldSync excludes DETACHED. During DETACHED state the agent is still producing events (per the state machine: DETACHED can transition to ACTIVE, CLOSING, or ENDED). If a second client connection (e.g. another browser tab) is watching a DETACHED session via the ConnectionRegistry, it will stop receiving periodic sync delivery. This may be intentional, but the behavioral change from the old code is worth verifying. [fixable]
dimakis
left a comment
There was a problem hiding this comment.
Centaur Review
Found 4 issue(s) (2 warning).
packages/client/src/sse-connection.ts
Solid defensive improvements (retry logic, replay gap cap, interrupt crash guard), but the while loop in flushPendingSends eagerly exhausts all retry attempts within a single flush cycle rather than spreading them across reconnects — undermining the retry design for 429/5xx errors. Two tests pass but don't validate the behavior their names describe.
- 🟡 bugs (L383): The
while (this.pendingSends.length > 0)loop influshPendingSendscauses failed-and-re-queued messages to be immediately retried within the same flush cycle, exhausting all MAX_SEND_ATTEMPTS in one go rather than spreading retries across reconnects. This is counterproductive for 429 (rate-limited) responses — immediate retry worsens rate limiting. For 500s it's also wasteful since the same server state likely persists. The retry mechanism would be more effective if the while loop were removed (just process one captured batch per flush), deferring re-queued items to the next reconnect.[fixable] - 🔵 style (L236): The
.catch(() => {}).then(...)chain onflushPendingSends()swallows all errors before emitting_open. This is fine defensively, but the empty catch with no logging makes flush failures invisible on the initial-connect path. Consider logging a warning in the catch, matching the error visibility of thedoReconnectPostpath which propagates flush errors to its own try/catch and callsscheduleReconnect().[fixable]
packages/client/src/__tests__/sse-connection.test.ts
Solid defensive improvements (retry logic, replay gap cap, interrupt crash guard), but the while loop in flushPendingSends eagerly exhausts all retry attempts within a single flush cycle rather than spreading them across reconnects — undermining the retry design for 429/5xx errors. Two tests pass but don't validate the behavior their names describe.
- 🔵 missing_tests (L1019): The test 'does not cause infinite retry within one flush cycle' never exercises the flush path —
send()goes throughdoPost()directly (connection is alive) and the re-queued message sits inpendingSends. NoflushPendingSendswhile loop is invoked. The test proves that a direct doPost failure doesn't trigger a flush, but doesn't verify the while loop's behavior during an actual flush cycle.[fixable] - 🟡 missing_tests (L1044): Test 'drops message after MAX_SEND_ATTEMPTS (3) exhausted across reconnects' is misleading. Because
seqBySessionis empty (no session messages emitted), every welcome takes the non-reconnect path and callsflushPendingSends(). The while loop inside flush exhausts all remaining retries in conn-2's flush cycle alone (attempt 2 fails → re-queues with retries=2 → while loop retries → attempt 3 fails → retries=3 → dropped). Conn-3 and conn-4 are no-ops. The test passes but doesn't validate the 'across reconnects' behavior its name and comments describe.[fixable]
| try { | ||
| // Loop: re-check after each batch in case doPost failures re-queued | ||
| // new items during the flush (e.g. concurrent welcome + partial failure). | ||
| while (this.pendingSends.length > 0) { |
There was a problem hiding this comment.
🟡 bugs: The while (this.pendingSends.length > 0) loop in flushPendingSends causes failed-and-re-queued messages to be immediately retried within the same flush cycle, exhausting all MAX_SEND_ATTEMPTS in one go rather than spreading retries across reconnects. This is counterproductive for 429 (rate-limited) responses — immediate retry worsens rate limiting. For 500s it's also wasteful since the same server state likely persists. The retry mechanism would be more effective if the while loop were removed (just process one captured batch per flush), deferring re-queued items to the next reconnect. [fixable]
| this.listener?.({ type: '_open' }); | ||
| if (this.pendingSends.length > 0 || this._flushPromise) { | ||
| // Flush must complete before _open fires to preserve message ordering. | ||
| void this.flushPendingSends() |
There was a problem hiding this comment.
🔵 style: The .catch(() => {}).then(...) chain on flushPendingSends() swallows all errors before emitting _open. This is fine defensively, but the empty catch with no logging makes flush failures invisible on the initial-connect path. Consider logging a warning in the catch, matching the error visibility of the doReconnectPost path which propagates flush errors to its own try/catch and calls scheduleReconnect(). [fixable]
| warnSpy.mockRestore(); | ||
| }); | ||
|
|
||
| it('re-queued sends do not cause infinite retry within one flush cycle', async () => { |
There was a problem hiding this comment.
🔵 missing_tests: The test 'does not cause infinite retry within one flush cycle' never exercises the flush path — send() goes through doPost() directly (connection is alive) and the re-queued message sits in pendingSends. No flushPendingSends while loop is invoked. The test proves that a direct doPost failure doesn't trigger a flush, but doesn't verify the while loop's behavior during an actual flush cycle. [fixable]
| warnSpy.mockRestore(); | ||
| }); | ||
|
|
||
| it('drops message after MAX_SEND_ATTEMPTS (3) exhausted across reconnects', async () => { |
There was a problem hiding this comment.
🟡 missing_tests: Test 'drops message after MAX_SEND_ATTEMPTS (3) exhausted across reconnects' is misleading. Because seqBySession is empty (no session messages emitted), every welcome takes the non-reconnect path and calls flushPendingSends(). The while loop inside flush exhausts all remaining retries in conn-2's flush cycle alone (attempt 2 fails → re-queues with retries=2 → while loop retries → attempt 3 fails → retries=3 → dropped). Conn-3 and conn-4 are no-ops. The test passes but doesn't validate the 'across reconnects' behavior its name and comments describe. [fixable]
dimakis
left a comment
There was a problem hiding this comment.
Centaur Review
Found 5 issue(s) (2 warning).
packages/client/src/sse-connection.ts
Solid defensive improvements for SSE delivery reliability. The main concern is a while-loop in flushPendingSends that burns through all retry attempts within a single flush cycle rather than distributing them across reconnects as comments and tests describe — particularly counterproductive for 429 rate limiting.
- 🟡 bugs (L383): The
while (this.pendingSends.length > 0)loop influshPendingSendswill exhaust all remaining retries within a single flush cycle, not across reconnects. WhendoPostfails at line 359 it callsenqueuePendingwhich pushes tothis.pendingSends(reset to[]at line 385). The while loop re-checks and immediately retries. With MAX_SEND_ATTEMPTS=3, a fresh message gets 3 rapid-fire attempts in one flush, then is dropped — contradicting thedoPostcomment at line 356 ("Re-queue so it flushes on the next successful reconnect"). For 429 rate-limiting this is counterproductive: the client burns through all retries instantly instead of backing off. Consider breaking out of the while loop after the inner for-loop to defer re-queued items to the next reconnect, or at minimum adjust the comment.[fixable] - 🔵 unsafe_assumptions (L354): 429 (rate limited) is treated the same as 404 and 5xx for retry purposes, but 429 typically includes a
Retry-Afterheader indicating how long to wait. Immediately re-queuing (and with the while loop, immediately retrying) defeats the rate limit signal. Consider either readingRetry-Afteror at least not re-queuing 429s for immediate retry within the same flush cycle.[fixable] - 🔵 style (L356): Comment says "Re-queue so it flushes on the next successful reconnect" but with the while loop in
flushPendingSends, re-queued items are retried immediately within the same flush cycle. Either the comment or the behavior should be updated to match the other.[fixable]
packages/client/src/__tests__/sse-connection.test.ts
Solid defensive improvements for SSE delivery reliability. The main concern is a while-loop in flushPendingSends that burns through all retry attempts within a single flush cycle rather than distributing them across reconnects as comments and tests describe — particularly counterproductive for 429 rate limiting.
- 🟡 missing_tests (L1019): The test 're-queued sends do not cause infinite retry within one flush cycle' sends via
conn.send()while connected (directdoPost, not the flush path). The re-queued message just sits inpendingSends— the while loop influshPendingSendsis never exercised. A test that queues a message while disconnected and then reconnects (triggeringflushPendingSends) would reveal the while-loop retry burn-through behavior. Similarly, the 'drops message after MAX_SEND_ATTEMPTS' test comments describe retries spread across reconnects 1–3, but all retries actually happen during reconnect 1's flush cycle — the test passes by accident because it only checks final state.[fixable]
packages/harness/src/connection-registry.ts
Solid defensive improvements for SSE delivery reliability. The main concern is a while-loop in flushPendingSends that burns through all retry attempts within a single flush cycle rather than distributing them across reconnects as comments and tests describe — particularly counterproductive for 429 rate limiting.
- 🔵 regressions (L41):
EventStoreAdapteris exported from@mitzo/harness(index.ts:16). RenamingisSessionActivetoshouldSyncis a breaking change to the public interface. The only implementation is inserver/index.ts(updated), and no stale references remain, so this is safe within the monorepo. If any external consumers exist, they would break. AddinggetHeadSeqis additive and non-breaking since it's optional.
| try { | ||
| // Loop: re-check after each batch in case doPost failures re-queued | ||
| // new items during the flush (e.g. concurrent welcome + partial failure). | ||
| while (this.pendingSends.length > 0) { |
There was a problem hiding this comment.
🟡 bugs: The while (this.pendingSends.length > 0) loop in flushPendingSends will exhaust all remaining retries within a single flush cycle, not across reconnects. When doPost fails at line 359 it calls enqueuePending which pushes to this.pendingSends (reset to [] at line 385). The while loop re-checks and immediately retries. With MAX_SEND_ATTEMPTS=3, a fresh message gets 3 rapid-fire attempts in one flush, then is dropped — contradicting the doPost comment at line 356 ("Re-queue so it flushes on the next successful reconnect"). For 429 rate-limiting this is counterproductive: the client burns through all retries instantly instead of backing off. Consider breaking out of the while loop after the inner for-loop to defer re-queued items to the next reconnect, or at minimum adjust the comment. [fixable]
| body: JSON.stringify(body), | ||
| }); | ||
| const retryable = endpoint === 'send' || endpoint === 'permission'; | ||
| if (!res.ok && retryable && (res.status === 404 || res.status === 429 || res.status >= 500)) { |
There was a problem hiding this comment.
🔵 unsafe_assumptions: 429 (rate limited) is treated the same as 404 and 5xx for retry purposes, but 429 typically includes a Retry-After header indicating how long to wait. Immediately re-queuing (and with the while loop, immediately retrying) defeats the rate limit signal. Consider either reading Retry-After or at least not re-queuing 429s for immediate retry within the same flush cycle. [fixable]
| const retryable = endpoint === 'send' || endpoint === 'permission'; | ||
| if (!res.ok && retryable && (res.status === 404 || res.status === 429 || res.status >= 500)) { | ||
| // Transient failure (404 = stale connectionId, 429 = rate limited, | ||
| // 5xx = server error). Re-queue so it flushes on the next successful |
There was a problem hiding this comment.
🔵 style: Comment says "Re-queue so it flushes on the next successful reconnect" but with the while loop in flushPendingSends, re-queued items are retried immediately within the same flush cycle. Either the comment or the behavior should be updated to match the other. [fixable]
| warnSpy.mockRestore(); | ||
| }); | ||
|
|
||
| it('re-queued sends do not cause infinite retry within one flush cycle', async () => { |
There was a problem hiding this comment.
🟡 missing_tests: The test 're-queued sends do not cause infinite retry within one flush cycle' sends via conn.send() while connected (direct doPost, not the flush path). The re-queued message just sits in pendingSends — the while loop in flushPendingSends is never exercised. A test that queues a message while disconnected and then reconnects (triggering flushPendingSends) would reveal the while-loop retry burn-through behavior. Similarly, the 'drops message after MAX_SEND_ATTEMPTS' test comments describe retries spread across reconnects 1–3, but all retries actually happen during reconnect 1's flush cycle — the test passes by accident because it only checks final state. [fixable]
| /** Optional: check if a session is still active. When provided, periodic sync | ||
| * skips ended sessions to avoid unnecessary EventStore queries. */ | ||
| isSessionActive?(sessionId: string): boolean; | ||
| /** Check if a session should receive periodic sync delivery. |
There was a problem hiding this comment.
🔵 regressions: EventStoreAdapter is exported from @mitzo/harness (index.ts:16). Renaming isSessionActive to shouldSync is a breaking change to the public interface. The only implementation is in server/index.ts (updated), and no stale references remain, so this is safe within the monorepo. If any external consumers exist, they would break. Adding getHeadSeq is additive and non-breaking since it's optional.
Summary
Fixes the SSE transport regression causing 8+ minute message delivery delays and silent message drops after the SSE migration (#381).
Root cause chain:
ProcessTransport.interrupt()throws unhandled → server crashes → loses all in-memory stateisSessionActiveonly checked a boolean, not session STATE → SUSPENDED/DETACHED sessions replayed indefinitelysendPOSTs (404 from stale connectionId) were silently dropped → user messages lostThree fixes:
Periodic sync replay storm prevention (
connection-registry.ts):isSessionActivewithshouldSync— checks session STATE (only ACTIVE/STARTING/CREATED sync)MAX_REPLAY_GAP(200 events) — skip ahead when cursor is far behind head instead of replaying 6K+ eventsgetHeadSeqto EventStore for gap detectionPOST retry for failed message sends (
sse-connection.ts):sendPOSTs that get HTTP errors or network errorsGuard ProcessTransport.interrupt() crash (
chat.ts):queryInstance.interrupt()in try/catch to prevent server process deathTest plan
🤖 Generated with Claude Code