feat(transport): stable client connectionId + POST failure resilience#401
Conversation
Client generates a UUID per tab, sends it via ?cid= on SSE GET. Server reuses it across reconnects, eliminating the ownership race in handleReconnect that caused the "send twice to wake agent" bug. - SseConnection: stable cid in URL, welcome validation, backward compat - ConnectionRegistry: reconnect() swaps transport preserving state, markInactive() with 60s TTL + onExpired callback for deferred cleanup - Server: accept ?cid=, validate format, connRegistry.reconnect() on existing cid, deferred close with onExpired → detachChat() - Frontend: sessionStorage-persisted connectionId per tab - Store: optimistic SET_RUNNING after send, _send_failed handler - doPost: re-queue failed sends (404/429/5xx) with _send_failed emission - Tests: 17 new tests for stable cid, reconnect, TTL, POST failures Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Centaur ReviewFound 6 issue(s) (3 warning).
|
dimakis
left a comment
There was a problem hiding this comment.
Centaur Review
Found 6 issue(s) (1 critical) (3 warning).
server/index.ts
Solid transport resilience feature, but the onExpired callback has a critical missing transport-ownership check that can detach sessions driven by other connections, and the stable-cid close path leaks observer transport references.
- 🔴 bugs (L118): onExpired callback is missing the
session.transport === transportownership check that the legacy SSE close handler has (line 417). If another tab/connection reattaches to the same session during the 60s grace period, onExpired will incorrectly detach it. The callback only receives (connectionId, watchedSessions) — it needs access to the original transport, or a connectionId-based ownership check, to avoid detaching sessions now driven by a different connection.[fixable] - 🟡 regressions (L438): The stable-cid SSE close path calls
connRegistry.markInactive(connectionId)but never callsregistry.removeObserver(transport). The legacy path (line 450) does call it. Stale transport references will accumulate in the SessionRegistry observer sets across reconnects — each reconnect creates a new SseTransport, adds it as an observer, but the old one is never removed. This leaks memory and causes dead-transport broadcast attempts.[fixable] - 🟡 unsafe_assumptions (L403): The CID validation regex
/^cid-[0-9a-f-]{36}$/assumescrypto.randomUUID()always produces lowercase hex. While the Web Crypto spec mandates lowercase, the regex silently falls back to a server-generated connectionId for any mismatch — which defeats the feature. Consider case-insensitive match or normalizing to lowercase.[fixable]
packages/client/src/sse-connection.ts
Solid transport resilience feature, but the onExpired callback has a critical missing transport-ownership check that can detach sessions driven by other connections, and the stable-cid close path leaks observer transport references.
- 🟡 bugs (L348): doPost re-queues failed sends into pendingSends with no retry counter. If a message triggers a persistent 404 or 500 (e.g., malformed body), it will be re-queued and retried on every reconnect forever. Consider adding a maxRetries field to pending items and dropping messages that exceed it, or at minimum emitting
willRetry: falseafter N retries.[fixable] - 🔵 style (L350): The
_send_failedsynthetic event type uses a leading underscore convention shared with_openand_close, but unlike those, it carries structured fields (clientMsgId,willRetry). Consider defining a TypeScript interface for it in the protocol types to catch misuse at compile time, rather than relying on untypedRecord<string, unknown>matching.[fixable]
packages/client/src/__tests__/sse-connection.test.ts
Solid transport resilience feature, but the onExpired callback has a critical missing transport-ownership check that can detach sessions driven by other connections, and the stable-cid close path leaks observer transport references.
- 🔵 missing_tests: No test for the retry-limit concern: a persistently failing send message being re-queued and retried across multiple reconnects. Adding a test that verifies behavior after N consecutive failures would validate the retry semantics.
[fixable]
| if (!found) continue; | ||
| if (registry.isSuspended(found.clientId)) continue; | ||
| const session = registry.get(found.clientId); | ||
| if (session && registry.isAttached(found.clientId)) { |
There was a problem hiding this comment.
🔴 bugs: onExpired callback is missing the session.transport === transport ownership check that the legacy SSE close handler has (line 417). If another tab/connection reattaches to the same session during the 60s grace period, onExpired will incorrectly detach it. The callback only receives (connectionId, watchedSessions) — it needs access to the original transport, or a connectionId-based ownership check, to avoid detaching sessions now driven by a different connection. [fixable]
|
|
||
| const conn = connRegistry.get(connectionId); | ||
| const watchedSessions = conn ? [...conn.watchedSessions] : []; | ||
| if (isStableCid) { |
There was a problem hiding this comment.
🟡 regressions: The stable-cid SSE close path calls connRegistry.markInactive(connectionId) but never calls registry.removeObserver(transport). The legacy path (line 450) does call it. Stale transport references will accumulate in the SessionRegistry observer sets across reconnects — each reconnect creates a new SseTransport, adds it as an observer, but the old one is never removed. This leaks memory and causes dead-transport broadcast attempts. [fixable]
| // ownership race. Client sends ?cid=<uuid>; server reuses it across | ||
| // SSE reconnects so the clientId (connectionId:sessionId) never changes. | ||
| const clientCid = req.query.cid as string | undefined; | ||
| const isStableCid = clientCid != null && /^cid-[0-9a-f-]{36}$/.test(clientCid); |
There was a problem hiding this comment.
🟡 unsafe_assumptions: The CID validation regex /^cid-[0-9a-f-]{36}$/ assumes crypto.randomUUID() always produces lowercase hex. While the Web Crypto spec mandates lowercase, the regex silently falls back to a server-generated connectionId for any mismatch — which defeats the feature. Consider case-insensitive match or normalizing to lowercase. [fixable]
| }); | ||
| if (!res.ok && endpoint === 'send') { | ||
| // Transient failure — re-queue for retry on next reconnect | ||
| if (res.status === 404 || res.status === 429 || res.status >= 500) { |
There was a problem hiding this comment.
🟡 bugs: doPost re-queues failed sends into pendingSends with no retry counter. If a message triggers a persistent 404 or 500 (e.g., malformed body), it will be re-queued and retried on every reconnect forever. Consider adding a maxRetries field to pending items and dropping messages that exceed it, or at minimum emitting willRetry: false after N retries. [fixable]
| // Transient failure — re-queue for retry on next reconnect | ||
| if (res.status === 404 || res.status === 429 || res.status >= 500) { | ||
| this.pendingSends.push({ endpoint, body }); | ||
| this.listener?.({ |
There was a problem hiding this comment.
🔵 style: The _send_failed synthetic event type uses a leading underscore convention shared with _open and _close, but unlike those, it carries structured fields (clientMsgId, willRetry). Consider defining a TypeScript interface for it in the protocol types to catch misuse at compile time, rather than relying on untyped Record<string, unknown> matching. [fixable]
…ak, retry limit - onExpired: verify session still owned by expired connectionId before detaching (prevents cross-tab detach race) - Stable-cid close: call removeObserver(transport) to prevent stale transport accumulation in SessionRegistry observer sets - CID regex: case-insensitive match for crypto.randomUUID() compat - doPost: add retry counter (max 3), emit willRetry:false after exhaustion - Export SendFailedEvent interface for type-safe event handling - Test: verify message dropped after max retries with willRetry:false Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
dimakis
left a comment
There was a problem hiding this comment.
Centaur Review
Found 6 issue(s) (2 warning).
packages/client/src/store.ts
Solid transport resilience feature with good test coverage. Main concern: optimistic running state set on send is never cleared when _send_failed fires with willRetry: false, leaving the UI stuck in a thinking spinner.
- 🟡 bugs (L377): Optimistic
SET_RUNNINGis set whensend()returns true, but when_send_failedfires (lines 714-718), onlysendErroris set —runningis never reset tofalse. If the finalwillRetry: falseevent fires, the UI stays in a permanent "thinking" spinner. The_send_failedhandler should also dispatchSET_RUNNING: falsewhenwillRetryis false.[fixable]
server/index.ts
Solid transport resilience feature with good test coverage. Main concern: optimistic running state set on send is never cleared when _send_failed fires with willRetry: false, leaving the UI stuck in a thinking spinner.
- 🟡 bugs (L413): The regex
/^cid-[0-9a-f-]{36}$/iallows malformed UUIDs (e.g.cid-------------------------------------— 36 dashes) because[0-9a-f-]{36}doesn't enforce UUID structure. A craftedcidquery param could pass validation while not being a real UUID. Impact is low since the cid is only used as a map key, but consider a stricter pattern like/^cid-[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$/i.[fixable] - 🔵 style (L117):
getOwnerConnection()is already exported fromws-handler-v2.tsand does this exact logic (clientId.slice(0, colonIdx)). TheonExpiredcallback reimplements it withsubstring(0, indexOf(':')). Import and reuse the existing function to avoid divergence.[fixable] - 🔵 missing_tests (L112): The
setOnExpiredcallback inindex.ts(lines 112-139) has no integration test. It contains ownership checks, suspended-session guards, and detach logic — all of which are only tested indirectly via theConnectionRegistryunit tests. A test covering the wiring (TTL expires → sessions actually detached) would catch regressions in the glue code.
packages/client/src/__tests__/sse-connection.test.ts
Solid transport resilience feature with good test coverage. Main concern: optimistic running state set on send is never cleared when _send_failed fires with willRetry: false, leaving the UI stuck in a thinking spinner.
- 🔵 missing_tests: No test covers what happens when
doPostfails for a 4xx status OTHER than 404/429 (e.g. 400, 403, 422). These are silently swallowed — no_send_failedevent, no retry. This is likely intentional (client errors shouldn't retry), but a test documenting this behavior would prevent accidental regressions.[fixable]
packages/client/src/sse-connection.ts
Solid transport resilience feature with good test coverage. Main concern: optimistic running state set on send is never cleared when _send_failed fires with willRetry: false, leaving the UI stuck in a thinking spinner.
- 🔵 unsafe_assumptions (L363): 404 is treated as retryable alongside 429 and 5xx. A 404 on the send endpoint typically means the session or route doesn't exist — retrying won't help. Consider treating 404 as non-retryable (emit
_send_failedwithwillRetry: false) to avoid wasting 3 retry cycles on a permanently missing resource.[fixable]
| if (!sent) { | ||
| set({ sendError: 'Not connected. Message will be sent when reconnected.' }); | ||
| } else { | ||
| // Optimistic: show "thinking" immediately. The server will confirm |
There was a problem hiding this comment.
🟡 bugs: Optimistic SET_RUNNING is set when send() returns true, but when _send_failed fires (lines 714-718), only sendError is set — running is never reset to false. If the final willRetry: false event fires, the UI stays in a permanent "thinking" spinner. The _send_failed handler should also dispatch SET_RUNNING: false when willRetry is false. [fixable]
| // ownership race. Client sends ?cid=<uuid>; server reuses it across | ||
| // SSE reconnects so the clientId (connectionId:sessionId) never changes. | ||
| const clientCid = req.query.cid as string | undefined; | ||
| const isStableCid = clientCid != null && /^cid-[0-9a-f-]{36}$/i.test(clientCid); |
There was a problem hiding this comment.
🟡 bugs: The regex /^cid-[0-9a-f-]{36}$/i allows malformed UUIDs (e.g. cid------------------------------------- — 36 dashes) because [0-9a-f-]{36} doesn't enforce UUID structure. A crafted cid query param could pass validation while not being a real UUID. Impact is low since the cid is only used as a map key, but consider a stricter pattern like /^cid-[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$/i. [fixable]
| const found = registry.findBySessionId(sessionId); | ||
| if (!found) continue; | ||
| // Skip if another connection has taken over this session | ||
| const ownerCid = found.clientId.substring(0, found.clientId.indexOf(':')); |
There was a problem hiding this comment.
🔵 style: getOwnerConnection() is already exported from ws-handler-v2.ts and does this exact logic (clientId.slice(0, colonIdx)). The onExpired callback reimplements it with substring(0, indexOf(':')). Import and reuse the existing function to avoid divergence. [fixable]
| // When a stable-cid connection's inactive TTL expires (no reconnect within | ||
| // 60s), detach any sessions it was driving — same logic as the SSE close | ||
| // handler for legacy connections. | ||
| connRegistry.setOnExpired((connectionId, watchedSessions) => { |
There was a problem hiding this comment.
🔵 missing_tests: The setOnExpired callback in index.ts (lines 112-139) has no integration test. It contains ownership checks, suspended-session guards, and detach logic — all of which are only tested indirectly via the ConnectionRegistry unit tests. A test covering the wiring (TTL expires → sessions actually detached) would catch regressions in the glue code.
| body: JSON.stringify(body), | ||
| }); | ||
| if (!res.ok && endpoint === 'send') { | ||
| if (res.status === 404 || res.status === 429 || res.status >= 500) { |
There was a problem hiding this comment.
🔵 unsafe_assumptions: 404 is treated as retryable alongside 429 and 5xx. A 404 on the send endpoint typically means the session or route doesn't exist — retrying won't help. Consider treating 404 as non-retryable (emit _send_failed with willRetry: false) to avoid wasting 3 retry cycles on a permanently missing resource. [fixable]
…ndling, regex - store: clear SET_RUNNING when _send_failed fires with willRetry:false (prevents stuck "thinking" spinner on permanent failure) - sse-connection: treat 404 as permanent failure (no retry), separate from transient 429/5xx retries - server: strict UUID regex, import getOwnerConnection() from ws-handler-v2 instead of reimplementing - tests: 404 permanent failure, 4xx silently swallowed (400/403) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Centaur ReviewFound 7 issue(s) (3 warning).
|
Centaur ReviewFound 6 issue(s) (3 warning).
|
dimakis
left a comment
There was a problem hiding this comment.
Centaur Review
Found 7 issue(s) (3 warning).
packages/client/src/store.ts
Well-structured transport resilience feature with solid test coverage. Main concern is the sendError state not being cleared when a retried send succeeds — users may see a stale error banner. The SSE-level implementation (stable cid, retry queue, connection registry TTL) is sound.
- 🟡 bugs (L717): When a retried send eventually succeeds (e.g.
willRetry: truefollowed by successfuldoPoston reconnect),sendErroris never cleared. The user sees "Message delivery delayed — retrying on reconnect..." indefinitely until they send another message (which clears it at line 345). Consider clearingsendErroron the_opensynthetic event emitted after successful reconnect.[fixable] - 🟡 bugs (L380): The optimistic
SET_RUNNINGfires onsend()returningtrue, butsend()returns true even when the message is just queued (not yet delivered). If the POST then fails withwillRetry: true, the spinner stays showing becauseSET_RUNNING: falseis only dispatched forwillRetry: false. This is probably intentional (server events will drive running state on success), but if all retries also fail, there's a window whererunningis true with no active agent — the user sees an eternal spinner until the finalwillRetry: falseevent arrives after 3 reconnect cycles. - 🔵 style (L715): Redundant type cast:
(msg as Record<string, unknown>).willRetry—msgis already typed asRecord<string, unknown>from the function parameter on line 704. Just usemsg.willRetry !== false.[fixable] - 🔵 missing_tests (L714): The store-level
_send_failedhandler and the optimisticSET_RUNNINGon successfulsend()are new behaviors without store-level unit tests. The SSE connection tests cover the transport layer well, but the Zustand store's reaction to these events (sendError state, running state transitions) is untested.[fixable]
server/index.ts
Well-structured transport resilience feature with solid test coverage. Main concern is the sendError state not being cleared when a retried send succeeds — users may see a stale error banner. The SSE-level implementation (stable cid, retry queue, connection registry TTL) is sound.
- 🟡 unsafe_assumptions (L416): The
cidregex/^cid-[0-9a-f]{8}-...$/ivalidates only the exact formatcid-{uuid}. If the client code or a future browser produces a non-standard UUID format (e.g., uppercase from a polyfill, or a non-v4 UUID), the cid silently falls back to a server-generated one, breaking the stable-cid reconnect flow with no error signal. The case-insensitive/iflag partially mitigates this, but consider logging when a non-nullcidquery param fails validation, so the mismatch is diagnosable.[fixable] - 🔵 regressions (L456): In the stable-cid close handler,
registry.removeObserver(transport)is called immediately but session detach is deferred viamarkInactive. During the 60s grace period, if the agent emits events, they won't reach this transport (removed as observer) but the connection entry still exists inconnRegistry. This is likely fine because the connection's transportisOpen()returns false, and the periodic sync in ConnectionRegistry filters byisOpen()— but it's worth noting the transport is stale until reconnect swaps it.
packages/client/src/__tests__/sse-connection.test.ts
Well-structured transport resilience feature with solid test coverage. Main concern is the sendError state not being cleared when a retried send succeeds — users may see a stale error banner. The SSE-level implementation (stable cid, retry queue, connection registry TTL) is sound.
- 🔵 missing_tests: No test for concurrent pending sends during flush. When
flushPendingSendsfires multipledoPostcalls without awaiting, and several fail, they each push back topendingSends. A test verifying that multiple concurrent failures during flush correctly accumulate in the retry queue (and don't exceed MAX_SEND_RETRIES per message) would increase confidence.[fixable]
| if (msg.type === '_send_failed') { | ||
| const willRetry = (msg as Record<string, unknown>).willRetry !== false; | ||
| store.setState((s) => ({ | ||
| sendError: willRetry |
There was a problem hiding this comment.
🟡 bugs: When a retried send eventually succeeds (e.g. willRetry: true followed by successful doPost on reconnect), sendError is never cleared. The user sees "Message delivery delayed — retrying on reconnect..." indefinitely until they send another message (which clears it at line 345). Consider clearing sendError on the _open synthetic event emitted after successful reconnect. [fixable]
| // Optimistic: show "thinking" immediately. The server will confirm | ||
| // via events, or the send-timeout safety net will reset on failure. | ||
| set((s) => ({ | ||
| messages: messagesReducer(s.messages, { type: 'SET_RUNNING', running: true }), |
There was a problem hiding this comment.
🟡 bugs: The optimistic SET_RUNNING fires on send() returning true, but send() returns true even when the message is just queued (not yet delivered). If the POST then fails with willRetry: true, the spinner stays showing because SET_RUNNING: false is only dispatched for willRetry: false. This is probably intentional (server events will drive running state on success), but if all retries also fail, there's a window where running is true with no active agent — the user sees an eternal spinner until the final willRetry: false event arrives after 3 reconnect cycles.
| } | ||
|
|
||
| if (msg.type === '_send_failed') { | ||
| const willRetry = (msg as Record<string, unknown>).willRetry !== false; |
There was a problem hiding this comment.
🔵 style: Redundant type cast: (msg as Record<string, unknown>).willRetry — msg is already typed as Record<string, unknown> from the function parameter on line 704. Just use msg.willRetry !== false. [fixable]
| return; | ||
| } | ||
|
|
||
| if (msg.type === '_send_failed') { |
There was a problem hiding this comment.
🔵 missing_tests: The store-level _send_failed handler and the optimistic SET_RUNNING on successful send() are new behaviors without store-level unit tests. The SSE connection tests cover the transport layer well, but the Zustand store's reaction to these events (sendError state, running state transitions) is untested. [fixable]
| const clientCid = req.query.cid as string | undefined; | ||
| const isStableCid = | ||
| clientCid != null && | ||
| /^cid-[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$/i.test(clientCid); |
There was a problem hiding this comment.
🟡 unsafe_assumptions: The cid regex /^cid-[0-9a-f]{8}-...$/i validates only the exact format cid-{uuid}. If the client code or a future browser produces a non-standard UUID format (e.g., uppercase from a polyfill, or a non-v4 UUID), the cid silently falls back to a server-generated one, breaking the stable-cid reconnect flow with no error signal. The case-insensitive /i flag partially mitigates this, but consider logging when a non-null cid query param fails validation, so the mismatch is diagnosable. [fixable]
| // connectionId — markInactive starts a 60s TTL. If the TTL expires, | ||
| // the onExpired callback triggers session detach. | ||
| connRegistry.markInactive(connectionId); | ||
| registry.removeObserver(transport); |
There was a problem hiding this comment.
🔵 regressions: In the stable-cid close handler, registry.removeObserver(transport) is called immediately but session detach is deferred via markInactive. During the 60s grace period, if the agent emits events, they won't reach this transport (removed as observer) but the connection entry still exists in connRegistry. This is likely fine because the connection's transport isOpen() returns false, and the periodic sync in ConnectionRegistry filters by isOpen() — but it's worth noting the transport is stale until reconnect swaps it.
… logging - Clear sendError on _open (reconnect success) to prevent stale error banners - Remove redundant type cast on msg.willRetry in _send_failed handler - Log warning when client sends invalid cid format on SSE connect Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
dimakis
left a comment
There was a problem hiding this comment.
Centaur Review
Found 6 issue(s) (2 warning).
packages/client/src/sse-connection.ts
Solid transport resilience feature with good test coverage. Two actionable issues: (1) 400 responses on send silently leave the UI stuck in 'thinking' state due to interaction with the new optimistic SET_RUNNING, and (2) sessionStorage access at module scope can throw in private browsing mode.
- 🟡 bugs (L362): When a
sendPOST returns 400 (bad request), it falls through silently — no_send_failedemitted, no retry. Combined with the new optimisticSET_RUNNING(true)instore.ts:380, this leaves the UI stuck in 'thinking' state indefinitely. The comment references a 'send-timeout safety net' but no such timer exists for this code path. Either emit_send_failedwithwillRetry: falsefor all non-2xx send responses that don't already have handling, or treat 400 the same as 404 (permanent failure).[fixable] - 🔵 style (L43):
SendFailedEvent.clientMsgIdis typed asunknownbut is always astring(generated by the store asuser-${Date.now()}-...). Usingstringwould improve type safety in the store handler atstore.ts:722where it's consumed without narrowing.[fixable] - 🔵 regressions (L113): The
send()guard changed fromthis._connected && this._connectionIdto justthis._connected. Previously,_connectionIdbeing null was an extra safety check that send wouldn't fire before the welcome handshake. Now_connectionIdis always set (from constructor), so the guard change is semantically correct — butdoPostwill now include the client-generated cid inX-Connection-IDeven if the server hasn't acknowledged it yet. This is safe because the_connectedflag is still gated on welcome receipt, but worth noting.
frontend/src/client-store.ts
Solid transport resilience feature with good test coverage. Two actionable issues: (1) 400 responses on send silently leave the UI stuck in 'thinking' state due to interaction with the new optimistic SET_RUNNING, and (2) sessionStorage access at module scope can throw in private browsing mode.
- 🟡 unsafe_assumptions (L31):
sessionStorage.getItem()andsetItem()can throwSecurityErrorin Safari private browsing, some Firefox configurations, and restrictive WebViews. SincegetOrCreateConnectionId()is called at module scope (during SSE config construction), an exception here would break the entire client import. Wrap in try-catch with a fallback to an in-memory UUID.[fixable]
packages/client/src/__tests__/sse-connection.test.ts
Solid transport resilience feature with good test coverage. Two actionable issues: (1) 400 responses on send silently leave the UI stuck in 'thinking' state due to interaction with the new optimistic SET_RUNNING, and (2) sessionStorage access at module scope can throw in private browsing mode.
- 🔵 missing_tests: No test verifies behavior when the same message is re-queued and then the connection succeeds — specifically, that
retriesis correctly incremented through multiple flush cycles and thatclientMsgIdis preserved across retries (not duplicated in the queue).[fixable]
server/index.ts
Solid transport resilience feature with good test coverage. Two actionable issues: (1) 400 responses on send silently leave the UI stuck in 'thinking' state due to interaction with the new optimistic SET_RUNNING, and (2) sessionStorage access at module scope can throw in private browsing mode.
- 🔵 unsafe_assumptions (L416): The cid validation regex
/^cid-[0-9a-f]{8}-...$/icorrectly matches UUID v4, butcrypto.randomUUID()is spec'd to return lowercase. Theiflag is fine for robustness, but note that if a non-UUID-shaped cid is rejected, the server silently falls back to a server-generatedconn-id with no feedback to the client. The client's_connectionIdwill be its UUID, while the server uses a different id — the welcome echo will trigger the 'old server' fallback path in the client, which works but means the 'stable cid' feature silently degrades. A log.warn exists but the client has no way to know its cid was rejected.
| }, | ||
| body: JSON.stringify(body), | ||
| }); | ||
| if (!res.ok && endpoint === 'send') { |
There was a problem hiding this comment.
🟡 bugs: When a send POST returns 400 (bad request), it falls through silently — no _send_failed emitted, no retry. Combined with the new optimistic SET_RUNNING(true) in store.ts:380, this leaves the UI stuck in 'thinking' state indefinitely. The comment references a 'send-timeout safety net' but no such timer exists for this code path. Either emit _send_failed with willRetry: false for all non-2xx send responses that don't already have handling, or treat 400 the same as 404 (permanent failure). [fixable]
|
|
||
| /** Synthetic event emitted when a POST to the send endpoint fails. */ | ||
| export interface SendFailedEvent { | ||
| type: '_send_failed'; |
There was a problem hiding this comment.
🔵 style: SendFailedEvent.clientMsgId is typed as unknown but is always a string (generated by the store as user-${Date.now()}-...). Using string would improve type safety in the store handler at store.ts:722 where it's consumed without narrowing. [fixable]
| if (!endpoint) return false; | ||
|
|
||
| if (this._connected && this._connectionId) { | ||
| if (this._connected) { |
There was a problem hiding this comment.
🔵 regressions: The send() guard changed from this._connected && this._connectionId to just this._connected. Previously, _connectionId being null was an extra safety check that send wouldn't fire before the welcome handshake. Now _connectionId is always set (from constructor), so the guard change is semantically correct — but doPost will now include the client-generated cid in X-Connection-ID even if the server hasn't acknowledged it yet. This is safe because the _connected flag is still gated on welcome receipt, but worth noting.
| /** Stable connectionId per browser tab — survives SSE reconnects. */ | ||
| function getOrCreateConnectionId(): string { | ||
| const KEY = 'mitzo:connectionId'; | ||
| let cid = sessionStorage.getItem(KEY); |
There was a problem hiding this comment.
🟡 unsafe_assumptions: sessionStorage.getItem() and setItem() can throw SecurityError in Safari private browsing, some Firefox configurations, and restrictive WebViews. Since getOrCreateConnectionId() is called at module scope (during SSE config construction), an exception here would break the entire client import. Wrap in try-catch with a fallback to an in-memory UUID. [fixable]
| const clientCid = req.query.cid as string | undefined; | ||
| const isStableCid = | ||
| clientCid != null && | ||
| /^cid-[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$/i.test(clientCid); |
There was a problem hiding this comment.
🔵 unsafe_assumptions: The cid validation regex /^cid-[0-9a-f]{8}-...$/i correctly matches UUID v4, but crypto.randomUUID() is spec'd to return lowercase. The i flag is fine for robustness, but note that if a non-UUID-shaped cid is rejected, the server silently falls back to a server-generated conn- id with no feedback to the client. The client's _connectionId will be its UUID, while the server uses a different id — the welcome echo will trigger the 'old server' fallback path in the client, which works but means the 'stable cid' feature silently degrades. A log.warn exists but the client has no way to know its cid was rejected.
Centaur ReviewFound 4 issue(s) (1 critical) (2 warning).
|
… timer - Fix critical race: stale close handler removing new SSE stream. Added removeIfCurrent() to SessionSseRegistry — only removes if the response being closed is still the current entry for that connectionId - Add timer-based retry flush (3s) for failed POSTs when SSE stays healthy, so messages don't sit in the queue waiting for a reconnect that never comes - Update ChatConnection jsdoc — connectionId is now client-generated - Clean up retry timer on disconnect Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
dimakis
left a comment
There was a problem hiding this comment.
Centaur Review
Found 7 issue(s) (3 warning).
packages/client/src/store.ts
Solid transport resilience feature with correct core logic. Main concerns: the retry re-queue path bypasses the MAX_PENDING_SENDS cap, the cid regex is stricter than necessary, and the user-facing error message says "retrying on reconnect" when retries also happen on a timer without reconnect.
- 🔵 style (L725): The sendError message says "retrying on reconnect..." but the actual retry mechanism (scheduleRetryFlush) fires after a 3s timer even without a reconnect. Consider "Message delivery delayed — retrying..." to avoid misleading the user when the SSE stream stays connected.
[fixable]
server/index.ts
Solid transport resilience feature with correct core logic. Main concerns: the retry re-queue path bypasses the MAX_PENDING_SENDS cap, the cid regex is stricter than necessary, and the user-facing error message says "retrying on reconnect" when retries also happen on a timer without reconnect.
- 🟡 bugs (L464): In the stable-cid close handler,
registry.removeObserver(transport)removes the OLD transport from SessionRegistry observer lists immediately on stream close. If the same client had been added as an observer (e.g., via WS subscribe flow) and reconnects with a new transport, the new transport is never re-added as an observer — only the ConnectionRegistry watched-session list survives viareconnect(). This is likely a non-issue today since SSE connections use the driver/ConnectionRegistry event store delivery path rather than SessionRegistry observers, but it creates a subtle inconsistency if SSE is ever used as an observer.[fixable] - 🟡 bugs (L416): The cid regex
/^cid-[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$/ivalidates strict UUID v4 format, butcrypto.randomUUID()can produce any UUID version (v4 today, but the spec doesn't guarantee it). If a future browser/runtime returns a different UUID version with uppercase or variant bits outside [0-9a-f], the regex would reject it. The/iflag handles case, but the format assumption is fragile. Consider a more permissive pattern like/^cid-[0-9a-f-]{36}$/ito allow any UUID format.[fixable]
packages/client/src/sse-connection.ts
Solid transport resilience feature with correct core logic. Main concerns: the retry re-queue path bypasses the MAX_PENDING_SENDS cap, the cid regex is stricter than necessary, and the user-facing error message says "retrying on reconnect" when retries also happen on a timer without reconnect.
- 🟡 unsafe_assumptions (L373): When doPost re-queues a failed send for retry (line 373, 394), it pushes to pendingSends without checking MAX_PENDING_SENDS. The initial send() path (line 126) enforces this cap, but the retry path bypasses it. In practice the MAX_SEND_RETRIES=3 limit bounds this to a small number of re-queued items per message, but a burst of many concurrent send failures could temporarily exceed the 100-item cap.
[fixable] - 🔵 style (L30): The connectionId JSDoc comment is 4 lines long — consider trimming to a single line per the project's "default to no comments, one short line max" convention.
[fixable]
packages/client/src/__tests__/sse-connection.test.ts
Solid transport resilience feature with correct core logic. Main concerns: the retry re-queue path bypasses the MAX_PENDING_SENDS cap, the cid regex is stricter than necessary, and the user-facing error message says "retrying on reconnect" when retries also happen on a timer without reconnect.
- 🔵 missing_tests: No test covers the case where the stable-cid SSE close handler fires AFTER a new SSE request with the same cid has connected (the race that removeIfCurrent is designed to prevent). This is the core scenario the stable-cid feature solves — a test confirming the old close handler does NOT remove the new stream would strengthen coverage.
[fixable]
packages/harness/__tests__/connection-registry.test.ts
Solid transport resilience feature with correct core logic. Main concerns: the retry re-queue path bypasses the MAX_PENDING_SENDS cap, the cid regex is stricter than necessary, and the user-facing error message says "retrying on reconnect" when retries also happen on a timer without reconnect.
- 🔵 missing_tests: No test verifies that the onExpired callback receives the correct watchedSessions after a partial unwatch. E.g., connection watches sessions A and B, unwatches B, then TTL expires — callback should get only {A}. The current tests only verify the happy path with sessions still watched at expiry time.
[fixable]
| if (msg.type === '_send_failed') { | ||
| const willRetry = msg.willRetry !== false; | ||
| store.setState((s) => ({ | ||
| sendError: willRetry |
There was a problem hiding this comment.
🔵 style: The sendError message says "retrying on reconnect..." but the actual retry mechanism (scheduleRetryFlush) fires after a 3s timer even without a reconnect. Consider "Message delivery delayed — retrying..." to avoid misleading the user when the SSE stream stays connected. [fixable]
| // connectionId — markInactive starts a 60s TTL. If the TTL expires, | ||
| // the onExpired callback triggers session detach. | ||
| connRegistry.markInactive(connectionId); | ||
| registry.removeObserver(transport); |
There was a problem hiding this comment.
🟡 bugs: In the stable-cid close handler, registry.removeObserver(transport) removes the OLD transport from SessionRegistry observer lists immediately on stream close. If the same client had been added as an observer (e.g., via WS subscribe flow) and reconnects with a new transport, the new transport is never re-added as an observer — only the ConnectionRegistry watched-session list survives via reconnect(). This is likely a non-issue today since SSE connections use the driver/ConnectionRegistry event store delivery path rather than SessionRegistry observers, but it creates a subtle inconsistency if SSE is ever used as an observer. [fixable]
| const clientCid = req.query.cid as string | undefined; | ||
| const isStableCid = | ||
| clientCid != null && | ||
| /^cid-[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$/i.test(clientCid); |
There was a problem hiding this comment.
🟡 bugs: The cid regex /^cid-[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$/i validates strict UUID v4 format, but crypto.randomUUID() can produce any UUID version (v4 today, but the spec doesn't guarantee it). If a future browser/runtime returns a different UUID version with uppercase or variant bits outside [0-9a-f], the regex would reject it. The /i flag handles case, but the format assumption is fragile. Consider a more permissive pattern like /^cid-[0-9a-f-]{36}$/i to allow any UUID format. [fixable]
| // Transient failure — retry up to MAX_SEND_RETRIES | ||
| const willRetry = retries < MAX_SEND_RETRIES; | ||
| if (willRetry) { | ||
| this.pendingSends.push({ endpoint, body, retries: retries + 1 }); |
There was a problem hiding this comment.
🟡 unsafe_assumptions: When doPost re-queues a failed send for retry (line 373, 394), it pushes to pendingSends without checking MAX_PENDING_SENDS. The initial send() path (line 126) enforces this cap, but the retry path bypasses it. In practice the MAX_SEND_RETRIES=3 limit bounds this to a small number of re-queued items per message, but a burst of many concurrent send failures could temporarily exceed the 100-item cap. [fixable]
| /** URL for the sendBeacon suspend fallback. */ | ||
| suspendUrl?: string; | ||
| /** | ||
| * Client-generated stable connectionId. Survives SSE reconnects so the |
There was a problem hiding this comment.
🔵 style: The connectionId JSDoc comment is 4 lines long — consider trimming to a single line per the project's "default to no comments, one short line max" convention. [fixable]
dimakis
left a comment
There was a problem hiding this comment.
Centaur Review
Found 5 issue(s) (1 warning).
packages/client/src/store.ts
Well-designed feature with solid test coverage. The critical issues from earlier reviews (SSE registry race, ownership check, observer leak) are all fixed via removeIfCurrent and getOwnerConnection. The main remaining gap is that sendError is never cleared after a successful retry via scheduleRetryFlush, leaving the user with a stale error banner while the agent responds.
- 🟡 bugs (L725): sendError is never cleared after a successful retry. When doPost succeeds on a retried send (via scheduleRetryFlush), no event is emitted — sendError remains set to 'Message delivery delayed' even as the server responds and the agent's streaming reply appears. The user sees a contradictory state: an error banner alongside an active response. sendError is only cleared on _open (reconnect) or USER_SEND (next message), but if the SSE stream stays healthy and the user is waiting for a reply, neither happens. Consider emitting a _send_succeeded synthetic event on a successful retried POST, or clearing sendError on MESSAGE_START.
[fixable] - 🔵 style (L726): The sendError message 'Message delivery delayed — retrying on reconnect...' is inaccurate. Since the scheduleRetryFlush mechanism was added, retries happen proactively after 3s (RETRY_FLUSH_DELAY_MS) while the SSE stream stays connected — no reconnect is needed. Consider 'Message delivery delayed — retrying shortly...' to match the actual behavior.
[fixable]
server/index.ts
Well-designed feature with solid test coverage. The critical issues from earlier reviews (SSE registry race, ownership check, observer leak) are all fixed via removeIfCurrent and getOwnerConnection. The main remaining gap is that sendError is never cleared after a successful retry via scheduleRetryFlush, leaving the user with a stale error banner while the agent responds.
- 🔵 unsafe_assumptions (L463): When a stable-cid reconnects, the old close handler fires asynchronously AFTER connRegistry.reconnect() has already swapped the transport. It calls markInactive(connectionId) which starts a 60s TTL timer on the already-reconnected connection. The timer is harmless (the isOpen() guard at connection-registry.ts:117 sees the new transport is open and bails), but it creates unnecessary timer churn. Consider guarding markInactive with a transport identity check: only call it if the transport in the connection registry is still the old one.
[fixable]
packages/client/src/__tests__/sse-connection.test.ts
Well-designed feature with solid test coverage. The critical issues from earlier reviews (SSE registry race, ownership check, observer leak) are all fixed via removeIfCurrent and getOwnerConnection. The main remaining gap is that sendError is never cleared after a successful retry via scheduleRetryFlush, leaving the user with a stale error banner while the agent responds.
- 🔵 missing_tests (L1050): The 'drops message after max retries' test comment says '3 reconnect cycles — each flush retries' but the retries actually exhaust via scheduleRetryFlush during the initial vi.runAllTimersAsync() call (chained 3s timers). The reconnect cycles in the for-loop are no-ops for this test case because pendingSends is already empty by then. The test passes correctly, but the comments are misleading about the retry mechanism.
[fixable] - 🔵 missing_tests (L1164): The 'retries failed send after delay even without reconnect' test verifies the retry fires but doesn't check that _send_failed is NOT emitted on the successful retry (i.e., no spurious willRetry:false event). Adding an assertion like
expect(listener).not.toHaveBeenCalledWith(expect.objectContaining({ willRetry: false }))would strengthen coverage.[fixable]
| if (msg.type === '_send_failed') { | ||
| const willRetry = msg.willRetry !== false; | ||
| store.setState((s) => ({ | ||
| sendError: willRetry |
There was a problem hiding this comment.
🟡 bugs: sendError is never cleared after a successful retry. When doPost succeeds on a retried send (via scheduleRetryFlush), no event is emitted — sendError remains set to 'Message delivery delayed' even as the server responds and the agent's streaming reply appears. The user sees a contradictory state: an error banner alongside an active response. sendError is only cleared on _open (reconnect) or USER_SEND (next message), but if the SSE stream stays healthy and the user is waiting for a reply, neither happens. Consider emitting a _send_succeeded synthetic event on a successful retried POST, or clearing sendError on MESSAGE_START. [fixable]
| const willRetry = msg.willRetry !== false; | ||
| store.setState((s) => ({ | ||
| sendError: willRetry | ||
| ? 'Message delivery delayed — retrying on reconnect...' |
There was a problem hiding this comment.
🔵 style: The sendError message 'Message delivery delayed — retrying on reconnect...' is inaccurate. Since the scheduleRetryFlush mechanism was added, retries happen proactively after 3s (RETRY_FLUSH_DELAY_MS) while the SSE stream stays connected — no reconnect is needed. Consider 'Message delivery delayed — retrying shortly...' to match the actual behavior. [fixable]
| log.info('SSE closed for stable-cid connection, deferring cleanup', { connectionId }); | ||
| } else { | ||
| // Legacy server-generated connectionId: immediate cleanup | ||
| const conn = connRegistry.get(connectionId); |
There was a problem hiding this comment.
🔵 unsafe_assumptions: When a stable-cid reconnects, the old close handler fires asynchronously AFTER connRegistry.reconnect() has already swapped the transport. It calls markInactive(connectionId) which starts a 60s TTL timer on the already-reconnected connection. The timer is harmless (the isOpen() guard at connection-registry.ts:117 sees the new transport is open and bails), but it creates unnecessary timer churn. Consider guarding markInactive with a transport identity check: only call it if the transport in the connection registry is still the old one. [fixable]
| expect(listener).not.toHaveBeenCalledWith(expect.objectContaining({ type: '_send_failed' })); | ||
| }); | ||
|
|
||
| it('drops message after max retries and emits willRetry: false', async () => { |
There was a problem hiding this comment.
🔵 missing_tests: The 'drops message after max retries' test comment says '3 reconnect cycles — each flush retries' but the retries actually exhaust via scheduleRetryFlush during the initial vi.runAllTimersAsync() call (chained 3s timers). The reconnect cycles in the for-loop are no-ops for this test case because pendingSends is already empty by then. The test passes correctly, but the comments are misleading about the retry mechanism. [fixable]
| ); | ||
| }); | ||
| }); | ||
| }); |
There was a problem hiding this comment.
🔵 missing_tests: The 'retries failed send after delay even without reconnect' test verifies the retry fires but doesn't check that _send_failed is NOT emitted on the successful retry (i.e., no spurious willRetry:false event). Adding an assertion like expect(listener).not.toHaveBeenCalledWith(expect.objectContaining({ willRetry: false })) would strengthen coverage. [fixable]
dimakis
left a comment
There was a problem hiding this comment.
Centaur Review
Found 6 issue(s) (3 warning).
server/index.ts
Well-structured transport resilience feature with correct race-condition guards. Key concern: the stale close handler can call markInactive on a freshly-reconnected connection, and removeIfCurrent — the linchpin of the reconnect safety model — lacks unit tests.
- 🟡 bugs (L459): When a stable-CID SSE reconnects quickly (new request arrives before old
closefires), the old close handler callsconnRegistry.markInactive(connectionId). Butreconnect()already ran and cleared the inactive timer. ThemarkInactivecall here starts a new 60s TTL timer on the freshly-reconnected connection. If the new SSE stream stays healthy for 60s, the TTL callback checks!transport.isOpen()— which is true (new transport), so it's a no-op. But if the new transport closes within 60s and another reconnect restores it, the stale timer from the first close handler could fire at the wrong time. The fix: check whether the closingreswas actually removed byremoveIfCurrent(it returns void — no way to tell) before callingmarkInactive.[fixable] - 🟡 unsafe_assumptions (L416): The CID format validation regex on the SSE endpoint (
/^cid-[0-9a-f]{8}-...$/i) is not applied to theX-Connection-IDheader on POST endpoints (extracted inchat-rest-handler.tsviagetConnectionId()). A client could POST with a malformed header value. Since theclientIdcomposite key is${connectionId}:${sessionId}, this is low-risk (the POST just won't match any SSE connection), but the asymmetric validation is surprising.[fixable]
server/session-sse-registry.ts
Well-structured transport resilience feature with correct race-condition guards. Key concern: the stale close handler can call markInactive on a freshly-reconnected connection, and removeIfCurrent — the linchpin of the reconnect safety model — lacks unit tests.
- 🟡 missing_tests (L52):
removeIfCurrent()is a critical guard that prevents stale close handlers from removing the active SSE stream. It has no unit tests. The correctness of the entire stable-CID reconnect flow depends on this method working correctly — a regression here would silently break event delivery.[fixable]
packages/client/src/sse-connection.ts
Well-structured transport resilience feature with correct race-condition guards. Key concern: the stale close handler can call markInactive on a freshly-reconnected connection, and removeIfCurrent — the linchpin of the reconnect safety model — lacks unit tests.
- 🔵 style (L43):
SendFailedEventis exported but never imported anywhere — the store handler uses untypedmsg.type === '_send_failed'checks. Either import and use the type in the store listener, or don't export it to avoid suggesting a type-safe contract that isn't enforced.[fixable] - 🔵 bugs (L373): When
doPostfails for asendendpoint with 429/5xx, the message is pushed back topendingSendswithretries + 1. ButpendingSendshas no deduplication — if the same message fails duringflushPendingSendsand during a concurrentscheduleRetryFlushtimer, it could be queued twice. This is mitigated byflushPendingSendsswapping the array (line 412-413), but thedoPost→push→scheduleRetryFlush→flushPendingSends→doPostcycle means a message could accumulate duplicate entries if two retry paths converge. The MAX_SEND_RETRIES cap limits the blast radius.[fixable]
packages/client/src/store.ts
Well-structured transport resilience feature with correct race-condition guards. Key concern: the stale close handler can call markInactive on a freshly-reconnected connection, and removeIfCurrent — the linchpin of the reconnect safety model — lacks unit tests.
- 🔵 missing_tests (L708): The store's handling of
_open(clearing sendError) and_send_failed(setting sendError, conditionally resetting running state) are new behaviors with no store-level tests. The SSE connection tests cover emission of these events, but the store reducer integration — especially thewillRetry: falsepath that resetsSET_RUNNING— is untested.[fixable]
| // response that replaced it, breaking event delivery. | ||
| chatSseRegistry.removeIfCurrent(connectionId, res); | ||
|
|
||
| if (isStableCid) { |
There was a problem hiding this comment.
🟡 bugs: When a stable-CID SSE reconnects quickly (new request arrives before old close fires), the old close handler calls connRegistry.markInactive(connectionId). But reconnect() already ran and cleared the inactive timer. The markInactive call here starts a new 60s TTL timer on the freshly-reconnected connection. If the new SSE stream stays healthy for 60s, the TTL callback checks !transport.isOpen() — which is true (new transport), so it's a no-op. But if the new transport closes within 60s and another reconnect restores it, the stale timer from the first close handler could fire at the wrong time. The fix: check whether the closing res was actually removed by removeIfCurrent (it returns void — no way to tell) before calling markInactive. [fixable]
| const clientCid = req.query.cid as string | undefined; | ||
| const isStableCid = | ||
| clientCid != null && | ||
| /^cid-[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$/i.test(clientCid); |
There was a problem hiding this comment.
🟡 unsafe_assumptions: The CID format validation regex on the SSE endpoint (/^cid-[0-9a-f]{8}-...$/i) is not applied to the X-Connection-ID header on POST endpoints (extracted in chat-rest-handler.ts via getConnectionId()). A client could POST with a malformed header value. Since the clientId composite key is ${connectionId}:${sessionId}, this is low-risk (the POST just won't match any SSE connection), but the asymmetric validation is surprising. [fixable]
| * the given response. Prevents a stale close handler from removing a | ||
| * newer replacement stream after chatSseRegistry.add() swapped it. | ||
| */ | ||
| removeIfCurrent(connectionId: string, res: Response): void { |
There was a problem hiding this comment.
🟡 missing_tests: removeIfCurrent() is a critical guard that prevents stale close handlers from removing the active SSE stream. It has no unit tests. The correctness of the entire stable-CID reconnect flow depends on this method working correctly — a regression here would silently break event delivery. [fixable]
| const RETRY_FLUSH_DELAY_MS = 3000; | ||
|
|
||
| /** Synthetic event emitted when a POST to the send endpoint fails. */ | ||
| export interface SendFailedEvent { |
There was a problem hiding this comment.
🔵 style: SendFailedEvent is exported but never imported anywhere — the store handler uses untyped msg.type === '_send_failed' checks. Either import and use the type in the store listener, or don't export it to avoid suggesting a type-safe contract that isn't enforced. [fixable]
| // Transient failure — retry up to MAX_SEND_RETRIES | ||
| const willRetry = retries < MAX_SEND_RETRIES; | ||
| if (willRetry) { | ||
| this.pendingSends.push({ endpoint, body, retries: retries + 1 }); |
There was a problem hiding this comment.
🔵 bugs: When doPost fails for a send endpoint with 429/5xx, the message is pushed back to pendingSends with retries + 1. But pendingSends has no deduplication — if the same message fails during flushPendingSends and during a concurrent scheduleRetryFlush timer, it could be queued twice. This is mitigated by flushPendingSends swapping the array (line 412-413), but the doPost→push→scheduleRetryFlush→flushPendingSends→doPost cycle means a message could accumulate duplicate entries if two retry paths converge. The MAX_SEND_RETRIES cap limits the blast radius. [fixable]
| // Foreground recovery: when the page becomes visible again (iOS may have | ||
| // evicted it from memory, losing in-memory state), re-fetch messages from | ||
| // the REST API if we have an active session but no messages in the store. | ||
| if (msg.type === '_open') { |
There was a problem hiding this comment.
🔵 missing_tests: The store's handling of _open (clearing sendError) and _send_failed (setting sendError, conditionally resetting running state) are new behaviors with no store-level tests. The SSE connection tests cover emission of these events, but the store reducer integration — especially the willRetry: false path that resets SET_RUNNING — is untested. [fixable]
dimakis
left a comment
There was a problem hiding this comment.
Centaur Review
Found 6 issue(s) (3 warning).
server/index.ts
Solid transport resilience feature with well-designed race condition guards (removeIfCurrent, staleness checks in doReconnectPost, atomic queue swap). Main concern is the phantom markInactive timer on rapid reconnect — harmless but noisy. Test coverage for the client-side changes is thorough; server-side additions (removeIfCurrent, cid validation) lack dedicated tests.
- 🟡 bugs (L463): Stale close handler creates phantom inactive timer after rapid reconnect. When a new SSE request arrives with the same stable cid,
chatSseRegistry.add()closes the old response.connRegistry.reconnect()runs and clears any existing inactive timer. But when the old response's close handler fires asynchronously AFTERreconnect(), it callsmarkInactive(connectionId)— starting a new 60s timer. The timer is a no-op (transport.isOpen() check saves it), but it's unnecessary work and produces a misleading 'inactive connection TTL expired' log if the connection closes again within the grace window. Consider checking if the transport is already open insidemarkInactive(), or having the close handler skipmarkInactive()whenconnRegistry.get(connectionId)?.transportis a different (newer) object than the closing one.[fixable] - 🔵 missing_tests (L416): The server-side cid format validation regex (
/^cid-[0-9a-f]{8}-...$/i) and the fallback-to-server-generated-id logic have no test coverage. An integration or unit test confirming that invalid cid values are rejected (fallback toconn-*) and valid ones are accepted would prevent regression if the format evolves.[fixable]
packages/client/src/store.ts
Solid transport resilience feature with well-designed race condition guards (removeIfCurrent, staleness checks in doReconnectPost, atomic queue swap). Main concern is the phantom markInactive timer on rapid reconnect — harmless but noisy. Test coverage for the client-side changes is thorough; server-side additions (removeIfCurrent, cid validation) lack dedicated tests.
- 🟡 unsafe_assumptions (L723):
willRetrydefaults totrueviamsg.willRetry !== false, meaning any truthy or undefined value is treated as retryable. If a future change emits_send_failedwithoutwillRetry(e.g., from a different code path), the store will show 'retrying...' and keep the running spinner indefinitely. Prefer explicitmsg.willRetry === trueto fail-safe toward the non-retrying case.[fixable]
server/session-sse-registry.ts
Solid transport resilience feature with well-designed race condition guards (removeIfCurrent, staleness checks in doReconnectPost, atomic queue swap). Main concern is the phantom markInactive timer on rapid reconnect — harmless but noisy. Test coverage for the client-side changes is thorough; server-side additions (removeIfCurrent, cid validation) lack dedicated tests.
- 🟡 missing_tests (L52):
removeIfCurrent()is a new method critical to preventing the stale-close-handler race, but has no unit test. It's only exercised implicitly through the SSE endpoint integration path. A dedicated test verifying it no-ops whenresdoesn't match (and removes when it does) would guard this invariant.[fixable]
packages/client/src/sse-connection.ts
Solid transport resilience feature with well-designed race condition guards (removeIfCurrent, staleness checks in doReconnectPost, atomic queue swap). Main concern is the phantom markInactive timer on rapid reconnect — harmless but noisy. Test coverage for the client-side changes is thorough; server-side additions (removeIfCurrent, cid validation) lack dedicated tests.
- 🔵 style (L41):
SendFailedEventis exported but never imported or used in the type system — the listener is called with untyped object literals matching the same shape. Either use the type indoPost(e.g.,const evt: SendFailedEvent = { ... }; this.listener?.(evt)) or remove the export to avoid implying a typed contract that isn't enforced.[fixable]
packages/client/src/chat-connection.ts
Solid transport resilience feature with well-designed race condition guards (removeIfCurrent, staleness checks in doReconnectPost, atomic queue swap). Main concern is the phantom markInactive timer on rapid reconnect — harmless but noisy. Test coverage for the client-side changes is thorough; server-side additions (removeIfCurrent, cid validation) lack dedicated tests.
- 🔵 style (L20): The
getConnectionId()return type is stillstring | null, but after the constructor change,_connectionIdis always initialized to a string — it can never be null. The docstring was updated to say 'Available immediately after construction' but the type doesn't reflect this. Consider updating the interface return type tostring(or document whynullis retained for other implementations).[fixable]
| // Stable cid: defer cleanup. The client will reconnect with the same | ||
| // connectionId — markInactive starts a 60s TTL. If the TTL expires, | ||
| // the onExpired callback triggers session detach. | ||
| connRegistry.markInactive(connectionId); |
There was a problem hiding this comment.
🟡 bugs: Stale close handler creates phantom inactive timer after rapid reconnect. When a new SSE request arrives with the same stable cid, chatSseRegistry.add() closes the old response. connRegistry.reconnect() runs and clears any existing inactive timer. But when the old response's close handler fires asynchronously AFTER reconnect(), it calls markInactive(connectionId) — starting a new 60s timer. The timer is a no-op (transport.isOpen() check saves it), but it's unnecessary work and produces a misleading 'inactive connection TTL expired' log if the connection closes again within the grace window. Consider checking if the transport is already open inside markInactive(), or having the close handler skip markInactive() when connRegistry.get(connectionId)?.transport is a different (newer) object than the closing one. [fixable]
| const clientCid = req.query.cid as string | undefined; | ||
| const isStableCid = | ||
| clientCid != null && | ||
| /^cid-[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$/i.test(clientCid); |
There was a problem hiding this comment.
🔵 missing_tests: The server-side cid format validation regex (/^cid-[0-9a-f]{8}-...$/i) and the fallback-to-server-generated-id logic have no test coverage. An integration or unit test confirming that invalid cid values are rejected (fallback to conn-*) and valid ones are accepted would prevent regression if the format evolves. [fixable]
| } | ||
|
|
||
| if (msg.type === '_send_failed') { | ||
| const willRetry = msg.willRetry !== false; |
There was a problem hiding this comment.
🟡 unsafe_assumptions: willRetry defaults to true via msg.willRetry !== false, meaning any truthy or undefined value is treated as retryable. If a future change emits _send_failed without willRetry (e.g., from a different code path), the store will show 'retrying...' and keep the running spinner indefinitely. Prefer explicit msg.willRetry === true to fail-safe toward the non-retrying case. [fixable]
| * the given response. Prevents a stale close handler from removing a | ||
| * newer replacement stream after chatSseRegistry.add() swapped it. | ||
| */ | ||
| removeIfCurrent(connectionId: string, res: Response): void { |
There was a problem hiding this comment.
🟡 missing_tests: removeIfCurrent() is a new method critical to preventing the stale-close-handler race, but has no unit test. It's only exercised implicitly through the SSE endpoint integration path. A dedicated test verifying it no-ops when res doesn't match (and removes when it does) would guard this invariant. [fixable]
| const MAX_PENDING_SENDS = 100; | ||
| const MAX_SEND_RETRIES = 3; | ||
| const RETRY_FLUSH_DELAY_MS = 3000; | ||
|
|
There was a problem hiding this comment.
🔵 style: SendFailedEvent is exported but never imported or used in the type system — the listener is called with untyped object literals matching the same shape. Either use the type in doPost (e.g., const evt: SendFailedEvent = { ... }; this.listener?.(evt)) or remove the export to avoid implying a typed contract that isn't enforced. [fixable]
| /** Whether the transport is connected and ready to send. */ | ||
| isConnected(): boolean; | ||
| /** Server-assigned connection ID (null before welcome). */ | ||
| /** Client-generated stable connection ID. Available immediately after construction. */ |
There was a problem hiding this comment.
🔵 style: The getConnectionId() return type is still string | null, but after the constructor change, _connectionId is always initialized to a string — it can never be null. The docstring was updated to say 'Available immediately after construction' but the type doesn't reflect this. Consider updating the interface return type to string (or document why null is retained for other implementations). [fixable]
- Make removeIfCurrent() return boolean so close handler knows if stream was actually removed vs already replaced by reconnect - Only call markInactive() when the closing stream was the current one, preventing stale close handlers from starting TTL on fresh connections - Add SessionSseRegistry unit tests for removeIfCurrent() Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
dimakis
left a comment
There was a problem hiding this comment.
Centaur Review
Found 7 issue(s) (3 warning).
packages/client/src/store.ts
_Well-structured transport resilience feature with solid race-condition guards (removeIfCurrent, markInactive/reconnect interplay). Main concern is the silent swallowing of 4xx send errors leaving the UI stuck in 'thinking' state, and the lack of store-level tests for the new send_failed flow.
- 🟡 bugs (L380): Optimistic
SET_RUNNING: trueis set whenconnection.send()succeeds, but if the POST fails with a transient error (_send_failedwithwillRetry: true),runningstaystrueand is never reset. The_send_failedhandler at line 722 only resetsrunningtofalsewhenwillRetry === false. If the retry eventually succeeds, the server'smessage_start/session_endevents will naturally manage the state. But if all retries fail and the message is dropped (finalwillRetry: falsefires), therunningstate was stuck attruefor up to 4 retry cycles (~9+ seconds). During this time the user sees "thinking" with an error banner — minor UX confusion but not a data loss bug.[fixable] - 🟡 missing_tests (L722): The store-level
_send_failedand_openhandlers (lines 708-733) have no unit tests. The@mitzo/clientpackage only hassse-connection.test.ts— there's no store test file. The transport-level tests verify that_send_failedevents are emitted, but the store's state transitions (sendErrorset/cleared,SET_RUNNINGreset on permanent failure) are untested. These are the user-facing behaviors that matter most.[fixable] - 🔵 style (L726): The error message
'Message delivery delayed — retrying on reconnect...'is misleading: the retry now happens after a 3-second timer viascheduleRetryFlush()even without a reconnect. Consider'Message delivery delayed — retrying...'to match the actual behavior.[fixable]
packages/client/src/sse-connection.ts
_Well-structured transport resilience feature with solid race-condition guards (removeIfCurrent, markInactive/reconnect interplay). Main concern is the silent swallowing of 4xx send errors leaving the UI stuck in 'thinking' state, and the lack of store-level tests for the new send_failed flow.
- 🟡 bugs (L368): When a
sendPOST returns a 4xx status other than 404 or 429 (e.g., 400 Bad Request, 401, 403), the error is completely silent — no_send_failedevent, no retry, no logging. The optimisticSET_RUNNING: truein the store will remain stuck until asession_endevent arrives (which it won't, since the server never started processing). This makes 400/401/403 errors invisible to the user. Consider emitting_send_failedwithwillRetry: falsefor all non-OK send responses that aren't handled, or at least for 401/403 which could indicate auth expiry.[fixable] - 🔵 style (L43):
SendFailedEventis exported but never imported by the store or any consumer — the store usesRecord<string, unknown>and checksmsg.type === '_send_failed'with string literals. Either the export is dead (remove it) or the store should import and use it for type safety.[fixable]
server/index.ts
_Well-structured transport resilience feature with solid race-condition guards (removeIfCurrent, markInactive/reconnect interplay). Main concern is the silent swallowing of 4xx send errors leaving the UI stuck in 'thinking' state, and the lack of store-level tests for the new send_failed flow.
- 🔵 unsafe_assumptions (L416): The stable CID regex
/^cid-[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$/ivalidates format but not origin — any client can claim any CID. A malicious or buggy tab could send another tab'ssessionStorageCID and hijack its connection state (watched sessions, delivery cursors). Since the server has no auth binding between the CID and the authenticated user, this is limited to same-user scope (both tabs share the same auth cookie), so it's low-severity. But it could cause confusing behavior if two tabs accidentally generate the same CID (practically impossible with UUID, but the format constraint itself is the only protection). - 🔵 missing_tests (L113): The
setOnExpiredcallback (lines 113-140) contains meaningful logic: ownership check viagetOwnerConnection, suspended-session skip, and detach. This is only tested indirectly throughConnectionRegistryunit tests (which test the timer/callback mechanism) but the callback body itself — the detach-on-expiry logic with its ownership guard — has no integration test. A test verifying that expiry of a stable-CID connection correctly detaches (or skips detach when ownership changed) would catch regressions in this critical path.[fixable]
| // Optimistic: show "thinking" immediately. The server will confirm | ||
| // via events, or the send-timeout safety net will reset on failure. | ||
| set((s) => ({ | ||
| messages: messagesReducer(s.messages, { type: 'SET_RUNNING', running: true }), |
There was a problem hiding this comment.
🟡 bugs: Optimistic SET_RUNNING: true is set when connection.send() succeeds, but if the POST fails with a transient error (_send_failed with willRetry: true), running stays true and is never reset. The _send_failed handler at line 722 only resets running to false when willRetry === false. If the retry eventually succeeds, the server's message_start / session_end events will naturally manage the state. But if all retries fail and the message is dropped (final willRetry: false fires), the running state was stuck at true for up to 4 retry cycles (~9+ seconds). During this time the user sees "thinking" with an error banner — minor UX confusion but not a data loss bug. [fixable]
| return; | ||
| } | ||
|
|
||
| if (msg.type === '_send_failed') { |
There was a problem hiding this comment.
🟡 missing_tests: The store-level _send_failed and _open handlers (lines 708-733) have no unit tests. The @mitzo/client package only has sse-connection.test.ts — there's no store test file. The transport-level tests verify that _send_failed events are emitted, but the store's state transitions (sendError set/cleared, SET_RUNNING reset on permanent failure) are untested. These are the user-facing behaviors that matter most. [fixable]
| const willRetry = msg.willRetry !== false; | ||
| store.setState((s) => ({ | ||
| sendError: willRetry | ||
| ? 'Message delivery delayed — retrying on reconnect...' |
There was a problem hiding this comment.
🔵 style: The error message 'Message delivery delayed — retrying on reconnect...' is misleading: the retry now happens after a 3-second timer via scheduleRetryFlush() even without a reconnect. Consider 'Message delivery delayed — retrying...' to match the actual behavior. [fixable]
| }, | ||
| body: JSON.stringify(body), | ||
| }); | ||
| if (!res.ok && endpoint === 'send') { |
There was a problem hiding this comment.
🟡 bugs: When a send POST returns a 4xx status other than 404 or 429 (e.g., 400 Bad Request, 401, 403), the error is completely silent — no _send_failed event, no retry, no logging. The optimistic SET_RUNNING: true in the store will remain stuck until a session_end event arrives (which it won't, since the server never started processing). This makes 400/401/403 errors invisible to the user. Consider emitting _send_failed with willRetry: false for all non-OK send responses that aren't handled, or at least for 401/403 which could indicate auth expiry. [fixable]
| const RETRY_FLUSH_DELAY_MS = 3000; | ||
|
|
||
| /** Synthetic event emitted when a POST to the send endpoint fails. */ | ||
| export interface SendFailedEvent { |
There was a problem hiding this comment.
🔵 style: SendFailedEvent is exported but never imported by the store or any consumer — the store uses Record<string, unknown> and checks msg.type === '_send_failed' with string literals. Either the export is dead (remove it) or the store should import and use it for type safety. [fixable]
| const clientCid = req.query.cid as string | undefined; | ||
| const isStableCid = | ||
| clientCid != null && | ||
| /^cid-[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$/i.test(clientCid); |
There was a problem hiding this comment.
🔵 unsafe_assumptions: The stable CID regex /^cid-[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$/i validates format but not origin — any client can claim any CID. A malicious or buggy tab could send another tab's sessionStorage CID and hijack its connection state (watched sessions, delivery cursors). Since the server has no auth binding between the CID and the authenticated user, this is limited to same-user scope (both tabs share the same auth cookie), so it's low-severity. But it could cause confusing behavior if two tabs accidentally generate the same CID (practically impossible with UUID, but the format constraint itself is the only protection).
| // When a stable-cid connection's inactive TTL expires (no reconnect within | ||
| // 60s), detach any sessions it was driving — same logic as the SSE close | ||
| // handler for legacy connections. | ||
| connRegistry.setOnExpired((connectionId, watchedSessions) => { |
There was a problem hiding this comment.
🔵 missing_tests: The setOnExpired callback (lines 113-140) contains meaningful logic: ownership check via getOwnerConnection, suspended-session skip, and detach. This is only tested indirectly through ConnectionRegistry unit tests (which test the timer/callback mechanism) but the callback body itself — the detach-on-expiry logic with its ownership guard — has no integration test. A test verifying that expiry of a stable-CID connection correctly detaches (or skips detach when ownership changed) would catch regressions in this critical path. [fixable]
- Emit _send_failed for all non-OK send responses, not just 404. Covers 400/401/403 which previously left the UI stuck in 'thinking' state - Fix error message: "retrying..." instead of "retrying on reconnect..." since scheduleRetryFlush now retries without waiting for reconnect Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
dimakis
left a comment
There was a problem hiding this comment.
Centaur Review
Found 6 issue(s) (2 warning).
packages/client/src/store.ts
Well-structured feature with solid race condition handling (removeIfCurrent, markInactive guards, ownership checks). Main concern: stale 'retrying...' sendError when a retry succeeds without SSE reconnect — minor UX polish issue, not a correctness bug.
- 🟡 bugs (L725): When a retried POST succeeds (transient 500 followed by OK on retry), the 'Message delivery delayed — retrying...' sendError is never cleared. No
_send_succeededevent is emitted fromdoPoston success, and_openonly fires on SSE reconnect. The stale error persists until the nextsendMessage()call (which setssendError: nullat the top). Consider emitting a success synthetic event fromdoPostwhen a retried send succeeds (retries > 0 && res.ok), or clearing sendError on any incomingmessage_startevent.[fixable]
packages/client/src/sse-connection.ts
Well-structured feature with solid race condition handling (removeIfCurrent, markInactive guards, ownership checks). Main concern: stale 'retrying...' sendError when a retry succeeds without SSE reconnect — minor UX polish issue, not a correctness bug.
- 🟡 bugs (L424):
scheduleRetryFlush()exits early if!this._connected(line 426), butdoPostcan push topendingSendsduring a disconnected state (fromflushPendingSendsduring reconnect when the flush itself fails). If SSE is connected but the POST fails, a retry is correctly scheduled. But if the POST failure triggers duringflushPendingSendson reconnect and the connection drops again before the flush timer fires, the timer is never rescheduled because_connectedgoes false. This is benign — the next reconnect will callflushPendingSendsagain — but worth noting thatscheduleRetryFlushis a no-op in this path. - 🔵 style (L43):
SendFailedEventis exported fromsse-connection.tsbut never imported anywhere. The store listener uses untypedRecord<string, unknown>to accessmsg.willRetry. IfSendFailedEventis meant to be a public contract, it should be consumed in the store's listener type; otherwise the export is dead code.[fixable]
packages/client/src/__tests__/sse-connection.test.ts
Well-structured feature with solid race condition handling (removeIfCurrent, markInactive guards, ownership checks). Main concern: stale 'retrying...' sendError when a retry succeeds without SSE reconnect — minor UX polish issue, not a correctness bug.
- 🔵 missing_tests: No test covers the case where a retried POST eventually succeeds. The existing retry test ('re-queues failed send for retry on reconnect') verifies the second call happens, but doesn't verify the message was actually processed (no listener assertion). Consider adding a test that the
_send_failedevent is NOT emitted on a successful retry.[fixable]
server/__tests__/session-sse-registry.test.ts
Well-structured feature with solid race condition handling (removeIfCurrent, markInactive guards, ownership checks). Main concern: stale 'retrying...' sendError when a retry succeeds without SSE reconnect — minor UX polish issue, not a correctness bug.
- 🔵 missing_tests: The
removeIfCurrenttests are good, but there's no test for theadd()method's behavior of closing the existing stream when a reconnect happens with the same connectionId. This is the foundation of the reconnect safety — verifying thatexisting.end()is called would strengthen coverage.[fixable]
server/index.ts
Well-structured feature with solid race condition handling (removeIfCurrent, markInactive guards, ownership checks). Main concern: stale 'retrying...' sendError when a retry succeeds without SSE reconnect — minor UX polish issue, not a correctness bug.
- 🔵 unsafe_assumptions (L416): The cid format regex
/^cid-[0-9a-f]{8}-...$/icorrectly validates the UUID shape but accepts uppercase hex via theiflag.crypto.randomUUID()always returns lowercase. Theiflag is fine for security but means two cids that differ only in case (e.g.,cid-ABC...vscid-abc...) would both pass validation but be treated as different connectionIds in the registry Map. Not a real-world issue since clients always usecrypto.randomUUID(), but worth a comment if intentional.
| if (msg.type === '_send_failed') { | ||
| const willRetry = msg.willRetry !== false; | ||
| store.setState((s) => ({ | ||
| sendError: willRetry |
There was a problem hiding this comment.
🟡 bugs: When a retried POST succeeds (transient 500 followed by OK on retry), the 'Message delivery delayed — retrying...' sendError is never cleared. No _send_succeeded event is emitted from doPost on success, and _open only fires on SSE reconnect. The stale error persists until the next sendMessage() call (which sets sendError: null at the top). Consider emitting a success synthetic event from doPost when a retried send succeeds (retries > 0 && res.ok), or clearing sendError on any incoming message_start event. [fixable]
| * stream is still connected. Without this, failed POSTs would sit in the | ||
| * queue until a reconnect event — which may never come if SSE stays healthy. | ||
| */ | ||
| private scheduleRetryFlush(): void { |
There was a problem hiding this comment.
🟡 bugs: scheduleRetryFlush() exits early if !this._connected (line 426), but doPost can push to pendingSends during a disconnected state (from flushPendingSends during reconnect when the flush itself fails). If SSE is connected but the POST fails, a retry is correctly scheduled. But if the POST failure triggers during flushPendingSends on reconnect and the connection drops again before the flush timer fires, the timer is never rescheduled because _connected goes false. This is benign — the next reconnect will call flushPendingSends again — but worth noting that scheduleRetryFlush is a no-op in this path.
| const RETRY_FLUSH_DELAY_MS = 3000; | ||
|
|
||
| /** Synthetic event emitted when a POST to the send endpoint fails. */ | ||
| export interface SendFailedEvent { |
There was a problem hiding this comment.
🔵 style: SendFailedEvent is exported from sse-connection.ts but never imported anywhere. The store listener uses untyped Record<string, unknown> to access msg.willRetry. If SendFailedEvent is meant to be a public contract, it should be consumed in the store's listener type; otherwise the export is dead code. [fixable]
| const clientCid = req.query.cid as string | undefined; | ||
| const isStableCid = | ||
| clientCid != null && | ||
| /^cid-[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$/i.test(clientCid); |
There was a problem hiding this comment.
🔵 unsafe_assumptions: The cid format regex /^cid-[0-9a-f]{8}-...$/i correctly validates the UUID shape but accepts uppercase hex via the i flag. crypto.randomUUID() always returns lowercase. The i flag is fine for security but means two cids that differ only in case (e.g., cid-ABC... vs cid-abc...) would both pass validation but be treated as different connectionIds in the registry Map. Not a real-world issue since clients always use crypto.randomUUID(), but worth a comment if intentional.
- Clear sendError on any incoming server event, confirming message delivery even when retry succeeded without SSE reconnect - Add SessionSseRegistry.add() stream-close test for reconnect safety Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use proper Response intersection type instead of `as never` to satisfy tsc --noEmit in CI. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
dimakis
left a comment
There was a problem hiding this comment.
Centaur Review
Found 7 issue(s) (3 warning).
packages/client/src/store.ts
Well-structured feature with thorough transport-layer tests. Main concern is sendError being cleared by unrelated global events (task_state, etc.) which could hide retry-in-progress state from the user. Store-level integration tests for the new synthetic events would strengthen confidence.
- 🟡 bugs (L771): sendError is cleared on ANY server event that passes the session filter — including global events with no sessionId (task_state, loop_status, inbox_updated). A background task board update could clear a legitimate 'Message delivery delayed — retrying...' banner while the send retry is still pending. Consider only clearing sendError on session-scoped events (where eventSessionId matches the active session), or on message_start/session_id events that confirm the send was actually delivered.
[fixable] - 🔵 missing_tests (L708): The store-level handling of
_open,_send_failed, and the sendError clearing on server events (lines 708-773) has no unit tests. The SSE connection tests verify the transport emits these events, but the store reducer/listener integration — especially the optimistic SET_RUNNING on send success (line 380) and its interaction with _send_failed — is untested.[fixable]
server/index.ts
Well-structured feature with thorough transport-layer tests. Main concern is sendError being cleared by unrelated global events (task_state, etc.) which could hide retry-in-progress state from the user. Store-level integration tests for the new synthetic events would strengthen confidence.
- 🟡 unsafe_assumptions (L416): The cid format regex requires exactly the
cid-prefix followed by a UUID v4. But the client generatescid-${crypto.randomUUID()}— if a browser'scrypto.randomUUID()ever returns uppercase hex or a non-v4 format, it would silently fall back to a server-generated connectionId, defeating the stable-cid feature without any client-side indication. The regex is case-insensitive (good), but the client should ideally validate or normalize as well. This is minor since all current browser implementations return lowercase v4 UUIDs.[fixable] - 🔵 unsafe_assumptions (L129): In the onExpired callback,
registry.isAttached()is checked but there's no transport identity guard (unlike the legacy close handler at line 491 which checkssession.transport === transport). This is intentionally different since the transport was already swapped/removed, and ownership is verified via getOwnerConnection. But if a reconnect race causes a new connection to take ownership after getOwnerConnection runs but before detachChat executes, the session could be detached from under the new owner. The window is small and would require exact timing, but worth noting.
packages/client/src/sse-connection.ts
Well-structured feature with thorough transport-layer tests. Main concern is sendError being cleared by unrelated global events (task_state, etc.) which could hide retry-in-progress state from the user. Store-level integration tests for the new synthetic events would strengthen confidence.
- 🟡 bugs (L424): scheduleRetryFlush() early-returns if
!this._connected, but the retry timer that queued the message fires via doPost failure while the connection IS still open (SSE stream healthy, just the POST failed). When doPost pushes to pendingSends and calls scheduleRetryFlush(), the guard passes. But if the SSE stream drops between the push and the timer firing, flushPendingSends() will call doPost() while disconnected — doPost doesn't guard against!this._connected, so it will attempt a fetch that likely fails, re-queue, and schedule another flush, creating a retry loop until max retries exhausts. Not critical since MAX_SEND_RETRIES caps it at 3, but the retry counting may drift (each failed flush counts as a retry even though the connection is down).[fixable] - 🔵 style (L40): The SendFailedEvent interface is exported but the
_send_failedtype string is used as a magic literal across sse-connection.ts and store.ts. Consider reusing the interface type or at least a shared constant for the event type to prevent typo-induced silent failures.[fixable] - 🔵 style (L72): The 4-line comment on the connectionId assignment explains the motivation well for a code review, but this is the kind of 'why this PR exists' context that belongs in the commit message rather than permanently in the source. The field name and config docs already communicate the intent.
[fixable]
| // case where a retried POST succeeded: the server starts processing and | ||
| // sends events, confirming delivery. Without this, the "retrying..." | ||
| // banner would persist until the next sendMessage() call. | ||
| if (store.getState().sendError) { |
There was a problem hiding this comment.
🟡 bugs: sendError is cleared on ANY server event that passes the session filter — including global events with no sessionId (task_state, loop_status, inbox_updated). A background task board update could clear a legitimate 'Message delivery delayed — retrying...' banner while the send retry is still pending. Consider only clearing sendError on session-scoped events (where eventSessionId matches the active session), or on message_start/session_id events that confirm the send was actually delivered. [fixable]
| // Foreground recovery: when the page becomes visible again (iOS may have | ||
| // evicted it from memory, losing in-memory state), re-fetch messages from | ||
| // the REST API if we have an active session but no messages in the store. | ||
| if (msg.type === '_open') { |
There was a problem hiding this comment.
🔵 missing_tests: The store-level handling of _open, _send_failed, and the sendError clearing on server events (lines 708-773) has no unit tests. The SSE connection tests verify the transport emits these events, but the store reducer/listener integration — especially the optimistic SET_RUNNING on send success (line 380) and its interaction with _send_failed — is untested. [fixable]
| const clientCid = req.query.cid as string | undefined; | ||
| const isStableCid = | ||
| clientCid != null && | ||
| /^cid-[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$/i.test(clientCid); |
There was a problem hiding this comment.
🟡 unsafe_assumptions: The cid format regex requires exactly the cid- prefix followed by a UUID v4. But the client generates cid-${crypto.randomUUID()} — if a browser's crypto.randomUUID() ever returns uppercase hex or a non-v4 format, it would silently fall back to a server-generated connectionId, defeating the stable-cid feature without any client-side indication. The regex is case-insensitive (good), but the client should ideally validate or normalize as well. This is minor since all current browser implementations return lowercase v4 UUIDs. [fixable]
| } | ||
| if (registry.isSuspended(found.clientId)) continue; | ||
| const session = registry.get(found.clientId); | ||
| if (session && registry.isAttached(found.clientId)) { |
There was a problem hiding this comment.
🔵 unsafe_assumptions: In the onExpired callback, registry.isAttached() is checked but there's no transport identity guard (unlike the legacy close handler at line 491 which checks session.transport === transport). This is intentionally different since the transport was already swapped/removed, and ownership is verified via getOwnerConnection. But if a reconnect race causes a new connection to take ownership after getOwnerConnection runs but before detachChat executes, the session could be detached from under the new owner. The window is small and would require exact timing, but worth noting.
| * stream is still connected. Without this, failed POSTs would sit in the | ||
| * queue until a reconnect event — which may never come if SSE stays healthy. | ||
| */ | ||
| private scheduleRetryFlush(): void { |
There was a problem hiding this comment.
🟡 bugs: scheduleRetryFlush() early-returns if !this._connected, but the retry timer that queued the message fires via doPost failure while the connection IS still open (SSE stream healthy, just the POST failed). When doPost pushes to pendingSends and calls scheduleRetryFlush(), the guard passes. But if the SSE stream drops between the push and the timer firing, flushPendingSends() will call doPost() while disconnected — doPost doesn't guard against !this._connected, so it will attempt a fetch that likely fails, re-queue, and schedule another flush, creating a retry loop until max retries exhausts. Not critical since MAX_SEND_RETRIES caps it at 3, but the retry counting may drift (each failed flush counts as a retry even though the connection is down). [fixable]
|
|
||
| const MAX_PENDING_SENDS = 100; | ||
| const MAX_SEND_RETRIES = 3; | ||
| const RETRY_FLUSH_DELAY_MS = 3000; |
There was a problem hiding this comment.
🔵 style: The SendFailedEvent interface is exported but the _send_failed type string is used as a magic literal across sse-connection.ts and store.ts. Consider reusing the interface type or at least a shared constant for the event type to prevent typo-induced silent failures. [fixable]
| @@ -51,6 +72,10 @@ export class SseConnection implements ChatConnection { | |||
| suspendUrl: '', | |||
There was a problem hiding this comment.
🔵 style: The 4-line comment on the connectionId assignment explains the motivation well for a code review, but this is the kind of 'why this PR exists' context that belongs in the commit message rather than permanently in the source. The field name and config docs already communicate the intent. [fixable]
Global events (task_state, inbox_updated) no longer clear the retry banner — only session-scoped events confirm message delivery. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
dimakis
left a comment
There was a problem hiding this comment.
Centaur Review
Found 8 issue(s) (4 warning).
packages/client/src/store.ts
Well-structured reconnect resilience feature with good test coverage for the new SseConnection behaviors. Main concerns: premature sendError clearing on _open before retried POSTs resolve, and sendError clearing not scoped to the active session — both are UX glitches rather than data-loss bugs.
- 🟡 bugs (L708): The
_openhandler clearssendErrorunconditionally on reconnect, butflushPendingSends()fires async POSTs that haven't resolved yet. Sequence: SSE reconnects →_openclears sendError → flushed retry POST fails →_send_failedsets sendError again. The user sees a brief flash of 'no error' then the error reappears. This is benign (the error eventually shows), but the order is surprising. A cleaner approach: don't clear sendError on_open— let the session-scoped event clearing at line 770 handle it, which only fires once the server actually processes the message.[fixable] - 🟡 bugs (L770): Session-scoped
sendErrorclearing triggers on ANY session event (including events from a different session's background activity that happen to have a sessionId). If session A's send fails and session B emits ablock_delta, the sendError from session A is silently cleared. The guard should also checkeventSessionId === store.getState().sessions.active.[fixable] - 🔵 missing_tests (L708): No test coverage for the
_open→ sendError clearing interaction in the store. The store tests should verify: (1)_openclears a stale sendError, (2)_send_failedafter_openre-sets the error, (3) session-scoped events clear sendError. The optimisticSET_RUNNINGdispatch insendMessage(line 379) also lacks a dedicated test.
server/index.ts
Well-structured reconnect resilience feature with good test coverage for the new SseConnection behaviors. Main concerns: premature sendError clearing on _open before retried POSTs resolve, and sendError clearing not scoped to the active session — both are UX glitches rather than data-loss bugs.
- 🟡 unsafe_assumptions (L414): The cid validation regex only accepts UUID v4 format (
cid-{uuid}), butcrypto.randomUUID()in some environments produces uppercase hex. The regex uses the/iflag so that's fine. However, the docstring ingetOwnerConnection(ws-handler-v2.ts:89) says connectionIds useconn-${timestamp}-${random}format — this is now stale. Stable cids usecid-${uuid}. The comment should be updated to document both formats, or a future maintainer may assume colons can't appear and miss edge cases.[fixable] - 🔵 missing_tests (L407): The SSE
/api/chat/eventshandler's stable-cid path (reconnect vs register, close handler branching,removeIfCurrentguard) is complex but only has unit tests forSessionSseRegistryandConnectionRegistryin isolation. No integration test verifies the full reconnect flow: client sends?cid=...→ reconnects → close handler defers cleanup → TTL expires → sessions detach. - 🔵 unsafe_assumptions (L417): Invalid cid format is logged as a warning but silently falls through to generate a legacy
conn-ID. This means a typo like?cid=cid-not-a-uuidsilently degrades to legacy behavior with no client feedback. Consider returning a 400 response or including a flag in the welcome event so the client knows its cid was rejected.[fixable]
packages/client/src/sse-connection.ts
Well-structured reconnect resilience feature with good test coverage for the new SseConnection behaviors. Main concerns: premature sendError clearing on _open before retried POSTs resolve, and sendError clearing not scoped to the active session — both are UX glitches rather than data-loss bugs.
- 🟡 regressions (L119): The
send()method removed thethis._connectionIdguard from the connected-path check (wasthis._connected && this._connectionId, now justthis._connected). Since_connectionIdis always set in the constructor, this is safe. ButdoPostpreviously had an early-returnif (!this._connectionId) return;which was also removed. Verify no caller depends ondoPostbeing a no-op before welcome — nowdoPostalways fires if_connectedis true, which only becomes true after welcome. This looks correct. - 🔵 style (L42):
SendFailedEventis exported but never imported anywhere — consumers handle it via the genericRecord<string, unknown>message type. Consider either using it in the store'swsListenersignature for type safety, or making it non-exported to avoid dead API surface.[fixable]
| // Foreground recovery: when the page becomes visible again (iOS may have | ||
| // evicted it from memory, losing in-memory state), re-fetch messages from | ||
| // the REST API if we have an active session but no messages in the store. | ||
| if (msg.type === '_open') { |
There was a problem hiding this comment.
🟡 bugs: The _open handler clears sendError unconditionally on reconnect, but flushPendingSends() fires async POSTs that haven't resolved yet. Sequence: SSE reconnects → _open clears sendError → flushed retry POST fails → _send_failed sets sendError again. The user sees a brief flash of 'no error' then the error reappears. This is benign (the error eventually shows), but the order is surprising. A cleaner approach: don't clear sendError on _open — let the session-scoped event clearing at line 770 handle it, which only fires once the server actually processes the message. [fixable]
| // Session event arrived — clear any stale send error. Only clear on | ||
| // session-scoped events (eventSessionId present), not global events | ||
| // like task_state or inbox_updated which don't confirm message delivery. | ||
| if (eventSessionId && store.getState().sendError) { |
There was a problem hiding this comment.
🟡 bugs: Session-scoped sendError clearing triggers on ANY session event (including events from a different session's background activity that happen to have a sessionId). If session A's send fails and session B emits a block_delta, the sendError from session A is silently cleared. The guard should also check eventSessionId === store.getState().sessions.active. [fixable]
| // Foreground recovery: when the page becomes visible again (iOS may have | ||
| // evicted it from memory, losing in-memory state), re-fetch messages from | ||
| // the REST API if we have an active session but no messages in the store. | ||
| if (msg.type === '_open') { |
There was a problem hiding this comment.
🔵 missing_tests: No test coverage for the _open → sendError clearing interaction in the store. The store tests should verify: (1) _open clears a stale sendError, (2) _send_failed after _open re-sets the error, (3) session-scoped events clear sendError. The optimistic SET_RUNNING dispatch in sendMessage (line 379) also lacks a dedicated test.
| // ownership race. Client sends ?cid=<uuid>; server reuses it across | ||
| // SSE reconnects so the clientId (connectionId:sessionId) never changes. | ||
| const clientCid = req.query.cid as string | undefined; | ||
| const isStableCid = |
There was a problem hiding this comment.
🟡 unsafe_assumptions: The cid validation regex only accepts UUID v4 format (cid-{uuid}), but crypto.randomUUID() in some environments produces uppercase hex. The regex uses the /i flag so that's fine. However, the docstring in getOwnerConnection (ws-handler-v2.ts:89) says connectionIds use conn-${timestamp}-${random} format — this is now stale. Stable cids use cid-${uuid}. The comment should be updated to document both formats, or a future maintainer may assume colons can't appear and miss edge cases. [fixable]
| @@ -374,54 +407,96 @@ app.get('/api/chat/events', (req, res) => { | |||
| 'X-Accel-Buffering': 'no', | |||
There was a problem hiding this comment.
🔵 missing_tests: The SSE /api/chat/events handler's stable-cid path (reconnect vs register, close handler branching, removeIfCurrent guard) is complex but only has unit tests for SessionSseRegistry and ConnectionRegistry in isolation. No integration test verifies the full reconnect flow: client sends ?cid=... → reconnects → close handler defers cleanup → TTL expires → sessions detach.
| const isStableCid = | ||
| clientCid != null && | ||
| /^cid-[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$/i.test(clientCid); | ||
| if (clientCid != null && !isStableCid) { |
There was a problem hiding this comment.
🔵 unsafe_assumptions: Invalid cid format is logged as a warning but silently falls through to generate a legacy conn- ID. This means a typo like ?cid=cid-not-a-uuid silently degrades to legacy behavior with no client feedback. Consider returning a 400 response or including a flag in the welcome event so the client knows its cid was rejected. [fixable]
| if (!endpoint) return false; | ||
|
|
||
| if (this._connected && this._connectionId) { | ||
| if (this._connected) { |
There was a problem hiding this comment.
🟡 regressions: The send() method removed the this._connectionId guard from the connected-path check (was this._connected && this._connectionId, now just this._connected). Since _connectionId is always set in the constructor, this is safe. But doPost previously had an early-return if (!this._connectionId) return; which was also removed. Verify no caller depends on doPost being a no-op before welcome — now doPost always fires if _connected is true, which only becomes true after welcome. This looks correct.
| const MAX_SEND_RETRIES = 3; | ||
| const RETRY_FLUSH_DELAY_MS = 3000; | ||
|
|
||
| /** Synthetic event emitted when a POST to the send endpoint fails. */ |
There was a problem hiding this comment.
🔵 style: SendFailedEvent is exported but never imported anywhere — consumers handle it via the generic Record<string, unknown> message type. Consider either using it in the store's wsListener signature for type safety, or making it non-exported to avoid dead API surface. [fixable]
- Remove premature sendError clear on _open (flushPendingSends hasn't resolved yet — let session events confirm delivery instead) - Scope sendError clear to active session events only, preventing background session activity from hiding retry state Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…onid Revert: stable client connectionId (PR #401)
Summary
Eliminates the "send message twice to wake agent" bug after iOS kills the SSE connection.
connectionId, breaking theclientId = ${connectionId}:${sessionId}ownership chain.handleReconnectskipped reattach when the old transport appeared attached but was actually dead.cid-{uuid}), sends it via?cid=on SSE GET. Server reuses it across reconnects —getOwnerConnection(clientId)always returns the same connectionId, soisOwneris always true. NohandleReconnectcode changes needed._send_failedemission. OptimisticSET_RUNNINGafter send. Deferred connection cleanup with 60s TTL.Changes
packages/client/src/sse-connection.ts_send_failedemissionpackages/harness/src/connection-registry.tsreconnect()swaps transport preserving state,markInactive()+onExpiredTTLserver/index.ts?cid=, validate,connRegistry.reconnect(), deferred close,onExpiredwiringfrontend/src/client-store.tssessionStorage-persisted connectionId per tabpackages/client/src/store.tsSET_RUNNING,_send_failedhandlerBackward Compatibility
?cid=)?cid=)conn-*as beforeTest plan
Supersedes #396.
🤖 Generated with Claude Code