Skip to content

feat(transport): stable client connectionId + POST failure resilience#401

Merged
dimakis merged 11 commits into
mainfrom
feat/stable-client-cid
Jun 26, 2026
Merged

feat(transport): stable client connectionId + POST failure resilience#401
dimakis merged 11 commits into
mainfrom
feat/stable-client-cid

Conversation

@dimakis

@dimakis dimakis commented Jun 26, 2026

Copy link
Copy Markdown
Owner

Summary

Eliminates the "send message twice to wake agent" bug after iOS kills the SSE connection.

  • Root cause: Every SSE reconnect minted a new server-generated connectionId, breaking the clientId = ${connectionId}:${sessionId} ownership chain. handleReconnect skipped reattach when the old transport appeared attached but was actually dead.
  • Fix: Client generates a stable UUID per tab (cid-{uuid}), sends it via ?cid= on SSE GET. Server reuses it across reconnects — getOwnerConnection(clientId) always returns the same connectionId, so isOwner is always true. No handleReconnect code changes needed.
  • Resilience: POST failures (404/429/5xx/network) are re-queued with _send_failed emission. Optimistic SET_RUNNING after send. Deferred connection cleanup with 60s TTL.

Changes

File What
packages/client/src/sse-connection.ts Stable cid in URL, welcome validation, backward compat, _send_failed emission
packages/harness/src/connection-registry.ts reconnect() swaps transport preserving state, markInactive() + onExpired TTL
server/index.ts Accept ?cid=, validate, connRegistry.reconnect(), deferred close, onExpired wiring
frontend/src/client-store.ts sessionStorage-persisted connectionId per tab
packages/client/src/store.ts Optimistic SET_RUNNING, _send_failed handler

Backward Compatibility

Client Server Behavior
New (sends ?cid=) New Stable connectionId, full fix
New (sends ?cid=) Old (ignores cid) Falls back to server-assigned ID
Old (no cid) New Server generates conn-* as before

Test plan

  • 17 new tests: stable cid generation/persistence/URL, reconnect transport swap, markInactive TTL, onExpired callback, POST failure surfacing (500/429/network), re-queue + retry
  • All 90 connection-registry + sse-connection tests pass
  • Lint + gitleaks clean
  • Deploy server, rebuild iOS app
  • Test: open session → send → force-kill app → reopen → send → should respond immediately (no double-send)
  • Test: airplane mode 5s → disable → send → should work on first try

Supersedes #396.

🤖 Generated with Claude Code

Client generates a UUID per tab, sends it via ?cid= on SSE GET. Server
reuses it across reconnects, eliminating the ownership race in
handleReconnect that caused the "send twice to wake agent" bug.

- SseConnection: stable cid in URL, welcome validation, backward compat
- ConnectionRegistry: reconnect() swaps transport preserving state,
  markInactive() with 60s TTL + onExpired callback for deferred cleanup
- Server: accept ?cid=, validate format, connRegistry.reconnect() on
  existing cid, deferred close with onExpired → detachChat()
- Frontend: sessionStorage-persisted connectionId per tab
- Store: optimistic SET_RUNNING after send, _send_failed handler
- doPost: re-queue failed sends (404/429/5xx) with _send_failed emission
- Tests: 17 new tests for stable cid, reconnect, TTL, POST failures

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@dimakis

dimakis commented Jun 26, 2026

Copy link
Copy Markdown
Owner Author

Centaur Review

Found 6 issue(s) (3 warning).

server/index.ts

Well-designed transport resilience feature with thorough client-side tests. Main concern: the optimistic SET_RUNNING in the store can leave the UI stuck if a send POST fails while the SSE stream remains open, and the onExpired callback lacks the transport ownership guard present in the legacy close handler.

  • 🟡 bugs: The onExpired callback detaches sessions without checking transport ownership. The legacy SSE close handler has session.transport === transport guard to avoid detaching sessions already reattached by another connection; onExpired has no equivalent. Currently safe because remove() and the callback run synchronously in the same setTimeout, but fragile — an ownership check like getOwnerConnection(found.clientId) === connectionId would match the legacy handler's safety guarantees. [fixable]
  • 🔵 unsafe_assumptions: CID validation regex /^cid-[0-9a-f-]{36}$/ is looser than UUID format — it accepts patterns like 36 dashes or 36 hex digits with no structure. A stricter pattern like /^cid-[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$/ would reject malformed cids. Not a security issue (the cid is just a lookup key) but tighter validation prevents confusion. [fixable]
  • 🔵 missing_tests: No server-side tests cover the new stable CID flow: the ?cid= query param validation/branching in the SSE handler, the stable-cid vs legacy close handler paths, or the onExpired callback that triggers session detach on TTL expiry. All new tests are client-side (SseConnection) or harness-level (ConnectionRegistry). [fixable]

packages/client/src/store.ts

Well-designed transport resilience feature with thorough client-side tests. Main concern: the optimistic SET_RUNNING in the store can leave the UI stuck if a send POST fails while the SSE stream remains open, and the onExpired callback lacks the transport ownership guard present in the legacy close handler.

  • 🟡 bugs: The optimistic SET_RUNNING dispatch (new in this PR) is not cleared when _send_failed arrives. If the POST to /api/chat/send fails but the SSE stream stays open (no reconnect), running stays true indefinitely — the user sees both a spinner and the error banner, and cannot send new messages. The inline comment references a 'send-timeout safety net' that does not appear to exist. [fixable]

packages/client/src/sse-connection.ts

Well-designed transport resilience feature with thorough client-side tests. Main concern: the optimistic SET_RUNNING in the store can leave the UI stuck if a send POST fails while the SSE stream remains open, and the onExpired callback lacks the transport ownership guard present in the legacy close handler.

  • 🟡 unsafe_assumptions: doPost re-queues failed sends into pendingSends, and flushPendingSends calls doPost without awaiting. If multiple pending sends all fail concurrently during a flush (e.g., server returns 500), each fires a separate _send_failed event and re-queues. There is no max-retry counter or exponential backoff — messages will retry on every reconnect indefinitely. This is bounded by MAX_PENDING_SENDS (100) but could saturate the queue with retries of permanently failing messages. [fixable]

server/ws-handler-v2.ts

Well-designed transport resilience feature with thorough client-side tests. Main concern: the optimistic SET_RUNNING in the store can leave the UI stuck if a send POST fails while the SSE stream remains open, and the onExpired callback lacks the transport ownership guard present in the legacy close handler.

  • 🔵 style (L89): getOwnerConnection JSDoc (lines 88-90) says 'ConnectionIds use the format conn-${timestamp}-${random}' but stable CIDs now use cid-${uuid}. The function works correctly for both formats (neither contains colons), but the comment is stale and could mislead future readers. [fixable]

@dimakis dimakis left a comment

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Centaur Review

Found 6 issue(s) (1 critical) (3 warning).

server/index.ts

Solid transport resilience feature, but the onExpired callback has a critical missing transport-ownership check that can detach sessions driven by other connections, and the stable-cid close path leaks observer transport references.

  • 🔴 bugs (L118): onExpired callback is missing the session.transport === transport ownership check that the legacy SSE close handler has (line 417). If another tab/connection reattaches to the same session during the 60s grace period, onExpired will incorrectly detach it. The callback only receives (connectionId, watchedSessions) — it needs access to the original transport, or a connectionId-based ownership check, to avoid detaching sessions now driven by a different connection. [fixable]
  • 🟡 regressions (L438): The stable-cid SSE close path calls connRegistry.markInactive(connectionId) but never calls registry.removeObserver(transport). The legacy path (line 450) does call it. Stale transport references will accumulate in the SessionRegistry observer sets across reconnects — each reconnect creates a new SseTransport, adds it as an observer, but the old one is never removed. This leaks memory and causes dead-transport broadcast attempts. [fixable]
  • 🟡 unsafe_assumptions (L403): The CID validation regex /^cid-[0-9a-f-]{36}$/ assumes crypto.randomUUID() always produces lowercase hex. While the Web Crypto spec mandates lowercase, the regex silently falls back to a server-generated connectionId for any mismatch — which defeats the feature. Consider case-insensitive match or normalizing to lowercase. [fixable]

packages/client/src/sse-connection.ts

Solid transport resilience feature, but the onExpired callback has a critical missing transport-ownership check that can detach sessions driven by other connections, and the stable-cid close path leaks observer transport references.

  • 🟡 bugs (L348): doPost re-queues failed sends into pendingSends with no retry counter. If a message triggers a persistent 404 or 500 (e.g., malformed body), it will be re-queued and retried on every reconnect forever. Consider adding a maxRetries field to pending items and dropping messages that exceed it, or at minimum emitting willRetry: false after N retries. [fixable]
  • 🔵 style (L350): The _send_failed synthetic event type uses a leading underscore convention shared with _open and _close, but unlike those, it carries structured fields (clientMsgId, willRetry). Consider defining a TypeScript interface for it in the protocol types to catch misuse at compile time, rather than relying on untyped Record<string, unknown> matching. [fixable]

packages/client/src/__tests__/sse-connection.test.ts

Solid transport resilience feature, but the onExpired callback has a critical missing transport-ownership check that can detach sessions driven by other connections, and the stable-cid close path leaks observer transport references.

  • 🔵 missing_tests: No test for the retry-limit concern: a persistently failing send message being re-queued and retried across multiple reconnects. Adding a test that verifies behavior after N consecutive failures would validate the retry semantics. [fixable]

Comment thread server/index.ts
if (!found) continue;
if (registry.isSuspended(found.clientId)) continue;
const session = registry.get(found.clientId);
if (session && registry.isAttached(found.clientId)) {

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 bugs: onExpired callback is missing the session.transport === transport ownership check that the legacy SSE close handler has (line 417). If another tab/connection reattaches to the same session during the 60s grace period, onExpired will incorrectly detach it. The callback only receives (connectionId, watchedSessions) — it needs access to the original transport, or a connectionId-based ownership check, to avoid detaching sessions now driven by a different connection. [fixable]

Comment thread server/index.ts

const conn = connRegistry.get(connectionId);
const watchedSessions = conn ? [...conn.watchedSessions] : [];
if (isStableCid) {

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 regressions: The stable-cid SSE close path calls connRegistry.markInactive(connectionId) but never calls registry.removeObserver(transport). The legacy path (line 450) does call it. Stale transport references will accumulate in the SessionRegistry observer sets across reconnects — each reconnect creates a new SseTransport, adds it as an observer, but the old one is never removed. This leaks memory and causes dead-transport broadcast attempts. [fixable]

Comment thread server/index.ts Outdated
// ownership race. Client sends ?cid=<uuid>; server reuses it across
// SSE reconnects so the clientId (connectionId:sessionId) never changes.
const clientCid = req.query.cid as string | undefined;
const isStableCid = clientCid != null && /^cid-[0-9a-f-]{36}$/.test(clientCid);

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 unsafe_assumptions: The CID validation regex /^cid-[0-9a-f-]{36}$/ assumes crypto.randomUUID() always produces lowercase hex. While the Web Crypto spec mandates lowercase, the regex silently falls back to a server-generated connectionId for any mismatch — which defeats the feature. Consider case-insensitive match or normalizing to lowercase. [fixable]

Comment thread packages/client/src/sse-connection.ts Outdated
});
if (!res.ok && endpoint === 'send') {
// Transient failure — re-queue for retry on next reconnect
if (res.status === 404 || res.status === 429 || res.status >= 500) {

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 bugs: doPost re-queues failed sends into pendingSends with no retry counter. If a message triggers a persistent 404 or 500 (e.g., malformed body), it will be re-queued and retried on every reconnect forever. Consider adding a maxRetries field to pending items and dropping messages that exceed it, or at minimum emitting willRetry: false after N retries. [fixable]

// Transient failure — re-queue for retry on next reconnect
if (res.status === 404 || res.status === 429 || res.status >= 500) {
this.pendingSends.push({ endpoint, body });
this.listener?.({

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 style: The _send_failed synthetic event type uses a leading underscore convention shared with _open and _close, but unlike those, it carries structured fields (clientMsgId, willRetry). Consider defining a TypeScript interface for it in the protocol types to catch misuse at compile time, rather than relying on untyped Record<string, unknown> matching. [fixable]

…ak, retry limit

- onExpired: verify session still owned by expired connectionId before
  detaching (prevents cross-tab detach race)
- Stable-cid close: call removeObserver(transport) to prevent stale
  transport accumulation in SessionRegistry observer sets
- CID regex: case-insensitive match for crypto.randomUUID() compat
- doPost: add retry counter (max 3), emit willRetry:false after exhaustion
- Export SendFailedEvent interface for type-safe event handling
- Test: verify message dropped after max retries with willRetry:false

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

@dimakis dimakis left a comment

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Centaur Review

Found 6 issue(s) (2 warning).

packages/client/src/store.ts

Solid transport resilience feature with good test coverage. Main concern: optimistic running state set on send is never cleared when _send_failed fires with willRetry: false, leaving the UI stuck in a thinking spinner.

  • 🟡 bugs (L377): Optimistic SET_RUNNING is set when send() returns true, but when _send_failed fires (lines 714-718), only sendError is set — running is never reset to false. If the final willRetry: false event fires, the UI stays in a permanent "thinking" spinner. The _send_failed handler should also dispatch SET_RUNNING: false when willRetry is false. [fixable]

server/index.ts

Solid transport resilience feature with good test coverage. Main concern: optimistic running state set on send is never cleared when _send_failed fires with willRetry: false, leaving the UI stuck in a thinking spinner.

  • 🟡 bugs (L413): The regex /^cid-[0-9a-f-]{36}$/i allows malformed UUIDs (e.g. cid------------------------------------- — 36 dashes) because [0-9a-f-]{36} doesn't enforce UUID structure. A crafted cid query param could pass validation while not being a real UUID. Impact is low since the cid is only used as a map key, but consider a stricter pattern like /^cid-[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$/i. [fixable]
  • 🔵 style (L117): getOwnerConnection() is already exported from ws-handler-v2.ts and does this exact logic (clientId.slice(0, colonIdx)). The onExpired callback reimplements it with substring(0, indexOf(':')). Import and reuse the existing function to avoid divergence. [fixable]
  • 🔵 missing_tests (L112): The setOnExpired callback in index.ts (lines 112-139) has no integration test. It contains ownership checks, suspended-session guards, and detach logic — all of which are only tested indirectly via the ConnectionRegistry unit tests. A test covering the wiring (TTL expires → sessions actually detached) would catch regressions in the glue code.

packages/client/src/__tests__/sse-connection.test.ts

Solid transport resilience feature with good test coverage. Main concern: optimistic running state set on send is never cleared when _send_failed fires with willRetry: false, leaving the UI stuck in a thinking spinner.

  • 🔵 missing_tests: No test covers what happens when doPost fails for a 4xx status OTHER than 404/429 (e.g. 400, 403, 422). These are silently swallowed — no _send_failed event, no retry. This is likely intentional (client errors shouldn't retry), but a test documenting this behavior would prevent accidental regressions. [fixable]

packages/client/src/sse-connection.ts

Solid transport resilience feature with good test coverage. Main concern: optimistic running state set on send is never cleared when _send_failed fires with willRetry: false, leaving the UI stuck in a thinking spinner.

  • 🔵 unsafe_assumptions (L363): 404 is treated as retryable alongside 429 and 5xx. A 404 on the send endpoint typically means the session or route doesn't exist — retrying won't help. Consider treating 404 as non-retryable (emit _send_failed with willRetry: false) to avoid wasting 3 retry cycles on a permanently missing resource. [fixable]

if (!sent) {
set({ sendError: 'Not connected. Message will be sent when reconnected.' });
} else {
// Optimistic: show "thinking" immediately. The server will confirm

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 bugs: Optimistic SET_RUNNING is set when send() returns true, but when _send_failed fires (lines 714-718), only sendError is set — running is never reset to false. If the final willRetry: false event fires, the UI stays in a permanent "thinking" spinner. The _send_failed handler should also dispatch SET_RUNNING: false when willRetry is false. [fixable]

Comment thread server/index.ts Outdated
// ownership race. Client sends ?cid=<uuid>; server reuses it across
// SSE reconnects so the clientId (connectionId:sessionId) never changes.
const clientCid = req.query.cid as string | undefined;
const isStableCid = clientCid != null && /^cid-[0-9a-f-]{36}$/i.test(clientCid);

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 bugs: The regex /^cid-[0-9a-f-]{36}$/i allows malformed UUIDs (e.g. cid------------------------------------- — 36 dashes) because [0-9a-f-]{36} doesn't enforce UUID structure. A crafted cid query param could pass validation while not being a real UUID. Impact is low since the cid is only used as a map key, but consider a stricter pattern like /^cid-[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$/i. [fixable]

Comment thread server/index.ts Outdated
const found = registry.findBySessionId(sessionId);
if (!found) continue;
// Skip if another connection has taken over this session
const ownerCid = found.clientId.substring(0, found.clientId.indexOf(':'));

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 style: getOwnerConnection() is already exported from ws-handler-v2.ts and does this exact logic (clientId.slice(0, colonIdx)). The onExpired callback reimplements it with substring(0, indexOf(':')). Import and reuse the existing function to avoid divergence. [fixable]

Comment thread server/index.ts
// When a stable-cid connection's inactive TTL expires (no reconnect within
// 60s), detach any sessions it was driving — same logic as the SSE close
// handler for legacy connections.
connRegistry.setOnExpired((connectionId, watchedSessions) => {

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 missing_tests: The setOnExpired callback in index.ts (lines 112-139) has no integration test. It contains ownership checks, suspended-session guards, and detach logic — all of which are only tested indirectly via the ConnectionRegistry unit tests. A test covering the wiring (TTL expires → sessions actually detached) would catch regressions in the glue code.

Comment thread packages/client/src/sse-connection.ts Outdated
body: JSON.stringify(body),
});
if (!res.ok && endpoint === 'send') {
if (res.status === 404 || res.status === 429 || res.status >= 500) {

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 unsafe_assumptions: 404 is treated as retryable alongside 429 and 5xx. A 404 on the send endpoint typically means the session or route doesn't exist — retrying won't help. Consider treating 404 as non-retryable (emit _send_failed with willRetry: false) to avoid wasting 3 retry cycles on a permanently missing resource. [fixable]

…ndling, regex

- store: clear SET_RUNNING when _send_failed fires with willRetry:false
  (prevents stuck "thinking" spinner on permanent failure)
- sse-connection: treat 404 as permanent failure (no retry), separate
  from transient 429/5xx retries
- server: strict UUID regex, import getOwnerConnection() from
  ws-handler-v2 instead of reimplementing
- tests: 404 permanent failure, 4xx silently swallowed (400/403)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@dimakis

dimakis commented Jun 26, 2026

Copy link
Copy Markdown
Owner Author

Centaur Review

Found 7 issue(s) (3 warning).

packages/client/src/sse-connection.ts

Solid architecture for stable connectionId and POST failure surfacing, but the retry mechanism only triggers on SSE reconnect — if SSE stays healthy while POSTs fail, the spinner gets stuck and retries never fire.

  • 🟡 bugs (L366): Failed sends are re-queued into pendingSends, but flushPendingSends() is only called during SSE reconnect (welcome handler). If the SSE stream stays healthy while the POST endpoint returns 5xx (transient API issue), retries never fire — the message sits in the queue indefinitely. The user sees "retrying on reconnect..." but no retry actually happens until an SSE disconnect/reconnect cycle. Consider adding a delayed retry timer (e.g., exponential backoff) for the connected-but-POST-failed case. [fixable]
  • 🔵 style (L138): getConnectionId() still returns string | null even though _connectionId is now always a string (initialized in the constructor, never set to null). The return type could be tightened to string, though this would require updating the ChatConnection interface — so may be intentionally left for backward compat. [fixable]

packages/client/src/store.ts

Solid architecture for stable connectionId and POST failure surfacing, but the retry mechanism only triggers on SSE reconnect — if SSE stays healthy while POSTs fail, the spinner gets stuck and retries never fire.

  • 🟡 bugs (L717): sendError is set to "Message delivery delayed — retrying on reconnect..." on transient failure, but never cleared when the retry eventually succeeds. sendError is only reset to null when the user sends the next message (line 345). If the retry succeeds silently, the error banner persists until the user takes another action. Consider clearing sendError in the message_start handler or when SET_RUNNING transitions from server events. [fixable]
  • 🟡 bugs (L720): When _send_failed arrives with willRetry: true, SET_RUNNING is NOT reset — the UI stays in "thinking" state while the message sits in the pending retry queue. Combined with the issue that retries only fire on SSE reconnect, the spinner can be stuck indefinitely if SSE stays connected. The user has no way to interact (the input is blocked while running is true). Consider resetting SET_RUNNING to false on willRetry: true as well, and re-dispatching SET_RUNNING: true only when the retry actually fires. [fixable]

server/index.ts

Solid architecture for stable connectionId and POST failure surfacing, but the retry mechanism only triggers on SSE reconnect — if SSE stays healthy while POSTs fail, the spinner gets stuck and retries never fire.

  • 🔵 unsafe_assumptions (L416): The stable CID regex /^cid-[0-9a-f]{8}-...$/i only accepts the cid- prefix with UUID format. If a future client implementation changes the ID format (or a non-browser client uses a different scheme), it silently falls back to a server-generated conn- ID. This is safe but could be confusing to debug. A log line when clientCid is provided but rejected by the regex would help diagnose integration issues. [fixable]

packages/client/src/__tests__/sse-connection.test.ts

Solid architecture for stable connectionId and POST failure surfacing, but the retry mechanism only triggers on SSE reconnect — if SSE stays healthy while POSTs fail, the spinner gets stuck and retries never fire.

  • 🔵 missing_tests: No test covers the scenario where a POST fails while SSE is connected (no reconnect), verifying that the retry sits in pendingSends and is eventually flushed on the next reconnect. The existing 're-queues failed send for retry on reconnect' test forces a reconnect immediately — it doesn't prove the message survives a period of connectivity. [fixable]
  • 🔵 missing_tests: No test verifies the sendError lifecycle in the store when retry succeeds: set on _send_failed with willRetry: true, then cleared when the server eventually responds. This is a store-level test, but it validates the end-to-end UX contract. [fixable]

@dimakis

dimakis commented Jun 26, 2026

Copy link
Copy Markdown
Owner Author

Centaur Review

Found 6 issue(s) (3 warning).

packages/client/src/store.ts

Well-structured transport resilience feature with solid test coverage at the connection layer. Main gaps are missing store-level tests for the new _send_failed handler and optimistic running state, a missing barrel export for SendFailedEvent, and a potential message ordering issue in flushPendingSends.

  • 🟡 missing_tests (L714): The store-level _send_failed handler (lines 714-725) and the optimistic SET_RUNNING dispatch (lines 379-381) have no unit tests. The _send_failed event is only tested at the SseConnection level. Store behavior — particularly the conditional SET_RUNNING: false when willRetry is false — should be validated by store tests. [fixable]
  • 🔵 style (L377): The comment says "the send-timeout safety net will reset on failure" but the safety-net timer (line 355) only runs when wasRunning is true — it doesn't cover fresh sends. For a fresh send that fails transiently (willRetry: true), the spinner stays on until either the retry succeeds or all retries are exhausted. The comment should clarify this: the safety net is the _send_failed { willRetry: false } event, not the timer. [fixable]

packages/client/src/index.ts

Well-structured transport resilience feature with solid test coverage at the connection layer. Main gaps are missing store-level tests for the new _send_failed handler and optimistic running state, a missing barrel export for SendFailedEvent, and a potential message ordering issue in flushPendingSends.

  • 🟡 regressions (L74): SendFailedEvent is exported from sse-connection.ts (line 42) but is NOT re-exported from the @mitzo/client barrel (index.ts). Consumers who need to type-narrow the listener callback can't import it without reaching into the internal module path, breaking the package's public API contract. [fixable]

server/ws-handler-v2.ts

Well-structured transport resilience feature with solid test coverage at the connection layer. Main gaps are missing store-level tests for the new _send_failed handler and optimistic running state, a missing barrel export for SendFailedEvent, and a potential message ordering issue in flushPendingSends.

  • 🔵 style (L88): The JSDoc on getOwnerConnection says "ConnectionIds use the format conn-${timestamp}-${random} (see server/index.ts L196)". With stable cids the format is now also cid-${uuid}. The function works correctly (neither format contains colons), but the comment is misleading and the line reference is stale. [fixable]

server/index.ts

Well-structured transport resilience feature with solid test coverage at the connection layer. Main gaps are missing store-level tests for the new _send_failed handler and optimistic running state, a missing barrel export for SendFailedEvent, and a potential message ordering issue in flushPendingSends.

  • 🔵 unsafe_assumptions (L416): The cid regex /^cid-[0-9a-f]{8}-...$/i validates UUID format strictly, which is good. However, this means a legitimate client that uses a non-standard UUID generator (e.g., shorter random string) will silently fall back to a server-generated conn- id, losing stable-cid benefits with no error or log line. Consider logging when a cid param is present but doesn't match the pattern, to aid debugging. [fixable]

packages/client/src/sse-connection.ts

Well-structured transport resilience feature with solid test coverage at the connection layer. Main gaps are missing store-level tests for the new _send_failed handler and optimistic running state, a missing barrel export for SendFailedEvent, and a potential message ordering issue in flushPendingSends.

  • 🟡 bugs (L398): flushPendingSends() fires all queued doPost() calls without awaiting — they execute concurrently. If multiple user messages were queued during a disconnect, they may arrive at the server out of order. Since doPost is async and HTTP/2 multiplexes requests, delivery order is not guaranteed. For the send endpoint (which starts agent work), misordered prompts could produce confusing behavior. [fixable]

@dimakis dimakis left a comment

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Centaur Review

Found 7 issue(s) (3 warning).

packages/client/src/store.ts

Well-structured transport resilience feature with solid test coverage. Main concern is the sendError state not being cleared when a retried send succeeds — users may see a stale error banner. The SSE-level implementation (stable cid, retry queue, connection registry TTL) is sound.

  • 🟡 bugs (L717): When a retried send eventually succeeds (e.g. willRetry: true followed by successful doPost on reconnect), sendError is never cleared. The user sees "Message delivery delayed — retrying on reconnect..." indefinitely until they send another message (which clears it at line 345). Consider clearing sendError on the _open synthetic event emitted after successful reconnect. [fixable]
  • 🟡 bugs (L380): The optimistic SET_RUNNING fires on send() returning true, but send() returns true even when the message is just queued (not yet delivered). If the POST then fails with willRetry: true, the spinner stays showing because SET_RUNNING: false is only dispatched for willRetry: false. This is probably intentional (server events will drive running state on success), but if all retries also fail, there's a window where running is true with no active agent — the user sees an eternal spinner until the final willRetry: false event arrives after 3 reconnect cycles.
  • 🔵 style (L715): Redundant type cast: (msg as Record<string, unknown>).willRetrymsg is already typed as Record<string, unknown> from the function parameter on line 704. Just use msg.willRetry !== false. [fixable]
  • 🔵 missing_tests (L714): The store-level _send_failed handler and the optimistic SET_RUNNING on successful send() are new behaviors without store-level unit tests. The SSE connection tests cover the transport layer well, but the Zustand store's reaction to these events (sendError state, running state transitions) is untested. [fixable]

server/index.ts

Well-structured transport resilience feature with solid test coverage. Main concern is the sendError state not being cleared when a retried send succeeds — users may see a stale error banner. The SSE-level implementation (stable cid, retry queue, connection registry TTL) is sound.

  • 🟡 unsafe_assumptions (L416): The cid regex /^cid-[0-9a-f]{8}-...$/i validates only the exact format cid-{uuid}. If the client code or a future browser produces a non-standard UUID format (e.g., uppercase from a polyfill, or a non-v4 UUID), the cid silently falls back to a server-generated one, breaking the stable-cid reconnect flow with no error signal. The case-insensitive /i flag partially mitigates this, but consider logging when a non-null cid query param fails validation, so the mismatch is diagnosable. [fixable]
  • 🔵 regressions (L456): In the stable-cid close handler, registry.removeObserver(transport) is called immediately but session detach is deferred via markInactive. During the 60s grace period, if the agent emits events, they won't reach this transport (removed as observer) but the connection entry still exists in connRegistry. This is likely fine because the connection's transport isOpen() returns false, and the periodic sync in ConnectionRegistry filters by isOpen() — but it's worth noting the transport is stale until reconnect swaps it.

packages/client/src/__tests__/sse-connection.test.ts

Well-structured transport resilience feature with solid test coverage. Main concern is the sendError state not being cleared when a retried send succeeds — users may see a stale error banner. The SSE-level implementation (stable cid, retry queue, connection registry TTL) is sound.

  • 🔵 missing_tests: No test for concurrent pending sends during flush. When flushPendingSends fires multiple doPost calls without awaiting, and several fail, they each push back to pendingSends. A test verifying that multiple concurrent failures during flush correctly accumulate in the retry queue (and don't exceed MAX_SEND_RETRIES per message) would increase confidence. [fixable]

if (msg.type === '_send_failed') {
const willRetry = (msg as Record<string, unknown>).willRetry !== false;
store.setState((s) => ({
sendError: willRetry

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 bugs: When a retried send eventually succeeds (e.g. willRetry: true followed by successful doPost on reconnect), sendError is never cleared. The user sees "Message delivery delayed — retrying on reconnect..." indefinitely until they send another message (which clears it at line 345). Consider clearing sendError on the _open synthetic event emitted after successful reconnect. [fixable]

// Optimistic: show "thinking" immediately. The server will confirm
// via events, or the send-timeout safety net will reset on failure.
set((s) => ({
messages: messagesReducer(s.messages, { type: 'SET_RUNNING', running: true }),

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 bugs: The optimistic SET_RUNNING fires on send() returning true, but send() returns true even when the message is just queued (not yet delivered). If the POST then fails with willRetry: true, the spinner stays showing because SET_RUNNING: false is only dispatched for willRetry: false. This is probably intentional (server events will drive running state on success), but if all retries also fail, there's a window where running is true with no active agent — the user sees an eternal spinner until the final willRetry: false event arrives after 3 reconnect cycles.

Comment thread packages/client/src/store.ts Outdated
}

if (msg.type === '_send_failed') {
const willRetry = (msg as Record<string, unknown>).willRetry !== false;

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 style: Redundant type cast: (msg as Record<string, unknown>).willRetrymsg is already typed as Record<string, unknown> from the function parameter on line 704. Just use msg.willRetry !== false. [fixable]

return;
}

if (msg.type === '_send_failed') {

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 missing_tests: The store-level _send_failed handler and the optimistic SET_RUNNING on successful send() are new behaviors without store-level unit tests. The SSE connection tests cover the transport layer well, but the Zustand store's reaction to these events (sendError state, running state transitions) is untested. [fixable]

Comment thread server/index.ts
const clientCid = req.query.cid as string | undefined;
const isStableCid =
clientCid != null &&
/^cid-[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$/i.test(clientCid);

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 unsafe_assumptions: The cid regex /^cid-[0-9a-f]{8}-...$/i validates only the exact format cid-{uuid}. If the client code or a future browser produces a non-standard UUID format (e.g., uppercase from a polyfill, or a non-v4 UUID), the cid silently falls back to a server-generated one, breaking the stable-cid reconnect flow with no error signal. The case-insensitive /i flag partially mitigates this, but consider logging when a non-null cid query param fails validation, so the mismatch is diagnosable. [fixable]

Comment thread server/index.ts
// connectionId — markInactive starts a 60s TTL. If the TTL expires,
// the onExpired callback triggers session detach.
connRegistry.markInactive(connectionId);
registry.removeObserver(transport);

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 regressions: In the stable-cid close handler, registry.removeObserver(transport) is called immediately but session detach is deferred via markInactive. During the 60s grace period, if the agent emits events, they won't reach this transport (removed as observer) but the connection entry still exists in connRegistry. This is likely fine because the connection's transport isOpen() returns false, and the periodic sync in ConnectionRegistry filters by isOpen() — but it's worth noting the transport is stale until reconnect swaps it.

… logging

- Clear sendError on _open (reconnect success) to prevent stale error banners
- Remove redundant type cast on msg.willRetry in _send_failed handler
- Log warning when client sends invalid cid format on SSE connect

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

@dimakis dimakis left a comment

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Centaur Review

Found 6 issue(s) (2 warning).

packages/client/src/sse-connection.ts

Solid transport resilience feature with good test coverage. Two actionable issues: (1) 400 responses on send silently leave the UI stuck in 'thinking' state due to interaction with the new optimistic SET_RUNNING, and (2) sessionStorage access at module scope can throw in private browsing mode.

  • 🟡 bugs (L362): When a send POST returns 400 (bad request), it falls through silently — no _send_failed emitted, no retry. Combined with the new optimistic SET_RUNNING(true) in store.ts:380, this leaves the UI stuck in 'thinking' state indefinitely. The comment references a 'send-timeout safety net' but no such timer exists for this code path. Either emit _send_failed with willRetry: false for all non-2xx send responses that don't already have handling, or treat 400 the same as 404 (permanent failure). [fixable]
  • 🔵 style (L43): SendFailedEvent.clientMsgId is typed as unknown but is always a string (generated by the store as user-${Date.now()}-...). Using string would improve type safety in the store handler at store.ts:722 where it's consumed without narrowing. [fixable]
  • 🔵 regressions (L113): The send() guard changed from this._connected && this._connectionId to just this._connected. Previously, _connectionId being null was an extra safety check that send wouldn't fire before the welcome handshake. Now _connectionId is always set (from constructor), so the guard change is semantically correct — but doPost will now include the client-generated cid in X-Connection-ID even if the server hasn't acknowledged it yet. This is safe because the _connected flag is still gated on welcome receipt, but worth noting.

frontend/src/client-store.ts

Solid transport resilience feature with good test coverage. Two actionable issues: (1) 400 responses on send silently leave the UI stuck in 'thinking' state due to interaction with the new optimistic SET_RUNNING, and (2) sessionStorage access at module scope can throw in private browsing mode.

  • 🟡 unsafe_assumptions (L31): sessionStorage.getItem() and setItem() can throw SecurityError in Safari private browsing, some Firefox configurations, and restrictive WebViews. Since getOrCreateConnectionId() is called at module scope (during SSE config construction), an exception here would break the entire client import. Wrap in try-catch with a fallback to an in-memory UUID. [fixable]

packages/client/src/__tests__/sse-connection.test.ts

Solid transport resilience feature with good test coverage. Two actionable issues: (1) 400 responses on send silently leave the UI stuck in 'thinking' state due to interaction with the new optimistic SET_RUNNING, and (2) sessionStorage access at module scope can throw in private browsing mode.

  • 🔵 missing_tests: No test verifies behavior when the same message is re-queued and then the connection succeeds — specifically, that retries is correctly incremented through multiple flush cycles and that clientMsgId is preserved across retries (not duplicated in the queue). [fixable]

server/index.ts

Solid transport resilience feature with good test coverage. Two actionable issues: (1) 400 responses on send silently leave the UI stuck in 'thinking' state due to interaction with the new optimistic SET_RUNNING, and (2) sessionStorage access at module scope can throw in private browsing mode.

  • 🔵 unsafe_assumptions (L416): The cid validation regex /^cid-[0-9a-f]{8}-...$/i correctly matches UUID v4, but crypto.randomUUID() is spec'd to return lowercase. The i flag is fine for robustness, but note that if a non-UUID-shaped cid is rejected, the server silently falls back to a server-generated conn- id with no feedback to the client. The client's _connectionId will be its UUID, while the server uses a different id — the welcome echo will trigger the 'old server' fallback path in the client, which works but means the 'stable cid' feature silently degrades. A log.warn exists but the client has no way to know its cid was rejected.

},
body: JSON.stringify(body),
});
if (!res.ok && endpoint === 'send') {

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 bugs: When a send POST returns 400 (bad request), it falls through silently — no _send_failed emitted, no retry. Combined with the new optimistic SET_RUNNING(true) in store.ts:380, this leaves the UI stuck in 'thinking' state indefinitely. The comment references a 'send-timeout safety net' but no such timer exists for this code path. Either emit _send_failed with willRetry: false for all non-2xx send responses that don't already have handling, or treat 400 the same as 404 (permanent failure). [fixable]


/** Synthetic event emitted when a POST to the send endpoint fails. */
export interface SendFailedEvent {
type: '_send_failed';

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 style: SendFailedEvent.clientMsgId is typed as unknown but is always a string (generated by the store as user-${Date.now()}-...). Using string would improve type safety in the store handler at store.ts:722 where it's consumed without narrowing. [fixable]

if (!endpoint) return false;

if (this._connected && this._connectionId) {
if (this._connected) {

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 regressions: The send() guard changed from this._connected && this._connectionId to just this._connected. Previously, _connectionId being null was an extra safety check that send wouldn't fire before the welcome handshake. Now _connectionId is always set (from constructor), so the guard change is semantically correct — but doPost will now include the client-generated cid in X-Connection-ID even if the server hasn't acknowledged it yet. This is safe because the _connected flag is still gated on welcome receipt, but worth noting.

/** Stable connectionId per browser tab — survives SSE reconnects. */
function getOrCreateConnectionId(): string {
const KEY = 'mitzo:connectionId';
let cid = sessionStorage.getItem(KEY);

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 unsafe_assumptions: sessionStorage.getItem() and setItem() can throw SecurityError in Safari private browsing, some Firefox configurations, and restrictive WebViews. Since getOrCreateConnectionId() is called at module scope (during SSE config construction), an exception here would break the entire client import. Wrap in try-catch with a fallback to an in-memory UUID. [fixable]

Comment thread server/index.ts
const clientCid = req.query.cid as string | undefined;
const isStableCid =
clientCid != null &&
/^cid-[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$/i.test(clientCid);

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 unsafe_assumptions: The cid validation regex /^cid-[0-9a-f]{8}-...$/i correctly matches UUID v4, but crypto.randomUUID() is spec'd to return lowercase. The i flag is fine for robustness, but note that if a non-UUID-shaped cid is rejected, the server silently falls back to a server-generated conn- id with no feedback to the client. The client's _connectionId will be its UUID, while the server uses a different id — the welcome echo will trigger the 'old server' fallback path in the client, which works but means the 'stable cid' feature silently degrades. A log.warn exists but the client has no way to know its cid was rejected.

@dimakis

dimakis commented Jun 26, 2026

Copy link
Copy Markdown
Owner Author

Centaur Review

Found 4 issue(s) (1 critical) (2 warning).

server/index.ts

The stable-cid and POST retry features are well-designed, but there is a critical race condition in the server's SSE close handler: when a new request replaces an old one with the same connectionId, the old request's async close handler removes the new response from the SSE registry, breaking event delivery and eventually triggering session detach.

  • 🔴 bugs (L452): Race condition on same-cid SSE reconnect: when a new request arrives with the same stable cid, chatSseRegistry.add() calls existing.end() on the old response, which asynchronously fires the old request's close handler. That handler unconditionally calls chatSseRegistry.remove(connectionId), which removes the NEW response from the registry (since they share the same key). This breaks event delivery — the server can no longer send SSE events to the reconnected client. Additionally, connRegistry.markInactive() starts a 60s TTL; since transport.isOpen() delegates to sseRegistry.isOpen() which checks the now-missing registry entry, the TTL fires and detaches sessions. Fix: guard the close handler with a check that the current stream in the registry is still res (this response) before removing, or add a removeIfCurrent(connectionId, res) method to SessionSseRegistry. [fixable]
  • 🟡 missing_tests (L407): No integration test covers the server-side SSE reconnect flow where a new request with the same stable cid arrives while the old request's close handler is pending. This is where the race condition in the previous finding lives. The existing tests cover ConnectionRegistry.reconnect() and client-side reconnection behavior, but not the interplay between chatSseRegistry.add() ending the old response and the old req.on('close') handler firing afterward. [fixable]

packages/client/src/sse-connection.ts

The stable-cid and POST retry features are well-designed, but there is a critical race condition in the server's SSE close handler: when a new request replaces an old one with the same connectionId, the old request's async close handler removes the new response from the SSE registry, breaking event delivery and eventually triggering session detach.

  • 🟡 unsafe_assumptions (L367): Failed send POSTs are queued in pendingSends for retry, but flushPendingSends() is only called after a reconnect (welcome event). If the SSE stream stays healthy while a POST fails with 5xx, the message sits in the queue indefinitely — the user sees "retrying on reconnect..." but no reconnect occurs. Consider adding a timer-based retry (e.g., exponential backoff within doPost) for cases where the SSE stream is alive but the POST endpoint is temporarily unavailable. [fixable]
  • 🔵 style (L138): _connectionId is now always initialized to a string in the constructor (never null), but getConnectionId() still returns string | null per the ChatConnection interface. The interface jsdoc "Server-assigned connection ID (null before welcome)" is also outdated — the ID is now client-generated at construction time. Consider updating the interface and jsdoc to reflect the new semantics. [fixable]

… timer

- Fix critical race: stale close handler removing new SSE stream. Added
  removeIfCurrent() to SessionSseRegistry — only removes if the response
  being closed is still the current entry for that connectionId
- Add timer-based retry flush (3s) for failed POSTs when SSE stays healthy,
  so messages don't sit in the queue waiting for a reconnect that never comes
- Update ChatConnection jsdoc — connectionId is now client-generated
- Clean up retry timer on disconnect

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

@dimakis dimakis left a comment

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Centaur Review

Found 7 issue(s) (3 warning).

packages/client/src/store.ts

Solid transport resilience feature with correct core logic. Main concerns: the retry re-queue path bypasses the MAX_PENDING_SENDS cap, the cid regex is stricter than necessary, and the user-facing error message says "retrying on reconnect" when retries also happen on a timer without reconnect.

  • 🔵 style (L725): The sendError message says "retrying on reconnect..." but the actual retry mechanism (scheduleRetryFlush) fires after a 3s timer even without a reconnect. Consider "Message delivery delayed — retrying..." to avoid misleading the user when the SSE stream stays connected. [fixable]

server/index.ts

Solid transport resilience feature with correct core logic. Main concerns: the retry re-queue path bypasses the MAX_PENDING_SENDS cap, the cid regex is stricter than necessary, and the user-facing error message says "retrying on reconnect" when retries also happen on a timer without reconnect.

  • 🟡 bugs (L464): In the stable-cid close handler, registry.removeObserver(transport) removes the OLD transport from SessionRegistry observer lists immediately on stream close. If the same client had been added as an observer (e.g., via WS subscribe flow) and reconnects with a new transport, the new transport is never re-added as an observer — only the ConnectionRegistry watched-session list survives via reconnect(). This is likely a non-issue today since SSE connections use the driver/ConnectionRegistry event store delivery path rather than SessionRegistry observers, but it creates a subtle inconsistency if SSE is ever used as an observer. [fixable]
  • 🟡 bugs (L416): The cid regex /^cid-[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$/i validates strict UUID v4 format, but crypto.randomUUID() can produce any UUID version (v4 today, but the spec doesn't guarantee it). If a future browser/runtime returns a different UUID version with uppercase or variant bits outside [0-9a-f], the regex would reject it. The /i flag handles case, but the format assumption is fragile. Consider a more permissive pattern like /^cid-[0-9a-f-]{36}$/i to allow any UUID format. [fixable]

packages/client/src/sse-connection.ts

Solid transport resilience feature with correct core logic. Main concerns: the retry re-queue path bypasses the MAX_PENDING_SENDS cap, the cid regex is stricter than necessary, and the user-facing error message says "retrying on reconnect" when retries also happen on a timer without reconnect.

  • 🟡 unsafe_assumptions (L373): When doPost re-queues a failed send for retry (line 373, 394), it pushes to pendingSends without checking MAX_PENDING_SENDS. The initial send() path (line 126) enforces this cap, but the retry path bypasses it. In practice the MAX_SEND_RETRIES=3 limit bounds this to a small number of re-queued items per message, but a burst of many concurrent send failures could temporarily exceed the 100-item cap. [fixable]
  • 🔵 style (L30): The connectionId JSDoc comment is 4 lines long — consider trimming to a single line per the project's "default to no comments, one short line max" convention. [fixable]

packages/client/src/__tests__/sse-connection.test.ts

Solid transport resilience feature with correct core logic. Main concerns: the retry re-queue path bypasses the MAX_PENDING_SENDS cap, the cid regex is stricter than necessary, and the user-facing error message says "retrying on reconnect" when retries also happen on a timer without reconnect.

  • 🔵 missing_tests: No test covers the case where the stable-cid SSE close handler fires AFTER a new SSE request with the same cid has connected (the race that removeIfCurrent is designed to prevent). This is the core scenario the stable-cid feature solves — a test confirming the old close handler does NOT remove the new stream would strengthen coverage. [fixable]

packages/harness/__tests__/connection-registry.test.ts

Solid transport resilience feature with correct core logic. Main concerns: the retry re-queue path bypasses the MAX_PENDING_SENDS cap, the cid regex is stricter than necessary, and the user-facing error message says "retrying on reconnect" when retries also happen on a timer without reconnect.

  • 🔵 missing_tests: No test verifies that the onExpired callback receives the correct watchedSessions after a partial unwatch. E.g., connection watches sessions A and B, unwatches B, then TTL expires — callback should get only {A}. The current tests only verify the happy path with sessions still watched at expiry time. [fixable]

if (msg.type === '_send_failed') {
const willRetry = msg.willRetry !== false;
store.setState((s) => ({
sendError: willRetry

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 style: The sendError message says "retrying on reconnect..." but the actual retry mechanism (scheduleRetryFlush) fires after a 3s timer even without a reconnect. Consider "Message delivery delayed — retrying..." to avoid misleading the user when the SSE stream stays connected. [fixable]

Comment thread server/index.ts
// connectionId — markInactive starts a 60s TTL. If the TTL expires,
// the onExpired callback triggers session detach.
connRegistry.markInactive(connectionId);
registry.removeObserver(transport);

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 bugs: In the stable-cid close handler, registry.removeObserver(transport) removes the OLD transport from SessionRegistry observer lists immediately on stream close. If the same client had been added as an observer (e.g., via WS subscribe flow) and reconnects with a new transport, the new transport is never re-added as an observer — only the ConnectionRegistry watched-session list survives via reconnect(). This is likely a non-issue today since SSE connections use the driver/ConnectionRegistry event store delivery path rather than SessionRegistry observers, but it creates a subtle inconsistency if SSE is ever used as an observer. [fixable]

Comment thread server/index.ts
const clientCid = req.query.cid as string | undefined;
const isStableCid =
clientCid != null &&
/^cid-[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$/i.test(clientCid);

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 bugs: The cid regex /^cid-[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$/i validates strict UUID v4 format, but crypto.randomUUID() can produce any UUID version (v4 today, but the spec doesn't guarantee it). If a future browser/runtime returns a different UUID version with uppercase or variant bits outside [0-9a-f], the regex would reject it. The /i flag handles case, but the format assumption is fragile. Consider a more permissive pattern like /^cid-[0-9a-f-]{36}$/i to allow any UUID format. [fixable]

// Transient failure — retry up to MAX_SEND_RETRIES
const willRetry = retries < MAX_SEND_RETRIES;
if (willRetry) {
this.pendingSends.push({ endpoint, body, retries: retries + 1 });

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 unsafe_assumptions: When doPost re-queues a failed send for retry (line 373, 394), it pushes to pendingSends without checking MAX_PENDING_SENDS. The initial send() path (line 126) enforces this cap, but the retry path bypasses it. In practice the MAX_SEND_RETRIES=3 limit bounds this to a small number of re-queued items per message, but a burst of many concurrent send failures could temporarily exceed the 100-item cap. [fixable]

/** URL for the sendBeacon suspend fallback. */
suspendUrl?: string;
/**
* Client-generated stable connectionId. Survives SSE reconnects so the

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 style: The connectionId JSDoc comment is 4 lines long — consider trimming to a single line per the project's "default to no comments, one short line max" convention. [fixable]

@dimakis dimakis left a comment

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Centaur Review

Found 5 issue(s) (1 warning).

packages/client/src/store.ts

Well-designed feature with solid test coverage. The critical issues from earlier reviews (SSE registry race, ownership check, observer leak) are all fixed via removeIfCurrent and getOwnerConnection. The main remaining gap is that sendError is never cleared after a successful retry via scheduleRetryFlush, leaving the user with a stale error banner while the agent responds.

  • 🟡 bugs (L725): sendError is never cleared after a successful retry. When doPost succeeds on a retried send (via scheduleRetryFlush), no event is emitted — sendError remains set to 'Message delivery delayed' even as the server responds and the agent's streaming reply appears. The user sees a contradictory state: an error banner alongside an active response. sendError is only cleared on _open (reconnect) or USER_SEND (next message), but if the SSE stream stays healthy and the user is waiting for a reply, neither happens. Consider emitting a _send_succeeded synthetic event on a successful retried POST, or clearing sendError on MESSAGE_START. [fixable]
  • 🔵 style (L726): The sendError message 'Message delivery delayed — retrying on reconnect...' is inaccurate. Since the scheduleRetryFlush mechanism was added, retries happen proactively after 3s (RETRY_FLUSH_DELAY_MS) while the SSE stream stays connected — no reconnect is needed. Consider 'Message delivery delayed — retrying shortly...' to match the actual behavior. [fixable]

server/index.ts

Well-designed feature with solid test coverage. The critical issues from earlier reviews (SSE registry race, ownership check, observer leak) are all fixed via removeIfCurrent and getOwnerConnection. The main remaining gap is that sendError is never cleared after a successful retry via scheduleRetryFlush, leaving the user with a stale error banner while the agent responds.

  • 🔵 unsafe_assumptions (L463): When a stable-cid reconnects, the old close handler fires asynchronously AFTER connRegistry.reconnect() has already swapped the transport. It calls markInactive(connectionId) which starts a 60s TTL timer on the already-reconnected connection. The timer is harmless (the isOpen() guard at connection-registry.ts:117 sees the new transport is open and bails), but it creates unnecessary timer churn. Consider guarding markInactive with a transport identity check: only call it if the transport in the connection registry is still the old one. [fixable]

packages/client/src/__tests__/sse-connection.test.ts

Well-designed feature with solid test coverage. The critical issues from earlier reviews (SSE registry race, ownership check, observer leak) are all fixed via removeIfCurrent and getOwnerConnection. The main remaining gap is that sendError is never cleared after a successful retry via scheduleRetryFlush, leaving the user with a stale error banner while the agent responds.

  • 🔵 missing_tests (L1050): The 'drops message after max retries' test comment says '3 reconnect cycles — each flush retries' but the retries actually exhaust via scheduleRetryFlush during the initial vi.runAllTimersAsync() call (chained 3s timers). The reconnect cycles in the for-loop are no-ops for this test case because pendingSends is already empty by then. The test passes correctly, but the comments are misleading about the retry mechanism. [fixable]
  • 🔵 missing_tests (L1164): The 'retries failed send after delay even without reconnect' test verifies the retry fires but doesn't check that _send_failed is NOT emitted on the successful retry (i.e., no spurious willRetry:false event). Adding an assertion like expect(listener).not.toHaveBeenCalledWith(expect.objectContaining({ willRetry: false })) would strengthen coverage. [fixable]

if (msg.type === '_send_failed') {
const willRetry = msg.willRetry !== false;
store.setState((s) => ({
sendError: willRetry

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 bugs: sendError is never cleared after a successful retry. When doPost succeeds on a retried send (via scheduleRetryFlush), no event is emitted — sendError remains set to 'Message delivery delayed' even as the server responds and the agent's streaming reply appears. The user sees a contradictory state: an error banner alongside an active response. sendError is only cleared on _open (reconnect) or USER_SEND (next message), but if the SSE stream stays healthy and the user is waiting for a reply, neither happens. Consider emitting a _send_succeeded synthetic event on a successful retried POST, or clearing sendError on MESSAGE_START. [fixable]

Comment thread packages/client/src/store.ts Outdated
const willRetry = msg.willRetry !== false;
store.setState((s) => ({
sendError: willRetry
? 'Message delivery delayed — retrying on reconnect...'

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 style: The sendError message 'Message delivery delayed — retrying on reconnect...' is inaccurate. Since the scheduleRetryFlush mechanism was added, retries happen proactively after 3s (RETRY_FLUSH_DELAY_MS) while the SSE stream stays connected — no reconnect is needed. Consider 'Message delivery delayed — retrying shortly...' to match the actual behavior. [fixable]

Comment thread server/index.ts
log.info('SSE closed for stable-cid connection, deferring cleanup', { connectionId });
} else {
// Legacy server-generated connectionId: immediate cleanup
const conn = connRegistry.get(connectionId);

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 unsafe_assumptions: When a stable-cid reconnects, the old close handler fires asynchronously AFTER connRegistry.reconnect() has already swapped the transport. It calls markInactive(connectionId) which starts a 60s TTL timer on the already-reconnected connection. The timer is harmless (the isOpen() guard at connection-registry.ts:117 sees the new transport is open and bails), but it creates unnecessary timer churn. Consider guarding markInactive with a transport identity check: only call it if the transport in the connection registry is still the old one. [fixable]

expect(listener).not.toHaveBeenCalledWith(expect.objectContaining({ type: '_send_failed' }));
});

it('drops message after max retries and emits willRetry: false', async () => {

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 missing_tests: The 'drops message after max retries' test comment says '3 reconnect cycles — each flush retries' but the retries actually exhaust via scheduleRetryFlush during the initial vi.runAllTimersAsync() call (chained 3s timers). The reconnect cycles in the for-loop are no-ops for this test case because pendingSends is already empty by then. The test passes correctly, but the comments are misleading about the retry mechanism. [fixable]

);
});
});
});

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 missing_tests: The 'retries failed send after delay even without reconnect' test verifies the retry fires but doesn't check that _send_failed is NOT emitted on the successful retry (i.e., no spurious willRetry:false event). Adding an assertion like expect(listener).not.toHaveBeenCalledWith(expect.objectContaining({ willRetry: false })) would strengthen coverage. [fixable]

@dimakis dimakis left a comment

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Centaur Review

Found 6 issue(s) (3 warning).

server/index.ts

Well-structured transport resilience feature with correct race-condition guards. Key concern: the stale close handler can call markInactive on a freshly-reconnected connection, and removeIfCurrent — the linchpin of the reconnect safety model — lacks unit tests.

  • 🟡 bugs (L459): When a stable-CID SSE reconnects quickly (new request arrives before old close fires), the old close handler calls connRegistry.markInactive(connectionId). But reconnect() already ran and cleared the inactive timer. The markInactive call here starts a new 60s TTL timer on the freshly-reconnected connection. If the new SSE stream stays healthy for 60s, the TTL callback checks !transport.isOpen() — which is true (new transport), so it's a no-op. But if the new transport closes within 60s and another reconnect restores it, the stale timer from the first close handler could fire at the wrong time. The fix: check whether the closing res was actually removed by removeIfCurrent (it returns void — no way to tell) before calling markInactive. [fixable]
  • 🟡 unsafe_assumptions (L416): The CID format validation regex on the SSE endpoint (/^cid-[0-9a-f]{8}-...$/i) is not applied to the X-Connection-ID header on POST endpoints (extracted in chat-rest-handler.ts via getConnectionId()). A client could POST with a malformed header value. Since the clientId composite key is ${connectionId}:${sessionId}, this is low-risk (the POST just won't match any SSE connection), but the asymmetric validation is surprising. [fixable]

server/session-sse-registry.ts

Well-structured transport resilience feature with correct race-condition guards. Key concern: the stale close handler can call markInactive on a freshly-reconnected connection, and removeIfCurrent — the linchpin of the reconnect safety model — lacks unit tests.

  • 🟡 missing_tests (L52): removeIfCurrent() is a critical guard that prevents stale close handlers from removing the active SSE stream. It has no unit tests. The correctness of the entire stable-CID reconnect flow depends on this method working correctly — a regression here would silently break event delivery. [fixable]

packages/client/src/sse-connection.ts

Well-structured transport resilience feature with correct race-condition guards. Key concern: the stale close handler can call markInactive on a freshly-reconnected connection, and removeIfCurrent — the linchpin of the reconnect safety model — lacks unit tests.

  • 🔵 style (L43): SendFailedEvent is exported but never imported anywhere — the store handler uses untyped msg.type === '_send_failed' checks. Either import and use the type in the store listener, or don't export it to avoid suggesting a type-safe contract that isn't enforced. [fixable]
  • 🔵 bugs (L373): When doPost fails for a send endpoint with 429/5xx, the message is pushed back to pendingSends with retries + 1. But pendingSends has no deduplication — if the same message fails during flushPendingSends and during a concurrent scheduleRetryFlush timer, it could be queued twice. This is mitigated by flushPendingSends swapping the array (line 412-413), but the doPost→push→scheduleRetryFlushflushPendingSendsdoPost cycle means a message could accumulate duplicate entries if two retry paths converge. The MAX_SEND_RETRIES cap limits the blast radius. [fixable]

packages/client/src/store.ts

Well-structured transport resilience feature with correct race-condition guards. Key concern: the stale close handler can call markInactive on a freshly-reconnected connection, and removeIfCurrent — the linchpin of the reconnect safety model — lacks unit tests.

  • 🔵 missing_tests (L708): The store's handling of _open (clearing sendError) and _send_failed (setting sendError, conditionally resetting running state) are new behaviors with no store-level tests. The SSE connection tests cover emission of these events, but the store reducer integration — especially the willRetry: false path that resets SET_RUNNING — is untested. [fixable]

Comment thread server/index.ts
// response that replaced it, breaking event delivery.
chatSseRegistry.removeIfCurrent(connectionId, res);

if (isStableCid) {

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 bugs: When a stable-CID SSE reconnects quickly (new request arrives before old close fires), the old close handler calls connRegistry.markInactive(connectionId). But reconnect() already ran and cleared the inactive timer. The markInactive call here starts a new 60s TTL timer on the freshly-reconnected connection. If the new SSE stream stays healthy for 60s, the TTL callback checks !transport.isOpen() — which is true (new transport), so it's a no-op. But if the new transport closes within 60s and another reconnect restores it, the stale timer from the first close handler could fire at the wrong time. The fix: check whether the closing res was actually removed by removeIfCurrent (it returns void — no way to tell) before calling markInactive. [fixable]

Comment thread server/index.ts
const clientCid = req.query.cid as string | undefined;
const isStableCid =
clientCid != null &&
/^cid-[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$/i.test(clientCid);

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 unsafe_assumptions: The CID format validation regex on the SSE endpoint (/^cid-[0-9a-f]{8}-...$/i) is not applied to the X-Connection-ID header on POST endpoints (extracted in chat-rest-handler.ts via getConnectionId()). A client could POST with a malformed header value. Since the clientId composite key is ${connectionId}:${sessionId}, this is low-risk (the POST just won't match any SSE connection), but the asymmetric validation is surprising. [fixable]

Comment thread server/session-sse-registry.ts Outdated
* the given response. Prevents a stale close handler from removing a
* newer replacement stream after chatSseRegistry.add() swapped it.
*/
removeIfCurrent(connectionId: string, res: Response): void {

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 missing_tests: removeIfCurrent() is a critical guard that prevents stale close handlers from removing the active SSE stream. It has no unit tests. The correctness of the entire stable-CID reconnect flow depends on this method working correctly — a regression here would silently break event delivery. [fixable]

const RETRY_FLUSH_DELAY_MS = 3000;

/** Synthetic event emitted when a POST to the send endpoint fails. */
export interface SendFailedEvent {

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 style: SendFailedEvent is exported but never imported anywhere — the store handler uses untyped msg.type === '_send_failed' checks. Either import and use the type in the store listener, or don't export it to avoid suggesting a type-safe contract that isn't enforced. [fixable]

// Transient failure — retry up to MAX_SEND_RETRIES
const willRetry = retries < MAX_SEND_RETRIES;
if (willRetry) {
this.pendingSends.push({ endpoint, body, retries: retries + 1 });

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 bugs: When doPost fails for a send endpoint with 429/5xx, the message is pushed back to pendingSends with retries + 1. But pendingSends has no deduplication — if the same message fails during flushPendingSends and during a concurrent scheduleRetryFlush timer, it could be queued twice. This is mitigated by flushPendingSends swapping the array (line 412-413), but the doPost→push→scheduleRetryFlushflushPendingSendsdoPost cycle means a message could accumulate duplicate entries if two retry paths converge. The MAX_SEND_RETRIES cap limits the blast radius. [fixable]

// Foreground recovery: when the page becomes visible again (iOS may have
// evicted it from memory, losing in-memory state), re-fetch messages from
// the REST API if we have an active session but no messages in the store.
if (msg.type === '_open') {

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 missing_tests: The store's handling of _open (clearing sendError) and _send_failed (setting sendError, conditionally resetting running state) are new behaviors with no store-level tests. The SSE connection tests cover emission of these events, but the store reducer integration — especially the willRetry: false path that resets SET_RUNNING — is untested. [fixable]

@dimakis dimakis left a comment

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Centaur Review

Found 6 issue(s) (3 warning).

server/index.ts

Solid transport resilience feature with well-designed race condition guards (removeIfCurrent, staleness checks in doReconnectPost, atomic queue swap). Main concern is the phantom markInactive timer on rapid reconnect — harmless but noisy. Test coverage for the client-side changes is thorough; server-side additions (removeIfCurrent, cid validation) lack dedicated tests.

  • 🟡 bugs (L463): Stale close handler creates phantom inactive timer after rapid reconnect. When a new SSE request arrives with the same stable cid, chatSseRegistry.add() closes the old response. connRegistry.reconnect() runs and clears any existing inactive timer. But when the old response's close handler fires asynchronously AFTER reconnect(), it calls markInactive(connectionId) — starting a new 60s timer. The timer is a no-op (transport.isOpen() check saves it), but it's unnecessary work and produces a misleading 'inactive connection TTL expired' log if the connection closes again within the grace window. Consider checking if the transport is already open inside markInactive(), or having the close handler skip markInactive() when connRegistry.get(connectionId)?.transport is a different (newer) object than the closing one. [fixable]
  • 🔵 missing_tests (L416): The server-side cid format validation regex (/^cid-[0-9a-f]{8}-...$/i) and the fallback-to-server-generated-id logic have no test coverage. An integration or unit test confirming that invalid cid values are rejected (fallback to conn-*) and valid ones are accepted would prevent regression if the format evolves. [fixable]

packages/client/src/store.ts

Solid transport resilience feature with well-designed race condition guards (removeIfCurrent, staleness checks in doReconnectPost, atomic queue swap). Main concern is the phantom markInactive timer on rapid reconnect — harmless but noisy. Test coverage for the client-side changes is thorough; server-side additions (removeIfCurrent, cid validation) lack dedicated tests.

  • 🟡 unsafe_assumptions (L723): willRetry defaults to true via msg.willRetry !== false, meaning any truthy or undefined value is treated as retryable. If a future change emits _send_failed without willRetry (e.g., from a different code path), the store will show 'retrying...' and keep the running spinner indefinitely. Prefer explicit msg.willRetry === true to fail-safe toward the non-retrying case. [fixable]

server/session-sse-registry.ts

Solid transport resilience feature with well-designed race condition guards (removeIfCurrent, staleness checks in doReconnectPost, atomic queue swap). Main concern is the phantom markInactive timer on rapid reconnect — harmless but noisy. Test coverage for the client-side changes is thorough; server-side additions (removeIfCurrent, cid validation) lack dedicated tests.

  • 🟡 missing_tests (L52): removeIfCurrent() is a new method critical to preventing the stale-close-handler race, but has no unit test. It's only exercised implicitly through the SSE endpoint integration path. A dedicated test verifying it no-ops when res doesn't match (and removes when it does) would guard this invariant. [fixable]

packages/client/src/sse-connection.ts

Solid transport resilience feature with well-designed race condition guards (removeIfCurrent, staleness checks in doReconnectPost, atomic queue swap). Main concern is the phantom markInactive timer on rapid reconnect — harmless but noisy. Test coverage for the client-side changes is thorough; server-side additions (removeIfCurrent, cid validation) lack dedicated tests.

  • 🔵 style (L41): SendFailedEvent is exported but never imported or used in the type system — the listener is called with untyped object literals matching the same shape. Either use the type in doPost (e.g., const evt: SendFailedEvent = { ... }; this.listener?.(evt)) or remove the export to avoid implying a typed contract that isn't enforced. [fixable]

packages/client/src/chat-connection.ts

Solid transport resilience feature with well-designed race condition guards (removeIfCurrent, staleness checks in doReconnectPost, atomic queue swap). Main concern is the phantom markInactive timer on rapid reconnect — harmless but noisy. Test coverage for the client-side changes is thorough; server-side additions (removeIfCurrent, cid validation) lack dedicated tests.

  • 🔵 style (L20): The getConnectionId() return type is still string | null, but after the constructor change, _connectionId is always initialized to a string — it can never be null. The docstring was updated to say 'Available immediately after construction' but the type doesn't reflect this. Consider updating the interface return type to string (or document why null is retained for other implementations). [fixable]

Comment thread server/index.ts Outdated
// Stable cid: defer cleanup. The client will reconnect with the same
// connectionId — markInactive starts a 60s TTL. If the TTL expires,
// the onExpired callback triggers session detach.
connRegistry.markInactive(connectionId);

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 bugs: Stale close handler creates phantom inactive timer after rapid reconnect. When a new SSE request arrives with the same stable cid, chatSseRegistry.add() closes the old response. connRegistry.reconnect() runs and clears any existing inactive timer. But when the old response's close handler fires asynchronously AFTER reconnect(), it calls markInactive(connectionId) — starting a new 60s timer. The timer is a no-op (transport.isOpen() check saves it), but it's unnecessary work and produces a misleading 'inactive connection TTL expired' log if the connection closes again within the grace window. Consider checking if the transport is already open inside markInactive(), or having the close handler skip markInactive() when connRegistry.get(connectionId)?.transport is a different (newer) object than the closing one. [fixable]

Comment thread server/index.ts
const clientCid = req.query.cid as string | undefined;
const isStableCid =
clientCid != null &&
/^cid-[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$/i.test(clientCid);

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 missing_tests: The server-side cid format validation regex (/^cid-[0-9a-f]{8}-...$/i) and the fallback-to-server-generated-id logic have no test coverage. An integration or unit test confirming that invalid cid values are rejected (fallback to conn-*) and valid ones are accepted would prevent regression if the format evolves. [fixable]

}

if (msg.type === '_send_failed') {
const willRetry = msg.willRetry !== false;

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 unsafe_assumptions: willRetry defaults to true via msg.willRetry !== false, meaning any truthy or undefined value is treated as retryable. If a future change emits _send_failed without willRetry (e.g., from a different code path), the store will show 'retrying...' and keep the running spinner indefinitely. Prefer explicit msg.willRetry === true to fail-safe toward the non-retrying case. [fixable]

Comment thread server/session-sse-registry.ts Outdated
* the given response. Prevents a stale close handler from removing a
* newer replacement stream after chatSseRegistry.add() swapped it.
*/
removeIfCurrent(connectionId: string, res: Response): void {

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 missing_tests: removeIfCurrent() is a new method critical to preventing the stale-close-handler race, but has no unit test. It's only exercised implicitly through the SSE endpoint integration path. A dedicated test verifying it no-ops when res doesn't match (and removes when it does) would guard this invariant. [fixable]

const MAX_PENDING_SENDS = 100;
const MAX_SEND_RETRIES = 3;
const RETRY_FLUSH_DELAY_MS = 3000;

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 style: SendFailedEvent is exported but never imported or used in the type system — the listener is called with untyped object literals matching the same shape. Either use the type in doPost (e.g., const evt: SendFailedEvent = { ... }; this.listener?.(evt)) or remove the export to avoid implying a typed contract that isn't enforced. [fixable]

/** Whether the transport is connected and ready to send. */
isConnected(): boolean;
/** Server-assigned connection ID (null before welcome). */
/** Client-generated stable connection ID. Available immediately after construction. */

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 style: The getConnectionId() return type is still string | null, but after the constructor change, _connectionId is always initialized to a string — it can never be null. The docstring was updated to say 'Available immediately after construction' but the type doesn't reflect this. Consider updating the interface return type to string (or document why null is retained for other implementations). [fixable]

- Make removeIfCurrent() return boolean so close handler knows if stream
  was actually removed vs already replaced by reconnect
- Only call markInactive() when the closing stream was the current one,
  preventing stale close handlers from starting TTL on fresh connections
- Add SessionSseRegistry unit tests for removeIfCurrent()

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

@dimakis dimakis left a comment

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Centaur Review

Found 7 issue(s) (3 warning).

packages/client/src/store.ts

_Well-structured transport resilience feature with solid race-condition guards (removeIfCurrent, markInactive/reconnect interplay). Main concern is the silent swallowing of 4xx send errors leaving the UI stuck in 'thinking' state, and the lack of store-level tests for the new send_failed flow.

  • 🟡 bugs (L380): Optimistic SET_RUNNING: true is set when connection.send() succeeds, but if the POST fails with a transient error (_send_failed with willRetry: true), running stays true and is never reset. The _send_failed handler at line 722 only resets running to false when willRetry === false. If the retry eventually succeeds, the server's message_start / session_end events will naturally manage the state. But if all retries fail and the message is dropped (final willRetry: false fires), the running state was stuck at true for up to 4 retry cycles (~9+ seconds). During this time the user sees "thinking" with an error banner — minor UX confusion but not a data loss bug. [fixable]
  • 🟡 missing_tests (L722): The store-level _send_failed and _open handlers (lines 708-733) have no unit tests. The @mitzo/client package only has sse-connection.test.ts — there's no store test file. The transport-level tests verify that _send_failed events are emitted, but the store's state transitions (sendError set/cleared, SET_RUNNING reset on permanent failure) are untested. These are the user-facing behaviors that matter most. [fixable]
  • 🔵 style (L726): The error message 'Message delivery delayed — retrying on reconnect...' is misleading: the retry now happens after a 3-second timer via scheduleRetryFlush() even without a reconnect. Consider 'Message delivery delayed — retrying...' to match the actual behavior. [fixable]

packages/client/src/sse-connection.ts

_Well-structured transport resilience feature with solid race-condition guards (removeIfCurrent, markInactive/reconnect interplay). Main concern is the silent swallowing of 4xx send errors leaving the UI stuck in 'thinking' state, and the lack of store-level tests for the new send_failed flow.

  • 🟡 bugs (L368): When a send POST returns a 4xx status other than 404 or 429 (e.g., 400 Bad Request, 401, 403), the error is completely silent — no _send_failed event, no retry, no logging. The optimistic SET_RUNNING: true in the store will remain stuck until a session_end event arrives (which it won't, since the server never started processing). This makes 400/401/403 errors invisible to the user. Consider emitting _send_failed with willRetry: false for all non-OK send responses that aren't handled, or at least for 401/403 which could indicate auth expiry. [fixable]
  • 🔵 style (L43): SendFailedEvent is exported but never imported by the store or any consumer — the store uses Record<string, unknown> and checks msg.type === '_send_failed' with string literals. Either the export is dead (remove it) or the store should import and use it for type safety. [fixable]

server/index.ts

_Well-structured transport resilience feature with solid race-condition guards (removeIfCurrent, markInactive/reconnect interplay). Main concern is the silent swallowing of 4xx send errors leaving the UI stuck in 'thinking' state, and the lack of store-level tests for the new send_failed flow.

  • 🔵 unsafe_assumptions (L416): The stable CID regex /^cid-[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$/i validates format but not origin — any client can claim any CID. A malicious or buggy tab could send another tab's sessionStorage CID and hijack its connection state (watched sessions, delivery cursors). Since the server has no auth binding between the CID and the authenticated user, this is limited to same-user scope (both tabs share the same auth cookie), so it's low-severity. But it could cause confusing behavior if two tabs accidentally generate the same CID (practically impossible with UUID, but the format constraint itself is the only protection).
  • 🔵 missing_tests (L113): The setOnExpired callback (lines 113-140) contains meaningful logic: ownership check via getOwnerConnection, suspended-session skip, and detach. This is only tested indirectly through ConnectionRegistry unit tests (which test the timer/callback mechanism) but the callback body itself — the detach-on-expiry logic with its ownership guard — has no integration test. A test verifying that expiry of a stable-CID connection correctly detaches (or skips detach when ownership changed) would catch regressions in this critical path. [fixable]

// Optimistic: show "thinking" immediately. The server will confirm
// via events, or the send-timeout safety net will reset on failure.
set((s) => ({
messages: messagesReducer(s.messages, { type: 'SET_RUNNING', running: true }),

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 bugs: Optimistic SET_RUNNING: true is set when connection.send() succeeds, but if the POST fails with a transient error (_send_failed with willRetry: true), running stays true and is never reset. The _send_failed handler at line 722 only resets running to false when willRetry === false. If the retry eventually succeeds, the server's message_start / session_end events will naturally manage the state. But if all retries fail and the message is dropped (final willRetry: false fires), the running state was stuck at true for up to 4 retry cycles (~9+ seconds). During this time the user sees "thinking" with an error banner — minor UX confusion but not a data loss bug. [fixable]

return;
}

if (msg.type === '_send_failed') {

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 missing_tests: The store-level _send_failed and _open handlers (lines 708-733) have no unit tests. The @mitzo/client package only has sse-connection.test.ts — there's no store test file. The transport-level tests verify that _send_failed events are emitted, but the store's state transitions (sendError set/cleared, SET_RUNNING reset on permanent failure) are untested. These are the user-facing behaviors that matter most. [fixable]

Comment thread packages/client/src/store.ts Outdated
const willRetry = msg.willRetry !== false;
store.setState((s) => ({
sendError: willRetry
? 'Message delivery delayed — retrying on reconnect...'

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 style: The error message 'Message delivery delayed — retrying on reconnect...' is misleading: the retry now happens after a 3-second timer via scheduleRetryFlush() even without a reconnect. Consider 'Message delivery delayed — retrying...' to match the actual behavior. [fixable]

},
body: JSON.stringify(body),
});
if (!res.ok && endpoint === 'send') {

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 bugs: When a send POST returns a 4xx status other than 404 or 429 (e.g., 400 Bad Request, 401, 403), the error is completely silent — no _send_failed event, no retry, no logging. The optimistic SET_RUNNING: true in the store will remain stuck until a session_end event arrives (which it won't, since the server never started processing). This makes 400/401/403 errors invisible to the user. Consider emitting _send_failed with willRetry: false for all non-OK send responses that aren't handled, or at least for 401/403 which could indicate auth expiry. [fixable]

const RETRY_FLUSH_DELAY_MS = 3000;

/** Synthetic event emitted when a POST to the send endpoint fails. */
export interface SendFailedEvent {

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 style: SendFailedEvent is exported but never imported by the store or any consumer — the store uses Record<string, unknown> and checks msg.type === '_send_failed' with string literals. Either the export is dead (remove it) or the store should import and use it for type safety. [fixable]

Comment thread server/index.ts
const clientCid = req.query.cid as string | undefined;
const isStableCid =
clientCid != null &&
/^cid-[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$/i.test(clientCid);

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 unsafe_assumptions: The stable CID regex /^cid-[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$/i validates format but not origin — any client can claim any CID. A malicious or buggy tab could send another tab's sessionStorage CID and hijack its connection state (watched sessions, delivery cursors). Since the server has no auth binding between the CID and the authenticated user, this is limited to same-user scope (both tabs share the same auth cookie), so it's low-severity. But it could cause confusing behavior if two tabs accidentally generate the same CID (practically impossible with UUID, but the format constraint itself is the only protection).

Comment thread server/index.ts
// When a stable-cid connection's inactive TTL expires (no reconnect within
// 60s), detach any sessions it was driving — same logic as the SSE close
// handler for legacy connections.
connRegistry.setOnExpired((connectionId, watchedSessions) => {

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 missing_tests: The setOnExpired callback (lines 113-140) contains meaningful logic: ownership check via getOwnerConnection, suspended-session skip, and detach. This is only tested indirectly through ConnectionRegistry unit tests (which test the timer/callback mechanism) but the callback body itself — the detach-on-expiry logic with its ownership guard — has no integration test. A test verifying that expiry of a stable-CID connection correctly detaches (or skips detach when ownership changed) would catch regressions in this critical path. [fixable]

- Emit _send_failed for all non-OK send responses, not just 404. Covers
  400/401/403 which previously left the UI stuck in 'thinking' state
- Fix error message: "retrying..." instead of "retrying on reconnect..."
  since scheduleRetryFlush now retries without waiting for reconnect

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

@dimakis dimakis left a comment

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Centaur Review

Found 6 issue(s) (2 warning).

packages/client/src/store.ts

Well-structured feature with solid race condition handling (removeIfCurrent, markInactive guards, ownership checks). Main concern: stale 'retrying...' sendError when a retry succeeds without SSE reconnect — minor UX polish issue, not a correctness bug.

  • 🟡 bugs (L725): When a retried POST succeeds (transient 500 followed by OK on retry), the 'Message delivery delayed — retrying...' sendError is never cleared. No _send_succeeded event is emitted from doPost on success, and _open only fires on SSE reconnect. The stale error persists until the next sendMessage() call (which sets sendError: null at the top). Consider emitting a success synthetic event from doPost when a retried send succeeds (retries > 0 && res.ok), or clearing sendError on any incoming message_start event. [fixable]

packages/client/src/sse-connection.ts

Well-structured feature with solid race condition handling (removeIfCurrent, markInactive guards, ownership checks). Main concern: stale 'retrying...' sendError when a retry succeeds without SSE reconnect — minor UX polish issue, not a correctness bug.

  • 🟡 bugs (L424): scheduleRetryFlush() exits early if !this._connected (line 426), but doPost can push to pendingSends during a disconnected state (from flushPendingSends during reconnect when the flush itself fails). If SSE is connected but the POST fails, a retry is correctly scheduled. But if the POST failure triggers during flushPendingSends on reconnect and the connection drops again before the flush timer fires, the timer is never rescheduled because _connected goes false. This is benign — the next reconnect will call flushPendingSends again — but worth noting that scheduleRetryFlush is a no-op in this path.
  • 🔵 style (L43): SendFailedEvent is exported from sse-connection.ts but never imported anywhere. The store listener uses untyped Record<string, unknown> to access msg.willRetry. If SendFailedEvent is meant to be a public contract, it should be consumed in the store's listener type; otherwise the export is dead code. [fixable]

packages/client/src/__tests__/sse-connection.test.ts

Well-structured feature with solid race condition handling (removeIfCurrent, markInactive guards, ownership checks). Main concern: stale 'retrying...' sendError when a retry succeeds without SSE reconnect — minor UX polish issue, not a correctness bug.

  • 🔵 missing_tests: No test covers the case where a retried POST eventually succeeds. The existing retry test ('re-queues failed send for retry on reconnect') verifies the second call happens, but doesn't verify the message was actually processed (no listener assertion). Consider adding a test that the _send_failed event is NOT emitted on a successful retry. [fixable]

server/__tests__/session-sse-registry.test.ts

Well-structured feature with solid race condition handling (removeIfCurrent, markInactive guards, ownership checks). Main concern: stale 'retrying...' sendError when a retry succeeds without SSE reconnect — minor UX polish issue, not a correctness bug.

  • 🔵 missing_tests: The removeIfCurrent tests are good, but there's no test for the add() method's behavior of closing the existing stream when a reconnect happens with the same connectionId. This is the foundation of the reconnect safety — verifying that existing.end() is called would strengthen coverage. [fixable]

server/index.ts

Well-structured feature with solid race condition handling (removeIfCurrent, markInactive guards, ownership checks). Main concern: stale 'retrying...' sendError when a retry succeeds without SSE reconnect — minor UX polish issue, not a correctness bug.

  • 🔵 unsafe_assumptions (L416): The cid format regex /^cid-[0-9a-f]{8}-...$/i correctly validates the UUID shape but accepts uppercase hex via the i flag. crypto.randomUUID() always returns lowercase. The i flag is fine for security but means two cids that differ only in case (e.g., cid-ABC... vs cid-abc...) would both pass validation but be treated as different connectionIds in the registry Map. Not a real-world issue since clients always use crypto.randomUUID(), but worth a comment if intentional.

if (msg.type === '_send_failed') {
const willRetry = msg.willRetry !== false;
store.setState((s) => ({
sendError: willRetry

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 bugs: When a retried POST succeeds (transient 500 followed by OK on retry), the 'Message delivery delayed — retrying...' sendError is never cleared. No _send_succeeded event is emitted from doPost on success, and _open only fires on SSE reconnect. The stale error persists until the next sendMessage() call (which sets sendError: null at the top). Consider emitting a success synthetic event from doPost when a retried send succeeds (retries > 0 && res.ok), or clearing sendError on any incoming message_start event. [fixable]

* stream is still connected. Without this, failed POSTs would sit in the
* queue until a reconnect event — which may never come if SSE stays healthy.
*/
private scheduleRetryFlush(): void {

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 bugs: scheduleRetryFlush() exits early if !this._connected (line 426), but doPost can push to pendingSends during a disconnected state (from flushPendingSends during reconnect when the flush itself fails). If SSE is connected but the POST fails, a retry is correctly scheduled. But if the POST failure triggers during flushPendingSends on reconnect and the connection drops again before the flush timer fires, the timer is never rescheduled because _connected goes false. This is benign — the next reconnect will call flushPendingSends again — but worth noting that scheduleRetryFlush is a no-op in this path.

const RETRY_FLUSH_DELAY_MS = 3000;

/** Synthetic event emitted when a POST to the send endpoint fails. */
export interface SendFailedEvent {

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 style: SendFailedEvent is exported from sse-connection.ts but never imported anywhere. The store listener uses untyped Record<string, unknown> to access msg.willRetry. If SendFailedEvent is meant to be a public contract, it should be consumed in the store's listener type; otherwise the export is dead code. [fixable]

Comment thread server/index.ts
const clientCid = req.query.cid as string | undefined;
const isStableCid =
clientCid != null &&
/^cid-[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$/i.test(clientCid);

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 unsafe_assumptions: The cid format regex /^cid-[0-9a-f]{8}-...$/i correctly validates the UUID shape but accepts uppercase hex via the i flag. crypto.randomUUID() always returns lowercase. The i flag is fine for security but means two cids that differ only in case (e.g., cid-ABC... vs cid-abc...) would both pass validation but be treated as different connectionIds in the registry Map. Not a real-world issue since clients always use crypto.randomUUID(), but worth a comment if intentional.

dimakis and others added 2 commits June 26, 2026 16:53
- Clear sendError on any incoming server event, confirming message delivery
  even when retry succeeded without SSE reconnect
- Add SessionSseRegistry.add() stream-close test for reconnect safety

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use proper Response intersection type instead of `as never` to satisfy
tsc --noEmit in CI.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

@dimakis dimakis left a comment

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Centaur Review

Found 7 issue(s) (3 warning).

packages/client/src/store.ts

Well-structured feature with thorough transport-layer tests. Main concern is sendError being cleared by unrelated global events (task_state, etc.) which could hide retry-in-progress state from the user. Store-level integration tests for the new synthetic events would strengthen confidence.

  • 🟡 bugs (L771): sendError is cleared on ANY server event that passes the session filter — including global events with no sessionId (task_state, loop_status, inbox_updated). A background task board update could clear a legitimate 'Message delivery delayed — retrying...' banner while the send retry is still pending. Consider only clearing sendError on session-scoped events (where eventSessionId matches the active session), or on message_start/session_id events that confirm the send was actually delivered. [fixable]
  • 🔵 missing_tests (L708): The store-level handling of _open, _send_failed, and the sendError clearing on server events (lines 708-773) has no unit tests. The SSE connection tests verify the transport emits these events, but the store reducer/listener integration — especially the optimistic SET_RUNNING on send success (line 380) and its interaction with _send_failed — is untested. [fixable]

server/index.ts

Well-structured feature with thorough transport-layer tests. Main concern is sendError being cleared by unrelated global events (task_state, etc.) which could hide retry-in-progress state from the user. Store-level integration tests for the new synthetic events would strengthen confidence.

  • 🟡 unsafe_assumptions (L416): The cid format regex requires exactly the cid- prefix followed by a UUID v4. But the client generates cid-${crypto.randomUUID()} — if a browser's crypto.randomUUID() ever returns uppercase hex or a non-v4 format, it would silently fall back to a server-generated connectionId, defeating the stable-cid feature without any client-side indication. The regex is case-insensitive (good), but the client should ideally validate or normalize as well. This is minor since all current browser implementations return lowercase v4 UUIDs. [fixable]
  • 🔵 unsafe_assumptions (L129): In the onExpired callback, registry.isAttached() is checked but there's no transport identity guard (unlike the legacy close handler at line 491 which checks session.transport === transport). This is intentionally different since the transport was already swapped/removed, and ownership is verified via getOwnerConnection. But if a reconnect race causes a new connection to take ownership after getOwnerConnection runs but before detachChat executes, the session could be detached from under the new owner. The window is small and would require exact timing, but worth noting.

packages/client/src/sse-connection.ts

Well-structured feature with thorough transport-layer tests. Main concern is sendError being cleared by unrelated global events (task_state, etc.) which could hide retry-in-progress state from the user. Store-level integration tests for the new synthetic events would strengthen confidence.

  • 🟡 bugs (L424): scheduleRetryFlush() early-returns if !this._connected, but the retry timer that queued the message fires via doPost failure while the connection IS still open (SSE stream healthy, just the POST failed). When doPost pushes to pendingSends and calls scheduleRetryFlush(), the guard passes. But if the SSE stream drops between the push and the timer firing, flushPendingSends() will call doPost() while disconnected — doPost doesn't guard against !this._connected, so it will attempt a fetch that likely fails, re-queue, and schedule another flush, creating a retry loop until max retries exhausts. Not critical since MAX_SEND_RETRIES caps it at 3, but the retry counting may drift (each failed flush counts as a retry even though the connection is down). [fixable]
  • 🔵 style (L40): The SendFailedEvent interface is exported but the _send_failed type string is used as a magic literal across sse-connection.ts and store.ts. Consider reusing the interface type or at least a shared constant for the event type to prevent typo-induced silent failures. [fixable]
  • 🔵 style (L72): The 4-line comment on the connectionId assignment explains the motivation well for a code review, but this is the kind of 'why this PR exists' context that belongs in the commit message rather than permanently in the source. The field name and config docs already communicate the intent. [fixable]

Comment thread packages/client/src/store.ts Outdated
// case where a retried POST succeeded: the server starts processing and
// sends events, confirming delivery. Without this, the "retrying..."
// banner would persist until the next sendMessage() call.
if (store.getState().sendError) {

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 bugs: sendError is cleared on ANY server event that passes the session filter — including global events with no sessionId (task_state, loop_status, inbox_updated). A background task board update could clear a legitimate 'Message delivery delayed — retrying...' banner while the send retry is still pending. Consider only clearing sendError on session-scoped events (where eventSessionId matches the active session), or on message_start/session_id events that confirm the send was actually delivered. [fixable]

// Foreground recovery: when the page becomes visible again (iOS may have
// evicted it from memory, losing in-memory state), re-fetch messages from
// the REST API if we have an active session but no messages in the store.
if (msg.type === '_open') {

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 missing_tests: The store-level handling of _open, _send_failed, and the sendError clearing on server events (lines 708-773) has no unit tests. The SSE connection tests verify the transport emits these events, but the store reducer/listener integration — especially the optimistic SET_RUNNING on send success (line 380) and its interaction with _send_failed — is untested. [fixable]

Comment thread server/index.ts
const clientCid = req.query.cid as string | undefined;
const isStableCid =
clientCid != null &&
/^cid-[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$/i.test(clientCid);

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 unsafe_assumptions: The cid format regex requires exactly the cid- prefix followed by a UUID v4. But the client generates cid-${crypto.randomUUID()} — if a browser's crypto.randomUUID() ever returns uppercase hex or a non-v4 format, it would silently fall back to a server-generated connectionId, defeating the stable-cid feature without any client-side indication. The regex is case-insensitive (good), but the client should ideally validate or normalize as well. This is minor since all current browser implementations return lowercase v4 UUIDs. [fixable]

Comment thread server/index.ts
}
if (registry.isSuspended(found.clientId)) continue;
const session = registry.get(found.clientId);
if (session && registry.isAttached(found.clientId)) {

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 unsafe_assumptions: In the onExpired callback, registry.isAttached() is checked but there's no transport identity guard (unlike the legacy close handler at line 491 which checks session.transport === transport). This is intentionally different since the transport was already swapped/removed, and ownership is verified via getOwnerConnection. But if a reconnect race causes a new connection to take ownership after getOwnerConnection runs but before detachChat executes, the session could be detached from under the new owner. The window is small and would require exact timing, but worth noting.

* stream is still connected. Without this, failed POSTs would sit in the
* queue until a reconnect event — which may never come if SSE stays healthy.
*/
private scheduleRetryFlush(): void {

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 bugs: scheduleRetryFlush() early-returns if !this._connected, but the retry timer that queued the message fires via doPost failure while the connection IS still open (SSE stream healthy, just the POST failed). When doPost pushes to pendingSends and calls scheduleRetryFlush(), the guard passes. But if the SSE stream drops between the push and the timer firing, flushPendingSends() will call doPost() while disconnected — doPost doesn't guard against !this._connected, so it will attempt a fetch that likely fails, re-queue, and schedule another flush, creating a retry loop until max retries exhausts. Not critical since MAX_SEND_RETRIES caps it at 3, but the retry counting may drift (each failed flush counts as a retry even though the connection is down). [fixable]


const MAX_PENDING_SENDS = 100;
const MAX_SEND_RETRIES = 3;
const RETRY_FLUSH_DELAY_MS = 3000;

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 style: The SendFailedEvent interface is exported but the _send_failed type string is used as a magic literal across sse-connection.ts and store.ts. Consider reusing the interface type or at least a shared constant for the event type to prevent typo-induced silent failures. [fixable]

@@ -51,6 +72,10 @@ export class SseConnection implements ChatConnection {
suspendUrl: '',

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 style: The 4-line comment on the connectionId assignment explains the motivation well for a code review, but this is the kind of 'why this PR exists' context that belongs in the commit message rather than permanently in the source. The field name and config docs already communicate the intent. [fixable]

Global events (task_state, inbox_updated) no longer clear the retry
banner — only session-scoped events confirm message delivery.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

@dimakis dimakis left a comment

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Centaur Review

Found 8 issue(s) (4 warning).

packages/client/src/store.ts

Well-structured reconnect resilience feature with good test coverage for the new SseConnection behaviors. Main concerns: premature sendError clearing on _open before retried POSTs resolve, and sendError clearing not scoped to the active session — both are UX glitches rather than data-loss bugs.

  • 🟡 bugs (L708): The _open handler clears sendError unconditionally on reconnect, but flushPendingSends() fires async POSTs that haven't resolved yet. Sequence: SSE reconnects → _open clears sendError → flushed retry POST fails → _send_failed sets sendError again. The user sees a brief flash of 'no error' then the error reappears. This is benign (the error eventually shows), but the order is surprising. A cleaner approach: don't clear sendError on _open — let the session-scoped event clearing at line 770 handle it, which only fires once the server actually processes the message. [fixable]
  • 🟡 bugs (L770): Session-scoped sendError clearing triggers on ANY session event (including events from a different session's background activity that happen to have a sessionId). If session A's send fails and session B emits a block_delta, the sendError from session A is silently cleared. The guard should also check eventSessionId === store.getState().sessions.active. [fixable]
  • 🔵 missing_tests (L708): No test coverage for the _open → sendError clearing interaction in the store. The store tests should verify: (1) _open clears a stale sendError, (2) _send_failed after _open re-sets the error, (3) session-scoped events clear sendError. The optimistic SET_RUNNING dispatch in sendMessage (line 379) also lacks a dedicated test.

server/index.ts

Well-structured reconnect resilience feature with good test coverage for the new SseConnection behaviors. Main concerns: premature sendError clearing on _open before retried POSTs resolve, and sendError clearing not scoped to the active session — both are UX glitches rather than data-loss bugs.

  • 🟡 unsafe_assumptions (L414): The cid validation regex only accepts UUID v4 format (cid-{uuid}), but crypto.randomUUID() in some environments produces uppercase hex. The regex uses the /i flag so that's fine. However, the docstring in getOwnerConnection (ws-handler-v2.ts:89) says connectionIds use conn-${timestamp}-${random} format — this is now stale. Stable cids use cid-${uuid}. The comment should be updated to document both formats, or a future maintainer may assume colons can't appear and miss edge cases. [fixable]
  • 🔵 missing_tests (L407): The SSE /api/chat/events handler's stable-cid path (reconnect vs register, close handler branching, removeIfCurrent guard) is complex but only has unit tests for SessionSseRegistry and ConnectionRegistry in isolation. No integration test verifies the full reconnect flow: client sends ?cid=... → reconnects → close handler defers cleanup → TTL expires → sessions detach.
  • 🔵 unsafe_assumptions (L417): Invalid cid format is logged as a warning but silently falls through to generate a legacy conn- ID. This means a typo like ?cid=cid-not-a-uuid silently degrades to legacy behavior with no client feedback. Consider returning a 400 response or including a flag in the welcome event so the client knows its cid was rejected. [fixable]

packages/client/src/sse-connection.ts

Well-structured reconnect resilience feature with good test coverage for the new SseConnection behaviors. Main concerns: premature sendError clearing on _open before retried POSTs resolve, and sendError clearing not scoped to the active session — both are UX glitches rather than data-loss bugs.

  • 🟡 regressions (L119): The send() method removed the this._connectionId guard from the connected-path check (was this._connected && this._connectionId, now just this._connected). Since _connectionId is always set in the constructor, this is safe. But doPost previously had an early-return if (!this._connectionId) return; which was also removed. Verify no caller depends on doPost being a no-op before welcome — now doPost always fires if _connected is true, which only becomes true after welcome. This looks correct.
  • 🔵 style (L42): SendFailedEvent is exported but never imported anywhere — consumers handle it via the generic Record<string, unknown> message type. Consider either using it in the store's wsListener signature for type safety, or making it non-exported to avoid dead API surface. [fixable]

// Foreground recovery: when the page becomes visible again (iOS may have
// evicted it from memory, losing in-memory state), re-fetch messages from
// the REST API if we have an active session but no messages in the store.
if (msg.type === '_open') {

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 bugs: The _open handler clears sendError unconditionally on reconnect, but flushPendingSends() fires async POSTs that haven't resolved yet. Sequence: SSE reconnects → _open clears sendError → flushed retry POST fails → _send_failed sets sendError again. The user sees a brief flash of 'no error' then the error reappears. This is benign (the error eventually shows), but the order is surprising. A cleaner approach: don't clear sendError on _open — let the session-scoped event clearing at line 770 handle it, which only fires once the server actually processes the message. [fixable]

Comment thread packages/client/src/store.ts Outdated
// Session event arrived — clear any stale send error. Only clear on
// session-scoped events (eventSessionId present), not global events
// like task_state or inbox_updated which don't confirm message delivery.
if (eventSessionId && store.getState().sendError) {

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 bugs: Session-scoped sendError clearing triggers on ANY session event (including events from a different session's background activity that happen to have a sessionId). If session A's send fails and session B emits a block_delta, the sendError from session A is silently cleared. The guard should also check eventSessionId === store.getState().sessions.active. [fixable]

// Foreground recovery: when the page becomes visible again (iOS may have
// evicted it from memory, losing in-memory state), re-fetch messages from
// the REST API if we have an active session but no messages in the store.
if (msg.type === '_open') {

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 missing_tests: No test coverage for the _open → sendError clearing interaction in the store. The store tests should verify: (1) _open clears a stale sendError, (2) _send_failed after _open re-sets the error, (3) session-scoped events clear sendError. The optimistic SET_RUNNING dispatch in sendMessage (line 379) also lacks a dedicated test.

Comment thread server/index.ts
// ownership race. Client sends ?cid=<uuid>; server reuses it across
// SSE reconnects so the clientId (connectionId:sessionId) never changes.
const clientCid = req.query.cid as string | undefined;
const isStableCid =

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 unsafe_assumptions: The cid validation regex only accepts UUID v4 format (cid-{uuid}), but crypto.randomUUID() in some environments produces uppercase hex. The regex uses the /i flag so that's fine. However, the docstring in getOwnerConnection (ws-handler-v2.ts:89) says connectionIds use conn-${timestamp}-${random} format — this is now stale. Stable cids use cid-${uuid}. The comment should be updated to document both formats, or a future maintainer may assume colons can't appear and miss edge cases. [fixable]

Comment thread server/index.ts
@@ -374,54 +407,96 @@ app.get('/api/chat/events', (req, res) => {
'X-Accel-Buffering': 'no',

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 missing_tests: The SSE /api/chat/events handler's stable-cid path (reconnect vs register, close handler branching, removeIfCurrent guard) is complex but only has unit tests for SessionSseRegistry and ConnectionRegistry in isolation. No integration test verifies the full reconnect flow: client sends ?cid=... → reconnects → close handler defers cleanup → TTL expires → sessions detach.

Comment thread server/index.ts
const isStableCid =
clientCid != null &&
/^cid-[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$/i.test(clientCid);
if (clientCid != null && !isStableCid) {

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 unsafe_assumptions: Invalid cid format is logged as a warning but silently falls through to generate a legacy conn- ID. This means a typo like ?cid=cid-not-a-uuid silently degrades to legacy behavior with no client feedback. Consider returning a 400 response or including a flag in the welcome event so the client knows its cid was rejected. [fixable]

if (!endpoint) return false;

if (this._connected && this._connectionId) {
if (this._connected) {

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 regressions: The send() method removed the this._connectionId guard from the connected-path check (was this._connected && this._connectionId, now just this._connected). Since _connectionId is always set in the constructor, this is safe. But doPost previously had an early-return if (!this._connectionId) return; which was also removed. Verify no caller depends on doPost being a no-op before welcome — now doPost always fires if _connected is true, which only becomes true after welcome. This looks correct.

const MAX_SEND_RETRIES = 3;
const RETRY_FLUSH_DELAY_MS = 3000;

/** Synthetic event emitted when a POST to the send endpoint fails. */

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 style: SendFailedEvent is exported but never imported anywhere — consumers handle it via the generic Record<string, unknown> message type. Consider either using it in the store's wsListener signature for type safety, or making it non-exported to avoid dead API surface. [fixable]

- Remove premature sendError clear on _open (flushPendingSends hasn't
  resolved yet — let session events confirm delivery instead)
- Scope sendError clear to active session events only, preventing
  background session activity from hiding retry state

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@dimakis dimakis merged commit 4b38750 into main Jun 26, 2026
1 check passed
dimakis added a commit that referenced this pull request Jun 27, 2026
…onid

Revert: stable client connectionId (PR #401)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant