Skip to content

fix(client): reconnect and retry on stale connection 404#406

Open
dimakis wants to merge 1 commit into
mainfrom
fix/reconnect-on-stale-post
Open

fix(client): reconnect and retry on stale connection 404#406
dimakis wants to merge 1 commit into
mainfrom
fix/reconnect-on-stale-post

Conversation

@dimakis

@dimakis dimakis commented Jun 27, 2026

Copy link
Copy Markdown
Owner

Summary

  • When doPost() gets a 404 (server dropped the connection), re-queue the message and trigger checkAndReconnect() so it's retried on a fresh connection
  • Previously, 404s were silently swallowed, causing the user's first message after a stale connection to vanish
  • Root cause of the "two messages to wake agent" bug
  • Includes design doc for the delivery guarantee architecture

Root Cause

SSE stream dies server-side (timeout, cleanup, restart) but the client still has _connected = true because EventSource hasn't fired an error yet (~30s heartbeat gap). User sends a message, POST returns 404, doPost() catches and ignores it. Message lost.

What Changed

  • sse-connection.ts: doPost() now checks res.status === 404, re-queues the message, and calls checkAndReconnect(true)
  • 3 new tests covering: 404 triggers reconnect + retry, 500 does NOT trigger reconnect, network error does NOT trigger reconnect
  • Design doc: docs/design/delivery-guarantee-redesign.md

Test plan

  • 37/37 SSE connection tests pass (34 existing + 3 new)
  • Full suite: 3017 pass, 39 fail (same 39 pre-existing failures on main)
  • Browser: send message after idle period, verify no lost messages
  • iOS: confirm no regression

🤖 Generated with Claude Code

When the SSE stream dies server-side but the client hasn't detected it
yet, POST requests return 404. Previously doPost() silently swallowed
these, causing the user's first message to vanish. Now a 404 re-queues
the message and triggers checkAndReconnect(), so it's retried on a
fresh connection.

Root cause of the "two messages to wake agent" bug: the ~30s gap
between server-side cleanup and client-side detection (SSE heartbeat).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@dimakis

dimakis commented Jun 27, 2026

Copy link
Copy Markdown
Owner Author

Centaur Review

Found 5 issue(s) (3 warning).

packages/client/src/sse-connection.ts

Core 404 → reconnect logic is sound and well-tested for the happy path, but doPost is a generic method also used by sendSuspend — fire-and-forget lifecycle messages should not be re-queued on 404. Also needs a retry limit and queue-size guard on the re-queue path.

  • 🟡 bugs (L340): sendSuspend() also calls doPost('suspend', ...). If it returns 404, the suspend message is re-queued and replayed on the next connection — but a suspend is a lifecycle signal for the current connection, not something that should be retried later. Consider filtering out fire-and-forget endpoints (suspend, close, unwatch) from the 404 re-queue path, e.g. if (res.status === 404 && RETRYABLE_ENDPOINTS.has(endpoint)). [fixable]
  • 🟡 bugs (L340): No MAX_PENDING_SENDS guard on the 404 re-queue push, unlike the send() path (line 97). If 404s happen repeatedly during flush, the queue grows without bound. Should apply the same eviction logic: if (this.pendingSends.length >= MAX_PENDING_SENDS) this.pendingSends.shift();. [fixable]
  • 🟡 unsafe_assumptions (L341): No retry limit on the 404 → re-queue → reconnect → flush → 404 cycle. If the server consistently returns 404 for a specific endpoint (e.g. routing misconfiguration), this loops indefinitely creating EventSource instances. A simple counter (e.g. max 3 consecutive 404-triggered reconnects before dropping the message and logging) would prevent runaway reconnection. [fixable]
  • 🔵 unsafe_assumptions (L354): flushPendingSends() fires all doPost() calls without awaiting. If N queued sends all return 404, each independently calls checkAndReconnect(true), which closes and recreates the EventSource N times in sequence (only the last survives). Not a correctness bug — it self-heals — but wasteful. In practice the queue is usually 1-2 items, so this is low priority. [fixable]

packages/client/src/__tests__/sse-connection.test.ts

Core 404 → reconnect logic is sound and well-tested for the happy path, but doPost is a generic method also used by sendSuspend — fire-and-forget lifecycle messages should not be re-queued on 404. Also needs a retry limit and queue-size guard on the re-queue path.

  • 🔵 missing_tests: No test for sendSuspend() receiving a 404. This would surface the semantic issue where a stale suspend message gets re-queued and replayed on the next connection. Also no test for multiple pending sends flushing and all returning 404 (amplification scenario). [fixable]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant