Skip to content

fix(server): report idle sessions as not-running on reconnect#402

Open
dimakis wants to merge 1 commit into
mainfrom
fix/reconnect-running-state
Open

fix(server): report idle sessions as not-running on reconnect#402
dimakis wants to merge 1 commit into
mainfrom
fix/reconnect-running-state

Conversation

@dimakis

@dimakis dimakis commented Jun 26, 2026

Copy link
Copy Markdown
Owner

Summary

  • Root cause: handleReconnect used isActive() (session exists in registry) to determine running state, but this returns true for sessions whose query loop is alive but idle between turns. The frontend interprets running=true as "agent is busy" and queues messages behind a 5-second fallback timer instead of sending immediately — causing the persistent "two messages needed after reattach" bug.
  • Fix: After the existing isActive/storeMeta.isActive checks, also check lastSpeaker from EventStore. lastSpeaker=assistant means the agent completed its turn (idle) → running=false. lastSpeaker=user means the agent hasn't answered yet → running=true.
  • Adds 2 test cases covering idle (lastSpeaker=assistant) and mid-turn (lastSpeaker=user) scenarios. All 136 tests pass.

Context

This bug has been the target of 13+ PRs (#320, #348, #353, #360, #362, #367, #368, #371, #372, #381, #384, #386, #392) fixing symptoms at different layers (SSE race conditions, iOS WS stale readyState, idempotency, state machine infra). None addressed the semantic mismatch: "query loop alive" ≠ "agent actively processing." The 5-second PENDING_SEND_TIMEOUT_MS timer in store.ts:355 is literally a band-aid for this — its comment says "stale running state after reconnect."

Test plan

  • All 136 ws-handler-v2 tests pass (134 existing + 2 new)
  • Manual: close app mid-session, reopen, send message — should respond immediately without needing a second "nudge" message
  • Verify agent mid-turn reconnect still shows running state correctly

🤖 Generated with Claude Code

handleReconnect conflated "query loop alive" with "agent actively
processing a turn". isActive() returns true for sessions whose query
loop is alive but idle between turns (waiting in `for await`). This
caused the reconnected message to report running=true, making the
frontend queue the first user message behind a 5-second fallback timer
instead of sending it immediately — the "two messages to get a response
after reattach" bug.

Fix: check lastSpeaker from EventStore. When lastSpeaker=assistant the
agent completed its last turn and is idle, so report running=false.
When lastSpeaker=user the agent hasn't answered yet, so running=true.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

@dimakis dimakis left a comment

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Centaur Review

Found 3 issue(s) (1 warning).

server/ws-handler-v2.ts

Core fix is correct and well-tested, but the same idle-session running-state logic is missing from handleSwitchSession, which sends the same running field to the same client queue mechanism — leaving the 5-second delay bug reachable via the switch_session path.

  • 🟡 regressions: handleSwitchSession (line 429) computes running without the new lastSpeaker idle check. Both reconnected and session_switched drive SET_RUNNING in the client store, which controls whether sendMessage queues behind a 5-second fallback timer. An idle session (lastSpeaker=assistant) will be reported as running=false on reconnect but running=true on switch_session, causing the same queuing delay this PR fixes — just via a different code path. [fixable]

server/__tests__/ws-handler-v2.test.ts

Core fix is correct and well-tested, but the same idle-session running-state logic is missing from handleSwitchSession, which sends the same running field to the same client queue mechanism — leaving the 5-second delay bug reachable via the switch_session path.

  • 🔵 missing_tests (L1148): The new idle-session tests don't verify that reattachChat is NOT called when running is downgraded to false by the lastSpeaker check. The zombie test (line 1117) explicitly asserts expect(reattachChat).not.toHaveBeenCalled(), but the new idle tests skip this. Since the code at line 284 gates reattach on running, and running is now false for idle sessions, it would be good to confirm the reattach is correctly skipped — especially since the session is still alive (unlike zombies where it's removed). [fixable]
  • 🔵 missing_tests (L1148): No test covers the case where storeMeta exists, isActive is true, but lastSpeaker is null (brand-new session that hasn't had any turns yet). The implementation handles it correctly (null !== 'assistant' → running stays true), but an explicit test would document this edge case and guard against future regressions. [fixable]

expect(reattachChat).not.toHaveBeenCalled();
});

it('reports running=false when session is active but idle (lastSpeaker=assistant)', () => {

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 missing_tests: The new idle-session tests don't verify that reattachChat is NOT called when running is downgraded to false by the lastSpeaker check. The zombie test (line 1117) explicitly asserts expect(reattachChat).not.toHaveBeenCalled(), but the new idle tests skip this. Since the code at line 284 gates reattach on running, and running is now false for idle sessions, it would be good to confirm the reattach is correctly skipped — especially since the session is still alive (unlike zombies where it's removed). [fixable]

expect(reattachChat).not.toHaveBeenCalled();
});

it('reports running=false when session is active but idle (lastSpeaker=assistant)', () => {

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 missing_tests: No test covers the case where storeMeta exists, isActive is true, but lastSpeaker is null (brand-new session that hasn't had any turns yet). The implementation handles it correctly (null !== 'assistant' → running stays true), but an explicit test would document this edge case and guard against future regressions. [fixable]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant