Skip to content

onboard: per-instance asyncio.Lock + typed lifecycle exceptions on ConversationEngine#151

Open
Antawari wants to merge 1 commit into
mainfrom
ishtar/conversation-engine-concurrency-lock
Open

onboard: per-instance asyncio.Lock + typed lifecycle exceptions on ConversationEngine#151
Antawari wants to merge 1 commit into
mainfrom
ishtar/conversation-engine-concurrency-lock

Conversation

@Antawari
Copy link
Copy Markdown
Contributor

Summary

Two defects rooted at the same class, both load-bearing on the host-only Front Door:

1. Back-to-back handle_answer calls corrupt _turn

handle_answer awaits emit(...) (which awaits broadcast(...)asyncio.gather) before incrementing _turn. A second WS message arriving while the first call is suspended inside the first emit reads the same stale _turn — the analyzer fires twice for the same question, _turn double-advances, and the question-emission sequence can skip Q2 or Q3. With three answers required for completion, a single user double-click can push the engine into is_complete=True after Q2; subsequent legitimate answers raise the bare RuntimeError("Conversation is already complete.") and the WS handler silently logs without recovery — the engine is permanently broken for that session.

2. Lifecycle violations raise bare RuntimeError

handle_answer before start() and handle_answer after completion both raised RuntimeError("..."). The WS handler's broad except Exception swallowed them without producing a typed error frame the browser could surface to the user.

Changes

src/bonfire/onboard/conversation.py

  • Added _lock: asyncio.Lock = field(default_factory=asyncio.Lock) to ConversationEngine. Each instance gets its own Lock — default_factory runs per instance, not per class definition, so unrelated conversations don't serialize against each other.
  • handle_answer now wraps its full body in async with self._lock: — the await points inside the body can't be raced by a sibling call.
  • Defined two typed exceptions, both subclassing RuntimeError for backward-compat with existing bare-except RuntimeError catchers upstream in the WS handler:
    • ConversationNotStarted (handle_answer called before start)
    • ConversationAlreadyComplete (handle_answer called after Q3 answered)
  • Updated __all__ to export the new exception types.

tests/unit/test_onboard_conversation_concurrency.py (new · Knight RED)

Seven tests across three classes:

Class Test What it pins
TestLockPresence _lock attribute is asyncio.Lock Contract for the per-instance lock
TestLockPresence Lock per-instance, not shared default_factory discipline
TestHandleAnswerAcquiresLock handle_answer blocks when lock held The acquire pattern fires
TestHandleAnswerAcquiresLock Concurrent calls emit Q1→Q2→Q3 in order Pre-fix this races and skips questions
TestTypedLifecycleExceptions Pre-start raises ConversationNotStarted Typed exception path
TestTypedLifecycleExceptions Post-complete raises ConversationAlreadyComplete Typed exception path
TestTypedLifecycleExceptions Both subclass RuntimeError Back-compat guard for existing catchers

TDD verification

  • Knight RED state confirmed against pre-fix code: collection-error RED (the test imports ConversationAlreadyComplete / ConversationNotStarted which don't exist pre-fix — the strongest form of structural RED).
  • Warrior GREEN state confirmed after fix: 7/7 pass in 1.05s.
  • Existing regression: tests/unit/test_onboard_server.py19/19 still green.
  • ruff check + ruff format --check on changed files: clean.

Out of scope (filed for follow-up PR)

WS handler integration — catching ConversationNotStarted / ConversationAlreadyComplete specifically in src/bonfire/onboard/server.py and emitting a typed error frame to the browser. Deferred to avoid file overlap with the in-flight front-door hardening PR (which also touches server.py).

Test plan

  • pytest tests/unit/test_onboard_conversation_concurrency.py — confirm 7/7 green.
  • pytest tests/unit/test_onboard_server.py — confirm 19/19 still green (no regression).

🤖 Generated with Claude Code

…nversationEngine

Two defects rooted at the same class:

1. handle_answer awaits emit() (which awaits broadcast() → asyncio.gather)
   before incrementing _turn. A second WS message arriving while the first
   handle_answer is suspended inside the first emit reads the same stale
   _turn — the analyzer fires twice for the same question, _turn double-
   advances, and the question-emission sequence can skip Q2 or Q3. With
   three answers required for completion, a double-click can leave the
   engine permanently broken for that session.

2. Lifecycle violations (handle_answer before start, handle_answer after
   completion) raised bare RuntimeError. The WS handler's `except Exception`
   swallowed them without producing a typed error frame the browser
   could surface to the user.

## Fix

src/bonfire/onboard/conversation.py:

  - Added `_lock: asyncio.Lock = field(default_factory=asyncio.Lock)` to
    ConversationEngine. Each instance gets its own Lock — `default_factory`
    runs per instance, not per class definition, so unrelated conversations
    don't serialize against each other.
  - handle_answer now wraps its full body in `async with self._lock:` —
    the await points inside the body can't be raced by a sibling call.
  - Defined two typed exceptions, both subclassing RuntimeError for
    backward-compat with existing bare-`except RuntimeError` catchers
    upstream in the WS handler:
      * ConversationNotStarted (handle_answer called before start)
      * ConversationAlreadyComplete (handle_answer called after Q3 answered)
  - Updated `__all__` to export the new exception types.

## Tests

tests/unit/test_onboard_conversation_concurrency.py (new · Knight RED):

  Seven tests across three classes:
  - TestLockPresence (2): asyncio.Lock attribute exists; per-instance not
    shared.
  - TestHandleAnswerAcquiresLock (2):
      * handle_answer blocks when the lock is held externally (asserts via
        asyncio.wait_for timeout).
      * two concurrent handle_answer calls emit questions in strict Q1→Q2→Q3
        order (pre-fix this races and skips questions).
  - TestTypedLifecycleExceptions (3):
      * handle_answer before start raises ConversationNotStarted.
      * handle_answer after completion raises ConversationAlreadyComplete.
      * both typed exceptions subclass RuntimeError (back-compat guard).

## Out of scope (filed for follow-up PR)

WS handler integration — catching ConversationNotStarted /
ConversationAlreadyComplete specifically in src/bonfire/onboard/server.py
and emitting a typed error frame to the browser. Deferred to avoid file
overlap with the in-flight front-door hardening PR which also touches
server.py.

## Verification

  pytest tests/unit/test_onboard_conversation_concurrency.py
    7 passed (Knight RED → GREEN verified)

  pytest tests/unit/test_onboard_server.py (regression)
    19 passed

  ruff check + format on changed files: clean

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant