Skip to content

Old session artifacts may persist after Session kill #252

Description

@KonradPilch

Summary

I have experienced a situation where a phantom graph/session is persisting in a GraphServer after cleanup and causing a collision when launching a new pipeline, notably through a collision of component keys here.

This requires more troubleshooting, but nonetheless I have looked into the session clean-up code and found that there is a mechanism by which cleanup can fail to finish. This was done in collaboration with an AI agent.

Areas of (possible) concern

  1. The biggest one: backend.py (line 866) sets _cleanup_done = True before cleanup actually succeeds. If _cleanup() is interrupted or raises while waiting for GraphContext.__aexit__(), a later cleanup attempt will return early and skip the graph close path. That could leave the session alive if the Python process/interpreter remains alive.
  2. Another: graphcontext.py (line 746) runs cleanup as:
await self.revert()
await self._close_session()
await self._shutdown_servers()

but not in a try/finally. If revert() hangs or raises before _close_session(), the session socket may not close, so GraphServer has no reason to drop the session metadata. Note: revert() closes publishers/subscribers and waits for them without a timeout before sending SESSION_CLEAR (graphcontext.py, line 765). If one client close stalls, it may never reach SESSION_CLEAR or _close_session().

Areas of no discernible concern

The server-side cleanup itself looks conceptually right: _handle_session() calls _drop_session() in finally, and _drop_session() removes edges, metadata, and settings (graphserver.py, line 632). But that only happens when the session task exits, which depends on the TCP connection closing. There is no session heartbeat/TTL to reap a stale-but-open connection.

Potential fixes

Cleanup is too dependent on best-effort graceful teardown. The most promising fixes would be:

  • Move _cleanup_done = True to the end of successful cleanup, or track “cleanup in progress” separately.
  • Make GraphContext.__aexit__() close the session in a finally.
  • Consider timeouts around client close/revert.
  • (Optionally) add a session heartbeat/lease expiry in GraphServer for long-lived external servers.

Planning to write a first draft of such a fix soon.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions