Summary
I have experienced a situation where a phantom graph/session is persisting in a GraphServer after cleanup and causing a collision when launching a new pipeline, notably through a collision of component keys here.
This requires more troubleshooting, but nonetheless I have looked into the session clean-up code and found that there is a mechanism by which cleanup can fail to finish. This was done in collaboration with an AI agent.
Areas of (possible) concern
- The biggest one: backend.py (line 866) sets
_cleanup_done = True before cleanup actually succeeds. If _cleanup() is interrupted or raises while waiting for GraphContext.__aexit__(), a later cleanup attempt will return early and skip the graph close path. That could leave the session alive if the Python process/interpreter remains alive.
- Another: graphcontext.py (line 746) runs cleanup as:
await self.revert()
await self._close_session()
await self._shutdown_servers()
but not in a try/finally. If revert() hangs or raises before _close_session(), the session socket may not close, so GraphServer has no reason to drop the session metadata. Note: revert() closes publishers/subscribers and waits for them without a timeout before sending SESSION_CLEAR (graphcontext.py, line 765). If one client close stalls, it may never reach SESSION_CLEAR or _close_session().
Areas of no discernible concern
The server-side cleanup itself looks conceptually right: _handle_session() calls _drop_session() in finally, and _drop_session() removes edges, metadata, and settings (graphserver.py, line 632). But that only happens when the session task exits, which depends on the TCP connection closing. There is no session heartbeat/TTL to reap a stale-but-open connection.
Potential fixes
Cleanup is too dependent on best-effort graceful teardown. The most promising fixes would be:
- Move
_cleanup_done = True to the end of successful cleanup, or track “cleanup in progress” separately.
- Make
GraphContext.__aexit__() close the session in a finally.
- Consider timeouts around client close/revert.
- (Optionally) add a session heartbeat/lease expiry in GraphServer for long-lived external servers.
Planning to write a first draft of such a fix soon.
Summary
I have experienced a situation where a phantom graph/session is persisting in a GraphServer after cleanup and causing a collision when launching a new pipeline, notably through a collision of component keys here.
This requires more troubleshooting, but nonetheless I have looked into the session clean-up code and found that there is a mechanism by which cleanup can fail to finish. This was done in collaboration with an AI agent.
Areas of (possible) concern
_cleanup_done = Truebefore cleanup actually succeeds. If_cleanup()is interrupted or raises while waiting forGraphContext.__aexit__(), a later cleanup attempt will return early and skip the graph close path. That could leave the session alive if the Python process/interpreter remains alive.but not in a try/finally. If
revert()hangs or raises before_close_session(), the session socket may not close, soGraphServerhas no reason to drop the session metadata. Note:revert()closes publishers/subscribers and waits for them without a timeout before sendingSESSION_CLEAR(graphcontext.py, line 765). If one client close stalls, it may never reachSESSION_CLEARor_close_session().Areas of no discernible concern
The server-side cleanup itself looks conceptually right:
_handle_session()calls_drop_session()infinally, and_drop_session()removes edges, metadata, and settings (graphserver.py, line 632). But that only happens when the session task exits, which depends on the TCP connection closing. There is no session heartbeat/TTL to reap a stale-but-open connection.Potential fixes
Cleanup is too dependent on best-effort graceful teardown. The most promising fixes would be:
_cleanup_done = Trueto the end of successful cleanup, or track “cleanup in progress” separately.GraphContext.__aexit__()close the session in afinally.Planning to write a first draft of such a fix soon.