fix(infra): restore Jaeger trace persistence with v2 config#405
Conversation
Jaeger v2 (latest tag) silently ignores v1 env vars (SPAN_STORAGE_TYPE, BADGER_*) and falls back to in-memory storage. All traces were lost on container restart — .jaeger-data/ was empty despite BADGER_EPHEMERAL=false. Replace env var config with a proper v2 OTel Collector YAML config that explicitly configures Badger persistent storage. Verified traces survive container restarts. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
dimakis
left a comment
There was a problem hiding this comment.
Centaur Review
Found 1 issue(s) (1 warning).
infra/jaeger-v2-config.yaml
Clean v1→v2 Jaeger migration; only gap is the dropped 14-day trace TTL which will cause unbounded disk growth.
- 🟡 regressions: The old config had
BADGER_SPAN_STORE_TTL: 336h(14 days) for trace retention. The new v2 config has no equivalentspan_store_ttlsetting under the Badger backend. Without it, Badger will retain traces indefinitely, causing unbounded disk growth in.jaeger-data/. Consider addingspan_store_ttl: 336hunder thebadgerkey to restore the previous retention behavior (Loki already has its ownretention_period: 720hfor parity).[fixable]
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
dimakis
left a comment
There was a problem hiding this comment.
Centaur Review
LGTM — no issues found.
…K result ~46% of sessions end via WS disconnect/abort and never receive the SDK result event, leaving duration_ms, num_turns, and cost as zero. Add a fallback recordUsage call in the finally block using live counters (cumulativeOutputTokens, turnIndex, wall-clock duration) so partial sessions still capture best-effort metrics. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
dimakis
left a comment
There was a problem hiding this comment.
Centaur Review
Found 2 issue(s) (1 warning).
server/query-loop.ts
Jaeger v2 config migration and fallback usage recording look correct, but a debug log block explicitly marked for removal ("remove after investigation") is still in the diff and should be stripped before merge.
- 🟡 style (L141): Debug logging left in production code. The comment explicitly says "remove after investigation" — this should not be merged. It logs on every non-token/progress event in sendOrBuffer, which is the hot path for all session events. This will be noisy in production logs and pollute Loki.
[fixable]
server/__tests__/query-loop.test.ts
Jaeger v2 config migration and fallback usage recording look correct, but a debug log block explicitly marked for removal ("remove after investigation") is still in the diff and should be stripped before merge.
- 🔵 missing_tests (L1604): The test only asserts durationMs and state, but doesn't verify the actual usage values written by the fallback path (outputTokens, numTurns, totalCostUsd). Since the stream yields no token events, cumulativeOutputTokens and turnIndex are both 0 — asserting those would document that the fallback records zeros for partial sessions, which is the stated intent.
[fixable]
| enriched = { ...enriched, seq }; | ||
| } | ||
| } | ||
| // DEBUG: trace event persistence decisions (remove after investigation) |
There was a problem hiding this comment.
🟡 style: Debug logging left in production code. The comment explicitly says "remove after investigation" — this should not be merged. It logs on every non-token/progress event in sendOrBuffer, which is the hot path for all session events. This will be noisy in production logs and pollute Loki. [fixable]
| connRegistry, | ||
| }); | ||
|
|
||
| // Fallback usage should be recorded with best-effort data |
There was a problem hiding this comment.
🔵 missing_tests: The test only asserts durationMs and state, but doesn't verify the actual usage values written by the fallback path (outputTokens, numTurns, totalCostUsd). Since the stream yields no token events, cumulativeOutputTokens and turnIndex are both 0 — asserting those would document that the fallback records zeros for partial sessions, which is the stated intent. [fixable]
Strip debug logging block marked "remove after investigation" from the sendOrBuffer hot path. Add assertions for outputTokens, numTurns, and totalCostUsd in the fallback usage test to document zero-value behavior for partial sessions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
dimakis
left a comment
There was a problem hiding this comment.
Centaur Review
Found 3 issue(s) (1 warning).
server/query-loop.ts
Clean infrastructure migration (Jaeger v1 env-vars → v2 config YAML) with a well-guarded fallback usage recorder for aborted sessions. No bugs — findings are minor documentation and edge-case test gaps.
- 🟡 bugs (L1392): After
registry.remove(clientId)at line 1380,finalSessionis still a valid reference (Map.delete doesn't invalidate it), but the comment at lines 1357-1360 only documents thecumulativeCostUsdread for the span attributes block. The new fallback code also readsfinalSession?.cumulativeCostUsd— safe due to the same reference semantics, but the existing comment should be updated to mention both read sites to prevent future refactoring from introducing a use-after-remove bug.[fixable] - 🔵 unsafe_assumptions (L1393):
turnIndexis used asnumTurnsin the fallback, butturnIndexincrements on each parentmessage_start(line 801), not on completed turns. For an aborted session, the last turn may be incomplete — this records an in-progress turn as completed. Minor inaccuracy, documented behavior is 'best-effort', but worth noting.
server/__tests__/query-loop.test.ts
Clean infrastructure migration (Jaeger v1 env-vars → v2 config YAML) with a well-guarded fallback usage recorder for aborted sessions. No bugs — findings are minor documentation and edge-case test gaps.
- 🔵 missing_tests (L1591): The test only covers the abort-with-error path (stream throws). Consider adding a test for the abort-controller signal path (external abort without error) — the fallback block would run with
caughtErrorbeing falsy, andsetSessionStatewould record reason'completed'instead of'error', which may be misleading for aborted sessions.[fixable]
| outputTokens: cumulativeOutputTokens, | ||
| cacheReadTokens: 0, | ||
| cacheCreationTokens: 0, | ||
| totalCostUsd: finalSession?.cumulativeCostUsd ?? 0, |
There was a problem hiding this comment.
🟡 bugs: After registry.remove(clientId) at line 1380, finalSession is still a valid reference (Map.delete doesn't invalidate it), but the comment at lines 1357-1360 only documents the cumulativeCostUsd read for the span attributes block. The new fallback code also reads finalSession?.cumulativeCostUsd — safe due to the same reference semantics, but the existing comment should be updated to mention both read sites to prevent future refactoring from introducing a use-after-remove bug. [fixable]
| cacheReadTokens: 0, | ||
| cacheCreationTokens: 0, | ||
| totalCostUsd: finalSession?.cumulativeCostUsd ?? 0, | ||
| numTurns: turnIndex, |
There was a problem hiding this comment.
🔵 unsafe_assumptions: turnIndex is used as numTurns in the fallback, but turnIndex increments on each parent message_start (line 801), not on completed turns. For an aborted session, the last turn may be incomplete — this records an in-progress turn as completed. Minor inaccuracy, documented behavior is 'best-effort', but worth noting.
|
|
||
| // Stream that yields an assistant event (resolves sessionId) then errors | ||
| // before an SDK result event arrives — simulates WS disconnect mid-session. | ||
| async function* abortStream() { |
There was a problem hiding this comment.
🔵 missing_tests: The test only covers the abort-with-error path (stream throws). Consider adding a test for the abort-controller signal path (external abort without error) — the fallback block would run with caughtError being falsy, and setSessionState would record reason 'completed' instead of 'error', which may be misleading for aborted sessions. [fixable]
Summary
latesttag) silently ignores v1 env vars (SPAN_STORAGE_TYPE,BADGER_*) and falls back to in-memory storage — all traces lost on container restart.jaeger-data/was empty (0 bytes) despiteBADGER_EPHEMERAL=falseinfra/jaeger-v2-config.yaml) that explicitly configures Badger persistent storageContext
Part of a larger observability persistence fix. Investigation also found that the event store only persists
user_messageevents (lifecycle events likesession_end,block_endnever reachstore.append()), and all 434 sessions have zero for duration/turns/cost metrics. Those fixes will follow in separate PRs.Test plan
Initializing storage 'badger_main'in logs).jaeger-data/data/and.jaeger-data/key/populated with Badger filespodman restart mitzo_jaeger_1🤖 Generated with Claude Code