Skip to content

fix(infra): restore Jaeger trace persistence with v2 config#405

Merged
dimakis merged 4 commits into
mainfrom
fix/observability-persistence
Jun 27, 2026
Merged

fix(infra): restore Jaeger trace persistence with v2 config#405
dimakis merged 4 commits into
mainfrom
fix/observability-persistence

Conversation

@dimakis

@dimakis dimakis commented Jun 27, 2026

Copy link
Copy Markdown
Owner

Summary

  • Jaeger v2 (latest tag) silently ignores v1 env vars (SPAN_STORAGE_TYPE, BADGER_*) and falls back to in-memory storage — all traces lost on container restart
  • .jaeger-data/ was empty (0 bytes) despite BADGER_EPHEMERAL=false
  • Replace env var config with proper v2 OTel Collector YAML config (infra/jaeger-v2-config.yaml) that explicitly configures Badger persistent storage
  • Verified: traces survive container restarts

Context

Part of a larger observability persistence fix. Investigation also found that the event store only persists user_message events (lifecycle events like session_end, block_end never reach store.append()), and all 434 sessions have zero for duration/turns/cost metrics. Those fixes will follow in separate PRs.

Test plan

  • Jaeger starts with Badger storage (Initializing storage 'badger_main' in logs)
  • .jaeger-data/data/ and .jaeger-data/key/ populated with Badger files
  • Test span sent via OTLP HTTP, visible in Jaeger UI
  • Traces survive podman restart mitzo_jaeger_1

🤖 Generated with Claude Code

Jaeger v2 (latest tag) silently ignores v1 env vars (SPAN_STORAGE_TYPE,
BADGER_*) and falls back to in-memory storage. All traces were lost on
container restart — .jaeger-data/ was empty despite BADGER_EPHEMERAL=false.

Replace env var config with a proper v2 OTel Collector YAML config that
explicitly configures Badger persistent storage. Verified traces survive
container restarts.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

@dimakis dimakis left a comment

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Centaur Review

Found 1 issue(s) (1 warning).

infra/jaeger-v2-config.yaml

Clean v1→v2 Jaeger migration; only gap is the dropped 14-day trace TTL which will cause unbounded disk growth.

  • 🟡 regressions: The old config had BADGER_SPAN_STORE_TTL: 336h (14 days) for trace retention. The new v2 config has no equivalent span_store_ttl setting under the Badger backend. Without it, Badger will retain traces indefinitely, causing unbounded disk growth in .jaeger-data/. Consider adding span_store_ttl: 336h under the badger key to restore the previous retention behavior (Loki already has its own retention_period: 720h for parity). [fixable]

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

@dimakis dimakis left a comment

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Centaur Review

LGTM — no issues found.

…K result

~46% of sessions end via WS disconnect/abort and never receive the SDK
result event, leaving duration_ms, num_turns, and cost as zero. Add a
fallback recordUsage call in the finally block using live counters
(cumulativeOutputTokens, turnIndex, wall-clock duration) so partial
sessions still capture best-effort metrics.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

@dimakis dimakis left a comment

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Centaur Review

Found 2 issue(s) (1 warning).

server/query-loop.ts

Jaeger v2 config migration and fallback usage recording look correct, but a debug log block explicitly marked for removal ("remove after investigation") is still in the diff and should be stripped before merge.

  • 🟡 style (L141): Debug logging left in production code. The comment explicitly says "remove after investigation" — this should not be merged. It logs on every non-token/progress event in sendOrBuffer, which is the hot path for all session events. This will be noisy in production logs and pollute Loki. [fixable]

server/__tests__/query-loop.test.ts

Jaeger v2 config migration and fallback usage recording look correct, but a debug log block explicitly marked for removal ("remove after investigation") is still in the diff and should be stripped before merge.

  • 🔵 missing_tests (L1604): The test only asserts durationMs and state, but doesn't verify the actual usage values written by the fallback path (outputTokens, numTurns, totalCostUsd). Since the stream yields no token events, cumulativeOutputTokens and turnIndex are both 0 — asserting those would document that the fallback records zeros for partial sessions, which is the stated intent. [fixable]

Comment thread server/query-loop.ts Outdated
enriched = { ...enriched, seq };
}
}
// DEBUG: trace event persistence decisions (remove after investigation)

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 style: Debug logging left in production code. The comment explicitly says "remove after investigation" — this should not be merged. It logs on every non-token/progress event in sendOrBuffer, which is the hot path for all session events. This will be noisy in production logs and pollute Loki. [fixable]

connRegistry,
});

// Fallback usage should be recorded with best-effort data

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 missing_tests: The test only asserts durationMs and state, but doesn't verify the actual usage values written by the fallback path (outputTokens, numTurns, totalCostUsd). Since the stream yields no token events, cumulativeOutputTokens and turnIndex are both 0 — asserting those would document that the fallback records zeros for partial sessions, which is the stated intent. [fixable]

Strip debug logging block marked "remove after investigation" from the
sendOrBuffer hot path. Add assertions for outputTokens, numTurns, and
totalCostUsd in the fallback usage test to document zero-value behavior
for partial sessions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@dimakis dimakis merged commit bcf1abb into main Jun 27, 2026
1 check passed
@dimakis dimakis deleted the fix/observability-persistence branch June 27, 2026 13:28

@dimakis dimakis left a comment

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Centaur Review

Found 3 issue(s) (1 warning).

server/query-loop.ts

Clean infrastructure migration (Jaeger v1 env-vars → v2 config YAML) with a well-guarded fallback usage recorder for aborted sessions. No bugs — findings are minor documentation and edge-case test gaps.

  • 🟡 bugs (L1392): After registry.remove(clientId) at line 1380, finalSession is still a valid reference (Map.delete doesn't invalidate it), but the comment at lines 1357-1360 only documents the cumulativeCostUsd read for the span attributes block. The new fallback code also reads finalSession?.cumulativeCostUsd — safe due to the same reference semantics, but the existing comment should be updated to mention both read sites to prevent future refactoring from introducing a use-after-remove bug. [fixable]
  • 🔵 unsafe_assumptions (L1393): turnIndex is used as numTurns in the fallback, but turnIndex increments on each parent message_start (line 801), not on completed turns. For an aborted session, the last turn may be incomplete — this records an in-progress turn as completed. Minor inaccuracy, documented behavior is 'best-effort', but worth noting.

server/__tests__/query-loop.test.ts

Clean infrastructure migration (Jaeger v1 env-vars → v2 config YAML) with a well-guarded fallback usage recorder for aborted sessions. No bugs — findings are minor documentation and edge-case test gaps.

  • 🔵 missing_tests (L1591): The test only covers the abort-with-error path (stream throws). Consider adding a test for the abort-controller signal path (external abort without error) — the fallback block would run with caughtError being falsy, and setSessionState would record reason 'completed' instead of 'error', which may be misleading for aborted sessions. [fixable]

Comment thread server/query-loop.ts
outputTokens: cumulativeOutputTokens,
cacheReadTokens: 0,
cacheCreationTokens: 0,
totalCostUsd: finalSession?.cumulativeCostUsd ?? 0,

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 bugs: After registry.remove(clientId) at line 1380, finalSession is still a valid reference (Map.delete doesn't invalidate it), but the comment at lines 1357-1360 only documents the cumulativeCostUsd read for the span attributes block. The new fallback code also reads finalSession?.cumulativeCostUsd — safe due to the same reference semantics, but the existing comment should be updated to mention both read sites to prevent future refactoring from introducing a use-after-remove bug. [fixable]

Comment thread server/query-loop.ts
cacheReadTokens: 0,
cacheCreationTokens: 0,
totalCostUsd: finalSession?.cumulativeCostUsd ?? 0,
numTurns: turnIndex,

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 unsafe_assumptions: turnIndex is used as numTurns in the fallback, but turnIndex increments on each parent message_start (line 801), not on completed turns. For an aborted session, the last turn may be incomplete — this records an in-progress turn as completed. Minor inaccuracy, documented behavior is 'best-effort', but worth noting.


// Stream that yields an assistant event (resolves sessionId) then errors
// before an SDK result event arrives — simulates WS disconnect mid-session.
async function* abortStream() {

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 missing_tests: The test only covers the abort-with-error path (stream throws). Consider adding a test for the abort-controller signal path (external abort without error) — the fallback block would run with caughtError being falsy, and setSessionState would record reason 'completed' instead of 'error', which may be misleading for aborted sessions. [fixable]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant