fix(infra): persist Loki logs and Grafana state#408
Conversation
Jaeger v2 (latest tag) silently ignores v1 env vars (SPAN_STORAGE_TYPE, BADGER_*) and falls back to in-memory storage. All traces were lost on container restart — .jaeger-data/ was empty despite BADGER_EPHEMERAL=false. Replace env var config with a proper v2 OTel Collector YAML config that explicitly configures Badger persistent storage. Verified traces survive container restarts. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…arts Loki had no data volume — pushed logs were lost on every container restart. Grafana similarly lost any manually created dashboards or preferences. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
dimakis
left a comment
There was a problem hiding this comment.
Centaur Review
Found 1 issue(s) (1 warning).
infra/jaeger-v2-config.yaml
Volume mounts and .gitignore additions are correct; the Jaeger v2 migration drops the 14-day trace TTL — add span_store_ttl: 336h to the Badger config to avoid unbounded storage growth.
- 🟡 regressions (L21): The old v1 config set
BADGER_SPAN_STORE_TTL: 336h(14-day retention). The new v2 config omits any TTL setting, so Badger will retain traces indefinitely and the.jaeger-datadirectory will grow without bound. Addspan_store_ttl: 336hunder thebadger:key to preserve the 14-day retention from v1.[fixable]
| directories: | ||
| keys: /badger/key | ||
| values: /badger/data | ||
| ephemeral: false |
There was a problem hiding this comment.
🟡 regressions: The old v1 config set BADGER_SPAN_STORE_TTL: 336h (14-day retention). The new v2 config omits any TTL setting, so Badger will retain traces indefinitely and the .jaeger-data directory will grow without bound. Add span_store_ttl: 336h under the badger: key to preserve the 14-day retention from v1. [fixable]
Centaur review caught that the v2 migration dropped the BADGER_SPAN_STORE_TTL setting, leaving traces to grow without bound. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
dimakis
left a comment
There was a problem hiding this comment.
Centaur Review
Found 2 issue(s) (1 warning).
docker-compose.yml
Clean infrastructure fix — Loki and Grafana volume mounts are correct, Jaeger v2 config properly replaces the silently-ignored v1 env vars, and .gitignore is updated. Only suggestion is to pin the Jaeger image version like the other services.
- 🟡 regressions (L14): Using
jaegertracing/jaeger:latest(line 9, unchanged) means the v2 config format could break if the image regresses or changes schema. Consider pinning to a specific version (e.g.,jaegertracing/jaeger:2.x.y) like the other services (Loki 3.4.2, Grafana 12.4.1) to avoid silent breakage.[fixable]
infra/jaeger-v2-config.yaml
Clean infrastructure fix — Loki and Grafana volume mounts are correct, Jaeger v2 config properly replaces the silently-ignored v1 env vars, and .gitignore is updated. Only suggestion is to pin the Jaeger image version like the other services.
- 🔵 unsafe_assumptions (L30): The OTLP HTTP receiver binds to
0.0.0.0:4318, which is correct for container use but worth noting — it only works because Docker's port mapping provides the access control. No issue here, just confirming the bind address is intentional for containerized deployment.
| BADGER_DIRECTORY_VALUE: /badger/data | ||
| BADGER_DIRECTORY_KEY: /badger/key | ||
| BADGER_SPAN_STORE_TTL: 336h # 14 days retention | ||
| command: ['--config', '/etc/jaeger/config.yaml'] |
There was a problem hiding this comment.
🟡 regressions: Using jaegertracing/jaeger:latest (line 9, unchanged) means the v2 config format could break if the image regresses or changes schema. Consider pinning to a specific version (e.g., jaegertracing/jaeger:2.x.y) like the other services (Loki 3.4.2, Grafana 12.4.1) to avoid silent breakage. [fixable]
| otlp: | ||
| protocols: | ||
| http: | ||
| endpoint: 0.0.0.0:4318 |
There was a problem hiding this comment.
🔵 unsafe_assumptions: The OTLP HTTP receiver binds to 0.0.0.0:4318, which is correct for container use but worth noting — it only works because Docker's port mapping provides the access control. No issue here, just confirming the bind address is intentional for containerized deployment.
Centaur review flagged that :latest can silently break the v2 config format. Pin like Loki (3.4.2) and Grafana (12.4.1). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
dimakis
left a comment
There was a problem hiding this comment.
Centaur Review
LGTM — no issues found.
Summary
.loki-data:/lokivolume mount so pushed logs survive container restarts.grafana-data:/var/lib/grafanavolume mount for dashboard/preference persistence.gitignoreTest plan
npm run observability:down && npm run observability:up{app="mitzo"}in Grafana.loki-data/and.grafana-data/directories are created on host🤖 Generated with Claude Code