feat: add opt-in OpenTelemetry tracing, metrics, and browser RUM#24
Merged
Conversation
Deploying codex with
|
| Latest commit: |
0c2ecc2
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://54fcfb81.codex-asm.pages.dev |
| Branch Preview URL: | https://otlp-traces.codex-asm.pages.dev |
Wire OpenTelemetry SDK + axum middleware behind a default `observability` Cargo feature and an explicit `observability.enabled` config flag. Default is off so no telemetry leaves the box without operator action. - Config: new `observability` section (service_name, otlp endpoint / protocol / headers / timeout, per-pipeline enable + sample ratio, browser proxy block) with env overrides and YAML round-trip. - Providers: new `src/observability` module builds SdkTracerProvider + SdkMeterProvider from config with a ParentBased sampler, batch span processor, and periodic metric reader. Returns a guard the serve and worker commands shut down on SIGTERM/SIGINT to flush pending exports. - Bridges: `init_tracing` composes the existing fmt + file appender with the `tracing-opentelemetry` layer via a Registry. A new `TraceContextFormat` wrapper prepends `trace_id=` / `span_id=` to each log line for trace ↔ log correlation. - HTTP: `install_http_layers` wires `OtelAxumLayer` + `OtelInResponseLayer` at the outermost router position; no-op when the feature or config flag is off. - Build: `--no-default-features` continues to compile via a stub observability module; OTLP/HTTP uses `reqwest-blocking-client` to avoid panics on the SDK's dedicated batch processor thread. Tests added for config defaults / env overrides / YAML round-trip, provider init + shutdown, and the HTTP layer wiring decisions.
…nd task worker with OTel spans Build out the trace tree that the OTLP scaffolding produces so a single API request reads as HTTP server → handler → repository calls → plugin RPCs in the operator's collector, instead of a flat list of unrelated spans. - Plugin manager entry methods (search_series, get_series_metadata, match_series, search_book, get_book_metadata, match_book, test_plugin, ping) emit client-kind spans named "plugin.<method>" carrying plugin_id, plugin.method, plugin_name, duration_ms, otel.status_code, and error.code. - Plugin RPC layer adds "plugin.rpc.write" and "plugin.rpc.wait" internal spans so stdio write time is attributable separately from waiting on the plugin process. - New observability::repo::db_system_str() maps SeaORM backends to the OpenTelemetry db.system attribute. Hot-path repository methods on books, series, libraries, users, and plugins are decorated with #[tracing::instrument] following a "db.<entity>.<operation>" naming convention plus db.system / db.operation / otel.kind fields. - Scanner entry points (scan_library, analyze_book) and the task worker (task.execute) get root spans so background work no longer appears as children of unrelated HTTP requests. The task span is created only after a task is claimed to avoid empty-poll noise. Span tests use a small in-test CapturingLayer to assert names and field values without standing up the full OTel SDK.
…side in-app metrics Dual-write plugin and task metrics to OpenTelemetry on top of the existing in-memory and DB-backed stores so operators can read p50/p95/p99 latencies from any OTLP backend without losing the in-app dashboards. - New `observability::metrics` module exposes stable instrument names (`codex.plugin.*`, `codex.task.*`, `codex.inventory.*`) and typed `PluginInstruments` / `TaskInstruments` wrappers built once against the global meter. A no-op `metrics_stub` mirror keeps call sites cfg-free under `--no-default-features`. - `PluginMetricsService` and `TaskMetricsService` emit counters and duration histograms at every recording call site; rate-limit rejections and rate-limited task completions get distinct labels so dashboards can filter them out of error rates. - In-flight task gauge is an observable gauge over an atomic toggled by an RAII `InFlightGuard` in the task worker, which catches every exit path of `process_next_task` (success, failure, error propagation) without scattering inc/dec across early returns. - Inventory observable gauges (libraries, series, books, users, pages) are fed by a 30s background poller that refreshes a shared snapshot and exits on the existing background-task cancellation token. - New axum middleware records `http.server.request.duration` in seconds with `http.request.method`, `http.route` (from `MatchedPath`), and `http.response.status_code` attributes; layered just inside the OTel server-span layer. - Process CPU and memory gauges via `sysinfo` (added as an optional dependency gated on the `observability` feature). Rolled in-house because `opentelemetry-system-metrics` is pinned to opentelemetry 0.31 while we run 0.32. Tests cover metric-name stability, no-op safety when no meter provider is installed, end-to-end emission through an in-memory exporter for plugin and task instruments, the in-flight saturation behaviour, and the inventory snapshot refresh path. Clippy is clean with and without the `observability` feature.
…ing proxy
Expose POST /api/v1/observability/otlp/v1/{traces,metrics} and forward the
raw OTLP body to the operator-configured upstream collector with the
configured auth headers stamped in. Inbound Content-Type is preserved;
operator headers always win over anything supplied by the browser, so
collector tokens stay server-side and no CORS hop is needed. Body is
capped at 4 MiB and the upstream reqwest client lives in a OnceCell so
the connection pool survives across requests. A companion
GET /api/v1/observability/config returns a redacted bootstrap payload
(enabled flag, service name, proxy path, sample ratio) so the SPA can
decide whether to start the SDK at all.
On the frontend, register the OpenTelemetry web SDK with WebTracerProvider
+ BatchSpanProcessor, an OTLP/HTTP exporter pointing at the proxy,
ZoneContextManager, document-load, and fetch instrumentations. Only inject
traceparent on same-origin requests so third-party CDNs and metadata
sources never see Codex trace context. UserInteractionInstrumentation is
restricted to click and submit to keep span volume in check. The heavy
SDK is loaded via dynamic import only when the server-side config flag is
on, so the default-off path pays nothing beyond a small bootstrap script
and a single fetch.
AppState now carries an Arc<ObservabilityConfig> alongside the existing
config Arcs, which the proxy and bootstrap handlers consume. The browser
proxy uses FlexibleAuthContext so the SPA's existing cookie session
authenticates the SDK's POSTs without any custom header plumbing.
Disabled by default; opt-in via observability.browser.enabled and a
non-empty observability.otlp.endpoint. Integration tests cover the auth
gate, the disabled/enabled bootstrap payloads, the 503 path when RUM is
off, and verbatim body + operator-header-wins behavior against an
in-process capture upstream.
The OTLP scaffolding rewrite switched the subscriber install from `try_init().ok()` to `init()`, which panics via `set_global_default` when called twice in the same process. Tests that drive migrate + wait_for_migrations back to back (or migrate twice) tripped the panic on the second call. Restore `try_init().ok()` in both feature branches so a redundant init is a no-op instead of a panic. The disabled-observability path in `observability::init` was already idempotent, so no other changes are needed.
…sidecar Document the opt-in OpenTelemetry pipeline end-to-end and make the dev environment exercise it by default: - docs/docs/observability.md: full operator guide covering quickstart, backend matrix (SigNoz, Tempo, Honeycomb, Uptrace, DataDog), sampling guidance keyed to workload size, the span/metric inventory, browser RUM design, log-trace correlation, three disable granularities, and a troubleshooting checklist. - docs/docs/configuration.md: new Observability Configuration section mirroring the Rust schema with defaults, env-override names, and forward links to the operator guide, plus an entry in the common env-var block. - docker-compose.yml: bundled jaeger sidecar on the dev profile (same pattern as mailhog), accepting OTLP on 4317/4318 and serving the UI on 16686. codex-dev and codex-dev-worker are pre-wired with CODEX_OBSERVABILITY_* env vars pointing at http://jaeger:4317, so `make dev-up` produces a fully working backend-plus-collector loop with no YAML edit. - config/config.docker.yaml, config.sqlite.yaml, config.kubernetes.yaml: commented-out observability blocks for schema discoverability. The templates stay disabled so the files are safe to reuse outside the dev compose without surprise telemetry export; the dev override lives at the compose layer only. - src/observability/repo.rs: added an `#[ignore]`d microbench measuring per-call cost of `#[tracing::instrument]` with and without a subscriber attached (~13 ns disabled, ~400 ns enabled). Runs via `cargo test --release -- --ignored bench_instrumentation_overhead`.
…, pin Jaeger tag Bump the codex-dev healthcheck start_period from 30s to 900s so a cold-cache `cargo build` inside the container (which can take 10+ minutes) does not exhaust the retry budget and mark the service unhealthy. While start_period is in effect, failing checks do not count toward `retries`, so this keeps codex-dev-worker (which depends on `codex-dev: service_healthy`) from also failing on first boot. Add a `make dev-logs-jaeger` target to match the existing per-service log shortcuts. Pin the Jaeger all-in-one image to `1.62.0` in both the dev compose file and the observability docs so the published quickstart cannot silently drift onto a different patch release.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds vendor-neutral OpenTelemetry instrumentation to Codex, covering backend traces (HTTP, repository, plugin RPC), metrics with latency histograms, and browser RUM with end-to-end trace propagation. Telemetry is shipped over OTLP to any compatible backend (SigNoz, Tempo, Honeycomb, Uptrace, DataDog, etc.) by changing a single config value. Defaults to fully disabled, so no telemetry leaves the box without explicit operator action.
Motivation
Codex previously exposed only text logs and an in-app dashboard backed by homegrown averages, with no way to see per-endpoint latency distributions, attribute slow requests to a specific SQL call or plugin, or correlate a slow browser interaction to the backend work it triggered. Operators had no path to feed Codex telemetry into the observability stack they already run. This change closes that gap at no vendor cost.
Changes
observabilitysection with toggles for traces, metrics, and browser RUM, plus OTLP endpoint, protocol (gRPC, HTTP/protobuf, HTTP/JSON), headers, sampling ratios, and metric export interval. Entire feature defaults to disabled.traceparentis honored on HTTP requests; one trace tree spans the HTTP handler, repository calls, scanner and background task work, and plugin RPC round-trips, with trace IDs also injected into existing log lines for correlation./api/v1/metrics/*responses continue to work unchanged.traceparentso browser interactions join the backend trace./api/v1/observability/otlp/forwards browser telemetry to the operator's configured collector (no CORS or token distribution to clients), and a small config endpoint serves the browser SDK its settings.Notes
observability.enabled = false); operators must opt in explicitly. Browser RUM is a separate opt-in on top of that and defaults to 10% sampling.