Skip to content

feat: add opt-in OpenTelemetry tracing, metrics, and browser RUM#24

Merged
AshDevFr merged 7 commits into
mainfrom
otlp-traces
May 23, 2026
Merged

feat: add opt-in OpenTelemetry tracing, metrics, and browser RUM#24
AshDevFr merged 7 commits into
mainfrom
otlp-traces

Conversation

@AshDevFr
Copy link
Copy Markdown
Owner

Summary

Adds vendor-neutral OpenTelemetry instrumentation to Codex, covering backend traces (HTTP, repository, plugin RPC), metrics with latency histograms, and browser RUM with end-to-end trace propagation. Telemetry is shipped over OTLP to any compatible backend (SigNoz, Tempo, Honeycomb, Uptrace, DataDog, etc.) by changing a single config value. Defaults to fully disabled, so no telemetry leaves the box without explicit operator action.

Motivation

Codex previously exposed only text logs and an in-app dashboard backed by homegrown averages, with no way to see per-endpoint latency distributions, attribute slow requests to a specific SQL call or plugin, or correlate a slow browser interaction to the backend work it triggered. Operators had no path to feed Codex telemetry into the observability stack they already run. This change closes that gap at no vendor cost.

Changes

  • Configuration: new observability section with toggles for traces, metrics, and browser RUM, plus OTLP endpoint, protocol (gRPC, HTTP/protobuf, HTTP/JSON), headers, sampling ratios, and metric export interval. Entire feature defaults to disabled.
  • Tracing: incoming traceparent is honored on HTTP requests; one trace tree spans the HTTP handler, repository calls, scanner and background task work, and plugin RPC round-trips, with trace IDs also injected into existing log lines for correlation.
  • Metrics: latency histograms (p50/p95/p99) for HTTP routes, plugin calls, and tasks, plus inventory gauges and process/runtime metrics. The existing in-app metrics dashboard and /api/v1/metrics/* responses continue to work unchanged.
  • Web UI: opt-in browser SDK auto-instruments page loads, user clicks, and fetches; outgoing API calls carry traceparent so browser interactions join the backend trace.
  • Operator-facing endpoints: a new authenticated browser-OTLP proxy under /api/v1/observability/otlp/ forwards browser telemetry to the operator's configured collector (no CORS or token distribution to clients), and a small config endpoint serves the browser SDK its settings.
  • Docs and examples: new observability page in the operator docs, config reference updates, and an example compose file for running SigNoz alongside Codex.

Notes

  • Default state is fully disabled (observability.enabled = false); operators must opt in explicitly. Browser RUM is a separate opt-in on top of that and defaults to 10% sampling.
  • Logs are intentionally out of scope and continue to use the existing stdout/file appender; trace IDs in log lines make external log shipping sufficient for correlation.
  • Plugin tracing covers the server-side RPC envelope only. Propagating trace context into plugin subprocesses is a planned follow-up that requires plugin SDK changes.
  • Operator-side end-to-end verification against a live collector is expected as part of release validation.

@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages Bot commented May 23, 2026

Deploying codex with  Cloudflare Pages  Cloudflare Pages

Latest commit: 0c2ecc2
Status: ✅  Deploy successful!
Preview URL: https://54fcfb81.codex-asm.pages.dev
Branch Preview URL: https://otlp-traces.codex-asm.pages.dev

View logs

AshDevFr added 2 commits May 22, 2026 19:02
Wire OpenTelemetry SDK + axum middleware behind a default `observability`
Cargo feature and an explicit `observability.enabled` config flag. Default
is off so no telemetry leaves the box without operator action.

- Config: new `observability` section (service_name, otlp endpoint /
  protocol / headers / timeout, per-pipeline enable + sample ratio,
  browser proxy block) with env overrides and YAML round-trip.
- Providers: new `src/observability` module builds SdkTracerProvider +
  SdkMeterProvider from config with a ParentBased sampler, batch span
  processor, and periodic metric reader. Returns a guard the serve and
  worker commands shut down on SIGTERM/SIGINT to flush pending exports.
- Bridges: `init_tracing` composes the existing fmt + file appender with
  the `tracing-opentelemetry` layer via a Registry. A new
  `TraceContextFormat` wrapper prepends `trace_id=` / `span_id=` to each
  log line for trace ↔ log correlation.
- HTTP: `install_http_layers` wires `OtelAxumLayer` + `OtelInResponseLayer`
  at the outermost router position; no-op when the feature or config flag
  is off.
- Build: `--no-default-features` continues to compile via a stub
  observability module; OTLP/HTTP uses `reqwest-blocking-client` to avoid
  panics on the SDK's dedicated batch processor thread.

Tests added for config defaults / env overrides / YAML round-trip,
provider init + shutdown, and the HTTP layer wiring decisions.
…nd task worker with OTel spans

Build out the trace tree that the OTLP scaffolding produces so a single API
request reads as HTTP server → handler → repository calls → plugin RPCs in
the operator's collector, instead of a flat list of unrelated spans.

- Plugin manager entry methods (search_series, get_series_metadata,
  match_series, search_book, get_book_metadata, match_book, test_plugin,
  ping) emit client-kind spans named "plugin.<method>" carrying plugin_id,
  plugin.method, plugin_name, duration_ms, otel.status_code, and error.code.
- Plugin RPC layer adds "plugin.rpc.write" and "plugin.rpc.wait" internal
  spans so stdio write time is attributable separately from waiting on the
  plugin process.
- New observability::repo::db_system_str() maps SeaORM backends to the
  OpenTelemetry db.system attribute. Hot-path repository methods on books,
  series, libraries, users, and plugins are decorated with
  #[tracing::instrument] following a "db.<entity>.<operation>" naming
  convention plus db.system / db.operation / otel.kind fields.
- Scanner entry points (scan_library, analyze_book) and the task worker
  (task.execute) get root spans so background work no longer appears as
  children of unrelated HTTP requests. The task span is created only after
  a task is claimed to avoid empty-poll noise.

Span tests use a small in-test CapturingLayer to assert names and field
values without standing up the full OTel SDK.
AshDevFr added 5 commits May 22, 2026 20:45
…side in-app metrics

Dual-write plugin and task metrics to OpenTelemetry on top of the
existing in-memory and DB-backed stores so operators can read p50/p95/p99
latencies from any OTLP backend without losing the in-app dashboards.

- New `observability::metrics` module exposes stable instrument names
  (`codex.plugin.*`, `codex.task.*`, `codex.inventory.*`) and typed
  `PluginInstruments` / `TaskInstruments` wrappers built once against
  the global meter. A no-op `metrics_stub` mirror keeps call sites
  cfg-free under `--no-default-features`.
- `PluginMetricsService` and `TaskMetricsService` emit counters and
  duration histograms at every recording call site; rate-limit
  rejections and rate-limited task completions get distinct labels so
  dashboards can filter them out of error rates.
- In-flight task gauge is an observable gauge over an atomic toggled
  by an RAII `InFlightGuard` in the task worker, which catches every
  exit path of `process_next_task` (success, failure, error
  propagation) without scattering inc/dec across early returns.
- Inventory observable gauges (libraries, series, books, users, pages)
  are fed by a 30s background poller that refreshes a shared snapshot
  and exits on the existing background-task cancellation token.
- New axum middleware records `http.server.request.duration` in seconds
  with `http.request.method`, `http.route` (from `MatchedPath`), and
  `http.response.status_code` attributes; layered just inside the OTel
  server-span layer.
- Process CPU and memory gauges via `sysinfo` (added as an optional
  dependency gated on the `observability` feature). Rolled in-house
  because `opentelemetry-system-metrics` is pinned to opentelemetry
  0.31 while we run 0.32.

Tests cover metric-name stability, no-op safety when no meter provider
is installed, end-to-end emission through an in-memory exporter for
plugin and task instruments, the in-flight saturation behaviour, and
the inventory snapshot refresh path. Clippy is clean with and without
the `observability` feature.
…ing proxy

Expose POST /api/v1/observability/otlp/v1/{traces,metrics} and forward the
raw OTLP body to the operator-configured upstream collector with the
configured auth headers stamped in. Inbound Content-Type is preserved;
operator headers always win over anything supplied by the browser, so
collector tokens stay server-side and no CORS hop is needed. Body is
capped at 4 MiB and the upstream reqwest client lives in a OnceCell so
the connection pool survives across requests. A companion
GET /api/v1/observability/config returns a redacted bootstrap payload
(enabled flag, service name, proxy path, sample ratio) so the SPA can
decide whether to start the SDK at all.

On the frontend, register the OpenTelemetry web SDK with WebTracerProvider
+ BatchSpanProcessor, an OTLP/HTTP exporter pointing at the proxy,
ZoneContextManager, document-load, and fetch instrumentations. Only inject
traceparent on same-origin requests so third-party CDNs and metadata
sources never see Codex trace context. UserInteractionInstrumentation is
restricted to click and submit to keep span volume in check. The heavy
SDK is loaded via dynamic import only when the server-side config flag is
on, so the default-off path pays nothing beyond a small bootstrap script
and a single fetch.

AppState now carries an Arc<ObservabilityConfig> alongside the existing
config Arcs, which the proxy and bootstrap handlers consume. The browser
proxy uses FlexibleAuthContext so the SPA's existing cookie session
authenticates the SDK's POSTs without any custom header plumbing.

Disabled by default; opt-in via observability.browser.enabled and a
non-empty observability.otlp.endpoint. Integration tests cover the auth
gate, the disabled/enabled bootstrap payloads, the 503 path when RUM is
off, and verbatim body + operator-header-wins behavior against an
in-process capture upstream.
The OTLP scaffolding rewrite switched the subscriber install from
`try_init().ok()` to `init()`, which panics via `set_global_default`
when called twice in the same process. Tests that drive migrate +
wait_for_migrations back to back (or migrate twice) tripped the
panic on the second call.

Restore `try_init().ok()` in both feature branches so a redundant
init is a no-op instead of a panic. The disabled-observability path
in `observability::init` was already idempotent, so no other changes
are needed.
…sidecar

Document the opt-in OpenTelemetry pipeline end-to-end and make the dev
environment exercise it by default:

- docs/docs/observability.md: full operator guide covering quickstart,
  backend matrix (SigNoz, Tempo, Honeycomb, Uptrace, DataDog), sampling
  guidance keyed to workload size, the span/metric inventory, browser RUM
  design, log-trace correlation, three disable granularities, and a
  troubleshooting checklist.
- docs/docs/configuration.md: new Observability Configuration section
  mirroring the Rust schema with defaults, env-override names, and forward
  links to the operator guide, plus an entry in the common env-var block.
- docker-compose.yml: bundled jaeger sidecar on the dev profile (same
  pattern as mailhog), accepting OTLP on 4317/4318 and serving the UI on
  16686. codex-dev and codex-dev-worker are pre-wired with
  CODEX_OBSERVABILITY_* env vars pointing at http://jaeger:4317, so
  `make dev-up` produces a fully working backend-plus-collector loop with
  no YAML edit.
- config/config.docker.yaml, config.sqlite.yaml, config.kubernetes.yaml:
  commented-out observability blocks for schema discoverability. The
  templates stay disabled so the files are safe to reuse outside the dev
  compose without surprise telemetry export; the dev override lives at the
  compose layer only.
- src/observability/repo.rs: added an `#[ignore]`d microbench measuring
  per-call cost of `#[tracing::instrument]` with and without a subscriber
  attached (~13 ns disabled, ~400 ns enabled). Runs via
  `cargo test --release -- --ignored bench_instrumentation_overhead`.
…, pin Jaeger tag

Bump the codex-dev healthcheck start_period from 30s to 900s so a
cold-cache `cargo build` inside the container (which can take 10+
minutes) does not exhaust the retry budget and mark the service
unhealthy. While start_period is in effect, failing checks do not
count toward `retries`, so this keeps codex-dev-worker (which depends
on `codex-dev: service_healthy`) from also failing on first boot.

Add a `make dev-logs-jaeger` target to match the existing
per-service log shortcuts.

Pin the Jaeger all-in-one image to `1.62.0` in both the dev compose
file and the observability docs so the published quickstart cannot
silently drift onto a different patch release.
@AshDevFr AshDevFr merged commit 3771a12 into main May 23, 2026
19 checks passed
@AshDevFr AshDevFr deleted the otlp-traces branch May 23, 2026 18:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant