feat: add opt-in OpenTelemetry tracing, metrics, and browser RUM by AshDevFr · Pull Request #24 · AshDevFr/codex

AshDevFr · 2026-05-23T01:10:56Z

Summary

Adds vendor-neutral OpenTelemetry instrumentation to Codex, covering backend traces (HTTP, repository, plugin RPC), metrics with latency histograms, and browser RUM with end-to-end trace propagation. Telemetry is shipped over OTLP to any compatible backend (SigNoz, Tempo, Honeycomb, Uptrace, DataDog, etc.) by changing a single config value. Defaults to fully disabled, so no telemetry leaves the box without explicit operator action.

Motivation

Codex previously exposed only text logs and an in-app dashboard backed by homegrown averages, with no way to see per-endpoint latency distributions, attribute slow requests to a specific SQL call or plugin, or correlate a slow browser interaction to the backend work it triggered. Operators had no path to feed Codex telemetry into the observability stack they already run. This change closes that gap at no vendor cost.

Changes

Configuration: new observability section with toggles for traces, metrics, and browser RUM, plus OTLP endpoint, protocol (gRPC, HTTP/protobuf, HTTP/JSON), headers, sampling ratios, and metric export interval. Entire feature defaults to disabled.
Tracing: incoming traceparent is honored on HTTP requests; one trace tree spans the HTTP handler, repository calls, scanner and background task work, and plugin RPC round-trips, with trace IDs also injected into existing log lines for correlation.
Metrics: latency histograms (p50/p95/p99) for HTTP routes, plugin calls, and tasks, plus inventory gauges and process/runtime metrics. The existing in-app metrics dashboard and /api/v1/metrics/* responses continue to work unchanged.
Web UI: opt-in browser SDK auto-instruments page loads, user clicks, and fetches; outgoing API calls carry traceparent so browser interactions join the backend trace.
Operator-facing endpoints: a new authenticated browser-OTLP proxy under /api/v1/observability/otlp/ forwards browser telemetry to the operator's configured collector (no CORS or token distribution to clients), and a small config endpoint serves the browser SDK its settings.
Docs and examples: new observability page in the operator docs, config reference updates, and an example compose file for running SigNoz alongside Codex.

Notes

Default state is fully disabled (observability.enabled = false); operators must opt in explicitly. Browser RUM is a separate opt-in on top of that and defaults to 10% sampling.
Logs are intentionally out of scope and continue to use the existing stdout/file appender; trace IDs in log lines make external log shipping sufficient for correlation.
Plugin tracing covers the server-side RPC envelope only. Propagating trace context into plugin subprocesses is a planned follow-up that requires plugin SDK changes.
Operator-side end-to-end verification against a live collector is expected as part of release validation.

cloudflare-workers-and-pages · 2026-05-23T01:15:35Z

Deploying codex with Cloudflare Pages

Latest commit:	`0c2ecc2`
Status:	✅ Deploy successful!
Preview URL:	https://54fcfb81.codex-asm.pages.dev
Branch Preview URL:	https://otlp-traces.codex-asm.pages.dev

View logs

Wire OpenTelemetry SDK + axum middleware behind a default `observability` Cargo feature and an explicit `observability.enabled` config flag. Default is off so no telemetry leaves the box without operator action. - Config: new `observability` section (service_name, otlp endpoint / protocol / headers / timeout, per-pipeline enable + sample ratio, browser proxy block) with env overrides and YAML round-trip. - Providers: new `src/observability` module builds SdkTracerProvider + SdkMeterProvider from config with a ParentBased sampler, batch span processor, and periodic metric reader. Returns a guard the serve and worker commands shut down on SIGTERM/SIGINT to flush pending exports. - Bridges: `init_tracing` composes the existing fmt + file appender with the `tracing-opentelemetry` layer via a Registry. A new `TraceContextFormat` wrapper prepends `trace_id=` / `span_id=` to each log line for trace ↔ log correlation. - HTTP: `install_http_layers` wires `OtelAxumLayer` + `OtelInResponseLayer` at the outermost router position; no-op when the feature or config flag is off. - Build: `--no-default-features` continues to compile via a stub observability module; OTLP/HTTP uses `reqwest-blocking-client` to avoid panics on the SDK's dedicated batch processor thread. Tests added for config defaults / env overrides / YAML round-trip, provider init + shutdown, and the HTTP layer wiring decisions.

…nd task worker with OTel spans Build out the trace tree that the OTLP scaffolding produces so a single API request reads as HTTP server → handler → repository calls → plugin RPCs in the operator's collector, instead of a flat list of unrelated spans. - Plugin manager entry methods (search_series, get_series_metadata, match_series, search_book, get_book_metadata, match_book, test_plugin, ping) emit client-kind spans named "plugin.<method>" carrying plugin_id, plugin.method, plugin_name, duration_ms, otel.status_code, and error.code. - Plugin RPC layer adds "plugin.rpc.write" and "plugin.rpc.wait" internal spans so stdio write time is attributable separately from waiting on the plugin process. - New observability::repo::db_system_str() maps SeaORM backends to the OpenTelemetry db.system attribute. Hot-path repository methods on books, series, libraries, users, and plugins are decorated with #[tracing::instrument] following a "db.<entity>.<operation>" naming convention plus db.system / db.operation / otel.kind fields. - Scanner entry points (scan_library, analyze_book) and the task worker (task.execute) get root spans so background work no longer appears as children of unrelated HTTP requests. The task span is created only after a task is claimed to avoid empty-poll noise. Span tests use a small in-test CapturingLayer to assert names and field values without standing up the full OTel SDK.

…side in-app metrics Dual-write plugin and task metrics to OpenTelemetry on top of the existing in-memory and DB-backed stores so operators can read p50/p95/p99 latencies from any OTLP backend without losing the in-app dashboards. - New `observability::metrics` module exposes stable instrument names (`codex.plugin.*`, `codex.task.*`, `codex.inventory.*`) and typed `PluginInstruments` / `TaskInstruments` wrappers built once against the global meter. A no-op `metrics_stub` mirror keeps call sites cfg-free under `--no-default-features`. - `PluginMetricsService` and `TaskMetricsService` emit counters and duration histograms at every recording call site; rate-limit rejections and rate-limited task completions get distinct labels so dashboards can filter them out of error rates. - In-flight task gauge is an observable gauge over an atomic toggled by an RAII `InFlightGuard` in the task worker, which catches every exit path of `process_next_task` (success, failure, error propagation) without scattering inc/dec across early returns. - Inventory observable gauges (libraries, series, books, users, pages) are fed by a 30s background poller that refreshes a shared snapshot and exits on the existing background-task cancellation token. - New axum middleware records `http.server.request.duration` in seconds with `http.request.method`, `http.route` (from `MatchedPath`), and `http.response.status_code` attributes; layered just inside the OTel server-span layer. - Process CPU and memory gauges via `sysinfo` (added as an optional dependency gated on the `observability` feature). Rolled in-house because `opentelemetry-system-metrics` is pinned to opentelemetry 0.31 while we run 0.32. Tests cover metric-name stability, no-op safety when no meter provider is installed, end-to-end emission through an in-memory exporter for plugin and task instruments, the in-flight saturation behaviour, and the inventory snapshot refresh path. Clippy is clean with and without the `observability` feature.

…ing proxy Expose POST /api/v1/observability/otlp/v1/{traces,metrics} and forward the raw OTLP body to the operator-configured upstream collector with the configured auth headers stamped in. Inbound Content-Type is preserved; operator headers always win over anything supplied by the browser, so collector tokens stay server-side and no CORS hop is needed. Body is capped at 4 MiB and the upstream reqwest client lives in a OnceCell so the connection pool survives across requests. A companion GET /api/v1/observability/config returns a redacted bootstrap payload (enabled flag, service name, proxy path, sample ratio) so the SPA can decide whether to start the SDK at all. On the frontend, register the OpenTelemetry web SDK with WebTracerProvider + BatchSpanProcessor, an OTLP/HTTP exporter pointing at the proxy, ZoneContextManager, document-load, and fetch instrumentations. Only inject traceparent on same-origin requests so third-party CDNs and metadata sources never see Codex trace context. UserInteractionInstrumentation is restricted to click and submit to keep span volume in check. The heavy SDK is loaded via dynamic import only when the server-side config flag is on, so the default-off path pays nothing beyond a small bootstrap script and a single fetch. AppState now carries an Arc<ObservabilityConfig> alongside the existing config Arcs, which the proxy and bootstrap handlers consume. The browser proxy uses FlexibleAuthContext so the SPA's existing cookie session authenticates the SDK's POSTs without any custom header plumbing. Disabled by default; opt-in via observability.browser.enabled and a non-empty observability.otlp.endpoint. Integration tests cover the auth gate, the disabled/enabled bootstrap payloads, the 503 path when RUM is off, and verbatim body + operator-header-wins behavior against an in-process capture upstream.

The OTLP scaffolding rewrite switched the subscriber install from `try_init().ok()` to `init()`, which panics via `set_global_default` when called twice in the same process. Tests that drive migrate + wait_for_migrations back to back (or migrate twice) tripped the panic on the second call. Restore `try_init().ok()` in both feature branches so a redundant init is a no-op instead of a panic. The disabled-observability path in `observability::init` was already idempotent, so no other changes are needed.

…sidecar Document the opt-in OpenTelemetry pipeline end-to-end and make the dev environment exercise it by default: - docs/docs/observability.md: full operator guide covering quickstart, backend matrix (SigNoz, Tempo, Honeycomb, Uptrace, DataDog), sampling guidance keyed to workload size, the span/metric inventory, browser RUM design, log-trace correlation, three disable granularities, and a troubleshooting checklist. - docs/docs/configuration.md: new Observability Configuration section mirroring the Rust schema with defaults, env-override names, and forward links to the operator guide, plus an entry in the common env-var block. - docker-compose.yml: bundled jaeger sidecar on the dev profile (same pattern as mailhog), accepting OTLP on 4317/4318 and serving the UI on 16686. codex-dev and codex-dev-worker are pre-wired with CODEX_OBSERVABILITY_* env vars pointing at http://jaeger:4317, so `make dev-up` produces a fully working backend-plus-collector loop with no YAML edit. - config/config.docker.yaml, config.sqlite.yaml, config.kubernetes.yaml: commented-out observability blocks for schema discoverability. The templates stay disabled so the files are safe to reuse outside the dev compose without surprise telemetry export; the dev override lives at the compose layer only. - src/observability/repo.rs: added an `#[ignore]`d microbench measuring per-call cost of `#[tracing::instrument]` with and without a subscriber attached (~13 ns disabled, ~400 ns enabled). Runs via `cargo test --release -- --ignored bench_instrumentation_overhead`.

…, pin Jaeger tag Bump the codex-dev healthcheck start_period from 30s to 900s so a cold-cache `cargo build` inside the container (which can take 10+ minutes) does not exhaust the retry budget and mark the service unhealthy. While start_period is in effect, failing checks do not count toward `retries`, so this keeps codex-dev-worker (which depends on `codex-dev: service_healthy`) from also failing on first boot. Add a `make dev-logs-jaeger` target to match the existing per-service log shortcuts. Pin the Jaeger all-in-one image to `1.62.0` in both the dev compose file and the observability docs so the published quickstart cannot silently drift onto a different patch release.

AshDevFr added 2 commits May 22, 2026 19:02

AshDevFr force-pushed the otlp-traces branch from 021f5fb to 4ddd39d Compare May 23, 2026 02:03

AshDevFr added 5 commits May 22, 2026 20:45

AshDevFr merged commit 3771a12 into main May 23, 2026
19 checks passed

AshDevFr deleted the otlp-traces branch May 23, 2026 18:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add opt-in OpenTelemetry tracing, metrics, and browser RUM#24

feat: add opt-in OpenTelemetry tracing, metrics, and browser RUM#24
AshDevFr merged 7 commits into
mainfrom
otlp-traces

AshDevFr commented May 23, 2026

Uh oh!

cloudflare-workers-and-pages Bot commented May 23, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

AshDevFr commented May 23, 2026

Summary

Motivation

Changes

Notes

Uh oh!

cloudflare-workers-and-pages Bot commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying codex with Cloudflare Pages

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cloudflare-workers-and-pages Bot commented May 23, 2026 •

edited

Loading