Skip to content

realtime(observability): forward diagnostics + stats over the signaling WS#148

Open
nagar-decart wants to merge 6 commits into
mainfrom
eyal/observability-bouncer-ingest
Open

realtime(observability): forward diagnostics + stats over the signaling WS#148
nagar-decart wants to merge 6 commits into
mainfrom
eyal/observability-bouncer-ingest

Conversation

@nagar-decart
Copy link
Copy Markdown
Contributor

@nagar-decart nagar-decart commented May 29, 2026

Summary

Pipes client-side WebRTC / ICE / networking observability events through the existing realtime WebSocket as {type: "observability", data} messages. Bouncer logs them under the session's structured-log context in Datadog (DecartAI/api#1882), so SDK-side observations land correlated with our server-side LiveKit / inference traces without any new endpoints, auth, or transport.

Context: while debugging the 29CM connection failures we had rich server-side LiveKit / ICE logs but no symmetric visibility into what the SDK's own WebRTC engine was producing on the client. This closes the gap.

What flows over the WS now

  • Every diagnostic emitted from RealtimeObservability.diagnostic():
    • client-session-connection-breakdown (with per-phase timings + error)
    • reconnect
    • videoStall
  • Every periodic WebRTC stats snapshot collected by the in-process stats collector (video / outbound / candidate pairs / etc.)

Wiring

  • SignalingChannel.sendObservability(data) — new public, fire-and-forget method that writes {type: "observability", data} (drops silently if socket isn't open, never throws).
  • RealtimeObservability.setObservabilityForwarder(fn | null) — sink set by StreamSession.createTransport() after the signaling channel is constructed, and cleared on tearDown().
  • ObservabilityMessage added to the OutgoingRealtimeMessage union.

Intentionally additive to existing observability: the local onDiagnostic / onStats callbacks and the platform telemetry POSTs keep working as before.

Diff size

4 files, +49 / -1 lines. No new dependencies, no new config.

Test plan

  • pnpm -F @decartai/sdk build — passes
  • pnpm -F @decartai/sdk test — 206 tests pass
  • Verify in webrtc-connect-test that a session with this SDK build produces service:bouncer-realtime "RTAPI: client observability" log lines in Datadog us5

🤖 Generated with Claude Code


Note

Medium Risk
Touches live session transport and relies on LiveKit private APIs for peer-connection hooks; failures are mostly best-effort but could add listener/polling overhead during connect.

Overview
Adds client-side connection observability on the realtime SDK path: diagnostics and ICE/WebRTC debug events are sent on the existing signaling WebSocket as { type: "observability", data }, so bouncer can log them next to server traces (e.g. Datadog).

Wiring: StreamSession registers an observability forwarder on RealtimeObservability that calls SignalingChannel.sendObservability (best-effort, no throw). Typed ObservabilityMessage is added to outgoing WS messages. Forwarder is cleared on teardown.

What gets forwarded: All diagnostic() payloads (connection breakdown, reconnect, video stall, etc.) plus free-form emitInstrumentationEvent streams. Periodic WebRTC stats are not forwarded over the WS (still go to local callbacks and telemetry POST) to avoid noise.

Deep ICE instrumentation: New attachRoomInstrumentation runs after LiveKit room.connect() from MediaChannel, reaches private room.engine.pcManager peer connections, and emits candidates, ICE/PC state, selected pairs, browser online/network-change, and selected room/signaling events (livekit_room_info without token). Routine prompt/generation ack traffic is suppressed on the signaling side.

Reviewed by Cursor Bugbot for commit deeaca7. Bugbot is set up for automated code reviews on this repo. Configure here.

nagar-decart and others added 2 commits May 29, 2026 15:19
…ng WS

Pipes client-side WebRTC / ICE / networking observability events through
the existing realtime WebSocket as `{type: "observability", data}`
messages. Bouncer logs them under the session's structured-log context
in Datadog (DecartAI/api#1882), so SDK-side observations land
correlated with our server-side LiveKit / inference traces without any
new endpoints, auth, or transport.

What flows over the WS now:
- Every diagnostic emitted from `RealtimeObservability.diagnostic()`
  (e.g. `client-session-connection-breakdown`, `reconnect`, `videoStall`)
- Every periodic WebRTC stats snapshot collected by the in-process
  stats collector

Wiring:
- `SignalingChannel.sendObservability(data)` — new public, fire-and-forget
  method that writes `{type: "observability", data}` (drops silently if
  the socket isn't open; never throws).
- `RealtimeObservability.setObservabilityForwarder(fn | null)` — sink
  set by `StreamSession.createTransport()` after the signaling channel
  is constructed, and cleared on `tearDown()`.
- `ObservabilityMessage` added to `OutgoingRealtimeMessage` union.

This is intentionally additive to existing observability: the local
`onDiagnostic`/`onStats` callbacks and the platform telemetry POSTs
keep working as before.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@pkg-pr-new
Copy link
Copy Markdown

pkg-pr-new Bot commented May 29, 2026

Open in StackBlitz

npm i https://pkg.pr.new/@decartai/sdk@148

commit: deeaca7

Comment thread packages/sdk/src/realtime/observability/realtime-observability.ts Outdated
The existing observability sink only emitted three diagnostic types
(client-session-connection-breakdown, reconnect, videoStall) plus
periodic WebRTC stats — useful for metrics, useless for actually
debugging an ICE failure. This adds raw event-level instrumentation
that mirrors the iOS PR DecartAI/decart-ios#28's
DecartLiveKitLogger + NetworkPathObserver coverage.

New file: observability/network-instrumentation.ts

Attaches addEventListener-based listeners to:
- The LiveKit Room (Connected/Disconnected/Reconnecting/Reconnected,
  SignalReconnecting, ConnectionStateChanged, ConnectionQualityChanged,
  MediaDevicesError, LocalTrackPublished, ParticipantConnected,
  TrackSubscribed/Muted/Unmuted).
- The underlying publisher + subscriber RTCPeerConnections (private
  access via room.engine.pcManager, with addEventListener so we never
  displace LiveKit's own handlers): icecandidate, icecandidateerror,
  iceconnectionstatechange, connectionstatechange,
  icegatheringstatechange, signalingstatechange, negotiationneeded,
  datachannel, track. ICE candidates carry full address/port/type
  /priority/foundation. Selected candidate pair (with RTT + addresses)
  is snapshotted from getStats() when ICE settles.
- Browser network state (initial snapshot of navigator.connection +
  online/offline + visibilitychange + NetworkInformation 'change').

New methods:
- RealtimeObservability.emitInstrumentationEvent(name, data)
  bypasses the strict DiagnosticEvents type union so per-candidate
  events can flow without bloating the public type surface.

Signaling traffic:
- SignalingChannel.writeMessage emits 'signaling-sent' (skips
  'observability' to avoid recursing).
- SignalingChannel.handleMessage emits 'signaling-received'.

All events flow through the same observabilityForwarder ->
realtime WS -> bouncer -> Datadog @event.* path that the previous
change established. No new transport, no new endpoints.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread packages/sdk/src/realtime/signaling-channel.ts Outdated
}

// Re-export for convenience so consumers know what's available.
export { Track };
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused Track import and re-export

Low Severity

Track is imported from livekit-client and re-exported at line 370, but it's never used within this file, and no other file imports Track from network-instrumentation. This is dead code — the only import from this module is attachRoomInstrumentation.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit f58f9c7. Configure here.

…ebug signal

Earlier instrumentation pass dumped everything over the WS — periodic
WebRTC stats (every 1s), every signaling frame, every signalingstatechange,
every track/visibility/connection-quality event. In a 15-min Datadog
window with 3 successful sessions, stats alone produced 141 of 250 events
(56%) while actual ICE-level data was ~7%. That signal-to-noise ratio
makes the stream useless for debugging connection failures.

Forward only what helps diagnose WHY a connection fails or stalls:

DROPPED (noise during a healthy session, not useful for failure debug):
- Periodic WebRTC stats (still local via onStats + platform telemetry)
- signaling-sent for every outbound frame
- signaling-received for routine generation_tick / prompt_ack / set_image_ack
- signaling-state on PC (SDP renegotiation chatter)
- negotiation-needed, track-received, data-channel-opened on PC
- track-subscribed/-muted/-unmuted, local-track-published,
  participant-connected, connection-quality on Room
- page-visibility

KEPT (the ICE / connection debug picture):
- ice-candidate, ice-candidate-past, ice-candidate-error
- ice-connection-state, ice-gathering-state, pc-connection-state
- pc-attached (with snapshot of current state)
- selected-candidate-pair (with local/remote addresses + RTT)
- connected-address (LiveKit's getter for the chosen address)
- room-connected/-disconnected/-reconnecting/-signal-reconnecting/-reconnected
- room-connection-state
- media-devices-error
- network-state (initial + on change), browser-online/-offline
- signaling-received for livekit_room_info / session_id (the only
  signaling messages that matter for connection setup)
- client-session-connection-breakdown (the breakdown diagnostic)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 4518087. Configure here.

this.options.onStats?.(stats);
this.telemetryReporter.addStats(stats);
this.detectVideoStall(stats);
// Stats intentionally not forwarded over the WS. They fire every 1s
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stop() doesn't clear observabilityForwarder like other fields

Low Severity

The stop() method resets every other private field (statsCollector, telemetryReporter, connectionBreakdown, etc.) but does not clear observabilityForwarder. After stop() is called, diagnostic() and emitInstrumentationEvent() still forward payloads over the WebSocket. In client.ts, observability.stop() is called before session.disconnect(), creating a window where the "stopped" observability instance continues sending data.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 4518087. Configure here.

nagar-decart and others added 2 commits May 29, 2026 16:40
- pc-attached now carries the PC's iceServers config (URLs only;
  credentials redacted to presence-bools) + iceTransportPolicy.
  Directly answers "did the JoinResponse give us STUN/TURN at all?",
  which was the core unknown in the 29CM ICE-failure investigation.
- room-disconnected now decodes the numeric DisconnectReason enum
  into a readable string (signal_close, connection_timeout,
  media_failure, etc.). Mirrored in the test mock.
- livekit_room_info signaling event now carries livekitUrl /
  roomName / sessionId — tells us which SFU node owns the room
  without needing to grep server-side logs.
- Dropped room-connection-state (duplicated room-connected /
  -disconnected / -reconnecting).
- Dropped connected-address (redundant with
  selected-candidate-pair.remote.address).
- ice-candidate trimmed from 15 fields to 12: drop raw SDP `candidate`
  string, sdpMid, sdpMLineIndex, usernameFragment. Keep type, protocol,
  address, port, priority, foundation, component, tcpType,
  relatedAddress/Port, networkType, url.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The TS lib used in CI typecheck (different from the DOM types vitest
uses locally) doesn't expose RTCIceCandidateStats. The
RTCIceCandidatePairStats reference also tripped because the same lib
exposes it under a slightly different shape — narrowing failed and
field accesses landed on 'never'.

Replaced with two minimal structural types (IceCandidateStat,
IceCandidatePairStat) covering only the fields we actually read from
getStats(). Functional behavior is unchanged.

Also cast the session_id signaling case through `unknown` — it's sent
by the bouncer but not in the typed IncomingRealtimeMessage union, so
TS narrowed the comparison to never.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant