Skip to content

feat(site-agent): log trace id on gRPC client responses#3005

Open
CTOmari360 wants to merge 1 commit into
NVIDIA:mainfrom
CTOmari360:feat/site-agent-grpc-log-trace-id
Open

feat(site-agent): log trace id on gRPC client responses#3005
CTOmari360 wants to merge 1 commit into
NVIDIA:mainfrom
CTOmari360:feat/site-agent-grpc-log-trace-id

Conversation

@CTOmari360

Copy link
Copy Markdown
Contributor

Thread the per-RPC OpenTelemetry trace id into the site-agent gRPC client response logs so an operator can correlate a log line back to the workflow that issued the call.

Today RecordRpcResponse logs only method, code, and duration:

method=/forge.Forge/FindMachinesByIds, code=DeadlineExceeded, duration=36.817µs

There is no way to tell which workflow triggered the RPC, which makes reasoning about e.g. a DeadlineExceeded impossible. The Flow and Core gRPC clients are already otelgrpc-instrumented, so every call carries a span; this passes the call context.Context into the Metrics.RecordRpcResponse callback and tags the log with the span's trace_id when present.

Related issues

Fixes #2650

Type of Change

  • Add - New feature or capability

Breaking Changes

  • This PR contains breaking changes

Metrics.RecordRpcResponse gains a leading ctx context.Context parameter. The interface has only two in-tree implementations (coregrpc, flowgrpc) and two call sites (the unary and stream interceptors), all updated here — no other implementers exist in the repo.

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Added TestTraceIDFromContext (table-driven, testify) covering: no span context → "", valid span → hex-encoded trace id, all-zero trace id → "". go build, go vet, gofmt, and go test are green for the changed packages.

Additional Notes

  • The new client.TraceIDFromContext helper is panic-safe: it returns "" when the context carries no valid span, so the field is omitted rather than logged empty, and it does not depend on the call originating inside a Temporal activity.
  • The existing log message text is unchanged; only a structured trace_id field is added, matching the existing zerolog .Str(...) style elsewhere in site-agent.
  • Scoped to the live RecordRpcResponse path; the unrelated, currently-unused RecordLatency/WorkflowMetrics callback is left untouched to keep the change single-purpose.

The gRPC client metrics callback logged only method, code, and duration,
so an operator reading e.g. a DeadlineExceeded line could not tell which
workflow issued the call. The Flow and Core gRPC clients are already
otelgrpc-instrumented, so every RPC carries a span; thread the call
context into Metrics.RecordRpcResponse and tag the log with the span's
trace id when present, letting operators thread together all logs for a
single workflow.

- Add ctx to client.Metrics.RecordRpcResponse and pass it from the unary
  and stream interceptors.
- Add client.TraceIDFromContext, a panic-safe helper returning the
  hex-encoded trace id (or "" when absent), covered by a unit test.
- Log trace_id via the existing zerolog style in the coregrpc and
  flowgrpc implementations; the existing message text is unchanged.

Fixes NVIDIA#2650

Signed-off-by: Omar Refai <omar@refai.org>
@CTOmari360 CTOmari360 requested a review from a team as a code owner June 30, 2026 03:02
@copy-pr-bot

copy-pr-bot Bot commented Jun 30, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 27cbfae1-6794-46aa-a733-1dad23e391e0

📥 Commits

Reviewing files that changed from the base of the PR and between d63a8ad and fb15808.

📒 Files selected for processing (4)
  • rest-api/site-agent/pkg/components/managers/coregrpc/metrics.go
  • rest-api/site-agent/pkg/components/managers/flowgrpc/metrics.go
  • rest-api/site-workflow/pkg/grpc/client/metrics.go
  • rest-api/site-workflow/pkg/grpc/client/metrics_test.go

Summary by CodeRabbit

  • New Features

    • RPC metrics now include trace ID information when available, improving request correlation in logs and monitoring.
    • Added support for passing request context through RPC metric reporting.
  • Tests

    • Added coverage for trace ID extraction from request context, including missing and invalid trace IDs.

Walkthrough

The Metrics interface's RecordRpcResponse method gains a context.Context parameter. A new TraceIDFromContext helper extracts the OpenTelemetry trace ID from a span context. Both coregrpc and flowgrpc implementations conditionally append trace_id to debug log events when a non-zero trace ID is present.

Changes

Trace ID propagation through gRPC client metrics

Layer / File(s) Summary
Metrics interface, TraceIDFromContext helper, and interceptor wiring
rest-api/site-workflow/pkg/grpc/client/metrics.go, rest-api/site-workflow/pkg/grpc/client/metrics_test.go
RecordRpcResponse gains a ctx context.Context first parameter; TraceIDFromContext is added to extract a hex trace ID from the OTel span context, returning "" for absent or all-zero IDs; stream interceptor passes ctx to the updated call; table-driven tests cover no-span, valid, and zero-trace-ID cases.
coregrpc and flowgrpc implementation updates
rest-api/site-agent/pkg/components/managers/coregrpc/metrics.go, rest-api/site-agent/pkg/components/managers/flowgrpc/metrics.go
Both grpcClientMetrics.RecordRpcResponse implementations are updated to accept ctx, construct a structured debug log event, and conditionally attach trace_id via client.TraceIDFromContext(ctx) before emitting the method, code, and duration fields.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 16.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title is concise and accurately reflects the main change: logging trace IDs on gRPC client responses.
Description check ✅ Passed The description matches the implemented change and explains the rationale, scope, and testing clearly.
Linked Issues check ✅ Passed The PR adds trace-context logging so responses can be correlated back to workflows, satisfying #2650's tracing goal.
Out of Scope Changes check ✅ Passed The added helper, call-site updates, and tests are all directly supporting the trace ID logging change.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@ajf

ajf commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator

/ok to test fb15808

@github-actions

github-actions Bot commented Jun 30, 2026

Copy link
Copy Markdown

🔍 Container Scan Summary

Service Total Critical High Medium Low Other
nico-flow 43 3 21 9 2 8
nico-nsm 49 0 10 31 8 0
nico-psm 43 3 21 9 2 8
nico-rest-api 43 3 21 9 2 8
nico-rest-cert-manager 42 3 21 9 1 8
nico-rest-db 43 3 21 9 2 8
nico-rest-site-agent 42 3 21 9 1 8
nico-rest-site-manager 42 3 21 9 1 8
nico-rest-workflow 43 3 21 9 2 8
TOTAL 390 24 178 103 21 64

Per-CVE detail lives in the per-service grype-* artifacts (JSON + SARIF). Severity counts only — no CVE IDs published here.

@github-actions

Copy link
Copy Markdown

🔐 TruffleHog Secret Scan

No secrets or credentials found!

Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉

🔗 View scan details

🕐 Last updated: 2026-06-30 20:04:36 UTC | Commit: fb15808

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: site-agent logging should contain more context information

3 participants