Skip to content

feat: introduce bespoke tracing crate#2022

Open
Mirko-von-Leipzig wants to merge 32 commits intomainfrom
mirko/tracing
Open

feat: introduce bespoke tracing crate#2022
Mirko-von-Leipzig wants to merge 32 commits intomainfrom
mirko/tracing

Conversation

@Mirko-von-Leipzig
Copy link
Copy Markdown
Collaborator

@Mirko-von-Leipzig Mirko-von-Leipzig commented Apr 30, 2026

This PR introduces a highly opinionated framework around the tracing and open-telemetry crates in an attempt to standardize and remove "pits of failure".

Warning

AI code, human driver + API design.

I still need to do another review pass but since its long weekend here I'm opening this early for critique.

This is inspired by the discussion in #1949 and the ideas in #1500/#1835.

Problems this addresses

  1. Field names are all over the show. We have transaction.id, tx.id, tx_id, transaction_id, tx etc.
  2. Errors use Display and not error chaining, dropping context.
  3. We use events instead of spans in many locations.
  4. We use errors as events instead of attaching error attributes to spans.
  5. Panics aren't always associated with their span trace.
  6. Our stdout log sync gets flooded with our traces, rendering them useless.
  7. target = COMPONENT is overly broad.
  8. We cannot change filters dynamically at runtime.

Usage

New macros, same as the old macros, but different

The miden-node-tracing crate introduces opinionated re-imaginings of the tracing family of macros. The macros all follow the same syntax: target name level user? (level is of course skipped where it is implied e.g. info!).

Event macros also require a justification = "..." because I want to encourage more spans, less events.

Spans and events can only be created via macros. This lets us do additional cool things like keeping a list of all events and spans created.

// Instruments a function; these are required in order. No fields allowed.
#[instrument(store::database, "load block header", debug)]
async fn load_header() -> Result<(), Error> { /* ... */ }

// Appending `user` marks the span for user-facing stdout logs as well as OTEL export.
#[instrument(sequencer::block_builder, "build block", info, user)]
async fn build_block() -> Result<(), Error> { /* ... */ }

// Creates a span manually; same syntax order, no fields.
let span = info_span!(sequencer::block_builder, "build block");

// Also supports `user`.
let span = info_span!(rpc, "submit transaction", user);

// Events require a justification because spans are preferred for normal work.
// Justification is not used, just a code audit tool.
let event = warn!(rpc, "request used stale block", justification = "good reason");

// `user` also supported for events.
let event = info!(sequencer::block_builder, "block accepted", user, justification = "user-visible block production milestone");

Recording attributes on spans and events. This is not allowed as part of macros, but is done via function calls to be more explicit.

// Records a field using its trait-defined default key.
// Output: `block.number=<value>`
span.record_field(&block_number);

// Records a field using an explicit key override.
// Output: `request.account.id=<value>`
span.record_field_as(&account_id, "request.account.id");

// Records a field for both traces and user-facing stdout logs.
// Trace output: `miden.user.transaction.id=<value>`; stdout output: `transaction.id=<value>`
span.record_user_field(&transaction_id);

// Records a user-facing field with an explicit key override.
// Trace output: `miden.user.batch.id=<value>`; stdout output: `batch.id=<value>`
span.record_user_field_as(&batch_id, "batch.id");

// Records an object using its trait-defined default prefix.
// Output: `block.header.hash=<value>`, `block.header.prev_hash=<value>`, ...
span.record_object(&block_header);

// Records an object using an explicit prefix override.
// Output: `candidate_block.hash=<value>`, `candidate_block.prev_hash=<value>`, ...
span.record_object_as(&block_header, "candidate_block");

// Same recording style is available on events before `emit`/drop.
// Output: `proof.kind=<value>`
event.record_field(&proof_kind);

Features

Most features are driven by the re-imagined macros.

User logs

Macros (both events and spans) have an optional user tag which indicates to the stdout exporter that this thing should be logged. Tagged events are logged verbatim, spans are expected to be named verb noun and are logged as verb noun completed/failed based on the span's error status.

The stdout exporter also has a separate filter to the otel exporter.

Field name unification

The macros don't accept fields as part of their syntax. Instead we have Field and Object traits which define the default property names of our types and how their values can be recorded on events and spans. This recording is done manually i.e. span.record_field(&tx_id) - in theory there is nothing stopping us from doing this in the macros as well, but I prefer this being a bit more explicit.

We also have record variants which will include the field as part of the stdout log. The intention is that this would be used for attributes which are important to the basic user, e.g. account ID, transaction ID, note ID.

Dynamic filters

Both user and otel trace filters can be configured at runtime. The idea is to have an admin interface which allows setting this e.g. a basic REST API.

In addition, because we use macros for everything, we can use inventory to list all spans and events and their metadata (target, name, level). This can be exposed in the admin UI maybe to make it easier to know what is available. Another usage is as a test case which gives an overview of them all e.g. using insta to keep an automated and visible list within a unit test.

Targets

The macros have a closed enum of valid target paths. This means we can easily control the list and ensure its sensible.

The current list is purely an example. Actual usage would begin empty and be filled over time.

Panics

Panics sometimes dangled and can be hard to find. I think I've solved this -- hard to be certain here, maybe its silly. But the panic handler now emits a specific event which is internally handled by the exporters and not actually exporterted. This lets the exporter find a parent/root span to attach it to, even if the panicked span itself is disabled.

Errors

#[instrument] now doesn't emit an error! on error but instead attaches the error to the span and uses error report. This doesn't help for manual span!, but its better than nothing.

New macros, same as the old macros, but different

The miden-node-tracing crate introduces opinionated re-imaginings of the tracing family of macros. The macros all follow the same syntax: target name level user? (level is of course skipped where it is implied e.g. info!).

Event macros also require a justification = "..." because I want more spans, less events.

Bummers

Its non-trivial to make this general purpose. This is because of the orphan rule for the macros, and for the whitelist of allowed targets (which the macros require access to).

Orphan rule might be solvable by a lot more generics and additional traits. Whitelist.. might be using a required build.rs, or by abandoning it in favor of a snapshot test to keep it bounded. Less compile time checks though.

@Mirko-von-Leipzig Mirko-von-Leipzig added no changelog This PR does not require an entry in the `CHANGELOG.md` file telemetry Relates to tracing and open-telemetry labels Apr 30, 2026
@Mirko-von-Leipzig Mirko-von-Leipzig marked this pull request as ready for review April 30, 2026 15:54
@Mirko-von-Leipzig
Copy link
Copy Markdown
Collaborator Author

Mirko-von-Leipzig commented Apr 30, 2026

This is a large chunk of code, I can break it up into smaller iterations, though maybe it isn't too bad since its all additive.

I don't think a deep review is needed; initially just a brief easy over check, see if you like the ideas, then we decide whether this is worth doing, or whether we should drop it. Then we decide how to approach a review.

Similar to the gRPC server work, this can be adopted in chunks. It might be good to do this alongside the store embedding for example.

@Mirko-von-Leipzig
Copy link
Copy Markdown
Collaborator Author

cc @bitwalker @huitseeker to tell me this is a terrible idea :)

Copy link
Copy Markdown
Contributor

@huitseeker huitseeker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The telemetry policy is valuable (fixed trace targets, stable field names, typed domain fields, a span/event catalog, and a clear split between internal traces and user-facing logs), but the implementation makes very poor reuse of functionality that's already available as standard issue in either tracing itself, or in telemetry (see comments inline).

Reload, panic capture, error span context, event-to-log export, and stdout formatting are already covered by tracing ecosystem crates. Please have a look at tracing_subscriber::reload, consider tracing-panic, tracing-error, opentelemetry-appender-tracing, and fmt::Layer, and make #[instrument] either clearly Result-only or closer to upstream tracing::instrument.

let env_filter = parse_filter(&filter)?;

self.set_inner(env_filter);
self.set_current(filter);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing the filter value here does not refresh tracing's callsite interest cache. With an initial filter like info, a debug! callsite can be registered as Interest::never; after set_otel_filter("rpc=debug"), that callsite may still be skipped before this new EnvFilter is consulted.

The usual fix is to rebuild the callsite cache after swapping the filter, or to route this through tracing_subscriber::reload, which already does that.

::miden_node_tracing::Span::current().__record_metadata(#target, #level_name);
#mark_user_span
let __miden_node_tracing_result = (async #block).await;
if let ::core::result::Result::Err(ref __miden_node_tracing_error) =
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This macro looks like a general #[instrument], but the generated body assumes the function returns Result. If someone uses it on an infallible function, the error will be a confusing type mismatch from this generated if let.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely unintentional, it should follow the original non-result/result split.

/// Default filter used when callers do not provide an initial exporter filter.
pub(crate) const DEFAULT_FILTER: &str = "info";

const ALLOWED_TARGETS: &[&str] = &[
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This target list is also defined in the proc-macro crate. If the two lists drift, a target can compile but be blocked by the runtime filter, or a filter can accept a target no macro can create.

Could both lists come from one source, or be checked by a test that fails when they differ?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be a single list. At the moment this means this would live in the macro crate which is a bit.. meh.

I'm thinking of ways to generalize this because we have multiple binary crates (and will have more in the future), and it would be nice to somehow have these be separated.

So far the best I've come up with is either:

  • no compile time list, list is enforced via insta style test maintained by each binary
  • build.rs injects each crates list
  • stick with a single central list in the macro crate

rendered.push(' ');
rendered.push_str(key);
rendered.push('=');
rendered.push_str(&attribute.value.as_str());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These values go straight into a single stdout line. If a user-facing field contains a newline, terminal control character, or spaces, it can make the operator log misleading or hard to parse.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iiuc you want to escape the string? tbh the stdout is meant only for local-node users as something to read - not really to parse or anything. Ideally we wouldn't even supply a stdout at all.

/// User-facing fields are explicitly selected to keep local stdout logs focused. The stored
/// attribute key is prefixed so exporters can strip the marker prefix for display while still
/// distinguishing these fields from trace-only attributes.
pub fn record_user_field<F>(&self, field: &F)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

record_user_field makes any field show up in operator stdout, but the type system does not say whether that field is safe for user-facing logs. It's easy to accidentally expose verbose or sensitive data later, e.g. a private key.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I've seen this handled in tracing before but it would be nice to somehow have this guarantee. Given that secrets should never be logged or emitted, they should simply never implement the traits -- aside from that I'm unsure how we're supposed to do anything about it?

Comment thread crates/tracing/src/lib.rs
INSTALL.call_once(|| {
let previous_hook = std::panic::take_hook();
std::panic::set_hook(Box::new(move |info| {
internal::emit_panic(info);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This panic hook is close to what tracing-panic already does: emit panic payload, location, and optional backtrace as a tracing event. I verified the preserving-hook pattern compiles locally:

let previous = std::panic::take_hook();
std::panic::set_hook(Box::new(move |panic_info| {
    tracing_panic::panic_hook(panic_info);
    previous(panic_info);
}));

Could this hook delegate to tracing-panic, and keep only the Miden-specific span-status/fallback behavior locally?

Copy link
Copy Markdown
Collaborator Author

@Mirko-von-Leipzig Mirko-von-Leipzig May 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

afaik the tracing-panic hook doesn't allow us to attach the panic to the root span? But maybe I've misunderstood how events attach to spans.

The problem I want to address is when a panic event is emitted within a disabled span. e.g. we panic inside a non-root debug_span! but the filter is set to info, then the event dangles. What I want is that the panic event should still attach at least to the root span if its enabled.

Maybe this is already the case and I've misunderstood. But I do see "floating" panic events occasionally in the node and that concerns me.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I show some ways of tying the span in #2033.

lines.push(format_span(&span));
}

for event in span.events.events.iter().filter(|event| is_user_facing(&event.attributes))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For events that are meant to be logs, opentelemetry-appender-tracing bridges tracing events into OTel logs directly. And this works with the current OTel versions:

use opentelemetry_appender_tracing::layer::OpenTelemetryTracingBridge;
use opentelemetry_sdk::logs::SdkLoggerProvider;

let logger_provider = SdkLoggerProvider::builder()
    .with_simple_exporter(opentelemetry_stdout::LogExporter::default())
    .build();
let log_layer = OpenTelemetryTracingBridge::new(&logger_provider);

User-facing events should use an OTel log layer, leaving this exporter to handle only the span-completed/span-failed operator summaries.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we don't have tracing events - we have otel events and tracing cannot see those

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, TIL.

}
}

impl SpanExporter for UserFacingStdoutExporter {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This exporter hand-rolls line formatting, escaping, and writing. tracing-subscriber's fmt layer already handles structured stdout output and writer plumbing. This is standard issue:

use tracing_subscriber::filter::LevelFilter;
use tracing_subscriber::prelude::*;

let subscriber = tracing_subscriber::registry().with(
    tracing_subscriber::fmt::layer()
        .with_ansi(false)
        .json()
        .with_filter(LevelFilter::INFO),
);

Could the Miden-specific part be just a miden.user filter layered over fmt::Layer, instead of a custom stdout exporter?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tracing-subscriber unfortunately only see's tracing fields and not open-telemetry fields, so its blind to the record family of things we introduce here.

Tracing fields have to be statically known at compile time, and must be set as part of the macros which is both tedious and a pain, which is where the fields(x = Field::empty()) becomes required in tracing::instrument and other macro calls to "pre-register" all fields.


quote! {
#(#attrs)*
#[::miden_node_tracing::__private::tracing::instrument(#tracing_args)]
Copy link
Copy Markdown
Contributor

@huitseeker huitseeker May 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this expands to tracing::instrument anyway, it may be worth leaning on the upstream macro for most behavior. tracing already supports the core options this wrapper is generating, including err. I verified this compiles locally:

#[tracing::instrument(
    skip_all,
    target = "rpc",
    name = "rpc::get_block",
    level = "info",
    err
)]
fn get_block() -> Result<(), std::io::Error> {
    Err(std::io::Error::other("boom"))
}

This macro could stay focused on Miden policy, like target validation and metadata registration, while deferring as much instrumentation behavior as possible to tracing::instrument?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

err does not do the correct thing unfortunately. It uses Display (or Debug), neither of which actually print an error report. So its either a debug impl or only uses the top level error which is not helpful.

It also emits an error! event, which the otel-tracing bridge then converts into marking the span as an error, essentially doubling the error.

Or ito code:

// tracing::instrument
error!(err.to_string());
Span::currrent().set_error(&err.to_string());

// PR impl
Span::current().set_error(err.to_report());

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tried to set up specific tests for this in the PoC at #2033

@@ -0,0 +1,112 @@
use syn::{Expr, ExprPath, Path, PathArguments};

const ALLOWED_TARGETS: &[&str] = &[
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the target taxonomy, a shared enum would avoid keeping string lists in sync manually. One option is strum for parsing/displaying target names from a normal shared crate used by both the proc macro and runtime crates.

At a high level, could the macro parser and runtime filter share one target definition instead of maintaining separate ALLOWED_TARGETS slices?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

no changelog This PR does not require an entry in the `CHANGELOG.md` file telemetry Relates to tracing and open-telemetry

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants