feat: introduce bespoke tracing crate#2022
Conversation
|
This is a large chunk of code, I can break it up into smaller iterations, though maybe it isn't too bad since its all additive. I don't think a deep review is needed; initially just a brief easy over check, see if you like the ideas, then we decide whether this is worth doing, or whether we should drop it. Then we decide how to approach a review. Similar to the gRPC server work, this can be adopted in chunks. It might be good to do this alongside the store embedding for example. |
|
cc @bitwalker @huitseeker to tell me this is a terrible idea :) |
There was a problem hiding this comment.
The telemetry policy is valuable (fixed trace targets, stable field names, typed domain fields, a span/event catalog, and a clear split between internal traces and user-facing logs), but the implementation makes very poor reuse of functionality that's already available as standard issue in either tracing itself, or in telemetry (see comments inline).
Reload, panic capture, error span context, event-to-log export, and stdout formatting are already covered by tracing ecosystem crates. Please have a look at tracing_subscriber::reload, consider tracing-panic, tracing-error, opentelemetry-appender-tracing, and fmt::Layer, and make #[instrument] either clearly Result-only or closer to upstream tracing::instrument.
| let env_filter = parse_filter(&filter)?; | ||
|
|
||
| self.set_inner(env_filter); | ||
| self.set_current(filter); |
There was a problem hiding this comment.
Changing the filter value here does not refresh tracing's callsite interest cache. With an initial filter like info, a debug! callsite can be registered as Interest::never; after set_otel_filter("rpc=debug"), that callsite may still be skipped before this new EnvFilter is consulted.
The usual fix is to rebuild the callsite cache after swapping the filter, or to route this through tracing_subscriber::reload, which already does that.
| ::miden_node_tracing::Span::current().__record_metadata(#target, #level_name); | ||
| #mark_user_span | ||
| let __miden_node_tracing_result = (async #block).await; | ||
| if let ::core::result::Result::Err(ref __miden_node_tracing_error) = |
There was a problem hiding this comment.
This macro looks like a general #[instrument], but the generated body assumes the function returns Result. If someone uses it on an infallible function, the error will be a confusing type mismatch from this generated if let.
There was a problem hiding this comment.
Definitely unintentional, it should follow the original non-result/result split.
| /// Default filter used when callers do not provide an initial exporter filter. | ||
| pub(crate) const DEFAULT_FILTER: &str = "info"; | ||
|
|
||
| const ALLOWED_TARGETS: &[&str] = &[ |
There was a problem hiding this comment.
This target list is also defined in the proc-macro crate. If the two lists drift, a target can compile but be blocked by the runtime filter, or a filter can accept a target no macro can create.
Could both lists come from one source, or be checked by a test that fails when they differ?
There was a problem hiding this comment.
There should be a single list. At the moment this means this would live in the macro crate which is a bit.. meh.
I'm thinking of ways to generalize this because we have multiple binary crates (and will have more in the future), and it would be nice to somehow have these be separated.
So far the best I've come up with is either:
- no compile time list, list is enforced via
instastyle test maintained by each binary build.rsinjects each crates list- stick with a single central list in the macro crate
| rendered.push(' '); | ||
| rendered.push_str(key); | ||
| rendered.push('='); | ||
| rendered.push_str(&attribute.value.as_str()); |
There was a problem hiding this comment.
These values go straight into a single stdout line. If a user-facing field contains a newline, terminal control character, or spaces, it can make the operator log misleading or hard to parse.
There was a problem hiding this comment.
iiuc you want to escape the string? tbh the stdout is meant only for local-node users as something to read - not really to parse or anything. Ideally we wouldn't even supply a stdout at all.
| /// User-facing fields are explicitly selected to keep local stdout logs focused. The stored | ||
| /// attribute key is prefixed so exporters can strip the marker prefix for display while still | ||
| /// distinguishing these fields from trace-only attributes. | ||
| pub fn record_user_field<F>(&self, field: &F) |
There was a problem hiding this comment.
record_user_field makes any field show up in operator stdout, but the type system does not say whether that field is safe for user-facing logs. It's easy to accidentally expose verbose or sensitive data later, e.g. a private key.
There was a problem hiding this comment.
I don't think I've seen this handled in tracing before but it would be nice to somehow have this guarantee. Given that secrets should never be logged or emitted, they should simply never implement the traits -- aside from that I'm unsure how we're supposed to do anything about it?
| INSTALL.call_once(|| { | ||
| let previous_hook = std::panic::take_hook(); | ||
| std::panic::set_hook(Box::new(move |info| { | ||
| internal::emit_panic(info); |
There was a problem hiding this comment.
This panic hook is close to what tracing-panic already does: emit panic payload, location, and optional backtrace as a tracing event. I verified the preserving-hook pattern compiles locally:
let previous = std::panic::take_hook();
std::panic::set_hook(Box::new(move |panic_info| {
tracing_panic::panic_hook(panic_info);
previous(panic_info);
}));Could this hook delegate to tracing-panic, and keep only the Miden-specific span-status/fallback behavior locally?
There was a problem hiding this comment.
afaik the tracing-panic hook doesn't allow us to attach the panic to the root span? But maybe I've misunderstood how events attach to spans.
The problem I want to address is when a panic event is emitted within a disabled span. e.g. we panic inside a non-root debug_span! but the filter is set to info, then the event dangles. What I want is that the panic event should still attach at least to the root span if its enabled.
Maybe this is already the case and I've misunderstood. But I do see "floating" panic events occasionally in the node and that concerns me.
| lines.push(format_span(&span)); | ||
| } | ||
|
|
||
| for event in span.events.events.iter().filter(|event| is_user_facing(&event.attributes)) |
There was a problem hiding this comment.
For events that are meant to be logs, opentelemetry-appender-tracing bridges tracing events into OTel logs directly. And this works with the current OTel versions:
use opentelemetry_appender_tracing::layer::OpenTelemetryTracingBridge;
use opentelemetry_sdk::logs::SdkLoggerProvider;
let logger_provider = SdkLoggerProvider::builder()
.with_simple_exporter(opentelemetry_stdout::LogExporter::default())
.build();
let log_layer = OpenTelemetryTracingBridge::new(&logger_provider);User-facing events should use an OTel log layer, leaving this exporter to handle only the span-completed/span-failed operator summaries.
There was a problem hiding this comment.
But we don't have tracing events - we have otel events and tracing cannot see those
| } | ||
| } | ||
|
|
||
| impl SpanExporter for UserFacingStdoutExporter { |
There was a problem hiding this comment.
This exporter hand-rolls line formatting, escaping, and writing. tracing-subscriber's fmt layer already handles structured stdout output and writer plumbing. This is standard issue:
use tracing_subscriber::filter::LevelFilter;
use tracing_subscriber::prelude::*;
let subscriber = tracing_subscriber::registry().with(
tracing_subscriber::fmt::layer()
.with_ansi(false)
.json()
.with_filter(LevelFilter::INFO),
);Could the Miden-specific part be just a miden.user filter layered over fmt::Layer, instead of a custom stdout exporter?
There was a problem hiding this comment.
tracing-subscriber unfortunately only see's tracing fields and not open-telemetry fields, so its blind to the record family of things we introduce here.
Tracing fields have to be statically known at compile time, and must be set as part of the macros which is both tedious and a pain, which is where the fields(x = Field::empty()) becomes required in tracing::instrument and other macro calls to "pre-register" all fields.
|
|
||
| quote! { | ||
| #(#attrs)* | ||
| #[::miden_node_tracing::__private::tracing::instrument(#tracing_args)] |
There was a problem hiding this comment.
Since this expands to tracing::instrument anyway, it may be worth leaning on the upstream macro for most behavior. tracing already supports the core options this wrapper is generating, including err. I verified this compiles locally:
#[tracing::instrument(
skip_all,
target = "rpc",
name = "rpc::get_block",
level = "info",
err
)]
fn get_block() -> Result<(), std::io::Error> {
Err(std::io::Error::other("boom"))
}This macro could stay focused on Miden policy, like target validation and metadata registration, while deferring as much instrumentation behavior as possible to tracing::instrument?
There was a problem hiding this comment.
err does not do the correct thing unfortunately. It uses Display (or Debug), neither of which actually print an error report. So its either a debug impl or only uses the top level error which is not helpful.
It also emits an error! event, which the otel-tracing bridge then converts into marking the span as an error, essentially doubling the error.
Or ito code:
// tracing::instrument
error!(err.to_string());
Span::currrent().set_error(&err.to_string());
// PR impl
Span::current().set_error(err.to_report());There was a problem hiding this comment.
I have tried to set up specific tests for this in the PoC at #2033
| @@ -0,0 +1,112 @@ | |||
| use syn::{Expr, ExprPath, Path, PathArguments}; | |||
|
|
|||
| const ALLOWED_TARGETS: &[&str] = &[ | |||
There was a problem hiding this comment.
For the target taxonomy, a shared enum would avoid keeping string lists in sync manually. One option is strum for parsing/displaying target names from a normal shared crate used by both the proc macro and runtime crates.
At a high level, could the macro parser and runtime filter share one target definition instead of maintaining separate ALLOWED_TARGETS slices?
This PR introduces a highly opinionated framework around the
tracingandopen-telemetrycrates in an attempt to standardize and remove "pits of failure".Warning
AI code, human driver + API design.
I still need to do another review pass but since its long weekend here I'm opening this early for critique.
This is inspired by the discussion in #1949 and the ideas in #1500/#1835.
Problems this addresses
transaction.id,tx.id,tx_id,transaction_id,txetc.Displayand not error chaining, dropping context.stdoutlog sync gets flooded with our traces, rendering them useless.target = COMPONENTis overly broad.Usage
New macros, same as the old macros, but different
The miden-node-tracing crate introduces opinionated re-imaginings of the tracing family of macros. The macros all follow the same syntax:
target name level user?(levelis of course skipped where it is implied e.g. info!).Event macros also require a
justification = "..."because I want to encourage more spans, less events.Spans and events can only be created via macros. This lets us do additional cool things like keeping a list of all events and spans created.
Recording attributes on spans and events. This is not allowed as part of macros, but is done via function calls to be more explicit.
Features
Most features are driven by the re-imagined macros.
User logs
Macros (both events and spans) have an optional
usertag which indicates to thestdoutexporter that this thing should be logged. Tagged events are logged verbatim, spans are expected to be namedverb nounand are logged asverb noun completed/failedbased on the span's error status.The
stdoutexporter also has a separate filter to the otel exporter.Field name unification
The macros don't accept fields as part of their syntax. Instead we have
FieldandObjecttraits which define the default property names of our types and how their values can be recorded on events and spans. This recording is done manually i.e.span.record_field(&tx_id)- in theory there is nothing stopping us from doing this in the macros as well, but I prefer this being a bit more explicit.We also have
recordvariants which will include the field as part of thestdoutlog. The intention is that this would be used for attributes which are important to the basic user, e.g. account ID, transaction ID, note ID.Dynamic filters
Both user and otel trace filters can be configured at runtime. The idea is to have an admin interface which allows setting this e.g. a basic REST API.
In addition, because we use macros for everything, we can use
inventoryto list all spans and events and their metadata (target, name, level). This can be exposed in the admin UI maybe to make it easier to know what is available. Another usage is as a test case which gives an overview of them all e.g. usinginstato keep an automated and visible list within a unit test.Targets
The macros have a closed enum of valid target paths. This means we can easily control the list and ensure its sensible.
The current list is purely an example. Actual usage would begin empty and be filled over time.
Panics
Panics sometimes dangled and can be hard to find. I think I've solved this -- hard to be certain here, maybe its silly. But the panic handler now emits a specific event which is internally handled by the exporters and not actually exporterted. This lets the exporter find a parent/root span to attach it to, even if the panicked span itself is disabled.
Errors
#[instrument]now doesn't emit anerror!on error but instead attaches the error to the span and uses error report. This doesn't help for manualspan!, but its better than nothing.New macros, same as the old macros, but different
The
miden-node-tracingcrate introduces opinionated re-imaginings of thetracingfamily of macros. The macros all follow the same syntax:target name level user?(levelis of course skipped where it is implied e.g.info!).Event macros also require a
justification = "..."because I want more spans, less events.Bummers
Its non-trivial to make this general purpose. This is because of the orphan rule for the macros, and for the whitelist of allowed targets (which the macros require access to).
Orphan rule might be solvable by a lot more generics and additional traits. Whitelist.. might be using a required
build.rs, or by abandoning it in favor of a snapshot test to keep it bounded. Less compile time checks though.