refactor(change-buffer)!: replace slot index with span_id, fix segment isolation#2105
refactor(change-buffer)!: replace slot index with span_id, fix segment isolation#2105rochdev wants to merge 13 commits into
Conversation
Add the main ChangeBufferState<T> struct and its full implementation: span creation/pooling, change buffer interpretation (opcodes), deferred tag resolution, span header materialization, flush logic, and string table management. This is the core state machine that processes batched span mutations from JS/WASM via a shared memory buffer. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… isolation Segment isolation bug When service A calls B which calls back to A, both of A's segments share the same trace_id. The segment-level state map (origin, meta, metrics) was keyed by trace_id, so SetTraceOrigin/SetTraceMetaAttr/ SetTraceMetricsAttr written for segment 2 would overwrite segment 1's values before it was flushed. Fix: add segment_id (u64) to the Create op so each segment gets its own isolated entry in SmallSegmentMap. Slot index removal Operations previously carried a u32 slot_index — a JS-managed array index into parallel Vecs (spans, deferred_meta, deferred_metrics, segment_ids). JS had to allocate and free slot numbers, silently reusing a slot would drop a live span, and ensure_slot had to grow all four Vecs in lockstep on every Create. Replaced with span_id (u64) as the natural key into HashMaps, removing the slot allocation contract entirely. Opcode size The opcode was encoded as u64 because JS was writing via setBigUint64. With only 17 opcodes this wastes 6 bytes per op. Reduced to u16. SpanHeader removal materialize_header was a direct-write path where JS wrote span fields into a repr(C) struct in WASM memory, bypassing the change buffer protocol. It had no segment_id field and would have reproduced the isolation bug. It was not in use, so deleted along with span_header.rs. Rename Trace→Segment SmallTraceMap/Trace renamed to SmallSegmentMap/Segment since the map holds per-segment state, not per-trace state. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Kept our implementation (HashMap-keyed by span_id) as it supersedes the slot-based style improvements in the base branch. Adopted the SPANS_POOL_MAX_SIZE named constant from the base branch. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
📚 Documentation Check Results📦
|
Clippy Allow Annotation ReportComparing clippy allow annotations between branches:
Summary by Rule
Annotation Counts by File
Annotation Stats by Crate
About This ReportThis report tracks Clippy allow annotations for specific rules, showing how they've changed in this PR. Decreasing the number of these annotations generally improves code quality. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 29b0d7d5bd
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
🔒 Cargo Deny Results📦
|
…buffer corruption Co-locate segment_id with its span in spans: HashMap<u64, (Span<T>, u64)>, removing the separate segment_ids: HashMap<u64, u64>. This eliminates a second HashMap lookup on every op that needs the segment_id. Group the 5 cache state variables in flush_change_buffer into a SpanCache<T> struct, fixing the too_many_arguments clippy lint on interpret_operation_cached. Fix a buffer-corruption bug (P1) where BatchSetMeta/BatchSetMetric in the cached path read the count but skipped the payload loop when cached_deferred_meta/cached_deferred_metrics was null (e.g. after materialize_span was called between flushes). The unread bytes were then decoded as the next op's opcode, corrupting all subsequent ops in the buffer. The fix moves the null-check inside the loop so bytes are always consumed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Artifact Size Benchmark Reportaarch64-alpine-linux-musl
aarch64-unknown-linux-gnu
libdatadog-x64-windows
libdatadog-x86-windows
x86_64-alpine-linux-musl
x86_64-unknown-linux-gnu
|
| /// trigger a rehash — i.e. before every Create op. | ||
| struct SpanCache<T: TraceData> { | ||
| span_id: u64, | ||
| span_ptr: *mut Span<T>, |
There was a problem hiding this comment.
Having *mut pointers here seems a little scary in terms of ensuring stacked borrows etc when they're used
There was a problem hiding this comment.
I agree. I got rid of them in the upstream PR, using an index instead. I'm not sure saving one vec indexing is worth the unsafe business.
| change_buffer, | ||
| spans: Vec::with_capacity(256), | ||
| traces: SmallTraceMap::default(), | ||
| spans: HashMap::with_capacity(256), |
| pub fn flush_chunk( | ||
| &mut self, | ||
| slot_indices: &[u32], | ||
| span_ids: Vec<u64>, |
| let (mut span, segment_id) = self | ||
| .spans | ||
| .remove(span_id) | ||
| .ok_or(ChangeBufferError::SpanNotFound(*span_id))?; |
There was a problem hiding this comment.
If this fails, we stop removing the other spans. Is that what we want?
| self.materialize_deferred_tags(*span_id, &mut span); | ||
|
|
||
| chunk_trace_id = Some(span.trace_id); | ||
| if let Some(entry) = seen_segment_ids |
There was a problem hiding this comment.
Is this data-structure small enough that iter_mut.find is faster than keeping a set?
There was a problem hiding this comment.
This whole seen_segment_ids has been removed upstream in the end.
| // Segment 1 — span_id=1: origin → "origin-A" (string id 0) | ||
| w.op(OP_CREATE, 1); // span_id=1 in header | ||
| w.u128(TRACE_ID); | ||
| w.u64(1); // segment_id=1 | ||
| w.u64(0); // parent_id | ||
| w.op(OP_SET_TRACE_ORIGIN, 1); | ||
| w.u32(0); // string_id → "origin-A" |
There was a problem hiding this comment.
Should this have a helper function?
|
|
||
| impl BufWriter { | ||
| fn new() -> Self { | ||
| let mut data = Vec::with_capacity(256); |
| let key_id: u32 = self.get_num_arg(index)?; | ||
| let val: f64 = self.get_num_arg(index)?; | ||
| deferred_metrics_at(&mut self.deferred_metrics, slot)?.insert(key_id, val); | ||
| if let Some(dm) = self.deferred_metrics.get_mut(&op.span_id) { |
There was a problem hiding this comment.
I'd probably use .entry here, but YMMV
| deferred_meta_at(&mut self.deferred_meta, slot)?.insert(key_id, val_id); | ||
| if !cache.deferred_meta.is_null() { | ||
| // SAFETY: pointer into self.deferred_meta, valid per function contract. | ||
| unsafe { &mut *cache.deferred_meta }.insert(key_id, val_id); |
There was a problem hiding this comment.
Are we guaranteed to enforce stacked-borrows here?
| } | ||
| } | ||
| } | ||
| OpCode::Create | OpCode::CreateSpan | OpCode::CreateSpanFull => unreachable!(), |
There was a problem hiding this comment.
do we want to have a potential panic here?
Summary
trace_id, trace-level state (SetTraceOrigin,SetTraceMetaAttr,SetTraceMetricsAttr) was keyed bytrace_id, so segment 2's writes overwrote segment 1's values before it was flushed. Fixed by addingsegment_id(u64) to Create ops and keyingSmallSegmentMapby segment_id instead.u32 slot_index— a JS-managed array index into four parallelVecs. JS had to allocate/free slot numbers; reusing a slot prematurely would silently drop a live span;ensure_slothad to grow all four Vecs in lockstep. Replaced withspan_id(u64) as the natural key intoHashMaps, removing the slot allocation contract entirely.u64because JS wrote viasetBigUint64, wasting 6 bytes per op with only 17 opcodes. Reduced tou16.materialize_headerwas an unused direct-write path where JS wrote span fields into a#[repr(C)]struct in WASM memory, bypassing the change buffer protocol. It had nosegment_idfield and would have reproduced the isolation bug. Deleted along withspan_header.rs.Trace→Segment:SmallTraceMap/Tracerenamed toSmallSegmentMap/Segmentsince the map holds per-segment state, not per-trace state.Test plan
cargo test -p libdd-trace-utils --lib --features change-buffer -- change_bufferpasses (16 tests including 3 new segment isolation tests)cargo check -p libdd-trace-utils --features change-bufferclean