perf(profiling): reduce profiler arena memory footprint#2048
perf(profiling): reduce profiler arena memory footprint#2048taegyunkim wants to merge 4 commits into
Conversation
📚 Documentation Check Results📦
|
Clippy Allow Annotation ReportComparing clippy allow annotations between branches:
Summary by Rule
Annotation Counts by File
Annotation Stats by Crate
About This ReportThis report tracks Clippy allow annotations for specific rules, showing how they've changed in this PR. Decreasing the number of these annotations generally improves code quality. |
🔒 Cargo Deny Results📦
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## taegyunkim/prof-14423-prof-dictinary-bench #2048 +/- ##
==============================================================================
+ Coverage 73.53% 73.57% +0.04%
==============================================================================
Files 475 475
Lines 79007 79200 +193
==============================================================================
+ Hits 58095 58270 +175
- Misses 20912 20930 +18
🚀 New features to boost your workflow:
|
|
Artifact Size Benchmark Reportaarch64-alpine-linux-musl
aarch64-unknown-linux-gnu
libdatadog-x64-windows
libdatadog-x86-windows
x86_64-alpine-linux-musl
x86_64-unknown-linux-gnu
|
3da10e3 to
477c1f4
Compare
|
Note that historically the tension here was between fragmentation and memory use -- that's why we set the higher defaults. (See for instance https://docs.google.com/document/d/1g_H7G9s_H9yoxlpyw_B0aoUyIVmo0ZQBzQkp5EUUyX8/edit?tab=t.0 ) This not to say that we can't or shouldn't adjust these numbers, it's more to add context to why larger numbers were chosen rather than starting with smallest possible and just letting it grow. |
@ivoanjo Thanks for the context! That makes sense, and this is why this PR uses capped geometric growth. A couple of differences make this less risky than the story from your report:
So this keeps the lower memory floor for small/common profiles, while avoiding the "smallest possible and just keep growing tiny chunks" behavior. I agree we should validate this with real workloads, especially Ruby if we're worried about fragmentation. |
|
Ahh that's great, thanks for the extra context. In particular, I missed the detail where these come from Excited to see the improvements from this one :D |
|
@ivoanjo the DoE results look very good for Python with this change For all three archetypes, we see reduction in heap live size, heap live samples, allocated memory, allocations without change in cpu-time.
|
45a451c to
4686b93
Compare
2e23e15 to
477834f
Compare
65a7aff to
6726636
Compare
…nkim/profiles-dictionary-memory-footprint
| /// doesn't have enough space for the requested allocation, and then links the | ||
| /// new [LinearAllocator] to the previous one, creating a chain. This is where | ||
| /// its name comes from. | ||
| /// its name comes from. Each successful growth doubles the target chunk size |
There was a problem hiding this comment.
have we experimented with other factors? e.g. 1.5x would still grow geometrically, but not as fast.
| /// this in mind when sizing your hint if you are trying to be precise, | ||
| /// such as making sure a specific object fits. | ||
| pub const fn new_in(chunk_size_hint: usize, allocator: A) -> Self { | ||
| let initial_node_size = Self::normalize_node_size(chunk_size_hint); |
There was a problem hiding this comment.
why does one of these get a function and the other is inline?
| assert!(function_arena_reserved_bytes(&dict) <= 4 * SMALL_ARENA_HINT); | ||
| assert!(mapping_arena_reserved_bytes(&dict) <= 2 * SMALL_ARENA_HINT); |
There was a problem hiding this comment.
where do these constants come from?
| pub const SIZE_HINT: usize = 1024 * 1024; | ||
| // Keep the per-shard arena small; larger dictionaries grow | ||
| // geometrically up to the historical 1 MiB chunk size. | ||
| pub const SIZE_HINT: usize = 64 * 1024; |
There was a problem hiding this comment.
should this be named INITIAL_SIZE_HINT
| // geometrically up to the historical 4 MiB chunk size, while common | ||
| // profiles fit comfortably below this initial size. Talk to .NET | ||
| // profiling engineers before making this any bigger. | ||
| const SIZE_HINT: usize = 512 * 1024; |
There was a problem hiding this comment.
for the other case, we went from 64K-1M, here we go from 512K-4M. Why
|
|
||
| let bool_layout = Layout::new::<bool>(); | ||
|
|
||
| const GROWTH_ITERATIONS: usize = 16; |
| if Layout::from_size_align(next, align).is_ok() { | ||
| next | ||
| } else { | ||
| current |
There was a problem hiding this comment.
Why is this the right fall-back when from_size_align fails?
| } else { | ||
| chunk_size_hint | ||
| }, | ||
| node_size: Cell::new(initial_node_size), |
There was a problem hiding this comment.
document why this need to be a cell?



What does this PR do?
Reduces the profiler arena memory floor while preserving larger-workload behavior by making
ChainAllocatorgrow geometrically up to a cap.This PR is stacked on top of #2088, which adds a
ProfilesDictionaryCriterion benchmark so this change can be compared by the GitLab benchmark job.Changes:
ChainAllocator.ChainAllocator::new_capped_in(initial, max, allocator)for callers that want a smaller initial chunk but a historical/max chunk size after growth.StringTableinitial chunks from 4 MiB to 512 KiB, capped at the historical 4 MiB chunk size.ParallelStringSet/ParallelSliceSetat 16 shards.Motivation
Python profiler memory analysis showed that common profiles keep only tens to hundreds of KiB of dictionary/string-table content, but libdatadog reserved much larger arena chunks up front. This created a high per-process memory floor, especially across forked workers.
The smaller initial chunks reduce that floor. Geometric growth avoids keeping large/high-cardinality services on tiny chunks indefinitely, so they ramp back to the previous chunk sizes after a few growth events.
Shard count
This PR keeps the existing 16-shard default.
I originally explored reducing
ParallelStringSet/ParallelSliceSetfrom 16 shards to 4, but dropped that from this PR. The extra memory saved by reducing shards after the arena-size change is relatively small (12 * 64 KiB = 768 KiBfor the string set), while 16 shards preserve better concurrent insertion headroom.Consumer concurrency summary:
So this PR focuses on the main memory win: smaller initial arenas with capped growth, without reducing shard count.
Additional Notes
Expected growth patterns:
64 KiB -> 128 KiB -> 256 KiB -> 512 KiB -> 1 MiB -> ...StringTable:512 KiB -> 1 MiB -> 2 MiB -> 4 MiB -> ...Oversized individual allocations still allocate chunks large enough for the request, even if larger than the routine growth cap.
Approximate initial arena floor after this change:
ParallelStringSet:16 * 64 KiB = 1 MiBinstead of16 * 1 MiB = 16 MiB.FunctionSet:4 * 64 KiB = 256 KiBinstead of4 * 1 MiB = 4 MiB.MappingSet:2 * 64 KiB = 128 KiBinstead of2 * 1 MiB = 2 MiB.StringTable:512 KiBinstead of4 MiB.How to test the change?
Ran:
cargo +nightly-2026-02-08 fmt --all -- --check cargo check -p libdd-alloc -p libdd-profiling cargo check -p libdd-profiling --benches cargo +stable clippy -p libdd-alloc -p libdd-profiling --all-targets --all-features -- -D warnings cargo nextest run -p libdd-alloc -p libdd-profiling cargo test --doc -p libdd-alloc -p libdd-profilingPROF-14423