Skip to content

compute: cache per-batch heap size in ArrangementSize operator#36977

Merged
antiguru merged 1 commit into
MaterializeInc:mainfrom
antiguru:arrangement-size-cache
Jun 11, 2026
Merged

compute: cache per-batch heap size in ArrangementSize operator#36977
antiguru merged 1 commit into
MaterializeInc:mainfrom
antiguru:arrangement-size-cache

Conversation

@antiguru

Copy link
Copy Markdown
Member

Motivation

The arrangement-size logging operator (log_arrangement_size) is sometimes slow.
On every activation it recomputed the heap size of every live batch by walking each batch's backing regions, even when nothing changed and the operator fired for unrelated reasons.
For large arrangements this repeated walk dominates.

Description

Batches are immutable once sealed, so their heap size never changes after creation.
The operator already keys its batch map on Rc::as_ptr, relying on stable pointer identity.
This change caches the computed (size, capacity, allocations) tuple alongside the weak reference when a batch is first observed (via the input stream or map_batches), and only sums the cached values on subsequent activations.
Dead batches are still dropped, now via Weak::strong_count instead of an upgrade that cloned the Rc.

Per-activation cost drops from O(batches * columns * regions) to O(new batches * regions) plus an O(batches) sum, eliminating the repeated region walk.
Logged deltas are unchanged: because batches are immutable, the cached size equals what a re-walk would produce.

Verification

Behavior-preserving change covered by existing arrangement-size logging tests and the mz_arrangement_sizes introspection views.
cargo clippy -p mz-compute and cargo fmt are clean.

🤖 Generated with Claude Code

The arrangement-size logging operator recomputed every live batch's heap
size on every activation, walking each batch's backing regions repeatedly
even when nothing changed. Batches are immutable once sealed, so this work
is redundant.

Cache the computed (size, capacity, allocations) tuple alongside the weak
reference when a batch is first observed, and only sum the cached values on
subsequent activations. This drops per-activation cost from
O(batches * columns * regions) to O(new batches * regions) plus an
O(batches) sum, eliminating the repeated region walk that made the operator
slow for large arrangements.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@antiguru antiguru marked this pull request as ready for review June 11, 2026 08:28
@antiguru antiguru requested a review from a team as a code owner June 11, 2026 08:28
@antiguru antiguru requested review from DAlperin and petrosagg June 11, 2026 08:28

@petrosagg petrosagg left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

@antiguru antiguru merged commit 0fe50dd into MaterializeInc:main Jun 11, 2026
117 checks passed
@antiguru antiguru deleted the arrangement-size-cache branch June 11, 2026 09:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants