Skip to content

[ntuple] Support multiple column representations in the merger#22017

Open
silverweed wants to merge 17 commits into
root-project:masterfrom
silverweed:ntuple_merge_colrep2
Open

[ntuple] Support multiple column representations in the merger#22017
silverweed wants to merge 17 commits into
root-project:masterfrom
silverweed:ntuple_merge_colrep2

Conversation

@silverweed

@silverweed silverweed commented Apr 22, 2026

Copy link
Copy Markdown
Contributor

This Pull request:

Significantly reworks the innards of the RNTupleMerger to support fast merging of fields with different but compatible column representations.
Basically it does two things:

  • turns all L3 merging cases into L2/L1.
  • no longer rejects merging fields with different column representations (previously this was only supported for representations that were the split/unsplit version of each other, and only via L3 merging).

A potentially negative consequence that we might want to revisit is that now the merger won't ever adapt the columns' splitness to the output compression (e.g. if merging changes the source compression from 0 to 505 it will still encode the columns as unsplit, and vice-versa). This will probably be readded in a future PR.

In order to achieve this, some new internal functionality had to be added, most notably RPagePersistentSink::AddColumnRepresentation.

Note that this PR is independent on #21740, which in fact might not be needed at all.

IMPORTANT

This PR introduces our first feature flag and thus the first bump to the specs' major version (1.1.0.0). This means we can now start producing RNTuples which cannot be read by older ROOT versions.

TODO

  • check if we need a feature flag for the changes in AddExtendedColumnRanges
  • add a test for merging of Real32Trunc/Quant columns with different bit width/value range
  • properly split the big merger commit
  • update Merging.md

Checklist:

  • tested changes locally
  • updated the docs (if necessary)

@silverweed silverweed requested a review from jblomer as a code owner April 22, 2026 15:07
@silverweed silverweed marked this pull request as draft April 22, 2026 15:07
@silverweed silverweed changed the title Ntuple merge colrep2 [ntuple] Support multiple column representations in the merger Apr 22, 2026
@silverweed silverweed self-assigned this Apr 22, 2026
@silverweed silverweed force-pushed the ntuple_merge_colrep2 branch 3 times, most recently from b2ae5fc to 1db6b5e Compare April 22, 2026 15:22
@github-actions

github-actions Bot commented Apr 22, 2026

Copy link
Copy Markdown

Test Results

    19 files      19 suites   2d 20h 11m 13s ⏱️
 3 869 tests  3 868 ✅ 0 💤 1 ❌
66 349 runs  66 348 ✅ 0 💤 1 ❌

For more details on these failures, see this check.

Results for commit 9eaef1b.

♻️ This comment has been updated with latest results.

@silverweed silverweed force-pushed the ntuple_merge_colrep2 branch 9 times, most recently from 82e5299 to d3efcfc Compare April 30, 2026 09:45
@silverweed silverweed force-pushed the ntuple_merge_colrep2 branch 2 times, most recently from 857d425 to 60b1bde Compare May 4, 2026 13:36
@silverweed silverweed force-pushed the ntuple_merge_colrep2 branch 3 times, most recently from 1d462ee to a249301 Compare May 5, 2026 07:42
@silverweed silverweed force-pushed the ntuple_merge_colrep2 branch from a249301 to ab08930 Compare May 13, 2026 15:05
@silverweed silverweed marked this pull request as ready for review May 15, 2026 06:48
@pcanal

pcanal commented May 23, 2026

Copy link
Copy Markdown
Member

now the merger won't ever adapt the columns' splitness to the output compression (e.g. if merging changes the source compression from 0 to 505 it will still encode the columns as unsplit, and vice-versa). This will probably be readded in a future PR.

indeed, we do need to provide a way for the user to require the L3 type of merging ('urgency' of this is less if we already have a way to for L4 type of merging).

Addendum: L4 is currently not supported, so re-adding L3 would be helpful. In particular to allow re-selection of the compression algorithm used.

@pcanal pcanal closed this May 23, 2026

| Flag Bit | Introduced in | Name | Meaning |
|----------|---------------|-------------------------|----------------------------------------------|
| 0 | 1.1.0.0 | Nested Deferred Columns | Signals that the RNTuple contains at least one deferred column that is part of a collection and was extended<br>(i.e. it appears in the footer). This can happen when merging two RNTuples that have the same collection field<br>backed by columns with different encoding, e.g. a `vector<float>` whose elements are represented by SplitReal32<br>in the first ntuple and by Real32 in the second. |

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mentions explicitly collections. Is the feature (merging RNTuple with 'same' column with different representation) not supported for simple type (i.e. just a float instead of a vector<float>)? If not, why not?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is supported but it requires no feature flag because it produces files that are still readable by previous ROOT versions.
The feature flag is only added if the feature produces files that are forward-incompatible for older versions, which is only the case for collections.

@pcanal

pcanal commented May 23, 2026

Copy link
Copy Markdown
Member

I did not mean to close this.

@pcanal pcanal reopened this May 23, 2026

@hahnjo hahnjo left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial comments for the preparatory changes. I wasn't able to properly review the actual changes in the merger, it's probably me missing enough context and time to thoroughly think it through...

Comment thread tree/ntuple/inc/ROOT/RField/RFieldFundamental.hxx
Comment thread tree/ntuple/src/RNTupleSerialize.cxx
Comment thread tree/ntuple/src/RNTupleDescriptor.cxx
@silverweed silverweed added this to the 6.42.00 milestone Jun 8, 2026
@silverweed silverweed force-pushed the ntuple_merge_colrep2 branch 3 times, most recently from 4697f42 to db9c393 Compare June 9, 2026 13:47

@jblomer jblomer left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! That is a huge step forward.

With all the added complexity, I think we need to give the merger another fresh look and try reduce the logic a bit to smaller blocks. For a future PR.

We should remember to backport

  • [ntuple] Fix RFieldFundamental::GenerateColumns using the wrong repr idx
  • [ntuple] update merger test to make sure we test nRepetitions
  • [ntuple] Clarify a bit RFieldBase::EntryToColumnElementIndex
  • [ntuple] Introduce feature flag 0 (nested deferred columns)
  • Patch up column ids in the header extension (part of [ntuple] Support multiple column representations in the merger)

Comment thread tree/ntuple/src/RNTupleMerger.cxx Outdated
Comment thread tree/ntuple/src/RNTupleSerialize.cxx
Comment thread tree/ntuple/src/RNTupleSerialize.cxx Outdated
Comment thread tree/ntuple/test/ntuple_merger.cxx
Comment on lines +517 to +522
const RColumnRange *TryGetColumnRange(ROOT::DescriptorId_t physicalId) const
{
if (auto it = fColumnRanges.find(physicalId); it != fColumnRanges.end())
return &it->second;
return nullptr;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is only used once and can be replaed by ContainsColumn() followed by GetColumnRange() (at the cost of a small performance penalty). I think it's not worth it at this point adding the public method.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand, is it really a problem to have the method? I imagine this could be useful for users in general and since we're hiding the underlying maps we should not force a suboptimal access pattern via the API imho.

Comment on lines +1437 to +1443
if (!projectionPointeesInitialized) {
for (const auto &field : descCmp.fExtraDstFields) {
if (field->IsProjectedField())
projectionPointees[field->GetProjectionSourceId()].push_back(field);
}
projectionPointeesInitialized = true;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, for simplicity, we could always fill that map

Comment thread tree/ntuple/test/ntuple_merger.cxx Outdated
}
}

TEST(RNTupleMerger, MergeDeferredAdvanced2)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commit message says that we test fields with a different number of column representations. Can you pinpoint where this happens?

Comment thread tree/ntuple/inc/ROOT/RPageStorage.hxx Outdated
if (std::find_if(fColumnRepresentatives.begin(), fColumnRepresentatives.end(),
[&r](const auto &rep) { return r == rep.get(); }) == fColumnRepresentatives.end())
fColumnRepresentatives.emplace_back(*itRepresentative);
fColumnRepresentatives.emplace_back(*itRepresentative);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps also for backport.

Instead of calling continue multiple times in the AddColumnFromField
loop, just early return in case of projected fields.
We are currently serializing columns per-field, but in case of late
column extension this might result in inconsistent sorting of the columns
in the serialized footer.

e.g. assume you have fields "A" and "B", both late model extended, both
with a single column:
    - col 0 -> field A, repr 0
    - col 1 -> field B, repr 0

Now you add a new column representation to field "A"; this new column
has id 2:
    - col 2 -> field A, repr 1

When serializing this RNTuple, all columns are written in the footer by
RNTupleSerialize::SerializeColumnsForFields(). Before this change, they
would end up on disk in order: [0, 2, 1].
This would corrupt the data by swapping the pages for columns 2 and 1.

After this change, they get written as [0, 1, 2] which is the correct
order.

Note that this exact case is tested in ntuple_merger in the unit test
MergeDeferredAdvanced.
This is necessary to support multiple representations that use the same
column type but different metadata (e.g. different bit width on
Real32Trunc columns)
In code, this feature is represented by an update in the logic of
RClusterDescriptorBuilder::AddExtendedColumnRanges(), which now handles
properly the case where a column is added in a later cluster and
therefore cannot rely on the (NEntries * NRepetitions) calculation, so
it instead copies its FirstElementIndex and NElements from the 0th
representation (which is guaranteed to have valid numbers for them).
- RPagePersistentSink::AddColumnRepresentation
- RPagePersistentSink::AddAliasColumn

Internal functionality to be used by the Merger.

This entails 2 additional changes:

- AddExtendedColumnRanges needs to be updated to handle the case where
a column representation is added to a field during writing after some
clusters have already been written;
- ShiftAliasColumns needs to properly shift the ids of extended alias
columns when called, otherwise a mismatch may happen when serializing
the footer
Two field descriptors might legitimately have different numbers of
elements in their LogicalColumnIds vectors due to having a different
number of representations.

The proper check to do is the one on the column cardinality, which
makes sure that the same number of columns is actually used at a time.
This test checks two things:

1. that the fix applied by 14ade04 works
   properly (this is already checked by MergeDeferredAdvanced but that
   might not be the case anymore if we update the merger to change the
   column encodings depending on the output compression - in which case
   that test would not be adding a late extended column anymore);
2. that the merger properly handles fields with the same cardinality
   but different number of column representations
…ation

Also set the anchor version in InitFromDescriptor().
@silverweed silverweed force-pushed the ntuple_merge_colrep2 branch from db9c393 to 9eaef1b Compare June 17, 2026 15:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants