[ntuple] Support multiple column representations in the merger#22017
[ntuple] Support multiple column representations in the merger#22017silverweed wants to merge 17 commits into
Conversation
b2ae5fc to
1db6b5e
Compare
Test Results 19 files 19 suites 2d 20h 11m 13s ⏱️ For more details on these failures, see this check. Results for commit 9eaef1b. ♻️ This comment has been updated with latest results. |
82e5299 to
d3efcfc
Compare
857d425 to
60b1bde
Compare
1d462ee to
a249301
Compare
a249301 to
ab08930
Compare
indeed, we do need to provide a way for the user to require the L3 type of merging ('urgency' of this is less if we already have a way to for L4 type of merging). Addendum: L4 is currently not supported, so re-adding L3 would be helpful. In particular to allow re-selection of the compression algorithm used. |
|
|
||
| | Flag Bit | Introduced in | Name | Meaning | | ||
| |----------|---------------|-------------------------|----------------------------------------------| | ||
| | 0 | 1.1.0.0 | Nested Deferred Columns | Signals that the RNTuple contains at least one deferred column that is part of a collection and was extended<br>(i.e. it appears in the footer). This can happen when merging two RNTuples that have the same collection field<br>backed by columns with different encoding, e.g. a `vector<float>` whose elements are represented by SplitReal32<br>in the first ntuple and by Real32 in the second. | |
There was a problem hiding this comment.
This mentions explicitly collections. Is the feature (merging RNTuple with 'same' column with different representation) not supported for simple type (i.e. just a float instead of a vector<float>)? If not, why not?
There was a problem hiding this comment.
It is supported but it requires no feature flag because it produces files that are still readable by previous ROOT versions.
The feature flag is only added if the feature produces files that are forward-incompatible for older versions, which is only the case for collections.
|
I did not mean to close this. |
hahnjo
left a comment
There was a problem hiding this comment.
Some initial comments for the preparatory changes. I wasn't able to properly review the actual changes in the merger, it's probably me missing enough context and time to thoroughly think it through...
4697f42 to
db9c393
Compare
jblomer
left a comment
There was a problem hiding this comment.
Thank you! That is a huge step forward.
With all the added complexity, I think we need to give the merger another fresh look and try reduce the logic a bit to smaller blocks. For a future PR.
We should remember to backport
- [ntuple] Fix RFieldFundamental::GenerateColumns using the wrong repr idx
- [ntuple] update merger test to make sure we test nRepetitions
- [ntuple] Clarify a bit RFieldBase::EntryToColumnElementIndex
- [ntuple] Introduce feature flag 0 (nested deferred columns)
- Patch up column ids in the header extension (part of
[ntuple] Support multiple column representations in the merger)
| const RColumnRange *TryGetColumnRange(ROOT::DescriptorId_t physicalId) const | ||
| { | ||
| if (auto it = fColumnRanges.find(physicalId); it != fColumnRanges.end()) | ||
| return &it->second; | ||
| return nullptr; | ||
| } |
There was a problem hiding this comment.
I think this is only used once and can be replaed by ContainsColumn() followed by GetColumnRange() (at the cost of a small performance penalty). I think it's not worth it at this point adding the public method.
There was a problem hiding this comment.
On the other hand, is it really a problem to have the method? I imagine this could be useful for users in general and since we're hiding the underlying maps we should not force a suboptimal access pattern via the API imho.
| if (!projectionPointeesInitialized) { | ||
| for (const auto &field : descCmp.fExtraDstFields) { | ||
| if (field->IsProjectedField()) | ||
| projectionPointees[field->GetProjectionSourceId()].push_back(field); | ||
| } | ||
| projectionPointeesInitialized = true; | ||
| } |
There was a problem hiding this comment.
Again, for simplicity, we could always fill that map
| } | ||
| } | ||
|
|
||
| TEST(RNTupleMerger, MergeDeferredAdvanced2) |
There was a problem hiding this comment.
The commit message says that we test fields with a different number of column representations. Can you pinpoint where this happens?
| if (std::find_if(fColumnRepresentatives.begin(), fColumnRepresentatives.end(), | ||
| [&r](const auto &rep) { return r == rep.get(); }) == fColumnRepresentatives.end()) | ||
| fColumnRepresentatives.emplace_back(*itRepresentative); | ||
| fColumnRepresentatives.emplace_back(*itRepresentative); |
There was a problem hiding this comment.
Perhaps also for backport.
Instead of calling continue multiple times in the AddColumnFromField loop, just early return in case of projected fields.
We are currently serializing columns per-field, but in case of late
column extension this might result in inconsistent sorting of the columns
in the serialized footer.
e.g. assume you have fields "A" and "B", both late model extended, both
with a single column:
- col 0 -> field A, repr 0
- col 1 -> field B, repr 0
Now you add a new column representation to field "A"; this new column
has id 2:
- col 2 -> field A, repr 1
When serializing this RNTuple, all columns are written in the footer by
RNTupleSerialize::SerializeColumnsForFields(). Before this change, they
would end up on disk in order: [0, 2, 1].
This would corrupt the data by swapping the pages for columns 2 and 1.
After this change, they get written as [0, 1, 2] which is the correct
order.
Note that this exact case is tested in ntuple_merger in the unit test
MergeDeferredAdvanced.
Also fix the type of result
This is necessary to support multiple representations that use the same column type but different metadata (e.g. different bit width on Real32Trunc columns)
In code, this feature is represented by an update in the logic of RClusterDescriptorBuilder::AddExtendedColumnRanges(), which now handles properly the case where a column is added in a later cluster and therefore cannot rely on the (NEntries * NRepetitions) calculation, so it instead copies its FirstElementIndex and NElements from the 0th representation (which is guaranteed to have valid numbers for them).
- RPagePersistentSink::AddColumnRepresentation - RPagePersistentSink::AddAliasColumn Internal functionality to be used by the Merger. This entails 2 additional changes: - AddExtendedColumnRanges needs to be updated to handle the case where a column representation is added to a field during writing after some clusters have already been written; - ShiftAliasColumns needs to properly shift the ids of extended alias columns when called, otherwise a mismatch may happen when serializing the footer
Two field descriptors might legitimately have different numbers of elements in their LogicalColumnIds vectors due to having a different number of representations. The proper check to do is the one on the column cardinality, which makes sure that the same number of columns is actually used at a time.
This test checks two things: 1. that the fix applied by 14ade04 works properly (this is already checked by MergeDeferredAdvanced but that might not be the case anymore if we update the merger to change the column encodings depending on the output compression - in which case that test would not be adding a late extended column anymore); 2. that the merger properly handles fields with the same cardinality but different number of column representations
…ation Also set the anchor version in InitFromDescriptor().
db9c393 to
9eaef1b
Compare
This Pull request:
Significantly reworks the innards of the RNTupleMerger to support fast merging of fields with different but compatible column representations.
Basically it does two things:
A potentially negative consequence that we might want to revisit is that now the merger won't ever adapt the columns' splitness to the output compression (e.g. if merging changes the source compression from 0 to 505 it will still encode the columns as unsplit, and vice-versa). This will probably be readded in a future PR.
In order to achieve this, some new internal functionality had to be added, most notably
RPagePersistentSink::AddColumnRepresentation.Note that this PR is independent on #21740, which in fact might not be needed at all.
IMPORTANT
This PR introduces our first feature flag and thus the first bump to the specs' major version (1.1.0.0). This means we can now start producing RNTuples which cannot be read by older ROOT versions.
TODO
AddExtendedColumnRangesChecklist: