fix(delta): null partition value for composite generated-column partitions#828
Conversation
DeltaPartitionExtractor#getSerializedPartitionValue joined the values of
a composite (generated column) partition with Collectors.joining("-").
When one of the component values was absent from the partition values
map, the missing value was rendered as the literal string "null",
producing a corrupted partition value such as "2013-null-20" that was
then fed to the date parser.
Return null when any component value is missing so the partition value
resolves to null, consistent with the single-field branch that uses
getOrDefault(name, null). Added a regression test covering a composite
generated-column partition with a missing component.
DeltaKernelPartitionExtractor#getSerializedPartitionValue contained the
same composite (generated column) partition handling as the Spark-based
DeltaPartitionExtractor: joining component values with
Collectors.joining("-") rendered a missing component as the literal
string "null", corrupting the partition value.
Apply the same fix here - return null when any component value is
missing - and add a regression test mirroring the one added for
DeltaPartitionExtractor.
|
@alealandreev are you running into these cases in datasets you are working with? I am curious how the dataset can get into a state where one of the fields is missing. |
The composite (generated column) partition columns are all derived from a single source column, so the realistic trigger for a missing component is a null source value, which makes every derived partition column null. - Reframe the DeltaPartitionExtractor and DeltaKernelPartitionExtractor unit tests around that null-source case (all components null) instead of an artificial single-missing-component map. - Add an integration test (ITDeltaConversionSource) that creates a Delta table partitioned by year/month/day generated columns derived from a nullable timestamp, inserts a row with a null timestamp, and verifies the snapshot resolves the partition value to null. Without the fix this reproduces the failure (ParseException on "null-null-null").
- Create the gen-col partitioned test table via the DeltaTable builder API instead of CREATE TABLE SQL (Spark 3.4 rejects generated partition columns via SQL), and use a named-column INSERT so generated columns are computed. - Drop misleading scalaMap alias in the kernel test. - Tighten comments.
|
@the-other-tim-brown The three partition columns are all generated from one source ( On write, Delta puts that file under So with all three components null, the old composite branch did This was never caught before because no existing partitioned test exercised a null partition value: |
What is the purpose of the pull request
When a Delta table is partitioned by a composite generated-column partition — e.g.
year/month/daycolumns all generated from a single source column — the partition extractor reconstructs the partition value by joining the component values withCollectors.joining("-").If the source column value is null, every generated component is null, and the join produced the literal string
"null-null-null"(StringJoinerrenders null elements as"null"). That corrupted value was then passed to the date parser, throwingParseException: Unable to parse partition valueand failing the snapshot.This affected both the Spark-based
DeltaPartitionExtractorand theDeltaKernelPartitionExtractor, which carried identical logic. A null partition value is the correct result here: Delta stores null partition values asnullin the AddFilepartitionValuesmap, and the__HIVE_DEFAULT_PARTITION__sentinel only ever appears in the physical directory path.Brief change log
DeltaPartitionExtractor#getSerializedPartitionValue: returnnullwhen any component of a composite partition is null, consistent with the single-field branch (getOrDefault(name, null)).DeltaKernelPartitionExtractor#getSerializedPartitionValue: same fix.Verify this pull request
testGeneratedPartitionValueExtractionWithNullSourcein bothTestDeltaPartitionExtractorandTestDeltaKernelPartitionExtractor— asserts the composite partition value resolves tonullinstead of"null-null-null".ITDeltaConversionSource#getCurrentSnapshotGenColPartitionedWithNullSourceTest— creates a Delta table partitioned byyear/month/daygenerated from a nullableevent_time, inserts a row with a null timestamp, and verifies the snapshot reads back one file with a null partition value (previously this threw while parsing the partition value).ITDeltaConversionSource—Tests run: 15, Failures: 0, Errors: 0.