Skip to content

fix(delta): null partition value for composite generated-column partitions#828

Merged
vinishjail97 merged 4 commits into
apache:mainfrom
alealandreev:fix/delta-composite-partition-null-value
Jun 22, 2026
Merged

fix(delta): null partition value for composite generated-column partitions#828
vinishjail97 merged 4 commits into
apache:mainfrom
alealandreev:fix/delta-composite-partition-null-value

Conversation

@alealandreev

@alealandreev alealandreev commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

What is the purpose of the pull request

When a Delta table is partitioned by a composite generated-column partition — e.g. year/month/day columns all generated from a single source column — the partition extractor reconstructs the partition value by joining the component values with Collectors.joining("-").

If the source column value is null, every generated component is null, and the join produced the literal string "null-null-null" (StringJoiner renders null elements as "null"). That corrupted value was then passed to the date parser, throwing ParseException: Unable to parse partition value and failing the snapshot.

This affected both the Spark-based DeltaPartitionExtractor and the DeltaKernelPartitionExtractor, which carried identical logic. A null partition value is the correct result here: Delta stores null partition values as null in the AddFile partitionValues map, and the __HIVE_DEFAULT_PARTITION__ sentinel only ever appears in the physical directory path.

Brief change log

  • DeltaPartitionExtractor#getSerializedPartitionValue: return null when any component of a composite partition is null, consistent with the single-field branch (getOrDefault(name, null)).
  • DeltaKernelPartitionExtractor#getSerializedPartitionValue: same fix.
  • Added regression tests (unit + integration).

Verify this pull request

  • Unit: testGeneratedPartitionValueExtractionWithNullSource in both TestDeltaPartitionExtractor and TestDeltaKernelPartitionExtractor — asserts the composite partition value resolves to null instead of "null-null-null".
  • Integration: ITDeltaConversionSource#getCurrentSnapshotGenColPartitionedWithNullSourceTest — creates a Delta table partitioned by year/month/day generated from a nullable event_time, inserts a row with a null timestamp, and verifies the snapshot reads back one file with a null partition value (previously this threw while parsing the partition value).
  • Verified locally: ITDeltaConversionSourceTests run: 15, Failures: 0, Errors: 0.

alealandreev and others added 2 commits June 3, 2026 21:33
DeltaPartitionExtractor#getSerializedPartitionValue joined the values of
a composite (generated column) partition with Collectors.joining("-").
When one of the component values was absent from the partition values
map, the missing value was rendered as the literal string "null",
producing a corrupted partition value such as "2013-null-20" that was
then fed to the date parser.

Return null when any component value is missing so the partition value
resolves to null, consistent with the single-field branch that uses
getOrDefault(name, null). Added a regression test covering a composite
generated-column partition with a missing component.
DeltaKernelPartitionExtractor#getSerializedPartitionValue contained the
same composite (generated column) partition handling as the Spark-based
DeltaPartitionExtractor: joining component values with
Collectors.joining("-") rendered a missing component as the literal
string "null", corrupting the partition value.

Apply the same fix here - return null when any component value is
missing - and add a regression test mirroring the one added for
DeltaPartitionExtractor.
@the-other-tim-brown

Copy link
Copy Markdown
Contributor

@alealandreev are you running into these cases in datasets you are working with? I am curious how the dataset can get into a state where one of the fields is missing.

The composite (generated column) partition columns are all derived from
a single source column, so the realistic trigger for a missing component
is a null source value, which makes every derived partition column null.

- Reframe the DeltaPartitionExtractor and DeltaKernelPartitionExtractor
  unit tests around that null-source case (all components null) instead
  of an artificial single-missing-component map.
- Add an integration test (ITDeltaConversionSource) that creates a Delta
  table partitioned by year/month/day generated columns derived from a
  nullable timestamp, inserts a row with a null timestamp, and verifies
  the snapshot resolves the partition value to null. Without the fix this
  reproduces the failure (ParseException on "null-null-null").
- Create the gen-col partitioned test table via the DeltaTable builder API
  instead of CREATE TABLE SQL (Spark 3.4 rejects generated partition columns
  via SQL), and use a named-column INSERT so generated columns are computed.
- Drop misleading scalaMap alias in the kernel test.
- Tighten comments.
@vinishjail97

Copy link
Copy Markdown
Contributor

@the-other-tim-brown The three partition columns are all generated from one source (event_time), so a row with a null event_time makes year/month/day all null at once — not one-of-three. It's just a nullable-timestamp scenario (optional / late-arriving / backfilled rows; generated columns are nullable unless constrained).

On write, Delta puts that file under …=__HIVE_DEFAULT_PARTITION__/… and parses it back into the AddFile partitionValues as genuine nulls — DelayedCommitProtocol.parsePartitions (v2.4.0 L121-146) casts the NULL literal produced by PartitionUtils (v2.4.0 L586-589) to a null string. On read both Delta (kernel literalForPartitionValue, v4.0.0 L479-482) and xtable take partition values from that map, where they are nulls — the __HIVE_DEFAULT_PARTITION__ sentinel only ever lives in the physical path.

So with all three components null, the old composite branch did Collectors.joining("-") over [null, null, null], which yields the literal string "null-null-null" (StringJoiner appends "null" for null elements — no NPE). convertFromDeltaPartitionValue then skips its null guard and, since this is a DAY transform, runs formatter.parse("null-null-null"), which throws org.apache.xtable.model.exception.ParseException: Unable to parse partition value. The fix returns null for the composite when any component is null, matching Delta's own read semantics.

This was never caught before because no existing partitioned test exercised a null partition value: testVariousOperations uses a single-field level:VALUE partition (never the multi-field join, never null), and the only generated partition (YEAR(birthDate)) is a single column. The new test is the first to hit the composite branch with a null source.

@vinishjail97 vinishjail97 changed the title Fix/delta composite partition null value fix(delta): null partition value for composite generated-column partitions Jun 22, 2026
@vinishjail97 vinishjail97 merged commit 244003d into apache:main Jun 22, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants