fix(writer): exclude I8/I16 from global dict — reader can't decode narrow-int dict#100
Merged
Conversation
…rrow-int dict Mutation coverage (PR #99) surfaced that the writer's global dict admitted I8/U8/I16/U16 columns, but the reader's lazy dict decode only supports I32/I64/F32/F64 — a low-cardinality I16 column wrote a vortex.dict the reader rejected with "unsupported ptype for lazy dict: I16", i.e. an unreadable file. Cross-checked against the Rust reference: the JNI writer does NOT dict-encode a low-cardinality I16 column, and the Java reader reads its output back exactly (new RustWritesJavaReadsIntegrationTest#jniWriter_javaReader_lowCardinalityI16). So the Java reader is already Rust-conformant; the Java writer was the outlier. Fix: exclude I8/U8/I16/U16 from isDictCandidate (alongside F16/F32). Narrow-int dict gives no real benefit (a U8/U16 code is no smaller than the value), matches Rust, and only ever produced files the reader couldn't read — so excluding it is not a regression. A low-card I16 column now encodes via the cascade and round-trips (GlobalDictPrimitiveTest#lowCardinality_i16_notDicted_roundTrips). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes the write/read incompatibility surfaced by mutation coverage in #99.
The bug
The writer's global dict admitted I8/U8/I16/U16 columns, but the reader's lazy dict decode (
ScanIterator.buildLazyDictPrimitive) only supports I32/I64/F32/F64. A low-cardinality I16 column therefore wrote avortex.dictthe reader rejected on read-back:i.e. the writer produced an unreadable file.
Cross-checked against Rust (ground truth)
New
RustWritesJavaReadsIntegrationTest#jniWriter_javaReader_lowCardinalityI16: the Rust/JNI writer encodes a 10k-row, 3-distinct I16 column and the Java reader reads it back exactly — no dict error. So Rust does not dict-encode narrow ints, and the Java reader is already Rust-conformant. The Java writer was the outlier.Fix
Exclude I8/U8/I16/U16 from
isDictCandidate(alongside F16/F32):A low-card I16 column now encodes via the cascade and round-trips cleanly (
GlobalDictPrimitiveTest#lowCardinality_i16_notDicted_roundTrips), andVortexWriterDictDecisionTestpins the new exclusion.Verify
🤖 Generated with Claude Code