Skip to content

GH-3511: Add JMH encoding benchmarks and fix parquet-benchmarks shaded jar#3512

Open
iemejia wants to merge 3 commits intoapache:masterfrom
iemejia:perf-benchmarks
Open

GH-3511: Add JMH encoding benchmarks and fix parquet-benchmarks shaded jar#3512
iemejia wants to merge 3 commits intoapache:masterfrom
iemejia:perf-benchmarks

Conversation

@iemejia
Copy link
Copy Markdown
Member

@iemejia iemejia commented Apr 19, 2026

Summary

Resolves #3511.

The parquet-benchmarks shaded jar built from current master is non-functional — it fails at runtime with RuntimeException: Unable to find the resource: /META-INF/BenchmarkList. This PR fixes that and adds 11 JMH benchmarks covering the encode/decode paths exercised by the open performance PRs, so reviewers can reproduce the reported numbers.

What's broken on master

parquet-benchmarks/pom.xml is missing two pieces of configuration:

  • maven-compiler-plugin lacks the annotationProcessorPaths / annotationProcessors config for jmh-generator-annprocess, so the JMH annotation processor never runs and META-INF/BenchmarkList / META-INF/CompilerHints are never generated.
  • maven-shade-plugin lacks AppendingTransformer entries for those two resources, so even if generated they would be dropped during shading.

Both problems are fixed in this PR.

Benchmarks added

11 new files in parquet-benchmarks/src/main/java/org/apache/parquet/benchmarks/:

Benchmark Coverage
IntEncodingBenchmark int encode/decode: PLAIN, DELTA_BINARY_PACKED, BYTE_STREAM_SPLIT, RLE, DICTIONARY
BinaryEncodingBenchmark Binary write/read paths, parameterized on length and cardinality
ByteStreamSplitEncodingBenchmark / ByteStreamSplitDecodingBenchmark BSS for float / double / int / long
FixedLenByteArrayEncodingBenchmark FLBA encode/decode
FileReadBenchmark / FileWriteBenchmark CPU-focused file-level benchmarks
RowGroupFlushBenchmark Flush path
ConcurrentReadWriteBenchmark Multi-threaded read/write throughput
BlackHoleOutputFile OutputFile that discards bytes — isolates CPU from I/O
TestDataFactory Shared data-generation utilities

Validation

After this PR, the shaded jar is runnable and registers 87 benchmarks:

$ ./mvnw clean package -pl parquet-benchmarks -DskipTests \
    -Dspotless.check.skip=true -Drat.skip=true -Djapicmp.skip=true
$ java -jar parquet-benchmarks/target/parquet-benchmarks.jar -l | wc -l
87

Sanity check — IntEncodingBenchmark.decodePlain reproduces the master baseline cited in #3493/#3494 (~91M ops/s on JDK 21, JMH 1.37, 3 warmup + 5 measurement iterations):

Benchmark                            (dataPattern)   Mode  Cnt         Score         Error  Units
IntEncodingBenchmark.decodePlain        SEQUENTIAL  thrpt    5  93528419.575 ± 1472148.214  ops/s
IntEncodingBenchmark.decodePlain            RANDOM  thrpt    5  90908523.483 ± 1978982.394  ops/s
IntEncodingBenchmark.decodePlain   LOW_CARDINALITY  thrpt    5  92672978.255 ± 2071927.851  ops/s
IntEncodingBenchmark.decodePlain  HIGH_CARDINALITY  thrpt    5  90770177.655 ± 2427904.955  ops/s

Out of scope (deferred)

Modernization of the existing ReadBenchmarks / WriteBenchmarks / NestedNullWritingBenchmarks (Hadoop-free LocalInputFile, parameterization, JMH-idiomatic state setup) is a separate concern and will be proposed in a follow-up PR.

Follow-up

Once this lands, each open perf PR (#3494, #3496, #3500, #3504, #3506, #3510) will be updated with a one-line "How to reproduce" snippet referencing the relevant *Benchmark class.

… shaded jar

The parquet-benchmarks pom is missing the JMH annotation-processor
configuration and the AppendingTransformer entries for BenchmarkList /
CompilerHints. As a result, the shaded jar built from master fails at
runtime with "Unable to find the resource: /META-INF/BenchmarkList".

This commit:

- Fixes parquet-benchmarks/pom.xml so the shaded jar is runnable: adds
  jmh-generator-annprocess to maven-compiler-plugin's annotation
  processor paths, and adds AppendingTransformer entries for
  META-INF/BenchmarkList and META-INF/CompilerHints to the shade plugin.

- Adds 11 JMH benchmarks covering the encode/decode paths used by the
  pending performance optimization PRs (apache#3494, apache#3496, apache#3500, apache#3504,
  apache#3506, apache#3510), so reviewers can reproduce the reported numbers and
  detect regressions:

    IntEncodingBenchmark, BinaryEncodingBenchmark,
    ByteStreamSplitEncodingBenchmark, ByteStreamSplitDecodingBenchmark,
    FixedLenByteArrayEncodingBenchmark, FileReadBenchmark,
    FileWriteBenchmark, RowGroupFlushBenchmark,
    ConcurrentReadWriteBenchmark, BlackHoleOutputFile, TestDataFactory.

After this change the shaded jar registers 87 benchmarks (was 0 from a
working build, or unrunnable at all from a default build).
iemejia added 2 commits April 20, 2026 01:25
Pre-generate deterministic rows for the file and concurrent benchmarks so row construction does not skew the timed section, and make the encoding benchmarks include real dictionary-page and dictionary-decode work instead of only value buffers. Split synthetic RLE dictionary-index decoding into its own benchmark and encode generated binary payloads as UTF-8 explicitly so benchmark inputs stay consistent across runs and platforms.
Make the dictionary encode/decode benchmarks symmetric by routing both
sides through a shared EncodedDictionary helper, guard against the
dictionary writer falling back to plain encoding (which previously NPE'd
in BinaryEncodingBenchmark setup for high-cardinality long strings),
and drop redundant close() calls after toDictPageAndClose().

Share the pre-generated row array across threads in
ConcurrentReadWriteBenchmark via Scope.Benchmark, eliminating 4x heap
duplication and a now-unnecessary ThreadData inner class.

Centralize the RNG seed as TestDataFactory.DEFAULT_SEED and add
seed-overload variants for the int and binary generators so generators
in the same setup no longer share a Random and silently depend on call
order. Wrap the RLE encoder in try-with-resources and validate that
LOW_CARDINALITY_DISTINCT fits within the configured bit width.
iemejia added a commit to iemejia/parquet-java that referenced this pull request Apr 20, 2026
Replace fastutil's *2IntLinkedOpenHashMap with the plain *2IntOpenHashMap
plus a separate primitive-typed list to track insertion order in the five
dictionary writers (binary, long, double, float, int).

The Linked variant was used because the dictionary page must be emitted
in insertion order, but it pays an avoidable cost on every put: two extra
long fields per slot (prev, next), 3-4 scattered writes per insert to fix
up the doubly-linked list, and re-stitching on rehash. None of this is
vectorizable. With the plain map plus an append-only list, the hash map
is a pure id lookup with the smallest possible slot, and the list is
contiguous and cache-friendly to iterate at flush time.

Both candidates are fastutil primitive-keyed maps, so this is not a
boxing change. The win is structural: an ordering guarantee that was
being paid for on every insert is replaced with an explicit append-only
list that provides it more cheaply.

Benchmark results (BinaryEncodingBenchmark.encodeDictionary,
IntEncodingBenchmark.encodeDictionary - added in apache#3512):

  - encodeDictionary (binary, high cardinality, short strings): +23-42%
  - encodeDictionary (int, high cardinality):                   ~+2x
  - low-cardinality cases: flat (linked-list overhead doesn't matter
    when there are few inserts)

No public API change. No file format change. Behavior is identical:
dictionary pages emit values in the same order.

Validation: parquet-column 573 tests pass. Built with
-Dspotless.check.skip=true -Drat.skip=true -Djapicmp.skip=true.
iemejia added a commit to iemejia/parquet-java that referenced this pull request Apr 20, 2026
Replace fastutil's *2IntLinkedOpenHashMap with the plain *2IntOpenHashMap
plus a separate primitive-typed list to track insertion order in the five
dictionary writers (binary, long, double, float, int).

The Linked variant was used because the dictionary page must be emitted
in insertion order, but it pays an avoidable cost on every put: two extra
long fields per slot (prev, next), 3-4 scattered writes per insert to fix
up the doubly-linked list, and re-stitching on rehash. None of this is
vectorizable. With the plain map plus an append-only list, the hash map
is a pure id lookup with the smallest possible slot, and the list is
contiguous and cache-friendly to iterate at flush time.

Both candidates are fastutil primitive-keyed maps, so this is not a
boxing change. The win is structural: an ordering guarantee that was
being paid for on every insert is replaced with an explicit append-only
list that provides it more cheaply.

Benchmark results (BinaryEncodingBenchmark.encodeDictionary,
IntEncodingBenchmark.encodeDictionary - added in apache#3512):

  - encodeDictionary (binary, high cardinality, short strings): +23-42%
  - encodeDictionary (int, high cardinality):                   ~+2x
  - low-cardinality cases: flat (linked-list overhead doesn't matter
    when there are few inserts)

No public API change. No file format change. Behavior is identical:
dictionary pages emit values in the same order.

Validation: parquet-column 573 tests pass. Built with
-Dspotless.check.skip=true -Drat.skip=true -Djapicmp.skip=true.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add JMH benchmarks for encoding/decoding paths and fix parquet-benchmarks shaded jar

1 participant