Skip to content

perf(omezarr): open the z5 dataset once, not per tile#361

Open
sameeul wants to merge 1 commit into
PolusAI:mainfrom
sameeul:perf/omezarr-cache-dataset-handle
Open

perf(omezarr): open the z5 dataset once, not per tile#361
sameeul wants to merge 1 commit into
PolusAI:mainfrom
sameeul:perf/omezarr-cache-dataset-handle

Conversation

@sameeul

@sameeul sameeul commented Jul 1, 2026

Copy link
Copy Markdown
Member

Summary

Follow-up to the z5 3.0.1 upgrade (#360). Both OME-Zarr loaders opened the dataset on every tile read:

void loadTile(...) {
    auto ds = z5::openDataset(*zarr_ptr_, ds_name_);  // re-parses .zarray every call
    ...
    z5::multiarray::readSubarray<FileType>(*ds, view, offset.begin());
}

z5::openDataset re-stats and re-parses the dataset's .zarray metadata (shape, chunks, dtype, compressor) from disk each time. On a whole-slide image that's thousands of redundant parses of an immutable dataset.

This opens the dataset once in the constructor, caches the returned std::unique_ptr<z5::Dataset>, and reads every tile through the cached handle.

Why it's safe

  • readSubarray takes a const Dataset&, so a single cached handle is safe to share across all tile reads, including the multithreaded tile-loader path (reads don't mutate the handle).
  • The constructor already opens the store and throws on a bad dataset, so opening the handle there keeps the same failure semantics.
  • Pure optimization: no change to what's read or returned.

Testing

Verified both readers (NyxusOmeZarrLoader and RawOmezarrLoader) still return exact pixel values and full-image checksums against the committed tests/data/omezarr datasets, including the multi-tile / partial-tile cases — those now read all tiles through the single cached handle, which exercises the reuse path directly. The existing test_omezarr.h GTest suite covers this; all assertions pass.

Scope

Two files, +23/-9, no behavior change. This is item 1 from the #360 review follow-ups (the highest-value, zero-risk one). The remaining follow-ups (8-bit/float16 dtype-string matching, float→uint32 truncation, added dtype test fixtures) change what inputs the reader accepts and will come as a separate PR with their own test data.

🤖 Generated with Claude Code

Both OME-Zarr loaders called z5::openDataset(*zarr_ptr_, ds_name_)
inside loadTile, i.e. on every tile read. openDataset re-stats and
re-parses the dataset's .zarray metadata (shape, chunks, dtype,
compressor) from disk each time. On a whole-slide image that is
thousands of redundant metadata parses of an immutable dataset.

Open the dataset once in the constructor, cache the returned
std::unique_ptr<z5::Dataset>, and read every tile through the cached
handle. readSubarray takes a const Dataset&, so sharing one handle
across all (including multithreaded) tile reads is safe.

No behavior change: verified both readers still return exact pixel
values and checksums, including the multi-tile / partial-tile paths
that now exercise handle reuse across tiles.
@sameeul sameeul requested a review from darkclad July 2, 2026 14:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants