Skip to content

feat: Fix offset overflow by casting slide_ids to large_string#3

Merged
JakubPekar merged 17 commits into
masterfrom
feature/large-meta-dataset
Apr 15, 2026
Merged

feat: Fix offset overflow by casting slide_ids to large_string#3
JakubPekar merged 17 commits into
masterfrom
feature/large-meta-dataset

Conversation

@JakubPekar

@JakubPekar JakubPekar commented Apr 5, 2026

Copy link
Copy Markdown
Collaborator

Cast slide_ids to large_string to prevent offset overflow.

Summary by CodeRabbit

  • Refactor

    • Improved slide/tile indexing to preserve original tile order and be robust for unsorted or very large datasets.
    • Data loading now consolidates partition sources (directory or single-file inputs) before concatenation and surfaces a clear load error instead of returning empty datasets.
  • New Features

    • Added an optional dataset-loading options parameter to customize how slide/tile files are read.
    • Tile filtering now accepts either text or binary slide identifiers.

Cast slide_ids to large_string to prevent offset overflow.
Copilot AI review requested due to automatic review settings April 5, 2026 19:53
@coderabbitai

coderabbitai Bot commented Apr 5, 2026

Copy link
Copy Markdown

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

MetaTiledSlides now accepts passthrough hf_kwargs, stops sorting tiles before indexing, and builds tile indices from physical row positions using Arrow group_by(list). Slide/tile loading resolves partition inputs as directories or files and concatenates loaded datasets; prior missing-file early-empty behavior was removed.

Changes

Cohort / File(s) Summary
Tile index & loader
rationai/mlkit/data/datasets/meta_tiled_slides.py
Added optional hf_kwargs to constructor and loader; removed pre-sorting and run-end encoding. _build_tile_index creates a sequential idx array and uses Arrow group_by(...).aggregate([("idx","list")]) to map slide_id -> list[row_indices] (handles large string/binary slide IDs). filter_tiles_by_slide accepts `str

Sequence Diagram(s)

mermaid
sequenceDiagram
participant Caller as Caller
participant Meta as MetaTiledSlides
participant HF as HF LoadDataset
participant Arrow as Apache Arrow
Caller->>Meta: request load_slides_and_tiles(paths, uris, hf_kwargs)
Meta->>HF: resolve partitions → call load_dataset(each, hf_kwargs)
HF-->>Meta: return slides & tiles datasets (per partition)
Meta->>Meta: concatenate partition datasets
Meta->>Arrow: build idx column (physical row positions)
Meta->>Arrow: group_by(slide_id).aggregate(("idx","list"))
Arrow-->>Meta: mapping slide_id -> ListScalar(indices)
Meta-->>Caller: return combined slides dataset and tile-indexed tiles dataset

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 I count the rows without a race,
I keep each tile in its true place,
I group and list with Arrow’s art,
No sorting hops to break apart,
A tidy map — from slide to space 🥕

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the main technical change: preventing offset overflow by casting slide_ids to large_string, which is the core focus of the refactored _build_tile_index method and related updates.
Docstring Coverage ✅ Passed Docstring coverage is 83.33% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/large-meta-dataset

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a potential 32-bit offset overflow issue in Arrow tables by casting slide IDs to a large string format before combining chunks. A review comment correctly identified that the use of pa.large_string() would cause a NameError due to a missing import and suggested using a string alias as a fix.

Comment thread rationai/mlkit/data/datasets/meta_tiled_slides.py Outdated

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a PyArrow offset overflow issue when building the slide_id -> tile index range lookup for large tile datasets by casting slide_id to large_string before calling combine_chunks() / run_end_encode().

Changes:

  • Cast slide_id Arrow column to large_string (64-bit offsets) prior to combine_chunks() to avoid 32-bit offset overflow.
  • Update comments around Arrow table access and the rationale for the cast.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread rationai/mlkit/data/datasets/meta_tiled_slides.py Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@rationai/mlkit/data/datasets/meta_tiled_slides.py`:
- Around line 82-84: The code casts slide_ids using pc.cast(...,
pa.large_string()) but pyarrow is not imported, causing NameError; add an import
for pyarrow (e.g., import pyarrow as pa) at the top of the module so
pa.large_string() is defined; ensure the import sits alongside the existing
pyarrow.compute import used by pc and reference the cast usage in
large_slide_ids = pc.cast(slide_ids, pa.large_string()) to verify it compiles.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: eef786e5-fe51-4474-a661-4d2bf7eb9cd3

📥 Commits

Reviewing files that changed from the base of the PR and between 91156fa and 186d943.

📒 Files selected for processing (1)
  • rationai/mlkit/data/datasets/meta_tiled_slides.py

Comment thread rationai/mlkit/data/datasets/meta_tiled_slides.py Outdated
Comment thread rationai/mlkit/data/datasets/meta_tiled_slides.py
Co-authored-by: Jakub Pekar <46449289+JakubPekar@users.noreply.github.com>
Comment thread rationai/mlkit/data/datasets/meta_tiled_slides.py Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
rationai/mlkit/data/datasets/meta_tiled_slides.py (1)

62-75: Update _build_tile_index()’s contract to match the new behavior.

The function now builds slide_id -> list[int] for unsorted tiles, but the annotation and docstring still promise dict[str, range] on a sorted dataset. Leaving the old contract in place will mislead callers and type checks.

Suggested cleanup
-    def _build_tile_index(tiles: HFDataset) -> dict[str, range]:
-        """Creates a fast lookup table for slide indices.
+    def _build_tile_index(tiles: HFDataset) -> dict[str, list[int]]:
+        """Creates a fast lookup table for tile indices.
@@
-        This function builds a mapping from `slide_id` to the range of indices in the
-        `tiles` dataset that correspond to that slide. It assumes that the `tiles` dataset
-        is sorted by `slide_id`, which allows for efficient retrieval of tile indices
-        for each slide without needing to scan the entire dataset for each slide.
+        This function builds a mapping from `slide_id` to the physical row indices in the
+        `tiles` dataset that correspond to that slide. The input dataset does not need to
+        be sorted by `slide_id`.
@@
-            tiles: A dataset containing a `slide_id` column, sorted by `slide_id`.
+            tiles: A dataset containing a `slide_id` column.
@@
-            A dictionary mapping each `slide_id` to a range of indices in the `tiles` dataset.
+            A dictionary mapping each `slide_id` to a list of row indices in the `tiles` dataset.

Also applies to: 86-87

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@rationai/mlkit/data/datasets/meta_tiled_slides.py` around lines 62 - 75, The
_build_tile_index function currently documents and types its return as dict[str,
range] for a sorted tiles dataset but its implementation constructs slide_id ->
list[int] to support unsorted tiles; update the function signature and docstring
to reflect the actual contract: change the return annotation to dict[str,
list[int]] and rewrite the docstring to state it accepts unsorted HFDataset
tiles and returns a mapping from slide_id to a list of tile indices; also update
any adjacent code/comments (see _build_tile_index and usages of its return) and
adjust callers/type hints to expect list[int] instead of range where necessary.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@rationai/mlkit/data/datasets/meta_tiled_slides.py`:
- Line 83: The code calls ChunkedArray.to_pandas(name="slide_id") which raises
TypeError because to_pandas() doesn't accept a name arg; change the call to
first convert to pandas then name the column by moving the name to to_frame(),
e.g. replace slide_ids.to_pandas(name="slide_id").to_frame() with
slide_ids.to_pandas().to_frame(name="slide_id") so the dataframe column is
correctly named (look for the slide_ids variable usage around that line).

---

Nitpick comments:
In `@rationai/mlkit/data/datasets/meta_tiled_slides.py`:
- Around line 62-75: The _build_tile_index function currently documents and
types its return as dict[str, range] for a sorted tiles dataset but its
implementation constructs slide_id -> list[int] to support unsorted tiles;
update the function signature and docstring to reflect the actual contract:
change the return annotation to dict[str, list[int]] and rewrite the docstring
to state it accepts unsorted HFDataset tiles and returns a mapping from slide_id
to a list of tile indices; also update any adjacent code/comments (see
_build_tile_index and usages of its return) and adjust callers/type hints to
expect list[int] instead of range where necessary.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1d78e3ab-80b0-4af3-bf12-d8966ea910d8

📥 Commits

Reviewing files that changed from the base of the PR and between a00ad1b and ced613c.

📒 Files selected for processing (1)
  • rationai/mlkit/data/datasets/meta_tiled_slides.py

Comment thread rationai/mlkit/data/datasets/meta_tiled_slides.py Outdated
@JakubPekar JakubPekar requested review from a team April 13, 2026 11:03
@JakubPekar JakubPekar requested a review from Copilot April 13, 2026 11:53
@JakubPekar

Copy link
Copy Markdown
Collaborator Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Apr 13, 2026

Copy link
Copy Markdown
✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@JakubPekar

Copy link
Copy Markdown
Collaborator Author

/gemini

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 5 comments.

Comments suppressed due to low confidence (2)

rationai/mlkit/data/datasets/meta_tiled_slides.py:70

  • self.tiles is no longer sorted, but _build_tile_index docstring (and type) still says the tiles dataset is sorted by slide_id and returns contiguous ranges. Either reintroduce sorting or update the docstring/typing and any assumptions to match the unsorted behavior.
        self.slides = slides
        self.tiles = tiles
        self._slide_id_to_indices = self._build_tile_index(self.tiles)

        super().__init__(self.generate_datasets())

    @staticmethod
    def _build_tile_index(tiles: HFDataset) -> dict[str, range]:
        """Creates a fast lookup table for slide indices.

        This function builds a mapping from `slide_id` to the range of indices in the
        `tiles` dataset that correspond to that slide.

        Args:
            tiles: A dataset containing a `slide_id` column, sorted by `slide_id`.

rationai/mlkit/data/datasets/meta_tiled_slides.py:73

  • _build_tile_index is annotated to return dict[str, range], but it now returns dict[slide_id, list[int]] (from to_pylist() on idx_list). This will break type-checking and is inconsistent with filter_tiles_by_slide expecting a range-like slice; update the return type/docstring and default in filter_tiles_by_slide to a consistent indices type (e.g., Sequence[int]).
    def _build_tile_index(tiles: HFDataset) -> dict[str, range]:
        """Creates a fast lookup table for slide indices.

        This function builds a mapping from `slide_id` to the range of indices in the
        `tiles` dataset that correspond to that slide.

        Args:
            tiles: A dataset containing a `slide_id` column, sorted by `slide_id`.

        Returns:
            A dictionary mapping each `slide_id` to a range of indices in the `tiles` dataset.
        """

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread rationai/mlkit/data/datasets/meta_tiled_slides.py
Comment thread rationai/mlkit/data/datasets/meta_tiled_slides.py Outdated
Comment thread rationai/mlkit/data/datasets/meta_tiled_slides.py
Comment thread rationai/mlkit/data/datasets/meta_tiled_slides.py
Comment thread rationai/mlkit/data/datasets/meta_tiled_slides.py Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
rationai/mlkit/data/datasets/meta_tiled_slides.py (1)

62-72: Update _build_tile_index's contract to match the new implementation.

This helper now returns per-slide index lists, not range objects, and it no longer requires tiles to be pre-sorted by slide_id. Keeping the old signature/docstring will mislead callers and static analysis.

Suggested update
-    def _build_tile_index(tiles: HFDataset) -> dict[str, range]:
+    def _build_tile_index(tiles: HFDataset) -> dict[str, list[int]]:
         """Creates a fast lookup table for slide indices.

-        This function builds a mapping from `slide_id` to the range of indices in the
-        `tiles` dataset that correspond to that slide.
+        This function builds a mapping from `slide_id` to the physical row indices in
+        the `tiles` dataset that correspond to that slide.

         Args:
-            tiles: A dataset containing a `slide_id` column, sorted by `slide_id`.
+            tiles: A dataset containing a `slide_id` column.

         Returns:
-            A dictionary mapping each `slide_id` to a range of indices in the `tiles` dataset.
+            A dictionary mapping each `slide_id` to a list of row indices in the `tiles` dataset.
         """
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@rationai/mlkit/data/datasets/meta_tiled_slides.py` around lines 62 - 72,
Update the _build_tile_index docstring and return type to reflect the new
behavior: state that it returns a dict[str, list[int]] (per-slide lists of
indices) instead of range objects, and remove the requirement that the tiles
HFDataset be pre-sorted by slide_id; also update the "Returns" section and any
type hints in the function signature to list[int] so callers and static analysis
see the correct contract (refer to function _build_tile_index and parameter
tiles: HFDataset).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@rationai/mlkit/data/datasets/meta_tiled_slides.py`:
- Around line 175-187: The constructor currently always attempts to build
slides_ds and tiles_ds via list comprehensions and concatenate_datasets, which
raises when both resolve_search_path("slides") and resolve_search_path("tiles")
return empty lists even if slides_and_tiles was passed; update the logic in
meta_tiled_slides.py to short-circuit that path: check the results of
resolve_search_path("slides") and resolve_search_path("tiles") first and only
call load_dataset/concatenate_datasets when those lists are non-empty, or
alternatively skip constructing slides_ds/tiles_ds and directly use the provided
slides_and_tiles tuple when slides_and_tiles is not None; refer to
resolve_search_path, load_dataset, concatenate_datasets, and
load_slides_and_tiles to locate and guard the problematic comprehensions.
- Around line 94-99: The current code re-materializes every tile index into
Python ints via grouped.column("idx_list").to_pylist(), causing huge memory use;
instead keep the grouped indices in Arrow/NumPy form and avoid to_pylist().
Replace the values assignment so you keep the Arrow list/array (e.g., values =
grouped.column("idx_list") or values = grouped.column("idx_list").to_numpy() /
to_numpy(arrow=False) depending on your pyarrow version) and then build
_slide_id_to_indices from keys (converted if needed) zipped with those
Arrow/NumPy arrays; ensure the returned dict maps slide IDs to the Arrow/NumPy
list-array objects rather than Python lists to avoid OOM.

---

Nitpick comments:
In `@rationai/mlkit/data/datasets/meta_tiled_slides.py`:
- Around line 62-72: Update the _build_tile_index docstring and return type to
reflect the new behavior: state that it returns a dict[str, list[int]]
(per-slide lists of indices) instead of range objects, and remove the
requirement that the tiles HFDataset be pre-sorted by slide_id; also update the
"Returns" section and any type hints in the function signature to list[int] so
callers and static analysis see the correct contract (refer to function
_build_tile_index and parameter tiles: HFDataset).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ffca2efa-00b2-4e47-aca5-f61e19501204

📥 Commits

Reviewing files that changed from the base of the PR and between a00ad1b and a6db6c3.

📒 Files selected for processing (1)
  • rationai/mlkit/data/datasets/meta_tiled_slides.py

Comment thread rationai/mlkit/data/datasets/meta_tiled_slides.py Outdated
Comment thread rationai/mlkit/data/datasets/meta_tiled_slides.py
@JakubPekar JakubPekar requested a review from Copilot April 13, 2026 13:18
@JakubPekar

Copy link
Copy Markdown
Collaborator Author

@coderabit review

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (1)

rationai/mlkit/data/datasets/meta_tiled_slides.py:83

  • _build_tile_index() is annotated and documented as returning dict[str, range], but the implementation now returns a mapping to PyArrow ListScalars (from values_array[i]). This is a type contract break that can confuse callers and static type checkers; update the return annotation/docstring (and _slide_id_to_indices usage) to reflect the new value type, or convert the computed indices back into range/Python sequences consistently.
    @staticmethod
    def _build_tile_index(tiles: HFDataset) -> dict[str, range]:
        """Creates a fast lookup table for slide indices.

        This function builds a mapping from `slide_id` to the range of indices in the
        `tiles` dataset that correspond to that slide.

        Args:
            tiles: A dataset containing a `slide_id` column.

        Returns:
            A dictionary mapping each `slide_id` to a range of indices in the `tiles` dataset.
        """

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread rationai/mlkit/data/datasets/meta_tiled_slides.py
Comment thread rationai/mlkit/data/datasets/meta_tiled_slides.py Outdated
Comment thread rationai/mlkit/data/datasets/meta_tiled_slides.py Outdated
Comment thread rationai/mlkit/data/datasets/meta_tiled_slides.py
@JakubPekar

Copy link
Copy Markdown
Collaborator Author

@coderabit review

@JakubPekar

Copy link
Copy Markdown
Collaborator Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the MetaTiledSlides dataset to support chunked Parquet files and removes the sorting requirement for tile datasets. The _build_tile_index method was refactored to use PyArrow's grouping functionality, and the loading logic was modified to handle both single files and directory-based partitions. Feedback includes updating an outdated docstring to reflect the new index return type and restoring descriptive information in the error message for failed data loading to improve debuggability.

Comment thread rationai/mlkit/data/datasets/meta_tiled_slides.py Outdated
Comment thread rationai/mlkit/data/datasets/meta_tiled_slides.py
@JakubPekar JakubPekar requested a review from Copilot April 13, 2026 14:26
@JakubPekar

Copy link
Copy Markdown
Collaborator Author

/gemini review

@JakubPekar

Copy link
Copy Markdown
Collaborator Author

@CodeRabbit review

@coderabbitai

coderabbitai Bot commented Apr 13, 2026

Copy link
Copy Markdown
✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances the MetaTiledSlides dataset to support chunked Parquet files and introduces hf_kwargs for more flexible dataset loading via HuggingFace. The indexing logic in _build_tile_index has been refactored to use PyArrow's native grouping and aggregation, which removes the previous requirement for the tiles dataset to be pre-sorted. Feedback was provided to optimize the conversion of PyArrow scalars to NumPy arrays in filter_tiles_by_slide to avoid performance bottlenecks when handling large datasets.

Comment thread rationai/mlkit/data/datasets/meta_tiled_slides.py Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
rationai/mlkit/data/datasets/meta_tiled_slides.py (1)

59-63: ⚠️ Potential issue | 🔴 Critical

slides_and_tiles-only initialization is still broken.

Line 59 always loads source datasets, and Lines 187-188 return empty-schema datasets when no paths/URIs are provided. Then Lines 61-63 attempt concatenation, which can fail for slides_and_tiles-only usage.

💡 Proposed fix
-        slides, tiles = self.load_slides_and_tiles(paths or [], uris or [], hf_kwargs)
-
-        if slides_and_tiles is not None:
-            slides = concatenate_datasets([slides, slides_and_tiles[0]])
-            tiles = concatenate_datasets([tiles, slides_and_tiles[1]])
+        has_sources = bool(paths) or bool(uris)
+        if has_sources:
+            slides, tiles = self.load_slides_and_tiles(paths or [], uris or [], hf_kwargs)
+            if slides_and_tiles is not None:
+                slides = concatenate_datasets([slides, slides_and_tiles[0]])
+                tiles = concatenate_datasets([tiles, slides_and_tiles[1]])
+        else:
+            # validated by constructor precondition
+            slides, tiles = slides_and_tiles  # type: ignore[misc]
#!/bin/bash
# Verify the unconditional load + empty-source return path still exists.
rg -n "slides, tiles = self\.load_slides_and_tiles\(paths or \[\], uris or \[\], hf_kwargs\)" rationai/mlkit/data/datasets/meta_tiled_slides.py
rg -n -C2 "if not len\(search_dirs\):" rationai/mlkit/data/datasets/meta_tiled_slides.py
rg -n -C2 "if slides_and_tiles is not None:" rationai/mlkit/data/datasets/meta_tiled_slides.py

Also applies to: 187-189

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@rationai/mlkit/data/datasets/meta_tiled_slides.py` around lines 59 - 63, The
current code always calls load_slides_and_tiles(paths or [], uris or [],
hf_kwargs) into slides, tiles which returns empty-schema datasets when no
paths/uris are provided, then unconditionally concatenates slides_and_tiles onto
them causing failures for slides_and_tiles-only usage; fix by only calling
load_slides_and_tiles when paths or uris are non-empty (or by allowing
load_slides_and_tiles to return None for empty inputs) and change the
concatenate logic so it handles three cases: (1) base slides/tiles present and
slides_and_tiles present -> concatenate via concatenate_datasets, (2) only
slides_and_tiles present -> assign slides, tiles = slides_and_tiles[0],
slides_and_tiles[1], and (3) only base present -> keep base; update the code
paths referencing load_slides_and_tiles, slides_and_tiles, slides, tiles, and
concatenate_datasets to implement these guards.
🧹 Nitpick comments (1)
rationai/mlkit/data/datasets/meta_tiled_slides.py (1)

215-217: Over-broad exception remapping reduces debuggability.

Line 216 converts every failure into FileNotFoundError, including schema/parse/runtime issues. Prefer preserving error class context or raising a neutral RuntimeError with source details.

💡 Proposed fix
-        except Exception as e:
-            msg = "Failed to load Parquet files."
-            raise FileNotFoundError(msg) from e
+        except Exception as e:
+            raise RuntimeError("Failed to load slides/tiles datasets.") from e
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@rationai/mlkit/data/datasets/meta_tiled_slides.py` around lines 215 - 217,
The catch-all except block in meta_tiled_slides.py that does "except Exception
as e: ... raise FileNotFoundError(msg) from e" is remapping all errors to
FileNotFoundError and losing the original error class; change it to either
re-raise the original exception or raise a neutral RuntimeError that includes
the original exception message and context and keep "from e" to preserve the
traceback. Locate the try/except around the Parquet loading (the except
Exception as e block) and replace the FileNotFoundError remapping with either
"raise" to propagate e or "raise RuntimeError(f'Failed to load Parquet files:
{e}') from e" so schema/parse/runtime errors remain debuggable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@rationai/mlkit/data/datasets/meta_tiled_slides.py`:
- Around line 59-63: The current code always calls load_slides_and_tiles(paths
or [], uris or [], hf_kwargs) into slides, tiles which returns empty-schema
datasets when no paths/uris are provided, then unconditionally concatenates
slides_and_tiles onto them causing failures for slides_and_tiles-only usage; fix
by only calling load_slides_and_tiles when paths or uris are non-empty (or by
allowing load_slides_and_tiles to return None for empty inputs) and change the
concatenate logic so it handles three cases: (1) base slides/tiles present and
slides_and_tiles present -> concatenate via concatenate_datasets, (2) only
slides_and_tiles present -> assign slides, tiles = slides_and_tiles[0],
slides_and_tiles[1], and (3) only base present -> keep base; update the code
paths referencing load_slides_and_tiles, slides_and_tiles, slides, tiles, and
concatenate_datasets to implement these guards.

---

Nitpick comments:
In `@rationai/mlkit/data/datasets/meta_tiled_slides.py`:
- Around line 215-217: The catch-all except block in meta_tiled_slides.py that
does "except Exception as e: ... raise FileNotFoundError(msg) from e" is
remapping all errors to FileNotFoundError and losing the original error class;
change it to either re-raise the original exception or raise a neutral
RuntimeError that includes the original exception message and context and keep
"from e" to preserve the traceback. Locate the try/except around the Parquet
loading (the except Exception as e block) and replace the FileNotFoundError
remapping with either "raise" to propagate e or "raise RuntimeError(f'Failed to
load Parquet files: {e}') from e" so schema/parse/runtime errors remain
debuggable.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 78cdb1f9-5af1-4877-97f4-1987b69476e6

📥 Commits

Reviewing files that changed from the base of the PR and between a6db6c3 and 16637a7.

📒 Files selected for processing (1)
  • rationai/mlkit/data/datasets/meta_tiled_slides.py

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

rationai/mlkit/data/datasets/meta_tiled_slides.py:144

  • The docstring claims filtering happens "without copying data", but the implementation materializes indices with tile_indices.as_py() (copies into a Python list) before calling select. Please adjust the docstring to reflect the actual behavior, or change the implementation to avoid the Python materialization if zero-copy is required.
        This function creates a view of the `self.tiles` dataset that contains only
        the tiles belonging to the specified slide. It uses the precomputed
        `_slide_id_to_indices` mapping to efficiently retrieve the relevant tiles
        without copying data.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread rationai/mlkit/data/datasets/meta_tiled_slides.py
Comment thread rationai/mlkit/data/datasets/meta_tiled_slides.py
@JakubPekar

Copy link
Copy Markdown
Collaborator Author

/gemini review

@JakubPekar

Copy link
Copy Markdown
Collaborator Author

@coderabit review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the MetaTiledSlides dataset to support chunked Parquet files stored in subdirectories and introduces hf_kwargs for more flexible data loading. The tile indexing logic was updated to use PyArrow's native grouping and aggregation for better performance and flexibility. Feedback includes improving the robustness of path resolution by verifying file existence, handling missing slide IDs more safely to avoid type mismatches, and enhancing error messages for better debuggability.

Comment thread rationai/mlkit/data/datasets/meta_tiled_slides.py
Comment thread rationai/mlkit/data/datasets/meta_tiled_slides.py Outdated
Comment thread rationai/mlkit/data/datasets/meta_tiled_slides.py
@JakubPekar JakubPekar merged commit 8758c82 into master Apr 15, 2026
2 of 3 checks passed
@JakubPekar JakubPekar deleted the feature/large-meta-dataset branch April 15, 2026 09:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants