feat: Fix offset overflow by casting slide_ids to large_string#3
Conversation
Cast slide_ids to large_string to prevent offset overflow.
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughMetaTiledSlides now accepts passthrough hf_kwargs, stops sorting tiles before indexing, and builds tile indices from physical row positions using Arrow group_by(list). Slide/tile loading resolves partition inputs as directories or files and concatenates loaded datasets; prior missing-file early-empty behavior was removed. Changes
Sequence Diagram(s)mermaid Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request addresses a potential 32-bit offset overflow issue in Arrow tables by casting slide IDs to a large string format before combining chunks. A review comment correctly identified that the use of pa.large_string() would cause a NameError due to a missing import and suggested using a string alias as a fix.
There was a problem hiding this comment.
Pull request overview
This PR addresses a PyArrow offset overflow issue when building the slide_id -> tile index range lookup for large tile datasets by casting slide_id to large_string before calling combine_chunks() / run_end_encode().
Changes:
- Cast
slide_idArrow column tolarge_string(64-bit offsets) prior tocombine_chunks()to avoid 32-bit offset overflow. - Update comments around Arrow table access and the rationale for the cast.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@rationai/mlkit/data/datasets/meta_tiled_slides.py`:
- Around line 82-84: The code casts slide_ids using pc.cast(...,
pa.large_string()) but pyarrow is not imported, causing NameError; add an import
for pyarrow (e.g., import pyarrow as pa) at the top of the module so
pa.large_string() is defined; ensure the import sits alongside the existing
pyarrow.compute import used by pc and reference the cast usage in
large_slide_ids = pc.cast(slide_ids, pa.large_string()) to verify it compiles.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: eef786e5-fe51-4474-a661-4d2bf7eb9cd3
📒 Files selected for processing (1)
rationai/mlkit/data/datasets/meta_tiled_slides.py
Co-authored-by: Jakub Pekar <46449289+JakubPekar@users.noreply.github.com>
Co-authored-by: Jakub Pekar <46449289+JakubPekar@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
rationai/mlkit/data/datasets/meta_tiled_slides.py (1)
62-75: Update_build_tile_index()’s contract to match the new behavior.The function now builds
slide_id -> list[int]for unsorted tiles, but the annotation and docstring still promisedict[str, range]on a sorted dataset. Leaving the old contract in place will mislead callers and type checks.Suggested cleanup
- def _build_tile_index(tiles: HFDataset) -> dict[str, range]: - """Creates a fast lookup table for slide indices. + def _build_tile_index(tiles: HFDataset) -> dict[str, list[int]]: + """Creates a fast lookup table for tile indices. @@ - This function builds a mapping from `slide_id` to the range of indices in the - `tiles` dataset that correspond to that slide. It assumes that the `tiles` dataset - is sorted by `slide_id`, which allows for efficient retrieval of tile indices - for each slide without needing to scan the entire dataset for each slide. + This function builds a mapping from `slide_id` to the physical row indices in the + `tiles` dataset that correspond to that slide. The input dataset does not need to + be sorted by `slide_id`. @@ - tiles: A dataset containing a `slide_id` column, sorted by `slide_id`. + tiles: A dataset containing a `slide_id` column. @@ - A dictionary mapping each `slide_id` to a range of indices in the `tiles` dataset. + A dictionary mapping each `slide_id` to a list of row indices in the `tiles` dataset.Also applies to: 86-87
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@rationai/mlkit/data/datasets/meta_tiled_slides.py` around lines 62 - 75, The _build_tile_index function currently documents and types its return as dict[str, range] for a sorted tiles dataset but its implementation constructs slide_id -> list[int] to support unsorted tiles; update the function signature and docstring to reflect the actual contract: change the return annotation to dict[str, list[int]] and rewrite the docstring to state it accepts unsorted HFDataset tiles and returns a mapping from slide_id to a list of tile indices; also update any adjacent code/comments (see _build_tile_index and usages of its return) and adjust callers/type hints to expect list[int] instead of range where necessary.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@rationai/mlkit/data/datasets/meta_tiled_slides.py`:
- Line 83: The code calls ChunkedArray.to_pandas(name="slide_id") which raises
TypeError because to_pandas() doesn't accept a name arg; change the call to
first convert to pandas then name the column by moving the name to to_frame(),
e.g. replace slide_ids.to_pandas(name="slide_id").to_frame() with
slide_ids.to_pandas().to_frame(name="slide_id") so the dataframe column is
correctly named (look for the slide_ids variable usage around that line).
---
Nitpick comments:
In `@rationai/mlkit/data/datasets/meta_tiled_slides.py`:
- Around line 62-75: The _build_tile_index function currently documents and
types its return as dict[str, range] for a sorted tiles dataset but its
implementation constructs slide_id -> list[int] to support unsorted tiles;
update the function signature and docstring to reflect the actual contract:
change the return annotation to dict[str, list[int]] and rewrite the docstring
to state it accepts unsorted HFDataset tiles and returns a mapping from slide_id
to a list of tile indices; also update any adjacent code/comments (see
_build_tile_index and usages of its return) and adjust callers/type hints to
expect list[int] instead of range where necessary.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 1d78e3ab-80b0-4af3-bf12-d8966ea910d8
📒 Files selected for processing (1)
rationai/mlkit/data/datasets/meta_tiled_slides.py
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
|
/gemini |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 5 comments.
Comments suppressed due to low confidence (2)
rationai/mlkit/data/datasets/meta_tiled_slides.py:70
self.tilesis no longer sorted, but_build_tile_indexdocstring (and type) still says thetilesdataset is sorted byslide_idand returns contiguous ranges. Either reintroduce sorting or update the docstring/typing and any assumptions to match the unsorted behavior.
self.slides = slides
self.tiles = tiles
self._slide_id_to_indices = self._build_tile_index(self.tiles)
super().__init__(self.generate_datasets())
@staticmethod
def _build_tile_index(tiles: HFDataset) -> dict[str, range]:
"""Creates a fast lookup table for slide indices.
This function builds a mapping from `slide_id` to the range of indices in the
`tiles` dataset that correspond to that slide.
Args:
tiles: A dataset containing a `slide_id` column, sorted by `slide_id`.
rationai/mlkit/data/datasets/meta_tiled_slides.py:73
_build_tile_indexis annotated to returndict[str, range], but it now returnsdict[slide_id, list[int]](fromto_pylist()onidx_list). This will break type-checking and is inconsistent withfilter_tiles_by_slideexpecting a range-like slice; update the return type/docstring and default infilter_tiles_by_slideto a consistent indices type (e.g.,Sequence[int]).
def _build_tile_index(tiles: HFDataset) -> dict[str, range]:
"""Creates a fast lookup table for slide indices.
This function builds a mapping from `slide_id` to the range of indices in the
`tiles` dataset that correspond to that slide.
Args:
tiles: A dataset containing a `slide_id` column, sorted by `slide_id`.
Returns:
A dictionary mapping each `slide_id` to a range of indices in the `tiles` dataset.
"""
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
rationai/mlkit/data/datasets/meta_tiled_slides.py (1)
62-72: Update_build_tile_index's contract to match the new implementation.This helper now returns per-slide index lists, not
rangeobjects, and it no longer requirestilesto be pre-sorted byslide_id. Keeping the old signature/docstring will mislead callers and static analysis.Suggested update
- def _build_tile_index(tiles: HFDataset) -> dict[str, range]: + def _build_tile_index(tiles: HFDataset) -> dict[str, list[int]]: """Creates a fast lookup table for slide indices. - This function builds a mapping from `slide_id` to the range of indices in the - `tiles` dataset that correspond to that slide. + This function builds a mapping from `slide_id` to the physical row indices in + the `tiles` dataset that correspond to that slide. Args: - tiles: A dataset containing a `slide_id` column, sorted by `slide_id`. + tiles: A dataset containing a `slide_id` column. Returns: - A dictionary mapping each `slide_id` to a range of indices in the `tiles` dataset. + A dictionary mapping each `slide_id` to a list of row indices in the `tiles` dataset. """🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@rationai/mlkit/data/datasets/meta_tiled_slides.py` around lines 62 - 72, Update the _build_tile_index docstring and return type to reflect the new behavior: state that it returns a dict[str, list[int]] (per-slide lists of indices) instead of range objects, and remove the requirement that the tiles HFDataset be pre-sorted by slide_id; also update the "Returns" section and any type hints in the function signature to list[int] so callers and static analysis see the correct contract (refer to function _build_tile_index and parameter tiles: HFDataset).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@rationai/mlkit/data/datasets/meta_tiled_slides.py`:
- Around line 175-187: The constructor currently always attempts to build
slides_ds and tiles_ds via list comprehensions and concatenate_datasets, which
raises when both resolve_search_path("slides") and resolve_search_path("tiles")
return empty lists even if slides_and_tiles was passed; update the logic in
meta_tiled_slides.py to short-circuit that path: check the results of
resolve_search_path("slides") and resolve_search_path("tiles") first and only
call load_dataset/concatenate_datasets when those lists are non-empty, or
alternatively skip constructing slides_ds/tiles_ds and directly use the provided
slides_and_tiles tuple when slides_and_tiles is not None; refer to
resolve_search_path, load_dataset, concatenate_datasets, and
load_slides_and_tiles to locate and guard the problematic comprehensions.
- Around line 94-99: The current code re-materializes every tile index into
Python ints via grouped.column("idx_list").to_pylist(), causing huge memory use;
instead keep the grouped indices in Arrow/NumPy form and avoid to_pylist().
Replace the values assignment so you keep the Arrow list/array (e.g., values =
grouped.column("idx_list") or values = grouped.column("idx_list").to_numpy() /
to_numpy(arrow=False) depending on your pyarrow version) and then build
_slide_id_to_indices from keys (converted if needed) zipped with those
Arrow/NumPy arrays; ensure the returned dict maps slide IDs to the Arrow/NumPy
list-array objects rather than Python lists to avoid OOM.
---
Nitpick comments:
In `@rationai/mlkit/data/datasets/meta_tiled_slides.py`:
- Around line 62-72: Update the _build_tile_index docstring and return type to
reflect the new behavior: state that it returns a dict[str, list[int]]
(per-slide lists of indices) instead of range objects, and remove the
requirement that the tiles HFDataset be pre-sorted by slide_id; also update the
"Returns" section and any type hints in the function signature to list[int] so
callers and static analysis see the correct contract (refer to function
_build_tile_index and parameter tiles: HFDataset).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: ffca2efa-00b2-4e47-aca5-f61e19501204
📒 Files selected for processing (1)
rationai/mlkit/data/datasets/meta_tiled_slides.py
|
@coderabit review |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 4 comments.
Comments suppressed due to low confidence (1)
rationai/mlkit/data/datasets/meta_tiled_slides.py:83
- _build_tile_index() is annotated and documented as returning
dict[str, range], but the implementation now returns a mapping to PyArrow ListScalars (fromvalues_array[i]). This is a type contract break that can confuse callers and static type checkers; update the return annotation/docstring (and_slide_id_to_indicesusage) to reflect the new value type, or convert the computed indices back intorange/Python sequences consistently.
@staticmethod
def _build_tile_index(tiles: HFDataset) -> dict[str, range]:
"""Creates a fast lookup table for slide indices.
This function builds a mapping from `slide_id` to the range of indices in the
`tiles` dataset that correspond to that slide.
Args:
tiles: A dataset containing a `slide_id` column.
Returns:
A dictionary mapping each `slide_id` to a range of indices in the `tiles` dataset.
"""
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@coderabit review |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request updates the MetaTiledSlides dataset to support chunked Parquet files and removes the sorting requirement for tile datasets. The _build_tile_index method was refactored to use PyArrow's grouping functionality, and the loading logic was modified to handle both single files and directory-based partitions. Feedback includes updating an outdated docstring to reflect the new index return type and restoring descriptive information in the error message for failed data loading to improve debuggability.
|
/gemini review |
|
@CodeRabbit review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Code Review
This pull request enhances the MetaTiledSlides dataset to support chunked Parquet files and introduces hf_kwargs for more flexible dataset loading via HuggingFace. The indexing logic in _build_tile_index has been refactored to use PyArrow's native grouping and aggregation, which removes the previous requirement for the tiles dataset to be pre-sorted. Feedback was provided to optimize the conversion of PyArrow scalars to NumPy arrays in filter_tiles_by_slide to avoid performance bottlenecks when handling large datasets.
There was a problem hiding this comment.
♻️ Duplicate comments (1)
rationai/mlkit/data/datasets/meta_tiled_slides.py (1)
59-63:⚠️ Potential issue | 🔴 Critical
slides_and_tiles-only initialization is still broken.Line 59 always loads source datasets, and Lines 187-188 return empty-schema datasets when no paths/URIs are provided. Then Lines 61-63 attempt concatenation, which can fail for
slides_and_tiles-only usage.💡 Proposed fix
- slides, tiles = self.load_slides_and_tiles(paths or [], uris or [], hf_kwargs) - - if slides_and_tiles is not None: - slides = concatenate_datasets([slides, slides_and_tiles[0]]) - tiles = concatenate_datasets([tiles, slides_and_tiles[1]]) + has_sources = bool(paths) or bool(uris) + if has_sources: + slides, tiles = self.load_slides_and_tiles(paths or [], uris or [], hf_kwargs) + if slides_and_tiles is not None: + slides = concatenate_datasets([slides, slides_and_tiles[0]]) + tiles = concatenate_datasets([tiles, slides_and_tiles[1]]) + else: + # validated by constructor precondition + slides, tiles = slides_and_tiles # type: ignore[misc]#!/bin/bash # Verify the unconditional load + empty-source return path still exists. rg -n "slides, tiles = self\.load_slides_and_tiles\(paths or \[\], uris or \[\], hf_kwargs\)" rationai/mlkit/data/datasets/meta_tiled_slides.py rg -n -C2 "if not len\(search_dirs\):" rationai/mlkit/data/datasets/meta_tiled_slides.py rg -n -C2 "if slides_and_tiles is not None:" rationai/mlkit/data/datasets/meta_tiled_slides.pyAlso applies to: 187-189
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@rationai/mlkit/data/datasets/meta_tiled_slides.py` around lines 59 - 63, The current code always calls load_slides_and_tiles(paths or [], uris or [], hf_kwargs) into slides, tiles which returns empty-schema datasets when no paths/uris are provided, then unconditionally concatenates slides_and_tiles onto them causing failures for slides_and_tiles-only usage; fix by only calling load_slides_and_tiles when paths or uris are non-empty (or by allowing load_slides_and_tiles to return None for empty inputs) and change the concatenate logic so it handles three cases: (1) base slides/tiles present and slides_and_tiles present -> concatenate via concatenate_datasets, (2) only slides_and_tiles present -> assign slides, tiles = slides_and_tiles[0], slides_and_tiles[1], and (3) only base present -> keep base; update the code paths referencing load_slides_and_tiles, slides_and_tiles, slides, tiles, and concatenate_datasets to implement these guards.
🧹 Nitpick comments (1)
rationai/mlkit/data/datasets/meta_tiled_slides.py (1)
215-217: Over-broad exception remapping reduces debuggability.Line 216 converts every failure into
FileNotFoundError, including schema/parse/runtime issues. Prefer preserving error class context or raising a neutralRuntimeErrorwith source details.💡 Proposed fix
- except Exception as e: - msg = "Failed to load Parquet files." - raise FileNotFoundError(msg) from e + except Exception as e: + raise RuntimeError("Failed to load slides/tiles datasets.") from e🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@rationai/mlkit/data/datasets/meta_tiled_slides.py` around lines 215 - 217, The catch-all except block in meta_tiled_slides.py that does "except Exception as e: ... raise FileNotFoundError(msg) from e" is remapping all errors to FileNotFoundError and losing the original error class; change it to either re-raise the original exception or raise a neutral RuntimeError that includes the original exception message and context and keep "from e" to preserve the traceback. Locate the try/except around the Parquet loading (the except Exception as e block) and replace the FileNotFoundError remapping with either "raise" to propagate e or "raise RuntimeError(f'Failed to load Parquet files: {e}') from e" so schema/parse/runtime errors remain debuggable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@rationai/mlkit/data/datasets/meta_tiled_slides.py`:
- Around line 59-63: The current code always calls load_slides_and_tiles(paths
or [], uris or [], hf_kwargs) into slides, tiles which returns empty-schema
datasets when no paths/uris are provided, then unconditionally concatenates
slides_and_tiles onto them causing failures for slides_and_tiles-only usage; fix
by only calling load_slides_and_tiles when paths or uris are non-empty (or by
allowing load_slides_and_tiles to return None for empty inputs) and change the
concatenate logic so it handles three cases: (1) base slides/tiles present and
slides_and_tiles present -> concatenate via concatenate_datasets, (2) only
slides_and_tiles present -> assign slides, tiles = slides_and_tiles[0],
slides_and_tiles[1], and (3) only base present -> keep base; update the code
paths referencing load_slides_and_tiles, slides_and_tiles, slides, tiles, and
concatenate_datasets to implement these guards.
---
Nitpick comments:
In `@rationai/mlkit/data/datasets/meta_tiled_slides.py`:
- Around line 215-217: The catch-all except block in meta_tiled_slides.py that
does "except Exception as e: ... raise FileNotFoundError(msg) from e" is
remapping all errors to FileNotFoundError and losing the original error class;
change it to either re-raise the original exception or raise a neutral
RuntimeError that includes the original exception message and context and keep
"from e" to preserve the traceback. Locate the try/except around the Parquet
loading (the except Exception as e block) and replace the FileNotFoundError
remapping with either "raise" to propagate e or "raise RuntimeError(f'Failed to
load Parquet files: {e}') from e" so schema/parse/runtime errors remain
debuggable.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 78cdb1f9-5af1-4877-97f4-1987b69476e6
📒 Files selected for processing (1)
rationai/mlkit/data/datasets/meta_tiled_slides.py
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
rationai/mlkit/data/datasets/meta_tiled_slides.py:144
- The docstring claims filtering happens "without copying data", but the implementation materializes indices with
tile_indices.as_py()(copies into a Python list) before callingselect. Please adjust the docstring to reflect the actual behavior, or change the implementation to avoid the Python materialization if zero-copy is required.
This function creates a view of the `self.tiles` dataset that contains only
the tiles belonging to the specified slide. It uses the precomputed
`_slide_id_to_indices` mapping to efficiently retrieve the relevant tiles
without copying data.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
/gemini review |
|
@coderabit review |
There was a problem hiding this comment.
Code Review
This pull request refactors the MetaTiledSlides dataset to support chunked Parquet files stored in subdirectories and introduces hf_kwargs for more flexible data loading. The tile indexing logic was updated to use PyArrow's native grouping and aggregation for better performance and flexibility. Feedback includes improving the robustness of path resolution by verifying file existence, handling missing slide IDs more safely to avoid type mismatches, and enhancing error messages for better debuggability.
Cast slide_ids to large_string to prevent offset overflow.
Summary by CodeRabbit
Refactor
New Features