refactor: extract SlidesTilesLoader from MetaTiledSlides#5
Conversation
Separates metadata loading/indexing logic (no torch dependency) into SlidesTilesLoader; MetaTiledSlides delegates to it while keeping the same public interface.
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 55 minutes and 38 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughMetaTiledSlides delegates slide/tile loading, concatenation, and slide→tile indexing to a new SlidesTilesLoader. MetaTiledSlides assigns Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant MetaTiledSlides
participant SlidesTilesLoader
participant DataSource as FileSystem/MLFlow/Preloaded
Client->>MetaTiledSlides: __init__(paths/uris/slides_and_tiles)
MetaTiledSlides->>SlidesTilesLoader: __init__(paths/uris/slides_and_tiles, hf_kwargs)
SlidesTilesLoader->>DataSource: load_slides_and_tiles(...)
DataSource-->>SlidesTilesLoader: (slides, tiles) HFDataset
SlidesTilesLoader->>SlidesTilesLoader: precompute _slide_id_to_indices
SlidesTilesLoader-->>MetaTiledSlides: loader initialized
rect rgba(100,150,200,0.5)
Client->>MetaTiledSlides: filter_tiles_by_slide(slide_id)
MetaTiledSlides->>SlidesTilesLoader: filter_tiles_by_slide(slide_id)
SlidesTilesLoader->>SlidesTilesLoader: lookup indices -> select tiles rows
SlidesTilesLoader-->>MetaTiledSlides: filtered HFDataset
MetaTiledSlides-->>Client: filtered tiles
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request refactors the MetaTiledSlides class by extracting its data loading and indexing logic into a new SlidesTilesLoader class. MetaTiledSlides now acts as a wrapper that delegates loading and filtering operations to this helper class. Review feedback suggests improving error handling by avoiding broad exception catching that masks the root cause of failures and updating the MetaTiledSlides docstring to correctly describe all supported data sources.
There was a problem hiding this comment.
Code Review
This pull request refactors the MetaTiledSlides class by delegating its data loading and indexing logic to a new SlidesTilesLoader class. The review feedback identifies an opportunity to improve error handling by using a more generic RuntimeError instead of FileNotFoundError when Parquet file loading fails, as the error may stem from issues other than missing files. Furthermore, the docstring for MetaTiledSlides.init should be updated to reflect the expanded support for local paths and preloaded datasets.
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
rationai/mlkit/data/datasets/meta_tiled_slides.py (2)
44-46:⚠️ Potential issue | 🟠 MajorReplace
assertwith an explicit exception for input validation.
assertstatements are stripped when Python runs with-O/PYTHONOPTIMIZE, which would silently accept an invalid construction and surface later as an obscure error (emptysearch_dirs+ a concat against empty datasets). Prefer an explicit exception for public-API validation.🛡️ Proposed fix
- assert paths or uris or slides_and_tiles, ( - "At least one of paths, uris or slides_and_tiles must be provided." - ) + if not (paths or uris or slides_and_tiles): + raise ValueError( + "At least one of paths, uris or slides_and_tiles must be provided." + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@rationai/mlkit/data/datasets/meta_tiled_slides.py` around lines 44 - 46, The code currently uses an assert to validate inputs (checking paths or uris or slides_and_tiles) which can be stripped under -O; replace that assert with an explicit input validation that raises a clear exception (e.g., ValueError) when none of paths, uris, or slides_and_tiles are provided, preserving the existing error message and ensuring downstream code (like search_dirs concatenation in this module) fails early with a clear error; update the check in the function/constructor where paths, uris, and slides_and_tiles are validated (the same location as the current assert) to raise the exception instead of asserting.
51-55:⚠️ Potential issue | 🟠 MajorSchema-mismatch risk when only
slides_and_tilesis provided.When callers pass only
slides_and_tiles(nopaths, nouris),load_slides_and_tiles([], [], ...)returnsHFDataset.from_dict({})at lines 61–62, which has no defined feature schema.concatenate_datasets([empty, slides_and_tiles[0]])then fails because the empty dataset lacks features that the populated dataset has. Consider short-circuiting when bothpathsandurisare empty, or skipping the concat when the loaded side is empty.♻️ Suggested refactor
- slides, tiles = self.load_slides_and_tiles(paths or [], uris or [], hf_kwargs) - - if slides_and_tiles is not None: - slides = concatenate_datasets([slides, slides_and_tiles[0]]) - tiles = concatenate_datasets([tiles, slides_and_tiles[1]]) - - self.slides = slides - self.tiles = tiles + if paths or uris: + slides, tiles = self.load_slides_and_tiles( + paths or [], uris or [], hf_kwargs + ) + if slides_and_tiles is not None: + slides = concatenate_datasets([slides, slides_and_tiles[0]]) + tiles = concatenate_datasets([tiles, slides_and_tiles[1]]) + else: + slides, tiles = slides_and_tiles # guaranteed non-None by the check above + + self.slides = slides + self.tiles = tiles🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@rationai/mlkit/data/datasets/meta_tiled_slides.py` around lines 51 - 55, In load_slides_and_tiles usage inside meta_tiled_slides.py, avoid concatenating an empty HFDataset (created by load_slides_and_tiles(paths or [], uris or [], ...)) with slides_and_tiles[0] because the empty dataset has no feature schema; instead short-circuit or skip the concat: if both paths and uris are empty, assign slides, tiles = slides_and_tiles directly (or if slides returned from load_slides_and_tiles has no features/num_rows==0, skip concatenating and use the non-empty slides_and_tiles entries). Modify the block that currently calls concatenate_datasets([slides, slides_and_tiles[0]]) and concatenate_datasets([tiles, slides_and_tiles[1]]) to first check for emptiness (e.g., absence of features or zero rows) and only call concatenate_datasets when both sides have schemas; otherwise set slides/tiles to the non-empty dataset.
🧹 Nitpick comments (1)
rationai/mlkit/data/datasets/meta_tiled_slides.py (1)
188-190: Broadexcept Exception→FileNotFoundErroris misleading.This swallows every failure inside
load_dataset/concatenate_datasets(schema issues, auth problems, corrupt Parquet, OOM, etc.) and re-raises asFileNotFoundError, which will mislead callers catching on type. Either narrow the catch to IO-related errors, or keep the broader catch but raise a more general error (e.g.RuntimeError) while preserving the original viafrom e.♻️ Proposed change
- except Exception as e: - msg = "Failed to load Parquet files." - raise FileNotFoundError(msg) from e + except (FileNotFoundError, OSError) as e: + raise FileNotFoundError("Failed to load Parquet files.") from eIf you do want to translate all errors uniformly, prefer
RuntimeErroroverFileNotFoundErrorso the exception type reflects reality.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@rationai/mlkit/data/datasets/meta_tiled_slides.py` around lines 188 - 190, The current except block in load_dataset/concatenate_datasets catches Exception and re-raises FileNotFoundError which is misleading; change the handler to either narrow the catch to IO-related errors (e.g., replace "except Exception as e:" with "except (FileNotFoundError, OSError) as e:" and keep "raise FileNotFoundError(msg) from e") or, if you intend to translate all errors, raise a more general error type (replace the raise with "raise RuntimeError(msg) from e") so the exception semantics match reality; preserve the original exception via "from e" and keep the existing msg variable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@rationai/mlkit/data/datasets/meta_tiled_slides.py`:
- Around line 212-226: Update the MetaTiledSlides.__init__ docstring to remove
the stale “Load slides and tiles from MLFlow artifacts” summary and explicitly
document that the constructor accepts local filesystem paths (`paths`), MLFlow
artifact URIs (`uris`), or a preloaded tuple (`slides_and_tiles`) similar to
SlidesTilesLoader; keep parameter descriptions for `paths`, `uris`,
`slides_and_tiles`, and `hf_kwargs` but adjust examples/phrasing to cover all
three input modes and mirror the wording used in SlidesTilesLoader for
consistency.
---
Outside diff comments:
In `@rationai/mlkit/data/datasets/meta_tiled_slides.py`:
- Around line 44-46: The code currently uses an assert to validate inputs
(checking paths or uris or slides_and_tiles) which can be stripped under -O;
replace that assert with an explicit input validation that raises a clear
exception (e.g., ValueError) when none of paths, uris, or slides_and_tiles are
provided, preserving the existing error message and ensuring downstream code
(like search_dirs concatenation in this module) fails early with a clear error;
update the check in the function/constructor where paths, uris, and
slides_and_tiles are validated (the same location as the current assert) to
raise the exception instead of asserting.
- Around line 51-55: In load_slides_and_tiles usage inside meta_tiled_slides.py,
avoid concatenating an empty HFDataset (created by load_slides_and_tiles(paths
or [], uris or [], ...)) with slides_and_tiles[0] because the empty dataset has
no feature schema; instead short-circuit or skip the concat: if both paths and
uris are empty, assign slides, tiles = slides_and_tiles directly (or if slides
returned from load_slides_and_tiles has no features/num_rows==0, skip
concatenating and use the non-empty slides_and_tiles entries). Modify the block
that currently calls concatenate_datasets([slides, slides_and_tiles[0]]) and
concatenate_datasets([tiles, slides_and_tiles[1]]) to first check for emptiness
(e.g., absence of features or zero rows) and only call concatenate_datasets when
both sides have schemas; otherwise set slides/tiles to the non-empty dataset.
---
Nitpick comments:
In `@rationai/mlkit/data/datasets/meta_tiled_slides.py`:
- Around line 188-190: The current except block in
load_dataset/concatenate_datasets catches Exception and re-raises
FileNotFoundError which is misleading; change the handler to either narrow the
catch to IO-related errors (e.g., replace "except Exception as e:" with "except
(FileNotFoundError, OSError) as e:" and keep "raise FileNotFoundError(msg) from
e") or, if you intend to translate all errors, raise a more general error type
(replace the raise with "raise RuntimeError(msg) from e") so the exception
semantics match reality; preserve the original exception via "from e" and keep
the existing msg variable.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: a562d747-255e-4d0b-8d3f-d8ecf4f3f7af
📒 Files selected for processing (1)
rationai/mlkit/data/datasets/meta_tiled_slides.py
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
rationai/mlkit/data/datasets/meta_tiled_slides.py (1)
44-46:⚠️ Potential issue | 🟡 MinorReplace
assertwith explicitValueErrorfor input validation.Constructor input validation should not use
assert, as assertions can be disabled with Python's-Oflag, allowing invalid constructors to be created silently. Additionally, materializingpathsanduristo lists ensures empty generators are caught during validation rather than silently producing empty datasets.Proposed fix
+ paths = list(paths or []) + uris = list(uris or []) + - assert paths or uris or slides_and_tiles, ( - "At least one of paths, uris or slides_and_tiles must be provided." - ) + if not (paths or uris or slides_and_tiles): + raise ValueError( + "At least one of paths, uris or slides_and_tiles must be provided." + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@rationai/mlkit/data/datasets/meta_tiled_slides.py` around lines 44 - 46, Replace the assert-based validation in the MetaTiledSlides constructor with explicit checks that materialize generators: convert paths and uris to lists (e.g., paths = list(paths) if paths is not None else None; same for uris) before validation, then if not (paths or uris or slides_and_tiles) raise ValueError("At least one of paths, uris or slides_and_tiles must be provided."); update any references in the class that used the original generators to use the new lists (look for uses of paths, uris, and slides_and_tiles in class methods) so empty generators are detected and proper exceptions are raised instead of relying on assert.
♻️ Duplicate comments (1)
rationai/mlkit/data/datasets/meta_tiled_slides.py (1)
212-226:⚠️ Potential issue | 🟡 MinorUpdate the stale constructor docstring.
Line 212 still says this only loads from MLFlow artifacts, but the constructor accepts local paths and preloaded datasets too.
📝 Proposed wording
- """Load slides and tiles from MLFlow artifacts. + """Load slides and tiles from local paths, MLFlow URIs, or preloaded datasets.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@rationai/mlkit/data/datasets/meta_tiled_slides.py` around lines 212 - 226, The constructor docstring for the class in meta_tiled_slides.py incorrectly states it only loads from MLFlow artifacts; update the __init__ docstring to reflect that it also accepts local filesystem paths and preloaded Datasets. Edit the docstring for the constructor (the method that accepts parameters paths, uris, slides_and_tiles, and hf_kwargs) to describe each parameter correctly: paths can be local directories containing slides.parquet/tiles.parquet or chunked subdirs, uris are MLFlow artifact URIs, slides_and_tiles can be a tuple of already-loaded Dataset objects, and hf_kwargs are passed to HuggingFace load_dataset (default {"path": "parquet", "split": "train"}). Mention that the loader supports any combination of these inputs.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@rationai/mlkit/data/datasets/meta_tiled_slides.py`:
- Around line 18-19: SlidesTilesLoader is a pure metadata utility but lives in a
module that imports torch; extract SlidesTilesLoader into a new torch-free
module (e.g., create a new file containing the SlidesTilesLoader class and its
helpers) and remove any torch imports from that new file, update the original
module to import and re-export SlidesTilesLoader from the new module, and add an
import/export in rationai/mlkit/data/datasets/__init__.py to re-export
SlidesTilesLoader so existing callers keep the same import path; ensure no torch
symbols remain referenced in the new module and update any local imports to
point to the new module name.
- Around line 257-268: The example in the SlideTiles generator mixes mapping
access and attribute access—change the slide.id usage to use the mapping key
like the rest (e.g., replace filter_tiles_by_slide(slide.id) with
filter_tiles_by_slide(slide["id"])) so the example consistently treats slide as
a dict; update any other occurrences to use slide["..."] to avoid misleading
users copying the snippet for HuggingFace dataset rows.
---
Outside diff comments:
In `@rationai/mlkit/data/datasets/meta_tiled_slides.py`:
- Around line 44-46: Replace the assert-based validation in the MetaTiledSlides
constructor with explicit checks that materialize generators: convert paths and
uris to lists (e.g., paths = list(paths) if paths is not None else None; same
for uris) before validation, then if not (paths or uris or slides_and_tiles)
raise ValueError("At least one of paths, uris or slides_and_tiles must be
provided."); update any references in the class that used the original
generators to use the new lists (look for uses of paths, uris, and
slides_and_tiles in class methods) so empty generators are detected and proper
exceptions are raised instead of relying on assert.
---
Duplicate comments:
In `@rationai/mlkit/data/datasets/meta_tiled_slides.py`:
- Around line 212-226: The constructor docstring for the class in
meta_tiled_slides.py incorrectly states it only loads from MLFlow artifacts;
update the __init__ docstring to reflect that it also accepts local filesystem
paths and preloaded Datasets. Edit the docstring for the constructor (the method
that accepts parameters paths, uris, slides_and_tiles, and hf_kwargs) to
describe each parameter correctly: paths can be local directories containing
slides.parquet/tiles.parquet or chunked subdirs, uris are MLFlow artifact URIs,
slides_and_tiles can be a tuple of already-loaded Dataset objects, and hf_kwargs
are passed to HuggingFace load_dataset (default {"path": "parquet", "split":
"train"}). Mention that the loader supports any combination of these inputs.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: fd4ff924-9c6f-4ea3-a728-69c4a4ca7846
📒 Files selected for processing (2)
rationai/mlkit/data/datasets/__init__.pyrationai/mlkit/data/datasets/meta_tiled_slides.py
✅ Files skipped from review due to trivial changes (1)
- rationai/mlkit/data/datasets/init.py
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
rationai/mlkit/data/datasets/meta_tiled_slides.py (2)
34-48:⚠️ Potential issue | 🟡 MinorUpdate the stale constructor summary.
The constructor no longer loads only from MLflow artifacts; it also accepts local paths and preloaded datasets.
📝 Proposed wording
- """Load slides and tiles from MLFlow artifacts. + """Load slides and tiles from local paths, MLflow URIs, or preloaded datasets.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@rationai/mlkit/data/datasets/meta_tiled_slides.py` around lines 34 - 48, The constructor docstring for MetaTiledSlides.__init__ is stale — it says it only loads from MLFlow artifacts but the implementation also accepts local paths and preloaded datasets (slides_and_tiles tuple); update the summary and Args so they describe all accepted inputs (paths, uris, and slides_and_tiles), clarify that both MLflow URIs and local directories are supported, and note that slides_and_tiles may be a preloaded (slides, tiles) Dataset tuple; ensure hf_kwargs default description remains accurate.
79-90:⚠️ Potential issue | 🟡 MinorFix the copied example’s slide-id access.
The example treats
slideas a mapping everywhere except Line 87, so users copying it for HuggingFace dataset rows will hit an attribute lookup error.📝 Proposed fix
- tiles=self.filter_tiles_by_slide(slide.id), + tiles=self.filter_tiles_by_slide(slide["id"]),🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@rationai/mlkit/data/datasets/meta_tiled_slides.py` around lines 79 - 90, The example generator mixes mapping and attribute access for slide (uses slide["path"] etc. but slide.id), causing an AttributeError; update the example in SlideTiles creation to consistently use mapping access (replace slide.id with slide["id"]) and ensure the call to filter_tiles_by_slide uses that mapping key (i.e., call self.filter_tiles_by_slide(slide["id"])) so the SlideTiles example and filter_tiles_by_slide invocation align with the other slide["..."] accesses.
🧹 Nitpick comments (1)
rationai/mlkit/data/datasets/meta_tiled_slides.py (1)
59-73: Align the filter docstring with delegation.This method no longer owns
_slide_id_to_indices; the implementation delegates toSlidesTilesLoader.📝 Proposed wording
def filter_tiles_by_slide(self, slide_id: str | bytes) -> HFDataset: - """Returns a view of the dataset using a slice or indices. + """Return the tiles dataset filtered to a single slide. - This function creates a view of the `self.tiles` dataset that contains only - the tiles belonging to the specified slide. It uses the precomputed - `_slide_id_to_indices` mapping to efficiently retrieve the relevant tiles - without copying data. + Delegates to `SlidesTilesLoader`, which uses its precomputed slide-id + index to retrieve the relevant tile rows.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@rationai/mlkit/data/datasets/meta_tiled_slides.py` around lines 59 - 73, The docstring for filter_tiles_by_slide incorrectly states the method uses a local _slide_id_to_indices mapping; update the wording to reflect that it delegates to SlidesTilesLoader via self._meta.filter_tiles_by_slide and does not manage indices itself. Edit the summary and description to mention delegation to SlidesTilesLoader/self._meta, remove references to owning or using _slide_id_to_indices, and keep args/returns the same for filter_tiles_by_slide.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@rationai/mlkit/data/datasets/slides_tiles_loader.py`:
- Around line 39-50: Replace the assert-based validation with explicit input
checks: validate that at least one real data source exists by checking if paths
or uris contain any elements (handle generators by converting to list or using
any()), or slides_and_tiles is provided; set hf_kwargs default if None as done
now. Only call load_slides_and_tiles(paths_list, uris_list, hf_kwargs) when
paths_list or uris_list are non-empty to avoid creating empty/schema-less
datasets, and when slides_and_tiles is provided, conditionally concatenate using
concatenate_datasets only if the loaded slides/tiles are non-empty (or vice
versa) to avoid mixing with placeholder empty datasets; reference
load_slides_and_tiles, slides_and_tiles, and concatenate_datasets when making
these changes.
---
Duplicate comments:
In `@rationai/mlkit/data/datasets/meta_tiled_slides.py`:
- Around line 34-48: The constructor docstring for MetaTiledSlides.__init__ is
stale — it says it only loads from MLFlow artifacts but the implementation also
accepts local paths and preloaded datasets (slides_and_tiles tuple); update the
summary and Args so they describe all accepted inputs (paths, uris, and
slides_and_tiles), clarify that both MLflow URIs and local directories are
supported, and note that slides_and_tiles may be a preloaded (slides, tiles)
Dataset tuple; ensure hf_kwargs default description remains accurate.
- Around line 79-90: The example generator mixes mapping and attribute access
for slide (uses slide["path"] etc. but slide.id), causing an AttributeError;
update the example in SlideTiles creation to consistently use mapping access
(replace slide.id with slide["id"]) and ensure the call to filter_tiles_by_slide
uses that mapping key (i.e., call self.filter_tiles_by_slide(slide["id"])) so
the SlideTiles example and filter_tiles_by_slide invocation align with the other
slide["..."] accesses.
---
Nitpick comments:
In `@rationai/mlkit/data/datasets/meta_tiled_slides.py`:
- Around line 59-73: The docstring for filter_tiles_by_slide incorrectly states
the method uses a local _slide_id_to_indices mapping; update the wording to
reflect that it delegates to SlidesTilesLoader via
self._meta.filter_tiles_by_slide and does not manage indices itself. Edit the
summary and description to mention delegation to SlidesTilesLoader/self._meta,
remove references to owning or using _slide_id_to_indices, and keep args/returns
the same for filter_tiles_by_slide.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 9364453a-ba29-4630-9e07-392fdf9b8d92
📒 Files selected for processing (3)
rationai/mlkit/data/datasets/__init__.pyrationai/mlkit/data/datasets/meta_tiled_slides.pyrationai/mlkit/data/datasets/slides_tiles_loader.py
🚧 Files skipped from review as they are similar to previous changes (1)
- rationai/mlkit/data/datasets/init.py
|
/review gemini |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces the SlidesTilesLoader class to centralize the logic for loading and indexing slide and tile metadata from various sources, including local paths and MLFlow URIs. The MetaTiledSlides class has been refactored to utilize this new loader, simplifying its implementation. Feedback includes a bug report regarding potential schema mismatches when concatenating empty datasets, a recommendation to use ValueError instead of assert for input validation, and a suggestion to revert a change in exception type to maintain backward compatibility and specificity.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request refactors the data loading and indexing logic from MetaTiledSlides into a new standalone class, SlidesTilesLoader, to improve modularity and reusability. MetaTiledSlides now delegates these responsibilities to the new loader. The review feedback suggests replacing assert statements with ValueError for more robust input validation in production and reverting a change in exception type from RuntimeError back to FileNotFoundError to maintain backward compatibility and error specificity.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (2)
rationai/mlkit/data/datasets/slides_tiles_loader.py (1)
39-60:⚠️ Potential issue | 🟠 MajorUse explicit validation before loading or concatenating datasets.
assertcan be disabled, empty generators are still truthy, and an emptypaths/urisiterator can append schema-less placeholder datasets before concatenating withslides_and_tiles.🐛 Proposed fix
- assert paths or uris or slides_and_tiles, ( - "At least one of paths, uris or slides_and_tiles must be provided." - ) + paths = list(paths or []) + uris = list(uris or []) + + if not paths and not uris and slides_and_tiles is None: + raise ValueError( + "At least one of paths, uris or slides_and_tiles must be provided." + ) if hf_kwargs is None: hf_kwargs = {"path": "parquet", "split": "train"} slides = [] tiles = [] if paths or uris: - s, t = self.load_slides_and_tiles(paths or [], uris or [], hf_kwargs) + s, t = self.load_slides_and_tiles(paths, uris, hf_kwargs) slides.append(s) tiles.append(t)Verification:
#!/bin/bash # Description: Verify Python behavior relevant to the validation bug. python - <<'PY' print("empty generator is truthy:", bool((x for x in []))) PY python -O - <<'PY' assert False, "this is skipped under -O" print("assert skipped under optimization") PY🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@rationai/mlkit/data/datasets/slides_tiles_loader.py` around lines 39 - 60, Replace the fragile assert with explicit validation: verify that at least one of paths, uris, or slides_and_tiles is provided and non-empty (handle generators/iterators by converting to a list or checking with any()) before calling load_slides_and_tiles; only call load_slides_and_tiles and append returned datasets to slides/tiles when those inputs yield actual items, and only append slides_and_tiles[0]/[1] when slides_and_tiles is a valid non-empty pair; ensure self.slides and self.tiles are assigned from non-empty datasets (or raise a clear ValueError) before calling _build_tile_index to avoid concatenating empty or placeholder datasets.rationai/mlkit/data/datasets/meta_tiled_slides.py (1)
34-48:⚠️ Potential issue | 🟡 MinorUpdate the constructor docstring to match the supported sources.
The summary still says MLFlow artifacts only, but the constructor accepts local paths, MLFlow URIs, and preloaded
slides_and_tiles.📝 Proposed wording
- """Load slides and tiles from MLFlow artifacts. + """Load slides and tiles from local paths, MLFlow URIs, or preloaded datasets.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@rationai/mlkit/data/datasets/meta_tiled_slides.py` around lines 34 - 48, The constructor docstring currently says it loads from MLFlow artifacts only; update MetaTiledSlides.__init__ docstring to list all supported sources (local filesystem paths, MLFlow artifact URIs, or a preloaded slides_and_tiles tuple), clarify the expected formats for each (single files slides.parquet/tiles.parquet or chunked subdirectories slides/ and tiles/), and preserve/mention the hf_kwargs default (`{"path": "parquet", "split": "train"}`) and purpose; reference the constructor by name (MetaTiledSlides.__init__) when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@rationai/mlkit/data/datasets/meta_tiled_slides.py`:
- Around line 49-57: MetaTiledSlides lost its public method
load_slides_and_tiles when the loader was moved to SlidesTilesLoader; restore a
forwarding API by adding a method MetaTiledSlides.load_slides_and_tiles(self,
*args, **kwargs) that simply delegates to
self._meta.load_slides_and_tiles(*args, **kwargs), preserving the original
signature and docstring so callers of MetaTiledSlides continue to work; ensure
the method uses self._meta (the SlidesTilesLoader instance) and
raises/propagates the same exceptions as the underlying loader.
In `@rationai/mlkit/data/datasets/slides_tiles_loader.py`:
- Around line 152-191: Currently the MLflow downloads via ThreadPoolExecutor and
download_artifacts are executed before the try/except, so download failures
escape the RuntimeError wrapper; move the artifact download logic (the
ThreadPoolExecutor block that produces artifacts_paths and the construction of
search_dirs that depends on artifacts_paths) inside the try block before calling
resolve_search_path/load_dataset/concatenate_datasets so any exceptions from
download_artifacts are caught and re-raised as the documented RuntimeError; keep
resolve_search_path, concatenate_datasets, load_dataset and the final return
behavior unchanged and still raise RuntimeError(msg) from e on any exception.
---
Duplicate comments:
In `@rationai/mlkit/data/datasets/meta_tiled_slides.py`:
- Around line 34-48: The constructor docstring currently says it loads from
MLFlow artifacts only; update MetaTiledSlides.__init__ docstring to list all
supported sources (local filesystem paths, MLFlow artifact URIs, or a preloaded
slides_and_tiles tuple), clarify the expected formats for each (single files
slides.parquet/tiles.parquet or chunked subdirectories slides/ and tiles/), and
preserve/mention the hf_kwargs default (`{"path": "parquet", "split": "train"}`)
and purpose; reference the constructor by name (MetaTiledSlides.__init__) when
making the change.
In `@rationai/mlkit/data/datasets/slides_tiles_loader.py`:
- Around line 39-60: Replace the fragile assert with explicit validation: verify
that at least one of paths, uris, or slides_and_tiles is provided and non-empty
(handle generators/iterators by converting to a list or checking with any())
before calling load_slides_and_tiles; only call load_slides_and_tiles and append
returned datasets to slides/tiles when those inputs yield actual items, and only
append slides_and_tiles[0]/[1] when slides_and_tiles is a valid non-empty pair;
ensure self.slides and self.tiles are assigned from non-empty datasets (or raise
a clear ValueError) before calling _build_tile_index to avoid concatenating
empty or placeholder datasets.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 90c1aeec-0210-4d74-908a-387369574706
📒 Files selected for processing (2)
rationai/mlkit/data/datasets/meta_tiled_slides.pyrationai/mlkit/data/datasets/slides_tiles_loader.py
Separates metadata loading/indexing logic (no torch dependency) into SlidesTilesLoader; MetaTiledSlides delegates to it while keeping the same public interface.
Summary by CodeRabbit