Feature/mlflow dataset by JakubPekar · Pull Request #6 · RationAI/mlkit

JakubPekar · 2026-04-21T13:31:53Z

Summary by CodeRabbit

New Features
- Added MLflow ParquetDataset support with lazy Parquet wrapping, dataset digest/profile metadata, and schema inference, plus a from_parquet(...) helper.
- Exposed new MLflow Parquet dataset helpers as part of the public API.
Bug Fixes
- Improved stratified batch sampling to accept a single or multiple stratification columns reliably.
Improvements
- Enhanced metric collection typing/casting for more consistent updates and aggregation.
- Kept behavior unchanged while refining related sampling/logging/sharding/CLI argument handling documentation and structure.

coderabbitai · 2026-04-21T13:32:20Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 066fe671-88e7-4b7d-9e94-92bbf819369b

📥 Commits

Reviewing files that changed from the base of the PR and between 573d7b4 and c9b00ee.

📒 Files selected for processing (2)

rationai/mlkit/data/samplers/stratified_batch_sampler.py
rationai/mlkit/mlflow/parquet_dataset.py

🚧 Files skipped from review as they are similar to previous changes (2)

rationai/mlkit/data/samplers/stratified_batch_sampler.py
rationai/mlkit/mlflow/parquet_dataset.py

📝 Walkthrough

Walkthrough

Adds a new ParquetDataset MLflow integration with lazy PyArrow-backed schema, profile, digest, and serialization support, plus a from_parquet factory and exports. The rest of the patch updates metric typing and suppressions, adjusts nested class-name output, and makes small comment and variable cleanup changes.

Changes

MLflow ParquetDataset

Layer / File(s)	Summary
ParquetDataset class: digest, profile, schema, serialization `rationai/mlkit/mlflow/parquet_dataset.py`	Introduces `ParquetDataset(Dataset)` with lazy PyArrow construction, schema- and path-based digest computation, cached Parquet profile metadata, MLflow schema inference with fallback, and `to_dict()` serialization.
from_parquet factory and exports `rationai/mlkit/mlflow/parquet_dataset.py`, `rationai/mlkit/mlflow/__init__.py`	Adds `from_parquet(...)` source resolution from `DatasetSource`, string input, or MLflow context tags, and exports `ParquetDataset` and `from_parquet` from the package namespace.

Type Annotations, Suppression Migrations, and Code Cleanup

Layer / File(s)	Summary
Metrics typing and suppression updates `rationai/mlkit/metrics/aggregated_metric_collection.py`, `rationai/mlkit/metrics/aggregators.py`, `rationai/mlkit/metrics/lazy_metric_dict.py`, `rationai/mlkit/metrics/nested_metric_collection.py`	Adds explicit `Tensor` attribute annotations, wraps metric values and collection factories in `cast()`, changes override suppressions, types `defaultdict` state, and updates `NestedMetricCollection` to emit string `class_names`.
Data and utility comment cleanup `rationai/mlkit/data/datasets/slides_tiles_loader.py`, `rationai/mlkit/data/shard_parquet.py`, `rationai/mlkit/data/samplers/stratified_batch_sampler.py`, `rationai/mlkit/lightning/loggers/mlflow.py`, `rationai/mlkit/with_cli_args.py`	Rewords inline comments in `_build_tile_index`, removes section-header comments from `shard_parquet`, renames the shuffled batch list variable, widens `stratify_by` typing, updates the MLflow logger comment, and refactors `sys.argv` handling in the CLI wrapper.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

RationAI/mlkit#4: Introduces the same shard_parquet area that this PR only rewords in-place.

Suggested reviewers

matejpekar
vejtek

🐇 I hopped through Parquet’s schema lane,
With digests, profiles, and exports in train.
My type casts are tidy, my comments are swept,
And strings in the metrics now neatly են? kept.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 55.17% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title is related to the main change: adding an MLflow dataset implementation and exports.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feature/mlflow-dataset

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

gemini-code-assist

Code Review

This pull request introduces the ParquetDataset class and a from_parquet factory function to support lazy-loaded Parquet datasets within MLflow. The implementation includes metadata-based digest computation, schema inference, and dataset profiling. Feedback was provided to correct a docstring typo, ensure the target column is included in both the dataset digest and the configuration dictionary for accurate tracking, remove a redundant property override, and cache the profile property to optimize performance by avoiding repeated I/O.

Copilot

Pull request overview

Adds an MLflow Dataset implementation for Parquet sources (via PyArrow) so Parquet training/eval inputs can be logged to MLflow with inferred schema and lightweight metadata profiling.

Changes:

Introduce ParquetDataset (lazy-loaded PyArrow dataset) plus from_parquet() factory for MLflow input logging.
Export the new dataset utilities from rationai.mlkit.mlflow.
Bump locked package version to 0.4.1.

Reviewed changes

Copilot reviewed 2 out of 3 changed files in this pull request and generated 6 comments.

File	Description
`uv.lock`	Updates the locked project version to `0.4.1`.
`rationai/mlkit/mlflow/parquet_dataset.py`	Implements `ParquetDataset`, digest computation, metadata profiling, and schema inference for Parquet datasets.
`rationai/mlkit/mlflow/__init__.py`	Exposes `ParquetDataset` and `from_parquet` as public mlflow utilities.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@rationai/mlkit/mlflow/parquet_dataset.py`:
- Line 28: The constructor docstring in the class containing the __init__ method
still uses placeholder text (“Hety.”), so replace it with a concise, meaningful
one-line summary that describes what the constructor initializes. Update the
docstring near the __init__ definition in parquet_dataset.py so it accurately
reflects the purpose of the Parquet dataset constructor and matches the
surrounding class documentation style.
- Line 48: The md5 digest used for content-addressing in parquet_dataset should
be marked non-security so it works on FIPS-enabled interpreters. Update the
hashlib.md5() call in the digest setup to use usedforsecurity=False, keeping the
behavior the same while making the hash explicitly non-cryptographic.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 95fca85c-a42c-4095-bb59-1df26c29ad9e

📥 Commits

Reviewing files that changed from the base of the PR and between 51d1b52 and 573d7b4.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

📒 Files selected for processing (11)

rationai/mlkit/data/datasets/slides_tiles_loader.py
rationai/mlkit/data/samplers/stratified_batch_sampler.py
rationai/mlkit/data/shard_parquet.py
rationai/mlkit/lightning/loggers/mlflow.py
rationai/mlkit/metrics/aggregated_metric_collection.py
rationai/mlkit/metrics/aggregators.py
rationai/mlkit/metrics/lazy_metric_dict.py
rationai/mlkit/metrics/nested_metric_collection.py
rationai/mlkit/mlflow/__init__.py
rationai/mlkit/mlflow/parquet_dataset.py
rationai/mlkit/with_cli_args.py

💤 Files with no reviewable changes (1)

rationai/mlkit/data/shard_parquet.py

JakubPekar · 2026-06-24T08:52:06Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a new lazy-loaded ParquetDataset for MLflow tracking, refactors various modules to improve type safety and clean up comments, and bumps the package version to 0.4.1. Key feedback includes addressing a critical bug in shard_parquet.py where shard indices and row counts are not updated when a shard reaches capacity, optimizing schema inference in parquet_dataset.py to avoid unnecessary I/O, and correcting the type hint for **kwargs in stratified_batch_sampler.py.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-24T08:54:23Z

                if current_shard_rows >= rows_per_shard:
                    writer.close()
                    writer = None


There is a critical bug here: when a shard reaches its capacity (rows_per_shard), the writer is closed and set to None, but shard_idx is never incremented and current_shard_rows is never reset to 0.

This causes two major issues:

The next batch will overwrite the same file (shard_00000.parquet) because shard_idx remains 0.

current_shard_rows will keep accumulating and stay above rows_per_shard, causing a new file to be opened and closed on every single subsequent iteration.

We must increment shard_idx and reset current_shard_rows to 0 when closing the writer.

Suggested change

if current_shard_rows >= rows_per_shard:

writer.close()

writer = None

if current_shard_rows >= rows_per_shard:

writer.close()

writer = None

shard_idx += 1

current_shard_rows = 0

that is not true, the code is there already

gemini-code-assist · 2026-06-24T08:54:23Z

+from functools import cached_property
+from typing import Any
+
+import pyarrow.dataset as ds


Import pyarrow directly to allow creating empty tables for schema inference without calling head(0).

Suggested change

import pyarrow.dataset as ds

import pyarrow as pa

import pyarrow.dataset as ds

gemini-code-assist · 2026-06-24T08:54:23Z

+            # Fetch an empty Pandas dataframe from the schema to utilize MLflow's built-in
+            # inference securely and correctly map PyArrow types to MLflow types.
+            empty_df = self._ds.head(0).to_pandas()
+            inferred_schema = _infer_schema(empty_df)
+            return inferred_schema
+        except Exception as e:


Instead of calling self._ds.head(0).to_pandas(), which can trigger I/O and requires PyArrow >= 12.0.0, you can construct an empty pyarrow.Table directly from the already-loaded self._ds.schema and convert that to a Pandas DataFrame. This is a zero-I/O, highly compatible, and fast operation.

Suggested change

# Fetch an empty Pandas dataframe from the schema to utilize MLflow's built-in

# inference securely and correctly map PyArrow types to MLflow types.

empty_df = self._ds.head(0).to_pandas()

inferred_schema = _infer_schema(empty_df)

return inferred_schema

except Exception as e:

try:

# Create an empty PyArrow Table from the schema and convert to Pandas

# to avoid calling head(0) which may trigger I/O or require newer PyArrow versions.

empty_df = pa.Table.from_batches([], schema=self._ds.schema).to_pandas()

inferred_schema = _infer_schema(empty_df)

return inferred_schema

gemini-code-assist · 2026-06-24T08:54:23Z

    def __init__(
        self,
        data: pd.DataFrame,
-        stratify_by: None,
+        stratify_by: str | list[str],
        batch_size: int,
        **kwargs: dict[str, Any],
    ) -> None:


In Python type hinting, **kwargs: T specifies that all keyword arguments passed must be of type T. Specifying **kwargs: dict[str, Any] means each individual keyword argument must be a dictionary, which is likely not the intention here. It should be typed as **kwargs: Any.

Suggested change

def __init__(

self,

data: pd.DataFrame,

stratify_by: None,

stratify_by: str | list[str],

batch_size: int,

**kwargs: dict[str, Any],

) -> None:

def __init__(

self,

data: pd.DataFrame,

stratify_by: str | list[str],

batch_size: int,

**kwargs: Any,

) -> None:

JakubPekar added 2 commits April 21, 2026 12:53

feat: mlflow datset

58ea683

feat: load parquet

6cb5472

JakubPekar requested review from a team and Copilot April 21, 2026 13:31

Copilot started reviewing on behalf of JakubPekar April 21, 2026 13:32 View session

feat: lint

8411863

gemini-code-assist Bot reviewed Apr 21, 2026

View reviewed changes

Comment thread rationai/mlkit/mlflow/parquet_dataset.py Outdated

Comment thread rationai/mlkit/mlflow/parquet_dataset.py

Comment thread rationai/mlkit/mlflow/parquet_dataset.py

Comment thread rationai/mlkit/mlflow/parquet_dataset.py

Comment thread rationai/mlkit/mlflow/parquet_dataset.py

Copilot AI reviewed Apr 21, 2026

View reviewed changes

feat: fix mypy

573d7b4

coderabbitai Bot reviewed Jun 24, 2026

View reviewed changes

Comment thread rationai/mlkit/mlflow/parquet_dataset.py Outdated

Comment thread rationai/mlkit/mlflow/parquet_dataset.py Outdated

JakubPekar added 2 commits June 24, 2026 10:49

feat: docs

c5da4b8

feat: CR fix

689cd6c

gemini-code-assist Bot reviewed Jun 24, 2026

View reviewed changes

fix: CR

c9b00ee

	import pyarrow.dataset as ds
	import pyarrow as pa
	import pyarrow.dataset as ds

Uh oh!

Conversation

JakubPekar commented Apr 21, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

JakubPekar commented Jun 24, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

JakubPekar Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JakubPekar commented Apr 21, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 21, 2026 •

edited

Loading