Skip to content

Feature/mlflow dataset#6

Open
JakubPekar wants to merge 7 commits into
masterfrom
feature/mlflow-dataset
Open

Feature/mlflow dataset#6
JakubPekar wants to merge 7 commits into
masterfrom
feature/mlflow-dataset

Conversation

@JakubPekar

@JakubPekar JakubPekar commented Apr 21, 2026

Copy link
Copy Markdown
Collaborator

Summary by CodeRabbit

  • New Features

    • Added MLflow ParquetDataset support with lazy Parquet wrapping, dataset digest/profile metadata, and schema inference, plus a from_parquet(...) helper.
    • Exposed new MLflow Parquet dataset helpers as part of the public API.
  • Bug Fixes

    • Improved stratified batch sampling to accept a single or multiple stratification columns reliably.
  • Improvements

    • Enhanced metric collection typing/casting for more consistent updates and aggregation.
    • Kept behavior unchanged while refining related sampling/logging/sharding/CLI argument handling documentation and structure.

@JakubPekar JakubPekar requested review from a team and Copilot April 21, 2026 13:31
@coderabbitai

coderabbitai Bot commented Apr 21, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 066fe671-88e7-4b7d-9e94-92bbf819369b

📥 Commits

Reviewing files that changed from the base of the PR and between 573d7b4 and c9b00ee.

📒 Files selected for processing (2)
  • rationai/mlkit/data/samplers/stratified_batch_sampler.py
  • rationai/mlkit/mlflow/parquet_dataset.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • rationai/mlkit/data/samplers/stratified_batch_sampler.py
  • rationai/mlkit/mlflow/parquet_dataset.py

📝 Walkthrough

Walkthrough

Adds a new ParquetDataset MLflow integration with lazy PyArrow-backed schema, profile, digest, and serialization support, plus a from_parquet factory and exports. The rest of the patch updates metric typing and suppressions, adjusts nested class-name output, and makes small comment and variable cleanup changes.

Changes

MLflow ParquetDataset

Layer / File(s) Summary
ParquetDataset class: digest, profile, schema, serialization
rationai/mlkit/mlflow/parquet_dataset.py
Introduces ParquetDataset(Dataset) with lazy PyArrow construction, schema- and path-based digest computation, cached Parquet profile metadata, MLflow schema inference with fallback, and to_dict() serialization.
from_parquet factory and exports
rationai/mlkit/mlflow/parquet_dataset.py, rationai/mlkit/mlflow/__init__.py
Adds from_parquet(...) source resolution from DatasetSource, string input, or MLflow context tags, and exports ParquetDataset and from_parquet from the package namespace.

Type Annotations, Suppression Migrations, and Code Cleanup

Layer / File(s) Summary
Metrics typing and suppression updates
rationai/mlkit/metrics/aggregated_metric_collection.py, rationai/mlkit/metrics/aggregators.py, rationai/mlkit/metrics/lazy_metric_dict.py, rationai/mlkit/metrics/nested_metric_collection.py
Adds explicit Tensor attribute annotations, wraps metric values and collection factories in cast(), changes override suppressions, types defaultdict state, and updates NestedMetricCollection to emit string class_names.
Data and utility comment cleanup
rationai/mlkit/data/datasets/slides_tiles_loader.py, rationai/mlkit/data/shard_parquet.py, rationai/mlkit/data/samplers/stratified_batch_sampler.py, rationai/mlkit/lightning/loggers/mlflow.py, rationai/mlkit/with_cli_args.py
Rewords inline comments in _build_tile_index, removes section-header comments from shard_parquet, renames the shuffled batch list variable, widens stratify_by typing, updates the MLflow logger comment, and refactors sys.argv handling in the CLI wrapper.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • RationAI/mlkit#4: Introduces the same shard_parquet area that this PR only rewords in-place.

Suggested reviewers

  • matejpekar
  • vejtek

🐇 I hopped through Parquet’s schema lane,
With digests, profiles, and exports in train.
My type casts are tidy, my comments are swept,
And strings in the metrics now neatly են? kept.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 55.17% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title is related to the main change: adding an MLflow dataset implementation and exports.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/mlflow-dataset

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the ParquetDataset class and a from_parquet factory function to support lazy-loaded Parquet datasets within MLflow. The implementation includes metadata-based digest computation, schema inference, and dataset profiling. Feedback was provided to correct a docstring typo, ensure the target column is included in both the dataset digest and the configuration dictionary for accurate tracking, remove a redundant property override, and cache the profile property to optimize performance by avoiding repeated I/O.

Comment thread rationai/mlkit/mlflow/parquet_dataset.py Outdated
Comment thread rationai/mlkit/mlflow/parquet_dataset.py
Comment thread rationai/mlkit/mlflow/parquet_dataset.py
Comment thread rationai/mlkit/mlflow/parquet_dataset.py
Comment thread rationai/mlkit/mlflow/parquet_dataset.py

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an MLflow Dataset implementation for Parquet sources (via PyArrow) so Parquet training/eval inputs can be logged to MLflow with inferred schema and lightweight metadata profiling.

Changes:

  • Introduce ParquetDataset (lazy-loaded PyArrow dataset) plus from_parquet() factory for MLflow input logging.
  • Export the new dataset utilities from rationai.mlkit.mlflow.
  • Bump locked package version to 0.4.1.

Reviewed changes

Copilot reviewed 2 out of 3 changed files in this pull request and generated 6 comments.

File Description
uv.lock Updates the locked project version to 0.4.1.
rationai/mlkit/mlflow/parquet_dataset.py Implements ParquetDataset, digest computation, metadata profiling, and schema inference for Parquet datasets.
rationai/mlkit/mlflow/__init__.py Exposes ParquetDataset and from_parquet as public mlflow utilities.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread rationai/mlkit/mlflow/parquet_dataset.py
Comment thread rationai/mlkit/mlflow/parquet_dataset.py
Comment thread rationai/mlkit/mlflow/parquet_dataset.py Outdated
Comment thread rationai/mlkit/mlflow/parquet_dataset.py
Comment thread rationai/mlkit/mlflow/parquet_dataset.py Outdated
Comment thread rationai/mlkit/mlflow/parquet_dataset.py Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@rationai/mlkit/mlflow/parquet_dataset.py`:
- Line 28: The constructor docstring in the class containing the __init__ method
still uses placeholder text (“Hety.”), so replace it with a concise, meaningful
one-line summary that describes what the constructor initializes. Update the
docstring near the __init__ definition in parquet_dataset.py so it accurately
reflects the purpose of the Parquet dataset constructor and matches the
surrounding class documentation style.
- Line 48: The md5 digest used for content-addressing in parquet_dataset should
be marked non-security so it works on FIPS-enabled interpreters. Update the
hashlib.md5() call in the digest setup to use usedforsecurity=False, keeping the
behavior the same while making the hash explicitly non-cryptographic.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 95fca85c-a42c-4095-bb59-1df26c29ad9e

📥 Commits

Reviewing files that changed from the base of the PR and between 51d1b52 and 573d7b4.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (11)
  • rationai/mlkit/data/datasets/slides_tiles_loader.py
  • rationai/mlkit/data/samplers/stratified_batch_sampler.py
  • rationai/mlkit/data/shard_parquet.py
  • rationai/mlkit/lightning/loggers/mlflow.py
  • rationai/mlkit/metrics/aggregated_metric_collection.py
  • rationai/mlkit/metrics/aggregators.py
  • rationai/mlkit/metrics/lazy_metric_dict.py
  • rationai/mlkit/metrics/nested_metric_collection.py
  • rationai/mlkit/mlflow/__init__.py
  • rationai/mlkit/mlflow/parquet_dataset.py
  • rationai/mlkit/with_cli_args.py
💤 Files with no reviewable changes (1)
  • rationai/mlkit/data/shard_parquet.py

Comment thread rationai/mlkit/mlflow/parquet_dataset.py Outdated
Comment thread rationai/mlkit/mlflow/parquet_dataset.py Outdated
@JakubPekar

Copy link
Copy Markdown
Collaborator Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new lazy-loaded ParquetDataset for MLflow tracking, refactors various modules to improve type safety and clean up comments, and bumps the package version to 0.4.1. Key feedback includes addressing a critical bug in shard_parquet.py where shard indices and row counts are not updated when a shard reaches capacity, optimizing schema inference in parquet_dataset.py to avoid unnecessary I/O, and correcting the type hint for **kwargs in stratified_batch_sampler.py.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines 61 to 63
if current_shard_rows >= rows_per_shard:
writer.close()
writer = None

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There is a critical bug here: when a shard reaches its capacity (rows_per_shard), the writer is closed and set to None, but shard_idx is never incremented and current_shard_rows is never reset to 0.

This causes two major issues:

  1. The next batch will overwrite the same file (shard_00000.parquet) because shard_idx remains 0.
  2. current_shard_rows will keep accumulating and stay above rows_per_shard, causing a new file to be opened and closed on every single subsequent iteration.

We must increment shard_idx and reset current_shard_rows to 0 when closing the writer.

Suggested change
if current_shard_rows >= rows_per_shard:
writer.close()
writer = None
if current_shard_rows >= rows_per_shard:
writer.close()
writer = None
shard_idx += 1
current_shard_rows = 0

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is not true, the code is there already

from functools import cached_property
from typing import Any

import pyarrow.dataset as ds

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Import pyarrow directly to allow creating empty tables for schema inference without calling head(0).

Suggested change
import pyarrow.dataset as ds
import pyarrow as pa
import pyarrow.dataset as ds

Comment on lines +108 to +113
# Fetch an empty Pandas dataframe from the schema to utilize MLflow's built-in
# inference securely and correctly map PyArrow types to MLflow types.
empty_df = self._ds.head(0).to_pandas()
inferred_schema = _infer_schema(empty_df)
return inferred_schema
except Exception as e:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Instead of calling self._ds.head(0).to_pandas(), which can trigger I/O and requires PyArrow >= 12.0.0, you can construct an empty pyarrow.Table directly from the already-loaded self._ds.schema and convert that to a Pandas DataFrame. This is a zero-I/O, highly compatible, and fast operation.

Suggested change
# Fetch an empty Pandas dataframe from the schema to utilize MLflow's built-in
# inference securely and correctly map PyArrow types to MLflow types.
empty_df = self._ds.head(0).to_pandas()
inferred_schema = _infer_schema(empty_df)
return inferred_schema
except Exception as e:
try:
# Create an empty PyArrow Table from the schema and convert to Pandas
# to avoid calling head(0) which may trigger I/O or require newer PyArrow versions.
empty_df = pa.Table.from_batches([], schema=self._ds.schema).to_pandas()
inferred_schema = _infer_schema(empty_df)
return inferred_schema

Comment on lines 94 to 100
def __init__(
self,
data: pd.DataFrame,
stratify_by: None,
stratify_by: str | list[str],
batch_size: int,
**kwargs: dict[str, Any],
) -> None:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In Python type hinting, **kwargs: T specifies that all keyword arguments passed must be of type T. Specifying **kwargs: dict[str, Any] means each individual keyword argument must be a dictionary, which is likely not the intention here. It should be typed as **kwargs: Any.

Suggested change
def __init__(
self,
data: pd.DataFrame,
stratify_by: None,
stratify_by: str | list[str],
batch_size: int,
**kwargs: dict[str, Any],
) -> None:
def __init__(
self,
data: pd.DataFrame,
stratify_by: str | list[str],
batch_size: int,
**kwargs: Any,
) -> None:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants