rootml-bridge

A Python toolkit for bridging ROOT and machine learning workflows in high-energy physics. Export ROOT files to Parquet, train ML models, and attach predictions back to ROOT files.

Features

Export ROOT → Parquet: Convert ROOT TTree data to Parquet format with metadata preservation
Train ML Models: Built-in support for XGBoost with configurable training pipelines
Attach Predictions: Add ML scores back to ROOT files as new branches
Chunked Processing: Memory-efficient handling of large datasets
Provenance Tracking: Automatic metadata capture (git commit, timestamps, config)

Installation

# Clone the repository
git clone https://github.com/zagraywolf/rootml-bridge.git
cd rootml-bridge

# Install dependencies
pip install --break-system-packages \
    ROOT \
    pandas \
    pyarrow \
    xgboost \
    scikit-learn \
    pyyaml \
    typer

Quick Start

1. Generate Synthetic Data (Optional)

python examples/make_synthetic_root.py

This creates synthetic.root with 50,000 events containing:

Features: x1, x2, x3
Label: label (binary classification)
Event identifiers: run, lumi, event
Event weights: weight

2. Export ROOT to Parquet

python -m rootml.cli.main export \
    --config configs/export.yaml \
    --out data.parquet

Export config (configs/export.yaml):

input_files:
  - synthetic.root

tree: Events

features:
  - x1
  - x2
  - x3

label: label
weight: weight

event_id:
  - run
  - lumi
  - event

selection: null  # Optional ROOT selection string
chunk_size: 10000  # Rows per chunk

3. Train ML Model

python -m rootml.cli.main train \
    --data data.parquet \
    --config configs/train.yaml \
    --out outputs/train_run_1

Training config (configs/train.yaml):

model: xgboost

target: label
weight: weight

features:
  - x1
  - x2
  - x3

test_size: 0.2
val_size: 0.1
seed: 42

xgb_params:
  max_depth: 4
  n_estimators: 200
  learning_rate: 0.1
  subsample: 0.8

Outputs:

model.json: Trained XGBoost model
metrics.json: Test AUC and other metrics
scores.parquet: Full dataset with ML predictions

4. Attach Scores to ROOT File

python -m rootml.cli.main attach \
    --input-root synthetic.root \
    --tree Events \
    --scores outputs/train_run_1/scores.parquet \
    --out synthetic_with_scores.root

This creates a new ROOT file with an additional ml_score branch.

Workflow Overview

ROOT File (TTree)
    ↓
    └─ Export (chunked) → Parquet
                            ↓
                            └─ Train ML → Model + Scores
                                            ↓
                                            └─ Attach → ROOT File + ML Branch

Configuration

Export Configuration

Field	Type	Description	Required
`input_files`	List[str]	ROOT files to process	Yes
`tree`	str	TTree name	Yes
`features`	List[str]	Feature columns	Yes
`label`	str	Target variable	Yes
`event_id`	List[str]	Event identifier columns	Yes
`weight`	str	Event weight column	No
`selection`	str	ROOT selection string	No
`chunk_size`	int	Rows per chunk (default: 100k)	No

Training Configuration

Field	Type	Description
`model`	str	Model type (currently: `xgboost`)
`target`	str	Target variable name
`weight`	str	Weight column name
`features`	List[str]	Feature columns
`test_size`	float	Test set fraction
`val_size`	float	Validation set fraction
`seed`	int	Random seed
`xgb_params`	dict	XGBoost hyperparameters

Advanced Usage

Manual Attachment (Python API)

from rootml.attach import attach_scores

attach_scores(
    input_root="input.root",
    tree_name="Events",
    scores_path="scores.parquet",
    event_col="event",
    score_col="ml_score",
    output_root="output.root"
)

Custom Selection Filters

Apply ROOT selection strings during export:

selection: "pt > 30 && abs(eta) < 2.4"

Metadata Access

The exported Parquet file contains provenance metadata:

import pyarrow.parquet as pq

metadata = pq.read_metadata('data.parquet')
custom_metadata = metadata.schema.metadata

print(custom_metadata[b'rootml_export_time'].decode())
print(custom_metadata[b'git_commit'].decode())
print(custom_metadata[b'config'].decode())

Architecture

Module Structure

rootml/
├── __init__.py
├── attach.py          # Attach scores to ROOT files
├── config.py          # Configuration loading/validation
├── export.py          # ROOT → Parquet export
├── cli/
│   └── main.py        # Command-line interface
└── train/
    ├── __init__.py
    ├── base.py        # BaseTrainer abstract class
    ├── run.py         # Training dispatcher
    └── xgb_trainer.py # XGBoost implementation

Key Design Choices

Chunked Processing: Uses RDataFrame's entry index (rdfentry_) to process large files in chunks without loading everything into memory
Implicit Multi-Threading Disabled: ROOT.ROOT.DisableImplicitMT() is called to ensure deterministic chunking behavior
Event ID Preservation: Maintains run/lumi/event identifiers throughout the pipeline for accurate score attachment
Flattened Arrays: ROOT-exported columns (stored as 1-element arrays) are automatically flattened during training

Common Issues

Array Flattening

ROOT exports numeric columns as 1-element arrays. The training pipeline automatically flattens these:

# Automatic flattening in xgb_trainer.py
for col in df.columns:
    if df[col].dtype == "object":
        df[col] = df[col].apply(lambda x: x[0] if hasattr(x, "__len__") else x)

Memory Management

For very large files (>100M events), adjust chunk_size in the export config:

chunk_size: 50000  # Smaller chunks = less memory

Missing Event IDs

If attachment fails with missing events, verify that:

Event IDs are unique in the ROOT file
The same events exist in both the ROOT file and scores Parquet
Event ID column names match between config and data

Extending the Framework

Adding a New Model Type

Create a new trainer in rootml/train/:

from rootml.train.base import BaseTrainer

class MyTrainer(BaseTrainer):
    def train(self, data_path, config, out_dir):
        # Your training logic
        pass

Register in rootml/train/run.py:

def run_training(data_path, config, out_dir):
    model_type = config["model"]
    
    if model_type == "xgboost":
        trainer = XGBTrainer()
    elif model_type == "mymodel":
        trainer = MyTrainer()
    else:
        raise ValueError(f"Unknown model type: {model_type}")

Performance

Benchmarks on synthetic datasets with chunked RDataFrame processing:

Task	Dataset Size	Chunk Size	Notes
Export	~50k events, 7 features	10,000 rows	Stable memory usage across chunks
Training (XGBoost)	Same dataset	—	Train/val/test split, AUC logged
Attachment	Same dataset	—	Event-ID matched merge

Chunked processing keeps memory bounded regardless of file size — chunk_size is configurable in export.yaml.

Contributing

Contributions welcome! Areas for improvement:

Additional model types (LightGBM, neural networks)
GPU acceleration for training
Distributed processing support
More sophisticated feature engineering

License

MIT License - see LICENSE file for details.

Contact

For questions or issues, please open a GitHub issue or contact the maintainer.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
configs		configs
examples		examples
rootml		rootml
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
attach_manual.py		attach_manual.py

Folders and files

Latest commit

History

Repository files navigation

rootml-bridge

Features

Installation

Quick Start

1. Generate Synthetic Data (Optional)

2. Export ROOT to Parquet

3. Train ML Model

4. Attach Scores to ROOT File

Workflow Overview

Configuration

Export Configuration

Training Configuration

Advanced Usage

Manual Attachment (Python API)

Custom Selection Filters

Metadata Access

Architecture

Module Structure

Key Design Choices

Common Issues

Array Flattening

Memory Management

Missing Event IDs

Extending the Framework

Adding a New Model Type

Performance

Contributing

License

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages