A Python toolkit for bridging ROOT and machine learning workflows in high-energy physics. Export ROOT files to Parquet, train ML models, and attach predictions back to ROOT files.
- Export ROOT → Parquet: Convert ROOT TTree data to Parquet format with metadata preservation
- Train ML Models: Built-in support for XGBoost with configurable training pipelines
- Attach Predictions: Add ML scores back to ROOT files as new branches
- Chunked Processing: Memory-efficient handling of large datasets
- Provenance Tracking: Automatic metadata capture (git commit, timestamps, config)
# Clone the repository
git clone https://github.com/zagraywolf/rootml-bridge.git
cd rootml-bridge
# Install dependencies
pip install --break-system-packages \
ROOT \
pandas \
pyarrow \
xgboost \
scikit-learn \
pyyaml \
typerpython examples/make_synthetic_root.pyThis creates synthetic.root with 50,000 events containing:
- Features:
x1,x2,x3 - Label:
label(binary classification) - Event identifiers:
run,lumi,event - Event weights:
weight
python -m rootml.cli.main export \
--config configs/export.yaml \
--out data.parquetExport config (configs/export.yaml):
input_files:
- synthetic.root
tree: Events
features:
- x1
- x2
- x3
label: label
weight: weight
event_id:
- run
- lumi
- event
selection: null # Optional ROOT selection string
chunk_size: 10000 # Rows per chunkpython -m rootml.cli.main train \
--data data.parquet \
--config configs/train.yaml \
--out outputs/train_run_1Training config (configs/train.yaml):
model: xgboost
target: label
weight: weight
features:
- x1
- x2
- x3
test_size: 0.2
val_size: 0.1
seed: 42
xgb_params:
max_depth: 4
n_estimators: 200
learning_rate: 0.1
subsample: 0.8Outputs:
model.json: Trained XGBoost modelmetrics.json: Test AUC and other metricsscores.parquet: Full dataset with ML predictions
python -m rootml.cli.main attach \
--input-root synthetic.root \
--tree Events \
--scores outputs/train_run_1/scores.parquet \
--out synthetic_with_scores.rootThis creates a new ROOT file with an additional ml_score branch.
ROOT File (TTree)
↓
└─ Export (chunked) → Parquet
↓
└─ Train ML → Model + Scores
↓
└─ Attach → ROOT File + ML Branch
| Field | Type | Description | Required |
|---|---|---|---|
input_files |
List[str] | ROOT files to process | Yes |
tree |
str | TTree name | Yes |
features |
List[str] | Feature columns | Yes |
label |
str | Target variable | Yes |
event_id |
List[str] | Event identifier columns | Yes |
weight |
str | Event weight column | No |
selection |
str | ROOT selection string | No |
chunk_size |
int | Rows per chunk (default: 100k) | No |
| Field | Type | Description |
|---|---|---|
model |
str | Model type (currently: xgboost) |
target |
str | Target variable name |
weight |
str | Weight column name |
features |
List[str] | Feature columns |
test_size |
float | Test set fraction |
val_size |
float | Validation set fraction |
seed |
int | Random seed |
xgb_params |
dict | XGBoost hyperparameters |
from rootml.attach import attach_scores
attach_scores(
input_root="input.root",
tree_name="Events",
scores_path="scores.parquet",
event_col="event",
score_col="ml_score",
output_root="output.root"
)Apply ROOT selection strings during export:
selection: "pt > 30 && abs(eta) < 2.4"The exported Parquet file contains provenance metadata:
import pyarrow.parquet as pq
metadata = pq.read_metadata('data.parquet')
custom_metadata = metadata.schema.metadata
print(custom_metadata[b'rootml_export_time'].decode())
print(custom_metadata[b'git_commit'].decode())
print(custom_metadata[b'config'].decode())rootml/
├── __init__.py
├── attach.py # Attach scores to ROOT files
├── config.py # Configuration loading/validation
├── export.py # ROOT → Parquet export
├── cli/
│ └── main.py # Command-line interface
└── train/
├── __init__.py
├── base.py # BaseTrainer abstract class
├── run.py # Training dispatcher
└── xgb_trainer.py # XGBoost implementation
-
Chunked Processing: Uses RDataFrame's entry index (
rdfentry_) to process large files in chunks without loading everything into memory -
Implicit Multi-Threading Disabled:
ROOT.ROOT.DisableImplicitMT()is called to ensure deterministic chunking behavior -
Event ID Preservation: Maintains run/lumi/event identifiers throughout the pipeline for accurate score attachment
-
Flattened Arrays: ROOT-exported columns (stored as 1-element arrays) are automatically flattened during training
ROOT exports numeric columns as 1-element arrays. The training pipeline automatically flattens these:
# Automatic flattening in xgb_trainer.py
for col in df.columns:
if df[col].dtype == "object":
df[col] = df[col].apply(lambda x: x[0] if hasattr(x, "__len__") else x)For very large files (>100M events), adjust chunk_size in the export config:
chunk_size: 50000 # Smaller chunks = less memoryIf attachment fails with missing events, verify that:
- Event IDs are unique in the ROOT file
- The same events exist in both the ROOT file and scores Parquet
- Event ID column names match between config and data
- Create a new trainer in
rootml/train/:
from rootml.train.base import BaseTrainer
class MyTrainer(BaseTrainer):
def train(self, data_path, config, out_dir):
# Your training logic
pass- Register in
rootml/train/run.py:
def run_training(data_path, config, out_dir):
model_type = config["model"]
if model_type == "xgboost":
trainer = XGBTrainer()
elif model_type == "mymodel":
trainer = MyTrainer()
else:
raise ValueError(f"Unknown model type: {model_type}")Benchmarks on synthetic datasets with chunked RDataFrame processing:
| Task | Dataset Size | Chunk Size | Notes |
|---|---|---|---|
| Export | ~50k events, 7 features | 10,000 rows | Stable memory usage across chunks |
| Training (XGBoost) | Same dataset | — | Train/val/test split, AUC logged |
| Attachment | Same dataset | — | Event-ID matched merge |
Chunked processing keeps memory bounded regardless of file size — chunk_size is configurable in export.yaml.
Contributions welcome! Areas for improvement:
- Additional model types (LightGBM, neural networks)
- GPU acceleration for training
- Distributed processing support
- More sophisticated feature engineering
MIT License - see LICENSE file for details.
For questions or issues, please open a GitHub issue or contact the maintainer.