GlycoForge is a simulation tool for generating glycomic relative-abundance datasets with customizable biological group differences and controllable batch-effect injection.
- Two simulation modes: Fully synthetic or templated (extract factor from input reference data + simulate batch effect)
- Paired multi-glycome simulation:
simulate_paired()generates two glycomic datasets (e.g., N- and O-glycomics) from the same biological samples, with shared batch labels and optional controllable cross-class coupling - Controllable effects injection: Systematic grid search over biological effect or batch effect strength parameters
- Motif-level effects: For both bio and batch effects, desired motif differences (e.g.,
Neu5Ac: down) can be introduced. These are propagated in a dynamically constructed biosynthetic network to ensure physiological glycomics data (e.g., corresponding increase in desialylated glycans in the example ofNeu5Ac: down) - MNAR missing data simulation: Mimics left-censored patterns biased toward low-abundance glycans
- Python 3.10–3.12 required (
>=3.10,<3.13). We recommend creating a dedicated virtual environment:
python3.10 -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate- Core dependency:
glycowork>=1.8.0
pip install glycoforgeOR
git clone https://github.com/BojarLab/GlycoForge.git
cd GlycoForge
python3.10 -m venv .venv
source .venv/bin/activate
pip install -e .See run_simulation.ipynb for interactive simulation examples, or use_cases/batch_correction/
for batch correction workflows, and benchmarking_batch_effect_removal.ipynb
for the full six-method benchmark.
All simulation operates in CLR (centered log-ratio) space using a Gaussian copula sampler:
- Reference construction: In synthetic mode, pool all class-matched
glycoworkdatasets, compute per-dataset Ledoit-Wolf covariance, extract and average correlation matrices into a consensusR, and collect empirical CLR marginals from the pooled samples. In templated mode, compute a single Ledoit-Wolf covariance and extract empirical CLR marginals from the input dataset's samples. - Clean data generation (
simulate_clean_data): DrawZ ~ N(0, R_reg), apply the probability integral transform to get uniform marginals, then map to real CLR empirical quantiles via linear interpolation. This preserves both inter-feature correlation (Mantel r) and marginal realism (KS). - Biological injection: In templated mode, inject real effect sizes in CLR space:
z_U = z_H + m * lambda * d_robust, wheremis the differential mask,lambdaisbio_strength, andd_robustis the effect vector afterrobust_effect_size_processing. In synthetic mode, inject along the top-K eigenvectors of the pooled covariance with random signs, scaled bybio_strength * std * sqrt(n_glycans)to remain PVCA-meaningful across feature counts. Thealpha_Handalpha_Uparameters are computed to define the CLR-space injection direction (viaclr(p_U) - clr(p_H)) - Batch effects: Two modes controlled by
batch_mode. In additive mode (default, ComBat-correctable):Y_batch = Y_clean + kappa_mu * sigma * u_b + epsilon. In multiplicative mode (non-linear):Y_batch = Y_clean + kappa_mu * u_b * Y_clean + epsilon. Variance inflation uses batch-specific scale factors spread evenly around 1.0, controlled byvar_b. Compositional pairing ensures substrate-product glycans receive correlated noise. - MNAR missingness: Logistic model in log-abundance space with per-sample intercept solved via Brent's method for exact target missingness fraction.
glycoforge.simulate() generates a single glycomic dataset per entity in two modes controlled by data_source. For paired multi-glycome data from the same samples, use glycoforge.simulate_paired() instead (see below). Configuration files are in sample_config/.
Synthetic mode (data_source="simulated") – Fully synthetic simulation (click to show detail introduction)
No real data dependency. Ideal for controlled experiments with known ground truth.
Pipeline steps:
- Pools all class-matched glycowork datasets (filtered by
glycan_class) to build a consensus Ledoit-Wolf correlation matrix and empirical CLR marginals via_build_copula_ref - Constructs
alpha_Hfrom the pooled mean abundance profile across all candidate datasets, scaled to sum proportional to10 * n_glycans - For each random seed, generates
alpha_Uby randomly scalingalpha_H(30% up, 35% down by default), or via motif-guided compositional pairing ifmotif_rulesis provided - Samples clean cohorts via Gaussian copula (LW correlation + empirical marginals) with biological injection along top-K eigenvectors of the pooled covariance, scaled by
bio_strength - Defines batch effect direction vectors
u_dictonce per simulation run (fixed seed ensures reproducible batch geometry across parameter sweep) - Applies batch effects controlled by
kappa_mu(shift strength) andvar_b(variance scaling) - Optionally applies MNAR (Missing Not At Random) missingness:
missing_fraction: proportion of missing values (0.0–1.0)mnar_bias: intensity-dependent bias (default 2.0, range 0.5–5.0)- Left-censored pattern: low-abundance glycans more likely to be missing
- Grid search over
kappa_muandvar_bproduces multiple datasets under identical batch effect structure
Key parameters: n_glycans, n_H, n_U, kappa_mu, var_b, batch_mode, missing_fraction, mnar_bias
Templated mode (data_source="real") – Extract biological effect from input reference data + simulate batch effect (click to show detail introduction)
Starts from real glycomics data to preserve biological signal structure. Accepts CSV file or glycowork.glycan_data datasets.
Pipeline steps:
- Loads CSV and extracts healthy/unhealthy sample columns by prefix (configurable via
column_prefix) - Runs CLR-based differential expression via
glycowork.get_differential_expressionto compute Cohen's d effect sizes - Reindexes effect sizes to match input glycan order (fills missing glycans with 0.0)
- Applies
differential_maskto select which glycans receive biological signal injection:"All": inject into all glycans"significant": only glycans marked significant by glycowork"Top-N": top N glycans by absolute effect size (e.g.,"Top-10")
- Processes effect sizes through
robust_effect_size_processing:- Centers effect sizes to remove global shift
- Applies Winsorization to clip extreme outliers (auto-selects percentile 85–99, or uses
winsorize_percentile) - Normalizes by baseline (
baseline_method: median, MAD, or p75) - Returns normalized
d_robustscaled bybio_strength
- Injects effects in CLR space:
z_U = z_H + mask * bio_strength * d_robust - Converts back to proportions:
p_U = invclr(z_U) - Computes Ledoit-Wolf covariance from pooled CLR data
- Samples clean cohorts via Gaussian copula using LW correlation and empirical CLR marginals, with the CLR injection vector shifting unhealthy samples
- Defines batch effect direction vectors
u_dictonce per run (fixed seed ensures fair comparison across parameter combinations) - Applies batch effects: in additive mode,
y_batch = y_clean + kappa_mu * sigma * u_b + epsilon; in multiplicative mode,y_batch = y_clean + kappa_mu * u_b * y_clean + epsilon, where variance inflation uses batch-specific scale factors controlled byvar_b - Optionally applies MNAR missingness (same as Simplified mode)
- Grid search over
bio_strength,k_dir,variance_ratio,kappa_mu,var_bto systematically test biological signal and batch effect interactions
Key parameters: data_file, column_prefix, bio_strength, k_dir, variance_ratio, differential_mask, winsorize_percentile, baseline_method, kappa_mu, var_b, missing_fraction, mnar_bias, batch_mode
Paired mode (simulate_paired()) – Two glycomic classes from the same biological samples (click to show detail introduction)
Generates two glycomic datasets (e.g., N- and O-glycomics) that share sample identity: the same n_H healthy and n_U unhealthy individuals appear in both, so bio_labels, batch_labels, and column names are identical across glycomes. Each glycome is otherwise independently parameterised (different glycan counts, Dirichlet parameters, biological effect structures, and batch direction vectors).
Pipeline steps:
- Draws independent log-normal healthy baselines
alpha_H_Aandalpha_H_B(fixed seeds 42/43 for reproducibility) - Generates per-glycome
alpha_Uusing independent seeds so biological effects are not correlated by seed reuse - Simulates clean compositional data for both glycomes via Gaussian copula, using independently constructed pooled references from
_build_copula_ref(class-filtered byglycan_class_A/B) - Optionally injects cross-class coupling in CLR space via shared latent factors
Z ~ N(0, I):Y_A_clr += coupling_strength * (Z @ U_A.T) * sigma_AY_B_clr += coupling_strength * (Z @ U_B.T) * sigma_B- At
coupling_strength=0the two glycomes are conditionally independent given sample labels; induced HSIC scales ascoupling_strength² - Direction matrices
U_A,U_Bcan be biased toward motif-matching glycans viacoupling_motif_A/B
- Round-trips through
invclrto restore simplex validity after coupling injection - Applies shared batch labels with independent per-glycome direction vectors (same samples in the same batches, but different glycans affected)
- Applies MNAR missingness independently per glycome (independent seeds prevent artificially correlated missing-value patterns)
Key parameters: n_glycans_A/B, bio_strength_A/B, k_dir_A/B, variance_ratio_A/B, coupling_strength, n_coupling_components, coupling_motif_A/B, kappa_mu, var_b, missing_fraction, mnar_bias
The use_cases/batch_correction/ directory demonstrates:
- Call
glycoforgesimulation, and then apply correction workflow - Six-method batch correction benchmark (ComBat, Percentile, Ratio-ComBat, Harmony, limma-style, Stratified ComBat) across a parameter grid of biological signal strengths and batch effect severities
- Batch correction effectiveness metrics visualization
Two biological groups only: Current implementation targets healthy/unhealthy setup. Supporting multi-stage disease (≥3 groups) requires refactoring parameter generation and evaluation metrics.
If you use GlycoForge in your research, please cite:
Hu, S. and Bojar, D. (2026). GlycoForge generates realistic glycomics data under known ground truth for rigorous method benchmarking. bioRxiv, doi:10.64898/2026.02.20.707134
BibTeX:
@article{hu2026glycoforge,
title = {GlycoForge generates realistic glycomics data under known ground truth for rigorous method benchmarking},
author = {Hu, Siyu and Bojar, Daniel},
journal = {bioRxiv},
year = {2026},
doi = {10.64898/2026.02.20.707134},
url = {https://www.biorxiv.org/content/10.64898/2026.02.20.707134v1}
}