Skip to content

BojarLab/GlycoForge

Repository files navigation

GlycoForge logo

GlycoForge is a simulation tool for generating glycomic relative-abundance datasets with customizable biological group differences and controllable batch-effect injection.

Key Features

  • Two simulation modes: Fully synthetic or templated (extract factor from input reference data + simulate batch effect)
  • Paired multi-glycome simulation: simulate_paired() generates two glycomic datasets (e.g., N- and O-glycomics) from the same biological samples, with shared batch labels and optional controllable cross-class coupling
  • Controllable effects injection: Systematic grid search over biological effect or batch effect strength parameters
  • Motif-level effects: For both bio and batch effects, desired motif differences (e.g., Neu5Ac: down) can be introduced. These are propagated in a dynamically constructed biosynthetic network to ensure physiological glycomics data (e.g., corresponding increase in desialylated glycans in the example of Neu5Ac: down)
  • MNAR missing data simulation: Mimics left-censored patterns biased toward low-abundance glycans

Quick Start

Installation

  • Python 3.10–3.12 required (>=3.10,<3.13). We recommend creating a dedicated virtual environment:
python3.10 -m venv .venv
source .venv/bin/activate   # Windows: .venv\Scripts\activate
  • Core dependency: glycowork>=1.8.0
pip install glycoforge

OR

git clone https://github.com/BojarLab/GlycoForge.git
cd GlycoForge
python3.10 -m venv .venv
source .venv/bin/activate
pip install -e .

Usage

See run_simulation.ipynb Open In Colab for interactive simulation examples, or use_cases/batch_correction/ Open In Colab for batch correction workflows, and benchmarking_batch_effect_removal.ipynb Open In Colab for the full six-method benchmark.

How the simulator works

All simulation operates in CLR (centered log-ratio) space using a Gaussian copula sampler:

  • Reference construction: In synthetic mode, pool all class-matched glycowork datasets, compute per-dataset Ledoit-Wolf covariance, extract and average correlation matrices into a consensus R, and collect empirical CLR marginals from the pooled samples. In templated mode, compute a single Ledoit-Wolf covariance and extract empirical CLR marginals from the input dataset's samples.
  • Clean data generation (simulate_clean_data): Draw Z ~ N(0, R_reg), apply the probability integral transform to get uniform marginals, then map to real CLR empirical quantiles via linear interpolation. This preserves both inter-feature correlation (Mantel r) and marginal realism (KS).
  • Biological injection: In templated mode, inject real effect sizes in CLR space: z_U = z_H + m * lambda * d_robust, where m is the differential mask, lambda is bio_strength, and d_robust is the effect vector after robust_effect_size_processing. In synthetic mode, inject along the top-K eigenvectors of the pooled covariance with random signs, scaled by bio_strength * std * sqrt(n_glycans) to remain PVCA-meaningful across feature counts. The alpha_H and alpha_U parameters are computed to define the CLR-space injection direction (via clr(p_U) - clr(p_H))
  • Batch effects: Two modes controlled by batch_mode. In additive mode (default, ComBat-correctable): Y_batch = Y_clean + kappa_mu * sigma * u_b + epsilon. In multiplicative mode (non-linear): Y_batch = Y_clean + kappa_mu * u_b * Y_clean + epsilon. Variance inflation uses batch-specific scale factors spread evenly around 1.0, controlled by var_b. Compositional pairing ensures substrate-product glycans receive correlated noise.
  • MNAR missingness: Logistic model in log-abundance space with per-sample intercept solved via Brent's method for exact target missingness fraction.

Simulation Modes

glycoforge.simulate() generates a single glycomic dataset per entity in two modes controlled by data_source. For paired multi-glycome data from the same samples, use glycoforge.simulate_paired() instead (see below). Configuration files are in sample_config/.

Synthetic mode (data_source="simulated") – Fully synthetic simulation (click to show detail introduction)

No real data dependency. Ideal for controlled experiments with known ground truth.

Pipeline steps:

  1. Pools all class-matched glycowork datasets (filtered by glycan_class) to build a consensus Ledoit-Wolf correlation matrix and empirical CLR marginals via _build_copula_ref
  2. Constructs alpha_H from the pooled mean abundance profile across all candidate datasets, scaled to sum proportional to 10 * n_glycans
  3. For each random seed, generates alpha_U by randomly scaling alpha_H (30% up, 35% down by default), or via motif-guided compositional pairing if motif_rules is provided
  4. Samples clean cohorts via Gaussian copula (LW correlation + empirical marginals) with biological injection along top-K eigenvectors of the pooled covariance, scaled by bio_strength
  5. Defines batch effect direction vectors u_dict once per simulation run (fixed seed ensures reproducible batch geometry across parameter sweep)
  6. Applies batch effects controlled by kappa_mu (shift strength) and var_b (variance scaling)
  7. Optionally applies MNAR (Missing Not At Random) missingness:
    • missing_fraction: proportion of missing values (0.0–1.0)
    • mnar_bias: intensity-dependent bias (default 2.0, range 0.5–5.0)
    • Left-censored pattern: low-abundance glycans more likely to be missing
  8. Grid search over kappa_mu and var_b produces multiple datasets under identical batch effect structure

Key parameters: n_glycans, n_H, n_U, kappa_mu, var_b, batch_mode, missing_fraction, mnar_bias

Templated mode (data_source="real") – Extract biological effect from input reference data + simulate batch effect (click to show detail introduction)

Starts from real glycomics data to preserve biological signal structure. Accepts CSV file or glycowork.glycan_data datasets.

Pipeline steps:

  1. Loads CSV and extracts healthy/unhealthy sample columns by prefix (configurable via column_prefix)
  2. Runs CLR-based differential expression via glycowork.get_differential_expression to compute Cohen's d effect sizes
  3. Reindexes effect sizes to match input glycan order (fills missing glycans with 0.0)
  4. Applies differential_mask to select which glycans receive biological signal injection:
    • "All": inject into all glycans
    • "significant": only glycans marked significant by glycowork
    • "Top-N": top N glycans by absolute effect size (e.g., "Top-10")
  5. Processes effect sizes through robust_effect_size_processing:
    • Centers effect sizes to remove global shift
    • Applies Winsorization to clip extreme outliers (auto-selects percentile 85–99, or uses winsorize_percentile)
    • Normalizes by baseline (baseline_method: median, MAD, or p75)
    • Returns normalized d_robust scaled by bio_strength
  6. Injects effects in CLR space: z_U = z_H + mask * bio_strength * d_robust
  7. Converts back to proportions: p_U = invclr(z_U)
  8. Computes Ledoit-Wolf covariance from pooled CLR data
  9. Samples clean cohorts via Gaussian copula using LW correlation and empirical CLR marginals, with the CLR injection vector shifting unhealthy samples
  10. Defines batch effect direction vectors u_dict once per run (fixed seed ensures fair comparison across parameter combinations)
  11. Applies batch effects: in additive mode, y_batch = y_clean + kappa_mu * sigma * u_b + epsilon; in multiplicative mode, y_batch = y_clean + kappa_mu * u_b * y_clean + epsilon, where variance inflation uses batch-specific scale factors controlled by var_b
  12. Optionally applies MNAR missingness (same as Simplified mode)
  13. Grid search over bio_strength, k_dir, variance_ratio, kappa_mu, var_b to systematically test biological signal and batch effect interactions

Key parameters: data_file, column_prefix, bio_strength, k_dir, variance_ratio, differential_mask, winsorize_percentile, baseline_method, kappa_mu, var_b, missing_fraction, mnar_bias, batch_mode

Paired mode (simulate_paired()) – Two glycomic classes from the same biological samples (click to show detail introduction)

Generates two glycomic datasets (e.g., N- and O-glycomics) that share sample identity: the same n_H healthy and n_U unhealthy individuals appear in both, so bio_labels, batch_labels, and column names are identical across glycomes. Each glycome is otherwise independently parameterised (different glycan counts, Dirichlet parameters, biological effect structures, and batch direction vectors).

Pipeline steps:

  1. Draws independent log-normal healthy baselines alpha_H_A and alpha_H_B (fixed seeds 42/43 for reproducibility)
  2. Generates per-glycome alpha_U using independent seeds so biological effects are not correlated by seed reuse
  3. Simulates clean compositional data for both glycomes via Gaussian copula, using independently constructed pooled references from _build_copula_ref (class-filtered by glycan_class_A/B)
  4. Optionally injects cross-class coupling in CLR space via shared latent factors Z ~ N(0, I):
    • Y_A_clr += coupling_strength * (Z @ U_A.T) * sigma_A
    • Y_B_clr += coupling_strength * (Z @ U_B.T) * sigma_B
    • At coupling_strength=0 the two glycomes are conditionally independent given sample labels; induced HSIC scales as coupling_strength²
    • Direction matrices U_A, U_B can be biased toward motif-matching glycans via coupling_motif_A/B
  5. Round-trips through invclr to restore simplex validity after coupling injection
  6. Applies shared batch labels with independent per-glycome direction vectors (same samples in the same batches, but different glycans affected)
  7. Applies MNAR missingness independently per glycome (independent seeds prevent artificially correlated missing-value patterns)

Key parameters: n_glycans_A/B, bio_strength_A/B, k_dir_A/B, variance_ratio_A/B, coupling_strength, n_coupling_components, coupling_motif_A/B, kappa_mu, var_b, missing_fraction, mnar_bias

Use Cases

The use_cases/batch_correction/ directory demonstrates:

  • Call glycoforge simulation, and then apply correction workflow
  • Six-method batch correction benchmark (ComBat, Percentile, Ratio-ComBat, Harmony, limma-style, Stratified ComBat) across a parameter grid of biological signal strengths and batch effect severities
  • Batch correction effectiveness metrics visualization

Limitation

Two biological groups only: Current implementation targets healthy/unhealthy setup. Supporting multi-stage disease (≥3 groups) requires refactoring parameter generation and evaluation metrics.

Citation

If you use GlycoForge in your research, please cite:

Hu, S. and Bojar, D. (2026). GlycoForge generates realistic glycomics data under known ground truth for rigorous method benchmarking. bioRxiv, doi:10.64898/2026.02.20.707134

BibTeX:

@article{hu2026glycoforge,
  title   = {GlycoForge generates realistic glycomics data under known ground truth for rigorous method benchmarking},
  author  = {Hu, Siyu and Bojar, Daniel},
  journal = {bioRxiv},
  year    = {2026},
  doi     = {10.64898/2026.02.20.707134},
  url     = {https://www.biorxiv.org/content/10.64898/2026.02.20.707134v1}
}

About

A simulation tool for generating glycomic relative-abundance datasets with customizable biological group differences and controllable batch-effect injection

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors