RmvLMM: A Rotated Multivariate Linear Mixed Model for Dual Large-Scale GWAS

Rotated multivariate Linear Mixed Model (RmvLMM) is a powerful and scalable statistical framework for dual large-scale GWAS, applicable to biobank-scale samples and a large number of phenotypes. Existing multi-trait GWAS methods are often computationally prohibitive for such massive datasets due to memory constraints and processing time. RmvLMM addresses these challenges using an orthogonal rotation framework, achieving both computational efficiency and high detection power.

Key Features

Orthogonal Rotation: Decorrelates multiple traits to enable efficient multi-trait analysis and achieve high detection power.
Fast MoM-REML Iterative Algorithm: A novel algorithm combining the Method-of-Moments and REML for rapid covariance matrix estimation.
Divided-and-Combined Strategy: Parallelizes analysis across split-sample groups, aggregating signals into a robust omnibus test and calibrating results via generalized gamma distribution approximation.
Biobank Scalability: Efficiently handles hundreds of thousands of individuals and millions of SNPs. It allows users to flexibly partition samples and process SNP matrices in batches based on available computational resources, enabling efficient multi-trait GWAS at the Biobank scale.
Modular Design: RmvLMM package follows a modular philosophy by focusing on the core statistical inference, allowing for seamless integration with high-performance I/O tools like PLINK.

Installation

You can install the development version of RmvLMM from GitHub:

# install.packages("remotes")
remotes::install_github("amss-stat/RmvLMM")

Dependencies

Tested with: R (>= 4.4.0) data.table (1.17.2) MASS (7.3.65) survival (3.8.3) flexsurv (2.3.2)

Quick Start

1. Input Data Format

Before running the analysis, ensure your data follows these formats:

Note: RmvLMM package is designed to be compatible with standard GWAS pipelines. Upstream tasks such as quality control and sample/SNP splitting are ideally performed using tools like PLINK before loading matrices into R.

Phenotype Matrix (Y): $N \times D$ matrix of quantitative traits.

ID	Trait_1	Trait_2	...	Trait_D
Indiv_1	1.25	-0.42	...	0.88
Indiv_2	-0.12	2.15	...	-1.04
Indiv_3	0.55	0.33	...	0.12

Covariate Matrix (X): $N \times C$ matrix, must include a column of 1s as the intercept.

ID	Intercept	Age	Sex	PC1	...
Indiv_1	1	45	1	-0.012	...
Indiv_2	1	52	0	0.045	...
Indiv_3	1	38	1	-0.008	...

Genotype Matrix (G): $M \times N$ matrix (SNPs in rows, Individuals in columns). SNP IDs must be provided as row names. A numeric vector is also acceptable for a single SNP.

SNP	Indiv_1	Indiv_2	Indiv_3	...
rs1001	0	1	0	...
rs1002	2	1	0	...
rs1003	0	0	1	...

Note: For all input matrices (Y, X, G), IDs should be provided as row/column names and not as the first row/column of the numeric data.

Sample Relatedness Matrix (K): $N \times N$ kinship or GRM matrix. $K$ must be positive definite. If not, consider adding small perturbations to the eigenvalues or diagonal elements.

Independent SNPs List (independent_snps.csv): A single-column file containing IDs of at least 20,000 approximately independent SNPs (selected via LD pruning or physical distance). The file must contain a header named SNP with the corresponding IDs.

SNP
rs125
rs458
rs992

2. Basic Multi-trait GWAS

For smaller datasets (e.g., $N < 20,000$), you can process the entire sample at once:

library(RmvLMM)

# Y: Quantitative phenotype matrix (N x D numeric matrix)
# X: Covariate matrix (N x C numeric matrix, includes a column of 1s)
# K: Sample relatedness matrix (N x N numeric matrix)
# G: Genotype matrix (M x N numeric matrix). SNP IDs must be provided as row names.
#    A numeric vector is also acceptable for a single SNP.
results <- run_RmvLMM(Y = Y, X = X, K = K, G = G, out_file = "results.rds")

3. Biobank-Scale Aggregation

For large cohorts (e.g., UK Biobank), we recommend a "Divided-and-Combined" strategy:

Divide Samples: Divide the total sample into multiple groups, each containing about 10,000 individuals.
Parallel Processing: For each group, run run_RmvLMM() independently and save the results as _partXX.rds files.
Identify Calibration SNPs: Prepare an ID list of at least 20,000 approximately independent SNPs.
Combine & Calibrate: Use bank_RmvLMM() to aggregate the results into a final omnibus test.

# Combine results from multiple group files
final_results <- bank_RmvLMM(
  rds_files = c("_part1.rds", "_part2.rds", "_part3.rds", ...),    # obtained from run_RmvLMM() for all groups
  indep_snp_file = "independent_snps.csv",    # ID List of ~20,000+ approximately independent SNPs
  out_file = "final_calibrated_results.rds"
)

For more details, you can browse the demo code in the example folder, or view the function comments in main.r in the R folder.

Output Formats

Both run_RmvLMM() and bank_RmvLMM() return a list object.

1. Results from `run_RmvLMM()`

Element	Type	Description
`$results`	`data.frame`	A table with two columns: - `SNP`: SNP identifier (from rownames of G). - `p_TOP2`: The raw joint p-value for the group (or small sample), calculated via the TOP2 statistics method.
`$Vb`	`matrix`	Estimated random polygenic covariance matrix ($D \times D$).
`$Ve`	`matrix`	Estimated random error covariance matrix ($D \times D$).

2. Results from `bank_RmvLMM()`

Element	Type	Description
`$results`	`data.frame`	A table with two columns: - `SNP`: SNP identifier. - `p_omnibus`: The final, calibrated p-value. The value is derived from Fisher’s combination of group-wise results and calibrated using the generalized gamma distribution.
`$Vb`	`matrix`	Final estimated random polygenic covariance matrix ($D \times D$)
`$Ve`	`matrix`	Final estimated random error covariance matrix ($D \times D$).

Performance and Computational Tips

1. Memory and Efficiency

RmvLMM is highly optimized for large-scale tasks.

Reference Task: A task with 26 traits, 10,000 individuals and 100,000 SNPs typically requires ~16GB of RAM.
Real-world Benchmark: In our study, an analysis of 323,839 individuals, 454,296 SNPs, and 26 traits was completed in just 13 hours using a server with 60 cores and 180GB of RAM.
Flexibility: Users can flexibly adjust the sample group size and SNP batching based on available computational resources. To our knowledge, RmvLMM is currently the only method capable of performing exact multi-trait LMM-based GWAS at this scale within a reasonable timeframe.

2. Optimization: Pre-computing Eigen-decomposition

The eigendecomposition of the relatedness matrix $K$ is computationally expensive. If you need to call run_RmvLMM() multiple times for the same set of individuals (e.g., when processing SNPs in different batches), you can pre-compute the decomposition and pass it via the K_precomp parameter to avoid redundant calculations:

# Pre-compute eigen-decomposition once
eK <- eigen(K, symmetric = TRUE)
K_precomp <- list(
  Ak_diag = eK$values,
  Qk = eK$vectors)

# Pass K_precomp to run_RmvLMM
results <- run_RmvLMM(Y = Y, X = X, G = G, K_precomp = K_precomp, out_file = "batch1.rds")

Citation

If you use RmvLMM in your research, please cite:

Guo, H.^†, Zhang, Q.^†, Zheng, X., Quan, Y., & Li, Q.^* (2026). A rotated multivariate linear mixed model for dual large-scale genome-wide association study (Submitted).

^† The first two authors should be regarded as Joint First Authors.
^* Corresponding author. E-mail: liqz@amss.ac.cn

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
R		R
example		example
man		man
.Rbuildignore		.Rbuildignore
DESCRIPTION		DESCRIPTION
NAMESPACE		NAMESPACE
README.md		README.md
RmvLMM.Rproj		RmvLMM.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RmvLMM: A Rotated Multivariate Linear Mixed Model for Dual Large-Scale GWAS

Key Features

Installation

Dependencies

Quick Start

1. Input Data Format

2. Basic Multi-trait GWAS

3. Biobank-Scale Aggregation

Output Formats

1. Results from `run_RmvLMM()`

2. Results from `bank_RmvLMM()`

Performance and Computational Tips

1. Memory and Efficiency

2. Optimization: Pre-computing Eigen-decomposition

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RmvLMM: A Rotated Multivariate Linear Mixed Model for Dual Large-Scale GWAS

Key Features

Installation

Dependencies

Quick Start

1. Input Data Format

2. Basic Multi-trait GWAS

3. Biobank-Scale Aggregation

Output Formats

1. Results from run_RmvLMM()

2. Results from bank_RmvLMM()

Performance and Computational Tips

1. Memory and Efficiency

2. Optimization: Pre-computing Eigen-decomposition

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. Results from `run_RmvLMM()`

2. Results from `bank_RmvLMM()`

Packages