Skip to content

THGLab/IDPForge

Repository files navigation

IDPForge (Intrinsically Disordered Protein, FOlded and disordered Region GEnerator)

A transformer protein language diffusion model to create all-atom IDP ensembles and IDR disordered ensembles that maintain their folded domains.

Getting started

To get started, this repository must be cloned using the following command:

git clone https://github.com/THGLab/IDPForge.git

Following that, the working conda environment can be established in two ways.

IDPForge Main Installation Protocol

First, navigate to the new IDPForge directory:

cd IDPForge

The base environment can be built manually via the environment.yml file in the repo. To do this, run the following command:

conda env create -f environment.yml

Note: The default environment.yml file is set to install torch==2.5.1 and cuda==12.1 for earlier GPUs (sm_60 - sm_80). If you have newer GPUs (released after Q4 2025) and run into issues with the 12.1 installation, switch to torch==2.7.1 and cuda==12.8. Although the 12.8 build nominally covers sm_60 - sm_120, it is not fully backwards compatible with older architectures, so the 12.1 default is preferred for earlier GPUs. Refer to the comments in the file for modification instructions.

Once the environment is created, activate it.

conda activate IDPForge

Then install IDPForge as a module in the environment.

pip install -e .

This repo also requires OpenFold utilities, so that repository must be cloned in the same directory as IDPForge. To do this, first navigate to the parent directory.

cd ../

Then clone the OpenFold repository into the parent directory.

git clone https://github.com/aqlaboratory/openfold.git

Once the repository is cloned, proceed into the resources of OpenFold.

cd openfold/openfold/resources

In there, download the following file.

wget https://git.scicore.unibas.ch/schwede/openstructure/-/raw/7102c63615b64735c4941278d92b554ec94415f8/modules/mol/alg/src/stereo_chemical_props.txt

Once this is done, navigate back into the main OpenFold directory.

cd ../../

The OpenFold setup must be replaced. To do this, first locate the 2 setup replacements provided with the IDPForge repository.

ls path/to/my/IDPForge/dockerfiles/openfold_setup_12*

Note: The output should look like the following: path/to/my/IDPForge/dockerfiles/openfold_setup_12.1.py path/to/my/IDPForge/dockerfiles/openfold_setup_12.8.py

Then copy openfold_setup_12.1.py into the OpenFold directory as the new setup.py.

cp path/to/my/IDPForge/dockerfiles/openfold_setup_12.1.py path/to/my/openfold/setup.py

Note: If the alternative installation was chosen during the setup of the IDPForge environment, copy the openfold_setup_12.8.py version instead.

Finally, install OpenFold as a module in the environment.

pip install -e .

This makes the environment fully ready for use.

Alternative Installation for Compute Cluster

If you have issues setting up the base environment from the yml file, or if you are setting IDPForge up for use on an HPC cluster, it is recommended to follow the installation by openfold. To do this, start by cloning both repositories in the same directory.

git clone https://github.com/THGLab/IDPForge.git
git clone https://github.com/aqlaboratory/openfold.git

Then navigate into the OpenFold directory.

cd openfold/

First, make a copy of the environment.yml file without flash-attn so it is not installed during environment creation.

python - <<'PY'
from pathlib import Path
src = Path("environment.yml")
dst = Path("environment_noflash.yml")
lines = src.read_text().splitlines()
lines = [ln for ln in lines if "flash-attn" not in ln]
dst.write_text("\n".join(lines) + "\n")
print("Wrote", dst)
PY

Then create the OpenFold environment from the stripped file.

mamba env create -n openfold_env -f environment_noflash.yml

Note: This can also be run with conda env create -n openfold_env -f environment_noflash.yml

Then activate the environment.

conda activate openfold_env

Install other dependencies required by IDPForge using the following commands:

conda install einops mdtraj pdb-tools -c conda-forge
conda install mmseqs2 -c bioconda
pip install tensorboard topoly

Navigate into the resources of OpenFold.

cd openfold/resources

In there, download the following file.

wget https://git.scicore.unibas.ch/schwede/openstructure/-/raw/7102c63615b64735c4941278d92b554ec94415f8/modules/mol/alg/src/stereo_chemical_props.txt

Navigate back to the main directory of OpenFold.

cd ../../

Install OpenFold as a module in the environment.

pip install . --no-build-isolation

Note: If pip install . --no-build-isolation does not work, proceed with pip install -e . instead.

Navigate to the IDPForge directory.

cd ../IDPForge

Install IDPForge as a module.

pip install . --no-build-isolation

Note: If pip install . --no-build-isolation does not work, proceed with pip install -e . instead.

This makes the environment fully ready for use.

Note: For more information on OpenFold installation, please refer to the installation guide. https://openfold.readthedocs.io/en/latest/Installation.html

Downloading model weights and other files

Model weights, example training data, and other inference input files can be downloaded from Figshare.

It is recommended to copy the weights/ directory directly into the IDPForge repository as IDPForge/weights/. Similarly, the contents of data/ can be copied into the given IDPForge/data/ directory.

Notes on ESM2 and Attention

ESM2 utilities are refactored into this repo for network modules and exploring the effects of ESM embedding on IDP modeling. Alternatively, it can be installed from their GitHub https://github.com/facebookresearch/esm.git, or via pip install pip install fair-esm.

Optional: pip install flash-attn==2.3 to speed up attention calculation.

Using Docker

IDPForge can also be built as a docker container using either of the included dockerfiles (Blackwell or Ampere). Blackwell runs on CUDA12.8 and Ampere runs on CUDA12.1. Optionally, the training weights and data files from Figshare may be merged before the creation of the image. This will ensure the image contains the merged files, removing the need for additional /weights and /data mounting.

To build the image, run the following command from the root of this repository choosing either Blackwell or Ampere based on preference:

docker build -f dockerfiles/Dockerfile_[Blackwell/Ampere] -t idpforge:latest .

To confirm that your idpforge:latest image is successfully completed, run

docker images

To run a container from the newly created image, run

docker run --rm -it --gpus all idpforge:latest

To verify that your docker installation is able to properly communicate with your GPU, run

docker run --rm --gpus all nvidia/cuda:12.8.1-base-ubuntu22.04 nvidia-smi

Once the image is created, outside directories can be added into a container by mounting them as follows.

docker run --rm -it --gpus all \
    -v "[path-to-directory]":/app/[directory-name-in-container] \
    # Optional: any other mounts... \
    idpforge:latest

Examples of this are given in later sections.

Using Apptainer

The same two recipes are provided as Apptainer (formerly Singularity) definition files, dockerfiles/idpforge_ampere.def and dockerfiles/idpforge_blackwell.def. Build a .sif image from the root of this repository:

apptainer build idpforge.sif dockerfiles/idpforge_ampere.def      
# or idpforge_blackwell.def

As with Docker, weights and data files from Figshare may be merged into the repository before building so they are baked into the image; otherwise mount them at runtime.

Running differs from Docker in a few ways: bind mounts use -B src:dst (not -v), the GPU flag is --nv (not --gpus all), environment variables use --env (not -e), and the application lives at /opt/IDPForge (not /app). The image filesystem is read-only, so outputs must go to a bound, writable directory. The %runscript is python "$@", so everything after the image name is passed straight to Python.

Example (single-chain IDP, Sic1), run from the repository root:

mkdir -p out
sequence="GSMTPSTPPRSRGTRYLAQPSGNTSSSALMQGQKTPQKPSQNLVPVTPSTTKSFKNAPLLAPPNSNMGMTSPFNGLTSPQRSPFPKSSVKRT"
apptainer run --nv \
    -B ./weights:/opt/IDPForge/weights \
    -B ./out:/app/out \
    idpforge.sif /opt/IDPForge/sample_idp.py "$sequence" \
        /opt/IDPForge/weights/mdl.ckpt /app/out /opt/IDPForge/configs/sample.yml \
        --nconf 10 --batch 4 --cuda --verbose

For IDRs with folded domains, swap in sample_ldr.py with its CKPT NPZ OUTDIR CFG arguments and mount the directory holding the .npz template (e.g. -B ./data:/opt/IDPForge/data).

Training

We use pytorch-lightning for training and one can customize training via the documented flags under trainer in the config file.

conda activate IDPForge
python train.py --model_config_path configs/train.yml

Sampling

Sampling loops through three phases until the target is met:

  1. Generate + Relax: Calls sample_ldr.py to produce diffusion conformers, which are immediately relaxed via AMBER minimization (relax config loaded from configs/sample.yml).
  2. Repair: Checks each relaxed structure for D-amino acids (chirality) and broken HIS ring bonds. Applies fixes and re-relaxes if any repairs were made.
  3. Validate: Runs unified validation checking chirality, bond integrity, clash score (adaptive smart threshold), and backbone topology (knot detection). Passing structures are renamed to N_validated.pdb.

With that, sampling scripts are provided for wholly disordered and partially disordered proteins below.

Single chain IDP/IDRs

We provide a commandline interface to sample single chain IDP/IDRs.

usage: sample_idp.py seq ckpt_path output_dir sample_cfg
[-h] [--batch BATCH] [--nconf NCONF] [--cuda] 
[--verbose]

positional arguments:
  seq                protein sequence
  ckpt_path          path to model weights
  output_dir         directory to output pdbs
  sample_cfg         path to a sampling configuration    
                     yaml file

optional arguments:
  --batch BATCH      batch size 
  --nconf NCONF      number of conformers to sample
  --cuda             whether to use cuda or cpu
  --verbose          show or hide debugging logs

Example to generate 10 conformers for Sic1:

mkdir test
sequence="GSMTPSTPPRSRGTRYLAQPSGNTSSSALMQGQKTPQKPSQNLVPVTPSTTKSFKNAPLLAPPNSNMGMTSPFNGLTSPQRSPFPKSSVKRT"
python sample_idp.py $sequence weights/mdl.ckpt test configs/sample.yml --nconf 10 --batch 4 --cuda --verbose

Inference time experimental guidance can be activated by the potential flag in the configs/sample.yml. An example PREs experimental data file is also provided in data/sic1_pre_exp.txt.

This can also be run within the previously created docker image. Set the working directory to the root of the previously cloned and merged version of this repository and run the following.

mkdir test
sequence="GSMTPSTPPRSRGTRYLAQPSGNTSSSALMQGQKTPQKPSQNLVPVTPSTTKSFKNAPLLAPPNSNMGMTSPFNGLTSPQRSPFPKSSVKRT"
docker run -it --rm --gpus all \
    -v "./test/":/app/output \
    -v "./data/":/app/data \
    -v "./weights/":/app/weights \
    -w /app \
    idpforge:latest \
    python -u /app/sample_idp.py $sequence /app/weights/mdl.ckpt /app/output /app/configs/sample.yml --nconf 10 --batch 4 --cuda --verbose

IDRs with folded domains

First, prepare the folded template with mk_ldr_template.py (shown below). We provide an example for sampling the low confidence region of AF entry P05231:

python mk_ldr_template.py data/AF-P05231-F1-model_v4.pdb 1-41 data/AF-P05231_ndr.npz

The provided model weights are not recommended for predicting multiple domains at the same time.

To generate an ensemble of IDRs with folded domains, run:

mkdir P05231_build
python sample_ldr.py weights/mdl.ckpt data/AF-P05231_ndr.npz P05231_build configs/sample.yml --nconf 10 --batch 4 --cuda --verbose

One can set the attention_chunk to manage memory usage for long sequences (Inference on long disordered sequences may be limited by training sequence length).

This can also be run within the previously created docker image. Set the working directory to the root of the previously cloned and merged version of this repository and run the following.

mkdir P05231_build
docker run -it --rm --gpus all \
    -v "./P05231_build/":/app/output \
    -v "./data/":/app/data \
    -v "./weights/":/app/weights \
    -w /app \
    idpforge:latest \
    python -u /app/sample_ldr.py /app/weights/mdl.ckpt /app/data/AF-P05231_ndr.npz /app/output /app/configs/sample.yml --nconf 10 --batch 4 --cuda --verbose

Chemical shifts prediction and evaluating ensembles with X-EISD (optional)

We use UCBShift for chemical shift prediction and can be installed at https://github.com/THGLab/CSpred.git. If you wish to use X-EISD for evaluation or reweighing with experimental data, please refer to https://github.com/THGLab/X-EISDv2.

Integrated X-EISD scorer (score_ensemble.py)

score_ensemble.py scores a PDB ensemble against experimental data: it back-calculates the requested observables and reports per-property MAE and X-EISD log-likelihood (utilities live in scoring/).

# default: 30 trials of 100-conformer subsamples -> scores_trials.csv (the benchmark protocol)
python score_ensemble.py PROTEIN path/to/ensemble_dir --jc --noe --pre --fret [--force]
# --all: score every conformer in one pass -> scores_all.csv (quick test-case scoring)
python score_ensemble.py PROTEIN path/to/ensemble_dir --jc --noe --pre --fret --all

The default 30×100 run produces scores_trials.csv (the file --normalize consumes); --all writes a separate scores_all.csv. J-couplings, NOE, PRE, and smFRET need only Biopython. Experimental data is read from ../Data/exp/{protein}/, overridable via IDPFORGE_EXP_DATA.

Chemical shifts (--cs) require CSpred (UCBShift). CSpred has its own dependency stack, so it must live in a separate environment; the scorer calls it once per conformer as a subprocess rather than importing it. To enable --cs:

  1. Clone and install CSpred (https://github.com/THGLab/CSpred.git) into its own conda environment, following that repository's instructions.
  2. Point the scorer at that environment's interpreter and the CSpred entry point, then run --cs:
    export CSPRED_PYTHON=/path/to/envs/cspred/bin/python   # interpreter with CSpred installed
    export CSPRED_PATH=/path/to/CSpred/CSpred.py           # default: ../Scoring/CSpred/CSpred.py
    python score_ensemble.py PROTEIN path/to/ensemble_dir --cs

Pass --normalize to build the cross-method Eq. S11 benchmark table. Methods are the immediate subdirectories of --ens-base (layout {ens_base}/{method}/{protein}/scores_trials.csv); use --score-file to aggregate a different per-protein CSV:

python score_ensemble.py --normalize --ens-base DIR [--score-file scores_trials.csv] [--outdir DIR] [--rg-json FILE]

The %|dRg|/Rg column compares each ensemble's Rg to an experimental target. Because only proteins with NMR data are X-EISD-scored, ensemble Rg is computed by the --rg mode (mass-weighted all-atom Rg, same 30×100 protocol → {method}/{protein}/rg_trials.csv), so the column also covers proteins that have an exp Rg target but no NMR data. Run --rg before --normalize, and pass the exp targets via --rg-json (JSON {"exp_rg": {protein: [mean, err]}}):

python score_ensemble.py --rg --ens-base DIR                              # writes rg_trials.csv per ensemble
python score_ensemble.py --normalize --ens-base DIR --rg-json exp_rg.json

Running the scorer in a container

The scorer is included in both images, but experimental data is not bundled — mount your Data/exp tree and point IDPFORGE_EXP_DATA at it. Scoring is CPU-only, so GPU flags are optional. J-couplings, NOE, PRE, and smFRET work out of the box; --cs additionally needs a CSpred environment, which is not in the image.

Docker:

docker run --rm -it \
    -v "./out":/app/output \
    -v "../Data/exp":/data -e IDPFORGE_EXP_DATA=/data \
    -w /app idpforge:latest \
    python -u /app/score_ensemble.py PROTEIN /app/output --jc --noe --pre --fret

Apptainer:

apptainer run \
    -B ./out:/app/output \
    -B ../Data/exp:/data --env IDPFORGE_EXP_DATA=/data \
    idpforge.sif /opt/IDPForge/score_ensemble.py PROTEIN /app/output --jc --noe --pre --fret

Replace PROTEIN with a name that has a subdirectory under Data/exp/ (pass only the flags whose experimental files exist for it), and point the ensemble argument (/app/output) at the directory of relaxed PDBs produced by sampling.

Citation

@article {Zhang2026,
	author = {Zhang, Oufan and Liu, Zi Hao and Forman-Kay, Julie Deborah and Head-Gordon, Teresa},
	title = {IDPForge: Deep Learning of Proteins with Global and Local Regions of Disorder},
	elocation-id = {2026.03.25.714313},
	year = {2026},
	doi = {10.64898/2026.03.25.714313},
	publisher = {Cold Spring Harbor Laboratory},
	URL = {https://www.biorxiv.org/content/early/2026/03/27/2026.03.25.714313},
	eprint = {https://www.biorxiv.org/content/early/2026/03/27/2026.03.25.714313.full.pdf},
	journal = {bioRxiv}
}

About

Disordered protein ensemble prediction

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages