A transformer protein language diffusion model to create all-atom IDP ensembles and IDR disordered ensembles that maintain their folded domains.
To get started, this repository must be cloned using the following command:
git clone https://github.com/THGLab/IDPForge.gitFollowing that, the working conda environment can be established in two ways.
First, navigate to the new IDPForge directory:
cd IDPForgeThe base environment can be built manually via the environment.yml file in the repo. To do this, run the following command:
conda env create -f environment.ymlNote: The default
environment.ymlfile is set to installtorch==2.5.1 and cuda==12.1for earlier GPUs (sm_60 - sm_80). If you have newer GPUs (released after Q4 2025) and run into issues with the 12.1 installation, switch totorch==2.7.1 and cuda==12.8. Although the 12.8 build nominally covers sm_60 - sm_120, it is not fully backwards compatible with older architectures, so the 12.1 default is preferred for earlier GPUs. Refer to the comments in the file for modification instructions.
Once the environment is created, activate it.
conda activate IDPForgeThen install IDPForge as a module in the environment.
pip install -e .This repo also requires OpenFold utilities, so that repository must be cloned in the same directory as IDPForge. To do this, first navigate to the parent directory.
cd ../Then clone the OpenFold repository into the parent directory.
git clone https://github.com/aqlaboratory/openfold.gitOnce the repository is cloned, proceed into the resources of OpenFold.
cd openfold/openfold/resourcesIn there, download the following file.
wget https://git.scicore.unibas.ch/schwede/openstructure/-/raw/7102c63615b64735c4941278d92b554ec94415f8/modules/mol/alg/src/stereo_chemical_props.txtOnce this is done, navigate back into the main OpenFold directory.
cd ../../The OpenFold setup must be replaced. To do this, first locate the 2 setup replacements provided with the IDPForge repository.
ls path/to/my/IDPForge/dockerfiles/openfold_setup_12*Note: The output should look like the following: path/to/my/IDPForge/dockerfiles/openfold_setup_12.1.py path/to/my/IDPForge/dockerfiles/openfold_setup_12.8.py
Then copy openfold_setup_12.1.py into the OpenFold directory as the new setup.py.
cp path/to/my/IDPForge/dockerfiles/openfold_setup_12.1.py path/to/my/openfold/setup.pyNote: If the alternative installation was chosen during the setup of the IDPForge environment, copy the
openfold_setup_12.8.pyversion instead.
Finally, install OpenFold as a module in the environment.
pip install -e .This makes the environment fully ready for use.
If you have issues setting up the base environment from the yml file, or if you are setting IDPForge up for use on an HPC cluster, it is recommended to follow the installation by openfold. To do this, start by cloning both repositories in the same directory.
git clone https://github.com/THGLab/IDPForge.gitgit clone https://github.com/aqlaboratory/openfold.gitThen navigate into the OpenFold directory.
cd openfold/First, make a copy of the environment.yml file without flash-attn so it is not installed during environment creation.
python - <<'PY'
from pathlib import Path
src = Path("environment.yml")
dst = Path("environment_noflash.yml")
lines = src.read_text().splitlines()
lines = [ln for ln in lines if "flash-attn" not in ln]
dst.write_text("\n".join(lines) + "\n")
print("Wrote", dst)
PYThen create the OpenFold environment from the stripped file.
mamba env create -n openfold_env -f environment_noflash.ymlNote: This can also be run with
conda env create -n openfold_env -f environment_noflash.yml
Then activate the environment.
conda activate openfold_envInstall other dependencies required by IDPForge using the following commands:
conda install einops mdtraj pdb-tools -c conda-forgeconda install mmseqs2 -c biocondapip install tensorboard topolyNavigate into the resources of OpenFold.
cd openfold/resourcesIn there, download the following file.
wget https://git.scicore.unibas.ch/schwede/openstructure/-/raw/7102c63615b64735c4941278d92b554ec94415f8/modules/mol/alg/src/stereo_chemical_props.txtNavigate back to the main directory of OpenFold.
cd ../../Install OpenFold as a module in the environment.
pip install . --no-build-isolationNote: If
pip install . --no-build-isolationdoes not work, proceed withpip install -e .instead.
Navigate to the IDPForge directory.
cd ../IDPForgeInstall IDPForge as a module.
pip install . --no-build-isolationNote: If
pip install . --no-build-isolationdoes not work, proceed withpip install -e .instead.
This makes the environment fully ready for use.
Note: For more information on OpenFold installation, please refer to the installation guide. https://openfold.readthedocs.io/en/latest/Installation.html
Model weights, example training data, and other inference input files can be downloaded from Figshare.
It is recommended to copy the weights/ directory directly into the IDPForge repository as IDPForge/weights/. Similarly, the contents of data/ can be copied into the given IDPForge/data/ directory.
ESM2 utilities are refactored into this repo for network modules and exploring the effects of ESM embedding on IDP modeling. Alternatively, it can be installed from their GitHub https://github.com/facebookresearch/esm.git, or via pip install pip install fair-esm.
Optional: pip install flash-attn==2.3 to speed up attention calculation.
IDPForge can also be built as a docker container using either of the included dockerfiles (Blackwell or Ampere). Blackwell runs on CUDA12.8 and Ampere runs on CUDA12.1. Optionally, the training weights and data files from Figshare may be merged before the creation of the image. This will ensure the image contains the merged files, removing the need for additional /weights and /data mounting.
To build the image, run the following command from the root of this repository choosing either Blackwell or Ampere based on preference:
docker build -f dockerfiles/Dockerfile_[Blackwell/Ampere] -t idpforge:latest .To confirm that your idpforge:latest image is successfully completed, run
docker imagesTo run a container from the newly created image, run
docker run --rm -it --gpus all idpforge:latestTo verify that your docker installation is able to properly communicate with your GPU, run
docker run --rm --gpus all nvidia/cuda:12.8.1-base-ubuntu22.04 nvidia-smiOnce the image is created, outside directories can be added into a container by mounting them as follows.
docker run --rm -it --gpus all \
-v "[path-to-directory]":/app/[directory-name-in-container] \
# Optional: any other mounts... \
idpforge:latestExamples of this are given in later sections.
The same two recipes are provided as Apptainer (formerly Singularity) definition files,
dockerfiles/idpforge_ampere.def and dockerfiles/idpforge_blackwell.def. Build a .sif
image from the root of this repository:
apptainer build idpforge.sif dockerfiles/idpforge_ampere.def
# or idpforge_blackwell.defAs with Docker, weights and data files from Figshare may be merged into the repository before building so they are baked into the image; otherwise mount them at runtime.
Running differs from Docker in a few ways: bind mounts use -B src:dst (not -v), the GPU flag is
--nv (not --gpus all), environment variables use --env (not -e), and the application lives at
/opt/IDPForge (not /app). The image filesystem is read-only, so outputs must go to a bound,
writable directory. The %runscript is python "$@", so everything after the image name is passed
straight to Python.
Example (single-chain IDP, Sic1), run from the repository root:
mkdir -p out
sequence="GSMTPSTPPRSRGTRYLAQPSGNTSSSALMQGQKTPQKPSQNLVPVTPSTTKSFKNAPLLAPPNSNMGMTSPFNGLTSPQRSPFPKSSVKRT"
apptainer run --nv \
-B ./weights:/opt/IDPForge/weights \
-B ./out:/app/out \
idpforge.sif /opt/IDPForge/sample_idp.py "$sequence" \
/opt/IDPForge/weights/mdl.ckpt /app/out /opt/IDPForge/configs/sample.yml \
--nconf 10 --batch 4 --cuda --verboseFor IDRs with folded domains, swap in sample_ldr.py with its CKPT NPZ OUTDIR CFG arguments and
mount the directory holding the .npz template (e.g. -B ./data:/opt/IDPForge/data).
We use pytorch-lightning for training and one can customize training via the documented flags under trainer in the config file.
conda activate IDPForge
python train.py --model_config_path configs/train.ymlSampling loops through three phases until the target is met:
- Generate + Relax: Calls
sample_ldr.pyto produce diffusion conformers, which are immediately relaxed via AMBER minimization (relax config loaded fromconfigs/sample.yml). - Repair: Checks each relaxed structure for D-amino acids (chirality) and broken HIS ring bonds. Applies fixes and re-relaxes if any repairs were made.
- Validate: Runs unified validation checking chirality, bond integrity, clash score (adaptive smart threshold), and backbone topology (knot detection). Passing structures are renamed to
N_validated.pdb.
With that, sampling scripts are provided for wholly disordered and partially disordered proteins below.
We provide a commandline interface to sample single chain IDP/IDRs.
usage: sample_idp.py seq ckpt_path output_dir sample_cfg
[-h] [--batch BATCH] [--nconf NCONF] [--cuda]
[--verbose]
positional arguments:
seq protein sequence
ckpt_path path to model weights
output_dir directory to output pdbs
sample_cfg path to a sampling configuration
yaml file
optional arguments:
--batch BATCH batch size
--nconf NCONF number of conformers to sample
--cuda whether to use cuda or cpu
--verbose show or hide debugging logs
Example to generate 10 conformers for Sic1:
mkdir test
sequence="GSMTPSTPPRSRGTRYLAQPSGNTSSSALMQGQKTPQKPSQNLVPVTPSTTKSFKNAPLLAPPNSNMGMTSPFNGLTSPQRSPFPKSSVKRT"
python sample_idp.py $sequence weights/mdl.ckpt test configs/sample.yml --nconf 10 --batch 4 --cuda --verboseInference time experimental guidance can be activated by the potential flag in the configs/sample.yml. An example PREs experimental data file is also provided in data/sic1_pre_exp.txt.
This can also be run within the previously created docker image. Set the working directory to the root of the previously cloned and merged version of this repository and run the following.
mkdir test
sequence="GSMTPSTPPRSRGTRYLAQPSGNTSSSALMQGQKTPQKPSQNLVPVTPSTTKSFKNAPLLAPPNSNMGMTSPFNGLTSPQRSPFPKSSVKRT"
docker run -it --rm --gpus all \
-v "./test/":/app/output \
-v "./data/":/app/data \
-v "./weights/":/app/weights \
-w /app \
idpforge:latest \
python -u /app/sample_idp.py $sequence /app/weights/mdl.ckpt /app/output /app/configs/sample.yml --nconf 10 --batch 4 --cuda --verboseFirst, prepare the folded template with mk_ldr_template.py (shown below). We provide an example for sampling the low confidence region of AF entry P05231:
python mk_ldr_template.py data/AF-P05231-F1-model_v4.pdb 1-41 data/AF-P05231_ndr.npzThe provided model weights are not recommended for predicting multiple domains at the same time.
To generate an ensemble of IDRs with folded domains, run:
mkdir P05231_build
python sample_ldr.py weights/mdl.ckpt data/AF-P05231_ndr.npz P05231_build configs/sample.yml --nconf 10 --batch 4 --cuda --verboseOne can set the attention_chunk to manage memory usage for long sequences (Inference on long disordered sequences may be limited by training sequence length).
This can also be run within the previously created docker image. Set the working directory to the root of the previously cloned and merged version of this repository and run the following.
mkdir P05231_build
docker run -it --rm --gpus all \
-v "./P05231_build/":/app/output \
-v "./data/":/app/data \
-v "./weights/":/app/weights \
-w /app \
idpforge:latest \
python -u /app/sample_ldr.py /app/weights/mdl.ckpt /app/data/AF-P05231_ndr.npz /app/output /app/configs/sample.yml --nconf 10 --batch 4 --cuda --verboseWe use UCBShift for chemical shift prediction and can be installed at https://github.com/THGLab/CSpred.git. If you wish to use X-EISD for evaluation or reweighing with experimental data, please refer to https://github.com/THGLab/X-EISDv2.
score_ensemble.py scores a PDB ensemble against experimental data: it back-calculates the requested
observables and reports per-property MAE and X-EISD log-likelihood (utilities live in scoring/).
# default: 30 trials of 100-conformer subsamples -> scores_trials.csv (the benchmark protocol)
python score_ensemble.py PROTEIN path/to/ensemble_dir --jc --noe --pre --fret [--force]
# --all: score every conformer in one pass -> scores_all.csv (quick test-case scoring)
python score_ensemble.py PROTEIN path/to/ensemble_dir --jc --noe --pre --fret --allThe default 30×100 run produces scores_trials.csv (the file --normalize consumes); --all
writes a separate scores_all.csv. J-couplings, NOE, PRE, and smFRET need only Biopython.
Experimental data is read from ../Data/exp/{protein}/, overridable via IDPFORGE_EXP_DATA.
Chemical shifts (--cs) require CSpred (UCBShift). CSpred has its own dependency stack, so it
must live in a separate environment; the scorer calls it once per conformer as a subprocess rather
than importing it. To enable --cs:
- Clone and install CSpred (https://github.com/THGLab/CSpred.git) into its own conda environment, following that repository's instructions.
- Point the scorer at that environment's interpreter and the CSpred entry point, then run
--cs:export CSPRED_PYTHON=/path/to/envs/cspred/bin/python # interpreter with CSpred installed export CSPRED_PATH=/path/to/CSpred/CSpred.py # default: ../Scoring/CSpred/CSpred.py python score_ensemble.py PROTEIN path/to/ensemble_dir --cs
Pass --normalize to build the cross-method Eq. S11 benchmark table. Methods are the immediate
subdirectories of --ens-base (layout {ens_base}/{method}/{protein}/scores_trials.csv); use
--score-file to aggregate a different per-protein CSV:
python score_ensemble.py --normalize --ens-base DIR [--score-file scores_trials.csv] [--outdir DIR] [--rg-json FILE]The %|dRg|/Rg column compares each ensemble's Rg to an experimental target. Because only proteins
with NMR data are X-EISD-scored, ensemble Rg is computed by the --rg mode (mass-weighted all-atom Rg,
same 30×100 protocol → {method}/{protein}/rg_trials.csv), so the column also covers proteins that
have an exp Rg target but no NMR data. Run --rg before --normalize, and pass the exp targets via
--rg-json (JSON {"exp_rg": {protein: [mean, err]}}):
python score_ensemble.py --rg --ens-base DIR # writes rg_trials.csv per ensemble
python score_ensemble.py --normalize --ens-base DIR --rg-json exp_rg.jsonThe scorer is included in both images, but experimental data is not bundled — mount your
Data/exp tree and point IDPFORGE_EXP_DATA at it. Scoring is CPU-only, so GPU flags are optional.
J-couplings, NOE, PRE, and smFRET work out of the box; --cs additionally needs a CSpred
environment, which is not in the image.
Docker:
docker run --rm -it \
-v "./out":/app/output \
-v "../Data/exp":/data -e IDPFORGE_EXP_DATA=/data \
-w /app idpforge:latest \
python -u /app/score_ensemble.py PROTEIN /app/output --jc --noe --pre --fretApptainer:
apptainer run \
-B ./out:/app/output \
-B ../Data/exp:/data --env IDPFORGE_EXP_DATA=/data \
idpforge.sif /opt/IDPForge/score_ensemble.py PROTEIN /app/output --jc --noe --pre --fretReplace PROTEIN with a name that has a subdirectory under Data/exp/ (pass only the flags whose
experimental files exist for it), and point the ensemble argument (/app/output) at the directory
of relaxed PDBs produced by sampling.
@article {Zhang2026,
author = {Zhang, Oufan and Liu, Zi Hao and Forman-Kay, Julie Deborah and Head-Gordon, Teresa},
title = {IDPForge: Deep Learning of Proteins with Global and Local Regions of Disorder},
elocation-id = {2026.03.25.714313},
year = {2026},
doi = {10.64898/2026.03.25.714313},
publisher = {Cold Spring Harbor Laboratory},
URL = {https://www.biorxiv.org/content/early/2026/03/27/2026.03.25.714313},
eprint = {https://www.biorxiv.org/content/early/2026/03/27/2026.03.25.714313.full.pdf},
journal = {bioRxiv}
}