GatorAffinity

GatorAffinity is a geometric deep learning model for protein–ligand binding affinity prediction. It leverages large-scale synthetic structural data, including over 1.45 million protein–ligand complexes sourced from the jointly released GatorAffinity-DB (over 450,000 complexes with K_d/K_i values) and the SAIR dataset (over 1 million IC₅₀-annotated complexes). The model is pre-trained on these synthetic complexes and subsequently fine-tuned using experimental structures from PDBbind, enabling accurate and generalizable affinity prediction. For further details, please refer to the GatorAffinity paper.

Synthetic Dataset at Scale

450K+ Kd/Ki complexes generated using Boltz-1 [4] structure prediction
1M+ IC50 complexes from SAIR database [1]
Total: 1.5M synthetic protein-ligand pairs for pre-training

Installation

Environment:

git clone https://github.com/AIDD-LiLab/GatorAffinity.git
cd GatorAffinity
bash environment.sh

Data Download

Original Structural Data

Preprocessed Data

Synthetic kd+Ki+IC50 data for GatorAffinity Pre-training
filtered LP-PDBbind For Fine-tuning - ./LP-PDBbind

Model Checkpoints

Pre-trained Models

Base model: Pre-trained on IC50+Kd+Ki datasets
./model_checkpoints/Kd+Ki+IC50_pretrain.ckpt
Fine-tuned model (best performance): Pre-trained on IC50+Kd+Ki, fine-tuned on experimental structures with LP-PDBbind split
./model_checkpoints/Kd+Ki+IC50_experimental_fine_tuning.ckpt

ATOMICA Backbone

ATOMICA-Universal atomic scale molecular interaction representation model used as GatorAffinity's backbone.
Download ATOMICA Checkpoints

Note: Our experiments show that ATOMICA backbone significantly improves performance with limited pre-training structures, though benefits diminish as synthetic training data increases.

Usage

Training

python train.py \
    --train_set_path LP-PDBbind/train.pkl \
    --valid_set_path LP-PDBbind/valid.pkl \
    --pretrain_ckpt model_checkpoints/Kd+Ki+IC50_pretrain.ckpt

Inference

python inference.py \
    --model_ckpt model_checkpoints/Kd+Ki+IC50_experimental_fine_tuning.ckpt \
    --test_set_path LP-PDBbind/test.pkl

Custom Data Processing

(Please note that the ligand you provide should not contain hydrogen atoms. A script for removing hydrogens is available at data/remove_h.py.)

GatorAffinity supports processing your own PDB data for training and inference.

Example Data

We provide example data in data/example/ to help you get started:

1a4h_pocket_5A.pdb, 1a4h_ligand.pdb: Example protein pocket and ligand structure
1bux_pocket_5A.pdb, 1bux_ligand.pdb: Example protein pocket and ligand structure
example.csv: Example data index file
example.pkl: Pre-processed example data

Data Format

Create a CSV file with the following columns:

Column	Description	Example
`pdb_id`	PDB identifier	`1a4h`
`protein_pdb`	Path to protein pocket PDB file	`data/example/1a4h_pocket_5A.pdb`
`ligand_pdb`	Path to ligand PDB file	`data/example/1a4h_ligand.pdb`
`protein_chains`	Protein chain(s) for pocket	`A` or `A_B` for multiple chains
`lig_code`	Ligand residue name	`UNL`, `LIG`, `ATP`
`smiles`	Ligand SMILES string	`CCO`, `c1ccccc1`
`lig_resi`	Ligand residue number	`1`, `100`
`label`	Binding affinity label (pKd/pKi)	`5.92`, `4.85`

Processing Your Data

python data/process_pdbs.py \
    --data_index_file your_data.csv \
    --out_path processed_data.pkl

Example with Provided Data

python data/process_pdbs.py \
    --data_index_file data/example/example.csv \
    --out_path data/example/example.pkl

Performance

State-of-the-art on filtered LP-PDBbind [2]:

License

This repository is licensed under two different licenses:

Main Repository - MIT License

The source code, documentation, and most files are licensed under the MIT License.

Model Checkpoints - CC BY-NC-SA 4.0

The model checkpoints in the ./model_checkpoints/ directory are licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Model checkpoints:

Kd+Ki+IC50_pretrain.ckpt
Kd+Ki+IC50_experimental_fine_tuning.ckpt

Other Data

For the license of other data, please refer to the specific license file provided by the repository.

References

[1] Lemos, P., Beckwith, Z., Bandi, S., Van Damme, M., Crivelli-Decker, J., Shields, B.J., Merth, T., Jha, P.K., De Mitri, N., Callahan, T.J., et al. (2025). SAIR: Enabling deep learning for protein-ligand interactions with a synthetic structural dataset. bioRxiv.

[2] Wang, Y., Sun, K., Li, J., Guan, X., Zhang, O., Bagni, D., Zhang, Y., Carlson, H.A., Head-Gordon, T. (2025). A workflow to create a high-quality protein–ligand binding dataset for training, validation, and prediction tasks. Digital Discovery, 4(5), 1209-1220.

[3] Fang, A., Zhang, Z., Zhou, A., and Zitnik, M. (2025). ATOMICA: Learning Universal Representations of Intermolecular Interactions. bioRxiv.

[4] Wohlwend, J., Corso, G., Passaro, S., Reveiz, M., Leidal, K., Swiderski, W., Portnoi, T., Chinn, I., Silterra, J., Jaakkola, T., et al. (2024). Boltz-1: Democratizing biomolecular interaction modeling. bioRxiv.

Acknowledgments

This work builds upon ATOMICA framework. We thank the ATOMICA authors for making their codebase available. We also thank the SAIR authors for making their dataset accessible to the research community. This code repository was primarily developed by Jinhang Wei.

Citation

If you use the code or data in this package, please cite:

@article{wei2025gatoraffinity,
  title={GatorAffinity: Boosting Protein-Ligand Binding Affinity Prediction with Large-Scale Synthetic Structural Data},
  author={Wei, Jinhang and Zhang, Yupu and Ramdhan, Peter A and Huang, Zihang and Seabra, Gustavo and Jiang, Zhe and Li, Chenglong and Li, Yanjun},
  journal={bioRxiv},
  pages={2025--09},
  year={2025},
  publisher={Cold Spring Harbor Laboratory}
}

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
LP-PDBbind		LP-PDBbind
assets		assets
data		data
model_checkpoints		model_checkpoints
models		models
trainers		trainers
utils		utils
LICENSE		LICENSE
README.md		README.md
environment.sh		environment.sh
inference.py		inference.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GatorAffinity

Synthetic Dataset at Scale

Installation

Environment:

Data Download

Original Structural Data

Preprocessed Data

Model Checkpoints

Pre-trained Models

ATOMICA Backbone

Usage

Training

Inference

Custom Data Processing

Example Data

Data Format

Processing Your Data

Example with Provided Data

Performance

License

Main Repository - MIT License

Model Checkpoints - CC BY-NC-SA 4.0

Other Data

References

Acknowledgments

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GatorAffinity

Synthetic Dataset at Scale

Installation

Environment:

Data Download

Original Structural Data

Preprocessed Data

Model Checkpoints

Pre-trained Models

ATOMICA Backbone

Usage

Training

Inference

Custom Data Processing

Example Data

Data Format

Processing Your Data

Example with Provided Data

Performance

License

Main Repository - MIT License

Model Checkpoints - CC BY-NC-SA 4.0

Other Data

References

Acknowledgments

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages