GatorAffinity is a geometric deep learning model for protein–ligand binding affinity prediction. It leverages large-scale synthetic structural data, including over 1.45 million protein–ligand complexes sourced from the jointly released GatorAffinity-DB (over 450,000 complexes with Kd/Ki values) and the SAIR dataset (over 1 million IC50-annotated complexes). The model is pre-trained on these synthetic complexes and subsequently fine-tuned using experimental structures from PDBbind, enabling accurate and generalizable affinity prediction. For further details, please refer to the GatorAffinity paper.
- 450K+ Kd/Ki complexes generated using Boltz-1 [4] structure prediction
- 1M+ IC50 complexes from SAIR database [1]
- Total: 1.5M synthetic protein-ligand pairs for pre-training
git clone https://github.com/AIDD-LiLab/GatorAffinity.git
cd GatorAffinity
bash environment.sh- Synthetic kd+Ki+IC50 data for GatorAffinity Pre-training
- filtered LP-PDBbind For Fine-tuning -
./LP-PDBbind
-
Base model: Pre-trained on IC50+Kd+Ki datasets
./model_checkpoints/Kd+Ki+IC50_pretrain.ckpt -
Fine-tuned model (best performance): Pre-trained on IC50+Kd+Ki, fine-tuned on experimental structures with LP-PDBbind split
./model_checkpoints/Kd+Ki+IC50_experimental_fine_tuning.ckpt
ATOMICA-Universal atomic scale molecular interaction representation model used as GatorAffinity's backbone.
Download ATOMICA Checkpoints
Note: Our experiments show that ATOMICA backbone significantly improves performance with limited pre-training structures, though benefits diminish as synthetic training data increases.
python train.py \
--train_set_path LP-PDBbind/train.pkl \
--valid_set_path LP-PDBbind/valid.pkl \
--pretrain_ckpt model_checkpoints/Kd+Ki+IC50_pretrain.ckptpython inference.py \
--model_ckpt model_checkpoints/Kd+Ki+IC50_experimental_fine_tuning.ckpt \
--test_set_path LP-PDBbind/test.pkl(Please note that the ligand you provide should not contain hydrogen atoms. A script for removing hydrogens is available at data/remove_h.py.)
GatorAffinity supports processing your own PDB data for training and inference.
We provide example data in data/example/ to help you get started:
1a4h_pocket_5A.pdb,1a4h_ligand.pdb: Example protein pocket and ligand structure1bux_pocket_5A.pdb,1bux_ligand.pdb: Example protein pocket and ligand structureexample.csv: Example data index fileexample.pkl: Pre-processed example data
Create a CSV file with the following columns:
| Column | Description | Example |
|---|---|---|
pdb_id |
PDB identifier | 1a4h |
protein_pdb |
Path to protein pocket PDB file | data/example/1a4h_pocket_5A.pdb |
ligand_pdb |
Path to ligand PDB file | data/example/1a4h_ligand.pdb |
protein_chains |
Protein chain(s) for pocket | A or A_B for multiple chains |
lig_code |
Ligand residue name | UNL, LIG, ATP |
smiles |
Ligand SMILES string | CCO, c1ccccc1 |
lig_resi |
Ligand residue number | 1, 100 |
label |
Binding affinity label (pKd/pKi) | 5.92, 4.85 |
python data/process_pdbs.py \
--data_index_file your_data.csv \
--out_path processed_data.pklpython data/process_pdbs.py \
--data_index_file data/example/example.csv \
--out_path data/example/example.pklState-of-the-art on filtered LP-PDBbind [2]:
This repository is licensed under two different licenses:
The source code, documentation, and most files are licensed under the MIT License.
The model checkpoints in the ./model_checkpoints/ directory are licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Model checkpoints:
Kd+Ki+IC50_pretrain.ckptKd+Ki+IC50_experimental_fine_tuning.ckpt
For the license of other data, please refer to the specific license file provided by the repository.
[1] Lemos, P., Beckwith, Z., Bandi, S., Van Damme, M., Crivelli-Decker, J., Shields, B.J., Merth, T., Jha, P.K., De Mitri, N., Callahan, T.J., et al. (2025). SAIR: Enabling deep learning for protein-ligand interactions with a synthetic structural dataset. bioRxiv.
[2] Wang, Y., Sun, K., Li, J., Guan, X., Zhang, O., Bagni, D., Zhang, Y., Carlson, H.A., Head-Gordon, T. (2025). A workflow to create a high-quality protein–ligand binding dataset for training, validation, and prediction tasks. Digital Discovery, 4(5), 1209-1220.
[3] Fang, A., Zhang, Z., Zhou, A., and Zitnik, M. (2025). ATOMICA: Learning Universal Representations of Intermolecular Interactions. bioRxiv.
[4] Wohlwend, J., Corso, G., Passaro, S., Reveiz, M., Leidal, K., Swiderski, W., Portnoi, T., Chinn, I., Silterra, J., Jaakkola, T., et al. (2024). Boltz-1: Democratizing biomolecular interaction modeling. bioRxiv.
This work builds upon ATOMICA framework. We thank the ATOMICA authors for making their codebase available. We also thank the SAIR authors for making their dataset accessible to the research community. This code repository was primarily developed by Jinhang Wei.
If you use the code or data in this package, please cite:
@article{wei2025gatoraffinity,
title={GatorAffinity: Boosting Protein-Ligand Binding Affinity Prediction with Large-Scale Synthetic Structural Data},
author={Wei, Jinhang and Zhang, Yupu and Ramdhan, Peter A and Huang, Zihang and Seabra, Gustavo and Jiang, Zhe and Li, Chenglong and Li, Yanjun},
journal={bioRxiv},
pages={2025--09},
year={2025},
publisher={Cold Spring Harbor Laboratory}
}


