Pytorch implementation for "Integrating Diffusion Models and Molecular Modeling for PARP1 Inhibitors Generation", submited in Journal of Biomolecular Structure & Dynamics This repository combines DiGress diffusion model for molecule generation with a GNN-based predictor for pIC50 value estimation.
This code was tested with PyTorch 2.0.1, CUDA 11.8 and torch_geometrics 2.3.1
# Download anaconda/miniconda if needed
# Create a rdkit environment that directly contains rdkit
conda create -c conda-forge -n digress rdkit=2023.03.2 python=3.9
# Activate the environment
conda activate digress
# Check that RDKit is installed correctly
python -c 'from rdkit import Chem'# Install graph-tool
conda install -c conda-forge graph-tool=2.45
# Check that graph-tool is installed correctly
python -c 'import graph_tool as gt'# Install the nvcc drivers for your cuda version
conda install -c "nvidia/label/cuda-11.8.0" cuda
# Install a compatible version of PyTorch
pip install torch==2.0.1 --index-url https://download.pytorch.org/whl/cu118# Install remaining packages
pip install -r requirements.txtData and trained weights can be downloaded here: https://drive.google.com/drive/folders/1WgtLS8pAy-bgU_L9s94MvZg1IwTbiIrr?usp=sharing
After downloading the data and weights files, extract them and organize the directories:
unzip data.zip
unzip weights.zipMake sure the ./data/ directory contains the generator and predictor folders with the necessary training data and pre-trained weights.
# Ensure you're in the project root directory
# Train the DiGress generator
python generator.py --model digress --task train --n_epochs 100 --batch_size 1024Configuration for DiGress training can be modified in configs/digress/train/train_default.yaml.
# Train the MOOD generator
python generator.py --model mood --task train --n_epochs 100 --batch_size 1024Configuration for MOOD training can be modified in configs/mood/prop_train.yaml.
# Train the GDSS generator
python generator.py --model gdss --task train --n_epochs 100 --batch_size 1024Configuration for GDSS training can be modified in configs/gdss/zinc250k.yaml.
# Train the Molecular VAE generator
python generator.py --model vae --task train --n_epochs 100 --batch_size 1024Configuration for VAE training can be modified in configs/vae/vae.yaml.
The GNN predictor is pre-trained on pIC50 data. If you need to retrain it:
# Train the GNN predictor
cd predictors/molecularGNN_smiles/main/
python train.py --config ../../configs/gnn/gnn.yamlTo improve model robustness, the GNN predictor uses SMILES data augmentation during training. This process generates multiple SMILES representations of the same molecule, effectively increasing the training dataset size. The augmentation script is in data/augment_smiles.py
You can directly generate molecules using any generator without filtering:
# Generate molecules using DiGress
python generator.py --model digress --task generate --n_samples_to_generate 100# Generate molecules using MOOD
python generator.py --model mood --task generate --n_samples_to_generate 100# Generate molecules using GDSS
python generator.py --model gdss --task generate --n_samples_to_generate 100# Generate molecules using Molecular VAE
python generator.py --model vae --task generate --n_samples_to_generate 100These commands will generate the specified number of SMILES strings directly from the corresponding generator without any filtering. The results will be saved to a text file named generated_smiles_{model}.txt.
Pipeline Summary: The complete molecule generation and filtering pipeline consists of several main stages: Molecule Generation using the best generator DiGress, Property Prediction where generated molecules are evaluated using a GNN-based predictor for pIC50 values along with calculation of other molecular properties like logP, SA scores and number of large rings, Filtering where molecules are screened based on specified property thresholds and structural constraints, and Output of the final set of optimized molecules that meet all criteria for potential PARP1 inhibitor activity.
To run the complete pipeline (generation, property prediction, and filtering):
# Complete pipeline with DiGress generator
python run.py --model digress --n_final_smiles 20This pipeline will:
- Generate a larger batch of molecules using the specified generator
- Calculate properties (logP, SA, pIC50) for each molecule
- Filter molecules based on property thresholds
- Return the requested number of filtered molecules
For a user-friendly interface that runs the complete pipeline:
# Launch the Gradio interface
python gradio_demo.pyThe interface allows you to:
- Specify the number of molecules to generate
- Set property ranges (logP, SA, pIC50, number of large rings)
- Start generating molecules and visualization by clicking "Generate Molecules"
- Export results to CSV by clicking "Export to CSV"
run.py: Complete pipeline script (generation + filtering)gradio_demo.py: Web interface for molecule generationgenerator.py: Contains implementations of molecule generatorsfilterer.py: Handles SMILES filtering based on molecular propertiespredictor.py: Contains GNN-based pIC50 predictorconfigs/: Configuration files for generators and predictorsgenerators/: Contains different molecule generation modelsDiGress/: Implementation of DiGress diffusion modelMOOD/: Implementation of MOOD generatorGDSS/: Implementation of GDSS generatorMolecular-VAE/: Implementation of Molecular VAE generator
predictors/: Contains different property prediction modelsmolecularGNN_smiles/: GNN-based pIC50 predictor
If you use this code, please cite our paper:
@article{
title={Integrating Diffusion Models and Molecular Modeling for PARP1 Inhibitors Generation},
author={},
journal={},
year={}
}


