MolProp-GNN: Graph Neural Networks for Molecular Property Prediction

Predicting ADMET molecular properties using graph neural networks on MoleculeNet benchmarks. Built for drug discovery research.

Overview

MolProp-GNN treats molecules as graphs — atoms as nodes, bonds as edges — and learns task-relevant representations through message passing. This project implements and benchmarks three GNN architectures (MPNN, GAT, GIN) against classical fingerprint baselines (ECFP + Random Forest/XGBoost) across seven MoleculeNet datasets covering the full range of ADMET property types.

Why this matters in drug discovery: Experimental ADMET profiling costs ~$10k–$50k per compound. Accurate in silico screening can shrink candidate pools by 10x before wet lab work begins, compressing development timelines and reducing costs.

Architecture

SMILES String
     │
     ▼
┌─────────────────────────────────────────────────────────┐
│  Molecular Graph Construction (RDKit)                   │
│  • Atom nodes: element, degree, charge, hybridization   │
│  • Bond edges: type, conjugation, ring, stereo          │
└─────────────────────────────────────────────────────────┘
     │
     ▼
┌─────────────────────────────────────────────────────────┐
│  Node + Edge Feature Encoding                           │
│  • Atom embedding: one-hot + continuous → Linear(d_h)   │
│  • Bond embedding: one-hot → Linear(d_e)                │
└─────────────────────────────────────────────────────────┘
     │
     ▼
┌─────────────────────────────────────────────────────────┐
│  Message Passing (L layers)                             │
│  ┌──────────────────────────────────────────────────┐   │
│  │  mᵢⱼ = φ_msg(hᵢ, hⱼ, eᵢⱼ)                       │   │
│  │  hᵢ' = φ_upd(hᵢ, Σⱼ∈N(i) mᵢⱼ)                   │   │
│  └──────────────────────────────────────────────────┘   │
│  (Repeated L=3–6 times)                                 │
└─────────────────────────────────────────────────────────┘
     │
     ▼
┌─────────────────────────────────────────────────────────┐
│  Graph Readout                                          │
│  • Sum / Mean / Attention-weighted pooling              │
│  h_G = Σᵢ αᵢ · hᵢ  (attention readout)                 │
└─────────────────────────────────────────────────────────┘
     │
     ▼
┌─────────────────────────────────────────────────────────┐
│  Multi-Task Prediction Head                             │
│  • Classification tasks → Sigmoid → BCEWithLogitsLoss  │
│  • Regression tasks → Linear → MSELoss                  │
└─────────────────────────────────────────────────────────┘
     │
     ▼
Predicted Properties (BBBP, logS, logP, toxicity flags…)

Results on MoleculeNet Benchmarks

All results use scaffold split (the industry-standard evaluation protocol — see Why Scaffold Split Matters).

Classification Tasks (AUROC ↑)

Dataset	Task	Compounds	ECFP+RF	ECFP+XGB	MPNN	GAT	GIN
BBBP	Blood-brain barrier permeability	2,039	0.872	0.889	0.901	0.897	0.908
HIV	HIV replication inhibition	41,127	0.776	0.795	0.823	0.831	0.820
Tox21	12 toxicology targets	7,831	0.812	0.834	0.851	0.855	0.847
SIDER	27 drug side-effect categories	1,427	0.619	0.631	0.641	0.647	0.638

Regression Tasks (RMSE ↓)

Dataset	Task	Compounds	ECFP+RF	ECFP+XGB	MPNN	GAT	GIN
ESOL	Aqueous solubility (logS)	1,128	0.978	0.932	0.614	0.623	0.598
FreeSolv	Hydration free energy	642	1.421	1.387	0.871	0.894	0.858
Lipophilicity	logD at pH 7.4	4,200	0.743	0.698	0.551	0.564	0.548

Key takeaway: GNNs show the largest gains on regression tasks where fine-grained structural features (atom environment, 3D-like topology) directly determine physico-chemical properties. For classification tasks with sparse positive labels (SIDER), the gap narrows.

Scaffold Split vs Random Split — Why It Matters

Dataset	GIN (Random)	GIN (Scaffold)	Delta
BBBP	0.957	0.908	−0.049
HIV	0.861	0.820	−0.041
ESOL	0.421	0.598	+0.177 RMSE

Random split inflates metrics by leaking scaffold-similar molecules between train and test. Scaffold split is the only honest evaluation for generalization to novel chemical matter. See docs/SCAFFOLD_SPLIT.md for a full explanation.

Why Scaffold Split Matters

In drug discovery, you need a model that generalizes to structurally novel compounds, not just interpolates within known scaffolds. Random splits place stereoisomers and close analogs in both train and test — this creates data leakage and leads to overconfident models that fail on genuinely new chemical matter.

Scaffold split partitions molecules by their Bemis-Murcko scaffold, ensuring train/test sets contain non-overlapping core ring systems. This is the de facto standard at pharma companies for evaluating ML models.

See docs/SCAFFOLD_SPLIT.md for a detailed discussion with examples and literature references.

Installation

Prerequisites

Python 3.9+
CUDA 11.7+ (optional but recommended)

Install

git clone https://github.com/yourusername/molprop-gnn.git
cd molprop-gnn

# Create environment
conda create -n molprop python=3.9
conda activate molprop

# Install PyTorch (adjust for your CUDA version)
pip install torch==2.0.1 --index-url https://download.pytorch.org/whl/cu117

# Install PyTorch Geometric
pip install torch-geometric==2.3.1
pip install torch-scatter torch-sparse -f https://data.pyg.org/whl/torch-2.0.1+cu117.html

# Install remaining dependencies
pip install -e .

Usage

Train a Model

# Train MPNN on BBBP with scaffold split
python scripts/train.py \
    --dataset bbbp \
    --model mpnn \
    --split scaffold \
    --config configs/moleculenet_config.yaml

# Train GIN on ESOL
python scripts/train.py \
    --dataset esol \
    --model gin \
    --split scaffold \
    --epochs 200 \
    --lr 1e-3

# Hyperparameter sweep with Optuna
python scripts/train.py \
    --dataset tox21 \
    --model gat \
    --hparam-sweep \
    --n-trials 50

Evaluate a Trained Model

# Evaluate on test set with full metrics
python scripts/evaluate.py \
    --checkpoint checkpoints/bbbp_mpnn_scaffold.pt \
    --dataset bbbp \
    --split scaffold \
    --output-dir results/bbbp_mpnn/

# Compare scaffold vs random split performance
python scripts/evaluate.py \
    --checkpoint checkpoints/bbbp_mpnn.pt \
    --dataset bbbp \
    --compare-splits

Predict from SMILES

# Predict single molecule
python scripts/predict_smiles.py \
    --smiles "CC(=O)Oc1ccccc1C(=O)O" \
    --model-checkpoint checkpoints/esol_gin.pt \
    --property esol

# Predict from CSV file
python scripts/predict_smiles.py \
    --input compounds.csv \
    --smiles-col SMILES \
    --model-checkpoint checkpoints/tox21_gat.pt \
    --property tox21 \
    --output predictions.csv

# Predict all ADMET endpoints
python scripts/predict_smiles.py \
    --smiles "CN1CCC[C@H]1c2cccnc2" \
    --all-endpoints \
    --output-format json

Python API

from src.models.gin_model import GINModel
from src.data.featurizer import MolecularFeaturizer
from src.data.molecule_dataset import smiles_to_graph

# Load trained model
model = GINModel.load_from_checkpoint("checkpoints/esol_gin.pt")
model.eval()

# Featurize molecule
featurizer = MolecularFeaturizer()
graph = smiles_to_graph("CC(=O)Oc1ccccc1C(=O)O", featurizer)

# Predict
import torch
with torch.no_grad():
    prediction = model(graph)
    print(f"Predicted logS: {prediction.item():.3f}")

Datasets

Dataset	Property	Task	Size	Split	Metric
BBBP	Blood-brain barrier permeability	Binary classification	2,039	Scaffold	AUROC
HIV	HIV inhibition	Binary classification	41,127	Scaffold	AUROC
Tox21	12 toxicology assays	Multi-label classification	7,831	Scaffold	Avg AUROC
SIDER	27 side-effect categories	Multi-label classification	1,427	Scaffold	Avg AUROC
ESOL	Aqueous solubility	Regression	1,128	Scaffold	RMSE
FreeSolv	Hydration free energy	Regression	642	Scaffold	RMSE
Lipophilicity	logD at pH 7.4	Regression	4,200	Scaffold	RMSE

Datasets are automatically downloaded from MoleculeNet via DeepChem on first use.

Project Structure

molprop-gnn/
├── src/
│   ├── models/
│   │   ├── mpnn.py                # Message Passing Neural Network
│   │   ├── gat_model.py           # Graph Attention Network
│   │   ├── gin_model.py           # Graph Isomorphism Network
│   │   └── fingerprint_baseline.py # ECFP + RF/XGBoost baselines
│   ├── data/
│   │   ├── molecule_dataset.py    # MoleculeNet loading + splits
│   │   └── featurizer.py          # Atom/bond featurization
│   ├── training/
│   │   └── trainer.py             # Training loop + Optuna sweeps
│   └── evaluation/
│       ├── moleculenet_metrics.py # AUROC, RMSE, R², scaffold analysis
│       └── interpretability.py    # Attention viz + integrated gradients
├── configs/
│   └── moleculenet_config.yaml    # All hyperparameters
├── scripts/
│   ├── train.py                   # Training entry point
│   ├── evaluate.py                # Evaluation entry point
│   └── predict_smiles.py          # SMILES → property prediction
├── docs/
│   └── SCAFFOLD_SPLIT.md          # Why scaffold split matters
├── tests/
├── notebooks/
├── requirements.txt
├── setup.py
└── LICENSE

Interpretability

MolProp-GNN supports atom-level interpretability to understand which structural features drive predictions:

from src.evaluation.interpretability import AttentionVisualizer, IntegratedGradients

# Attention-based attribution
viz = AttentionVisualizer(model)
atom_weights = viz.get_atom_importance("CC(=O)Oc1ccccc1C(=O)O")
viz.render_molecule_svg(atom_weights, output="aspirin_attention.svg")

# Integrated gradients
ig = IntegratedGradients(model)
attributions = ig.attribute("CC(=O)Oc1ccccc1C(=O)O", n_steps=50)

Citation

If you use MolProp-GNN in your research:

@software{molprop_gnn,
  title={MolProp-GNN: Graph Neural Networks for Molecular Property Prediction},
  author={Your Name},
  year={2024},
  url={https://github.com/yourusername/molprop-gnn}
}

Key references:

Gilmer et al., "Neural Message Passing for Quantum Chemistry," ICML 2017
Hu et al., "Strategies for Pre-training Graph Neural Networks," ICLR 2020
Wu et al., "MoleculeNet: A Benchmark for Molecular Machine Learning," Chem. Sci. 2018
Bemis & Murcko, "The Properties of Known Drugs," J. Med. Chem. 1996

License

MIT License — see LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MolProp-GNN: Graph Neural Networks for Molecular Property Prediction

Overview

Architecture

Results on MoleculeNet Benchmarks

Classification Tasks (AUROC ↑)

Regression Tasks (RMSE ↓)

Scaffold Split vs Random Split — Why It Matters

Why Scaffold Split Matters

Installation

Prerequisites

Install

Usage

Train a Model

Evaluate a Trained Model

Predict from SMILES

Python API

Datasets

Project Structure

Interpretability

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.github/workflows		.github/workflows
configs		configs
docs		docs
notebooks		notebooks
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
RESULTS.md		RESULTS.md
requirements.txt		requirements.txt
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

MolProp-GNN: Graph Neural Networks for Molecular Property Prediction

Overview

Architecture

Results on MoleculeNet Benchmarks

Classification Tasks (AUROC ↑)

Regression Tasks (RMSE ↓)

Scaffold Split vs Random Split — Why It Matters

Why Scaffold Split Matters

Installation

Prerequisites

Install

Usage

Train a Model

Evaluate a Trained Model

Predict from SMILES

Python API

Datasets

Project Structure

Interpretability

Citation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages