This study details the development of an ensemble Siamese neural network model for predicing prodrug activation. Our model accurately predicts FDA-approved prodrugs and shows particularly remarkable performance in predicting non-traditional biotransformations, prodrug activations not captured by established metabolic reaction rules, and when predicting activation of prodrugs with lower chemical structural similarity to their API. A prospective analysis confirmed the model's ability to rank prodrug candidates by their observed release profile, establishing this tool as a generalizable resource for rational prodrug design.

This folder contains code used to train and evaluate modified versions of the three SiameseNN models in the ensemble for an ablation study. Ablation study model architectures are as follows:
This folder contains code used to train and evaluate baseline ensemble models. Baseline machine learning models include:
- k-Nearest Neighbors
- Decision Tree
- Random Forest
- Gradient Boosting
- XGBoost
- Support Vector Machine (Linear Kernel)
- Support Vector Machine (Radial Basis Function Kernel)
- Logistic Regression
- Multilayer Perceptron
This folder contains data files used to train and evaluate models:
- Three train sets of endogenous and xenobiotic metabolic reactions used to train each of the three SiameseNNs in the ensemble model
- Three test sets corresponding to to each train set and SiameseNN
- An external test set of FDA-approved prodrugs
This folder contains code used to extract latent embeddings representing the prodrug test set from each of the three SiameseNN models in the ensemble. Due to the large file size of extracted embeddings, these are stored on Zenodo: 10.5281/zenodo.17237868
This folder contains code used to calculate integrated gradients for a set of prodrugs with proven or theoretical competing metabolic pathways for each of the three SiameseNN models in the ensemble. This folder also contains gradients calculated by each model for this set of prodrugs.
This folder contains code used to train and evaluate each of the three SiameseNN models in the ensemble.
This folder contains data files detailing predictions made by the:
- SiameseNN and modified models
- Baseline models
- Existing meatbolism prediction tools
The copyrights of the software are owned by Duke University. As such, two licenses for this software are offered:
- An open-source license under the GPLv2 license for non-commercial academic use.
- A custom license with Duke University, for commercial use or uses without the GPLv2 license restrictions.