mgktools is a Python package for molecular property prediction using Marginalized Graph Kernels (MGK). It provides a comprehensive framework for training Gaussian Process and Support Vector Machine models on molecular datasets, with built-in support for hyperparameter optimization and model interpretability.
- Graph Kernel Methods: Marginalized Graph Kernel (MGK) for molecular similarity computation
- Multiple Model Types: Gaussian Process Regression (GPR), Gaussian Process Classification (GPC), and SVM models
- Hyperparameter Optimization: Bayesian optimization via Optuna and gradient-based optimization
- Model Interpretability: Atomic and molecular attribution for understanding predictions
- Flexible Molecular Representations: Support for 13+ fingerprint types and molecular descriptors
- Scalable Methods: Nystrom approximation and local expert models for large datasets
- Cross-Validation: Built-in k-fold, leave-one-out, and Monte Carlo cross-validation
- Python == 3.12
- GCC == 9 or 11
- CUDA >= 11 (No CPU implementation for graph kernel computation)
# Install graphdot dependency (required)
pip install git+https://gitlab.com/Xiangyan93/graphdot.git@v0.8.2
# Install mgktools from PyPI
pip install mgktools# Install graphdot dependency (required)
pip install git+https://gitlab.com/Xiangyan93/graphdot.git@v0.8.2
git clone https://github.com/Xiangyan93/mgktools.git
cd mgktools
pip install -e .GPU is required to compute graph kernels. Try the interactive tutorial:
mgktools provides several CLI commands:
Pre-cache graph objects and molecular features for faster processing.
mgk_cache_data --data_paths data.csv --smiles_columns smiles --cache_graph --cache_path cache.pklCompute and save kernel matrices.
mgk_kernel_calc --save_dir output/ --data_path data.csv --smiles_columns smiles \
--graph_kernel_type graph --graph_hyperparameters additive-pnorm.jsonRun cross-validation experiments.
mgk_cross_validation --save_dir output/ --data_path data.csv --smiles_columns smiles \
--targets_columns target --graph_kernel_type graph --graph_hyperparameters additive-pnorm.json \
--model_type gpr --task_type regression --cross_validation kFold --n_splits 5 \
--alpha 0.01 --metric rmseBayesian hyperparameter optimization with Optuna.
mgk_optuna --save_dir output/ --data_path data.csv --smiles_columns smiles \
--targets_columns target --graph_kernel_type graph --graph_hyperparameters additive-pnorm.json \
--model_type gpr --task_type regression --cross_validation leave-one-out \
--alpha 0.01 --alpha_bounds 0.001 0.1 --metric rmse --num_iters 100Compute molecular embeddings (t-SNE or kPCA).
mgk_embedding --save_dir output/ --data_path data.csv --smiles_columns smiles \
--targets_columns target --graph_kernel_type graph --graph_hyperparameters additive-pnorm.json \
--embedding_algorithm tSNE --n_components 2 --perplexity 30import pandas as pd
from mgktools.data.data import Dataset
from mgktools.kernels.utils import get_kernel_config
from mgktools.evaluators.cross_validation import Evaluator
# Load data from DataFrame
df = pd.DataFrame({
'smiles': ['CCO', 'CCC', 'CCCC', 'CCCCC', 'CCCCCC'],
'target': [1.0, 2.0, 3.0, 4.0, 5.0]
})
# Create dataset
dataset = Dataset.from_df(
df,
smiles_columns=['smiles'],
targets_columns=['target']
)
# Set up for graph kernel computation
dataset.set_status(graph_kernel_type='graph')
dataset.create_graphs(n_jobs=4)
dataset.unify_datatype()
# Configure kernel (using additive kernel with p-normalization)
kernel_config = get_kernel_config(
dataset=dataset,
graph_kernel_type='graph',
mgk_hyperparameters_files=['mgktools/hyperparameters/configs/additive-pnorm.json']
)
# Run cross-validation
evaluator = Evaluator(
dataset=dataset,
model_type='gpr',
task_type='regression',
kernel_config=kernel_config,
split_type='random',
n_splits=5,
metrics=['rmse', 'r2']
)
results = evaluator.run_cross_validation()from mgktools.features_mol.features_generators import FeaturesGenerator
# Create feature generator
fg = FeaturesGenerator('morgan', radius=2, num_bits=2048)
# Set up dataset with molecular features
dataset.set_status(
graph_kernel_type='graph',
features_generators=[fg],
features_combination='concat'
)
dataset.create_graphs(n_jobs=4)
dataset.create_features_mol(n_jobs=4)from mgktools.hyperparameters.optuna import bayesian_optimization
# Run Bayesian optimization
best_params, best_score = bayesian_optimization(
dataset_train=train_dataset,
dataset_val=val_dataset,
kernel_config=kernel_config,
model_type='gpr',
task_type='regression',
metric='rmse',
n_trials=100
)mgktools/
├── data/ # Dataset handling and caching
│ ├── data.py # CachedDict, Datapoint, Dataset classes
│ └── split.py # Train/val/test splitting utilities
├── kernels/ # Kernel implementations
│ ├── base.py # MicroKernel, BaseKernelConfig
│ ├── GraphKernel.py # MGK and GraphKernelConfig
│ ├── FeatureKernel.py
│ ├── HybridKernel.py
│ └── utils.py # get_kernel_config factory
├── models/ # Model implementations
│ ├── regression/ # GPR, NLE, ensemble models
│ └── classification/# GPC, SVM classifiers
├── evaluators/ # Evaluation utilities
│ ├── cross_validation.py # Evaluator class
│ └── metric.py # Metric computation
├── features_mol/ # Molecular feature generators
│ └── features_generators.py
├── graph/ # Graph conversion utilities
│ └── hashgraph.py # HashGraph class
├── interpret/ # Model interpretability
│ └── interpret.py # Atomic/molecular attribution
├── hyperparameters/ # Optimization and configs
│ ├── optuna.py # Bayesian optimization
│ └── configs/ # Pre-defined kernel configs
└── exe/ # CLI entry points
└── run.py # Command-line tools
| Name | Description | Default Size |
|---|---|---|
morgan |
Binary Morgan fingerprint | 2048 |
morgan_count |
Count-based Morgan fingerprint | 2048 |
rdkit_2d |
RDKit 2D descriptors | ~200 |
rdkit_2d_normalized |
Normalized RDKit 2D descriptors | ~200 |
rdkit_208 |
RDKit 208 descriptors | ~210 |
rdkit_topol |
RDKit topological fingerprint | 2048 |
layered |
Layered fingerprint | 2048 |
torsion |
Topological torsion fingerprint | 2048 |
atom_pair |
Atom pair fingerprint | 2048 |
avalon |
Avalon fingerprint | 2048 |
avalon_count |
Count-based Avalon fingerprint | 2048 |
maccskey |
MACCS keys fingerprint | 167 |
pattern |
Pattern fingerprint | 2048 |
Pre-defined kernel configurations are available in mgktools/hyperparameters/configs/:
- Additive kernels:
additive-norm.json,additive-pnorm.json,additive-msnorm.json - Product kernels:
product-norm.json,product-pnorm.json,product-msnorm.json - Feature kernels:
rbf.json,dot_product.json
# Run all tests
pytest tests/
# Run specific test file
pytest tests/kernel/test_kernel.py
# Run with verbose output
pytest tests/ -v- Predicting Single-Substance Phase Diagrams: A Kernel Approach on Graph Representations of Molecules
- A Comparative Study of Marginalized Graph Kernel and Message-Passing Neural Network
- Interpretable Molecular Property Predictions Using Marginalized Graph Kernels
This project is licensed under the MIT License - see the LICENSE file for details.