Task Hardness Estimation for Molecular Activity Prediction
A Python library for calculating distances between chemical datasets to enable intelligent dataset selection for molecular activity prediction tasks.
- Overview
- Installation
- Quick Start
- CLI Reference
- Usage Examples
- Reproducing FS-Mol Experiments
- Documentation
- Contributing
- Citation
- License
THEMAP is a Python library designed to calculate distances between chemical datasets for molecular activity prediction tasks. The primary goal is to enable intelligent dataset selection for:
- Transfer Learning: Identify the most relevant source datasets for your target prediction task
- Domain Adaptation: Measure dataset similarity to guide model adaptation strategies
- Task Hardness Assessment: Quantify how difficult a prediction task will be based on dataset characteristics
- Dataset Curation: Select optimal training datasets from large chemical databases like ChEMBL
The easiest way to install THEMAP with all features:
git clone https://github.com/HFooladi/THEMAP.git
cd THEMAP
source install.shThis automatically:
- Installs
uv(fast Python package manager) if needed - Creates a virtual environment in
.venv - Installs all dependencies
- Activates the environment
After installation, try an example:
python examples/quickstart.pyTo reactivate the environment later:
source .venv/bin/activateFor more control, install with pip:
pip install themap # Basic installation from PyPI
pip install -e ".[all]" # Full installation (editable)
pip install -e ".[protein]" # Protein analysis only
pip install -e ".[otdd]" # Optimal transport only
pip install -e ".[dev,test]" # Development + testingFor GPU support with specific CUDA versions:
conda env create -f environment.yml
conda activate themap
pip install -e . --no-deps- Python 3.10 or higher
- For GPU features: CUDA-compatible GPU and drivers
The simplest way to compute distances between molecular datasets:
from themap import quick_distance
results = quick_distance(
data_dir="datasets", # Directory with train/ and test/ folders
output_dir="output", # Where to save results
molecule_featurizer="ecfp", # Fingerprint type (ecfp, maccs, etc.)
molecule_method="euclidean", # Distance metric
)
# Results saved to output/molecule_distances.csvFor reproducible experiments, use a YAML configuration:
from themap import run_pipeline
results = run_pipeline("config.yaml")Example config.yaml:
data:
directory: "datasets"
distances:
molecule:
enabled: true
featurizer: "ecfp"
method: "euclidean"
output:
directory: "output"
format: "csv"Organize your data in this structure:
datasets/
├── train/ # Source datasets
│ ├── CHEMBL123456.jsonl.gz
│ └── ...
└── test/ # Target datasets
├── CHEMBL111111.jsonl.gz
└── ...
Each .jsonl.gz file contains molecules in JSON lines format:
{"SMILES": "CCO", "Property": 1}
{"SMILES": "CCCO", "Property": 0}THEMAP provides a command-line interface for all core operations. After installation, the themap command is available in your terminal.
themap --help # Show all available commands
themap <command> --help # Show help for a specific commandCompute distances between datasets with minimal setup — no config file needed:
themap quick datasets/ -f ecfp -m euclidean -o output/
themap quick datasets/ -f maccs -m cosine -j 4For reproducible experiments, use a YAML configuration:
themap init # Generate a config.yaml template
themap run config.yaml # Run the full pipeline
themap run config.yaml -o results/ # Custom output directory
themap run config.yaml --molecule-only # Skip protein distances
themap run config.yaml -j 4 # Set parallel workersFeaturize datasets and cache to disk (useful before running multiple distance computations):
# Single featurizer
themap featurize datasets/ -f ecfp
# Multiple featurizers at once
themap featurize datasets/ -f ecfp -f maccs -f desc2D
# Featurize a specific fold or file
themap featurize datasets/ -f ecfp --fold train
themap featurize datasets/test/CHEMBL123.jsonl.gz -f ecfp
# Force recompute (ignore cached features)
themap featurize datasets/ -f ecfp --force# Convert CSV to THEMAP's JSONL.GZ format
themap convert data.csv CHEMBL123456
themap convert data.csv CHEMBL123456 --smiles-column SMILES --activity-column pIC50
# Inspect a dataset directory
themap info datasets/
# List all available featurizers (27 molecule + 5 protein featurizers)
themap list-featurizersAdd -v before any command for verbose/debug output: themap -v quick datasets/
import pandas as pd
# Load computed distances
distances = pd.read_csv("output/molecule_distances.csv", index_col=0)
# Find closest source for each target (transfer learning selection)
for target in distances.columns:
closest = distances[target].idxmin()
dist = distances[target].min()
print(f"{target} <- {closest} (distance: {dist:.4f})")
# Estimate task hardness (average distance to k-nearest sources)
k = 3
for target in distances.columns:
hardness = distances[target].nsmallest(k).mean()
print(f"Task hardness for {target}: {hardness:.4f}")The companion data for our paper "Quantifying the hardness of bioactivity prediction tasks for transfer learning" (J. Chem. Inf. Model. 64(10), 4031–4046, 2024) is published on Zenodo (record 10605093). It contains pre-computed OTDD distance matrices across multiple molecular featurizers, ESM-2 protein embeddings, internal chemical hardness measures, and ProtoNet evaluation summaries on the FS-Mol benchmark — everything needed to reproduce the figures and tables without re-running the expensive embedding pipelines.
source install.sh # creates .venv and installs themap[all,dev,test]The reproduction notebooks rely on the optional ml extras (torch, ESM, etc.); the all-in-one install above covers them.
You need ~35 GB of free disk space (16 GB zip + ~16 GB extracted). The script downloads with resume support, verifies the MD5 checksum, extracts into datasets/fsmol_hardness/, and removes the zip when done.
make download-fsmol
# or, equivalently:
python scripts/download_fsmol_data.pyUseful flags: --keep-zip (don't delete the archive after extraction), --force (re-download), --no-verify (skip MD5 — only if you've already verified out-of-band), --dest DIR (custom location).
After it completes (~31 GB extracted) you should see:
datasets/fsmol_hardness/
├── ext_chem/ # OTDD distance matrices per molecular featurizer
├── ext_prot/ # ESM-2 protein-distance matrices (t6_8M ... t36_3B)
├── int_chem/{train,test}/ # Internal chemical hardness (RF baselines)
├── embeddings/ # Per-task molecular embeddings used to compute the OTDDs
├── FSMol_Eval_ProtoNet/summary/ # ProtoNet performance per support-set size (16/32/64/128)
└── FSMol_Eval_randomForest/summary/ # Random-forest baseline performance summaries
The reproduction notebooks read from ext_chem/, ext_prot/, int_chem/, and FSMol_Eval_ProtoNet/; the other two directories are provided so users can rebuild the OTDD matrices from raw embeddings if desired.
Manual download (no Python)
mkdir -p datasets/fsmol_hardness
cd datasets
wget -c https://zenodo.org/records/10605093/files/fsmol_hardness.zip
echo "10644660a53d8d106b6883cb53eb1f3b fsmol_hardness.zip" | md5sum -c -
unzip fsmol_hardness.zip -d fsmol_hardness/cd notebooks
jupyter lab # or: jupyter notebook| Notebook | What it reproduces |
|---|---|
external_chemical_hardness.ipynb |
External chemical-space hardness: correlation between k-nearest source-task OTDD distance and ProtoNet performance, across molecular featurizers (GIN, UniMol, ChemBERTa/Roberta-Zinc, desc2D). |
external_protein_hardness.ipynb |
External protein-space hardness: correlation between target/source protein-embedding distance and performance, across ESM-2 model sizes (t6_8M → t36_3B). |
task_hardness.ipynb |
Combined task-hardness score (external chemical + external protein + internal chemical) and its correlation with ProtoNet performance at support-set sizes 16/32/64/128. |
Notebook paths are resolved relative to the notebooks/ directory, so launch Jupyter from there. Outputs are auto-stripped on commit by the pre-commit hook (nbstripout).
Full documentation is available at hfooladi.github.io/THEMAP or can be built locally:
mkdocs serve # Serve locally at http://127.0.0.1:8000We welcome contributions! Please see our Contributing Guidelines for details.
git clone https://github.com/HFooladi/THEMAP.git
cd THEMAP
source install.sh # creates .venv and installs all depsOr manually:
pip install -e ".[dev,test,ml]"
pre-commit install # one-time; install.sh does this automaticallysource .venv/bin/activate # always activate venv first
python run_tests.py # all tests
python run_tests.py fast # skip slow tests
python run_tests.py coverage # with coverage
pytest -k "test_name" # specific test by nameruff check . # linting
ruff format . # formatting
mypy -p themap # type checkingIf you use THEMAP in your research, please cite our paper:
@article{fooladi2024quantifying,
title={Quantifying the hardness of bioactivity prediction tasks for transfer learning},
author={Fooladi, Hosein and Hirte, Steffen and Kirchmair, Johannes},
journal={Journal of Chemical Information and Modeling},
volume={64},
number={10},
pages={4031-4046},
year={2024},
publisher={ACS Publications}
}This project is licensed under the MIT License - see the LICENSE file for details.
