From f47c5f00bce356bc8eced26df8f533e1d5ad7a58 Mon Sep 17 00:00:00 2001 From: Zachary Fralish <127516906+zacharyfralish@users.noreply.github.com> Date: Fri, 24 Mar 2023 16:36:29 -0400 Subject: [PATCH] Update README.md --- README.md | 24 +++++++++++++++++++----- 1 file changed, 19 insertions(+), 5 deletions(-) diff --git a/README.md b/README.md index 261b8f9..54bfdb3 100644 --- a/README.md +++ b/README.md @@ -1,10 +1,26 @@ # Active Subsampling -Using active learning for data curation. +**Improving Molecular Machine Learning Through Adaptive Subsampling with Active Learning** + +## Overview +We use active machine learning as an autonomous and adaptive data subsampling strategy and show that active learning-based subsampling can lead to better molecular machine learning performance when compared to both training models on the complete training data and 19 state-of-the-art subsampling strategies. We find that active learning is robust to errors in the data, highlighting the utility of this approach for low-quality datasets. Taken together, we here describe a new, adaptive machine learning pre-processing approach and provide novel insights into the behavior and robustness of active machine learning for molecular sciences. + +For more information, please refer to: [Improving Molecular Machine Learning Through Adaptive Subsampling with Active Learning](https://chemrxiv.org/engage/api-gateway/chemrxiv/assets/orp/resource/item/63e5c76e1d2d18406337135d/original/improving-molecular-machine-learning-through-adaptive-subsampling-with-active-learning.pdf) + +If you use this data or code, please kindly cite: Wen, Y., Li, Z., Xiang, Y., & Reker, D. (2023). Improving Molecular Machine Learning Through Adaptive Subsampling with Active Learning. ## Files - **code.py** contains all code and functions to run and evaluate active learning subsampling - **Example_workflow_for_AL_Subsampling.ipynb** contains an example notebook that runs BBBP but can be run out of the box on a local machine or on Google Colab to apply this technique to new datasets +## Dependencies +* [numpy](https://numpy.org/) +* [scipy](https://scipy.org/) +* [pandas](https://github.com/pandas-dev/pandas) +* [scikit-learn](https://scikit-learn.org/stable/) +* [deepchem](https://deepchem.io/) +* [matplotlib](https://matplotlib.org/) + + ## Quickstart Datasets can be loaded from DeepChem @@ -14,7 +30,7 @@ tasks, data, transformers = dc.molnet.load_bbbp(splitter=None) bbbp = data[0] ``` -Model and performance metric need to be initialized, we recommend random forest models and Matthews correlation coefficient +Model and performance metric need to be initialized, we recommend random forest models and Matthew's correlation coefficient (MCC) ``` # initialize model and performance metric model = RF() @@ -35,7 +51,7 @@ pl.savefig("learning_curve.pdf") pl.close() ``` -Delta Performance can be directly calculated from the resulting curves +Delta performance can be directly calculated from the resulting curves ``` # report deltaPerformance print(calc_deltaPerformances(result)) @@ -47,6 +63,4 @@ Subsampled data can be extracted by calling the subsample_data function subsample = subsample_data(model, data, metric, 5) ``` -## Dependencies -This code uses numpy, scipy, sklearn, numpy, deepchem, and matplotlib.