From f47c5f00bce356bc8eced26df8f533e1d5ad7a58 Mon Sep 17 00:00:00 2001
From: Zachary Fralish <127516906+zacharyfralish@users.noreply.github.com>
Date: Fri, 24 Mar 2023 16:36:29 -0400
Subject: [PATCH] Update README.md

---
 README.md | 24 +++++++++++++++++++-----
 1 file changed, 19 insertions(+), 5 deletions(-)

diff --git a/README.md b/README.md
index 261b8f9..54bfdb3 100644
--- a/README.md
+++ b/README.md
@@ -1,10 +1,26 @@
 # Active Subsampling
-Using active learning for data curation.
+**Improving Molecular Machine Learning Through Adaptive Subsampling with Active Learning**
+
+## Overview
+We use active machine learning as an autonomous and adaptive data subsampling strategy and show that active learning-based subsampling can lead to better molecular machine learning performance when compared to both training models on the complete training data and 19 state-of-the-art subsampling strategies. We find that active learning is robust to errors in the data, highlighting the utility of this approach for low-quality datasets. Taken together, we here describe a new, adaptive machine learning pre-processing approach and provide novel insights into the behavior and robustness of active machine learning for molecular sciences.
+
+For more information, please refer to: [Improving Molecular Machine Learning Through Adaptive Subsampling with Active Learning](https://chemrxiv.org/engage/api-gateway/chemrxiv/assets/orp/resource/item/63e5c76e1d2d18406337135d/original/improving-molecular-machine-learning-through-adaptive-subsampling-with-active-learning.pdf)
+
+If you use this data or code, please kindly cite: Wen, Y., Li, Z., Xiang, Y., & Reker, D. (2023). Improving Molecular Machine Learning Through Adaptive Subsampling with Active Learning.
 
 ## Files
 - **code.py** contains all code and functions to run and evaluate active learning subsampling
 - **Example_workflow_for_AL_Subsampling.ipynb** contains an example notebook that runs BBBP but can be run out of the box on a local machine or on Google Colab to apply this technique to new datasets
 
+## Dependencies
+* [numpy](https://numpy.org/)
+* [scipy](https://scipy.org/)
+* [pandas](https://github.com/pandas-dev/pandas)
+* [scikit-learn](https://scikit-learn.org/stable/)
+* [deepchem](https://deepchem.io/)
+* [matplotlib](https://matplotlib.org/)
+
+
 ## Quickstart
 
 Datasets can be loaded from DeepChem
@@ -14,7 +30,7 @@ tasks, data, transformers = dc.molnet.load_bbbp(splitter=None)
 bbbp = data[0]
 ```
 
-Model and performance metric need to be initialized, we recommend random forest models and Matthews correlation coefficient
+Model and performance metric need to be initialized, we recommend random forest models and Matthew's correlation coefficient (MCC)
 ```
 # initialize model and performance metric
 model = RF()
@@ -35,7 +51,7 @@ pl.savefig("learning_curve.pdf")
 pl.close()
 ```
 
-Delta Performance can be directly calculated from the resulting curves
+Delta performance can be directly calculated from the resulting curves
 ```
 # report deltaPerformance 
 print(calc_deltaPerformances(result))
@@ -47,6 +63,4 @@ Subsampled data can be extracted by calling the subsample_data function
 subsample = subsample_data(model, data, metric, 5)
 ```
 
-## Dependencies
-This code uses numpy, scipy, sklearn, numpy, deepchem, and matplotlib.