Overview

mol_data_prep is comprised of R and Python scripts for streamlining the preparation of molecular datasets for machine learning tasks. It provides methods for descriptor calculation, dataset standardization, response variable binarization, clustering for test set selection, and synthetic data generation using SMOTE.

The main script provides a class with handy methods for preparing datasets of molecules for Machine Learning procedures

Descriptor calculation using MORDRED (peptide descriptors and fingerprints soon to be added)
Standardization
Binarization of response variable
Allows clustering strategies to build more representative test sets
Easy introduction of synthetic datapoints through SMOTE (SMOGN soon to be added)

Usage

In the "example" folder you can find a step by step example of a standard procedure for a classification problem

Citation

A similar pipeline was originally used in:

Machado, L. A., Krempser, E. and Guimarães, A. C. R - A machine learning-based virtual screening for natural compounds capable of inhibiting the HIV-1 integrase. Frontiers in Drug Discovery (2022)

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
example		example
README.md		README.md
mol_data_prep.py		mol_data_prep.py
step1_process_targets.R		step1_process_targets.R
step2_model_building.py		step2_model_building.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

Usage

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Overview

Usage

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages