Skip to content

RithvikReddy0-0/ML_Bioengineering_Detection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

K-mer–Driven Genome-Informed Machine Intelligence for Detecting Bioengineered Pathogens

This repository contains the full implementation of the research project
“K-mer–Driven Genome-Informed Machine Intelligence for Detecting Bioengineered Pathogens.”
It includes preprocessing scripts, trained machine-learning models, notebooks, FASTA test files, and a FastAPI-based web interface for inference.

The system detects whether a plasmid is natural or engineered and predicts its host genus using an interpretable ML pipeline built on 4-mer frequency vectors.


Overview

Features

  • BLAST-based genus inference for synthetic plasmids
  • Automated preprocessing and k-mer feature vectorization
  • Binary (Natural vs Engineered) classification
  • Genus attribution across eight bacterial taxa
  • High-performance Random Forest classifier
  • t-SNE visualizations of learned genomic structure
  • Reproducible code and saved models
  • Deployable FastAPI-based web application

Model Performance

Task Accuracy ROC–AUC
Binary (Natural vs Engineered) 98.98% 0.9994
Genus Attribution (8-class) 90.34% 0.9895

Project Structure

ML_Bioengineered_Detection/
│
├── models/
│   ├── binary/
│   │   ├── binary_model.joblib
│   │   ├── binary_vectorizer.joblib
│   │   ├── binary_simple_model.joblib
│   │   └── binary_simple_vectorizer.joblib
│   │
│   └── genus/
│       ├── genus_full_model.joblib
│       ├── genus_full_vectorizer.joblib
│       ├── genus_simple_model.joblib
│       └── genus_simple_vectorizer.joblib
│
├── notebooks/
│   ├── ML_ENDV2.ipynb
│   └── ML_ENDV3.ipynb
│
├── test_fasta_files/
│   ├── AB282595.1.fasta
│   ├── AP040173.1.fasta
│   └── DL143694.1.fasta
│
|
├── web_app/
│   ├── test_fasta_files/
│   |   ├── AB282595.1.fasta
│   |   ├── AP040173.1.fasta
│   |   └── DL143694.1.fasta
│   │
│   └── app.py
|
├── requirements.txt
├── README.md
└── LICENSE

Installation

Clone the repository

git clone https://github.com/RithvikReddy0-0/ML_Bioengineered_Detection.git
cd ML_Bioengineered_Detection

Install dependencies

Ensure you have Python installed, then install the required packages

pip install -r requirements.txt

Usage

Running the Web Application

The project includes a web interface built with FastAPI/Uvicorn.

uvicorn app:app --reload

CLI Testing

You can test specific FASTA files using the command line:

python app.py --file test_fasta_files/AB282595.1.fasta

Adjust the path to point to any FASTA file you want to analyze.


Prediction Pipeline

The end‑to‑end prediction pipeline follows these steps, using standard k‑mer profiling approaches for genomic analysis.

  1. FASTA upload

    • The user uploads a nucleotide sequence in FASTA format via the web UI or API.
  2. Sequence preprocessing

    • Convert sequence to uppercase.
    • Filter to valid IUPAC nucleotide characters.
    • Optionally truncate sequences longer than 10,000 bp.
  3. 4‑mer feature extraction

    • Slide a window of length 4 along the sequence.
    • Compute normalized 4‑mer frequencies, yielding a 256‑dimensional feature vector.
  4. Model inference

    • The 4‑mer vector is passed into two classifiers:
      • binary_model.joblib: predicts Natural vs Engineered origin.
      • genus_full_model.joblib: predicts Genus attribution.
  5. Outputs

    • Origin prediction (Natural / Engineered) with probability or confidence score.
    • Genus prediction with associated probability or confidence.

Both “simple” and “full” variants of the models and vectorizers are provided in the models/ directory for experimentation and ablation studies.


Notebooks

The notebooks/ directory contains Jupyter notebooks that reproduce the data processing and model development workflow.

  • ML_ENDV2.ipynb

    • Data collection from public nucleotide repositories.
    • BLAST‑based genus inference and label curation.
    • Preprocessing and k‑mer feature generation.
  • ML_ENDV3.ipynb

    • Model training and hyperparameter tuning using scikit‑learn.
    • Evaluation (ROC curves, confusion matrices, and other metrics).
    • Figure generation for the associated research manuscript.

Test FASTA Files

Example sequences are included in test_fasta_files/:

  • AB282595.1.fasta
  • AP040173.1.fasta
  • DL143694.1.fasta

You can quickly verify the pipeline with:

python app.py --file test_fasta_files/AB282595.1.fasta

Each file corresponds to a real genomic record sourced from public nucleotide databases such as NCBI.


Citation

If you use this code or models in your research, please cite:

@article{bioengineered_detection_2025,
title = {K-mer–Driven Genome-Informed Machine Intelligence for Detecting Bioengineered Pathogens},
author = {Reddy, Mukkara Rithvik and Barshilia, Vasudha and Bhattacharya, Debanjali},
year = {2025}
}

License

This project is distributed under the MIT License; see the LICENSE file for details.

The MIT License is a permissive open‑source license that allows reuse, modification, and distribution of the software, provided that the original copyright and license notice are included in copies or substantial portions of the software.


Acknowledgements

  • NCBI Nucleotide Database for access to curated genomic sequences used in this work.
  • PLSDB plasmid repository for plasmid sequence resources.
  • Google Colab for providing GPU‑enabled compute resources for training and experiments.
  • Scikit‑learn, NumPy, Matplotlib, and other open‑source libraries that underpin the modeling and analysis pipeline.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors