K-mer–Driven Genome-Informed Machine Intelligence for Detecting Bioengineered Pathogens

This repository contains the full implementation of the research project
“K-mer–Driven Genome-Informed Machine Intelligence for Detecting Bioengineered Pathogens.”
It includes preprocessing scripts, trained machine-learning models, notebooks, FASTA test files, and a FastAPI-based web interface for inference.

The system detects whether a plasmid is natural or engineered and predicts its host genus using an interpretable ML pipeline built on 4-mer frequency vectors.

Overview

Features

BLAST-based genus inference for synthetic plasmids
Automated preprocessing and k-mer feature vectorization
Binary (Natural vs Engineered) classification
Genus attribution across eight bacterial taxa
High-performance Random Forest classifier
t-SNE visualizations of learned genomic structure
Reproducible code and saved models
Deployable FastAPI-based web application

Model Performance

Task	Accuracy	ROC–AUC
Binary (Natural vs Engineered)	98.98%	0.9994
Genus Attribution (8-class)	90.34%	0.9895

Project Structure

ML_Bioengineered_Detection/
│
├── models/
│   ├── binary/
│   │   ├── binary_model.joblib
│   │   ├── binary_vectorizer.joblib
│   │   ├── binary_simple_model.joblib
│   │   └── binary_simple_vectorizer.joblib
│   │
│   └── genus/
│       ├── genus_full_model.joblib
│       ├── genus_full_vectorizer.joblib
│       ├── genus_simple_model.joblib
│       └── genus_simple_vectorizer.joblib
│
├── notebooks/
│   ├── ML_ENDV2.ipynb
│   └── ML_ENDV3.ipynb
│
├── test_fasta_files/
│   ├── AB282595.1.fasta
│   ├── AP040173.1.fasta
│   └── DL143694.1.fasta
│
|
├── web_app/
│   ├── test_fasta_files/
│   |   ├── AB282595.1.fasta
│   |   ├── AP040173.1.fasta
│   |   └── DL143694.1.fasta
│   │
│   └── app.py
|
├── requirements.txt
├── README.md
└── LICENSE

Installation

Clone the repository

git clone https://github.com/RithvikReddy0-0/ML_Bioengineered_Detection.git
cd ML_Bioengineered_Detection

Install dependencies

Ensure you have Python installed, then install the required packages

pip install -r requirements.txt

Usage

Running the Web Application

The project includes a web interface built with FastAPI/Uvicorn.

uvicorn app:app --reload

App Interface: http://127.0.0.1:8000
API Documentation: http://127.0.0.1:8000/docs

CLI Testing

You can test specific FASTA files using the command line:

python app.py --file test_fasta_files/AB282595.1.fasta

Adjust the path to point to any FASTA file you want to analyze.

Prediction Pipeline

The end‑to‑end prediction pipeline follows these steps, using standard k‑mer profiling approaches for genomic analysis.

FASTA upload
- The user uploads a nucleotide sequence in FASTA format via the web UI or API.
Sequence preprocessing
- Convert sequence to uppercase.
- Filter to valid IUPAC nucleotide characters.
- Optionally truncate sequences longer than 10,000 bp.
4‑mer feature extraction
- Slide a window of length 4 along the sequence.
- Compute normalized 4‑mer frequencies, yielding a 256‑dimensional feature vector.
Model inference
- The 4‑mer vector is passed into two classifiers:
  - binary_model.joblib: predicts Natural vs Engineered origin.
  - genus_full_model.joblib: predicts Genus attribution.
Outputs
- Origin prediction (Natural / Engineered) with probability or confidence score.
- Genus prediction with associated probability or confidence.

Both “simple” and “full” variants of the models and vectorizers are provided in the models/ directory for experimentation and ablation studies.

Notebooks

The notebooks/ directory contains Jupyter notebooks that reproduce the data processing and model development workflow.

ML_ENDV2.ipynb
- Data collection from public nucleotide repositories.
- BLAST‑based genus inference and label curation.
- Preprocessing and k‑mer feature generation.
ML_ENDV3.ipynb
- Model training and hyperparameter tuning using scikit‑learn.
- Evaluation (ROC curves, confusion matrices, and other metrics).
- Figure generation for the associated research manuscript.

Test FASTA Files

Example sequences are included in test_fasta_files/:

AB282595.1.fasta
AP040173.1.fasta
DL143694.1.fasta

You can quickly verify the pipeline with:

python app.py --file test_fasta_files/AB282595.1.fasta

Each file corresponds to a real genomic record sourced from public nucleotide databases such as NCBI.

Citation

If you use this code or models in your research, please cite:

@article{bioengineered_detection_2025,
title = {K-mer–Driven Genome-Informed Machine Intelligence for Detecting Bioengineered Pathogens},
author = {Reddy, Mukkara Rithvik and Barshilia, Vasudha and Bhattacharya, Debanjali},
year = {2025}
}

License

This project is distributed under the MIT License; see the LICENSE file for details.

The MIT License is a permissive open‑source license that allows reuse, modification, and distribution of the software, provided that the original copyright and license notice are included in copies or substantial portions of the software.

Acknowledgements

NCBI Nucleotide Database for access to curated genomic sequences used in this work.
PLSDB plasmid repository for plasmid sequence resources.
Google Colab for providing GPU‑enabled compute resources for training and experiments.
Scikit‑learn, NumPy, Matplotlib, and other open‑source libraries that underpin the modeling and analysis pipeline.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

K-mer–Driven Genome-Informed Machine Intelligence for Detecting Bioengineered Pathogens

Overview

Features

Model Performance

Project Structure

Installation

Clone the repository

Install dependencies

Usage

Running the Web Application

CLI Testing

Prediction Pipeline

Notebooks

Test FASTA Files

Citation

License

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
models		models
notebooks		notebooks
web_app		web_app
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

K-mer–Driven Genome-Informed Machine Intelligence for Detecting Bioengineered Pathogens

Overview

Features

Model Performance

Project Structure

Installation

Clone the repository

Install dependencies

Usage

Running the Web Application

CLI Testing

Prediction Pipeline

Notebooks

Test FASTA Files

Citation

License

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages