This repository contains the full implementation of the research project
“K-mer–Driven Genome-Informed Machine Intelligence for Detecting Bioengineered Pathogens.”
It includes preprocessing scripts, trained machine-learning models, notebooks, FASTA test files, and a FastAPI-based web interface for inference.
The system detects whether a plasmid is natural or engineered and predicts its host genus using an interpretable ML pipeline built on 4-mer frequency vectors.
- BLAST-based genus inference for synthetic plasmids
- Automated preprocessing and k-mer feature vectorization
- Binary (Natural vs Engineered) classification
- Genus attribution across eight bacterial taxa
- High-performance Random Forest classifier
- t-SNE visualizations of learned genomic structure
- Reproducible code and saved models
- Deployable FastAPI-based web application
| Task | Accuracy | ROC–AUC |
|---|---|---|
| Binary (Natural vs Engineered) | 98.98% | 0.9994 |
| Genus Attribution (8-class) | 90.34% | 0.9895 |
ML_Bioengineered_Detection/
│
├── models/
│ ├── binary/
│ │ ├── binary_model.joblib
│ │ ├── binary_vectorizer.joblib
│ │ ├── binary_simple_model.joblib
│ │ └── binary_simple_vectorizer.joblib
│ │
│ └── genus/
│ ├── genus_full_model.joblib
│ ├── genus_full_vectorizer.joblib
│ ├── genus_simple_model.joblib
│ └── genus_simple_vectorizer.joblib
│
├── notebooks/
│ ├── ML_ENDV2.ipynb
│ └── ML_ENDV3.ipynb
│
├── test_fasta_files/
│ ├── AB282595.1.fasta
│ ├── AP040173.1.fasta
│ └── DL143694.1.fasta
│
|
├── web_app/
│ ├── test_fasta_files/
│ | ├── AB282595.1.fasta
│ | ├── AP040173.1.fasta
│ | └── DL143694.1.fasta
│ │
│ └── app.py
|
├── requirements.txt
├── README.md
└── LICENSE
git clone https://github.com/RithvikReddy0-0/ML_Bioengineered_Detection.git
cd ML_Bioengineered_DetectionEnsure you have Python installed, then install the required packages
pip install -r requirements.txtThe project includes a web interface built with FastAPI/Uvicorn.
uvicorn app:app --reload- App Interface: http://127.0.0.1:8000
- API Documentation: http://127.0.0.1:8000/docs
You can test specific FASTA files using the command line:
python app.py --file test_fasta_files/AB282595.1.fastaAdjust the path to point to any FASTA file you want to analyze.
The end‑to‑end prediction pipeline follows these steps, using standard k‑mer profiling approaches for genomic analysis.
-
FASTA upload
- The user uploads a nucleotide sequence in FASTA format via the web UI or API.
-
Sequence preprocessing
- Convert sequence to uppercase.
- Filter to valid IUPAC nucleotide characters.
- Optionally truncate sequences longer than 10,000 bp.
-
4‑mer feature extraction
- Slide a window of length 4 along the sequence.
- Compute normalized 4‑mer frequencies, yielding a 256‑dimensional feature vector.
-
Model inference
- The 4‑mer vector is passed into two classifiers:
binary_model.joblib: predicts Natural vs Engineered origin.genus_full_model.joblib: predicts Genus attribution.
- The 4‑mer vector is passed into two classifiers:
-
Outputs
- Origin prediction (Natural / Engineered) with probability or confidence score.
- Genus prediction with associated probability or confidence.
Both “simple” and “full” variants of the models and vectorizers are provided in the models/ directory for experimentation and ablation studies.
The notebooks/ directory contains Jupyter notebooks that reproduce the data processing and model development workflow.
-
ML_ENDV2.ipynb- Data collection from public nucleotide repositories.
- BLAST‑based genus inference and label curation.
- Preprocessing and k‑mer feature generation.
-
ML_ENDV3.ipynb- Model training and hyperparameter tuning using scikit‑learn.
- Evaluation (ROC curves, confusion matrices, and other metrics).
- Figure generation for the associated research manuscript.
Example sequences are included in test_fasta_files/:
AB282595.1.fastaAP040173.1.fastaDL143694.1.fasta
You can quickly verify the pipeline with:
python app.py --file test_fasta_files/AB282595.1.fastaEach file corresponds to a real genomic record sourced from public nucleotide databases such as NCBI.
If you use this code or models in your research, please cite:
@article{bioengineered_detection_2025,
title = {K-mer–Driven Genome-Informed Machine Intelligence for Detecting Bioengineered Pathogens},
author = {Reddy, Mukkara Rithvik and Barshilia, Vasudha and Bhattacharya, Debanjali},
year = {2025}
}This project is distributed under the MIT License; see the LICENSE file for details.
The MIT License is a permissive open‑source license that allows reuse, modification, and distribution of the software, provided that the original copyright and license notice are included in copies or substantial portions of the software.
- NCBI Nucleotide Database for access to curated genomic sequences used in this work.
- PLSDB plasmid repository for plasmid sequence resources.
- Google Colab for providing GPU‑enabled compute resources for training and experiments.
- Scikit‑learn, NumPy, Matplotlib, and other open‑source libraries that underpin the modeling and analysis pipeline.