Version 4.7 β Deep LearningβDriven Environmental DNA (eDNA) Taxonomic Clustering and Diversity Analysis
The eDNA Biodiversity Analysis Pipeline provides an end-to-end workflow for DNA-based biodiversity discovery.
It simulates, embeds, clusters, and analyzes environmental DNA sequences (eDNA) using a Variational Autoencoder (VAE) and HDBSCAN clustering β wrapped inside a user-friendly Flask web API.
β
Mock Sequence Generator β Create realistic eDNA datasets with mutations and random variation
β
K-mer Encoding β Convert raw sequences into normalized frequency vectors
β
Deep Latent Embedding (VAE) β Learn compressed, noise-tolerant DNA representations
β
HDBSCAN Clustering β Detect taxonomic groups without fixed cluster numbers
β
Biodiversity Metrics β Compute Shannon, Simpson, Pielou, and species richness indices
β
Partial Taxonomy for Noise β Infer likely taxa for unclustered sequences
β
UMAP Visualization β Interactive 2D plots of latent embeddings
β
Flask Web API β Easily integrate into web or bioinformatics workflows
π¦ eDNA-Biodiversity
βββ analysis.py # Core ML and analysis pipeline
βββ main.py # Flask web service exposing endpoints
βββ templates/
β βββ index.html # Optional front-end page for testing
βββ README.md
git clone https://github.com/<your-username>/eDNA-Biodiversity.git
cd eDNA-Biodiversitypython3 -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activatepython3 -m pip install -r requirements.txtIf you donβt have a requirements.txt, install manually:
python3 -m pip install flask tensorflow scikit-learn numpy matplotlib umap-learn hdbscanpython3 main.pyAccess the web app at:
π http://localhost:5000
Generates 1000 mock DNA sequences for testing.
Response Example
{
"sequences": ">sample_sequence_1\nATGCTAG...\n>sample_sequence_2\nTGCATGA...\n..."
}Runs the complete biodiversity pipeline.
Request JSON
{
"sequence": ">seq1\nATGCTAGCTAG...\n>seq2\nCGTATCGT...\n..."
}Response JSON
{
"summary": {
"biodiversity_metrics": {
"species_richness": 42,
"shannon_diversity_index": 2.8765,
"simpson_diversity_index": 0.9132,
"pielou_evenness_index": 0.8231
},
"taxonomic_summary": {
"identified_taxa": 36,
"novel_taxa": 6,
"noise_points": 12
},
"taxa_details": [...],
"noise_point_details": [...]
},
"plots": {
"cluster_plot": "<base64_image>"
}
}| Step | Description | Function |
|---|---|---|
| 1οΈβ£ | Generate mock or input eDNA sequences | generate_mock_edna_sequences() |
| 2οΈβ£ | Convert sequences into 4-mer vectors | sequences_to_kmers() |
| 3οΈβ£ | Train VAE for latent embeddings | train_vae_and_get_embeddings() |
| 4οΈβ£ | Cluster embeddings using HDBSCAN | run_hdbscan() |
| 5οΈβ£ | Compute biodiversity metrics | analyze_biodiversity() |
| 6οΈβ£ | Save 2D UMAP visualization | save_cluster_scatter() |
import analysis
# Step 1: Generate mock eDNA data
sequences = analysis.generate_mock_edna_sequences(500)
# Step 2: Convert to k-mer vectors
kmer_vectors = analysis.sequences_to_kmers(sequences)
# Step 3: Train VAE and extract embeddings
embeddings = analysis.train_vae_and_get_embeddings(kmer_vectors)
# Step 4: Cluster with HDBSCAN
labels = analysis.run_hdbscan(embeddings)
# Step 5: Analyze biodiversity
summary = analysis.analyze_biodiversity(labels, embeddings, sequences)
print(summary['biodiversity_metrics'])--- Starting VAE Training ---
--- VAE Training and Embedding Extraction Complete ---
--- Running HDBSCAN (min_cluster_size=10) ---
HDBSCAN found 8 clusters and 12 noise points.
--- Analyzing Biodiversity ---
Biodiversity analysis complete.
Biodiversity Metrics:
- Species Richness: 42
- Shannon Index: 2.87
- Simpson Index: 0.91
- Pielou Evenness: 0.82
| Area | Library |
|---|---|
| Machine Learning | TensorFlow / Keras |
| Clustering | HDBSCAN |
| Dimensionality Reduction | UMAP |
| Preprocessing | scikit-learn |
| Visualization | Matplotlib |
| API Framework | Flask |
MOKSHI SHAH
π‘ Developer in AI, ML
π GitHub Profile
Rishit Modi
π‘ Developer in AI, ML
π GitHub Profile
Mahi Desai
π‘ Deep Learning Researcher
π GitHub Profile
ARYAN DOSHI
π‘ Developer in AI, ML
π GitHub Profile
INDRANEEL HAJARNIS
π‘ Developer in AI, ML
π GitHub Profile
- Requires β₯50 valid sequences for meaningful clustering
- Random seeds are fixed (
42) for reproducibility - Mock taxa and confidences are simulated for demonstration
β If you find this project useful, consider starring the repo to support development! β