GEMS is an end-to-end platform for fungal metabolic model reconstruction, ML-driven growth-condition optimisation, and geometry-aware optimisation design using polytope sampling. It combines a ModelSEED-based genome-scale model (GEM) pipeline, a trained multi-target regressor, and a convex-geometry analysis layer for four industrial fungal strains.
- Overview
- Architecture
- Project Structure
- Backend
- Frontend
- GEM Pipeline (
src/+scripts/) - Experimental Analysis β Polytope Module
- Data
- Quick Start
- CLI Usage
- API Reference
- Supported Fungal Strains
GEMS has three integrated components:
| Component | Purpose |
|---|---|
| GEM Pipeline | Protein FASTA β draft metabolic model β gapfill β FBA analysis β validation |
| ML Recommender | Historical growth data (online learning) β train Random Forest / XGBoost / LightGBM β recommend optimal media conditions |
| Experimental / Geometry Aware | Fungal GEM + scenario generator β polytope sampling β geometry features β surrogate ML β industrial ranking |
All three components are accessible through a single Streamlit UI and a FastAPI backend.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Streamlit UI (frontend_app.py) β
β βββββββββββββββββ ββββββββββββββββββββββββ βββββββββββββββββ β
β β GEM Pipeline β β ML Recommender β β Experimental β β
β β Tab β β Tab β β Analysis Tab β β
β βββββββββ¬ββββββββ ββββββββββββ¬ββββββββββββ βββββββββ¬ββββββββ β
ββββββββββββΌβββββββββββββββββββββββΌββββββββββββββββββββββΌβββββββββββββ
β HTTP (REST) β Direct Python β Direct Python
βΌ βΌ βΌ
βββββββββββββββββββββββ ββββββββββββββββββββββββββ ββββββββββββββββββββββββ
β FastAPI Backend β β ML Backend (backend/) β β Experimental/ β
β (backend/main.py) β β model_trainer.py β β Polytopes/ β
β β β recommender.py β β dataset_builder.py β
β POST /run β β retrainer.py β β train_model.py β
β POST /run/custom β β data_ingestion.py β β postprocess_scores.pyβ
β GET /health β ββββββββββββββββββββββββββ ββββββββββββββββββββββββ
ββββββββββββ¬βββββββββββ
β subprocess
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β GEM Pipeline (scripts/ + src/) β
β run_mvp_pipeline.py β analyze_mvp.py β
β β validate_mvp.py β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
GEMS/
βββ frontend_app.py # Streamlit UI β GEM Pipeline + ML Recommender + Experimental tabs
βββ requirements.txt # Python dependencies
βββ installation.txt # Step-by-step setup and pipeline walkthrough
βββ USAGE.md # Detailed usage examples
βββ ARCHITECTURE.md # In-depth architecture notes
β
βββ backend/ # ML recommender + API orchestration
β βββ main.py # FastAPI app β /run, /run/custom, /health
β βββ pipeline_runner.py # PipelineRunner: orchestrates MVP pipeline steps
β βββ config.py # Paths, feature/target columns, model hyperparams
β βββ data_loader.py # Load / save combined training dataset
β βββ feature_engineering.py # Encoders, scalers, sample weight computation
β βββ model_trainer.py # Train RF / XGBoost / LightGBM; CV; persistence
β βββ recommender.py # Generate Exploit + Explore condition recommendations
β βββ retrainer.py # Adaptive retraining with round tracking
β βββ data_ingestion.py # Validate and ingest new wet-lab CSV results
β βββ lab_exporter.py # Export recommendations to Excel lab sheets
β βββ __init__.py
β
βββ scripts/ # CLI entry points for the GEM pipeline
β βββ run_mvp_pipeline.py # Step 1 β build draft model, gapfill, COBRA inspect
β βββ analyze_mvp.py # Steps 2β4 β theoretical / preset / custom analysis
β βββ validate_mvp.py # Step 5 β FBA, dead-ends, FVA, gene essentiality
β βββ build_draft_model.py # Standalone draft-model builder
β βββ gapfill_and_export_model.py
β βββ inspect_with_cobra.py
β βββ screen_media.py
β βββ diagnose_exchange_space.py
β βββ debug_growth.py
β βββ run_oracle_growth.py
β βββ screen_oracle_medium.py
β βββ benchmark_bio2.py
β βββ inspect_oracle_condition.py
β βββ first_modelseed_step.py
β βββ prepare_input.py
β βββ compare_template_runs.py
β
βββ src/ # Core GEM pipeline library
β βββ paths.py # Canonical path constants (PROJECT_ROOT, MODELS_DIR, β¦)
β βββ reconstruction.py # MSBuilder draft-model construction
β βββ template_loader.py # Load built-in or local ModelSEED templates
β βββ gapfill.py # Best-effort minimal gapfilling
β βββ export_model.py # SBML / JSON model export helpers
β βββ cobra_loader.py # Load COBRA model from directory
β βββ cobra_inspect.py # FBA, exchange table, baseline optimization
β βββ cobra_outputs.py # Save COBRA inspection outputs
β βββ cobra_debug.py # Debug utilities for COBRA models
β βββ mvp_analysis.py # Theoretical / preset / custom condition analysis
β βββ mvp_outputs.py # Save all MVP analysis outputs + plots
β βββ validation.py # Dead-end analysis, exchange FVA, gene essentiality
β βββ validation_outputs.py # Save validation dashboard and summary files
β βββ media_screen.py # First-pass media screening
β βββ media_outputs.py # Save media screen outputs
β βββ exchange_diagnostics.py
β βββ exchange_diagnostic_outputs.py
β βββ oracle_growth.py # Oracle growth check
β βββ oracle_medium.py # Oracle-derived debug media
β βββ oracle_medium_outputs.py
β βββ bio2_benchmark.py # Benchmark bio2 reaction rate
β βββ bio2_benchmark_outputs.py
β βββ modelseed_step.py # ModelSEED first-pass step helpers
β βββ input_parser.py # Detect protein FASTA / genome FASTA / accession input
β βββ model_io.py # Save model summary text/JSON
β βββ plot_utils.py # Ranked bar chart plotting helpers
β βββ report_utils.py # Plain-text report builders
β βββ logging_utils.py # Configured logger factory
β βββ __init__.py
β
βββ Experimental/ # Geometry-aware fermentation optimisation (polytope module)
β βββ README.md # Experimental module documentation
β βββ A_oryzae_optimized.xml # Aspergillus oryzae GEM (SBML) used for simulations
β βββ scenarios_fungi.json # Fermentation scenario definitions (nutrients, T, pH, mixing)
β βββ scenarios.json # Additional scenarios (standard exchange reaction names)
β βββ dataset_builder.py # Main engine: FBA β FVA β PolyRound β polytope sampling β features
β βββ scenario_generator_adaptive.py # Adaptive explore/exploit scenario generator
β βββ train_model.py # Train Random Forest surrogate on dataset.csv
β βββ rank_scenarios.py # Rank scenarios by predicted overall_rank_score
β βββ postprocess_scores.py # Compute economic, morphology, meatiness, industrial scores
β βββ rank_scenarios_industrial.py # Rank by industrial_score
β βββ feature_importance.py # Feature importance from trained surrogate model
β βββ top_region_summary.py # Summarise top-performing scenario region (medians, ranges)
β βββ plot_pareto.py # Pareto plot: growth vs byproduct burden
β βββ plot_industrial_tradeoff.py # Industrial score vs growth scatter + trend line
β βββ plot_geometry_vs_growth.py # 3-panel: geometry/byproduct/validation plots
β βββ reactions.py # Search model reactions by keyword
β βββ test_fungal_model.py # Verify GEM loading and biomass reaction
β βββ ml_pipeline.py # ML pipeline utility
β βββ Results A_oryzae/ # Pre-computed results for Aspergillus oryzae
β βββ dataset.csv # Raw FBA + geometry dataset
β βββ dataset_postprocessed.csv # With industrial scores
β βββ model.pkl # Trained surrogate model
β βββ feature_importances.csv
β βββ predicted_ranked_scenarios.csv
β βββ predicted_ranked_scenarios_industrial.csv
β βββ top_region_summary.txt
β βββ pareto_growth_vs_byproduct.png
β βββ plot_industrial_tradeoff.png
β
βββ polytopes/ # Mirror of Experimental/ (identical content)
β
βββ data/
β βββ synthetic_fungal_growth_dataset.csv # 2,000-row synthetic training set
β βββ intermediate/ # Combined dataset, encoded features (auto-generated)
β βββ models/ # GEM model output directories + ML model checkpoints
β βββ raw/uploads/ # Uploaded protein FASTA files
β
βββ config/
β βββ media_library.yml # Named media definitions for screening
β
βββ ModelSEEDDatabase/ # Local copy of the ModelSEED reference database
β βββ Templates/
β β βββ Fungi/Fungi.json # Fungal reconstruction template (local source)
β β βββ Core/ # Core template
β β βββ β¦ # GramNeg, GramPos, Human, Plant, etc.
β βββ Biochemistry/ # Compounds, reactions, aliases, structures
β βββ Annotations/ # Complexes, Roles
β
βββ docs/ # Pipeline diagrams and template comparison reports
The backend/ package contains two distinct responsibilities:
| Endpoint | Method | Description |
|---|---|---|
/run |
POST | Upload a .faa file; run the 4-step MVP pipeline; return model_id + step status |
/run/custom |
POST | Run an optional custom-condition analysis on an existing model |
/health |
GET | Liveness check |
PipelineRunner (in pipeline_runner.py) orchestrates:
run_mvp_pipeline.pyβ build draft modelanalyze_mvp.py --mode theoreticalanalyze_mvp.py --mode presetvalidate_mvp.py --mode theoretical_upper_bound
Each step is a child subprocess. If a step fails its returncode, the pipeline stops and returns partial results.
| Module | Responsibility |
|---|---|
config.py |
Feature columns, target names, model hyperparameters, directory paths |
data_loader.py |
Load/save the combined (synthetic + real) training CSV |
feature_engineering.py |
Label-encode categoricals, min-max scale numerics, compute sample weights |
model_trainer.py |
Cross-validate Random Forest / XGBoost / LightGBM; select best; persist with joblib |
recommender.py |
Sample 2,000 candidate conditions; predict all targets; return top-N exploit + explore |
retrainer.py |
Adaptive retraining loop with round tracking (retrain_log.json) |
data_ingestion.py |
Validate lab CSV schema; rename columns; recompute composite score; append to combined dataset |
lab_exporter.py |
Render recommendations into an Excel workbook for the wet lab |
frontend_app.py is a Streamlit single-page application with three top-level tabs:
- Upload & Run β upload a
.faafile, choose template (Core / Fungal), toggle RAST, click βΆ Run Pipeline - View Results β model selector dropdown; six sub-tabs:
- Draft Model β
mvp_summary.jsonmetrics card + mode comparison plot - Theoretical Upper Bound β FBA benchmark plot, condition table, JSON summary
- Preset Conditions β ranked bar chart, conditions table, text summary
- Custom Condition β run and display a user-defined media condition
- Validation β dashboard image, FBA status, dead-end metabolites, exchange FVA, gene essentiality
- Full Pipeline Files β all 12 intermediate file outputs in pipeline order
- Draft Model β
- Train β train all 3 model types Γ 4 targets; display CV RΒ² scores
- Recommendations β select strain, get top-N exploit + explore conditions; download Excel lab sheet
- Upload & Retrain β upload a filled lab results CSV, ingest, retrain with updated data
Visualises results from the geometry-aware fermentation optimisation pipeline in Experimental/:
- Results Overview β display pre-computed
Results A_oryzae/outputs - Scenario Rankings β tabular view of
predicted_ranked_scenarios.csvandpredicted_ranked_scenarios_industrial.csv - Pareto Analysis β Pareto plot image (growth vs byproduct burden)
- Industrial Tradeoff β industrial score vs growth scatter with trend line
- Geometry vs Growth β 3-panel figure: feasible space log-volume / byproduct pressure / ML validation
- Feature Importances β bar chart of which variables drive the surrogate model score
- Top Region Summary β median and range of the top-performing scenario cluster
The MVP pipeline runs in a fixed order via scripts/run_mvp_pipeline.py:
Protein FASTA (.faa)
β
βΌ
MSGenome.from_fasta() β load features
β
βΌ
MSBuilder.build_metabolic_model() β draft reconstruction
β (template: Core builtin OR Fungi local)
βΌ
gapfill_model_minimally() β best-effort gapfill on bio1
β
βΌ
save_model_sbml_if_possible() β export model.xml (SBML) or model.json
β
βΌ
load_cobra_model() β load via COBRApy
run_baseline_optimization() β FBA baseline
get_exchange_table() β exchange metabolite fluxes
β
βΌ
save_mvp_summary() β mvp_summary.json / .txt
Analysis steps (run after step 1):
| Script | Mode | Output |
|---|---|---|
analyze_mvp.py |
theoretical |
theoretical_upper_bound.{json,txt,png,csv} |
analyze_mvp.py |
preset |
preset_conditions.{json,csv,txt,png} |
analyze_mvp.py |
custom |
custom_condition_NAME.{json,txt,png} |
validate_mvp.py |
theoretical_upper_bound |
validation dashboard, dead-end CSV, FVA CSV, gene essentiality CSV |
| Label | --template-name |
--template-source |
File |
|---|---|---|---|
| Core Template (built-in) | template_core |
builtin |
modelseedpy built-in |
| Fungal Template (local) | fungi |
local |
ModelSEEDDatabase/Templates/Fungi/Fungi.json |
Located in Experimental/ (and mirrored in polytopes/), this module implements a geometry-aware, biologically grounded optimisation framework for fermentation design.
Instead of optimising a single metabolic solution, this system:
- Explores the full feasible metabolic flux space (solution polytope) for a fungal GEM
- Extracts geometric and biological features from that space
- Trains a surrogate ML model that learns how environmental conditions shape performance
- Identifies robust and efficient operating regions for fermentation
scenarios_fungi.json β fermentation scenario definitions
β
βΌ
dataset_builder.py
βββ cobra.io.read_sbml_model(A_oryzae_optimized.xml)
βββ apply_model_specific_medium(scenario)
βββ model.optimize() β FBA
βββ flux_variability_analysis() β FVA range
βββ polyround_preprocess() β convert SBML β polytope (Ax β€ b)
βββ PolytopeSampler.sample_from_polytope() β MCMC interior sampling
βββ back_transform() β recover flux vectors
βββ extract geometry features β log-volume, anisotropy, flux_std
β
βΌ
dataset.csv β FBA + geometry features per scenario
β
βΌ
train_model.py β Random Forest surrogate on overall_rank_score
β
βΌ
rank_scenarios.py β predicted_ranked_scenarios.csv
β
βΌ
postprocess_scores.py
βββ economic scores (substrate + mixing cost / yield)
βββ morphology score (growth Γ mixing Γ pH penalty)
βββ meatiness score (growth + biomass + morphology β byproducts)
β
βΌ
dataset_postprocessed.csv β enhanced with industrial scores
β
βΌ
rank_scenarios_industrial.py β predicted_ranked_scenarios_industrial.csv
| Script | Description | Key Output |
|---|---|---|
dataset_builder.py |
Main engine β FBA + FVA + polytope sampling + feature extraction | results/dataset.csv |
scenario_generator_adaptive.py |
Generate uniform explore + local exploit scenarios | scenarios.json |
train_model.py |
Train RandomForest surrogate; evaluate RΒ² / MAE | results/model.pkl |
rank_scenarios.py |
Apply trained model; rank by predicted score | results/predicted_ranked_scenarios.csv |
postprocess_scores.py |
Compute economic, morphology, meatiness, industrial scores | results/dataset_postprocessed.csv |
rank_scenarios_industrial.py |
Rank by composite industrial score | results/predicted_ranked_scenarios_industrial.csv |
feature_importance.py |
Extract and display feature importances | results/feature_importances.csv |
top_region_summary.py |
Summarise top-performing scenario cluster (medians, ranges) | results/top_region_summary.txt |
plot_pareto.py |
Pareto view: growth vs total byproducts | results/pareto_growth_vs_byproduct.png |
plot_industrial_tradeoff.py |
Industrial score vs growth with trend line | results/plot_industrial_tradeoff.png |
plot_geometry_vs_growth.py |
3-panel: geometry / byproduct pressure / ML validation | results/final_3panel_figure.png |
test_fungal_model.py |
Verify GEM loads and biomass reaction exists | β |
reactions.py |
Search GEM reactions by keyword | β |
| Feature | Description |
|---|---|
log_volume |
Log of polytope volume (sum of log eigenvalues of the flux covariance matrix) |
anisotropy_log |
Log of max eigenvalue / median eigenvalue β measures directionality of the flux space |
flux_std |
Mean standard deviation of flux samples β measures overall variability |
biomass_flux_mean |
Mean biomass flux across polytope samples |
biomass_std |
Standard deviation of biomass flux |
fva_range |
Mean FVA (minβmax) range across all reactions |
| Component | Weight | Description |
|---|---|---|
| Growth (FBA) | 0.25 | Raw FBA biomass rate |
| Biomass flux mean | 0.15 | Average biomass across polytope samples |
| Biomass yield | 0.15 | Growth / glucose_uptake |
| Economic score | 0.20 | 1 β (substrate cost + mixing cost) / yield |
| Morphology score | 0.15 | Growth Γ mixing_fibrousness β pH penalty |
| Meatiness score | 0.10 | Growth + biomass + morphology β byproducts |
| Byproduct penalty | β0.20 | Total byproduct excretion |
cd GEMS/Experimental
# 1. generate the Scenarios
python scenario_generator_adaptive.py
# 2. Build the dataset (requires PolyRound + PolytopeSampler)
python dataset_builder.py
# 3. Train the surrogate model
python train_model.py
# 4. Rank scenarios by overall score
python rank_scenarios.py
# 5. Add industrial scoring layer
python postprocess_scores.py
python rank_scenarios_industrial.py
# 6. Analyse and visualise
python feature_importance.py
python top_region_summary.py
python plot_pareto.py
python plot_industrial_tradeoff.py
python plot_geometry_vs_growth.py| File | Description |
|---|---|
data/synthetic_fungal_growth_dataset.csv |
2,000 synthetic growth experiments across 4 strains; features include carbon source, nitrogen source, pH, temperature, RPM, inoculum size, nutrient concentrations |
data/intermediate/combined_dataset.csv |
Merged synthetic + real uploaded data (auto-generated after ingest) |
data/intermediate/features.pkl |
Fitted encoder/scaler pipeline (auto-generated after training) |
data/models/ |
One directory per GEM model run; one directory per ML training run (run_YYYYMMDD_HHMMSS/) |
Experimental/Results A_oryzae/dataset.csv |
FBA + geometry features for A. oryzae simulated scenarios |
Experimental/Results A_oryzae/dataset_postprocessed.csv |
Enhanced dataset with industrial scores |
Experimental/Results A_oryzae/model.pkl |
Trained Random Forest surrogate for A. oryzae |
# 1. Install dependencies
pip install -r GEMS/requirements.txt
pip install modelseedpy cobra fastapi uvicorn
# 2. Start the API server (from the GEMS/ directory)
cd GEMS
uvicorn backend.main:app --reload --port 8000
# 3. Start the Streamlit UI (separate terminal, from GEMS/ directory)
cd GEMS
streamlit run frontend_app.pyNavigate to http://localhost:8501 to access the UI.
# Build a draft fungal model using the local Fungi template
python GEMS/scripts/run_mvp_pipeline.py \
--input ncbi_dataset/data/GCA_000182925.2/protein.faa \
--model-id fungi_test \
--use-rast \
--template-name fungi \
--template-source local
# Run theoretical upper bound analysis
python GEMS/scripts/analyze_mvp.py \
--model-dir GEMS/data/models/fungi_test \
--mode theoretical
# Run preset conditions
python GEMS/scripts/analyze_mvp.py \
--model-dir GEMS/data/models/fungi_test \
--mode preset
# Run validation
python GEMS/scripts/validate_mvp.py \
--model-dir GEMS/data/models/fungi_test \
--mode theoretical_upper_bound \
--biomass-reaction bio2Upload a protein FASTA and run the full 4-step MVP pipeline.
| Field | Type | Default | Description |
|---|---|---|---|
file |
.faa upload |
required | Protein FASTA file |
use_rast |
bool | false |
Annotate with RAST |
template_name |
string | template_core |
template_core or fungi |
template_source |
string | builtin |
builtin or local |
Response: model_id, steps[] (name, returncode, stdout, stderr), all_succeeded
Run a single custom-condition analysis on an existing model.
| Field | Type | Default | Description |
|---|---|---|---|
model_id |
string | required | Existing model directory name |
condition_name |
string | required | Output filename stem |
preset_seed |
string | rich_debug_medium |
Starting preset |
metabolite_ids |
string | optional | Comma-separated metabolite IDs |
Works with any SBML file and any protein FASTA file.