Skip to content

taowis/PathogenHawk

Repository files navigation

🧠 PathogenHawk

PathogenHawk is a modular machine learning toolkit for predicting antimicrobial resistance (AMR) from genome sequences of fungal and bacterial pathogens.

It supports multiple species including Candida auris, E. coli, and Aspergillus fumigatus, using genomic features (e.g. SNPs, resistance genes) to train interpretable ML models.


📄 Project Links

📚 Rationale

C. auris is an emerging pathogen with significant resistance concerns Novel antifungals and treatment approaches to tackle resistance and improve outcomes of invasive fungal disease.


📁 Project Structure

PathogenHawk/
├── configs/                 # YAML configs for each pathogen
├── data/                    # Raw and processed data
├── genome_processing/       # Preprocessing and variant calling
├── feature_engineering/     # Genomic feature extraction
├── ml_model/                # Model training, evaluation, interpretation
├── genes/                   # Resistance gene annotations
├── metadata/                # MIC values and resistance phenotype labels
├── ref/                     # Reference genomes (e.g. Cauris_CDC317.fa)
├── notebooks/               # Jupyter demo notebooks
└── scripts/                 # Utility scripts

🚀 Quick Start

  1. Clone the repository and install dependencies:
conda env create -f environment.yml
conda activate pathogenhawk
  1. Create a configuration file under configs/, e.g. configs/Cauris.yaml:
pathogen: candida_auris
reference_genome: data/raw/C_auris_B8441V3_ref.fa
resistance_genes: data/metadata/Cauris_amr_genes.tsv
phenotype_file: data/metadata/Cauris_MIC.csv
feature_file: data/metadata/Cauris_features.tsv
labels: fluconazole_resistance
alignment_tool: bwa
variant_caller: freebayes
features: ["snp", "resistance_genes"]
ml_model: xgboost
output_dir: results/Candida_auris
  1. Run your pipeline:
export PYTHONPATH=$(pwd)
python ml_model/train_model.py --config configs/Cauris.yaml

Run Nextflow

# For Candida auris
nextflow run pathogenhawk.nf --config configs/Cauris.yaml

# For Escherichia coli
nextflow run pathogenhawk.nf --config configs/Ecoli.yaml

Configuration

workflow/nextflow.config includes basic container settings. You can customize profiles for HPC/cloud usage.


📦 Example Files

For Candida auris

For Escherichia coli


📚 Citation

If you use PathogenHawk in your research, please cite the associated JOSS paper (under review):

Lai, K. (2025). PathogenHawk: A Pathogen Machine Learning Toolkit for Predicting Antimicrobial Resistance from Genomic Features. Journal of Open Source Software (under review). https://github.com/biosciences/PathogenHawk


👩‍💻 Author

Developed by Kaitao Lai

🪪 License

MIT © 2025 Kaitao Lai

About

PathogenHawk: A pathogen machine learning toolkit for predicting antimicrobial resistance (AMR) from genomic features. Supports variant calling, feature extraction, model training, and visualization.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors