PathogenHawk is a modular machine learning toolkit for predicting antimicrobial resistance (AMR) from genome sequences of fungal and bacterial pathogens.
It supports multiple species including Candida auris, E. coli, and Aspergillus fumigatus, using genomic features (e.g. SNPs, resistance genes) to train interpretable ML models.
- 📂 Source Code: Explore the full repository
- 🔗 Live Report: View the interactive HTML output
C. auris is an emerging pathogen with significant resistance concerns Novel antifungals and treatment approaches to tackle resistance and improve outcomes of invasive fungal disease.
PathogenHawk/
├── configs/ # YAML configs for each pathogen
├── data/ # Raw and processed data
├── genome_processing/ # Preprocessing and variant calling
├── feature_engineering/ # Genomic feature extraction
├── ml_model/ # Model training, evaluation, interpretation
├── genes/ # Resistance gene annotations
├── metadata/ # MIC values and resistance phenotype labels
├── ref/ # Reference genomes (e.g. Cauris_CDC317.fa)
├── notebooks/ # Jupyter demo notebooks
└── scripts/ # Utility scripts
- Clone the repository and install dependencies:
conda env create -f environment.yml
conda activate pathogenhawk- Create a configuration file under
configs/, e.g.configs/Cauris.yaml:
pathogen: candida_auris
reference_genome: data/raw/C_auris_B8441V3_ref.fa
resistance_genes: data/metadata/Cauris_amr_genes.tsv
phenotype_file: data/metadata/Cauris_MIC.csv
feature_file: data/metadata/Cauris_features.tsv
labels: fluconazole_resistance
alignment_tool: bwa
variant_caller: freebayes
features: ["snp", "resistance_genes"]
ml_model: xgboost
output_dir: results/Candida_auris- Run your pipeline:
export PYTHONPATH=$(pwd)
python ml_model/train_model.py --config configs/Cauris.yaml# For Candida auris
nextflow run pathogenhawk.nf --config configs/Cauris.yaml
# For Escherichia coli
nextflow run pathogenhawk.nf --config configs/Ecoli.yamlworkflow/nextflow.config includes basic container settings. You can customize profiles for HPC/cloud usage.
data/metadata/auris_amr_genes.tsv: Known resistance genesdata/metadata/cauris_MIC.csv: MIC values and resistance phenotypes
data/metadata/Ecoli_res_genes.tsv: Known resistance genesdata/metadata/Ecoli_MICs.csv: MIC values and resistance phenotypes
If you use PathogenHawk in your research, please cite the associated JOSS paper (under review):
Lai, K. (2025). PathogenHawk: A Pathogen Machine Learning Toolkit for Predicting Antimicrobial Resistance from Genomic Features. Journal of Open Source Software (under review). https://github.com/biosciences/PathogenHawk
Developed by Kaitao Lai
MIT © 2025 Kaitao Lai