Skip to content

Releases: AnantharamanLab/CheckAMG

v0.10.0

22 Feb 18:48

Choose a tag to compare

0.10.0

  • Added genome-context curation to curate_annots.py module:

    • Flag proteins located outside strict viral regions or directly adjacent to their boundaries, based on window average VL-scores and V-scores of genes (or hallmark genes if enabled). These proteins are not removed from final AVG predictions by default, since strict viral region calls can be too conservative when viral origin confidence remains high.
    • Flag proteins in contiguous runs of 3 or more AVGs in a row, excluded by default as this indicates non-auxiliary function.
    • Parameter values are configurable with --min-flank-vscore, --min-window-avg-vlscore, and --max-avg-array-length.
  • Updated the viral origin confidence LGBM

    • Previous versions trained and evaluated the model using train/test splits that could inadvertently exclude some proteins from a contig.
    • As a result, the model was sometimes trained with incomplete genome context information, though this did not introduce data leakage between training, validation, and test sets.
    • This has been corrected. The model was retrained and re-evaluated using datasets that retain all proteins encoded by each contig (including a new test dataset comprised of only host chromosomes with integrated proviruses and some integrated MGEs).
    • Inference is now parallelized by batching multiple contigs per prediction call to improve throughput.
  • Additional changes:

    • Added support for gzipped FASTA file inputs (.fasta.gz, .fna.gz and .faa.gz).
    • Now logs how many AVGs of each type were filtered during the curation module.
    • Changed the order of some logging messages in organize_proteins.py.
    • Now writes full HMMsearch results to parquet instead of tsv when --keep_full_hmm_resultsis enabled.
    • Added additional methyltransferase annotations from Pfam to the AMG and AMG filter lists.
    • Added additional defense/anti-defense annotations to the APG and AReG lists
    • Added a feature to optionally save all intermediate and final tables to parquet instead of TSV to reduce filesize on large input datasets
    • Changed some argument names (e.g., --genomes to --input-contigs, --vmags to --input-bins) and renamed variables and log messages to be less specific to genomes/vMAGs and more generalized to contigs/bins

v0.9.0

01 Feb 07:07

Choose a tag to compare

0.9.0

  • Updated checkamg download to retrieve and extract a pre-built, standardized CheckAMG database containing all required profile HMMs and cutoff files, rather than downloading individual databases from their original sources.
    • This ensures reproducibility across CheckAMG and database versions and avoids failures caused by upstream download links changing or disappearing.
    • Added the notebook build_checkamg_db.ipynb, which documents how the standardized CheckAMG database is assembled, including data sources and formatting steps.

v0.8.1

26 Jan 22:47

Choose a tag to compare

0.8.1

  • Modified annotate_hmm.py so it resumes HMM searches from the last completed database instead of restarting all HMM searches across all databases.

    • This allows long runs with very large inputs that crash due to memory issues to resume where they left off, saving time when rerunning with more memory.
    • HMM search parameters, filtering strategy, and other aspects of the annotation pipeline are unchanged.
  • Added the --keep_full_hmm_results option to CheckAMG annotate to control whether full HMM search results are written.

    • Previously, full results were always written by default, which can use substantial disk space for large inputs. This option now defaults to False.

v0.8.0

23 Jan 15:15

Choose a tag to compare

0.8.0

  • Removed the split between "hard" and "soft" keyword filters for AVG annotation filtering.

    • All keywords are now treated as a single filter set, including those previously classified as "soft".
    • These filter hits are no longer bypassed based on exceptional profile HMM matches.
    • As a result, the --scaling_factor argument has been removed.
  • Genome context now reports the distance to contig ends for each gene

v0.7.0

21 Dec 00:19

Choose a tag to compare

0.7.0

This release expands AVGs annotations, standardizes HMM annotation and filtering across databases, improves HMMsearch reporting and filtering, and adds reproducibility assets for rebuilding reference tables used by CheckAMG.

Major changes include:

  • Expansion of the curated annotations used by CheckAMG (AMGs, APGs, AReGs), plus a large expansion of FOAM and KEGG reference annotations.
  • Added CAMPER profile HMMs (McGivern et al., 2024) to the CheckAMG database.
  • Added reproducibility assets for rebuilding the required tables/files used by CheckAMG in the notebooks folder.
  • KEGG AMGs were expanded using BRITE KO classifications (beyond the previous KOs sourced from VIBRANT).
  • False-positive filtering is now driven by explicit, standardized and inspected (see make_checkamg_required_tables.ipynb), pre-flagged HMM ID tables (hard/soft and exception categories) instead of only keyword lists.
  • Refined terms used to filter false-positives that were either too strict or lenient.
  • HMMsearch reporting is more explicit: the pipeline now carries per-hit "kept vs removed" information (and rationale) and writes a best-hit-per-sequence filtered output in addition to the full hit table.
  • Default annotation thresholds were updated: --scaling_factor -> 3.0, --bit_score -> 30, --cov_fraction -> 0.30 (and cov_fraction is now HMM profile coverage, not sequence coverage).

v0.6.2

12 Dec 03:29

Choose a tag to compare

0.6.2

  • AMGs.tsv, AReGs.tsv, hmm_id_to_name.csv:
    • Fixed FOAM profile HMM annotations that had valid KO labels but did not get their names/descriptions properly mapped

v0.6.1

10 Dec 19:11

Choose a tag to compare

0.6.1

  • filter_by_cds.py:
    • Fixed a bug where vMAGs containing too few ORFs to pass the filter set by --min_orf were still written, but as empty files under filtered_faa_by_cds, causing the annotation step to crash

v0.6.0

28 Oct 19:22

Choose a tag to compare

0.6.0

  • lgbm_model.joblib, lgbm_feature_names.joblib, lgbm_thresholds.joblib:

    • Improvements to the viral origin confidence LGBM
    • Added additional features that consider the V/VL-scores of the 3 nearest genes on the left and right flanks of each gene, and the V/VL-scores of the nearest mobile genes
  • AMGs.tsv, APGs.tsv, AReGs.tsv, FOAM.tsv, hmm_id_to_name.csv, mobile_genes.csv, viral_hallmark_genes.csv:

    • Slight modifications to the AMG, APG, AReG, viral hallmark, and mobile genes lists
    • Updated the all-HMM list with updated dbCAN and missing FOAM annotations
  • download_db.py:

    • Added functionality to download and prepare the database-provided bitscore thresholds for KEGG and FOAM using the same versions as the downloaded HMMs
    • Fixed incorrect version label for the dbCAN HMM files
  • CheckAMG_annotate.smk:

    • Updated the KEGG and FOAM threshold file locations from the 'files' directory (with the AMG, APG, AReG, etc. tables) to the 'db' directory (with the HMM profiles)
    • Now puts the snakemake *.done files in their own folder
  • CheckAMG_annotate.py:

    • Set up a subfolder for snakemake files
  • main.py:

    • Changed the default --scaling_factor from 1.6 to 1.8 due to KEGG threshold updates
  • pyproject.toml:

    • Added a missing scikit-learn dependency

v0.5.3

12 Oct 19:39

Choose a tag to compare

0.5.3

  • download_db.py:
    • Updated URLs to download dbCAN v14

v0.5.2

06 Oct 22:13

Choose a tag to compare

0.5.2