Releases: AnantharamanLab/CheckAMG
v0.10.0
0.10.0
-
Added genome-context curation to
curate_annots.pymodule:- Flag proteins located outside strict viral regions or directly adjacent to their boundaries, based on window average VL-scores and V-scores of genes (or hallmark genes if enabled). These proteins are not removed from final AVG predictions by default, since strict viral region calls can be too conservative when viral origin confidence remains high.
- Flag proteins in contiguous runs of 3 or more AVGs in a row, excluded by default as this indicates non-auxiliary function.
- Parameter values are configurable with
--min-flank-vscore,--min-window-avg-vlscore, and--max-avg-array-length.
-
Updated the viral origin confidence LGBM
- Previous versions trained and evaluated the model using train/test splits that could inadvertently exclude some proteins from a contig.
- As a result, the model was sometimes trained with incomplete genome context information, though this did not introduce data leakage between training, validation, and test sets.
- This has been corrected. The model was retrained and re-evaluated using datasets that retain all proteins encoded by each contig (including a new test dataset comprised of only host chromosomes with integrated proviruses and some integrated MGEs).
- Inference is now parallelized by batching multiple contigs per prediction call to improve throughput.
-
Additional changes:
- Added support for gzipped FASTA file inputs (.fasta.gz, .fna.gz and .faa.gz).
- Now logs how many AVGs of each type were filtered during the curation module.
- Changed the order of some logging messages in
organize_proteins.py. - Now writes full HMMsearch results to parquet instead of tsv when
--keep_full_hmm_resultsis enabled. - Added additional methyltransferase annotations from Pfam to the AMG and AMG filter lists.
- Added additional defense/anti-defense annotations to the APG and AReG lists
- Added a feature to optionally save all intermediate and final tables to parquet instead of TSV to reduce filesize on large input datasets
- Changed some argument names (e.g.,
--genomesto--input-contigs,--vmagsto--input-bins) and renamed variables and log messages to be less specific to genomes/vMAGs and more generalized to contigs/bins
v0.9.0
0.9.0
- Updated
checkamg downloadto retrieve and extract a pre-built, standardized CheckAMG database containing all required profile HMMs and cutoff files, rather than downloading individual databases from their original sources.- This ensures reproducibility across CheckAMG and database versions and avoids failures caused by upstream download links changing or disappearing.
- Added the notebook
build_checkamg_db.ipynb, which documents how the standardized CheckAMG database is assembled, including data sources and formatting steps.
v0.8.1
0.8.1
-
Modified annotate_hmm.py so it resumes HMM searches from the last completed database instead of restarting all HMM searches across all databases.
- This allows long runs with very large inputs that crash due to memory issues to resume where they left off, saving time when rerunning with more memory.
- HMM search parameters, filtering strategy, and other aspects of the annotation pipeline are unchanged.
-
Added the
--keep_full_hmm_resultsoption to CheckAMG annotate to control whether full HMM search results are written.- Previously, full results were always written by default, which can use substantial disk space for large inputs. This option now defaults to
False.
- Previously, full results were always written by default, which can use substantial disk space for large inputs. This option now defaults to
v0.8.0
0.8.0
-
Removed the split between "hard" and "soft" keyword filters for AVG annotation filtering.
- All keywords are now treated as a single filter set, including those previously classified as "soft".
- These filter hits are no longer bypassed based on exceptional profile HMM matches.
- As a result, the
--scaling_factorargument has been removed.
-
Genome context now reports the distance to contig ends for each gene
v0.7.0
0.7.0
This release expands AVGs annotations, standardizes HMM annotation and filtering across databases, improves HMMsearch reporting and filtering, and adds reproducibility assets for rebuilding reference tables used by CheckAMG.
Major changes include:
- Expansion of the curated annotations used by CheckAMG (AMGs, APGs, AReGs), plus a large expansion of FOAM and KEGG reference annotations.
- Added CAMPER profile HMMs (McGivern et al., 2024) to the CheckAMG database.
- Added reproducibility assets for rebuilding the required tables/files used by CheckAMG in the
notebooksfolder. - KEGG AMGs were expanded using BRITE KO classifications (beyond the previous KOs sourced from VIBRANT).
- False-positive filtering is now driven by explicit, standardized and inspected (see
make_checkamg_required_tables.ipynb), pre-flagged HMM ID tables (hard/soft and exception categories) instead of only keyword lists. - Refined terms used to filter false-positives that were either too strict or lenient.
- HMMsearch reporting is more explicit: the pipeline now carries per-hit "kept vs removed" information (and rationale) and writes a best-hit-per-sequence filtered output in addition to the full hit table.
- Default annotation thresholds were updated:
--scaling_factor-> 3.0,--bit_score-> 30,--cov_fraction-> 0.30 (andcov_fractionis now HMM profile coverage, not sequence coverage).
v0.6.2
v0.6.1
v0.6.0
0.6.0
-
lgbm_model.joblib, lgbm_feature_names.joblib, lgbm_thresholds.joblib:
- Improvements to the viral origin confidence LGBM
- Added additional features that consider the V/VL-scores of the 3 nearest genes on the left and right flanks of each gene, and the V/VL-scores of the nearest mobile genes
-
AMGs.tsv, APGs.tsv, AReGs.tsv, FOAM.tsv, hmm_id_to_name.csv, mobile_genes.csv, viral_hallmark_genes.csv:
- Slight modifications to the AMG, APG, AReG, viral hallmark, and mobile genes lists
- Updated the all-HMM list with updated dbCAN and missing FOAM annotations
-
download_db.py:
- Added functionality to download and prepare the database-provided bitscore thresholds for KEGG and FOAM using the same versions as the downloaded HMMs
- Fixed incorrect version label for the dbCAN HMM files
-
CheckAMG_annotate.smk:
- Updated the KEGG and FOAM threshold file locations from the 'files' directory (with the AMG, APG, AReG, etc. tables) to the 'db' directory (with the HMM profiles)
- Now puts the snakemake
*.donefiles in their own folder
-
CheckAMG_annotate.py:
- Set up a subfolder for snakemake files
-
main.py:
- Changed the default
--scaling_factorfrom1.6to1.8due to KEGG threshold updates
- Changed the default
-
pyproject.toml:
- Added a missing
scikit-learndependency
- Added a missing
v0.5.3
v0.5.2
0.5.2
- Minor fix to the formatting in the 'false' AMG keywords