Tandem-repeat array identifier — a Python port of the TRASH program.
trash-py is a Python re-implementation of the TRASH program written by Piotr Włodzimierz (pwlodzimierz@ibb.waw.pl, Institute of Biochemistry and Biophysics, Polish Academy of Sciences). The upstream repository lives at https://github.com/vlothec/TRASH_2. All algorithmic credit for the underlying approach belongs to the original author; this port is an independent rewrite (no code is byte-identical with the upstream) that retains the structure and logic of the upstream pipeline.
The upstream MIT license is reproduced in
LICENSES/TRASH-UPSTREAM-MIT.txt to
satisfy its notice-preservation clause.
trash-py aims to build on the substantial work done in developing the original TRASH repeat annotation pipeline by adopting a more flexible (and performant) Python/C runner/libs framework. A library of reusable functions is exposed which can be incorporated into diverse repeat-annotation related workflows beyond simply running the TRASH pipeline, and the hotter functions have been ported to C to maximise performance.
Currently, on smaller less repetitive genomes (e.g. Arabidopsis, Human genome) the bottleneck is nhmmer rather than the the TRASH pipeline itself, which takes a fraction of the original time to complete.
The stderr logs have also been changed to reflect the new internals and to give trash-py a little of its own personality.
The best way to install the tool is using conda/mamba/micromamba:
conda install -c bioconda trash-py
If instead you'd like to build/install from source, please install nhmmer and Clustal Omega and ensure they are available on the PATH. Additionally, please ensure you have a suitable C compiler installed (gcc, clang) for the C extensions. There is no need to separately compile these, python should recognise the instructions for compilation in setup.py.
trash-py can then be installed by:
git clone https://github.com/mbeavitt/trash-py
cd trash-py
pip install .
trash-py -f input.fasta -o output_dir
Currently the CLI aims to mirror the one in the original TRASH tool as closely
as possible, to present a drag-and-drop replacement. trash-py adds two options
over upstream: -q to silence logs, and -p to run the array-identification
and repeat-mapping stages across multiple worker processes.
If you use trash-py in academic work, please cite the original TRASH
publication:
Wlodzimierz, P., Hong, M., & Henderson, I. R. (2023). TRASH: tandem repeat annotation and structural hierarchy. Bioinformatics, 39(5), btad308.
BibTeX:
@article{wlodzimierz2023trash,
title={TRASH: tandem repeat annotation and structural hierarchy},
author={Wlodzimierz, Piotr and Hong, Michael and Henderson, Ian R},
journal={Bioinformatics},
volume={39},
number={5},
pages={btad308},
year={2023},
publisher={Oxford University Press}
}trash-py is released under the MIT License — see LICENSE.
$ trash-py --help
usage: trash-py [-h] [-V] -f FASTA -o OUTPUT [-m MAX_REP_SIZE]
[-i MIN_REP_SIZE] [-t TEMPLATES] [-q] [-p PROCESSES]
TRASH — tandem-repeat array identifier (Python)
options:
-h, --help show this help message and exit
-V, --version show program's version number and exit
-f, --fasta FASTA input fasta
-o, --output OUTPUT output directory
-m, --max-rep-size MAX_REP_SIZE
-i, --min-rep-size MIN_REP_SIZE
-t, --templates TEMPLATES
optional template fasta — assigns class names from
headers
-q, --quiet suppress progress output
-p, --processes PROCESSES
parallel worker processes for the array-identification
and repeat-mapping stages (default 1 = serial)