Topological recurrence analysis of genome alignments using topological data analysis.
© 2020-26 Andreas Ott, Michael Bleher, Maximilian Neumann
EVOtRec (topological Recurrence in EVOlution) is a Python module for the efficient and scalable topological recurrence analysis of large genome alignments. It uses persistent homology to compute the topological recurrence index (tRI) of single nucleotide variations in a nucleotide sequence alignment. EVOtRec enables the inference of the dynamics of convergently evolving genomic variants directly from topological patterns in the genomic dataset, without requiring the construction of a phylogenetic tree. For a paper describing EVOtRec and its applications to molecular evolution in more detail, see Bleher et al, 2026.
- Python: version 3.8 or higher
- C++ compiler: for building Ripser (e.g. gcc>=11.4.0)
- Make: for building Ripser (>=4.3)
git clone https://github.com/ottamj/evotrec.git
cd evotrecThe Python dependencies are:
- Biopython: for sequence analysis and FASTA file handling (==1.86)
- hammingdist: for efficient Hamming distance calculations (==1.4.0)
To install these dependencies run:
pip install -r requirements.txtNote: While we do not expect breaking changes in future versions of these dependencies, we cannot guarantee compatibility. It is recommended to test your installation with the provided example data.
EVOtRec requires a special version of Ripser that also computes representative cycles:
git clone --branch tight-representative-cycles https://github.com/Ripser/ripser.git
cd ripser
makeThis will create the ripser-representatives executable in the ripser/ directory.
Add it to your PATH by adding the following to your .bashrc or .bash_profile:
export PATH=$PATH:/path/to/ripserYou can check that Ripser was installed successfully by running it on the example distance matrix provided in the ripser repository:
ripser-representatives examples/sphere_3_192.lower_distance_matrixFor more detailed installation instructions, see the Ripser GitHub repository.
You can test EVOtRec using the example alignments in the examples/ directory.
python evotrec.py examples/sars-cov-2_spike_gene.fasta "NC_045512.2|Severeacuterespiratorysyndromecoronavirus2isolateWuhan-Hu-1,completegenome|China|2019-12-30" --timeseriesThe command produces the following output:
===============================================================
EVOtRec -- Topological recurrence analysis of genome alignments
(c) 2020-26 Andreas Ott, Michael Bleher, Maximilian Neumann
===============================================================
Preparing examples/sars-cov-2_spike_gene.fasta...
Start date: 2019-12-30
End date: 2021-01-03
Time range: 371 days
Number of sequences: 3274
Hammingdist...
# hammingdist <status>
MuRiT...
Ripser...
Analyzing cycles...
Computing tRI...
Results written to examples/sars-cov-2_spike_gene.csv.
After successful execution, the following text files will be created:
examples/
├── sars-cov-2_spike_gene.csv
├── sars-cov-2_spike_gene.dist
├── sars-cov-2_spike_gene.ripser
└── sars-cov-2_spike_gene.timedist
- FASTA file: a nucleotide sequence alignment in FASTA format
- Reference sequence header: the header (excluding the leading ">") of the reference sequence in the FASTA file, enclosed in double quotes
--timeseries(optional): When using the--timeseriesflag, EVOtRec requires that all sequence headers in the FASTA file contain a date in the following format:>seq-id|field1|field2|...|YYYY-MM-DD
Requirements for sequence dates:
- Format: must be exactly
YYYY-MM-DD(ISO 8601 format) - Position: must be the last field when splitting by
| - Chronological order: all sequence dates must be on or after the reference sequence date
If any sequence header does not follow this format or contains an invalid date, EVOtRec will raise a ValueError with details about the problematic sequence.
Remark: For best computational performance, we recommend that the input FASTA is sorted in reverse chronological order (newest sequences first).
python evotrec.py <input_fasta> "<refseq_header>" [--timeseries]<input_fasta>: path to the input FASTA file containing sequence alignments<refseq_header>: reference sequence header as it appears in the FASTA file (excluding the leading ">")--timeseries: optional flag to enable time series analysis (requires a date for each sequence)
The script generates several text files with the same base name as the input file and the following filename extensions:
<input_fasta>.dist: Hamming distance matrix between all sequences in sparse format<input_fasta>.timedist: Rips-transformed distance matrix (generated only when--timeseriesflag is used)<input_fasta>.ripser: raw output from Ripser persistent homology computation<input_fasta>.csv: final results of topological recurrence analysis (tRI list)
The structure of <input_fasta>.csv depends on whether the --timeseries flag is used:
POS: genomic position of the variationREF: reference nucleotide at this positionALT: alternative nucleotide (the mutation)
Without --timeseries flag:
tRI: topological recurrence index of the variation
With --timeseries flag:
1,2,3,...,N: time-binned columns representing days in the time range, and columns containing tRI values at each day
For questions, issues, or feature requests:
- Check the Issues page
- Search existing issues before creating a new one
- Provide detailed information including:
- operating system and Python version
- input data characteristics
- error messages (if any)
- steps to reproduce the issue
We welcome contributions! Please follow these guidelines:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Make your changes and add tests
- Run the test suite (
pytest tests/) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
Run the pytest suite to verify all components produce the desired outputs:
pytest tests/Or run tests with verbose output:
pytest -v tests/The test suite includes:
- Unit tests: testing individual functions with mocked data
- Integration tests: end-to-end testing with real example data
- Output validation: comparing results against expected outputs
- Follow PEP 8 for Python code style
- Use meaningful variable and function names
- Add docstrings for new functions
- Include unit tests for new functionality
This project is licensed under the terms specified in the LICENSE file.
If you use EVOtRec in your research, please cite the following references:
@article{bleher2026topological,
title={Ultrafast topological data analysis reveals pandemic-scale dynamics of convergent evolution},
author={Bleher, Michael and Hahn, Lukas and Neumann, Maximilian and Ardern, Zachary and Patino-Galindo, Juan Angel and Carriere, Mathieu and Bauer, Ulrich and Rabadan, Raul and Ott, Andreas},
journal={arXiv preprint arXiv:2106.07292},
year={2026},
url={https://arxiv.org/abs/2106.07292}
}
@article{neumann2022murit,
title={MuRiT: Efficient Computation of Pathwise Persistence Barcodes in Multi-Filtered Flag Complexes via Vietoris-Rips Transformations},
author={Neumann, Maximilian and Bleher, Michael and Hahn, Lukas and Braun, Samuel and Obermaier, Holger and Soysal, Mehmet and Caspart, Rene and Ott, Andreas},
journal={arXiv preprint arXiv:2207.03394},
year={2022},
url={https://arxiv.org/abs/2207.03394}
}