09 Feb 15:48

Waschina

26f5af1

Sievers Apple (v2.0.1) Latest

Latest

What's Changed

make gapseq test work on macos, and add support for mawk by @jonasoh in #283
bug fix related to md5sum calculation of reaction names under macOS (#286 )

Full Changelog: v2.0.0...v2.0.1

Contributors

jonasoh

Assets 2

14 Jan 09:42

Waschina

v2.0.0

0d96416

Vanilla Orange (v2.0.0)

What's Changed

As a major change, this version includes a re-implementation of parts in gapseq find and gapseq find-transport:

Run time is greatly improved by performing only one large multiple sequence alignment rather than many smaller ones.
Users can now choose between three different sequence alignment algorithms: blast, diamond, mmseqs2. The user can choose the algorithm using the option -A <algorithm> in gapseq find/gapseq find-transport.
A number of bug fixes (see PR #258)
The output table <query>-Pathways.tbl now includes additional columns that fully document how the completion percent was calculated and why the pathways were predicted to be present or absent. Also, an FAQ and its answer concerning completeness calculations were added to the documentation.
When a genomic nucleotide FASTA file is used as input, it’s first translated into amino acid sequences of open reading frames (ORFs). For this step, the optional dependency pyrodigal is required.
gapseq automatically selects the appropriate codon translation table by running pyrodigal with three options:
- Table 4: "Mycoplasma/Spiroplasma (Mollicutes)"
- Table 11: "Bacterial, Archaeal, and Plant Plastid Code" (default for most prokaryotic tools)
- Table 25: "Candidate Division SR1 and Gracilibacteria"
The choice between Table 11 and Tables 4/25 depends on genome coverage. If using Table 4 or 25 gives at least 5% higher coverage than Table 11, then 4 or 25 is used. Choosing between Table 4 and 25 is more nuanced since both yield the same coverage. The key difference is how the codon UGA is interpreted:
- In Table 11, UGA is a stop codon.
- In Table 4, UGA codes for Tryptophan.
- In Table 25, UGA codes for Glycine.
Since the Tryptophan content in proteins is typically around 1%, the table that produces a Tryptophan usage closest to this value is selected.

Admittedly, this approach relies on an arbitrary threshold, but it works well in practice. If users already know the correct codon table for their genome, they can provide a protein FASTA file directly to avoid translation by gapseq.
There are fewer dependencies on other software libraries. Specifically, the dependencies on 'exonerate', 'barrnap', 'bedtools', 'perl', and 'parallel' were dropped.
Users can now specify a custom directory for the reference sequence database, and which version to use (not only the latest). This option is especially relevant in cases where gapseq is installed in a location where the user does not have write permissions. See documentation for details.
For protein complexes, gapseq infers which subunit a reference sequence belongs to from the Fasta headers. However, subunit naming is often inconsistent. Example: EC 1.2.7.1 (Pyruvate synthase): Some proteins have the subunits stated as "subunit alpha/beta/gamma/delta"; others have "subunit PorA/PorB/PorC/PorD". For enzymes, where this is often an issue, we now have a subunit ID dictionary in dat/complex_subunit_dict.tsv. This dictionary links synonyms to common IDs. Currently, the dictionary needs to be curated manually, but we could probably also automate this somehow.

Other small changes in the new gapseq version

Complex detection

In the old and the new gapseq version, complexes are detected by analysing the fasta sequence headers for key terms such as "chain" or "subunit". In rare cases, where there were several sequences but only very few that indicated a subunit association, gapseq always needed hits to those sequences in order to say that the complex is there. However, in most organisms, this enzyme might not be a complex/heteromer.

New approach: If 20% or less of the sequences are predicted to be a specific subunit, the reaction is not tested as a complex; i.e., no subunit hits are required for the reaction prediction to be TRUE. This is implemented in src/complex_prediction.R

Gram prediction

Gram prediction is used to determine which biomass reaction to add to a bacterial metabolic model. In the previous version, the prediction was made within the gapseq draft, where the biomass reaction was also added to the model. Now, the Gram-staining prediction is moved to gapseq find. The rationale behind this decision is that gapseq find already has the genome sequence as input; performing HMM-based Gram prediction here makes sense, as it also requires the genome. The predicted Gram staining is added as information to the headers of the output tables "...-Reactions.tbl" and "...-Pathways.tbl".

Updating reference sequence databases

gapseq now has a new module to update the reference sequence database. Two examples:

gapseq update-sequences -t Bacteria # Update Reference sequences for Bacteria
gapseq update-sequences -t Bacteria -D ~/gapseqDB/ # Update Reference sequences for Archaea and save the database in a user-defined directory

New Contributors

@cmkobel made their first contribution in #217

Full Changelog: v1.4.0...v2.0.0

Contributors

cmkobel

Assets 2

10 Feb 07:58

Waschina

v1.4.0

9b5d3ac

Berkeley Pit (v1.4.0)