FASTA file parsing written in C++ with Python bindings

Installation

pip install pyfastatools

Usage

The pyfastatools.Parser object is the primary API that parses FASTA files and yields pyfastatools.Record objects.

If you have a FASTA file called proteins.faa that looks like this:

>seq_1
MSKFKKIPL
>seq_2
MQSSSKTCN
>seq_3
MEDNMITIY

Then you can parse this file in python like this:

from pyfastatools import Parser

for record in Parser("proteins.faa"):
    print(record.header.name, record.seq)

which will print:

>>> 'seq_1 MSKFKKIPL'
>>> 'seq_2 MQSSSKTCN'
>>> 'seq_3 MEDNMITIY'

API

This library has a very simple API that can be displayed in a few lines:

Parser

This is the main class that will satisfy 99% of user needs. While parsing FASTA files, it produces Record objects. Only the name of a FASTA file is needed:

pyfastatools.Parser("my_fasta.fasta")

The parser will attempt to auto-detect the RecordType of the file by checking the input file extension and the first 5 sequences.

However, the record type can optionally be specified:

pyfastatools.Parser("my_fasta.fasta", pyfastatools.RecordType.PROTEIN)

The parser can be iterated over to yield one Record at a time:

parser = pyfastatools.Parser("my_fasta.fasta")
for record in parser:
    ...

Methods

There are also other convenience methods:

all - Read all records into a list-like object.
take - Take up to n records into a list-like object.
filter - Keep/exclude sequences based on the sequence name.
remove_stops - Yield sequences without a * stop codon character if the sequences are proteins.
clean_header - Yield sequences while cleaning the header to not have a description.
headers - Yield Header objects only without parsing the sequence itself.
all_headers - Return all headers into a list-like object.

Properties

num_records - Returns the number of sequences in the FASTA file. This is cached after the first time it is called. Note: This can also be computed using len(parser)
format - Returns the RecordType enum that corresponds to the FASTA file's record type
extension - Returns the file extension based on the format

Record

A single FASTA record. It has the following fields:

header - A Header object that has the fields name and desc
seq - A str storing the entire sequence

Methods

empty - Checks if the Header and sequence are empty
clear - Sets the Header and sequence to empty strings
to_string - Returns the record as a string representation identical to what was parsed from the file
clean_header - Sets the Header description to an empty string
remove_stops - Removes * stop codon characters from the sequence if they are present

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
.github/workflows		.github/workflows
src		src
tests		tests
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FASTA file parsing written in C++ with Python bindings

Installation

Usage

API

Parser

Methods

Properties

Record

Methods

About

Uh oh!

Releases 6

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

FASTA file parsing written in C++ with Python bindings

Installation

Usage

API

Parser

Methods

Properties

Record

Methods

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 6

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages