pip install pyfastatoolsThe pyfastatools.Parser object is the primary API that parses FASTA files and yields pyfastatools.Record objects.
If you have a FASTA file called proteins.faa that looks like this:
>seq_1
MSKFKKIPL
>seq_2
MQSSSKTCN
>seq_3
MEDNMITIYThen you can parse this file in python like this:
from pyfastatools import Parser
for record in Parser("proteins.faa"):
print(record.header.name, record.seq)which will print:
>>> 'seq_1 MSKFKKIPL'
>>> 'seq_2 MQSSSKTCN'
>>> 'seq_3 MEDNMITIY'This library has a very simple API that can be displayed in a few lines:
This is the main class that will satisfy 99% of user needs. While parsing FASTA files, it produces Record objects. Only the name of a FASTA file is needed:
pyfastatools.Parser("my_fasta.fasta")The parser will attempt to auto-detect the RecordType of the file by checking the input file extension and the first 5 sequences.
However, the record type can optionally be specified:
pyfastatools.Parser("my_fasta.fasta", pyfastatools.RecordType.PROTEIN)The parser can be iterated over to yield one Record at a time:
parser = pyfastatools.Parser("my_fasta.fasta")
for record in parser:
...There are also other convenience methods:
all- Read all records into a list-like object.take- Take up to n records into a list-like object.filter- Keep/exclude sequences based on the sequence name.remove_stops- Yield sequences without a*stop codon character if the sequences are proteins.clean_header- Yield sequences while cleaning the header to not have a description.headers- YieldHeaderobjects only without parsing the sequence itself.all_headers- Return all headers into a list-like object.
num_records- Returns the number of sequences in the FASTA file. This is cached after the first time it is called. Note: This can also be computed usinglen(parser)format- Returns theRecordTypeenum that corresponds to the FASTA file's record typeextension- Returns the file extension based on theformat
A single FASTA record. It has the following fields:
header- AHeaderobject that has the fieldsnameanddescseq- Astrstoring the entire sequence
empty- Checks if theHeaderand sequence are emptyclear- Sets theHeaderand sequence to empty stringsto_string- Returns the record as a string representation identical to what was parsed from the fileclean_header- Sets theHeaderdescription to an empty stringremove_stops- Removes*stop codon characters from the sequence if they are present