A collection of Python utilities for fetching protein and nucleotide sequences from public databases (UniProt and GenBank/NCBI).
Fetches protein sequences from UniProt REST API using a list of UniProt accession IDs.
Features:
- Reads UniProt IDs from input file
- Fetches FASTA records from the UniProt REST API
- Writes results to output FASTA file
- Automatic output directory creation
- Progress tracking and summary statistics
- Error handling with detailed logging
Usage:
python scripts/retrieve_records_by_uniprot_ids.pyInput: input/uniprot_ids.txt (one ID per line)
Output: output/results.fasta
Example input file:
P12345
A0A024B7W1
Q9Y5K6
Fetches protein sequences from GenBank (NCBI) using a list of GenBank accession IDs.
Features:
- Reads GenBank IDs from command-line argument
- Fetches FASTA records from NCBI Entrez API
- Writes results to output FASTA file (default:
output/results.fasta) - Automatic output directory creation
- Prevents accidental file overwriting
- Progress tracking and summary statistics
- Error handling with detailed logging
Usage:
python scripts/retrieve_records_by_genebank_id.py <input_file> [output_file]Examples:
# Use default output location
python scripts/retrieve_records_by_genebank_id.py input/genbank_ids.txt
# Specify custom output location
python scripts/retrieve_records_by_genebank_id.py input/genbank_ids.txt output/custom_results.fastaInput: Comma or space-separated GenBank IDs (one per line or multiple per line)
Output: output/results.fasta (or custom path)
Example input file:
NM_000001
NC_000001
AF123456
bioinformatics/
├── scripts/
│ ├── retrieve_records_by_uniprot_ids.py
│ └── retrieve_records_by_genebank_id.py
├── input/
│ ├── uniprot_ids.txt
│ └── genbank_ids.txt
├── output/
│ └── results.fasta
└── README.md
- Python 3.6+
- Dependencies:
requests- For HTTP requestsbiopython- For sequence parsing and writingBio.Entrez- For NCBI database access (included in biopython)
Install dependencies:
pip install requests biopythonThe UniProt script uses hardcoded paths:
- Input:
input/uniprot_ids.txt - Output:
output/results.fasta
The GenBank script requires command-line arguments:
python scripts/retrieve_records_by_genebank_id.py <input_file> [output_file]The NCBI Entrez email is set to: erbayyigit@gmail.com
(Modify in the script if needed)
✅ Automatic directory creation
✅ Error handling and validation
✅ Input file existence checking
✅ Duplicate file prevention (GenBank script)
✅ Progress tracking during retrieval
✅ Summary statistics after completion
✅ Detailed logging to stderr for errors
Both scripts include:
- Input validation: Checks if input file exists
- Output validation: Prevents overwriting existing files (GenBank)
- API error handling: Catches HTTP errors and parsing exceptions
- Polite rate limiting: 0.1 second delay between API requests
All output is in FASTA format:
>sequence_id description
MKVLWAALLV...
Each sequence retrieved is appended to the output file.
- The UniProt script uses fixed file paths. Modify the script if you need different paths.
- The GenBank script accepts input filename as a command-line argument.
- Both scripts create the
output/directory automatically if it doesn't exist. - Rate limiting is set to 0.1 seconds between requests to be respectful to public APIs.
- An NCBI Entrez email is required for GenBank queries. Update it in the script if needed.
Older edirect utilities are available in the edirect-utils/ directory:
efetch-proteins-by-id.py- Original NCBI protein fetcherefetch-proteins-by-id-and-analyze.py- NCBI fetcher with protein analysis features
These have been superseded by the modern scripts in the scripts/ directory with improved error handling and user experience.