Bioinformatics Sequence Retrieval Scripts

A collection of Python utilities for fetching protein and nucleotide sequences from public databases (UniProt and GenBank/NCBI).

Scripts

1. retrieve_records_by_uniprot_ids.py

Fetches protein sequences from UniProt REST API using a list of UniProt accession IDs.

Features:

Reads UniProt IDs from input file
Fetches FASTA records from the UniProt REST API
Writes results to output FASTA file
Automatic output directory creation
Progress tracking and summary statistics
Error handling with detailed logging

Usage:

python scripts/retrieve_records_by_uniprot_ids.py

Input: input/uniprot_ids.txt (one ID per line)
Output: output/results.fasta

Example input file:

P12345
A0A024B7W1
Q9Y5K6

2. retrieve_records_by_genebank_id.py

Fetches protein sequences from GenBank (NCBI) using a list of GenBank accession IDs.

Features:

Reads GenBank IDs from command-line argument
Fetches FASTA records from NCBI Entrez API
Writes results to output FASTA file (default: output/results.fasta)
Automatic output directory creation
Prevents accidental file overwriting
Progress tracking and summary statistics
Error handling with detailed logging

Usage:

python scripts/retrieve_records_by_genebank_id.py <input_file> [output_file]

Examples:

# Use default output location
python scripts/retrieve_records_by_genebank_id.py input/genbank_ids.txt

# Specify custom output location
python scripts/retrieve_records_by_genebank_id.py input/genbank_ids.txt output/custom_results.fasta

Input: Comma or space-separated GenBank IDs (one per line or multiple per line)
Output: output/results.fasta (or custom path)

Example input file:

NM_000001
NC_000001
AF123456

Project Structure

bioinformatics/
├── scripts/
│   ├── retrieve_records_by_uniprot_ids.py
│   └── retrieve_records_by_genebank_id.py
├── input/
│   ├── uniprot_ids.txt
│   └── genbank_ids.txt
├── output/
│   └── results.fasta
└── README.md

Requirements

Python 3.6+
Dependencies:
- requests - For HTTP requests
- biopython - For sequence parsing and writing
- Bio.Entrez - For NCBI database access (included in biopython)

Install dependencies:

pip install requests biopython

Configuration

UniProt Script

The UniProt script uses hardcoded paths:

Input: input/uniprot_ids.txt
Output: output/results.fasta

GenBank Script

The GenBank script requires command-line arguments:

python scripts/retrieve_records_by_genebank_id.py <input_file> [output_file]

The NCBI Entrez email is set to: erbayyigit@gmail.com
(Modify in the script if needed)

Features

✅ Automatic directory creation
✅ Error handling and validation
✅ Input file existence checking
✅ Duplicate file prevention (GenBank script)
✅ Progress tracking during retrieval
✅ Summary statistics after completion
✅ Detailed logging to stderr for errors

Error Handling

Both scripts include:

Input validation: Checks if input file exists
Output validation: Prevents overwriting existing files (GenBank)
API error handling: Catches HTTP errors and parsing exceptions
Polite rate limiting: 0.1 second delay between API requests

Output Format

All output is in FASTA format:

>sequence_id description
MKVLWAALLV...

Each sequence retrieved is appended to the output file.

Notes

The UniProt script uses fixed file paths. Modify the script if you need different paths.
The GenBank script accepts input filename as a command-line argument.
Both scripts create the output/ directory automatically if it doesn't exist.
Rate limiting is set to 0.1 seconds between requests to be respectful to public APIs.
An NCBI Entrez email is required for GenBank queries. Update it in the script if needed.

Legacy Scripts

Older edirect utilities are available in the edirect-utils/ directory:

efetch-proteins-by-id.py - Original NCBI protein fetcher
efetch-proteins-by-id-and-analyze.py - NCBI fetcher with protein analysis features

These have been superseded by the modern scripts in the scripts/ directory with improved error handling and user experience.

Name		Name	Last commit message	Last commit date
Latest commit History 184 Commits
codon_optimization		codon_optimization
edirect_utils		edirect_utils
output		output
retrieve_fasta_records_from_id		retrieve_fasta_records_from_id
.gitignore		.gitignore
README.md		README.md
analyze-protein-sequences.py		analyze-protein-sequences.py
sample.fasta		sample.fasta

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bioinformatics Sequence Retrieval Scripts

Scripts

1. retrieve_records_by_uniprot_ids.py

2. retrieve_records_by_genebank_id.py

Project Structure

Requirements

Configuration

UniProt Script

GenBank Script

Features

Error Handling

Output Format

Notes

Legacy Scripts

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Bioinformatics Sequence Retrieval Scripts

Scripts

1. retrieve_records_by_uniprot_ids.py

2. retrieve_records_by_genebank_id.py

Project Structure

Requirements

Configuration

UniProt Script

GenBank Script

Features

Error Handling

Output Format

Notes

Legacy Scripts

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages