A comprehensive Python package for downloading and managing sequencing data from the European Nucleotide Archive (ENA) in terminal and through Python interface.
- π Extract Metadata - Get comprehensive sample information from ENA projects
- π₯ Download FASTQ Files - Automated download with progress tracking
- π Auto Fallback - Automatically tries NCBI if ENA metadata unavailable
- π Progress Bars - Real-time progress for downloads and metadata retrieval
- π Interactive Reports - Generate searchable HTML tables with DataTables.js
- πΎ Export to CSV - Save metadata in standard formats
- π Smart Verification - Check fastq file integrity and skip existing files
- π» Command line and Python interface
# Install from PyPI
pip install ENATool# Custom output directory
enatool download PRJNA335681 --path data/my_projectimport ENATool
# Fetch metadata AND download files in one command
info, downloads = ENATool.fetch('PRJNA335681', path='data/my_project', download=True)ENATool creates organized output:
my_project/
βββ PRJNA335681.csv # Sample metadata
βββ PRJNA335681.html # Interactive table
βββ downoad_info_table.csv # Download tracking
βββ raw_reads/ # Downloaded FASTQ files
βββ SRR123456/
β βββ SRR123456_1.fastq.gz
β βββ SRR123456_2.fastq.gz
βββ SRR123457/
βββ SRR123457.fastq.gz
- Python >= 3.7
- pandas >= 1.3.0
- numpy >= 1.20.0
- requests >= 2.25.0
- xmltodict >= 0.12.0
- tqdm >= 4.60.0
- lxml >= 4.6.0
Download metadata for all samples in an ENA project using enatool fetch.
Syntax:
enatool fetch PROJECT_ID [--path DIR]Arguments:
PROJECT_ID(required): ENA project accession (e.g., PRJNA335681)--path DIRor-p DIR: Output directory (default: PROJECT_ID)
What it does:
- Downloads sample metadata from ENA
- Tries NCBI BioSample as fallback if ENA fails
- Creates CSV file with all metadata
- Generates interactive HTML report
- Shows progress bars
Output files:
PROJECT_ID.csv- Metadata in CSV formatPROJECT_ID.html- Interactive HTML table
Examples:
# Basic usage - saves to PRJNA335681/
enatool fetch PRJNA335681
# Custom output directory
enatool fetch PRJNA335681 --path data/my_projectDownload metadata for all samples in an ENA project and download sample files using using enatool download.
Syntax:
enatool download PROJECT_ID [--path DIR]Arguments:
PROJECT_ID(required): ENA project accession--path DIRor-p DIR: Output directory (default: PROJECT_ID)
What it does:
- Downloads metadata (same as
fetch) - Downloads all FASTQ files for all samples
- Uses enaDataGet tool
- Skips files that already exist
- Tracks download status
Output files:
PROJECT_ID.csv- MetadataPROJECT_ID.html- Interactive tabledownoad_info_table.csv- Download trackingraw_reads/- Directory with FASTQ filesSRR123456/- One directory per runSRR123456_1.fastq.gz- Forward readsSRR123456_2.fastq.gz- Reverse reads (if paired-end)
Examples:
# Download everything
enatool download PRJNA335681
# Custom output directory
enatool download PRJNA335681 --path data/project1Display summary information about a downloaded project using enatool info.
Syntax:
enatool info PROJECT_ID --path DIRArguments:
PROJECT_ID(required): ENA project accession--path DIRor-p DIR(required): Directory containing metadata
What it does:
- Reads metadata from CSV file
- Shows summary statistics
- Displays organism breakdown
- Shows sequencing platforms
- Shows download status (if available)
Examples:
# Show info for custom directory
enatool info PRJNA335681 --path data/my_projectOutput:
π Project Information: PRJNAXXXXXX
============================================================
Total samples: 50
Organisms (2):
β’ Homo sapiens: 45
β’ Mus musculus: 5
Sequencing Platforms:
β’ ILLUMINA: 50
Library Strategies:
β’ RNA-Seq: 30
β’ WGS: 15
β’ ChIP-Seq: 5
Library Layout:
β’ PAIRED: 45
β’ SINGLE: 5
Download Status:
β’ OK: 48
β’ Error: 2
Download all FASTQ files using previously fetched metadata or based on the subsetted metadata table using enatool download-files. Also forces redownload of files which previously ended up with a error.
Syntax:
enatool download-files PROJECT_ID --path DIRArguments:
PROJECT_ID(required): ENA project accession--path DIRor-p DIR(required): Directory containing metadata
What it does:
- Loads sample names from existing CSV file (
PROJECT_ID.csv) - Downloads FASTQ files
- Useful if you already have metadata and just want the files or for filtered metadata tables.
Use cases:
- You fetched metadata earlier with
enatool fetch - You filtered the CSV file manually
- You want to re-download after failures
Examples:
# First get metadata (fast)
enatool fetch PRJNA335681 --path my_project
# Later, download files
enatool download-files PRJNA335681 --path my_project
# Or after filtering CSV file
enatool download-files PRJNA335681 --path my_projectBy default ENATool removes all the files which ended up being corrupted or md5 chesum did not match. However, you may use --keep-failed paramter to prevent the removal.
Syntax:
# with download command
enatool download PROJECT_ID --path DIR --keep-failed
# with download-files command
enatool download-files PROJECT_ID --path DIR --keep-failedFor processing multiple projects:
# Simple loop
for project in PRJNA335681 PRJNA123456 PRJNA789012; do
echo "Processing $project..."
enatool fetch $project --path data/$project
done
# Or with download
for project in PRJNA335681 PRJNA123456; do
echo "Downloading $project..."
enatool download $project --path data/$project
doneUse a global enatool option: --no-banner. Follows right after enatool and before the action command.
Example:
enatool --no-banner fetch PRJNA335681Use a global enatool option: --no-progress-bar. Follows right after enatool and before the action command.
Example:
enatool --no-progress-bar fetch PRJNA335681__
Use fetch() function to download metadata:
import ENATool
# Basic usage - just get metadata
info_table = ENATool.fetch('PRJNA335681')
# Specify custom directory
info_table = ENATool.fetch('PRJNA335681', path='data/my_project')
# Get metadata AND download files
info_table, downloads = ENATool.fetch('PRJNA335681', download=True)
# Show some basic stats
print(f"Total samples: {len(info_table)}")
print(f"Organisms: {info_table['scientific_name'].unique()}")
print(f"Platforms: {info_table['instrument_platform'].value_counts()}")What you get:
- Sample accessions and metadata
- Run accessions and sequencing details
- FASTQ file URLs and checksums
- Organism and experimental information
- Interactive HTML report
import ENATool
# Get metadata AND download files
info_table, downloads = ENATool.fetch('PRJNA335681', download=True)
# Check results
print(downloads['download_status'].value_counts())Download status values:
OK- Successfully downloadedExists- File already exists (skipped)Error- Download failed
import ENATool
# Get metadata
info = ENATool.fetch('PRJNA335681')
# Filter samples
human_samples = info[info['scientific_name'] == 'Homo sapiens']
# ! Important !
# Re-initialize for filtered table
human_samples.ena.reinit(info)
# Download only filtered samples
downloads = human_samples.ena.download()
# Save to CSV
human_samples.to_csv('human_samples.csv', index=False)Prevent ENATool from automatic removal of the corrupted files.
import ENATool
# Could be used in fetch method
info_table, downloads = ENATool.fetch('PRJNA335681', download=True, keep_failed=True)
# Could be used in download method
info = ENATool.fetch('PRJNA335681')
downloads = info.ena.download(keep_failed=True)import ENATool
# Could be used in fetch method
info_table, downloads = ENATool.fetch('PRJNA335681', download=True, NO_PROGRESS_BAR=True)
# Could be used in download method
info = ENATool.fetch('PRJNA335681')
downloads = info.ena.download(NO_PROGRESS_BAR=True)import ENATool
projects = ['PRJNA335681', 'PRJEB2961', 'PRJEB28350']
for project_id in projects:
try:
info = ENATool.fetch(project_id, path=f'data/{project_id}')
print(f"β {project_id}: {len(info)} samples")
except Exception as e:
print(f"β {project_id}: {e}")Main entry point for fetching ENA data.
Parameters:
project_id(str): ENA project accession (e.g., 'PRJNA335681')path(str, optional): Directory for outputs (defaults to project_id)download(bool, optional): Auto-download FASTQ files (default: False)
Returns:
- DataFrame (if download=False)
- Tuple of (info_table, download_table) (if download=True)
Download FASTQ files for samples in DataFrame.
Returns:
- DataFrame with download status
If you use ENATool in your research, please cite:
Tikhonova, P. (2021). ENATool: European Nucleotide Archive Data Manager
(v2.0.0). Zenodo. https://doi.org/10.5281/zenodo.17443004
This project is licensed under the MIT License - see the LICENSE file for details.
- PyPI: https://pypi.org/project/ENATool/
- GitHub: https://github.com/PollyTikhonova/ENATool
- Documentation: https://github.com/PollyTikhonova/ENATool#readme
- Bug Reports: https://github.com/PollyTikhonova/ENATool/issues