Skip to content

PollyTikhonova/ENATool

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

21 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

ENATool 🧬

PyPI version Python 3.7+ License: MIT

A comprehensive Python package for downloading and managing sequencing data from the European Nucleotide Archive (ENA) in terminal and through Python interface.

✨ Features

  • πŸ“Š Extract Metadata - Get comprehensive sample information from ENA projects
  • πŸ“₯ Download FASTQ Files - Automated download with progress tracking
  • πŸ”„ Auto Fallback - Automatically tries NCBI if ENA metadata unavailable
  • πŸ“ˆ Progress Bars - Real-time progress for downloads and metadata retrieval
  • πŸ“‹ Interactive Reports - Generate searchable HTML tables with DataTables.js
  • πŸ’Ύ Export to CSV - Save metadata in standard formats
  • πŸ” Smart Verification - Check fastq file integrity and skip existing files
  • πŸ’» Command line and Python interface

πŸš€ Quick Start

Installation

# Install from PyPI
pip install ENATool

Basic Usage in Terminal

# Custom output directory
enatool download PRJNA335681 --path data/my_project

Basic Usage in Python

import ENATool

# Fetch metadata AND download files in one command
info, downloads = ENATool.fetch('PRJNA335681', path='data/my_project', download=True)

πŸ“Š Example Output Files

ENATool creates organized output:

my_project/
β”œβ”€β”€ PRJNA335681.csv              # Sample metadata
β”œβ”€β”€ PRJNA335681.html             # Interactive table
β”œβ”€β”€ downoad_info_table.csv       # Download tracking
└── raw_reads/                  # Downloaded FASTQ files
    β”œβ”€β”€ SRR123456/
    β”‚   β”œβ”€β”€ SRR123456_1.fastq.gz
    β”‚   └── SRR123456_2.fastq.gz
    └── SRR123457/
        └── SRR123457.fastq.gz

πŸ”§ Requirements

  • Python >= 3.7
  • pandas >= 1.3.0
  • numpy >= 1.20.0
  • requests >= 2.25.0
  • xmltodict >= 0.12.0
  • tqdm >= 4.60.0
  • lxml >= 4.6.0

πŸ“– Documentation

Use ENATool in Terminal

Fetching Metadata

Download metadata for all samples in an ENA project using enatool fetch.

Syntax:

enatool fetch PROJECT_ID [--path DIR]

Arguments:

  • PROJECT_ID (required): ENA project accession (e.g., PRJNA335681)
  • --path DIR or -p DIR: Output directory (default: PROJECT_ID)

What it does:

  • Downloads sample metadata from ENA
  • Tries NCBI BioSample as fallback if ENA fails
  • Creates CSV file with all metadata
  • Generates interactive HTML report
  • Shows progress bars

Output files:

  • PROJECT_ID.csv - Metadata in CSV format
  • PROJECT_ID.html - Interactive HTML table

Examples:

# Basic usage - saves to PRJNA335681/
enatool fetch PRJNA335681

# Custom output directory
enatool fetch PRJNA335681 --path data/my_project

Download Reads and Fetch Metadata

Download metadata for all samples in an ENA project and download sample files using using enatool download.

Syntax:

enatool download PROJECT_ID [--path DIR]

Arguments:

  • PROJECT_ID (required): ENA project accession
  • --path DIR or -p DIR: Output directory (default: PROJECT_ID)

What it does:

  • Downloads metadata (same as fetch)
  • Downloads all FASTQ files for all samples
  • Uses enaDataGet tool
  • Skips files that already exist
  • Tracks download status

Output files:

  • PROJECT_ID.csv - Metadata
  • PROJECT_ID.html - Interactive table
  • downoad_info_table.csv - Download tracking
  • raw_reads/ - Directory with FASTQ files
    • SRR123456/ - One directory per run
      • SRR123456_1.fastq.gz - Forward reads
      • SRR123456_2.fastq.gz - Reverse reads (if paired-end)

Examples:

# Download everything
enatool download PRJNA335681

# Custom output directory
enatool download PRJNA335681 --path data/project1

Show Project Summary [stdout]

Display summary information about a downloaded project using enatool info.

Syntax:

enatool info PROJECT_ID --path DIR

Arguments:

  • PROJECT_ID (required): ENA project accession
  • --path DIR or -p DIR (required): Directory containing metadata

What it does:

  • Reads metadata from CSV file
  • Shows summary statistics
  • Displays organism breakdown
  • Shows sequencing platforms
  • Shows download status (if available)

Examples:

# Show info for custom directory
enatool info PRJNA335681 --path data/my_project

Output:

πŸ“Š Project Information: PRJNAXXXXXX
============================================================
Total samples: 50

Organisms (2):
  β€’ Homo sapiens: 45
  β€’ Mus musculus: 5

Sequencing Platforms:
  β€’ ILLUMINA: 50

Library Strategies:
  β€’ RNA-Seq: 30
  β€’ WGS: 15
  β€’ ChIP-Seq: 5

Library Layout:
  β€’ PAIRED: 45
  β€’ SINGLE: 5

Download Status:
  β€’ OK: 48
  β€’ Error: 2

Redownload Corrupted Files or Download Only Selected Files

Download all FASTQ files using previously fetched metadata or based on the subsetted metadata table using enatool download-files. Also forces redownload of files which previously ended up with a error.

Syntax:

enatool download-files PROJECT_ID --path DIR

Arguments:

  • PROJECT_ID (required): ENA project accession
  • --path DIR or -p DIR (required): Directory containing metadata

What it does:

  • Loads sample names from existing CSV file (PROJECT_ID.csv)
  • Downloads FASTQ files
  • Useful if you already have metadata and just want the files or for filtered metadata tables.

Use cases:

  • You fetched metadata earlier with enatool fetch
  • You filtered the CSV file manually
  • You want to re-download after failures

Examples:

# First get metadata (fast)
enatool fetch PRJNA335681 --path my_project

# Later, download files 
enatool download-files PRJNA335681 --path my_project

# Or after filtering CSV file
enatool download-files PRJNA335681 --path my_project

Leave files with incorrect md5 checksum

By default ENATool removes all the files which ended up being corrupted or md5 chesum did not match. However, you may use --keep-failed paramter to prevent the removal.

Syntax:

# with download command
enatool download PROJECT_ID --path DIR --keep-failed

# with download-files command
enatool download-files PROJECT_ID --path DIR --keep-failed

Process multiple projects

For processing multiple projects:

# Simple loop
for project in PRJNA335681 PRJNA123456 PRJNA789012; do
    echo "Processing $project..."
    enatool fetch $project --path data/$project
done

# Or with download
for project in PRJNA335681 PRJNA123456; do
    echo "Downloading $project..."
    enatool download $project --path data/$project
done

Hide banner

Use a global enatool option: --no-banner. Follows right after enatool and before the action command.

Example:

enatool --no-banner fetch PRJNA335681

Disable progress bar

Use a global enatool option: --no-progress-bar. Follows right after enatool and before the action command.

Example:

enatool --no-progress-bar fetch PRJNA335681

__

Use ENATool in Python

Fetch Metadata

Use fetch() function to download metadata:

import ENATool

# Basic usage - just get metadata
info_table = ENATool.fetch('PRJNA335681')

# Specify custom directory
info_table = ENATool.fetch('PRJNA335681', path='data/my_project')

# Get metadata AND download files
info_table, downloads = ENATool.fetch('PRJNA335681', download=True)

# Show some basic stats
print(f"Total samples: {len(info_table)}")
print(f"Organisms: {info_table['scientific_name'].unique()}")
print(f"Platforms: {info_table['instrument_platform'].value_counts()}")

What you get:

  • Sample accessions and metadata
  • Run accessions and sequencing details
  • FASTQ file URLs and checksums
  • Organism and experimental information
  • Interactive HTML report

Download FASTQ Files

import ENATool

# Get metadata AND download files
info_table, downloads = ENATool.fetch('PRJNA335681', download=True)

# Check results
print(downloads['download_status'].value_counts())

Download status values:

  • OK - Successfully downloaded
  • Exists - File already exists (skipped)
  • Error - Download failed

Download only a subset of samples

import ENATool

# Get metadata
info = ENATool.fetch('PRJNA335681')

# Filter samples
human_samples = info[info['scientific_name'] == 'Homo sapiens']

# ! Important ! 
# Re-initialize for filtered table
human_samples.ena.reinit(info)

# Download only filtered samples
downloads = human_samples.ena.download()

# Save to CSV
human_samples.to_csv('human_samples.csv', index=False)

Leave files with incorrect md5 checksum

Prevent ENATool from automatic removal of the corrupted files.

import ENATool

# Could be used in fetch method
info_table, downloads = ENATool.fetch('PRJNA335681', download=True, keep_failed=True)

# Could be used in download method
info = ENATool.fetch('PRJNA335681')
downloads = info.ena.download(keep_failed=True)

Disable progress bar

import ENATool

# Could be used in fetch method
info_table, downloads = ENATool.fetch('PRJNA335681', download=True, NO_PROGRESS_BAR=True)

# Could be used in download method
info = ENATool.fetch('PRJNA335681')
downloads = info.ena.download(NO_PROGRESS_BAR=True)

Work with multiple datasets

import ENATool

projects = ['PRJNA335681', 'PRJEB2961', 'PRJEB28350']

for project_id in projects:
    try:
        info = ENATool.fetch(project_id, path=f'data/{project_id}')
        print(f"βœ“ {project_id}: {len(info)} samples")
    except Exception as e:
        print(f"βœ— {project_id}: {e}")

Python API Reference

ENATool.fetch(project_id, path=None, download=False)

Main entry point for fetching ENA data.

Parameters:

  • project_id (str): ENA project accession (e.g., 'PRJNA335681')
  • path (str, optional): Directory for outputs (defaults to project_id)
  • download (bool, optional): Auto-download FASTQ files (default: False)

Returns:

  • DataFrame (if download=False)
  • Tuple of (info_table, download_table) (if download=True)

DataFrame.ena.download()

Download FASTQ files for samples in DataFrame.

Returns:

  • DataFrame with download status

πŸ“ Citation

If you use ENATool in your research, please cite:

Tikhonova, P. (2021). ENATool: European Nucleotide Archive Data Manager
(v2.0.0). Zenodo. https://doi.org/10.5281/zenodo.17443004

πŸ“œ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ”— Links

About

Convenient dowloader of raw files for ENA.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages

  • Python 100.0%