Automated notification system for detecting and reporting species duplication across Earth BioGenome Project (EBP) projects in the GoaT (Genome of All Types) database.
The system analyzes project data from GoaT, identifies species overlaps across EBP-affiliated projects, and sends automated email notifications to project representatives with detailed duplication reports. Each report includes three types of analyses:
- Report #1: Species with EBP-standard assemblies available
- Report #2: Active duplications with non-standard assemblies
- Report #3: Potential duplications (not started by target project)
notification_streamlined.ipynb: Execution and testing interface (Jupyter notebook)duplication_analysis.py: Business logic for matrix analysis, report generation, and email creationnotification_utils.py: Utility functions for Gmail API authentication, data loading, and HTML formattingprojectsMap.py: Project name to NCBI Bioproject ID mapping dictionaryvalidate_tsv.py: TSV input file validator
- Python 3.x
- Required packages:
google-auth-oauthlibgoogle-api-python-clientpandasrequests
- Gmail API credentials (
credentials.jsonandtoken.json) - Input TSV file with project information (columns:
goat_project,sequencing_status,contact_email,contact_name)
Execute all cells in notification_streamlined.ipynb to run the complete duplication alert system. The notebook imports and orchestrates all components including duplication_analysis.py and notification_utils.py.
- Input File: Modify
INPUT_FILEinduplication_analysis.py(default:GoaT_Projects_Test.tsv) - CC Emails: Uncomment and modify
cc_emailslist induplication_analysis.pymain() function - Output Directory: All outputs saved to
./output_files/(emails, logs)
- Email Reports: HTML emails sent to project contacts via Gmail API
- Saved Emails: HTML copies archived in
./output_files/saved_emails/ - Processing Logs: Detailed logs in
./output_files/Overlaps_GoaT_Projects_[DATE].txt
- Projects without bioproject IDs will be skipped (no email sent)
- Multiple recipients per project are supported (comma-separated in TSV)
- TSV validation available via
python validate_tsv.py [filename]
- Load and validate project data from TSV
- For each project, query GoaT API for duplication data
- Generate three reports with enhanced matrix analysis
- Create HTML email content with formatted tables
- Send notifications via Gmail API
- Log all processing steps and results