Add annotation-database project and GFF3-to-Iceberg HealthOmics workflow by wisech-aws · Pull Request #85 · aws-samples/aws-healthomics-tutorials

wisech-aws · 2026-04-27T14:38:52Z

annotation-database/: Standalone project for loading GFF3 files into Apache Iceberg tables on AWS S3 Tables using PyIceberg
- Schema 1 (normalized): features, sources, feature_relationships tables
- Schema 2 (denormalized): single genomic_annotations table
- GFF3 parser with URL-decoding, reserved attribute extraction, parent-child relationship tracking, and FASTA section handling
- Metadata schema for GFF3 directives and pragmas
templates/gff3-to-iceberg/: HealthOmics WDL workflow template
- 7-stage pipeline: validate, setup catalog, check connectivity, check permissions, initialize tables, load GFF3, generate summary
- Supports S3 Tables and vanilla Iceberg (Glue catalog)
- Dockerfile, ECR build scripts, parameter template
test_data/sample.gff3: Sample GFF3 with 30 features across chr1/chr2 (genes, mRNAs, exons, CDS) from Ensembl and Havana sources

Issue #, if available:

Description of changes:

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

- annotation-database/: Standalone project for loading GFF3 files into Apache Iceberg tables on AWS S3 Tables using PyIceberg - Schema 1 (normalized): features, sources, feature_relationships tables - Schema 2 (denormalized): single genomic_annotations table - GFF3 parser with URL-decoding, reserved attribute extraction, parent-child relationship tracking, and FASTA section handling - Metadata schema for GFF3 directives and pragmas - templates/gff3-to-iceberg/: HealthOmics WDL workflow template - 7-stage pipeline: validate, setup catalog, check connectivity, check permissions, initialize tables, load GFF3, generate summary - Supports S3 Tables and vanilla Iceberg (Glue catalog) - Dockerfile, ECR build scripts, parameter template - test_data/sample.gff3: Sample GFF3 with 30 features across chr1/chr2 (genes, mRNAs, exons, CDS) from Ensembl and Havana sources

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add annotation-database project and GFF3-to-Iceberg HealthOmics workflow#85

Add annotation-database project and GFF3-to-Iceberg HealthOmics workflow#85
wisech-aws wants to merge 1 commit intoaws-samples:mainfrom
wisech-aws:feature/annotation-database

wisech-aws commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wisech-aws commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant