Skip to content

Add annotation-database project and GFF3-to-Iceberg HealthOmics workflow#85

Open
wisech-aws wants to merge 1 commit intoaws-samples:mainfrom
wisech-aws:feature/annotation-database
Open

Add annotation-database project and GFF3-to-Iceberg HealthOmics workflow#85
wisech-aws wants to merge 1 commit intoaws-samples:mainfrom
wisech-aws:feature/annotation-database

Conversation

@wisech-aws
Copy link
Copy Markdown

  • annotation-database/: Standalone project for loading GFF3 files into Apache Iceberg tables on AWS S3 Tables using PyIceberg

    • Schema 1 (normalized): features, sources, feature_relationships tables
    • Schema 2 (denormalized): single genomic_annotations table
    • GFF3 parser with URL-decoding, reserved attribute extraction, parent-child relationship tracking, and FASTA section handling
    • Metadata schema for GFF3 directives and pragmas
  • templates/gff3-to-iceberg/: HealthOmics WDL workflow template

    • 7-stage pipeline: validate, setup catalog, check connectivity, check permissions, initialize tables, load GFF3, generate summary
    • Supports S3 Tables and vanilla Iceberg (Glue catalog)
    • Dockerfile, ECR build scripts, parameter template
  • test_data/sample.gff3: Sample GFF3 with 30 features across chr1/chr2 (genes, mRNAs, exons, CDS) from Ensembl and Havana sources

Issue #, if available:

Description of changes:

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

- annotation-database/: Standalone project for loading GFF3 files into
  Apache Iceberg tables on AWS S3 Tables using PyIceberg
  - Schema 1 (normalized): features, sources, feature_relationships tables
  - Schema 2 (denormalized): single genomic_annotations table
  - GFF3 parser with URL-decoding, reserved attribute extraction,
    parent-child relationship tracking, and FASTA section handling
  - Metadata schema for GFF3 directives and pragmas

- templates/gff3-to-iceberg/: HealthOmics WDL workflow template
  - 7-stage pipeline: validate, setup catalog, check connectivity,
    check permissions, initialize tables, load GFF3, generate summary
  - Supports S3 Tables and vanilla Iceberg (Glue catalog)
  - Dockerfile, ECR build scripts, parameter template

- test_data/sample.gff3: Sample GFF3 with 30 features across chr1/chr2
  (genes, mRNAs, exons, CDS) from Ensembl and Havana sources
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant