-
Notifications
You must be signed in to change notification settings - Fork 1
historical assembly version backfill (phase 0) #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
18 commits
Select commit
Hold shift + click to select a range
8797ac1
one-time backfill process to populate historical assembly versions.
fchen13 b0c2c60
Add version status column to the existing parser
fchen13 645b7ea
Phase 0 summary
fchen13 237742e
Merge upstream/main: resolve conflict in parse_ncbi_assemblies.py
fchen13 8d1666b
Add strict accession validation as suggested by reviewer
fchen13 7101396
Merge remote-tracking branch 'upstream/main' into fang-assemblies
fchen13 899074a
Remove parse_s3_file stub, use existing implementation
fchen13 792f34e
Update documentation to run parser as module
fchen13 91ab12b
Fix YAML config as suggested by reviewer
fchen13 6786644
Simplify cache key format to use accession only
fchen13 8576cb3
Use shared argument parser from flows.parsers.args
fchen13 9283f42
update on Phase 0 scripts and test
fchen13 3d4d84c
Merge upstream/main: keep version_status param in process_assembly_re…
fchen13 096e53e
update on flake8 import
fchen13 97961a3
Fix flake8: remove unused imports, add noqa for E402
fchen13 d075ae7
Merge remote-tracking branch 'upstream/main' into feature/assembly-hi…
fchen13 4d8a0f4
fix: rename yaml to .types.yaml, taxon_names field, and checkpoint re…
fchen13 09a0fc7
untrack backfill test file
fchen13 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,119 @@ | ||
| # Pull Request: Historical Assembly Version Backfill (Phase 0) | ||
|
|
||
| ## Summary | ||
|
|
||
| One-time backfill process to populate historical (superseded) assembly versions | ||
| from NCBI for all existing assemblies with version > 1. Enables version-aware | ||
| milestone tracking by capturing previously untracked superseded versions. | ||
|
|
||
| ## What Changed (latest revision) | ||
|
|
||
| ### Bug fixes | ||
| - **Fixed data-loss bug**: `write_to_tsv` overwrites by default, so the | ||
| original per-batch write + `parsed = {}` clearing lost all but the last | ||
| batch. All rows now accumulate in memory and `write_to_tsv` is called | ||
| exactly once at the end. Checkpoints only record resume progress. | ||
|
|
||
| ### Structural changes | ||
| - **Renamed** `backfill_historical_versions.py` → | ||
| `parse_backfill_historical_versions.py` so `register.py` discovers it as a | ||
| plugin via the `parse_` prefix convention. | ||
| - **Changed orchestrator decorator** from `@task` to `@flow(log_prints=True)` | ||
| to match how `parse_ncbi_assemblies` structures its flow/task hierarchy. | ||
| - **Split `find_all_assembly_versions`** into `discover_version_accessions` | ||
| (FTP) + `fetch_version_metadata` (datasets CLI) for modularity and | ||
| independent testability. | ||
| - **Added `backfill_historical_versions_wrapper`** matching the | ||
| `fetch_parse_validate` parser signature so the flow integrates with the | ||
| Prefect pipeline. | ||
|
|
||
| ### Convention alignment | ||
| - **CLI arguments** now use `shared_args` exclusively (`INPUT_PATH`, | ||
| `YAML_PATH`, `WORK_DIR`); removed ad-hoc `--checkpoint` argument. | ||
| Checkpoint path is derived via `derive_checkpoint_path`. | ||
| - **Replaced `subprocess.run`** with `utils.run_quoted` for shell-safe | ||
| argument quoting (consistent with `parse_ncbi_assemblies`). | ||
| - **Code style** aligned with GenomeHubs conventions: Google-style docstrings, | ||
| lowercase type hints (`dict`, `list`, `tuple`), `e` for exception variables, | ||
| removed shebang and section banners. | ||
| - **`assembly_historical.types.yaml`**: moved `needs` under `file:` section, | ||
| references `ATTR_assembly.types.yaml` (matches `ncbi_datasets_eukaryota` | ||
| convention), and renamed `names` → `taxon_names` for validation compliance. | ||
|
|
||
| ### Test suite rewrite | ||
| - Rewrote `tests/test_backfill.py` using pytest (was a custom runner). | ||
| - **33 tests**, all passing, covering: | ||
| - Accession parsing helpers | ||
| - Assembly identification from JSONL fixture | ||
| - Cache round-trip with expiry | ||
| - Checkpoint save/load/derive | ||
| - Accession format validation | ||
| - `parse_historical_version` delegation (mocked) | ||
| - Orchestrator: single TSV write, multi-assembly accumulation, | ||
| current-version skipping, no-op for v1-only input | ||
| - **Regression test for the batch-overwrite data-loss bug** | ||
|
|
||
| ## Solution Overview | ||
|
|
||
| The backfill script: | ||
| 1. Identifies assemblies with version > 1 from the input JSONL | ||
| 2. Discovers all historical versions via NCBI FTP directory listing | ||
| 3. Fetches metadata for each version using NCBI Datasets CLI | ||
| 4. Parses each version through `process_assembly_report` with | ||
| `version_status="superseded"` | ||
| 5. Writes all accumulated rows to a single TSV via `write_to_tsv` | ||
|
|
||
| ### Smart 2-Tier Caching | ||
| - **Version discovery cache** (7-day expiry): FTP directory listings | ||
| - **Metadata cache** (30-day expiry): Datasets CLI responses | ||
| - Reduces reruns from hours to minutes | ||
|
|
||
| ### Checkpoint System | ||
| - Saves progress every 100 assemblies to `{work_dir}/checkpoints/` | ||
| - Allows resuming after interruptions without re-fetching cached data | ||
| - Marks the checkpoint as `completed` at the end of a full run so the next | ||
| re-run starts from index 0 (all rows collected; network fetches still served | ||
| from cache) | ||
| - Does **not** trigger intermediate TSV writes (avoids the overwrite bug) | ||
|
|
||
| ## Files | ||
|
|
||
| ### New | ||
| - `flows/parsers/parse_backfill_historical_versions.py` — Main backfill flow | ||
| - `configs/assembly_historical.types.yaml` — Output schema configuration | ||
| - `tests/test_backfill.py` — pytest suite (33 tests) | ||
| - `tests/test_data/assembly_test_sample.jsonl` — Test fixture (3 assemblies) | ||
|
|
||
| ### Modified | ||
| - `flows/parsers/parse_ncbi_assemblies.py` — Added `version_status` parameter | ||
|
|
||
| ## Usage | ||
|
|
||
| ### As a standalone CLI | ||
| ```bash | ||
| python -m flows.parsers.parse_backfill_historical_versions \ | ||
| --input_path data/assembly_data_report.jsonl \ | ||
| --yaml_path configs/assembly_historical.types.yaml \ | ||
| --work_dir tmp | ||
| ``` | ||
|
|
||
| ### Via Prefect pipeline | ||
| Discovered automatically by `register.py` and invoked through | ||
| `fetch_parse_validate` with the standard parser signature. | ||
|
|
||
| ### Expected performance | ||
| - **First run**: ~10–15 hours (~3,700 assemblies, ~8,500 versions) | ||
| - **Subsequent runs**: ~15–30 minutes (cache hits) | ||
|
|
||
| ### Running tests | ||
| ```bash | ||
| set SKIP_PREFECT=true | ||
| python -m pytest tests/test_backfill.py -v | ||
| ``` | ||
|
|
||
| ## Next Steps (after merge) | ||
|
|
||
| 1. Run full backfill on production dataset | ||
| 2. Upload output to S3 | ||
| 3. Implement Phase 1: incremental daily updates for new historical versions | ||
| 4. Implement Phase 2: version-aware milestone tracking |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,201 @@ | ||
| # Configuration for historical assembly version backfill | ||
| # This config defines the schema for assembly_historical.tsv | ||
| # which contains all superseded assembly versions | ||
|
|
||
| attributes: | ||
| assembly_level: | ||
| header: assemblyLevel | ||
| path: assemblyInfo.assemblyLevel | ||
| assembly_span: | ||
| header: totalSequenceLength | ||
| path: assemblyStats.totalSequenceLength | ||
| assigned_percent: | ||
| header: assignedProportion | ||
| path: processedAssemblyStats.assignedProportion | ||
| assembly_status: | ||
| header: primaryValue | ||
| path: processedAssemblyInfo.primaryValue | ||
| translate: | ||
| "1": primary | ||
| assembly_type: | ||
| header: assemblyType | ||
| path: assemblyInfo.assemblyType | ||
| bioproject: | ||
| header: bioProjectAccession | ||
| path: assemblyInfo.bioprojectLineage.bioprojects.accession | ||
| separator: | ||
| - "," | ||
| biosample: | ||
| header: biosampleAccession | ||
| path: assemblyInfo.biosample.accession | ||
| separator: | ||
| - "," | ||
| chromosome_count: | ||
| header: totalNumberOfChromosomes | ||
| path: assemblyStats.totalNumberOfChromosomes | ||
| contig_count: | ||
| header: numberOfContigs | ||
| path: assemblyStats.numberOfContigs | ||
| contig_l50: | ||
| header: contigL50 | ||
| path: assemblyStats.contigL50 | ||
| contig_n50: | ||
| header: contigN50 | ||
| path: assemblyStats.contigN50 | ||
| ebp_standard_criteria: | ||
| header: ebpStandardCriteria | ||
| path: processedAssemblyStats.ebpStandardCriteria | ||
| separator: | ||
| - "," | ||
| ebp_standard_date: | ||
| header: ebpStandardDate | ||
| path: processedAssemblyStats.ebpStandardDate | ||
| gc_percent: | ||
| header: gcPercent | ||
| path: assemblyStats.gcPercent | ||
| gene_count: | ||
| header: geneCountTotal | ||
| path: annotationInfo.stats.geneCounts.total | ||
| gene_count.source.date: | ||
| header: annotationReleaseDate | ||
| path: annotationInfo.releaseDate | ||
| isolate: | ||
| header: isolate | ||
| path: assemblyInfo.biosample.attributes.name==isolate.value | ||
| last_updated: | ||
| header: releaseDate | ||
| path: assemblyInfo.releaseDate | ||
| mitochondrion_accession: | ||
| header: mitochondrionAccession | ||
| path: processedOrganelleInfo.mitochondrion.accession | ||
| separator: | ||
| - ; | ||
| mitochondrion_assembly_span: | ||
| header: mitochondrionAssemblySpan | ||
| path: processedOrganelleInfo.mitochondrion.assemblySpan | ||
| mitochondrion_gc_percent: | ||
| header: mitochondrionGcPercent | ||
| path: processedOrganelleInfo.mitochondrion.gcPercent | ||
| mitochondrion_scaffolds: | ||
| header: mitochondrionScaffolds | ||
| path: processedOrganelleInfo.mitochondrion.scaffolds | ||
| separator: | ||
| - ; | ||
| noncoding_gene_count: | ||
| header: geneCountNoncoding | ||
| path: annotationInfo.stats.geneCounts.nonCoding | ||
| noncoding_gene_count.source.date: | ||
| header: annotationReleaseDate | ||
| path: annotationInfo.releaseDate | ||
| plastid_accession: | ||
| header: plastidAccession | ||
| path: processedOrganelleInfo.plastid.accession | ||
| separator: | ||
| - ; | ||
| plastid_assembly_span: | ||
| header: plastidAssemblySpan | ||
| path: processedOrganelleInfo.plastid.assemblySpan | ||
| plastid_gc_percent: | ||
| header: plastidGcPercent | ||
| path: processedOrganelleInfo.plastid.gcPercent | ||
| plastid_scaffolds: | ||
| header: plastidScaffolds | ||
| path: processedOrganelleInfo.plastid.scaffolds | ||
| separator: | ||
| - ; | ||
| protein_count: | ||
| header: geneCountProteincoding | ||
| path: annotationInfo.stats.geneCounts.proteinCoding | ||
| protein_count.source.date: | ||
| header: annotationReleaseDate | ||
| path: annotationInfo.releaseDate | ||
| pseudogene_count: | ||
| header: geneCountPseudogene | ||
| path: annotationInfo.stats.geneCounts.pseudogene | ||
| pseudogene.source.date: | ||
| header: annotationReleaseDate | ||
| path: annotationInfo.releaseDate | ||
| refseq_category: | ||
| header: refseqCategory | ||
| path: assemblyInfo.refseqCategory | ||
| sample_sex: | ||
| header: sex | ||
| path: assemblyInfo.biosample.attributes.name==sex.value | ||
| scaffold_count: | ||
| header: numberOfScaffolds | ||
| path: assemblyStats.numberOfScaffolds | ||
| scaffold_l50: | ||
| header: scaffoldL50 | ||
| path: assemblyStats.scaffoldL50 | ||
| scaffold_n50: | ||
| header: scaffoldN50 | ||
| path: assemblyStats.scaffoldN50 | ||
| sequence_count: | ||
| header: numberOfComponentSequences | ||
| path: assemblyStats.numberOfComponentSequences | ||
| submitter: | ||
| header: submitter | ||
| path: assemblyInfo.submitter | ||
| ungapped_span: | ||
| header: totalUngappedLength | ||
| path: assemblyStats.totalUngappedLength | ||
| # NEW: Version-specific field to indicate this is a historical/superseded version | ||
| version_status: | ||
| header: versionStatus | ||
| path: processedAssemblyInfo.versionStatus | ||
|
|
||
| file: | ||
| display_group: general | ||
| needs: | ||
| - ATTR_assembly.types.yaml | ||
| exclusions: | ||
| attributes: | ||
| - bioproject | ||
| - biosample | ||
| identifiers: | ||
| - assembly_id | ||
| taxonomy: | ||
| - taxon_id | ||
| format: tsv | ||
| header: true | ||
| name: assembly_historical.tsv | ||
| source: INSDC | ||
| source_url_stub: https://www.ncbi.nlm.nih.gov/assembly/ | ||
|
|
||
| identifiers: | ||
| assembly_id: | ||
| header: assemblyID | ||
| path: processedAssemblyInfo.assemblyID | ||
| assembly_name: | ||
| header: assemblyName | ||
| path: assemblyInfo.assemblyName | ||
| genbank_accession: | ||
| header: genbankAccession | ||
| path: processedAssemblyInfo.genbankAccession | ||
| refseq_accession: | ||
| header: refseqAccession | ||
| path: processedAssemblyInfo.refseqAccession | ||
| wgs_accession: | ||
| header: wgsProjectAccession | ||
| path: wgsInfo.wgsProjectAccession | ||
|
|
||
| metadata: | ||
| is_primary_value: | ||
| header: primaryValue | ||
| path: processedAssemblyInfo.primaryValue | ||
| source_slug: | ||
| header: genbankAccession | ||
| path: processedAssemblyInfo.genbankAccession | ||
|
|
||
| taxon_names: | ||
| common_name: | ||
| header: commonName | ||
| path: organism.commonName | ||
|
|
||
| taxonomy: | ||
| taxon: | ||
| header: organismName | ||
| path: organism.organismName | ||
| taxon_id: | ||
| header: taxId | ||
| path: organism.taxId | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.