genomehubs · rjchallis · Apr 2, 2026 · Dec 19, 2025 · Dec 19, 2025 · Dec 19, 2025
diff --git a/Phase_0_PR_SUMMARY.md b/Phase_0_PR_SUMMARY.md
@@ -0,0 +1,119 @@
+# Pull Request: Historical Assembly Version Backfill (Phase 0)
+
+## Summary
+
+One-time backfill process to populate historical (superseded) assembly versions
+from NCBI for all existing assemblies with version > 1.  Enables version-aware
+milestone tracking by capturing previously untracked superseded versions.
+
+## What Changed (latest revision)
+
+### Bug fixes
+- **Fixed data-loss bug**: `write_to_tsv` overwrites by default, so the
+  original per-batch write + `parsed = {}` clearing lost all but the last
+  batch.  All rows now accumulate in memory and `write_to_tsv` is called
+  exactly once at the end.  Checkpoints only record resume progress.
+
+### Structural changes
+- **Renamed** `backfill_historical_versions.py` →
+  `parse_backfill_historical_versions.py` so `register.py` discovers it as a
+  plugin via the `parse_` prefix convention.
+- **Changed orchestrator decorator** from `@task` to `@flow(log_prints=True)`
+  to match how `parse_ncbi_assemblies` structures its flow/task hierarchy.
+- **Split `find_all_assembly_versions`** into `discover_version_accessions`
+  (FTP) + `fetch_version_metadata` (datasets CLI) for modularity and
+  independent testability.
+- **Added `backfill_historical_versions_wrapper`** matching the
+  `fetch_parse_validate` parser signature so the flow integrates with the
+  Prefect pipeline.
+
+### Convention alignment
+- **CLI arguments** now use `shared_args` exclusively (`INPUT_PATH`,
+  `YAML_PATH`, `WORK_DIR`); removed ad-hoc `--checkpoint` argument.
+  Checkpoint path is derived via `derive_checkpoint_path`.
+- **Replaced `subprocess.run`** with `utils.run_quoted` for shell-safe
+  argument quoting (consistent with `parse_ncbi_assemblies`).
+- **Code style** aligned with GenomeHubs conventions: Google-style docstrings,
+  lowercase type hints (`dict`, `list`, `tuple`), `e` for exception variables,
+  removed shebang and section banners.
+- **`assembly_historical.types.yaml`**: moved `needs` under `file:` section,
+  references `ATTR_assembly.types.yaml` (matches `ncbi_datasets_eukaryota`
+  convention), and renamed `names` → `taxon_names` for validation compliance.
+
+### Test suite rewrite
+- Rewrote `tests/test_backfill.py` using pytest (was a custom runner).
+- **33 tests**, all passing, covering:
+  - Accession parsing helpers
+  - Assembly identification from JSONL fixture
+  - Cache round-trip with expiry
+  - Checkpoint save/load/derive
+  - Accession format validation
+  - `parse_historical_version` delegation (mocked)
+  - Orchestrator: single TSV write, multi-assembly accumulation,
+    current-version skipping, no-op for v1-only input
+  - **Regression test for the batch-overwrite data-loss bug**
+
+## Solution Overview
+
+The backfill script:
+1. Identifies assemblies with version > 1 from the input JSONL
+2. Discovers all historical versions via NCBI FTP directory listing
+3. Fetches metadata for each version using NCBI Datasets CLI
+4. Parses each version through `process_assembly_report` with
+   `version_status="superseded"`
+5. Writes all accumulated rows to a single TSV via `write_to_tsv`
+
+### Smart 2-Tier Caching
+- **Version discovery cache** (7-day expiry): FTP directory listings
+- **Metadata cache** (30-day expiry): Datasets CLI responses
+- Reduces reruns from hours to minutes
+
+### Checkpoint System
+- Saves progress every 100 assemblies to `{work_dir}/checkpoints/`
+- Allows resuming after interruptions without re-fetching cached data
+- Marks the checkpoint as `completed` at the end of a full run so the next
+  re-run starts from index 0 (all rows collected; network fetches still served
+  from cache)
+- Does **not** trigger intermediate TSV writes (avoids the overwrite bug)
+
+## Files
+
+### New
+- `flows/parsers/parse_backfill_historical_versions.py` — Main backfill flow
+- `configs/assembly_historical.types.yaml` — Output schema configuration
+- `tests/test_backfill.py` — pytest suite (33 tests)
+- `tests/test_data/assembly_test_sample.jsonl` — Test fixture (3 assemblies)
+
+### Modified
+- `flows/parsers/parse_ncbi_assemblies.py` — Added `version_status` parameter
+
+## Usage
+
+### As a standalone CLI
+```bash
+python -m flows.parsers.parse_backfill_historical_versions \
+    --input_path data/assembly_data_report.jsonl \
+    --yaml_path configs/assembly_historical.types.yaml \
+    --work_dir tmp
+```
+
+### Via Prefect pipeline
+Discovered automatically by `register.py` and invoked through
+`fetch_parse_validate` with the standard parser signature.
+
+### Expected performance
+- **First run**: ~10–15 hours (~3,700 assemblies, ~8,500 versions)
+- **Subsequent runs**: ~15–30 minutes (cache hits)
+
+### Running tests
+```bash
+set SKIP_PREFECT=true
+python -m pytest tests/test_backfill.py -v
+```
+
+## Next Steps (after merge)
+
+1. Run full backfill on production dataset
+2. Upload output to S3
+3. Implement Phase 1: incremental daily updates for new historical versions
+4. Implement Phase 2: version-aware milestone tracking
diff --git a/configs/assembly_historical.types.yaml b/configs/assembly_historical.types.yaml
@@ -0,0 +1,201 @@
+# Configuration for historical assembly version backfill
+# This config defines the schema for assembly_historical.tsv
+# which contains all superseded assembly versions
+
+attributes:
+  assembly_level:
+    header: assemblyLevel
+    path: assemblyInfo.assemblyLevel
+  assembly_span:
+    header: totalSequenceLength
+    path: assemblyStats.totalSequenceLength
+  assigned_percent:
+    header: assignedProportion
+    path: processedAssemblyStats.assignedProportion
+  assembly_status:
+    header: primaryValue
+    path: processedAssemblyInfo.primaryValue
+    translate:
+      "1": primary
+  assembly_type:
+    header: assemblyType
+    path: assemblyInfo.assemblyType
+  bioproject:
+    header: bioProjectAccession
+    path: assemblyInfo.bioprojectLineage.bioprojects.accession
+    separator:
+      - ","
+  biosample:
+    header: biosampleAccession
+    path: assemblyInfo.biosample.accession
+    separator:
+      - ","
+  chromosome_count:
+    header: totalNumberOfChromosomes
+    path: assemblyStats.totalNumberOfChromosomes
+  contig_count:
+    header: numberOfContigs
+    path: assemblyStats.numberOfContigs
+  contig_l50:
+    header: contigL50
+    path: assemblyStats.contigL50
+  contig_n50:
+    header: contigN50
+    path: assemblyStats.contigN50
+  ebp_standard_criteria:
+    header: ebpStandardCriteria
+    path: processedAssemblyStats.ebpStandardCriteria
+    separator:
+      - ","
+  ebp_standard_date:
+    header: ebpStandardDate
+    path: processedAssemblyStats.ebpStandardDate
+  gc_percent:
+    header: gcPercent
+    path: assemblyStats.gcPercent
+  gene_count:
+    header: geneCountTotal
+    path: annotationInfo.stats.geneCounts.total
+  gene_count.source.date:
+    header: annotationReleaseDate
+    path: annotationInfo.releaseDate
+  isolate:
+    header: isolate
+    path: assemblyInfo.biosample.attributes.name==isolate.value
+  last_updated:
+    header: releaseDate
+    path: assemblyInfo.releaseDate
+  mitochondrion_accession:
+    header: mitochondrionAccession
+    path: processedOrganelleInfo.mitochondrion.accession
+    separator:
+      - ;
+  mitochondrion_assembly_span:
+    header: mitochondrionAssemblySpan
+    path: processedOrganelleInfo.mitochondrion.assemblySpan
+  mitochondrion_gc_percent:
+    header: mitochondrionGcPercent
+    path: processedOrganelleInfo.mitochondrion.gcPercent
+  mitochondrion_scaffolds:
+    header: mitochondrionScaffolds
+    path: processedOrganelleInfo.mitochondrion.scaffolds
+    separator:
+      - ;
+  noncoding_gene_count:
+    header: geneCountNoncoding
+    path: annotationInfo.stats.geneCounts.nonCoding
+  noncoding_gene_count.source.date:
+    header: annotationReleaseDate
+    path: annotationInfo.releaseDate
+  plastid_accession:
+    header: plastidAccession
+    path: processedOrganelleInfo.plastid.accession
+    separator:
+      - ;
+  plastid_assembly_span:
+    header: plastidAssemblySpan
+    path: processedOrganelleInfo.plastid.assemblySpan
+  plastid_gc_percent:
+    header: plastidGcPercent
+    path: processedOrganelleInfo.plastid.gcPercent
+  plastid_scaffolds:
+    header: plastidScaffolds
+    path: processedOrganelleInfo.plastid.scaffolds
+    separator:
+      - ;
+  protein_count:
+    header: geneCountProteincoding
+    path: annotationInfo.stats.geneCounts.proteinCoding
+  protein_count.source.date:
+    header: annotationReleaseDate
+    path: annotationInfo.releaseDate
+  pseudogene_count:
+    header: geneCountPseudogene
+    path: annotationInfo.stats.geneCounts.pseudogene
+  pseudogene.source.date:
+    header: annotationReleaseDate
+    path: annotationInfo.releaseDate
+  refseq_category:
+    header: refseqCategory
+    path: assemblyInfo.refseqCategory
+  sample_sex:
+    header: sex
+    path: assemblyInfo.biosample.attributes.name==sex.value
+  scaffold_count:
+    header: numberOfScaffolds
+    path: assemblyStats.numberOfScaffolds
+  scaffold_l50:
+    header: scaffoldL50
+    path: assemblyStats.scaffoldL50
+  scaffold_n50:
+    header: scaffoldN50
+    path: assemblyStats.scaffoldN50
+  sequence_count:
+    header: numberOfComponentSequences
+    path: assemblyStats.numberOfComponentSequences
+  submitter:
+    header: submitter
+    path: assemblyInfo.submitter
+  ungapped_span:
+    header: totalUngappedLength
+    path: assemblyStats.totalUngappedLength
+  # NEW: Version-specific field to indicate this is a historical/superseded version
+  version_status:
+    header: versionStatus
+    path: processedAssemblyInfo.versionStatus
+
+file:
+  display_group: general
+  needs:
+    - ATTR_assembly.types.yaml
+  exclusions:
+    attributes:
+      - bioproject
+      - biosample
+    identifiers:
+      - assembly_id
+    taxonomy:
+      - taxon_id
+  format: tsv
+  header: true
+  name: assembly_historical.tsv
+  source: INSDC
+  source_url_stub: https://www.ncbi.nlm.nih.gov/assembly/
+
+identifiers:
+  assembly_id:
+    header: assemblyID
+    path: processedAssemblyInfo.assemblyID
+  assembly_name:
+    header: assemblyName
+    path: assemblyInfo.assemblyName
+  genbank_accession:
+    header: genbankAccession
+    path: processedAssemblyInfo.genbankAccession
+  refseq_accession:
+    header: refseqAccession
+    path: processedAssemblyInfo.refseqAccession
+  wgs_accession:
+    header: wgsProjectAccession
+    path: wgsInfo.wgsProjectAccession
+
+metadata:
+  is_primary_value:
+    header: primaryValue
+    path: processedAssemblyInfo.primaryValue
+  source_slug:
+    header: genbankAccession
+    path: processedAssemblyInfo.genbankAccession
+
+taxon_names:
+  common_name:
+    header: commonName
+    path: organism.commonName
+
+taxonomy:
+  taxon:
+    header: organismName
+    path: organism.organismName
+  taxon_id:
+    header: taxId
+    path: organism.taxId