Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
18 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
119 changes: 119 additions & 0 deletions Phase_0_PR_SUMMARY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
# Pull Request: Historical Assembly Version Backfill (Phase 0)

## Summary

One-time backfill process to populate historical (superseded) assembly versions
from NCBI for all existing assemblies with version > 1. Enables version-aware
milestone tracking by capturing previously untracked superseded versions.

## What Changed (latest revision)

### Bug fixes
- **Fixed data-loss bug**: `write_to_tsv` overwrites by default, so the
original per-batch write + `parsed = {}` clearing lost all but the last
batch. All rows now accumulate in memory and `write_to_tsv` is called
exactly once at the end. Checkpoints only record resume progress.

### Structural changes
- **Renamed** `backfill_historical_versions.py` →
`parse_backfill_historical_versions.py` so `register.py` discovers it as a
plugin via the `parse_` prefix convention.
- **Changed orchestrator decorator** from `@task` to `@flow(log_prints=True)`
to match how `parse_ncbi_assemblies` structures its flow/task hierarchy.
- **Split `find_all_assembly_versions`** into `discover_version_accessions`
(FTP) + `fetch_version_metadata` (datasets CLI) for modularity and
independent testability.
- **Added `backfill_historical_versions_wrapper`** matching the
`fetch_parse_validate` parser signature so the flow integrates with the
Prefect pipeline.

### Convention alignment
- **CLI arguments** now use `shared_args` exclusively (`INPUT_PATH`,
`YAML_PATH`, `WORK_DIR`); removed ad-hoc `--checkpoint` argument.
Checkpoint path is derived via `derive_checkpoint_path`.
- **Replaced `subprocess.run`** with `utils.run_quoted` for shell-safe
argument quoting (consistent with `parse_ncbi_assemblies`).
- **Code style** aligned with GenomeHubs conventions: Google-style docstrings,
lowercase type hints (`dict`, `list`, `tuple`), `e` for exception variables,
removed shebang and section banners.
- **`assembly_historical.types.yaml`**: moved `needs` under `file:` section,
references `ATTR_assembly.types.yaml` (matches `ncbi_datasets_eukaryota`
convention), and renamed `names` → `taxon_names` for validation compliance.

### Test suite rewrite
- Rewrote `tests/test_backfill.py` using pytest (was a custom runner).
- **33 tests**, all passing, covering:
- Accession parsing helpers
- Assembly identification from JSONL fixture
- Cache round-trip with expiry
- Checkpoint save/load/derive
- Accession format validation
- `parse_historical_version` delegation (mocked)
- Orchestrator: single TSV write, multi-assembly accumulation,
current-version skipping, no-op for v1-only input
- **Regression test for the batch-overwrite data-loss bug**

## Solution Overview

The backfill script:
1. Identifies assemblies with version > 1 from the input JSONL
2. Discovers all historical versions via NCBI FTP directory listing
3. Fetches metadata for each version using NCBI Datasets CLI
4. Parses each version through `process_assembly_report` with
`version_status="superseded"`
5. Writes all accumulated rows to a single TSV via `write_to_tsv`

### Smart 2-Tier Caching
- **Version discovery cache** (7-day expiry): FTP directory listings
- **Metadata cache** (30-day expiry): Datasets CLI responses
- Reduces reruns from hours to minutes

### Checkpoint System
- Saves progress every 100 assemblies to `{work_dir}/checkpoints/`
- Allows resuming after interruptions without re-fetching cached data
- Marks the checkpoint as `completed` at the end of a full run so the next
re-run starts from index 0 (all rows collected; network fetches still served
from cache)
- Does **not** trigger intermediate TSV writes (avoids the overwrite bug)

## Files

### New
- `flows/parsers/parse_backfill_historical_versions.py` — Main backfill flow
- `configs/assembly_historical.types.yaml` — Output schema configuration
- `tests/test_backfill.py` — pytest suite (33 tests)
- `tests/test_data/assembly_test_sample.jsonl` — Test fixture (3 assemblies)

### Modified
- `flows/parsers/parse_ncbi_assemblies.py` — Added `version_status` parameter

## Usage

### As a standalone CLI
```bash
python -m flows.parsers.parse_backfill_historical_versions \
--input_path data/assembly_data_report.jsonl \
--yaml_path configs/assembly_historical.types.yaml \
--work_dir tmp
```

### Via Prefect pipeline
Discovered automatically by `register.py` and invoked through
`fetch_parse_validate` with the standard parser signature.

### Expected performance
- **First run**: ~10–15 hours (~3,700 assemblies, ~8,500 versions)
- **Subsequent runs**: ~15–30 minutes (cache hits)

### Running tests
```bash
set SKIP_PREFECT=true
python -m pytest tests/test_backfill.py -v
```

## Next Steps (after merge)

1. Run full backfill on production dataset
2. Upload output to S3
3. Implement Phase 1: incremental daily updates for new historical versions
4. Implement Phase 2: version-aware milestone tracking
201 changes: 201 additions & 0 deletions configs/assembly_historical.types.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,201 @@
# Configuration for historical assembly version backfill
# This config defines the schema for assembly_historical.tsv
# which contains all superseded assembly versions

attributes:
assembly_level:
header: assemblyLevel
path: assemblyInfo.assemblyLevel
assembly_span:
header: totalSequenceLength
path: assemblyStats.totalSequenceLength
assigned_percent:
header: assignedProportion
path: processedAssemblyStats.assignedProportion
assembly_status:
header: primaryValue
path: processedAssemblyInfo.primaryValue
translate:
"1": primary
assembly_type:
header: assemblyType
path: assemblyInfo.assemblyType
bioproject:
header: bioProjectAccession
path: assemblyInfo.bioprojectLineage.bioprojects.accession
separator:
- ","
biosample:
header: biosampleAccession
path: assemblyInfo.biosample.accession
separator:
- ","
chromosome_count:
header: totalNumberOfChromosomes
path: assemblyStats.totalNumberOfChromosomes
contig_count:
header: numberOfContigs
path: assemblyStats.numberOfContigs
contig_l50:
header: contigL50
path: assemblyStats.contigL50
contig_n50:
header: contigN50
path: assemblyStats.contigN50
ebp_standard_criteria:
header: ebpStandardCriteria
path: processedAssemblyStats.ebpStandardCriteria
separator:
- ","
ebp_standard_date:
header: ebpStandardDate
path: processedAssemblyStats.ebpStandardDate
gc_percent:
header: gcPercent
path: assemblyStats.gcPercent
gene_count:
header: geneCountTotal
path: annotationInfo.stats.geneCounts.total
gene_count.source.date:
header: annotationReleaseDate
path: annotationInfo.releaseDate
isolate:
header: isolate
path: assemblyInfo.biosample.attributes.name==isolate.value
last_updated:
header: releaseDate
path: assemblyInfo.releaseDate
mitochondrion_accession:
header: mitochondrionAccession
path: processedOrganelleInfo.mitochondrion.accession
separator:
- ;
mitochondrion_assembly_span:
header: mitochondrionAssemblySpan
path: processedOrganelleInfo.mitochondrion.assemblySpan
mitochondrion_gc_percent:
header: mitochondrionGcPercent
path: processedOrganelleInfo.mitochondrion.gcPercent
mitochondrion_scaffolds:
header: mitochondrionScaffolds
path: processedOrganelleInfo.mitochondrion.scaffolds
separator:
- ;
noncoding_gene_count:
header: geneCountNoncoding
path: annotationInfo.stats.geneCounts.nonCoding
noncoding_gene_count.source.date:
header: annotationReleaseDate
path: annotationInfo.releaseDate
plastid_accession:
header: plastidAccession
path: processedOrganelleInfo.plastid.accession
separator:
- ;
plastid_assembly_span:
header: plastidAssemblySpan
path: processedOrganelleInfo.plastid.assemblySpan
plastid_gc_percent:
header: plastidGcPercent
path: processedOrganelleInfo.plastid.gcPercent
plastid_scaffolds:
header: plastidScaffolds
path: processedOrganelleInfo.plastid.scaffolds
separator:
- ;
protein_count:
header: geneCountProteincoding
path: annotationInfo.stats.geneCounts.proteinCoding
protein_count.source.date:
header: annotationReleaseDate
path: annotationInfo.releaseDate
pseudogene_count:
header: geneCountPseudogene
path: annotationInfo.stats.geneCounts.pseudogene
pseudogene.source.date:
header: annotationReleaseDate
path: annotationInfo.releaseDate
refseq_category:
header: refseqCategory
path: assemblyInfo.refseqCategory
sample_sex:
header: sex
path: assemblyInfo.biosample.attributes.name==sex.value
scaffold_count:
header: numberOfScaffolds
path: assemblyStats.numberOfScaffolds
scaffold_l50:
header: scaffoldL50
path: assemblyStats.scaffoldL50
scaffold_n50:
header: scaffoldN50
path: assemblyStats.scaffoldN50
sequence_count:
header: numberOfComponentSequences
path: assemblyStats.numberOfComponentSequences
submitter:
header: submitter
path: assemblyInfo.submitter
ungapped_span:
header: totalUngappedLength
path: assemblyStats.totalUngappedLength
# NEW: Version-specific field to indicate this is a historical/superseded version
version_status:
header: versionStatus
path: processedAssemblyInfo.versionStatus

file:
Comment thread
fchen13 marked this conversation as resolved.
display_group: general
needs:
- ATTR_assembly.types.yaml
exclusions:
attributes:
- bioproject
- biosample
identifiers:
- assembly_id
taxonomy:
- taxon_id
format: tsv
header: true
name: assembly_historical.tsv
source: INSDC
source_url_stub: https://www.ncbi.nlm.nih.gov/assembly/

identifiers:
assembly_id:
header: assemblyID
path: processedAssemblyInfo.assemblyID
assembly_name:
header: assemblyName
path: assemblyInfo.assemblyName
genbank_accession:
header: genbankAccession
path: processedAssemblyInfo.genbankAccession
refseq_accession:
header: refseqAccession
path: processedAssemblyInfo.refseqAccession
wgs_accession:
header: wgsProjectAccession
path: wgsInfo.wgsProjectAccession

metadata:
is_primary_value:
header: primaryValue
path: processedAssemblyInfo.primaryValue
source_slug:
header: genbankAccession
path: processedAssemblyInfo.genbankAccession

taxon_names:
common_name:
header: commonName
path: organism.commonName

taxonomy:
taxon:
header: organismName
path: organism.organismName
taxon_id:
header: taxId
path: organism.taxId
Loading
Loading