Add PGGB pangenome build workflow (comparative_genomics)#1246
Conversation
Builds a pangenome variation graph from per-strain assembly FASTAs via PGGB (wfmash + seqwish + smoothxg + gfaffix + odgi). PanSN-rename mapped over the input collection, concat to a single multifasta, then pggb with MultiQC HTML report enabled. Validated end-to-end on the 8-strain P. vivax v3 reference pangenome (graph length within 2.7%% of native build) and PGGB's own DRB1-3123 CI fixture (12 HLA haplotypes). Tool dependencies (pggb, wfmash, seqwish, smoothxg, gfaffix, odgi, pansn_rename, fasta_concat) currently in nekrut/brc-tools; will move to tools-iuc post-IWC review. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Tool dependencies now live on the Main Tool Shed under owner
Browse: https://toolshed.g2.bx.psu.edu/view/nekrut Flipping to ready-for-review. |
Test Results (powered by Planemo)Test Summary
Errored Tests
|
planemo workflow_lint flagged: 'The release of workflow ... does not match the version in the CHANGELOG'. CHANGELOG.md has [0.1]; the .ga had 0.1.0 (release version is a planemo IWC linter check). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Test Results (powered by Planemo)Test Summary
Errored Tests
|
Previous run failed in 'Combine chunked test results' with: 'Workflow was not invoked; the following required tools are not installed: pansn_rename, fasta_concat, pggb, odgi_stats' IWC CI runs planemo against a Galaxy that auto-installs tools from the Tool Shed using the workflow's tool_id, but only if the tool_id is the full toolshed-qualified form. Bare 'pggb' worked locally because the tools were registered via local_tool_conf, but the CI Galaxy has no such config — it pulls from toolshed. Rewrote 4 tool steps (pansn_rename, fasta_concat, pggb, odgi_stats): tool_id: toolshed.g2.bx.psu.edu/repos/nekrut/<repo>/<tool>/<ver> + tool_shed_repository block with changeset_revision/name/owner/tool_shed Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
| smoothed GFA: | ||
| asserts: | ||
| has_size: | ||
| min: 100 |
There was a problem hiding this comment.
Can we make that more specific and assert something about the contents ?
There was a problem hiding this comment.
Pull request overview
This PR adds a new IWC Galaxy workflow for building PGGB pangenome graphs from per-strain FASTA collections, including PanSN renaming, FASTA concatenation, PGGB graph construction, odgi stats, and MultiQC reporting.
Changes:
- Adds the
pggb-pangenome-buildworkflow and metadata for Dockstore/WorkflowHub. - Adds a Planemo smoke test with three small FASTA fixtures.
- Adds README and changelog documentation for workflow usage, outputs, and validation.
IWC workflow checklist reviewed: required files are present, test data is below the Zenodo threshold, and output labels mostly align with test names; unresolved comments cover metadata alignment, annotation format, human-readable input labels, and documented outputs that are not exposed by the workflow.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
workflows/comparative_genomics/pggb-pangenome-build/.dockstore.yml |
Adds Dockstore workflow descriptor and author metadata. |
workflows/comparative_genomics/pggb-pangenome-build/.workflowhub.yml |
Adds WorkflowHub registration metadata. |
workflows/comparative_genomics/pggb-pangenome-build/CHANGELOG.md |
Documents the initial workflow release. |
workflows/comparative_genomics/pggb-pangenome-build/README.md |
Describes workflow purpose, inputs, steps, outputs, resource notes, and citations. |
workflows/comparative_genomics/pggb-pangenome-build/pggb-pangenome-build.ga |
Adds the Galaxy workflow definition for PanSN rename, concat, PGGB, and odgi stats. |
workflows/comparative_genomics/pggb-pangenome-build/pggb-pangenome-build-tests.yml |
Adds a Planemo smoke test for the workflow. |
workflows/comparative_genomics/pggb-pangenome-build/test-data/PvA.fa |
Adds small FASTA fixture for testing. |
workflows/comparative_genomics/pggb-pangenome-build/test-data/PvB.fa |
Adds small FASTA fixture for testing. |
workflows/comparative_genomics/pggb-pangenome-build/test-data/PvC.fa |
Adds small FASTA fixture for testing. |
| @@ -0,0 +1,403 @@ | |||
| { | |||
| "a_galaxy_workflow": "true", | |||
| "annotation": "PGGB pangenome graph build from PanSN-named per-strain FASTAs. Map PanSN-rename across the collection, concatenate, then run pggb.", | |||
| "name": "n_haplotypes" | ||
| } | ||
| ], | ||
| "label": "n_haplotypes", |
| "name": "segment_length" | ||
| } | ||
| ], | ||
| "label": "segment_length", |
| "name": "map_pct_id" | ||
| } | ||
| ], | ||
| "label": "map_pct_id", |
| "name": "min_match_len" | ||
| } | ||
| ], | ||
| "label": "min_match_len", |
| "name": "vcf_spec" | ||
| } | ||
| ], | ||
| "label": "vcf_spec", |
| { | ||
| "class": "Person", | ||
| "name": "Claude Opus 4.7" |
| - **layout (.og.lay)** — 2D graph layout | ||
| - **layout PNG** — 2D layout rendered | ||
| - **viz PNG** — 1D path-coloured visualisation | ||
| - **pggb log** — full run log with all parameter hashes | ||
| - **MultiQC report** — interactive HTML collating odgi stats + viz across | ||
| both the seqwish-induced intermediate and final smoothed graphs | ||
| - **deconstruct VCF** (optional) — only when `vcf_spec` is set |
| 2D layout (`.og.lay`), layout/viz PNGs, run log, optional VCF | ||
| via `vg deconstruct`, graph stats TSV, and MultiQC HTML report | ||
| collating odgi stats + viz across the build. |
Test Results (powered by Planemo)Test Summary
Errored Tests
|
CI gating: tool install timingUpdates:
Reverting to draft. This pattern works for tools-iuc-owned dependencies (which usegalaxy.* hosts via cvmfs at the CI's @iwc-reviewers — what's the canonical way to get IWC CI to accept a workflow depending on a fresh-Tool-Shed-only suite? Options I see:
Happy to move toward whichever pattern reviewers prefer. |
|
it just works, we don't use cvmfs here. something else is wrong in your workflow |
Galaxy resolves each workflow step's tool by content_id at invoke time. The previous commit set the full toolshed tool_id but left content_id as the bare short name (e.g. pansn_rename), so invocation failed with 'required tools are not installed' even though the tools were installed and loaded into the panel. Set content_id to match tool_id for all four tool steps, matching every other IWC workflow.
|
The tests are passing with d086321 but please address the copilot comments. |
|
Can you replace the fasta concat with https://usegalaxy.org/?tool_id=toolshed.g2.bx.psu.edu%2Frepos%2Fnml%2Fcollapse_collections%2Fcollapse_dataset%2F5.1.0&version=latest or similar ? This seems like something we'd have existing tools for. |
|
There are probably more tools and also there is some work @SaimMomin12 is doing that I need to sync with. Give me a few days. |
Summary
Adds a new workflow under
comparative_genomics/pggb-pangenome-build/that builds a pangenome variation graph from per-strain assembly FASTAs
via the PGGB pipeline (wfmash →
seqwish → smoothxg → gfaffix → odgi), then surfaces an interactive
MultiQC report summarising the build.
Pipeline
Outputs
.og, 2D layout (.og.lay), layout PNG,1D viz PNG, run log
the seqwish-induced and final smoothed graphs
vg deconstructwhenvcf_specis setValidation
Two e2e runs on a local Galaxy 26.1-dev with the wrappers installed:
target — accessions in PvP01/Sal-I/PvW1/PAM/PvSY56/PvT01/PvC01/MHC087
from NCBI datasets). ~28 min wall on 16 cores. Graph nucleotide length
matched the v2 native reference build within −2.7 %; path count delta
driven entirely by upstream NCBI re-annotation (one strain has +10
contigs vs v2's input) — not the workflow.
compressed input). Standard pggb smoke test. Completed cleanly with
the MultiQC path enabled, all 7 outputs produced.
Dependencies — important caveat
The tool wrappers required by this workflow are currently in
nekrut/brc-tools, not yet in
tools-iuc:pansn_rename(NEW custom)fasta_concat(NEW custom)pggb0.7.4 (NEW)wfmash0.24.2 (bumped from tools-iuc 0.14)seqwish0.7.11 (refreshed)smoothxg0.8.2 (NEW)gfaffix0.2.2 (NEW)odgi0.9.4 (bumped from 0.3, build/stats/viz subset)vg1.73.0 (bumped from 1.23, with newdeconstructsubcommand)All 9 wrappers have planemo tests green (27 tests total) and 0 lint
warnings; they were exercised end-to-end on a local Galaxy with the
workflow itself.
This PR is opened as draft because IWC CI will fail until the
wrappers are installable on usegalaxy.* / Tool Shed. The plan:
nekrut).tools-iucin a follow-up; bump tool versions herewhen they do.
Test plan
planemo workflow_lintpasses.gaimports into Galaxy without upgrade messages🤖 Generated with Claude Code