Add viral detection from bulk metagenomes#1222
Conversation
Test Results (powered by Planemo)Test Summary
Errored Tests
|
|
not sure why this test again run out of runtime but checking the log it seems that it has a |
There was a problem hiding this comment.
Pull request overview
This PR adds a new IWC microbiome workflow for detecting viral contigs from bulk metagenomic assemblies and running downstream quality, host prediction, annotation, coverage, and vMAG binning analyses.
Changes:
- Adds a Galaxy workflow using geNomad, CheckV, MMseqs2, iPHoP, Pharokka, Bowtie2, CoverM, and vRhyme.
- Adds Dockstore metadata and a Planemo test configuration with Zenodo-hosted inputs.
- Adds README and changelog documentation for the new workflow.
Checklist issues were identified around annotation wording, test file naming, changelog date validity, exposed outputs, and workflow output spelling.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
workflows/microbiome/viral-detection-from-bulk-metagenomes/viral-detection-from-bulk-metagenomes.ga |
Defines the new Galaxy viral detection workflow and workflow outputs. |
workflows/microbiome/viral-detection-from-bulk-metagenomes/viral-detection-from-bulk-metagenomes-test.yml |
Adds Planemo tests and expected output assertions. |
workflows/microbiome/viral-detection-from-bulk-metagenomes/README.md |
Documents workflow purpose, inputs, steps, outputs, and license. |
workflows/microbiome/viral-detection-from-bulk-metagenomes/CHANGELOG.md |
Adds initial release changelog entry. |
workflows/microbiome/viral-detection-from-bulk-metagenomes/.dockstore.yml |
Adds Dockstore workflow metadata and test file reference. |
| publish: true | ||
| primaryDescriptorPath: /viral-detection-from-bulk-metagenomes.ga | ||
| testParameterFiles: | ||
| - /viral-detection-from-bulk-metagenomes-test.yml |
| @@ -0,0 +1,5 @@ | |||
| # Changelog | |||
|
|
|||
| ## [1.0] - 2026-26-04 | |||
| ] | ||
| }, | ||
| "19": { | ||
| "annotation": "Additional filtering of VirSorter contigs using complex custom rules. If all conditions are AND-conditions you may be able to set them on the tool directly.\nBy default this step imposes a stricter score threshold of 0.9 on sequences shorter than 3000 bp, instead of the standard threshold of 0.5.", |
| "post_job_actions": { | ||
| "RenameDatasetActionoutput": { | ||
| "action_arguments": { | ||
| "newname": "VirSorter filtered scores" |
| "label": "best bins summery", | ||
| "output_name": "best_bins_summery", |
| asserts: | ||
| - that: has_n_lines | ||
| n: 0 | ||
| best bins summery: |
| "type": "tool", | ||
| "uuid": "08681d8f-a9a8-4bfc-b1e6-3c6694907756", | ||
| "when": null, | ||
| "workflow_outputs": [] |
| @@ -0,0 +1,1618 @@ | |||
| { | |||
| "a_galaxy_workflow": "true", | |||
| "annotation": "This workflow identifies viral contigs from metagenomic assemblies using geNomad and supports taxonomy, functional annotation, binning, and host prediction.", | |||
bernt-matthias
left a comment
There was a problem hiding this comment.
Why do you run bowtie on different fasta (filtered / unfiltered)? Or in other words: why does vRhyme get different input as CoverM?
| Viral contigs from both geNomad runs are combined. | ||
|
|
||
| ### 4. Clustering and dereplication | ||
| Redundant sequences are removed using **MMseqs2** clustering. |
There was a problem hiding this comment.
Any specific reason not to use dRep?
There was a problem hiding this comment.
1- Bowtie2 is currently run twice as: once against the broader viral-contig set used for vRhyme because of binning purpose, and once against the filtered set used for CoverM with a stricter CheckV-filtered set, including only medium-quality, high-quality, and complete viral contigs. I will check whether the workflow can be simplified by mapping once against the broader dereplicated viral-contig set and then filtering the resulting coverage/abundance table to the final CheckV quality set, instead of running Bowtie2 twice. If this works cleanly in Galaxy, I will update the workflow accordingly.
2- We used MMseqs2 here because at this stage we are still working with viral candidate contigs before vRhyme binning, not final vMAGs. dRep can technically compare and dereplicate genome FASTA files, so it could be more suitable if we want to dereplicate the final vMAGs later. But for this contig-level step, MMseqs2 seemed more appropriate.
| "workflow_outputs": [] | ||
| }, | ||
| "5": { | ||
| "annotation": "Additional filtering of geNomad contigs using complex custom rules. If all conditions are AND-conditions you may be able to set them on the tool directly.\nBy default this step does nothing beyond the conservative preset applied by geNomad.", |
There was a problem hiding this comment.
By default this step does nothing beyond the conservative preset applied by geNomad.
->
By default this step does nothing, i.e. it copies the results from geNomad.
| "owner": "bgruening", | ||
| "tool_shed": "toolshed.g2.bx.psu.edu" | ||
| }, | ||
| "tool_state": "{\"code\": \"/^>/ {print \\\">\\\" FILENAME \\\"_\\\" ++i} !/^>/ {print}\", \"infile\": {\"__class__\": \"ConnectedValue\"}, \"variables\": [], \"__page__\": 0, \"__rerun_remap_job_id__\": null}", |
There was a problem hiding this comment.
I do not understand what this is doing. Are you replacing fasta headers using filenames (would be Galaxy's internal filenames)? This seems wrong to me.
There was a problem hiding this comment.
Thank you for catching this. I think the intention was to make the FASTA identifiers unique, but I agree the approach need to be fixed.
| }, | ||
| "3": { | ||
| "annotation": "", | ||
| "content_id": "toolshed.g2.bx.psu.edu/repos/bgruening/split_file_to_collection/split_file_to_collection/0.5.2", |
There was a problem hiding this comment.
Wondering about the workflow flow this will construct a list:list, or? Is it fine to give partial input to genomad? Why split at all?
There was a problem hiding this comment.
If I understood correctly, this step was added because running geNomad on the full assembly caused runtime/resource issues, so the idea was to process the FASTA in smaller chunks. But I agree that the collection structure needs to be checked, and I will review this part and revise it if needed.
| "top": 50 | ||
| }, | ||
| "post_job_actions": {}, | ||
| "tool_id": "toolshed.g2.bx.psu.edu/repos/ufz/genomad_end_to_end/genomad_end_to_end/1.11.1+galaxy0", |
There was a problem hiding this comment.
License agreement should be a workflow parameter
There was a problem hiding this comment.
What is the idea of the 2nd genomad round?
I think the README could need extension.
There was a problem hiding this comment.
This follows the MVP workflow logic. The second geNomad run is used after CheckV separates proviral sequences, and the results are then combined with the other viral predictions. I will revise the README.
| }, | ||
| "8": { | ||
| "annotation": "", | ||
| "content_id": "toolshed.g2.bx.psu.edu/repos/iuc/qiime_filter_fasta/qiime_filter_fasta/1.9.1.0", |
There was a problem hiding this comment.
Not sure if the qiime (v1) tools are a good choice .. since they are EOL? Is there a replacement?
There was a problem hiding this comment.
This is only used for FASTA filtering by sequence ID, not for QIIME-specific analysis. I will replace it.
|
I didnt done the workflow i only should add it to IWC. I did messga the creator and she will answer the question about the tools choices. |
FOR CONTRIBUTOR:
FOR REVIEWERS:
This workflow does/runs/performs … xyz … to generate/analyze/etc …namefield should be human readable (spaces are fine, no underscore, dash only where spelling dictates it), no abbreviation unless generally understood-) over underscore (_), prefer all lowercase. Folder becomes repository in iwc-workflows organization and is included in TRS id