Add DNA sequence classification workflow using LSTM (RNN) neural network from Galaxy ML tool suite#1229
Add DNA sequence classification workflow using LSTM (RNN) neural network from Galaxy ML tool suite#1229anuprulez wants to merge 27 commits into
Conversation
hujambo-dunia
left a comment
There was a problem hiding this comment.
Content looks good.
| filetype: "tabular" | ||
| DNA sequences: | ||
| class: File | ||
| path: "test-data/dna-sequence.fasta" |
There was a problem hiding this comment.
outfile_predict.tabular isn't referenced, can you remove that file ?
| outfile_predict: | ||
| asserts: | ||
| has_n_columns: | ||
| n: 3 No newline at end of file |
There was a problem hiding this comment.
Can you please make that more specific ? Something about the contents of the file
There was a problem hiding this comment.
Pull request overview
Adds a new Galaxy workflow under workflows/machine-learning/ to classify DNA sequences using an LSTM-based deep learning pipeline from the Galaxy ML tool suite, including Dockstore metadata, documentation, tests, and bundled test datasets.
Changes:
- Added
dna-seq-classification-lstm.gaimplementing encoding, train/test split, Keras model training/evaluation, and prediction. - Added Dockstore descriptor, README, changelog, workflow test definition, and workflow-scoped test data files.
- Added an additional repository-level test dataset file under
test-data/.
Reviewed changes
Copilot reviewed 6 out of 10 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| workflows/machine-learning/dna-seq-classification-lstm/dna-seq-classification-lstm.ga | New Galaxy workflow definition for LSTM-based DNA sequence classification. |
| workflows/machine-learning/dna-seq-classification-lstm/dna-seq-classification-lstm-tests.yml | Defines a Galaxy workflow test job and output assertions. |
| workflows/machine-learning/dna-seq-classification-lstm/.dockstore.yml | Registers the workflow and its test parameter file for Dockstore/TRS. |
| workflows/machine-learning/dna-seq-classification-lstm/README.md | Documents workflow purpose, inputs/outputs, and model/training details. |
| workflows/machine-learning/dna-seq-classification-lstm/CHANGELOG.md | Introduces initial changelog entry for version 0.1. |
| workflows/machine-learning/dna-seq-classification-lstm/test-data/dna-sequence-labels.tabular | Adds labels used as workflow test input data. |
| workflows/machine-learning/dna-seq-classification-lstm/test-data/outfile_predict.tabular | Adds example prediction-probability output data. |
| test-data/Labels of DNA sequences.tabular | Adds repository-level labels dataset (appears duplicative/unreferenced). |
| @@ -0,0 +1,703 @@ | |||
| { | |||
| "a_galaxy_workflow": "true", | |||
| "annotation": "", | |||
| "workflow_outputs": [ | ||
| { | ||
| "label": "outfile_predict", | ||
| "output_name": "outfile_predict", | ||
| "uuid": "14aee224-e581-42fb-9925-1c80c50b0934" | ||
| } | ||
| ] |
| "owner": "bgruening", | ||
| "tool_shed": "toolshed.g2.bx.psu.edu" | ||
| }, | ||
| "tool_state": "{\"__job_resource\": {\"__job_resource__select\": \"yes\", \"__current_case__\": 1, \"gpu\": \"1\"}, \"experiment_schemes\": {\"selected_exp_scheme\": \"train_val\", \"__current_case__\": 0, \"infile_estimator\": {\"__class__\": \"ConnectedValue\"}, \"hyperparams_swapping\": {\"param_set\": [{\"__index__\": 0, \"sp_name\": null, \"sp_value\": \"\"}]}, \"test_split\": {\"split_algos\": {\"shuffle\": \"simple\", \"__current_case__\": 1, \"test_size\": \"0.2\", \"random_state\": null}}, \"metrics\": {\"scoring\": {\"primary_scoring\": \"accuracy\", \"__current_case__\": 1, \"secondary_scoring\": [\"f1_macro\", \"recall_macro\"]}}}, \"input_options\": {\"selected_input\": \"tabular\", \"__current_case__\": 0, \"infile1\": {\"__class__\": \"ConnectedValue\"}, \"header1\": false, \"column_selector_options_1\": {\"selected_column_selector_option\": \"all_columns\", \"__current_case__\": 4}, \"infile2\": {\"__class__\": \"ConnectedValue\"}, \"header2\": false, \"column_selector_options_2\": {\"selected_column_selector_option2\": \"all_columns\", \"__current_case__\": 4}}, \"save\": [\"save_estimator\", \"save_prediction\"], \"__page__\": 0, \"__rerun_remap_job_id__\": null}", |
| "top": 516.007763092226 | ||
| }, | ||
| "post_job_actions": {}, | ||
| "tool_id": "Cut1", | ||
| "tool_state": "{\"columnList\": \"c59\", \"delimiter\": \"T\", \"input\": {\"__class__\": \"ConnectedValue\"}, \"__page__\": 0, \"__rerun_remap_job_id__\": null}", | ||
| "tool_uuid": null, | ||
| "tool_version": "1.0.2", |
There was a problem hiding this comment.
The number of columns correspond to the number of kmers a DNA sequence can have which is directly related to the length of DNA sequence and kmer size. The number of columns will be dynamic for DNA sequences and need to be adapted by users in the workflow depending on the dataset being used.
There was a problem hiding this comment.
Then it has to be a workflow parameter. Editing the workflow cannot be the suggested way to deal with that.
There was a problem hiding this comment.
Wait, are you saying this in works with a fixed length fasta ? This seems like very important information that should go to the input and the readme.
There was a problem hiding this comment.
, are you saying this in works with a fixed length fasta
No, it supports variable length fasta. to_categorical tool find the max length from all the sequences and then pads smaller sequences with zeros (which the deep learning model knows to avoid those positions). Added the information in the README.
| "left": 1109.7736284525877, | ||
| "top": 195.5612919938918 | ||
| }, | ||
| "post_job_actions": {}, | ||
| "tool_id": "Cut1", | ||
| "tool_state": "{\"columnList\": \"c1-c58\", \"delimiter\": \"T\", \"input\": {\"__class__\": \"ConnectedValue\"}, \"__page__\": 0, \"__rerun_remap_job_id__\": null}", | ||
| "tool_uuid": null, |
There was a problem hiding this comment.
same comment like before
| outputs: | ||
| outfile_predict: | ||
| asserts: |
| @@ -0,0 +1,5 @@ | |||
| # Changelog | |||
|
|
|||
| ## [0.1] | |||
| 0 | ||
| 0 | ||
| 0 | ||
| 0 | ||
| 0 |
| AGACCCGCCGGGAGGCGGAGGACCTGCAGGGTGAGCCCCACCGCCCCTCCGTGCCCCCGC | ||
| >3 | ||
| GAGGTGAAGGACGTCCTTCCCCAGGAGCCGGTGAGAAGCGCAGTCGGGGGCACGGGGATG | ||
| >4 | ||
| GGGCTGCGTTGCTGGTCACATTCCTGGCAGGTATGGGGCGGGGCTTGCTCGGTTTTCCCC | ||
| >5 | ||
| GCTCAGCCCCCAGGTCACCCAGGAACTGACGTGAGTGTCCCCATCCCGGCCCTTGACCCT | ||
| >6 | ||
| CAGACTGGGTGGACAACAAAACCTTCAGCGGTAAGAGAGGGCCAAGCTCAGAGACCACAG | ||
| >7 | ||
| CCTTTGAGGACAGCACCAAGAAGTGTGCAGGTACGTTCCCACCTGCCCTGGTGGCCGCCA | ||
| >8 | ||
| CCCTCGTGCGGTCCACGACCAAGACCAGCGGTGAGCCACGGGCAGGCCGGGGTCGTGGGG | ||
| >9 | ||
| TGGCGACTACGGCGCGGAGGCCCTGGAGAGGTGAGGACCCTCCTGTCCCTGCTCCAGTCC | ||
| >10 | ||
| AAGCTGACAGTGGACCCGGTCAACTTCAAGGTGAGCCAGGAGTCGGGTGGGAGGGTGAGA | ||
| >11 | ||
| TGGCGACTACGGCGCGGAGGCCCTGGAGAGGTGAGGACCCTGGTATCCCTGCTGCCAGTC | ||
| >12 | ||
| AAGCTGAGAGTGGACCCTGTCAACTTCAAGGTGAGCCACCAGTCGGGTGGGGAGGGTGAG | ||
| >13 | ||
| GGAAGATGCTGGAGGAGAAACCCTGGGAAGGTAGGCTCTGGTGACCAGGACAAGGGAGGG | ||
| >14 | ||
| AAGCTGCATGTGGATCCTGAGAACTTCAGGGTGAGTACAGGAGATGTTTCAGCCCTGTTG | ||
| >15 | ||
| GGAAGATGTTGGAGGAGAAACCCTGGGAAGGTAGGCTCTGGTGACCAGGACAAGGGAGGG | ||
| >16 | ||
| AAGCTGCATGTGGATCCTGAGAACTTCAGGGTGAGTACAGGAGATGTTTCAGCCCTGTTG | ||
| >17 | ||
| GGCACCACCACTGACCTGGGACAGTGAATCGTAAGTATGCCTTTCACTGCGAGGGGTTCT | ||
| >18 | ||
| TTGCTCTGGTGAATTACATCTTCTTTAAAGGTAAGGTTGCTCAACCAGCCTGAGCTGTTT | ||
| >19 | ||
| CACCAAGTTCCTGGAAAATGAAGACAGAAGGTGATTCCCCAACCTGAGGGTGACCAAGAA | ||
| >20 | ||
| ACAGAGGAGGCACCCCTGAAGCTCTCCAAGGTGAGATCACCCTGACGACCTTGTTGCACC | ||
| >21 | ||
| GTGCCCATCACCAACGCCACCCTGGACCGGGTGAGTGCCTGGGCTAGCCCTGTCCTGAGC | ||
| >22 | ||
| CACGATCTTTCTCAGAGAGTACCAGACCCGGTGAGAGCCCCCATTCCAATGCACCCCCGA | ||
| >23 | ||
| AGCGGGAGAATGGGACCGTCTCCAGATACGGTGAGGGCCAGCCCTCAGGCAGGAGGGTTC |
Test Results (powered by Planemo)Test Summary
Passed Tests
|
Test Results (powered by Planemo)Test Summary
Passed Tests
|
| @@ -0,0 +1,118 @@ | |||
| # 🧬 DNA Sequence Classification using LSTM (Galaxy Workflow) | |||
There was a problem hiding this comment.
I read the readme and I wouldn't say i'm a bioinfo noob but I have no idea what this workflow does. Maybe an example would be helpful ?
| ## 📥 Inputs | ||
| The workflow requires two datasets: | ||
| - DNA sequences (FASTA format) | ||
| - Labels for sequences (tabular format) |
There was a problem hiding this comment.
What is this format ? Each line is a value per nucleotide ? How does a user come across that file ? What's the logic of of the train/test split ? What are we predicting and how can we predict something if we use the same input as training ? Add this also to the description of the label input in the workflow.
This reads so much like an AI written readme (apologies if handcrafted), which isn't a problem per se, but you should ask yourself if this is sufficient information for someone to understand and use the workflow.
I probably don't care about the model architecture, i want to know what and how it does its thing.
There was a problem hiding this comment.
What is this format ?
The (biological) labelled data for ML usually comes (publicly available for example) in tabular format, even genomic sequences -
seq_id,seq,label
0,CGTAT,0 (Neither)
1,TGATA,1 (Exon-Intron)
2,TTGTA,2 (Intro-Exon)
The sequence column are usually converted to k-mers (CGTAT -> CGT GTA TAT following 3-mers)
The above step requires DNA sequence in FASTA format
the labels stay in the tabular format.
Later, each 3-mer becomes a token and is represented by an integer index (CGT is represented by 38, GTA 29 and TAT by 3). So, CGTAT -> CGT GTA TAT -> 38, 29, 3
then 38, 29, 3 become 3 features and is associated with its original label which is 0
38, 29, 3,0 (the last column is the labels columns) and the rest are features.
Each line is a value per nucleotide
No, each label is for a sequence (CGTAT)
How does a user come across that file ?
there is no standard format, only a few standard steps such as k-mer conversion and feature transformation. Labels could also be embedded into the FASTA file itself for example.
What's the logic of of the train/test split ? What are we predicting and how can we predict something if we use the same input as training ?
We take the entire dataset and then make it ready for doing ML - convert it into features and labels combination (CGTAT 0 -> CGT GTA TAT 0 -> 38, 29, 3, 0), the last entry is the label.
then, this entire data is split (a standard practice) horizontally into train and test (75% rows into training and rest for test). Model gets trained on the train data and then evaluated on the test data. Users can bring a separate test data but that needs to be transformed using the same k-mer vocabulary (like LLMs need their own vocabulary/tokenizers). The test data should come from the same distribution. The workflow can predicts the splice junctions for example which the labels.
Add this also to the description of the label input in the workflow.
I think it is there in the workflow file. Do you mean somewhere else also?
"inputs": [
{
"description": "",
"name": "Labels of DNA sequences"
}
],
|
|
||
| ### An example task achieved by the workflow | ||
|
|
||
| The workflow can be used to perform DNA sequence classification on splice-junction gene sequences. In an example task, the workflow takes raw DNA sequence data as input and classifies each sequence according to whether it contains an exon–intron boundary, an intron–exon boundary, or no splice junction using an LSTM-based deep learning model. These classes correspond to donor sites (EI), acceptor sites (IE), and neither (N). The biological goal of such a task is to identify where RNA splicing occurs. During splicing, non-coding introns are removed and coding exons are joined together before a gene is translated into a protein. Detecting these splice-junction boundaries from DNA sequences helps in understanding gene structure and function. More information about such a dataset can be found in this [blogpost](https://galaxyproject.org/news/2026-04-28-tabpfn-v2-5/#splice-junction-gene-sequences) |
There was a problem hiding this comment.
Thanks for including the link. I see from the screenshot that you need to certify that you're not a commercial user. If tabfpn does not have a permissive license we cannot merge this into the IWC.
There was a problem hiding this comment.
The blogpost is only for the explanation of such a dataset which can also be used for the LSTM based workflow (this PR) for DNA sequence classification. This workflow, however, does not use TabPFN anywhere as a tool or any other commercial tools.
There was a problem hiding this comment.
This please edit the text so it says, "see here for how to produce a label file" ? Is that what that blogpost describes ?
There was a problem hiding this comment.
Is that what that blogpost describes
No, the blogpost describes a dataset that comes from a public repository. This blogpost describes another workflow using TabPFN as a DNA sequence classifier (instead of LSTM network in the PR).
This please edit the text so it says, "see here for how to produce a label file" ?
I will add a similar piece of text. Added here: 96dbca4
…-classification-lstm.ga Co-authored-by: Marius van den Beek <m.vandenbeek@gmail.com>
| The workflow requires two datasets: | ||
| - DNA sequences (FASTA format) | ||
| - Labels for sequences (tabular format) | ||
| - Labels for sequences (tabular format) (for example, a file containing a list of splice junction (exon-intron, intron-exon and neither) categories corresponding to each DNA sequence) |
There was a problem hiding this comment.
That is still insufficient, how does that look like ? And please add that in the workflow proper
There was a problem hiding this comment.
I updated it to "Categories/labels/classes for DNA sequences (tabular format) (e.g. splice junctions (exon-intron, intron-exon and neither) corresponding to DNA sequences)"
In the workflow, it is "Task specific categories/labels/classes of DNA sequences".
| ### An example task achieved by the workflow | ||
|
|
||
| The workflow can be used to perform DNA sequence classification on splice-junction gene sequences. In an example task, the workflow takes raw DNA sequence data as input and classifies each sequence according to whether it contains an exon–intron boundary, an intron–exon boundary, or no splice junction using an LSTM-based deep learning model. These classes correspond to donor sites (EI), acceptor sites (IE), and neither (N). The biological goal of such a task is to identify where RNA splicing occurs. During splicing, non-coding introns are removed and coding exons are joined together before a gene is translated into a protein. Detecting these splice-junction boundaries from DNA sequences helps in understanding gene structure and function. More information about such a dataset can be found in this [blogpost](https://galaxyproject.org/news/2026-04-28-tabpfn-v2-5/#splice-junction-gene-sequences) | ||
| The workflow can be used to perform DNA sequence classification on splice-junction gene sequences. In an example task, the workflow takes raw DNA sequence data as input and classifies each sequence according to whether it contains an exon–intron boundary, an intron–exon boundary, or no splice junction using an LSTM-based deep learning model. These classes correspond to donor sites (EI), acceptor sites (IE), and neither (N). The biological goal of such a task is to identify where RNA splicing occurs. During splicing, non-coding introns are removed and coding exons are joined together before a gene is translated into a protein. Detecting these splice-junction boundaries from DNA sequences helps in understanding gene structure and function. More information about such a dataset can be found in this [blogpost](https://galaxyproject.org/news/2026-04-28-tabpfn-v2-5/#splice-junction-gene-sequences). The blogpost uses a publicly available dataset that contains DNA sequences and their respective splice junction categories or classes as EI, IE and N. |
There was a problem hiding this comment.
So it can only be used with that file ? In what situations is that an appropriate input ?
There was a problem hiding this comment.
This is how the dataset for doing a ML task look like (from https://galaxyproject.org/news/2026-04-28-tabpfn-v2-5/#splice-junction-gene-sequences)
The first column contains classes/labels and the third DNA fragments.
Another example from https://doi.org/10.1038/s41586-024-08070-z
HF dataset: https://huggingface.co/datasets/HuggingFaceBio/malinois-mpra-regression
DNA sequences and their corresponding labels (which is differential gene expression) which could be used for gene expression prediction tasks.
Sharing labels as tabular data is very common.
There was a problem hiding this comment.
Can you put this in the readme please ?
There was a problem hiding this comment.
Added the information in README with links
| # 🧬 DNA Sequence Classification using LSTM neural network (Galaxy Workflow) | ||
|
|
||
| ## 📌 Overview | ||
| This workflow implements a deep learning pipeline for DNA sequence classification using an LSTM-based neural network. It takes raw DNA sequences in FASTA format and their labels in tabular format, processes them into numerical representations, trains a model, and evaluates its performance. |
There was a problem hiding this comment.
Something isn't right here in how this is phrased. What does "their labels" mean here ? How can you "evaluate performance" when you don't have ground truth ? Is this some kind of ML lingo that doesn't correspond to what a scientist would call "evaluate performance" ? If i give this some random fasta file will the "performance" be worse than if this was a from a genome of the species the labels originate from ?
There was a problem hiding this comment.
What does "their labels" mean here ?
corresponding labels of DNA sequences.
Is this some kind of ML lingo that doesn't correspond to what a scientist would call "evaluate performance"
Yes, in the ML field, evaluate performance means testing the trained models for its generalization quality on unseen/test datasets.
If i give this some random fasta file will the "performance" be worse than if this was a from a genome of the species the labels originate from ?
If the FASTA file containing DNA sequences and its labels containing tabular file don't share any biological meaning, ML model will not do much and its prediction cannot be trusted. If random file is given that don't share biological meaning with labels, any performance will not make sense. Typically, it should give worse performance.
There was a problem hiding this comment.
Sooooo, does that mean the model performance doesn't matter to the user ? If not please mention in the readme what a good "performance" looks like and how that can be evaluated. If the user can't control the result then I'm not sure why we'd include the performance evaluation ?
There was a problem hiding this comment.
does that mean the model performance doesn't matter to the user ?
I mean if the data quality is low, performance will also be bad.
If not please mention in the readme what a good "performance" looks like and how that can be evaluated.
In the "Evaluation" section of the README, the metrics to look out for are listed. In general, model performance is not objective as it depends on several factors such as data quality, model architecture etc. A higher F1-score (closer to 1) usually indicate a high performance but it may not be achievable with all and every dataset - varies from dataset to dataset. I have added this information in the same "Evaluation" section.
If the user can't control the result then I'm not sure why we'd include the performance evaluation ?
User can control it by optimizing the model architecture - added a section "## Model optimisation" to the README.
|
|
||
| ### An example task achieved by the workflow | ||
|
|
||
| The workflow can be used to perform DNA sequence classification on splice-junction gene sequences. In an example task, the workflow takes raw DNA sequence data as input and classifies each sequence according to whether it contains an exon–intron boundary, an intron–exon boundary, or no splice junction using an LSTM-based deep learning model. These classes correspond to donor sites (EI), acceptor sites (IE), and neither (N). The biological goal of such a task is to identify where RNA splicing occurs. During splicing, non-coding introns are removed and coding exons are joined together before a gene is translated into a protein. Detecting these splice-junction boundaries from DNA sequences helps in understanding gene structure and function. More information about such a dataset can be found in this [blogpost](https://galaxyproject.org/news/2026-04-28-tabpfn-v2-5/#splice-junction-gene-sequences). The blogpost uses a publicly available dataset that contains DNA sequences and their respective splice junction categories or classes as EI, IE and N. |
There was a problem hiding this comment.
In an example task, the workflow takes raw DNA sequence data as input and classifies each sequence according to whether it contains an exon–intron boundary, an intron–exon boundary, or no splice junction using an LSTM-based deep learning model.
What kind of DNA sequence ? contigs ? short reads ? You've skipped the label file here I assume ?
The biological goal of such a task is to identify where RNA splicing occurs
You've said that already in the second sentence, no ?
The blogpost uses a publicly available dataset that contains DNA sequences and their respective splice junction categories or classes as EI, IE and N
Nowhere in the blog is it discussed how that dataset was created. https://archive.ics.uci.edu/dataset/69/molecular+biology+splice+junction+gene+sequences only says "This dataset has been developed" ... this doesn't seem like something I would recommend to a person doing biology. I don't even know what species this is based on ?
I did however find this, is that the label file format ? If so, why does the test data not look like that ?
| - Ensure DNA sequences are properly formatted (FASTA) | ||
| - Labels must align with input sequences | ||
| - GPU acceleration is enabled (if available) | ||
| - Suitable for multi-class classification problems |
There was a problem hiding this comment.
You'll have to provide some more detail here. Properly formatted is such a wide range of things. If you mean fixed length, maybe even of a very specific fixed length you have to say that.
How does one align labels with input sequences
Users don't have access to whether or not "GPU acceleration" is enabled and what does that even mean ? CUDA ? ROCm ? Metal ? What version ? Does it make sense to list that here at all ?
There was a problem hiding this comment.
You'll have to provide some more detail here. Properly formatted is such a wide range of things. If you mean fixed length, maybe even of a very specific fixed length you have to say that.
Updated it to "Ensure input datasets are in correct format: DNA sequences as FASTA and labels as tabular"
How does one align labels with input sequences
It comes alinged via published papers or public databases or huggingface datasets or Zenodo.
Users don't have access to whether or not "GPU acceleration" is enabled and what does that even mean ? CUDA ? ROCm ? Metal ? What version ? Does it make sense to list that here at all ?
It is important to mention it. I do not enable it by default because a Galaxy server may not have a GPU support. But, if users use it on a Galaxy server that has GPU support (Main or EU), enabling such a feature would be super useful if say there are 10,000 DNA sequences in the analysis.
There was a problem hiding this comment.
Updated it to "Ensure input datasets are in correct format: DNA sequences as FASTA and labels as tabular"
Contigs or short read data or ... ? Correctly formatted fasta doesn't really mean much. Does it have to be the ATGC alphabet, are mixed length sequences supported, etc ? You mention 10000 sequences, that is very few if we're dealing with short read data. Maybe include expectation about input data types ? I assume you can't pass short read data if 10000 sequences requires GPU support ?
It comes alinged via published papers or public databases or huggingface datasets or Zenodo.
I still don't know what that means and what i'm looking for if I want to classify e.g. transcription factor binding sites. How do I as a user know that a label is aligned with the input sequences. What do I look for on zenodo or hugginface ?
It is important to mention it. I do not enable it by default because a Galaxy server may not have a GPU support. But, if users use it on a Galaxy server that has GPU support (Main or EU), enabling such a feature would be super useful if say there are 10,000 DNA sequences in the analysis.
The text says GPU acceleration is enabled (if available), which contradictsw I do not enable it by default because a Galaxy server may not have a GPU. which one is it ? how does one enable it if it's not on by default with a CPU fallback? What's the expected speedup ?
There was a problem hiding this comment.
Contigs or short read data or ... ?
The DNA sequences users can bring can be any sequence - short, long, contigs etc.. The workflow's job is to map the supplied sequences to their respective labels by learning differentiating (between labels) context. If I bring 10000 sequences which are 1000 bp long, GPUs would be extremely helpful because each base pair is represented by a fixed sized vector and because of the length of the sequences, number of matrix multiplications would be enormous. It is hard to put a number when someone needs a GPU - again it depends on the dataset.
I still don't know what that means and what i'm looking for if I want to classify e.g. transcription factor binding sites. How do I as a user know that a label is aligned with the input sequences. What do I look for on zenodo or hugginface ?
Users have to bring their own datasets to use this workflow and tune the parameters of the workflow to obtain best possible performance. They can always use the the test data we provide to play around.
How do I as a user know that a label is aligned with the input sequences. What do I look for on zenodo or hugginface ?
Users know their research question and the kind of dataset they want to analyse. README mentions they can do classification or regression tasks. For example, in single-cell cell type annotation tasks, the cell type names comes with Anndata.
The text says GPU acceleration is enabled (if available), which contradictsw I do not enable it by default because a Galaxy server may not have a GPU. which one is it ?
It is changed to "Enable GPU for faster performance - consider this option when dataset is large (tested on Nvidia GPUs)". I don' know kind of GPUs are there on the Main or AU servers. It is not enabled because "I do not enable it by default because a Galaxy server may not have a GPU". Also, CI/CD pipelines also may not have GPU for automatic testing.
how does one enable it if it's not on by default with a CPU fallback?
Added to README: "To enable it, open the workflow and go to "Deep learning training and evaluation" tool. At the bottom of the tool definition, there is an option "Job Resource Parameters". Choose "Specify job resource parameters" and the in the "Use GPU resources", set it to "Yes"."
What's the expected speedup ?
With GPU, it is 2 mins (model training) on the test data and without, 4 mins (2X speedup). But, this is a subjective thing and depends on what kind of GPU deployed on the server. If it is A100, the speedup could be much more.
| ## 📄 License | ||
| MIT License | ||
|
|
||
| --- | ||
|
|
||
| ## 👤 Author | ||
| **Anup Kumar** | ||
| ORCID: 0000-0002-2068-4695 No newline at end of file |
There was a problem hiding this comment.
These are already in the workflow and the dockstore file
Test Results (powered by Planemo)Test Summary
Errored Tests
|
1 similar comment
Test Results (powered by Planemo)Test Summary
Errored Tests
|
| @@ -26,10 +26,10 @@ | |||
| "inputs": [ | |||
| { | |||
| "description": "", | |||
There was a problem hiding this comment.
Please put an example of the expected format here.
There was a problem hiding this comment.
Ok, I have added it to the WF
| @@ -0,0 +1,3189 @@ | |||
| 0 | |||
There was a problem hiding this comment.
This doesn't look like the label file you pointed to, why is that ?
There was a problem hiding this comment.
Fixed in 5bb4861
Added another tool to WF to encode raw labels to integer representation.
Test Results (powered by Planemo)Test Summary
Passed Tests
|

FOR CONTRIBUTOR:
FOR REVIEWERS:
This workflow does/runs/performs … xyz … to generate/analyze/etc …namefield should be human readable (spaces are fine, no underscore, dash only where spelling dictates it), no abbreviation unless generally understood-) over underscore (_), prefer all lowercase. Folder becomes repository in iwc-workflows organization and is included in TRS idwould you like to have a look @bgruening @wm75 thanks!