From 93e1409f55cfc3396bb2676e342accc6491d95b7 Mon Sep 17 00:00:00 2001 From: Danielle Pinto Date: Thu, 19 Jun 2025 20:56:51 -0400 Subject: [PATCH 01/14] scope out master-params file and update metaphlan.nf params based on diff metaphlan versions --- processes/metaphlan.nf | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/processes/metaphlan.nf b/processes/metaphlan.nf index 8e618f9..151b122 100644 --- a/processes/metaphlan.nf +++ b/processes/metaphlan.nf @@ -15,13 +15,23 @@ process metaphlan { script: + // metphlan4 changed metaphlan db variable from bowtie2db to db_dir + // also changed from bowtie2out to mapout + if (params.metaphaln_ver == 'metaphlan4') { + db_arg = 'db_dir' + out_arg = 'mapout'} + else (params.metaphaln_ver == 'metaphlan3.1.0'){ + db_arg = 'bowtie2db' + out_arg = 'bowtie2out' + } + """ metaphlan $kneads -o ${sample}_profile.tsv \ - --mapout ${sample}_bowtie2.tsv \ + --${out_arg} ${sample}_bowtie2.tsv \ --samout ${sample}.sam \ --input_type fastq \ --nproc ${task.cpus} \ - --db_dir ${params.metaphlan_db} \ + --${dbarg} ${params.metaphlan_db} \ --index ${params.metaphlan_index} \ -t rel_ab_w_read_stats """ From dd332d426039c5b1ec6a28041f45a3318e159546 Mon Sep 17 00:00:00 2001 From: Danielle Pinto Date: Sat, 21 Jun 2025 22:03:02 -0400 Subject: [PATCH 02/14] add documentation to the README --- README.md | 82 +++++++++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 79 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index 36db0f9..d031f47 100644 --- a/README.md +++ b/README.md @@ -1,11 +1,87 @@ # Nextflow pipeline for running the bioBakery -by Kevin Bonham, PhD +by Kevin Bonham, PhD -- `KneadData` +bioBakery + +- `KneadData`: a data quality-control pipeline that removes host genomic data within our metagenomic samples. Particularly, this pipeline uses a database containing a reference human genome so that all human DNA is removed from the samples. Link to more information here: (https://huttenhower.sph.harvard.edu/kneaddata/). - `MetaPhlAn` - `HUMAnN` ## Setup +Instructions for setting up a local environment to run the pipeline can be found on Danielle's notebook [here](LINK TO BE ADDED). + +Computing environments on the Tufts HPC and AWS should already be set-up with apptainer environments. + +## Running the pipeline +This nextflow pipeline can be run on three different types of machines: +1) Locally +2) Tufts high performance cluster (HPC) +3) Amazon website services cloud (AWS) + +Based on the profiles described in `nextflow.config`, we can run the pipeline with the following Nextflow commands: + +[NEED TO DOUBLE CHECK THIS PART] +### Running locally +`nextflow main.nf --local` + +### Running on the HPC +TO DO: Still need to figure out the exact nextflow syntax + +Jobs on the Tufts HPC can be run in two different ways: +- **Batch**: the job will be sent to the queue and it will be completed based on how many resources you have requested, current cluster load, and fairshare (have you recently used the cluster) + - `nextflow main.nf --tufts_hpc --batch` + +- **Preempt**: this allows you to run your job preemptively using free nodes from another lab that paid for these compute resources. However, if they are already running a job, your job will be killed and you'll have to resubmit it. + + - `nextflow main.nf --tufts_hpc --preempt` + + +### Running on AWS +`nextflow main.nf --amazon` + +> Kevin may want to add additional comments here about different ways to run the pipeline + +## Databases +Several databases must be installed to run the pipeline. + +### Kneaddata +- A database containing a reference human genome so that unwanted human DNA can be removed from our metagenomic samples. + +### MetaPhlAn +- `mpa_vOct22_CHOCOPhlAnSGB_202403` is the most recent MetaPhlAn database that is compatible with the versions of HUMAnN we are using +- Note: there is a more up-to-date version (released in January 2025) that we will probably eventually want to shift to once HUMAnN is able to support it. + +### HUMAnN + + +## Information on software versions +This pipeline supports the following versions of MetaPhlAn and HUMAnN: + + ### MetaPhlAn +- MetaPhlAn 3.1.0 +- MetaPhlAn 4 + +### HUMAnN +- HUMAnN3 3.7 +- HUMAnN3 4 alpha + +## Testing the pipeline +There are some raw fastq files in `test/` which can be processed through the pipeline + +## Using the `master-params.yaml` file +The `master-params.yaml` file defines all input parameters that you may want to use to run the Nextflow pipeline. The file should not be used directly to run the pipeline. Rather, the user should select the params they need from the file based on how they would like to use the pipeline (software versions of MetaPhlAn or HUMAnN, computing environment, databases, etc. ), and paste these into a separate yaml file. This second yaml file can be used to run the Nextflow pipeline. -**TODO** \ No newline at end of file +### Overview of parameters in `master-params.yaml` +- `paired_end`: True or False, given the type of input data +- `metaphlan_ver`: MetaPhlAn software version (either `metaphlan3.1.0` or `metaphlan4`) +- `humann_ver`: HUMAnN3 software version (either `humann3.7` or `humann4_alpha`) +- `readsdir`: path to directory that contains raw data (bam files) +- `outdir`: path to directory where processed results will be saved +- `human_genome`: path to directory that contains human reference database used during Kneaddata +- `metaphlan_db`: path to directory that contains metaphlan databases +- `metaphlan_index`: +- `humann_nucleotide_db`: +- `humann_protein_db`: +- `humann_utility_db`: +- `filepattern`: regex describing samples should be named (relative to the input raw data) \ No newline at end of file From b1da87c5157e1fdf173b50159399737659b67305 Mon Sep 17 00:00:00 2001 From: Danielle Pinto Date: Sat, 21 Jun 2025 22:06:01 -0400 Subject: [PATCH 03/14] add param for paired-end reads --- master-params.yaml | 52 ++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 52 insertions(+) create mode 100644 master-params.yaml diff --git a/master-params.yaml b/master-params.yaml new file mode 100644 index 0000000..fb97efb --- /dev/null +++ b/master-params.yaml @@ -0,0 +1,52 @@ +### Data type +# paired end data +paired_end: "True" + +### Metaphlan version +# metaphlan3.1.0 params +metaphlan_ver : "metaphlan3.1.0" +# metaphlan4 params +metaphlan_ver : "metaphlan4" + +# humann3.7 params +humann_ver : "humann3.7" +# humann4alpha params +humann_ver : "humann4_alpha" + + + +### Computing environment +# local params (will need to fill out yourself based on the location of files on your personal computer) + +# readsdir: +# outdir: +# human_genome: +# metaphlan_db: +# metaphlan_index: +# humann_nucleotide_db: +# humann_protein_db: +# humann_utility_db: + +# Tufts HPC params +readsdir: "/cluster/tufts/bonhamlab/shared/sequencing/bam" +outdir: "/cluster/tufts/bonhamlab/shared/sequencing/processed" +human_genome: "/cluster/tufts/bonhamlab/shared/databases/biobakery/kneaddata" +metaphlan_db: "/cluster/tufts/bonhamlab/shared/databases/biobakery/metaphlan" +metaphlan_index: "mpa_vOct22_CHOCOPhlAnSGB_202403" +humann_nucleotide_db: "/cluster/tufts/bonhamlab/shared/databases/biobakery/humann/chocophlan" +humann_protein_db: "/cluster/tufts/bonhamlab/shared/databases/biobakery/humann/uniref" +humann_utility_db: "/cluster/tufts/bonhamlab/shared/databases/biobakery/humann/utility_mapping" + + +# AWS params +readsdir: "s3://vkc-nextflow/rawfastq/" +outdir: "s3://vkc-nextflow/output/" +human_genome: "s3://biobakery-databases/kneaddata_databases/" +metaphlan_db: "s3://biobakery-databases/metaphlan_databases/" +humann_bowtie_db: "s3://biobakery-databases/humann_databases/chocophlan" +humann_protein_db: "s3://biobakery-databases/humann_databases/uniref" +humann_utility_db: "s3://biobakery-databases/humann_databases/utility_mapping" + + +# Global params (same regardless of computer environment) +filepattern: "*.bam" # need to adjust if bam or fastq \ No newline at end of file From 91064dfc673ac66c99952893b5df2c27fe7faf19 Mon Sep 17 00:00:00 2001 From: Danielle Pinto Date: Sun, 22 Jun 2025 22:07:42 -0400 Subject: [PATCH 04/14] add more documentation to README --- README.md | 20 +++++++++++++------- 1 file changed, 13 insertions(+), 7 deletions(-) diff --git a/README.md b/README.md index d031f47..757f83a 100644 --- a/README.md +++ b/README.md @@ -2,14 +2,14 @@ by Kevin Bonham, PhD -bioBakery +[bioBakery](https://github.com/biobakery): software, documentation, and tutorials for microbial community profiling (created and mantained by the Huttenhower lab) -- `KneadData`: a data quality-control pipeline that removes host genomic data within our metagenomic samples. Particularly, this pipeline uses a database containing a reference human genome so that all human DNA is removed from the samples. Link to more information here: (https://huttenhower.sph.harvard.edu/kneaddata/). -- `MetaPhlAn` -- `HUMAnN` +- [`KneadData`](https://github.com/biobakery/kneaddata): a data quality-control pipeline that removes host genomic data within our metagenomic samples. Particularly, this pipeline uses a database containing a reference human genome so that all human DNA is removed from the samples. Link to more information here: (https://huttenhower.sph.harvard.edu/kneaddata/). +- [`MetaPhlAn`](https://github.com/biobakery/MetaPhlAn): a computational tool for species-level microbial profiling (bacteria, archaea, eukaryotes, and viruses) from metagenomic shotgun sequencing data. Link to more information here:(https://huttenhower.sph.harvard.edu/metaphlan) +- [`HUMAnN`](https://github.com/biobakery/humann): a pipeline for efficiently and accurately profiling the presence/absence and abundance of microbial pathways in a community from metagenomic or metatranscriptomic sequencing data (typically millions of short DNA/RNA reads). This process, referred to as functional profiling, aims to describe the metabolic potential of a microbial community and its members. Link to more information here:(https://huttenhower.sph.harvard.edu/humann) ## Setup -Instructions for setting up a local environment to run the pipeline can be found on Danielle's notebook [here](LINK TO BE ADDED). +Instructions for setting up a local environment to run the pipeline can be found on Danielle's notebook [here](https://github.com/BonhamLab/daniellepinto/blob/main/PeriodicMeetings/2025-06-17.md#danielles-personal-notes). Computing environments on the Tufts HPC and AWS should already be set-up with apptainer environments. @@ -47,12 +47,18 @@ Several databases must be installed to run the pipeline. ### Kneaddata - A database containing a reference human genome so that unwanted human DNA can be removed from our metagenomic samples. + - The `Homo_sapiens_hg39_T2T_Bowtie2_v0.1` bowtie2 database can be downloaded from [here](https://huttenhower.sph.harvard.edu/kneadData_databases/Homo_sapiens_hg39_T2T_Bowtie2_v0.1.tar.gz). + - This version of the database can be used for all analyses and there shouldn't be a big need to upgrade the database (unless we have an updated human genome!) +- Other reference databases can be added as well if other types of data want to be removed (eg. human transcriptome, mouse genome, etc.) ### MetaPhlAn -- `mpa_vOct22_CHOCOPhlAnSGB_202403` is the most recent MetaPhlAn database that is compatible with the versions of HUMAnN we are using +- `mpa_vOct22_CHOCOPhlAnSGB_202403` is the most recent MetaPhlAn database that is compatible with the versions of HUMAnN we are using. + - It can be downloaded from [here](http://cmprod1.cibio.unitn.it/biobakery4/metaphlan_databases/). - Note: there is a more up-to-date version (released in January 2025) that we will probably eventually want to shift to once HUMAnN is able to support it. ### HUMAnN +- Looks like there is only version available + - Database can be downloaded [here](http://cmprod1.cibio.unitn.it/databases/HUMAnN/). ## Information on software versions @@ -80,7 +86,7 @@ The `master-params.yaml` file defines all input parameters that you may want to - `outdir`: path to directory where processed results will be saved - `human_genome`: path to directory that contains human reference database used during Kneaddata - `metaphlan_db`: path to directory that contains metaphlan databases -- `metaphlan_index`: +- `metaphlan_index`: database version (database must exist within `metaphlan_db`) - `humann_nucleotide_db`: - `humann_protein_db`: - `humann_utility_db`: From 5dd757ee4476119f1481910d3fff70ba984da86d Mon Sep 17 00:00:00 2001 From: Danielle Pinto Date: Mon, 23 Jun 2025 21:42:47 -0400 Subject: [PATCH 05/14] final first draft of README --- README.md | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index 757f83a..20caba2 100644 --- a/README.md +++ b/README.md @@ -57,8 +57,7 @@ Several databases must be installed to run the pipeline. - Note: there is a more up-to-date version (released in January 2025) that we will probably eventually want to shift to once HUMAnN is able to support it. ### HUMAnN -- Looks like there is only version available - - Database can be downloaded [here](http://cmprod1.cibio.unitn.it/databases/HUMAnN/). +- Database can be downloaded [here](http://cmprod1.cibio.unitn.it/databases/HUMAnN/). ## Information on software versions @@ -87,7 +86,7 @@ The `master-params.yaml` file defines all input parameters that you may want to - `human_genome`: path to directory that contains human reference database used during Kneaddata - `metaphlan_db`: path to directory that contains metaphlan databases - `metaphlan_index`: database version (database must exist within `metaphlan_db`) -- `humann_nucleotide_db`: -- `humann_protein_db`: +- `humann_nucleotide_db`: path to directory containing chocophlan database +- `humann_protein_db`: path to directory containing UniRef database - `humann_utility_db`: - `filepattern`: regex describing samples should be named (relative to the input raw data) \ No newline at end of file From 0758a1c40f20f661baeb90a64310cc271576f05a Mon Sep 17 00:00:00 2001 From: Danielle Pinto Date: Tue, 24 Jun 2025 20:33:00 -0400 Subject: [PATCH 06/14] rename master-params to template-params --- README.md | 6 +++--- master-params.yaml => template-params.yaml | 0 2 files changed, 3 insertions(+), 3 deletions(-) rename master-params.yaml => template-params.yaml (100%) diff --git a/README.md b/README.md index 20caba2..9983548 100644 --- a/README.md +++ b/README.md @@ -74,10 +74,10 @@ This pipeline supports the following versions of MetaPhlAn and HUMAnN: ## Testing the pipeline There are some raw fastq files in `test/` which can be processed through the pipeline -## Using the `master-params.yaml` file -The `master-params.yaml` file defines all input parameters that you may want to use to run the Nextflow pipeline. The file should not be used directly to run the pipeline. Rather, the user should select the params they need from the file based on how they would like to use the pipeline (software versions of MetaPhlAn or HUMAnN, computing environment, databases, etc. ), and paste these into a separate yaml file. This second yaml file can be used to run the Nextflow pipeline. +## Using the `template-params.yaml` file +The `template-params.yaml` file defines all input parameters that you may want to use to run the Nextflow pipeline. The file should not be used directly to run the pipeline. Rather, the user should select the params they need from the file based on how they would like to use the pipeline (software versions of MetaPhlAn or HUMAnN, computing environment, databases, etc. ), and paste these into a separate yaml file. This second yaml file can be used to run the Nextflow pipeline. -### Overview of parameters in `master-params.yaml` +### Overview of parameters in `template-params.yaml` - `paired_end`: True or False, given the type of input data - `metaphlan_ver`: MetaPhlAn software version (either `metaphlan3.1.0` or `metaphlan4`) - `humann_ver`: HUMAnN3 software version (either `humann3.7` or `humann4_alpha`) diff --git a/master-params.yaml b/template-params.yaml similarity index 100% rename from master-params.yaml rename to template-params.yaml From fe48fb880898188a8a9b5a205c015df66e5bf390 Mon Sep 17 00:00:00 2001 From: Danielle Pinto Date: Tue, 24 Jun 2025 21:32:15 -0400 Subject: [PATCH 07/14] update nextflow commands in README --- README.md | 22 ++++++++++++---------- template-params.yaml | 3 +++ 2 files changed, 15 insertions(+), 10 deletions(-) diff --git a/README.md b/README.md index 9983548..5fc9f7d 100644 --- a/README.md +++ b/README.md @@ -8,7 +8,7 @@ by Kevin Bonham, PhD - [`MetaPhlAn`](https://github.com/biobakery/MetaPhlAn): a computational tool for species-level microbial profiling (bacteria, archaea, eukaryotes, and viruses) from metagenomic shotgun sequencing data. Link to more information here:(https://huttenhower.sph.harvard.edu/metaphlan) - [`HUMAnN`](https://github.com/biobakery/humann): a pipeline for efficiently and accurately profiling the presence/absence and abundance of microbial pathways in a community from metagenomic or metatranscriptomic sequencing data (typically millions of short DNA/RNA reads). This process, referred to as functional profiling, aims to describe the metabolic potential of a microbial community and its members. Link to more information here:(https://huttenhower.sph.harvard.edu/humann) -## Setup +## Environment setup Instructions for setting up a local environment to run the pipeline can be found on Danielle's notebook [here](https://github.com/BonhamLab/daniellepinto/blob/main/PeriodicMeetings/2025-06-17.md#danielles-personal-notes). Computing environments on the Tufts HPC and AWS should already be set-up with apptainer environments. @@ -23,27 +23,28 @@ Based on the profiles described in `nextflow.config`, we can run the pipeline wi [NEED TO DOUBLE CHECK THIS PART] ### Running locally -`nextflow main.nf --local` +`nextflow run main.nf -profile local -params-file params.yaml` ### Running on the HPC -TO DO: Still need to figure out the exact nextflow syntax - Jobs on the Tufts HPC can be run in two different ways: - **Batch**: the job will be sent to the queue and it will be completed based on how many resources you have requested, current cluster load, and fairshare (have you recently used the cluster) - - `nextflow main.nf --tufts_hpc --batch` - **Preempt**: this allows you to run your job preemptively using free nodes from another lab that paid for these compute resources. However, if they are already running a job, your job will be killed and you'll have to resubmit it. - - `nextflow main.nf --tufts_hpc --preempt` +With how the HPC environment is currently defined in `nextflow.config`, jobs will first be submitted to the batch queue. If there are not any available resources, it will be processed preemptively. + +- `nextflow run main.nf -profile tufts_hpc -params-file params.yaml` ### Running on AWS -`nextflow main.nf --amazon` +`nextflow main.nf -profile amazon -params-file params.yaml` > Kevin may want to add additional comments here about different ways to run the pipeline +> Note: We can also process samples on the MIT `engaging` cluster, but that should probably not be used without permission + ## Databases -Several databases must be installed to run the pipeline. +Several databases must be installed to run this pipeline. ### Kneaddata - A database containing a reference human genome so that unwanted human DNA can be removed from our metagenomic samples. @@ -75,13 +76,14 @@ This pipeline supports the following versions of MetaPhlAn and HUMAnN: There are some raw fastq files in `test/` which can be processed through the pipeline ## Using the `template-params.yaml` file -The `template-params.yaml` file defines all input parameters that you may want to use to run the Nextflow pipeline. The file should not be used directly to run the pipeline. Rather, the user should select the params they need from the file based on how they would like to use the pipeline (software versions of MetaPhlAn or HUMAnN, computing environment, databases, etc. ), and paste these into a separate yaml file. This second yaml file can be used to run the Nextflow pipeline. +The `template-params.yaml` file defines all input parameters that you may want to use to run the Nextflow pipeline. The file should **not** be used directly to run the pipeline. Rather, the user should select the params they need from the file based on how they would like to use the pipeline (software versions of MetaPhlAn or HUMAnN, computing environment, databases, input data etc. ), and paste these into a separate yaml file. This second yaml file can be used to run the Nextflow pipeline. ### Overview of parameters in `template-params.yaml` +- `data_type`: type of input data (either `fastq` or `bam`) - `paired_end`: True or False, given the type of input data - `metaphlan_ver`: MetaPhlAn software version (either `metaphlan3.1.0` or `metaphlan4`) - `humann_ver`: HUMAnN3 software version (either `humann3.7` or `humann4_alpha`) -- `readsdir`: path to directory that contains raw data (bam files) +- `readsdir`: path to directory that contains raw data - `outdir`: path to directory where processed results will be saved - `human_genome`: path to directory that contains human reference database used during Kneaddata - `metaphlan_db`: path to directory that contains metaphlan databases diff --git a/template-params.yaml b/template-params.yaml index fb97efb..5eedd90 100644 --- a/template-params.yaml +++ b/template-params.yaml @@ -1,7 +1,10 @@ ### Data type +data_type: "fastq" +data_type: "bam" # paired end data paired_end: "True" + ### Metaphlan version # metaphlan3.1.0 params metaphlan_ver : "metaphlan3.1.0" From 7242da6cb70eef43cc022458b79f7538989729c0 Mon Sep 17 00:00:00 2001 From: Danielle Pinto Date: Wed, 25 Jun 2025 16:06:23 -0400 Subject: [PATCH 08/14] update according to Kevin's github feedback --- README.md | 15 ++++++++------- template-params.yaml | 19 ++++++++++--------- 2 files changed, 18 insertions(+), 16 deletions(-) diff --git a/README.md b/README.md index 5fc9f7d..df33e0d 100644 --- a/README.md +++ b/README.md @@ -21,7 +21,7 @@ This nextflow pipeline can be run on three different types of machines: Based on the profiles described in `nextflow.config`, we can run the pipeline with the following Nextflow commands: -[NEED TO DOUBLE CHECK THIS PART] + ### Running locally `nextflow run main.nf -profile local -params-file params.yaml` @@ -54,7 +54,7 @@ Several databases must be installed to run this pipeline. ### MetaPhlAn - `mpa_vOct22_CHOCOPhlAnSGB_202403` is the most recent MetaPhlAn database that is compatible with the versions of HUMAnN we are using. - - It can be downloaded from [here](http://cmprod1.cibio.unitn.it/biobakery4/metaphlan_databases/). + - It can be found/downloaded manually from [here](http://cmprod1.cibio.unitn.it/biobakery4/metaphlan_databases/). The easiest way to download is by running `metaphlan --install #any_other_args` - Note: there is a more up-to-date version (released in January 2025) that we will probably eventually want to shift to once HUMAnN is able to support it. ### HUMAnN @@ -79,10 +79,12 @@ There are some raw fastq files in `test/` which can be processed through the pip The `template-params.yaml` file defines all input parameters that you may want to use to run the Nextflow pipeline. The file should **not** be used directly to run the pipeline. Rather, the user should select the params they need from the file based on how they would like to use the pipeline (software versions of MetaPhlAn or HUMAnN, computing environment, databases, input data etc. ), and paste these into a separate yaml file. This second yaml file can be used to run the Nextflow pipeline. ### Overview of parameters in `template-params.yaml` -- `data_type`: type of input data (either `fastq` or `bam`) +- `input_data_type`: type of input data (either `fastq` or `bam`) - `paired_end`: True or False, given the type of input data -- `metaphlan_ver`: MetaPhlAn software version (either `metaphlan3.1.0` or `metaphlan4`) -- `humann_ver`: HUMAnN3 software version (either `humann3.7` or `humann4_alpha`) +- `filepattern`: regex describing sample naming convention (relative to the input data type) + +- `metaphlan_version`: MetaPhlAn software version (either `metaphlan_v3` or `metaphlan_v4`) +- `humann_version`: HUMAnN3 software version (either `humann_v37` or `humann_v4a`) - `readsdir`: path to directory that contains raw data - `outdir`: path to directory where processed results will be saved - `human_genome`: path to directory that contains human reference database used during Kneaddata @@ -90,5 +92,4 @@ The `template-params.yaml` file defines all input parameters that you may want t - `metaphlan_index`: database version (database must exist within `metaphlan_db`) - `humann_nucleotide_db`: path to directory containing chocophlan database - `humann_protein_db`: path to directory containing UniRef database -- `humann_utility_db`: -- `filepattern`: regex describing samples should be named (relative to the input raw data) \ No newline at end of file +- `humann_utility_db`: path to directory containing databases that have conversions between different protein annotations (eg UniRef90 to KO or EC), and names for all of the different annotations that have them diff --git a/template-params.yaml b/template-params.yaml index 5eedd90..8c0696e 100644 --- a/template-params.yaml +++ b/template-params.yaml @@ -1,20 +1,24 @@ ### Data type -data_type: "fastq" -data_type: "bam" +input_data_type: "bam" +input_data_type: "fastq" # paired end data paired_end: "True" +filepattern: "*.bam" # need to adjust if bam or fastq +# filepattern: "*.fastq" +# filepattern: "*.fastq.gz" + ### Metaphlan version # metaphlan3.1.0 params -metaphlan_ver : "metaphlan3.1.0" +metaphlan_version : "metaphlan_v3" # metaphlan4 params -metaphlan_ver : "metaphlan4" +metaphlan_version : "metaphlan_v4" # humann3.7 params -humann_ver : "humann3.7" +humann_version : "humann_v37" # humann4alpha params -humann_ver : "humann4_alpha" +humann_version : "humann_v4a" @@ -50,6 +54,3 @@ humann_bowtie_db: "s3://biobakery-databases/humann_databases/chocophlan" humann_protein_db: "s3://biobakery-databases/humann_databases/uniref" humann_utility_db: "s3://biobakery-databases/humann_databases/utility_mapping" - -# Global params (same regardless of computer environment) -filepattern: "*.bam" # need to adjust if bam or fastq \ No newline at end of file From 9e6a84b7ee96afabdbc90044fd92a42089057af1 Mon Sep 17 00:00:00 2001 From: Danielle Pinto <108756057+danielle-pinto@users.noreply.github.com> Date: Wed, 25 Jun 2025 16:06:55 -0400 Subject: [PATCH 09/14] add line breaks Co-authored-by: Kevin Bonham --- README.md | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index df33e0d..0236379 100644 --- a/README.md +++ b/README.md @@ -26,8 +26,13 @@ Based on the profiles described in `nextflow.config`, we can run the pipeline wi `nextflow run main.nf -profile local -params-file params.yaml` ### Running on the HPC + Jobs on the Tufts HPC can be run in two different ways: -- **Batch**: the job will be sent to the queue and it will be completed based on how many resources you have requested, current cluster load, and fairshare (have you recently used the cluster) + +- **Batch**: the job will be sent to the queue + and it will be completed based on how many resources you have requested, + current cluster load, + and fairshare (have you recently used the cluster) - **Preempt**: this allows you to run your job preemptively using free nodes from another lab that paid for these compute resources. However, if they are already running a job, your job will be killed and you'll have to resubmit it. From 6e978425afaae39915c0fa283d973f2891484a0d Mon Sep 17 00:00:00 2001 From: Danielle Pinto <108756057+danielle-pinto@users.noreply.github.com> Date: Wed, 25 Jun 2025 16:07:28 -0400 Subject: [PATCH 10/14] update README with Kevin's suggestions about containers Co-authored-by: Kevin Bonham --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 0236379..dc1cba6 100644 --- a/README.md +++ b/README.md @@ -11,7 +11,7 @@ by Kevin Bonham, PhD ## Environment setup Instructions for setting up a local environment to run the pipeline can be found on Danielle's notebook [here](https://github.com/BonhamLab/daniellepinto/blob/main/PeriodicMeetings/2025-06-17.md#danielles-personal-notes). -Computing environments on the Tufts HPC and AWS should already be set-up with apptainer environments. +Computing environments on the Tufts HPC and AWS should already be set-up with container-based (docker, apptainer) or conda environments. ## Running the pipeline This nextflow pipeline can be run on three different types of machines: From d2d32a691f65ceaf63deca4a8442a63fb03a41d3 Mon Sep 17 00:00:00 2001 From: Danielle Pinto <108756057+danielle-pinto@users.noreply.github.com> Date: Wed, 25 Jun 2025 16:08:14 -0400 Subject: [PATCH 11/14] add line breaks to Kneaddata description Co-authored-by: Kevin Bonham --- README.md | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index dc1cba6..c7d9cb1 100644 --- a/README.md +++ b/README.md @@ -4,7 +4,12 @@ by Kevin Bonham, PhD [bioBakery](https://github.com/biobakery): software, documentation, and tutorials for microbial community profiling (created and mantained by the Huttenhower lab) -- [`KneadData`](https://github.com/biobakery/kneaddata): a data quality-control pipeline that removes host genomic data within our metagenomic samples. Particularly, this pipeline uses a database containing a reference human genome so that all human DNA is removed from the samples. Link to more information here: (https://huttenhower.sph.harvard.edu/kneaddata/). +- [`KneadData`](https://github.com/biobakery/kneaddata): + a data quality-control pipeline that trims low quality reads + and removes host genomic data within our metagenomic samples. + Particularly, this pipeline uses a database containing a reference human genome + so that all human DNA is removed from the samples. + Link to more information here: (https://huttenhower.sph.harvard.edu/kneaddata/). - [`MetaPhlAn`](https://github.com/biobakery/MetaPhlAn): a computational tool for species-level microbial profiling (bacteria, archaea, eukaryotes, and viruses) from metagenomic shotgun sequencing data. Link to more information here:(https://huttenhower.sph.harvard.edu/metaphlan) - [`HUMAnN`](https://github.com/biobakery/humann): a pipeline for efficiently and accurately profiling the presence/absence and abundance of microbial pathways in a community from metagenomic or metatranscriptomic sequencing data (typically millions of short DNA/RNA reads). This process, referred to as functional profiling, aims to describe the metabolic potential of a microbial community and its members. Link to more information here:(https://huttenhower.sph.harvard.edu/humann) From 2f1bb4246593c5f8ecfd5382232faf1ccd252dbb Mon Sep 17 00:00:00 2001 From: Danielle Pinto <108756057+danielle-pinto@users.noreply.github.com> Date: Wed, 25 Jun 2025 17:50:18 -0400 Subject: [PATCH 12/14] Update README.md with formatting suggestions Co-authored-by: Kevin Bonham --- README.md | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index c7d9cb1..d5a239c 100644 --- a/README.md +++ b/README.md @@ -10,8 +10,17 @@ by Kevin Bonham, PhD Particularly, this pipeline uses a database containing a reference human genome so that all human DNA is removed from the samples. Link to more information here: (https://huttenhower.sph.harvard.edu/kneaddata/). -- [`MetaPhlAn`](https://github.com/biobakery/MetaPhlAn): a computational tool for species-level microbial profiling (bacteria, archaea, eukaryotes, and viruses) from metagenomic shotgun sequencing data. Link to more information here:(https://huttenhower.sph.harvard.edu/metaphlan) -- [`HUMAnN`](https://github.com/biobakery/humann): a pipeline for efficiently and accurately profiling the presence/absence and abundance of microbial pathways in a community from metagenomic or metatranscriptomic sequencing data (typically millions of short DNA/RNA reads). This process, referred to as functional profiling, aims to describe the metabolic potential of a microbial community and its members. Link to more information here:(https://huttenhower.sph.harvard.edu/humann) +- [`MetaPhlAn`](https://github.com/biobakery/MetaPhlAn): + a computational tool for species-level microbial profiling (bacteria, archaea, eukaryotes, and viruses) + from metagenomic shotgun sequencing data. + Link to more information here:(https://huttenhower.sph.harvard.edu/metaphlan) +- [`HUMAnN`](https://github.com/biobakery/humann): + a pipeline for efficiently and accurately profiling the presence/absence and abundance of microbial pathways + in a community from metagenomic or metatranscriptomic sequencing data + (typically millions of short DNA/RNA reads). + This process, referred to as functional profiling, + aims to describe the metabolic potential of a microbial community and its members. + Link to more information here:(https://huttenhower.sph.harvard.edu/humann) ## Environment setup Instructions for setting up a local environment to run the pipeline can be found on Danielle's notebook [here](https://github.com/BonhamLab/daniellepinto/blob/main/PeriodicMeetings/2025-06-17.md#danielles-personal-notes). From 4fd5d1e46d01cff8a6418780286dec0b2eb61839 Mon Sep 17 00:00:00 2001 From: Danielle Pinto <108756057+danielle-pinto@users.noreply.github.com> Date: Wed, 25 Jun 2025 17:52:18 -0400 Subject: [PATCH 13/14] Update README.md Co-authored-by: Kevin Bonham --- README.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index d5a239c..1093aa7 100644 --- a/README.md +++ b/README.md @@ -48,7 +48,8 @@ Jobs on the Tufts HPC can be run in two different ways: current cluster load, and fairshare (have you recently used the cluster) -- **Preempt**: this allows you to run your job preemptively using free nodes from another lab that paid for these compute resources. However, if they are already running a job, your job will be killed and you'll have to resubmit it. +- **Preempt**: this allows you to run your job using free nodes from another lab that paid for these compute resources. + However, if they attempt to queue a job, your job will be preempted and killed, so you'll have to resubmit it. With how the HPC environment is currently defined in `nextflow.config`, jobs will first be submitted to the batch queue. If there are not any available resources, it will be processed preemptively. From 1eaf4f62630aeadca9e417c428f17009d5240bd1 Mon Sep 17 00:00:00 2001 From: Danielle Pinto <108756057+danielle-pinto@users.noreply.github.com> Date: Wed, 25 Jun 2025 17:52:37 -0400 Subject: [PATCH 14/14] Update README.md Co-authored-by: Kevin Bonham --- README.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 1093aa7..a885657 100644 --- a/README.md +++ b/README.md @@ -51,7 +51,8 @@ Jobs on the Tufts HPC can be run in two different ways: - **Preempt**: this allows you to run your job using free nodes from another lab that paid for these compute resources. However, if they attempt to queue a job, your job will be preempted and killed, so you'll have to resubmit it. -With how the HPC environment is currently defined in `nextflow.config`, jobs will first be submitted to the batch queue. If there are not any available resources, it will be processed preemptively. +With how the HPC environment is currently defined in `nextflow.config`, +jobs will first be submitted to the `batch` or `preempt` queue, whichever is available first. - `nextflow run main.nf -profile tufts_hpc -params-file params.yaml`