diff --git a/README.md b/README.md index 36db0f9..a885657 100644 --- a/README.md +++ b/README.md @@ -1,11 +1,116 @@ # Nextflow pipeline for running the bioBakery -by Kevin Bonham, PhD +by Kevin Bonham, PhD -- `KneadData` -- `MetaPhlAn` -- `HUMAnN` +[bioBakery](https://github.com/biobakery): software, documentation, and tutorials for microbial community profiling (created and mantained by the Huttenhower lab) -## Setup +- [`KneadData`](https://github.com/biobakery/kneaddata): + a data quality-control pipeline that trims low quality reads + and removes host genomic data within our metagenomic samples. + Particularly, this pipeline uses a database containing a reference human genome + so that all human DNA is removed from the samples. + Link to more information here: (https://huttenhower.sph.harvard.edu/kneaddata/). +- [`MetaPhlAn`](https://github.com/biobakery/MetaPhlAn): + a computational tool for species-level microbial profiling (bacteria, archaea, eukaryotes, and viruses) + from metagenomic shotgun sequencing data. + Link to more information here:(https://huttenhower.sph.harvard.edu/metaphlan) +- [`HUMAnN`](https://github.com/biobakery/humann): + a pipeline for efficiently and accurately profiling the presence/absence and abundance of microbial pathways + in a community from metagenomic or metatranscriptomic sequencing data + (typically millions of short DNA/RNA reads). + This process, referred to as functional profiling, + aims to describe the metabolic potential of a microbial community and its members. + Link to more information here:(https://huttenhower.sph.harvard.edu/humann) -**TODO** \ No newline at end of file +## Environment setup +Instructions for setting up a local environment to run the pipeline can be found on Danielle's notebook [here](https://github.com/BonhamLab/daniellepinto/blob/main/PeriodicMeetings/2025-06-17.md#danielles-personal-notes). + +Computing environments on the Tufts HPC and AWS should already be set-up with container-based (docker, apptainer) or conda environments. + +## Running the pipeline +This nextflow pipeline can be run on three different types of machines: +1) Locally +2) Tufts high performance cluster (HPC) +3) Amazon website services cloud (AWS) + +Based on the profiles described in `nextflow.config`, we can run the pipeline with the following Nextflow commands: + + +### Running locally +`nextflow run main.nf -profile local -params-file params.yaml` + +### Running on the HPC + +Jobs on the Tufts HPC can be run in two different ways: + +- **Batch**: the job will be sent to the queue + and it will be completed based on how many resources you have requested, + current cluster load, + and fairshare (have you recently used the cluster) + +- **Preempt**: this allows you to run your job using free nodes from another lab that paid for these compute resources. + However, if they attempt to queue a job, your job will be preempted and killed, so you'll have to resubmit it. + +With how the HPC environment is currently defined in `nextflow.config`, +jobs will first be submitted to the `batch` or `preempt` queue, whichever is available first. + + +- `nextflow run main.nf -profile tufts_hpc -params-file params.yaml` + +### Running on AWS +`nextflow main.nf -profile amazon -params-file params.yaml` + +> Kevin may want to add additional comments here about different ways to run the pipeline + +> Note: We can also process samples on the MIT `engaging` cluster, but that should probably not be used without permission + +## Databases +Several databases must be installed to run this pipeline. + +### Kneaddata +- A database containing a reference human genome so that unwanted human DNA can be removed from our metagenomic samples. + - The `Homo_sapiens_hg39_T2T_Bowtie2_v0.1` bowtie2 database can be downloaded from [here](https://huttenhower.sph.harvard.edu/kneadData_databases/Homo_sapiens_hg39_T2T_Bowtie2_v0.1.tar.gz). + - This version of the database can be used for all analyses and there shouldn't be a big need to upgrade the database (unless we have an updated human genome!) +- Other reference databases can be added as well if other types of data want to be removed (eg. human transcriptome, mouse genome, etc.) + +### MetaPhlAn +- `mpa_vOct22_CHOCOPhlAnSGB_202403` is the most recent MetaPhlAn database that is compatible with the versions of HUMAnN we are using. + - It can be found/downloaded manually from [here](http://cmprod1.cibio.unitn.it/biobakery4/metaphlan_databases/). The easiest way to download is by running `metaphlan --install #any_other_args` +- Note: there is a more up-to-date version (released in January 2025) that we will probably eventually want to shift to once HUMAnN is able to support it. + +### HUMAnN +- Database can be downloaded [here](http://cmprod1.cibio.unitn.it/databases/HUMAnN/). + + +## Information on software versions +This pipeline supports the following versions of MetaPhlAn and HUMAnN: + + ### MetaPhlAn +- MetaPhlAn 3.1.0 +- MetaPhlAn 4 + +### HUMAnN +- HUMAnN3 3.7 +- HUMAnN3 4 alpha + +## Testing the pipeline +There are some raw fastq files in `test/` which can be processed through the pipeline + +## Using the `template-params.yaml` file +The `template-params.yaml` file defines all input parameters that you may want to use to run the Nextflow pipeline. The file should **not** be used directly to run the pipeline. Rather, the user should select the params they need from the file based on how they would like to use the pipeline (software versions of MetaPhlAn or HUMAnN, computing environment, databases, input data etc. ), and paste these into a separate yaml file. This second yaml file can be used to run the Nextflow pipeline. + +### Overview of parameters in `template-params.yaml` +- `input_data_type`: type of input data (either `fastq` or `bam`) +- `paired_end`: True or False, given the type of input data +- `filepattern`: regex describing sample naming convention (relative to the input data type) + +- `metaphlan_version`: MetaPhlAn software version (either `metaphlan_v3` or `metaphlan_v4`) +- `humann_version`: HUMAnN3 software version (either `humann_v37` or `humann_v4a`) +- `readsdir`: path to directory that contains raw data +- `outdir`: path to directory where processed results will be saved +- `human_genome`: path to directory that contains human reference database used during Kneaddata +- `metaphlan_db`: path to directory that contains metaphlan databases +- `metaphlan_index`: database version (database must exist within `metaphlan_db`) +- `humann_nucleotide_db`: path to directory containing chocophlan database +- `humann_protein_db`: path to directory containing UniRef database +- `humann_utility_db`: path to directory containing databases that have conversions between different protein annotations (eg UniRef90 to KO or EC), and names for all of the different annotations that have them diff --git a/main.nf b/main.nf index e99896d..d89a0a6 100755 --- a/main.nf +++ b/main.nf @@ -2,26 +2,27 @@ nextflow.enable.dsl=2 +include { bam2fastq } from './processes/bam2fastq.nf' include { kneaddata } from './processes/kneaddata.nf' -include { metaphlan; metaphlan_bzip } from './processes/metaphlan.nf' +include { metaphlan; metaphlan_bam } from './processes/metaphlan.nf' include { humann; humann_regroup; humann_rename } from './processes/humann.nf' workflow { - read_pairs_ch = Channel - .fromFilePairs("$params.readsdir/$params.filepattern", size: 2) - - human_genome = params.human_genome - metaphlan_db = params.metaphlan_db - humann_bowtie_db = params.humann_bowtie_db - humann_protein_db = params.humann_protein_db - humann_utility_db = params.humann_utility_db + read_ch = Channel + .fromPath("${params.readsdir}/${params.filepattern}") + .map { file -> + def sample = file.baseName // ERR3405856.bam -> ERR3405856 + return tuple(sample, file) + } + - knead_out = kneaddata(read_pairs_ch, human_genome) - metaphlan_out = metaphlan(knead_out[0], knead_out[1], metaphlan_db) - metaphlan_bzip = metaphlan_bzip(metaphlan_out[0], metaphlan_out[4]) - humann_out = humann(metaphlan_out[0], metaphlan_out[1], metaphlan_out[2], humann_bowtie_db, humann_protein_db) - regroup_out = humann_regroup(humann_out[0], humann_out[1], humann_utility_db) - humann_rename(regroup_out, humann_utility_db) + bam_out = bam2fastq(read_ch) + knead_out = kneaddata(bam_out) + metaphlan_out = metaphlan(knead_out.sample, knead_out.fastq) + // metaphlan_bam = metaphlan_bam(metaphlan_out.sample, metaphlan_out.sam) // not working because of headers + humann_out = humann(metaphlan_out.sample, knead_out.fastq, metaphlan_out.profile) + regroup_out = humann_regroup(humann_out.sample, humann_out.genefamilies) + humann_rename(regroup_out) } diff --git a/nextflow.config b/nextflow.config index 0097e86..0492127 100644 --- a/nextflow.config +++ b/nextflow.config @@ -7,6 +7,54 @@ profiles { process.executor = 'local' } + tufts_hpc { + process { + + executor = 'slurm' + queue = 'batch,preempt' + + // Default settings for all processes + memory = '4.G' + time = '2.h' + cpus = 2 + + withName: bam2fastq { + memory = '8.G' + time = '1.h' + cpus = 4 + + } + + withName: kneaddata { + memory = '8.G' + time = '8.h' + cpus = 8 + + } + + withName: metaphlan { + memory = '32.G' + time = '4.h' + cpus = 8 + + } + + withName: humann { + memory = '32.G' + time = '8.h' + cpus = 16 + } + + } + + apptainer { + enabled = true + autoMounts = true + runOptions = '--no-home --bind /cluster' + } + + } + engaging { process { diff --git a/processes/bam2fastq.nf b/processes/bam2fastq.nf new file mode 100644 index 0000000..0f076be --- /dev/null +++ b/processes/bam2fastq.nf @@ -0,0 +1,20 @@ +process bam2fastq { + tag "bam2fastq $sample" + + errorStrategy 'retry' + maxRetries 3 + + input: + tuple val(sample), path(reads) + + output: + tuple val(sample), path("${sample}.fastq") + + shell: + + """ + echo $sample + + samtools fastq -@ ${task.cpus} ${reads} > ${sample}.fastq + """ +} diff --git a/processes/humann.nf b/processes/humann.nf index a56073e..e2094a5 100644 --- a/processes/humann.nf +++ b/processes/humann.nf @@ -3,35 +3,34 @@ process humann { tag "humann on $sample" publishDir "$params.outdir/humann/main" - memory { workflow.profile == 'standard' ? null : memory * task.attempt } - cpus { workflow.profile == 'standard' ? null : cpus * task.attempt } + // memory { workflow.profile == 'standard' ? null : memory * task.attempt } + // cpus { workflow.profile == 'standard' ? null : cpus * task.attempt } - errorStrategy { task.exitStatus in 134..140 ? 'retry' : 'terminate' } - maxRetries 3 + // errorStrategy { task.exitStatus in 134..140 ? 'retry' : 'terminate' } + // maxRetries 3 input: val sample - path profile path catkneads - path humann_bowtie_db - path humann_protein_db + path profile output: val sample , emit: sample - path "${sample}_genefamilies.tsv" , emit: genefamilies - path "${sample}_pathabundance.tsv" - path "${sample}_pathcoverage.tsv" + path "${sample}_2_genefamilies.tsv" , emit: genefamilies + path "${sample}_0.log" + path "${sample}_3_reactions.tsv" + path "${sample}_4_pathabundance.tsv" script: """ - humann_config --update database_folders nucleotide `realpath $humann_bowtie_db` - humann_config --update database_folders protein `realpath $humann_protein_db` - humann --input $catkneads --taxonomic-profile $profile --output ./ \ - --threads ${task.cpus} --remove-temp-output --search-mode uniref90 \ - --output-basename $sample + --threads ${task.cpus} --remove-temp-output \ + --protein-database ${params.humann_protein_db} \ + --nucleotide-database ${params.humann_nucleotide_db} \ + --utility-database ${params.humann_utility_db} \ + --output-basename $sample """ } @@ -42,21 +41,20 @@ process humann_regroup { input: val sample path genefamilies - path humann_utility_db output: val sample , emit: sample - path "${sample}_ecs.tsv" - path "${sample}_kos.tsv" - path "${sample}_pfams.tsv" + path "${sample}_ecs.tsv", emit: ecs + path "${sample}_kos.tsv", emit: kos + path "${sample}_pfams.tsv", emit: pfams + script: """ - humann_config --update database_folders utility_mapping `realpath $humann_utility_db` - humann_regroup_table --input $genefamilies --output ${sample}_ecs.tsv --groups uniref90_level4ec - humann_regroup_table --input $genefamilies --output ${sample}_kos.tsv --groups uniref90_ko - humann_regroup_table --input $genefamilies --output ${sample}_pfams.tsv --groups uniref90_pfam + humann_regroup_table --input $genefamilies --output ${sample}_ecs.tsv --custom ${params.humann_utility_db}/map_level4ec_uniclust90.txt.gz + humann_regroup_table --input $genefamilies --output ${sample}_kos.tsv --custom ${params.humann_utility_db}/map_ko_uniref90.txt.gz + humann_regroup_table --input $genefamilies --output ${sample}_pfams.tsv --custom ${params.humann_utility_db}/map_pfam_uniref90.txt.gz """ } @@ -69,7 +67,6 @@ process humann_rename { path ecs path kos path pfams - path humann_utility_db output: val sample , emit: sample @@ -80,9 +77,8 @@ process humann_rename { script: """ - humann_config --update database_folders utility_mapping `realpath $humann_utility_db` - humann_rename_table --input $ecs --output ${sample}_ecs_rename.tsv --names ec - humann_rename_table --input $kos --output ${sample}_kos_rename.tsv --names kegg-orthology - humann_rename_table --input $pfams --output ${sample}_pfams_rename.tsv --names pfam + humann_rename_table --input $ecs --output ${sample}_ecs_rename.tsv --custom ${params.humann_utility_db}/map_level4ec_name.txt.gz + humann_rename_table --input $kos --output ${sample}_kos_rename.tsv --custom ${params.humann_utility_db}/map_ko_name.txt.gz + humann_rename_table --input $pfams --output ${sample}_pfams_rename.tsv --custom ${params.humann_utility_db}/map_pfam_name.txt.gz """ -} \ No newline at end of file +} diff --git a/processes/kneaddata.nf b/processes/kneaddata.nf index 6ec2c90..0c27509 100644 --- a/processes/kneaddata.nf +++ b/processes/kneaddata.nf @@ -4,16 +4,15 @@ process kneaddata { time { workflow.profile == 'standard' ? null : time * task.attempt } memory { workflow.profile == 'standard' ? null : memory * task.attempt } - errorStrategy 'retry' - maxRetries 3 + //errorStrategy 'retry' + //maxRetries 3 input: tuple val(sample), path(reads) - path human_genome output: - tuple val(sample), path("${sample}_kneaddata_paired_{1,2}.fastq.gz") - path "${sample}_kneaddata_unmatched_{1,2}.fastq.gz" + val(sample), emit: sample + path("${sample}_kneaddata.fastq.gz"), emit: fastq path "${sample}_kneaddata*.fastq.gz" , optional:true , emit: others path "${sample}_kneaddata.log" , emit: log @@ -22,11 +21,11 @@ process kneaddata { """ echo $sample - kneaddata --input ${reads[0]} --input ${reads[1]} \ - --reference-db $human_genome --output ./ \ + kneaddata --unpaired $reads \ + --reference-db ${params.human_genome} --output ./ \ --processes ${task.cpus} --output-prefix ${sample}_kneaddata \ - --trimmomatic /opt/conda/share/trimmomatic + --trimmomatic /cluster/tufts/bonhamlab/shared/conda-envs/metaphlan_v4.2/.CondaPkg/.pixi/envs/default/share/trimmomatic - gzip *.fastq + gzip ${sample}_kneaddata*.fastq """ -} \ No newline at end of file +} diff --git a/processes/metaphlan.nf b/processes/metaphlan.nf index e081e44..151b122 100644 --- a/processes/metaphlan.nf +++ b/processes/metaphlan.nf @@ -1,39 +1,44 @@ process metaphlan { tag "metaphlan on $sample" - publishDir "$params.outdir/metaphlan", pattern: "{*.tsv}" + // publishDir "$params.outdir/metaphlan", pattern: "*.tsv" // once fix for sam compression is found + publishDir "$params.outdir/metaphlan" // keeps sam file input: - tuple val(sample), path(kneads) - path unmatched - path metaphlan_db + val(sample) + path(kneads) output: val sample , emit: sample path "${sample}_profile.tsv" , emit: profile - path "${sample}_grouped.fastq.gz" - path "${sample}_bowtie2.tsv" - path "${sample}.sam" + path "${sample}_bowtie2.tsv" , emit: bowtie2 + path "${sample}.sam" , emit: sam - script: - def forward = kneads[0] - def reverse = kneads[1] - def unf = unmatched[0] - def unr = unmatched[1] - """ - cat $forward $reverse $unf $unr > ${sample}_grouped.fastq.gz + script: + // metphlan4 changed metaphlan db variable from bowtie2db to db_dir + // also changed from bowtie2out to mapout + if (params.metaphaln_ver == 'metaphlan4') { + db_arg = 'db_dir' + out_arg = 'mapout'} + else (params.metaphaln_ver == 'metaphlan3.1.0'){ + db_arg = 'bowtie2db' + out_arg = 'bowtie2out' + } - metaphlan ${sample}_grouped.fastq.gz ${sample}_profile.tsv \ - --bowtie2out ${sample}_bowtie2.tsv \ + """ + metaphlan $kneads -o ${sample}_profile.tsv \ + --${out_arg} ${sample}_bowtie2.tsv \ --samout ${sample}.sam \ --input_type fastq \ --nproc ${task.cpus} \ - --bowtie2db $metaphlan_db + --${dbarg} ${params.metaphlan_db} \ + --index ${params.metaphlan_index} \ + -t rel_ab_w_read_stats """ } - process metaphlan_bzip { - tag "metaphlan_bzip on $sample" + process metaphlan_bam { + tag "metaphlan_bam on $sample" publishDir "$params.outdir/metaphlan" stageInMode "copy" @@ -42,11 +47,19 @@ process metaphlan { path sam output: - val sample , emit: sample - path "${sample}.sam.bz2" + val sample , emit: sample + path "${sample}_markers.bam" , emit: bam + + when: script: """ - bzip2 -v $sam + # Duplicate headers in output - skipping header validation + samtools view -bS --no-PG ${sam} -o ${sample}_markers.bam + + # Alternative approach: strip and rebuild header + # samtools view -S ${sam} | samtools view -b -o ${sample}_markers.bam + + rm $sam """ -} \ No newline at end of file +} diff --git a/template-params.yaml b/template-params.yaml new file mode 100644 index 0000000..8c0696e --- /dev/null +++ b/template-params.yaml @@ -0,0 +1,56 @@ +### Data type +input_data_type: "bam" +input_data_type: "fastq" +# paired end data +paired_end: "True" + +filepattern: "*.bam" # need to adjust if bam or fastq +# filepattern: "*.fastq" +# filepattern: "*.fastq.gz" + + +### Metaphlan version +# metaphlan3.1.0 params +metaphlan_version : "metaphlan_v3" +# metaphlan4 params +metaphlan_version : "metaphlan_v4" + +# humann3.7 params +humann_version : "humann_v37" +# humann4alpha params +humann_version : "humann_v4a" + + + +### Computing environment +# local params (will need to fill out yourself based on the location of files on your personal computer) + +# readsdir: +# outdir: +# human_genome: +# metaphlan_db: +# metaphlan_index: +# humann_nucleotide_db: +# humann_protein_db: +# humann_utility_db: + +# Tufts HPC params +readsdir: "/cluster/tufts/bonhamlab/shared/sequencing/bam" +outdir: "/cluster/tufts/bonhamlab/shared/sequencing/processed" +human_genome: "/cluster/tufts/bonhamlab/shared/databases/biobakery/kneaddata" +metaphlan_db: "/cluster/tufts/bonhamlab/shared/databases/biobakery/metaphlan" +metaphlan_index: "mpa_vOct22_CHOCOPhlAnSGB_202403" +humann_nucleotide_db: "/cluster/tufts/bonhamlab/shared/databases/biobakery/humann/chocophlan" +humann_protein_db: "/cluster/tufts/bonhamlab/shared/databases/biobakery/humann/uniref" +humann_utility_db: "/cluster/tufts/bonhamlab/shared/databases/biobakery/humann/utility_mapping" + + +# AWS params +readsdir: "s3://vkc-nextflow/rawfastq/" +outdir: "s3://vkc-nextflow/output/" +human_genome: "s3://biobakery-databases/kneaddata_databases/" +metaphlan_db: "s3://biobakery-databases/metaphlan_databases/" +humann_bowtie_db: "s3://biobakery-databases/humann_databases/chocophlan" +humann_protein_db: "s3://biobakery-databases/humann_databases/uniref" +humann_utility_db: "s3://biobakery-databases/humann_databases/utility_mapping" + diff --git a/tuftshpc-params.yaml b/tuftshpc-params.yaml new file mode 100644 index 0000000..9821466 --- /dev/null +++ b/tuftshpc-params.yaml @@ -0,0 +1,9 @@ +readsdir: "/cluster/tufts/bonhamlab/shared/sequencing/bam" +outdir: "/cluster/tufts/bonhamlab/shared/sequencing/nf-staging" +human_genome: "/cluster/tufts/bonhamlab/shared/databases/biobakery/kneaddata" +metaphlan_db: "/cluster/tufts/bonhamlab/shared/databases/biobakery/metaphlan" +metaphlan_index: "mpa_vOct22_CHOCOPhlAnSGB_202403" +humann_nucleotide_db: "/cluster/tufts/bonhamlab/shared/databases/biobakery/humann/chocophlan" +humann_protein_db: "/cluster/tufts/bonhamlab/shared/databases/biobakery/humann/uniref" +humann_utility_db: "/cluster/tufts/bonhamlab/shared/databases/biobakery/humann/utility_mapping" +filepattern: "SRR*.bam"