Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
fb844b8
add tufts config
Jun 9, 2025
26076fd
paths for stuff on tufts hpc
Jun 9, 2025
479e8e5
start working on bam conversion
Jun 9, 2025
646b40a
Use emit better, rejigger some stuff
kescobo Jun 9, 2025
79bd96f
Bunch of changes for v4s
kescobo Jun 11, 2025
37292ef
fix params
kescobo Jun 11, 2025
62de8ec
Few more humann tweak
kescobo Jun 11, 2025
326c33d
fix mp stuff
Jun 11, 2025
a990d6d
Merge remote-tracking branch 'origin/tufts' into tufts
kescobo Jun 11, 2025
69f78c1
fix kneaddata inputs
kescobo Jun 11, 2025
bc78841
typos
Jun 11, 2025
a387885
wrap params in braces
Jun 11, 2025
ac550b5
trying stuff
Jun 11, 2025
3eabbd4
something works
Jun 11, 2025
21727da
claude-suggested
Jun 12, 2025
5fe9fd1
return explicit stuff for bam
kescobo Jun 12, 2025
b8c39a2
more tweaks from failures
Jun 12, 2025
2506c98
remove bam for now
Jun 12, 2025
3cf41cc
more fixing humann
Jun 12, 2025
72dda1a
it works
Jun 12, 2025
57d70e3
update params to run all
Jun 12, 2025
64c692b
trying to solve random issues
Jun 16, 2025
93e1409
scope out master-params file and update metaphlan.nf params based on …
danielle-pinto Jun 20, 2025
dd332d4
add documentation to the README
danielle-pinto Jun 22, 2025
b1da87c
add param for paired-end reads
danielle-pinto Jun 22, 2025
91064df
add more documentation to README
danielle-pinto Jun 23, 2025
5dd757e
final first draft of README
danielle-pinto Jun 24, 2025
0758a1c
rename master-params to template-params
danielle-pinto Jun 25, 2025
fe48fb8
update nextflow commands in README
danielle-pinto Jun 25, 2025
7242da6
update according to Kevin's github feedback
danielle-pinto Jun 25, 2025
9e6a84b
add line breaks
danielle-pinto Jun 25, 2025
6e97842
update README with Kevin's suggestions about containers
danielle-pinto Jun 25, 2025
d2d32a6
add line breaks to Kneaddata description
danielle-pinto Jun 25, 2025
2f1bb42
Update README.md with formatting suggestions
danielle-pinto Jun 25, 2025
4fd5d1e
Update README.md
danielle-pinto Jun 25, 2025
1eaf4f6
Update README.md
danielle-pinto Jun 25, 2025
ca73d43
Merge pull request #1 from danielle-pinto/tufts-danielle
danielle-pinto Jun 25, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
117 changes: 111 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,116 @@
# Nextflow pipeline for running the bioBakery

by Kevin Bonham, PhD
by Kevin Bonham, PhD

- `KneadData`
- `MetaPhlAn`
- `HUMAnN`
[bioBakery](https://github.com/biobakery): software, documentation, and tutorials for microbial community profiling (created and mantained by the Huttenhower lab)

## Setup
- [`KneadData`](https://github.com/biobakery/kneaddata):
a data quality-control pipeline that trims low quality reads
and removes host genomic data within our metagenomic samples.
Particularly, this pipeline uses a database containing a reference human genome
so that all human DNA is removed from the samples.
Link to more information here: (https://huttenhower.sph.harvard.edu/kneaddata/).
- [`MetaPhlAn`](https://github.com/biobakery/MetaPhlAn):
a computational tool for species-level microbial profiling (bacteria, archaea, eukaryotes, and viruses)
from metagenomic shotgun sequencing data.
Link to more information here:(https://huttenhower.sph.harvard.edu/metaphlan)
- [`HUMAnN`](https://github.com/biobakery/humann):
a pipeline for efficiently and accurately profiling the presence/absence and abundance of microbial pathways
in a community from metagenomic or metatranscriptomic sequencing data
(typically millions of short DNA/RNA reads).
This process, referred to as functional profiling,
aims to describe the metabolic potential of a microbial community and its members.
Link to more information here:(https://huttenhower.sph.harvard.edu/humann)

**TODO**
## Environment setup
Instructions for setting up a local environment to run the pipeline can be found on Danielle's notebook [here](https://github.com/BonhamLab/daniellepinto/blob/main/PeriodicMeetings/2025-06-17.md#danielles-personal-notes).

Computing environments on the Tufts HPC and AWS should already be set-up with container-based (docker, apptainer) or conda environments.

## Running the pipeline
This nextflow pipeline can be run on three different types of machines:
1) Locally
2) Tufts high performance cluster (HPC)
3) Amazon website services cloud (AWS)

Based on the profiles described in `nextflow.config`, we can run the pipeline with the following Nextflow commands:


### Running locally
`nextflow run main.nf -profile local -params-file params.yaml`

### Running on the HPC

Jobs on the Tufts HPC can be run in two different ways:

- **Batch**: the job will be sent to the queue
and it will be completed based on how many resources you have requested,
current cluster load,
and fairshare (have you recently used the cluster)

- **Preempt**: this allows you to run your job using free nodes from another lab that paid for these compute resources.
However, if they attempt to queue a job, your job will be preempted and killed, so you'll have to resubmit it.

With how the HPC environment is currently defined in `nextflow.config`,
jobs will first be submitted to the `batch` or `preempt` queue, whichever is available first.


- `nextflow run main.nf -profile tufts_hpc -params-file params.yaml`

### Running on AWS
`nextflow main.nf -profile amazon -params-file params.yaml`

> Kevin may want to add additional comments here about different ways to run the pipeline

> Note: We can also process samples on the MIT `engaging` cluster, but that should probably not be used without permission

## Databases
Several databases must be installed to run this pipeline.

### Kneaddata
- A database containing a reference human genome so that unwanted human DNA can be removed from our metagenomic samples.
- The `Homo_sapiens_hg39_T2T_Bowtie2_v0.1` bowtie2 database can be downloaded from [here](https://huttenhower.sph.harvard.edu/kneadData_databases/Homo_sapiens_hg39_T2T_Bowtie2_v0.1.tar.gz).
- This version of the database can be used for all analyses and there shouldn't be a big need to upgrade the database (unless we have an updated human genome!)
- Other reference databases can be added as well if other types of data want to be removed (eg. human transcriptome, mouse genome, etc.)

### MetaPhlAn
- `mpa_vOct22_CHOCOPhlAnSGB_202403` is the most recent MetaPhlAn database that is compatible with the versions of HUMAnN we are using.
- It can be found/downloaded manually from [here](http://cmprod1.cibio.unitn.it/biobakery4/metaphlan_databases/). The easiest way to download is by running `metaphlan --install #any_other_args`
- Note: there is a more up-to-date version (released in January 2025) that we will probably eventually want to shift to once HUMAnN is able to support it.

### HUMAnN
- Database can be downloaded [here](http://cmprod1.cibio.unitn.it/databases/HUMAnN/).


## Information on software versions
This pipeline supports the following versions of MetaPhlAn and HUMAnN:

### MetaPhlAn
- MetaPhlAn 3.1.0
- MetaPhlAn 4

### HUMAnN
- HUMAnN3 3.7
- HUMAnN3 4 alpha

## Testing the pipeline
There are some raw fastq files in `test/` which can be processed through the pipeline

## Using the `template-params.yaml` file
The `template-params.yaml` file defines all input parameters that you may want to use to run the Nextflow pipeline. The file should **not** be used directly to run the pipeline. Rather, the user should select the params they need from the file based on how they would like to use the pipeline (software versions of MetaPhlAn or HUMAnN, computing environment, databases, input data etc. ), and paste these into a separate yaml file. This second yaml file can be used to run the Nextflow pipeline.

### Overview of parameters in `template-params.yaml`
- `input_data_type`: type of input data (either `fastq` or `bam`)
- `paired_end`: True or False, given the type of input data
- `filepattern`: regex describing sample naming convention (relative to the input data type)

- `metaphlan_version`: MetaPhlAn software version (either `metaphlan_v3` or `metaphlan_v4`)
- `humann_version`: HUMAnN3 software version (either `humann_v37` or `humann_v4a`)
- `readsdir`: path to directory that contains raw data
- `outdir`: path to directory where processed results will be saved
- `human_genome`: path to directory that contains human reference database used during Kneaddata
- `metaphlan_db`: path to directory that contains metaphlan databases
- `metaphlan_index`: database version (database must exist within `metaphlan_db`)
- `humann_nucleotide_db`: path to directory containing chocophlan database
- `humann_protein_db`: path to directory containing UniRef database
- `humann_utility_db`: path to directory containing databases that have conversions between different protein annotations (eg UniRef90 to KO or EC), and names for all of the different annotations that have them
31 changes: 16 additions & 15 deletions main.nf
Original file line number Diff line number Diff line change
Expand Up @@ -2,26 +2,27 @@

nextflow.enable.dsl=2

include { bam2fastq } from './processes/bam2fastq.nf'
include { kneaddata } from './processes/kneaddata.nf'
include { metaphlan; metaphlan_bzip } from './processes/metaphlan.nf'
include { metaphlan; metaphlan_bam } from './processes/metaphlan.nf'
include { humann; humann_regroup; humann_rename } from './processes/humann.nf'


workflow {

read_pairs_ch = Channel
.fromFilePairs("$params.readsdir/$params.filepattern", size: 2)

human_genome = params.human_genome
metaphlan_db = params.metaphlan_db
humann_bowtie_db = params.humann_bowtie_db
humann_protein_db = params.humann_protein_db
humann_utility_db = params.humann_utility_db
read_ch = Channel
.fromPath("${params.readsdir}/${params.filepattern}")
.map { file ->
def sample = file.baseName // ERR3405856.bam -> ERR3405856
return tuple(sample, file)
}


knead_out = kneaddata(read_pairs_ch, human_genome)
metaphlan_out = metaphlan(knead_out[0], knead_out[1], metaphlan_db)
metaphlan_bzip = metaphlan_bzip(metaphlan_out[0], metaphlan_out[4])
humann_out = humann(metaphlan_out[0], metaphlan_out[1], metaphlan_out[2], humann_bowtie_db, humann_protein_db)
regroup_out = humann_regroup(humann_out[0], humann_out[1], humann_utility_db)
humann_rename(regroup_out, humann_utility_db)
bam_out = bam2fastq(read_ch)
knead_out = kneaddata(bam_out)
metaphlan_out = metaphlan(knead_out.sample, knead_out.fastq)
// metaphlan_bam = metaphlan_bam(metaphlan_out.sample, metaphlan_out.sam) // not working because of headers
humann_out = humann(metaphlan_out.sample, knead_out.fastq, metaphlan_out.profile)
regroup_out = humann_regroup(humann_out.sample, humann_out.genefamilies)
humann_rename(regroup_out)
}
48 changes: 48 additions & 0 deletions nextflow.config
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,54 @@ profiles {
process.executor = 'local'
}

tufts_hpc {
process {

executor = 'slurm'
queue = 'batch,preempt'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding here is that nextflow will first send the job in the batch partition, but if there is not available space in the queue, the job will be run preemptively. Is that correct?


// Default settings for all processes
memory = '4.G'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: we technically don't need this "default settings" section because resources for every process is defined

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be a vestige - I will double check

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the humann_regroup and humann_rename processes don't have their own. Can probably also remove the bam2fastq-specific params, I think the defaults should be fine for that

time = '2.h'
cpus = 2

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[For my own curiousity] Were these resource and time limits set through experience running these pipelines? Or something else? I'm guessing we probably have a good sense of what it takes to run metagenomic samples through so we won't run into any limits

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, mostly experience. They're mostly over-estimates. There's some logic in some of the processes to re-try on failure asking for more resources, but I think those haven't worked quite as intended. Should be revisited at some point, though it's really more relevant for AWS (the cost for provisioning resources is higher there - on the HPC it just means our fairshare goes down faster, but I have not run into any queuing problems yet even with pretty heavy use)

withName: bam2fastq {
memory = '8.G'
time = '1.h'
cpus = 4

}

withName: kneaddata {
memory = '8.G'
time = '8.h'
cpus = 8

}

withName: metaphlan {
memory = '32.G'
time = '4.h'
cpus = 8

}

withName: humann {
memory = '32.G'
time = '8.h'
cpus = 16
}

}

apptainer {
enabled = true
autoMounts = true
runOptions = '--no-home --bind /cluster'
}

}

engaging {
process {

Expand Down
20 changes: 20 additions & 0 deletions processes/bam2fastq.nf
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
process bam2fastq {
tag "bam2fastq $sample"

errorStrategy 'retry'
maxRetries 3

input:
tuple val(sample), path(reads)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: but should reads technically be bam? I was a bit confused at first why we are converting reads to a fastq.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A quick one-liner explaining what each process file does would be helpful too. I can add this to my version of the pipeline.


output:
tuple val(sample), path("${sample}.fastq")

shell:

"""
echo $sample

samtools fastq -@ ${task.cpus} ${reads} > ${sample}.fastq
"""
}
54 changes: 25 additions & 29 deletions processes/humann.nf
Original file line number Diff line number Diff line change
Expand Up @@ -3,35 +3,34 @@
process humann {
tag "humann on $sample"
publishDir "$params.outdir/humann/main"
memory { workflow.profile == 'standard' ? null : memory * task.attempt }
cpus { workflow.profile == 'standard' ? null : cpus * task.attempt }
// memory { workflow.profile == 'standard' ? null : memory * task.attempt }
// cpus { workflow.profile == 'standard' ? null : cpus * task.attempt }

errorStrategy { task.exitStatus in 134..140 ? 'retry' : 'terminate' }
maxRetries 3
// errorStrategy { task.exitStatus in 134..140 ? 'retry' : 'terminate' }
// maxRetries 3


input:
val sample
path profile
path catkneads
path humann_bowtie_db
path humann_protein_db
path profile

output:
val sample , emit: sample
path "${sample}_genefamilies.tsv" , emit: genefamilies
path "${sample}_pathabundance.tsv"
path "${sample}_pathcoverage.tsv"
path "${sample}_2_genefamilies.tsv" , emit: genefamilies
path "${sample}_0.log"
path "${sample}_3_reactions.tsv"
path "${sample}_4_pathabundance.tsv"

script:

"""
humann_config --update database_folders nucleotide `realpath $humann_bowtie_db`
humann_config --update database_folders protein `realpath $humann_protein_db`

humann --input $catkneads --taxonomic-profile $profile --output ./ \
--threads ${task.cpus} --remove-temp-output --search-mode uniref90 \
--output-basename $sample
--threads ${task.cpus} --remove-temp-output \
--protein-database ${params.humann_protein_db} \
--nucleotide-database ${params.humann_nucleotide_db} \
--utility-database ${params.humann_utility_db} \
--output-basename $sample
"""
}

Expand All @@ -42,21 +41,20 @@ process humann_regroup {
input:
val sample
path genefamilies
path humann_utility_db

output:
val sample , emit: sample
path "${sample}_ecs.tsv"
path "${sample}_kos.tsv"
path "${sample}_pfams.tsv"
path "${sample}_ecs.tsv", emit: ecs
path "${sample}_kos.tsv", emit: kos
path "${sample}_pfams.tsv", emit: pfams


script:

"""
humann_config --update database_folders utility_mapping `realpath $humann_utility_db`
humann_regroup_table --input $genefamilies --output ${sample}_ecs.tsv --groups uniref90_level4ec
humann_regroup_table --input $genefamilies --output ${sample}_kos.tsv --groups uniref90_ko
humann_regroup_table --input $genefamilies --output ${sample}_pfams.tsv --groups uniref90_pfam
humann_regroup_table --input $genefamilies --output ${sample}_ecs.tsv --custom ${params.humann_utility_db}/map_level4ec_uniclust90.txt.gz
humann_regroup_table --input $genefamilies --output ${sample}_kos.tsv --custom ${params.humann_utility_db}/map_ko_uniref90.txt.gz
humann_regroup_table --input $genefamilies --output ${sample}_pfams.tsv --custom ${params.humann_utility_db}/map_pfam_uniref90.txt.gz
"""
}

Expand All @@ -69,7 +67,6 @@ process humann_rename {
path ecs
path kos
path pfams
path humann_utility_db

output:
val sample , emit: sample
Expand All @@ -80,9 +77,8 @@ process humann_rename {
script:

"""
humann_config --update database_folders utility_mapping `realpath $humann_utility_db`
humann_rename_table --input $ecs --output ${sample}_ecs_rename.tsv --names ec
humann_rename_table --input $kos --output ${sample}_kos_rename.tsv --names kegg-orthology
humann_rename_table --input $pfams --output ${sample}_pfams_rename.tsv --names pfam
humann_rename_table --input $ecs --output ${sample}_ecs_rename.tsv --custom ${params.humann_utility_db}/map_level4ec_name.txt.gz
humann_rename_table --input $kos --output ${sample}_kos_rename.tsv --custom ${params.humann_utility_db}/map_ko_name.txt.gz
humann_rename_table --input $pfams --output ${sample}_pfams_rename.tsv --custom ${params.humann_utility_db}/map_pfam_name.txt.gz
"""
}
}
19 changes: 9 additions & 10 deletions processes/kneaddata.nf
Original file line number Diff line number Diff line change
Expand Up @@ -4,16 +4,15 @@ process kneaddata {
time { workflow.profile == 'standard' ? null : time * task.attempt }
memory { workflow.profile == 'standard' ? null : memory * task.attempt }

errorStrategy 'retry'
maxRetries 3
//errorStrategy 'retry'
//maxRetries 3

input:
tuple val(sample), path(reads)
path human_genome

output:
tuple val(sample), path("${sample}_kneaddata_paired_{1,2}.fastq.gz")
path "${sample}_kneaddata_unmatched_{1,2}.fastq.gz"
val(sample), emit: sample
path("${sample}_kneaddata.fastq.gz"), emit: fastq
path "${sample}_kneaddata*.fastq.gz" , optional:true , emit: others
path "${sample}_kneaddata.log" , emit: log

Expand All @@ -22,11 +21,11 @@ process kneaddata {
"""
echo $sample

kneaddata --input ${reads[0]} --input ${reads[1]} \
--reference-db $human_genome --output ./ \
kneaddata --unpaired $reads \
--reference-db ${params.human_genome} --output ./ \
--processes ${task.cpus} --output-prefix ${sample}_kneaddata \
--trimmomatic /opt/conda/share/trimmomatic
--trimmomatic /cluster/tufts/bonhamlab/shared/conda-envs/metaphlan_v4.2/.CondaPkg/.pixi/envs/default/share/trimmomatic
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting here that this path pointing to trimmomatic would only work if we are processing samples in the HPC environment. We should update this path based on the working environment. This can be another parameter we add as well.


gzip *.fastq
gzip ${sample}_kneaddata*.fastq
"""
}
}
Loading