diff --git a/.gitignore b/.gitignore
deleted file mode 100644
index a47bd10..0000000
--- a/.gitignore
+++ /dev/null
@@ -1,8 +0,0 @@
-!.snakemake/
-!benchmarks/
-!logs/
-!mapped/
-!qc/
-!trim_galore/
-!vcf/
-!report.html
\ No newline at end of file
diff --git a/README.md b/README.md
index 504a228..f20fea6 100644
--- a/README.md
+++ b/README.md
@@ -1,314 +1,100 @@
# human_genomics_pipeline
-A simple Snakemake workflow to process paired-end sequencing data (WGS or WES) using [bwa](http://bio-bwa.sourceforge.net/), [sambamba](https://lomereiter.github.io/sambamba/) and [GATK4](https://gatk.broadinstitute.org/hc/en-us).
+A Snakemake workflow to process single samples (unrelated individuals) or cohort samples (related individuals) of paired-end sequencing data (WGS or WES) using [bwa](http://bio-bwa.sourceforge.net/) and [GATK4](https://gatk.broadinstitute.org/hc/en-us). Quality control checks are also undertaken. The fastq files can be optionally trimmed with [Trim Galore](https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/) and the pipeline can run on [NVIDIA GPU's](https://www.nvidia.com/en-gb/graphics-cards/) where [nvidia clara parabricks software is available](https://www.nvidia.com/en-us/docs/parabricks/quickstart-guide/software-overview/) for *significant* speedups in analysis times. This workflow is designed to follow the [GATK best practice workflow for germline short variant discovery (SNPs + Indels)](https://gatk.broadinstitute.org/hc/en-us/articles/360035535932-Germline-short-variant-discovery-SNPs-Indels-). This pipeline is designed to be followed by [vcf_annotation_pipeline](https://github.com/ESR-NZ/vcf_annotation_pipeline) and the data ingested into [scout](https://github.com/Clinical-Genomics/scout) for clinical interpretation. However, this pipeline also stands on it's own, taking the data from fastq to vcf (raw sequencing data to called variants). This pipeline has been developed with human genetic data in mind, however we designed it to be species agnostic. Genetic data from other species can be analysed by setting a species-specific reference genome and variant databases in the configuration file (but not all situations have been tested).
- [human_genomics_pipeline](#human_genomics_pipeline)
- - [workflow diagram](#workflow-diagram)
- - [How to run human_genomics_pipeline](#how-to-run-human_genomics_pipeline)
- - [1. Fork the pipeline repo to a personal or lab account](#1-fork-the-pipeline-repo-to-a-personal-or-lab-account)
- - [2. Take the pipeline to the data on your local machine](#2-take-the-pipeline-to-the-data-on-your-local-machine)
- - [3. Create a local copy of the reference human genome and variation databases (either GRCh37 or GRCh38)](#3-create-a-local-copy-of-the-reference-human-genome-and-variation-databases-either-grch37-or-grch38)
- - [GRCh37](#grch37)
- - [GRCh38](#grch38)
- - [4. Modify the configuration file](#4-modify-the-configuration-file)
- - [5. Configure to run on a HPC (optional)](#5-configure-to-run-on-a-hpc-optional)
- - [6. Modify the run scripts](#6-modify-the-run-scripts)
- - [HPC](#hpc)
- - [7. Create and activate a conda environment with python and snakemake installed](#7-create-and-activate-a-conda-environment-with-python-and-snakemake-installed)
- - [8. Run the pipeline](#8-run-the-pipeline)
- - [9. Evaluate the pipeline run](#9-evaluate-the-pipeline-run)
- - [10. Commit and push to your forked version of the repo](#10-commit-and-push-to-your-forked-version-of-the-repo)
- - [11. Repeat step 10 each time you re-run the analysis with different parameters](#11-repeat-step-10-each-time-you-re-run-the-analysis-with-different-parameters)
- - [12. Create a pull request with the upstream repo to merge any useful changes (optional)](#12-create-a-pull-request-with-the-upstream-repo-to-merge-any-useful-changes-optional)
- - [Useful reading](#useful-reading)
+ - [Pipeline summary - single samples](#pipeline-summary---single-samples)
+ - [Pipeline summary - single samples - GPU accelerated](#pipeline-summary---single-samples---gpu-accelerated)
+ - [Pipeline summary - cohort samples](#pipeline-summary---cohort-samples)
+ - [Pipeline summary - cohort samples - GPU accelerated](#pipeline-summary---cohort-samples---gpu-accelerated)
+ - [Main output files](#main-output-files)
+ - [Prerequisites](#prerequisites)
+ - [Test human_genomics_pipeline](#test-human_genomics_pipeline)
+ - [Run human_genomics_pipeline](#run-human_genomics_pipeline)
+ - [Contribute back!](#contribute-back)
-## workflow diagram
+## Pipeline summary - single samples
-
+1. Raw read QC ([FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) and [MultiQC](https://multiqc.info/))
+2. Adapter trimming ([Trim Galore](https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/)) (*optional*)
+3. Alignment against reference genome ([Burrows-Wheeler Aligner](http://bio-bwa.sourceforge.net/))
+4. Mark duplicates ([GATK MarkDuplicates](https://gatk.broadinstitute.org/hc/en-us/articles/4414594430619-GenotypeGVCFs))
+5. Base recalibration ([GATK BaseRecalibrator](https://gatk.broadinstitute.org/hc/en-us/articles/360036898312-BaseRecalibrator) and [GATK ApplyBQSR](https://gatk.broadinstitute.org/hc/en-us/articles/360037055712-ApplyBQSR))
+6. Haplotype calling ([GATK HaplotypeCalller](https://gatk.broadinstitute.org/hc/en-us/articles/360037225632-HaplotypeCaller))
-## How to run human_genomics_pipeline
+
-- **Prerequisite software:** [Git](https://git-scm.com/), [Conda 4.8.2](https://docs.conda.io/projects/conda/en/latest/index.html), [wget](https://www.gnu.org/software/wget/), [gunzip](https://linux.die.net/man/1/gunzip), [bwa](http://bio-bwa.sourceforge.net/)
-- **OS:** Validated on Ubuntu 16.04
+## Pipeline summary - single samples - GPU accelerated
----
+1. Raw read QC ([FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) and [MultiQC](https://multiqc.info/))
+2. Adapter trimming ([Trim Galore](https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/)) (*optional*)
+3. Alignment against reference genome, mark duplicates, base recalibration and haplotype calling ([parabricks germline pipeline](https://docs.nvidia.com/clara/parabricks/v3.6.1/text/germline_pipeline.html))
+ - *Equivilant to [Burrows-Wheeler Aligner](http://bio-bwa.sourceforge.net/), [GATK MarkDuplicates](https://gatk.broadinstitute.org/hc/en-us/articles/4414594430619-GenotypeGVCFs), [GATK BaseRecalibrator](https://gatk.broadinstitute.org/hc/en-us/articles/360036898312-BaseRecalibrator), [GATK ApplyBQSR](https://gatk.broadinstitute.org/hc/en-us/articles/360037055712-ApplyBQSR) and [GATK HaplotypeCalller](https://gatk.broadinstitute.org/hc/en-us/articles/360037225632-HaplotypeCaller)*
-*This pipeline is in ongoing development. It is currently operational for single samples of whole exome sequencing data (not cohort data yet). It should theoretically be operational for whole genome sequencing data, but this has not yet been validated. It can be run against either the GRCh37 or GRCh38 reference genome. Benchmarking is currently underway.*
+
----
+## Pipeline summary - cohort samples
-### 1. Fork the pipeline repo to a personal or lab account
+1. Raw read QC ([FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) and [MultiQC](https://multiqc.info/))
+2. Adapter trimming ([Trim Galore](https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/)) (*optional*)
+3. Alignment against reference genome ([Burrows-Wheeler Aligner](http://bio-bwa.sourceforge.net/))
+4. Mark duplicates ([GATK MarkDuplicates](https://gatk.broadinstitute.org/hc/en-us/articles/4414594430619-GenotypeGVCFs))
+5. Base recalibration ([GATK BaseRecalibrator](https://gatk.broadinstitute.org/hc/en-us/articles/360036898312-BaseRecalibrator) and [GATK ApplyBQSR](https://gatk.broadinstitute.org/hc/en-us/articles/360037055712-ApplyBQSR))
+6. Haplotype calling ([GATK HaplotypeCalller](https://gatk.broadinstitute.org/hc/en-us/articles/360037225632-HaplotypeCaller))
+7. Combine GVCF into multi-sample GVCF ([GATK CombineGVCFs](https://gatk.broadinstitute.org/hc/en-us/articles/360037593911-CombineGVCFs))
+8. Genotyping ([GATK GenotypeGVCFs](https://gatk.broadinstitute.org/hc/en-us/articles/4414594430619-GenotypeGVCFs))
-See [here](https://help.github.com/en/github/getting-started-with-github/fork-a-repo#fork-an-example-repository) for help
+
-### 2. Take the pipeline to the data on your local machine
+## Pipeline summary - cohort samples - GPU accelerated
-Clone the forked [human_genomics_pipeline](https://github.com/ESR-NZ/human_genomics_pipeline) repo into the same directory as your paired end fastq data to be processed. Required folder structure and file naming convention:
+1. Raw read QC ([FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) and [MultiQC](https://multiqc.info/))
+2. Adapter trimming ([Trim Galore](https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/)) (*optional*)
+3. Alignment against reference genome, mark duplicates, base recalibration and haplotype calling ([parabricks germline pipeline](https://docs.nvidia.com/clara/parabricks/v3.6.1/text/germline_pipeline.html))
+ - *Equivilant to [Burrows-Wheeler Aligner](http://bio-bwa.sourceforge.net/), [GATK MarkDuplicates](https://gatk.broadinstitute.org/hc/en-us/articles/4414594430619-GenotypeGVCFs), [GATK BaseRecalibrator](https://gatk.broadinstitute.org/hc/en-us/articles/360036898312-BaseRecalibrator), [GATK ApplyBQSR](https://gatk.broadinstitute.org/hc/en-us/articles/360037055712-ApplyBQSR) and [GATK HaplotypeCalller](https://gatk.broadinstitute.org/hc/en-us/articles/360037225632-HaplotypeCaller)*
+4. Combine GVCF into multi-sample GVCF ([parabricks trio combine gvcf](https://docs.nvidia.com/clara/parabricks/v3.6/text/joint_calling.html#trio-combine-gvcf))
+ - *Equivalent to [GATK CombineGVCFs](https://gatk.broadinstitute.org/hc/en-us/articles/360037593911-CombineGVCFs)*
+5. Genotyping ([GATK GenotypeGVCFs](https://gatk.broadinstitute.org/hc/en-us/articles/4414594430619-GenotypeGVCFs))
-```bash
+
-.
-|___fastq/
-| |___sample1_R1
-| |___sample1_R2
-| |___sample2_R1
-| |___sample2_R2
-| |___ ...
-|
-|___human_genomics_pipeline/
+## Main output files
-```
+Single samples:
-See [here](https://help.github.com/en/github/getting-started-with-github/fork-a-repo#keep-your-fork-synced) for help
+- `results/qc/multiqc_report.html`
+- `results/mapped/sample1_recalibrated.bam`
+- `results/called/sample1_raw_snps_indels.vcf`
-### 3. Create a local copy of the reference human genome and variation databases (either GRCh37 or GRCh38)
+Cohort samples:
-#### GRCh37
+- `results/qc/multiqc_report.html`
+- `results/mapped/sample1_recalibrated.bam`
+- `results/mapped/sample2_recalibrated.bam`
+- `results/mapped/sample3_recalibrated.bam`
+- `results/called/proband1_raw_snps_indels.vcf`
-Download the reference human genome (GRCh37) and it's associated fasta sequence dictionary file (.dict) and fasta index file (.fai) files from the [GATK resource bundle](https://gatk.broadinstitute.org/hc/en-us/articles/360035890811-Resource-bundle)
+## Prerequisites
-```bash
-wget ftp://gsapubftp-anonymous:password@ftp.broadinstitute.org/bundle/hg19/ucsc.hg19.fasta.gz
-gunzip ucsc.hg19.fasta.gz
-wget ftp://gsapubftp-anonymous:password@ftp.broadinstitute.org/bundle/hg19/ucsc.hg19.dict.gz
-gunzip ucsc.hg19.dict.gz
-wget ftp://gsapubftp-anonymous:password@ftp.broadinstitute.org/bundle/hg19/ucsc.hg19.fasta.fai.gz
-gunzip ucsc.hg19.fasta.fai.gz
-```
+- **Prerequisite hardware:** [NVIDIA GPUs](https://www.nvidia.com/en-gb/graphics-cards/) (for GPU accelerated runs)
+- **Prerequisite software:** [NVIDIA CLARA parabricks and dependencies](https://www.nvidia.com/en-us/docs/parabricks/local-installation/) (for GPU accelerated runs), [Git](https://git-scm.com/) (tested with version 2.7.4), [Mamba](https://github.com/TheSnakePit/mamba) (tested with version 0.4.4) with [Conda](https://docs.conda.io/projects/conda/en/latest/index.html) (tested with version 4.8.2), [gsutil](https://pypi.org/project/gsutil/) (tested with version 4.52), [gunzip](https://linux.die.net/man/1/gunzip) (tested with version 1.6)
-Create index files for the genome sequence (.amb, .ann, .bwt, .pac, .sa)
+## Test human_genomics_pipeline
-```bash
-bwa index -a bwtsw ucsc.hg19.fasta
-```
+The provided [test dataset](./test) can be used to test running this pipeline on a new machine, or test pipeline developments/releases.
-Download dbSNP (build 151) from [NCBI](https://www.ncbi.nlm.nih.gov/variation/docs/human_variation_vcf/)
+## Run human_genomics_pipeline
-```bash
-wget ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh37p13/VCF/GATK/All_20180423.vcf.gz
-wget ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh37p13/VCF/GATK/All_20180423.vcf.gz.tbi
-```
+See the docs for a walkthrough guide for running [human_genomics_pipeline](https://github.com/ESR-NZ/human_genomics_pipeline) on:
-Download other variation databases from the [GATK resource bundle](https://gatk.broadinstitute.org/hc/en-us/articles/360035890811-Resource-bundle)
+- [A single machine like a laptop or single server/computer](./docs/running_on_a_single_machine.md)
+- [A high performance cluster](./docs/running_on_a_hpc.md)
-```bash
-wget ftp://gsapubftp-anonymous:password@ftp.broadinstitute.org/bundle/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz
-wget ftp://gsapubftp-anonymous:password@ftp.broadinstitute.org/bundle/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.idx.gz
-wget ftp://gsapubftp-anonymous:password@ftp.broadinstitute.org/bundle/hg19/1000G_phase1.indels.hg19.sites.vcf.gz
-wget ftp://gsapubftp-anonymous:password@ftp.broadinstitute.org/bundle/hg19/1000G_phase1.indels.hg19.sites.vcf.idx.gz
-```
+## Contribute back!
-#### GRCh38
+- Raise issues in [the issues page](https://github.com/ESR-NZ/human_genomics_pipeline/issues)
+- Create feature requests in [the issues page](https://github.com/ESR-NZ/human_genomics_pipeline/issues)
+- Start a discussion in [the discussion page](https://github.com/ESR-NZ/human_genomics_pipeline/discussions)
+- Contribute your code! Create your own branch from the [development branch](https://github.com/ESR-NZ/human_genomics_pipeline/tree/dev) and create a pull request to the [development branch](https://github.com/ESR-NZ/human_genomics_pipeline/tree/dev) once the code is on point!
-Download the reference human genome (GRCh38) and it's associated fasta sequence dictionary file (.dict) and fasta index file (.fai) files from the [GATK resource bundle](https://gatk.broadinstitute.org/hc/en-us/articles/360035890811-Resource-bundle)
-
-```bash
-wget ftp://gsapubftp-anonymous:password@ftp.broadinstitute.org/bundle/hg38/Homo_sapiens_assembly38.fasta.gz
-gunzip Homo_sapiens_assembly38.fasta.gz
-wget ftp://gsapubftp-anonymous:password@ftp.broadinstitute.org/bundle/hg38/Homo_sapiens_assembly38.dict
-wget ftp://gsapubftp-anonymous:password@ftp.broadinstitute.org/bundle/hg38/Homo_sapiens_assembly38.fasta.fai
-```
-
-Create index files for the genome sequence (.amb, .ann, .bwt, .pac, .sa)
-
-```bash
-bwa index -a bwtsw Homo_sapiens_assembly38.fasta
-```
-
-Download dbSNP (build 151) from [NCBI](https://www.ncbi.nlm.nih.gov/variation/docs/human_variation_vcf/)
-
-```bash
-wget ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/VCF/GATK/All_20180418.vcf.gz
-wget ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/VCF/GATK/All_20180418.vcf.gz.tbi
-```
-
-Download other variation databases from the [GATK resource bundle](https://gatk.broadinstitute.org/hc/en-us/articles/360035890811-Resource-bundle)
-
-```bash
-wget ftp://gsapubftp-anonymous:password@ftp.broadinstitute.org/bundle/hg38/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
-wget ftp://gsapubftp-anonymous:password@ftp.broadinstitute.org/bundle/hg38/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi
-```
-
-### 4. Modify the configuration file
-
-Either edit the 'config_template.yaml' file from scratch, or edit a config file 'config_examples'.
-
-Specify whether you are running your analysis against the GRCh37 or GRCh38 build of the reference genome and whether the data is to be analysed on it's own ('Single') or as a part of a cohort ('Cohort'). For example:
-
-```yaml
-# Type of input data (either 'Single' or 'Cohort')
-DATA: "Single"
-```
-
-Set the the working directories in the config file to the reference human genome file (GRCh37 or GRCh38), dbSNP database file (GRCh37 or GRCh38) and a temporary directory. For example:
-
-```yaml
-# File directories to reference genome and dbSNP database
-REFGENOME: "/home/lkemp/publicData/referenceGenome/Homo_sapiens_assembly38.fasta"
-dbSNP: "/home/lkemp/publicData/dbSNP/All_20180418.vcf.gz"
-
-# Temporary file directory
-TEMPDIR: "/home/lkemp/tmp/"
-```
-
-If analysing WES data, pass a design file (.bed) indicating the genomic regions that were sequenced (see [here](https://leahkemp.github.io/documentation/human_genomic_pipelines/design_files.html) for more information on accessing design files). Also set the level of padding. For example:
-
-*If NOT analysing WES data, leave these fields blank*
-
-```yaml
-# WES settings (leave blank if analysing other data such as WGS)
-WES:
- # Genomic intervals over which to operate
- INTERVALS: "-L /home/lkemp/publicData/sure_select_human_all_exon_V7/S31285117_AllTracks.bed"
- # Padding (in bp) to add to each interval
- PADDING: "-ip 100"
-```
-
-Pass the resources to be used to recalibrate bases with gatk BaseRecalibrator to the `--known-sites` flag. For example:
-
-```yaml
-RECALIBRATION:
- RESOURCES: "--known-sites /home/lkemp/publicData/dbSNP/All_20180418.vcf.gz
- --known-sites /home/lkemp/publicData/mills/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz"
-```
-
-Save your modified config file with a descriptive name
-
-### 5. Configure to run on a HPC (optional)
-
-In theory, this cluster configuration should be adaptable to other job scheduler systems, but here I will demonstrate how to integrate this pipeline with slurm.
-
-Configure `account:`, `partition:` and `nodelist:` in default section of 'cluster.json' in order to set the parameters for slurm sbatch (see documentation [here](https://snakemake.readthedocs.io/en/stable/snakefiles/configuration.html#cluster-configuration-deprecated) and [here](https://slurm.schedmd.com/)). For example:
-
-```json
-{
- "__default__" :
- {
- "account" : "lkemp",
- "nodes" : 1,
- "ntasks" : 4,
- "partition" : "prod",
- "nodelist" : "kscprod-bio4"
- }
-}
-```
-
-[This](https://hpc-carpentry.github.io/hpc-python/17-cluster/) is a good place to go for a good working example if you get stuck.
-
-*These variables will need to be passed to snakemake in the snakemake run script (see example in step 6).*
-
-### 6. Modify the run scripts
-
-Specify your config file to be used with the `--configfile` flag and modify the number of cores to be used with the `-j` flag. For example:
-
-Dry run (dryrun.sh):
-
-```bash
-snakemake \
--n -j 32 \
---use-conda \
---configfile config_37_single_WES.yaml
-```
-
-Full run (run.sh):
-
-```bash
-snakemake \
--j 32 \
---use-conda \
---configfile config_37_single_WES.yaml
-```
-
-Report (report.sh)
-
-```bash
-snakemake \
---report ESR_report.html \
---configfile config_37_single_WES.yaml \
---report-stylesheet ESR_stylesheet.css
-```
-
-See the [snakemake documentation](https://snakemake.readthedocs.io/en/v4.5.1/executable.html) for additional run parameters.
-
-#### HPC
-
-If you want to run the pipeline on a HPC, pass the cluster variables set in 'cluster.json' to the dry run and full run scripts. For example:
-
-Dry run (dryrun.sh):
-
-```bash
-snakemake \
--n -j 32 \
---use-conda \
---configfile config_37_single_WES.yaml \
---cluster-config cluster.json \
---cluster "sbatch -A {cluster.account} \
--p {cluster.partition} \
---nodes {cluster.nodes} \
---ntasks {cluster.ntasks} \
---nodelist {cluster.nodelist}"
-```
-
-Full run (run.sh):
-
-```bash
-snakemake \
--j 32 \
---use-conda \
---configfile config_37_single_WES.yaml \
---cluster-config cluster.json \
---cluster "sbatch -A {cluster.account} \
--p {cluster.partition} \
---nodes {cluster.nodes} \
---ntasks {cluster.ntasks} \
---nodelist {cluster.nodelist}"
-```
-
-### 7. Create and activate a conda environment with python and snakemake installed
-
-```bash
-conda env create -f pipeline_run_env.yml
-conda activate pipeline_run_env
-```
-
-### 8. Run the pipeline
-
-First carry out a dry run
-
-```bash
-bash dryrun.sh
-```
-
-If there are no issues, start a full run
-
-```bash
-bash run.sh
-```
-
-### 9. Evaluate the pipeline run
-
-Generate an interactive html report
-
-```bash
-bash report.sh
-```
-
-### 10. Commit and push to your forked version of the repo
-
-To maintain reproducibility, commit and push:
-
-- All configuration files
-- All run scripts
-- The final report
-
-### 11. Repeat step 10 each time you re-run the analysis with different parameters
-
-### 12. Create a pull request with the [upstream repo](https://github.com/ESR-NZ/human_genomics_pipeline) to merge any useful changes (optional)
-
-Contributions and feedback are more than welcome! :blush:
-
-See [here](https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/creating-a-pull-request) for help
-
-## Useful reading
-
-Van der Auwera et al., (2013). *Current Protocols in Bioinformatics*. [From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4243306/); 11(1110): 11.10.1–11.10.33.
+Contributions and feedback are always welcome! :blush:
diff --git a/Snakefile b/Snakefile
deleted file mode 100644
index 6a6348a..0000000
--- a/Snakefile
+++ /dev/null
@@ -1,59 +0,0 @@
-"""
-Author: Miles Benton and Leah Kemp
-Affiliation: ESR
-Aim: A simple Snakemake workflow to process paired-end sequencing data (WGS) using bwa/GATK4. Designed to be used before vcf_annotation_pipeline.
-Date created: 2019-08-21
-Modified: 2020-04-29
-Dry run: snakemake -n -j 24 --use-conda --configfile config.yaml
-Full run: snakemake -j 24 --use-conda --configfile config.yaml
-Report: snakemake --report report.html --configfile config.yaml --report-stylesheet stylesheet.css
-Rule diagram: snakemake --rulegraph --configfile config.yaml | dot -Tpng > rulegraph.png
-Workflow diagram (specific experiment): snakemake --dag --configfile config.yaml | dot -Tpng > dag.png
-"""
-
-##### Set up #####
-
-# Define single or cohort samples from fastq dir using wildcards
-if config['DATA'] == "Single":
- SAMPLES, = glob_wildcards("../fastq/{sample}_R1.fastq.gz")
-elif config['DATA'] == "Cohort":
- # Fix cohort wildcard
- SAMPLES, = glob_wildcards("../fastq/{cohort}_R1.fastq.gz")
-else: print("ERROR: Please check your configuration settings")
-
-##### Target rules #####
-
-rule all:
- input:
- expand("qc/multiqc/pre_trim_multiqc_report.html"),
- expand("qc/multiqc/post_trim_multiqc_report.html"),
- expand("vcf/{sample}_raw_snps_indels_AS_g.vcf", sample = SAMPLES)
-
-#### Set up report #####
-
-report: "report/workflow.rst"
-
-##### load rules #####
-
-localrules: multiqc_pre_trim, multiqc_post_trim, sambamba_index, sambamba_index_rgadd
-
-include: "rules/fastqc.smk"
-include: "rules/multiqc_pre_trim.smk"
-include: "rules/trim_galore_pe.smk"
-include: "rules/multiqc_post_trim.smk"
-include: "rules/bwa_map.smk"
-include: "rules/sambamba_sort.smk"
-include: "rules/sambamba_mkdups.smk"
-include: "rules/sambamba_index.smk"
-include: "rules/gatk_add_replace_read_groups.smk"
-include: "rules/sambamba_index_rgadd.smk"
-include: "rules/gatk_base_recalibrator.smk"
-include: "rules/gatk_apply_bqsr.smk"
-
-if config['DATA'] == "Single":
- include: "rules/gatk_haplotype_caller_single.smk"
-
-if config['DATA'] == "Cohort":
- include: "rules/gatk_haplotype_caller_gvcf.smk"
- include: "rules/gatk_combine_gvcf.smk"
- include: "rules/gatk_genotype_gvcf.smk"
\ No newline at end of file
diff --git a/cluster.json b/cluster.json
deleted file mode 100644
index 5f5100e..0000000
--- a/cluster.json
+++ /dev/null
@@ -1,54 +0,0 @@
-{
- "__default__" :
- {
- "account" : "",
- "nodes" : 1,
- "ntasks" : 4,
- "partition" : "",
- "nodelist" : ""
- },
- "__fastqc__" :
- {
- "nodes" : 2
- },
- "__trim_galore_pe__" :
- {
- "nodes" : 16
- },
- "__bwa_map__" :
- {
- "nodes" : 32
- },
- "__sambamba_sort__" :
- {
- "nodes" : 16
- },
- "__sambamba_mkdups__" :
- {
- "nodes" : 4
- },
- "__sambamba_index__" :
- {
- "nodes" : 8
- },
- "__gatk_add_replace_read_groups__" :
- {
- "nodes" : 1
- },
- "__sambamba_index_rgadd__" :
- {
- "nodes" : 8
- },
- "__gatk_base_recalibrator__" :
- {
- "nodes" : 1
- },
- "__gatk_apply_bqsr__" :
- {
- "nodes" : 1
- },
- "__gatk_haplotype_caller_single__" :
- {
- "nodes" : 1
- }
-}
\ No newline at end of file
diff --git a/ESR_stylesheet.css b/config/ESR_stylesheet.css
similarity index 100%
rename from ESR_stylesheet.css
rename to config/ESR_stylesheet.css
diff --git a/config/cluster.json b/config/cluster.json
new file mode 100644
index 0000000..0738a2e
--- /dev/null
+++ b/config/cluster.json
@@ -0,0 +1,8 @@
+{
+ "__default__" :
+ {
+ "account" : "",
+ "partition" : "",
+ "output" : "logs/slurm-%j_{rule}.out"
+ }
+}
\ No newline at end of file
diff --git a/config/config.yaml b/config/config.yaml
new file mode 100644
index 0000000..c4879fd
--- /dev/null
+++ b/config/config.yaml
@@ -0,0 +1,60 @@
+##############################
+###### Overall workflow ######
+##############################
+
+# Type of input data (either 'Single' or 'Cohort')
+DATA: ""
+
+# Should the pipeline be GPU accelerated where possible? (either 'Yes' or 'No')
+GPU_ACCELERATED: ""
+
+# File path to the reference genome (.fasta)
+REFGENOME: ""
+
+# File path to dbSNP database
+dbSNP: ""
+
+# Temporary file directory
+TEMPDIR: ""
+
+# Whole exome sequence settings (leave blank if analysing other data such as whole genome sequence data)
+WES:
+ # File path to the exome capture regions over which to operate
+ INTERVALS: ""
+ # Padding (in bp) to add to each region
+ PADDING: ""
+
+##############################
+##### Pipeline resources #####
+##############################
+
+# Number of threads to use per rule/sample for multithreaded rules, multithreading will significantly speed up these rules (diminishing speed gains beyond 8 threads)
+THREADS:
+
+# Maximum memory usage per rule/sample (eg. '40g' for 40 gigabytes, this should suffice for exomes)
+MAXMEMORY: ""
+
+# Maximum number of GPU's to be used per rule/sample for gpu-accelerated runs (eg `1` for 1 GPU)
+GPU:
+
+##############################
+########## Trimming ##########
+##############################
+
+# Whether or not to trim the raw fastq reads (either 'Yes' or 'No')
+TRIM: ""
+
+# If trimming, choose the adapter sequence to be trimmed (eg. `--illumina`, `--nextera` or `--small_rna`) or pass adapter sequences to the `-a` and `-a2` flags
+TRIMMING:
+ ADAPTERS: ""
+
+##############################
+##### Base recalibration #####
+##############################
+
+# List of resources to used for base recalibration
+RECALIBRATION:
+ RESOURCES:
+ -
+ -
+ -
\ No newline at end of file
diff --git a/config_examples/GRCh37_cohort_WES_config_template.yaml b/config_examples/GRCh37_cohort_WES_config_template.yaml
deleted file mode 100644
index 1fed879..0000000
--- a/config_examples/GRCh37_cohort_WES_config_template.yaml
+++ /dev/null
@@ -1,30 +0,0 @@
-############################
-##### Overall workflow #####
-############################
-
-# Type of input data (either 'Single' or 'Cohort')
-DATA: "Cohort"
-
-# File directories to reference genome and dbSNP database
-REFGENOME: "/path/to/reference/human/genome/ucsc.hg19.fasta"
-dbSNP: "/path/to/dbSNP/All_20180423.vcf.gz"
-
-# Temporary file directory
-TEMPDIR: "/file/path/to/tmp/"
-
-# WES settings (leave blank if analysing other data such as WGS)
-WES:
- # Genomic intervals over which to operate
- INTERVALS: "-L /path/to/regions/.bed"
- # Padding (in bp) to add to each interval
- PADDING: "-ip 100"
-
-##############################
-##### Base recalibration #####
-##############################
-
-# Resources used for base recalibration
-RECALIBRATION:
- RESOURCES: "--known-sites /path/to/dbSNP/All_20180423.vcf.gz
- --known-sites /path/to/mills/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz
- --known-sites /path/to/1000G/indels/1000G_phase1.indels.hg19.sites.vcf.gz"
\ No newline at end of file
diff --git a/config_examples/GRCh37_cohort_WGS_config_template.yaml b/config_examples/GRCh37_cohort_WGS_config_template.yaml
deleted file mode 100644
index aff45c4..0000000
--- a/config_examples/GRCh37_cohort_WGS_config_template.yaml
+++ /dev/null
@@ -1,30 +0,0 @@
-############################
-##### Overall workflow #####
-############################
-
-# Type of input data (either 'Single' or 'Cohort')
-DATA: "Cohort"
-
-# File directories to reference genome and dbSNP database
-REFGENOME: "/path/to/reference/human/genome/ucsc.hg19.fasta"
-dbSNP: "/path/to/dbSNP/All_20180423.vcf.gz"
-
-# Temporary file directory
-TEMPDIR: "/file/path/to/tmp/"
-
-# WES settings (leave blank if analysing other data such as WGS)
-WES:
- # Genomic intervals over which to operate
- INTERVALS: ""
- # Padding (in bp) to add to each interval
- PADDING: ""
-
-##############################
-##### Base recalibration #####
-##############################
-
-# Resources used for base recalibration
-RECALIBRATION:
- RESOURCES: "--known-sites /path/to/dbSNP/All_20180423.vcf.gz
- --known-sites /path/to/mills/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz
- --known-sites /path/to/1000G/indels/1000G_phase1.indels.hg19.sites.vcf.gz"
\ No newline at end of file
diff --git a/config_examples/GRCh37_single_WES_config_template.yaml b/config_examples/GRCh37_single_WES_config_template.yaml
deleted file mode 100644
index 8e03879..0000000
--- a/config_examples/GRCh37_single_WES_config_template.yaml
+++ /dev/null
@@ -1,30 +0,0 @@
-############################
-##### Overall workflow #####
-############################
-
-# Type of input data (either 'Single' or 'Cohort')
-DATA: "Single"
-
-# File directories to reference genome and dbSNP database
-REFGENOME: "/path/to/reference/human/genome/ucsc.hg19.fasta"
-dbSNP: "/path/to/dbSNP/All_20180423.vcf.gz"
-
-# Temporary file directory
-TEMPDIR: "/file/path/to/tmp/"
-
-# WES settings (leave blank if analysing other data such as WGS)
-WES:
- # Genomic intervals over which to operate
- INTERVALS: "-L /path/to/regions/.bed"
- # Padding (in bp) to add to each interval
- PADDING: "-ip 100"
-
-##############################
-##### Base recalibration #####
-##############################
-
-# Resources used for base recalibration
-RECALIBRATION:
- RESOURCES: "--known-sites /path/to/dbSNP/All_20180423.vcf.gz
- --known-sites /path/to/mills/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz
- --known-sites /path/to/1000G/indels/1000G_phase1.indels.hg19.sites.vcf.gz"
\ No newline at end of file
diff --git a/config_examples/GRCh37_single_WGS_config_template.yaml b/config_examples/GRCh37_single_WGS_config_template.yaml
deleted file mode 100644
index 819e3c3..0000000
--- a/config_examples/GRCh37_single_WGS_config_template.yaml
+++ /dev/null
@@ -1,30 +0,0 @@
-############################
-##### Overall workflow #####
-############################
-
-# Type of input data (either 'Single' or 'Cohort')
-DATA: "Single"
-
-# File directories to reference genome and dbSNP database
-REFGENOME: "/path/to/reference/human/genome/ucsc.hg19.fasta"
-dbSNP: "/path/to/dbSNP/All_20180423.vcf.gz"
-
-# Temporary file directory
-TEMPDIR: "/file/path/to/tmp/"
-
-# WES settings (leave blank if analysing other data such as WGS)
-WES:
- # Genomic intervals over which to operate
- INTERVALS: ""
- # Padding (in bp) to add to each interval
- PADDING: ""
-
-##############################
-##### Base recalibration #####
-##############################
-
-# Resources used for base recalibration
-RECALIBRATION:
- RESOURCES: "--known-sites /path/to/dbSNP/All_20180423.vcf.gz
- --known-sites /path/to/mills/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz
- --known-sites /path/to/1000G/indels/1000G_phase1.indels.hg19.sites.vcf.gz"
\ No newline at end of file
diff --git a/config_examples/GRCh38_cohort_WES_config_template.yaml b/config_examples/GRCh38_cohort_WES_config_template.yaml
deleted file mode 100644
index 78da3df..0000000
--- a/config_examples/GRCh38_cohort_WES_config_template.yaml
+++ /dev/null
@@ -1,29 +0,0 @@
-############################
-##### Overall workflow #####
-############################
-
-# Type of input data (either 'Single' or 'Cohort')
-DATA: "Cohort"
-
-# File directories to reference genome and dbSNP database
-REFGENOME: "/path/to/reference/human/genome/Homo_sapiens_assembly38.fasta"
-dbSNP: "/path/to/dbSNP/All_20180418.vcf.gz"
-
-# Temporary file directory
-TEMPDIR: "/file/path/to/tmp/"
-
-# WES settings (leave blank if analysing other data such as WGS)
-WES:
- # Genomic intervals over which to operate
- INTERVALS: "-L /path/to/regions/.bed"
- # Padding (in bp) to add to each interval
- PADDING: "-ip 100"
-
-##############################
-##### Base recalibration #####
-##############################
-
-# Resources used for base recalibration
-RECALIBRATION:
- RESOURCES: "--known-sites /path/to/dbSNP/All_20180418.vcf.gz
- --known-sites /path/to/mills/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz"
\ No newline at end of file
diff --git a/config_examples/GRCh38_cohort_WGS_config_template.yaml b/config_examples/GRCh38_cohort_WGS_config_template.yaml
deleted file mode 100644
index b93dd6d..0000000
--- a/config_examples/GRCh38_cohort_WGS_config_template.yaml
+++ /dev/null
@@ -1,29 +0,0 @@
-############################
-##### Overall workflow #####
-############################
-
-# Type of input data (either 'Single' or 'Cohort')
-DATA: "Cohort"
-
-# File directories to reference genome and dbSNP database
-REFGENOME: "/path/to/reference/human/genome/Homo_sapiens_assembly38.fasta"
-dbSNP: "/path/to/dbSNP/All_20180418.vcf.gz"
-
-# Temporary file directory
-TEMPDIR: "/file/path/to/tmp/"
-
-# WES settings (leave blank if analysing other data such as WGS)
-WES:
- # Genomic intervals over which to operate
- INTERVALS: ""
- # Padding (in bp) to add to each interval
- PADDING: ""
-
-##############################
-##### Base recalibration #####
-##############################
-
-# Resources used for base recalibration
-RECALIBRATION:
- RESOURCES: "--known-sites /path/to/dbSNP/All_20180418.vcf.gz
- --known-sites /path/to/mills/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz"
\ No newline at end of file
diff --git a/config_examples/GRCh38_single_WES_config_template.yaml b/config_examples/GRCh38_single_WES_config_template.yaml
deleted file mode 100644
index c519ee9..0000000
--- a/config_examples/GRCh38_single_WES_config_template.yaml
+++ /dev/null
@@ -1,29 +0,0 @@
-############################
-##### Overall workflow #####
-############################
-
-# Type of input data (either 'Single' or 'Cohort')
-DATA: "Single"
-
-# File directories to reference genome and dbSNP database
-REFGENOME: "/path/to/reference/human/genome/Homo_sapiens_assembly38.fasta"
-dbSNP: "/path/to/dbSNP/All_20180418.vcf.gz"
-
-# Temporary file directory
-TEMPDIR: "/file/path/to/tmp/"
-
-# WES settings (leave blank if analysing other data such as WGS)
-WES:
- # Genomic intervals over which to operate
- INTERVALS: "-L /path/to/regions/.bed"
- # Padding (in bp) to add to each interval
- PADDING: "-ip 100"
-
-##############################
-##### Base recalibration #####
-##############################
-
-# Resources used for base recalibration
-RECALIBRATION:
- RESOURCES: "--known-sites /path/to/dbSNP/All_20180418.vcf.gz
- --known-sites /path/to/mills/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz"
\ No newline at end of file
diff --git a/config_examples/GRCh38_single_WGS_config_template.yaml b/config_examples/GRCh38_single_WGS_config_template.yaml
deleted file mode 100644
index 85540b0..0000000
--- a/config_examples/GRCh38_single_WGS_config_template.yaml
+++ /dev/null
@@ -1,29 +0,0 @@
-############################
-##### Overall workflow #####
-############################
-
-# Type of input data (either 'Single' or 'Cohort')
-DATA: "Single"
-
-# File directories to reference genome and dbSNP database
-REFGENOME: "/path/to/reference/human/genome/Homo_sapiens_assembly38.fasta"
-dbSNP: "/path/to/dbSNP/All_20180418.vcf.gz"
-
-# Temporary file directory
-TEMPDIR: "/file/path/to/tmp/"
-
-# WES settings (leave blank if analysing other data such as WGS)
-WES:
- # Genomic intervals over which to operate
- INTERVALS: ""
- # Padding (in bp) to add to each interval
- PADDING: ""
-
-##############################
-##### Base recalibration #####
-##############################
-
-# Resources used for base recalibration
-RECALIBRATION:
- RESOURCES: "--known-sites /path/to/dbSNP/All_20180418.vcf.gz
- --known-sites /path/to/mills/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz"
\ No newline at end of file
diff --git a/config_examples/README.md b/config_examples/README.md
deleted file mode 100644
index 4d47494..0000000
--- a/config_examples/README.md
+++ /dev/null
@@ -1,5 +0,0 @@
-This folder contains some example config files for a few analyses scenarios:
-
-- GRCh37 vs. GRCh38 build of the reference human genome
-- Analysing single samples vs. a cohort
-- Analysing WES vs. WGS
\ No newline at end of file
diff --git a/config_template.yaml b/config_template.yaml
deleted file mode 100644
index a85d48a..0000000
--- a/config_template.yaml
+++ /dev/null
@@ -1,28 +0,0 @@
-############################
-##### Overall workflow #####
-############################
-
-# Type of input data (either 'Single' or 'Cohort')
-DATA: ""
-
-# File directories to reference genome and dbSNP database
-REFGENOME: ""
-dbSNP: ""
-
-# Temporary file directory
-TEMPDIR: ""
-
-# WES settings (leave blank if analysing other data such as WGS)
-WES:
- # Genomic intervals over which to operate
- INTERVALS: ""
- # Padding (in bp) to add to each interval
- PADDING: ""
-
-##############################
-##### Base recalibration #####
-##############################
-
-# Resources used for base recalibration
-RECALIBRATION:
- RESOURCES: ""
\ No newline at end of file
diff --git a/docs/running_on_a_hpc.md b/docs/running_on_a_hpc.md
new file mode 100644
index 0000000..50d9b0d
--- /dev/null
+++ b/docs/running_on_a_hpc.md
@@ -0,0 +1,347 @@
+# Run human_genomics_pipeline on a high performance cluster
+
+## Table of contents
+
+- [Run human_genomics_pipeline on a high performance cluster](#run-human_genomics_pipeline-on-a-high-performance-cluster)
+ - [Table of contents](#table-of-contents)
+ - [1. Fork the pipeline repo to a personal or lab account](#1-fork-the-pipeline-repo-to-a-personal-or-lab-account)
+ - [2. Take the pipeline to the data on the HPC](#2-take-the-pipeline-to-the-data-on-the-hpc)
+ - [3. Setup files and directories](#3-setup-files-and-directories)
+ - [Test data](#test-data)
+ - [4. Get prerequisite software/hardware](#4-get-prerequisite-softwarehardware)
+ - [5. Create a local copy of the GATK resource bundle (either b37 or hg38)](#5-create-a-local-copy-of-the-gatk-resource-bundle-either-b37-or-hg38)
+ - [b37](#b37)
+ - [hg38](#hg38)
+ - [6. Modify the configuration file](#6-modify-the-configuration-file)
+ - [Overall workflow](#overall-workflow)
+ - [Pipeline resources](#pipeline-resources)
+ - [Trimming](#trimming)
+ - [Base recalibration](#base-recalibration)
+ - [7. Configure to run on a HPC](#7-configure-to-run-on-a-hpc)
+ - [8. Modify the run scripts](#8-modify-the-run-scripts)
+ - [9. Create and activate a conda environment with python and snakemake installed](#9-create-and-activate-a-conda-environment-with-python-and-snakemake-installed)
+ - [10. Run the pipeline](#10-run-the-pipeline)
+ - [11. Evaluate the pipeline run](#11-evaluate-the-pipeline-run)
+ - [12. Commit and push to your forked version of the github repo](#12-commit-and-push-to-your-forked-version-of-the-github-repo)
+ - [13. Repeat step 12 each time you re-run the analysis with different parameters](#13-repeat-step-12-each-time-you-re-run-the-analysis-with-different-parameters)
+ - [14. Raise issues, create feature requests or create a pull request with the upstream repo to merge any useful changes to the pipeline (optional)](#14-raise-issues-create-feature-requests-or-create-a-pull-request-with-the-upstream-repo-to-merge-any-useful-changes-to-the-pipeline-optional)
+
+## 1. Fork the pipeline repo to a personal or lab account
+
+See [here](https://help.github.com/en/github/getting-started-with-github/fork-a-repo#fork-an-example-repository) for help forking a repository
+
+## 2. Take the pipeline to the data on the HPC
+
+Clone the forked [human_genomics_pipeline](https://github.com/ESR-NZ/human_genomics_pipeline) repo into the same directory as your paired end fastq data to be processed.
+
+See [here](https://help.github.com/en/github/getting-started-with-github/fork-a-repo#keep-your-fork-synced) for help cloning a repository
+
+## 3. Setup files and directories
+
+Required folder structure and file naming convention:
+
+```bash
+
+.
+|___fastq/
+| |___sample1_1.fastq.gz
+| |___sample1_2.fastq.gz
+| |___sample2_1.fastq.gz
+| |___sample2_2.fastq.gz
+| |___sample3_1.fastq.gz
+| |___sample3_2.fastq.gz
+| |___sample4_1.fastq.gz
+| |___sample4_2.fastq.gz
+| |___sample5_1.fastq.gz
+| |___sample5_2.fastq.gz
+| |___sample6_1.fastq.gz
+| |___sample6_2.fastq.gz
+| |___ ...
+|
+|___human_genomics_pipeline/
+
+```
+
+If you're analysing cohort's of samples, you will need an additional directory with a [pedigree file](https://gatk.broadinstitute.org/hc/en-us/articles/360035531972-PED-Pedigree-format) for each cohort/family using the following folder structure and file naming convention:
+
+```bash
+
+.
+|___fastq/
+| |___sample1_1.fastq.gz
+| |___sample1_2.fastq.gz
+| |___sample2_1.fastq.gz
+| |___sample2_2.fastq.gz
+| |___sample3_1.fastq.gz
+| |___sample3_2.fastq.gz
+| |___sample4_1.fastq.gz
+| |___sample4_2.fastq.gz
+| |___sample5_1.fastq.gz
+| |___sample5_2.fastq.gz
+| |___sample6_1.fastq.gz
+| |___sample6_2.fastq.gz
+| |___ ...
+|
+|___pedigrees/
+| |___proband1_pedigree.ped
+| |___proband2_pedigree.ped
+| |___ ...
+|
+|___human_genomics_pipeline/
+
+```
+
+Requirements:
+
+- Input paired end fastq files need to identified with `_1` and `_2` (not `_R1` and `_R2`)
+- Currently, the filenames of the pedigree files need to be labelled with the name of the proband/individual affected with the disease phenotype in the cohort (we will be working towards removing this requirement)
+- Singletons and cohorts need to be run in separate pipeline runs
+- It is assumed that there is one proband/individual affected with the disease phenotype of interest in a given cohort (one individual with a value of 2 in the 6th column of a given pedigree file)
+
+### Test data
+
+The provided [test dataset](./test) can be used. Setup the test dataset before running the pipeline on this data - choose to setup to run either a single sample analysis or a cohort analysis with the `-a` flag. For example:
+
+```bash
+cd ./human_genomics_pipeline
+bash ./test/setup_test.sh -a cohort
+```
+
+## 4. Get prerequisite software/hardware
+
+For GPU accelerated runs, you'll need [NVIDIA GPUs](https://www.nvidia.com/en-gb/graphics-cards/) and [NVIDIA CLARA PARABRICKS and dependencies](https://www.nvidia.com/en-us/docs/parabricks/local-installation/). Talk to your system administrator to see if the HPC has this hardware and software available.
+
+Other software required to get setup and run the pipeline:
+
+- [Git](https://git-scm.com/) (tested with version 2.7.4)
+- [Conda](https://docs.conda.io/projects/conda/en/latest/index.html) (tested with version 4.8.2)
+- [Mamba](https://github.com/TheSnakePit/mamba) (tested with version 0.4.4) (note. [mamba can be installed via conda with a single command](https://mamba.readthedocs.io/en/latest/installation.html#existing-conda-install))
+- [gsutil](https://pypi.org/project/gsutil/) (tested with version 4.52)
+- [gunzip](https://linux.die.net/man/1/gunzip) (tested with version 1.6)
+
+Most of this software is commonly pre-installed on HPC's, likely available as modules that can be loaded. Talk to your system administrator if you need help with this.
+
+## 5. Create a local copy of the [GATK resource bundle](https://gatk.broadinstitute.org/hc/en-us/articles/360035890811-Resource-bundle) (either b37 or hg38)
+
+### b37
+
+Download from [Google Cloud Bucket](https://console.cloud.google.com/storage/browser/gatk-legacy-bundles/b37?prefix=)
+
+```bash
+gsutil cp -r gs://gatk-legacy-bundles/b37 /where/to/download/
+```
+
+### hg38
+
+Download from [Google Cloud Bucket](https://console.cloud.google.com/storage/browser/genomics-public-data/resources/broad/hg38/v0)
+
+```bash
+gsutil cp -r gs://genomics-public-data/resources/broad/hg38 /where/to/download/
+```
+
+## 6. Modify the configuration file
+
+Edit 'config.yaml' found within the config directory
+
+### Overall workflow
+
+Specify whether the data is to be analysed on it's own ('Single') or as a part of a cohort of samples ('Cohort'). For example:
+
+```yaml
+DATA: "Single"
+```
+
+Specify whether the pipeline should be GPU accelerated where possible (either 'Yes' or 'No', this requires [NVIDIA GPUs](https://www.nvidia.com/en-gb/graphics-cards/) and [NVIDIA CLARA PARABRICKS](https://www.nvidia.com/en-us/docs/parabricks/local-installation/))
+
+```yaml
+GPU_ACCELERATED: "No"
+```
+
+Set the the working directories to the reference human genome file (b37 or hg38). For example:
+
+```yaml
+REFGENOME: "/scratch/publicData/b37/human_g1k_v37_decoy.fasta"
+```
+
+Set the the working directory to your dbSNP database file (b37 or hg38). For example:
+
+```yaml
+dbSNP: "/scratch/publicData/b37/dbsnp_138.b37.vcf"
+```
+
+Set the the working directory to a temporary file directory. Make sure this is a location with a fair amount of memory space for large intermediate analysis files. For example:
+
+```yaml
+TEMPDIR: "/scratch/tmp/"
+```
+
+If analysing WES data, pass a design file (.bed) indicating the genomic regions that were sequenced (see [here](https://leahkemp.github.io/documentation/human_genomic_pipelines/design_files.html) for more information on accessing design files). Also set the level of padding by passing the amount of padding in base pairs. For example:
+
+*If NOT analysing WES data, leave these fields blank*
+
+```yaml
+WES:
+ # File path to the exome capture regions over which to operate
+ INTERVALS: "/scratch/publicData/sure_select_human_all_exon_V7/S31285117_Padded.bed"
+ # Padding (in bp) to add to each region
+ PADDING: "100"
+```
+
+### Pipeline resources
+
+These settings allow you to configure the resources per rule/sample
+
+Set the number of threads to use per sample/rule for multithreaded rules (`rule trim_galore_pe` and `rule bwa_mem`). Multithreading will significantly speed up these rules, however the improvements in speed will diminish beyond 8 threads. If desired, a different number of threads can be set for these multithreaded rules by utilising the `--set-threads` flag in the runscript (see [step 8](#8-modify-the-run-scripts)).
+
+```yaml
+THREADS: 8
+```
+
+Set the maximum memory usage per rule/sample (eg. '40g' for 40 gigabytes, this should suffice for exomes)
+
+```yaml
+MAXMEMORY: "40g"
+```
+
+Set the maximum number of GPU's to be used per rule/sample for gpu-accelerated runs (eg `1` for 1 GPU)
+
+```yaml
+GPU: 1
+```
+
+It is a good idea to consider the number of samples that you are processing. For example, if you set `THREADS: "8"` and set the maximum number of cores to be used by the pipeline in the run script to `-j/--cores 32` (see [step 8](#8-modify-the-run-scripts)), a maximum of 3 samples will be able to run at one time for these rules (if they are deployed at the same time), but each sample will complete faster. In contrast, if you set `THREADS: "1"` and `-j/--cores 32`, a maximum of 32 samples could be run at one time, but each sample will take longer to complete. This also needs to be considered when setting `MAXMEMORY` + `--resources mem_mb` and `GPU` + `--resources gpu`.
+
+#### Trimming
+
+Specify whether the raw fastq reads should be trimmed (either 'Yes' or 'No'). For example:
+
+```yaml
+TRIM: "Yes"
+```
+
+*Note. if you'd like to use a different trimming tool that you feel is better for your use case/data, you can pre-trim your fastq files and pass the trimmed fastq's to this pipeline, turning off this pipelines internal trimming with as outlined above*
+
+If trimming the raw fastq reads, set the [trim galore](https://github.com/FelixKrueger/TrimGalore/blob/master/Docs/Trim_Galore_User_Guide.md) adapter trimming parameters. Choose one of the common adapters such as Illumina universal, Nextera transposase or Illumina small RNA with `--illumina`, `--nextera` or `--small_rna`. Alternatively, pass adapter sequences to the `-a` and `-a2` flags. If not set, trim galore will try to auto-detect the adapter based on the fastq reads.
+
+```yaml
+TRIMMING:
+ ADAPTERS: "--illumina"
+```
+
+### Base recalibration
+
+Pass the resources to be used to recalibrate bases with [gatk BaseRecalibrator](https://gatk.broadinstitute.org/hc/en-us/articles/360036726891-BaseRecalibrator) (this passes the resource files to the [`--known-sites`](https://gatk.broadinstitute.org/hc/en-us/articles/360036726891-BaseRecalibrator#--known-sites) flag), these known polymorphic sites will be used to exclude regions around known polymorphisms from base recalibration. Note. you can include as many or as few resources as you like, but you'll need at least one recalibration resource file. For example:
+
+```yaml
+RECALIBRATION:
+ RESOURCES:
+ - /scratch/publicData/b37/dbsnp_138.b37.vcf
+ - /scratch/publicData/b37/Mills_and_1000G_gold_standard.indels.b37.vcf
+ - /scratch/publicData/b37/1000G_phase1.indels.b37.vcf
+```
+
+## 7. Configure to run on a HPC
+
+*This will deploy the non-GPU accelerated rules to slurm and deploy the GPU accelerated rules locally (pbrun_fq2bam, pbrun_haplotypecaller_single, pbrun_haplotypecaller_cohort). Therefore, if running the pipeline gpu accelerated, the pipeline should be deployed from the machine with the GPU's.*
+
+In theory, this cluster configuration should be adaptable to other job scheduler systems. In fact, snakemake is designed to allow pipelines/workflows such as this to be portable to different job schedulers by having any [any cluster specific configured outside of the workflow/pipeline](https://snakemake.readthedocs.io/en/stable/snakefiles/configuration.html#cluster-configuration-deprecated). but here I will provide a minimal example for deploying this pipeline to [slurm](https://slurm.schedmd.com/).
+
+Configure `account:` and `partition:` in the default section of 'cluster.json' in order to set the parameters for slurm sbatch (see documentation [here](https://snakemake.readthedocs.io/en/stable/snakefiles/configuration.html#cluster-configuration-deprecated) and [here](https://slurm.schedmd.com/)). For example:
+
+```json
+{
+ "__default__" :
+ {
+ "account" : "lkemp",
+ "partition" : "prod",
+ "output" : "logs/slurm-%j_{rule}.out"
+ }
+}
+```
+
+There are a plethora of additional slurm parameters that can be configured (and can be configured per rule). If you set additional slurm parameters, remember to pass them to the `--cluster` flag in the runscripts. See [here](https://snakemake-on-nesi.sschmeier.com/snake.html#slurm-and-nesi-specific-setup) for a good working example of deploying a snakemake workflow to [NeSi](https://www.nesi.org.nz/).
+
+## 8. Modify the run scripts
+
+Set the number maximum number of cores to be used with the `--cores` flag and the maximum amount of memory to be used (in megabytes) with the `resources mem_mb=` flag. If running GPU accelerated, also set the maximum number of GPU's to be used with the `--resources gpu=` flag. For example:
+
+Dry run (dryrun_hpc.sh):
+
+```bash
+snakemake \
+--dryrun \
+--cores 32 \
+--resources mem_mb=150000 \
+--resources gpu=2 \
+--use-conda \
+--conda-frontend mamba \
+--latency-wait 120 \
+--configfile ../config/config.yaml \
+--cluster-config ../config/cluster.json \
+--cluster "sbatch -A {cluster.account} \
+-p {cluster.partition} \
+-o {cluster.output}"
+```
+
+Full run (run_hpc.sh):
+
+```bash
+snakemake \
+--cores 32 \
+--resources mem_mb=150000 \
+--resources gpu=2 \
+--use-conda \
+--conda-frontend mamba \
+--latency-wait 120 \
+--configfile ../config/config.yaml \
+--cluster-config ../config/cluster.json \
+--cluster "sbatch -A {cluster.account} \
+-p {cluster.partition} \
+-o {cluster.output}"
+```
+
+See the [snakemake documentation](https://snakemake.readthedocs.io/en/v4.5.1/executable.html#all-options) for additional run parameters.
+
+## 9. Create and activate a conda environment with python and snakemake installed
+
+```bash
+cd ./workflow/
+mamba env create -f pipeline_run_env.yml
+conda activate pipeline_run_env
+```
+
+## 10. Run the pipeline
+
+First carry out a dry run
+
+```bash
+bash dryrun_hpc.sh
+```
+
+If there are no issues, start a full run
+
+```bash
+bash run_hpc.sh
+```
+
+## 11. Evaluate the pipeline run
+
+Generate an interactive html report
+
+```bash
+bash report.sh
+```
+
+## 12. Commit and push to your forked version of the github repo
+
+To maintain reproducibility, commit and push:
+
+- All configuration files
+- All run scripts
+- The final report
+
+## 13. Repeat step 12 each time you re-run the analysis with different parameters
+
+## 14. Raise issues, create feature requests or create a pull request with the [upstream repo](https://github.com/ESR-NZ/human_genomics_pipeline) to merge any useful changes to the pipeline (optional)
+
+See [the README](https://github.com/ESR-NZ/human_genomics_pipeline/blob/dev/README.md#contribute-back) for info on how to contribute back to the pipeline!
diff --git a/docs/running_on_a_single_machine.md b/docs/running_on_a_single_machine.md
new file mode 100644
index 0000000..0198092
--- /dev/null
+++ b/docs/running_on_a_single_machine.md
@@ -0,0 +1,314 @@
+# Run human_genomics_pipeline on a single machine like a laptop or single server/computer
+
+## Table of contents
+
+- [Run human_genomics_pipeline on a single machine like a laptop or single server/computer](#run-human_genomics_pipeline-on-a-single-machine-like-a-laptop-or-single-servercomputer)
+ - [Table of contents](#table-of-contents)
+ - [1. Fork the pipeline repo to a personal or lab account](#1-fork-the-pipeline-repo-to-a-personal-or-lab-account)
+ - [2. Take the pipeline to the data on your local machine](#2-take-the-pipeline-to-the-data-on-your-local-machine)
+ - [3. Setup files and directories](#3-setup-files-and-directories)
+ - [Test data](#test-data)
+ - [4. Get prerequisite software/hardware](#4-get-prerequisite-softwarehardware)
+ - [5. Create a local copy of the GATK resource bundle (either b37 or hg38)](#5-create-a-local-copy-of-the-gatk-resource-bundle-either-b37-or-hg38)
+ - [b37](#b37)
+ - [hg38](#hg38)
+ - [6. Modify the configuration file](#6-modify-the-configuration-file)
+ - [Overall workflow](#overall-workflow)
+ - [Pipeline resources](#pipeline-resources)
+ - [Trimming](#trimming)
+ - [Base recalibration](#base-recalibration)
+ - [7. Modify the run scripts](#7-modify-the-run-scripts)
+ - [8. Create and activate a conda environment with python and snakemake installed](#8-create-and-activate-a-conda-environment-with-python-and-snakemake-installed)
+ - [9. Run the pipeline](#9-run-the-pipeline)
+ - [10. Evaluate the pipeline run](#10-evaluate-the-pipeline-run)
+ - [11. Commit and push to your forked version of the github repo](#11-commit-and-push-to-your-forked-version-of-the-github-repo)
+ - [12. Repeat step 11 each time you re-run the analysis with different parameters](#12-repeat-step-11-each-time-you-re-run-the-analysis-with-different-parameters)
+ - [13. Raise issues, create feature requests or create a pull request with the upstream repo to merge any useful changes to the pipeline (optional)](#13-raise-issues-create-feature-requests-or-create-a-pull-request-with-the-upstream-repo-to-merge-any-useful-changes-to-the-pipeline-optional)
+
+## 1. Fork the pipeline repo to a personal or lab account
+
+See [here](https://help.github.com/en/github/getting-started-with-github/fork-a-repo#fork-an-example-repository) for help forking a repository
+
+## 2. Take the pipeline to the data on your local machine
+
+Clone the forked [human_genomics_pipeline](https://github.com/ESR-NZ/human_genomics_pipeline) repo into the same directory as your paired end fastq data to be processed.
+
+See [here](https://help.github.com/en/github/getting-started-with-github/fork-a-repo#keep-your-fork-synced) for help cloning a repository
+
+## 3. Setup files and directories
+
+Required folder structure and file naming convention:
+
+```bash
+
+.
+|___fastq/
+| |___sample1_1.fastq.gz
+| |___sample1_2.fastq.gz
+| |___sample2_1.fastq.gz
+| |___sample2_2.fastq.gz
+| |___sample3_1.fastq.gz
+| |___sample3_2.fastq.gz
+| |___sample4_1.fastq.gz
+| |___sample4_2.fastq.gz
+| |___sample5_1.fastq.gz
+| |___sample5_2.fastq.gz
+| |___sample6_1.fastq.gz
+| |___sample6_2.fastq.gz
+| |___ ...
+|
+|___human_genomics_pipeline/
+
+```
+
+If you're analysing cohort's of samples, you will need an additional directory with a [pedigree file](https://gatk.broadinstitute.org/hc/en-us/articles/360035531972-PED-Pedigree-format) for each cohort/family using the following folder structure and file naming convention:
+
+```bash
+
+.
+|___fastq/
+| |___sample1_1.fastq.gz
+| |___sample1_2.fastq.gz
+| |___sample2_1.fastq.gz
+| |___sample2_2.fastq.gz
+| |___sample3_1.fastq.gz
+| |___sample3_2.fastq.gz
+| |___sample4_1.fastq.gz
+| |___sample4_2.fastq.gz
+| |___sample5_1.fastq.gz
+| |___sample5_2.fastq.gz
+| |___sample6_1.fastq.gz
+| |___sample6_2.fastq.gz
+| |___ ...
+|
+|___pedigrees/
+| |___proband1_pedigree.ped
+| |___proband2_pedigree.ped
+| |___ ...
+|
+|___human_genomics_pipeline/
+
+```
+
+Requirements:
+
+- Input paired end fastq files need to identified with `_1` and `_2` (not `_R1` and `_R2`)
+- Currently, the filenames of the pedigree files need to be labelled with the name of the proband/individual affected with the disease phenotype in the cohort (we will be working towards removing this requirement)
+- Singletons and cohorts need to be run in separate pipeline runs
+- It is assumed that there is one proband/individual affected with the disease phenotype of interest in a given cohort (one individual with a value of 2 in the 6th column of a given pedigree file)
+
+### Test data
+
+The provided [test dataset](./test) can be used. Setup the test dataset before running the pipeline on this data - choose to setup to run either a single sample analysis or a cohort analysis with the `-a` flag. For example:
+
+```bash
+cd ./human_genomics_pipeline
+bash ./test/setup_test.sh -a cohort
+```
+
+## 4. Get prerequisite software/hardware
+
+For GPU accelerated runs, you'll need [NVIDIA GPUs](https://www.nvidia.com/en-gb/graphics-cards/) and [NVIDIA CLARA PARABRICKS and dependencies](https://www.nvidia.com/en-us/docs/parabricks/local-installation/). Talk to your system administrator to see if the HPC has this hardware and software available.
+
+Other software required to get setup and run the pipeline:
+
+- [Git](https://git-scm.com/) (tested with version 2.7.4)
+- [Conda](https://docs.conda.io/projects/conda/en/latest/index.html) (tested with version 4.8.2)
+- [Mamba](https://github.com/TheSnakePit/mamba) (tested with version 0.4.4) (note. [mamba can be installed via conda with a single command](https://mamba.readthedocs.io/en/latest/installation.html#existing-conda-install))
+- [gsutil](https://pypi.org/project/gsutil/) (tested with version 4.52)
+- [gunzip](https://linux.die.net/man/1/gunzip) (tested with version 1.6)
+
+Most of this software is commonly pre-installed on HPC's, likely available as modules that can be loaded. Talk to your system administrator if you need help with this.
+
+## 5. Create a local copy of the [GATK resource bundle](https://gatk.broadinstitute.org/hc/en-us/articles/360035890811-Resource-bundle) (either b37 or hg38)
+
+### b37
+
+Download from [Google Cloud Bucket](https://console.cloud.google.com/storage/browser/gatk-legacy-bundles/b37?prefix=)
+
+```bash
+gsutil cp -r gs://gatk-legacy-bundles/b37 /where/to/download/
+```
+
+### hg38
+
+Download from [Google Cloud Bucket](https://console.cloud.google.com/storage/browser/genomics-public-data/resources/broad/hg38/v0)
+
+```bash
+gsutil cp -r gs://genomics-public-data/resources/broad/hg38 /where/to/download/
+```
+
+## 6. Modify the configuration file
+
+Edit 'config.yaml' found within the config directory
+
+### Overall workflow
+
+Specify whether the data is to be analysed on it's own ('Single') or as a part of a cohort of samples ('Cohort'). For example:
+
+```yaml
+DATA: "Single"
+```
+
+Specify whether the pipeline should be GPU accelerated where possible (either 'Yes' or 'No', this requires [NVIDIA GPUs](https://www.nvidia.com/en-gb/graphics-cards/) and [NVIDIA CLARA PARABRICKS](https://www.nvidia.com/en-us/docs/parabricks/local-installation/))
+
+```yaml
+GPU_ACCELERATED: "No"
+```
+
+Set the the working directories to the reference human genome file (b37 or hg38). For example:
+
+```yaml
+REFGENOME: "/scratch/publicData/b37/human_g1k_v37_decoy.fasta"
+```
+
+Set the the working directory to your dbSNP database file (b37 or hg38). For example:
+
+```yaml
+dbSNP: "/scratch/publicData/b37/dbsnp_138.b37.vcf"
+```
+
+Set the the working directory to a temporary file directory. Make sure this is a location with a fair amount of memory space for large intermediate analysis files. For example:
+
+```yaml
+TEMPDIR: "/scratch/tmp/"
+```
+
+If analysing WES data, pass a design file (.bed) indicating the genomic regions that were sequenced (see [here](https://leahkemp.github.io/documentation/human_genomic_pipelines/design_files.html) for more information on accessing design files). Also set the level of padding by passing the amount of padding in base pairs. For example:
+
+*If NOT analysing WES data, leave these fields blank*
+
+```yaml
+WES:
+ # File path to the exome capture regions over which to operate
+ INTERVALS: "/scratch/publicData/sure_select_human_all_exon_V7/S31285117_Padded.bed"
+ # Padding (in bp) to add to each region
+ PADDING: "100"
+```
+
+### Pipeline resources
+
+These settings allow you to configure the resources per rule/sample
+
+Set the number of threads to use per sample/rule for multithreaded rules (`rule trim_galore_pe` and `rule bwa_mem`). Multithreading will significantly speed up these rules, however the improvements in speed will diminish beyond 8 threads. If desired, a different number of threads can be set for these multithreaded rules by utilising the `--set-threads` flag in the runscript (see [step 7](#7-modify-the-run-scripts)).
+
+```yaml
+THREADS: 8
+```
+
+Set the maximum memory usage per rule/sample (eg. '40g' for 40 gigabytes, this should suffice for exomes)
+
+```yaml
+MAXMEMORY: "40g"
+```
+
+Set the maximum number of GPU's to be used per rule/sample for gpu-accelerated runs (eg `1` for 1 GPU)
+
+```yaml
+GPU: 1
+```
+
+It is a good idea to consider the number of samples that you are processing. For example, if you set `THREADS: "8"` and set the maximum number of cores to be used by the pipeline in the run script to `-j/--cores 32` (see [step 7](#7-modify-the-run-scripts)), a maximum of 3 samples will be able to run at one time for these rules (if they are deployed at the same time), but each sample will complete faster. In contrast, if you set `THREADS: "1"` and `-j/--cores 32`, a maximum of 32 samples could be run at one time, but each sample will take longer to complete. This also needs to be considered when setting `MAXMEMORY` + `--resources mem_mb` and `GPU` + `--resources gpu`.
+
+#### Trimming
+
+Specify whether the raw fastq reads should be trimmed (either 'Yes' or 'No'). For example:
+
+```yaml
+TRIM: "Yes"
+```
+
+If trimming the raw fastq reads, set the [trim galore](https://github.com/FelixKrueger/TrimGalore/blob/master/Docs/Trim_Galore_User_Guide.md) adapter trimming parameters. Choose one of the common adapters such as Illumina universal, Nextera transposase or Illumina small RNA with `--illumina`, `--nextera` or `--small_rna`. Alternatively, pass adapter sequences to the `-a` and `-a2` flags. If not set, trim galore will try to auto-detect the adapter based on the fastq reads.
+
+```yaml
+TRIMMING:
+ ADAPTERS: "--illumina"
+```
+
+### Base recalibration
+
+Pass the resources to be used to recalibrate bases with [gatk BaseRecalibrator](https://gatk.broadinstitute.org/hc/en-us/articles/360036726891-BaseRecalibrator) (this passes the resource files to the [`--known-sites`](https://gatk.broadinstitute.org/hc/en-us/articles/360036726891-BaseRecalibrator#--known-sites) flag), these known polymorphic sites will be used to exclude regions around known polymorphisms from base recalibration. Note. you can include as many or as few resources as you like, but you'll need at least one recalibration resource file. For example:
+
+```yaml
+RECALIBRATION:
+ RESOURCES:
+ - /scratch/publicData/b37/dbsnp_138.b37.vcf
+ - /scratch/publicData/b37/Mills_and_1000G_gold_standard.indels.b37.vcf
+ - /scratch/publicData/b37/1000G_phase1.indels.b37.vcf
+```
+
+## 7. Modify the run scripts
+
+Set the number maximum number of cores to be used with the `--cores` flag and the maximum amount of memory to be used (in megabytes) with the `resources mem_mb=` flag. If running GPU accelerated, also set the maximum number of GPU's to be used with the `--resources gpu=` flag. For example:
+
+Dry run (dryrun.sh):
+
+```bash
+snakemake \
+--dryrun \
+--cores 32 \
+--resources mem_mb=150000 \
+--resources gpu=2 \
+--use-conda \
+--conda-frontend mamba \
+--latency-wait 120 \
+--configfile ../config/config.yaml
+```
+
+Full run (run.sh):
+
+```bash
+snakemake \
+--cores 32 \
+--resources mem_mb=150000 \
+--resources gpu=2 \
+--use-conda \
+--latency-wait 120 \
+--configfile ../config/config.yaml
+```
+
+See the [snakemake documentation](https://snakemake.readthedocs.io/en/v4.5.1/executable.html#all-options) for additional run parameters.
+
+## 8. Create and activate a conda environment with python and snakemake installed
+
+```bash
+cd ./workflow/
+mamba env create -f pipeline_run_env.yml
+conda activate pipeline_run_env
+```
+
+## 9. Run the pipeline
+
+First carry out a dry run
+
+```bash
+bash dryrun.sh
+```
+
+If there are no issues, start a full run
+
+```bash
+bash run.sh
+```
+
+## 10. Evaluate the pipeline run
+
+Generate an interactive html report
+
+```bash
+bash report.sh
+```
+
+## 11. Commit and push to your forked version of the github repo
+
+To maintain reproducibility, commit and push:
+
+- All configuration files
+- All run scripts
+- The final report
+
+## 12. Repeat step 11 each time you re-run the analysis with different parameters
+
+## 13. Raise issues, create feature requests or create a pull request with the [upstream repo](https://github.com/ESR-NZ/human_genomics_pipeline) to merge any useful changes to the pipeline (optional)
+
+See [the README](https://github.com/ESR-NZ/human_genomics_pipeline/blob/dev/README.md#contribute-back) for info on how to contribute back to the pipeline!
diff --git a/dryrun.sh b/dryrun.sh
deleted file mode 100644
index 43fd481..0000000
--- a/dryrun.sh
+++ /dev/null
@@ -1,6 +0,0 @@
-#!/bin/bash -x
-
-snakemake \
--n -j 32 \
---use-conda \
---configfile config.yaml
\ No newline at end of file
diff --git a/envs/bwa.yaml b/envs/bwa.yaml
deleted file mode 100644
index 3427286..0000000
--- a/envs/bwa.yaml
+++ /dev/null
@@ -1,7 +0,0 @@
-channels:
- - bioconda
- - conda-forge
- - defaults
-dependencies:
- - bioconda::samtools =1.10
- - bioconda::bwa =0.7.17
diff --git a/envs/sambamba.yaml b/envs/sambamba.yaml
deleted file mode 100644
index 9b672a0..0000000
--- a/envs/sambamba.yaml
+++ /dev/null
@@ -1,6 +0,0 @@
-channels:
- - bioconda
- - conda-forge
- - defaults
-dependencies:
- - bioconda::sambamba =0.7.0
diff --git a/envs/trim_galore.yaml b/envs/trim_galore.yaml
deleted file mode 100644
index b384c03..0000000
--- a/envs/trim_galore.yaml
+++ /dev/null
@@ -1,6 +0,0 @@
-channels:
- - bioconda
- - conda-forge
- - defaults
-dependencies:
- - bioconda::trim-galore =0.6.5
diff --git a/images/jupyterLauncher.png b/images/jupyterLauncher.png
new file mode 100644
index 0000000..ecaeb61
Binary files /dev/null and b/images/jupyterLauncher.png differ
diff --git a/images/jupyter_login_labels_updated.png b/images/jupyter_login_labels_updated.png
new file mode 100644
index 0000000..a6b9c83
Binary files /dev/null and b/images/jupyter_login_labels_updated.png differ
diff --git a/images/nesi99991_screenshot.png b/images/nesi99991_screenshot.png
new file mode 100644
index 0000000..df1d426
Binary files /dev/null and b/images/nesi99991_screenshot.png differ
diff --git a/images/rulegraph_cohort.png b/images/rulegraph_cohort.png
new file mode 100644
index 0000000..87960f3
Binary files /dev/null and b/images/rulegraph_cohort.png differ
diff --git a/images/rulegraph_cohort_gpu.png b/images/rulegraph_cohort_gpu.png
new file mode 100644
index 0000000..aed8fb5
Binary files /dev/null and b/images/rulegraph_cohort_gpu.png differ
diff --git a/images/rulegraph_single.png b/images/rulegraph_single.png
new file mode 100644
index 0000000..30cc46a
Binary files /dev/null and b/images/rulegraph_single.png differ
diff --git a/images/rulegraph_single_gpu.png b/images/rulegraph_single_gpu.png
new file mode 100644
index 0000000..1b7d8e0
Binary files /dev/null and b/images/rulegraph_single_gpu.png differ
diff --git a/pipeline_run_env.yml b/pipeline_run_env.yml
deleted file mode 100644
index 4684503..0000000
--- a/pipeline_run_env.yml
+++ /dev/null
@@ -1,160 +0,0 @@
-name: pipeline_run_env
-channels:
- - bioconda
- - conda-forge
- - defaults
-dependencies:
- - _libgcc_mutex=0.1=conda_forge
- - _openmp_mutex=4.5=0_gnu
- - aioeasywebdav=2.4.0=py37_1000
- - aiohttp=3.6.2=py37h516909a_0
- - appdirs=1.4.3=py_1
- - async-timeout=3.0.1=py_1000
- - attrs=19.3.0=py_0
- - bcrypt=3.1.7=py37h8f50634_1
- - boto3=1.13.23=pyh9f0ad1d_0
- - botocore=1.16.23=pyh9f0ad1d_0
- - brotlipy=0.7.0=py37h8f50634_1000
- - bzip2=1.0.8=h516909a_2
- - ca-certificates=2020.4.5.1=hecc5488_0
- - cachetools=4.1.0=py_1
- - cairo=1.16.0=h3fc0475_1004
- - certifi=2020.4.5.1=py37hc8dfbb8_0
- - cffi=1.14.0=py37hd463f26_0
- - chardet=3.0.4=py37hc8dfbb8_1006
- - configargparse=1.2.3=pyh9f0ad1d_0
- - crc32c=2.0=py37h516909a_0
- - cryptography=2.9.2=py37hb09aad4_0
- - datrie=0.8.2=py37h8f50634_0
- - decorator=4.4.2=py_0
- - docutils=0.15.2=py37_0
- - dropbox=9.4.0=py_0
- - expat=2.2.9=he1b5a44_2
- - fftw=3.3.8=nompi_h7f3a6c3_1110
- - filechunkio=1.8=py_2
- - fontconfig=2.13.1=h1056068_1002
- - freetype=2.10.2=he06d7ca_0
- - ftputil=3.4=py_0
- - gettext=0.19.8.1=hc5be6a0_1002
- - ghostscript=9.18=1
- - giflib=5.1.9=h516909a_0
- - gitdb=4.0.5=py_0
- - gitpython=3.1.3=py_0
- - glib=2.64.3=h6f030ca_0
- - google-api-core=1.17.0=py37hc8dfbb8_0
- - google-auth=1.16.0=pyh9f0ad1d_0
- - google-cloud-core=1.3.0=py_0
- - google-cloud-storage=1.28.1=pyh9f0ad1d_0
- - google-resumable-media=0.5.0=py_1
- - googleapis-common-protos=1.51.0=py37hc8dfbb8_2
- - graphite2=1.3.13=he1b5a44_1001
- - graphviz=2.38.0=hf68f40c_1011
- - harfbuzz=2.4.0=hee91db6_4
- - icu=67.1=he1b5a44_0
- - idna=2.9=py_1
- - imagemagick=7.0.8_11=pl526hc610aec_0
- - importlib-metadata=1.6.0=py37hc8dfbb8_0
- - importlib_metadata=1.6.0=0
- - ipython_genutils=0.2.0=py_1
- - jbig=2.1=h516909a_2002
- - jinja2=2.11.2=pyh9f0ad1d_0
- - jmespath=0.10.0=pyh9f0ad1d_0
- - jpeg=9d=h516909a_0
- - jsonschema=3.2.0=py37hc8dfbb8_1
- - jupyter_core=4.6.3=py37hc8dfbb8_1
- - ld_impl_linux-64=2.34=h53a641e_4
- - libblas=3.8.0=14_openblas
- - libcblas=3.8.0=14_openblas
- - libffi=3.2.1=he1b5a44_1007
- - libgcc=7.2.0=h69d50b8_2
- - libgcc-ng=9.2.0=h24d8f2e_2
- - libgfortran-ng=7.5.0=hdf63c60_8
- - libgomp=9.2.0=h24d8f2e_2
- - libiconv=1.15=h516909a_1006
- - liblapack=3.8.0=14_openblas
- - libopenblas=0.3.7=h5ec1e0e_6
- - libpng=1.6.37=hed695b0_1
- - libprotobuf=3.12.3=h8b12597_0
- - libstdcxx-ng=9.2.0=hdf63c60_2
- - libtiff=4.1.0=hc3755c2_3
- - libtool=2.4.6=h14c3975_1002
- - libuuid=2.32.1=h14c3975_1000
- - libwebp=0.5.2=7
- - libxcb=1.13=h14c3975_1002
- - libxml2=2.9.10=h72b56ed_1
- - lz4-c=1.9.2=he1b5a44_1
- - markupsafe=1.1.1=py37h8f50634_1
- - multidict=4.7.5=py37h8f50634_1
- - nbformat=5.0.6=py_0
- - ncurses=6.1=hf484d3e_1002
- - networkx=2.4=py_1
- - numpy=1.18.4=py37h8960a57_0
- - openjpeg=2.3.1=h981e76c_3
- - openssl=1.1.1g=h516909a_0
- - pandas=1.0.4=py37h0da4684_0
- - pango=1.40.14=he7ab937_1005
- - paramiko=2.7.1=py37_0
- - pcre=8.44=he1b5a44_0
- - perl=5.26.2=h516909a_1006
- - pip=20.1.1=py_1
- - pixman=0.38.0=h516909a_1003
- - pkg-config=0.29.2=h516909a_1006
- - prettytable=0.7.2=py_3
- - protobuf=3.12.3=py37h3340039_0
- - psutil=5.7.0=py37h8f50634_1
- - pthread-stubs=0.4=h14c3975_1001
- - pyasn1=0.4.8=py_0
- - pyasn1-modules=0.2.7=py_0
- - pycparser=2.20=py_0
- - pygments=2.6.1=py_0
- - pygraphviz=1.5=py37h8f50634_1002
- - pynacl=1.3.0=py37h516909a_1001
- - pyopenssl=19.1.0=py_1
- - pyrsistent=0.16.0=py37h8f50634_0
- - pysftp=0.2.9=py_1
- - pysocks=1.7.1=py37hc8dfbb8_1
- - python=3.7.6=cpython_h8356626_6
- - python-dateutil=2.8.1=py_0
- - python-irodsclient=0.8.2=py_0
- - python_abi=3.7=1_cp37m
- - pytz=2020.1=pyh9f0ad1d_0
- - pyyaml=5.3.1=py37h8f50634_0
- - ratelimiter=1.2.0=py37_1000
- - readline=8.0=hf8c457e_0
- - requests=2.23.0=pyh8c360ce_2
- - rsa=4.0=py_0
- - s3transfer=0.3.3=py37hc8dfbb8_1
- - setuptools=47.1.1=py37hc8dfbb8_0
- - six=1.15.0=pyh9f0ad1d_0
- - smmap=3.0.4=pyh9f0ad1d_0
- - snakemake=5.14.0=1
- - snakemake-minimal=5.14.0=py_1
- - sqlite=3.30.1=hcee41ef_0
- - tk=8.6.10=hed695b0_0
- - toposort=1.5=py_3
- - traitlets=4.3.3=py37hc8dfbb8_1
- - urllib3=1.25.9=py_0
- - wheel=0.34.2=py_1
- - wrapt=1.12.1=py37h8f50634_1
- - xmlrunner=1.7.7=py_0
- - xorg-kbproto=1.0.7=h14c3975_1002
- - xorg-libice=1.0.10=h516909a_0
- - xorg-libsm=1.2.3=h84519dc_1000
- - xorg-libx11=1.6.9=h516909a_0
- - xorg-libxau=1.0.9=h14c3975_0
- - xorg-libxdmcp=1.1.3=h516909a_0
- - xorg-libxext=1.3.4=h516909a_0
- - xorg-libxpm=3.5.13=h516909a_0
- - xorg-libxrender=0.9.10=h516909a_1002
- - xorg-libxt=1.1.5=h516909a_1003
- - xorg-renderproto=0.11.1=h14c3975_1002
- - xorg-xextproto=7.3.0=h14c3975_1002
- - xorg-xproto=7.0.31=h14c3975_1007
- - xz=5.2.5=h516909a_0
- - yaml=0.2.5=h516909a_0
- - yarl=1.3.0=py37h516909a_1000
- - zipp=3.1.0=py_0
- - zlib=1.2.11=h516909a_1006
- - zstd=1.4.4=h6597ccf_3
-prefix: /home/lkemp/anaconda3/envs/pipeline_run_env
-
diff --git a/report.sh b/report.sh
deleted file mode 100644
index 1852b96..0000000
--- a/report.sh
+++ /dev/null
@@ -1,6 +0,0 @@
-#!/bin/bash -x
-
-snakemake \
---report report.html \
---configfile config.yaml \
---report-stylesheet ESR_stylesheet.css
\ No newline at end of file
diff --git a/report/coverage.rst b/report/coverage.rst
deleted file mode 100644
index b979ac6..0000000
--- a/report/coverage.rst
+++ /dev/null
@@ -1 +0,0 @@
-Coverage calculated with `mosdepth `_
\ No newline at end of file
diff --git a/report/quality_checks_pre_trim.rst b/report/quality_checks_pre_trim.rst
deleted file mode 100644
index 839fb8d..0000000
--- a/report/quality_checks_pre_trim.rst
+++ /dev/null
@@ -1 +0,0 @@
-`FastQC `_ report (compiled into a `MultiQC report `_) providing general statistics on the sequence data (before trimming)
\ No newline at end of file
diff --git a/report/trimming_R1.rst b/report/trimming_R1.rst
deleted file mode 100644
index e6bbec6..0000000
--- a/report/trimming_R1.rst
+++ /dev/null
@@ -1 +0,0 @@
-Run logs for trimming with `trim_galore `_ (R1)
\ No newline at end of file
diff --git a/report/trimming_R2.rst b/report/trimming_R2.rst
deleted file mode 100644
index 80d85ab..0000000
--- a/report/trimming_R2.rst
+++ /dev/null
@@ -1 +0,0 @@
-Run logs for trimming with `trim_galore `_ (R2)
\ No newline at end of file
diff --git a/report/workflow.rst b/report/workflow.rst
deleted file mode 100644
index e717e61..0000000
--- a/report/workflow.rst
+++ /dev/null
@@ -1,6 +0,0 @@
-human_genomics_pipeline processes paired-end sequencing data (WGS or WES) using `bwa `_, `sambamba `_ and `GATK4 `_. It is based on the GATK4 best practices for `data pre-processing for variant discovery `_
-
-Run parameters:
- * Input data type: {{ snakemake.config["DATA"] }}
- * Reference genome: {{ snakemake.config["REFGENOME"] }}
- * dbSNP database {{ snakemake.config["dbSNP"] }}
\ No newline at end of file
diff --git a/results/README.md b/results/README.md
new file mode 100644
index 0000000..208ba22
--- /dev/null
+++ b/results/README.md
@@ -0,0 +1 @@
+Pipeline outputs will reside here
\ No newline at end of file
diff --git a/rulegraph.png b/rulegraph.png
deleted file mode 100644
index b7a2274..0000000
Binary files a/rulegraph.png and /dev/null differ
diff --git a/rules/bwa_map.smk b/rules/bwa_map.smk
deleted file mode 100644
index c31b313..0000000
--- a/rules/bwa_map.smk
+++ /dev/null
@@ -1,18 +0,0 @@
-rule bwa_map:
- input:
- refgenome = expand("{refgenome}", refgenome = config['REFGENOME']),
- R1 = "trim_galore/{sample}_R1_val_1.fq.gz",
- R2 = "trim_galore/{sample}_R2_val_2.fq.gz"
- output:
- temp("mapped/{sample}_bwamem.bam")
- log:
- "logs/bwamem/{sample}.log"
- benchmark:
- "benchmarks/bwamem/{sample}.bwamem"
- conda:
- "../envs/bwa.yaml"
- threads: 32
- message:
- "Mapping sequences against a reference human genome with BWA-MEM"
- shell:
- "bwa mem {input.refgenome} {input.R1} {input.R2} -t {threads} -M | samtools view -@ {threads} -Sbh - > {output}"
\ No newline at end of file
diff --git a/rules/fastqc.smk b/rules/fastqc.smk
deleted file mode 100644
index 92e9990..0000000
--- a/rules/fastqc.smk
+++ /dev/null
@@ -1,18 +0,0 @@
-rule fastqc:
- input:
- R1 = "../fastq/{sample}_R1.fastq.gz",
- R2 = "../fastq/{sample}_R2.fastq.gz"
- output:
- html = ["qc/fastqc/{sample}_R1_fastqc.html", "qc/fastqc/{sample}_R2_fastqc.html"],
- zip = ["qc/fastqc/{sample}_R1_fastqc.zip", "qc/fastqc/{sample}_R2_fastqc.zip"]
- log:
- "logs/fastqc/{sample}.log"
- benchmark:
- "benchmarks/fastqc/{sample}.fastqc"
- conda:
- "../envs/fastqc.yaml"
- threads: 2
- message:
- "Undertaking quality control checks on raw sequence data"
- shell:
- "fastqc {input.R1} {input.R2} --outdir qc/fastqc 2> {log} --threads {threads}"
\ No newline at end of file
diff --git a/rules/gatk_add_replace_read_groups.smk b/rules/gatk_add_replace_read_groups.smk
deleted file mode 100644
index f59d142..0000000
--- a/rules/gatk_add_replace_read_groups.smk
+++ /dev/null
@@ -1,19 +0,0 @@
-rule gatk4_AddOrReplaceReadGroups:
- input:
- bams = "mapped/{sample}_bwamem_sorted_mkdups.bam",
- index = "mapped/{sample}_bwamem_sorted_mkdups.bam.bai"
- output:
- temp("mapped/{sample}_sorted_mkdups_rgreplaced.bam")
- params:
- tdir = expand("{tdir}", tdir = config['TEMPDIR']),
- other = "-ID 4 -LB lib1 -PL illumina -PU unit1 -SM {sample}"
- log:
- "logs/gatk_readgroup/{sample}.log"
- benchmark:
- "benchmarks/gatk_readgroup/{sample}.readgroup"
- conda:
- "../envs/gatk4.yaml"
- message:
- "Assigning all reads to a single new read-group"
- shell:
- "gatk AddOrReplaceReadGroups -I {input.bams} -O {output} --TMP_DIR {params.tdir} {params.other}"
\ No newline at end of file
diff --git a/rules/gatk_apply_bqsr.smk b/rules/gatk_apply_bqsr.smk
deleted file mode 100644
index bcbacfd..0000000
--- a/rules/gatk_apply_bqsr.smk
+++ /dev/null
@@ -1,21 +0,0 @@
-rule gatk4_ApplyBQSR:
- input:
- bams = "mapped/{sample}_sorted_mkdups_rgreplaced.bam",
- index = "mapped/{sample}_sorted_mkdups_rgreplaced.bam.bai",
- recal = "mapped/{sample}_recalibration_report.grp",
- refgenome = expand("{refgenome}", refgenome = config['REFGENOME'])
- output:
- "mapped/{sample}_bwa_recal.bam"
- params:
- padding = expand("{padding}", padding = config['WES']['PADDING']),
- intervals = expand("{intervals}", intervals = config['WES']['INTERVALS'])
- log:
- "logs/gatk_recal/{sample}.log"
- benchmark:
- "benchmarks/gatk_recal/{sample}.gatkrecal"
- conda:
- "../envs/gatk4.yaml"
- message:
- "Applying base quality score recalibration and producing a recalibrated BAM file"
- shell:
- "gatk ApplyBQSR -I {input.bams} -bqsr {input.recal} -R {input.refgenome} -O {output} {params.padding} {params.intervals}"
\ No newline at end of file
diff --git a/rules/gatk_base_recalibrator.smk b/rules/gatk_base_recalibrator.smk
deleted file mode 100644
index a48cede..0000000
--- a/rules/gatk_base_recalibrator.smk
+++ /dev/null
@@ -1,21 +0,0 @@
-rule gatk4_BaseRecalibrator:
- input:
- bams = "mapped/{sample}_sorted_mkdups_rgreplaced.bam",
- index = "mapped/{sample}_sorted_mkdups_rgreplaced.bam.bai",
- refgenome = expand("{refgenome}", refgenome = config['REFGENOME'])
- output:
- report("mapped/{sample}_recalibration_report.grp", caption = "../report/recalibration.rst", category = "Base recalibration")
- params:
- padding = expand("{padding}", padding = config['WES']['PADDING']),
- intervals = expand("{intervals}", intervals = config['WES']['INTERVALS']),
- recalibration_resources = expand("{recalibration_resources}", recalibration_resources = config['RECALIBRATION']['RESOURCES'])
- log:
- "logs/gatk_recalrep/{sample}.log"
- benchmark:
- "benchmarks/gatk_recalrep/{sample}.gatkrecalrep"
- conda:
- "../envs/gatk4.yaml"
- message:
- "Generating a recalibration table for the following rule (Base Quality Score Recalibration)"
- shell:
- "gatk BaseRecalibrator -I {input.bams} -R {input.refgenome} -O {output} {params.padding} {params.intervals} {params.recalibration_resources}"
\ No newline at end of file
diff --git a/rules/gatk_combine_gvcf.smk b/rules/gatk_combine_gvcf.smk
deleted file mode 100644
index a7d3815..0000000
--- a/rules/gatk_combine_gvcf.smk
+++ /dev/null
@@ -1,27 +0,0 @@
-rule gatk4_CombineGVCFs:
- input:
- # Fix wildcard naming of three input vcfs
- vcf1 = "vcf/{cohort}_{sample}_haplotype.vcf",
- vcf2 = "vcf/{cohort}_{sample}_haplotype.vcf",
- vcf3 = "vcf/{cohort}_{sample}_haplotype.vcf",
- refgenome = expand("{refgenome}", refgenome = config['REFGENOME']),
- dbsnp = expand("{dbsnp}", dbsnp = config['dbSNP'])
- output:
- temp("vcf/{sample}_haplotype_gvcf_combined.vcf")
- params:
- padding = expand("{padding}", padding = config['WES']['PADDING']),
- intervals = expand("{intervals}", intervals = config['WES']['INTERVALS']),
- other = "-G StandardAnnotation -G AS_StandardAnnotation"
- log:
- "logs/gatk_combinegvcf/{sample}.log"
- benchmark:
- "benchmarks/gatk_combinegvcf/{sample}.gatkcombinegvcf"
- conda:
- "../envs/gatk4.yaml"
- message:
- "Merging of GVCFs"
- shell:
- """
- gatk --java-options "-Xmx64g -Xms64g" CombineGVCFs \
- -R {input.refgenome} --variant {input.vcf1} --variant {input.vcf2} --variant {input.vcf3} -O {output} {params.padding} {params.intervals} {params.other}
- """
\ No newline at end of file
diff --git a/rules/gatk_genotype_gvcf.smk b/rules/gatk_genotype_gvcf.smk
deleted file mode 100644
index f6e0eff..0000000
--- a/rules/gatk_genotype_gvcf.smk
+++ /dev/null
@@ -1,23 +0,0 @@
-rule gatk4_GenotypeGVCFs:
- input:
- vcf = "vcf/{sample}_haplotype_gvcf_combined.vcf",
- refgenome = expand("{refgenome}", refgenome = config['REFGENOME'])
- output:
- "vcf/{sample}_raw_snps_indels_AS_g.vcf"
- params:
- padding = expand("{padding}", padding = config['WES']['PADDING']),
- intervals = expand("{intervals}", intervals = config['WES']['INTERVALS']),
- other = "-G StandardAnnotation -G AS_StandardAnnotation"
- log:
- "logs/gatk_genotype_gvcf/{sample}.log"
- benchmark:
- "benchmarks/gatk_genotype_gvcf/{sample}.gatkgenotypegvcf"
- conda:
- "../envs/gatk4.yaml"
- message:
- "Performing joint genotyping"
- shell:
- """
- gatk --java-options "-Xmx64g -Xms64g" GenotypeGVCFs \
- -R {input.refgenome} -V {input.vcf} -O {output} {params.padding} {params.intervals} {params.other}
- """
\ No newline at end of file
diff --git a/rules/gatk_haplotype_caller_gvcf.smk b/rules/gatk_haplotype_caller_gvcf.smk
deleted file mode 100644
index dbaa662..0000000
--- a/rules/gatk_haplotype_caller_gvcf.smk
+++ /dev/null
@@ -1,23 +0,0 @@
-rule gatk4_HaplotypeCaller_GVCF:
- input:
- bams = "mapped/{sample}_bwa_recal.bam",
- refgenome = expand("{refgenome}", refgenome = config['REFGENOME']),
- dbsnp = expand("{dbsnp}", dbsnp = config['dbSNP'])
- output:
- "vcf/{sample}_haplotype_gvcf.vcf"
- params:
- tdir = expand("{tdir}", tdir = config['TEMPDIR']),
- padding = expand("{padding}", padding = config['WES']['PADDING']),
- intervals = expand("{intervals}", intervals = config['WES']['INTERVALS']),
- other = "-ERC GVCF"
- log:
- "logs/gatk_haplocall/{sample}.log"
- benchmark:
- "benchmarks/gatk_haplocall/{sample}.gatkhaplocall"
- conda:
- "../envs/gatk4.yaml"
- threads: 4
- message:
- "Calling germline SNPs and indels via local re-assembly of haplotypes"
- shell:
- "gatk HaplotypeCaller -I {input.bams} -R {input.refgenome} -D {input.dbsnp} -O {output} --tmp-dir {params.tdir} --native-pair-hmm-threads {threads} {params.padding} {params.intervals} {params.other}"
\ No newline at end of file
diff --git a/rules/gatk_haplotype_caller_single.smk b/rules/gatk_haplotype_caller_single.smk
deleted file mode 100644
index e9f04bd..0000000
--- a/rules/gatk_haplotype_caller_single.smk
+++ /dev/null
@@ -1,21 +0,0 @@
-rule gatk4_HaplotypeCaller_single:
- input:
- bams = "mapped/{sample}_bwa_recal.bam",
- refgenome = expand("{refgenome}", refgenome = config['REFGENOME']),
- dbsnp = expand("{dbsnp}", dbsnp = config['dbSNP'])
- output:
- "vcf/{sample}_raw_snps_indels_AS_g.vcf"
- params:
- tdir = expand("{tdir}", tdir = config['TEMPDIR']),
- padding = expand("{padding}", padding = config['WES']['PADDING']),
- intervals = expand("{intervals}", intervals = config['WES']['INTERVALS'])
- log:
- "logs/gatk_haplocall/{sample}.log"
- benchmark:
- "benchmarks/gatk_haplocall/{sample}.gatkhaplocall"
- conda:
- "../envs/gatk4.yaml"
- message:
- "Calling germline SNPs and indels via local re-assembly of haplotypes"
- shell:
- "gatk HaplotypeCaller -I {input.bams} -R {input.refgenome} -D {input.dbsnp} -O {output} --tmp-dir {params.tdir} {params.padding} {params.intervals}"
\ No newline at end of file
diff --git a/rules/mosdepth.smk b/rules/mosdepth.smk
deleted file mode 100644
index 941a665..0000000
--- a/rules/mosdepth.smk
+++ /dev/null
@@ -1,20 +0,0 @@
-rule mosdepth:
- input:
- bams = "mapped/{sample}_bwa_recal.bam",
- refgenome = expand("{refgenome}", refgenome = config['REFGENOME'])
- output:
- "coverage/{sample}.per-base.bed.gz",
- "coverage/{sample}.per-base.bed.gz.csi",
- report("coverage/{sample}.mosdepth.summary.txt", caption = "../report/coverage.rst", category = "Coverage"),
- report("coverage/{sample}.mosdepth.global.dist.txt", caption = "../report/coverage.rst", category = "Coverage")
- log:
- "logs/mosdepth/{sample}.log"
- benchmark:
- "benchmarks/mosdepth/{sample}.mosdepth"
- conda:
- "../envs/mosdepth.yaml"
- threads: 16
- message:
- "Undertaking fast BAM/CRAM depth calculation"
- shell:
- "mosdepth -f {input.refgenome} -t {threads} coverage/{wildcards.sample} {input.bams}"
\ No newline at end of file
diff --git a/rules/multiqc_post_trim.smk b/rules/multiqc_post_trim.smk
deleted file mode 100644
index bb12192..0000000
--- a/rules/multiqc_post_trim.smk
+++ /dev/null
@@ -1,11 +0,0 @@
-rule multiqc_post_trim:
- input:
- expand(["trim_galore/{sample}_R1_val_1_fastqc.zip", "trim_galore/{sample}_R2_val_2_fastqc.zip"], sample = SAMPLES)
- output:
- report("qc/multiqc/post_trim_multiqc_report.html", caption = "../report/quality_checks_post_trim.rst", category = "Quality checks")
- conda:
- "../envs/multiqc.yaml"
- message:
- "Searching for analysis logs to compile a HTML report"
- shell:
- "multiqc {input} -o qc/multiqc/ -i post_trim"
\ No newline at end of file
diff --git a/rules/multiqc_pre_trim.smk b/rules/multiqc_pre_trim.smk
deleted file mode 100644
index 553a69f..0000000
--- a/rules/multiqc_pre_trim.smk
+++ /dev/null
@@ -1,11 +0,0 @@
-rule multiqc_pre_trim:
- input:
- expand(["qc/fastqc/{sample}_R1_fastqc.zip", "qc/fastqc/{sample}_R2_fastqc.zip"], sample = SAMPLES)
- output:
- report("qc/multiqc/pre_trim_multiqc_report.html", caption = "../report/quality_checks_pre_trim.rst", category = "Quality checks")
- conda:
- "../envs/multiqc.yaml"
- message:
- "Searching for analysis logs to compile a HTML report"
- shell:
- "multiqc {input} -o qc/multiqc/ -i pre_trim"
\ No newline at end of file
diff --git a/rules/sambamba_index.smk b/rules/sambamba_index.smk
deleted file mode 100644
index 0a678d4..0000000
--- a/rules/sambamba_index.smk
+++ /dev/null
@@ -1,16 +0,0 @@
-rule sambamba_index:
- input:
- "mapped/{sample}_bwamem_sorted_mkdups.bam"
- output:
- temp("mapped/{sample}_bwamem_sorted_mkdups.bam.bai")
- log:
- "logs/sambamba_index/{sample}.log"
- benchmark:
- "benchmarks/sambamba_index/{sample}.sambamba"
- conda:
- "../envs/sambamba.yaml"
- threads: 8
- message:
- "Building index files for BAM files"
- shell:
- "sambamba index -p {input} -t {threads}"
\ No newline at end of file
diff --git a/rules/sambamba_index_rgadd.smk b/rules/sambamba_index_rgadd.smk
deleted file mode 100644
index 23e70d9..0000000
--- a/rules/sambamba_index_rgadd.smk
+++ /dev/null
@@ -1,16 +0,0 @@
-rule sambamba_index_rgadd:
- input:
- "mapped/{sample}_sorted_mkdups_rgreplaced.bam"
- output:
- temp("mapped/{sample}_sorted_mkdups_rgreplaced.bam.bai")
- log:
- "logs/sambamba_index/{sample}_rg.log"
- benchmark:
- "benchmarks/sambamba_index/{sample}_rg.sambamba"
- conda:
- "../envs/sambamba.yaml"
- threads: 8
- message:
- "Building index files for BAM files"
- shell:
- "sambamba index -p {input} -t {threads}"
\ No newline at end of file
diff --git a/rules/sambamba_mkdups.smk b/rules/sambamba_mkdups.smk
deleted file mode 100644
index 9b051a6..0000000
--- a/rules/sambamba_mkdups.smk
+++ /dev/null
@@ -1,19 +0,0 @@
-rule sambamba_mkdups:
- input:
- "mapped/{sample}_bwamem_sorted.bam"
- output:
- temp("mapped/{sample}_bwamem_sorted_mkdups.bam")
- params:
- tdir = expand("{tdir}", tdir = config['TEMPDIR']),
- other = "--sort-buffer-size=6144 --overflow-list-size=600000 --hash-table-size=600000"
- log:
- "logs/sambamba_mkdups/{sample}.log"
- benchmark:
- "benchmarks/sambamba_mkdups/{sample}.sambamba"
- conda:
- "../envs/sambamba.yaml"
- threads: 4
- message:
- "Finding duplicate reads in BAM files"
- shell:
- "sambamba markdup -p {input} {output} --tmpdir={params.tdir} {params.other} -t {threads}"
\ No newline at end of file
diff --git a/rules/sambamba_sort.smk b/rules/sambamba_sort.smk
deleted file mode 100644
index 68591c0..0000000
--- a/rules/sambamba_sort.smk
+++ /dev/null
@@ -1,20 +0,0 @@
-rule sambamba_sort:
- input:
- "mapped/{sample}_bwamem.bam"
- output:
- bams = temp("mapped/{sample}_bwamem_sorted.bam"),
- index = temp("mapped/{sample}_bwamem_sorted.bam.bai")
- params:
- tdir = expand("{tdir}", tdir = config['TEMPDIR']),
- other = "-m 6G"
- log:
- "logs/sambamba_sort/{sample}.log"
- benchmark:
- "benchmarks/sambamba_sort/{sample}.sambamba"
- conda:
- "../envs/sambamba.yaml"
- threads: 16
- message:
- "Sorting BAM files"
- shell:
- "sambamba sort -p {input} -o {output.bams} --tmpdir={params.tdir} {params.other} -t {threads}"
\ No newline at end of file
diff --git a/rules/trim_galore_pe.smk b/rules/trim_galore_pe.smk
deleted file mode 100644
index 6b85067..0000000
--- a/rules/trim_galore_pe.smk
+++ /dev/null
@@ -1,24 +0,0 @@
-rule trim_galore_pe:
- input:
- R1 = "../fastq/{sample}_R1.fastq.gz",
- R2 = "../fastq/{sample}_R2.fastq.gz"
- output:
- "trim_galore/{sample}_R1_val_1.fq.gz",
- "trim_galore/{sample}_R2_val_2.fq.gz",
- "trim_galore/{sample}_R1_val_1_fastqc.zip",
- "trim_galore/{sample}_R2_val_2_fastqc.zip",
- report("trim_galore/{sample}_R1.fastq.gz_trimming_report.txt", caption = "../report/trimming_R1.rst", category = "Trimming"),
- report("trim_galore/{sample}_R2.fastq.gz_trimming_report.txt", caption = "../report/trimming_R2.rst", category = "Trimming")
- params:
- "--illumina --fastqc -q 20"
- log:
- "logs/trim_galore/{sample}.log"
- benchmark:
- "benchmarks/trim_galore/{sample}.trim"
- conda:
- "../envs/trim_galore.yaml"
- threads: 16
- message:
- "Applying quality and adapter trimming of input fastq files: {input.R1} and {input.R2}"
- shell:
- "trim_galore --paired {input.R1} {input.R2} -o trim_galore/ {params} --cores {threads}"
\ No newline at end of file
diff --git a/run.sh b/run.sh
deleted file mode 100644
index 69b6017..0000000
--- a/run.sh
+++ /dev/null
@@ -1,6 +0,0 @@
-#!/bin/bash -x
-
-snakemake \
--j 32 \
---use-conda \
---configfile config.yaml
\ No newline at end of file
diff --git a/test/cohort/fastq/NA24631_1.fastq.gz b/test/cohort/fastq/NA24631_1.fastq.gz
new file mode 100644
index 0000000..a8dc360
Binary files /dev/null and b/test/cohort/fastq/NA24631_1.fastq.gz differ
diff --git a/test/cohort/fastq/NA24631_2.fastq.gz b/test/cohort/fastq/NA24631_2.fastq.gz
new file mode 100644
index 0000000..370972f
Binary files /dev/null and b/test/cohort/fastq/NA24631_2.fastq.gz differ
diff --git a/test/cohort/fastq/NA24694_1.fastq.gz b/test/cohort/fastq/NA24694_1.fastq.gz
new file mode 100644
index 0000000..a9a1901
Binary files /dev/null and b/test/cohort/fastq/NA24694_1.fastq.gz differ
diff --git a/test/cohort/fastq/NA24694_2.fastq.gz b/test/cohort/fastq/NA24694_2.fastq.gz
new file mode 100644
index 0000000..5141473
Binary files /dev/null and b/test/cohort/fastq/NA24694_2.fastq.gz differ
diff --git a/test/cohort/fastq/NA24695_1.fastq.gz b/test/cohort/fastq/NA24695_1.fastq.gz
new file mode 100644
index 0000000..de9ea50
Binary files /dev/null and b/test/cohort/fastq/NA24695_1.fastq.gz differ
diff --git a/test/cohort/fastq/NA24695_2.fastq.gz b/test/cohort/fastq/NA24695_2.fastq.gz
new file mode 100644
index 0000000..6b4e285
Binary files /dev/null and b/test/cohort/fastq/NA24695_2.fastq.gz differ
diff --git a/test/cohort/pedigrees/NA24631_pedigree.ped b/test/cohort/pedigrees/NA24631_pedigree.ped
new file mode 100644
index 0000000..db716cb
--- /dev/null
+++ b/test/cohort/pedigrees/NA24631_pedigree.ped
@@ -0,0 +1,3 @@
+NA24631 NA24631 NA24694 NA24695 1 2
+NA24631 NA24694 0 0 1 1
+NA24631 NA24695 0 0 2 1
diff --git a/test/setup_test.sh b/test/setup_test.sh
new file mode 100644
index 0000000..72f347d
--- /dev/null
+++ b/test/setup_test.sh
@@ -0,0 +1,29 @@
+#!/bin/bash
+
+# Move test data to the correct location
+
+helpFunction()
+{
+ echo ""
+ echo "Usage: $0 -a parameterA"
+ echo -e "\t-a choose to setup to run either a single sample analysis (-a single) or a cohort analysis (-a cohort)"
+ exit 1 # Exit script after printing help
+}
+
+while getopts "a:" opt
+do
+ case "$opt" in
+ a ) parameterA="$OPTARG" ;;
+ ? ) helpFunction ;; # Print helpFunction in case parameter is non-existent
+ esac
+done
+
+# Print helpFunction in case parameters are empty
+if [ -z "$parameterA" ]
+then
+ echo "Value for parameter not provided";
+ helpFunction
+fi
+
+# Begin script in case all parameters are correct
+cp -r ./test/"$parameterA"/* ../
\ No newline at end of file
diff --git a/test/single/fastq/NA24631_1.fastq.gz b/test/single/fastq/NA24631_1.fastq.gz
new file mode 100644
index 0000000..a8dc360
Binary files /dev/null and b/test/single/fastq/NA24631_1.fastq.gz differ
diff --git a/test/single/fastq/NA24631_2.fastq.gz b/test/single/fastq/NA24631_2.fastq.gz
new file mode 100644
index 0000000..370972f
Binary files /dev/null and b/test/single/fastq/NA24631_2.fastq.gz differ
diff --git a/test/test.md b/test/test.md
new file mode 100644
index 0000000..894266f
--- /dev/null
+++ b/test/test.md
@@ -0,0 +1,5 @@
+# Test dataset
+
+This directory contains a reduced dataset of the publicly available [Han Chinese GIAB trio](https://github.com/genome-in-a-bottle/about_GIAB) that can be used to test running this pipeline on a new machine, or test pipeline developments.
+
+Detailed documentation for how this test dataset was created can be found at https://github.com/leahkemp/documentation/blob/master/human_genomic_pipelines/create_test_dataset.md
\ No newline at end of file
diff --git a/workflow/Snakefile b/workflow/Snakefile
new file mode 100644
index 0000000..ea9240f
--- /dev/null
+++ b/workflow/Snakefile
@@ -0,0 +1,200 @@
+"""
+Author: Miles Benton and Leah Kemp
+Affiliation: ESR
+Aim: A simple Snakemake workflow to process paired-end sequencing data (WGS) using bwa/GATK4. Designed to be used before vcf_annotation_pipeline.
+Date created: 2019-08-21
+Modified: 2020-10-09
+"""
+
+##### Set up wildcards #####
+
+# Define samples from fastq dir and families/cohorts from pedigree dir using wildcards
+SAMPLES, = glob_wildcards("../../fastq/{sample}_1.fastq.gz")
+FAMILIES, = glob_wildcards("../../pedigrees/{family}_pedigree.ped")
+
+##### Setup helper functions #####
+import csv
+import glob
+
+def get_input_fastq(command):
+ """Return a string which defines the input fastq files for the bwa_mem and pbrun_germline rules.
+ This changes based on the user configurable options for trimming
+ """
+
+ input_files = ""
+
+ if config['TRIM'] == "No" or config['TRIM'] == "no":
+ input_files = ["../../fastq/{sample}_1.fastq.gz", "../../fastq/{sample}_2.fastq.gz"]
+ if config['TRIM'] == "Yes" or config['TRIM'] == "yes":
+ input_files = ["../results/trimmed/{sample}_1_val_1.fq.gz", "../results/trimmed/{sample}_2_val_2.fq.gz"]
+
+ return input_files
+
+def get_output_vcf(config):
+ """Return a string which defines the output vcf files for the pbrun_germline rule. This changes based on the
+ user configurable options for running single samples or cohorts of samples
+ """
+
+ vcf = ""
+
+ if config['DATA'] == "Single" or config['DATA'] == 'single':
+ vcf = "../results/called/{sample}_raw_snps_indels.vcf"
+ if config['DATA'] == "Cohort" or config['DATA'] == 'cohort':
+ vcf = "../results/called/{sample}_raw_snps_indels_tmp.g.vcf"
+
+ return vcf
+
+def get_params(command):
+ """Return a string which defines some parameters for the pbrun_germline rule. This changes based on the
+ user configurable options for running single samples or cohorts of samples
+ """
+
+ params = ""
+
+ if config['DATA'] == "Single" or config['DATA'] == 'single':
+ params = ""
+ if config['DATA'] == "Cohort" or config['DATA'] == 'cohort':
+ params = "--gvcf"
+
+ return params
+
+def get_gatk_combinegvcf_command(family):
+ """Return a string, a portion of the gatk CombineGVCF command which defines individuals which should be combined. This
+ command is used by the gatk_CombineGVCFs rule. For a particular family, we construct the gatk command by adding
+ -V for each individual (defined by individual id column in the pedigree file)
+ """
+ filename = "../../pedigrees/" + str(family) + "_pedigree.ped"
+
+ command = ""
+ with open(filename, newline = '') as pedigree:
+
+ pedigree_reader = csv.DictReader(pedigree, fieldnames = ('family', 'individual_id', 'paternal_id', 'maternal_id', 'sex', 'phenotype'), delimiter='\t')
+ for individual in pedigree_reader:
+ command += "-V ../results/called/" + individual['individual_id'] + "_raw_snps_indels_tmp.g.vcf "
+
+ return command
+
+def get_parabricks_combinegvcf_command(family):
+ """Return a string, a portion of the gatk CombineGVCF command which defines individuals which should be combined. This
+ command is used by the gatk_triocombinegvcf rule. For a particular family, we construct the gatk command by adding
+ -V for each individual (defined by individual id column in the pedigree file)
+ """
+ filename = "../../pedigrees/" + str(family) + "_pedigree.ped"
+
+ command = ""
+ with open(filename, newline = '') as pedigree:
+
+ pedigree_reader = csv.DictReader(pedigree, fieldnames = ('family', 'individual_id', 'paternal_id', 'maternal_id', 'sex', 'phenotype'), delimiter='\t')
+ for individual in pedigree_reader:
+ command += "--in-gvcf ../results/called/" + individual['individual_id'] + "_raw_snps_indels_tmp.g.vcf "
+
+ return command
+
+def get_recal_resources_command(resource):
+ """Return a string, a portion of the gatk BaseRecalibrator command (used in the gatk_BaseRecalibrator and the
+ parabricks_germline rules) which dynamically includes each of the recalibration resources defined by the user
+ in the configuration file. For each recalibration resource (element in the list), we construct the command by
+ adding either --knownSites (for parabricks) or --known-sites (for gatk4)
+ """
+
+ command = ""
+
+ for resource in config['RECALIBRATION']['RESOURCES']:
+ if config['GPU_ACCELERATED'] == "Yes" or config['GPU_ACCELERATED'] == "yes":
+ command += "--knownSites " + resource + " "
+
+ if config['GPU_ACCELERATED'] == "No" or config['GPU_ACCELERATED'] == "no":
+ command += "--known-sites " + resource + " "
+
+ return command
+
+def get_wes_intervals_command(resource):
+ """Return a string, a portion of the gatk command's (used in several gatk rules) which builds a flag
+ formatted for gatk based on the configuration file. We construct the command by adding either --interval-file
+ (for parabricks) or --L (for gatk4) . If the user provides nothing for these configurable
+ options, an empty string is returned
+ """
+
+ command = ""
+
+ if config['GPU_ACCELERATED'] == "Yes" or config['GPU_ACCELERATED'] == "yes":
+ command = "--interval-file " + config['WES']['PADDING'] + " "
+ if config['GPU_ACCELERATED'] == "No" or config['GPU_ACCELERATED'] == "no":
+ command = "--L " + config['WES']['PADDING'] + " "
+ if config['WES']['INTERVALS'] == "":
+ command = ""
+
+ return command
+
+def get_wes_padding_command(resource):
+ """Return a string, a portion of the gatk command's (used in several gatk rules) which builds a flag
+ formatted for gatk based on the configuration file. We construct the command by adding either --interval
+ (for parabricks) or --ip (for gatk4) . If the user provides nothing for these configurable
+ options, an empty string is returned
+ """
+
+ command = ""
+
+ if config['GPU_ACCELERATED'] == "Yes" or config['GPU_ACCELERATED'] == "yes":
+ command = "--interval " + config['WES']['PADDING'] + " "
+ if config['GPU_ACCELERATED'] == "No" or config['GPU_ACCELERATED'] == "no":
+ command = "--ip " + config['WES']['PADDING'] + " "
+ if config['WES']['PADDING'] == "":
+ command = ""
+
+ return command
+
+#### Set up report #####
+
+report: "report/workflow.rst"
+
+##### Target rules #####
+
+if config['DATA'] == "Single" or config['DATA'] == 'single':
+ rule all:
+ input:
+ "../results/qc/multiqc_report.html",
+ expand("../results/mapped/{sample}_recalibrated.bam", sample = SAMPLES),
+ expand("../results/called/{sample}_raw_snps_indels.vcf", sample = SAMPLES),
+
+
+if config['DATA'] == "Cohort" or config['DATA'] == 'cohort':
+ rule all:
+ input:
+ "../results/qc/multiqc_report.html",
+ expand("../results/mapped/{sample}_recalibrated.bam", sample = SAMPLES),
+ expand("../results/called/{family}_raw_snps_indels.vcf", family = FAMILIES)
+
+##### Load rules #####
+
+localrules: multiqc, pbrun_germline, pbrun_triocombinegvcf
+
+include: "rules/fastqc.smk"
+
+if config['TRIM'] == "No" or config['TRIM'] == "no":
+ include: "rules/multiqc.smk"
+
+if config['TRIM'] == "Yes" or config['TRIM'] == "yes":
+ include: "rules/trim_galore_pe.smk"
+ include: "rules/multiqc_trim.smk"
+
+if config['GPU_ACCELERATED'] == "No" or config['GPU_ACCELERATED'] == "no":
+ include: "rules/bwa_mem.smk"
+ include: "rules/gatk_MarkDuplicates.smk"
+ include: "rules/gatk_BaseRecalibrator.smk"
+ include: "rules/gatk_ApplyBQSR.smk"
+
+if (config['GPU_ACCELERATED'] == "No" or config['GPU_ACCELERATED'] == "no") and (config['DATA'] == "Single" or config['DATA'] == 'single'):
+ include: "rules/gatk_HaplotypeCaller_single.smk"
+
+if (config['GPU_ACCELERATED'] == "No" or config['GPU_ACCELERATED'] == "no") and (config['DATA'] == "Cohort" or config['DATA'] == 'cohort'):
+ include: "rules/gatk_HaplotypeCaller_cohort.smk"
+ include: "rules/gatk_CombineGVCFs.smk"
+ include: "rules/gatk_GenotypeGVCFs.smk"
+
+if config['GPU_ACCELERATED'] == "Yes" or config['GPU_ACCELERATED'] == "yes":
+ include: "rules/pbrun_germline.smk"
+
+if (config['GPU_ACCELERATED'] == "Yes" or config['GPU_ACCELERATED'] == "yes") and (config['DATA'] == "Cohort" or config['DATA'] == 'cohort'):
+ include: "rules/pbrun_triocombinegvcf.smk"
+ include: "rules/gatk_GenotypeGVCFs.smk"
\ No newline at end of file
diff --git a/workflow/dryrun.sh b/workflow/dryrun.sh
new file mode 100644
index 0000000..4bdc112
--- /dev/null
+++ b/workflow/dryrun.sh
@@ -0,0 +1,11 @@
+#!/bin/bash -x
+
+snakemake \
+--dryrun \
+--cores 32 \
+--resources mem_mb=150000 \
+--resources gpu=2 \
+--use-conda \
+--conda-frontend mamba \
+--latency-wait 120 \
+--configfile ../config/config.yaml
\ No newline at end of file
diff --git a/workflow/dryrun_hpc.sh b/workflow/dryrun_hpc.sh
new file mode 100644
index 0000000..f5e38e9
--- /dev/null
+++ b/workflow/dryrun_hpc.sh
@@ -0,0 +1,15 @@
+#!/bin/bash -x
+
+snakemake \
+--dryrun \
+--cores 32 \
+--resources mem_mb=150000 \
+--resources gpu=2 \
+--use-conda \
+--conda-frontend mamba \
+--latency-wait 120 \
+--configfile ../config/config.yaml \
+--cluster-config ../config/cluster.json \
+--cluster "sbatch -A {cluster.account} \
+-p {cluster.partition} \
+-o {cluster.output}"
\ No newline at end of file
diff --git a/workflow/envs/bwa.yaml b/workflow/envs/bwa.yaml
new file mode 100644
index 0000000..d4ab52f
--- /dev/null
+++ b/workflow/envs/bwa.yaml
@@ -0,0 +1,8 @@
+channels:
+ - bioconda
+ - conda-forge
+ - defaults
+dependencies:
+ - bioconda::samtools =1.10
+ - bioconda::bwa =0.7.17
+ - bioconda::gatk4 =4.1.6.0
\ No newline at end of file
diff --git a/envs/mosdepth.yaml b/workflow/envs/fastqc.yaml
similarity index 66%
rename from envs/mosdepth.yaml
rename to workflow/envs/fastqc.yaml
index b258322..c9cbea6 100644
--- a/envs/mosdepth.yaml
+++ b/workflow/envs/fastqc.yaml
@@ -3,4 +3,4 @@ channels:
- conda-forge
- defaults
dependencies:
- - bioconda::mosdepth =0.2.9
\ No newline at end of file
+ - bioconda::fastqc =0.11.9
\ No newline at end of file
diff --git a/envs/gatk4.yaml b/workflow/envs/gatk4.yaml
similarity index 68%
rename from envs/gatk4.yaml
rename to workflow/envs/gatk4.yaml
index 34c38b1..7cac917 100644
--- a/envs/gatk4.yaml
+++ b/workflow/envs/gatk4.yaml
@@ -3,4 +3,4 @@ channels:
- conda-forge
- defaults
dependencies:
- - bioconda::gatk4 =4.1.6.0
+ - bioconda::gatk4 =4.1.0.0
diff --git a/envs/multiqc.yaml b/workflow/envs/multiqc.yaml
similarity index 67%
rename from envs/multiqc.yaml
rename to workflow/envs/multiqc.yaml
index c6d36b7..33eeacc 100644
--- a/envs/multiqc.yaml
+++ b/workflow/envs/multiqc.yaml
@@ -1,6 +1,6 @@
-channels:
- - bioconda
- - conda-forge
- - defaults
-dependencies:
- - bioconda::multiqc =1.8
+channels:
+ - bioconda
+ - conda-forge
+ - defaults
+dependencies:
+ - bioconda::multiqc =1.11
\ No newline at end of file
diff --git a/envs/fastqc.yaml b/workflow/envs/trim_galore.yaml
similarity index 67%
rename from envs/fastqc.yaml
rename to workflow/envs/trim_galore.yaml
index 4442771..17c049a 100644
--- a/envs/fastqc.yaml
+++ b/workflow/envs/trim_galore.yaml
@@ -3,4 +3,4 @@ channels:
- conda-forge
- defaults
dependencies:
- - bioconda::fastqc =0.11.9
+ - bioconda::trim-galore =0.6.5
\ No newline at end of file
diff --git a/workflow/logs/README.md b/workflow/logs/README.md
new file mode 100644
index 0000000..256d7b8
--- /dev/null
+++ b/workflow/logs/README.md
@@ -0,0 +1 @@
+Snakemake and slurm log files will reside here
\ No newline at end of file
diff --git a/workflow/pipeline_run_env.yml b/workflow/pipeline_run_env.yml
new file mode 100644
index 0000000..45126ca
--- /dev/null
+++ b/workflow/pipeline_run_env.yml
@@ -0,0 +1,172 @@
+name: pipeline_run_env
+channels:
+ - bioconda
+ - conda-forge
+ - defaults
+dependencies:
+ - _libgcc_mutex=0.1=conda_forge
+ - _openmp_mutex=4.5=1_gnu
+ - aioeasywebdav=2.4.0=py37_1000
+ - aiohttp=3.6.2=py37h516909a_0
+ - appdirs=1.4.4=py_0
+ - async-timeout=3.0.1=py_1000
+ - attrs=19.3.0=py_0
+ - bcrypt=3.1.7=py37h8f50634_1
+ - boto3=1.9.191=py_0
+ - botocore=1.12.191=py_0
+ - brotlipy=0.7.0=py37h516909a_1000
+ - bzip2=1.0.8=h516909a_2
+ - c-ares=1.16.1=h516909a_0
+ - ca-certificates=2020.6.24=0
+ - cachetools=4.1.1=py_0
+ - cairo=1.16.0=h3fc0475_1005
+ - certifi=2020.6.20=py37hc8dfbb8_0
+ - cffi=1.14.1=py37h2b28604_0
+ - chardet=3.0.4=py37hc8dfbb8_1006
+ - configargparse=1.2.3=pyh9f0ad1d_0
+ - cryptography=3.0=py37hb09aad4_0
+ - datrie=0.8.2=py37h8f50634_0
+ - decorator=4.4.2=py_0
+ - docutils=0.16=py37hc8dfbb8_1
+ - dropbox=10.2.0=py37_0
+ - expat=2.2.9=he1b5a44_2
+ - fftw=3.3.8=mpi_mpich_h3f9e1be_1011
+ - filechunkio=1.8=py_2
+ - fontconfig=2.13.1=h1056068_1002
+ - freetype=2.10.2=he06d7ca_0
+ - ftputil=4.0.0=py_0
+ - gdk-pixbuf=2.36.12=h3f25603_1005
+ - gettext=0.19.8.1=hc5be6a0_1002
+ - ghostscript=9.22=hf484d3e_1001
+ - giflib=5.1.7=h516909a_1
+ - gitdb=4.0.5=py_0
+ - gitpython=3.1.7=py_0
+ - glib=2.65.0=h6f030ca_0
+ - gobject-introspection=1.64.1=py37h619baee_1
+ - google-api-core=1.22.1=py37hc8dfbb8_0
+ - google-api-python-client=1.10.0=pyh9f0ad1d_0
+ - google-auth=1.20.1=py_0
+ - google-auth-httplib2=0.0.3=py_3
+ - google-cloud-core=1.4.1=pyh9f0ad1d_0
+ - google-cloud-storage=1.30.0=pyh9f0ad1d_0
+ - google-crc32c=1.0.0=py37h193935f_0
+ - google-resumable-media=0.7.0=pyh9f0ad1d_0
+ - googleapis-common-protos=1.51.0=py37hc8dfbb8_2
+ - graphite2=1.3.14=h23475e2_0
+ - graphviz=2.38.0=hf68f40c_1011
+ - grpcio=1.31.0=py37hb0870dc_0
+ - harfbuzz=2.4.0=hee91db6_5
+ - httplib2=0.18.1=pyh9f0ad1d_0
+ - icu=67.1=he1b5a44_0
+ - idna=2.10=pyh9f0ad1d_0
+ - imagemagick=7.0.8_54=pl526h39023e4_0
+ - importlib-metadata=1.7.0=py37hc8dfbb8_0
+ - importlib_metadata=1.7.0=0
+ - ipython_genutils=0.2.0=py_1
+ - jbig=2.1=h516909a_2002
+ - jinja2=2.11.2=pyh9f0ad1d_0
+ - jmespath=0.10.0=pyh9f0ad1d_0
+ - jpeg=9d=h516909a_0
+ - jsonschema=3.2.0=py37hc8dfbb8_1
+ - jupyter_core=4.6.3=py37hc8dfbb8_1
+ - ld_impl_linux-64=2.34=hc38a660_9
+ - libblas=3.8.0=17_openblas
+ - libcblas=3.8.0=17_openblas
+ - libcrc32c=1.1.1=he1b5a44_2
+ - libcroco=0.6.13=h8d621e5_1
+ - libffi=3.2.1=he1b5a44_1007
+ - libgcc-ng=9.3.0=h24d8f2e_14
+ - libgfortran-ng=7.5.0=hdf63c60_14
+ - libgomp=9.3.0=h24d8f2e_14
+ - libiconv=1.15=h516909a_1006
+ - liblapack=3.8.0=17_openblas
+ - libopenblas=0.3.10=pthreads_hb3c22a3_4
+ - libpng=1.6.37=hed695b0_2
+ - libprotobuf=3.12.4=h8b12597_0
+ - librsvg=2.44.14=h11c8777_0
+ - libsodium=1.0.18=h516909a_0
+ - libstdcxx-ng=9.3.0=hdf63c60_14
+ - libtiff=4.1.0=hc3755c2_3
+ - libtool=2.4.6=h516909a_1003
+ - libuuid=2.32.1=h14c3975_1000
+ - libwebp=1.0.2=hf4e8a37_4
+ - libxcb=1.14=h7b6447c_0
+ - libxml2=2.9.10=h72b56ed_2
+ - lz4-c=1.9.2=he1b5a44_1
+ - markupsafe=1.1.1=py37h8f50634_1
+ - mpi=1.0=mpich
+ - mpich=3.3.2=hc856adb_0
+ - multidict=4.7.5=py37h8f50634_1
+ - nbformat=5.0.7=py_0
+ - ncurses=6.2=he1b5a44_1
+ - networkx=2.4=py_1
+ - numpy=1.19.1=py37h8960a57_0
+ - oauth2client=4.1.3=py_0
+ - openjpeg=2.3.1=h981e76c_3
+ - openssl=1.1.1g=h516909a_1
+ - pandas=1.1.0=py37h3340039_0
+ - pango=1.40.14=he7ab937_1005
+ - paramiko=2.7.1=py37_0
+ - pcre=8.44=he1b5a44_0
+ - perl=5.26.2=h516909a_1006
+ - pip=20.2.2=py_0
+ - pixman=0.38.0=h516909a_1003
+ - pkg-config=0.29.2=h516909a_1006
+ - prettytable=0.7.2=py_3
+ - protobuf=3.12.4=py37h3340039_0
+ - psutil=5.7.2=py37h8f50634_0
+ - pyasn1=0.4.8=py_0
+ - pyasn1-modules=0.2.7=py_0
+ - pycparser=2.20=pyh9f0ad1d_2
+ - pygments=2.6.1=py_0
+ - pygraphviz=1.5=py37h8f50634_1002
+ - pynacl=1.4.0=py37h7b6447c_1
+ - pyopenssl=19.1.0=py37_0
+ - pyrsistent=0.16.0=py37h8f50634_0
+ - pysftp=0.2.9=py_1
+ - pysocks=1.7.1=py37hc8dfbb8_1
+ - python=3.7.6=cpython_he5300dc_6
+ - python-dateutil=2.8.1=py_0
+ - python-irodsclient=0.8.2=py_0
+ - python_abi=3.7=1_cp37m
+ - pytz=2020.1=pyh9f0ad1d_0
+ - pyyaml=5.3.1=py37h8f50634_0
+ - ratelimiter=1.2.0=py37hc8dfbb8_1001
+ - readline=8.0=he28a2e2_2
+ - requests=2.24.0=pyh9f0ad1d_0
+ - rsa=4.6=pyh9f0ad1d_0
+ - s3transfer=0.2.1=py37_0
+ - setuptools=49.5.0=py37hc8dfbb8_0
+ - simplejson=3.17.2=py37h8f50634_0
+ - six=1.15.0=pyh9f0ad1d_0
+ - slacker=0.14.0=py_0
+ - smmap=3.0.4=pyh9f0ad1d_0
+ - snakemake=5.22.0=0
+ - snakemake-minimal=5.22.0=py_0
+ - sqlite=3.32.3=hcee41ef_1
+ - tk=8.6.10=hed695b0_0
+ - toposort=1.5=py_3
+ - traitlets=4.3.3=py37hc8dfbb8_1
+ - uritemplate=3.0.1=py_0
+ - urllib3=1.25.10=py_0
+ - wheel=0.34.2=py37_0
+ - wrapt=1.12.1=py37h8f50634_1
+ - xmlrunner=1.7.7=py_0
+ - xorg-kbproto=1.0.7=h14c3975_1002
+ - xorg-libice=1.0.10=h516909a_0
+ - xorg-libsm=1.2.3=h84519dc_1000
+ - xorg-libx11=1.6.11=h516909a_0
+ - xorg-libxext=1.3.4=h516909a_0
+ - xorg-libxpm=3.5.13=h516909a_0
+ - xorg-libxrender=0.9.10=h516909a_1002
+ - xorg-libxt=1.1.5=h516909a_1003
+ - xorg-renderproto=0.11.1=h14c3975_1002
+ - xorg-xextproto=7.3.0=h14c3975_1002
+ - xorg-xproto=7.0.31=h14c3975_1007
+ - xz=5.2.5=h516909a_1
+ - yaml=0.2.5=h516909a_0
+ - yarl=1.4.2=py37h7b6447c_0
+ - zipp=3.1.0=py_0
+ - zlib=1.2.11=h516909a_1007
+ - zstd=1.4.5=h6597ccf_2
+prefix: /home/lkemp/anaconda3/envs/pipeline_run_env
diff --git a/workflow/report.sh b/workflow/report.sh
new file mode 100644
index 0000000..b5d23a0
--- /dev/null
+++ b/workflow/report.sh
@@ -0,0 +1,6 @@
+#!/bin/bash -x
+
+snakemake \
+--report ../results/report.html \
+--configfile ../config/config.yaml \
+--report-stylesheet ../config/ESR_stylesheet.css
\ No newline at end of file
diff --git a/report/quality_checks_post_trim.rst b/workflow/report/quality_checks.rst
similarity index 55%
rename from report/quality_checks_post_trim.rst
rename to workflow/report/quality_checks.rst
index b1c1f22..0fffa8a 100644
--- a/report/quality_checks_post_trim.rst
+++ b/workflow/report/quality_checks.rst
@@ -1 +1 @@
-`FastQC `_ report (compiled into a `MultiQC report `_) providing general statistics on the sequence data (after trimming)
\ No newline at end of file
+`FastQC `_ report (compiled into a `MultiQC report `_) providing general quality statistics on the sequence data and fastq trimming (if undertaken)
\ No newline at end of file
diff --git a/report/recalibration.rst b/workflow/report/recalibration.rst
similarity index 100%
rename from report/recalibration.rst
rename to workflow/report/recalibration.rst
diff --git a/workflow/report/workflow.rst b/workflow/report/workflow.rst
new file mode 100644
index 0000000..98a8a22
--- /dev/null
+++ b/workflow/report/workflow.rst
@@ -0,0 +1,6 @@
+`human_genomics_pipeline `_ processes single samples or cohorts of paired-end sequencing data (WGS or WES) using `bwa `_ and `GATK4 `_. This workflow is designed to follow the `GATK best practice workflow for germline short variant discovery (SNPs + Indels) `_ and is designed to be followed by `vcf_annotation_pipeline `_.
+
+Run parameters:
+ * Input data type: {{ snakemake.config["DATA"] }}
+ * Reference genome: {{ snakemake.config["REFGENOME"] }}
+ * dbSNP database {{ snakemake.config["dbSNP"] }}
diff --git a/workflow/rules/bwa_mem.smk b/workflow/rules/bwa_mem.smk
new file mode 100644
index 0000000..40b1062
--- /dev/null
+++ b/workflow/rules/bwa_mem.smk
@@ -0,0 +1,22 @@
+rule bwa_mem:
+ input:
+ fastq = get_input_fastq,
+ refgenome = expand("{refgenome}", refgenome = config['REFGENOME'])
+ output:
+ temp("../results/mapped/{sample}_sorted.bam")
+ params:
+ readgroup = "'@RG\\tID:{sample}_rg1\\tLB:lib1\\tPL:bar\\tSM:{sample}\\tPU:{sample}_rg1'",
+ maxmemory = expand('"-Xmx{maxmemory}"', maxmemory = config['MAXMEMORY']),
+ sortsam = "--MAX_RECORDS_IN_RAM=5000000 --SORT_ORDER=coordinate -I=/dev/stdin",
+ tdir = expand("{tdir}", tdir = config['TEMPDIR'])
+ log:
+ "logs/bwa_mem/{sample}.log"
+ benchmark:
+ "benchmarks/bwa_mem/{sample}.tsv"
+ conda:
+ "../envs/bwa.yaml"
+ threads: config['THREADS']
+ message:
+ "Mapping sequences against a reference human genome with BWA-MEM for {input.fastq}"
+ shell:
+ "bwa mem -R {params.readgroup} -t {threads} -K 10000000 {input.refgenome} {input.fastq} | gatk SortSam --java-options {params.maxmemory} {params.sortsam} -O={output} --TMP_DIR={params.tdir} &> {log}"
\ No newline at end of file
diff --git a/workflow/rules/fastqc.smk b/workflow/rules/fastqc.smk
new file mode 100644
index 0000000..e36b906
--- /dev/null
+++ b/workflow/rules/fastqc.smk
@@ -0,0 +1,17 @@
+rule fastqc:
+ input:
+ ["../../fastq/{sample}_1.fastq.gz", "../../fastq/{sample}_2.fastq.gz"]
+ output:
+ html = ["../results/qc/fastqc/{sample}_1_fastqc.html", "../results/qc/fastqc/{sample}_2_fastqc.html"],
+ zip = ["../results/qc/fastqc/{sample}_1_fastqc.zip", "../results/qc/fastqc/{sample}_2_fastqc.zip"]
+ log:
+ "logs/fastqc/{sample}.log"
+ benchmark:
+ "benchmarks/fastqc/{sample}.tsv"
+ threads: config['THREADS']
+ conda:
+ "../envs/fastqc.yaml"
+ message:
+ "Undertaking quality control checks on raw sequence data for {input}"
+ shell:
+ "fastqc {input} -o ../results/qc/fastqc/ -t {threads} &> {log}"
\ No newline at end of file
diff --git a/workflow/rules/gatk_ApplyBQSR.smk b/workflow/rules/gatk_ApplyBQSR.smk
new file mode 100644
index 0000000..9d3540a
--- /dev/null
+++ b/workflow/rules/gatk_ApplyBQSR.smk
@@ -0,0 +1,21 @@
+rule gatk_ApplyBQSR:
+ input:
+ bam = "../results/mapped/{sample}_sorted_mkdups.bam",
+ recal = "../results/mapped/{sample}_recalibration_report.grp",
+ refgenome = expand("{refgenome}", refgenome = config['REFGENOME'])
+ output:
+ bam = protected("../results/mapped/{sample}_recalibrated.bam")
+ params:
+ maxmemory = expand('"-Xmx{maxmemory}"', maxmemory = config['MAXMEMORY']),
+ padding = get_wes_padding_command,
+ intervals = get_wes_intervals_command
+ log:
+ "logs/gatk_ApplyBQSR/{sample}.log"
+ benchmark:
+ "benchmarks/gatk_ApplyBQSR/{sample}.tsv"
+ conda:
+ "../envs/gatk4.yaml"
+ message:
+ "Applying base quality score recalibration and producing a recalibrated BAM file for {input.bam}"
+ shell:
+ "gatk ApplyBQSR --java-options {params.maxmemory} -I {input.bam} -bqsr {input.recal} -R {input.refgenome} -O {output} {params.padding} {params.intervals}"
\ No newline at end of file
diff --git a/workflow/rules/gatk_BaseRecalibrator.smk b/workflow/rules/gatk_BaseRecalibrator.smk
new file mode 100644
index 0000000..265ccb9
--- /dev/null
+++ b/workflow/rules/gatk_BaseRecalibrator.smk
@@ -0,0 +1,22 @@
+rule gatk_BaseRecalibrator:
+ input:
+ bams = "../results/mapped/{sample}_sorted_mkdups.bam",
+ refgenome = expand("{refgenome}", refgenome = config['REFGENOME'])
+ output:
+ report("../results/mapped/{sample}_recalibration_report.grp", caption = "../report/recalibration.rst", category = "Base recalibration")
+ params:
+ maxmemory = expand('"-Xmx{maxmemory}"', maxmemory = config['MAXMEMORY']),
+ tdir = config['TEMPDIR'],
+ padding = get_wes_padding_command,
+ intervals = get_wes_intervals_command,
+ recalibration_resources = get_recal_resources_command
+ log:
+ "logs/gatk_BaseRecalibrator/{sample}.log"
+ benchmark:
+ "benchmarks/gatk_BaseRecalibrator/{sample}.tsv"
+ conda:
+ "../envs/gatk4.yaml"
+ message:
+ "Generating a recalibration table for {input.bams}"
+ shell:
+ "gatk BaseRecalibrator --java-options {params.maxmemory} -I {input.bams} -R {input.refgenome} -O {output} --tmp-dir {params.tdir} {params.padding} {params.intervals} {params.recalibration_resources} &> {log}"
\ No newline at end of file
diff --git a/workflow/rules/gatk_CombineGVCFs.smk b/workflow/rules/gatk_CombineGVCFs.smk
new file mode 100644
index 0000000..57d647c
--- /dev/null
+++ b/workflow/rules/gatk_CombineGVCFs.smk
@@ -0,0 +1,23 @@
+rule gatk_CombineGVCFs:
+ input:
+ vcf_dummy = expand("../results/called/{sample}_raw_snps_indels_tmp.g.vcf", sample = SAMPLES), # a dummy vcf to connect this rule to gatk_HaplotypeCaller
+ refgenome = expand("{refgenome}", refgenome = config['REFGENOME'])
+ output:
+ vcf = temp("../results/called/{family}_raw_snps_indels_tmp_combined.g.vcf"),
+ index = temp("../results/called/{family}_raw_snps_indels_tmp_combined.g.vcf.idx")
+ params:
+ command = get_gatk_combinegvcf_command,
+ tdir = config['TEMPDIR'],
+ padding = get_wes_padding_command,
+ intervals = get_wes_intervals_command,
+ other = "-G StandardAnnotation -G AS_StandardAnnotation"
+ log:
+ "logs/gatk_CombineGVCFs/{family}.log"
+ benchmark:
+ "benchmarks/gatk_CombineGVCFs/{family}.tsv"
+ conda:
+ "../envs/gatk4.yaml"
+ message:
+ "Merging one or more HaplotypeCaller GVCF files into a single GVCF"
+ shell:
+ "gatk CombineGVCFs -R {input.refgenome} {params.command} -O {output.vcf} --tmp-dir {params.tdir} {params.other} &> {log}"
\ No newline at end of file
diff --git a/workflow/rules/gatk_GenotypeGVCFs.smk b/workflow/rules/gatk_GenotypeGVCFs.smk
new file mode 100644
index 0000000..165b470
--- /dev/null
+++ b/workflow/rules/gatk_GenotypeGVCFs.smk
@@ -0,0 +1,22 @@
+rule gatk_GenotypeGVCFs:
+ input:
+ gvcf = "../results/called/{family}_raw_snps_indels_tmp_combined.g.vcf",
+ refgenome = expand("{refgenome}", refgenome = config['REFGENOME'])
+ output:
+ protected("../results/called/{family}_raw_snps_indels.vcf")
+ params:
+ maxmemory = expand('"-Xmx{maxmemory}"', maxmemory = config['MAXMEMORY']),
+ tdir = config['TEMPDIR'],
+ padding = get_wes_padding_command,
+ intervals = get_wes_intervals_command,
+ other = "-G StandardAnnotation -G AS_StandardAnnotation"
+ log:
+ "logs/gatk_GenotypeGVCFs/{family}.log"
+ benchmark:
+ "benchmarks/gatk_GenotypeGVCFs/{family}.tsv"
+ conda:
+ "../envs/gatk4.yaml"
+ message:
+ "Performing joint genotyping on one or more samples pre-called with HaplotypeCaller for {input.gvcf}"
+ shell:
+ "gatk GenotypeGVCFs --java-options {params.maxmemory} -R {input.refgenome} -V {input.gvcf} -O {output} {params.padding} {params.intervals} {params.other} &> {log}"
\ No newline at end of file
diff --git a/workflow/rules/gatk_HaplotypeCaller_cohort.smk b/workflow/rules/gatk_HaplotypeCaller_cohort.smk
new file mode 100644
index 0000000..220d6a7
--- /dev/null
+++ b/workflow/rules/gatk_HaplotypeCaller_cohort.smk
@@ -0,0 +1,24 @@
+rule gatk_HaplotypeCaller_cohort:
+ input:
+ bams = "../results/mapped/{sample}_recalibrated.bam",
+ refgenome = expand("{refgenome}", refgenome = config['REFGENOME']),
+ dbsnp = expand("{dbsnp}", dbsnp = config['dbSNP'])
+ output:
+ vcf = temp("../results/called/{sample}_raw_snps_indels_tmp.g.vcf"),
+ index = temp("../results/called/{sample}_raw_snps_indels_tmp.g.vcf.idx")
+ params:
+ maxmemory = expand('"-Xmx{maxmemory}"', maxmemory = config['MAXMEMORY']),
+ tdir = config['TEMPDIR'],
+ padding = get_wes_padding_command,
+ intervals = get_wes_intervals_command,
+ other = "-ERC GVCF"
+ log:
+ "logs/gatk_HaplotypeCaller_cohort/{sample}.log"
+ benchmark:
+ "benchmarks/gatk_HaplotypeCaller_cohort/{sample}.tsv"
+ conda:
+ "../envs/gatk4.yaml"
+ message:
+ "Calling germline SNPs and indels via local re-assembly of haplotypes for {input.bams}"
+ shell:
+ "gatk HaplotypeCaller --java-options {params.maxmemory} -I {input.bams} -R {input.refgenome} -D {input.dbsnp} -O {output.vcf} --tmp-dir {params.tdir} {params.padding} {params.intervals} {params.other} &> {log}"
\ No newline at end of file
diff --git a/workflow/rules/gatk_HaplotypeCaller_single.smk b/workflow/rules/gatk_HaplotypeCaller_single.smk
new file mode 100644
index 0000000..7056bfe
--- /dev/null
+++ b/workflow/rules/gatk_HaplotypeCaller_single.smk
@@ -0,0 +1,22 @@
+rule gatk_HaplotypeCaller_single:
+ input:
+ bams = "../results/mapped/{sample}_recalibrated.bam",
+ refgenome = expand("{refgenome}", refgenome = config['REFGENOME']),
+ dbsnp = expand("{dbsnp}", dbsnp = config['dbSNP'])
+ output:
+ protected("../results/called/{sample}_raw_snps_indels.vcf")
+ params:
+ maxmemory = expand('"-Xmx{maxmemory}"', maxmemory = config['MAXMEMORY']),
+ tdir = config['TEMPDIR'],
+ padding = get_wes_padding_command,
+ intervals = get_wes_intervals_command
+ log:
+ "logs/gatk_HaplotypeCaller_single/{sample}.log"
+ benchmark:
+ "benchmarks/gatk_HaplotypeCaller_single/{sample}.tsv"
+ conda:
+ "../envs/gatk4.yaml"
+ message:
+ "Calling germline SNPs and indels via local re-assembly of haplotypes for {input.bams}"
+ shell:
+ "gatk HaplotypeCaller --java-options {params.maxmemory} -I {input.bams} -R {input.refgenome} -D {input.dbsnp} -O {output} --tmp-dir {params.tdir} {params.padding} {params.intervals} &> {log}"
\ No newline at end of file
diff --git a/workflow/rules/gatk_MarkDuplicates.smk b/workflow/rules/gatk_MarkDuplicates.smk
new file mode 100644
index 0000000..4992490
--- /dev/null
+++ b/workflow/rules/gatk_MarkDuplicates.smk
@@ -0,0 +1,19 @@
+rule gatk_MarkDuplicates:
+ input:
+ "../results/mapped/{sample}_sorted.bam"
+ output:
+ bam = temp("../results/mapped/{sample}_sorted_mkdups.bam"),
+ metrics = "../results/mapped/{sample}_sorted_mkdups_metrics.txt"
+ params:
+ maxmemory = expand('"-Xmx{maxmemory}"', maxmemory = config['MAXMEMORY']),
+ tdir = config['TEMPDIR']
+ log:
+ "logs/gatk_MarkDuplicates/{sample}.log"
+ benchmark:
+ "benchmarks/gatk_MarkDuplicates/{sample}.tsv"
+ conda:
+ "../envs/gatk4.yaml"
+ message:
+ "Locating and tagging duplicate reads in {input}"
+ shell:
+ "gatk MarkDuplicates --java-options {params.maxmemory} -I {input} -O {output.bam} -M {output.metrics} --TMP_DIR {params.tdir} &> {log}"
\ No newline at end of file
diff --git a/workflow/rules/multiqc.smk b/workflow/rules/multiqc.smk
new file mode 100644
index 0000000..2dd35d2
--- /dev/null
+++ b/workflow/rules/multiqc.smk
@@ -0,0 +1,13 @@
+rule multiqc:
+ input:
+ expand(["../results/qc/fastqc/{sample}_1_fastqc.zip", "../results/qc/fastqc/{sample}_2_fastqc.zip"], sample = SAMPLES)
+ output:
+ report("../results/qc/multiqc_report.html", caption = "../report/quality_checks.rst", category = "Quality checks")
+ conda:
+ "../envs/multiqc.yaml"
+ log:
+ "logs/multiqc/multiqc.log"
+ message:
+ "Compiling a HTML report for quality control checks on raw sequence data"
+ shell:
+ "multiqc {input} -o ../results/qc/ &> {log}"
\ No newline at end of file
diff --git a/workflow/rules/multiqc_trim.smk b/workflow/rules/multiqc_trim.smk
new file mode 100644
index 0000000..3303280
--- /dev/null
+++ b/workflow/rules/multiqc_trim.smk
@@ -0,0 +1,15 @@
+rule multiqc:
+ input:
+ fastqc = expand(["../results/qc/fastqc/{sample}_1_fastqc.zip", "../results/qc/fastqc/{sample}_2_fastqc.zip"], sample = SAMPLES),
+ trimming1 = expand("../results/trimmed/{sample}_1.fastq.gz_trimming_report.txt", sample = SAMPLES),
+ trimming2 = expand("../results/trimmed/{sample}_2.fastq.gz_trimming_report.txt", sample = SAMPLES)
+ output:
+ report("../results/qc/multiqc_report.html", caption = "../report/quality_checks.rst", category = "Quality checks")
+ conda:
+ "../envs/multiqc.yaml"
+ log:
+ "logs/multiqc/multiqc.log"
+ message:
+ "Compiling a HTML report for quality control checks on raw sequence data"
+ shell:
+ "multiqc {input.fastqc} {input.trimming1} {input.trimming2} -o ../results/qc/ &> {log}"
diff --git a/workflow/rules/pbrun_germline.smk b/workflow/rules/pbrun_germline.smk
new file mode 100644
index 0000000..3163d38
--- /dev/null
+++ b/workflow/rules/pbrun_germline.smk
@@ -0,0 +1,27 @@
+rule pbrun_germline:
+ input:
+ fastq = get_input_fastq,
+ refgenome = expand("{refgenome}", refgenome = config['REFGENOME'])
+ output:
+ bam = protected("../results/mapped/{sample}_recalibrated.bam"),
+ bam_index = protected("../results/mapped/{sample}_recalibrated.bam.bai"),
+ vcf = get_output_vcf(config),
+ recal = temp("../results/mapped/{sample}_recal.txt")
+ resources:
+ gpu = config['GPU']
+ params:
+ readgroup = "--read-group-sm {sample}",
+ tdir = config['TEMPDIR'],
+ padding = get_wes_padding_command,
+ intervals = get_wes_intervals_command,
+ recalibration_resources = get_recal_resources_command,
+ other_params = get_params
+ log:
+ "logs/pbrun_germline/{sample}.log"
+ benchmark:
+ "benchmarks/pbrun_germline/{sample}.tsv"
+ threads: config['THREADS']
+ message:
+ "Running GPU accelerated germline variant pipeline workflow to generate BAM, vcf and recal output for {input.fastq}"
+ shell:
+ "pbrun germline --ref {input.refgenome} --in-fq {input.fastq} {params.recalibration_resources} --out-bam {output.bam} --out-variants {output.vcf} --out-recal-file {output.recal} --num-gpus {resources.gpu} {params.readgroup} --tmp-dir {params.tdir} {params.padding} {params.intervals} {params.other_params} --num-cpu-threads {threads} &> {log}"
\ No newline at end of file
diff --git a/workflow/rules/pbrun_triocombinegvcf.smk b/workflow/rules/pbrun_triocombinegvcf.smk
new file mode 100644
index 0000000..5987edc
--- /dev/null
+++ b/workflow/rules/pbrun_triocombinegvcf.smk
@@ -0,0 +1,18 @@
+rule pbrun_triocombinegvcf:
+ input:
+ vcf_dummy = expand("../results/called/{sample}_raw_snps_indels_tmp.g.vcf", sample = SAMPLES), # a dummy vcf to connect this rule to gatk_HaplotypeCaller
+ refgenome = expand("{refgenome}", refgenome = config['REFGENOME'])
+ output:
+ temp("../results/called/{family}_raw_snps_indels_tmp_combined.g.vcf")
+ resources:
+ gpu = config['GPU']
+ params:
+ command = get_parabricks_combinegvcf_command
+ log:
+ "logs/pbrun_triocombinegvcf/{family}.log"
+ benchmark:
+ "benchmarks/pbrun_triocombinegvcf/{family}.tsv"
+ message:
+ "Merging one or more HaplotypeCaller GVCF files into a single GVCF"
+ shell:
+ "pbrun triocombinegvcf --ref {input.refgenome} {params.command} --out-variants {output} &> {log}"
\ No newline at end of file
diff --git a/workflow/rules/trim_galore_pe.smk b/workflow/rules/trim_galore_pe.smk
new file mode 100644
index 0000000..c3f0bb4
--- /dev/null
+++ b/workflow/rules/trim_galore_pe.smk
@@ -0,0 +1,23 @@
+rule trim_galore_pe:
+ input:
+ ["../../fastq/{sample}_1.fastq.gz", "../../fastq/{sample}_2.fastq.gz"]
+ output:
+ temp("../results/trimmed/{sample}_1_val_1.fq.gz"),
+ "../results/trimmed/{sample}_1.fastq.gz_trimming_report.txt",
+ temp("../results/trimmed/{sample}_2_val_2.fq.gz"),
+ "../results/trimmed/{sample}_2.fastq.gz_trimming_report.txt"
+ params:
+ adapters = config['TRIMMING']['ADAPTERS'],
+ threads = config['THREADS'],
+ other = "-q 20 --paired"
+ log:
+ "logs/trim_galore_pe/{sample}.log"
+ benchmark:
+ "benchmarks/trim_galore_pe/{sample}.tsv"
+ conda:
+ "../envs/trim_galore.yaml"
+ threads: config['THREADS']
+ message:
+ "Applying quality and adapter trimming to input fastq files: {input}"
+ shell:
+ "trim_galore {input} -o ../results/trimmed/ {params.adapters} {params.other} -j {threads} &> {log}"
diff --git a/workflow/run.sh b/workflow/run.sh
new file mode 100644
index 0000000..3b8b621
--- /dev/null
+++ b/workflow/run.sh
@@ -0,0 +1,10 @@
+#!/bin/bash -x
+
+snakemake \
+--cores 32 \
+--resources mem_mb=150000 \
+--resources gpu=2 \
+--use-conda \
+--conda-frontend mamba \
+--latency-wait 120 \
+--configfile ../config/config.yaml
\ No newline at end of file
diff --git a/workflow/run_hpc.sh b/workflow/run_hpc.sh
new file mode 100644
index 0000000..eabb0f6
--- /dev/null
+++ b/workflow/run_hpc.sh
@@ -0,0 +1,14 @@
+#!/bin/bash -x
+
+snakemake \
+--cores 32 \
+--resources mem_mb=150000 \
+--resources gpu=2 \
+--use-conda \
+--conda-frontend mamba \
+--latency-wait 120 \
+--configfile ../config/config.yaml \
+--cluster-config ../config/cluster.json \
+--cluster "sbatch -A {cluster.account} \
+-p {cluster.partition} \
+-o {cluster.output}"
\ No newline at end of file