diff --git a/CHANGELOG.md b/CHANGELOG.md index 034988a27..c8b1cd362 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -21,6 +21,10 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 - [[PR #493](https://github.com/nf-core/chipseq/pull/493)] - Follow up to #487. - [[#492](https://github.com/nf-core/chipseq/issues/492), [#417](https://github.com/nf-core/chipseq/issues/417)] - Refactor local modules to nf-core standard. - [[#416](https://github.com/nf-core/chipseq/issues/416)] - Moved the KHMER_UNIQUEKMERS logic to prepare_genome +- [[#440](https://github.com/nf-core/chipseq/issues/440), [#510](https://github.com/nf-core/chipseq/issues/510)] - Fix + naming collisions when sample and replicate combination is identical for multiple antibodies see. +- [[#467](https://github.com/nf-core/chipseq/issues/467), [#510](https://github.com/nf-core/chipseq/issues/510)] - + Restrict the usage to one IP against one control replicate. ### Parameters diff --git a/README.md b/README.md index d022f5fc5..40eb1447f 100644 --- a/README.md +++ b/README.md @@ -116,7 +116,13 @@ These scripts were originally written by Chuan Wang ([@chuan-wang](https://githu The pipeline workflow diagram was designed by Sarah Guinchard ([@G-Sarah](https://github.com/G-Sarah)). -Many thanks to others who have helped out and contributed along the way too, including (but not limited to): [@apeltzer](https://github.com/apeltzer), [@bc2zb](https://github.com/bc2zb), [@bjlang](https://github.com/bjlang), [@crickbabs](https://github.com/crickbabs), [@drejom](https://github.com/drejom), [@houghtos](https://github.com/houghtos), [@KevinMenden](https://github.com/KevinMenden), [@mashehu](https://github.com/mashehu), [@pditommaso](https://github.com/pditommaso), [@Rotholandus](https://github.com/Rotholandus), [@sofiahaglund](https://github.com/sofiahaglund), [@tiagochst](https://github.com/tiagochst) and [@winni2k](https://github.com/winni2k). +Many thanks to others who have helped out and contributed along the way too, including (but not limited to): +[@apeltzer](https://github.com/apeltzer), [@bc2zb](https://github.com/bc2zb), [@bjlang](https://github.com/bjlang), +[@crickbabs](https://github.com/crickbabs), [@drejom](https://github.com/drejom), +[@houghtos](https://github.com/houghtos), [@KevinMenden](https://github.com/KevinMenden), +[@mashehu](https://github.com/mashehu), [@pditommaso](https://github.com/pditommaso), +[@Rotholandus](https://github.com/Rotholandus), [@sofiahaglund](https://github.com/sofiahaglund), +[@tiagochst](https://github.com/tiagochst), [@winni2k](https://github.com/winni2k) and [@Kevin-Brockers](https://github.com/Kevin-Brockers). ## Contributions and Support diff --git a/bin/check_samplesheet.py b/bin/check_samplesheet.py index d34f42ed0..463c79270 100755 --- a/bin/check_samplesheet.py +++ b/bin/check_samplesheet.py @@ -212,9 +212,14 @@ def check_samplesheet(file_in, file_out): sample, ) + set_antibodies = set() + set_control_replicates = set() + for idx, val in enumerate(sample_mapping_dict[sample][replicate]): control = "_REP".join(val[-1].split("_REP")[:-1]) control_replicate = val[-1].split("_REP")[-1] + set_control_replicates.add(control_replicate) + if control and ( control not in sample_mapping_dict.keys() or int(control_replicate) not in sample_mapping_dict[control].keys() @@ -225,6 +230,21 @@ def check_samplesheet(file_in, file_out): val[-1], ) + for x in sample_mapping_dict[sample][replicate]: + set_antibodies.add(x[4]) + + # Check that a given sample replicate only uses one antibody + if len(set_antibodies) > 1: + print_error( + f"Sample: {sample}, replicate {replicate} has more than one antibody specified!" + ) + + # Check that a given sample-replicate have only one control replicate + if len(set_control_replicates) > 1: + print_error( + f"Sample: {sample}, replicate {replicate} has more than one control replicate specified! Revise the experimental design, see: 'Note on IP and control replicates'" + ) + ## Write to file for idx in range(len(sample_mapping_dict[sample][replicate])): fastq_files = sample_mapping_dict[sample][replicate][idx] diff --git a/docs/usage.md b/docs/usage.md index d8eaf22bf..0ebc6cf91 100644 --- a/docs/usage.md +++ b/docs/usage.md @@ -47,6 +47,50 @@ WT_INPUT,BLA203A30_S21_L002_R1_001.fastq.gz,,2,,, WT_INPUT,BLA203A31_S21_L003_R1_001.fastq.gz,,3,,, ``` +### Note on IP and control replicates - Comparisons of one IP sample against multiple controls + +The pipeline is designed to handle one IP and matching control replicate, see section above. However there can be +situations where one might want to make multiple comparisons of the IP sample against several different controls. In +those cases it is advisable to encode these comparisons either in the sample column or as another replicate. Since it is +rather unusual in ChIP-Seq experiments, this feature is considered experimental. Please open a github issue in case you +need further assistance. + +- Encoding in sample names: + +```csv title="samplesheet.csv" +sample,fastq_1,fastq_2,replicate,antibody,control,control_replicate +WT_BCATENIN_IP,BLA203A1_S27_L006_R1_001.fastq.gz,,1,BCATENIN,WT_INPUT,1 +WT_BCATENIN_IP_CONTROL_2,BLA203A1_S27_L006_R1_001.fastq.gz,,1,BCATENIN,WT_INPUT,2 +WT_BCATENIN_IP_CONTROL_3,BLA203A1_S27_L006_R1_001.fastq.gz,,1,BCATENIN,WT_INPUT,3 +WT_INPUT,BLA203A6_S32_L006_R1_001.fastq.gz,,1,,, +WT_INPUT,BLA203A30_S21_L001_R1_001.fastq.gz,,2,,, +WT_INPUT,BLA203A31_S21_L003_R1_001.fastq.gz,,3,,, +``` + +- Encoding as new biological replicates: + +```csv title="samplesheet.csv" +sample,fastq_1,fastq_2,replicate,antibody,control,control_replicate +WT_BCATENIN_IP,BLA203A1_S27_L006_R1_001.fastq.gz,,1,BCATENIN,WT_INPUT,1 +WT_BCATENIN_IP,BLA203A1_S27_L006_R1_001.fastq.gz,,2,BCATENIN,WT_INPUT,2 +WT_BCATENIN_IP,BLA203A1_S27_L006_R1_001.fastq.gz,,3,BCATENIN,WT_INPUT,3 +WT_INPUT,BLA203A6_S32_L006_R1_001.fastq.gz,,1,,, +WT_INPUT,BLA203A30_S21_L001_R1_001.fastq.gz,,2,,, +WT_INPUT,BLA203A31_S21_L003_R1_001.fastq.gz,,3,,, +``` + +- The following design, one IP replicate against more than one control replicate, is not allowed: + +```csv title="samplesheet.csv" +sample,fastq_1,fastq_2,replicate,antibody,control,control_replicate +WT_BCATENIN_IP,BLA203A1_S27_L006_R1_001.fastq.gz,,1,BCATENIN,WT_INPUT,1 +WT_BCATENIN_IP,BLA203A1_S27_L006_R1_001.fastq.gz,,1,BCATENIN,WT_INPUT,2 +WT_BCATENIN_IP,BLA203A1_S27_L006_R1_001.fastq.gz,,1,BCATENIN,WT_INPUT,3 +WT_INPUT,BLA203A6_S32_L006_R1_001.fastq.gz,,1,,, +WT_INPUT,BLA203A30_S21_L001_R1_001.fastq.gz,,2,,, +WT_INPUT,BLA203A31_S21_L003_R1_001.fastq.gz,,3,,, +``` + ### Full design The pipeline will auto-detect whether a sample is single- or paired-end using the information provided in the samplesheet. The samplesheet can have as many columns as you desire, however, there is a strict requirement for the first 7 columns to match those defined in the table below. @@ -77,15 +121,15 @@ NAIVE_INPUT,BLA203A48_S39_L001_R1_001.fastq.gz,,2,,, NAIVE_INPUT,BLA203A49_S1_L006_R1_001.fastq.gz,,3,,, ``` -| Column | Description | -| ------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `sample` | Custom sample name. This entry will be identical for multiple sequencing libraries/runs from the same sample. Spaces in sample names are automatically converted to underscores (`_`). | -| `fastq_1` | Full path to FastQ file for Illumina short reads 1. File has to be gzipped and have the extension ".fastq.gz" or ".fq.gz". | -| `fastq_2` | Full path to FastQ file for Illumina short reads 2. File has to be gzipped and have the extension ".fastq.gz" or ".fq.gz". | -| `replicate` | Integer representing replicate number. This will be identical for re-sequenced libraries. Must start from `1..`. | -| `antibody` | Antibody name. This is required to segregate downstream analysis for different antibodies. Required when `control` is specified. | -| `control` | Sample name for control sample. | -| `control_replicate` | Integer representing replicate number for control sample. | +| Column | Description | +| ------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `sample` | Custom sample name. This entry will be identical for multiple sequencing libraries/runs from the same sample. Spaces in sample names are automatically converted to underscores (`_`). It should be unique per sample and contain sufficient informations, such as the antibody name. E.g: `{Treatment or cell type}_{antibody}_IP` -> `{WT/NAIVE}_{BCATENIN}_IP` | +| `fastq_1` | Full path to FastQ file for Illumina short reads 1. File has to be gzipped and have the extension ".fastq.gz" or ".fq.gz". | +| `fastq_2` | Full path to FastQ file for Illumina short reads 2. File has to be gzipped and have the extension ".fastq.gz" or ".fq.gz". | +| `replicate` | Integer representing replicate number. This will be identical for re-sequenced libraries. Must start from `1..`. | +| `antibody` | Antibody name. This is required to segregate downstream analysis for different antibodies. Required when `control` is specified. | +| `control` | Sample name for control sample. | +| `control_replicate` | Integer representing replicate number for control sample. | Example design files have been provided with the pipeline for [paired-end](../assets/samplesheet_pe.csv) and [single-end](../assets/samplesheet_se.csv) data.