Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
53 commits
Select commit Hold shift + click to select a range
56fe854
Update README.md
JessvLS Feb 28, 2020
3e237a4
Import of original code/files
JessvLS Apr 20, 2020
49e0117
Delete README.md
JessvLS Apr 20, 2020
e6cee81
Delete genome.json
JessvLS Apr 20, 2020
a4b6756
Delete pipelines.json
JessvLS Apr 20, 2020
5bd285e
Delete primer_file.csv
JessvLS Apr 20, 2020
44bdddf
Delete primers.json
JessvLS Apr 20, 2020
649e72e
Delete protocol.json
JessvLS Apr 20, 2020
767b476
Delete references.fasta
JessvLS Apr 20, 2020
a320354
Delete Snakefile
JessvLS Apr 20, 2020
218bda7
Delete assign_amplicon.py
JessvLS Apr 20, 2020
d606b53
Delete config.yaml
JessvLS Apr 20, 2020
95120b7
Delete parse_noro_ref_and_depth.py
JessvLS Apr 20, 2020
b6fcc26
Delete summary_info_from_rampart.py
JessvLS Apr 20, 2020
029318b
Delete Snakefile
JessvLS Apr 20, 2020
560f638
Delete clean.py
JessvLS Apr 20, 2020
53fee0e
Delete generate_report.py
JessvLS Apr 20, 2020
39f4076
Delete map_polish.smk
JessvLS Apr 20, 2020
9fe3edb
Delete mask_low_coverage.py
JessvLS Apr 20, 2020
880ef3b
Delete mask_low_coverage.smk
JessvLS Apr 20, 2020
cbb55ec
Delete merge.py
JessvLS Apr 20, 2020
08c1f2c
Delete trim_primers.py
JessvLS Apr 20, 2020
df95b95
Delete variants.smk
JessvLS Apr 20, 2020
2a91521
Delete rampart_noro.png
JessvLS Apr 20, 2020
3cc4721
Delete run_configuration.json
JessvLS Apr 20, 2020
c3f2cd1
Delete barcodes.csv
JessvLS Apr 20, 2020
62af993
Real upload of original files
JessvLS Apr 20, 2020
bae476c
Update README.md
JessvLS Apr 21, 2020
571ab60
Update barcodes.csv
JessvLS Apr 21, 2020
8961009
Updated references.fasta
JessvLS Apr 21, 2020
1d9692d
Updated genome.json
JessvLS Apr 22, 2020
ff917b2
Updated pipelines.json
JessvLS Apr 22, 2020
7029605
Updated protocol.json
JessvLS Apr 22, 2020
9a9f6a1
Update environment.yml
JessvLS Apr 22, 2020
7caf3df
Delete sample_file.py
JessvLS Apr 22, 2020
287f70f
Update README.md
JessvLS Apr 22, 2020
9256790
Update README.md
JessvLS Apr 22, 2020
2cb6efa
Updated primer_file.csv
JessvLS Apr 22, 2020
b16a257
Update README.md
JessvLS Apr 23, 2020
df9eb18
Test upload of example fastq
JessvLS Apr 23, 2020
b205490
Complete sample data
JessvLS Apr 23, 2020
dce5463
Changed environment.yml to root dir
dannythedorito May 5, 2020
8fc3efe
Removed old environment.yml
dannythedorito May 5, 2020
36b0bd8
Update config.yml
JessvLS May 7, 2020
2bbf727
Update README.md
JessvLS May 7, 2020
eb6cd27
Delete README.md
JessvLS May 7, 2020
b9ef594
Update and rename sample_test.py to run_test.py
JessvLS May 7, 2020
9274dd0
Upload of meta.yaml
JessvLS May 7, 2020
a29899a
Rename run_test.py to run_test.txt
JessvLS May 7, 2020
e23696d
Delete config.yml
JessvLS May 7, 2020
7bd16a8
Add files via upload
JessvLS May 7, 2020
ab52945
Trying to fix circleci issues..
JessvLS May 7, 2020
4960962
Update environment.yml
JessvLS May 7, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 0 additions & 13 deletions .circleci/config.yml

This file was deleted.

44 changes: 44 additions & 0 deletions Project_2020_Notebook_JvLS.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
#On April 16, 2020:

"Got approval to work on the RAMPART sequencing pipeline for norovirus, a project whose code was laid out by a collaborator, but would be updated for our lab's purposes"

#On April 20, 2020:

" * Original files were uploaded from https://github.com/aineniamh/realtime-noro"
" * README.md was updated to include the project goals we intended to accomplish"

#On April 21, 2020:

" * A new references.fasta was crafted to merge the exisiting reference sequences with a complete set of references for all uncommon noro strains"

#On April 22,2020:

" * genome.json was edited to correct for overlapping ORFs - original file did not have this overlap"
" * genome.json was edited to have correct GI numbering to accommodate mapping noro genomes larger than GI.1"
" * pipelines.json was edited to drop 'min_read' coverage from 50 to 10 to assume complete genome coverage"
" * protocol.json was edited to drop 'min_identity' from 'annotationOptions' from 50 to 10 to assure complete genome coverage"
" * protocol.json was edited to remove 'require_two_barcodes' from 'annotationOptions' to account for poor barcoding during wetlab prep"
" * primer_file.csv was updated to include the primers our lab used to generate amplicon pools spanning each noro genotype"
" * RAN A TEST RUN OF DATA - FAIL"
" * Updated RAMPART and environment.yml used to run the current version of RAMPART"
" * RE-RAN TEST RUN - SUCCESS; Issue 1 fixed; Issues 2, 3, 4 outlined in our goals on the README.md still stand"

#On April 23, 2020:

" * README.md containing our goals and original text was moved to parent folder; new README.md was crafted under project_spring_2020/project_spring_2020"
" * Test fastq files containing sequences from a noro run were added to the 'tests/test-fastq' directory"

#On May 4, 2020:

" * environment.yml was moved to root directory"
" * setup.py was coded for, but unsure of how to finalize packaging - do we need to upload it to pypi?"

#On May 7, 2020:

" * config.yml from circleci folder was deleted; meta.yaml outlining conda config uploaded to root directory"
" * run_test.py changed to run_test.txt as per notation in meta.yaml"

#Current work:

" * Issue 2 - Visualization issues are coded within the RAMPART program and are coded in JavaScript - will need to contact original coder James Hadfield."
" * Issues 3 and 4 - Consensus pipelines have been updated following NCoV2019 work - we are working with original coding team to update python script in Snakemake files accordingly."
181 changes: 179 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,180 @@
# project_spring_2020
# universal-realtime-noro
Original code was laid out by our collaborator, Aine Niamh O'Toole. The goal of our project is to adjust the baseline code to suit our needs. The original package screened for common types of human norovirus. However, our group at the NIH studies less common genotypes and unique strains evolving within immunocompromised patients. When testing out the original code, many of our samples failed to be mapped, and we could not generate consensus sequences for the samples. Our goals are to:
* Update the code to allow for more primers and reference strains to be accomodated
* Reprogram the RAMPART visualization to allow references to be assigned to ORFs (instead of the whole genome) and to allow mapping to show recombination sites
* Fix the binning problems of each of the barcoded reads
* Generate consensus sequences for our samples

[![CircleCI](https://circleci.com/gh/biof309/project_spring_2020/tree/master.svg?style=shield)](https://circleci.com/gh/biof309/project_spring_2020/tree/master)
This pipeline complements [``RAMPART``](https://github.com/artic-network/rampart) and continues downstream analysis to consensus level.

<img src="https://github.com/aineniamh/realtime-noro/blob/master/rampart/figures/rampart_noro.png">

## Table of contents


* [Requirements](#requirements)
* [Installation](#installation)
* [Setting up your run](#setting-up-your-run)
* [Checklist](#checklist)
* [Running RAMPART](#running-rampart)
* [RAMPART command line options](#rampart-command-line-options)
* [Downstream analysis](#downstream-analysis)
* [Quick usage](#quick-usage)
* [Pipeline description](#pipeline-description)
* [Output](#output)
* [Reference FASTA](#reference-fasta)
* [License](#license)

## Requirements
This pipeline will run on MacOS and Linux. An install of Miniconda will make the setup of this pipeline on your local machine much more streamlined. To install Miniconda, visit here https://conda.io/docs/user-guide/install/ in a browser, select your type of machine (mac or linux) and follow the link to the download instructions. We recommend to install the 64-bit Python 3.7 version of Miniconda.

## Installation
Clone this repository:

```
git clone https://github.com/jessvls/project_spring_2020.git
```

1. Create the conda environment.
This may take some time, but will only need to be done once. It allows the pipeline to access all the software it needs, including RAMPART.

```
cd universal-realtime-noro
conda env create -f environment.yml
```

2. Activate the conda environment.

```
conda activate universal-realtime-noro
```

## Setting up your run


If you have a ``run_configuration.json`` file and a ``barcodes.csv`` file, you can run RAMPART with very few command line options. A template of the configuration files needed to run both RAMPART and the downstream analysis pipeline is provided in the ``examples`` directory.

The run_configuration.json can specify the path to your basecalled reads or alternatively you can input that information on the command line. `basecalledPath` should be set to wherever MinKNOW/guppy is going to write its basecalled files. If you want alter where the annotations files from RAMPART or the analysis files from the downstream pipeline are put, you can add the optional ``"annotatedPath"`` and ``"outputPath"`` options. By default the annotations are written to a directory called ``annotations`` and the analysis output is written to a directory called ``analysis``.

```
run_configuration.json

{
"title": "MinION_run_example",
"basecalledPath": "fastq_pass"
}
```

Optional for RAMPART, but required for the downstream analysis pipeline, the ``barcodes.csv`` file describes which barcode corresponds to which sample. Note that you can have more than one barcode for each sample, but they will be merged in the analysis.

```
barcodes.csv

sample,barcode
sample1,BC01
sample2,BC02
sample3,BC03
sample4,BC04
```

## Checklist

- The conda environment ``universal-realtime-noro`` is active.
- ``barcodes.csv`` file with sample to barcode mapping either in the current directory or the path to it will need to be provided.
- ``annotations`` directory with csv files from RAMPART (will be generated upon initiation of RAMPART)
- The path to basecalled ``.fastq`` files is provided either in the ``run_configuration.json`` or it will need to be specified on the command line.

## Running RAMPART

Create run folder:

```
mkdir [run_name]
cd [run_name]
```

Where `[run_name]` is whatever you are calling todays run (as specified in MinKNOW).


With this setup, to run RAMPART:

```
rampart --protocol path/to/universal-realtime-noro/rampart
```

Open a web browser to view [http://localhost:3000](http://localhost:3000)

More information about RAMPART can be found [here](https://github.com/artic-network/rampart).

## RAMPART command line options

```
usage: rampart [-h] [-v] [--verbose] [--ports PORTS PORTS]
[--protocol PROTOCOL] [--title TITLE]
[--basecalledPath BASECALLEDPATH]
[--annotatedPath ANNOTATEDPATH]
[--referencesPath REFERENCESPATH]
[--referencesLabel REFERENCESLABEL]
[--barcodeNames BARCODENAMES [BARCODENAMES ...]]
[--annotationOptions ANNOTATIONOPTIONS [ANNOTATIONOPTIONS ...]]
[--clearAnnotated] [--simulateRealTime SIMULATEREALTIME]
[--devClient] [--mockFailures]
```

## Downstream analysis

### Quick usage

Recommended: all samples can be analysed in parallel by editing the following command to give the path to realtime-noro and then typing it into the command line:

```
postbox -p path/to/universal-realtime-noro
```

```
usage: postbox [-h] -p PROTOCOL [-q PIPELINE] [-d RUN_DIRECTORY]
[-r RUN_CONFIGURATION] [-c CSV] [-t THREADS]
```

Alternatively, for each sample, the downstream analysis can be performed within the RAMPART GUI by clicking on the button to 'Analyse to consensus'.

## Pipeline description

The bioinformatic pipeline was developed using [snakemake](https://snakemake.readthedocs.io/en/stable/).

1. The server process of ``RAMPART`` watches the directory where the reads will be produced.
2. This snakemake takes each file produced in real-time and identifies the barcodes using a custom version of [``porechop``](https://github.com/artic-network/Porechop).
3. Reads are mapped against a panel of references using [``minimap2``](https://github.com/lh3/minimap2).
4. This information is collected into a csv file corresponding to each read file and the information is visualised in a web-browser, with depth of coverage and composition for each sample shown.
5. Once sufficient depth is achieved, the anaysis pipeline can be started for one sample at a time by clicking in the web browser or, to run analysis for all samples, type ``postbox -p path/to/realtime-noro`` on the command line, substituting in the relative path to the protocol directory.
6. The downstream analysis pipeline runs the following steps:
- [``binlorry``](https://github.com/rambaut/binlorry) parses through the fastq files with barcode labels, pulling out the relevant reads and binning them into a single fastq file for each sample. It also applies a read-length filter (pre-set in the config file to only include full length amplicons).
- Based on the mapping coordinates of the read, relative to the reference it maps against, the amplicon that each read corresponds to is identified.
- The number of reads mapping to distinct genotypes is assessed with a custom python script (``parse_noro_ref_and_depth.py``) and reports whether multiple types of viruses are present in the sample and the number of corresponding reads.
- The reads are binned for each virus identified, and split into Amplicon1234 and Amplicon45 bins to account for never-seen-before recombinants.
- For each bin, the primers are trimmed from the reads.
- An iterative neural-net based polishing cycle is performed per virus type to provide a consensus sequence in ``.fasta`` format. [``racon``](https://github.com/isovic/racon) and [``minimap2``](https://github.com/lh3/minimap2) are run iteratively four times, with gap removal in each round, against the fastq reads and then a final polishing consensus-generation step is performed using [``medaka consensus``](https://github.com/nanoporetech/medaka).
- Read coverage for each base is calculated and regions of low coverage are masked with N's.
- For each sample, all sequences are collected into a single ``.fasta`` file containing polished, masked consensus sequences.

### Output

By default the downstream analysis output will be put in a directory called ``analysis``.

Within that directory will be:
- a ``consensus_sequences`` directory with ``.fasta`` files for each sample. If the sample contained a mixture of viruses, all viruse sequences present at high enough levels in the sample will be in that file.
- ``sample_composition_summary.csv`` is a summary file that gives the raw read counts for each sample that have mapped to particular virus sequences.

These are the main output files with summary information and the consensus sequences can be taken for further analysis at this point (i.e. alignments and phylogenetic trees). This directory also contains detailed output of the different steps performed in the analysis.

- ``binned_sample.csv`` and ``binned_sample.fastq`` are present for each sample. These are the output of ``BinLorry``. The csv file contains the mapping information from ``RAMPART``
- Within each ``binned_sample`` directory are many of the intermediate files produced during the analysis, including outputs of the rounds of racon polishing and medaka consensus generation.

## Reference FASTA

The ``references.fasta`` file was updated by Jessica van Loben Sels to represent newly defined ORFs 1 and 2 of all noroviruses, regardless if they are fractions of or complete ORFs. They supplement the reference file from the original realtime-noro platform.


## License

[GNU General Public License, version 3](https://www.gnu.org/licenses/gpl-3.0.html)
31 changes: 31 additions & 0 deletions circleci/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
package:
name: universal-realtime-noro
version: "0.1.0"

source:
git_url: https://github.com/JessvLS/project_spring_2020.git

build:
noarch: python
number: 0
script: python -m pip install --no-deps --ignore-installed .

requirements:
build:
- git
- cmake
host:
- python
- pip
run:
- python

test:
imports:
- run_test.txt

about:
home: https://github.com/JessvLS/project_spring_2020
license: Apache-2.0
license_file: LICENSE
summary: 'Visualization, consensus building of norovirus genomes using Nanopopre data'
39 changes: 39 additions & 0 deletions environment.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
name: universal-realtime-noro
channels:
- conda-forge
- bioconda
- defaults
dependencies:
- python=3
- nodejs=12
- biopython=1.74
- bwa=0.7.17=pl5.22.0_2
- clint=0.5.1=py36_0
- eigen=3.2=3
- pysam=0.15.3
- pyvcf=0.6.8=py36_0
- ete3=3.1.1=py36_0
- goalign=0.2.8=0
- gotree=0.2.10=0
- libdeflate=1.3
- muscle=3.8.1551=2
- nanopolish=0.13.0
- medaka=0.12.1
- longshot=0.4.1
- phyml=3.3.20190909 # references etetoolkit build
- pandas=0.23.0=py36_1
- samtools=1.9
- mafft=7.407=0
- iqtree=1.6.12
- datrie=0.8
- snakemake-minimal=5.8.1
- minimap2=2.17
- seqtk=1.3
- bcftools=1.9
- artic-network::rampart=1.0
- pip
- pip:
- git+https://github.com/artic-network/fieldbioinformatics.git
- git+https://github.com/artic-network/Porechop.git@v0.3.2pre
- binlorry==1.3.0_alpha1
- ont-fast5-api
31 changes: 31 additions & 0 deletions meta.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
package:
name: universal-realtime-noro
version: "0.1.0"

source:
git_url: https://github.com/JessvLS/project_spring_2020.git

build:
noarch: python
number: 0
script: python -m pip install --no-deps --ignore-installed .

requirements:
build:
- git
- cmake
host:
- python
- pip
run:
- python

test:
imports:
- run_test.txt

about:
home: https://github.com/JessvLS/project_spring_2020
license: Apache-2.0
license_file: LICENSE
summary: 'Visualization, consensus building of norovirus genomes using Nanopopre data'
25 changes: 25 additions & 0 deletions project_spring_2020/examples/barcodes.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
sample,barcode
Sample1,BC01
Sample2,BC02
Sample3,BC03
Sample4,BC04
Sample5,BC05
Sample6,BC06
Sample7,BC07
Sample8,BC08
Sample9,BC09
Sample10,BC10
Sample11,BC11
Sample12,BC12
Sample13,BC13
Sample14,BC14
Sample15,BC15
Sample16,BC16
Sample17,BC17
Sample18,BC18
Sample19,BC19
Sample20,BC20
Sample21,BC21
Sample22,BC22
Sample23,BC23
Sample24,BC24
4 changes: 4 additions & 0 deletions project_spring_2020/examples/run_configuration.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
{
"title": "MinION_Run1",
"basecalledPath": "path/to/basecalled/fastq_pass"
}
3 changes: 3 additions & 0 deletions project_spring_2020/rampart/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# RAMPART

This folder contains example files for running RAMPART. Modify and adapt these to the specific instance being developed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading