Skip to content
This repository was archived by the owner on Feb 26, 2025. It is now read-only.

5. ARTIC sequencing analysis

mihinduk edited this page Jun 29, 2023 · 23 revisions

Introduction

This is a comprehensive step-by-step guideline for ARTIC sequencing analysis in the Handley lab. This guidelines covers steps from transferring raw reads from MGI (Globus) to phylogenetic tree generation of analyzed sequences.

Topics


Practice data

https://www.dropbox.com/t/IWcZ5XAUJI2hbGLX


Installing Software Dependencies

Nextflow

wget -qO- https://get.nextflow.io | bash

OR

curl -s https://get.nextflow.io | bash

OR

conda install -c bioconda nextflow

Docker

Docker installation instructions based on platforms (ex. CentOS, Debian, Ubuntu etc.)

https://docs.docker.com/engine/install/centos/

Unix Cheatsheet

Helpful cheat sheet on common unix commands

http://evomics.org/learning/unix-tutorial/

Transferring Data from Globus to lab server

ARTIC sequencing data from MGI is uploaded to Globus, where we need to transfer the data to our lab servers (ex. atlantis, pathogen or HTCF, preferably atlantis). The data transfer process comes in three parts:

Globus installation

  • Skip this part if Globus is already installed on server
  • insert how to check if globus connect personal is already installed
  1. cd into the directory where you want to transfer the data to (usually under /mnt)
  2. install globus connect personal
wget https://downloads.globus.org/globus-connect-personal/linux/stable/globusconnectpersonal-latest.tgz
  1. Extract
tar xzf globusconnectpersonal-latest.tgz

Running CLI Globus on the server

  1. cd into the globus folder (fill x.y.z with the version number)
cd globusconnectpersonal-x.y.z
  1. Run globus
./globusconnectpersonal

instructions will then be prompted on the screen (insert image)

  1. Click the https://auth.globus.org link from instructions to login to GUI Globus in the browser

  2. In the browser, enter the authentication code in the instructions from command-line window (ex. sKScCcLBgcY2RCjcu0riLNsWbbzZL)

  3. In your command-line window, you will be asked to input a value for Endpoint name (ex. Handley_atlantis)

  4. Launch globus connect personal

./globusconnectpersonal -start

Graphical User Interphase (GUI) Globus data storage link on browser

  • insert image
  1. Click on Transfer or Sync to
  2. Enter the Endpoint name as the destination (ex. Handley_atlantis)
  3. Click Start

after transfer has finished:

  • sometimes there are many unnecessary .md5 files
  • remove them by running this command:
rm -rf *.md5.md5*
  • Shut down globus connect personal with this command:
./globusconnectpersonal -stop
  • optional:
  • Delete the old Globus Connect Personal install directory.
  • Delete the old Globus Connect Personal config with this command:
rm -r ~/.globusonline/

Pre CZI file manipulation

Make sure file names for each sample is informative and unique for downstream analysis and visualizations.

eval "$(sed 's/^/mv /g' rename_guide.txt )"

Running the CZI pipeline

The CZI pipeline generates consensus SARS-CoV-genomes as well as variant calling format (VCF) files from raw fastq files.

Prerequisites:

  • kraken2 database

kraken_db can either be installed from https://benlangmead.github.io/aws-indexes/k2 OR run command:

mkdir kraken2_db                                                           # Makes a folder named kraken2_db
cd kraken2_db
wget https://genome-idx.s3.amazonaws.com/kraken/k2_viral_20210517.tar.gz   # Downlads Viral Kraken database
tar –xf k2_viral_20210517.tar.gz                                           # Extract archive files

- primer bed file
> https://github.com/artic-network/artic-ncov2019/tree/master/primer_schemes/nCoV-2019

To run the CZI pipeline

  1. cd into run directory where raw folder with fastq reads resides

nextflow processes should be run in working directories NOT home directories

cd {path_to_directory}
  1. Test that the pipeline is working:

nextflow run czbiohub/sc2-illumina-pipeline -profile docker,test

3. Example command:
> note that the paths to reads, kraken2_db and outdir must be corrected by the user according to their data path

nextflow run czbiohub/sc2-illumina-pipeline -profile artic,docker --reads '{path_to_raw_reads}/_R{1,2}.fastq.gz' --kraken2_db '/{path_to_kraken2_db}/kraken2_db' --outdir './outdir'

> if using specific primer file (ex. V4.1 bed file)

nextflow run czbiohub/sc2-illumina-pipeline -profile artic,docker --reads 'new_raw/_R{1,2}.fastq.gz' --kraken2_db 'ARTIC/kraken2_db' --outdir './outdir' --primers 'ARTIC/primers/SARS-CoV-2.primer.V4.1.bed'


Command explanation:

nextflow run \ # Tells nextflow to execute a pipeline project czbiohub/sc2-illumina-pipeline \ # Name of the pipeline to execute -profile artic,docker \ # Specify profiles that presets different compute environments, the order of arguments is important --reads '/{path}/_R{1,2}.fastq.gz' \ # Specify the location of your input SARS-CoV-2 read files --kraken2_db '/{path}/kraken2_db' \ # Specify the path to the folder containing the Kraken2 database --primers '' \ # (optional) Specify the BED file with the amplification primers used in the library prep --outdir './outdir' \ # Specify the output directory where the results can be stored

***



# Post CZI analysis

Using vcf's and consensus fasta file outputs from the CZI pipeline, we perform variant annotations using SnpEff and phylogenetic assignment using [Pangolin](https://cov-lineages.org/resources/pangolin.html) and [NextClade](https://clades.nextstrain.org/). The filtered fasta from this pipeline can be used for manual tree generation in [Microreact](https://microreact.org/upload).
1. Pull latest version of pipeline 

nextflow pull anajung/CZI_addon # Pull the latest revision


2. Run pipeline (Example command): 

nextflow run anajung/CZI_addon --vcf '/{path}/*vcf.gz' --combinedfa '/{path}/combined.fa' --outdir '/{path}/out' -r main


Command explanation:

nextflow run \ # Tells nextflow to execute a pipeline project anajung/CZI_addon \ # Name of the pipeline to execute --vcf '/{path}/*vcf.gz' \ # Specify the location of your input vcf files --combinedfa '/{path}/combined.fa' \ # Specify the location of your input combined.fa file --outdir '/{path}/out' \ # Specify the output directory where the results can be stored -r main \ # Specify version of pipeline to run


**Output files**
- annotated_vcf.tsv
- joined_lineage.tsv

**Interpreting output columns**
- snpEFF: https://pcingola.github.io/SnpEff/se_inputoutput/
- Nextclade: https://docs.nextstrain.org/projects/nextclade/en/latest/user/output-files.html
- Pangolin: https://cov-lineages.org/resources/pangolin/output.html

**Further result analysis**
* SNP analysis
* missing N's visualization
* summary statistics

***