-
Notifications
You must be signed in to change notification settings - Fork 3
5. ARTIC sequencing analysis
This is a comprehensive step-by-step guideline for ARTIC sequencing analysis in the Handley lab. This guidelines covers steps from transferring raw reads from MGI (Globus) to phylogenetic tree generation of analyzed sequences.
- Software Installation
- Unix Cheatsheet
- Transferring Data from Globus to lab server
- Pre CZI file manipulation
- Running the CZI pipeline
- Post CZI analysis
https://www.dropbox.com/t/IWcZ5XAUJI2hbGLX
wget -qO- https://get.nextflow.io | bash
OR
curl -s https://get.nextflow.io | bash
OR
conda install -c bioconda nextflow
Docker installation instructions based on platforms (ex. CentOS, Debian, Ubuntu etc.)
https://docs.docker.com/engine/install/centos/
Helpful cheat sheet on common unix commands
http://evomics.org/learning/unix-tutorial/
ARTIC sequencing data from MGI is uploaded to Globus, where we need to transfer the data to our lab servers (ex. atlantis, pathogen or HTCF, preferably atlantis). The data transfer process comes in three parts:
- Globus installation
- Running Commandline (CLI) Globus on the server
- Graphical User Interphase (GUI) Globus data storage link on browser
- insert image
- Skip this part if Globus is already installed on server
- insert how to check if globus connect personal is already installed
- cd into the directory where you want to transfer the data to (usually under /mnt)
- install globus connect personal
wget https://downloads.globus.org/globus-connect-personal/linux/stable/globusconnectpersonal-latest.tgz
- Extract
tar xzf globusconnectpersonal-latest.tgz
- cd into the globus folder (fill x.y.z with the version number)
cd globusconnectpersonal-x.y.z
- Run globus
./globusconnectpersonalinstructions will then be prompted on the screen (insert image)
-
Click the https://auth.globus.org link from instructions to login to GUI Globus in the browser
-
In the browser, enter the authentication code in the instructions from command-line window (ex. sKScCcLBgcY2RCjcu0riLNsWbbzZL)
-
In your command-line window, you will be asked to input a value for Endpoint name (ex. Handley_atlantis)
-
Launch globus connect personal
./globusconnectpersonal -start
- insert image
- Click on Transfer or Sync to
- Enter the Endpoint name as the destination (ex. Handley_atlantis)
- Click Start
after transfer has finished:
- sometimes there are many unnecessary .md5 files
- remove them by running this command:
rm -rf *.md5.md5*
- Shut down globus connect personal with this command:
./globusconnectpersonal -stop
- optional:
- Delete the old Globus Connect Personal install directory.
- Delete the old Globus Connect Personal config with this command:
rm -r ~/.globusonline/
Make sure file names for each sample is informative and unique for downstream analysis and visualizations.
eval "$(sed 's/^/mv /g' rename_guide.txt )"
The CZI pipeline generates consensus SARS-CoV-genomes as well as variant calling format (VCF) files from raw fastq files.
Prerequisites:
- kraken2 database
kraken_db can either be installed from https://benlangmead.github.io/aws-indexes/k2 OR run command:
mkdir kraken2_db # Makes a folder named kraken2_db
cd kraken2_db
wget https://genome-idx.s3.amazonaws.com/kraken/k2_viral_20210517.tar.gz # Downlads Viral Kraken database
tar –xf k2_viral_20210517.tar.gz # Extract archive files
- primer bed file
> https://github.com/artic-network/artic-ncov2019/tree/master/primer_schemes/nCoV-2019
- cd into run directory where raw folder with fastq reads resides
nextflow processes should be run in working directories NOT home directories
cd {path_to_directory}
- Test that the pipeline is working:
nextflow run czbiohub/sc2-illumina-pipeline -profile docker,test
3. Example command:
> note that the paths to reads, kraken2_db and outdir must be corrected by the user according to their data path
nextflow run czbiohub/sc2-illumina-pipeline -profile artic,docker --reads '{path_to_raw_reads}/_R{1,2}.fastq.gz' --kraken2_db '/{path_to_kraken2_db}/kraken2_db' --outdir './outdir'
> if using specific primer file (ex. V4.1 bed file)
nextflow run czbiohub/sc2-illumina-pipeline -profile artic,docker --reads 'new_raw/_R{1,2}.fastq.gz' --kraken2_db 'ARTIC/kraken2_db' --outdir './outdir' --primers 'ARTIC/primers/SARS-CoV-2.primer.V4.1.bed'
Command explanation:
nextflow run \ # Tells nextflow to execute a pipeline project czbiohub/sc2-illumina-pipeline \ # Name of the pipeline to execute -profile artic,docker \ # Specify profiles that presets different compute environments, the order of arguments is important --reads '/{path}/_R{1,2}.fastq.gz' \ # Specify the location of your input SARS-CoV-2 read files --kraken2_db '/{path}/kraken2_db' \ # Specify the path to the folder containing the Kraken2 database --primers '' \ # (optional) Specify the BED file with the amplification primers used in the library prep --outdir './outdir' \ # Specify the output directory where the results can be stored
***
# Post CZI analysis
Using vcf's and consensus fasta file outputs from the CZI pipeline, we perform variant annotations using SnpEff and phylogenetic assignment using [Pangolin](https://cov-lineages.org/resources/pangolin.html) and [NextClade](https://clades.nextstrain.org/). The filtered fasta from this pipeline can be used for manual tree generation in [Microreact](https://microreact.org/upload).
1. Pull latest version of pipeline
nextflow pull anajung/CZI_addon # Pull the latest revision
2. Run pipeline (Example command):
nextflow run anajung/CZI_addon --vcf '/{path}/*vcf.gz' --combinedfa '/{path}/combined.fa' --outdir '/{path}/out' -r main
Command explanation:
nextflow run \ # Tells nextflow to execute a pipeline project anajung/CZI_addon \ # Name of the pipeline to execute --vcf '/{path}/*vcf.gz' \ # Specify the location of your input vcf files --combinedfa '/{path}/combined.fa' \ # Specify the location of your input combined.fa file --outdir '/{path}/out' \ # Specify the output directory where the results can be stored -r main \ # Specify version of pipeline to run
**Output files**
- annotated_vcf.tsv
- joined_lineage.tsv
**Interpreting output columns**
- snpEFF: https://pcingola.github.io/SnpEff/se_inputoutput/
- Nextclade: https://docs.nextstrain.org/projects/nextclade/en/latest/user/output-files.html
- Pangolin: https://cov-lineages.org/resources/pangolin/output.html
**Further result analysis**
* SNP analysis
* missing N's visualization
* summary statistics
***