RNA sequencing

About the project

The following workflow is part of a project that examine how Toxoplasma gondii infection impact the expression of gene in mouse blood and lung tissues. The goal of this project is to identify genes differentially expressed between the different tissues and treatment and find GO terms enriched in these genes.

The dataset utilized for this project come from the study by Singhania et al. (2019) and can be found in Gene Expression Omnibus (GEO), accession GSE119855. The data consist of 3 biological replicates for the control samples and 5 replicates of the Toxoplasma infected mices (Case) in both lung and blood tissues (Table 1). Libraries were constructed using a strand-specific protocol and sequenced in high-resolution paired-end mode with the Illumina HiSeq 4000 platform

List of sample

Table 1: List of samples. The "Sample" column corresponds to the names of the FASTQ files. The "Case" group includes infected mice, while the "Control" group represents non-infected mice.

Sample	Group
SRR7821921	Lung Case
SRR7821922	Lung Case
SRR7821918	Lung Case
SRR7821919	Lung Case
SRR7821920	Lung Case
SRR7821937	Lung Control
SRR7821938	Lung Control
SRR7821939	Lung Control
SRR7821949	Blood Case
SRR7821950	Blood Case
SRR7821951	Blood Case
SRR7821952	Blood Case
SRR7821953	Blood Case
SRR7821968	Blood Control
SRR7821969	Blood Control
SRR7821970	Blood Control

Overview of the workflow

0. Getting started

This step consists of the script named 0_get_samplelist used for organizing and generating the file containing the list of the samples essential for the rest of the workflow. This script creates a FASTQ directory where links to all FASTQ files are created. Additionally, it makes a metadata folder containing the sample_list file. This file lists the name of each sample along with the corresponding paths to the Mate 1 and Mate 2 FASTQ files and also creates a file that recovers the table of the README file containing the names of all samples and their corresponding group.

1. Quality control

The Quality control step involves two scripts :

1a_FASTQC : Generate individual quality control reports for each sample.
1b_MultiQC : Combines all the individual reports in a single file, facilitating the assessment of data quality across all samples.

2. Mapping to the reference genome

The mapping step is carried out by the following scripts :

2a_get_reference : Download reference genome (Mus_musculus.GRCm39.dna.primary_assembly.fa) and annotation (Mus_musculus.GRCm39.113.gtf) from Ensembl and perform a checksum verification step was performed to confirm that no corruption occurred during the download process.
2b_hisat_index : Generate indexes of the reference genome using the tools hisat
2c_hisat_mapping : Maps the reads to the reference genome using hisat with the following parameter :
- -x : path to the index of the reference genome
- -1 : path to the first mate
- -2 : path to the second mate
- -S : path to output file
- --threads : number of threads used by hisat
- --rna-strandness RF : specify that the first read is the reverse strand
- --summary-file : allows creation of a summary file containing statistics about the mapping
2d_merging_summary_mapping : (Optional) Combine all the summary files generated during the mapping in one file
2e_samtools_conversion_bam: Convert the sam files generated during the last step to a bam file with the following parameter
- -h : include the header in the output file
- -b : convert the output file to the BAM format
- -S : inform that the input file is in SAM format
2f_samtools_sort : Sort the BAM files, this script uses the following parameters:
- -@ : specify the number of thread
- -m : specify the memory per thread
- -t : indicates the name of the temporary files
- -o : indicates the path and name of the output file
2g samtools_index : Index the BAM files
2h_samtools_flagstat : (Optional) Generate supplementary alignment statistics from the BAM files

3. Counting the number of reads per gene

This step uses the script 3_featurecount that use the tools FeatureCounts version v2.0.1, to produce a table of counts containing the number of reads per gene in each sample and a summary file. The featurecount tools use the following parameters :

-T : number of threads used by featurecount
-p : specify that the reads were sequenced pair-end
-s2 : specify that the first read is the reverse strand
-Q10 : only consider the aligned read with a quality equal or higher than 10
-t exon : specifies that feature should count reads based on the exon feature type in the input annotation file
-g gene_id : specifies which attribute in the annotation file should be used to group exons into genes for counting.
-a : indicates the path to the annotation file
-o : indicates the path and name of the output file

4. Differential expression analysis and Gene Ontology enrichment analysis

This step requires the file containing the names of all samples and their corresponding group generated in step 0 and the table of counts generated by FeatureCounts and use the following R scripts:

4a_DE_utils.R : this script sets the general constant, imports the common library utilised in the other R scripts and will be loaded by all the other R script used for the Differential expression analysis.
4b_DE_preprocessing.R : this script processes the count table data, prepares the sample metadata, performs the Differential expression analysis using DEseq2 and saves the result as an RDS file.
4c_DE_PCA.R : This script performs a Principal component analysis on the result of the Differential expression analysis and generates the corresponding graph.
4d_DE-pairwise_comparaison.R : this script performs pairwise differential expression analyses and Gene Ontology enrichment analysis, generates visualizations (volcano plots, barplots, and dotplots) and saves the results in Excel files and barplots as PNG files for further interpretation.

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
0_get_samplelist.sh		0_get_samplelist.sh
1a_FASTQC.sh		1a_FASTQC.sh
1b_MultiQC.sh		1b_MultiQC.sh
2a_get_reference.sh		2a_get_reference.sh
2b_hisat_index.sh		2b_hisat_index.sh
2c_hisat_mapping.sh		2c_hisat_mapping.sh
2d_merging_summary_mapping.sh		2d_merging_summary_mapping.sh
2e_samtools_conversion_bam.sh		2e_samtools_conversion_bam.sh
2f_samtools_sort.sh		2f_samtools_sort.sh
2g_samtools_index.sh		2g_samtools_index.sh
2h_samtools_flagstat.sh		2h_samtools_flagstat.sh
3_featurecount.sh		3_featurecount.sh
4a_DE_utils.R		4a_DE_utils.R
4b_DE_preprocessing.R		4b_DE_preprocessing.R
4c_DE_PCA.R		4c_DE_PCA.R
4d_DE_pairwise_comparison.R		4d_DE_pairwise_comparison.R
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RNA sequencing

Table of Contents

About the project

List of sample

Overview of the workflow

0. Getting started

1. Quality control

2. Mapping to the reference genome

3. Counting the number of reads per gene

4. Differential expression analysis and Gene Ontology enrichment analysis

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RNA sequencing

Table of Contents

About the project

List of sample

Overview of the workflow

0. Getting started

1. Quality control

2. Mapping to the reference genome

3. Counting the number of reads per gene

4. Differential expression analysis and Gene Ontology enrichment analysis

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages