- Learn how to align with HISAT2 and kallisto
- Convert the output from the aligners to the BAM format.
- Explore the alignments
hisat2/2.0.5
kallisto/0.43.0
samtools/1.8
stringtie/1.3.4d- As before, Keep the record of which version of the software you use. Sometimes updates may change the output. To be consistent in the analysis, use always the same version within a experiment.
- You can write scripts with all your commands to run the same analysis again.
For this tutorial, create folders for the outputs of each program.
mkdir -p alignments
mkdir -p alignments/kallisto
mkdir -p alignments/HISAT2
mkdir -p stringtie_outputRun these exercises using batch scripts, rather than interactively. It is a good idea to specify STERR and STOUT files in the header using the SBATCH parameters, as we will use these later.
#SBATCH -o filename.out # STDOUT
#SBATCH -e filename.err # STDERRThe first step is to run the alignments on each sample individually, and then pipe the output through samtools to convert the SAM file to a BAM file.
module load hisat2/2.0.5
module load samtools/1.8
hisat2 -x hisat_index/transformed_coordinates_fasta_indx \
-p 2 --dta \
--known-splicesite-infile gtf_splice/transformed_coordinates_splices.txt \
-1 trim/104B_1P.fastq \
-2 trim/104B_2P.fastq \
| samtools view -bS - > alignments/HISAT2/104B_P.bam
- To run the alignments faster, you can increase the number of threads by changing the flag
-p 8inhisat2
All the bam files need to be sorted for downstream analysis/
module load samtools/1.8
samtools sort alignments/HISAT2/104B_P.bam \
-o alignments/HISAT2/sorted_104B_P.bamRepeat the mapping and subsequent sorting of the output BAM file for each sample.
- Does
alignments/HISAT2/contain all the BAM files you expect?
Look at the STDERR file:
- What was the overall mapping % for each sample?
- What percentage of reads mapped that as pairs, uniquely?
- What percentage of reads mapped that as pairs, more than once?
We use StringTie to quantify the gene expression in each sample. It takes a GFF annotation file as input, along with a BAM file containing RNA-seq read mappings, sorted by their genomic location.
module load stringtie/1.3.4d
stringtie \
alignments/HISAT2/sorted_104B_P.bam \
-p 8 -G References/transformed_coordinates.gff -l strg -o /stringtie_output/104B_P_string.gtfRun Stringtie for each of the 4 samples.
The output .gtf file for each sample contains the name of each transcript and its coordinates, along with FPKM values.
Next, merge the transcripts from all the samples using StringTie --merge, to create a single .gtf that we will use to re-quantify gene expression. By providing the original References/transformed_coordinates.gff at the same time, StringTie will use this as a guide when reconstructing novel transcripts.
First you need to use a text editor to create a .txt file containing a list of the .gtf output files generated by StringTie, including filepaths, as below. This file tells StringTie which gtf files you want it to include in the --merge and how to find them. Save this in your gtf_splice/ folder.
stringtie_output/104B_P_string.gtf
stringtie_output/105B_P_string.gtf
stringtie_output/119B_P_string.gtf
stringtie_output/120A_P_string.gtfmodule load stringtie/1.3.4d
stringtie --merge -p 2 -G References/transformed_coordinates.gff \
-l mrg \
-o stringtie_output/merge_stringtie_gtf \
gtf_splice/list_gtf.txt The last line of the command above tells StringTie where to find the .txt file you created earlier, containing the filepaths and names of the StringTie derived gtf files you wish to use with the merge command.
We use this stringtie_output/merge_stringtie_gtf to re-estimate transcript abundances and create tables of counts for Ballgown.
module load stringtie/1.3.4d
stringtie -e -B -p 2 -G stringtie_output/merge_stringtie_gtf \
-o ./ballgown_str_mrg/104B/104B_vs_mrg_gtf.gtf \
alignments/HISAT2/sorted_104B_P.bamThe option -e limits the processing of read alignments to only estimate and output the assembled transcripts matching the reference transcripts given with the -G option (requires -G, recommended for -B/-b). With this option, read bundles with no reference transcripts will be entirely skipped, which may provide a considerable speed boost when the given set of reference transcripts is limited to a set of target genes.
This option -B enables the output of 'Ballgown' input table files *.ctab containing coverage data
for the reference transcripts given with the -G option.
If the option -o is given as a full path to the output transcript file, StringTie will write the *.ctab
files in the same directory as the output GTF.
The output .gtf file for each sample contains the name of each gene, its coordinates, along with FPKM values.
Repeat this using each sorted BAM file.
- Check that you have the correct folders within the
ballgown_str_mrgdirectory - Does each sample folder contain a
*.ctabfile? - Use
lessto look at the columns in the.gtffiles generated by StringTie
We will now repeat the mapping, but this time we will use kallisto to map reads to the transcriptome. Again, it is important to specify the STERR and STOUT files in your batch script.
#SBATCH -o filename.out # STDOUT
#SBATCH -e filename.err # STDERR
module load kallisto/0.43.0
kallisto quant --pseudobam -i kall_index/kall_selected_refseq1 -t 1 -b 10\
-o /alignments_kallisto/kall_104B \
/trim/104B_1P.fastq \
/trim/104B_2P.fastq \quant runs the quantification algorithm, using the index defined at -i that we created earlier.
Using the ---pseudobam option outputs pseudoalignments to the transcriptome as a BAM file.
In versions of kallisto prior to 0.44.0, this pseudobam is sent to the STDOUT file specified in the batch script, so we will use samtools to sort it and
save it as a bam file.
In this kallisto version, we can only specify -t 1 if we are using the --pseudobam option.
module load samtools/1.8
samtools sort alignments_kallisto/filename.out \
-o alignments_kallisto/kall_sorted_104B_P.bamRepeat this for the remaining samples. The STERR file contains information on the mapping process.
kallisto quant produces three output files by default:
abundances.h5is a HDF5 binary file containing run info, abundance estimates, bootstrap estimates, and transcript length information length. This file can be read in by sleuthabundances.tsvis a plaintext file of the abundance estimates. It does not contains bootstrap estimates. Please use the --plaintext mode to output plaintext abundance estimates. The first line contains a header for each column, including estimated counts, TPM, effective length.run_info.jsonis a json file containing information about the run
- Check that you have all the output files you expect from running kallisto, and sorting the
bamfiles - Use
lessto look at theabundances.tsvfile
- HISAT2 manual: https://ccb.jhu.edu/software/hisat2/manual.shtml
- StringTie manual: http://ccb.jhu.edu/software/stringtie/index.shtml?t=manual
- kallisto manual: https://pachterlab.github.io/kallisto/manual
- SAM Format specification: https://samtools.github.io/hts-specs/SAMv1.pdf
- SAM Flags calculator: https://broadinstitute.github.io/picard/explain-flags.html