Small RNA Alignment, Filtering, and Annotation

Trimming and Alignment Process

The program cutadapt trims adapters from both ends of RNA-seq reads.

for i in *.fastq
do
cutadapt -q 5 -b TGGAATTCTCGGGTGCCAAGGAACTCCAGTCACGACGACATCTCGTATGC -O 1 -m 16 -o $i.ft $i
done

I use the .ft extension to denote "fastq trimmed". I will give the read counts when I have time.

I concatenated the fastq.ft files for each sample and ran something like this to get counts:

awk '{getline seq; getline plus; getline score; print seq}'   Samplename.fastq.ft|sort -k1d,1  | uniq -c | awk '{print ">SampleID:"NR":"$1; print $2}' > Samplename.uniq.fa

which yields a file in this format:

>H1_GACGAC:1:1
AAAAAAAAAAAAAAAA
>H1_GACGAC:2:4
AAAAAAAAAAAAAAAAA

Fasta files for the individual samples are concatenated to look like this using the fastaConcatenator.py script:

>H1_GACGAC:9:3>H2_TAATCG:7:2
AAAAAAAACATTAGACT

To that fasta file I added phred-33 scores of 'D', which should be sufficiently low to allow an occasional mismatch. From a fasta file with format as in the above grey box, the following awk script yields a fastq file. Don't forget that the fastq name line needs to start with '@'.

awk '{getline seq; print "@"$0; print seq; print "+"; for (i=1; i<=length(seq); i++) printf "D"; printf "\n" }' All_Samples.fasta > All_Samples.fastq

Then I aligned the sequences with bowtie like this:

nohup bowtie2 --phred33 -p 20 -k 1000 --end-to-end --very-sensitive -x /data/helicos/reference/rn5/noRandom/rn5_No_Random -U All_Samples.concat.fq -S All_Samples.concat.sam

For this alignment, I used 20 threads and told the aligner to show up to 1,000 alignments for each read. See the bowtie2 manual for details on the settings.

For the alignments we actually use, I went through the above procedure separately for reads from the BNLX and SHRH strains, and aligned them to their respective strain-specific genomes.

Post-alignment Processing

Across all 6 samples, the small-RNA sequencing experiment yielded 96,330,384 trimmed reads prior to alignment.

Small RNA Read Counts
Strain	Trimmed Reads	Aligned Reads	Perfect Aligned	Perfect Uniquely Aligned
BNLX	59,677,100	55,554,442	41,801,988	16,650,698
SHRH	36,653,284	34,521,891	27,093,237	10,941,673
Total	96,330,384	90,076,333	68,895,225	27,592,371

Again, bowtie2 settings allowed up to 1000 alignments per read, and reports the first 1000 alignments for reads that aligned to more than 1000 locations (rather than discarding them, as I previously reported). A piece of the .sam file output:

>Lx3_CTCAGA:288764:1    256     chr13   95723976        0       48M     *       0       0       ATTTGTGGTTAAGAGGTAGAATTCTCGCCTGCCACGCGGGAGGCCCGG *       AS:i:-25        XS:i:-15        XN:i:0  XM:i:5  XO:i:0  XG:i:0  NM:i:5  MD:Z:2G0G6C2T29T4       YT:Z:UU
>Lx1_CTAGCT:409415:1>Lx2_CTATAC:1943563:1       16      chr7    32974132        0       22M     *       0       0       CACATAGGAATAAAAAGCCTTA   DDDDDDDDDDDDDDDDDDDDDD  AS:i:-5 XS:i:-5 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:19A2       YT:Z:UU
>Lx1_CTAGCT:409415:1>Lx2_CTATAC:1943563:1       256     chr8    114212759       0       22M     *       0       0       TAAGGCTTTTTATTCCTATGTG   *       AS:i:-5 XS:i:-5 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:2T19       YT:Z:UU
>Lx2_CTATAC:1029272:1   16      chr16   8150525 0       18M     *       0       0       TCCTGAGAAGATTGACTG      DDDDDDDDDDDDDDDDDD       AS:i:-5 XS:i:-5 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:14C3       YT:Z:UU

From the sam format above to the bed format below, I used the Python script samParser.py. The main reason for using Python rather than an awk/shell script was to be able to use a hash/dict easily. This script accepts the concatenation of the two sam files, catalogs the number of reads for each sample and gives all the locations that a read from either sample aligned to.

Here is a bit of the bed file (both_sorted.bed) of all aligned sequences after the samParser, sorted, with chromosome and location, a vector of counts by sample, the alignment score, strand, sequence, and the number of genomic locations to which the sequence aligned. The format of column 4 is leftover from a previous version of the parser.

chr1    26476   26506   :0:0:0:3:30:0:0 33      -       ATGTCCAATTTTCTGAGGAACCTCCAGACT  1000
chr1    28395   28415   :3:4:0:0:1:6:0  14      +       CAGTGGTAGAGGGCTTGCCT    171
chr1    31907   31942   :4:3:0:3:4:4:0  18      -       GGTTGGGGATTTAGCTCAGTGGTAGAGCGCTTGCC     2000

In previous versions of the samParser, I included a column for the number of mismatches between the read and the reference, as extracted from the sam file. Since the reads are aligned to two different genomes, it is possible that the number of mismatches differs between the two strains, so what I am going to do is use the bedtools program fastaFromBed to get the reference sequence for the location from each of the genomes, then check it against the read sequence. Note that fastaFromBed grabs the stranded sequence if you ask it to so no need to reverse complement the read if it aligned to the - strand.

First we will filter by number of reads, and by the weighting from the samples. If there are fewer than 10 reads, or if one sample contributes the majority of the reads, that sequence is removed. You can pipe in and out of the various bedtools programs, but I broke things up to make it easier to see what is going on.

 awk '{ind=0; split($4,a,":"); for (i=2; i<length(a); i++) {if (a[i]>(.5*$5)) {ind=1; break;}} 
if ((ind==0) && ($5>9)) print }' both_sorted.bed | 

awk '{print $1"\t"$2"\t"$3"\t"$1"|"$2"|"$3"|"$4"|"$5"|"$6"|"$7"|"$8"\t"$5"\t"$6}'| 

fastaFromBed -s -name -tab -fi /data/helicos/reference/rn5/BNLXGenome/BNLX_rn5_Genome.fa -bed stdin -fo both_sorted_BNLX_ref.bed

Whence:

[flinks@phenogen realign]$ head both_sorted_BNLX_ref.bed
chr1|23311|23327|:4:3:0:1:5:7:0|20|-|TGTTTGGGTATTGGCC|293       tgtttaggtattggcc
chr1|24861|24878|:2:5:0:2:9:7:0|25|-|ATGTTTGGGTATTGGCC|231      atgtttaggtattggcc
chr1|24861|24877|:4:3:0:1:5:7:0|20|-|TGTTTGGGTATTGGCC|293       tgtttaggtattggcc
chr1|28395|28415|:3:4:0:0:1:6:0|14|+|CAGTGGTAGAGGGCTTGCCT|171   cagtggtagagggcttgcct
chr1|31907|31942|:4:3:0:3:4:4:0|18|-|GGTTGGGGATTTAGCTCAGTGGTAGAGCGCTTGCC|2000   ggttggggatttagctcagtggtagagcgcttgcc
chr1|31908|31925|:3:2:0:0:4:1:0|10|-|AGTGGTAGAGCGCATGC|1837     agtggtagagcgcttgc
chr1|32148|32176|:100:87:0:4:67:252:0|510|+|ATTTAGCTCAGTGGTAGAGCGCTTGCCT|1970   atttagctcagtggtagagcgcttgcct
chr1|42648|42664|:5:1:0:3:4:3:0|16|+|GAGCAATGATGACCCC|56        gagcaatgctgacccc
chr1|44982|45002|:1:16:0:1:7:20:0|45|-|CAGTGGTAGAGCGCTTGCCG|1847        cagtggtagagcgcttgcct
chr1|44998|45021|:4:1:0:1:2:5:0|13|-|AGGGGCTGGGGATTTAGCTCAGC|1680       Aggggctggggatttagctcagt

thence:

 awk '{s=0; refseq=toupper($2); split($1,a,"|"); if (a[7]==refseq) s=1; print a[1]"\t"a[2]"\t"a[3]"\t"a[1]"|"a[2]"|"a[3]"|"a[4]"|"a[5]"|"a[6]"|"a[7]"|"a[8]"|"refseq"|BNLX:"s"\t"0"\t"a[6]}' both_sorted_BNLX_ref.bed 
| fastaFromBed -s -name -tab -fi /data/helicos/reference/rn5/SHRHGenome/SHRH_rn5_Genome.fa -bed stdin -fo both_sorted_ref.bed
[[flinks@phenogen realign]$ head both_sorted_ref.bed
chr1|23311|23327|:4:3:0:1:5:7:0|20|-|TGTTTGGGTATTGGCC|293|TGTTTAGGTATTGGCC|BNLX:0       tgtttaggtattggcc
chr1|24861|24878|:2:5:0:2:9:7:0|25|-|ATGTTTGGGTATTGGCC|231|ATGTTTAGGTATTGGCC|BNLX:0     atgtttaggtattggcc
chr1|24861|24877|:4:3:0:1:5:7:0|20|-|TGTTTGGGTATTGGCC|293|TGTTTAGGTATTGGCC|BNLX:0       tgtttaggtattggcc
chr1|28395|28415|:3:4:0:0:1:6:0|14|+|CAGTGGTAGAGGGCTTGCCT|171|CAGTGGTAGAGGGCTTGCCT|BNLX:1       cagtggtagagggcttgcct
chr1|31907|31942|:4:3:0:3:4:4:0|18|-|GGTTGGGGATTTAGCTCAGTGGTAGAGCGCTTGCC|2000|GGTTGGGGATTTAGCTCAGTGGTAGAGCGCTTGCC|BNLX:1        ggttggggatttagctcagtggtagagcgcttgcc
chr1|31908|31925|:3:2:0:0:4:1:0|10|-|AGTGGTAGAGCGCATGC|1837|AGTGGTAGAGCGCTTGC|BNLX:0    agtggtagagcgcttgc
chr1|32148|32176|:100:87:0:4:67:252:0|510|+|ATTTAGCTCAGTGGTAGAGCGCTTGCCT|1970|ATTTAGCTCAGTGGTAGAGCGCTTGCCT|BNLX:1       atttagctcagtggtagagcgcttgcct
chr1|42648|42664|:5:1:0:3:4:3:0|16|+|GAGCAATGATGACCCC|56|GAGCAATGCTGACCCC|BNLX:0        gagcaatgctgacccc
chr1|44982|45002|:1:16:0:1:7:20:0|45|-|CAGTGGTAGAGCGCTTGCCG|1847|CAGTGGTAGAGCGCTTGCCT|BNLX:0    cagtggtagagcgcttgcct
chr1|44998|45021|:4:1:0:1:2:5:0|13|-|AGGGGCTGGGGATTTAGCTCAGC|1680|AGGGGCTGGGGATTTAGCTCAGT|BNLX:0        Aggggctggggatttagctcagt

and hence:

awk '{s=0; refseq=toupper($2); split($1,a,"|"); if (a[7]==refseq) s=1; print a[1]"\t"a[2]"\t"a[3]"\t"a[4]"\t"a[5]"\t"a[6]"\t"a[7]"\t"a[8]"\t"a[9]"\t"a[10]"\t"refseq"\tSHRH:"s}' both_sorted_ref.bed 
> both_sorted_fullRef.bed

head both_sorted_fullRef.bed
chr1    23311   23327   :4:3:0:1:5:7:0  20      -       TGTTTGGGTATTGGCC        293     TGTTTAGGTATTGGCC        BNLX:0  TGTTTAGGTATTGGCC        SHRH:0
chr1    24861   24878   :2:5:0:2:9:7:0  25      -       ATGTTTGGGTATTGGCC       231     ATGTTTAGGTATTGGCC       BNLX:0  ATGTTTAGGTATTGGCC       SHRH:0
chr1    24861   24877   :4:3:0:1:5:7:0  20      -       TGTTTGGGTATTGGCC        293     TGTTTAGGTATTGGCC        BNLX:0  TGTTTAGGTATTGGCC        SHRH:0
chr1    28395   28415   :3:4:0:0:1:6:0  14      +       CAGTGGTAGAGGGCTTGCCT    171     CAGTGGTAGAGGGCTTGCCT    BNLX:1  CAGTGGTAGAGGGCTTGCCT    SHRH:1

Recall that 1=no mismatches. Now get rid of the pesky extra sequences and keep only those that matched at least one of the genomes perfectly (although how do you interpret those that match only one?) (script not shown).

head both_Strains_filtered_reads.bed
chr1    28395   28415   :3:4:0:0:1:6:0  14      +       CAGTGGTAGAGGGCTTGCCT    171     BNLX:1  SHRH:1
chr1    31907   31942   :4:3:0:3:4:4:0  18      -       GGTTGGGGATTTAGCTCAGTGGTAGAGCGCTTGCC     2000    BNLX:1  SHRH:1
chr1    32148   32176   :100:87:0:4:67:252:0    510     +       ATTTAGCTCAGTGGTAGAGCGCTTGCCT    1970    BNLX:1  SHRH:1
chr1    62854   62883   :125:123:0:11:76:220:0  555     -       GATTTAGCTCAGTGGTAGAGCGCTTGCCT   2000    BNLX:1  SHRH:1

Here's the big trick: the bedtools program mergeBed merges overlapping genomic intervals into a single interval, and if you tell it to, it will report all of the names of the intervals that were merged into the single interval. Our plan then is to pack all of the information we want to preserve into the name (4th column) of the bed file above. Once we have these features, we want to sort them

awk '{split($10,B,":"); split($12,S,":"); if (B[2]+S[2] > 0) print $1"\t"$2"\t"$3"\t"$1"|"$2"|"$3"|"$4"|"$5"|"$6"|"$7"|"$8"|"$10"|"$12"\t"0"\t"$6}' both_Strains_filtered_reads.bed > both_Strains_precrammed.bed

Now we will use mergeBed in the bedtools package to concatenate overlapping reads and retain their information, via

mergeBed -s -nms -i both_Strains_precrammed.bed > both_Strains_crammed.win

which gives us this kind of thing:

chr1    28395   28415   chr1|28395|28415|:3:4:0:0:1:6:0|14|+|CAGTGGTAGAGGGCTTGCCT|171|SHRH:1|   +
chr1    32148   32176   chr1|32148|32176|:100:87:0:4:67:252:0|510|+|ATTTAGCTCAGTGGTAGAGCGCTTGCCT|1970|SHRH:1|   +
chr1    159118  159151  chr1|159118|159151|:11:9:0:1:7:14:0|42|+|TGGGGATTTAGCTCAGTGGTAGAGCGCTTGCCT|2000|SHRH:1|;chr1|159129|159151|:25:33:0:2:24:68:0|152|+|CTCAGTGGTAGAGCGCTTGCCT|2000|SHRH:1|;chr1|159135|159151|:5:5:0:2:7:12:0|31|+|GGTAGAGCGCTTGCCT|1967|SHRH:1|

For each row, we split the fourth column on ";", and amalgamate the different characteristics of that feature (interval of concatenated read locations): chromosome, start, end, counts by sample, an indicator that tells if a single sample contributes more than 40% of the reads, the total read count for the feature, counts for the different reads, number of mapped locations for each read, mapping quality for each read by strain, . Here is the script:

awk '{
split("0,0,0,0,0,0,0",counts,",");
split("",seqCounts,"");
split("",seqs,"");
split("",uniq,"");
split("",Bqual,"");
split("",Squal,"");
split($4,a,";");
for (i=1; i<=length(a); i++) {
  split(a[i],b,"|");
  split(b[4],c,":");
  split(b[9],B,":")
  split(b[10],S,":")
  for (j=2; j<8; j++)
    counts[j-1]=counts[j-1]+c[j];
  counts[7]=counts[7]+b[5]
  seqCounts[i]=b[5];
  seqs[i]=b[7];
  uniq[i]=b[8];
  Bqual[i]=B[2];
  Squal[i]=S[2];
}
printf $1"\t"$2"\t"$3"\t";
for (i=1; i<7; i++)
  printf counts[i]":"
printf "\t"counts[7]"\t"
for (i in seqCounts)
  printf seqCounts[i]"|";
printf "\t";
for (i in uniq)
  printf uniq[i]"|";
printf "\t";
for (i in Bqual)
  printf Bqual[i]"|";
printf "\t";
for (i in Squal)
  printf Squal[i]"|";
printf "\t";
for (i in seqs)
  printf seqs[i]"|";
printf "\t"$5"\n"
}' both_Strains_crammed.win > both_Strains_decrammed.bed

Intersect the result with our various annotation sources:

intersectBed -a both_Strains_decrammed.bed -b NCRNA.NonRatRefSeqRN5.bed -wa -wb > bothHits/bothHit_NonRatRefSeq.bed
intersectBed -a both_Strains_decrammed.bed -b NCRNA.RefSeqRN5.bed -wa -wb > bothHits/bothHit_RefSeq.bed
intersectBed -a both_Strains_decrammed.bed -b piRNA.bed -wa -wb > bothHits/bothHit_piRNA.bed
intersectBed -a both_Strains_decrammed.bed -b mirDeepPredicted.bed -wa -wb > bothHits/bothHit_mirDeep.bed
intersectBed -a both_Strains_decrammed.bed -b NCRNA.RepeatMaskRNA.bed -wa -wb > bothHits/bothHit_RepeatMaskRNA.bed
intersectBed -a both_Strains_decrammed.bed -b NCRNA.RepeatMaskMisc.bed -wa -wb > bothHits/bothHit_RepeatMaskMisc.bed
intersectBed -a both_Strains_decrammed.bed -b NCRNA.Ensembl.bed -wa -wb > bothHits/bothHit_Ensembl.bed

Small RNA Alignment, Filtering, and Annotation

Trimming and Alignment Process

Post-alignment Processing

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally