-
Notifications
You must be signed in to change notification settings - Fork 0
Small RNA Alignment, Filtering, and Annotation
The program cutadapt trims adapters from both ends of RNA-seq reads.
for i in *.fastq
do
cutadapt -q 5 -b TGGAATTCTCGGGTGCCAAGGAACTCCAGTCACGACGACATCTCGTATGC -O 1 -m 16 -o $i.ft $i
doneI concatenated the fastq.ft files for each sample and ran something like this to get counts:
awk '{getline seq; getline plus; getline score; print seq}' Samplename.fastq.ft|sort -k1d,1 | uniq -c | awk '{print ">SampleID:"NR":"$1; print $2}' > Samplename.uniq.fa>H1_GACGAC:1:1
AAAAAAAAAAAAAAAA
>H1_GACGAC:2:4
AAAAAAAAAAAAAAAAAFasta files for the individual samples are concatenated to look like this using the fastaConcatenator.py script:
>H1_GACGAC:9:3>H2_TAATCG:7:2
AAAAAAAACATTAGACTawk '{getline seq; print "@"$0; print seq; print "+"; for (i=1; i<=length(seq); i++) printf "D"; printf "\n" }' All_Samples.fasta > All_Samples.fastq Then I aligned the sequences with bowtie like this:
nohup bowtie2 --phred33 -p 20 -k 1000 --end-to-end --very-sensitive -x /data/helicos/reference/rn5/noRandom/rn5_No_Random -U All_Samples.concat.fq -S All_Samples.concat.samFor this alignment, I used 20 threads and told the aligner to show up to 1,000 alignments for each read. See the bowtie2 manual for details on the settings.
For the alignments we actually use, I went through the above procedure separately for reads from the BNLX and SHRH strains, and aligned them to their respective strain-specific genomes.
Across all 6 samples, the small-RNA sequencing experiment yielded 96,330,384 trimmed reads prior to alignment.
| Strain | Trimmed Reads | Aligned Reads | Perfect Aligned | Perfect Uniquely Aligned |
|---|---|---|---|---|
| BNLX | 59,677,100 | 55,554,442 | 41,801,988 | 16,650,698 |
| SHRH | 36,653,284 | 34,521,891 | 27,093,237 | 10,941,673 |
| Total | 96,330,384 | 90,076,333 | 68,895,225 | 27,592,371 |
Again, bowtie2 settings allowed up to 1000 alignments per read, and reports the first 1000 alignments for reads that aligned to more than 1000 locations (rather than discarding them, as I previously reported). A piece of the .sam file output:
>Lx3_CTCAGA:288764:1 256 chr13 95723976 0 48M * 0 0 ATTTGTGGTTAAGAGGTAGAATTCTCGCCTGCCACGCGGGAGGCCCGG * AS:i:-25 XS:i:-15 XN:i:0 XM:i:5 XO:i:0 XG:i:0 NM:i:5 MD:Z:2G0G6C2T29T4 YT:Z:UU
>Lx1_CTAGCT:409415:1>Lx2_CTATAC:1943563:1 16 chr7 32974132 0 22M * 0 0 CACATAGGAATAAAAAGCCTTA DDDDDDDDDDDDDDDDDDDDDD AS:i:-5 XS:i:-5 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:19A2 YT:Z:UU
>Lx1_CTAGCT:409415:1>Lx2_CTATAC:1943563:1 256 chr8 114212759 0 22M * 0 0 TAAGGCTTTTTATTCCTATGTG * AS:i:-5 XS:i:-5 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:2T19 YT:Z:UU
>Lx2_CTATAC:1029272:1 16 chr16 8150525 0 18M * 0 0 TCCTGAGAAGATTGACTG DDDDDDDDDDDDDDDDDD AS:i:-5 XS:i:-5 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:14C3 YT:Z:UUHere is a bit of the bed file (both_sorted.bed) of all aligned sequences after the samParser, sorted, with chromosome and location, a vector of counts by sample, the alignment score, strand, sequence, and the number of genomic locations to which the sequence aligned. The format of column 4 is leftover from a previous version of the parser.
chr1 26476 26506 :0:0:0:3:30:0:0 33 - ATGTCCAATTTTCTGAGGAACCTCCAGACT 1000
chr1 28395 28415 :3:4:0:0:1:6:0 14 + CAGTGGTAGAGGGCTTGCCT 171
chr1 31907 31942 :4:3:0:3:4:4:0 18 - GGTTGGGGATTTAGCTCAGTGGTAGAGCGCTTGCC 2000In previous versions of the samParser, I included a column for the number of mismatches between the read and the reference, as extracted from the sam file. Since the reads are aligned to two different genomes, it is possible that the number of mismatches differs between the two strains, so what I am going to do is use the bedtools program fastaFromBed to get the reference sequence for the location from each of the genomes, then check it against the read sequence. Note that fastaFromBed grabs the stranded sequence if you ask it to so no need to reverse complement the read if it aligned to the - strand.
First we will filter by number of reads, and by the weighting from the samples. If there are fewer than 10 reads, or if one sample contributes the majority of the reads, that sequence is removed. You can pipe in and out of the various bedtools programs, but I broke things up to make it easier to see what is going on.
awk '{ind=0; split($4,a,":"); for (i=2; i<length(a); i++) {if (a[i]>(.5*$5)) {ind=1; break;}}
if ((ind==0) && ($5>9)) print }' both_sorted.bed |
awk '{print $1"\t"$2"\t"$3"\t"$1"|"$2"|"$3"|"$4"|"$5"|"$6"|"$7"|"$8"\t"$5"\t"$6}'|
fastaFromBed -s -name -tab -fi /data/helicos/reference/rn5/BNLXGenome/BNLX_rn5_Genome.fa -bed stdin -fo both_sorted_BNLX_ref.bed[flinks@phenogen realign]$ head both_sorted_BNLX_ref.bed
chr1|23311|23327|:4:3:0:1:5:7:0|20|-|TGTTTGGGTATTGGCC|293 tgtttaggtattggcc
chr1|24861|24878|:2:5:0:2:9:7:0|25|-|ATGTTTGGGTATTGGCC|231 atgtttaggtattggcc
chr1|24861|24877|:4:3:0:1:5:7:0|20|-|TGTTTGGGTATTGGCC|293 tgtttaggtattggcc
chr1|28395|28415|:3:4:0:0:1:6:0|14|+|CAGTGGTAGAGGGCTTGCCT|171 cagtggtagagggcttgcct
chr1|31907|31942|:4:3:0:3:4:4:0|18|-|GGTTGGGGATTTAGCTCAGTGGTAGAGCGCTTGCC|2000 ggttggggatttagctcagtggtagagcgcttgcc
chr1|31908|31925|:3:2:0:0:4:1:0|10|-|AGTGGTAGAGCGCATGC|1837 agtggtagagcgcttgc
chr1|32148|32176|:100:87:0:4:67:252:0|510|+|ATTTAGCTCAGTGGTAGAGCGCTTGCCT|1970 atttagctcagtggtagagcgcttgcct
chr1|42648|42664|:5:1:0:3:4:3:0|16|+|GAGCAATGATGACCCC|56 gagcaatgctgacccc
chr1|44982|45002|:1:16:0:1:7:20:0|45|-|CAGTGGTAGAGCGCTTGCCG|1847 cagtggtagagcgcttgcct
chr1|44998|45021|:4:1:0:1:2:5:0|13|-|AGGGGCTGGGGATTTAGCTCAGC|1680 Aggggctggggatttagctcagtthence:
awk '{s=0; refseq=toupper($2); split($1,a,"|"); if (a[7]==refseq) s=1; print a[1]"\t"a[2]"\t"a[3]"\t"a[1]"|"a[2]"|"a[3]"|"a[4]"|"a[5]"|"a[6]"|"a[7]"|"a[8]"|"refseq"|BNLX:"s"\t"0"\t"a[6]}' both_sorted_BNLX_ref.bed
| fastaFromBed -s -name -tab -fi /data/helicos/reference/rn5/SHRHGenome/SHRH_rn5_Genome.fa -bed stdin -fo both_sorted_ref.bed
[[flinks@phenogen realign]$ head both_sorted_ref.bed
chr1|23311|23327|:4:3:0:1:5:7:0|20|-|TGTTTGGGTATTGGCC|293|TGTTTAGGTATTGGCC|BNLX:0 tgtttaggtattggcc
chr1|24861|24878|:2:5:0:2:9:7:0|25|-|ATGTTTGGGTATTGGCC|231|ATGTTTAGGTATTGGCC|BNLX:0 atgtttaggtattggcc
chr1|24861|24877|:4:3:0:1:5:7:0|20|-|TGTTTGGGTATTGGCC|293|TGTTTAGGTATTGGCC|BNLX:0 tgtttaggtattggcc
chr1|28395|28415|:3:4:0:0:1:6:0|14|+|CAGTGGTAGAGGGCTTGCCT|171|CAGTGGTAGAGGGCTTGCCT|BNLX:1 cagtggtagagggcttgcct
chr1|31907|31942|:4:3:0:3:4:4:0|18|-|GGTTGGGGATTTAGCTCAGTGGTAGAGCGCTTGCC|2000|GGTTGGGGATTTAGCTCAGTGGTAGAGCGCTTGCC|BNLX:1 ggttggggatttagctcagtggtagagcgcttgcc
chr1|31908|31925|:3:2:0:0:4:1:0|10|-|AGTGGTAGAGCGCATGC|1837|AGTGGTAGAGCGCTTGC|BNLX:0 agtggtagagcgcttgc
chr1|32148|32176|:100:87:0:4:67:252:0|510|+|ATTTAGCTCAGTGGTAGAGCGCTTGCCT|1970|ATTTAGCTCAGTGGTAGAGCGCTTGCCT|BNLX:1 atttagctcagtggtagagcgcttgcct
chr1|42648|42664|:5:1:0:3:4:3:0|16|+|GAGCAATGATGACCCC|56|GAGCAATGCTGACCCC|BNLX:0 gagcaatgctgacccc
chr1|44982|45002|:1:16:0:1:7:20:0|45|-|CAGTGGTAGAGCGCTTGCCG|1847|CAGTGGTAGAGCGCTTGCCT|BNLX:0 cagtggtagagcgcttgcct
chr1|44998|45021|:4:1:0:1:2:5:0|13|-|AGGGGCTGGGGATTTAGCTCAGC|1680|AGGGGCTGGGGATTTAGCTCAGT|BNLX:0 Aggggctggggatttagctcagtand hence:
awk '{s=0; refseq=toupper($2); split($1,a,"|"); if (a[7]==refseq) s=1; print a[1]"\t"a[2]"\t"a[3]"\t"a[4]"\t"a[5]"\t"a[6]"\t"a[7]"\t"a[8]"\t"a[9]"\t"a[10]"\t"refseq"\tSHRH:"s}' both_sorted_ref.bed
> both_sorted_fullRef.bed
head both_sorted_fullRef.bed
chr1 23311 23327 :4:3:0:1:5:7:0 20 - TGTTTGGGTATTGGCC 293 TGTTTAGGTATTGGCC BNLX:0 TGTTTAGGTATTGGCC SHRH:0
chr1 24861 24878 :2:5:0:2:9:7:0 25 - ATGTTTGGGTATTGGCC 231 ATGTTTAGGTATTGGCC BNLX:0 ATGTTTAGGTATTGGCC SHRH:0
chr1 24861 24877 :4:3:0:1:5:7:0 20 - TGTTTGGGTATTGGCC 293 TGTTTAGGTATTGGCC BNLX:0 TGTTTAGGTATTGGCC SHRH:0
chr1 28395 28415 :3:4:0:0:1:6:0 14 + CAGTGGTAGAGGGCTTGCCT 171 CAGTGGTAGAGGGCTTGCCT BNLX:1 CAGTGGTAGAGGGCTTGCCT SHRH:1Recall that 1=no mismatches. Now get rid of the pesky extra sequences and keep only those that matched at least one of the genomes perfectly (although how do you interpret those that match only one?) (script not shown).
head both_Strains_filtered_reads.bed
chr1 28395 28415 :3:4:0:0:1:6:0 14 + CAGTGGTAGAGGGCTTGCCT 171 BNLX:1 SHRH:1
chr1 31907 31942 :4:3:0:3:4:4:0 18 - GGTTGGGGATTTAGCTCAGTGGTAGAGCGCTTGCC 2000 BNLX:1 SHRH:1
chr1 32148 32176 :100:87:0:4:67:252:0 510 + ATTTAGCTCAGTGGTAGAGCGCTTGCCT 1970 BNLX:1 SHRH:1
chr1 62854 62883 :125:123:0:11:76:220:0 555 - GATTTAGCTCAGTGGTAGAGCGCTTGCCT 2000 BNLX:1 SHRH:1Here's the big trick: the bedtools program mergeBed merges overlapping genomic intervals into a single interval, and if you tell it to, it will report all of the names of the intervals that were merged into the single interval. Our plan then is to pack all of the information we want to preserve into the name (4th column) of the bed file above. Once we have these features, we want to sort them
awk '{split($10,B,":"); split($12,S,":"); if (B[2]+S[2] > 0) print $1"\t"$2"\t"$3"\t"$1"|"$2"|"$3"|"$4"|"$5"|"$6"|"$7"|"$8"|"$10"|"$12"\t"0"\t"$6}' both_Strains_filtered_reads.bed > both_Strains_precrammed.bedNow we will use mergeBed in the bedtools package to concatenate overlapping reads and retain their information, via
mergeBed -s -nms -i both_Strains_precrammed.bed > both_Strains_crammed.winwhich gives us this kind of thing:
chr1 28395 28415 chr1|28395|28415|:3:4:0:0:1:6:0|14|+|CAGTGGTAGAGGGCTTGCCT|171|SHRH:1| +
chr1 32148 32176 chr1|32148|32176|:100:87:0:4:67:252:0|510|+|ATTTAGCTCAGTGGTAGAGCGCTTGCCT|1970|SHRH:1| +
chr1 159118 159151 chr1|159118|159151|:11:9:0:1:7:14:0|42|+|TGGGGATTTAGCTCAGTGGTAGAGCGCTTGCCT|2000|SHRH:1|;chr1|159129|159151|:25:33:0:2:24:68:0|152|+|CTCAGTGGTAGAGCGCTTGCCT|2000|SHRH:1|;chr1|159135|159151|:5:5:0:2:7:12:0|31|+|GGTAGAGCGCTTGCCT|1967|SHRH:1| For each row, we split the fourth column on ";", and amalgamate the different characteristics of that feature (interval of concatenated read locations): chromosome, start, end, counts by sample, an indicator that tells if a single sample contributes more than 40% of the reads, the total read count for the feature, counts for the different reads, number of mapped locations for each read, mapping quality for each read by strain, . Here is the script:
awk '{
split("0,0,0,0,0,0,0",counts,",");
split("",seqCounts,"");
split("",seqs,"");
split("",uniq,"");
split("",Bqual,"");
split("",Squal,"");
split($4,a,";");
for (i=1; i<=length(a); i++) {
split(a[i],b,"|");
split(b[4],c,":");
split(b[9],B,":")
split(b[10],S,":")
for (j=2; j<8; j++)
counts[j-1]=counts[j-1]+c[j];
counts[7]=counts[7]+b[5]
seqCounts[i]=b[5];
seqs[i]=b[7];
uniq[i]=b[8];
Bqual[i]=B[2];
Squal[i]=S[2];
}
printf $1"\t"$2"\t"$3"\t";
for (i=1; i<7; i++)
printf counts[i]":"
printf "\t"counts[7]"\t"
for (i in seqCounts)
printf seqCounts[i]"|";
printf "\t";
for (i in uniq)
printf uniq[i]"|";
printf "\t";
for (i in Bqual)
printf Bqual[i]"|";
printf "\t";
for (i in Squal)
printf Squal[i]"|";
printf "\t";
for (i in seqs)
printf seqs[i]"|";
printf "\t"$5"\n"
}' both_Strains_crammed.win > both_Strains_decrammed.bedintersectBed -a both_Strains_decrammed.bed -b NCRNA.NonRatRefSeqRN5.bed -wa -wb > bothHits/bothHit_NonRatRefSeq.bed
intersectBed -a both_Strains_decrammed.bed -b NCRNA.RefSeqRN5.bed -wa -wb > bothHits/bothHit_RefSeq.bed
intersectBed -a both_Strains_decrammed.bed -b piRNA.bed -wa -wb > bothHits/bothHit_piRNA.bed
intersectBed -a both_Strains_decrammed.bed -b mirDeepPredicted.bed -wa -wb > bothHits/bothHit_mirDeep.bed
intersectBed -a both_Strains_decrammed.bed -b NCRNA.RepeatMaskRNA.bed -wa -wb > bothHits/bothHit_RepeatMaskRNA.bed
intersectBed -a both_Strains_decrammed.bed -b NCRNA.RepeatMaskMisc.bed -wa -wb > bothHits/bothHit_RepeatMaskMisc.bed
intersectBed -a both_Strains_decrammed.bed -b NCRNA.Ensembl.bed -wa -wb > bothHits/bothHit_Ensembl.bed