Skip to content

Rn6 Executing the Above Steps

Spencer Mahaffey edited this page Sep 18, 2015 · 1 revision
  1. Run blat
    ./blat -stepSize=5 -minScore=20 -minIdentity=1 -repMatch=2253 /data2/rn6/index/rn6.cleaned.fa RaEx-1_0-st-v1.probe.fa blatOutput.psl
    Note: It may seem like you can increase the minScore and minIdentity but this can cause blat to miss perfect matches. So leave it alone and filter the .psl file with the R script or some other method.

  2. Identify probes that hit the genome once and only once (findPerfectMatches_Exon.R - Exon array or findPerfectMatches_3Prime.R)

    • Modified by only doing the probe location output as a bed file. Skip combining SNPS to allow for different combinations of SNPs to be used for masking.
  3. Create BED files for SNPs/Indels for each strain.

  4. Combine SNPs and Indels into 1 VCF file.
    1. Concatenate:
    /usr/local/vcftools/bin/vcf-concat rn6.SNP.BNLx.SHR.sorted.filtered.vcf rn6.INDEL.BNLx.SHR.sorted.filtered.vcf > rn6.Snps.Indels.concat.filtered.vcf

1.  Sort:  
   `/usr/local/vcftools/bin/vcf-sort -c rn6.Snps.Indels.concat.filtered.vcf > rn6.Snps.Indels.concat.filtered.sorted.vcf`  
  1. run perl script to create BED file for each strain.
    perl extractStrainSNPs.pl /data2/rn6/snps/rn6.Snps.Indels.concat.filtered.sorted.vcf /data2/rn6/snps/BNLX.snpsIndels.bed BNLX CHROM,POS,ID,REF,ALT,QUAL,FILTER,INFO,FORMAT
`perl extractStrainSNPs.pl /data2/rn6/snps/rn6.Snps.Indels.concat.filtered.sorted.vcf /data2/rn6/snps/SHR.snpsIndels.bed SHR CHROM,POS,ID,REF,ALT,QUAL,FILTER,INFO,FORMAT`  
  1. If the output is not recognizing the chromosome column correctly and chromosomes don't start with chr you must add it.
    sed -i -e 's/^/chr/' BNLX.snpsIndels.bed

  2. Find intersection of Probe bed and SNP bed for each strain.

    • Use BEDtools to find overlap between SNPs and probes:
      intersectBed -a /data2/rn6/array/RaEx/Aligned/RaEx1_0st_probe.default.perfectMatches.wStrand.txt -b /data2/rn6/snps/BNLx.snpsIndels.bed -c > /data2/rn6/array/RaEx/SNPs_Probes/BNLx_snpsInProbes.txt

    intersectBed -a /data2/rn6/array/RaEx/Aligned/RaEx1_0st_probe.default.perfectMatches.wStrand.txt -b /data2/rn6/snps/SHR.snpsIndels.bed -c > /data2/rn6/array/RaEx/SNPs_Probes/SHR_snpsInProbes.txt

  3. Combine SNPs and Probes and perform filtering and output Masks (Exon arrays)
    perl CreateMaskExon.pl /data2/rn6/array/RaEx RaEx-1_0-st-v1.r2.pgf BNLx,SHR RaEx-1_0-st-v1.r2.dt1.rn4. RaEx-1_0-st-v1.r2.HXB.MASKED.perl.pgf rn6 MASKED.perl RaEx-1_0-st-v1.na35.1.rn5.probeset.csv

    • Arguments are:
      • Path to folder for the array type(Ex. RaEx)
      • PGF file name
      • Comma separated list of strain snps to include(just strain name prefix of file Strain_snpsInProbes.txt) Strain,Strain2,Strain3
      • Prefix including last . up to core, extended etc for the source files.(Downloaded from affy)
      • Output pgf file name
      • version of genome ex rn5,rn6
      • label to append to all files output(this can be anything avoid spaces)
      • The csv file linking Transcript clusters and probe sets(Downloaded from affy)

Output from creating mask:

reading:/data2/rn6/array/RaEx/Source/RaEx-1_0-st-v1.na35.1.rn5.probeset.csv
TC Size:380999
CSV PS Size:1064901
read in csv
try to open: RaEx-1_0-st-v1.r2.pgf
Processing PGF.....................
Initial 4104558 Total Probes
Initial 1064901 Total Probe Sets
2750 snp masked of 3593359 in BNLx
Masked probes:2750
Non-Masked probes:4101808
135787 snp masked of 3593359 in SHR
Masked probes:138537
Non-Masked probes:3966021
Unique location list size: 3454753
Removing PS/NonUnique........................................................................................................
464137 Non-Unique removed
Remove Probe sets <3 probes:
146988 removed Probes
194381 removed Probe sets
38 removed Chr Probe sets
406 removed Loc Probe sets
749662 Total masked probes
194381 Total masked probe sets
3966021 Total Probes
870521 Total Probe Sets

Output Masking Transcript Clusters:

TC size:380999
Masking TC 144 removed for length >300kbp
21504 removed for location problems chr/strand
6 removed for lack of probesets after QC
55840 removed all probesets were masked
TC size:303505
194381 removed masked Probesets
5384 removed for Chr/Strand
2785 removed for outlier location
465 core masked out of 8260
917 extended masked out of 18885
20063 full masked out of 186438
output tc all:303505

Clone this wiki locally