Skip to content

Latest commit

 

History

History
733 lines (619 loc) · 21.8 KB

File metadata and controls

733 lines (619 loc) · 21.8 KB

Mapping traits in biparental populations with genotyping-by-sequencing data

GBStutorial flowchart

1. Open a UNIX shell

  1. Connect with PuTTY to the server slurm-01.ipk-gatersleben.de.

  2. Obtain a Kerberos ticket to access our filer resources.

    kinit
    klist (1)
    1. A ticket is valid for seven days.

  3. You will see the prompt of the UNIX command-line interpreter, the shell. It is called "shell" because it executes commands that do not belong to the operating system, the "kernel".

  4. Check out your home directory.

    pwd
    ls

2. Set up your analysis directory

  1. Let’s have a look at the GBS data.

    cd /filer/projekte/gbs_course
    cd data
    ls
    ls -lhS gbs_reads | less (1)
    man ls
    cd gbs_reads
    ls | wc -l
    zcat Morex.fastq.gz | less -S // (2)
    cd ..
    du -ch gbs_reads
    1. To close less, press q.

    2. The reads are in FASTQ format.

  2. Take a look at the phenotypes. There are two. One is binary: lateral spikelet fertility (commonly referred to in barley as row type). The other is quantitative (plant height).

    less MxB_92_RILs_phenotypes.csv
  3. Create your working directory and change into it.

    cd /filer/projekte/gbs_course
    mkdir -p mascher/gbs // (1)
    cd mascher/gbs
    1. Change mascher to your name.

  4. Create symbolic links to the GBS reads and the phenotype data.

    pwd // (1)
    
    mkdir gbs_reads
    datadir="/filer/projekte/gbs_course/data/"
    ln -s $datadir/gbs_reads/*fastq.gz gbs_reads
    ls -l gbs_reads | less -S
    ls gbs_reads | wc -l
    
    ln -s $datadir/MxB_92_RILs_phenotypes.csv
    1. Check that you are in your working directory (/filer/projekte/gbs_course/USER/gbs).

  5. Count the reads. We use a for-loop to process all samples at once.

    zcat gbs_reads/Morex.fastq.gz | wc -l > Morex.len
    
    for i in gbs_reads/*fastq.gz; do
     name=`basename $i | cut -d . -f 1`
     count=`zcat $i | wc -l`
     echo $name $count
    done > fastq_line_counts.txt
    
    cat fastq_line_counts.txt | awk '{print $1,$2/4}' | sort -nk 2 > fastq_read_counts.txt // (1)
    1. We use AWK to divide the second column by 4.

3. Read mapping

  1. We will use the tools BWA and samtools for read mapping and processing of alignment records, respectively.

    module load bwa samtools
    module list
  2. Load the BWA index into shared memory. This step can be skipped, but pre-loadeding the index makes mapping a lot faster.

    ref='/filer/projekte/gbs_course/data/reference/Hordeum_vulgare_MorexV3_assembly.fasta' // (1)
    bwa shm $ref
    bwa shm -l
    1. We use the MorexV3 reference genome sequence.

  3. We will first run a toy example with 10,000 reads of the Morex sample.

    zcat gbs_reads/Morex.fastq.gz | head -n 40000 > Morex_10k.fastq
    bwa mem $ref Morex_10k.fastq > Morex_10k.sam (1)
    samtools sort Morex_10k.sam > Morex_10k.bam (1)
    samtools view Morex_10k.bam | less -S
    1. Alignments are stored in the Sequence Alignment Map format (SAM) or its binary equivalent BAM.

  4. Combine mapping and sorting into one command.

    bwa mem $ref Morex_10k.fastq | samtools sort > Morex_10k.bam
  5. Open a new PuTTY session. Start tmux to keep your commands running in the background.

    k5reauth tmux (1)
    1. k5reauth is used to prolong the Kerberos ticket for as long as the tmux session is running.

  6. Try out tmux functionality.

    tmux ls
    tmux rename mapping
    tmux ls
    tmux detach
    tmux ls
    tmux attach -t mapping
  7. Map all samples.

    ref='/filer/projekte/gbs_course/data/reference/Hordeum_vulgare_MorexV3_assembly.fasta'
    for i in gbs_reads/*fastq.gz; do
     name=`echo $i | cut -d . -f 1` // (1)
     bwa mem -t 4 $ref $i | samtools sort > $name.bam
    done 2> bwa.err (2)
    1. Strip the extension: Morex.fastq.gz become Morex.

    2. To detach the tmux session, press Ctrl-B followed by D.

    If you forget to start bwa inside a tmux session, there is no way to prevent your job from aborting when you shutdown your laptop. Also without k5reauth programs cannot access filer resource after a maximum of ten hours.
  8. Open a new terminal. Look at your jobs in the table of processes (top).

    find gbs_reads | grep -c 'bam$' (1)
    top -u mascher // (2)
    1. Show the number of BAM files created so far.

    2. Replace mascher with your username.

  9. When the mapping is done, calculate the mapping rates for all samples.

    for i in gbs_reads/*bam; do
     name=`basename $i | cut -d . -f 1`
     count=`samtools view -cq 20 $i`
     echo $name $count
    done > mapped_reads.txt
  10. Combine the raw read counts and the mapping rates into one table.

    LC_ALL=C sort fastq_read_counts.txt > tmp1 // (1)
    LC_ALL=C sort mapped_reads.txt > tmp2 // (1)
    
    join tmp1 tmp2 | awk '{print $0,$3/$2*100}' | sort -nk 4 | column -t > mapping_stats.txt // (2)
    
    rm -f tmp1 tmp2
    
    column -t mapping_stats.txt  | less -S
    1. To combine two lists with join, both lists need to be sorted on the common ID column.

    2. column is used to align columns.

  11. Get a list of all BAM files.

    ls gbs_reads/*bam | sort > bam_list.txt

4. Variant calling and filtration

  1. Open a new tmux session and load bcftools.

    tmux // (1)
    tmux rename variant_call
    module load bcftools
    1. The variant calling will run for some time, so run it inside tmux.

  2. Run the variant calling.

    ref='/filer/projekte/gbs_course/data/reference/Hordeum_vulgare_MorexV3_assembly.fasta'
    #bamlist='bam_list.txt'
    bamlist='/filer-dg/agruppen/dg6/mascher/DG/mxb_course_221201/try_221216/bam_list_Martin.txt' (1)
    
    vcf='bcftools_SNP_calling.vcf' // (2)
    
    bcftools mpileup --skip-indels --fasta-ref $ref --bam-list $bamlist --min-MQ 20 --annotate AD \#(3)
     | bcftools view -i 'INFO/DP > 100' \#(4)(5)
     | bcftools call --consensus-caller --variants-only --output $vcf
    1. List of pre-computed BAM files.

    2. Output file in variant call format (VCF). Here are the specifications of the VCF format.

    3. We ignore insertions and deletions (--skip-indels), consider only SNPs with a quality score no smaller than 20 (--min-MQ 20) and add allelic depth information (--annotate AD) for all genotype calls.

    4. Remove sites with fewer than 100 reads across all samples. We are only interested in sites that have at least two supporting reads in nine tenths of the samples.

    5. The backslash \ character is used to split long commands across multiple lines. When pasting the commands or editing them, make sure that no white space follows the backslash. Otherwise, the shell will interpret the lines as belonging to different commands. Also multi-line commands do not tolerate intervening command line (starting the hash sign #).

  3. Filter the variant calls.

    filter='/filer/projekte/gbs_course/scripts/filter_vcf.zsh'
    #vcf='bcftools_SNP_calling.vcf'
    vcf='/filer-dg/agruppen/dg6/mascher/DG/mxb_course_221201/try_221216/bcftools_SNP_calling.vcf' (1)
    fvcf='bcftools_SNP_calling_filtered.vcf'
    
    $filter --vcf $vcf --dphom 2 --dphet 4 --minmaf 0.2 --minpresent 0.9 --minhomp 0.9 > $fvcf // (2)
    1. Path to pre-computed VCF file.

    2. We keep homozygous genotype calls if they have at least two supporting reads; heterozygous calls are accepted if they are supported by no fewer than four reads. SNPs with a minor allele frequency below 20 % or less than 90 % present calls or less than 90 % homozyous calls are discarded.

  4. Change the column names of the VCF files to match the row names in the phenotype table.

    less MxB_92_RILs_phenotypes.csv
    bcftools query -l bcftools_SNP_calling_filtered.vcf | less -S
    bcftools query -l $fvcf | cut -d / -f 2 | cut -d . -f 1 > new_sample_names.txt
    bcftools reheader -s new_sample_names.txt $fvcf > bcftools_SNP_calling_filtered_newNames.vcf
  5. Review the VCF file.

    grep -v '^##' bcftools_SNP_calling_filtered_newNames.vcf | column -t | less -S

5. Import the genetic variant matrix into R

  1. Start R.

    module load R/4.1.1
    R
  2. R is a widely used programming language in data science. There are very many tutorials, e.g. this one.

  3. Load the required libraries.

    .libPaths("/filer/projekte/gbs_course/Rlibs/4.1.1") // (1)
    
    library("qtl") // (2)
    library("ASMap") // (3)
    library("utl") // (4)
    1. Set the path where the R libraries are located.

    2. R/qtl is package for QTL mapping. Several tutorials are available here.

    3. R/ASMap is package for linkage map constuction. It implements the MSTMAP algorithm. A detailed tutorial is available here.

    4. utl provides utility functions for R/qtl, one of which we use to convert VCF to R/qtl format.

  4. Convert the VCF to R/qtl format. Example files are found here.

    vcf <- 'bcftools_SNP_calling_filtered_newNames.vcf'
    ids <- read.table("new_sample_names.txt", head=F)[, 1]
    genfile <- 'bcftools_SNP_calling_geno.csv'
    founders <- c("Morex", "Barke")
    samples <- setdiff(ids, founders)
    
    convert_vcf_to_genfile(vcf, genfile, samples, founders) // (1)
    1. This function writes a text file in R/qtl’s "csvs" format to disk. The output filename is genfile.

  5. The conversion function does not take genomic coordinates into account, so markers are ordered correctly, but equidistant. We add a line to the CSV file to correct this.

    cmd <- "sed -Ei '1{p; s/id|chr.H://g; h; d}; 2G'" // (1)
    paste(cmd, genfile) |> system() // (2)(3)
    1. The sed command retrieves the genomic coordinates from the first line and inserts them as a new line after line 2. This could also be achieved with a text editor. The command, at the cost of arcanity, forgoes error-prone manual editing.

    2. The sed command is called from inside R and modifies the file in place.

    3. Alternatively, you write this command as system(paste(cmd, genfile)). Before R version 4.0 (or so), this was the only way to do it. The pipe-like syntax is a recent addition to R.

  6. Read the genotype and phenotype data into an R/qtl cross object.

    pheno <- 'MxB_92_RILs_phenotypes.csv'
    read.cross(format="csvs", genfile=genfile, phefile=pheno, crosstype= "f2",  genotypes=c("AA","AB","BB")) -> mxb
    convert2bcsft(mxb, BC.gen=0, F.gen=8,  estimate.map=F) -> mxb // (1)
    summary(mxb)
    summary.map(mxb)
    1. A population of recombinant inbred lines is read in as "f2". convert2bcsft() is used to set the correct number of selfing generations. See the documentation of that function. We disable the estimate.map option because we will rely on R/ASMap to construct a genetic linkage map.

  7. R/qtl offers several functions to extract basic information from "cross" objects.

    nind(mxb) # number of individuals
    nchr(mxb) # number of chromosomes (actually linkage groups)
    totmar(mxb) # total number of markers
    nmar(mxb) # number of markers per chromosome
    nphe(mxb) # number of phenotypes
  8. Plot a summary of the phenotypes. Copy the PDF file with WinSCP and take a look at it with the Acrobat Reader.

    pdf("plot_pheno.pdf") (1)
    plotPheno(mxb, pheno.col=1, xlab="phenotypes")
    plotPheno(mxb, pheno.col=2)
    dev.off() (2)
    1. Open a PDF file to plot to. The file is created in the current working directory.

    2. Close the file (switch off the plotting device). Don’t forget to call dev.off(). Otherwise the PDF file will be empty or invalid.

  9. Change the scale of the linkage map from base pairs to megabase pairs and plot the distribution of markers along chromosomes.

    rescalemap(mxb, 1/1e6) -> mxb
    summary.map(mxb)
    
    pdf("plot_map.pdf")
    plot.map(mxb, main="Physical map", ylab="Location (Mb)")
    dev.off()
  10. Create a copy of the cross object with the physical map for later comparison.

    copy(mxb) -> mxb_physical
    saveRDS(mxb_physical, "mxb_physical.Rds") // (1)
    readRDS("mxb_physical.Rds") -> mxb_physical
    1. Any R object can be saved to and read from disk, respectively, with saveRDS() and readRDS().

6. Quality control and linkage map construction

  1. Karl Broman pointed out that a reference genome sequence has obviated the need for linkage map construction in mouse; the same applies to barley. Although marker order is known, constructing a linkage map from scratch is a means of quality control. If there are issues with the data, the genetic map will be off the mark.

  2. A map off the cuff is not too bad. The only worry is that it’s about a fifth longer than expected.

    summary.map(mstmap(mxb, id="id"))
  3. Let’s see if we can do better. The usual suscepts are bad markers and odd individuals.

  4. Remove duplicated markers. If there are groups of markers that differ only in their patterns of missing data, keep only one representative.

    findDupMarkers(mxb, exact.only=FALSE) -> dups // (1)
    unlist(dups) |> length()
    
    mxb <- drop.markers(mxb, unlist(dups))
    summary(mxb)
    1. exact.only=FALSE ignores differences in missingness.

  5. Remove duplicated individuals.

    cg <- comparegeno(mxb)
    
    pdf("compare_geno.pdf")
    hist(cg[lower.tri(cg)], breaks=seq(0, 1, len=101), xlab="No. matching genotypes")
    dev.off()
    
    summary(cg, thresh=0.9) -> dups
    subset(mxb, ind=setdiff(samples, dups$ind2)) -> mxb
  6. Have a look at the graphical genotypes.

    pdf("plot.geno.pdf")
    geno.image(mxb, col=c("white", "red", "violet", "blue")) // (1)
    dev.off()
    1. The default colors are red and green, which puts color blind people at a disadvantage. In this case, the genotypes AA, BB, AB are displayed in the colors red, blue, and violet, respectively. Missing data will be displayed in white.

  7. We remove the odd individuals with lots of missing calls.

    ntyped(mxb) |> sort() |> head(n=1) -> rm.ind
    
    subset(mxb, ind=setdiff(names(ntyped(mxb)), names(rm.ind))) -> mxb // (1)
    summary(mxb)
    1. ntyped(mxb) |> names() outputs a list of all individuals.

  8. Thin the set of markers.

    set.seed(1) // (1)
    lapply(pull.map(mxb), function(i) pickMarkerSubset(i, 1)) |> unlist() -> keep
    
    drop.markers(mxb, setdiff(markernames(mxb), keep))  -> mxb
    1. Markers are picked at random to have one per megabase. To ensure you always get the results, set the seed for random number generator.

  9. Look at the graphical genotypes again.

    pdf("plot.geno_1Mb.pdf")
    geno.image(mxb, col=c("white", "red", "violet", "blue"))
    dev.off()
  10. Construct the map again.

    mstmap.cross(mxb, id="id") -> mxb
    summary.map(mxb)
  11. Align the genetic to the physical map.

    pdf("align_maps.pdf")
    alignCross(mxb, maps=list(mxb_physical), layout=c(1,1))
    dev.off()
  12. Compute the rank correlation.

    pull.map(mxb, as.table=T) -> a
    pull.map(mxb_physical, as.table=T) -> b
    merge(a, b, by=0) -> m // (1)
    sapply(split(m, m$chr.x), function(i) with(i, cor(pos.x, pos.y, method='s'))) -> cc // (2)
    1. by=0 means: merge by row names.

    2. Use method = 'p' to compute the Pearson (linear) correlation.

  13. Flip those linkage groups that are inverted relative to the reference orientation (short arm first).

    names(which(cc < 0)) -> flip.chr
    flip.order(mxb, flip.chr) -> mxb
  14. Plot the updated alignment and graphical genotypes.

    pdf("align_maps_2.pdf")
    alignCross(mxb, maps=list(mxb_physical), layout=c(1,1))
    dev.off()
    
    pdf("plot.geno.mstmap.pdf")
    geno.image(mxb, col=c("white", "red", "violet", "blue"))
    dev.off()
  15. Compare genetic and physical map with connector plots.

    rescalemap(mxb_physical, 1/5) -> mxb_physical // (1)
    
    pdf("plot_map_2.pdf")
    plotMap(mxb, mxb_physical)
    dev.off()
    1. Divide chromosome lengths by 5.

  16. Save the genetic map object.

    mxb -> mxb_genetic
    saveRDS(mxb_genetic, file="mxb_genetic.Rds")

7. Trait mapping

  1. Load the required package and the data.

    .libPaths("/filer/projekte/gbs_course/Rlibs/4.1.1") // (1)
    
    library("qtl")
    readRDS("mxb_genetic.Rds") -> mxb
  2. Calculate genotype probabilities conditional on the marker data.

    calc.genoprob(mxb) -> mxb
  3. Run the "QTL" scan for the first phenotypes, row type.

    mxb$pheno$row_type <- ifelse(mxb$pheno$row_type == 2, 0, 1) // (1)
    
    scanone(mxb, pheno.col=1, method="mr", model="binary") -> out // (2)
    summary(out)
    
    pdf("plot_row_type.pdf") // (3)
    plot(out)
    dev.off()
    1. Values for binary traits have to be zero or one.

    2. Perform a single-QTL genome scan using the marker regression (mr) method. scanone() supports several other methods and models. The defaut is "normal" for a quantiative phenotype.

    3. Plot the LOD (logarithm of the odds) scores along the genome.

  4. Run a permutation test to get p-values and a significance threshold.

    operm <- scanone(mxb, method="mr", n.perm=1000, pheno.col=1)
    
    summary(operm, alpha=c(0.05, 0.2))
    
    pdf("plot_operm.pdf") (1)
    hist(as.numeric(operm))
    abline(v=summary(operm, alpha=c(0.05)), col='red')
    dev.off()
    
    summary(out, perms=operm, pvalues=TRUE)
    
    pdf("plot_row_type_threshold.pdf")
    plot(out)
    abline(h=3.5, col='red') // (2)
    dev.off()
    1. Plot a histogram of the maximum genome-wide LOD scores from the 1000 permutations.

    2. Add the significance threshold.

  5. Get interval estimates.

    lodint(out, chr="2H", drop=2)
  6. Plot the phenotypes against the genotypes at the most highly associated marker (genotype x phenotpe, pxg).

    rownames(max(out)) -> mar
    pdf("plot_pxg.pdf")
    plotPXG(mxb, pheno.col=1, marker=mar)
    dev.off()
  7. To close R, type quit(). To close the tmux session, type exit. This will finish the tmux session. If you just want to detach and keep the session running, use tmux detach.

  8. Check by BLAST how close the top marker is to the causal gene (VRS1). The sequence is available from GenBank. GrainGenes offers a web-based BLAST.

8. Mapping the KASP markers

  1. Load the required libraries.

    .libPaths("/filer/projekte/gbs_course/Rlibs/4.1.1")
    library(data.table)
    library(xlsx)
    library("qtl")
  2. Add the KASP results to the phenotypes table.

    fread('MxB_92_RILs_phenotypes.csv')->p
    
    data.table(read.xlsx('/filer-dg/agruppen/dg6/mascher/DG/mxb_course_221201/prep/KASP_results_PGR23.xlsx', 1))->k
    setnames(k, "Samples", "id")
    k[grepl("F11$", id)][, id := sub("MBRIL-", "07-", sub("_F11$", "", id))][] -> k
    
    k[p, on="id"] -> m
    m[, .(row_type, plant_height, RAW1_Henry, VRS1_Henry, RAW1_ZsaZsa, VRS1_ZsaZsa, RAW1_Tania, VRS1_Tania, id)] -> m
    
    unique(m[, .(row_type, VRS1_Tania)])
  3. Modify the mxb object.

    readRDS('mxb_genetic.Rds')
    
    data.frame(m[id %in% mxb$pheno$id]) -> mxb$pheno
    mxb
    colnames(mxb$pheno)
  4. Plot the new "phenotypes".

    pdf("plot_KASP.pdf")
    plotPheno(mxb, pheno.col=3)
    plotPheno(mxb, pheno.col=4)
    plotPheno(mxb, pheno.col=5)
    plotPheno(mxb, pheno.col=6)
    plotPheno(mxb, pheno.col=7)
    plotPheno(mxb, pheno.col=8)
    dev.off()
  5. Map VRS1_Tania.

    mxb$pheno$VRS1_Tania <- ifelse(mxb$pheno$VRS1_Tania == "C", 0, 1)
    scanone(mxb, pheno.col=8, method="mr", model="binary") -> out
    
    pdf("plot_VRS1_Tania.pdf")
    plot(out)
    dev.off()