Mapping traits in biparental populations with genotyping-by-sequencing data

Table of Contents

1. Open a UNIX shell
2. Set up your analysis directory
3. Read mapping
4. Variant calling and filtration
5. Import the genetic variant matrix into R
6. Quality control and linkage map construction
7. Trait mapping
8. Mapping the KASP markers

1. Open a UNIX shell

Connect with PuTTY to the server slurm-01.ipk-gatersleben.de.
Obtain a Kerberos ticket to access our filer resources.
```
kinit
klist (1)
```
1. A ticket is valid for seven days.
You will see the prompt of the UNIX command-line interpreter, the shell. It is called "shell" because it executes commands that do not belong to the operating system, the "kernel".
Check out your home directory.
```
pwd
ls
```

2. Set up your analysis directory

Let’s have a look at the GBS data.

cd /filer/projekte/gbs_course
cd data
ls
ls -lhS gbs_reads | less (1)
man ls
cd gbs_reads
ls | wc -l
zcat Morex.fastq.gz | less -S // (2)
cd ..
du -ch gbs_reads

To close less, press q.
The reads are in FASTQ format.

Take a look at the phenotypes. There are two. One is binary: lateral spikelet fertility (commonly referred to in barley as row type). The other is quantitative (plant height).
```
less MxB_92_RILs_phenotypes.csv
```

Create your working directory and change into it.

cd /filer/projekte/gbs_course
mkdir -p mascher/gbs // (1)
cd mascher/gbs

Change mascher to your name.

Create symbolic links to the GBS reads and the phenotype data.

pwd // (1)

mkdir gbs_reads
datadir="/filer/projekte/gbs_course/data/"
ln -s $datadir/gbs_reads/*fastq.gz gbs_reads
ls -l gbs_reads | less -S
ls gbs_reads | wc -l

ln -s $datadir/MxB_92_RILs_phenotypes.csv

Check that you are in your working directory (/filer/projekte/gbs_course/USER/gbs).

Count the reads. We use a for-loop to process all samples at once.

zcat gbs_reads/Morex.fastq.gz | wc -l > Morex.len

for i in gbs_reads/*fastq.gz; do
 name=`basename $i | cut -d . -f 1`
 count=`zcat $i | wc -l`
 echo $name $count
done > fastq_line_counts.txt

cat fastq_line_counts.txt | awk '{print $1,$2/4}' | sort -nk 2 > fastq_read_counts.txt // (1)

We use AWK to divide the second column by 4.

3. Read mapping

We will use the tools BWA and samtools for read mapping and processing of alignment records, respectively.
```
module load bwa samtools
module list
```
Load the BWA index into shared memory. This step can be skipped, but pre-loadeding the index makes mapping a lot faster.
```
ref='/filer/projekte/gbs_course/data/reference/Hordeum_vulgare_MorexV3_assembly.fasta' // (1)
bwa shm $ref
bwa shm -l
```
1. We use the MorexV3 reference genome sequence.

We will first run a toy example with 10,000 reads of the Morex sample.

zcat gbs_reads/Morex.fastq.gz | head -n 40000 > Morex_10k.fastq
bwa mem $ref Morex_10k.fastq > Morex_10k.sam (1)
samtools sort Morex_10k.sam > Morex_10k.bam (1)
samtools view Morex_10k.bam | less -S

Alignments are stored in the Sequence Alignment Map format (SAM) or its binary equivalent BAM.

Combine mapping and sorting into one command.

bwa mem $ref Morex_10k.fastq | samtools sort > Morex_10k.bam

Open a new PuTTY session. Start tmux to keep your commands running in the background.
```
k5reauth tmux (1)
```
1. k5reauth is used to prolong the Kerberos ticket for as long as the tmux session is running.

Try out tmux functionality.

tmux ls
tmux rename mapping
tmux ls
tmux detach
tmux ls
tmux attach -t mapping

Map all samples.

ref='/filer/projekte/gbs_course/data/reference/Hordeum_vulgare_MorexV3_assembly.fasta'
for i in gbs_reads/*fastq.gz; do
 name=`echo $i | cut -d . -f 1` // (1)
 bwa mem -t 4 $ref $i | samtools sort > $name.bam
done 2> bwa.err (2)

Strip the extension: Morex.fastq.gz become Morex.
To detach the tmux session, press Ctrl-B followed by D.

❗	If you forget to start bwa inside a tmux session, there is no way to prevent your job from aborting when you shutdown your laptop. Also without k5reauth programs cannot access filer resource after a maximum of ten hours.

Open a new terminal. Look at your jobs in the table of processes (top).
```
find gbs_reads | grep -c 'bam$' (1)
top -u mascher // (2)
```
1. Show the number of BAM files created so far.
2. Replace mascher with your username.

When the mapping is done, calculate the mapping rates for all samples.

for i in gbs_reads/*bam; do
 name=`basename $i | cut -d . -f 1`
 count=`samtools view -cq 20 $i`
 echo $name $count
done > mapped_reads.txt

Combine the raw read counts and the mapping rates into one table.

LC_ALL=C sort fastq_read_counts.txt > tmp1 // (1)
LC_ALL=C sort mapped_reads.txt > tmp2 // (1)

join tmp1 tmp2 | awk '{print $0,$3/$2*100}' | sort -nk 4 | column -t > mapping_stats.txt // (2)

rm -f tmp1 tmp2

column -t mapping_stats.txt  | less -S

To combine two lists with join, both lists need to be sorted on the common ID column.
column is used to align columns.

Get a list of all BAM files.
```
ls gbs_reads/*bam | sort > bam_list.txt
```

4. Variant calling and filtration

Open a new tmux session and load bcftools.
```
tmux // (1)
tmux rename variant_call
module load bcftools
```
1. The variant calling will run for some time, so run it inside tmux.
Run the variant calling.
```
ref='/filer/projekte/gbs_course/data/reference/Hordeum_vulgare_MorexV3_assembly.fasta'
#bamlist='bam_list.txt'
bamlist='/filer-dg/agruppen/dg6/mascher/DG/mxb_course_221201/try_221216/bam_list_Martin.txt' (1)

vcf='bcftools_SNP_calling.vcf' // (2)

bcftools mpileup --skip-indels --fasta-ref $ref --bam-list $bamlist --min-MQ 20 --annotate AD \#(3)
 | bcftools view -i 'INFO/DP > 100' \#(4)(5)
 | bcftools call --consensus-caller --variants-only --output $vcf
```
1. List of pre-computed BAM files.
2. Output file in variant call format (VCF). Here are the specifications of the VCF format.
3. We ignore insertions and deletions (--skip-indels), consider only SNPs with a quality score no smaller than 20 (--min-MQ 20) and add allelic depth information (--annotate AD) for all genotype calls.
4. Remove sites with fewer than 100 reads across all samples. We are only interested in sites that have at least two supporting reads in nine tenths of the samples.
5. The backslash \ character is used to split long commands across multiple lines. When pasting the commands or editing them, make sure that no white space follows the backslash. Otherwise, the shell will interpret the lines as belonging to different commands. Also multi-line commands do not tolerate intervening command line (starting the hash sign #).
Filter the variant calls.
```
filter='/filer/projekte/gbs_course/scripts/filter_vcf.zsh'
#vcf='bcftools_SNP_calling.vcf'
vcf='/filer-dg/agruppen/dg6/mascher/DG/mxb_course_221201/try_221216/bcftools_SNP_calling.vcf' (1)
fvcf='bcftools_SNP_calling_filtered.vcf'

$filter --vcf $vcf --dphom 2 --dphet 4 --minmaf 0.2 --minpresent 0.9 --minhomp 0.9 > $fvcf // (2)
```
1. Path to pre-computed VCF file.
2. We keep homozygous genotype calls if they have at least two supporting reads; heterozygous calls are accepted if they are supported by no fewer than four reads. SNPs with a minor allele frequency below 20 % or less than 90 % present calls or less than 90 % homozyous calls are discarded.

Change the column names of the VCF files to match the row names in the phenotype table.

less MxB_92_RILs_phenotypes.csv
bcftools query -l bcftools_SNP_calling_filtered.vcf | less -S
bcftools query -l $fvcf | cut -d / -f 2 | cut -d . -f 1 > new_sample_names.txt
bcftools reheader -s new_sample_names.txt $fvcf > bcftools_SNP_calling_filtered_newNames.vcf

Review the VCF file.

grep -v '^##' bcftools_SNP_calling_filtered_newNames.vcf | column -t | less -S

5. Import the genetic variant matrix into R

Start R.
```
module load R/4.1.1
R
```
R is a widely used programming language in data science. There are very many tutorials, e.g. this one.
Load the required libraries.
```
.libPaths("/filer/projekte/gbs_course/Rlibs/4.1.1") // (1)

library("qtl") // (2)
library("ASMap") // (3)
library("utl") // (4)
```
1. Set the path where the R libraries are located.
2. R/qtl is package for QTL mapping. Several tutorials are available here.
3. R/ASMap is package for linkage map constuction. It implements the MSTMAP algorithm. A detailed tutorial is available here.
4. utl provides utility functions for R/qtl, one of which we use to convert VCF to R/qtl format.

Convert the VCF to R/qtl format. Example files are found here.

vcf <- 'bcftools_SNP_calling_filtered_newNames.vcf'
ids <- read.table("new_sample_names.txt", head=F)[, 1]
genfile <- 'bcftools_SNP_calling_geno.csv'
founders <- c("Morex", "Barke")
samples <- setdiff(ids, founders)

convert_vcf_to_genfile(vcf, genfile, samples, founders) // (1)

This function writes a text file in R/qtl’s "csvs" format to disk. The output filename is genfile.

The conversion function does not take genomic coordinates into account, so markers are ordered correctly, but equidistant. We add a line to the CSV file to correct this.
```
cmd <- "sed -Ei '1{p; s/id|chr.H://g; h; d}; 2G'" // (1)
paste(cmd, genfile) |> system() // (2)(3)
```
1. The sed command retrieves the genomic coordinates from the first line and inserts them as a new line after line 2. This could also be achieved with a text editor. The command, at the cost of arcanity, forgoes error-prone manual editing.
2. The sed command is called from inside R and modifies the file in place.
3. Alternatively, you write this command as system(paste(cmd, genfile)). Before R version 4.0 (or so), this was the only way to do it. The pipe-like syntax is a recent addition to R.
Read the genotype and phenotype data into an R/qtl cross object.
```
pheno <- 'MxB_92_RILs_phenotypes.csv'
read.cross(format="csvs", genfile=genfile, phefile=pheno, crosstype= "f2",  genotypes=c("AA","AB","BB")) -> mxb
convert2bcsft(mxb, BC.gen=0, F.gen=8,  estimate.map=F) -> mxb // (1)
summary(mxb)
summary.map(mxb)
```
1. A population of recombinant inbred lines is read in as "f2". convert2bcsft() is used to set the correct number of selfing generations. See the documentation of that function. We disable the estimate.map option because we will rely on R/ASMap to construct a genetic linkage map.

R/qtl offers several functions to extract basic information from "cross" objects.

nind(mxb) # number of individuals
nchr(mxb) # number of chromosomes (actually linkage groups)
totmar(mxb) # total number of markers
nmar(mxb) # number of markers per chromosome
nphe(mxb) # number of phenotypes

Plot a summary of the phenotypes. Copy the PDF file with WinSCP and take a look at it with the Acrobat Reader.
```
pdf("plot_pheno.pdf") (1)
plotPheno(mxb, pheno.col=1, xlab="phenotypes")
plotPheno(mxb, pheno.col=2)
dev.off() (2)
```
1. Open a PDF file to plot to. The file is created in the current working directory.
2. Close the file (switch off the plotting device). Don’t forget to call dev.off(). Otherwise the PDF file will be empty or invalid.

Change the scale of the linkage map from base pairs to megabase pairs and plot the distribution of markers along chromosomes.

rescalemap(mxb, 1/1e6) -> mxb
summary.map(mxb)

pdf("plot_map.pdf")
plot.map(mxb, main="Physical map", ylab="Location (Mb)")
dev.off()

Create a copy of the cross object with the physical map for later comparison.
```
copy(mxb) -> mxb_physical
saveRDS(mxb_physical, "mxb_physical.Rds") // (1)
readRDS("mxb_physical.Rds") -> mxb_physical
```
1. Any R object can be saved to and read from disk, respectively, with saveRDS() and readRDS().

6. Quality control and linkage map construction

Karl Broman pointed out that a reference genome sequence has obviated the need for linkage map construction in mouse; the same applies to barley. Although marker order is known, constructing a linkage map from scratch is a means of quality control. If there are issues with the data, the genetic map will be off the mark.
A map off the cuff is not too bad. The only worry is that it’s about a fifth longer than expected.
```
summary.map(mstmap(mxb, id="id"))
```
Let’s see if we can do better. The usual suscepts are bad markers and odd individuals.
Remove duplicated markers. If there are groups of markers that differ only in their patterns of missing data, keep only one representative.
```
findDupMarkers(mxb, exact.only=FALSE) -> dups // (1)
unlist(dups) |> length()

mxb <- drop.markers(mxb, unlist(dups))
summary(mxb)
```
1. exact.only=FALSE ignores differences in missingness.

Remove duplicated individuals.

cg <- comparegeno(mxb)

pdf("compare_geno.pdf")
hist(cg[lower.tri(cg)], breaks=seq(0, 1, len=101), xlab="No. matching genotypes")
dev.off()

summary(cg, thresh=0.9) -> dups
subset(mxb, ind=setdiff(samples, dups$ind2)) -> mxb

Have a look at the graphical genotypes.
```
pdf("plot.geno.pdf")
geno.image(mxb, col=c("white", "red", "violet", "blue")) // (1)
dev.off()
```
1. The default colors are red and green, which puts color blind people at a disadvantage. In this case, the genotypes AA, BB, AB are displayed in the colors red, blue, and violet, respectively. Missing data will be displayed in white.

We remove the odd individuals with lots of missing calls.

ntyped(mxb) |> sort() |> head(n=1) -> rm.ind

subset(mxb, ind=setdiff(names(ntyped(mxb)), names(rm.ind))) -> mxb // (1)
summary(mxb)

ntyped(mxb) |> names() outputs a list of all individuals.

Thin the set of markers.

set.seed(1) // (1)
lapply(pull.map(mxb), function(i) pickMarkerSubset(i, 1)) |> unlist() -> keep

drop.markers(mxb, setdiff(markernames(mxb), keep))  -> mxb

Markers are picked at random to have one per megabase. To ensure you always get the results, set the seed for random number generator.

Look at the graphical genotypes again.

pdf("plot.geno_1Mb.pdf")
geno.image(mxb, col=c("white", "red", "violet", "blue"))
dev.off()

Construct the map again.

mstmap.cross(mxb, id="id") -> mxb
summary.map(mxb)

Align the genetic to the physical map.

pdf("align_maps.pdf")
alignCross(mxb, maps=list(mxb_physical), layout=c(1,1))
dev.off()

Compute the rank correlation.

pull.map(mxb, as.table=T) -> a
pull.map(mxb_physical, as.table=T) -> b
merge(a, b, by=0) -> m // (1)
sapply(split(m, m$chr.x), function(i) with(i, cor(pos.x, pos.y, method='s'))) -> cc // (2)

by=0 means: merge by row names.
Use method = 'p' to compute the Pearson (linear) correlation.

Flip those linkage groups that are inverted relative to the reference orientation (short arm first).
```
names(which(cc < 0)) -> flip.chr
flip.order(mxb, flip.chr) -> mxb
```

Plot the updated alignment and graphical genotypes.

pdf("align_maps_2.pdf")
alignCross(mxb, maps=list(mxb_physical), layout=c(1,1))
dev.off()

pdf("plot.geno.mstmap.pdf")
geno.image(mxb, col=c("white", "red", "violet", "blue"))
dev.off()

Compare genetic and physical map with connector plots.

rescalemap(mxb_physical, 1/5) -> mxb_physical // (1)

pdf("plot_map_2.pdf")
plotMap(mxb, mxb_physical)
dev.off()

Divide chromosome lengths by 5.

Save the genetic map object.

mxb -> mxb_genetic
saveRDS(mxb_genetic, file="mxb_genetic.Rds")

7. Trait mapping

Load the required package and the data.

.libPaths("/filer/projekte/gbs_course/Rlibs/4.1.1") // (1)

library("qtl")
readRDS("mxb_genetic.Rds") -> mxb

Calculate genotype probabilities conditional on the marker data.
```
calc.genoprob(mxb) -> mxb
```
Run the "QTL" scan for the first phenotypes, row type.
```
mxb$pheno$row_type <- ifelse(mxb$pheno$row_type == 2, 0, 1) // (1)

scanone(mxb, pheno.col=1, method="mr", model="binary") -> out // (2)
summary(out)

pdf("plot_row_type.pdf") // (3)
plot(out)
dev.off()
```
1. Values for binary traits have to be zero or one.
2. Perform a single-QTL genome scan using the marker regression (mr) method. scanone() supports several other methods and models. The defaut is "normal" for a quantiative phenotype.
3. Plot the LOD (logarithm of the odds) scores along the genome.

Run a permutation test to get p-values and a significance threshold.

operm <- scanone(mxb, method="mr", n.perm=1000, pheno.col=1)

summary(operm, alpha=c(0.05, 0.2))

pdf("plot_operm.pdf") (1)
hist(as.numeric(operm))
abline(v=summary(operm, alpha=c(0.05)), col='red')
dev.off()

summary(out, perms=operm, pvalues=TRUE)

pdf("plot_row_type_threshold.pdf")
plot(out)
abline(h=3.5, col='red') // (2)
dev.off()

Plot a histogram of the maximum genome-wide LOD scores from the 1000 permutations.
Add the significance threshold.

Get interval estimates.
```
lodint(out, chr="2H", drop=2)
```
Plot the phenotypes against the genotypes at the most highly associated marker (genotype x phenotpe, pxg).
```
rownames(max(out)) -> mar
pdf("plot_pxg.pdf")
plotPXG(mxb, pheno.col=1, marker=mar)
dev.off()
```
To close R, type quit(). To close the tmux session, type exit. This will finish the tmux session. If you just want to detach and keep the session running, use tmux detach.
Check by BLAST how close the top marker is to the causal gene (VRS1). The sequence is available from GenBank. GrainGenes offers a web-based BLAST.

8. Mapping the KASP markers

Load the required libraries.

.libPaths("/filer/projekte/gbs_course/Rlibs/4.1.1")
library(data.table)
library(xlsx)
library("qtl")

Add the KASP results to the phenotypes table.

fread('MxB_92_RILs_phenotypes.csv')->p

data.table(read.xlsx('/filer-dg/agruppen/dg6/mascher/DG/mxb_course_221201/prep/KASP_results_PGR23.xlsx', 1))->k
setnames(k, "Samples", "id")
k[grepl("F11$", id)][, id := sub("MBRIL-", "07-", sub("_F11$", "", id))][] -> k

k[p, on="id"] -> m
m[, .(row_type, plant_height, RAW1_Henry, VRS1_Henry, RAW1_ZsaZsa, VRS1_ZsaZsa, RAW1_Tania, VRS1_Tania, id)] -> m

unique(m[, .(row_type, VRS1_Tania)])

Modify the mxb object.

readRDS('mxb_genetic.Rds')

data.frame(m[id %in% mxb$pheno$id]) -> mxb$pheno
mxb
colnames(mxb$pheno)

Plot the new "phenotypes".

pdf("plot_KASP.pdf")
plotPheno(mxb, pheno.col=3)
plotPheno(mxb, pheno.col=4)
plotPheno(mxb, pheno.col=5)
plotPheno(mxb, pheno.col=6)
plotPheno(mxb, pheno.col=7)
plotPheno(mxb, pheno.col=8)
dev.off()

Map VRS1_Tania.

mxb$pheno$VRS1_Tania <- ifelse(mxb$pheno$VRS1_Tania == "C", 0, 1)
scanone(mxb, pheno.col=8, method="mr", model="binary") -> out

pdf("plot_VRS1_Tania.pdf")
plot(out)
dev.off()

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mapping traits in biparental populations with genotyping-by-sequencing data

1. Open a UNIX shell

2. Set up your analysis directory

3. Read mapping

4. Variant calling and filtration

5. Import the genetic variant matrix into R

6. Quality control and linkage map construction

7. Trait mapping

8. Mapping the KASP markers

FilesExpand file tree

tutorial.asciidoc

Latest commit

History

tutorial.asciidoc

File metadata and controls

Mapping traits in biparental populations with genotyping-by-sequencing data

1. Open a UNIX shell

2. Set up your analysis directory

3. Read mapping

4. Variant calling and filtration

5. Import the genetic variant matrix into R

6. Quality control and linkage map construction

7. Trait mapping

8. Mapping the KASP markers