-
Connect with PuTTY to the server slurm-01.ipk-gatersleben.de.
-
Obtain a Kerberos ticket to access our filer resources.
kinit klist (1)
-
A ticket is valid for seven days.
-
-
You will see the prompt of the UNIX command-line interpreter, the shell. It is called "shell" because it executes commands that do not belong to the operating system, the "kernel".
-
Check out your home directory.
pwd ls
-
Let’s have a look at the GBS data.
cd /filer/projekte/gbs_course cd data ls ls -lhS gbs_reads | less (1) man ls cd gbs_reads ls | wc -l zcat Morex.fastq.gz | less -S // (2) cd .. du -ch gbs_reads
-
To close
less, pressq. -
The reads are in FASTQ format.
-
-
Take a look at the phenotypes. There are two. One is binary: lateral spikelet fertility (commonly referred to in barley as row type). The other is quantitative (plant height).
less MxB_92_RILs_phenotypes.csv
-
Create your working directory and change into it.
cd /filer/projekte/gbs_course mkdir -p mascher/gbs // (1) cd mascher/gbs
-
Change
mascherto your name.
-
-
Create symbolic links to the GBS reads and the phenotype data.
pwd // (1) mkdir gbs_reads datadir="/filer/projekte/gbs_course/data/" ln -s $datadir/gbs_reads/*fastq.gz gbs_reads ls -l gbs_reads | less -S ls gbs_reads | wc -l ln -s $datadir/MxB_92_RILs_phenotypes.csv
-
Check that you are in your working directory (
/filer/projekte/gbs_course/USER/gbs).
-
-
Count the reads. We use a for-loop to process all samples at once.
zcat gbs_reads/Morex.fastq.gz | wc -l > Morex.len for i in gbs_reads/*fastq.gz; do name=`basename $i | cut -d . -f 1` count=`zcat $i | wc -l` echo $name $count done > fastq_line_counts.txt cat fastq_line_counts.txt | awk '{print $1,$2/4}' | sort -nk 2 > fastq_read_counts.txt // (1)
-
We use AWK to divide the second column by 4.
-
-
We will use the tools BWA and samtools for read mapping and processing of alignment records, respectively.
module load bwa samtools module list
-
Load the BWA index into shared memory. This step can be skipped, but pre-loadeding the index makes mapping a lot faster.
ref='/filer/projekte/gbs_course/data/reference/Hordeum_vulgare_MorexV3_assembly.fasta' // (1) bwa shm $ref bwa shm -l
-
We use the MorexV3 reference genome sequence.
-
-
We will first run a toy example with 10,000 reads of the Morex sample.
zcat gbs_reads/Morex.fastq.gz | head -n 40000 > Morex_10k.fastq bwa mem $ref Morex_10k.fastq > Morex_10k.sam (1) samtools sort Morex_10k.sam > Morex_10k.bam (1) samtools view Morex_10k.bam | less -S
-
Alignments are stored in the Sequence Alignment Map format (SAM) or its binary equivalent BAM.
-
-
Combine mapping and sorting into one command.
bwa mem $ref Morex_10k.fastq | samtools sort > Morex_10k.bam
-
Open a new PuTTY session. Start tmux to keep your commands running in the background.
k5reauth tmux (1)
-
k5reauth is used to prolong the Kerberos ticket for as long as the tmux session is running.
-
-
Try out tmux functionality.
tmux ls tmux rename mapping tmux ls tmux detach tmux ls tmux attach -t mapping
-
Map all samples.
ref='/filer/projekte/gbs_course/data/reference/Hordeum_vulgare_MorexV3_assembly.fasta' for i in gbs_reads/*fastq.gz; do name=`echo $i | cut -d . -f 1` // (1) bwa mem -t 4 $ref $i | samtools sort > $name.bam done 2> bwa.err (2)
-
Strip the extension: Morex.fastq.gz become Morex.
-
To detach the tmux session, press Ctrl-B followed by D.
❗If you forget to start bwa inside a tmux session, there is no way to prevent your job from aborting when you shutdown your laptop. Also without k5reauth programs cannot access filer resource after a maximum of ten hours. -
-
Open a new terminal. Look at your jobs in the table of processes (
top).find gbs_reads | grep -c 'bam$' (1) top -u mascher // (2)
-
Show the number of BAM files created so far.
-
Replace mascher with your username.
-
-
When the mapping is done, calculate the mapping rates for all samples.
for i in gbs_reads/*bam; do name=`basename $i | cut -d . -f 1` count=`samtools view -cq 20 $i` echo $name $count done > mapped_reads.txt
-
Combine the raw read counts and the mapping rates into one table.
LC_ALL=C sort fastq_read_counts.txt > tmp1 // (1) LC_ALL=C sort mapped_reads.txt > tmp2 // (1) join tmp1 tmp2 | awk '{print $0,$3/$2*100}' | sort -nk 4 | column -t > mapping_stats.txt // (2) rm -f tmp1 tmp2 column -t mapping_stats.txt | less -S
-
Get a list of all BAM files.
ls gbs_reads/*bam | sort > bam_list.txt
-
Open a new tmux session and load bcftools.
tmux // (1) tmux rename variant_call module load bcftools
-
The variant calling will run for some time, so run it inside
tmux.
-
-
Run the variant calling.
ref='/filer/projekte/gbs_course/data/reference/Hordeum_vulgare_MorexV3_assembly.fasta' #bamlist='bam_list.txt' bamlist='/filer-dg/agruppen/dg6/mascher/DG/mxb_course_221201/try_221216/bam_list_Martin.txt' (1) vcf='bcftools_SNP_calling.vcf' // (2) bcftools mpileup --skip-indels --fasta-ref $ref --bam-list $bamlist --min-MQ 20 --annotate AD \#(3) | bcftools view -i 'INFO/DP > 100' \#(4)(5) | bcftools call --consensus-caller --variants-only --output $vcf
-
List of pre-computed BAM files.
-
Output file in variant call format (VCF). Here are the specifications of the VCF format.
-
We ignore insertions and deletions (
--skip-indels), consider only SNPs with a quality score no smaller than 20 (--min-MQ 20) and add allelic depth information (--annotate AD) for all genotype calls. -
Remove sites with fewer than 100 reads across all samples. We are only interested in sites that have at least two supporting reads in nine tenths of the samples.
-
The backslash \ character is used to split long commands across multiple lines. When pasting the commands or editing them, make sure that no white space follows the backslash. Otherwise, the shell will interpret the lines as belonging to different commands. Also multi-line commands do not tolerate intervening command line (starting the hash sign #).
-
-
Filter the variant calls.
filter='/filer/projekte/gbs_course/scripts/filter_vcf.zsh' #vcf='bcftools_SNP_calling.vcf' vcf='/filer-dg/agruppen/dg6/mascher/DG/mxb_course_221201/try_221216/bcftools_SNP_calling.vcf' (1) fvcf='bcftools_SNP_calling_filtered.vcf' $filter --vcf $vcf --dphom 2 --dphet 4 --minmaf 0.2 --minpresent 0.9 --minhomp 0.9 > $fvcf // (2)
-
Path to pre-computed VCF file.
-
We keep homozygous genotype calls if they have at least two supporting reads; heterozygous calls are accepted if they are supported by no fewer than four reads. SNPs with a minor allele frequency below 20 % or less than 90 % present calls or less than 90 % homozyous calls are discarded.
-
-
Change the column names of the VCF files to match the row names in the phenotype table.
less MxB_92_RILs_phenotypes.csv bcftools query -l bcftools_SNP_calling_filtered.vcf | less -S bcftools query -l $fvcf | cut -d / -f 2 | cut -d . -f 1 > new_sample_names.txt bcftools reheader -s new_sample_names.txt $fvcf > bcftools_SNP_calling_filtered_newNames.vcf
-
Review the VCF file.
grep -v '^##' bcftools_SNP_calling_filtered_newNames.vcf | column -t | less -S
-
Start R.
module load R/4.1.1 R
-
R is a widely used programming language in data science. There are very many tutorials, e.g. this one.
-
Load the required libraries.
.libPaths("/filer/projekte/gbs_course/Rlibs/4.1.1") // (1) library("qtl") // (2) library("ASMap") // (3) library("utl") // (4)
-
Set the path where the R libraries are located.
-
R/qtl is package for QTL mapping. Several tutorials are available here.
-
R/ASMap is package for linkage map constuction. It implements the MSTMAP algorithm. A detailed tutorial is available here.
-
utl provides utility functions for R/qtl, one of which we use to convert VCF to R/qtl format.
-
-
Convert the VCF to R/qtl format. Example files are found here.
vcf <- 'bcftools_SNP_calling_filtered_newNames.vcf' ids <- read.table("new_sample_names.txt", head=F)[, 1] genfile <- 'bcftools_SNP_calling_geno.csv' founders <- c("Morex", "Barke") samples <- setdiff(ids, founders) convert_vcf_to_genfile(vcf, genfile, samples, founders) // (1)
-
This function writes a text file in R/qtl’s "csvs" format to disk. The output filename is
genfile.
-
-
The conversion function does not take genomic coordinates into account, so markers are ordered correctly, but equidistant. We add a line to the CSV file to correct this.
cmd <- "sed -Ei '1{p; s/id|chr.H://g; h; d}; 2G'" // (1) paste(cmd, genfile) |> system() // (2)(3)
-
The sed command retrieves the genomic coordinates from the first line and inserts them as a new line after line 2. This could also be achieved with a text editor. The command, at the cost of arcanity, forgoes error-prone manual editing.
-
The
sedcommand is called from inside R and modifies the file in place. -
Alternatively, you write this command as
system(paste(cmd, genfile)). Before R version 4.0 (or so), this was the only way to do it. The pipe-like syntax is a recent addition to R.
-
-
Read the genotype and phenotype data into an R/qtl cross object.
pheno <- 'MxB_92_RILs_phenotypes.csv' read.cross(format="csvs", genfile=genfile, phefile=pheno, crosstype= "f2", genotypes=c("AA","AB","BB")) -> mxb convert2bcsft(mxb, BC.gen=0, F.gen=8, estimate.map=F) -> mxb // (1) summary(mxb) summary.map(mxb)
-
A population of recombinant inbred lines is read in as "f2".
convert2bcsft()is used to set the correct number of selfing generations. See the documentation of that function. We disable theestimate.mapoption because we will rely on R/ASMap to construct a genetic linkage map.
-
-
R/qtl offers several functions to extract basic information from "cross" objects.
nind(mxb) # number of individuals nchr(mxb) # number of chromosomes (actually linkage groups) totmar(mxb) # total number of markers nmar(mxb) # number of markers per chromosome nphe(mxb) # number of phenotypes
-
Plot a summary of the phenotypes. Copy the PDF file with WinSCP and take a look at it with the Acrobat Reader.
pdf("plot_pheno.pdf") (1) plotPheno(mxb, pheno.col=1, xlab="phenotypes") plotPheno(mxb, pheno.col=2) dev.off() (2)
-
Open a PDF file to plot to. The file is created in the current working directory.
-
Close the file (switch off the plotting device). Don’t forget to call
dev.off(). Otherwise the PDF file will be empty or invalid.
-
-
Change the scale of the linkage map from base pairs to megabase pairs and plot the distribution of markers along chromosomes.
rescalemap(mxb, 1/1e6) -> mxb summary.map(mxb) pdf("plot_map.pdf") plot.map(mxb, main="Physical map", ylab="Location (Mb)") dev.off()
-
Create a copy of the cross object with the physical map for later comparison.
copy(mxb) -> mxb_physical saveRDS(mxb_physical, "mxb_physical.Rds") // (1) readRDS("mxb_physical.Rds") -> mxb_physical
-
Any R object can be saved to and read from disk, respectively, with
saveRDS()andreadRDS().
-
-
Karl Broman pointed out that a reference genome sequence has obviated the need for linkage map construction in mouse; the same applies to barley. Although marker order is known, constructing a linkage map from scratch is a means of quality control. If there are issues with the data, the genetic map will be off the mark.
-
A map off the cuff is not too bad. The only worry is that it’s about a fifth longer than expected.
summary.map(mstmap(mxb, id="id"))
-
Let’s see if we can do better. The usual suscepts are bad markers and odd individuals.
-
Remove duplicated markers. If there are groups of markers that differ only in their patterns of missing data, keep only one representative.
findDupMarkers(mxb, exact.only=FALSE) -> dups // (1) unlist(dups) |> length() mxb <- drop.markers(mxb, unlist(dups)) summary(mxb)
-
exact.only=FALSE ignores differences in missingness.
-
-
Remove duplicated individuals.
cg <- comparegeno(mxb) pdf("compare_geno.pdf") hist(cg[lower.tri(cg)], breaks=seq(0, 1, len=101), xlab="No. matching genotypes") dev.off() summary(cg, thresh=0.9) -> dups subset(mxb, ind=setdiff(samples, dups$ind2)) -> mxb
-
Have a look at the graphical genotypes.
pdf("plot.geno.pdf") geno.image(mxb, col=c("white", "red", "violet", "blue")) // (1) dev.off()
-
The default colors are red and green, which puts color blind people at a disadvantage. In this case, the genotypes AA, BB, AB are displayed in the colors red, blue, and violet, respectively. Missing data will be displayed in white.
-
-
We remove the odd individuals with lots of missing calls.
ntyped(mxb) |> sort() |> head(n=1) -> rm.ind subset(mxb, ind=setdiff(names(ntyped(mxb)), names(rm.ind))) -> mxb // (1) summary(mxb)
-
ntyped(mxb) |> names()outputs a list of all individuals.
-
-
Thin the set of markers.
set.seed(1) // (1) lapply(pull.map(mxb), function(i) pickMarkerSubset(i, 1)) |> unlist() -> keep drop.markers(mxb, setdiff(markernames(mxb), keep)) -> mxb
-
Markers are picked at random to have one per megabase. To ensure you always get the results, set the seed for random number generator.
-
-
Look at the graphical genotypes again.
pdf("plot.geno_1Mb.pdf") geno.image(mxb, col=c("white", "red", "violet", "blue")) dev.off()
-
Construct the map again.
mstmap.cross(mxb, id="id") -> mxb summary.map(mxb)
-
Align the genetic to the physical map.
pdf("align_maps.pdf") alignCross(mxb, maps=list(mxb_physical), layout=c(1,1)) dev.off()
-
Compute the rank correlation.
pull.map(mxb, as.table=T) -> a pull.map(mxb_physical, as.table=T) -> b merge(a, b, by=0) -> m // (1) sapply(split(m, m$chr.x), function(i) with(i, cor(pos.x, pos.y, method='s'))) -> cc // (2)
-
by=0means: merge by row names. -
Use
method = 'p'to compute the Pearson (linear) correlation.
-
-
Flip those linkage groups that are inverted relative to the reference orientation (short arm first).
names(which(cc < 0)) -> flip.chr flip.order(mxb, flip.chr) -> mxb
-
Plot the updated alignment and graphical genotypes.
pdf("align_maps_2.pdf") alignCross(mxb, maps=list(mxb_physical), layout=c(1,1)) dev.off() pdf("plot.geno.mstmap.pdf") geno.image(mxb, col=c("white", "red", "violet", "blue")) dev.off()
-
Compare genetic and physical map with connector plots.
rescalemap(mxb_physical, 1/5) -> mxb_physical // (1) pdf("plot_map_2.pdf") plotMap(mxb, mxb_physical) dev.off()
-
Divide chromosome lengths by 5.
-
-
Save the genetic map object.
mxb -> mxb_genetic saveRDS(mxb_genetic, file="mxb_genetic.Rds")
-
Load the required package and the data.
.libPaths("/filer/projekte/gbs_course/Rlibs/4.1.1") // (1) library("qtl") readRDS("mxb_genetic.Rds") -> mxb
-
Calculate genotype probabilities conditional on the marker data.
calc.genoprob(mxb) -> mxb
-
Run the "QTL" scan for the first phenotypes, row type.
mxb$pheno$row_type <- ifelse(mxb$pheno$row_type == 2, 0, 1) // (1) scanone(mxb, pheno.col=1, method="mr", model="binary") -> out // (2) summary(out) pdf("plot_row_type.pdf") // (3) plot(out) dev.off()
-
Values for binary traits have to be zero or one.
-
Perform a single-QTL genome scan using the marker regression (mr) method. scanone() supports several other methods and models. The defaut is "normal" for a quantiative phenotype.
-
Plot the LOD (logarithm of the odds) scores along the genome.
-
-
Run a permutation test to get p-values and a significance threshold.
operm <- scanone(mxb, method="mr", n.perm=1000, pheno.col=1) summary(operm, alpha=c(0.05, 0.2)) pdf("plot_operm.pdf") (1) hist(as.numeric(operm)) abline(v=summary(operm, alpha=c(0.05)), col='red') dev.off() summary(out, perms=operm, pvalues=TRUE) pdf("plot_row_type_threshold.pdf") plot(out) abline(h=3.5, col='red') // (2) dev.off()
-
Plot a histogram of the maximum genome-wide LOD scores from the 1000 permutations.
-
Add the significance threshold.
-
-
Get interval estimates.
lodint(out, chr="2H", drop=2)
-
Plot the phenotypes against the genotypes at the most highly associated marker (genotype x phenotpe, pxg).
rownames(max(out)) -> mar pdf("plot_pxg.pdf") plotPXG(mxb, pheno.col=1, marker=mar) dev.off()
-
To close R, type
quit(). To close the tmux session, typeexit. This will finish the tmux session. If you just want to detach and keep the session running, usetmux detach. -
Check by BLAST how close the top marker is to the causal gene (VRS1). The sequence is available from GenBank. GrainGenes offers a web-based BLAST.
-
Load the required libraries.
.libPaths("/filer/projekte/gbs_course/Rlibs/4.1.1") library(data.table) library(xlsx) library("qtl")
-
Add the KASP results to the phenotypes table.
fread('MxB_92_RILs_phenotypes.csv')->p data.table(read.xlsx('/filer-dg/agruppen/dg6/mascher/DG/mxb_course_221201/prep/KASP_results_PGR23.xlsx', 1))->k setnames(k, "Samples", "id") k[grepl("F11$", id)][, id := sub("MBRIL-", "07-", sub("_F11$", "", id))][] -> k k[p, on="id"] -> m m[, .(row_type, plant_height, RAW1_Henry, VRS1_Henry, RAW1_ZsaZsa, VRS1_ZsaZsa, RAW1_Tania, VRS1_Tania, id)] -> m unique(m[, .(row_type, VRS1_Tania)])
-
Modify the
mxbobject.readRDS('mxb_genetic.Rds') data.frame(m[id %in% mxb$pheno$id]) -> mxb$pheno mxb colnames(mxb$pheno)
-
Plot the new "phenotypes".
pdf("plot_KASP.pdf") plotPheno(mxb, pheno.col=3) plotPheno(mxb, pheno.col=4) plotPheno(mxb, pheno.col=5) plotPheno(mxb, pheno.col=6) plotPheno(mxb, pheno.col=7) plotPheno(mxb, pheno.col=8) dev.off()
-
Map VRS1_Tania.
mxb$pheno$VRS1_Tania <- ifelse(mxb$pheno$VRS1_Tania == "C", 0, 1) scanone(mxb, pheno.col=8, method="mr", model="binary") -> out pdf("plot_VRS1_Tania.pdf") plot(out) dev.off()
