Skip to content

Post processing

Thomas Krannich edited this page Feb 25, 2021 · 3 revisions

Post-processing

Here we will introduce a few common post-processing steps to get the most out of your PopIns2 analysis, improve the overview of results and safe some disk space (without data loss).

Constructing a multi-VCF file of your entire population

You can obtain a VCF file that summarizes the genotypes of all variants in all samples by following the steps below. They require the UNIX tools cat, bgzip and tabix as well as VCFtools and might be translated into a workflow language of your choice.

  1. Sort the VCF files
input: {PATH_TO_YOUR_PROJECT}/{SAMPLE}/insertions.vcf
output: {PATH_TO_YOUR_PROJECT}/{SAMPLE}/insertions_sorted.vcf
shell: cat {input} | vcf-sort -c > {output}
  1. Compress the sorted VCF files
input: {PATH_TO_YOUR_PROJECT}/{SAMPLE}/insertions_sorted.vcf
output: {PATH_TO_YOUR_PROJECT}/{SAMPLE}/insertions_sorted.vcf.gz
shell: bgzip {input}
  1. Index the compressed VCF files
input: {PATH_TO_YOUR_PROJECT}/{SAMPLE}/insertions_sorted.vcf.gz
output: {PATH_TO_YOUR_PROJECT}/{SAMPLE}/insertions_sorted.vcf.gz.tbi
shell: tabix -p vcf {input}
  1. Merge all VCF files
input: {PATH_TO_YOUR_PROJECT}/*/insertions_sorted.vcf.gz
output: {PATH_TO_YOUR_PROJECT}/insertions_all.vcf.gz
shell: vcf-merge {input} | bgzip -c > {output}

The records (insertions) of the final insertions_all.vcf.gz file contain the genotypes for each of the samples. After successful steps 1 and 2 you can safely delete the original {PATH_TO_YOUR_PROJECT}/{SAMPLE}/insertions.vcf to safe some disc space.


Back to main

Clone this wiki locally