Skip to content

NIH-CARD/CARDlongread_data_standardization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

83 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Long read data standardization SOP for downstream analysis

Mike Nalls, Jon Moller, Rylee Genner, Melissa Meredith, and Kimberley Billingsley

This repository summarizes the CARD Applied Neurogenomics Group's long read sequencing data standardization SOP for downstream analyses like QTLs. It includes both examples of file formats for each downstream analysis input data type and preprocessing scripts to convert raw data files (e.g., VCFs, methylation BEDs) into tables suitable for downstream analysis.

File formats and examples

Metadata

Metadata files contain individual level data and potentially protected health information (PHI) so be cautious with sharing. In general these are sample level metrics not stratified by haplotype.

Mandatory columns are as follows:
SAMPLE = sample ID in alphanumeric string format.
MALE = binary indicator confirming XY sex chromosomes
STATUS = disease of interest, controls for the population are always designated by the string value “control”.
CHEMISTRY = an alphanumeric code for the chemistry used such as “R9” or “R10”.
Any additional columns are considered covariates generally.

An example of one such file is below:

SAMPLE,MALE,STATUS,CHEMISTRY,PC1,PC2,PC3,PC4,PC5
indiv1,1,control ,R9,0.001,-0.002,0.005.0.006,-0.1
indiv2,1,PD ,R9,0.01,-0.003,0.006.0.0068,-0.4,
indiv3,0,AD ,R10,0.3,-0.004,0.0065.0.08,0.002

Omics

All omics can be reduced to count tables.

Mandatory columns are as follows:
SAMPLE = sample ID in alphanumeric string format.
HAPLOTYPE = a string that denotes haplotype as “H1” or “H2”
Any additional columns are numeric columns corresponding to likely normalized values of omics sites or regions found in corresponding map files.

An example of one such file is below:

SAMPLE,HAPLOTYPE,APOE
indiv1,H1,5.0
indiv1,H2,0.0
indiv2,H1,99.5
indiv2,H2,55.25
indiv3,H1,96.2
indiv3,H2,90.0

Genetics

Genetics data is the count of alternate or non reference alleles for a SV or SNV per haplotype.

Mandatory columns are as follows:
SAMPLE = sample ID in alphanumeric string format.
HAPLOTYPE = a string that denotes haplotype as “H1” or “H2”
Any additional columns are numeric columns corresponding to alternate allele count values of genetic sites or regions found in corresponding map files.

An example of one such file is below:

SAMPLE,HAPLOTYPE,chr1_111000_A_G,chr1_111555_T_CCCC
indiv1,H1,0,1
indiv1,H2,0,1
indiv2,H1,1,1
indiv2,H2,1,0
indiv3,H1,0,0
indiv3,H2,1,0

Maps

Map files are genome-wide maps in a standard format. Note that for genetic variants all START values are incorporated into variant names and for SNVs, the START and STOP values should be identical.

Mandatory columns are as follows:
NAME = variant or region name as alphanumeric string.
CHROM = chromosome designation as described above prefixed by “chr”.
START = integer region start site.
STOP = integer region stop site, this will be the same as START for SNVs.
REF = reference allele for the variant (note, not applicable and can be omitted for omics maps, only exists in genetics maps).
ALT = non-reference allele for the variant (note, not applicable and can be omitted for omics maps, only exists in genetics maps).

Variants are named in the NAME column above differently depending upon whether they are structural variants (SVs) or single nucleotide variants (SNVs). SNVs are named SNV_CHROM_START_STOP_REF_ALT with each field defined in the columns above (e.g., SNV_chr16_53368720_C_CTGTT). SVs, on the other hand, are named SV_CHROM_SVTYPE_START_STOP_SVLEN where SVTYPE is the type of SV (e.g., INS, DEL) and SVLEN is the absolute value of the SV length (e.g., 94 for -94). An example SV name is SV_chr19_DEL_54471870_54471945_75.

Regarding positions and alleles, the reference genome used should always be specified in the cohort readme file.

Example genetics map file below:

NAME,CHROM,START,STOP,REF,ALT
chr1_111000_A_G,chr1,111000,111000,A,G
chr1_111555_T_CCCC,chr1,111555,111558,T,CCCC

Example omics map file below:

NAME,CHROM,START,STOP
APOE,chr19,44905796,44909393

Example regions of interest map file below:

NAME,CHROM,START,STOP
SORT1,chr1,109309574,109397918
PMVK,chr1,154924739,154942658
KRTCAP2,chr1,155169407,155173304
GBAP1,chr1,155213824,155227534

Preprocessing scripts and usage

Metadata

Metadata scripts are still in development. These scripts will standardize calculation of PCs for omics regions per sample and joining with standard sample metadata tables (age, sex, ancestry, PMI, etc.).

Included are scripts to guide the exploratory data analysis (EDA) process, particularly calculation of PCs, evaluation of PC contribution through stepwise regression, and merging with covariates. We have written two scripts to perform principal component analysis on different types of data (e.g., genetic variants, methylation, gene expression) and then join chosen principal components with a standard sample metadata table.

The first script (make_pcs_stepwise.py) runs PCA using the PCA method from the Python scikit-learn package on different input data types and creates scree plots to assist choice of PCs that explain most of the variation in the data. Missing values are imputed with the median of the column (methylation region or expression per gene/transcript) using the scikit-learn SimpleImputer method, while columns with entirely missing values are excluded from PCA calculation altogether. The scree plots generated (see below) show cumulative variance proportion explained relative to PC number with a user-adjusted cumulative variance explained cutoff. We set this cutoff to 0.7 (70%) based on commonly accepted practice for deciding how many PCs to retain. make_pcs_stepwise.py also includes options for each data type and for the PC prefix (e.g., GENETIC_ and thus GENETIC_PC for PCs from genetic variant data).

Below is an example scree plot for NABEC CpG island (CGI) normalized unphased methylation data:

image

The second script (choose_pcs_join_metadata.py) takes a metadata table (e.g., sample, age, sex, PMI, ancestry), list of input PC files generated with make_pcs_stepwise.py, and a list of values that indicate the number of PCs to include (starting from PC_1) from each PC file. These values are chosen based on inspection of the previous scree plots and output of the prior script indicating the minimum number of PCs starting with PC1 that together explain at least the cumulative variance cutoff (0.7 default). It then creates a single metadata/covariates/PCs table joined on common sample names ("SAMPLE" column) and a correlation matrix plus a corresponding heatmap showing relationships between included covariates.

Below is an example correlation matrix heatmap for NABEC CpG island (CGI) normalized unphased methylation data. The first 14 genetic PCs and first 5 methylation PCs are included in the full table.

image

usage: make_pcs_stepwise.py [-h] --input_type INPUT_TYPE --input INPUT --output_prefix OUTPUT_PREFIX --pc_prefix PC_PREFIX [--cumulative_variance_explained_cutoff CUMULATIVE_VARIANCE_EXPLAINED_CUTOFF] [--plot_title PLOT_TITLE]
                            [--new_sample_names NEW_SAMPLE_NAMES]

Perform exploratory data analysis (EDA) by running PCA for 20 PCs on input data (genetics, methylation, or expression) and generating a scree plot to assist selection of PCs.

optional arguments:
  -h, --help            show this help message and exit
  --input_type INPUT_TYPE
                        Input file type (currently genetics, methylation, or expression). Genetics input is Plink eigenvalue/eigenvector files generated with the --pca 20 option, while methylation input is a methylation BED file and
                        expression input is a normalized expression BED file.
  --input INPUT         Path to input genetics file prefix (for both Plink eigenvalue/eigenvector) or methylation/expression input file.
  --output_prefix OUTPUT_PREFIX
                        Prefix for PC and scree plot output files.
  --pc_prefix PC_PREFIX
                        Prefix for PC names (e.g., METH_PC1).
  --cumulative_variance_explained_cutoff CUMULATIVE_VARIANCE_EXPLAINED_CUTOFF
                        Horizontal cutoff line for cumulative variance explained in scree plot (default 0.70). PC that exceeds threshold printed to standard output.
  --plot_title PLOT_TITLE
                        Title for output scree plot.
  --new_sample_names NEW_SAMPLE_NAMES
                        Path to list of new sample names for output PC file. Must have same number of samples as input data file.
usage: choose_pcs_join_metadata.py [-h] --input_metadata INPUT_METADATA --input_pcs INPUT_PCS [INPUT_PCS ...] --pc_counts PC_COUNTS [PC_COUNTS ...] --output_prefix OUTPUT_PREFIX
                                   [--tensorQTL_transpose | --no-tensorQTL_transpose]

Choose PCs from multiple input PC files and merge chosen PCs with metadata/covariates table.

optional arguments:
  -h, --help            show this help message and exit
  --input_metadata INPUT_METADATA
                        Path to initial input metadata/covariates file to be joined with PCs of interest.
  --input_pcs INPUT_PCS [INPUT_PCS ...]
                        Path to input PC files generated or preprocessed with make_pcs_stepwise.py.
  --pc_counts PC_COUNTS [PC_COUNTS ...]
                        Number of PCs starting with PC1 to retain for input PC files, in order of inputs specified for --input_pcs.
  --output_prefix OUTPUT_PREFIX
                        Specify prefix for output merged metadata/covariates/PCs file and covariate correlation heatmap.
  --tensorQTL_transpose, --no-tensorQTL_transpose
                        Convert categorical variables to integer dummies, transpose joined output covariates, and output TSV file as required for tensorQTL input. (default: False)

Genetic PCs can be generated from a MAF filtered variant file with plink through the following command, which keeps the first 20 PCs:

plink2 --vcf example_variants.vcf.gz --pca 20 --out example_variants_PCA_20

This generates PCs in eigenvector file example_variants_PCA_20.eigenvec and a list of variance explained per included PC in the eigenvalue file example_variants_PCA_20.eigenval.

Omics

We have so far developed a script to convert methylation BED files from Napu into methylation data and map files as described above. This script takes methylation BED files for each haplotype as input and outputs methylation data and map files as CSV output. It is written to process CpG island (CGI), gene body (GB), and promoter (PROM) regions. It performs a full outer join on each haplotype (union of methylation regions) and fills in missing values in the respective haplotype as NA. An option is also provided to filter out regions with missing methylation information above a threshold proportion of samples (default 0.05 or 5%; missing methylation typing rate equivalent to missing genotyping rate). Sample methylation data and map files are provided as example_methylation_data.csv and example_methylation_map.csv.

usage: make_methylation_data_and_map.py [-h] -t {CGI,GB,PROM} -h1 HAPLOTYPE_1 -h2 HAPLOTYPE_2 -p REGION_PREFIX -o OUTPUT_PREFIX [-m MISSING_INFO_RATE]

Convert methylation BED files to methylation map and data matrix CSVs for downstream analysis (e.g., QTLs).

optional arguments:
  -h, --help            show this help message and exit
  -t {CGI,GB,PROM}, --region_type {CGI,GB,PROM}
                        Specifying region type for parsing and joining. Currently supported are CpG islands (CGI), gene bodies (GB), and promoters (PROM).
  -h1 HAPLOTYPE_1, --haplotype_1 HAPLOTYPE_1
                        Methylation BED file for the first haplotype (H1).
  -h2 HAPLOTYPE_2, --haplotype_2 HAPLOTYPE_2
                        Methylation BED file for the second haplotype (H2).
  -p REGION_PREFIX, --region_prefix REGION_PREFIX
                        Prefix for the output methylation region names (e.g., CGIs, gene bodies, promoters).
  -o OUTPUT_PREFIX, --output_prefix OUTPUT_PREFIX
                        Prefix for the output methylation data and genetic map files.
  -m MISSING_INFO_RATE, --missing_info_rate MISSING_INFO_RATE
                        Filter out regions lacking methylation information for higher than this proportion of samples (default 0.05 or 5%).

We have also recently added a cohort/population-wide aggregated methylation table processing script to the repository that filters for samples of interest, chromosomes of interest, and missing methylation rate per region, in addition to imputation of missing values with the mean per region (which we think is a fair assumption for missing methylation data) and formatting for tensorQTL.

usage: filter_input_methylation_table.py [-h] --input_methylation_table INPUT_METHYLATION_TABLE [--output_methylation_table OUTPUT_METHYLATION_TABLE] [--reference_name REFERENCE_NAME]
                                         [--included_chromosomes INCLUDED_CHROMOSOMES] [--excluded_chromosomes EXCLUDED_CHROMOSOMES] [--included_samples INCLUDED_SAMPLES] [--excluded_samples EXCLUDED_SAMPLES]
                                         [--na_definition NA_DEFINITION] [--print_input_samples_and_chromosomes_only | --no-print_input_samples_and_chromosomes_only]
                                         [--print_output_samples_and_chromosomes | --no-print_output_samples_and_chromosomes] [--impute_with_region_means | --no-impute_with_region_means]
                                         [--missing_methylation_rate_filter MISSING_METHYLATION_RATE_FILTER] [--clean_sample_names | --no-clean_sample_names] [--tensorQTL_format | --no-tensorQTL_format]
                                         [--remove_duplicate_regions_for_tensorQTL | --no-remove_duplicate_regions_for_tensorQTL] [--remove_duplicate_regions | --no-remove_duplicate_regions]

Filter input methylation tables (extended average methylation per region over regions and samples) by samples and chromosomes, amongst other features.

optional arguments:
  -h, --help            show this help message and exit
  --input_methylation_table INPUT_METHYLATION_TABLE
                        Path to input methylation table (methylation per sample per aggregated region).
  --output_methylation_table OUTPUT_METHYLATION_TABLE
                        Path to output methylation table after filtering (methylation per sample per aggregated region).
  --reference_name REFERENCE_NAME
                        Reference name used to separate sample from non-sample columns in table (default GRCh38).
  --included_chromosomes INCLUDED_CHROMOSOMES
                        Chromosomes to include in output (list in text file, one chromosome per line).
  --excluded_chromosomes EXCLUDED_CHROMOSOMES
                        Chromosomes to exclude in output (list in text file, one chromosome per line).
  --included_samples INCLUDED_SAMPLES
                        Samples to include in output (list in text file, one chromosome per line).
  --excluded_samples EXCLUDED_SAMPLES
                        Samples to exclude in output (list in text file, one chromosome per line).
  --na_definition NA_DEFINITION
                        String definition of NA if not nan/NA/NAN/NaN (default is period).
  --print_input_samples_and_chromosomes_only, --no-print_input_samples_and_chromosomes_only
                        Print samples and chromosomes in input table to files (input_methylation_table.samples, .chromosomes) only and exit. (default: False)
  --print_output_samples_and_chromosomes, --no-print_output_samples_and_chromosomes
                        Print samples and chromosomes in output table to files (output_methylation_table.samples, .chromosomes) as well. (default: False)
  --impute_with_region_means, --no-impute_with_region_means
                        Impute N/A methylation values as mean per region (default false). (default: False)
  --missing_methylation_rate_filter MISSING_METHYLATION_RATE_FILTER
                        Remove region if more than this proportion of samples have no methylation calls; default 0.05.
  --clean_sample_names, --no-clean_sample_names
                        Output table with sample name only for sample columns (no _[reference]_modFraction). (default: False)
  --tensorQTL_format, --no-tensorQTL_format
                        Output table in tensorQTL phenotype BED format (chromosome/start/end/phenotype ID/samples). (default: False)
  --remove_duplicate_regions_for_tensorQTL, --no-remove_duplicate_regions_for_tensorQTL
                        Only keep first non-null entry for duplicate regions from input table for tensorQTL output. (default: False)
  --remove_duplicate_regions, --no-remove_duplicate_regions
                        Only keep first non-null entry for duplicate regions from input table for regular output. (default: False)

Genetic data

Genetic data and maps are generated together in two steps. First, input SV or SNV VCF variant files from Napu are normalized to biallelic variants, subset by chromosome, and then converted to a CSV file with the necessary fields using bcftools query (vcf_preprocess_for_genetic_data_map.sh). Next, preprocessed CSV files are converted to per sample, per haplotype genetic data matrix and per variant genetic map files using make_genetic_data_and_map.py. Sample preprocessed data is included in the repository as example_sv_preprocess.csv and example_snv_preprocess.csv. Based on preliminary testing and available NIH HPC resources, we recommend parallelizing genetic map and data generation by chromosome. Tests on the HBCC cohort SNV VCF indicated chromosome 1 genetic map/data processing took 15 minutes on a 32GB RAM/32 CPU allocation. Example output genetic data are also provided as example_sv_data.csv and example_snv_data.csv. We have also included a helper script to perform initial sample inclusion or exclusion, variant filtering, and LD pruning before preprocessing (variant_initial_cleanup.sh).

Script for initial variant cleanup pre-allele specific QTL, using plink and bcftools. Splits multiallelic variants by default.
Usage: variant_initial_cleanup.sh -i input_prefix -s sample_list -a sample_list_action -m maf_cutoff -g missing_genotype_rate -h hwe_pvalue -p indep_pairwise_ld_pruning_values -t threads
	-i Input VCF file prefix
	-s Path to list of samples to include or exclude
	-a Action to take on sample list (either include or exclude)
	-m Minor allele frequency (MAF) cutoff (default 0.05)
	-g Missing genotype rate cutoff (default 0.05)
	-h Hardy-Weinberg equilibrium p-value cutoff (default 0.001)
	-p Independent pairwise LD pruning settings (default "1000 50 0.3"). Numbers indicate window, step size, and r2 value for indep-pairwise pruning.
	-t Threads to use for bcftools decompression (default 0; plink uses all available CPUs.)
Usage: vcf_preprocess_for_genetic_data_map.sh -v variant_type -c chromosome -i input.vcf(.gz) -o output.csv
	-v Variant type - structural variants (SV) or single nucleotide variants (SNV)
	-c Chromosome to subset (optional)
	-i Input VCF file
	-o Output CSV file for genetic data/map generation with helper Python script

Below is an example of preprocessed output from the above script:

ID,CHROM,POS,END,REF,ALT,HBCC_81935_FTX,HBCC_81931_FTX,HBCC_81929_FTX,HBCC_81934_FTX
chr1_10611_C_G,chr1,10611,10611,C,G,0/1,./.,./.,0/1
chr1_10622_T_G,chr1,10622,10622,T,G,./.,1/0,1/0,./.
chr1_10623_T_C,chr1,10623,10623,T,C,./.,1/1,1/1,0/1
chr1_10627_A_AG,chr1,10627,10627,A,AG,0/0,1/0,1/0,./.
chr1_10815_T_TC,chr1,10815,10815,T,TC,./.,./.,./.,1/0
chr1_10927_A_G,chr1,10927,10927,A,G,./.,1/0,1/0,./.
chr1_10936_G_C,chr1,10936,10936,G,C,./.,./.,./.,./.
chr1_10968_A_AG,chr1,10968,10968,A,AG,./.,./.,1/0,0/0
chr1_11002_A_C,chr1,11002,11002,A,C,./.,1/0,1/0,./.
usage: make_genetic_data_and_map.py [-h] -s {SV,SNV} -i INPUT_FILE -o OUTPUT_PREFIX [-c CHROMOSOME]

Convert preprocessed VCF input in CSV data table (from vcf_preprocess_for_genetic_data_map.sh) to genetic map and data matrix CSVs for downstream analysis (e.g., QTLs).

optional arguments:
  -h, --help            show this help message and exit
  -s {SV,SNV}, --file_type {SV,SNV}
                        Specify the file type: SVs for Structural Variants or SNVs for Single Nucleotide Variants.
  -i INPUT_FILE, --input_file INPUT_FILE
                        Path to the input preprocessed CSV file for parsing (from vcf_preprocess_for_genetic_data_map.sh).
  -o OUTPUT_PREFIX, --output_prefix OUTPUT_PREFIX
                        Prefix for the output genetic data and genetic map files.
  -c CHROMOSOME, --chromosome CHROMOSOME
                        Chromosome(s) to include (optional).

Input variant files can have variants relabeled accordingly with the following bcftools annotate commands:

# SNVs
bcftools annotate -I 'SNV\_%CHROM\_%POS\_%REF\_%ALT' input.vcf
# SVs
bcftools annotate -I 'SV\_%CHROM\_%SVTYPE\_%POS\_%END\_ABS(%SVLEN)' input.vcf

Genetic maps

Genetic maps are made together with genetic data above. Sample output genetic maps corresponding to above genetic data are included in the repository as example_sv_genetic_map.csv and example_snv_genetic_map.csv.

About

Long read data standardization SOP for downstream analyses like QTLs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors