-
Notifications
You must be signed in to change notification settings - Fork 5
Tutorials 4: Gene Set Analyses
[[Home|Home]] | [[Previous|Tutorials-3:-Correction-for-multiple-testing]] | [[Next|Tutorials-5:-Network-analyses]]
Gene-Set Analyses are a group of bioinformatic analyses aiming to interpret results from genome-wide or large scale experiments using predefined groups of genes. For example, in a gene-expression study after assessing the differential expression of each gene (each probe) you may want to explore if the differentially expressed genes cluster in a known biological pathway or genes know to be regulated by a transcription factor. There is extensive literature in gene-set analyses, also known as pathway analyses, and several software are freely available to run these analysis. However, few of them are design to take into account the complexities of genome-wide association studies as the Linkage Disequilibrium between genetic markers and thus genes’ signal. We decided to partially fill this gap by implementing a popular GSA algorithm and developing in the same software the necessary corrections to perform GSA on genome-wide association studies results. We have implemented the PAGE: parametric analysis of gene set enrichment introduced by Kim and Volsky (2005).
You can find our Perl script in the Download section in the [[Home Page|Home]] but before going into detail lets cover the files you will need for the analysis.
You will need as well a set gene-set definitions in GMT format. The GMT format is as follows: one pathway per line in a tab separated format with the two first fields being the pathway name and description followed by the gene names. Spaces are allowed inside the pathway name and description. For example, this is a gene-set from the BioCarta
no1Pathway Actions of Nitric Oxide in the Heart CHRM1 CHRNA1 CYCSP35 AKT1 FLT1 FLT4 HSP90AA1 KDR KNG1 NOS3 PDE2A PDE3A PDE3B PRKACB PRKACG PRKAR1A PRKAR1B PRKAR2A PRKAR2B PRKG1 PRKG2 ACTA1 BDKRB2 RYR2 SLC7A1 TNNI1 VEGFA CALM1 CALM2 CALM3 CAV1
In addition to your pathway definition you will need a set of statistics or scores per gene. This must be in tab separated file with two columns, like
EFNB1 9.129e-05 HYDIN 9.760e-05 PRUNE 1.037e-04 LAYN 2.275e-04 BNIPL 2.330e-04 SIK2 2.540e-04 OR1A1 2.606e-04 C11orf88 3.026e-04 TRIM9 3.358e-04 OR1A2 4.218e-04
This are the two basic input files you will need. To run the the analysis do $ perl gsa.pl -help or $ perl gsa.pl -man
which will spit something like
NAME Perl implementation of the PAGE: parametric analysis of gene set enrichment. As bonus includes correction for gene clusters. SYNOPSIS script [options] -h, --help print help message -m, --man print complete documentation -report how often to report advance -gmt gene-set definitions on GMT format -file gene p-value file -perm number of permutations -out, -o output file -max_size max gene-set size -min_size min gene-set size -recomb_intervals correct for genes that are in the same recombination interval -z_score inpute values are z_scores -append Append results to output file rather than overwrite it -add_file_name add the input file name to the result -ref_list set reference list for the analysis -set_stat statistics to calculate over the sub-networks OPTIONS -help Print help message -man print complete documentation -report how often to report advance. Provide an integer X and the program will report adnvance after X networks are analyzed. -gmt ggene-set definitions on GMT format -file gene p-value file. Tab separated file with at least two columns: gene_id and p-value. please make sure that not p-values equal to 0 are included, those genes will be excluded from the analysis. -perm number of permutations -out output file: ithe output file looks like GO0007156 9.032e-06 4.288 4.474e-06 4.441 GO0016339 1.656e-04 3.590 2.306e-04 3.502 columns are: 1) gene-set id 2) PAGE asymtotic p-value, z_score, 3) p-value calculated with the null distribution calculated via sampling 4) z score calculated with a null distribution calculated via sampling. the last 2 columsn will only be printed if permutatins are run. -max_size max gene-set size -min_size min gene-set size -recomb_intervals correct for genes that are in the same recombination interval. On GWAS analysis one would often derive one p-value per genes and these will be correlated if the genetic variants on different genes are in Linkage Desequilibrium, as for example when genes lay on the same recombination interval (pice of genome between two recombination hot-spots). The proble is that the statistics calculated across the sub-network asssume that the gene p-values are independent. With this option is possible to provide a set of genomic interval that group genes, e.g. recombination interval. This information will be use during the MC sampling. For example, if we have a network with 10 genes, 5 of which lay on the same recombination interval. With the -recomb_intervals option set on the montecarlo sampling 5 of the genes will be obtain from a single recombination interval (elsewhere in the genome), thus assuring the genetic structure in the subnetwotk is somehow preserved. -z_score Input values are z_score intead of p-values, for calculations the absolute value of the z-score will be taken. Permutations must be performed together with this option. -append Append results to output file rather than overwrite it -add_file_name Add the input file name to the result. Usefull is analysing different data sets that wil be concatenated in on single file -ref_list set reference list for the analysis -set_stat statistics to calculate over the sub-networks. options are stouffer_z_score and mean. mean if the default DESCRIPTION This program will read a performed the PAGE gene-set analysis. It reads a at least two files: a file with gene id's and p-values and a second file with the gene-set definitions. Please check the original paper Kim SY, Volsky DJ: PAGE: parametric analysis of gene set enrichment. BMC Bioinformatics 2005, 6:144. for details of the method.
To run the software do
perl gsa.pl -gmt example/example.gmt -file example/exampl.gene_pvals -o example/example.out Wed Oct 6 17:18:21 2010Running version [ 0.1 ] of GSA perl script Wed Oct 6 17:18:21 2010Will analyses Gene-set between [ 10 ] and [ 99999999 ] in size Wed Oct 6 17:18:21 2010Gene-set statistics set to [ mean ] Wed Oct 6 17:18:21 2010Reading gene stats [ example/example_gene_pvalues.txt ] Wed Oct 6 17:18:24 2010 '-> [ 36850 ] genes will be analysed Wed Oct 6 17:18:24 2010Reading gene-set definitions from [ example/example.gmt ] Wed Oct 6 17:18:24 2010 '-> [ 1 ] gene-sets will be analysed Wed Oct 6 17:18:24 2010 '-> [ 36850 ] genes will be used to calculate the patameters of null Wed Oct 6 17:18:24 2010Writing output to [ example/example.out ] Wed Oct 6 17:18:24 2010 '->Done with [ 0 ] Gene-Sets Wed Oct 6 17:18:24 2010 Well Done
This will print out the file “example/example.out” which contains
$ cat example/example.out nameraw_praw_zNdesc no1Pathway7.268e-01-0.60326Actions of Nitric Oxide in the Heart
The output present several columns: gene-set name, raw p-value, raw z-score, Number of genes in gene set and description of the gene-set. If permutations are performed the output will look like
$ name raw_p raw_z empi_p:mean empi_z:mean mean_set mean_null sd_null N desc no1Pathway 7.268e-01 -0.603 8.367e-01 -0.981 -0.083 0.027 0.11226 Actions of Nitric Oxide in the Heart
The additional columns have the information for the empirical p-value and empirical z-score.
[[Home|Home]] | [[Previous|Tutorials-3:-Correction-for-multiple-testing]] | [[Next|Tutorials-5:-Network-analyses]]