Skip to content

Tutorials 4: Gene Set Analyses

Inti Pedroso edited this page Jan 27, 2016 · 3 revisions

[[Home|Home]] | [[Previous|Tutorials-3:-Correction-for-multiple-testing]] | [[Next|Tutorials-5:-Network-analyses]]

Gene-Set Analyses are a group of bioinformatic analyses aiming to interpret results from genome-wide or large scale experiments using predefined groups of genes. For example, in a gene-expression study after assessing the differential expression of each gene (each probe) you may want to explore if the differentially expressed genes cluster in a known biological pathway or genes know to be regulated by a transcription factor. There is extensive literature in gene-set analyses, also known as pathway analyses, and several software are freely available to run these analysis. However, few of them are design to take into account the complexities of genome-wide association studies as the Linkage Disequilibrium between genetic markers and thus genes’ signal. We decided to partially fill this gap by implementing a popular GSA algorithm and developing in the same software the necessary corrections to perform GSA on genome-wide association studies results. We have implemented the PAGE: parametric analysis of gene set enrichment introduced by Kim and Volsky (2005).

You can find our Perl script in the Download section in the [[Home Page|Home]] but before going into detail lets cover the files you will need for the analysis.

You will need as well a set gene-set definitions in GMT format. The GMT format is as follows: one pathway per line in a tab separated format with the two first fields being the pathway name and description followed by the gene names. Spaces are allowed inside the pathway name and description. For example, this is a gene-set from the BioCarta

no1Pathway    Actions of Nitric Oxide in the Heart    CHRM1    CHRNA1      CYCSP35    AKT1    FLT1    FLT4    HSP90AA1    KDR    KNG1    NOS3    PDE2A     PDE3A    PDE3B    PRKACB    PRKACG    PRKAR1A    PRKAR1B    PRKAR2A     PRKAR2B    PRKG1    PRKG2    ACTA1    BDKRB2    RYR2    SLC7A1    TNNI1    VEGFA    CALM1    CALM2    CALM3    CAV1

In addition to your pathway definition you will need a set of statistics or scores per gene. This must be in tab separated file with two columns, like

EFNB1	9.129e-05
HYDIN	9.760e-05
PRUNE	1.037e-04
LAYN	2.275e-04
BNIPL	2.330e-04
SIK2	2.540e-04
OR1A1	2.606e-04
C11orf88	3.026e-04
TRIM9	3.358e-04
OR1A2	4.218e-04

This are the two basic input files you will need. To run the the analysis do $ perl gsa.pl -help or $ perl gsa.pl -man

which will spit something like

 NAME
     Perl implementation of the PAGE: parametric analysis of gene set enrichment. As bonus includes correction for gene clusters.
 SYNOPSIS
    script [options]
            -h, --help              print help message
            -m, --man               print complete documentation
            -report                 how often to report advance
            -gmt                    gene-set definitions on GMT format
            -file                   gene p-value file
            -perm                   number of permutations
            -out, -o                output file
            -max_size               max gene-set size
            -min_size               min gene-set size
            -recomb_intervals       correct for genes that are in the same recombination interval
            -z_score                inpute values are z_scores
            -append                 Append results to output file rather than overwrite it
            -add_file_name          add the input file name to the result
            -ref_list               set reference list for the analysis
            -set_stat               statistics to calculate over the sub-networks  
 OPTIONS
    -help   Print help message
    -man    print complete documentation
    -report how often to report advance. Provide an integer X and the
            program will report adnvance after X networks are analyzed.
    -gmt    ggene-set definitions on GMT format
    -file   gene p-value file. Tab separated file with at least two columns:
            gene_id and p-value. please make sure that not p-values equal to
            0 are included, those genes will be excluded from the analysis.
    -perm   number of permutations
    -out    output file: ithe output file looks like GO0007156 9.032e-06
            4.288 4.474e-06 4.441 GO0016339 1.656e-04 3.590 2.306e-04 3.502
            columns are: 1) gene-set id 2) PAGE asymtotic p-value, z_score,
            3) p-value calculated with the null distribution calculated via
            sampling 4) z score calculated with a null distribution
            calculated via sampling.
            the last 2 columsn will only be printed if permutatins are run.
    -max_size
            max gene-set size
    -min_size
            min gene-set size
    -recomb_intervals
            correct for genes that are in the same recombination interval.
            On GWAS analysis one would often derive one p-value per genes
            and these will be correlated if the genetic variants on
            different genes are in Linkage Desequilibrium, as for example
            when genes lay on the same recombination interval (pice of
            genome between two recombination hot-spots). The proble is that
            the statistics calculated across the sub-network asssume that
            the gene p-values are independent. With this option is possible
            to provide a set of genomic interval that group genes, e.g.
            recombination interval. This information will be use during the
            MC sampling. For example, if we have a network with 10 genes, 5
            of which lay on the same recombination interval. With the
            -recomb_intervals option set on the montecarlo sampling 5 of the
            genes will be obtain from a single recombination interval
            (elsewhere in the genome), thus assuring the genetic structure
            in the subnetwotk is somehow preserved.
    -z_score
            Input values are z_score intead of p-values, for calculations
            the absolute value of the z-score will be taken. Permutations
            must be performed together with this option.
    -append Append results to output file rather than overwrite it
    -add_file_name
            Add the input file name to the result. Usefull is analysing
            different data sets that wil be concatenated in on single file
    -ref_list
            set reference list for the analysis
    -set_stat
            statistics to calculate over the sub-networks. options are
            stouffer_z_score and mean. mean if the default
DESCRIPTION
    This program will read a performed the PAGE gene-set analysis. It reads
    a at least two files: a file with gene id's and p-values and a second
    file with the gene-set definitions. Please check the original paper Kim
    SY, Volsky DJ: PAGE: parametric analysis of gene set enrichment. BMC
    Bioinformatics 2005, 6:144. for details of the method.

To run the software do

perl gsa.pl -gmt example/example.gmt -file example/exampl.gene_pvals -o example/example.out
Wed Oct  6 17:18:21 2010Running version  [ 0.1 ] of GSA perl script
Wed Oct  6 17:18:21 2010Will analyses Gene-set between [ 10 ] and [ 99999999 ] in size
Wed Oct  6 17:18:21 2010Gene-set statistics set to [ mean ]
Wed Oct  6 17:18:21 2010Reading gene stats [ example/example_gene_pvalues.txt ]
Wed Oct  6 17:18:24 2010   '-> [ 36850 ] genes will be analysed
Wed Oct  6 17:18:24 2010Reading gene-set definitions from [ example/example.gmt ]
Wed Oct  6 17:18:24 2010   '-> [ 1 ] gene-sets will be analysed
Wed Oct  6 17:18:24 2010   '-> [ 36850 ] genes will be used to calculate the patameters of null
Wed Oct  6 17:18:24 2010Writing output to [ example/example.out ]
Wed Oct  6 17:18:24 2010     '->Done with [ 0 ] Gene-Sets
Wed Oct  6 17:18:24 2010    Well Done

This will print out the file “example/example.out” which contains

$ cat example/example.out
nameraw_praw_zNdesc
no1Pathway7.268e-01-0.60326Actions of Nitric Oxide in the Heart

The output present several columns: gene-set name, raw p-value, raw z-score, Number of genes in gene set and description of the gene-set. If permutations are performed the output will look like

$ name    raw_p    raw_z    empi_p:mean    empi_z:mean    mean_set    mean_null    sd_null    N    desc
no1Pathway    7.268e-01    -0.603    8.367e-01    -0.981    -0.083    0.027    0.11226    Actions of Nitric Oxide in the Heart

The additional columns have the information for the empirical p-value and empirical z-score.

[[Home|Home]] | [[Previous|Tutorials-3:-Correction-for-multiple-testing]] | [[Next|Tutorials-5:-Network-analyses]]

Clone this wiki locally