Skip to content

Tutorials 2: Running FORGE

inti edited this page Jan 16, 2011 · 11 revisions

[[Home|Home]]

The first step is to make sure you have all files needed. The basic files are the FORGE Perl script, the snp-to-gene annotation and the SNP association p-values (see flow chart on the right). The first two are available at the Downloads page and you should provide the SNP association file. To get you started we distribute some example files, which we will use on this tutorial.

Download the FORGE folder, uncompress it and go into the folder. On the command line:

$ unzip FORGE.zip
$ cd FORGE

Start by check the files on the folder

 $ ls -1
CovMatrix.pm
GWAS_IO.pm
GWAS_STATS.pm
README.txt
affy_to_rsID.tab
example
forge.pl
utilities

An explanation of the files is on the doc.txt, the README.txt will have some instructions for running as well as interpreting the output. To get a list of available options type

$ perl forge.pl -h
Usage:
   script [options]
           -h, --help              print help message
           -m, --man               print complete documentation
           -report                 how often to report advance
           -verbose, -v            useful for debugging
           Input Files:
           -ped                    Genotype file in PLINK PED format
           -map                    SNP info file in PLINK MAP format
           -bfile                  Files in PLINK binary format, corresping file with extension *.bed, *.bim and *.fam.
           -assoc, -a              SNP association file, columns header is necessary and at leat columns with SNP and P names must be present
           -snpmap, -m             Snp-to-gene mapping file
           -affy_to_rsid           Affy id to rs id mapping file
           -gmt                    Gene-set definition file
           Output Files:
           -out, -o                Name of the output file
           -print_cor              print out SNP-SNP correlations
           Analsis modifiers:
           -gene_list, -g          Only analyse genes on the file provided
           -genes                  Provide some gene name in command line, only these genes will be analyzed
           -all_genes              Analyze all genes in the snp-to-gene mapping file
           -chr                    Anlyze a specific chromosome
           -gene_type, -type       Only analyses genes of this type, e.g. protein_coding
           -exclude_gene_type, -exclude_type       Exclude genes of this type from the analysis, e.g. pseudogenes
           -distance, -d           Max SNP-to-gene distance allowed (in kb)
           -correlation, -cor      SNP-SNP correlation file
           -pearson_genotypes      Calculate SNP-SNP correlation with a pearson correlation for categorical variables
           -use_ld                         Use Linkage Disequilibrium as measure of SNP-SNP correlation
           -lambda                 lambda value to correct SNP statistics, if genomic control is used
           -gc_correction          Determine the lamda for the genomic control correction from the input p-values
           -gmt_min_size           Min number of genes in gene-sets to be analysed. default = 2
           -gmt_max_size           Max number of genes in gene-sets to be analysed. default = 999999999
           Weigthing
           -weights, -w            File with SNP weights
           -w_header               Indicate if the SNP weight file has a header.
           -weight_by_maf, -w_maf  Weight each SNP its the minor allele frequency (MAF). The actual weight is 1/MAF.

Obtain a more detailed description by typing

 $ perl forge.pl -man
    Usage:
    script [options]
            -h, --help              print help message
            -m, --man               print complete documentation
            -report                 how often to report advance
            -verbose, -v            useful for debugging
            Input Files:
            -ped                    Genotype file in PLINK PED format
            -map                    SNP info file in PLINK MAP format
            -bfile                  Files in PLINK binary format, corresping file with extension *.bed, *.bim and *.fam.
            -assoc, -a              SNP association file, columns header is necessary and at leat columns with SNP and P names must be present
            -snpmap, -m             Snp-to-gene mapping file
            -affy_to_rsid           Affy id to rs id mapping file
            -gmt                    Gene-set definition file
            Output Files:
            -out, -o                Name of the output file
            -print_cor              print out SNP-SNP correlations
            Analsis modifiers:
            -gene_list, -g          Only analyse genes on the file provided
            -genes                  Provide some gene name in command line, only these genes will be analyzed
            -all_genes              Analyze all genes in the snp-to-gene mapping file
            -chr                    Anlyze a specific chromosome
            -gene_type, -type       Only analyses genes of this type, e.g. protein_coding
            -exclude_gene_type, -exclude_type       Exclude genes of this type from the analysis, e.g. pseudogenes
            -distance, -d           Max SNP-to-gene distance allowed (in kb)
            -correlation, -cor      SNP-SNP correlation file
            -pearson_genotypes      Calculate SNP-SNP correlation with a pearson correlation for categorical variables
            -use_ld                         Use Linkage Disequilibrium as measure of SNP-SNP correlation
            -lambda                 lambda value to correct SNP statistics, if genomic control is used
            -gc_correction          Determine the lamda for the genomic control correction from the input p-values
            -gmt_min_size           Min number of genes in gene-sets to be analysed. default = 2
            -gmt_max_size           Max number of genes in gene-sets to be analysed. default = 999999999
            Weigthing
            -weights, -w            File with SNP weights
            -w_header               Indicate if the SNP weight file has a header.
            -weight_by_maf, -w_maf  Weight each SNP its the minor allele frequency (MAF). The actual weight is 1/MAF.
Options:
    -help   Print help message
    -man    print complete documentation
    -report How often to report advance. Provide an integer X and the
            program will report adnvance after X genes are analyzed.
    -verbose, -v
            useful for debugging
    -ped    Genotype file in PLINK PED format. See
            http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml for
            details on data format
    -map    SNP info file in PLINK MAP format
    -bfile  Files in PLINK binary format, with extension *.bed, *.bim and
            *.fam.
    -assoc, -a
            SNP association file, columns header is necessary and at leat
            columns with SNP and P names must be present.
    -snpmap, -m
            SNP-to-gene mapping file. Format is tab separated with
            fields:chromosome,start,end,Ensembl id, hugo id, description,
            SNP1, SNP2, ..., SNPN. Each SNP has 4 fields separated by
            colons: id,position,alleles and strand. This may look
            overcomplicated but allows to map any kind of variation and
            store its information without having to use additional files. An
            example look like:
            1 67632083 67725662 ENSG00000162594 IL23R KNOWN protein_coding
            Interleukin-23 receptor Precursor (IL-23R)
            [Source:UniProtKB/Swiss-Prot;Acc:Q5VWK5]
            rs7538938:67132262:T/C:1 rs72669476:67132582:C/T:1
            rs72019237:67132863:C/-:1 rs61197134:67132871:C/-:1
            rs11208941:67133111:T/C:1
    -affy_to_rsid
            Affy id to rs id mapping file. Tab separated file, looks like:
            SNP_A-8389091 rs7593668
    -gmt    Gene-set definition file. Format is (tab separated) NAME
            Description GENE1 GENE2 GENEN
    -out, -o
            Name of the output file. Output file will be tab separated with
            the following columns: Ensembl_ID,Hugo
            id,gene_type,chromosome,start,end,min p-value in the gene, Sidak
            corrected minimum p-value, FORGE p-value, FORGE chi-square,
            FORGE degrees of freedom, Eigen value ratio method gene p-value,
            EVR chi-square, EVR degrees freedom, number of snps in gene,
            number effective tests(Gao et al;PDMID:19434714) An example
            would look like:ENSG00000162594 IL23R protein_coding 1 67632083
            67725662 3.491e-02 4.132e-01 2.128e-01 8.42305 6.04941 1.119e-01
            28.17326 10.116917 15
    -print_cor
            print out SNP-SNP correlations
    -gene_list, -g
            Only analyse genes on the file provided. Ids must match either
            Ensembl Ids or Hugo ids present in the SNP-to-gene mapping file
    -genes  Provide some gene name in command line, only these genes will be
            analyzed. Use like -genes IL23R
    -all_genes
            Analyze all genes in the snp-to-gene mapping file
    -chr    Anlyze a specific chromosome, Chromosome code must match that of
            the SNP-to-gene mapping file
    -gene_type, -type
            Only analyses genes of this type, e.g. protein_coding
    -exclude_gene_type, -exclude_type
            Exclude genes of this type from the analysis, e.g. pseudogenes
    -distance, -d
            Max SNP-to-gene distance allowed (in kb)
    -correlation, -cor
            SNP-SNP correlation file. Space separated file with 3 columns,
            first 2 the SNP ids the the 3th the correlation between them.
            Like:
            rs6660226 rs11209018 0.089
    -pearson_genotypes
            Calculate SNP-SNP correlation with a pearson correlation for
            categorical variables as described on S. Wellek, A. Ziegler, Hum
            Hered 67, 128 (2009).
    -use_ld Use Linkage Disequilibrium as measure of SNP-SNP correlation.
            Default is Pearson's correlation.
    -lambda lambda value to correct SNP statistics, if genomic control is
            used
    -gc_correction
            Correct the SNP pvalues by the genomic control method. It will
            calculate the lambda from the data itself. PLEASE MAKE SURE YOU
            DO NOT FILTER THE SNPS BEFORE RUNNING FORGE OR THE CORRECTION
            MAY BE ERRONEOUS.
    -gmt_min_size
            Min number of genes in gene-sets to be analysed. default = 2
    -gmt_max_size
            Max number of genes in gene-sets to be analysed. default =
            999999999
    -weights, -w
            File with SNP weights. Multiple files with weights can be used
            by providing the -weights for each file. the file with weigths
            can be tab, space or comma separated. the first columns must be
            the snp id and the rest the weights. It is possible to have a
            header in the file, in which case you must use the option
            -w_header as well. SNP weights are use in the gene analysis and
            to calculate the sample scores. How to use different sources of
            information to make a simple weight for the SNP is a not trivial
            issue. To simplify things I: a) for each gene's SNPs I make the
            weight of each category sum 1, b) then I sum these re-sacel
            weights for each SNP and c) finally make sure the weight within
            the gene sum 1. Please note that SNPs without weights will
            internally get the weights set to zero.
    -w_header
            Indicate if the SNP weight file has a header.
    -weight_by_maf, -w_maf
            Weight each SNP its the minor allele frequency (MAF). The actual
            weight is 1/MAF.
  Description:
    TODO
   Examples:
    1. Basic Analysis
        To perform a basic gene-based testing with the example files run:
        >perl forge.pl -bfile example/example -assoc example/example.assoc -snpmap example/example.snpmap -out test
    2. Adding gene-sets to the analysis
        To perform a basic gene-based testing with the example files run:
        >perl forge.pl -bfile example/example -assoc example/example.assoc -snpmap example/example.snpmap -out test -gmt example/example.gmt

OK, so now that you are familiar with the options lets run it! To run FORGE with the example files do

 $ perl forge.pl -bfile example/example -assoc example/example.assoc -snpmap example/example.snpmap -out my_output.txt
Sun Jan 16 11:22:28 2011	Check http://github.com/inti/FORGE/wiki for updates
Sun Jan 16 11:22:28 2011	LOG file will be written to [ my_output.txt.log ]
Sun Jan 16 11:22:28 2011	Max SNP-to-gene distance allowed [ 20 ] kb
Sun Jan 16 11:22:28 2011	Note: You did not provide an option for the set of genes to be analyzed. I will analyze all genes covered by the SNP association file. Check documentation for options -genes and -gene_list otherwise
Sun Jan 16 11:22:28 2011	Going to analyze all genes on [ example/example.snpmap ] file.
Sun Jan 16 11:22:28 2011	Reading association file: [ example/example.assoc ]
Sun Jan 16 11:22:28 2011	[ 17 ] SNPs with association data
Sun Jan 16 11:22:28 2011	[ 20 ] SNPs and [ 4806 ] samples in genotype file
Sun Jan 16 11:22:28 2011	Loading SNP-2-Gene mapping
Sun Jan 16 11:22:28 2011	   '-> Reading [ example/example.snpmap ]
Sun Jan 16 11:22:28 2011	  '->[ 1 ] Genes read from SNP-2-Gene Mapping files
Sun Jan 16 11:22:28 2011	  '->[ 15 ] SNPs mapped to Genes and with association results will be analyzed
Sun Jan 16 11:22:28 2011	Will analyse [ 1 ] genes
Sun Jan 16 11:22:28 2011	Output file will be written to [ my_output.txt ]
Sun Jan 16 11:22:28 2011	Starting to Calculate gene p-values
Sun Jan 16 11:22:28 2011	Reading genotypes from [ example/example.bed ]
Sun Jan 16 11:22:28 2011	Binary file is on SNP-major format
Sun Jan 16 11:22:28 2011	Well Done!!

We did run FORGE with a binary file but it is possible to do it with PED and MAP files, for example:

 $ perl forge.pl -ped example/example.ped -map example/example.map --a example/example.assoc -m example/example.snpmap -out my_output.txt
Sun Jan 16 11:23:45 2011	Check http://github.com/inti/FORGE/wiki for updates
Sun Jan 16 11:23:45 2011	LOG file will be written to [ my_output.txt.log ]
Sun Jan 16 11:23:45 2011	Max SNP-to-gene distance allowed [ 20 ] kb
Sun Jan 16 11:23:45 2011	Note: You did not provide an option for the set of genes to be analyzed. I will analyze all genes covered by the SNP association file. Check documentation for options -genes and -gene_list otherwise
Sun Jan 16 11:23:45 2011	Going to analyze all genes on [ example/example.snpmap ] file.
Sun Jan 16 11:23:45 2011	Reading association file: [ example/example.assoc ]
Sun Jan 16 11:23:45 2011	[ 17 ] SNPs with association data
Sun Jan 16 11:23:46 2011	[ 20 ] SNPs and [ 4806 ] samples in genotype file
Sun Jan 16 11:23:46 2011	Loading SNP-2-Gene mapping
Sun Jan 16 11:23:46 2011	   '-> Reading [ example/example.snpmap ]
Sun Jan 16 11:23:46 2011	  '->[ 1 ] Genes read from SNP-2-Gene Mapping files
Sun Jan 16 11:23:46 2011	  '->[ 15 ] SNPs mapped to Genes and with association results will be analyzed
Sun Jan 16 11:23:46 2011	Will analyse [ 1 ] genes
Sun Jan 16 11:23:46 2011	Output file will be written to [ my_output.txt ]
Sun Jan 16 11:23:46 2011	Starting to Calculate gene p-values
Sun Jan 16 11:23:46 2011	Well Done!!

OK so lets look at the output file, my_output.txt

 $ cat my_output.txt
Ensembl_ID	Hugo_id	gene_type	chromosome	start	end	min_p	min_p_SIDAK	FISHER	FISHER_chi-square	FISHER_df	B_fix	Var_fix	B_P_fix	B_random	Var_random	B_P_random	I-squared	Q	Q_p-value	tau_squared	n_effect_Galwey	n_effect_Gao	n_snps
ENSG00000162594	IL23R	protein_coding	1	67632083	67725662	0.03077	0.354381637678306	0.276952014834721	7.35420285105523	5.87531460523843	-0.202198269842706	0.192931236614231	0.677362932153149	-0.202198269842706	0.192931236614231	0.677362932153149	54.6674257749351	30.8828700759271	0.00575658931285827	1.20591929113765	9.73459071596588	14	15

In the manual there is description of the columns. The columns 9th is the FISHER p-value, 8th The Sidak correction on the minimum p-value per gene, columns 14th is the Fix-effect z-score and the 14th is the Random-effect z-score.

Note
Some times the p-values can be be equal to 0 due to lack of precision because Perl works with single precision at least you recompile it with double. A software like R will allow you to get around it and we provide the chi-square a degrees of freedom and z-score and its variance of the tests for that.

If you want, run a whole genome analysis now and while it runs check out the tutorials on System Biology Oriented approaches: network and pathway analysis.

[[Home|Home]]

Clone this wiki locally