Tutorials 2: Running FORGE

[[Home|Home]]

The first step is to make sure you have all files needed. The basic files are the FORGE Perl script, the snp-to-gene annotation and the SNP association p-values (see flow chart on the right). The first two are available at the Downloads page and you should provide the SNP association file. To get you started we distribute some example files, which we will use on this tutorial.

Download the FORGE folder, uncompress it and go into the folder. On the command line:

$ unzip FORGE.zip
$ cd FORGE

Start by check the files on the folder

 $ ls -1
CovMatrix.pm
GWAS_IO.pm
GWAS_STATS.pm
README.txt
affy_to_rsID.tab
example
forge.pl
utilities

An explanation of the files is on the doc.txt, the README.txt will have some instructions for running as well as interpreting the output. To get a list of available options type

$ perl forge.pl -h
Usage:
   script [options]
           -h, --help              print help message
           -m, --man               print complete documentation
           -report                 how often to report advance
           -verbose, -v            useful for debugging
           Input Files:
           -ped                    Genotype file in PLINK PED format
           -map                    SNP info file in PLINK MAP format
           -bfile                  Files in PLINK binary format, corresping file with extension *.bed, *.bim and *.fam.
           -assoc, -a              SNP association file, columns header is necessary and at leat columns with SNP and P names must be present
           -snpmap, -m             Snp-to-gene mapping file
           -affy_to_rsid           Affy id to rs id mapping file
           -gmt                    Gene-set definition file
           Output Files:
           -out, -o                Name of the output file
           -print_cor              print out SNP-SNP correlations
           Analsis modifiers:
           -gene_list, -g          Only analyse genes on the file provided
           -genes                  Provide some gene name in command line, only these genes will be analyzed
           -all_genes              Analyze all genes in the snp-to-gene mapping file
           -chr                    Anlyze a specific chromosome
           -gene_type, -type       Only analyses genes of this type, e.g. protein_coding
           -exclude_gene_type, -exclude_type       Exclude genes of this type from the analysis, e.g. pseudogenes
           -distance, -d           Max SNP-to-gene distance allowed (in kb)
           -correlation, -cor      SNP-SNP correlation file
           -pearson_genotypes      Calculate SNP-SNP correlation with a pearson correlation for categorical variables
           -use_ld                         Use Linkage Disequilibrium as measure of SNP-SNP correlation
           -lambda                 lambda value to correct SNP statistics, if genomic control is used
           -gc_correction          Determine the lamda for the genomic control correction from the input p-values
           -gmt_min_size           Min number of genes in gene-sets to be analysed. default = 2
           -gmt_max_size           Max number of genes in gene-sets to be analysed. default = 999999999
           Weigthing
           -weights, -w            File with SNP weights
           -w_header               Indicate if the SNP weight file has a header.
           -weight_by_maf, -w_maf  Weight each SNP its the minor allele frequency (MAF). The actual weight is 1/MAF.

Obtain a more detailed description by typing

$ perl forge.pl -man
Usage:
script [options]
-h, --help print help message
-m, --man print complete documentation
-report how often to report advance
-verbose, -v useful for debugging
Input Files:
-ped Genotype file in PLINK PED format
-map SNP info file in PLINK MAP format
-bfile Files in PLINK binary format, corresping file with extension *.bed, *.bim and *.fam.
-assoc, -a SNP association file, columns header is necessary and at leat columns with SNP and P names must be present
-snpmap, -m Snp-to-gene mapping file
-affy_to_rsid Affy id to rs id mapping file
-gmt Gene-set definition file
Output Files:
-out, -o Name of the output file
-print_cor print out SNP-SNP correlations
Analsis modifiers:
-gene_list, -g Only analyse genes on the file provided
-genes Provide some gene name in command line, only these genes will be analyzed
-all_genes Analyze all genes in the snp-to-gene mapping file
-chr Anlyze a specific chromosome
-gene_type, -type Only analyses genes of this type, e.g. protein_coding
-exclude_gene_type, -exclude_type Exclude genes of this type from the analysis, e.g. pseudogenes
-distance, -d Max SNP-to-gene distance allowed (in kb)
-correlation, -cor SNP-SNP correlation file
-pearson_genotypes Calculate SNP-SNP correlation with a pearson correlation for categorical variables
-use_ld Use Linkage Disequilibrium as measure of SNP-SNP correlation
-lambda lambda value to correct SNP statistics, if genomic control is used
-gc_correction Determine the lamda for the genomic control correction from the input p-values
-gmt_min_size Min number of genes in gene-sets to be analysed. default = 2
-gmt_max_size Max number of genes in gene-sets to be analysed. default = 999999999
Weigthing
-weights, -w File with SNP weights
-w_header Indicate if the SNP weight file has a header.
-weight_by_maf, -w_maf Weight each SNP its the minor allele frequency (MAF). The actual weight is 1/MAF.
Options:
-help Print help message
-man print complete documentation
-report How often to report advance. Provide an integer X and the
program will report adnvance after X genes are analyzed.
-verbose, -v
useful for debugging
-ped Genotype file in PLINK PED format. See
http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml for
details on data format
-map SNP info file in PLINK MAP format
-bfile Files in PLINK binary format, with extension *.bed, *.bim and
*.fam.
-assoc, -a
SNP association file, columns header is necessary and at leat
columns with SNP and P names must be present.
-snpmap, -m
SNP-to-gene mapping file. Format is tab separated with
fields:chromosome,start,end,Ensembl id, hugo id, description,
SNP1, SNP2, ..., SNPN. Each SNP has 4 fields separated by
colons: id,position,alleles and strand. This may look
overcomplicated but allows to map any kind of variation and
store its information without having to use additional files. An
example look like:
1 67632083 67725662 ENSG00000162594 IL23R KNOWN protein_coding
Interleukin-23 receptor Precursor (IL-23R)
[Source:UniProtKB/Swiss-Prot;Acc:Q5VWK5]
rs7538938:67132262:T/C:1 rs72669476:67132582:C/T:1
rs72019237:67132863:C/-:1 rs61197134:67132871:C/-:1
rs11208941:67133111:T/C:1
-affy_to_rsid
Affy id to rs id mapping file. Tab separated file, looks like:
SNP_A-8389091 rs7593668
-gmt Gene-set definition file. Format is (tab separated) NAME
Description GENE1 GENE2 GENEN
-out, -o
Name of the output file. Output file will be tab separated with
the following columns: Ensembl_ID,Hugo
id,gene_type,chromosome,start,end,min p-value in the gene, Sidak
corrected minimum p-value, FORGE p-value, FORGE chi-square,
FORGE degrees of freedom, Eigen value ratio method gene p-value,
EVR chi-square, EVR degrees freedom, number of snps in gene,
number effective tests(Gao et al;PDMID:19434714) An example
would look like:ENSG00000162594 IL23R protein_coding 1 67632083
67725662 3.491e-02 4.132e-01 2.128e-01 8.42305 6.04941 1.119e-01
28.17326 10.116917 15
-print_cor
print out SNP-SNP correlations
-gene_list, -g
Only analyse genes on the file provided. Ids must match either
Ensembl Ids or Hugo ids present in the SNP-to-gene mapping file
-genes Provide some gene name in command line, only these genes will be
analyzed. Use like -genes IL23R
-all_genes
Analyze all genes in the snp-to-gene mapping file
-chr Anlyze a specific chromosome, Chromosome code must match that of
the SNP-to-gene mapping file
-gene_type, -type
Only analyses genes of this type, e.g. protein_coding
-exclude_gene_type, -exclude_type
Exclude genes of this type from the analysis, e.g. pseudogenes
-distance, -d
Max SNP-to-gene distance allowed (in kb)
-correlation, -cor
SNP-SNP correlation file. Space separated file with 3 columns,
first 2 the SNP ids the the 3th the correlation between them.
Like:
rs6660226 rs11209018 0.089
-pearson_genotypes
Calculate SNP-SNP correlation with a pearson correlation for
categorical variables as described on S. Wellek, A. Ziegler, Hum
Hered 67, 128 (2009).
-use_ld Use Linkage Disequilibrium as measure of SNP-SNP correlation.
Default is Pearson's correlation.
-lambda lambda value to correct SNP statistics, if genomic control is
used
-gc_correction
Correct the SNP pvalues by the genomic control method. It will
calculate the lambda from the data itself. PLEASE MAKE SURE YOU
DO NOT FILTER THE SNPS BEFORE RUNNING FORGE OR THE CORRECTION
MAY BE ERRONEOUS.
-gmt_min_size
Min number of genes in gene-sets to be analysed. default = 2
-gmt_max_size
Max number of genes in gene-sets to be analysed. default =
999999999
-weights, -w
File with SNP weights. Multiple files with weights can be used
by providing the -weights for each file. the file with weigths
can be tab, space or comma separated. the first columns must be
the snp id and the rest the weights. It is possible to have a
header in the file, in which case you must use the option
-w_header as well. SNP weights are use in the gene analysis and
to calculate the sample scores. How to use different sources of
information to make a simple weight for the SNP is a not trivial
issue. To simplify things I: a) for each gene's SNPs I make the
weight of each category sum 1, b) then I sum these re-sacel
weights for each SNP and c) finally make sure the weight within
the gene sum 1. Please note that SNPs without weights will
internally get the weights set to zero.
-w_header
Indicate if the SNP weight file has a header.
-weight_by_maf, -w_maf
Weight each SNP its the minor allele frequency (MAF). The actual
weight is 1/MAF.
Description:
TODO
Examples:
1. Basic Analysis
To perform a basic gene-based testing with the example files run:
>perl forge.pl -bfile example/example -assoc example/example.assoc -snpmap example/example.snpmap -out test
2. Adding gene-sets to the analysis
To perform a basic gene-based testing with the example files run:
>perl forge.pl -bfile example/example -assoc example/example.assoc -snpmap example/example.snpmap -out test -gmt example/example.gmt

OK, so now that you are familiar with the options lets run it! To run FORGE with the example files do

 $ perl forge.pl -bfile example/example -assoc example/example.assoc -snpmap example/example.snpmap -out my_output.txt
Sun Jan 16 11:22:28 2011	Check http://github.com/inti/FORGE/wiki for updates
Sun Jan 16 11:22:28 2011	LOG file will be written to [ my_output.txt.log ]
Sun Jan 16 11:22:28 2011	Max SNP-to-gene distance allowed [ 20 ] kb
Sun Jan 16 11:22:28 2011	Note: You did not provide an option for the set of genes to be analyzed. I will analyze all genes covered by the SNP association file. Check documentation for options -genes and -gene_list otherwise
Sun Jan 16 11:22:28 2011	Going to analyze all genes on [ example/example.snpmap ] file.
Sun Jan 16 11:22:28 2011	Reading association file: [ example/example.assoc ]
Sun Jan 16 11:22:28 2011	[ 17 ] SNPs with association data
Sun Jan 16 11:22:28 2011	[ 20 ] SNPs and [ 4806 ] samples in genotype file
Sun Jan 16 11:22:28 2011	Loading SNP-2-Gene mapping
Sun Jan 16 11:22:28 2011	   '-> Reading [ example/example.snpmap ]
Sun Jan 16 11:22:28 2011	  '->[ 1 ] Genes read from SNP-2-Gene Mapping files
Sun Jan 16 11:22:28 2011	  '->[ 15 ] SNPs mapped to Genes and with association results will be analyzed
Sun Jan 16 11:22:28 2011	Will analyse [ 1 ] genes
Sun Jan 16 11:22:28 2011	Output file will be written to [ my_output.txt ]
Sun Jan 16 11:22:28 2011	Starting to Calculate gene p-values
Sun Jan 16 11:22:28 2011	Reading genotypes from [ example/example.bed ]
Sun Jan 16 11:22:28 2011	Binary file is on SNP-major format
Sun Jan 16 11:22:28 2011	Well Done!!

We did run FORGE with a binary file but it is possible to do it with PED and MAP files, for example:

 $ perl forge.pl -ped example/example.ped -map example/example.map --a example/example.assoc -m example/example.snpmap -out my_output.txt
Sun Jan 16 11:23:45 2011	Check http://github.com/inti/FORGE/wiki for updates
Sun Jan 16 11:23:45 2011	LOG file will be written to [ my_output.txt.log ]
Sun Jan 16 11:23:45 2011	Max SNP-to-gene distance allowed [ 20 ] kb
Sun Jan 16 11:23:45 2011	Note: You did not provide an option for the set of genes to be analyzed. I will analyze all genes covered by the SNP association file. Check documentation for options -genes and -gene_list otherwise
Sun Jan 16 11:23:45 2011	Going to analyze all genes on [ example/example.snpmap ] file.
Sun Jan 16 11:23:45 2011	Reading association file: [ example/example.assoc ]
Sun Jan 16 11:23:45 2011	[ 17 ] SNPs with association data
Sun Jan 16 11:23:46 2011	[ 20 ] SNPs and [ 4806 ] samples in genotype file
Sun Jan 16 11:23:46 2011	Loading SNP-2-Gene mapping
Sun Jan 16 11:23:46 2011	   '-> Reading [ example/example.snpmap ]
Sun Jan 16 11:23:46 2011	  '->[ 1 ] Genes read from SNP-2-Gene Mapping files
Sun Jan 16 11:23:46 2011	  '->[ 15 ] SNPs mapped to Genes and with association results will be analyzed
Sun Jan 16 11:23:46 2011	Will analyse [ 1 ] genes
Sun Jan 16 11:23:46 2011	Output file will be written to [ my_output.txt ]
Sun Jan 16 11:23:46 2011	Starting to Calculate gene p-values
Sun Jan 16 11:23:46 2011	Well Done!!

OK so lets look at the output file, my_output.txt

 $ cat my_output.txt
Ensembl_ID	Hugo_id	gene_type	chromosome	start	end	min_p	min_p_SIDAK	FISHER	FISHER_chi-square	FISHER_df	B_fix	Var_fix	B_P_fix	B_random	Var_random	B_P_random	I-squared	Q	Q_p-value	tau_squared	n_effect_Galwey	n_effect_Gao	n_snps
ENSG00000162594	IL23R	protein_coding	1	67632083	67725662	0.03077	0.354381637678306	0.276952014834721	7.35420285105523	5.87531460523843	-0.202198269842706	0.192931236614231	0.677362932153149	-0.202198269842706	0.192931236614231	0.677362932153149	54.6674257749351	30.8828700759271	0.00575658931285827	1.20591929113765	9.73459071596588	14	15

In the manual there is description of the columns. The columns 9th is the FISHER p-value, 8th The Sidak correction on the minimum p-value per gene, columns 14th is the Fix-effect z-score and the 14th is the Random-effect z-score.

Note	Some times the p-values can be be equal to 0 due to lack of precision because Perl works with single precision at least you recompile it with double. A software like R will allow you to get around it and we provide the chi-square a degrees of freedom and z-score and its variance of the tests for that.

If you want, run a whole genome analysis now and while it runs check out the tutorials on System Biology Oriented approaches: network and pathway analysis.

[[Home|Home]]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tutorials 2: Running FORGE

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally