-
Notifications
You must be signed in to change notification settings - Fork 5
Tutorials 2: Running FORGE
[[Home|Home]]
The first step is to make sure you have all files needed. The basic files are the FORGE Perl script, the snp-to-gene annotation and the SNP association p-values (see flow chart on the right). The first two are available at the Downloads page and you should provide the SNP association file. To get you started we distribute some example files, which we will use on this tutorial.
Download the FORGE folder, uncompress it and go into the folder. On the command line:
$ unzip FORGE.zip $ cd FORGE
Start by check the files on the folder
$ ls -1 CovMatrix.pm GWAS_IO.pm GWAS_STATS.pm README.txt affy_to_rsID.tab example forge.pl utilities
An explanation of the files is on the doc.txt, the README.txt will have some instructions for running as well as interpreting the output. To get a list of available options type
$ perl forge.pl -h
Usage:
script [options]
-h, --help print help message
-m, --man print complete documentation
-report how often to report advance
-verbose, -v useful for debugging
Input Files:
-ped Genotype file in PLINK PED format
-map SNP info file in PLINK MAP format
-bfile Files in PLINK binary format, corresping file with extension *.bed, *.bim and *.fam.
-assoc, -a SNP association file, columns header is necessary and at leat columns with SNP and P names must be present
-snpmap, -m Snp-to-gene mapping file
-affy_to_rsid Affy id to rs id mapping file
-gmt Gene-set definition file
Output Files:
-out, -o Name of the output file
-print_cor print out SNP-SNP correlations
Analsis modifiers:
-gene_list, -g Only analyse genes on the file provided
-genes Provide some gene name in command line, only these genes will be analyzed
-all_genes Analyze all genes in the snp-to-gene mapping file
-chr Anlyze a specific chromosome
-gene_type, -type Only analyses genes of this type, e.g. protein_coding
-exclude_gene_type, -exclude_type Exclude genes of this type from the analysis, e.g. pseudogenes
-distance, -d Max SNP-to-gene distance allowed (in kb)
-correlation, -cor SNP-SNP correlation file
-pearson_genotypes Calculate SNP-SNP correlation with a pearson correlation for categorical variables
-use_ld Use Linkage Disequilibrium as measure of SNP-SNP correlation
-lambda lambda value to correct SNP statistics, if genomic control is used
-gc_correction Determine the lamda for the genomic control correction from the input p-values
-gmt_min_size Min number of genes in gene-sets to be analysed. default = 2
-gmt_max_size Max number of genes in gene-sets to be analysed. default = 999999999
Weigthing
-weights, -w File with SNP weights
-w_header Indicate if the SNP weight file has a header.
-weight_by_maf, -w_maf Weight each SNP its the minor allele frequency (MAF). The actual weight is 1/MAF.
Obtain a more detailed description by typing
$ perl forge.pl -man
Usage:
script [options]
-h, --help print help message
-m, --man print complete documentation
-report how often to report advance
-verbose, -v useful for debugging
Input Files:
-ped Genotype file in PLINK PED format
-map SNP info file in PLINK MAP format
-bfile Files in PLINK binary format, corresping file with extension *.bed, *.bim and *.fam.
-assoc, -a SNP association file, columns header is necessary and at leat columns with SNP and P names must be present
-snpmap, -m Snp-to-gene mapping file
-affy_to_rsid Affy id to rs id mapping file
-gmt Gene-set definition file
Output Files:
-out, -o Name of the output file
-print_cor print out SNP-SNP correlations
Analsis modifiers:
-gene_list, -g Only analyse genes on the file provided
-genes Provide some gene name in command line, only these genes will be analyzed
-all_genes Analyze all genes in the snp-to-gene mapping file
-chr Anlyze a specific chromosome
-gene_type, -type Only analyses genes of this type, e.g. protein_coding
-exclude_gene_type, -exclude_type Exclude genes of this type from the analysis, e.g. pseudogenes
-distance, -d Max SNP-to-gene distance allowed (in kb)
-correlation, -cor SNP-SNP correlation file
-pearson_genotypes Calculate SNP-SNP correlation with a pearson correlation for categorical variables
-use_ld Use Linkage Disequilibrium as measure of SNP-SNP correlation
-lambda lambda value to correct SNP statistics, if genomic control is used
-gc_correction Determine the lamda for the genomic control correction from the input p-values
-gmt_min_size Min number of genes in gene-sets to be analysed. default = 2
-gmt_max_size Max number of genes in gene-sets to be analysed. default = 999999999
Weigthing
-weights, -w File with SNP weights
-w_header Indicate if the SNP weight file has a header.
-weight_by_maf, -w_maf Weight each SNP its the minor allele frequency (MAF). The actual weight is 1/MAF.
Options:
-help Print help message
-man print complete documentation
-report How often to report advance. Provide an integer X and the
program will report adnvance after X genes are analyzed.
-verbose, -v
useful for debugging
-ped Genotype file in PLINK PED format. See
http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml for
details on data format
-map SNP info file in PLINK MAP format
-bfile Files in PLINK binary format, with extension *.bed, *.bim and
*.fam.
-assoc, -a
SNP association file, columns header is necessary and at leat
columns with SNP and P names must be present.
-snpmap, -m
SNP-to-gene mapping file. Format is tab separated with
fields:chromosome,start,end,Ensembl id, hugo id, description,
SNP1, SNP2, ..., SNPN. Each SNP has 4 fields separated by
colons: id,position,alleles and strand. This may look
overcomplicated but allows to map any kind of variation and
store its information without having to use additional files. An
example look like:
1 67632083 67725662 ENSG00000162594 IL23R KNOWN protein_coding
Interleukin-23 receptor Precursor (IL-23R)
[Source:UniProtKB/Swiss-Prot;Acc:Q5VWK5]
rs7538938:67132262:T/C:1 rs72669476:67132582:C/T:1
rs72019237:67132863:C/-:1 rs61197134:67132871:C/-:1
rs11208941:67133111:T/C:1
-affy_to_rsid
Affy id to rs id mapping file. Tab separated file, looks like:
SNP_A-8389091 rs7593668
-gmt Gene-set definition file. Format is (tab separated) NAME
Description GENE1 GENE2 GENEN
-out, -o
Name of the output file. Output file will be tab separated with
the following columns: Ensembl_ID,Hugo
id,gene_type,chromosome,start,end,min p-value in the gene, Sidak
corrected minimum p-value, FORGE p-value, FORGE chi-square,
FORGE degrees of freedom, Eigen value ratio method gene p-value,
EVR chi-square, EVR degrees freedom, number of snps in gene,
number effective tests(Gao et al;PDMID:19434714) An example
would look like:ENSG00000162594 IL23R protein_coding 1 67632083
67725662 3.491e-02 4.132e-01 2.128e-01 8.42305 6.04941 1.119e-01
28.17326 10.116917 15
-print_cor
print out SNP-SNP correlations
-gene_list, -g
Only analyse genes on the file provided. Ids must match either
Ensembl Ids or Hugo ids present in the SNP-to-gene mapping file
-genes Provide some gene name in command line, only these genes will be
analyzed. Use like -genes IL23R
-all_genes
Analyze all genes in the snp-to-gene mapping file
-chr Anlyze a specific chromosome, Chromosome code must match that of
the SNP-to-gene mapping file
-gene_type, -type
Only analyses genes of this type, e.g. protein_coding
-exclude_gene_type, -exclude_type
Exclude genes of this type from the analysis, e.g. pseudogenes
-distance, -d
Max SNP-to-gene distance allowed (in kb)
-correlation, -cor
SNP-SNP correlation file. Space separated file with 3 columns,
first 2 the SNP ids the the 3th the correlation between them.
Like:
rs6660226 rs11209018 0.089
-pearson_genotypes
Calculate SNP-SNP correlation with a pearson correlation for
categorical variables as described on S. Wellek, A. Ziegler, Hum
Hered 67, 128 (2009).
-use_ld Use Linkage Disequilibrium as measure of SNP-SNP correlation.
Default is Pearson's correlation.
-lambda lambda value to correct SNP statistics, if genomic control is
used
-gc_correction
Correct the SNP pvalues by the genomic control method. It will
calculate the lambda from the data itself. PLEASE MAKE SURE YOU
DO NOT FILTER THE SNPS BEFORE RUNNING FORGE OR THE CORRECTION
MAY BE ERRONEOUS.
-gmt_min_size
Min number of genes in gene-sets to be analysed. default = 2
-gmt_max_size
Max number of genes in gene-sets to be analysed. default =
999999999
-weights, -w
File with SNP weights. Multiple files with weights can be used
by providing the -weights for each file. the file with weigths
can be tab, space or comma separated. the first columns must be
the snp id and the rest the weights. It is possible to have a
header in the file, in which case you must use the option
-w_header as well. SNP weights are use in the gene analysis and
to calculate the sample scores. How to use different sources of
information to make a simple weight for the SNP is a not trivial
issue. To simplify things I: a) for each gene's SNPs I make the
weight of each category sum 1, b) then I sum these re-sacel
weights for each SNP and c) finally make sure the weight within
the gene sum 1. Please note that SNPs without weights will
internally get the weights set to zero.
-w_header
Indicate if the SNP weight file has a header.
-weight_by_maf, -w_maf
Weight each SNP its the minor allele frequency (MAF). The actual
weight is 1/MAF.
Description:
TODO
Examples:
1. Basic Analysis
To perform a basic gene-based testing with the example files run:
>perl forge.pl -bfile example/example -assoc example/example.assoc -snpmap example/example.snpmap -out test
2. Adding gene-sets to the analysis
To perform a basic gene-based testing with the example files run:
>perl forge.pl -bfile example/example -assoc example/example.assoc -snpmap example/example.snpmap -out test -gmt example/example.gmt
OK, so now that you are familiar with the options lets run it! To run FORGE with the example files do
$ perl forge.pl -bfile example/example -assoc example/example.assoc -snpmap example/example.snpmap -out my_output.txt Sun Jan 16 11:22:28 2011 Check http://github.com/inti/FORGE/wiki for updates Sun Jan 16 11:22:28 2011 LOG file will be written to [ my_output.txt.log ] Sun Jan 16 11:22:28 2011 Max SNP-to-gene distance allowed [ 20 ] kb Sun Jan 16 11:22:28 2011 Note: You did not provide an option for the set of genes to be analyzed. I will analyze all genes covered by the SNP association file. Check documentation for options -genes and -gene_list otherwise Sun Jan 16 11:22:28 2011 Going to analyze all genes on [ example/example.snpmap ] file. Sun Jan 16 11:22:28 2011 Reading association file: [ example/example.assoc ] Sun Jan 16 11:22:28 2011 [ 17 ] SNPs with association data Sun Jan 16 11:22:28 2011 [ 20 ] SNPs and [ 4806 ] samples in genotype file Sun Jan 16 11:22:28 2011 Loading SNP-2-Gene mapping Sun Jan 16 11:22:28 2011 '-> Reading [ example/example.snpmap ] Sun Jan 16 11:22:28 2011 '->[ 1 ] Genes read from SNP-2-Gene Mapping files Sun Jan 16 11:22:28 2011 '->[ 15 ] SNPs mapped to Genes and with association results will be analyzed Sun Jan 16 11:22:28 2011 Will analyse [ 1 ] genes Sun Jan 16 11:22:28 2011 Output file will be written to [ my_output.txt ] Sun Jan 16 11:22:28 2011 Starting to Calculate gene p-values Sun Jan 16 11:22:28 2011 Reading genotypes from [ example/example.bed ] Sun Jan 16 11:22:28 2011 Binary file is on SNP-major format Sun Jan 16 11:22:28 2011 Well Done!!
We did run FORGE with a binary file but it is possible to do it with PED and MAP files, for example:
$ perl forge.pl -ped example/example.ped -map example/example.map --a example/example.assoc -m example/example.snpmap -out my_output.txt Sun Jan 16 11:23:45 2011 Check http://github.com/inti/FORGE/wiki for updates Sun Jan 16 11:23:45 2011 LOG file will be written to [ my_output.txt.log ] Sun Jan 16 11:23:45 2011 Max SNP-to-gene distance allowed [ 20 ] kb Sun Jan 16 11:23:45 2011 Note: You did not provide an option for the set of genes to be analyzed. I will analyze all genes covered by the SNP association file. Check documentation for options -genes and -gene_list otherwise Sun Jan 16 11:23:45 2011 Going to analyze all genes on [ example/example.snpmap ] file. Sun Jan 16 11:23:45 2011 Reading association file: [ example/example.assoc ] Sun Jan 16 11:23:45 2011 [ 17 ] SNPs with association data Sun Jan 16 11:23:46 2011 [ 20 ] SNPs and [ 4806 ] samples in genotype file Sun Jan 16 11:23:46 2011 Loading SNP-2-Gene mapping Sun Jan 16 11:23:46 2011 '-> Reading [ example/example.snpmap ] Sun Jan 16 11:23:46 2011 '->[ 1 ] Genes read from SNP-2-Gene Mapping files Sun Jan 16 11:23:46 2011 '->[ 15 ] SNPs mapped to Genes and with association results will be analyzed Sun Jan 16 11:23:46 2011 Will analyse [ 1 ] genes Sun Jan 16 11:23:46 2011 Output file will be written to [ my_output.txt ] Sun Jan 16 11:23:46 2011 Starting to Calculate gene p-values Sun Jan 16 11:23:46 2011 Well Done!!
OK so lets look at the output file, my_output.txt
$ cat my_output.txt Ensembl_ID Hugo_id gene_type chromosome start end min_p min_p_SIDAK FISHER FISHER_chi-square FISHER_df B_fix Var_fix B_P_fix B_random Var_random B_P_random I-squared Q Q_p-value tau_squared n_effect_Galwey n_effect_Gao n_snps ENSG00000162594 IL23R protein_coding 1 67632083 67725662 0.03077 0.354381637678306 0.276952014834721 7.35420285105523 5.87531460523843 -0.202198269842706 0.192931236614231 0.677362932153149 -0.202198269842706 0.192931236614231 0.677362932153149 54.6674257749351 30.8828700759271 0.00575658931285827 1.20591929113765 9.73459071596588 14 15
In the manual there is description of the columns. The columns 9th is the FISHER p-value, 8th The Sidak correction on the minimum p-value per gene, columns 14th is the Fix-effect z-score and the 14th is the Random-effect z-score.
|
Note
|
Some times the p-values can be be equal to 0 due to lack of precision because Perl works with single precision at least you recompile it with double. A software like R will allow you to get around it and we provide the chi-square a degrees of freedom and z-score and its variance of the tests for that. |
If you want, run a whole genome analysis now and while it runs check out the tutorials on System Biology Oriented approaches: network and pathway analysis.
[[Home|Home]]