biof309 · hsowards · Feb 28, 2020 · Apr 30, 2020 · Apr 30, 2020 · May 7, 2020
diff --git a/README.md b/README.md
@@ -1,3 +1,108 @@
-# project_spring_2020
+# Run SuSiE Run!
+## Data processing for fine-mapping with summary statistics using SuSiE and DAP-G
+### Python Project for BIOF309
 
 [![CircleCI](https://circleci.com/gh/biof309/project_spring_2020/tree/master.svg?style=shield)](https://circleci.com/gh/biof309/project_spring_2020/tree/master)
+
+This package aims to help process GWAS summary statistics and LD reference panel data for analysis on a selected region using the fine-mapping programs SuSiE and DAP-G. These are main goals of this package:
+1. Limited to those that have access: Pull GWAS summary statistics and reference panel data 
+2. Make a linkage disequilibrium (LD) matrix with the reference panel data (PLINK format)
+3. Align and clean the summary stats and LD matrix
+4. Fine-map the region using 2 different programs:
+    - A program written in R named SuSiE
+    - A program written in C++ named DAP-G
+
+To access this repository locally, enter the following code into command line:
+
+```
+git clone https://github.com/hsowards/project_spring_2020
+```
+
+## 1. Limited to those that have access: Pull GWAS summary statistics and reference panel data
+*Important Note: the instructions for this step are specific to data that **is not** publicly accessible* 
+
+1. If you don't have biowulf access, [request it](https://hpc.nih.gov/docs/accounts.html)
+2. Request access to data files (ask your PI/supervisor)
+3. [Connect to Biowulf](https://hpc.nih.gov/docs/connect.html) 
+
+Once you're connected, edit the file **scripts/config.sh** to match your region of interest. Don't forget to include the path to the project directory so that the files don't pile up in shared folders.
+
+After your config.sh file is set up and saved, run the following script
+
+```
+sh scripts/data.sh
+```
+
+Check that your files are present, there should be a bim/fam/bed file with the rsID in your config file and a txt file with the chr number
+
+```
+source scripts/config.sh
+cd $dir
+ls $dir
+```
+
+## 2. Make a linkage disequilibrium (LD) matrix with the reference panel data
+If you haven't yet, edit the file **scripts/config.sh** to match your region of interest and the directory where your data is.
+
+The following functions will:
+1. read in plink data, specifically the genotype matrix (X)
+2. make sure that SNPs are in columns rather than rows
+3. set missing genotypes to the sample mean
+4. standardize the genotypes
+5. compute the LD using the equation np.transpose(X) @ (X/N)
+    - X is the genotype matrix
+    - N is the sample size (N individuals included in the LD reference population)
+    - @ is numpy's matrix multiplication symbol
+6. check that the matrix is positive semi-definite (PSD)
+    - Specifically checking that all eigen values are greater than -1E-10, SuSiE's threshold for PSD
+7. write out the matrix as rsid.ld
+
+```
+python scripts/ld.py
+```
+
+## 3. Align and clean the summary statistics and LD matrix
+The following functions hope to:
+1. read in the summary statistics and new LD matrix
+2. pare the datasets down to the region
+3. check that ref/alt alleles are aligned between the datasets
+2. check that the number of SNPs in the summary stats is the same as the LD matrix
+3. if there is a mismatch of SNPs:
+    - the bim file will be used to align the summary stats with the LD matrix
+    - SNPs that are missing from either file will be removed
+4. write out the clean summary stats and LD matrix
+    - Files for analysis in SuSiE will be named rsid_susie_sumstats.csv and rsid_susie.ld
+    - Files for analysis in DAP-G will be named rsid_dapg_sumstats.csv and rsid_dapg.ld
+
+```
+python scripts/processing.py
+```
+
+## 4. Fine-map the region with SuSiE
+The following script will:
+1. read in the summary stats and LD matrix
+2. run the fine-mapping algorithm with L number of predicted causals
+3. write out the credible sets produced by SuSiE as rsid_susie_L_cs.csv
+
+```
+R scripts/SuSiE.R
+```
+
+## 4. Fine-map the region with DAP-G
+
+- The DAP-G specific files produced by processing.py will be run in DAP-G using the included script dapg.sh
+- DAP-G must be compiled and run in a virtual box running ubuntu, here are the rough steps:
+    1. [Download Oracle VM VirtualBox](https://www.virtualbox.org/wiki/Downloads)
+    2. download ubuntu desktop [here](https://ubuntu.com/download/desktop)
+    2. [Follow this guide](https://brb.nci.nih.gov/seqtools/installUbuntu.html) to set up your virtual machine
+    3. You can clone this repository into your virtualbox to access the config.sh and dapg.sh files but to access the data you will have to either:
+        - Create a shared file between your home OS and ubuntu OR
+        - Upload the data to google docs
+    4. Once the data is uploaded, make sure the config.sh file is updated, you can do this from the terminal:
+    ```
+    nano scripts/config.sh
+    ```
+    5. run dapg.sh
+    ```
+    sh scripts/dapg.sh
+    ```
diff --git a/RSR/.ipynb_checkpoints/__init__-checkpoint.py b/RSR/.ipynb_checkpoints/__init__-checkpoint.py
@@ -0,0 +1,22 @@
+import setuptools
+
+with open("README.md", "r") as fh:
+    long_description = fh.read()
+
+setuptools.setup(
+    name="RSR", 
+    version="0.0.1",
+    author="Hayley Sowards",
+    author_email="hayley.sowards@gmail.com",
+    description="This package aims to help process data for analysis using the fine-mapping programs SuSiE and DAP-G",
+    long_description=long_description,
+    long_description_content_type="text/markdown",
+    url="https://github.com/hsowards/project_spring_2020",
+    packages=setuptools.find_packages(),
+    classifiers=[
+        "Programming Language :: Python :: 3",
+        "License :: OSI Approved :: MIT License",
+        "Operating System :: OS Independent",
+    ],
+    python_requires='>=3.6',
+)
diff --git a/RSR/PLINKtoLD.py b/RSR/PLINKtoLD.py
@@ -0,0 +1,30 @@
+import numpy as np
+
+def PLINKtoLD(x, n):
+    """
+    LD from plink data
+
+    Creation of a linkage disequilibrium matrix from a plink genotype matrix. Checks that SNPs 
+
+    Parameters: 
+    x (ndarray): genotype matrix
+    n (int): sample size
+
+    Returns: 
+    ndarray: linkage disequilibrium matrix
+    """
+    if bed.shape[1] == n & bed.shape[0] != n: #this would indicate snps are rows, the dataset needs to be transposed
+        x = np.asarray(x)
+        x = np.transpose(x) #transpose
+    else:
+        x = np.asarray(x)
+    #setting nas as column mean
+    col_mean = np.nanmean(x, axis = 0)
+    inds = np.where(np.isnan(x)) 
+    x[inds] = np.take(col_mean, inds[1])
+    #standardizing the matrix (zscore)
+    x = (x - np.mean(x, axis=0)) / np.std(x, axis=0)
+    #creating the ld matrix
+    #ld = np.dot(np.transpose(x), (x/n)) doesn't work
+    #ld = np.transpose(x) @ x/n doesn't work
+    #ld = np.multiply(np.transpose(x), (x/n)) doesn't work
diff --git a/RSR/__init__.py b/RSR/__init__.py
@@ -0,0 +1,22 @@
+import setuptools
+
+with open("README.md", "r") as fh:
+    long_description = fh.read()
+
+setuptools.setup(
+    name="RSR", 
+    version="0.0.1",
+    author="Hayley Sowards",
+    author_email="hayley.sowards@gmail.com",
+    description="This package aims to help process data for analysis using the fine-mapping programs SuSiE and DAP-G",
+    long_description=long_description,
+    long_description_content_type="text/markdown",
+    url="https://github.com/hsowards/project_spring_2020",
+    packages=setuptools.find_packages(),
+    classifiers=[
+        "Programming Language :: Python :: 3",
+        "License :: OSI Approved :: MIT License",
+        "Operating System :: OS Independent",
+    ],
+    python_requires='>=3.6',
+)
diff --git a/RSR/eig_check.py b/RSR/eig_check.py
@@ -0,0 +1,17 @@
+import numpy as np
+
+def eig_check(ld):
+    """
+    Check eignenvalues of LD matrix
+
+    Check that all eignenvalues of the LD matrix are > -1E-10, meaning the matrix is positive semi-definite
+
+    Parameters: 
+    ld (ndarray): linkage disequilibrium matrix
+
+    Returns: 
+    boolean: True if matrix is positive semi-definite, false if not
+    """
+    eig = np.linalg.eigvals(ld) #taking the eigenvalues of the ld matrix
+    check = np.all(np.linalg.eigvals(x) > -1e-10) #checking that the values are all above -1e-10
+    print(check) #prints boolean
diff --git a/RSR/test/test_ld.py b/RSR/test/test_ld.py
@@ -0,0 +1,9 @@
+def test_PLINKtoLD():
+    assert bed.shape[0] == n or bed.shape[1] == n, "One of the dimensions of the bed file should be n"
+
+def test1_PLINKtoLD():
+    assert dtype(ld) == "ndarray", "The LD matrix should be an ndarray"
+
+def test_eig_check():
+    assert dtype(eig) == "array", "The eigenvalues should be in an array"
+
diff --git a/RSR/write_ld.py b/RSR/write_ld.py
@@ -0,0 +1,17 @@
+import numpy as np
+
+def write_ld(ld, rsID):
+    """
+    Write LD matrix
+
+    Writes out LD matrix, after checking that it's positive semi-definite, as a comma delimited file named rsid.ld
+
+    Parameters: 
+    ld (ndarray): linkage disequilibrium matrix
+    """
+    if eig_check(ld) == True:
+        filename = dir + "/" + rsID + ".ld" #set up filename from configs
+        np.savetext(filename, ld, delimeter=",") #write out file
+    else : #if matrix is not positive semi-definite, LD matrix is not saved
+        print("LD matrix is not positive semi-definite")
+
diff --git a/project_spring_2020/sample_file.py b/project_spring_2020/sample_file.py
diff --git a/scripts/SuSiE.R b/scripts/SuSiE.R
@@ -0,0 +1,45 @@
+library(tidyverse)
+source("scripts/config.sh") #load in configs
+
+x <- read_csv(paste0(dir, "/", rsid, "_susie.ld"))
+sumstats <- read_csv(paste0(dir, "/", rsid, "_susie_sumstats.csv"))
+
+crediblesets <- function(causals){
+  susie_output <- susieR::susie_rss(z = sumstats$Z_fixed, #p vector of z scores
+                                   R = x, #LD matrix
+                                   L = causals, #number of causals
+                                   check_z = F, #saving computation since we check PSD earlier
+                                   estimate_residual_variance = T,
+                                   estimate_prior_variance = F, 
+                                   max_iter = 100000) #iterations based on CAVIAR
+  cs <- map_df(susie_output$sets$cs, ~as.data.frame(.x), .id="susie_set")
+  if (nrow(sumstats_clean) == nrow(ld_info)) {
+    cs <- ld_info %>% 
+      select(pos) %>% 
+      rowid_to_column("row") %>% 
+      inner_join(cs, by = c("row" = ".x"))
+    crediblesets <- as.data.frame(susie_output$pip) %>% 
+      rowid_to_column("row") %>% 
+      inner_join(cs, by = "row") %>% 
+      inner_join(ld_info, by = "pos") %>% 
+      select(susie_set, row, `susie_output$pip`, pos, chr, id, ref, alt)
+    colnames(crediblesets) <- paste(colnames(crediblesets), causals, sep = "_")
+    print(crediblesets)
+  }
+  else{
+    cs <- ld_info %>% 
+      filter(pos > start & pos < end) %>% 
+      select(pos) %>% 
+      rowid_to_column("row") %>% 
+      inner_join(cs, by = c("row" = ".x"))
+    crediblesets <- as.data.frame(susie_output$pip) %>% 
+      rowid_to_column("row") %>% 
+      inner_join(cs, by = c("row")) %>% 
+      inner_join(ld_info, by = "pos") %>% 
+      select(susie_set, row, `susie_output$pip`, pos, chr, id, ref, alt) 
+    colnames(crediblesets) <- paste(colnames(crediblesets), causals, sep = "_")
+    print(crediblesets)
+  }
+}
+
+crediblesets(causals)
diff --git a/scripts/config.sh b/scripts/config.sh
@@ -0,0 +1,7 @@
+#set each of these variables to reflect your region of interest
+chr=21 #chr
+start=42643496 #starting bp
+end=42843496 #ending bp
+rsID="rs408825" #rsID for filenames
+n=5000 #sample size
+dir="/Users/sowardsha/Documents/data" #your data directory path
diff --git a/scripts/dapg.sh b/scripts/dapg.sh
@@ -0,0 +1,3 @@
+#!/bin/bash
+source scripts/config.sh #sourcing the config file so that configs are used in file creation
+dap-g -d_z `$rsID`_dapg_sumstats.csv -d_ld `$rsID`_dapg.ld
diff --git a/scripts/data.sh b/scripts/data.sh
@@ -0,0 +1,11 @@
+#!/bin/bash
+source scripts/config.sh #sourcing the config file so that configs are used in file creation
+unzip /data/Brown_lab/vQTL/LD_reference/UKBB/transfer_1599580_files_b3b70061.zip #unzip (decompress into multiple file componenets), this and the next step must be done for plink to run
+gunzip UKBB* #gunzip (decompress files)
+module load plink #load plink
+#running plink to pull $rdID.bim/.fam/.bed files on relevant chr:start-end region
+plink --bfile UKBB_LD_bestguess_ref_panel_maf1e3_rsq_point3_HRC_SNPs_only --chr $chr --from-bp $start --to-bp $end --make-bed --out $rsID
+mv -r $rsID.* $dir #moving files to indicated project directory
+sumstats="MEL_all_RSQ0.5_${chr}_for_TWAS.txt.gz" #initializing filename for sumstats
+mv -r /data/Brown_lab/PAINTOR/dataset/GWAS_Meta/$sumstats $dir #moving sumstats to project directory
+echo "The configured region is on chr $chr, starting at position $start, and ending at position $end. $rsID.bed/.fam/.bim and MEL_all_RSQ0.5_${chr}_for_TWAS.txt.gz are now available in $dir."
diff --git a/scripts/ld.py b/scripts/ld.py
@@ -0,0 +1,15 @@
+exec(open("source.py").read())
+import RSR #package
+
+#reads in data, pulling from config file
+prefix = dir + "/" + rsID
+(bim, fam, bed) = read_plink(prefix)
+
+#creates LD matrix from bed file
+ld = PLINKtoLD(bed, n)
+
+#checks that matrix is positive semi definite
+eig_check(ld)
+
+#writes out LD matrix
+write_ld(ld)
diff --git a/scripts/source.py b/scripts/source.py
@@ -0,0 +1,6 @@
+import pandas as pd 
+import numpy as np
+from pandas_plink import read_plink
+from statistics import *
+
+exec(open("scripts/config.sh").read()) #this reads in the config file
diff --git a/tests/sample_test.py b/tests/sample_test.py