Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
107 changes: 106 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,108 @@
# project_spring_2020
# Run SuSiE Run!
## Data processing for fine-mapping with summary statistics using SuSiE and DAP-G
### Python Project for BIOF309

[![CircleCI](https://circleci.com/gh/biof309/project_spring_2020/tree/master.svg?style=shield)](https://circleci.com/gh/biof309/project_spring_2020/tree/master)

This package aims to help process GWAS summary statistics and LD reference panel data for analysis on a selected region using the fine-mapping programs SuSiE and DAP-G. These are main goals of this package:
1. Limited to those that have access: Pull GWAS summary statistics and reference panel data
2. Make a linkage disequilibrium (LD) matrix with the reference panel data (PLINK format)
3. Align and clean the summary stats and LD matrix
4. Fine-map the region using 2 different programs:
- A program written in R named SuSiE
- A program written in C++ named DAP-G

To access this repository locally, enter the following code into command line:

```
git clone https://github.com/hsowards/project_spring_2020
```

## 1. Limited to those that have access: Pull GWAS summary statistics and reference panel data
*Important Note: the instructions for this step are specific to data that **is not** publicly accessible*

1. If you don't have biowulf access, [request it](https://hpc.nih.gov/docs/accounts.html)
2. Request access to data files (ask your PI/supervisor)
3. [Connect to Biowulf](https://hpc.nih.gov/docs/connect.html)

Once you're connected, edit the file **scripts/config.sh** to match your region of interest. Don't forget to include the path to the project directory so that the files don't pile up in shared folders.

After your config.sh file is set up and saved, run the following script

```
sh scripts/data.sh
```

Check that your files are present, there should be a bim/fam/bed file with the rsID in your config file and a txt file with the chr number

```
source scripts/config.sh
cd $dir
ls $dir
```

## 2. Make a linkage disequilibrium (LD) matrix with the reference panel data
If you haven't yet, edit the file **scripts/config.sh** to match your region of interest and the directory where your data is.

The following functions will:
1. read in plink data, specifically the genotype matrix (X)
2. make sure that SNPs are in columns rather than rows
3. set missing genotypes to the sample mean
4. standardize the genotypes
5. compute the LD using the equation np.transpose(X) @ (X/N)
- X is the genotype matrix
- N is the sample size (N individuals included in the LD reference population)
- @ is numpy's matrix multiplication symbol
6. check that the matrix is positive semi-definite (PSD)
- Specifically checking that all eigen values are greater than -1E-10, SuSiE's threshold for PSD
7. write out the matrix as rsid.ld

```
python scripts/ld.py
```

## 3. Align and clean the summary statistics and LD matrix
The following functions hope to:
1. read in the summary statistics and new LD matrix
2. pare the datasets down to the region
3. check that ref/alt alleles are aligned between the datasets
2. check that the number of SNPs in the summary stats is the same as the LD matrix
3. if there is a mismatch of SNPs:
- the bim file will be used to align the summary stats with the LD matrix
- SNPs that are missing from either file will be removed
4. write out the clean summary stats and LD matrix
- Files for analysis in SuSiE will be named rsid_susie_sumstats.csv and rsid_susie.ld
- Files for analysis in DAP-G will be named rsid_dapg_sumstats.csv and rsid_dapg.ld

```
python scripts/processing.py
```

## 4. Fine-map the region with SuSiE
The following script will:
1. read in the summary stats and LD matrix
2. run the fine-mapping algorithm with L number of predicted causals
3. write out the credible sets produced by SuSiE as rsid_susie_L_cs.csv

```
R scripts/SuSiE.R
```

## 4. Fine-map the region with DAP-G

- The DAP-G specific files produced by processing.py will be run in DAP-G using the included script dapg.sh
- DAP-G must be compiled and run in a virtual box running ubuntu, here are the rough steps:
1. [Download Oracle VM VirtualBox](https://www.virtualbox.org/wiki/Downloads)
2. download ubuntu desktop [here](https://ubuntu.com/download/desktop)
2. [Follow this guide](https://brb.nci.nih.gov/seqtools/installUbuntu.html) to set up your virtual machine
3. You can clone this repository into your virtualbox to access the config.sh and dapg.sh files but to access the data you will have to either:
- Create a shared file between your home OS and ubuntu OR
- Upload the data to google docs
4. Once the data is uploaded, make sure the config.sh file is updated, you can do this from the terminal:
```
nano scripts/config.sh
```
5. run dapg.sh
```
sh scripts/dapg.sh
```
22 changes: 22 additions & 0 deletions RSR/.ipynb_checkpoints/__init__-checkpoint.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
import setuptools

with open("README.md", "r") as fh:
long_description = fh.read()

setuptools.setup(
name="RSR",
version="0.0.1",
author="Hayley Sowards",
author_email="hayley.sowards@gmail.com",
description="This package aims to help process data for analysis using the fine-mapping programs SuSiE and DAP-G",
long_description=long_description,
long_description_content_type="text/markdown",
url="https://github.com/hsowards/project_spring_2020",
packages=setuptools.find_packages(),
classifiers=[
"Programming Language :: Python :: 3",
"License :: OSI Approved :: MIT License",
"Operating System :: OS Independent",
],
python_requires='>=3.6',
)
30 changes: 30 additions & 0 deletions RSR/PLINKtoLD.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
import numpy as np

def PLINKtoLD(x, n):
"""
LD from plink data

Creation of a linkage disequilibrium matrix from a plink genotype matrix. Checks that SNPs

Parameters:
x (ndarray): genotype matrix
n (int): sample size

Returns:
ndarray: linkage disequilibrium matrix
"""
if bed.shape[1] == n & bed.shape[0] != n: #this would indicate snps are rows, the dataset needs to be transposed
x = np.asarray(x)
x = np.transpose(x) #transpose
else:
x = np.asarray(x)
#setting nas as column mean
col_mean = np.nanmean(x, axis = 0)
inds = np.where(np.isnan(x))
x[inds] = np.take(col_mean, inds[1])
#standardizing the matrix (zscore)
x = (x - np.mean(x, axis=0)) / np.std(x, axis=0)
#creating the ld matrix
#ld = np.dot(np.transpose(x), (x/n)) doesn't work
#ld = np.transpose(x) @ x/n doesn't work
#ld = np.multiply(np.transpose(x), (x/n)) doesn't work
22 changes: 22 additions & 0 deletions RSR/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
import setuptools

with open("README.md", "r") as fh:
long_description = fh.read()

setuptools.setup(
name="RSR",
version="0.0.1",
author="Hayley Sowards",
author_email="hayley.sowards@gmail.com",
description="This package aims to help process data for analysis using the fine-mapping programs SuSiE and DAP-G",
long_description=long_description,
long_description_content_type="text/markdown",
url="https://github.com/hsowards/project_spring_2020",
packages=setuptools.find_packages(),
classifiers=[
"Programming Language :: Python :: 3",
"License :: OSI Approved :: MIT License",
"Operating System :: OS Independent",
],
python_requires='>=3.6',
)
17 changes: 17 additions & 0 deletions RSR/eig_check.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
import numpy as np

def eig_check(ld):
"""
Check eignenvalues of LD matrix

Check that all eignenvalues of the LD matrix are > -1E-10, meaning the matrix is positive semi-definite

Parameters:
ld (ndarray): linkage disequilibrium matrix

Returns:
boolean: True if matrix is positive semi-definite, false if not
"""
eig = np.linalg.eigvals(ld) #taking the eigenvalues of the ld matrix
check = np.all(np.linalg.eigvals(x) > -1e-10) #checking that the values are all above -1e-10
print(check) #prints boolean
9 changes: 9 additions & 0 deletions RSR/test/test_ld.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
def test_PLINKtoLD():
assert bed.shape[0] == n or bed.shape[1] == n, "One of the dimensions of the bed file should be n"

def test1_PLINKtoLD():
assert dtype(ld) == "ndarray", "The LD matrix should be an ndarray"

def test_eig_check():
assert dtype(eig) == "array", "The eigenvalues should be in an array"

17 changes: 17 additions & 0 deletions RSR/write_ld.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
import numpy as np

def write_ld(ld, rsID):
"""
Write LD matrix

Writes out LD matrix, after checking that it's positive semi-definite, as a comma delimited file named rsid.ld

Parameters:
ld (ndarray): linkage disequilibrium matrix
"""
if eig_check(ld) == True:
filename = dir + "/" + rsID + ".ld" #set up filename from configs
np.savetext(filename, ld, delimeter=",") #write out file
else : #if matrix is not positive semi-definite, LD matrix is not saved
print("LD matrix is not positive semi-definite")

1 change: 0 additions & 1 deletion project_spring_2020/sample_file.py

This file was deleted.

45 changes: 45 additions & 0 deletions scripts/SuSiE.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
library(tidyverse)
source("scripts/config.sh") #load in configs

x <- read_csv(paste0(dir, "/", rsid, "_susie.ld"))
sumstats <- read_csv(paste0(dir, "/", rsid, "_susie_sumstats.csv"))

crediblesets <- function(causals){
susie_output <- susieR::susie_rss(z = sumstats$Z_fixed, #p vector of z scores
R = x, #LD matrix
L = causals, #number of causals
check_z = F, #saving computation since we check PSD earlier
estimate_residual_variance = T,
estimate_prior_variance = F,
max_iter = 100000) #iterations based on CAVIAR
cs <- map_df(susie_output$sets$cs, ~as.data.frame(.x), .id="susie_set")
if (nrow(sumstats_clean) == nrow(ld_info)) {
cs <- ld_info %>%
select(pos) %>%
rowid_to_column("row") %>%
inner_join(cs, by = c("row" = ".x"))
crediblesets <- as.data.frame(susie_output$pip) %>%
rowid_to_column("row") %>%
inner_join(cs, by = "row") %>%
inner_join(ld_info, by = "pos") %>%
select(susie_set, row, `susie_output$pip`, pos, chr, id, ref, alt)
colnames(crediblesets) <- paste(colnames(crediblesets), causals, sep = "_")
print(crediblesets)
}
else{
cs <- ld_info %>%
filter(pos > start & pos < end) %>%
select(pos) %>%
rowid_to_column("row") %>%
inner_join(cs, by = c("row" = ".x"))
crediblesets <- as.data.frame(susie_output$pip) %>%
rowid_to_column("row") %>%
inner_join(cs, by = c("row")) %>%
inner_join(ld_info, by = "pos") %>%
select(susie_set, row, `susie_output$pip`, pos, chr, id, ref, alt)
colnames(crediblesets) <- paste(colnames(crediblesets), causals, sep = "_")
print(crediblesets)
}
}

crediblesets(causals)
7 changes: 7 additions & 0 deletions scripts/config.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
#set each of these variables to reflect your region of interest
chr=21 #chr
start=42643496 #starting bp
end=42843496 #ending bp
rsID="rs408825" #rsID for filenames
n=5000 #sample size
dir="/Users/sowardsha/Documents/data" #your data directory path
3 changes: 3 additions & 0 deletions scripts/dapg.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
#!/bin/bash
source scripts/config.sh #sourcing the config file so that configs are used in file creation
dap-g -d_z `$rsID`_dapg_sumstats.csv -d_ld `$rsID`_dapg.ld
11 changes: 11 additions & 0 deletions scripts/data.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
#!/bin/bash
source scripts/config.sh #sourcing the config file so that configs are used in file creation
unzip /data/Brown_lab/vQTL/LD_reference/UKBB/transfer_1599580_files_b3b70061.zip #unzip (decompress into multiple file componenets), this and the next step must be done for plink to run
gunzip UKBB* #gunzip (decompress files)
module load plink #load plink
#running plink to pull $rdID.bim/.fam/.bed files on relevant chr:start-end region
plink --bfile UKBB_LD_bestguess_ref_panel_maf1e3_rsq_point3_HRC_SNPs_only --chr $chr --from-bp $start --to-bp $end --make-bed --out $rsID
mv -r $rsID.* $dir #moving files to indicated project directory
sumstats="MEL_all_RSQ0.5_${chr}_for_TWAS.txt.gz" #initializing filename for sumstats
mv -r /data/Brown_lab/PAINTOR/dataset/GWAS_Meta/$sumstats $dir #moving sumstats to project directory
echo "The configured region is on chr $chr, starting at position $start, and ending at position $end. $rsID.bed/.fam/.bim and MEL_all_RSQ0.5_${chr}_for_TWAS.txt.gz are now available in $dir."
15 changes: 15 additions & 0 deletions scripts/ld.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
exec(open("source.py").read())
import RSR #package

#reads in data, pulling from config file
prefix = dir + "/" + rsID
(bim, fam, bed) = read_plink(prefix)

#creates LD matrix from bed file
ld = PLINKtoLD(bed, n)

#checks that matrix is positive semi definite
eig_check(ld)

#writes out LD matrix
write_ld(ld)
6 changes: 6 additions & 0 deletions scripts/source.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
import pandas as pd
import numpy as np
from pandas_plink import read_plink
from statistics import *

exec(open("scripts/config.sh").read()) #this reads in the config file
9 changes: 0 additions & 9 deletions tests/sample_test.py

This file was deleted.