HMM for learning chromatin organization

Markov Models for learning the chromatin organization.

Features

Discrete and continuous Hidden Markov Model
Score functions to evaluate genome segmentation, based on different assumptions:
enrichment of chromatin modifications
consistency between different samples
gene breaks - number of genes that fall between domains
Various script for download, transform and integration of data from different sources
Performance and portability: works under Linux and Windows with 64-bit. The project includes some cython code for critical code paths: a parser for bed graph files (parsing ~500Mb file in ~10 sec) and some HMM algorithms (Viterbi)
Create custom UCSC hubs for trained models

Installation and usage

The project is written in python 3, with some cython and c in critical code paths. It doesn't have UI, but some command line tools.

Installation:

Clone the repository
copy config_default.py to config.py and define the directories to work with
pyx - compile using the following command:

python setup.py build_ext --inplace

Fill directories with specific project data/external dependencies:
data - Create data directory with your data. Recommended public data repositories: ** http://www.ncbi.nlm.nih.gov/epigenomics - roadmap epigenomics ** http://hgdownload.cse.ucsc.edu/goldenPath/ - genome browser
bin - fill with UCSC programs such as bedToBigBed, bigWigToBedGraph and wigToBigWig. (can be downloaded for example from http://hgdownload.cse.ucsc.edu/admin/exe/ )
results - create a directory for storing results

Data directory as well as results directories may require large disk space, and you may find it convenient to store data or results directories in another drive (ln -s ...). See also Installation FAQ

Getting data

It is recommended to use rsync for getting data from public repositories. To fetch data (bigWig format) from multiple sources you can use:

	cd src
	python3 -m data_provider.dataDownloader download_sources *SOURCES_FILE*

(where SOURCES_FILE is a file with links to rsync repositories, each repository in separate line)

The project use compressed numpy arrays (npz format) for storing and manipulating the data.

    python3 -m data_provider.dataDownloader serialize_dir

To serialize bigWig format to npz format.

Usage

~"Use the source, Luke" (Obi-Wan Kenobi)

You are welcome to use the source and extend it. Some common tasks are provided as command line tools:

data_provider directory
dataDownloader - Script for downloading and transforming data.
createMeanMarkers - Script for averaging different samples from same experiment/same cell type
dnase_classify - train and classify chromatin to regions of open and closed

Installation FAQ

Here I keep some annoying problems I encountered and their solutions. You are welcome to suggest more (you can use issues on github)

In general: since it handles huge files, it is intended to be used by 64-bit environment. Using 32-bit environment may cause memory errors or at least slowness. Validate your working environment is 64-bit (OS, python and for development also the IDE)

Bin files

Bin files are UCSC programs, and are used for transforming files formats. The required programs can be downloaded using install.sh or manually from http://hgdownload.cse.ucsc.edu/admin/exe/

If you run into problems getting it work, use this checklist:

"cannot execute binary file"
Is the program compatible with your system? (64-bit?) The install.sh downloads 64-bit version. Find 32-bit version of the programs if it doesn't work.
"error while loading shared libraries: libssl.so.10: cannot open shared object file: No such file or directory"
Get libssl. With Debian linux (such as Ubuntu):

	sudo apt-get install libssl1.0.0 libssl-dev
	sudo ln -s /lib/x86_64-linux-gnu/libssl.so.1.0.0 /usr/lib/libssl.so.10
	sudo ln -s /lib/x86_64-linux-gnu/libcrypto.so.1.0.0 /usr/lib/libcrypto.so.10

pyx

pyx directory contains cython code for optimization of critical code paths.

You can build it using the following:

python setup.py build_ext --inplace

or under MS Windows:

python.exe setup.py build_ext --inplace --compiler=msvc --plat-name=win-amd64 Tested with VS 2012: to use new version of msvc compiler you may have to modify msvc9compiler.py in distutils and specially get_build_version and PLAT_TO_VCVARS as described in http://www.xavierdupre.fr/blog/2013-07-07_nojs.html)

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
data		data
other_data		other_data
results		results
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
install_bin.sh		install_bin.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HMM for learning chromatin organization

Features

Installation and usage

Getting data

Usage

Installation FAQ

Bin files

pyx

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

HMM for learning chromatin organization

Features

Installation and usage

Getting data

Usage

Installation FAQ

Bin files

pyx

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages