Skip to content

awfderry/PARSE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PARSE

Protein Annotation by Residue-Specific Enrichment: a site-based statistical method for interpretable protein function annotation

📦 Install Dependencies

We recommend installing all dependencies using the provided environment.yml file in a conda environment. This ensures version compatibility and avoids issues related to binary incompatibilities like GLIBCXX or CXXABI.

✅ Step 1: Create and activate the environment

conda env create -f environment.yml
conda activate parse

If you're on a system with an older version of libstdc++.so.6, you may need to load a newer GCC version to avoid runtime errors (e.g., GLIBCXX_3.4.26 not found):

module load gcc/10.3.0
export LD_LIBRARY_PATH=/path/to/gcc/10.3.0/lib64:$LD_LIBRARY_PATH

You can find the correct path via:

find $(dirname $(which gcc))/.. -name "libstdc++.so.6"

This environment supports CUDA 11.7 for GPU execution. If you are running on CPU only, you can comment out or remove the cudatoolkit=11.7 line in environment.yml.

Check if GPU is available:

python -c "import torch; print(torch.cuda.is_available())"

If you prefer a scripted setup:

./install_dependencies.sh

This handles environment creation and optional compiler module loading.

To generate embeddings, GVP-PyTorch is also necessary. To ensure compatibility,

git clone https://github.com/drorlab/gvp-pytorch.git

upon entering the credentials, navigate to

cd gvp-pytorch
vim setup.py

and change 'sklearn' to 'scikit-learn'. Then

pip install .

to complete the setup.

Download data

Download CSA reference data and precomputed embeddings for AlphaFoldDB datasets (human and dark proteomes) from Zenodo.

CSA data (required):

  • csa_function_sets_nn.pkl: function sets dictionary
  • csa_site_db_nn.pkl: embeddings database
  • function_score_dists.pkl: function-specific background distributions

AlphaFoldDB precomputed embeddings (optional):

  • af2_human_lmdb.zip: AF2 human proteome embeddings in LMDB format
  • af2_dark_hernandez_lmdb.zip: AF2 dark proteome embeddings from Barrio-Hernandez et al. in LMDB format
  • af2_dark_durairaj_lmdb.zip: AF2 human proteome embeddings from Durairaj et al. in LMDB format

Run PARSE on a single PDB

Run the following to annotate a protein using standard CSA reference database.

From PDB file:

python predict.py --pdb PDB_PATH

From pre-computed embedding database, given a PDB id:

python predict.py --precomputed_id PDB_ID --precomputed_lmdb LMDB_PATH

Optional arguments:

--chain CHAIN: annotate only a specified chain
--db DB_PATH: reference database embeddings, in pickle format
--function_sets FN_PATH: reference database function sets, in pickle format
--background BKG_PATH: function-specific background distributions, in pickle format
--cutoff CUTOFF: FDR cutoff for reporting results (default 0.001)
--use_gpu : flag for running with GPU

Create embedding database from directory of PDB files

To run PARSE on a large number of PDB files, you can create a pre-computed embedding database in LMDB format for all pdb files in a directory (including subdirectories). Valid filetypes include pdb, pdb.gz, ent, ent.gz, cif

For small datasets, run without optional split arguments:

python embed_pdb_dataset.py PDB_DIR OUT_LMDB_DIR --filetype=pdb

For large datasets (e.g. Swissprot, AlphaFoldDB), we recommend processing in parallel using the num_splits argument.

python embed_pdb_dataset.py PDB_DIR OUT_LMDB_DIR --split_id=$i --num_splits=NUM_SPLITS --filetype=pdb

This produces NUM_SPLITS (e.g. 20) tmp files in OUT_LMDB_DIR. To combine all into the full dataset, run the following:

python -m atom3d.datasets.scripts.combine_lmdb OUT_LMDB_DIR/tmp_* OUT_LMDB_DIR/full

To run PARSE on each protein in the embedding database:

python run_parse_lmdb.py --dataset=LMDB_DIR

Optional arguments are the same as predict.py, with additional split arguments for parallel processing: --db DB_PATH: reference database embeddings, in pickle format --function_sets FN_PATH: reference database function sets, in pickle format --background BKG_PATH: function-specific background distributions, in pickle format --cutoff CUTOFF: FDR cutoff for reporting results (default 0.001) --split_id SPLIT: split id (int) representing index from 0 to NUM_SPLITS-1 --num_splits NUM_SPLITS: number of splits for parallel processing

Create new reference database

To create a new reference database of functional sites, first generate a csv file with the following columns (see data/csa_functional_sites.csv for example):

site: function ID from source database (e.g. M-CSA 993)
pdb: pdb and chain (e.g. 2j9hA)
locs: list of residue ids which are functionally important
source: database source (e.g. M-CSA; used in case of more than one source database)
description: text description of function (e.g. Glutathione S-transferase A)

Then, create embedding database using the following, where EMBEDDING_OUT is the generated .pkl file containing DB embeddings and FUNCSET_OUT is the generated .pkl file containing function sets:

python create_reference_database.py DATABASE.csv EMBEDDING_OUT FUNCSET_OUT

About

Protein Annotation by Residue-Specific Enrichment: a site-based statistical method for interpretable protein function annotation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors