Protein Annotation by Residue-Specific Enrichment: a site-based statistical method for interpretable protein function annotation
We recommend installing all dependencies using the provided environment.yml file in a conda environment. This ensures version compatibility and avoids issues related to binary incompatibilities like GLIBCXX or CXXABI.
conda env create -f environment.yml
conda activate parseIf you're on a system with an older version of libstdc++.so.6, you may need to load a newer GCC version to avoid runtime errors (e.g., GLIBCXX_3.4.26 not found):
module load gcc/10.3.0
export LD_LIBRARY_PATH=/path/to/gcc/10.3.0/lib64:$LD_LIBRARY_PATHYou can find the correct path via:
find $(dirname $(which gcc))/.. -name "libstdc++.so.6"This environment supports CUDA 11.7 for GPU execution. If you are running on CPU only, you can comment out or remove the cudatoolkit=11.7 line in environment.yml.
Check if GPU is available:
python -c "import torch; print(torch.cuda.is_available())"If you prefer a scripted setup:
./install_dependencies.shThis handles environment creation and optional compiler module loading.
To generate embeddings, GVP-PyTorch is also necessary. To ensure compatibility,
git clone https://github.com/drorlab/gvp-pytorch.gitupon entering the credentials, navigate to
cd gvp-pytorch
vim setup.pyand change 'sklearn' to 'scikit-learn'. Then
pip install .to complete the setup.
Download CSA reference data and precomputed embeddings for AlphaFoldDB datasets (human and dark proteomes) from Zenodo.
CSA data (required):
csa_function_sets_nn.pkl: function sets dictionarycsa_site_db_nn.pkl: embeddings databasefunction_score_dists.pkl: function-specific background distributions
AlphaFoldDB precomputed embeddings (optional):
af2_human_lmdb.zip: AF2 human proteome embeddings in LMDB formataf2_dark_hernandez_lmdb.zip: AF2 dark proteome embeddings from Barrio-Hernandez et al. in LMDB formataf2_dark_durairaj_lmdb.zip: AF2 human proteome embeddings from Durairaj et al. in LMDB format
Run the following to annotate a protein using standard CSA reference database.
From PDB file:
python predict.py --pdb PDB_PATH
From pre-computed embedding database, given a PDB id:
python predict.py --precomputed_id PDB_ID --precomputed_lmdb LMDB_PATH
Optional arguments:
--chain CHAIN: annotate only a specified chain
--db DB_PATH: reference database embeddings, in pickle format
--function_sets FN_PATH: reference database function sets, in pickle format
--background BKG_PATH: function-specific background distributions, in pickle format
--cutoff CUTOFF: FDR cutoff for reporting results (default 0.001)
--use_gpu : flag for running with GPU
To run PARSE on a large number of PDB files, you can create a pre-computed embedding database in LMDB format for all pdb files in a directory (including subdirectories). Valid filetypes include pdb, pdb.gz, ent, ent.gz, cif
For small datasets, run without optional split arguments:
python embed_pdb_dataset.py PDB_DIR OUT_LMDB_DIR --filetype=pdb
For large datasets (e.g. Swissprot, AlphaFoldDB), we recommend processing in parallel using the num_splits argument.
python embed_pdb_dataset.py PDB_DIR OUT_LMDB_DIR --split_id=$i --num_splits=NUM_SPLITS --filetype=pdb
This produces NUM_SPLITS (e.g. 20) tmp files in OUT_LMDB_DIR. To combine all into the full dataset, run the following:
python -m atom3d.datasets.scripts.combine_lmdb OUT_LMDB_DIR/tmp_* OUT_LMDB_DIR/full
To run PARSE on each protein in the embedding database:
python run_parse_lmdb.py --dataset=LMDB_DIR
Optional arguments are the same as predict.py, with additional split arguments for parallel processing:
--db DB_PATH: reference database embeddings, in pickle format
--function_sets FN_PATH: reference database function sets, in pickle format
--background BKG_PATH: function-specific background distributions, in pickle format
--cutoff CUTOFF: FDR cutoff for reporting results (default 0.001)
--split_id SPLIT: split id (int) representing index from 0 to NUM_SPLITS-1
--num_splits NUM_SPLITS: number of splits for parallel processing
To create a new reference database of functional sites, first generate a csv file with the following columns (see data/csa_functional_sites.csv for example):
site: function ID from source database (e.g. M-CSA 993)
pdb: pdb and chain (e.g. 2j9hA)
locs: list of residue ids which are functionally important
source: database source (e.g. M-CSA; used in case of more than one source database)
description: text description of function (e.g. Glutathione S-transferase A)
Then, create embedding database using the following, where EMBEDDING_OUT is the generated .pkl file containing DB embeddings and FUNCSET_OUT is the generated .pkl file containing function sets:
python create_reference_database.py DATABASE.csv EMBEDDING_OUT FUNCSET_OUT