Associated data files can be found in Zenodo: https://doi.org/10.5281/zenodo.18332051
This repository contains code for GATSBI, a framework for learning protein embeddings using Graph Attention Networks (GATs) and biologically informed data splits. The learned embeddings are evaluated across multiple downstream biological tasks, including interaction prediction, function prediction, and pathway-level inference.
Overview
GATSBI learns node embeddings on protein–protein interaction (PPI) graphs using:
- Graph Attention Networks (GATs) for message passing
- ESM protein language model embeddings for initialization
- Degree-matched negative sampling for robust link prediction
- Leakage-aware node and edge splits, including sequence-similarity constraints The pipeline supports both node-split and edge-split training, followed by task-specific evaluation.
Repository Structure
├── GATSBI_data_split.py # Node splitting with sequence-similarity constraints
├── GATSBI_node_embed.py # Node-split GAT training
├── GATSBI_edge_embed.py # Edge-split GAT training
├── eval_node_pred.py # Protein function (EC) prediction
├── eval_interaction_pred.py # Protein–protein interaction prediction
├── eval_set_prediction.py # Pathway / protein-set prediction
├── pinnacle.py # PINNACLE embedding post-processing
├── requirements.yml # Conda environment specification
├── README.md
└── temp/ # Intermediate outputs (optional)
We recommend using Conda.
conda env create -f requirements.yml
conda activate <env_name>You will need the following inputs:
1.Protein–protein interaction graph
- NetworkX .gpickle format (for node splitting)
- Protein sequence embeddings
- Pickle file mapping UniProt ID → ESM embedding (1280-dim)
- Sequence similarity matrix
- NumPy .npy matrix for similarity-aware node splits
- Downstream task annotations
- Enzyme Code annotations (TSV)
- BioGRID interaction file (TSV)
- Reactome Pathway → protein-set mappings (Pickle)
The script GATSBI_data_split.py performs leakage-aware graph splitting and produces both node splits and edge splits from a single input graph.
Key features:
- Sequence-similarity–aware node splitting
- Deterministic hashing for reproducibility
- Induced subgraph generation
- Standard train/val/test edge splits
Running GATSBI_data_split.py generates:
- Node split
node_split_train_induced.edgelist.gz
node_split_val_induced.edgelist.gz
node_split_test_induced.edgelist.gz
- Edge split
edge_split_val.edgelist.gz
edge_split_test.edgelist.gz
Running the Split Script
--graph_path data/ppi_graph.gpickle \
--seq_matrix data/seq_similarity.npy \
--protein_list data/protein_ids.npy \
--out_dir splits/ \
--similarity_threshold 0.30Choose one of the following training modes.
Node-Split Training (Inductive)
--split_dir splits/ \
--esm_path data/esm_uniprot_vec.pkl \
--out_dir outputs/node_splitEdge-Split Training (Transductive)
--split_dir splits/ \
--esm_path data/esm_uniprot_vec.pkl \
--out_dir outputs/edge_split --embeddings outputs/node_split/gat_node_embeddings.pkl \
--ec_tsv data/uniprot_ec.tsv \
--out_dir results/function_predMulti-label classification
Macro ROC-AUC and AUPRC
--biogrid data/BIOGRID-ALL.tsv \
--embeddings outputs/node_split/gat_node_embeddings.pkl \
--out_dir results/ppi_predBinary edge classification
ROC and precision–recall curves
--embeddings outputs/node_split/gat_node_embeddings.pkl \
--pathways data/pathway_to_proteins.pkl \
--out_dir results/pathway_predAttention-based pooling over protein sets
Corrupted-positive negative sampling
- Multi-head Graph Attention Networks
- Degree-matched negative sampling
- Label smoothing for link prediction
- Sequence-similarity–aware splitting
- Centroid-normalized ESM initialization
Released under the MIT License.