GATSBI-embedding

Associated data files can be found in Zenodo: https://doi.org/10.5281/zenodo.18332051

This repository contains code for GATSBI, a framework for learning protein embeddings using Graph Attention Networks (GATs) and biologically informed data splits. The learned embeddings are evaluated across multiple downstream biological tasks, including interaction prediction, function prediction, and pathway-level inference.

Overview

GATSBI learns node embeddings on protein–protein interaction (PPI) graphs using:

Graph Attention Networks (GATs) for message passing
ESM protein language model embeddings for initialization
Degree-matched negative sampling for robust link prediction
Leakage-aware node and edge splits, including sequence-similarity constraints The pipeline supports both node-split and edge-split training, followed by task-specific evaluation.

Repository Structure

├── GATSBI_data_split.py        # Node splitting with sequence-similarity constraints
├── GATSBI_node_embed.py        # Node-split GAT training
├── GATSBI_edge_embed.py        # Edge-split GAT training
├── eval_node_pred.py           # Protein function (EC) prediction
├── eval_interaction_pred.py    # Protein–protein interaction prediction
├── eval_set_prediction.py      # Pathway / protein-set prediction
├── pinnacle.py                 # PINNACLE embedding post-processing
├── requirements.yml            # Conda environment specification
├── README.md
└── temp/                       # Intermediate outputs (optional)

Installation

We recommend using Conda.

conda env create -f requirements.yml
conda activate <env_name>

Data Requirements

You will need the following inputs:

1.Protein–protein interaction graph

NetworkX .gpickle format (for node splitting)

Protein sequence embeddings

Pickle file mapping UniProt ID → ESM embedding (1280-dim)

Sequence similarity matrix

NumPy .npy matrix for similarity-aware node splits

Downstream task annotations

Enzyme Code annotations (TSV)
BioGRID interaction file (TSV)
Reactome Pathway → protein-set mappings (Pickle)

Step 1: Graph Splitting (Node and Edge Splits)

The script GATSBI_data_split.py performs leakage-aware graph splitting and produces both node splits and edge splits from a single input graph.

Key features:

Sequence-similarity–aware node splitting
Deterministic hashing for reproducibility
Induced subgraph generation
Standard train/val/test edge splits

Running GATSBI_data_split.py generates:

Node split

node_split_train_induced.edgelist.gz
node_split_val_induced.edgelist.gz
node_split_test_induced.edgelist.gz

Edge split

edge_split_val.edgelist.gz
edge_split_test.edgelist.gz

Running the Split Script

  --graph_path data/ppi_graph.gpickle \
  --seq_matrix data/seq_similarity.npy \
  --protein_list data/protein_ids.npy \
  --out_dir splits/ \
  --similarity_threshold 0.30

Step 2: GAT Embedding Training

Choose one of the following training modes.

Node-Split Training (Inductive)

  --split_dir splits/ \
  --esm_path data/esm_uniprot_vec.pkl \
  --out_dir outputs/node_split

Edge-Split Training (Transductive)

  --split_dir splits/ \
  --esm_path data/esm_uniprot_vec.pkl \
  --out_dir outputs/edge_split

Step 3: Downstream Evaluation Tasks

Protein Function Prediction (EC Level-1)

  --embeddings outputs/node_split/gat_node_embeddings.pkl \
  --ec_tsv data/uniprot_ec.tsv \
  --out_dir results/function_pred

Multi-label classification

Macro ROC-AUC and AUPRC

Protein–Protein Interaction Prediction

  --biogrid data/BIOGRID-ALL.tsv \
  --embeddings outputs/node_split/gat_node_embeddings.pkl \
  --out_dir results/ppi_pred

Binary edge classification

ROC and precision–recall curves

Functional Set Prediction

  --embeddings outputs/node_split/gat_node_embeddings.pkl \
  --pathways data/pathway_to_proteins.pkl \
  --out_dir results/pathway_pred

Attention-based pooling over protein sets

Corrupted-positive negative sampling

Key Methodological Features

Multi-head Graph Attention Networks
Degree-matched negative sampling
Label smoothing for link prediction
Sequence-similarity–aware splitting
Centroid-normalized ESM initialization

License

Released under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GATSBI-embedding

Installation

Data Requirements

Step 1: Graph Splitting (Node and Edge Splits)

Step 2: GAT Embedding Training

Step 3: Downstream Evaluation Tasks

Protein Function Prediction (EC Level-1)

Protein–Protein Interaction Prediction

Functional Set Prediction

Key Methodological Features

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
code		code
data		data
README.md		README.md
requirements.yml		requirements.yml

Folders and files

Latest commit

History

Repository files navigation

GATSBI-embedding

Installation

Data Requirements

Step 1: Graph Splitting (Node and Edge Splits)

Step 2: GAT Embedding Training

Step 3: Downstream Evaluation Tasks

Protein Function Prediction (EC Level-1)

Protein–Protein Interaction Prediction

Functional Set Prediction

Key Methodological Features

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages