Skip to content

HUBioDataLab/ECLIPSE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🕸️ ECLIPSE

ECLIPSE (Exploration of Complex Ligand-Protein Interactions through Learning from Systems-level Heterogeneous Biomedical Knowledge Graphs) is an AI-powered framework for predicting the bioactivity of compound–protein interactions (CPIs). By combining advanced graph modeling, comprehensive biomedical knowledge, and pre-trained embeddings, it uncovers hidden relationships within complex biological networks, offering a practical tool for researchers in drug discovery and computational biology.

ECLIPSE combines:

  • Large-scale heterogeneous biomedical knowledge graphs (KGs): We built an integrated KG capturing entities, including genes, proteins, drugs, compounds, pathways, diseases, and phenotypes, and their multi-layered interactions.
  • Feature embeddings from language and graph models: Each biological entity is represented using learned embeddings, enabling richer context and better predictions.
  • Heterogeneous Graph Transformer (HGT): Unlike standard GNNs, HGT leverages node and edge types with type-specific attention, effectively modeling complex and diverse relationships.

ECLIPSE_Figure1

The schematic representation of the ECLIPSE framework. ECLIPSE is a systems-level framework for predicting compound–protein bioactivity. The Integrated Biomedical KG module provides a multi-relational biomedical graph of proteins, compounds, drugs, pathways, phenotypes, and diseases, serving as the structural foundation for representation learning. From this graph, sampled subgraphs are processed in the Node Projection on Sampled Subgraphs module, where type-specific MLP layers project heterogeneous input node features into fixed-size representations. These embeddings are then passed into stacked Heterogeneous Graph Transformer Layers, which apply heterogeneous mutual attention, message passing, and target-specific aggregation with residual connections to generate contextualized node embeddings. Finally, the compound–protein interaction - CPI Prediction Layer combines updated compound and protein embeddings, which are first refined through separate MLPs, either through vector concatenation with a fully connected network or via dot product, to predict bioactivity values.


📚 Contents

📁 Repository Structure

The ECLIPSE repository is organized as follows:

ECLIPSE/
│
├── data/                     # Input datasets and knowledge graph resources
│   ├── node_index/             # Node indexing files
│   ├── train_test_samples/     # Train/test splits for CPI benchmark datasets
│   └── crossbar_kg/            # Preprocessed knowledge graph and feature tensors
│
├── saved_models/                    # Trained ECLIPSE models
│   └── dcs_eclipse_dp_selformer.pt    # Dot-product ECLIPSE model with SELFormer embeddings, trained on dissimilar-compound split 
│
├── configs/                  # Configuration files with optimized hyperparameters and training settings
│   ├── rs_config.yaml          # Config for random-split based ECLIPSE and baseline models
│   ├── dcs_config.yaml         # Config for dissimilar-compound-split based ECLIPSE and baseline models
│   └── fds_config.yaml         # Config for fully-dissimilar-split based ECLIPSE and baseline models
│
├── src/                      # Source code
│   ├── data_loader.py          # Data loading & preprocessing functions
│   ├── model.py                # HGT-based model architecture
│   ├── train.py                # Training pipeline script
│   ├── predict.py              # Prediction script
│   └── utils.py                # Utility/helper functions
│
├── outputs/                 # Model outputs (predictions, performance scores etc.)
│
├── requirements_cpu.txt           # Python dependencies for CPU version
├── requirements_cuda.txt          # Python dependencies for CUDA version
├── workflow.png               # Workflow diagram of the ECLIPSE framework
├── README.md                  # Project documentation (this file)
└── LICENSE                    # License information

🚀 Getting Started

1. Clone the repository

git clone https://github.com/HUBioDataLab/ECLIPSE.git
cd ECLIPSE

2. Set up the environment

Create and activate a new Conda environment with Python 3.9, then install all required packages using pip:

conda create -n eclipse python=3.9
conda activate eclipse

You can now install either the CPU or CUDA version of ECLIPSE dependencies:

  • CPU version:
pip install -r requirements_cpu.txt
  • CUDA version:
pip install -r requirements_cuda.txt

⚠️ Reminder: Ensure that your local CUDA driver is compatible with the version of PyTorch defined in requirements_cuda.txt. If not, modify the corresponding lines in requirements_cuda.txt to match your installed CUDA version before running the command.

⚙️ Training the ECLIPSE Model

⚠️ Ensure that the graph files are correctly placed in the data/crossbar_kg/ directory before starting training. For detailed instructions, see data/README.md.

To train the ECLIPSE model, run the train.py script with an example command:

python src/train.py -s dcs -pl dp -cr selformer -sm -sp

Arguments:

  • -s, --split: Data split -> fds (fully_dissimilar_split), dcs (dissimilar_compound_split), or rs (random_split)
  • -pl, --prediction-layer: Prediction layer -> dp (dot_product) or fc (fully_connected)
  • -cr, --compound-representation: Compound representation -> ecfp4 or selformer
  • -nw, --num-workers: Number of data loading workers (default: 2)
  • -nt, --num-threads: Number of CPU threads (default: 2)
  • -o, --output-dir: Output directory (default: outputs/)
  • -c, --config: Path to config file (default: generated from other args)
  • -sm,--save-model: Save trained model to saved_models/ if flagged
  • -sp,--save-predictions: Save test set predictions to --output-dir if flagged
  • -b, --baseline: Use baseline model (no HGT layers, only linear layers) if flagged

Test set performance results will be saved to the specified --output-dir.

🎯 Making Predictions

To generate bioactivity value predictions using a trained ECLIPSE model, run the predict.py script with the desired split, prediction layer, and compound representation.

An example command:

python src/predict.py -s dcs -pl dp -cr selformer -pid P11309

Arguments:

  • -s, --split: Data split -> fds (fully_dissimilar_split), dcs (dissimilar_compound_split), or rs (random_split)
  • -pl, --prediction-layer: Prediction layer -> dp (dot_product) or fc (fully_connected)
  • -cr, --compound-representation: Compound representation -> ecfp4 or selformer
  • -o, --output-dir: Output directory (default: outputs/)

Use only one of the following options:

  • -pid, --protein_id: UniProt ID for protein-centric prediction (predict bioactivity values for the given protein against all compounds in the KG)
  • -cid", --compound_id: Compound ID for compound-centric prediction (predict bioactivity values for the given compound against all proteins in the KG)
  • -c, --custom: Path to a CSV file for a custom set (predict bioactivity values for the specified protein-compound pairs in the KG). The file must have two columns with headers: compound_id, protein_id

Predictions will be saved as a CSV file in the specified --output-dir.

📄 License

Copyright (C) 2026 HUBioDataLab

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.

About

Exploration of Complex Ligand-Protein Interactions through Learning from Systems-level Heterogeneous Biomedical Knowledge Graphs

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages