ECLIPSE (Exploration of Complex Ligand-Protein Interactions through Learning from Systems-level Heterogeneous Biomedical Knowledge Graphs) is an AI-powered framework for predicting the bioactivity of compound–protein interactions (CPIs). By combining advanced graph modeling, comprehensive biomedical knowledge, and pre-trained embeddings, it uncovers hidden relationships within complex biological networks, offering a practical tool for researchers in drug discovery and computational biology.
ECLIPSE combines:
- Large-scale heterogeneous biomedical knowledge graphs (KGs): We built an integrated KG capturing entities, including genes, proteins, drugs, compounds, pathways, diseases, and phenotypes, and their multi-layered interactions.
- Feature embeddings from language and graph models: Each biological entity is represented using learned embeddings, enabling richer context and better predictions.
- Heterogeneous Graph Transformer (HGT): Unlike standard GNNs, HGT leverages node and edge types with type-specific attention, effectively modeling complex and diverse relationships.
The schematic representation of the ECLIPSE framework. ECLIPSE is a systems-level framework for predicting compound–protein bioactivity. The Integrated Biomedical KG module provides a multi-relational biomedical graph of proteins, compounds, drugs, pathways, phenotypes, and diseases, serving as the structural foundation for representation learning. From this graph, sampled subgraphs are processed in the Node Projection on Sampled Subgraphs module, where type-specific MLP layers project heterogeneous input node features into fixed-size representations. These embeddings are then passed into stacked Heterogeneous Graph Transformer Layers, which apply heterogeneous mutual attention, message passing, and target-specific aggregation with residual connections to generate contextualized node embeddings. Finally, the compound–protein interaction - CPI Prediction Layer combines updated compound and protein embeddings, which are first refined through separate MLPs, either through vector concatenation with a fully connected network or via dot product, to predict bioactivity values.
The ECLIPSE repository is organized as follows:
ECLIPSE/
│
├── data/ # Input datasets and knowledge graph resources
│ ├── node_index/ # Node indexing files
│ ├── train_test_samples/ # Train/test splits for CPI benchmark datasets
│ └── crossbar_kg/ # Preprocessed knowledge graph and feature tensors
│
├── saved_models/ # Trained ECLIPSE models
│ └── dcs_eclipse_dp_selformer.pt # Dot-product ECLIPSE model with SELFormer embeddings, trained on dissimilar-compound split
│
├── configs/ # Configuration files with optimized hyperparameters and training settings
│ ├── rs_config.yaml # Config for random-split based ECLIPSE and baseline models
│ ├── dcs_config.yaml # Config for dissimilar-compound-split based ECLIPSE and baseline models
│ └── fds_config.yaml # Config for fully-dissimilar-split based ECLIPSE and baseline models
│
├── src/ # Source code
│ ├── data_loader.py # Data loading & preprocessing functions
│ ├── model.py # HGT-based model architecture
│ ├── train.py # Training pipeline script
│ ├── predict.py # Prediction script
│ └── utils.py # Utility/helper functions
│
├── outputs/ # Model outputs (predictions, performance scores etc.)
│
├── requirements_cpu.txt # Python dependencies for CPU version
├── requirements_cuda.txt # Python dependencies for CUDA version
├── workflow.png # Workflow diagram of the ECLIPSE framework
├── README.md # Project documentation (this file)
└── LICENSE # License information
1. Clone the repository
git clone https://github.com/HUBioDataLab/ECLIPSE.git
cd ECLIPSE2. Set up the environment
Create and activate a new Conda environment with Python 3.9, then install all required packages using pip:
conda create -n eclipse python=3.9
conda activate eclipseYou can now install either the CPU or CUDA version of ECLIPSE dependencies:
- CPU version:
pip install -r requirements_cpu.txt- CUDA version:
pip install -r requirements_cuda.txtrequirements_cuda.txt. If not, modify the corresponding lines in requirements_cuda.txt to match your installed CUDA version before running the command.
data/crossbar_kg/ directory before starting training. For detailed instructions, see data/README.md.
To train the ECLIPSE model, run the train.py script with an example command:
python src/train.py -s dcs -pl dp -cr selformer -sm -spArguments:
-s, --split: Data split ->fds(fully_dissimilar_split),dcs(dissimilar_compound_split), orrs(random_split)-pl, --prediction-layer: Prediction layer ->dp(dot_product) orfc(fully_connected)-cr, --compound-representation: Compound representation ->ecfp4orselformer-nw, --num-workers: Number of data loading workers (default: 2)-nt, --num-threads: Number of CPU threads (default: 2)-o, --output-dir: Output directory (default:outputs/)-c, --config: Path to config file (default: generated from other args)-sm,--save-model: Save trained model tosaved_models/if flagged-sp,--save-predictions: Save test set predictions to--output-dirif flagged-b, --baseline: Use baseline model (no HGT layers, only linear layers) if flagged
Test set performance results will be saved to the specified --output-dir.
To generate bioactivity value predictions using a trained ECLIPSE model, run the predict.py script with the desired split, prediction layer, and compound representation.
An example command:
python src/predict.py -s dcs -pl dp -cr selformer -pid P11309Arguments:
-s, --split: Data split ->fds(fully_dissimilar_split),dcs(dissimilar_compound_split), orrs(random_split)-pl, --prediction-layer: Prediction layer ->dp(dot_product) orfc(fully_connected)-cr, --compound-representation: Compound representation ->ecfp4orselformer-o, --output-dir: Output directory (default:outputs/)
Use only one of the following options:
-pid, --protein_id: UniProt ID for protein-centric prediction (predict bioactivity values for the given protein against all compounds in the KG)-cid", --compound_id: Compound ID for compound-centric prediction (predict bioactivity values for the given compound against all proteins in the KG)-c, --custom: Path to a CSV file for a custom set (predict bioactivity values for the specified protein-compound pairs in the KG). The file must have two columns with headers:compound_id,protein_id
Predictions will be saved as a CSV file in the specified --output-dir.
Copyright (C) 2026 HUBioDataLab
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.
