Graph + Domain + Global (GDG) fusion for link prediction
This repository contains the minimal, reproducible code used in our paper to evaluate link prediction on Hetionet and SuppKG using a lightweight fusion model that combines:
- Text embeddings (e.g., PubMedBERT sentence/entity representations)
- Graph/Knowledge embeddings (e.g., Poincaré/MuRP)
- Global/context embeddings (Clustered by hierarchical leigen clustering.)
The implementation is intentionally compact for ease of reproduction and ablation.
- Single‑command training & evaluation
- Works with precomputed embeddings (no heavy model training here)
- Supports MRR, MR, Hits@K with filtered evaluation
- CUDA 12.4 ready (PyTorch & DGL pinned)
FUSE_GDG/
├─ hetionet/
│ ├─ entity_with_definition/
│ │ ├─ train_aligned.json
│ │ ├─ valid_aligned.json
│ │ └─ test_aligned.json
│ ├─ entity2index.pkl
│ ├─ index2entity.pkl
│ ├─ index2relation.pkl
│ ├─ relation2index.pkl
│ ├─ pubmedbert_embeddings_768.npy # Files downloaded from Google Drive
│ ├─ poincare_embeddings.npy
│ ├─ global_embeddings.npy # Files downloaded from Google Drive
│ └─ train.tsv | valid.tsv | test.tsv # Triples: head \t relation \t tail
├─ suppkg/
│ ├─ (same layout as hetionet/)
├─ data_loader.py
├─ model.py
├─ myutils.py
├─ main.py
└─ requirements.txt
Note: The
entity_with_definition/The folder is optional during runtime and you can see how the names, types, and definitions of the entities are configured.
Due to file size limitations, precomputed embedding files (.npy and .pth) are not included directly in this repository. You can download them from Google Drive:
🔗 Download Embeddings from Google Drive
We also provide trained checkpoints for direct evaluation or warm start.
🔗 Download Model Checkpoints (Google Drive)
- Python: 3.10
- CUDA: 12.4
- Key packages (pinned in
requirements.txt):torch==2.4.0+cu124(pypi)dgl==2.4.0+cu124(pypi)
conda create -n fuse_gdg python=3.10 -y
conda activate fuse_gdg
# If you already have CUDA 12.4 drivers/toolkit
pip install -r requirements.txtIf you use a different CUDA version, install the matching wheels for both PyTorch and DGL.
train.tsv,valid.tsv,test.tsvcontain triples in TSV:head\trelation\ttail(no header).entity2index.pkl,relation2index.pklmap string IDs → integer indices.- Embedding
.npyfiles are arrays whose row order matches the integer indices.
pubmedbert_embeddings_768.npy→--text_embedding_filepoincare_embeddings.npy→--graph_embedding_fileglobal_embeddings.npy→--global_embedding_file
Single command to run training + periodic evaluation:
python main.py \
--data hetionet \
--text_embedding_file pubmedbert_embeddings_768.npy \
--knowledge_embedding_file poincare_embeddings.npy \
--global_embedding_file global_embeddings.npy \
--w_text 0.3 --w_domain 0.5 --w_global 0.2 \
--num_hidden_layers 2 \
--iterations 40000 \
--evaluate_every 1000 \
--neg_sample_size_eval 100 \
--model_state_file hetionet_model_state352.pthSwitching to SuppKG only changes the --data folder and output filename:
python main.py --data suppkg ... --model_state_file suppkg_model_stateXXXX.pth--w_text,--w_domain,--w_global: fusion weights (must sum to 1.0 is not required, but recommended).--iterations: total training steps.--evaluate_every: evaluation frequency (steps). Set high to evaluate only at the end.--neg_sample_size_eval: negatives per query during evaluation.--model_state_file: output checkpoint filename.
- MRR, MR, Hits@1/3/10 (filtered protocol).
- Logs are printed to stdout; checkpoints are saved to
--model_state_file.
- Fix seeds if needed (see
myutils.py). - Keep
entity2index.pkland embedding row orders in sync. - Use the same CUDA build for both PyTorch and DGL (see Troubleshooting).
DGL/PyTorch CUDA mismatch
- Symptom: import errors or runtime CUDA failures
- Fix: install wheels that target the same CUDA (e.g.,
+cu124for both)
OOM during eval
- Reduce
--neg_sample_size_eval - Evaluate less frequently via larger
--evaluate_every
@misc{fuse_gdg_2025,
title = {Fuse-GDG: Leveraging graph structural, domain knowledge, global context to enhance GNN-based link prediction on Biomedical knowledge graphs},
author = {DaeHo Kim, TaeHeon Seong, SoYeop Yoo and OkRan Jeong},
year = {2025},
url = {https://github.com/PlantInGreenhouse/Fuse-GDG}
}