RAG-GFM: Overcoming In-Memory Bottlenecks in Graph Foundation Models via Retrieval-Augmented Generation
This repository contains the official implementation of RAG-GFM, proposed in the paper RAG-GFM: Overcoming In-Memory Bottlenecks in Graph Foundation Models via Retrieval-Augmented Generation (WWW 2026).
RAG-GFM addresses the fundamental in-memory bottleneck of Graph Foundation Models (GFMs) by externalizing graph knowledge into a unified retrieval system. Instead of compressing heterogeneous semantic and structural knowledge into model parameters, RAG-GFM leverages retrieval-augmented generation to enable scalable, interpretable, and efficient cross-domain graph learning.
-
Dual-Modal Knowledge Externalization
- Semantic Store: Prefix-structured node texts stored in a vector database for controllable semantic retrieval.
- Structural Store: Centrality-based graph motifs encoded via Walk-Spectrum Encoding (WSE) to capture higher-order structural patterns.
-
Cross-View Knowledge Alignment
- Self-supervised alignment between semantic and structural views during multi-domain pre-training to learn transferable priors.
-
In-Context Retrieval Augmentation
- Retrieved texts and motifs are injected as contextual evidence during few-shot adaptation, enabling efficient downstream learning without updating backbone parameters.
Install dependencies:
pip install -r requirements.txtWe recommend using Python ≥ 3.9 and a CUDA-enabled environment for efficient training and retrieval.
The overall workflow follows three stages:
(1) Knowledge Externalization → (2) Cross-View Pre-training → (3) Retrieval-Augmented Adaptation
python build_nano_db.pyCreates a semantic vector database from prefix-structured node texts using dense embeddings.
This database supports top-k semantic retrieval during both pre-training and fine-tuning.
python train_all_motif_finders.pyTrains motif encoders based on Walk-Spectrum Encoding (WSE) to identify structurally important nodes and subgraphs.
These motifs form the basis of the structural retrieval store.
python build_all_motif_dbs.pyConstructs motif-level vector databases from trained motif finders, enabling efficient retrieval of transferable structural patterns.
python execute_cora.pyEvaluates RAG-GFM on node classification under few-shot and cross-domain (LODO) settings using the Cora dataset.
python execute_graph_cora.pyEvaluates graph-level classification by reformulating node-centered ego-graphs as graph instances.
-
Tasks
- Few-shot Node Classification
- Few-shot Graph Classification
-
Evaluation Protocols
- Leave-One-Dataset-Out (LODO-dataset)
- Leave-One-Domain-Out (LODO-domain)
-
Domains
- Citation Networks (Cora, CiteSeer, PubMed)
- E-Commerce Graphs (Ogbn-Products)
- Web Link Graphs (Wiki-CS)
Download the required datasets from the following links:
After downloading, place all datasets in the project root directory before running experiments.
If you find this work useful, please cite:
@inproceedings{yuan2026rag,
author = {Haonan Yuan and Qingyun Sun and Jiacheng Tao and Xingcheng Fu and Jianxin Li},
title = {RAG-GFM: Overcoming In-Memory Bottlenecks in Graph Foundation Models via Retrieval-Augmented Generation},
booktitle = {Proceedings of the ACM Web Conference 2026 (WWW '26)},
year = {2026},
publisher = {ACM},
address = {New York, NY, USA},
doi = {10.1145/3774904.3792139},
url = {https://doi.org/10.1145/3774904.3792139}
}
For questions or discussions, please contact Haonan Yuan or open an issue in this repository.
Enjoy exploring retrieval-augmented graph foundation models! 🚀