This repository contains the implementation and research associated with a pipeline for detecting vulnerabilities in smart contracts. The pipeline integrates function-level analysis using CodeBERT embeddings and graph-based classification using Graph Neural Networks (GNNs).
The aim of this project is to develop an effective and scalable system for detecting vulnerabilities in smart contracts. The process involves:
-
Function-Level Vulnerability Classification:
- Functions from smart contracts are analyzed for specific vulnerability types (e.g., Re-Entrancy, Timestamp Dependency, Unhandled Exceptions).
- A dataset is created with functions labeled as vulnerable or non-vulnerable.
- CodeBERT is fine-tuned for each vulnerability type to generate embeddings representing the functional semantics.
-
Graph-Based Classification:
- Source codes are analyzed to construct Function Call Graphs (FCGs) where:
- Nodes represent functions.
- Edges represent function call relationships.
- Node embeddings are generated using the fine-tuned CodeBERT model.
- These graphs are used as input to GNN models (e.g., GCN, GraphSAGE, GAT) for classification of smart contracts as vulnerable or non-vulnerable.
- Source codes are analyzed to construct Function Call Graphs (FCGs) where:
-
Baseline Comparisons:
- Results are compared against existing methods, such as CBGRU, Peculiar, VulBERTa, TMP, AME, MANDO, MANDO-HGT, ... to demonstrate the performance improvements of the proposed approach.
-
Data Collection:
- Utilizes datasets such as
smartbugs/smartbugs-wildandmwritescode/slither-audited-smart-contracts. - Functions and source code are pre-processed using tools like Slither, SmartCheck, and Mythril.
- Utilizes datasets such as
-
Modeling Techniques:
- Fine-tuned CodeBERT for generating embeddings.
- GNNs for graph-based classification.
- Comparison with alternative architectures, including hybrid models (e.g., CBGRU).
-
Optimization:
- Handles limitations of input token lengths (e.g., CodeBERT's 512-token limit).
- Investigates strategies such as selecting key tokens (first 128 + last 382 tokens).
- Significant improvements were observed in detecting vulnerabilities, especially for Re-Entrancy, compared to traditional models.
- Experimentation with different graph construction techniques and feature enhancements.
datasets/: Contains datasets for functions and graphs.core/: Core implementation of the pipeline components, including:finetune_llm/: CodeBERT fine-tuning, dataset, pipeline, and components.GFD/: GFD model pipeline, data processing, and modules.preprocessing/: Scripts for dataset preprocessing and analysis.utils/: Utility functions for the pipeline.
notebooks/: Jupyter notebooks for data processing, model training, and evaluation.docs/: Additional project documentation and baseline comparisons.
-
Prerequisites:
- Python 3.8+
- PyTorch
- Hugging Face Transformers
- DGL (Deep Graph Library)
- Pandas, NumPy, Scikit-learn
-
Setup: Clone the repository and install dependencies:
git clone https://github.com/QuangNguyen2910/GraphFusionVulDetect.git cd GraphFusionVulDetect pip install -r requirements.txt -
Data Preparation:
- Follow instructions in
docs/data_preparation.mdto preprocess datasets and create FCGs.
- Follow instructions in
-
Training & Evaluation (using Makefile): You can use the provided Makefile to run training and evaluation commands easily:
- Fine-tune CodeBERT:
make finetune-train
- Evaluate CodeBERT:
make finetune-eval
- Train GFD model:
make gfd-train
- Evaluate GFD model:
make gfd-eval
These commands will execute the corresponding Python modules for training and evaluation.
- Fine-tune CodeBERT:
If you use this project in your research, please cite the accompanying paper:
@article{GraphFusionVulDetect,
title={GraphFusionVulDetect: A Smart Contract Vulnerability Detection Using CodeBERT and GNNs},
author={Quang Nguyen, Tuyen Vu, Minh Pham, Kien Nguyen, Cong Tran},
journal={None},
year={2024}
}
For questions or collaborations, contact Quang Nguyen.