Abstract:
The rapid advancement of pre-trained language models (PLMs) has demonstrated promising results for various code-related tasks. However, their effectiveness in detecting real-world vulnerabilities remains a critical challenge. While existing empirical studies evaluate PLMs for vulnerability detection (VD), they suffer from data leakage, limited scope, and superficial analysis, hindering the accuracy and comprehensiveness of evaluations. This paper begins by revisiting the common issues in existing research on PLMs for VD through the evaluation pipeline. It then proceeds with an accurate and extensive evaluation of 18 PLMs, spanning model parameters from millions to billions, on high-quality datasets that feature accurate labeling, diverse vulnerability types, and various projects. Specifically, we compare the performance of PLMs under both fine-tuning and prompt engineering, assess their effectiveness and generalizability across various training and testing settings, and analyze their robustness to perturbations such as code normalization, abstraction, and semantic-preserving transformations.
Our findings reveal that, for function-level VD, PLMs incorporating pre-training tasks designed to capture the syntactic and semantic patterns of code outperform both general-purpose PLMs and those solely pre-trained or fine-tuned on large code corpora. However, these models face notable challenges in real-world scenarios, such as difficulties in detecting vulnerabilities with complex dependencies, handling perturbations introduced by code normalization and abstraction, and identifying semantic-preserving vulnerable code transformations. Also, the truncation caused by the limited context windows of PLMs can lead to a non-negligible number of labeling errors, which is overlooked by previous work. This study underscores the importance of thorough evaluations of model performance in practical scenarios and outlines future directions to help enhance the effectiveness of PLMs for realistic VD applications.
-
We revisit the true capabilities of PLMs for VD by addressing multiple critical shortcomings of existing research. These include data leakage caused by flawed dataset partitioning and overlooking the temporal overlap between evaluation data and pre-training knowledge of LLMs; limited scope resulting from the evaluation of constrained evaluation setups on unrepresentative experimental settings; and superficial analysis neglecting factors like model generalization and robustness to practical use.
-
We evaluate a wide range of 18 PLMs, from small to large parameter scales, and compares their performance using two adaptation techniques: fine-tuning and prompt engineering. These comprehensive comparisons, which also includes the use of structural and semantic-aware prompts and the CoT reasoning model, offer a realistic estimation of PLMs' capabilities for VD.
-
We implement an extensible evaluation framework built on newly collected, high-quality datasets that go beyond LLMs' pre-training knowledge cutoff, allowing for future evaluation of stronger models on more recent data. We reveal several key insights into the capabilities and weaknesses of PLMs, along with effective strategies for enhancing their performance in practical settings.
Please refer to the readme.md in each folder (i.e., finetune and inference).
