This repository contains the replication code for the research paper "Semantic Source Code Segmentation using Small and Large Language Models". CodeSeg introduces an automated, domain-specific approach for segmenting research code, particularly in low-resource languages like R and Python, into functionally coherent units.
Title: Semantic Code Segmentation with Language Models
Abstract: Source code segmentation, the process of dividing code into functionally coherent segments, is vital for knowledge retrieval and maintenance in software development. Traditional manual and syntactic analysis methods become impractical as code repositories grow, especially for low-resource languages such as R. This paper proposes an automated, domain-specific solution for research R code segmentation leveraging Large and Small Language Models (LLMs/SLMs). It presents two novel approaches: line-by-line analysis with context and range-based segment determination, and introduces a human-annotated dataset called StatCodeSeg. Experiments also include Python code from the computer science domain to support generalizability. The findings indicate that context-based line-by-line analysis outperforms range-based methods, and smaller language models like CodeBERT and an encoder-only version of CodeT5+ demonstrate superior performance compared to their LLM counterparts, even without prior R code pre-training.
The paper explores two primary approaches for semantic source code segmentation:
- Line-by-line analysis with context: This method analyzes code line by line, incorporating contextual information to determine segment boundaries. The research indicates this approach is superior.
- Range-based segment determination: This approach identifies segments based on predefined ranges within the code.
The study utilizes both Large Language Models (LLMs) and fine-tuned Small Language Models (SLMs) for segmentation. Notably, smaller models like CodeBERT and an encoder-only version of CodeT5+ performed better, despite not being pre-trained on R code.
The research introduces StatCodeSeg and PyCodeSeg, a human-annotated dataset specifically designed for R and Python code segmentation. These datasets were crucial for fine-tuning the SLMs and evaluating the proposed approaches.
The GitHub repository Dahouabdelhalim/CodeSeg contains the replication code for the paper. Based on the file structure, it likely includes:
Data/: Contains data related to the project, including the StatCodeSeg and PyCodeSeg datasets and processed versions of it.Predictions/: Stores the output or predictions generated by the segmentation models.Scripts/: Houses the Python and R scripts used for implementing the segmentation approaches, model training, and evaluation.LICENSE: The licensing information for the project (MIT License).requirements.txt: Lists the necessary Python packages and their versions required to run the code.
To set up the environment and run the code, you would typically follow these steps:
- Clone the repository:
git clone https://github.com/Dahouabdelhalim/CodeSeg.git cd CodeSeg - Install dependencies:
pip install -r requirements.txt
- Run the scripts: Refer to the
Scripts/directory for specific instructions on how to execute the segmentation models and reproduce the results. Further details might be available within the scripts themselves or in additional documentation within the repository.
This project is licensed under the MIT License - see the LICENSE file for details.