CodeSeg: Semantic Source Code Segmentation

This repository contains the replication code for the research paper "Semantic Source Code Segmentation using Small and Large Language Models". CodeSeg introduces an automated, domain-specific approach for segmenting research code, particularly in low-resource languages like R and Python, into functionally coherent units.

Research Paper

Title: Semantic Code Segmentation with Language Models

Abstract: Source code segmentation, the process of dividing code into functionally coherent segments, is vital for knowledge retrieval and maintenance in software development. Traditional manual and syntactic analysis methods become impractical as code repositories grow, especially for low-resource languages such as R. This paper proposes an automated, domain-specific solution for research R code segmentation leveraging Large and Small Language Models (LLMs/SLMs). It presents two novel approaches: line-by-line analysis with context and range-based segment determination, and introduces a human-annotated dataset called StatCodeSeg. Experiments also include Python code from the computer science domain to support generalizability. The findings indicate that context-based line-by-line analysis outperforms range-based methods, and smaller language models like CodeBERT and an encoder-only version of CodeT5+ demonstrate superior performance compared to their LLM counterparts, even without prior R code pre-training.

Key Contributions and Approaches

The paper explores two primary approaches for semantic source code segmentation:

Line-by-line analysis with context: This method analyzes code line by line, incorporating contextual information to determine segment boundaries. The research indicates this approach is superior.
Range-based segment determination: This approach identifies segments based on predefined ranges within the code.

The study utilizes both Large Language Models (LLMs) and fine-tuned Small Language Models (SLMs) for segmentation. Notably, smaller models like CodeBERT and an encoder-only version of CodeT5+ performed better, despite not being pre-trained on R code.

Dataset

The research introduces StatCodeSeg and PyCodeSeg, a human-annotated dataset specifically designed for R and Python code segmentation. These datasets were crucial for fine-tuning the SLMs and evaluating the proposed approaches.

Repository Structure

The GitHub repository Dahouabdelhalim/CodeSeg contains the replication code for the paper. Based on the file structure, it likely includes:

Data/: Contains data related to the project, including the StatCodeSeg and PyCodeSeg datasets and processed versions of it.
Predictions/: Stores the output or predictions generated by the segmentation models.
Scripts/: Houses the Python and R scripts used for implementing the segmentation approaches, model training, and evaluation.
LICENSE: The licensing information for the project (MIT License).
requirements.txt: Lists the necessary Python packages and their versions required to run the code.

Installation and Usage

To set up the environment and run the code, you would typically follow these steps:

Clone the repository:

git clone https://github.com/Dahouabdelhalim/CodeSeg.git
cd CodeSeg

Install dependencies:
```
pip install -r requirements.txt
```
Run the scripts: Refer to the Scripts/ directory for specific instructions on how to execute the segmentation models and reproduce the results. Further details might be available within the scripts themselves or in additional documentation within the repository.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CodeSeg: Semantic Source Code Segmentation

Research Paper

Key Contributions and Approaches

Dataset

Repository Structure

Installation and Usage

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
Data		Data
Predictions		Predictions
Scripts		Scripts
LICENSE		LICENSE
Prompts.md		Prompts.md
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

CodeSeg: Semantic Source Code Segmentation

Research Paper

Key Contributions and Approaches

Dataset

Repository Structure

Installation and Usage

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages