Skip to content

Dahouabdelhalim/CodeSeg

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

52 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CodeSeg: Semantic Source Code Segmentation

This repository contains the replication code for the research paper "Semantic Source Code Segmentation using Small and Large Language Models". CodeSeg introduces an automated, domain-specific approach for segmenting research code, particularly in low-resource languages like R and Python, into functionally coherent units.

Research Paper

Title: Semantic Code Segmentation with Language Models

Abstract: Source code segmentation, the process of dividing code into functionally coherent segments, is vital for knowledge retrieval and maintenance in software development. Traditional manual and syntactic analysis methods become impractical as code repositories grow, especially for low-resource languages such as R. This paper proposes an automated, domain-specific solution for research R code segmentation leveraging Large and Small Language Models (LLMs/SLMs). It presents two novel approaches: line-by-line analysis with context and range-based segment determination, and introduces a human-annotated dataset called StatCodeSeg. Experiments also include Python code from the computer science domain to support generalizability. The findings indicate that context-based line-by-line analysis outperforms range-based methods, and smaller language models like CodeBERT and an encoder-only version of CodeT5+ demonstrate superior performance compared to their LLM counterparts, even without prior R code pre-training.

Key Contributions and Approaches

The paper explores two primary approaches for semantic source code segmentation:

  1. Line-by-line analysis with context: This method analyzes code line by line, incorporating contextual information to determine segment boundaries. The research indicates this approach is superior.
  2. Range-based segment determination: This approach identifies segments based on predefined ranges within the code.

The study utilizes both Large Language Models (LLMs) and fine-tuned Small Language Models (SLMs) for segmentation. Notably, smaller models like CodeBERT and an encoder-only version of CodeT5+ performed better, despite not being pre-trained on R code.

Dataset

The research introduces StatCodeSeg and PyCodeSeg, a human-annotated dataset specifically designed for R and Python code segmentation. These datasets were crucial for fine-tuning the SLMs and evaluating the proposed approaches.

Repository Structure

The GitHub repository Dahouabdelhalim/CodeSeg contains the replication code for the paper. Based on the file structure, it likely includes:

  • Data/: Contains data related to the project, including the StatCodeSeg and PyCodeSeg datasets and processed versions of it.
  • Predictions/: Stores the output or predictions generated by the segmentation models.
  • Scripts/: Houses the Python and R scripts used for implementing the segmentation approaches, model training, and evaluation.
  • LICENSE: The licensing information for the project (MIT License).
  • requirements.txt: Lists the necessary Python packages and their versions required to run the code.

Installation and Usage

To set up the environment and run the code, you would typically follow these steps:

  1. Clone the repository:
    git clone https://github.com/Dahouabdelhalim/CodeSeg.git
    cd CodeSeg
  2. Install dependencies:
    pip install -r requirements.txt
  3. Run the scripts: Refer to the Scripts/ directory for specific instructions on how to execute the segmentation models and reproduce the results. Further details might be available within the scripts themselves or in additional documentation within the repository.

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

Replication code for "Semantic Code Segmentation with Language Models" paper, focusing on automated code segmentation for R and Python using LLMs/SLMs.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors