This repository contains the implementation of AUTOPATCH, a framework designed to optimize program runtime performance through Context-Aware Retrieval-Augmented Generation (RAG). By combining insights from Control Flow Graph (CFG) analysis, retrieval-augmented learning, and in-context LLM prompting, AUTOPATCH bridges the gap between traditional manual code optimization and automated techniques. It achieves measurable improvements in execution efficiency while maintaining high adaptability to various programming challenges.
- CFG-Based Optimization: Leverages Control Flow Graph analysis to identify inefficiencies in code structure and guide optimizations.
- Context-Aware Learning: Integrates historical examples and optimization patterns through a retrieval-augmented pipeline.
- Unified RAG Framework: Embeds CFG differences and optimization rationales into structured prompts for precise and effective code refinement.
- Comprehensive Evaluation: Tested on IBM Project CodeNet, demonstrating a 7.3% improvement in execution efficiency over baseline methods.
- Modular Design: Structured for easy experimentation, scalability, and integration with additional datasets and optimization techniques.
project_root/
├── cfg_conversion/
│ └── cfg_conversion.py # Script for CFG generation and analysis
├── data/
│ ├── cfg_dataset/
│ │ ├── train_cfg_dataset.json # Training dataset
│ │ └── test_cfg_dataset.json # Testing dataset
│ ├── generated_code/
│ │ ├── context_generation.json # Optimized code using contextual examples
│ │ ├── naive_generation.json # Optimized code using naive retrieval
│ │ ├── zero_shot_generation.json # Optimized code without context
│ │ ├── code_analysis.json # Analysis results
│ │ ├── train_dataset.json # Original training dataset
│ │ └── test_dataset.json # Original testing dataset
├── embeddings/
│ ├── embeddings_script.py # Script for generating code embeddings
│ ├── train_embeddings.csv # Embeddings for training data
│ └── test_embeddings.csv # Embeddings for testing data
├── evaluation/
│ └── analysis.ipynb # Jupyter notebook for evaluation and visualization
├── .env # Environment configuration
├── .gitignore # Git ignore file
├── code_analysis.py # Code analysis pipeline
├── context_aware.py # Context-aware optimization script
├── naive_embeddings.py # Embedding-based naive retrieval script
├── zero_shot.py # Zero-shot optimization script
└── requirements.txt # List of required Python packages
git clone https://github.com/manishacharya60/rag-optimization.git
cd rag-optimizationInstall the required Python packages using pip:
pip install -r requirements.txtEnsure all datasets and embeddings are in the correct directory structure as outlined above.
To convert code into CFGs for analysis, run:
python cfg_conversion/cfg_conversion.pyTo create embeddings for the training and testing datasets, run:
python embeddings/embeddings_script.py- Context-Aware Optimization:
python context_aware.py
- Naive Retrieval-Based Optimization:
python naive_embeddings.py
- Zero-Shot Optimization:
python zero_shot.py
To evaluate the model performance, open the Jupyter notebook:
jupyter notebook evaluation/analysis.ipynbThis repository includes datasets derived from the IBM Project CodeNet, featuring pairs of original and optimized C++ code.
- Control Flow Graphs (CFGs) are generated using Clang's static analyzer, which extracts and formalizes the code structure for optimization.
- Embeddings for retrieval and context-aware learning are generated using Microsoft/codebert-base, a pretrained transformer-based model designed for programming and natural languages.
AUTOPATCH has been rigorously tested against baseline methods, demonstrating:
- 7.3% improvement in runtime efficiency over zero-shot generation.
- Higher lexical similarity metrics compared to naive retrieval-based methods.
- Robust performance across various optimization types, including loop unrolling, branch simplifications, and memory optimizations.
If you use AUTOPATCH in your research, please cite our paper:
@misc{acharya2025optimizingcoderuntimeperformance,
title={Optimizing Code Runtime Performance through Context-Aware Retrieval-Augmented Generation},
author={Manish Acharya and Yifan Zhang and Yu Huang and Kevin Leach},
year={2025},
eprint={2501.16692},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2501.16692},
}