Skip to content

wsr2002/code-risk-rank-predictor

Repository files navigation

ML-Based Code Risk Predictor

A machine learning-powered system for predicting code file risk and test priority by analyzing Git history and code quality metrics.

🎯 Overview

This project implements a Machine Learning-based Code Risk Prediction System that identifies high-risk code files in software repositories. By analyzing historical commit patterns, code complexity, and maintainability metrics, the system predicts which files are most likely to require bug fixes, helping prioritize testing efforts.

✨ Features

  • ML-Powered Prediction: Uses Random Forest classifier to predict file risk scores
  • Comprehensive Feature Extraction:
    • Git metrics (commit count, churn, author count)
    • Code quality metrics (cyclomatic complexity, maintainability index, LOC)
  • Automatic Label Generation: Identifies bug-fix commits by parsing commit messages
  • Visual Analytics: Generates confusion matrix heatmap and ROC curve
  • Multi-Project Support: Can process multiple repositories and generate unified datasets

🚀 Quick Start

Installation

pip install -r requirements.txt

Usage

1. Prepare Dataset from Multiple Projects

cd data_prepare
python prepare_dataset.py -f python_projects_list.txt -o dataset.csv

2. Train Model and Predict Risk

python hotspot_detection.py data_prepare/dataset.csv

📊 Output

The system generates:

  1. Model Evaluation: AUC-ROC score, classification report, feature importance
  2. Visualizations:
    • confusion_matrix.png - Confusion matrix heatmap
    • roc_curve.png - ROC curve visualization
  3. Risk Predictions: hotspot_predictions.csv with risk scores for all files
  4. Top High-Risk Files: Console output showing top 20 high-risk files

📈 Model Performance

  • Algorithm: Random Forest Classifier
  • Train/Test Split: 80/20
  • Typical AUC-ROC: ~0.93-0.94
  • Features: commit_count, churn, author_count, cc, mi, loc

🔧 Features Description

Feature Description
commit_count Number of commits for the file
churn Total lines changed (added + deleted)
author_count Number of unique authors who modified the file
cc Average cyclomatic complexity
mi Maintainability Index (0-100)
loc Lines of Code

📝 Label Generation

Labels are automatically generated by detecting bug-fix commits:

  • Keywords: fix, bug, error, patch, hotfix, resolve, issue, defect
  • Label = 1: File appears in bug-fix commits
  • Label = 0: File does not appear in bug-fix commits

🛠️ Model Configuration

  • Algorithm: Random Forest
  • n_estimators: 200
  • max_depth: 8
  • class_weight: "balanced" (handles class imbalance)

📁 Project Structure

hotspot_detection/
├── hotspot_detection.py      # Main ML model training and prediction
├── data_prepare/
│   ├── prepare_dataset.py    # Feature extraction from multiple projects
│   └── python_projects_list.txt
├── requirements.txt
└── README.md

About

A machine learning-based system for predicting test priority by analyzing Git history and code quality metrics. Uses Random Forest to identify high risk files that are likely to require bug fixes.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages