ML-Based Code Risk Predictor

A machine learning-powered system for predicting code file risk and test priority by analyzing Git history and code quality metrics.

🎯 Overview

This project implements a Machine Learning-based Code Risk Prediction System that identifies high-risk code files in software repositories. By analyzing historical commit patterns, code complexity, and maintainability metrics, the system predicts which files are most likely to require bug fixes, helping prioritize testing efforts.

✨ Features

ML-Powered Prediction: Uses Random Forest classifier to predict file risk scores
Comprehensive Feature Extraction:
- Git metrics (commit count, churn, author count)
- Code quality metrics (cyclomatic complexity, maintainability index, LOC)
Automatic Label Generation: Identifies bug-fix commits by parsing commit messages
Visual Analytics: Generates confusion matrix heatmap and ROC curve
Multi-Project Support: Can process multiple repositories and generate unified datasets

🚀 Quick Start

Installation

pip install -r requirements.txt

Usage

1. Prepare Dataset from Multiple Projects

cd data_prepare
python prepare_dataset.py -f python_projects_list.txt -o dataset.csv

2. Train Model and Predict Risk

python hotspot_detection.py data_prepare/dataset.csv

📊 Output

The system generates:

Model Evaluation: AUC-ROC score, classification report, feature importance
Visualizations:
- confusion_matrix.png - Confusion matrix heatmap
- roc_curve.png - ROC curve visualization
Risk Predictions: hotspot_predictions.csv with risk scores for all files
Top High-Risk Files: Console output showing top 20 high-risk files

📈 Model Performance

Algorithm: Random Forest Classifier
Train/Test Split: 80/20
Typical AUC-ROC: ~0.93-0.94
Features: commit_count, churn, author_count, cc, mi, loc

🔧 Features Description

Feature	Description
`commit_count`	Number of commits for the file
`churn`	Total lines changed (added + deleted)
`author_count`	Number of unique authors who modified the file
`cc`	Average cyclomatic complexity
`mi`	Maintainability Index (0-100)
`loc`	Lines of Code

📝 Label Generation

Labels are automatically generated by detecting bug-fix commits:

Keywords: fix, bug, error, patch, hotfix, resolve, issue, defect
Label = 1: File appears in bug-fix commits
Label = 0: File does not appear in bug-fix commits

🛠️ Model Configuration

Algorithm: Random Forest
n_estimators: 200
max_depth: 8
class_weight: "balanced" (handles class imbalance)

📁 Project Structure

hotspot_detection/
├── hotspot_detection.py      # Main ML model training and prediction
├── data_prepare/
│   ├── prepare_dataset.py    # Feature extraction from multiple projects
│   └── python_projects_list.txt
├── requirements.txt
└── README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ML-Based Code Risk Predictor

🎯 Overview

✨ Features

🚀 Quick Start

Installation

Usage

1. Prepare Dataset from Multiple Projects

2. Train Model and Predict Risk

📊 Output

📈 Model Performance

🔧 Features Description

📝 Label Generation

🛠️ Model Configuration

📁 Project Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data_prepare		data_prepare
.gitignore		.gitignore
README.md		README.md
code-risk-rank-predictor.pdf		code-risk-rank-predictor.pdf
confusion_matrix.png		confusion_matrix.png
hotspot_detection.py		hotspot_detection.py
hotspot_predictions.csv		hotspot_predictions.csv
requirements.txt		requirements.txt
roc_curve.png		roc_curve.png

Folders and files

Latest commit

History

Repository files navigation

ML-Based Code Risk Predictor

🎯 Overview

✨ Features

🚀 Quick Start

Installation

Usage

1. Prepare Dataset from Multiple Projects

2. Train Model and Predict Risk

📊 Output

📈 Model Performance

🔧 Features Description

📝 Label Generation

🛠️ Model Configuration

📁 Project Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages