A machine learning-powered system for predicting code file risk and test priority by analyzing Git history and code quality metrics.
This project implements a Machine Learning-based Code Risk Prediction System that identifies high-risk code files in software repositories. By analyzing historical commit patterns, code complexity, and maintainability metrics, the system predicts which files are most likely to require bug fixes, helping prioritize testing efforts.
- ML-Powered Prediction: Uses Random Forest classifier to predict file risk scores
- Comprehensive Feature Extraction:
- Git metrics (commit count, churn, author count)
- Code quality metrics (cyclomatic complexity, maintainability index, LOC)
- Automatic Label Generation: Identifies bug-fix commits by parsing commit messages
- Visual Analytics: Generates confusion matrix heatmap and ROC curve
- Multi-Project Support: Can process multiple repositories and generate unified datasets
pip install -r requirements.txtcd data_prepare
python prepare_dataset.py -f python_projects_list.txt -o dataset.csvpython hotspot_detection.py data_prepare/dataset.csvThe system generates:
- Model Evaluation: AUC-ROC score, classification report, feature importance
- Visualizations:
confusion_matrix.png- Confusion matrix heatmaproc_curve.png- ROC curve visualization
- Risk Predictions:
hotspot_predictions.csvwith risk scores for all files - Top High-Risk Files: Console output showing top 20 high-risk files
- Algorithm: Random Forest Classifier
- Train/Test Split: 80/20
- Typical AUC-ROC: ~0.93-0.94
- Features: commit_count, churn, author_count, cc, mi, loc
| Feature | Description |
|---|---|
commit_count |
Number of commits for the file |
churn |
Total lines changed (added + deleted) |
author_count |
Number of unique authors who modified the file |
cc |
Average cyclomatic complexity |
mi |
Maintainability Index (0-100) |
loc |
Lines of Code |
Labels are automatically generated by detecting bug-fix commits:
- Keywords: fix, bug, error, patch, hotfix, resolve, issue, defect
- Label = 1: File appears in bug-fix commits
- Label = 0: File does not appear in bug-fix commits
- Algorithm: Random Forest
- n_estimators: 200
- max_depth: 8
- class_weight: "balanced" (handles class imbalance)
hotspot_detection/
├── hotspot_detection.py # Main ML model training and prediction
├── data_prepare/
│ ├── prepare_dataset.py # Feature extraction from multiple projects
│ └── python_projects_list.txt
├── requirements.txt
└── README.md