End-to-end credit risk ML pipeline simulating a production-grade Probability of Default (PD) model used by banks and NBFCs β Logistic Regression scorecard + Gradient Boosting classifier with industry-standard credit risk metrics: Gini Coefficient 0.521 Β· KS Statistic 0.395 Β· ROC-AUC 0.761 Β· 5-Fold CV AUC 0.759 Β± 0.009
This project builds a credit risk scorecard β the core model used by HDFC Bank, ICICI Bank, Bajaj Finance, and NBFCs to decide whether to approve or reject a loan application. It mirrors the full credit analytics workflow:
- Synthetic loan portfolio generation (German Credit / Lending Club inspired)
- Feature engineering with FOIR, LTV proxy, credit bands, WoE encoding
- Model training: Logistic Regression (interpretable scorecard) + Gradient Boosting (predictive power)
- Credit risk evaluation: Gini, KS Statistic, Information Value (IV), Weight of Evidence (WoE)
- Portfolio scoring with 5 credit grades (A to E)
- Interactive dashboard with risk distribution, feature importance, IV table
Relevance: Directly applicable to Credit Risk Analyst, Risk Analytics, and Data Analyst (Finance) roles at HDFC Bank, ICICI Bank, Bajaj Finance, Moody's, CRISIL, Acuity Knowledge Partners, EXL Service.
loan-default-prediction/
β
βββ src/
β βββ generate_data.py # Synthetic loan portfolio generator (5,000 loans)
β βββ train_model.py # Full ML pipeline β LR + Decision Tree + Gradient Boosting
β βββ predict.py # Inference script β score new loan applications
β βββ utils.py # Gini, KS, IV/WoE helper functions
β
βββ data/
β βββ loan_data.csv # Raw synthetic loan portfolio (generated)
β βββ loans_scored.csv # Enriched with PD scores + credit grades (generated)
β
βββ notebooks/
β βββ EDA_CreditRisk.ipynb # Exploratory data analysis notebook
β
βββ tests/
β βββ test_pipeline.py # Unit tests for data generation + model pipeline
β
βββ docs/
β βββ credit_risk_concepts.md # IV, WoE, Gini, KS explained in plain English
β
βββ credit_risk_dashboard.html # Interactive browser dashboard (no server needed)
βββ model_results.json # Metrics, feature importances, IV table (generated)
βββ requirements.txt # Python dependencies
βββ .gitignore # Standard Python gitignore
βββ README.md
| Property | Value |
|---|---|
| Total Loans | 5,000 |
| Default Rate | ~33.7% (realistic for unsecured/MSME lending) |
| Loan Types | Personal, Home Improvement, Medical, Wedding, Education, Business, Vehicle |
| Employment Types | Salaried, Self-Employed, Business, Freelancer |
| Loan Range | βΉ20,000 β βΉ20,00,000 |
| Income Range | βΉ10,000 β βΉ1,50,000/month |
| Credit Score Range | 300 β 850 |
| Feature | Description | Credit Risk Rationale |
|---|---|---|
foir_pct |
Fixed Obligation-to-Income Ratio | EMI / Income β capacity to repay |
ltv_proxy |
Loan-to-Income ratio | Leverage indicator |
credit_band |
Bucketed credit score (0=best, 4=worst) | Bureau score risk tier |
high_foir |
Binary: FOIR > 55% | Industry threshold for over-leverage |
high_util |
Binary: Credit utilisation > 80% | Stress indicator |
delinq_flag |
Binary: Any 30-day past due | Behavioural default predictor |
combined_risk |
Sum of all binary risk flags | Composite alert score |
log_income |
Log-transformed monthly income | Corrects right-skewed distribution |
Three models were trained and compared. Logistic Regression is the industry standard for regulatory-compliant scorecards; Gradient Boosting provides maximum predictive power.
| Metric | Gradient Boosting | Logistic Regression | Decision Tree |
|---|---|---|---|
| ROC-AUC | 0.7447 | 0.7607 | 0.6812 |
| Gini Coefficient | 0.4894 | 0.5213 | 0.3624 |
| KS Statistic | 0.3710 | 0.3948 | 0.2981 |
| PR-AUC | 0.631 | 0.644 | 0.541 |
| 5-Fold CV AUC | 0.759 Β± 0.009 | 0.758 Β± 0.011 | β |
- Gini Coefficient = 2 Γ AUC β 1. Industry benchmark: >0.3 = acceptable, >0.4 = good, >0.6 = excellent. Regulators (RBI, Basel III) explicitly require Gini reporting for PD models.
- KS Statistic = Maximum separation between cumulative Good and Bad distributions. >0.3 = deployable scorecard.
- Information Value (IV) = Measures each feature's predictive power. Used for feature selection in scorecards.
| Feature | IV Score | Predictive Strength |
|---|---|---|
| Delinquency (30-day past due) | 0.7288 | Very Strong (>0.5) |
| Existing Loans | 0.2386 | Medium (0.1β0.3) |
| Credit Score | 0.1840 | Medium (0.1β0.3) |
| FOIR % | 0.0298 | Weak (<0.1) |
| Loan Amount | 0.0140 | Weak (<0.1) |
| Grade | PD Score Range | Count | Interpretation |
|---|---|---|---|
| A β Very Low Risk | 0β20 | 1,180 | Approve immediately |
| B β Low Risk | 20β35 | 1,420 | Approve with standard terms |
| C β Medium Risk | 35β50 | 1,025 | Approve with conditions |
| D β High Risk | 50β65 | 622 | Decline or high rate |
| E β Very High Risk | 65β100 | 753 | Decline |
- 8 KPI cards β Total loans, default rate, Gini, KS, AUC, CV AUC, high-risk count
- Credit grade distribution β Doughnut chart (A to E)
- Default rate by employment type β Salaried vs Freelancer vs Business
- Default rate by credit score band β Visual risk gradient
- Feature importance β Gradient Boosting top drivers
- Information Value table β IV scores with predictive strength labels
- Model comparison β GB vs LR with confusion matrix
| Layer | Technology |
|---|---|
| Data Generation | Python Β· Pandas Β· NumPy |
| ML Models | scikit-learn β LogisticRegression, DecisionTreeClassifier, GradientBoostingClassifier |
| Credit Risk Metrics | Custom Gini, KS, IV/WoE functions (Basel III aligned) |
| Feature Engineering | FOIR, LTV proxy, WoE bins, log transforms, credit bands |
| Cross-Validation | StratifiedKFold (5-fold) β preserves default rate in each fold |
| Evaluation | ROC-AUC Β· Gini Β· KS Statistic Β· PR-AUC Β· IV |
| Dashboard | Chart.js Β· HTML/CSS Β· JetBrains Mono |
# 1. Clone the repository
git clone https://github.com/ukishore33/loan-default-prediction.git
cd loan-default-prediction
# 2. Install dependencies
pip install -r requirements.txt
# 3. Generate synthetic data
python src/generate_data.py
# 4. Train models and generate metrics
python src/train_model.py
# 5. Score a new loan application
python src/predict.py
# 6. Open interactive dashboard
open credit_risk_dashboard.html
# 7. Run tests
python -m pytest tests/| File | Purpose |
|---|---|
src/generate_data.py |
Generates 5,000 synthetic loans with realistic default patterns |
src/train_model.py |
Full training pipeline β 3 models, all metrics, portfolio scoring |
src/utils.py |
Standalone Gini, KS, IV, WoE functions β reusable in any project |
src/predict.py |
Takes new loan data β outputs PD score + credit grade |
model_results.json |
All metrics, feature importances, IV table in structured JSON |
credit_risk_dashboard.html |
Interactive dashboard β open in any browser, no server |
docs/credit_risk_concepts.md |
Plain-English explanation of Gini, KS, IV for non-technical readers |
"I built a credit risk scorecard that mirrors Basel III PD model requirements. I used Logistic Regression as the primary scorecard model β because it's interpretable and regulatory-compliant β and compared it against Gradient Boosting for predictive power. The Gini of 0.52 and KS of 0.39 both exceed industry thresholds for a deployable scorecard. I also computed Information Value for all features β delinquency history was the strongest predictor at IV = 0.73, which is Very Strong by the industry scale. The model assigns every loan a credit grade from A to E, directly mirroring what HDFC Bank and Bajaj Finance use in their credit decisioning systems."
Kishore U. AML/KYC Compliance Analyst | Credit Risk Analytics | Data Analytics π± 6303308133 | Bengaluru, Karnataka | Immediate Joiner π LinkedIn Β· GitHub
Skills demonstrated: Credit Risk Β· Probability of Default Β· Scorecard Development Β· Gini Β· KS Statistic Β· Information Value Β· Python Β· scikit-learn Β· Gradient Boosting Β· Logistic Regression Β· Basel III PD Model Concepts
All data is 100% synthetic β generated programmatically. No real loan, customer, or financial data was used. Built purely for portfolio demonstration.