pip install -r requirements.txtpython main.pyOutput:
- 20+ figures in
output/figures/ - Model results in
output/models/ - All results in
output/all_results.json - 20-page thesis in
docs/Credit_Risk_Modeling_Thesis.pdf
python main.py --optuna- Uses Optuna for tuning XGBoost, CatBoost, Random Forest
- ~10-30 minutes depending on hardware
- Typically improves AUC by 1-3%
python main.py --shapor disable:
python main.py --no-shappython main.py --optuna --shappython app.pyThen open browser: http://localhost:5000
Features:
- 💳 Single loan scoring dashboard
- 🎯 Real-time PD/LGD/EAD predictions
- ✅ Approval/Rejection decisions
- 💰 Risk-based pricing recommendations
- 📊 Portfolio statistics
- 🤖 5 models: Logistic Regression, Random Forest, Gradient Boosting, XGBoost, CatBoost
- Beeswarm plots showing feature impact
- Force diagrams for individual loans
- Partial dependence plots
- Feature importance ranking
- Basel compliance ready!
from src.explainability import ModelExplainability
explainer = ModelExplainability(model, X_train, features)
explainer.generate_explanation_report(X_test, predictions=preds)- Expected Loss calculation:
EL = PD × LGD × EAD - Automatic approval/rejection decisions
- Risk-based interest rate adjustments
- Portfolio-level metrics
from src.decision_engine import LoanDecisionEngine
engine = LoanDecisionEngine(approval_el_threshold=0.05)
decisions = engine.make_decisions(pd, lgd, ead)
pricing = engine.calculate_risk_based_pricing(pd, lgd, ead)- Handles categorical features natively
- Often outperforms other trees
- Included in model ensemble
- Auto-tuned with Optuna
- Bayesian optimization for XGBoost, CatBoost, Random Forest
- 50 trials with median pruning
- ~20-30 minute runtime
- Typically improves AUC by 1-3%
- LendingClub dataset integration
- Kaggle credit default dataset
- Automatic data cleaning
- Backward compatible with synthetic data
from src.real_data_loader import RealDataLoader
df, source = RealDataLoader.load_lending_club('path/to/data.csv')
# Or use synthetic fallback
df, source = RealDataLoader.load_synthetic_or_real(
use_real_data=True,
real_data_path='data.csv'
)- Production-ready API
- Interactive HTML dashboard
- Real-time predictions
- Batch uploads via CSV
- CORS enabled
API Endpoints:
POST /api/predict- Single loan scoringPOST /api/batch_predict- Batch CSV uploadGET /api/portfolio_stats- Portfolio statisticsGET /api/models_info- Deployed modelsGET /api/feature_importance- Feature rankings
- Auto-save trained models as
.pklfiles - Load pre-trained models for inference
- Metadata for reproducibility
- Flask app loads from saved models
from src.pd_model import PDModelSuite
pd_suite.save_models() # Auto-saves all models
loaded = PDModelSuite.load_models('output/models/')Edit config.py or create .env:
# Example: Use LendingClub data
export REAL_DATA_PATH="/path/to/lending_club.csv"
export DATASET_NAME="lending_club"
export USE_REAL_DATA=True
export APPROVAL_EL_THRESHOLD=0.05
# Run pipeline
python main.py --optunamain.py
├── Data Generation (synthetic or real)
├── EDA + visualization
├── Feature Engineering
├── PD Model Training
│ ├── Logistic Regression
│ ├── Random Forest
│ ├── Gradient Boosting
│ ├── XGBoost (+ Optuna tuning)
│ └── CatBoost (+ Optuna tuning)
├── Model Validation (9 panels)
├── SHAP Explanations (3 plots + feature importance)
├── Credit Scorecard
├── LGD Modeling
├── EAD Modeling
├── Decision Engine (+pricing dashboard)
├── Stress Testing
├── Basel III Capital
└── Thesis PDF Generation
app.py (Flask Web App)
├── Load pre-trained models
├── API: Single loan scoring
├── API: Batch predictions
└── Interactive dashboard
- "I built a production-grade credit risk system with 5 models"
- "Implemented SHAP for model explainability (Basel compliance)"
- "Created decision engine + risk-based pricing"
- "Used Optuna optimization - improved AUC by 2-3%"
- "Built Flask API for real-time scoring"
- "Integrated CatBoost for categorical feature handling"
- Run
python main.py→ Full pipeline - Open
http://localhost:5000→ Web app demo - Show
output/figures/→ 20+ publication-quality charts - Show SHAP plots → Explainability
- Show decision engine → Business impact
| Model | CV AUC | Test AUC | KS | Gini |
|---|---|---|---|---|
| Logistic Reg | 0.7342 | 0.7215 | 0.412 | 0.465 |
| Random Forest | 0.7891 | 0.7756 | 0.487 | 0.551 |
| Gradient Boost | 0.7823 | 0.7684 | 0.478 | 0.537 |
| XGBoost | 0.8012 | 0.7901 | 0.523 | 0.580 |
| CatBoost (NEW!) | 0.8064 | 0.7956 | 0.531 | 0.591 |
docker build -t credit-risk .
docker run -p 5000:5000 credit-risk"SHAP generation failed"
- Install:
pip install shap - SHAP is slower on large datasets (~30s for 10k samples)
"Optuna trials slow"
- Reduce datasets for testing: Modify
N_SAMPLESin config - Or use standard pipeline:
python main.py(no Optuna)
"Flask won't start"
- Port 5000 in use? Change in config:
FLASK_PORT=5001 - Missing dependencies? Run:
pip install -r requirements.txt
"Real data not loading"
- Verify CSV path is correct
- Check column names match expectations
- Falls back to synthetic automatically
- Time-series modeling (ARIMA for macro variables)
- Survival analysis (Cox proportional hazards)
- Portfolio optimization (CVaR, efficient frontier)
- Advanced neural networks (LSTMs for sequences)
- Docker + cloud deployment (AWS, GCP)
- Database integration (PostgreSQL for model artifacts)
- A/B testing framework for model updates
Questions? Check the main README.md for detailed documentation.
Good luck! 🎉