Skip to content

Latest commit

 

History

History
277 lines (216 loc) · 6.84 KB

File metadata and controls

277 lines (216 loc) · 6.84 KB

🚀 Quick Start Guide

Installation

pip install -r requirements.txt

1️⃣ Run Full Pipeline (Standard)

python main.py

Output:

  • 20+ figures in output/figures/
  • Model results in output/models/
  • All results in output/all_results.json
  • 20-page thesis in docs/Credit_Risk_Modeling_Thesis.pdf

2️⃣ Advanced Pipeline Options

With Hyperparameter Optimization (Slower, Better Accuracy)

python main.py --optuna
  • Uses Optuna for tuning XGBoost, CatBoost, Random Forest
  • ~10-30 minutes depending on hardware
  • Typically improves AUC by 1-3%

With SHAP Explanations (Enabled by Default)

python main.py --shap

or disable:

python main.py --no-shap

Both Optimizations

python main.py --optuna --shap

3️⃣ Launch Flask Web Application (NEW!)

python app.py

Then open browser: http://localhost:5000

Features:

  • 💳 Single loan scoring dashboard
  • 🎯 Real-time PD/LGD/EAD predictions
  • ✅ Approval/Rejection decisions
  • 💰 Risk-based pricing recommendations
  • 📊 Portfolio statistics
  • 🤖 5 models: Logistic Regression, Random Forest, Gradient Boosting, XGBoost, CatBoost

📊 What's New (v2.0)?

1. SHAP Model Explainability

  • Beeswarm plots showing feature impact
  • Force diagrams for individual loans
  • Partial dependence plots
  • Feature importance ranking
  • Basel compliance ready!
from src.explainability import ModelExplainability
explainer = ModelExplainability(model, X_train, features)
explainer.generate_explanation_report(X_test, predictions=preds)

2. Decision Engine 🎯

  • Expected Loss calculation: EL = PD × LGD × EAD
  • Automatic approval/rejection decisions
  • Risk-based interest rate adjustments
  • Portfolio-level metrics
from src.decision_engine import LoanDecisionEngine
engine = LoanDecisionEngine(approval_el_threshold=0.05)
decisions = engine.make_decisions(pd, lgd, ead)
pricing = engine.calculate_risk_based_pricing(pd, lgd, ead)

3. CatBoost Model 🚀

  • Handles categorical features natively
  • Often outperforms other trees
  • Included in model ensemble
  • Auto-tuned with Optuna

4. Optuna Hyperparameter Tuning 🔧

  • Bayesian optimization for XGBoost, CatBoost, Random Forest
  • 50 trials with median pruning
  • ~20-30 minute runtime
  • Typically improves AUC by 1-3%

5. Real Data Support 📥

  • LendingClub dataset integration
  • Kaggle credit default dataset
  • Automatic data cleaning
  • Backward compatible with synthetic data
from src.real_data_loader import RealDataLoader
df, source = RealDataLoader.load_lending_club('path/to/data.csv')
# Or use synthetic fallback
df, source = RealDataLoader.load_synthetic_or_real(
    use_real_data=True, 
    real_data_path='data.csv'
)

6. Flask Web Application 🌐

  • Production-ready API
  • Interactive HTML dashboard
  • Real-time predictions
  • Batch uploads via CSV
  • CORS enabled

API Endpoints:

  • POST /api/predict - Single loan scoring
  • POST /api/batch_predict - Batch CSV upload
  • GET /api/portfolio_stats - Portfolio statistics
  • GET /api/models_info - Deployed models
  • GET /api/feature_importance - Feature rankings

7. Model Persistence 💾

  • Auto-save trained models as .pkl files
  • Load pre-trained models for inference
  • Metadata for reproducibility
  • Flask app loads from saved models
from src.pd_model import PDModelSuite
pd_suite.save_models()  # Auto-saves all models
loaded = PDModelSuite.load_models('output/models/')

🔧 Configuration

Edit config.py or create .env:

# Example: Use LendingClub data
export REAL_DATA_PATH="/path/to/lending_club.csv"
export DATASET_NAME="lending_club"
export USE_REAL_DATA=True
export APPROVAL_EL_THRESHOLD=0.05

# Run pipeline
python main.py --optuna

📈 Pipeline Workflow

main.py
├── Data Generation (synthetic or real)
├── EDA + visualization
├── Feature Engineering
├── PD Model Training
│   ├── Logistic Regression
│   ├── Random Forest
│   ├── Gradient Boosting
│   ├── XGBoost (+ Optuna tuning)
│   └── CatBoost (+ Optuna tuning)
├── Model Validation (9 panels)
├── SHAP Explanations (3 plots + feature importance)
├── Credit Scorecard
├── LGD Modeling
├── EAD Modeling
├── Decision Engine (+pricing dashboard)
├── Stress Testing
├── Basel III Capital
└── Thesis PDF Generation

app.py (Flask Web App)
├── Load pre-trained models
├── API: Single loan scoring
├── API: Batch predictions
└── Interactive dashboard

🎓 For Interview/Portfolio

Talking Points:

  1. "I built a production-grade credit risk system with 5 models"
  2. "Implemented SHAP for model explainability (Basel compliance)"
  3. "Created decision engine + risk-based pricing"
  4. "Used Optuna optimization - improved AUC by 2-3%"
  5. "Built Flask API for real-time scoring"
  6. "Integrated CatBoost for categorical feature handling"

Show:

  • Run python main.py → Full pipeline
  • Open http://localhost:5000 → Web app demo
  • Show output/figures/ → 20+ publication-quality charts
  • Show SHAP plots → Explainability
  • Show decision engine → Business impact

📚 Model Comparison (Example Results)

Model CV AUC Test AUC KS Gini
Logistic Reg 0.7342 0.7215 0.412 0.465
Random Forest 0.7891 0.7756 0.487 0.551
Gradient Boost 0.7823 0.7684 0.478 0.537
XGBoost 0.8012 0.7901 0.523 0.580
CatBoost (NEW!) 0.8064 0.7956 0.531 0.591

🐋 Docker (Optional)

docker build -t credit-risk .
docker run -p 5000:5000 credit-risk

📞 Troubleshooting

"SHAP generation failed"

  • Install: pip install shap
  • SHAP is slower on large datasets (~30s for 10k samples)

"Optuna trials slow"

  • Reduce datasets for testing: Modify N_SAMPLES in config
  • Or use standard pipeline: python main.py (no Optuna)

"Flask won't start"

  • Port 5000 in use? Change in config: FLASK_PORT=5001
  • Missing dependencies? Run: pip install -r requirements.txt

"Real data not loading"

  • Verify CSV path is correct
  • Check column names match expectations
  • Falls back to synthetic automatically

🚀 Next Steps (Further Enhancements)

  • Time-series modeling (ARIMA for macro variables)
  • Survival analysis (Cox proportional hazards)
  • Portfolio optimization (CVaR, efficient frontier)
  • Advanced neural networks (LSTMs for sequences)
  • Docker + cloud deployment (AWS, GCP)
  • Database integration (PostgreSQL for model artifacts)
  • A/B testing framework for model updates

Questions? Check the main README.md for detailed documentation.

Good luck! 🎉