All Lending Club Data Analysis

This comprehensive project compiles scripts for decoding and analyzing data from the Lending Club loan dataset, available here.

Introduction

This project undertakes a thorough Exploratory Data Analysis (EDA) of the Lending Club dataset and trains a model to predict the likelihood of loan defaults. With the data available at the time of writing, the model achieves an accuracy of approximately 80%. By identifying an optimal threshold to maximize profitability, it forecasts an increase in profitability of about 130% ($706 million across all historical loans).

The workflow is segmented into six primary steps:

01 - CSV to SQLite: Transforms the data from CSV to a more efficient SQLite file format.
02 - Exploratory Data Analysis: Generates relevant EDA plots to uncover patterns in the data and key insights. It also produces an automated report.
03 - Data Preprocessing: Prepares the data for ML optimization and training.
04 - Machine Learning Hyperparameter Optimization: Conducts a Bayesian Search to fine-tune model hyperparameters, aiming to maximize the recall score.
05 - Machine Learning Training: Trains the optimal machine learning model from step 04 on the entire dataset and generates plots related to training and predictability.
06 - Machine Learning Results and Final Analysis: Performs the final analysis of the model, calculating key performance metrics.

All configurations are specified in the config/config.yml file. The scripts can be executed using the following commands:

python "scripts/01 - csv_to_sqlite.py" --config "config/config.yml"  && \
python "scripts/02 - exploratory_data_analysis.py" --config "config/config.yml" && \
python "scripts/03 - data_preprocessing.py" --config "config/config.yml" && \
python "scripts/04 - machine_learning_hyperparameter_optimization_BayesSearchCV.py" --config "config/config.yml" && \
python "scripts/05 - machine_learning_training.py" --config "config/config.yml" && \
python "scripts/06 - machine_learning_results_and_final_analysis.py" --config "config/config.yml"

Project Showcase

Machine Learning Performance

Below are three pivotal plots demonstrating the model's predictive capability (confusion matrix), the significance of each feature in the prediction (Feature Importances), and a Learning Curve to verify the model's fitting quality.

Key Insights

The model not only classifies loans but also calculates the probability of default (ranging from 0 to 1, or 0% to 100%). Testing various thresholds reveals the proportion of defaulted loans beneath that threshold.

This indicates that lowering the threshold increasingly filters out defaulted loans more than fully paid ones.

A crucial question arises: At what probability threshold should loan applications be rejected to maximize profits? The following figures demonstrate the total and relative profitability at different thresholds, with an optimal point identified.

The optimal threshold of 56% yields a profit margin approximately 130% higher than historical profits by selectively approving loans.

Exploratory Data Analysis

You can find the Tableau Dashboard here

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
config		config
data		data
notebooks		notebooks
outputs/2023_12_19		outputs/2023_12_19
scripts		scripts
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

All Lending Club Data Analysis

Introduction

Project Showcase

Machine Learning Performance

Key Insights

Exploratory Data Analysis

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

All Lending Club Data Analysis

Introduction

Project Showcase

Machine Learning Performance

Key Insights

Exploratory Data Analysis

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages