Salary Prediction Portfolio

Introduction

This project aims to develop and deploy a salary prediction model that provides salary estimates which business' HR and talent functions can use to optimize their compensation strategy, acquire the best talent, and improve the company retention rate in the competitive labor market.
This model also provides useful information that job seeker can use to maximize their salary as well as determine the most common domain, highest paying jobs and in-demand career paths.
This model will be trained on 1+ million training data set with 8 features.

Data

The data folder of this repository consists of three csv files

'train_features'
'train_salaries': This file contains the target variable 'salary'
'test_features'

Jupiter Notebooks:

SalaryPredictionEDA: contains code for Data Wrangling, Exploratory Data Analysis(EDA) and Baseline creation
Salary_Prediction_Modelling: contains code for model training, hyper parameter tuning, model evaluation and model deployment.

Data Wrangling

Load all the train and test data files into pandas data frame.
Merge the files - train_features and train_salaries into a single dataframe called train_df
Clean the data

Check for data duplicates
- no data duplicates were found
Check for missing values
- no missing values were found in the dataset
Check for outliers
- we used the IQR rule to find the outliers. We found some lower bound outlier and removed them from the training set. Upper bound outliers were detected but we did not remove them because they seem to be legitimate data.

Exploratory Data Analysis

Exploring every features

CompanyID
- The number of employess per company ranges from 15635 to 16114. We can conclude that this dataset contains data of all the large companies. This means that our predictive model might not be able to predict the salaries of mid size or small organizations.
- All the companies seem to have similar average salary, and similar distribution across the dataset.
Job Type
- All jobtypes have the same count
- There are some Vice Presidents, CEO, CFO and Managers without degree.
- Either this is the case of missing data or we can assume that we dont necesserily need a degree to have a C-suite role.
Degree
- There are more High School and Non degree holders in the dataset.
Major
- There are 9 majors listed in the dataset
- Majority of the people in the dataset do not have theie major listed
Industry
- There are 7 types of industies in this dataset and all of them have the same count
Years Experience
- Years of experience is evenly distributed in the dataset
Miles from metropolis

Target variable - salary

- The target variable "Salary" has a Normal Distribution

Correlartion with the Target Variable

i. yearsExperience and salary

- We can see a positive correlation between Salary and years of experience.
- Higher the number of years of job experience, higher the salary.

ii. milesFromMetropolis and salary

- We can see a negative correlation between salary and milesFromMetropolis.
- So, the futher away you are from the metropolitian city, the lower your salary will be.

iii. companyId and salary

- There is no correlation between salary and companiId.
- We can see a flat line which means all the companies have the same average salaries.

iv. jobType and salary

- We can see a positive correlation between salary and jobType.
- The higher the job position, the higher the salary.

v. degree and salary

- More advanced degrees like Doctorate and Masters tend to correspond to higher salaries.
- Surpringly, even non-degree holders have a decent salary. By looking at this plot we can conclude that a degree is not always required to have a good salary.

vi. major and salary

- People with Engineering, Business and Math majors have the highest salaries.
- Surprisingly, even people with no major have a decent salary. This might possibly be the case of missing data.

vii. industry and salary

- The lowest paying industries - Education and Services can have upper bound salary if they have 15+ years of experice.

Multivariate Analysis

Based on the correlation heatmap above, we can note the following things: - jobType is the most strongly correlated to salary, followed by degree, major, yearsExperience - milesFromMetropolis has a negative correlation with salary

Baseline

For our model performance baseline, a linear regression model using negative MeanSquaredError(MSE) scoring, has been selected. The average MSE score using a five-fold cross-validation on the training dataset is 400.02.

Our goal is to train and deploy a model boasting an MSE score of less than 360.

Hypothesizing solutions

Given the information we have about our data.

These are the models I propose to predict the salary:

Linear Regression: From the EDA, we have seen that both the numerical and most of the categorical variable have a linear correlation with the target variable 'salary'. Based on those factors a linear model would be suitable for this dataset.
Random Forest: Random Forest would be a good model for this dataset because it would be able to handle the categorical data well. This model also reduces overfitting and helps to improve the accuracy. Random forest is also a flexible model for both classifiaction and regression problems.
Gradient-boost: Gradient boosting has a lot of flexibility. It can optimize on different loss functions and provides several hyper parameter tuning options that make the function fit flexibly.

The steps I am planning to take to improve the model accuracy and decrease the MSE are as follows:

- Apply feature engineering like one-hot encoding for categorical variabes
- Normalize the numerical to scale the data
- Hypertune the parameters to enhance the accuracy

Model Training

We performed a 5 fold cross-validation with negative MSE scoring on each of the selected models.

Results

Clearly the GradientBoosting outperformed all the other models with the lowest neg-MSE.

We achived the goal of reducing the MSE(Mean Squared Error) < 360.

The plot below visualizes actual salaries versus predicted on the training dataset.

Top 10 Important Features of the Model (GradientBoosting)

We can see that the most important feature is the jobType

Deployment

Finally, we can deploy our model by making salary predictions on the test dataset. The outcome of predictions has been saved in the file 'predictions.csv' which can be found in the repository.

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
data		data
images		images
README.md		README.md
SalaryPredictionEDA.ipynb		SalaryPredictionEDA.ipynb
Salary_Prediction_Modelling.ipynb		Salary_Prediction_Modelling.ipynb
predictions.csv		predictions.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Salary Prediction Portfolio

Introduction

Data

Jupiter Notebooks:

Other repository content

Data Wrangling

Exploratory Data Analysis

Target variable - salary

Correlartion with the Target Variable

Multivariate Analysis

Baseline

Hypothesizing solutions

Model Training

Top 10 Important Features of the Model (GradientBoosting)

Deployment

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Salary Prediction Portfolio

Introduction

Data

Jupiter Notebooks:

Other repository content

Data Wrangling

Exploratory Data Analysis

Target variable - salary

Correlartion with the Target Variable

Multivariate Analysis

Baseline

Hypothesizing solutions

Model Training

Top 10 Important Features of the Model (GradientBoosting)

Deployment

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages