-
This project aims to develop and deploy a salary prediction model that provides salary estimates which business' HR and talent functions can use to optimize their compensation strategy, acquire the best talent, and improve the company retention rate in the competitive labor market.
-
This model also provides useful information that job seeker can use to maximize their salary as well as determine the most common domain, highest paying jobs and in-demand career paths.
-
This model will be trained on 1+ million training data set with 8 features.
The data folder of this repository consists of three csv files
- 'train_features'
- 'train_salaries': This file contains the target variable 'salary'
- 'test_features'
- SalaryPredictionEDA: contains code for Data Wrangling, Exploratory Data Analysis(EDA) and Baseline creation
- Salary_Prediction_Modelling: contains code for model training, hyper parameter tuning, model evaluation and model deployment.
'predicition.csv' contains the salaries predicted by the model from the test dataset.
-
Load all the train and test data files into pandas data frame.
-
Merge the files - train_features and train_salaries into a single dataframe called train_df
-
Clean the data
-
Check for data duplicates
- no data duplicates were found
-
Check for missing values
- no missing values were found in the dataset
-
Check for outliers
- we used the IQR rule to find the outliers. We found some lower bound outlier and removed them from the training set. Upper bound outliers were detected but we did not remove them because they seem to be legitimate data.
Exploring every features
-
CompanyID
- The number of employess per company ranges from 15635 to 16114. We can conclude that this dataset contains data of all the large companies. This means that our predictive model might not be able to predict the salaries of mid size or small organizations.
- All the companies seem to have similar average salary, and similar distribution across the dataset.
-
Job Type
- All jobtypes have the same count
- There are some Vice Presidents, CEO, CFO and Managers without degree.
- Either this is the case of missing data or we can assume that we dont necesserily need a degree to have a C-suite role.
-
Degree
- There are more High School and Non degree holders in the dataset.
-
Major
- There are 9 majors listed in the dataset
- Majority of the people in the dataset do not have theie major listed
-
Industry
- There are 7 types of industies in this dataset and all of them have the same count
-
Years Experience
- Years of experience is evenly distributed in the dataset
-
Miles from metropolis
- The target variable "Salary" has a Normal Distribution
i. yearsExperience and salary
- We can see a positive correlation between Salary and years of experience.
- Higher the number of years of job experience, higher the salary.
ii. milesFromMetropolis and salary
- We can see a negative correlation between salary and milesFromMetropolis.
- So, the futher away you are from the metropolitian city, the lower your salary will be.
iii. companyId and salary
- There is no correlation between salary and companiId.
- We can see a flat line which means all the companies have the same average salaries.
iv. jobType and salary
- We can see a positive correlation between salary and jobType.
- The higher the job position, the higher the salary.
v. degree and salary
- More advanced degrees like Doctorate and Masters tend to correspond to higher salaries.
- Surpringly, even non-degree holders have a decent salary. By looking at this plot we can conclude that a degree is not always required to have a good salary.
vi. major and salary
- People with Engineering, Business and Math majors have the highest salaries.
- Surprisingly, even people with no major have a decent salary. This might possibly be the case of missing data.
vii. industry and salary
- The lowest paying industries - Education and Services can have upper bound salary if they have 15+ years of experice.
Based on the correlation heatmap above, we can note the following things: - jobType is the most strongly correlated to salary, followed by degree, major, yearsExperience - milesFromMetropolis has a negative correlation with salary
For our model performance baseline, a linear regression model using negative MeanSquaredError(MSE) scoring, has been selected. The average MSE score using a five-fold cross-validation on the training dataset is 400.02.
Our goal is to train and deploy a model boasting an MSE score of less than 360.
Given the information we have about our data.
These are the models I propose to predict the salary:
- Linear Regression: From the EDA, we have seen that both the numerical and most of the categorical variable have a linear correlation with the target variable 'salary'. Based on those factors a linear model would be suitable for this dataset.
- Random Forest: Random Forest would be a good model for this dataset because it would be able to handle the categorical data well. This model also reduces overfitting and helps to improve the accuracy. Random forest is also a flexible model for both classifiaction and regression problems.
- Gradient-boost: Gradient boosting has a lot of flexibility. It can optimize on different loss functions and provides several hyper parameter tuning options that make the function fit flexibly.
The steps I am planning to take to improve the model accuracy and decrease the MSE are as follows:
- Apply feature engineering like one-hot encoding for categorical variabes
- Normalize the numerical to scale the data
- Hypertune the parameters to enhance the accuracy
We performed a 5 fold cross-validation with negative MSE scoring on each of the selected models.
Results
Clearly the GradientBoosting outperformed all the other models with the lowest neg-MSE.
We achived the goal of reducing the MSE(Mean Squared Error) < 360.
The plot below visualizes actual salaries versus predicted on the training dataset.

- We can see that the most important feature is the jobType
Finally, we can deploy our model by making salary predictions on the test dataset. The outcome of predictions has been saved in the file 'predictions.csv' which can be found in the repository.














