Welcome to the Income Prediction project! This project leverages Apache Spark and various machine learning algorithms to predict whether an individual's income exceeds $50K based on features like age, workclass, education, occupation, and more.
The goal of this project is to predict whether an individual's income exceeds $50K based on various personal features. The dataset used is adult.csv, containing details such as:
- Age 👶👴
- Workclass 🧑💼
- Education 🎓
- Occupation 💼
- Income 💰 (target variable)
The prediction task is a binary classification: predicting whether the income is greater than $50K.
- Apache Spark: A fast, general-purpose cluster-computing system for big data processing.
- PySpark: Python API for Apache Spark, used for distributed data processing.
- ngrok: Exposing local servers to the internet (if required for the project).
- Matplotlib & Seaborn: For data visualization and plotting.
- Scikit-learn: For machine learning models, evaluation metrics, and performance analysis.
The dataset used in this project is adult.csv, which contains the following features:
- Age: Age of the individual.
- Workclass: Type of employment.
- Education: Level of education.
- Occupation: Type of occupation.
- Income: Whether the individual earns more than $50K or not (target variable).
-
Data Cleaning 🧹:
- Handle missing values and duplicates.
- Apply necessary transformations to prepare the data for modeling.
-
Feature Engineering 🔧:
- Create new features like age_category to enhance model performance.
-
Modeling 💡:
- Implement various machine learning models:
- Logistic Regression (LR) 🤖
- Decision Tree (DT) 🌳
- Support Vector Machine (SVM) 🧑💼
- Random Forest (RF) 🌲
- Naive Bayes (NB) 📈
- Gradient Boosting (GBT) 🚀
- Implement various machine learning models:
-
Model Evaluation 📉:
- Evaluate model performance using metrics such as:
- Accuracy ✅
- Precision 🎯
- Recall 🔍
- F1 Score 🏆
- ROC-AUC 📊
- Confusion Matrix 📉
- Evaluate model performance using metrics such as:
You can install the necessary libraries with the following command:
pip install pyspark findspark pyngrok matplotlib seaborn scikit-learn-
Initialize PySpark python: from pyspark.sql import SparkSession spark = SparkSession.builder.appName("IncomePrediction").getOrCreate()
-
Load the Dataset python: df = spark.read.csv('adult.csv', header=True, inferSchema=True)
-
Run the Project
jupyter notebook income_prediction.ipynbThis project evaluates the performance of different machine learning models using the following metrics:
- Accuracy: Measures the proportion of correctly classified instances.
- Precision: The ratio of correctly predicted positive observations to the total predicted positives.
- Recall: The ratio of correctly predicted positive observations to all actual positives.
- F1 Score: The weighted average of Precision and Recall, providing a balance between them.
- ROC-AUC: The Area Under the Receiver Operating Characteristic Curve, indicating the model's ability to distinguish between classes.
- Clone the repository:
git clone https://github.com/ujwalakopparthi/Income-Prediction-Using-Pyspark.git
- Navigate to the project directory:
cd income-prediction- Open the Jupyter notebook:
jupyter notebook income_prediction.ipynbFeel free to fork this project and contribute to improving it! 🚀
If you have any suggestions, improvements, or bug fixes, please:
- Open an issue if you encounter any problems.
- Submit a pull request with your changes.
All contributions are welcome, and any help is greatly appreciated!
This project is licensed under the MIT License - see the LICENSE file for details.