Skip to content

Big Data Analysis using PySpark on BigMart Dataset with ML model and business insights.

License

Notifications You must be signed in to change notification settings

pavithralanalytics/Big-Data-Analysis-PySpark

Repository files navigation

Big Data Analysis Using PySpark

📌 Project Overview

This project demonstrates Big Data processing using PySpark on the BigMart Sales dataset. The analysis includes data preprocessing, aggregation, feature engineering, and a machine learning model to predict sales.

🛠 Tools Used

  • Python
  • PySpark
  • Spark MLlib
  • Jupyter Notebook

📊 Dataset

BigMart Sales Dataset including:

  • Item Identifier
  • Item Type
  • Item MRP
  • Outlet Type
  • Location Tier
  • Item Outlet Sales

🔍 Analysis Performed

  • Data Cleaning
  • Missing Value Handling
  • Aggregation & GroupBy Operations
  • Sales Trend Analysis
  • Linear Regression Model

📈 Key Insights

  • Supermarket Type outlets generate higher revenue
  • Tier 3 cities show strong sales trends
  • Item MRP significantly impacts sales
  • PySpark efficiently processes large-scale data

🎯 Conclusion

This project demonstrates scalable data processing using distributed computing with PySpark, suitable for large datasets in real-world business environments.

Releases

No releases published

Packages

 
 
 

Contributors