Skip to content

Siddhesh100711/Uber_Data_Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

🚖 Uber Booking Analysis

📌 Project Overview

This project focuses on analyzing Uber ride booking data using Data Science and Machine Learning techniques. The goal is to extract insights from ride data and build predictive models to determine whether a ride will be completed or not.


🎯 Problem Statement

Uber booking systems generate large volumes of data related to rides, customers, and operations. Analyzing this data helps improve:

  • Ride completion rates
  • Customer satisfaction
  • Operational efficiency

This project aims to:

  • Clean and preprocess the dataset
  • Perform Exploratory Data Analysis (EDA)
  • Handle class imbalance
  • Build and evaluate machine learning models

📊 Dataset Description

The dataset contains ride-related attributes such as:

Column Name Description
Booking ID Unique ride identifier
Booking Value Total fare of ride
Ride Distance Distance traveled
Driver Ratings Rating given to driver
Customer Rating Rating given by customer
Payment Method Cash / UPI / Card
Vehicle Type Type of vehicle
Booking Status Completed / Cancelled

🛠️ Technologies Used

  • Python 3
  • Pandas
  • NumPy
  • Matplotlib
  • Seaborn
  • Scikit-learn
  • PySpark

🔧 Data Preprocessing

The dataset was cleaned using the following steps:

  • Handling Missing Values

    • Numerical → Mean
    • Categorical → Mode
  • Removing Duplicates

  • Outlier Detection

    • IQR Method
    • Z-score Method
  • Feature Scaling

    • Min-Max Scaling
    • Standardization

📈 Exploratory Data Analysis (EDA)

Techniques used:

  • Histogram
  • Scatter Plot
  • Boxplot
  • Heatmap

🔍 Key Insights

  • Most bookings are Completed
  • Booking value increases with ride distance
  • Data shows slight skewness
  • Ratings influence ride completion

⚖️ Handling Class Imbalance

Problem:

  • Completed bookings >> Non-completed bookings

Solution:

  • SMOTE (Synthetic Minority Oversampling Technique)
  • Balanced dataset improved model performance

🤖 Machine Learning Models

The following models were implemented:

1. Logistic Regression

  • Used for binary classification
  • Predicts probability of ride completion

2. K-Nearest Neighbors (KNN)

  • Based on similarity between data points

3. Naive Bayes

  • Probabilistic classifier
  • Fast and efficient

📊 Model Evaluation

  • Train-Test Split (70-30)
  • Cross Validation
  • Confusion Matrix
  • Accuracy Score
  • Z-Test for validation

🏆 Results

  • Multiple models were compared

  • Best Model: Decision Tree Classifier

    • Highest accuracy
    • Captures complex patterns

⚠️ Limitation

  • May overfit on data

✅ Alternative

  • Naive Bayes provides more stable performance

📌 Conclusion

This project demonstrates how data science techniques can be applied to real-world ride booking systems. The analysis helps in:

  • Improving ride completion prediction
  • Understanding customer behavior
  • Enhancing operational decisions

📂 Project Structure

Uber-Booking-Analysis/
│── UberBookingAnalysis.ipynb
│── README.md
│── dataset.csv (optional)

🚀 How to Run

  1. Clone the repository
  2. Open the notebook in Google Colab or Jupyter
  3. Run all cells

👨‍💻 Authors

  • Roll No: 13, 14, 15

📅 Date

April 2026

About

Uber Data Analysis & Machine Learning Project : Data Cleaning, EDA, Outlier Detection, Normalization, SMOTE, Hypothesis Testing, Classification Models (Logistic Regression, KNN, Naive Bayes, Decision Tree) and Clustering using K-Means.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors