This project focuses on analyzing Uber ride booking data using Data Science and Machine Learning techniques. The goal is to extract insights from ride data and build predictive models to determine whether a ride will be completed or not.
Uber booking systems generate large volumes of data related to rides, customers, and operations. Analyzing this data helps improve:
- Ride completion rates
- Customer satisfaction
- Operational efficiency
This project aims to:
- Clean and preprocess the dataset
- Perform Exploratory Data Analysis (EDA)
- Handle class imbalance
- Build and evaluate machine learning models
The dataset contains ride-related attributes such as:
| Column Name | Description |
|---|---|
| Booking ID | Unique ride identifier |
| Booking Value | Total fare of ride |
| Ride Distance | Distance traveled |
| Driver Ratings | Rating given to driver |
| Customer Rating | Rating given by customer |
| Payment Method | Cash / UPI / Card |
| Vehicle Type | Type of vehicle |
| Booking Status | Completed / Cancelled |
- Python 3
- Pandas
- NumPy
- Matplotlib
- Seaborn
- Scikit-learn
- PySpark
The dataset was cleaned using the following steps:
-
Handling Missing Values
- Numerical → Mean
- Categorical → Mode
-
Removing Duplicates
-
Outlier Detection
- IQR Method
- Z-score Method
-
Feature Scaling
- Min-Max Scaling
- Standardization
Techniques used:
- Histogram
- Scatter Plot
- Boxplot
- Heatmap
- Most bookings are Completed
- Booking value increases with ride distance
- Data shows slight skewness
- Ratings influence ride completion
Problem:
- Completed bookings >> Non-completed bookings
Solution:
- SMOTE (Synthetic Minority Oversampling Technique)
- Balanced dataset improved model performance
The following models were implemented:
- Used for binary classification
- Predicts probability of ride completion
- Based on similarity between data points
- Probabilistic classifier
- Fast and efficient
- Train-Test Split (70-30)
- Cross Validation
- Confusion Matrix
- Accuracy Score
- Z-Test for validation
-
Multiple models were compared
-
Best Model: Decision Tree Classifier
- Highest accuracy
- Captures complex patterns
- May overfit on data
- Naive Bayes provides more stable performance
This project demonstrates how data science techniques can be applied to real-world ride booking systems. The analysis helps in:
- Improving ride completion prediction
- Understanding customer behavior
- Enhancing operational decisions
Uber-Booking-Analysis/
│── UberBookingAnalysis.ipynb
│── README.md
│── dataset.csv (optional)
- Clone the repository
- Open the notebook in Google Colab or Jupyter
- Run all cells
- Roll No: 13, 14, 15
April 2026