This project demonstrates a complete data analysis workflow using a real-world sales dataset. The objective is to transform raw transactional data into structured insights through data cleaning, preprocessing, exploratory analysis, and visualization.
The project follows a structured analytical pipeline:
-
Data inspection and merging of three source tables
-
Data cleaning and validation
-
Feature engineering (Revenue, Total Cost, Profit)
-
Exploratory Data Analysis (EDA)
-
Business insight extraction
This notebook reflects a real-world data analyst approach rather than isolated visualizations.
-
Python
-
Pandas
-
NumPy
-
Matplotlib
-
Seaborn
The project uses three source datasets that are merged into a single analytical dataframe:
Events — transactional sales records:
-
order_id– unique order identifier -
order_date– date of purchase -
ship_date– shipping date -
order_priority– order priority level -
country_code– country of sale -
product_id– product identifier -
sales_channel– sales channel (Online / Offline) -
units_sold– number of units sold -
unit_price– price per unit -
unit_cost– cost per unit
Products — product catalog with id and item_type
Countries — country reference with alpha-3, region, sub-region
The following preprocessing steps were performed:
-
Data structure inspection across all three tables
-
Data type corrections (date conversion for
order_dateandship_date) -
Missing value handling (country codes filled with "Unknown",
units_soldfilled with mean) -
Cyrillic-to-Latin character normalization and duplicate check
-
Merging tables via
product_idandcountry_code -
Creation of calculated metrics: Revenue, Total Cost, Profit
-
The most profitable category is Cosmetics, the least profitable is Fruits.
-
The highest revenue and costs are in the Office Supplies category.
-
Profit distribution between Online and Offline channels is relatively close in value across categories.
-
Top countries by order volume: San Marino, Andorra, Romania
-
Revenue by sub-region: Southern Europe leads (33.1%), followed by Eastern Europe (22.2%) and Northern Europe (17.4%)
-
Average shipping interval varies by category; the longest delays are in Office Supplies and Cereal
-
By country: longest shipping time in Hungary, shortest in Croatia
-
By region: delivery in Asia takes longer than in Europe
-
No direct linear relationship between shipping time and profit
-
Profit is distributed evenly across the entire range (1–50 days), meaning delivery time does not significantly affect profitability
-
Revenue dynamics differ across categories and countries — no single general trend
-
European region shows significantly higher values compared to Asia, with more volatile dynamics
-
Most orders are placed on Sunday (207), fewest on Thursday (167)
-
Distribution is relatively even across most categories, with some exceptions: Household sales dip on Tuesday, Fruits sales peak on weekends
-
The most profitable category is Cosmetics; the least profitable is Fruits
-
Online and Offline channels show relatively similar profit distribution across categories
-
Southern Europe generates the largest share of revenue (33.1%)
-
Shipping interval does not significantly affect profitability
-
Delivery in Asia takes longer than in Europe
-
Sunday has the highest order volume; Thursday has the lowest
-
Revenue dynamics vary across categories and regions with no single universal trend
-
Clone this repository
-
Install dependencies:
pip install -r requirements.txt- Open
data_analysis_workflow.ipynbin Jupyter Notebook or Google Colab
full-data-analysis-workflow/
├── images/
│ ├── plots.png
│ ├── profit_category.png
│ ├── profit_vs_shipping.png
│ └── order_day_week.png
├── data_analysis_workflow.ipynb
├── requirements.txt
└── README.md