Skip to content

Harsh-Belekar/Samsung-Data-Engineering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

77 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🏒 Samsung β€” Data Engineering Project

Project Banner

End-to-End Data Engineering Project using Medallion Architecture

A fully End-to-End Data Engineering project β€” from Synthetic data generation to a production-grade three-layer Data Warehouse β€” built to demonstrate real-world data engineering skills using Python, SQL, and PostgreSQL.

Python Pandas NumPy Tool Process Process Feature Domain Type Project Status Open Source


πŸ“Œ Table of Contents

  1. Project Overview
  2. Data Warehouse Architecture
  3. Project Folder Structure
  4. Data Domains
  5. Dataset Overview
  6. Tech Stack
  7. Project Workflow β€” Step by Step
  8. Naming Conventions
  9. Pipeline Execution Guide
  10. Project Documentation
  11. Logs
  12. Author

1. πŸš€ Project Overview

This project builds a production-grade Data Warehouse for Samsung India β€” from scratch β€” using the Medallion Architecture (Bronze β†’ Silver β†’ Gold). The pipeline begins with synthetic data generation simulating Samsung India's real business operations across Sales, Finance, Customer Relations, After-Sales Service, Supply Chain, and Marketing, then processes that data through three structured warehouse layers into analytics-ready Gold views.

The project is implemented twice β€” once using Python Scripts and once using SQL Scripts β€” deliberately, to showcase proficiency in both approaches on the same problem.

What This Project Covers

  • βœ… Synthetic data generation β€” 2,000,000+ rows across 14 tables using Python
  • βœ… Multi-format raw data β€” CSV, JSON, and XLSX files organised by business domain
  • βœ… Bronze Layer β€” raw data ingestion with no transformations (14 tables)
  • βœ… Silver Layer β€” cleaned, typed, standardised, and normalised data (14 tables)
  • βœ… Gold Layer β€” business-ready Star Schema views for analytics (14 views)
  • βœ… Dual implementation β€” complete Python Scripts + complete SQL Scripts
  • βœ… Structured logging β€” every pipeline stage produces its own log file
  • βœ… Full documentation β€” Data Catalog, Data Generation Guide, Naming Conventions

2. πŸ—οΈ Data Warehouse Architecture

Data Warehouse Architecture

The architecture follows the Medallion (Multi-Hop) pattern with three progressive layers inside a single PostgreSQL database (Samsung_Data_Warehouse):

Layer Schema Object Type Load Strategy Transformations Data Model
Bronze bronze Tables Batch Β· Full Load Β· Truncate & Insert None β€” raw as-is No model
Silver silver Tables Batch Β· Full Load Β· Truncate & Insert Cleaning Β· Standardisation Β· Normalisation Β· Enrichment No model
Gold gold Views No load β€” live reads from Silver Data Integration Β· Aggregations Β· Business Logic Star Schema

Data Flow

Source Files (CSV / JSON / XLSX)
        β”‚
        β–Ό  File_converter.py (JSON & XLSX β†’ CSV)
        β”‚
        β–Ό  Load_Bronze  (Python) / Proc_Load_Bronze  (SQL)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚         BRONZE LAYER            β”‚  ← Raw ingestion, all columns TEXT
β”‚   14 Tables Β· bronze schema     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                 β”‚
                 β–Ό  Load_Silver  (Python) / Proc_Load_Silver  (SQL)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚         SILVER LAYER            β”‚  ← Typed, cleaned, deduplicated
β”‚   14 Tables Β· silver schema     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                 β”‚
                 β–Ό  DDL_Gold  (Python / SQL) β€” Views only
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚          GOLD LAYER             β”‚  ← Business-ready Star Schema
β”‚   14 Views  Β· gold schema       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                 β”‚
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β–Ό                  β–Ό
   Power BI           SQL Analysis
   QuickBI             Tableau Β· ML

3. πŸ“ Project Folder Structure

Samsung Data Engineering/
β”‚
β”œβ”€β”€ Data/                              ← Raw source data organised by business domain
β”‚   β”œβ”€β”€ AS2/                           ← After-Sales Service
β”‚   β”‚   β”œβ”€β”€ complaints.csv
β”‚   β”‚   β”œβ”€β”€ returns.csv
β”‚   β”‚   β”œβ”€β”€ returns.xlsx
β”‚   β”‚   β”œβ”€β”€ service_centers.csv
β”‚   β”‚   └── service_centers.json
β”‚   β”œβ”€β”€ CRM/                           ← Customer Relationship Management
β”‚   β”‚   β”œβ”€β”€ customers.csv
β”‚   β”‚   └── product_reviews.csv
β”‚   β”œβ”€β”€ FIP/                           ← Finance & Payments
β”‚   β”‚   └── financial_transactions.csv
β”‚   β”œβ”€β”€ HRM/                           ← Human Resources & Marketing
β”‚   β”‚   β”œβ”€β”€ campaigns.csv
β”‚   β”‚   β”œβ”€β”€ campaigns.xlsx
β”‚   β”‚   β”œβ”€β”€ employees.csv
β”‚   β”‚   β”œβ”€β”€ employees.xlsx
β”‚   β”‚   β”œβ”€β”€ products.csv
β”‚   β”‚   └── products.json
β”‚   β”œβ”€β”€ SCI/                           ← Supply Chain & Inventory
β”‚   β”‚   β”œβ”€β”€ inventory.csv
β”‚   β”‚   β”œβ”€β”€ inventory.xlsx
β”‚   β”‚   β”œβ”€β”€ suppliers.csv
β”‚   β”‚   β”œβ”€β”€ warehouses.csv
β”‚   β”‚   └── warehouses.json
β”‚   └── SND/                           ← Sales & Distribution
β”‚       β”œβ”€β”€ dealers.csv
β”‚       └── sales_transactions.csv
β”‚
β”œβ”€β”€ Data_Generation/                   ← Synthetic data generation pipeline
β”‚   β”œβ”€β”€ Config.py                      ← All tunable parameters (rows, dates, quality)
β”‚   β”œβ”€β”€ Main.py                        ← Entry point β€” run this to generate all data
β”‚   β”œβ”€β”€ Master_data.py                 ← Static lookup data (products, cities, names)
β”‚   β”œβ”€β”€ Utils.py                       ← Shared helper functions
β”‚   └── Generators/                    ← One generator class per domain pair
β”‚       β”œβ”€β”€ __init__.py
β”‚       β”œβ”€β”€ Customers_Dealers.py
β”‚       β”œβ”€β”€ Employees_Inventory.py
β”‚       β”œβ”€β”€ Products_Warehouses.py
β”‚       β”œβ”€β”€ Service_Campaigns_Suppliers.py
β”‚       └── Transactions.py
β”‚
β”œβ”€β”€ Docs/                                ← Project documentation
β”‚   β”œβ”€β”€ Data_Catalog.md                  ← Gold layer table & column reference
β”‚   β”œβ”€β”€ Data_Generation_Documentation.md ← Data generation guide & configuration
β”‚   └── Naming_Conventions.md            ← Schema, table, column naming standards
β”‚
β”œβ”€β”€ Images/                            ← Architecture and schema diagrams
β”‚   β”œβ”€β”€ Data_Warehouse_Architecture.png
β”‚   └── Schema.png
β”‚
β”œβ”€β”€ Logs/                              ← Auto-generated logs for every pipeline stage
β”‚   β”œβ”€β”€ Data_generation.log
β”‚   β”œβ”€β”€ Init_database.log
β”‚   β”œβ”€β”€ DDL_Bronze.log
β”‚   β”œβ”€β”€ Load_Bronze.log
β”‚   β”œβ”€β”€ DDL_Silver.log
β”‚   β”œβ”€β”€ Helper_func.log
β”‚   β”œβ”€β”€ Load_Silver.log
β”‚   └── DDL_Gold.log
β”‚
β”œβ”€β”€ Python Scripts/                    ← Full Python implementation of the pipeline
β”‚   β”œβ”€β”€ Init_database.py               ← Creates bronze, silver, gold schemas
β”‚   β”œβ”€β”€ File_converter.py              ← Converts JSON & XLSX files to CSV
β”‚   β”œβ”€β”€ Bronze/
β”‚   β”‚   β”œβ”€β”€ DDL_Bronze.py              ← Creates 14 Bronze tables
β”‚   β”‚   └── Load_Bronze.py             ← Loads CSV data into Bronze tables
β”‚   β”œβ”€β”€ Silver/
β”‚   β”‚   β”œβ”€β”€ DDL_Silver.py              ← Creates 14 Silver tables (typed columns)
β”‚   β”‚   β”œβ”€β”€ Helper_func.py             ← Shared data cleaning functions
β”‚   β”‚   └── Load_Silver.py             ← Cleans & loads Bronze β†’ Silver
β”‚   └── Gold/
β”‚       └── DDL_Gold.py                ← Creates 14 Gold views (Star Schema)
β”‚
β”œβ”€β”€ SQL Scripts/                       ← Full SQL implementation of the pipeline
β”‚   β”œβ”€β”€ Init_database.sql              ← Creates bronze, silver, gold schemas
β”‚   β”œβ”€β”€ Bronze/
β”‚   β”‚   β”œβ”€β”€ DDL_Bronze.sql             ← Creates 14 Bronze tables
β”‚   β”‚   └── Proc_Load_Bronze.sql       ← Stored procedure: load_bronze
β”‚   β”œβ”€β”€ Silver/
β”‚   β”‚   β”œβ”€β”€ DDL_Silver.sql             ← Creates 14 Silver tables (typed columns)
β”‚   β”‚   β”œβ”€β”€ Helper_function.sql        ← Shared SQL cleaning functions
β”‚   β”‚   └── Proc_Load_Silver.sql       ← Stored procedure: load_silver
β”‚   └── Gold/
β”‚       └── DDL_Gold.sql               ← Creates 14 Gold views (Star Schema)
β”‚
β”œβ”€β”€ README.md                          ← This file
β”œβ”€β”€ LICENSE                            ← MIT License
└── requirements.txt                   ← Python dependencies

4. πŸ—‚οΈ Data Domains

The 14 source tables are organised into 6 business domains, each stored in its own subfolder inside Data/. The domain prefix is carried through all three warehouse layers as part of the naming convention.

Domain Prefix Business Area Tables
After-Sales Service AS2 Complaints, Returns, Service Centres complaints, returns, service_centers
Customer Relationship CRM Customers, Reviews customers, product_reviews
Finance & Payments FIP Payments, GST, EMI financial_transactions
Human Resources & Marketing HRM Products, Employees, Campaigns products, employees, campaigns
Supply Chain & Inventory SCI Warehouses, Suppliers, Inventory warehouses, suppliers, inventory
Sales & Distribution SND Dealers, Sales Transactions dealers, sales_transactions

5. πŸ“Š Dataset Overview

All data is synthetically generated using the custom Python pipeline in the Data Generation/ folder. It simulates Samsung India's real-world business operations across 2022–2025.

# Table Domain Format Approx. Rows Description
1 products HRM JSON / CSV 2,000 Samsung India product catalogue
2 warehouses SCI JSON / CSV 25 National warehouse master
3 service_centers AS2 JSON / CSV 1,200 Authorised service centres
4 customers CRM CSV 200,000 Registered customer master
5 dealers SND CSV 10,000 Retail dealer and partner master
6 suppliers SCI CSV 500 Component and logistics suppliers
7 campaigns HRM XLSX / CSV 1,000 Marketing campaign master
8 employees HRM XLSX / CSV 15,000 Employee HR master
9 inventory SCI XLSX / CSV 100,000 Daily inventory snapshots
10 sales_transactions SND CSV 750,000 Primary sales fact table
11 complaints AS2 CSV 200,000 After-sales complaint cases
12 returns AS2 XLSX / CSV ~77,250 Product returns (+3% duplicates)
13 financial_transactions FIP CSV ~663,000 Payment ledger (+2% duplicates)
14 product_reviews CRM CSV 50,000 Customer product ratings

Total: ~2,070,000 rows across 14 tables spanning 6 business domains and 4 years (2022–2025)

The raw data intentionally contains real-world data quality issues β€” mixed date formats, null values, duplicate rows, casing errors, and invalid values β€” to make the Silver layer cleaning pipeline meaningful and realistic.


6. πŸ› οΈ Tech Stack

Tool / Technology Version Purpose
Python 3.10+ Data generation, pipeline orchestration, data cleaning
PostgreSQL 14+ Data warehouse database engine
Pandas 2.0+ DataFrame manipulation and CSV I/O
NumPy 1.24+ Vectorised data generation
openpyxl 3.1+ Reading and writing XLSX files
psycopg2 2.9+ PostgreSQL connection from Python
SQLAlchemy 2.0+ ORM and database connection pooling
SQL PostgreSQL dialect DDL, stored procedures, helper functions
Power BI Latest Dashboard and reporting (Gold layer consumer)

Install all Python dependencies with:

pip install -r requirements.txt

7. πŸ”„ Project Workflow β€” Step by Step

Phase 1 β€” Synthetic Data Generation

The project begins with generating a realistic, messy dataset that simulates Samsung India's operational data.

Entry point: Data Generation/Main.py

cd "Data Generation"
python Main.py

What happens:

  • Config.py sets the date range (2022–2025), row counts per table, and data quality parameters
  • Master_data.py provides all static lookup data β€” Indian names, cities, Samsung product SKUs, payment modes, etc.
  • Utils.py provides shared helper functions β€” rnd_phone(), rnd_date(), add_messy(), fn_to_boolean()
  • Each generator class in Generators/ produces one or two tables and exposes its generated ID pool for use as foreign keys in downstream tables
  • All 14 files are generated in order of dependency and saved to the Data/ domain folders

Dependency chain (generation order must be preserved):

Products β†’ Warehouses β†’ Customers β†’ Dealers β†’
Service Centers β†’ Campaigns β†’ Suppliers β†’ Employees β†’
Inventory β†’ Sales Transactions β†’ Complaints β†’
Returns β†’ Financial Transactions β†’ Product Reviews

For full configuration options and parameter reference, see Docs/Data_Generation_Documentation.md


Phase 2 β€” Database Setup

File Conversion

Some source files are in JSON and XLSX format. File_converter.py converts all of them to CSV before loading β€” ensuring a single, consistent ingestion format for the warehouse pipeline.

python "Python Scripts/File_converter.py"

Database and Schema Initialisation

Creates the Samsung_Data_Warehouse PostgreSQL database and the three schemas: bronze, silver, and gold.

# Python
python "Python Scripts/Init_database.py"

# SQL equivalent
psql -U <user> -f "SQL Scripts/Init_database.sql"

Log: Logs/Init_database.log


Phase 3 β€” Bronze Layer

The Bronze layer ingests raw data exactly as it is β€” no type casting, no cleaning, no transformations. All columns are stored as TEXT. This preserves the original data integrity and provides a full audit trail.

Step 1 β€” Create Bronze Tables (DDL)

Creates 14 tables in the bronze schema. All columns are TEXT to accept any raw value without rejection.

# Python
python "Python Scripts/Bronze/DDL_Bronze.py"

# SQL equivalent
psql -U <user> -d Samsung_Data_Warehouse -f "SQL Scripts/Bronze/DDL_Bronze.sql"

Log: Logs/DDL_Bronze.log

Step 2 β€” Load Bronze Tables

Reads the CSV files from each domain folder and inserts them into the Bronze tables using a Truncate & Insert (Full Load) strategy.

# Python
python "Python Scripts/Bronze/Load_Bronze.py"

# SQL equivalent (Stored Procedure)
psql -U <user> -d Samsung_Data_Warehouse -f "SQL Scripts/Bronze/Proc_Load_Bronze.sql"
CALL load_bronze();

Log: Logs/Load_Bronze.log

Bronze Table Naming: <sourcesystem>_<entity> Example: bronze.crm_customers, bronze.snd_sales_transactions


Phase 4 β€” Silver Layer

The Silver layer transforms Bronze data into clean, typed, and standardised tables. This is where all data quality issues are resolved.

Cleaning operations performed:

  • Mixed date formats β†’ DATE type (6 format variants handled)
  • Phone prefixes (+91, leading 0) β†’ 10-digit standard format
  • Mixed boolean encodings (Yes/No/1/0/True/False) β†’ BOOLEAN
  • Salary formats ("18 LPA" / "150000" monthly / "18L") β†’ Annual INR NUMERIC
  • GST strings ("18%" / "18" / "GST@18") β†’ Numeric percentage
  • Category casing errors β†’ Title Case canonical values
  • Duplicate rows in returns and financial_transactions β†’ Deduplicated
  • Out-of-range values (negative stock, invalid ratings) β†’ Cleaned or excluded
  • Null standardisation β†’ Genuine NULLs preserved, bad values removed

Step 1 β€” Create Silver Tables (DDL)

Creates 14 tables in the silver schema with proper data types, NOT NULL constraints, CHECK constraints, and foreign keys.

# Python
python "Python Scripts/Silver/DDL_Silver.py"

# SQL equivalent
psql -U <user> -d Samsung_Data_Warehouse -f "SQL Scripts/Silver/DDL_Silver.sql"

Log: Logs/DDL_Silver.log

Step 2 β€” Create Helper Functions

Creates shared data cleaning functions used by the Silver loading process.

# Python (module imported by Load_Silver.py)
# Helper_func.py is not run standalone β€” it is imported by Load_Silver.py

# SQL equivalent
psql -U <user> -d Samsung_Data_Warehouse -f "SQL Scripts/Silver/Helper_function.sql"

Log: Logs/Helper_func.log

Key functions created:

Function Purpose
fn_to_boolean(raw_val) Converts Yes/No/1/0/True/False β†’ BOOLEAN
fn_clean_phone(raw_phone) Strips +91 / leading 0 β†’ 10-digit CHAR
fn_parse_date(raw_date) Parses 6 mixed date formats β†’ DATE
fn_parse_epoch_or_date(val) Converts Unix epoch strings β†’ DATE
fn_parse_gst_pct(raw_gst) Parses all GST string formats β†’ NUMERIC
fn_salary_to_annual(raw_salary) Converts LPA / monthly / shorthand β†’ Annual INR

Step 3 β€” Load Silver Tables

Reads from Bronze tables, applies all cleaning transformations, deduplicates where needed, and inserts into Silver tables.

# Python
python "Python Scripts/Silver/Load_Silver.py"

# SQL equivalent (Stored Procedure)
psql -U <user> -d Samsung_Data_Warehouse -f "SQL Scripts/Silver/Proc_Load_Silver.sql"
CALL load_silver();

Log: Logs/Load_Silver.log

Silver Table Naming: <sourcesystem>_<entity> Example: silver.crm_customers, silver.snd_sales_transactions


Phase 5 β€” Gold Layer

The Gold layer exposes business-ready data as PostgreSQL VIEWS β€” no physical storage, no load process. Every Gold view reads live, cleaned data from the Silver layer and presents it in a Star Schema structure optimised for analytics and reporting.

Star Schema structure:

Type Count Examples
Dimension Views (dim_) 8 dim_products, dim_customers, dim_employees
Fact Views (fact_) 6 fact_sales_transactions, fact_complaints, fact_returns

Create Gold Views (DDL)

# Python
python "Python Scripts/Gold/DDL_Gold.py"

# SQL equivalent
psql -U <user> -d Samsung_Data_Warehouse -f "SQL Scripts/Gold/DDL_Gold.sql"

Log: Logs/DDL_Gold.log

Gold View Naming: <category>_<entity> Examples: gold.dim_customers, gold.fact_sales_transactions

Gold layer consumers:

Tool Use Case
Power BI Interactive dashboards and KPI reports
QuickBI Business intelligence and self-service analytics
Tableau Advanced data visualisation
Python / Jupyter Data analysis and machine learning
SQL Clients Ad-hoc analysis by data and business analysts

For a full reference of all 14 Gold views β€” their purpose, column definitions, data types, and join relationships β€” see Docs/Data_Catalog.md


8. πŸ“ Naming Conventions

All database objects follow consistent naming standards documented in Docs/Naming_Conventions.md.

Table / View Naming

Layer Pattern Example
Bronze <sourcesystem>_<entity> bronze.crm_customers
Silver <sourcesystem>_<entity> silver.hrm_employees
Gold <category>_<entity> gold.dim_customers

Gold Category Prefixes

Prefix Meaning Example
dim_ Dimension table dim_products, dim_customers
fact_ Fact table fact_sales_transactions
agg_ Aggregated table agg_monthly_revenue

Column Naming

Type Pattern Example
Surrogate keys <tablename>_key customer_key
Technical / metadata columns dwh_<column_name> dwh_load_date

Stored Procedures

Pattern Example
load_<layer> load_bronze, load_silver

All names use snake_case with lowercase letters. No SQL reserved words used as object names.


9. ▢️ Pipeline Execution Guide

Prerequisites

  • Python 3.10 or higher
  • PostgreSQL 14 or higher (running locally or on a server)
  • A PostgreSQL user with CREATE DATABASE privileges

Step 1 β€” Clone the Repository

git clone https://github.com/Harsh-Belekar/Samsung-Data-Engineering.git
cd "Samsung Data Engineering"

Step 2 β€” Install Python Dependencies

pip install -r requirements.txt

Step 3 β€” Configure Database Connection

Update the connection settings in the Python scripts or create a .env file:

DB_HOST=localhost
DB_PORT=5432
DB_USER=your_username
DB_PASSWORD=your_password
DB_NAME=Samsung_Data_Warehouse

Step 4 β€” Generate the Data

cd "Data Generation"
python Main.py
cd ..

Step 5 β€” Convert JSON & XLSX to CSV

python "Python Scripts/File_converter.py"

Step 6 β€” Run the Full Pipeline

Option A β€” Python Scripts (recommended)

python "Python Scripts/Init_database.py"
python "Python Scripts/Bronze/DDL_Bronze.py"
python "Python Scripts/Bronze/Load_Bronze.py"
python "Python Scripts/Silver/DDL_Silver.py"
python "Python Scripts/Silver/Load_Silver.py"
python "Python Scripts/Gold/DDL_Gold.py"

Option B β€” SQL Scripts

-- Run in order in your PostgreSQL client or psql
\i 'SQL Scripts/Init_database.sql'
\i 'SQL Scripts/Bronze/DDL_Bronze.sql'
\i 'SQL Scripts/Bronze/Proc_Load_Bronze.sql'
CALL load_bronze();
\i 'SQL Scripts/Silver/DDL_Silver.sql'
\i 'SQL Scripts/Silver/Helper_function.sql'
\i 'SQL Scripts/Silver/Proc_Load_Silver.sql'
CALL load_silver();
\i 'SQL Scripts/Gold/DDL_Gold.sql'

Step 7 β€” Verify the Pipeline

-- Check row counts across all three layers
SELECT 'bronze' AS layer, 'crm_customers' AS table_name, COUNT(*) FROM bronze.crm_customers
UNION ALL
SELECT 'silver', 'crm_customers', COUNT(*) FROM silver.crm_customers
UNION ALL
SELECT 'gold',   'dim_customers', COUNT(*) FROM gold.dim_customers;

10. πŸ“„ Project Documentation

Document Location Description
Data Generation Guide Docs/Data_Generation_Documentation.md How to configure and run the data generator β€” row counts, date ranges, quality parameters, and extending the project
Gold Layer Data Catalog Docs/Data_Catalog.md Full reference for all 14 Gold views β€” purpose, grain, column definitions, data types, example values, and join relationships
Naming Conventions Docs/Naming_Conventions.md Standards for all schema, table, view, column, and stored procedure naming across all three layers
Architecture Diagram Images/Data_Warehouse_Architecture.png Visual overview of the full Bronze β†’ Silver β†’ Gold pipeline with source systems and consumers
Schema Diagram Images/Schema.png Entity-relationship diagram showing how all 14 tables connect through primary and foreign keys

11. πŸ“‹ Logs

Every stage of the pipeline produces a dedicated log file in the Logs/ folder. Each log captures start time, completion status, row counts, and any errors encountered.

Log File Pipeline Stage Script
Data_generation.log Synthetic data generation Data Generation/Main.py
Init_database.log Schema creation Python Scripts/Init_database.py
DDL_Bronze.log Bronze table creation Python Scripts/Bronze/DDL_Bronze.py
Load_Bronze.log Bronze data loading Python Scripts/Bronze/Load_Bronze.py
DDL_Silver.log Silver table creation Python Scripts/Silver/DDL_Silver.py
Helper_func.log Cleaning function creation Python Scripts/Silver/Helper_func.py
Load_Silver.log Silver data cleaning & loading Python Scripts/Silver/Load_Silver.py
DDL_Gold.log Gold view creation Python Scripts/Gold/DDL_Gold.py

SQL Scripts produce the same transformations as Python Scripts but do not write to the Logs/ folder β€” the Python implementation serves as the canonical log source.


⚠️ Dataset Disclaimer

All datasets used in this project are dummy, synthetic, or public β€” generated programmatically using Python for learning and portfolio demonstration purposes only.

No real customer data, restaurant data, or proprietary Samsung information has been used. This project is not affiliated with, endorsed by, or connected to Samsung in any way.


πŸ§‘β€πŸ’» Author

πŸ‘€ Harsh Belekar
πŸ“ Data Analyst | Python Developer | SQL | Power BI | Excel | Data Visualization
πŸ“¬ LinkedIn | πŸ”—GitHub

πŸ“§ harshbelekar74@gmail.com


⭐ If you found this project helpful, feel free to star the repo and connect with me for collaboration!

Made with ❀️ and a lot of β˜• by Harsh Belekar

About

πŸ—οΈ End-to-End Data Engineering project for Samsung India β€” Synthetic Data Generation β†’ Medallion Architecture (Bronze Β· Silver Β· Gold) β†’ Star Schema | Built with Python & SQL on PostgreSQL | 2M+ rows Β· 14 tables Β· 6 business domains

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors