A fully End-to-End Data Engineering project β from Synthetic data generation to a production-grade three-layer Data Warehouse β built to demonstrate real-world data engineering skills using Python, SQL, and PostgreSQL.
- Project Overview
- Data Warehouse Architecture
- Project Folder Structure
- Data Domains
- Dataset Overview
- Tech Stack
- Project Workflow β Step by Step
- Naming Conventions
- Pipeline Execution Guide
- Project Documentation
- Logs
- Author
This project builds a production-grade Data Warehouse for Samsung India β from scratch β using the Medallion Architecture (Bronze β Silver β Gold). The pipeline begins with synthetic data generation simulating Samsung India's real business operations across Sales, Finance, Customer Relations, After-Sales Service, Supply Chain, and Marketing, then processes that data through three structured warehouse layers into analytics-ready Gold views.
The project is implemented twice β once using Python Scripts and once using SQL Scripts β deliberately, to showcase proficiency in both approaches on the same problem.
- β Synthetic data generation β 2,000,000+ rows across 14 tables using Python
- β Multi-format raw data β CSV, JSON, and XLSX files organised by business domain
- β Bronze Layer β raw data ingestion with no transformations (14 tables)
- β Silver Layer β cleaned, typed, standardised, and normalised data (14 tables)
- β Gold Layer β business-ready Star Schema views for analytics (14 views)
- β Dual implementation β complete Python Scripts + complete SQL Scripts
- β Structured logging β every pipeline stage produces its own log file
- β Full documentation β Data Catalog, Data Generation Guide, Naming Conventions
The architecture follows the Medallion (Multi-Hop) pattern with three progressive layers inside a single PostgreSQL database (Samsung_Data_Warehouse):
| Layer | Schema | Object Type | Load Strategy | Transformations | Data Model |
|---|---|---|---|---|---|
| Bronze | bronze |
Tables | Batch Β· Full Load Β· Truncate & Insert | None β raw as-is | No model |
| Silver | silver |
Tables | Batch Β· Full Load Β· Truncate & Insert | Cleaning Β· Standardisation Β· Normalisation Β· Enrichment | No model |
| Gold | gold |
Views | No load β live reads from Silver | Data Integration Β· Aggregations Β· Business Logic | Star Schema |
Source Files (CSV / JSON / XLSX)
β
βΌ File_converter.py (JSON & XLSX β CSV)
β
βΌ Load_Bronze (Python) / Proc_Load_Bronze (SQL)
βββββββββββββββββββββββββββββββββββ
β BRONZE LAYER β β Raw ingestion, all columns TEXT
β 14 Tables Β· bronze schema β
ββββββββββββββββββ¬βββββββββββββββββ
β
βΌ Load_Silver (Python) / Proc_Load_Silver (SQL)
βββββββββββββββββββββββββββββββββββ
β SILVER LAYER β β Typed, cleaned, deduplicated
β 14 Tables Β· silver schema β
ββββββββββββββββββ¬βββββββββββββββββ
β
βΌ DDL_Gold (Python / SQL) β Views only
βββββββββββββββββββββββββββββββββββ
β GOLD LAYER β β Business-ready Star Schema
β 14 Views Β· gold schema β
ββββββββββββββββββ¬βββββββββββββββββ
β
ββββββββββ΄ββββββββββ
βΌ βΌ
Power BI SQL Analysis
QuickBI Tableau Β· ML
Samsung Data Engineering/
β
βββ Data/ β Raw source data organised by business domain
β βββ AS2/ β After-Sales Service
β β βββ complaints.csv
β β βββ returns.csv
β β βββ returns.xlsx
β β βββ service_centers.csv
β β βββ service_centers.json
β βββ CRM/ β Customer Relationship Management
β β βββ customers.csv
β β βββ product_reviews.csv
β βββ FIP/ β Finance & Payments
β β βββ financial_transactions.csv
β βββ HRM/ β Human Resources & Marketing
β β βββ campaigns.csv
β β βββ campaigns.xlsx
β β βββ employees.csv
β β βββ employees.xlsx
β β βββ products.csv
β β βββ products.json
β βββ SCI/ β Supply Chain & Inventory
β β βββ inventory.csv
β β βββ inventory.xlsx
β β βββ suppliers.csv
β β βββ warehouses.csv
β β βββ warehouses.json
β βββ SND/ β Sales & Distribution
β βββ dealers.csv
β βββ sales_transactions.csv
β
βββ Data_Generation/ β Synthetic data generation pipeline
β βββ Config.py β All tunable parameters (rows, dates, quality)
β βββ Main.py β Entry point β run this to generate all data
β βββ Master_data.py β Static lookup data (products, cities, names)
β βββ Utils.py β Shared helper functions
β βββ Generators/ β One generator class per domain pair
β βββ __init__.py
β βββ Customers_Dealers.py
β βββ Employees_Inventory.py
β βββ Products_Warehouses.py
β βββ Service_Campaigns_Suppliers.py
β βββ Transactions.py
β
βββ Docs/ β Project documentation
β βββ Data_Catalog.md β Gold layer table & column reference
β βββ Data_Generation_Documentation.md β Data generation guide & configuration
β βββ Naming_Conventions.md β Schema, table, column naming standards
β
βββ Images/ β Architecture and schema diagrams
β βββ Data_Warehouse_Architecture.png
β βββ Schema.png
β
βββ Logs/ β Auto-generated logs for every pipeline stage
β βββ Data_generation.log
β βββ Init_database.log
β βββ DDL_Bronze.log
β βββ Load_Bronze.log
β βββ DDL_Silver.log
β βββ Helper_func.log
β βββ Load_Silver.log
β βββ DDL_Gold.log
β
βββ Python Scripts/ β Full Python implementation of the pipeline
β βββ Init_database.py β Creates bronze, silver, gold schemas
β βββ File_converter.py β Converts JSON & XLSX files to CSV
β βββ Bronze/
β β βββ DDL_Bronze.py β Creates 14 Bronze tables
β β βββ Load_Bronze.py β Loads CSV data into Bronze tables
β βββ Silver/
β β βββ DDL_Silver.py β Creates 14 Silver tables (typed columns)
β β βββ Helper_func.py β Shared data cleaning functions
β β βββ Load_Silver.py β Cleans & loads Bronze β Silver
β βββ Gold/
β βββ DDL_Gold.py β Creates 14 Gold views (Star Schema)
β
βββ SQL Scripts/ β Full SQL implementation of the pipeline
β βββ Init_database.sql β Creates bronze, silver, gold schemas
β βββ Bronze/
β β βββ DDL_Bronze.sql β Creates 14 Bronze tables
β β βββ Proc_Load_Bronze.sql β Stored procedure: load_bronze
β βββ Silver/
β β βββ DDL_Silver.sql β Creates 14 Silver tables (typed columns)
β β βββ Helper_function.sql β Shared SQL cleaning functions
β β βββ Proc_Load_Silver.sql β Stored procedure: load_silver
β βββ Gold/
β βββ DDL_Gold.sql β Creates 14 Gold views (Star Schema)
β
βββ README.md β This file
βββ LICENSE β MIT License
βββ requirements.txt β Python dependencies
The 14 source tables are organised into 6 business domains, each stored in its own subfolder inside Data/. The domain prefix is carried through all three warehouse layers as part of the naming convention.
| Domain | Prefix | Business Area | Tables |
|---|---|---|---|
| After-Sales Service | AS2 |
Complaints, Returns, Service Centres | complaints, returns, service_centers |
| Customer Relationship | CRM |
Customers, Reviews | customers, product_reviews |
| Finance & Payments | FIP |
Payments, GST, EMI | financial_transactions |
| Human Resources & Marketing | HRM |
Products, Employees, Campaigns | products, employees, campaigns |
| Supply Chain & Inventory | SCI |
Warehouses, Suppliers, Inventory | warehouses, suppliers, inventory |
| Sales & Distribution | SND |
Dealers, Sales Transactions | dealers, sales_transactions |
All data is synthetically generated using the custom Python pipeline in the Data Generation/ folder. It simulates Samsung India's real-world business operations across 2022β2025.
| # | Table | Domain | Format | Approx. Rows | Description |
|---|---|---|---|---|---|
| 1 | products |
HRM | JSON / CSV | 2,000 | Samsung India product catalogue |
| 2 | warehouses |
SCI | JSON / CSV | 25 | National warehouse master |
| 3 | service_centers |
AS2 | JSON / CSV | 1,200 | Authorised service centres |
| 4 | customers |
CRM | CSV | 200,000 | Registered customer master |
| 5 | dealers |
SND | CSV | 10,000 | Retail dealer and partner master |
| 6 | suppliers |
SCI | CSV | 500 | Component and logistics suppliers |
| 7 | campaigns |
HRM | XLSX / CSV | 1,000 | Marketing campaign master |
| 8 | employees |
HRM | XLSX / CSV | 15,000 | Employee HR master |
| 9 | inventory |
SCI | XLSX / CSV | 100,000 | Daily inventory snapshots |
| 10 | sales_transactions |
SND | CSV | 750,000 | Primary sales fact table |
| 11 | complaints |
AS2 | CSV | 200,000 | After-sales complaint cases |
| 12 | returns |
AS2 | XLSX / CSV | ~77,250 | Product returns (+3% duplicates) |
| 13 | financial_transactions |
FIP | CSV | ~663,000 | Payment ledger (+2% duplicates) |
| 14 | product_reviews |
CRM | CSV | 50,000 | Customer product ratings |
Total: ~2,070,000 rows across 14 tables spanning 6 business domains and 4 years (2022β2025)
The raw data intentionally contains real-world data quality issues β mixed date formats, null values, duplicate rows, casing errors, and invalid values β to make the Silver layer cleaning pipeline meaningful and realistic.
| Tool / Technology | Version | Purpose |
|---|---|---|
| Python | 3.10+ | Data generation, pipeline orchestration, data cleaning |
| PostgreSQL | 14+ | Data warehouse database engine |
| Pandas | 2.0+ | DataFrame manipulation and CSV I/O |
| NumPy | 1.24+ | Vectorised data generation |
| openpyxl | 3.1+ | Reading and writing XLSX files |
| psycopg2 | 2.9+ | PostgreSQL connection from Python |
| SQLAlchemy | 2.0+ | ORM and database connection pooling |
| SQL | PostgreSQL dialect | DDL, stored procedures, helper functions |
| Power BI | Latest | Dashboard and reporting (Gold layer consumer) |
Install all Python dependencies with:
pip install -r requirements.txtThe project begins with generating a realistic, messy dataset that simulates Samsung India's operational data.
Entry point: Data Generation/Main.py
cd "Data Generation"
python Main.pyWhat happens:
Config.pysets the date range (2022β2025), row counts per table, and data quality parametersMaster_data.pyprovides all static lookup data β Indian names, cities, Samsung product SKUs, payment modes, etc.Utils.pyprovides shared helper functions βrnd_phone(),rnd_date(),add_messy(),fn_to_boolean()- Each generator class in
Generators/produces one or two tables and exposes its generated ID pool for use as foreign keys in downstream tables - All 14 files are generated in order of dependency and saved to the
Data/domain folders
Dependency chain (generation order must be preserved):
Products β Warehouses β Customers β Dealers β
Service Centers β Campaigns β Suppliers β Employees β
Inventory β Sales Transactions β Complaints β
Returns β Financial Transactions β Product Reviews
For full configuration options and parameter reference, see
Docs/Data_Generation_Documentation.md
File Conversion
Some source files are in JSON and XLSX format. File_converter.py converts all of them to CSV before loading β ensuring a single, consistent ingestion format for the warehouse pipeline.
python "Python Scripts/File_converter.py"Database and Schema Initialisation
Creates the Samsung_Data_Warehouse PostgreSQL database and the three schemas: bronze, silver, and gold.
# Python
python "Python Scripts/Init_database.py"
# SQL equivalent
psql -U <user> -f "SQL Scripts/Init_database.sql"Log: Logs/Init_database.log
The Bronze layer ingests raw data exactly as it is β no type casting, no cleaning, no transformations. All columns are stored as TEXT. This preserves the original data integrity and provides a full audit trail.
Step 1 β Create Bronze Tables (DDL)
Creates 14 tables in the bronze schema. All columns are TEXT to accept any raw value without rejection.
# Python
python "Python Scripts/Bronze/DDL_Bronze.py"
# SQL equivalent
psql -U <user> -d Samsung_Data_Warehouse -f "SQL Scripts/Bronze/DDL_Bronze.sql"Log: Logs/DDL_Bronze.log
Step 2 β Load Bronze Tables
Reads the CSV files from each domain folder and inserts them into the Bronze tables using a Truncate & Insert (Full Load) strategy.
# Python
python "Python Scripts/Bronze/Load_Bronze.py"
# SQL equivalent (Stored Procedure)
psql -U <user> -d Samsung_Data_Warehouse -f "SQL Scripts/Bronze/Proc_Load_Bronze.sql"
CALL load_bronze();Log: Logs/Load_Bronze.log
Bronze Table Naming: <sourcesystem>_<entity>
Example: bronze.crm_customers, bronze.snd_sales_transactions
The Silver layer transforms Bronze data into clean, typed, and standardised tables. This is where all data quality issues are resolved.
Cleaning operations performed:
- Mixed date formats β
DATEtype (6 format variants handled) - Phone prefixes (+91, leading 0) β 10-digit standard format
- Mixed boolean encodings (Yes/No/1/0/True/False) β
BOOLEAN - Salary formats ("18 LPA" / "150000" monthly / "18L") β Annual INR
NUMERIC - GST strings ("18%" / "18" / "GST@18") β Numeric percentage
- Category casing errors β Title Case canonical values
- Duplicate rows in
returnsandfinancial_transactionsβ Deduplicated - Out-of-range values (negative stock, invalid ratings) β Cleaned or excluded
- Null standardisation β Genuine NULLs preserved, bad values removed
Step 1 β Create Silver Tables (DDL)
Creates 14 tables in the silver schema with proper data types, NOT NULL constraints, CHECK constraints, and foreign keys.
# Python
python "Python Scripts/Silver/DDL_Silver.py"
# SQL equivalent
psql -U <user> -d Samsung_Data_Warehouse -f "SQL Scripts/Silver/DDL_Silver.sql"Log: Logs/DDL_Silver.log
Step 2 β Create Helper Functions
Creates shared data cleaning functions used by the Silver loading process.
# Python (module imported by Load_Silver.py)
# Helper_func.py is not run standalone β it is imported by Load_Silver.py
# SQL equivalent
psql -U <user> -d Samsung_Data_Warehouse -f "SQL Scripts/Silver/Helper_function.sql"Log: Logs/Helper_func.log
Key functions created:
| Function | Purpose |
|---|---|
fn_to_boolean(raw_val) |
Converts Yes/No/1/0/True/False β BOOLEAN |
fn_clean_phone(raw_phone) |
Strips +91 / leading 0 β 10-digit CHAR |
fn_parse_date(raw_date) |
Parses 6 mixed date formats β DATE |
fn_parse_epoch_or_date(val) |
Converts Unix epoch strings β DATE |
fn_parse_gst_pct(raw_gst) |
Parses all GST string formats β NUMERIC |
fn_salary_to_annual(raw_salary) |
Converts LPA / monthly / shorthand β Annual INR |
Step 3 β Load Silver Tables
Reads from Bronze tables, applies all cleaning transformations, deduplicates where needed, and inserts into Silver tables.
# Python
python "Python Scripts/Silver/Load_Silver.py"
# SQL equivalent (Stored Procedure)
psql -U <user> -d Samsung_Data_Warehouse -f "SQL Scripts/Silver/Proc_Load_Silver.sql"
CALL load_silver();Log: Logs/Load_Silver.log
Silver Table Naming: <sourcesystem>_<entity>
Example: silver.crm_customers, silver.snd_sales_transactions
The Gold layer exposes business-ready data as PostgreSQL VIEWS β no physical storage, no load process. Every Gold view reads live, cleaned data from the Silver layer and presents it in a Star Schema structure optimised for analytics and reporting.
Star Schema structure:
| Type | Count | Examples |
|---|---|---|
Dimension Views (dim_) |
8 | dim_products, dim_customers, dim_employees |
Fact Views (fact_) |
6 | fact_sales_transactions, fact_complaints, fact_returns |
Create Gold Views (DDL)
# Python
python "Python Scripts/Gold/DDL_Gold.py"
# SQL equivalent
psql -U <user> -d Samsung_Data_Warehouse -f "SQL Scripts/Gold/DDL_Gold.sql"Log: Logs/DDL_Gold.log
Gold View Naming: <category>_<entity>
Examples: gold.dim_customers, gold.fact_sales_transactions
Gold layer consumers:
| Tool | Use Case |
|---|---|
| Power BI | Interactive dashboards and KPI reports |
| QuickBI | Business intelligence and self-service analytics |
| Tableau | Advanced data visualisation |
| Python / Jupyter | Data analysis and machine learning |
| SQL Clients | Ad-hoc analysis by data and business analysts |
For a full reference of all 14 Gold views β their purpose, column definitions, data types, and join relationships β see
Docs/Data_Catalog.md
All database objects follow consistent naming standards documented in Docs/Naming_Conventions.md.
| Layer | Pattern | Example |
|---|---|---|
| Bronze | <sourcesystem>_<entity> |
bronze.crm_customers |
| Silver | <sourcesystem>_<entity> |
silver.hrm_employees |
| Gold | <category>_<entity> |
gold.dim_customers |
| Prefix | Meaning | Example |
|---|---|---|
dim_ |
Dimension table | dim_products, dim_customers |
fact_ |
Fact table | fact_sales_transactions |
agg_ |
Aggregated table | agg_monthly_revenue |
| Type | Pattern | Example |
|---|---|---|
| Surrogate keys | <tablename>_key |
customer_key |
| Technical / metadata columns | dwh_<column_name> |
dwh_load_date |
| Pattern | Example |
|---|---|
load_<layer> |
load_bronze, load_silver |
All names use
snake_casewith lowercase letters. No SQL reserved words used as object names.
- Python 3.10 or higher
- PostgreSQL 14 or higher (running locally or on a server)
- A PostgreSQL user with
CREATE DATABASEprivileges
git clone https://github.com/Harsh-Belekar/Samsung-Data-Engineering.git
cd "Samsung Data Engineering"pip install -r requirements.txtUpdate the connection settings in the Python scripts or create a .env file:
DB_HOST=localhost
DB_PORT=5432
DB_USER=your_username
DB_PASSWORD=your_password
DB_NAME=Samsung_Data_Warehouse
cd "Data Generation"
python Main.py
cd ..python "Python Scripts/File_converter.py"Option A β Python Scripts (recommended)
python "Python Scripts/Init_database.py"
python "Python Scripts/Bronze/DDL_Bronze.py"
python "Python Scripts/Bronze/Load_Bronze.py"
python "Python Scripts/Silver/DDL_Silver.py"
python "Python Scripts/Silver/Load_Silver.py"
python "Python Scripts/Gold/DDL_Gold.py"Option B β SQL Scripts
-- Run in order in your PostgreSQL client or psql
\i 'SQL Scripts/Init_database.sql'
\i 'SQL Scripts/Bronze/DDL_Bronze.sql'
\i 'SQL Scripts/Bronze/Proc_Load_Bronze.sql'
CALL load_bronze();
\i 'SQL Scripts/Silver/DDL_Silver.sql'
\i 'SQL Scripts/Silver/Helper_function.sql'
\i 'SQL Scripts/Silver/Proc_Load_Silver.sql'
CALL load_silver();
\i 'SQL Scripts/Gold/DDL_Gold.sql'-- Check row counts across all three layers
SELECT 'bronze' AS layer, 'crm_customers' AS table_name, COUNT(*) FROM bronze.crm_customers
UNION ALL
SELECT 'silver', 'crm_customers', COUNT(*) FROM silver.crm_customers
UNION ALL
SELECT 'gold', 'dim_customers', COUNT(*) FROM gold.dim_customers;| Document | Location | Description |
|---|---|---|
| Data Generation Guide | Docs/Data_Generation_Documentation.md |
How to configure and run the data generator β row counts, date ranges, quality parameters, and extending the project |
| Gold Layer Data Catalog | Docs/Data_Catalog.md |
Full reference for all 14 Gold views β purpose, grain, column definitions, data types, example values, and join relationships |
| Naming Conventions | Docs/Naming_Conventions.md |
Standards for all schema, table, view, column, and stored procedure naming across all three layers |
| Architecture Diagram | Images/Data_Warehouse_Architecture.png |
Visual overview of the full Bronze β Silver β Gold pipeline with source systems and consumers |
| Schema Diagram | Images/Schema.png |
Entity-relationship diagram showing how all 14 tables connect through primary and foreign keys |
Every stage of the pipeline produces a dedicated log file in the Logs/ folder. Each log captures start time, completion status, row counts, and any errors encountered.
| Log File | Pipeline Stage | Script |
|---|---|---|
Data_generation.log |
Synthetic data generation | Data Generation/Main.py |
Init_database.log |
Schema creation | Python Scripts/Init_database.py |
DDL_Bronze.log |
Bronze table creation | Python Scripts/Bronze/DDL_Bronze.py |
Load_Bronze.log |
Bronze data loading | Python Scripts/Bronze/Load_Bronze.py |
DDL_Silver.log |
Silver table creation | Python Scripts/Silver/DDL_Silver.py |
Helper_func.log |
Cleaning function creation | Python Scripts/Silver/Helper_func.py |
Load_Silver.log |
Silver data cleaning & loading | Python Scripts/Silver/Load_Silver.py |
DDL_Gold.log |
Gold view creation | Python Scripts/Gold/DDL_Gold.py |
SQL Scripts produce the same transformations as Python Scripts but do not write to the
Logs/folder β the Python implementation serves as the canonical log source.
All datasets used in this project are dummy, synthetic, or public β generated programmatically using Python for learning and portfolio demonstration purposes only.
No real customer data, restaurant data, or proprietary Samsung information has been used. This project is not affiliated with, endorsed by, or connected to Samsung in any way.
π€ Harsh Belekar
π Data Analyst | Python Developer | SQL | Power BI | Excel | Data Visualization
π¬ LinkedIn | πGitHub
β If you found this project helpful, feel free to star the repo and connect with me for collaboration!
Made with β€οΈ and a lot of β by Harsh Belekar

