Skip to content

pranaviyay/SQL-Data-Warehouse

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SQL Data Warehouse Project

A full end-to-end data warehousing project built with SQL Server, implementing a Medallion Architecture (Bronze → Silver → Gold) to ingest, clean, and model data from two source systems (CRM and ERP) into an analytics-ready star schema.


Project Overview

This project simulates a real-world scenario where customer, product, and sales data lives across two separate source systems — a CRM and an ERP. The goal is to consolidate and transform that data into a single, trusted layer that can power business reporting and analytics.

Key things built here:

  • Database and schema initialisation
  • Raw ingestion layer (Bronze) via bulk load from CSV
  • Cleaned and standardised layer (Silver) with business logic transformations
  • Analytical model (Gold) using dimensional modelling (star schema)
  • Data quality checks for both Silver and Gold layers

Architecture

Source Systems (CSV)
      │
      ▼
┌─────────────┐
│   BRONZE    │  Raw ingestion — no transformations, exact copy of source
└─────┬───────┘
      │
      ▼
┌─────────────┐
│   SILVER    │  Cleaned, deduplicated, standardised data
└─────┬───────┘
      │
      ▼
┌─────────────┐
│    GOLD     │  Star schema — fact + dimension views ready for analytics
└─────────────┘

Repository Structure

sql-data-warehouse-project/
│
├── datasets/
│   ├── source_crm/          # CRM source files (customers, products, sales)
│   │   ├── cust_info.csv
│   │   ├── prd_info.csv
│   │   └── sales_details.csv
│   └── source_erp/          # ERP source files (locations, demographics, categories)
│       ├── CUST_AZ12.csv
│       ├── LOC_A101.csv
│       └── PX_CAT_G1V2.csv
│
├── scripts/
│   ├── init_database.sql    # Creates the DataWarehouse DB and schemas
│   ├── bronze/
│   │   ├── ddl_bronze.sql          # Table definitions for the Bronze layer
│   │   └── proc_load_bronze.sql    # Stored procedure: bulk load CSV → Bronze
│   ├── silver/
│   │   ├── ddl_silver.sql          # Table definitions for the Silver layer
│   │   └── proc_load_silver.sql    # Stored procedure: ETL Bronze → Silver
│   └── gold/
│       └── ddl_gold.sql            # View definitions for Gold (star schema)
│
├── tests/
│   ├── quality_checks_silver.sql   # QA queries for Silver tables
│   └── quality_checks_gold.sql     # QA queries for Gold views
│
└── docs/
    └── data_model.md        # Notes on the data model and design decisions

Data Sources

Source Table Description
CRM crm_cust_info Customer master data
CRM crm_prd_info Product catalogue with pricing history
CRM crm_sales_details Order-level sales transactions
ERP erp_cust_az12 Customer demographics (DOB, gender)
ERP erp_loc_a101 Customer country/location data
ERP erp_px_cat_g1v2 Product category and subcategory mappings

Gold Layer (Star Schema)

The final analytical model consists of two dimension tables and one fact table:

  • gold.dim_customers — unified customer profile combining CRM + ERP data, with a generated surrogate key
  • gold.dim_products — current product catalogue (historical records filtered out), joined with category data
  • gold.fact_sales — transactional sales records linked to both dimensions via surrogate keys

Transformations Applied (Silver Layer)

Table Transformation
crm_cust_info Deduplicate by most recent record per customer; normalise gender (M/FMale/Female) and marital status (S/MSingle/Married); trim whitespace
crm_prd_info Extract cat_id from product key; derive prd_end_dt using LEAD() window function; normalise product line codes
crm_sales_details Parse integer date fields (YYYYMMDD) to DATE; recalculate sls_sales where sales ≠ qty × price; derive missing price from sales/quantity
erp_cust_az12 Strip NAS prefix from customer IDs; null out future birthdates; normalise gender values
erp_loc_a101 Remove hyphens from customer IDs; expand country codes (DEGermany, US/USAUnited States)

How to Run

Step 1 — Initialise the database

scripts/init_database.sql

Step 2 — Bronze layer

scripts/bronze/ddl_bronze.sql
EXEC bronze.load_bronze;

Update the file paths in proc_load_bronze.sql to match your local CSV location before running.

Step 3 — Silver layer

scripts/silver/ddl_silver.sql
EXEC silver.load_silver;

Step 4 — Gold layer

scripts/gold/ddl_gold.sql

Step 5 — Run quality checks

tests/quality_checks_silver.sql
tests/quality_checks_gold.sql

Requirements

  • SQL Server 2016+ (uses LEAD(), ROW_NUMBER(), BULK INSERT)
  • CSV source files placed at the paths referenced in proc_load_bronze.sql

Key Design Decisions

  • Used ROW_NUMBER() OVER (PARTITION BY cst_id ORDER BY cst_create_date DESC) to handle duplicate customer records — keeping the most recent snapshot rather than arbitrarily dropping rows.
  • Derived product end dates using LEAD() on the start date, which handles historical price versioning naturally without hardcoding.
  • CRM is treated as the primary source for gender; ERP data is only used as a fallback via COALESCE in the Gold view.
  • Gold layer is implemented as views rather than physical tables, keeping it always in sync with Silver without a separate load step.

About

End-to-end SQL Server data warehouse built on a medallion architecture (Bronze → Silver → Gold). Ingests raw CRM and ERP data, applies transformation and cleaning logic, and exposes a star schema for analytics.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages