A full end-to-end data warehousing project built with SQL Server, implementing a Medallion Architecture (Bronze → Silver → Gold) to ingest, clean, and model data from two source systems (CRM and ERP) into an analytics-ready star schema.
This project simulates a real-world scenario where customer, product, and sales data lives across two separate source systems — a CRM and an ERP. The goal is to consolidate and transform that data into a single, trusted layer that can power business reporting and analytics.
Key things built here:
- Database and schema initialisation
- Raw ingestion layer (Bronze) via bulk load from CSV
- Cleaned and standardised layer (Silver) with business logic transformations
- Analytical model (Gold) using dimensional modelling (star schema)
- Data quality checks for both Silver and Gold layers
Source Systems (CSV)
│
▼
┌─────────────┐
│ BRONZE │ Raw ingestion — no transformations, exact copy of source
└─────┬───────┘
│
▼
┌─────────────┐
│ SILVER │ Cleaned, deduplicated, standardised data
└─────┬───────┘
│
▼
┌─────────────┐
│ GOLD │ Star schema — fact + dimension views ready for analytics
└─────────────┘
sql-data-warehouse-project/
│
├── datasets/
│ ├── source_crm/ # CRM source files (customers, products, sales)
│ │ ├── cust_info.csv
│ │ ├── prd_info.csv
│ │ └── sales_details.csv
│ └── source_erp/ # ERP source files (locations, demographics, categories)
│ ├── CUST_AZ12.csv
│ ├── LOC_A101.csv
│ └── PX_CAT_G1V2.csv
│
├── scripts/
│ ├── init_database.sql # Creates the DataWarehouse DB and schemas
│ ├── bronze/
│ │ ├── ddl_bronze.sql # Table definitions for the Bronze layer
│ │ └── proc_load_bronze.sql # Stored procedure: bulk load CSV → Bronze
│ ├── silver/
│ │ ├── ddl_silver.sql # Table definitions for the Silver layer
│ │ └── proc_load_silver.sql # Stored procedure: ETL Bronze → Silver
│ └── gold/
│ └── ddl_gold.sql # View definitions for Gold (star schema)
│
├── tests/
│ ├── quality_checks_silver.sql # QA queries for Silver tables
│ └── quality_checks_gold.sql # QA queries for Gold views
│
└── docs/
└── data_model.md # Notes on the data model and design decisions
| Source | Table | Description |
|---|---|---|
| CRM | crm_cust_info |
Customer master data |
| CRM | crm_prd_info |
Product catalogue with pricing history |
| CRM | crm_sales_details |
Order-level sales transactions |
| ERP | erp_cust_az12 |
Customer demographics (DOB, gender) |
| ERP | erp_loc_a101 |
Customer country/location data |
| ERP | erp_px_cat_g1v2 |
Product category and subcategory mappings |
The final analytical model consists of two dimension tables and one fact table:
gold.dim_customers— unified customer profile combining CRM + ERP data, with a generated surrogate keygold.dim_products— current product catalogue (historical records filtered out), joined with category datagold.fact_sales— transactional sales records linked to both dimensions via surrogate keys
| Table | Transformation |
|---|---|
crm_cust_info |
Deduplicate by most recent record per customer; normalise gender (M/F → Male/Female) and marital status (S/M → Single/Married); trim whitespace |
crm_prd_info |
Extract cat_id from product key; derive prd_end_dt using LEAD() window function; normalise product line codes |
crm_sales_details |
Parse integer date fields (YYYYMMDD) to DATE; recalculate sls_sales where sales ≠ qty × price; derive missing price from sales/quantity |
erp_cust_az12 |
Strip NAS prefix from customer IDs; null out future birthdates; normalise gender values |
erp_loc_a101 |
Remove hyphens from customer IDs; expand country codes (DE → Germany, US/USA → United States) |
Step 1 — Initialise the database
scripts/init_database.sqlStep 2 — Bronze layer
scripts/bronze/ddl_bronze.sql
EXEC bronze.load_bronze;Update the file paths in
proc_load_bronze.sqlto match your local CSV location before running.
Step 3 — Silver layer
scripts/silver/ddl_silver.sql
EXEC silver.load_silver;Step 4 — Gold layer
scripts/gold/ddl_gold.sqlStep 5 — Run quality checks
tests/quality_checks_silver.sql
tests/quality_checks_gold.sql- SQL Server 2016+ (uses
LEAD(),ROW_NUMBER(),BULK INSERT) - CSV source files placed at the paths referenced in
proc_load_bronze.sql
- Used
ROW_NUMBER() OVER (PARTITION BY cst_id ORDER BY cst_create_date DESC)to handle duplicate customer records — keeping the most recent snapshot rather than arbitrarily dropping rows. - Derived product end dates using
LEAD()on the start date, which handles historical price versioning naturally without hardcoding. - CRM is treated as the primary source for gender; ERP data is only used as a fallback via
COALESCEin the Gold view. - Gold layer is implemented as views rather than physical tables, keeping it always in sync with Silver without a separate load step.