Skip to content

teevirta/Data_Warehouse_ETL

Repository files navigation

Data Warehouse & Mart Build using SQL & DuckDB

An end-to-end data engineering pipeline that transforms raw CSV files from Google Cloud Storage into a normalized star schema data warehouse, then builds analytical data marts.

Summary

  • Pipeline scope: Built a complete ETL pipeline from raw CSVs to star schema warehouse to analytical marts
  • Data modeling: Designed a star schema with fact tables, dimensions, and bridge tables for many-to-many relationships
  • ETL development: Implemented extract, transform, load processes with idempotent operations and data quality checks
  • Mart architecture: Created specialized data marts (flat, skills, priority) with additive measures and incremental update patterns

Problem & Context

Raw job posting data arrives as flat CSV files in Google Cloud Storage—not structured for analytical queries. Analysts need to answer:

  • Which skills are most in-demand over time?
  • What are hiring trends by company and location?
  • How do salary patterns vary by role and skill?

Challenge: Data teams need a single source of truth system—a data warehouse—to enable consistent, reliable analysis across the organization. Additionally, specialized data marts are required to optimize resources by pre-aggregating data for specific business use cases, reducing query complexity and improving performance for common analytical patterns.

Solution: End-to-end ETL pipeline that extracts CSVs from cloud storage, normalizes them into a star schema warehouse (separating facts from dimensions), and creates specialized data marts optimized for specific use cases (flat queries, skill demand analysis, priority role tracking).

Tech Stack

  • Database: DuckDB
  • Language: SQL (DDL for schema design, DML for data loading and transformation)
  • Data Model: Star schema (fact + dimension + bridge tables)
  • Development: VS Code for SQL editing + Terminal for DuckDB CLI execution
  • Automation: Master SQL script for pipeline orchestration
  • Version Control: Git/GitHub for versioned pipeline scripts
  • Storage: Google Cloud Storage for source CSV files

Pipeline Architecture

Data Pipeline Architecture

The pipeline transforms job posting CSVs from Google Cloud Storage into a normalized star schema data warehouse, then builds specialized analytical data marts. BI tools (Excel, Power BI, Tableau, Python) consume from both the warehouse and marts.

Data Warehouse

The data warehouse implements a star schema with company_dim, skills_dim, job_postings_fact, and skills_job_dim tables.

Data Warehouse Schema

  • SQL Files:
  • Purpose: Star schema serving as single source of truth for analytical queries
  • Grain: One row per job posting in the fact table (job_postings_fact)

Flat Mart

Denormalized table with all dimensions for ad-hoc queries.

Flat Mart Schema

  • SQL File: 03_create_flat_mart.sql – Builds denormalized table with all dimensions joined
  • Purpose: Denormalized table for quick ad-hoc queries
  • Grain: One row per job posting with all dimensions joined

Skills Mart

Time-series skill demand analysis with additive measures.

Skills Mart Schema

  • SQL File: 04_create_skills_mart.sql – Builds time-series skill demand mart
  • Purpose: Time-series analysis of skill demand over time with additive measures
  • Grain: skill_id + month_start_date + job_title_short
  • Key Features: All measures are additive (counts/sums) for safe re-aggregation

Priority Mart

Priority role tracking with incremental updates using MERGE operations.

Priority Mart Schema

  • SQL Files:
  • Purpose: Track priority roles and job snapshots with incremental update capabilities
  • Grain: One row per job posting with priority level assignment
  • Key Features: MERGE operations for incremental updates - demonstrates production-ready upsert patterns (INSERT, UPDATE, DELETE in single statement)

Dimensional Modeling

  • Star Schema Design: Fact table (job_postings_fact) with dimension tables (company_dim, skills_dim)
  • Bridge Tables: Many-to-many relationship handling (skills_job_dim, bridge_company_location, bridge_job_title)
  • Grain Definition: Proper fact table granularity (skill+month, company+title+location+month)
  • Additive Measures: Counts and sums that can be safely re-aggregated at any level
  • Surrogate Keys: Sequential ID generation using CTEs with self-joins (optional company_mart build only)

About

End-to-end SQL data engineering pipeline — raw CSVs → star schema data warehouse → analytical data marts, built with DuckDB.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors