Skip to content

ShxradJadhav/Adventure-Works-Data-Engineering-Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

13 Commits
Β 
Β 
Β 
Β 

Repository files navigation

πŸš€ AdventureWorks: End-to-End Azure Data Engineering Project

Azure Databricks Synapse

πŸ“Œ Project Overview

This project implements a Medallion Architecture (Bronze, Silver, Gold) to transform raw transactional data into a cloud-based Logical Data Lakehouse. The solution automates the ingestion, cleaning, and aggregation of AdventureWorks sales data to provide actionable business intelligence.


πŸ— Architecture Diagram

Project Architecture

The data flow follows these stages:

  1. Ingestion (Bronze): Raw CSV files are moved from source to ADLS Gen2 via Azure Data Factory.
  2. Transformation (Silver): Data is cleaned, standardized, and converted to Parquet format using PySpark in Databricks.
  3. Serving (Gold): Business-level aggregates are created in Azure Synapse Analytics via Serverless SQL Pools for reporting.

πŸ›  Tech Stack

  • Orchestration: Azure Data Factory (ADF)
  • Data Lake: Azure Data Lake Storage (ADLS) Gen2
  • Compute: Azure Databricks (Spark 3.x)
  • Data Warehouse: Azure Synapse Analytics (Serverless SQL)
  • Visualization: Power BI
  • Security: Azure Key Vault & Managed Identities (SMI)

βœ… Key Solutions Provided

1. Schema Drift & Evolution

Implemented recursive reads in PySpark to handle varying sales data schemas from 2015-2017. Used mergeSchema options to ensure consistent Dataframe writes.

2. Performance Optimization

Converted heavy CSV files into optimized Snappy-compressed Parquet files in the Silver layer. This reduced storage footprint and boosted query performance by ~10x in the Gold layer.

3. "Secret-less" Architecture

Configured Service Principals and Azure Key Vault for secure authentication between Databricks and ADLS, eliminating the need for hard-coded access keys.


πŸ“‚ Repository Structure

  • /pipelines/: ADF JSON exports for ingestion logic.
  • /notebooks/: PySpark notebooks for Silver & Gold transformations.
  • /sql/: Synapse Serverless SQL scripts for views and CTAS.
  • /docs/: Architecture diagrams and data dictionary.

πŸ“ˆ Final Result

The final pipeline serves a refined Gold layer accessible via Synapse Serverless SQL, enabling real-time Power BI dashboarding with zero infrastructure management.


πŸ‘¨β€πŸ’» Author

Sharad Jadhav Data Engineer | Azure Specialist LinkedIn | Portfolio

About

End-to-end Data Engineering project using Azure Data Factory, Databricks (PySpark), and Synapse Analytics to process AdventureWorks sales data via Medallion Architecture.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors