This project implements a Medallion Architecture (Bronze, Silver, Gold) to transform raw transactional data into a cloud-based Logical Data Lakehouse. The solution automates the ingestion, cleaning, and aggregation of AdventureWorks sales data to provide actionable business intelligence.
The data flow follows these stages:
- Ingestion (Bronze): Raw CSV files are moved from source to ADLS Gen2 via Azure Data Factory.
- Transformation (Silver): Data is cleaned, standardized, and converted to Parquet format using PySpark in Databricks.
- Serving (Gold): Business-level aggregates are created in Azure Synapse Analytics via Serverless SQL Pools for reporting.
- Orchestration: Azure Data Factory (ADF)
- Data Lake: Azure Data Lake Storage (ADLS) Gen2
- Compute: Azure Databricks (Spark 3.x)
- Data Warehouse: Azure Synapse Analytics (Serverless SQL)
- Visualization: Power BI
- Security: Azure Key Vault & Managed Identities (SMI)
Implemented recursive reads in PySpark to handle varying sales data schemas from 2015-2017. Used mergeSchema options to ensure consistent Dataframe writes.
Converted heavy CSV files into optimized Snappy-compressed Parquet files in the Silver layer. This reduced storage footprint and boosted query performance by ~10x in the Gold layer.
Configured Service Principals and Azure Key Vault for secure authentication between Databricks and ADLS, eliminating the need for hard-coded access keys.
/pipelines/: ADF JSON exports for ingestion logic./notebooks/: PySpark notebooks for Silver & Gold transformations./sql/: Synapse Serverless SQL scripts for views and CTAS./docs/: Architecture diagrams and data dictionary.
The final pipeline serves a refined Gold layer accessible via Synapse Serverless SQL, enabling real-time Power BI dashboarding with zero infrastructure management.
Sharad Jadhav Data Engineer | Azure Specialist LinkedIn | Portfolio
