This repository contains an end-to-end Azure Data Engineering project that demonstrates how to:
- Provision core Azure resources (via ARM template export)
- Orchestrate ingestion using Azure Data Factory (ADF) pipelines
- Store data in ADLS Gen2 using a Medallion architecture (Bronze → Silver → Gold)
- Transform and curate data using Azure Databricks (PySpark / Spark SQL)
- Publish analytics-ready dimension and fact tables in Unity Catalog
The project includes:
- ADF pipeline assets (factory JSON)
- Infrastructure export (ARM template.json + parameters.json)
- Databricks project/notebooks and ETL structure (Spotify-themed dataset)
- Curated Silver and Gold tables in Unity Catalog
Services/Resources used
- Azure Data Factory (ADF): orchestration + pipeline scheduling
- Azure Data Lake Storage Gen2 (ADLS): storage for bronze/silver/gold zones
- Azure Databricks: transformations, lakehouse processing, curated tables
- Unity Catalog: governance + table management (silver/gold schemas)
- Azure SQL Database: relational source / curated outputs (as used in pipelines)
- Databricks Access Connector: secure connectivity to ADLS
- ARM Template Export: reproducible infrastructure definition
The storage account containers are organized like this:
- bronze/ → raw/ingested files (minimal/no transformation)
- silver/ → cleaned, standardized, conformed tables
- gold/ → curated, business-ready tables (dimensions/facts)
- databricksmetastore/ → Unity Catalog / metastore-related storage
- $logs/ → platform logs
- test/ → experiments / sandbox
This structure supports scalable, auditable ingestion and transformation.
The Unity Catalog structure captured in environment:
spotify_catalog
silvergold
silver.dimartistsilver.dimdatesilver.dimtracksilver.dimusersilver.factstream
gold.dimdategold.dimtrackgold.dimusergold.factstream
Used for incremental loads / CDC patterns:
gold.dimdate_stggold.dimtrack_stggold.dimuser_stggold.fact_stg
This pipeline implements a loop-based incremental strategy using a ForEach activity.
Inside ForEach1:
- Lookup (
Lookup1)- Reads metadata / control info required to drive incremental processing.
- Set variable (
current)- Stores the current loop item / entity being processed.
- Copy data (
AzureSQLToLake)- Copies data from Azure SQL into the lake storage path.
- If Condition (
IfIncrementaldata)- Applies CDC/incremental logic:
- True path
- Script (
max_cdc) → fetches max incremental key / watermark - Copy data (
cdc_key_update) → updates the watermark/control record
- Script (
- False path
- Delete (
Deleteemptyfile) → cleans up empty outputs when no data is present
- Delete (
This design allows ingestion to be incremental, repeatable, and operationally clean.
-
factory/- ADF factory assets exported to JSON (pipelines, datasets, linked services, triggers, etc.)
-
infra/arm/resource-group-export/template.json→ ARM template for resource deploymentparameters.json→ parameter values (some may require manual secure inputs)
-
spotify_dub_project/- Databricks-oriented project folder (ETL / notebooks / configs)
- Implements transformations to produce Silver and Gold outputs
-
publish_config.json- Publishing configuration used by ADF export/publish processes
The infra/arm/resource-group-export/ directory contains an export of your Azure Resource Group resources, including (as applicable):
- Databricks workspace
- Storage account (ADLS Gen2 / HNS enabled)
- Access connector
- Azure SQL Server + Database
- ADF instance
- Containers (bronze/silver/gold/metastore/test)
Note: ARM exports often include values that should be treated carefully (tenant IDs, principal IDs, firewall rules, etc.).
If recreating the environment:
- Use the ARM template:
infra/arm/resource-group-export/template.jsoninfra/arm/resource-group-export/parameters.json
- Deploy via Azure Portal or CLI.
In ADF:
- Configure Linked Services for:
- Azure SQL
- ADLS Gen2
- Databricks (if used by activities)
Ensure:
- Storage permissions are granted (via Access Connector / managed identity role assignments).
- Networking/firewall rules allow access (SQL firewall, private endpoints if applicable).
Execute:
incremental_ingestion(single incremental run)incremental_loop_ingestion(foreach-driven incremental run)
In Databricks:
- Confirm Silver tables are refreshed:
silver.dimartist,silver.dimdate,silver.dimtrack,silver.dimuser,silver.factstream
- Confirm Gold tables are curated and ready for analytics:
gold.dimdate,gold.dimtrack,gold.dimuser,gold.factstream
- Use Monitor → Pipeline runs to validate execution, duration, and activity status.
- Inspect activity output for:
- rows read/written
- copy performance
- script outputs (
max_cdc) - cleanup confirmation (
Deleteemptyfile)
- Verify files land in correct zones:
- Bronze → raw
- Silver → cleaned
- Gold → curated
- Query Unity Catalog tables:
- row counts, schema checks, null checks, incremental changes
- Validate that staging tables are used only as intermediate layers.
- Medallion architecture implementation (Bronze/Silver/Gold)
- Incremental ingestion patterns (CDC watermarking + control updates)
- Pipeline orchestration using ADF ForEach + conditional routing
- Lakehouse transformation and curation in Databricks
- Infrastructure export / reproducibility using ARM templates
- Clear separation of concerns: orchestration, storage, transformation, and serving
- Add CI/CD using GitHub Actions (ADF + Databricks deployment automation)
- Add data quality checks (Great Expectations / Deequ / custom checks)
- Add Delta Lake optimizations (ZORDER, OPTIMIZE, VACUUM scheduling)
- Add unit tests for transformation logic
- Add parameterized environments (dev/test/prod) and secret handling via Key Vault