AzureDataFactoryProject (Azure Data Engineering End-to-End)

Overview

This repository contains an end-to-end Azure Data Engineering project that demonstrates how to:

Provision core Azure resources (via ARM template export)
Orchestrate ingestion using Azure Data Factory (ADF) pipelines
Store data in ADLS Gen2 using a Medallion architecture (Bronze → Silver → Gold)
Transform and curate data using Azure Databricks (PySpark / Spark SQL)
Publish analytics-ready dimension and fact tables in Unity Catalog

The project includes:

ADF pipeline assets (factory JSON)
Infrastructure export (ARM template.json + parameters.json)
Databricks project/notebooks and ETL structure (Spotify-themed dataset)
Curated Silver and Gold tables in Unity Catalog

Architecture (High Level)

Services/Resources used

Azure Data Factory (ADF): orchestration + pipeline scheduling
Azure Data Lake Storage Gen2 (ADLS): storage for bronze/silver/gold zones
Azure Databricks: transformations, lakehouse processing, curated tables
Unity Catalog: governance + table management (silver/gold schemas)
Azure SQL Database: relational source / curated outputs (as used in pipelines)
Databricks Access Connector: secure connectivity to ADLS
ARM Template Export: reproducible infrastructure definition

Storage Layout (ADLS Gen2)

The storage account containers are organized like this:

bronze/ → raw/ingested files (minimal/no transformation)
silver/ → cleaned, standardized, conformed tables
gold/ → curated, business-ready tables (dimensions/facts)
databricksmetastore/ → Unity Catalog / metastore-related storage
$logs/ → platform logs
test/ → experiments / sandbox

This structure supports scalable, auditable ingestion and transformation.

Lakehouse Catalog & Tables (Unity Catalog)

The Unity Catalog structure captured in environment:

Catalog

spotify_catalog

Schemas

silver
gold

Silver tables (cleaned & standardized)

silver.dimartist
silver.dimdate
silver.dimtrack
silver.dimuser
silver.factstream

Gold tables (curated / business-ready)

gold.dimdate
gold.dimtrack
gold.dimuser
gold.factstream

Staging / intermediate tables (Gold schema)

Used for incremental loads / CDC patterns:

gold.dimdate_stg
gold.dimtrack_stg
gold.dimuser_stg
gold.fact_stg

ADF Pipelines (Orchestration)

1) `incremental_loop_ingestion`

This pipeline implements a loop-based incremental strategy using a ForEach activity.

Key flow (as implemented)

Inside ForEach1:

Lookup (Lookup1)
- Reads metadata / control info required to drive incremental processing.
Set variable (current)
- Stores the current loop item / entity being processed.
Copy data (AzureSQLToLake)
- Copies data from Azure SQL into the lake storage path.
If Condition (IfIncrementaldata)
- Applies CDC/incremental logic:
- True path
  - Script (max_cdc) → fetches max incremental key / watermark
  - Copy data (cdc_key_update) → updates the watermark/control record
- False path
  - Delete (Deleteemptyfile) → cleans up empty outputs when no data is present

This design allows ingestion to be incremental, repeatable, and operationally clean.

Repo Structure (What’s in this GitHub repo)

factory/
- ADF factory assets exported to JSON (pipelines, datasets, linked services, triggers, etc.)
infra/arm/resource-group-export/
- template.json → ARM template for resource deployment
- parameters.json → parameter values (some may require manual secure inputs)
spotify_dub_project/
- Databricks-oriented project folder (ETL / notebooks / configs)
- Implements transformations to produce Silver and Gold outputs
publish_config.json
- Publishing configuration used by ADF export/publish processes

Infrastructure as Code (ARM Export)

The infra/arm/resource-group-export/ directory contains an export of your Azure Resource Group resources, including (as applicable):

Databricks workspace
Storage account (ADLS Gen2 / HNS enabled)
Access connector
Azure SQL Server + Database
ADF instance
Containers (bronze/silver/gold/metastore/test)

Note: ARM exports often include values that should be treated carefully (tenant IDs, principal IDs, firewall rules, etc.).

How to Use / Run (Typical Flow)

Step 1 — Provision Infrastructure (optional)

If recreating the environment:

Use the ARM template:
- infra/arm/resource-group-export/template.json
- infra/arm/resource-group-export/parameters.json
Deploy via Azure Portal or CLI.

Step 2 — Configure Access & Linked Services

In ADF:

Configure Linked Services for:
- Azure SQL
- ADLS Gen2
- Databricks (if used by activities)

Ensure:

Storage permissions are granted (via Access Connector / managed identity role assignments).
Networking/firewall rules allow access (SQL firewall, private endpoints if applicable).

Step 3 — Run ADF Pipelines

Execute:

incremental_ingestion (single incremental run)
incremental_loop_ingestion (foreach-driven incremental run)

Step 4 — Validate Silver/Gold Outputs

In Databricks:

Confirm Silver tables are refreshed:
- silver.dimartist, silver.dimdate, silver.dimtrack, silver.dimuser, silver.factstream
Confirm Gold tables are curated and ready for analytics:
- gold.dimdate, gold.dimtrack, gold.dimuser, gold.factstream

Monitoring & Validation Tips

ADF Monitoring

Use Monitor → Pipeline runs to validate execution, duration, and activity status.
Inspect activity output for:
- rows read/written
- copy performance
- script outputs (max_cdc)
- cleanup confirmation (Deleteemptyfile)

Storage Validation

Verify files land in correct zones:
- Bronze → raw
- Silver → cleaned
- Gold → curated

Databricks Validation

Query Unity Catalog tables:
- row counts, schema checks, null checks, incremental changes
Validate that staging tables are used only as intermediate layers.

What This Project Demonstrates (Key Skills)

Medallion architecture implementation (Bronze/Silver/Gold)
Incremental ingestion patterns (CDC watermarking + control updates)
Pipeline orchestration using ADF ForEach + conditional routing
Lakehouse transformation and curation in Databricks
Infrastructure export / reproducibility using ARM templates
Clear separation of concerns: orchestration, storage, transformation, and serving

Suggested Enhancements (Future Work)

Add CI/CD using GitHub Actions (ADF + Databricks deployment automation)
Add data quality checks (Great Expectations / Deequ / custom checks)
Add Delta Lake optimizations (ZORDER, OPTIMIZE, VACUUM scheduling)
Add unit tests for transformation logic
Add parameterized environments (dev/test/prod) and secret handling via Key Vault

Screenshots / Evidence:

ADF Pipeline: incremental_loop_ingestion:

Resource Group (ADF):

Storage Containers (ADLS):

Unity Catalog Silver/Gold pipelines:

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
Sql_Scripts		Sql_Scripts
factory		factory
infra/arm/resource-group-export		infra/arm/resource-group-export
spotify_dub_project		spotify_dub_project
README.md		README.md
publish_config.json		publish_config.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AzureDataFactoryProject (Azure Data Engineering End-to-End)

Overview

Architecture (High Level)

Storage Layout (ADLS Gen2)