Skip to content

Description: Azure Data Engineering project: ADF incremental ingestion + ADLS Gen2 medallion + Databricks transformations to Silver/Gold tables (Unity Catalog). About: This repository demonstrates a complete Azure lakehouse pipeline—from orchestration to curated analytics tables. It uses Azure Data Factory to run incremental, loop-based ingestion

Notifications You must be signed in to change notification settings

VamsiKrishnaSamavedam/AzureDataFactoryProject

Repository files navigation

AzureDataFactoryProject (Azure Data Engineering End-to-End)

Overview

This repository contains an end-to-end Azure Data Engineering project that demonstrates how to:

  • Provision core Azure resources (via ARM template export)
  • Orchestrate ingestion using Azure Data Factory (ADF) pipelines
  • Store data in ADLS Gen2 using a Medallion architecture (Bronze → Silver → Gold)
  • Transform and curate data using Azure Databricks (PySpark / Spark SQL)
  • Publish analytics-ready dimension and fact tables in Unity Catalog

The project includes:

  • ADF pipeline assets (factory JSON)
  • Infrastructure export (ARM template.json + parameters.json)
  • Databricks project/notebooks and ETL structure (Spotify-themed dataset)
  • Curated Silver and Gold tables in Unity Catalog

Architecture (High Level)

Services/Resources used

  • Azure Data Factory (ADF): orchestration + pipeline scheduling
  • Azure Data Lake Storage Gen2 (ADLS): storage for bronze/silver/gold zones
  • Azure Databricks: transformations, lakehouse processing, curated tables
  • Unity Catalog: governance + table management (silver/gold schemas)
  • Azure SQL Database: relational source / curated outputs (as used in pipelines)
  • Databricks Access Connector: secure connectivity to ADLS
  • ARM Template Export: reproducible infrastructure definition

Storage Layout (ADLS Gen2)

The storage account containers are organized like this:

  • bronze/ → raw/ingested files (minimal/no transformation)
  • silver/ → cleaned, standardized, conformed tables
  • gold/ → curated, business-ready tables (dimensions/facts)
  • databricksmetastore/ → Unity Catalog / metastore-related storage
  • $logs/ → platform logs
  • test/ → experiments / sandbox

This structure supports scalable, auditable ingestion and transformation.


Lakehouse Catalog & Tables (Unity Catalog)

The Unity Catalog structure captured in environment:

Catalog

  • spotify_catalog

Schemas

  • silver
  • gold

Silver tables (cleaned & standardized)

  • silver.dimartist
  • silver.dimdate
  • silver.dimtrack
  • silver.dimuser
  • silver.factstream

Gold tables (curated / business-ready)

  • gold.dimdate
  • gold.dimtrack
  • gold.dimuser
  • gold.factstream

Staging / intermediate tables (Gold schema)

Used for incremental loads / CDC patterns:

  • gold.dimdate_stg
  • gold.dimtrack_stg
  • gold.dimuser_stg
  • gold.fact_stg

ADF Pipelines (Orchestration)

1) incremental_loop_ingestion

This pipeline implements a loop-based incremental strategy using a ForEach activity.

Key flow (as implemented)

Inside ForEach1:

  1. Lookup (Lookup1)
    • Reads metadata / control info required to drive incremental processing.
  2. Set variable (current)
    • Stores the current loop item / entity being processed.
  3. Copy data (AzureSQLToLake)
    • Copies data from Azure SQL into the lake storage path.
  4. If Condition (IfIncrementaldata)
    • Applies CDC/incremental logic:
    • True path
      • Script (max_cdc) → fetches max incremental key / watermark
      • Copy data (cdc_key_update) → updates the watermark/control record
    • False path
      • Delete (Deleteemptyfile) → cleans up empty outputs when no data is present

This design allows ingestion to be incremental, repeatable, and operationally clean.


Repo Structure (What’s in this GitHub repo)

  • factory/

    • ADF factory assets exported to JSON (pipelines, datasets, linked services, triggers, etc.)
  • infra/arm/resource-group-export/

    • template.json → ARM template for resource deployment
    • parameters.json → parameter values (some may require manual secure inputs)
  • spotify_dub_project/

    • Databricks-oriented project folder (ETL / notebooks / configs)
    • Implements transformations to produce Silver and Gold outputs
  • publish_config.json

    • Publishing configuration used by ADF export/publish processes

Infrastructure as Code (ARM Export)

The infra/arm/resource-group-export/ directory contains an export of your Azure Resource Group resources, including (as applicable):

  • Databricks workspace
  • Storage account (ADLS Gen2 / HNS enabled)
  • Access connector
  • Azure SQL Server + Database
  • ADF instance
  • Containers (bronze/silver/gold/metastore/test)

Note: ARM exports often include values that should be treated carefully (tenant IDs, principal IDs, firewall rules, etc.).


How to Use / Run (Typical Flow)

Step 1 — Provision Infrastructure (optional)

If recreating the environment:

  1. Use the ARM template:
    • infra/arm/resource-group-export/template.json
    • infra/arm/resource-group-export/parameters.json
  2. Deploy via Azure Portal or CLI.

Step 2 — Configure Access & Linked Services

In ADF:

  • Configure Linked Services for:
    • Azure SQL
    • ADLS Gen2
    • Databricks (if used by activities)

Ensure:

  • Storage permissions are granted (via Access Connector / managed identity role assignments).
  • Networking/firewall rules allow access (SQL firewall, private endpoints if applicable).

Step 3 — Run ADF Pipelines

Execute:

  • incremental_ingestion (single incremental run)
  • incremental_loop_ingestion (foreach-driven incremental run)

Step 4 — Validate Silver/Gold Outputs

In Databricks:

  • Confirm Silver tables are refreshed:
    • silver.dimartist, silver.dimdate, silver.dimtrack, silver.dimuser, silver.factstream
  • Confirm Gold tables are curated and ready for analytics:
    • gold.dimdate, gold.dimtrack, gold.dimuser, gold.factstream

Monitoring & Validation Tips

ADF Monitoring

  • Use Monitor → Pipeline runs to validate execution, duration, and activity status.
  • Inspect activity output for:
    • rows read/written
    • copy performance
    • script outputs (max_cdc)
    • cleanup confirmation (Deleteemptyfile)

Storage Validation

  • Verify files land in correct zones:
    • Bronze → raw
    • Silver → cleaned
    • Gold → curated

Databricks Validation

  • Query Unity Catalog tables:
    • row counts, schema checks, null checks, incremental changes
  • Validate that staging tables are used only as intermediate layers.

What This Project Demonstrates (Key Skills)

  • Medallion architecture implementation (Bronze/Silver/Gold)
  • Incremental ingestion patterns (CDC watermarking + control updates)
  • Pipeline orchestration using ADF ForEach + conditional routing
  • Lakehouse transformation and curation in Databricks
  • Infrastructure export / reproducibility using ARM templates
  • Clear separation of concerns: orchestration, storage, transformation, and serving

Suggested Enhancements (Future Work)

  • Add CI/CD using GitHub Actions (ADF + Databricks deployment automation)
  • Add data quality checks (Great Expectations / Deequ / custom checks)
  • Add Delta Lake optimizations (ZORDER, OPTIMIZE, VACUUM scheduling)
  • Add unit tests for transformation logic
  • Add parameterized environments (dev/test/prod) and secret handling via Key Vault

Screenshots / Evidence:

ADF Pipeline: incremental_loop_ingestion:

Screenshot (556) Screenshot (557) Screenshot (558) Screenshot (559)

Resource Group (ADF):

Screenshot 2026-01-18 134542

Storage Containers (ADLS):

Screenshot (560)

Unity Catalog Silver/Gold pipelines:

Screenshot (561) Screenshot (562) Screenshot (563) Screenshot (564)

About

Description: Azure Data Engineering project: ADF incremental ingestion + ADLS Gen2 medallion + Databricks transformations to Silver/Gold tables (Unity Catalog). About: This repository demonstrates a complete Azure lakehouse pipeline—from orchestration to curated analytics tables. It uses Azure Data Factory to run incremental, loop-based ingestion

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published