Data Engineering Project – ELT Pipeline with dbt & Snowflake

Overview

This project is a Data Engineering initiative focused on building an ELT pipeline using dbt and Snowflake, based on a real e-commerce dataset (Olist, Brazil).

The objective is to design a scalable and maintainable data platform oriented to analytical use cases, applying best practices in data modeling, transformation, data quality, and AI-powered enrichment.

Current Status

✅ Data ingestion completed in Snowflake
✅ Staging layer (stg) implemented with data cleaning and standardization
✅ Intermediate layer implemented for reusable transformations
✅ Core layer implemented (fact and dimension models)
✅ Analytical layer implemented with multiple business-oriented marts
✅ Snowflake Cortex AI integration (sentiment analysis + translation)
🚧 Ongoing improvements and extensions

Architecture

Data Warehouse: Snowflake
Transformation approach: ELT
Transformation tool: dbt

Data Layers

RAW (Bronze)
Source data ingested into Snowflake from AWS S3
STAGING (stg) — Silver
Data cleaning, normalization and type casting
INTERMEDIATE (int) — Silver
Reusable transformations to simplify downstream models.
Follows the DRY principle (Don't Repeat Yourself): complex or expensive calculations are performed once here and inherited by all dependent marts.
CORE — Gold
Fact and dimension models:
- fct_sales
- dim_customers
- dim_products
- dim_date
- dim_seller
MARTS — Gold
Business-oriented analytical models

Snowflake Cortex AI Integration

This project integrates Snowflake Cortex AI functions directly into the dbt pipeline to enrich customer reviews with NLP capabilities.

Problem

The Olist dataset contains 99,224 customer reviews written in Portuguese. Analyzing sentiment required both translation and scoring within the pipeline.

Solution

Two Cortex functions are chained in the int_order_reviews_enriched intermediate model:

AI_TRANSLATE — translates each review from Portuguese to English
SNOWFLAKE.CORTEX.SENTIMENT — returns a sentiment score (float between -1 and +1)

NULL values are handled explicitly with CASE WHEN to avoid unnecessary Cortex API calls and reduce credit consumption.

-- Step 1: Translate
CASE
    WHEN review_comment_message IS NOT NULL
    THEN AI_TRANSLATE(review_comment_message, 'pt', 'en')
    ELSE NULL
END AS review_comment_message_translated

-- Step 2: Score sentiment on translated text
CASE
    WHEN review_comment_message_translated IS NOT NULL
    THEN SNOWFLAKE.CORTEX.SENTIMENT(review_comment_message_translated)
    ELSE NULL
END AS review_sentiment

Key decisions

Materialized as table (not view) — Cortex is called once per dbt run, not on every query
Enrichment placed in intermediate layer — all sentiment marts inherit from a single source of truth
Cross-region inference required for EU accounts: ALTER ACCOUNT SET CORTEX_ENABLED_CROSS_REGION = 'AWS_EU'

Data Marts

Sales

mart_customer_base — customer classification (new vs repeat) by month
mart_customer_rfm — RFM analysis (Recency, Frequency, Monetary)
mart_sales_daily — daily aggregation of sales metrics
mart_sales_by_state — sales performance by geographic location
mart_top_products — product-level sales analysis
mart_price_history — price evolution tracking via SCD2 snapshot

Sentiment

mart_review_sentiment — sentiment metrics per product, seller and month
- avg_sentiment — average Cortex sentiment score
- avg_review_score — average star rating (1–5)
- total_orders — distinct orders count
- total_items — total items sold
mart_category_sentiment — sentiment metrics aggregated by product category
- avg_sentiment — average sentiment per category
- avg_review_score — average star rating per category
- reviews_text — count of reviews with text comments
- total_orders — distinct orders per category
- total_products — total items sold per category

Project Structure

models/
  staging/
  marts/
    core/
      analytics/
        sales/
        sentiment/
      intermediate/
      sales/
tests/
macros/
seeds/
snapshots/

Data Quality & Testing

The project includes:

Schema Tests

not_null
unique
accepted_values

Singular Tests

Duplicate detection (e.g. customer + month grain)
Negative revenue checks
Invalid order counts
Business rule validations (e.g. customer classification)

Key Design Decisions

Separation of facts and dimensions
Use of intermediate models following the DRY principle
Cortex AI enrichment materialized as table to minimize credit consumption
Explicit NULL handling before Cortex function calls
Use of window functions (ROW_NUMBER) for latest state logic
LEFT JOIN when enriching facts to preserve data completeness
Sentiment enrichment in intermediate layer to serve multiple marts from a single calculation

How to Run

⚠️ Note: Source data is loaded in a private Snowflake account connected to AWS S3 and is not publicly available. To run this project you will need to set up your own Snowflake account and load the Olist dataset into your RAW layer.

dbt deps
dbt run
dbt test

For EU Snowflake accounts using Cortex AI functions, enable cross-region inference first:

ALTER ACCOUNT SET CORTEX_ENABLED_CROSS_REGION = 'AWS_EU';

Next Steps

Add dbt tests for sentiment marts
Extend mart_category_sentiment with time dimension (month/year)
Build Power BI dashboards consuming Gold layer marts
Explore incremental models for scalability
Add model descriptions and dbt docs

Name		Name	Last commit message	Last commit date
Latest commit History 89 Commits
Snowflake		Snowflake
analyses		analyses
macros		macros
models		models
seeds		seeds
snapshots		snapshots
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
dbt_project.yml		dbt_project.yml
package-lock.yml		package-lock.yml
packages.yml		packages.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Engineering Project – ELT Pipeline with dbt & Snowflake

Overview

Current Status

Architecture

Data Layers

Snowflake Cortex AI Integration

Problem

Solution

Key decisions

Data Marts

Sales

Sentiment

Project Structure

Data Quality & Testing

Schema Tests

Singular Tests

Key Design Decisions

How to Run

Next Steps

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Data Engineering Project – ELT Pipeline with dbt & Snowflake

Overview

Current Status

Architecture

Data Layers

Snowflake Cortex AI Integration

Problem

Solution

Key decisions

Data Marts

Sales

Sentiment

Project Structure

Data Quality & Testing

Schema Tests

Singular Tests

Key Design Decisions

How to Run

Next Steps

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages