This project is a Data Engineering initiative focused on building an ELT pipeline using dbt and Snowflake, based on a real e-commerce dataset (Olist, Brazil).
The objective is to design a scalable and maintainable data platform oriented to analytical use cases, applying best practices in data modeling, transformation, data quality, and AI-powered enrichment.
✅ Data ingestion completed in Snowflake
✅ Staging layer (stg) implemented with data cleaning and standardization
✅ Intermediate layer implemented for reusable transformations
✅ Core layer implemented (fact and dimension models)
✅ Analytical layer implemented with multiple business-oriented marts
✅ Snowflake Cortex AI integration (sentiment analysis + translation)
🚧 Ongoing improvements and extensions
- Data Warehouse: Snowflake
- Transformation approach: ELT
- Transformation tool: dbt
-
RAW (Bronze)
Source data ingested into Snowflake from AWS S3 -
STAGING (
stg) — Silver
Data cleaning, normalization and type casting -
INTERMEDIATE (
int) — Silver
Reusable transformations to simplify downstream models.
Follows the DRY principle (Don't Repeat Yourself): complex or expensive calculations are performed once here and inherited by all dependent marts. -
CORE — Gold
Fact and dimension models:fct_salesdim_customersdim_productsdim_datedim_seller
-
MARTS — Gold
Business-oriented analytical models
This project integrates Snowflake Cortex AI functions directly into the dbt pipeline to enrich customer reviews with NLP capabilities.
The Olist dataset contains 99,224 customer reviews written in Portuguese. Analyzing sentiment required both translation and scoring within the pipeline.
Two Cortex functions are chained in the int_order_reviews_enriched intermediate model:
AI_TRANSLATE— translates each review from Portuguese to EnglishSNOWFLAKE.CORTEX.SENTIMENT— returns a sentiment score (float between -1 and +1)
NULL values are handled explicitly with CASE WHEN to avoid unnecessary Cortex API calls and reduce credit consumption.
-- Step 1: Translate
CASE
WHEN review_comment_message IS NOT NULL
THEN AI_TRANSLATE(review_comment_message, 'pt', 'en')
ELSE NULL
END AS review_comment_message_translated
-- Step 2: Score sentiment on translated text
CASE
WHEN review_comment_message_translated IS NOT NULL
THEN SNOWFLAKE.CORTEX.SENTIMENT(review_comment_message_translated)
ELSE NULL
END AS review_sentiment- Materialized as table (not view) — Cortex is called once per
dbt run, not on every query - Enrichment placed in intermediate layer — all sentiment marts inherit from a single source of truth
- Cross-region inference required for EU accounts:
ALTER ACCOUNT SET CORTEX_ENABLED_CROSS_REGION = 'AWS_EU'
mart_customer_base— customer classification (new vs repeat) by monthmart_customer_rfm— RFM analysis (Recency, Frequency, Monetary)mart_sales_daily— daily aggregation of sales metricsmart_sales_by_state— sales performance by geographic locationmart_top_products— product-level sales analysismart_price_history— price evolution tracking via SCD2 snapshot
-
mart_review_sentiment— sentiment metrics per product, seller and monthavg_sentiment— average Cortex sentiment scoreavg_review_score— average star rating (1–5)total_orders— distinct orders counttotal_items— total items sold
-
mart_category_sentiment— sentiment metrics aggregated by product categoryavg_sentiment— average sentiment per categoryavg_review_score— average star rating per categoryreviews_text— count of reviews with text commentstotal_orders— distinct orders per categorytotal_products— total items sold per category
models/
staging/
marts/
core/
analytics/
sales/
sentiment/
intermediate/
sales/
tests/
macros/
seeds/
snapshots/
The project includes:
not_nulluniqueaccepted_values
- Duplicate detection (e.g. customer + month grain)
- Negative revenue checks
- Invalid order counts
- Business rule validations (e.g. customer classification)
- Separation of facts and dimensions
- Use of intermediate models following the DRY principle
- Cortex AI enrichment materialized as table to minimize credit consumption
- Explicit NULL handling before Cortex function calls
- Use of window functions (
ROW_NUMBER) for latest state logic - LEFT JOIN when enriching facts to preserve data completeness
- Sentiment enrichment in intermediate layer to serve multiple marts from a single calculation
⚠️ Note: Source data is loaded in a private Snowflake account connected to AWS S3 and is not publicly available. To run this project you will need to set up your own Snowflake account and load the Olist dataset into your RAW layer.
dbt deps
dbt run
dbt testFor EU Snowflake accounts using Cortex AI functions, enable cross-region inference first:
ALTER ACCOUNT SET CORTEX_ENABLED_CROSS_REGION = 'AWS_EU';- Add dbt tests for sentiment marts
- Extend
mart_category_sentimentwith time dimension (month/year) - Build Power BI dashboards consuming Gold layer marts
- Explore incremental models for scalability
- Add model descriptions and dbt docs