This dbt project implements a modern ELT pipeline to transform raw job posting data into actionable insights. It focuses on scoring job listings based on title relevance and required technical skills (SQL, Python, dbt).
jobsjumble_sliced: Raw job postings data including titles, companies, locations, and full descriptions.
stg_job_postings:- Description: Standardizes raw seed data.
- Operations: Renames columns for clarity, cleans whitespace.
- Tests:
not_nullchecks onjob_titleandcompany_name.
fct_job_scoring:- Description: The core fact table that calculates a
relevance_scorefor each posting. - Logic:
- Title Scoring: Assigns points for specific keywords like 'Analyst', 'Engineer', and 'Senior'.
- Skill Scoring: Parses the
job_descriptionfor technical keywords like 'SQL', 'Python', and 'dbt'.
- Materialization: Table.
- Tests:
not_nullcheck onrelevance_score.
- Description: The core fact table that calculates a
job_postings_snapshot:- Strategy:
checkonrelevance_score. - Purpose: Tracks how job scoring changes over time as data is updated or refined.
- Strategy:
select_state: A utility macro for filtering data by state (included for legacy/utility demonstration).
- Automated Testing: Implemented schema tests to ensure data integrity.
- SCD Type 2 Modeling: Used snapshots to capture history of transformed data.
- Complex Transformations: Logic-driven scoring system using SQL
CASEstatements and string parsing. - Layered Architecture: Separation of concerns between staging (cleaning) and marts (business logic).
Run the full pipeline using Docker:
# Seed, Run, Test, Snapshot
dbt seed && dbt run && dbt test && dbt snapshot