diff --git a/README.md b/README.md index ce11692..6e45c16 100644 --- a/README.md +++ b/README.md @@ -31,6 +31,7 @@ Type `/lfx-skills:lfx` and describe what you want in plain language: - **"Where does the meeting data flow live?"** — the router classifies the task and points at the owning repos plus the relevant central skill. - **"I'm adding a new V2 resource service"** — routes you to `/lfx-skills:lfx-platform-architecture` for platform flow, service class, and cross-service handoff points; the owning repo's path-scoped guidance handles Go conventions. - **"Does this API already exist?"** — `/lfx-skills:lfx` runs a read-only research pass to verify owning repos, contracts, examples, and blockers before implementation. +- **"Generate a new silver dbt model"** — routes to `/lfx-skills:lfx-data-engineer` for medallion-layer conventions, sqlfluff formatting, tests, and dbt validation guidance. - **"Add or fix Intercom in this app"** — routes to `/lfx-skills:lfx-intercom`. - **"Add a CDP Snowflake connector"** — routes to `/lfx-skills:lfx-cdp-snowflake-connectors`. - **"Catch me up on my open PRs"** — routes to `/lfx-skills:lfx-pr-catchup`. @@ -50,7 +51,7 @@ Canonical LFX knowledge that lives in this plugin and is referenced by every LFX | `/lfx-skills:lfx-itx-integration` | ITX wrapper patterns: OAuth2 M2M tokens, v1 KV sync, NATS ID mapping via `lfx.lookup_v1_mapping`. | | `/lfx-skills:lfx-intercom` | Retained central Intercom workflow from `main`, plus Fin AI optimization: Fin Guidance, Help Center content quality, and resolution rate. | -### Workflow skills (7) +### Workflow skills (8) Cross-repo developer workflows that apply across every LFX repo. @@ -63,6 +64,7 @@ Cross-repo developer workflows that apply across every LFX repo. | `/lfx-skills:lfx-test-journey` | Combine feature branches across repos into git worktrees for end-to-end journey testing. | | `/lfx-skills:lfx-snowflake-access` | Request Snowflake access or service accounts via the `lfx-snowflake-terraform` repo. | | `/lfx-skills:lfx-cdp-snowflake-connectors` | Scaffold a CDP snowflake-connector data source in `crowd.dev`; retained centrally from `main`. | +| `/lfx-skills:lfx-data-engineer` | Generate PR-ready dbt models, SQL transformations, and tests for `lf-dbt`, including medallion architecture, sqlfluff conventions, macros, and validation workflow guidance. | ### Platform skill (1) @@ -111,6 +113,8 @@ Each agent locates its owning repo at runtime and uses repo-qualified paths for │ ├── lfx-test-journey/ │ ├── lfx-snowflake-access/ │ ├── lfx-cdp-snowflake-connectors/ +│ ├── lfx-data-engineer/ # dbt model + SQL transformation skill +│ │ └── references/ # dbt setup, style, macros, testing, debugging │ └── lfx-v2-ticket-writer/ ├── agents/ │ ├── lfx-committee-service-code-reviewer.md diff --git a/skills/lfx-data-engineer/SKILL.md b/skills/lfx-data-engineer/SKILL.md new file mode 100644 index 0000000..0a55f93 --- /dev/null +++ b/skills/lfx-data-engineer/SKILL.md @@ -0,0 +1,497 @@ +--- +name: lfx-data-engineer +description: > + Guide non-dbt developers through building PR-ready data models, tests, and + transformations in the lf-dbt repo. Encodes the medallion architecture + (bronze/silver/gold/platinum), Snowflake SQL conventions, sqlfluff formatting, + dbt testing patterns, key macros, and data governance rules. Use this skill + any time someone asks about writing dbt models, adding data tests, creating + SQL transformations, fixing pipeline failures, or contributing to the lf-dbt + repository. +allowed-tools: Bash, Read, Write, Edit, Glob, Grep, AskUserQuestion +--- + + + + + +# LFX Data Engineering + +You are generating dbt models and SQL transformations that must be PR-ready. This skill encodes all conventions for the `lf-dbt` repository, which implements a medallion architecture data warehouse on Snowflake. + +**Prerequisites:** Snowflake access must be provisioned first (via `/lfx-snowflake-access`). + +## Input Validation + +Before generating any code, verify your args include: + +| Required | If Missing | +|----------|------------| +| Specific task (what to build/modify) | Stop and ask — do not guess | +| Which medallion layer (bronze/silver/gold/platinum) | Infer from task, but confirm | +| Data source name (for bronze) or upstream model (for silver+) | Stop and ask — never assume | +| Target file path(s) | Infer from naming conventions, but verify they exist | +| Example pattern to follow | Find one yourself (see Read Before Generating) | + +**If invoked with a FIX: prefix**, this is an error correction. Read the error, find the file, apply the targeted fix, and re-validate. + +## Read Before Generating — MANDATORY + +Before writing ANY code, you MUST: + +1. **Read the target file** (if modifying) — understand what's already there +2. **Read one example file** in the same layer and domain — match the exact patterns +3. **Read the relevant YML test file** — ensure your model will be tested consistently + +Do NOT generate code from memory alone. The codebase may have evolved since your training data. + +```bash +# Example: before creating a new bronze model, read an existing one in the same source +cat models/bronze/fivetran_platform/bronze_fivetran_platform_events.sql +# And read the test file +cat models/bronze/fivetran_platform/bronze_fivetran_platform_tests.yml +``` + +## License Header + +Every new `.sql` file MUST start with this header: + +```sql +-- Copyright The Linux Foundation and each contributor to LFX. +-- SPDX-License-Identifier: MIT +``` + +Every new `.yml` file MUST start with: + +```yaml +# Copyright The Linux Foundation and each contributor to LFX. +# SPDX-License-Identifier: MIT +``` + +## Completion Report + +When you finish, output a clear summary: + +``` +═══════════════════════════════════════════ +/lfx-data-engineer COMPLETE +═══════════════════════════════════════════ +Files created: + - models/bronze/fivetran_platform/bronze_fivetran_platform_new_table.sql + +Files modified: + - models/bronze/fivetran_platform/bronze_fivetran_platform_tests.yml — added new_table tests + +Validation: + - Ran: sqlfluff lint models/bronze/fivetran_platform/bronze_fivetran_platform_new_table.sql + - Result: ✓ passed / ✗ failed with: + - Ran: dbt compile --select bronze_fivetran_platform_new_table + - Result: ✓ passed / ✗ failed with: + +Notes: + - Source table 'new_table' must exist in the fivetran_platform source definition + +Errors: + - (none) +═══════════════════════════════════════════ +``` + +**Always include the Validation section.** Run `sqlfluff lint` and `dbt compile` after creating or modifying files. Report the result. + +--- + +## Medallion Architecture Quick Reference + +| Layer | Materialization | Schema | Purpose | +|-------|----------------|--------|---------| +| **Bronze** | `view` (default) | `bronze_*` (per source) | 1:1 with source data — column renames, type casting, filter deletes/test data | +| **Silver** | `table` | `silver_dim`, `silver_fact` | Business logic, joins, reusable business objects | +| **Gold** | `table` | `gold_*` (per domain) | Aggregated metrics for specific business use cases | +| **Platinum** | `table` | `platinum*` (per product) | Pre-computed reports with time windows for dashboards | + +### References + +| Task | Reference | +|------|-----------| +| Environment setup, dbt commands, clone workflow | [references/getting-started.md](references/getting-started.md) | +| Detailed layer guide with SQL examples and decision tree | [references/medallion-architecture.md](references/medallion-architecture.md) | +| SQL formatting, keyword casing, indentation, CTEs, JOINs | [references/sql-style-guide.md](references/sql-style-guide.md) | +| dbt test conventions, PII tagging, primary key tests | [references/testing-patterns.md](references/testing-patterns.md) | +| Project macros: smart_source, format_timestamp, date ranges, deltas | [references/key-macros.md](references/key-macros.md) | +| Troubleshooting build failures, sqlfluff, incremental issues | [references/debugging-pipelines.md](references/debugging-pipelines.md) | + +--- + +## Creating a Model by Layer + +### Bronze — Source Extraction + +Bronze models are 1:1 with source tables. They rename columns, cast types, and filter out deleted/test records. No business logic. + +```sql +-- Copyright The Linux Foundation and each contributor to LFX. +-- SPDX-License-Identifier: MIT + +SELECT + id AS event_id, + event_title AS event_name, + event_start_date, + event_end_date, + created_date AS event_created_ts, + lastmodified_date AS updated_at + +FROM {{ source('fivetran_platform', 'event') }} +WHERE + NOT _fivetran_deleted + AND NOT is_test +``` + +**Bronze rules:** +- Use `source()` to reference raw tables (or `smart_source()` for dev lookback) +- Rename columns to snake_case with business-friendly names +- Timestamps: suffix `_ts`; Dates: suffix `_date`; Booleans: prefix `is_` or `has_` +- Filter `_fivetran_deleted` and test data rows +- No JOINs — one source table per model +- Use `get_warehouse('hourly')` in config if the source is large + +### Silver — Business Logic + +Silver models join bronze models, apply business rules, and create reusable objects. Split into `dim/` (dimensions) and `fact/` (facts). + +```sql +-- Copyright The Linux Foundation and each contributor to LFX. +-- SPDX-License-Identifier: MIT + +{% set warehouse = get_warehouse('hourly') %} + +{{ config(snowflake_warehouse=warehouse) }} + +/* +Purpose: + Create a reusable project dimension with core Salesforce project attributes + and the latest project health score for downstream analytics. + +Questions answered: + - What are the canonical identifiers and names for each project? + - What is the current health score associated with each project? + +Data sources: + - bronze_fivetran_salesforce_projects + - silver_fact_crowd_dev_project_health_metrics +*/ + +WITH source_data AS ( + SELECT + project_id, + project_name, + project_slug, + project_status + FROM {{ ref('bronze_fivetran_salesforce_projects') }} +), + +enriched AS ( + SELECT + s.project_id, + s.project_name, + s.project_slug, + s.project_status, + h.health_score + FROM source_data s + LEFT JOIN {{ ref('silver_fact_crowd_dev_project_health_metrics') }} h + ON s.project_slug = h.project_slug +) + +SELECT + project_id, + project_name, + project_slug, + project_status, + health_score +FROM enriched +``` + +**Silver rules:** +- Use `ref()` to reference bronze or other silver models +- CTEs for each logical step (one unit of work per CTE) +- Verbose CTE names that describe what they do +- Include a block comment at the top explaining purpose, questions answered, and data sources +- `dim/` for slowly-changing attributes; `fact/` for events and transactions + +### Gold — Aggregated Metrics + +Gold models combine silver models into purpose-built datasets for specific use cases. + +```sql +-- Copyright The Linux Foundation and each contributor to LFX. +-- SPDX-License-Identifier: MIT + +{{ config(unique_key=["_key", "project_id"]) }} + +SELECT + ({{ dbt_utils.generate_surrogate_key(["c._key", "p.mapped_project_id"]) }}) AS activity_project_id, + c._key, + c.activity_id, + c.activity_ts, + p.mapped_project_id AS project_id, + p.mapped_project_slug AS project_slug + +FROM {{ ref("silver_fact_crowd_dev_activities") }} c +LEFT JOIN {{ ref("_silver_dim_project_spine") }} p + ON c.project_id = p.base_project_id +WHERE + p.mapped_project_id IS NOT NULL + AND {{ filter_code_contributions_non_bot('c') }} +``` + +**Gold rules:** +- Use `dbt_utils.generate_surrogate_key()` for composite primary keys +- Always specify `unique_key` in config for incremental models +- Reference silver models via `ref()`, apply domain-specific macros +- Final SELECT should explicitly list all columns — no `SELECT *` + +### Platinum — Pre-Computed Reports + +Platinum models produce dashboard-ready data with time-windowed aggregations. + +```sql +-- Copyright The Linux Foundation and each contributor to LFX. +-- SPDX-License-Identifier: MIT + +{% set warehouse = get_warehouse('hourly') %} + +{{ config(snowflake_warehouse=warehouse) }} + +WITH base AS ( + SELECT + user_id, + event_id, + event_name, + event_start_date + FROM {{ ref('silver_fact_event_registrations') }} + WHERE event_name IS NOT NULL +) + +SELECT + ({{ dbt_utils.generate_surrogate_key(['user_id', 'event_id']) }}) AS _key, + user_id, + event_id, + event_name, + event_start_date +FROM base +QUALIFY ROW_NUMBER() OVER ( + PARTITION BY user_id, event_id + ORDER BY event_start_date +) = 1 +``` + +**Platinum rules:** +- Use date range macros (`is_last_30_days`, `is_year_to_date`, etc.) for time windows +- Use `get_warehouse()` for resource-intensive models +- `GROUP BY ALL` is acceptable for complex aggregations +- `QUALIFY` with `ROW_NUMBER()` for deduplication +- Purpose-built for specific dashboards (PCC, Individual Dashboard, Org Dashboard) + +--- + +## Writing Tests (YML) + +Every model needs a corresponding entry in a `*_tests.yml` file. Use `data_tests:` (not the deprecated `tests:`). Parameterized tests require the `arguments:` wrapper. + +```yaml +# Copyright The Linux Foundation and each contributor to LFX. +# SPDX-License-Identifier: MIT + +version: 2 +models: + - name: my_new_model + description: "What this model contains and its purpose." + columns: + - name: _key + description: "The unique primary key for the table." + data_tests: + - unique + - not_null + - dbt_utils.not_empty_string + + - name: status + description: "The current status." + data_type: string + data_tests: + - not_null + - accepted_values: + arguments: + values: ["active", "inactive", "pending"] + + - name: project_id + description: "Foreign key to the projects dimension." + data_type: string + data_tests: + - not_null + - relationships: + arguments: + to: ref('silver_dim_projects') + field: project_id + + - name: email + description: "User email address" + data_type: string + config: + meta: + contains_pii: true + data_retention: "undefined" +``` + +See [references/testing-patterns.md](references/testing-patterns.md) for full conventions. + +--- + +## SQL Style Rules (Summary) + +| Rule | Example | +|------|---------| +| Uppercase SQL keywords | `SELECT`, `FROM`, `WHERE`, `LEFT JOIN` | +| Lowercase identifiers | `event_id`, `project_name` | +| 4-space indentation | Indent columns under `SELECT`, conditions under `WHERE` | +| Trailing commas | `event_id,` (not `, event_id`) | +| CTEs over subqueries | Use `WITH ... AS (...)` instead of nested `SELECT` | +| Default to `INNER JOIN` | Use `LEFT JOIN` only when right side may have no matches | +| No `RIGHT JOIN` | Rewrite as `LEFT JOIN` | +| No `SELECT DISTINCT` | Requires architect approval | +| `GROUP BY` by number | `GROUP BY 1, 2` preferred over column names | +| Explicit column lists | No `SELECT *` in final SELECT | +| Pre-filter in CTEs | Complex filtering on joined tables belongs in a CTE before the join | + +See [references/sql-style-guide.md](references/sql-style-guide.md) for full formatting rules. + +--- + +## Key Macros + +| Macro | Purpose | When to Use | +|-------|---------|-------------| +| `smart_source()` | Dev-friendly source wrapper with lookback | Bronze models reading from source tables | +| `format_timestamp()` | Generate UTC `_ts` and local `_ts_local` columns | Bronze models normalizing timestamps | +| `to_utc_timestamp()` | Convert local timestamp to UTC with dynamic timezone | When timezone is a column, not a constant | +| `get_warehouse()` | Select warehouse by size (`default`, `hourly`, `medium`) | Large models needing specific compute | +| `generate_alias_name` | Strips schema prefix from table name (e.g., `silver_dim_` → table name) | Automatic — configured in macros | +| `is_last_7_days()`, `is_last_30_days()`, etc. | Date range filters for time windows | Platinum models with pre-computed periods | +| `is_prev_7_days()`, `is_prev_30_days()`, etc. | Previous period for period-over-period comparison | Delta/change calculations | +| `add_delta_columns()` | Generate `_prev`, `_diff`, `_delta` columns | Period-over-period metric comparisons | +| `get_month()`, `get_quarter()` | Human-readable date labels | Display-friendly date columns | +| `gdpr_filter_email()` | Exclude GDPR-suppressed emails | Any model exposing email addresses | +| `filter_code_contributions_non_bot()` | Exclude bot code contributions | Code contribution models | +| `format_country()` | Normalize country names to canonical values | Models with user-entered country data | +| `comprehensive_email_filter()` | Validate email format + exclude test emails | Email-based models | + +See [references/key-macros.md](references/key-macros.md) for full documentation and usage examples. + +--- + +## Data Governance + +### PII Tagging + +Columns containing personally identifiable information (names, emails, addresses, etc.) must be tagged in the YML file. Use `config.meta` — not top-level `meta`. + +```yaml +columns: + - name: email + description: "User email address" + config: + meta: + contains_pii: true + data_retention: "undefined" +``` + +### Timestamp Normalization + +All timestamps must be normalized to UTC in the bronze layer: +- Timestamps: `_ts` suffix, stored as `TIMESTAMP_NTZ` in UTC +- Dates: `_date` suffix, stored as `DATE` +- Use `format_timestamp()` macro for consistent conversion +- Use `convert_timezone()` for explicit timezone conversion + +### Primary Key Convention + +- Use `_key` suffix for primary key columns +- Always add unique, not_null, and not_empty_string tests + +--- + +## Common Anti-Patterns — DO NOT DO THESE + +| Anti-Pattern | Correct Pattern | +|-------------|-----------------| +| Missing license header | Always add `-- Copyright The Linux Foundation...` | +| `tests:` in YML | Use `data_tests:` (dbt v1.10.5+) | +| `meta:` at top level in YML | Nest under `config:` → `meta:` | +| Missing `arguments:` on parameterized tests | `accepted_values:` → `arguments:` → `values:` | +| `tags:` at top level in YML | Nest under `config:` → `tags:` | +| Duplicate `config:` keys in YML | Combine into a single `config:` block | +| Custom keys directly in `config:` | Nest under `config:` → `meta:` | +| `SELECT DISTINCT` | Use `GROUP BY` or `QUALIFY ROW_NUMBER()` | +| `RIGHT JOIN` | Rewrite as `LEFT JOIN` | +| Filtering right side of LEFT JOIN in `WHERE` | Filter in the `ON` clause or in a CTE | +| `SELECT *` in final select | Explicitly list all columns | +| Subqueries in `FROM` or `JOIN` | Use CTEs | +| Raw `source()` in dev (large tables) | Use `smart_source()` with lookback | +| Hardcoded warehouse name | Use `get_warehouse()` macro | +| `console.log` / `print` debugging | Use `dbt compile` and `dbt show` | +| Committing without `--signoff` or `-S` | Always use signed commits with DCO | + +--- + +## Pre-PR Checklist + +### All Models +- [ ] License header on all new `.sql` and `.yml` files +- [ ] Model documented in corresponding `*_tests.yml` file +- [ ] Primary key column(s) have `unique`, `not_null`, `dbt_utils.not_empty_string` tests +- [ ] PII columns tagged with `config.meta.contains_pii: true` and `data_retention: "undefined"` +- [ ] `sqlfluff lint` passes on all new/modified `.sql` files +- [ ] `dbt compile --select +model_name` succeeds +- [ ] Column naming follows conventions (`_ts`, `_date`, `is_`, `has_`, `_key`) +- [ ] No `SELECT *` in final select statements +- [ ] All timestamps normalized to UTC + +### Bronze Models +- [ ] 1:1 with source table — no joins +- [ ] Filters `_fivetran_deleted` and test data +- [ ] Column renames to snake_case with business-friendly names +- [ ] Uses `source()` or `smart_source()` + +### Silver Models +- [ ] Uses `ref()` to reference upstream models +- [ ] CTEs for each logical unit of work +- [ ] Block comment explaining purpose and data sources +- [ ] Placed in correct subfolder (`dim/` or `fact/`) + +### Gold Models +- [ ] Surrogate key generated for composite keys +- [ ] `unique_key` specified in config for incremental models +- [ ] Final SELECT explicitly lists all columns + +### Platinum Models +- [ ] Uses date range macros for time windows +- [ ] `get_warehouse()` configured if resource-intensive +- [ ] Purpose-built for a specific dashboard or use case + +--- + +## Scope Boundaries + +**This skill DOES:** +- Generate/modify dbt SQL models following medallion architecture +- Create/update YML test files with proper data_tests format +- Add source definitions for new data sources +- Apply project macros (smart_source, format_timestamp, date ranges, etc.) +- Run sqlfluff lint/fix validation after changes +- Run dbt compile to verify model correctness + +**This skill does NOT:** +- Run dbt build/test against the warehouse (use the `running-dbt-commands` skill) +- Modify existing macros without architect review +- Make architectural decisions about layer placement (ask the user) +- Generate semantic layer definitions (use the `building-dbt-semantic-layer` skill) +- Troubleshoot dbt Cloud job failures (use the `troubleshooting-dbt-job-errors` skill) +- Modify protected infrastructure files (`dbt_project.yml`, `profiles.yml`, `packages.yml`) — flag for code owner diff --git a/skills/lfx-data-engineer/references/debugging-pipelines.md b/skills/lfx-data-engineer/references/debugging-pipelines.md new file mode 100644 index 0000000..9099262 --- /dev/null +++ b/skills/lfx-data-engineer/references/debugging-pipelines.md @@ -0,0 +1,328 @@ + + + +# Debugging Pipelines + +Common failure patterns in the lf-dbt project and how to resolve them. + +--- + +## dbt compile Failures + +`dbt compile` validates SQL and Jinja syntax without executing against +Snowflake. Always run it before `dbt build`. + +```bash +dbt compile --select +model_name +``` + +### Missing `ref()` or `source()` Target + +**Error:** `Compilation Error: ... depends on a node named '...' which was not found` + +**Cause:** The model references a table or source that doesn't exist. + +**Fix:** +1. Check spelling — model names must match exactly (case-sensitive in YAML) +2. Verify the upstream model exists: `find models -name '*model_name*'` +3. For sources, check the source definition YAML file exists +4. Run `dbt deps` if the reference is to a package model + +### Jinja Syntax Error + +**Error:** `Compilation Error: ... unexpected '}'` or `expected token 'end of statement block'` + +**Cause:** Malformed Jinja template syntax. + +**Fix:** +1. Check for unmatched `{% %}` or `{{ }}` blocks +2. Verify macro calls have the right number of arguments +3. Look for missing commas in macro arguments +4. Check that `config()` blocks have proper Python dict syntax + +### Undefined Macro + +**Error:** `Compilation Error: 'macro_name' is undefined` + +**Cause:** The macro doesn't exist or packages aren't installed. + +**Fix:** +1. Run `dbt deps` to install packages +2. Check macro spelling — search `macros/` for the correct name +3. Verify the macro is defined in `macros/` or in a package + +--- + +## dbt build Failures + +### SQL Compilation Error in Snowflake + +**Error:** `Database Error: ... SQL compilation error` + +**Cause:** The generated SQL is invalid Snowflake syntax. + +**Fix:** +1. Inspect the compiled SQL: `dbt compile --select model_name` +2. Open the compiled file: `target/compiled/core_warehouse/models/.../model_name.sql` +3. Copy the compiled SQL into a Snowflake worksheet and run it directly +4. The Snowflake error message will point to the exact line + +### Missing Source Table + +**Error:** `Database Error: ... Object 'DATABASE.SCHEMA.TABLE' does not exist` + +**Cause:** The source table hasn't been cloned to your dev schema. + +**Fix:** +```bash +# Clone production tables to dev +dbt run-operation clone_production_tables + +# Then rebuild excluding cloned data +dbt build --select +model_name --exclude tag:cloned_data +``` + +### Permission Error + +**Error:** `Database Error: ... Insufficient privileges to operate on ...` + +**Cause:** Your Snowflake role doesn't have access to the source data. + +**Fix:** +1. Verify your role in `.env` matches your provisioned access +2. Check if the source table requires a specific role +3. Contact CloudOps if you need additional permissions + +--- + +## sqlfluff Lint Errors + +### Running the Linter + +```bash +# Lint a file and see errors +sqlfluff lint path/to/file.sql + +# Auto-fix what it can +sqlfluff fix path/to/file.sql + +# Lint all staged files +make lint-staged-files +``` + +### Common Lint Errors + +| Error Code | Description | Fix | +|------------|-------------|-----| +| `CP01` | Keyword not uppercase | Change `select` to `SELECT` | +| `CP02` | Identifier not lowercase | Change `COLUMN_NAME` to `column_name` | +| `CP03` | Function not uppercase | Change `count()` to `COUNT()` | +| `CP04` | Literal not uppercase | Change `null` to `NULL`, `true` to `TRUE` | +| `CP05` | Type cast not lowercase | Change `::INT` to `::int` | +| `CV09` | Blocked data type | Use `INT` not `INTEGER`, `DECIMAL` not `NUMBER` | +| `CV11` | Non-shorthand cast | Use `::int` not `CAST(x AS INT)` | +| `ST05` | Subquery in FROM/JOIN | Extract to a CTE | +| `RF03` | Qualified single-table ref | Remove table alias prefix when only one table | +| `AL01` | Implicit table alias style | `FROM users u` is fine (project allows implicit) | + +### Ignoring Specific Rules + +If a specific lint rule must be violated with good reason: + +```sql +-- Example: using a blocked type because the source requires it +column_name::NUMBER -- noqa: CV09 - source returns NUMBER type +``` + +### Jinja Template Errors in sqlfluff + +**Error:** `WARNING: Could not parse ... Traceback ...` + +**Cause:** sqlfluff can't parse a Jinja expression. + +**Fix:** +1. Ensure `dbt deps` has been run (macros from packages are needed) +2. Check that `load_macros_from_path = macros` is in `.sqlfluff` +3. Complex Jinja may need `-- noqa` to skip that line + +--- + +## Incremental Model Issues + +### Full Refresh + +If an incremental model has bad data or schema changes: + +```bash +# Rebuild from scratch (drops and recreates) +dbt build --select model_name --full-refresh +``` + +### Unique Key Conflicts + +**Error:** `Database Error: ... Duplicate row detected during DML action` + +**Cause:** The `unique_key` in the model config doesn't produce unique rows. + +**Fix:** +1. Check the `unique_key` in the model's `config()` block +2. Run `dbt show` to inspect for duplicates: + +```bash +dbt show --select model_name --limit 20 +``` + +3. Add `QUALIFY ROW_NUMBER()` to deduplicate before the final SELECT +4. If the issue is in source data, add deduplication in a CTE + +### Schema Changes + +If you add or remove columns from an incremental model: + +```bash +# Full refresh to apply schema changes +dbt build --select model_name --full-refresh +``` + +Without `--full-refresh`, new columns won't appear because the existing table +structure is preserved for incremental loads. + +--- + +## Quick Data Validation + +### `dbt show` — Preview Results + +Preview the output of a model without materializing it: + +```bash +# Show first 5 rows (default) +dbt show --select model_name + +# Show more rows +dbt show --select model_name --limit 20 + +# Show with inline SQL +dbt show --inline "SELECT COUNT(*) FROM {{ ref('model_name') }}" +``` + +### `dbt compile` — Inspect Generated SQL + +See exactly what SQL dbt will execute: + +```bash +dbt compile --select model_name +``` + +The compiled SQL is written to: +`target/compiled/core_warehouse/models/.../model_name.sql` + +Open this file to see the fully-rendered SQL with all Jinja resolved. + +--- + +## Missing Data in Dev + +### Symptom: Model Runs But Returns No Rows + +**Cause:** The `smart_source()` macro limits data to the last 30 days in dev. +If the source table has no recent data, the query returns nothing. + +**Fix:** +1. Increase the lookback window: `smart_source('source', 'table', 'date_col', 90)` +2. Or clone production data: + +```bash +dbt run-operation clone_production_tables +dbt build --select +model_name --exclude tag:cloned_data +``` + +### Symptom: Source Table Not Found + +**Cause:** Large tables (Kafka, Salesforce) aren't rebuilt in dev by default. + +**Fix:** Clone production data (see above). The cloned tables/views appear in +your dev schema automatically. + +--- + +## Test Failures + +### Running Tests + +```bash +# Run all tests +dbt test + +# Test a specific model +dbt test --select model_name + +# Test a model and all its dependencies +dbt test --select +model_name +``` + +### Debugging a Failed Test + +1. Read the test failure message — it tells you which test and column failed +2. Check the compiled test SQL: `target/compiled/core_warehouse/tests/...` +3. Run the test query directly in Snowflake to see the offending rows +4. Use `dbt show` to inspect the model output: + +```bash +dbt show --inline " +SELECT column_name, COUNT(*) +FROM {{ ref('model_name') }} +GROUP BY 1 +HAVING COUNT(*) > 1 +" +``` + +### Known Edge Cases + +Some models have intentional test threshold overrides for known data quality +issues: + +```yaml +data_tests: + - unique: + config: + error_if: ">10" + warn_if: ">10" +``` + +If your model has a small number of expected duplicates from upstream data, +use this pattern with a comment explaining why. + +--- + +## dbt Cloud Job Failures + +For failures in dbt Cloud (production or staging jobs), use the +`troubleshooting-dbt-job-errors` skill in the lf-dbt repository's +`.agents/skills/` directory. That skill covers: + +- Reading job run logs via the dbt Cloud Admin API +- Diagnosing intermittent failures +- Checking git history for recent changes that may have caused the failure +- Investigating data issues in source systems + +--- + +## Common Debugging Workflow + +```text +1. dbt compile --select model_name + └─ Fix Jinja/SQL syntax errors + +2. sqlfluff lint path/to/model.sql + └─ Fix formatting violations + +3. dbt build --select model_name + └─ Fix Snowflake runtime errors + +4. dbt test --select model_name + └─ Fix data quality issues + +5. dbt show --select model_name --limit 20 + └─ Verify output looks correct +``` diff --git a/skills/lfx-data-engineer/references/getting-started.md b/skills/lfx-data-engineer/references/getting-started.md new file mode 100644 index 0000000..de1fb9e --- /dev/null +++ b/skills/lfx-data-engineer/references/getting-started.md @@ -0,0 +1,222 @@ + + + +# Getting Started with lf-dbt + +## Prerequisites + +| Requirement | Details | +|-------------|---------| +| Python | 3.11+ with virtual environment | +| Snowflake access | Provisioned via `lfx-snowflake-terraform` (see `/lfx-snowflake-access` skill) | +| dbt | Installed via `pip install -r requirements.txt` | +| Environment variables | Configured in `.env` file (see `.env.sample`) | + +## Initial Setup + +```bash +# 1. Clone the repository +git clone https://github.com/linuxfoundation/lf-dbt.git +cd lf-dbt + +# 2. Create and activate a virtual environment +python3 -m venv venv +source venv/bin/activate + +# 3. Install dependencies +pip install -r requirements.txt + +# 4. Configure environment variables +cp .env.sample .env +# Edit .env with your Snowflake credentials + +# 5. Install dbt packages +dbt deps + +# 6. Verify your connection +dbt compile +``` + +## Snowflake Connection + +The connection is configured in `profiles.yml` with the `dbt-snowflake` profile: + +| Setting | Source | +|---------|--------| +| Account | `SNOWFLAKE_ACCOUNT` env var | +| User | `DBT_ENV_SECRET_USER` env var | +| Password | `DBT_ENV_SECRET_PASS` env var | +| Role | `DBT_ENV_ROLE` env var | +| Database | `DBT_ENV_DATABASE` env var | +| Warehouse | `DBT_ENV_WAREHOUSE` env var | +| Default schema | `DBT_DEFAULT_SCHEMA` env var | + +For keypair authentication (required for CLI/programmatic access), see the +[lf-dbt README — SnowSQL Keypair Authentication Setup](https://github.com/linuxfoundation/lf-dbt/blob/main/README.md#snowsql-keypair-authentication-setup). + +## Essential dbt Commands + +```bash +# Install package dependencies +dbt deps + +# Compile models without running (validates SQL) +dbt compile + +# Build all models and run tests +dbt build + +# Build excluding cloned production data (use after cloning) +dbt build --exclude tag:cloned_data + +# Build excluding large Kafka tables +dbt build --exclude tag:kafka_crowd_dev + +# Build a specific model and all its upstream dependencies +dbt build --select +model_name + +# Build by layer +dbt build --select tag:bronze +dbt build --select tag:silver +dbt build --select tag:gold +dbt build --select tag:platinum + +# Run tests only +dbt test + +# Preview query results without materializing +dbt show --select model_name + +# Inspect the compiled SQL for a model +dbt compile --select model_name +# Then check target/compiled/core_warehouse/models/... + +# Generate and view documentation +dbt docs generate +dbt docs serve +``` + +## Cloning Production Data for Development + +Some bronze tables (Kafka CDP, Salesforce) are too large to rebuild in dev. +Clone production data to your dev schema instead: + +```bash +# Clone tables and create views from production (run weekly) +dbt run-operation clone_production_tables + +# With custom retention if Time Travel is needed +dbt run-operation clone_production_tables --args '{retention_days: 7}' + +# Then exclude cloned data from your builds +dbt build --exclude tag:cloned_data +``` + +This creates 179 objects across 19 schemas: + +- 112 Bronze views across 17 schemas +- 21 Bronze cloned tables across 17 schemas +- 39 Silver Dim cloned tables +- 7 Silver Fact cloned tables + +Cloned tables use 0-day retention by default (no Time Travel history) to +optimize storage costs. + +## Makefile Targets + +The project includes shortcuts for building specific data domains: + +| Command | What it builds | +|---------|---------------| +| `make edx` | EdX course and enrollment data | +| `make easycla` | EasyCLA signature data | +| `make bevy` | Bevy chapter and event data | +| `make events` | Platform event registration data | +| `make ti` | Training Institute data | +| `make webinars` | Webinar attendance data | +| `make individual_memberships` | Individual membership data | +| `make docs` | Generate dbt documentation | + +### Linting + +```bash +# Lint a specific file +sqlfluff lint path/to/file.sql + +# Auto-fix formatting issues +sqlfluff fix path/to/file.sql + +# Lint a specific file via Makefile +make lint-fix file=path/to/file.sql + +# Lint all staged files (before commit) +make lint-staged-files + +# Auto-fix all staged files +make fix-lint-staged-files +``` + +## Schema Organization + +Each layer maps to specific Snowflake schemas. In production, the schema name +is used directly. In dev, it is prefixed with your default schema +(e.g., `your_schema_bronze_fivetran_platform`). + +| Layer | Schema Pattern | Example | +|-------|---------------|---------| +| Bronze | `bronze_*` (per source) | `bronze_fivetran_platform`, `bronze_salesforce` | +| Silver Dim | `silver_dim` | `silver_dim` | +| Silver Fact | `silver_fact` | `silver_fact` | +| Gold | `gold_*` (per domain) | `gold_reporting`, `gold_fact` | +| Platinum | `platinum*` (per product) | `platinum`, `platinum_organization_dashboard` | + +## Project Structure + +```text +lf-dbt/ +├── dbt_project.yml # Main project configuration +├── profiles.yml # Snowflake connection config +├── packages.yml # dbt package dependencies +├── .sqlfluff # SQL linting rules +├── Makefile # Build shortcuts +├── macros/ # Reusable SQL fragments +├── models/ +│ ├── bronze/ # Source-aligned raw data +│ │ ├── fivetran_platform/ +│ │ ├── fivetran_salesforce/ +│ │ ├── kafka_crowd_dev/ +│ │ └── ... +│ ├── silver/ # Business logic layer +│ │ ├── dim/ # Dimensions +│ │ │ └── helper_models/ +│ │ └── fact/ # Facts +│ │ └── helper_models/ +│ ├── gold/ # Aggregated metrics +│ │ ├── fact/ +│ │ ├── reporting/ +│ │ └── ... +│ ├── platinum/ # Pre-computed reports +│ │ ├── individual_dashboard/ +│ │ ├── organization_dashboard/ +│ │ ├── lfx_one/ +│ │ └── ... +│ └── semantic/ # Semantic layer definitions +├── data/ # Seed data +├── tests/ # Custom data tests +└── snapshots/ # dbt snapshots +``` + +## Git Workflow + +All commits must be signed and include DCO signoff: + +```bash +git commit -S --signoff -m "Add new bronze model for event registrations" +``` + +Branch naming follows the convention: + +- `feature/{JIRA_TICKET}-{short-description}` +- `bug/{JIRA_TICKET}-{short-description}` + +Example: `feature/DL-123-add-event-registrations-model` diff --git a/skills/lfx-data-engineer/references/key-macros.md b/skills/lfx-data-engineer/references/key-macros.md new file mode 100644 index 0000000..dd4ca4e --- /dev/null +++ b/skills/lfx-data-engineer/references/key-macros.md @@ -0,0 +1,434 @@ + + + +# Key Macros Reference + +The lf-dbt project includes reusable macros in the `macros/` directory. This +reference covers the macros developers use most frequently. + +--- + +## Source and Environment Macros + +### `smart_source(source_name, table_name, timestamp_col, lookback_window)` + +**File:** `macros/smart_source.sql` + +A development-friendly wrapper around `source()` that limits data volume in +non-production environments. + +| Environment | Behavior | +|-------------|----------| +| `no_data` (CI) | Wraps source in `WHERE 1=0` — validates schema only, no data | +| Dev (with `timestamp_col`) | Filters to last N days (default 30) for faster builds | +| `prod` / `stage` | Returns raw `source()` reference — full data | + +**Usage:** + +```sql +-- Bronze model with dev lookback on a timestamp column +FROM {{ smart_source('fivetran_platform', 'event', 'created_date', 30) }} + +-- Without timestamp lookback (full table in all environments except CI) +FROM {{ smart_source('fivetran_platform', 'event') }} +``` + +**When to use:** Bronze models reading from large source tables. Use instead of +raw `source()` when the source has a timestamp column suitable for filtering. + +--- + +### `get_warehouse(warehouse_type)` + +**File:** `macros/get_environment_warehouse.sql` + +Selects the appropriate Snowflake warehouse based on model size and environment. + +| `warehouse_type` | Production Warehouse | Dev/CI Override | +|-------------------|---------------------|-----------------| +| `'default'` | `DBT_PROD` | `DBT_DEV` (dev), `DBT_STG` (CI) | +| `'hourly'` | `DBT_HOURLY` | `DBT_DEV` (dev), `DBT_STG` (CI) | +| `'medium'` | `DBT_PROD_MED` | `DBT_DEV` (dev), `DBT_STG` (CI) | + +**Convenience macros:** + +- `get_environment_warehouse()` — alias for `get_warehouse('default')` +- `get_hourly_warehouse()` — alias for `get_warehouse('hourly')` +- `get_medium_warehouse()` — alias for `get_warehouse('medium')` + +**Usage:** + +```sql +{% set warehouse = get_warehouse('hourly') %} + +{{ config(snowflake_warehouse=warehouse) }} + +SELECT ... +``` + +**When to use:** Any model that reads from large tables or performs heavy +aggregations. Most bronze and platinum models use `get_warehouse('hourly')`. + +--- + +### `generate_alias_name` / `generate_schema_name` + +**File:** `macros/generate_alias_name.sql`, `macros/generate_schema_name.sql` + +These macros control how dbt resolves table names in Snowflake. + +**`generate_alias_name`** strips the schema prefix from the model name. A model +named `silver_dim_users.sql` configured with `+schema: silver_dim` becomes +table `USERS` (not `SILVER_DIM_USERS`) in the `SILVER_DIM` schema. + +**`generate_schema_name`** handles environment-specific schema naming: +- Production: uses the schema name directly (e.g., `SILVER_DIM`) +- Dev: prepends your personal schema (e.g., `your_schema_SILVER_DIM`) + +These macros run automatically — you do not call them in model code. But +understanding them is important for knowing where your tables will land. + +--- + +## Timestamp and Date Macros + +### `format_timestamp(original_column_name, target_column_name, data_type, local_tz, source_tz)` + +**File:** `macros/format_timestamp.sql` + +Generates standardized timestamp/date columns with proper naming conventions. + +| `data_type` | Output Columns | +|-------------|---------------| +| `'date'` | `{target_column_name}_date` (via `TO_DATE()`) | +| `'timestamp'` | `{target_column_name}_ts` (UTC) + `{target_column_name}_ts_local` (local timezone) | + +**Usage:** + +```sql +SELECT + {{ format_timestamp('created_at', 'created', 'timestamp', 'America/New_York') }}, + {{ format_timestamp('birth_date', 'birth', 'date', 'UTC') }} +FROM {{ source('my_source', 'my_table') }} +``` + +**Produces:** + +```sql +convert_timezone('UTC', 'UTC', created_at) AS created_ts, +convert_timezone('UTC', 'America/New_York', created_at) AS created_ts_local, +to_date(birth_date) AS birth_date +``` + +**When to use:** Bronze models normalizing timestamps from source systems. + +--- + +### `to_utc_timestamp(local_ts, local_tz)` + +**File:** `macros/format_timestamp.sql` + +Converts a local timestamp to UTC when the timezone is stored in a column +rather than being a constant. + +**Usage:** + +```sql +SELECT + {{ to_utc_timestamp('event_start_time', 'event_timezone') }} AS event_start_ts +FROM {{ ref('bronze_events') }} +``` + +**When to use:** When the timezone varies per row (e.g., events in different +timezones with the timezone stored as a column value). + +--- + +## Date Range Filter Macros + +**File:** `macros/date_range_helpers.sql` + +These macros generate `WHERE` clause conditions for time-windowed filtering. +They are the backbone of platinum models that pre-compute metrics over specific +time periods. + +### Current-Period Macros (Exclude Today by Default) + +| Macro | Window | +|-------|--------| +| `is_last_x_days(date, days)` | Generic N-day lookback | +| `is_last_7_days(date)` | Last 7 days (days -8 to -1) | +| `is_last_14_days(date)` | Last 14 days | +| `is_last_30_days(date)` | Last 30 days | +| `is_last_90_days(date)` | Last 90 days | +| `is_last_6_months(date)` | Last 6 calendar months | +| `is_last_12_months(date)` | Last 12 calendar months | +| `is_last_24_months(date)` | Last 24 calendar months | +| `is_last_48_months(date)` | Last 48 calendar months | +| `is_last_quarter(date)` | Most recently completed quarter | +| `is_year_to_date(date)` | Jan 1 of current year through yesterday | +| `is_current_year(date)` | Full current calendar year | +| `is_specific_year(date, year)` | A specific calendar year | +| `is_alltime(date)` | All dates up to today | +| `is_before_today(date)` | Strictly before today | +| `is_before_or_today(date)` | Up to and including today | + +**Usage:** + +```sql +-- Filter to last 30 days +WHERE {{ is_last_30_days('activity_date') }} + +-- Filter to year-to-date +WHERE {{ is_year_to_date('event_start_date') }} + +-- Generic lookback +WHERE {{ is_last_x_days('created_ts', 60) }} +``` + +### Completed Year Macros + +| Macro | Window | +|-------|--------| +| `is_last_completed_year(date)` | Previous full calendar year | +| `is_prev_completed_year(date)` | 2 years ago (full year) | +| `is_3rd_last_completed_year(date)` | 3 years ago | +| `is_4th_last_completed_year(date)` | 4 years ago | +| `is_5th_last_completed_year(date)` | 5 years ago | + +### Quarter Macros + +| Macro | Window | +|-------|--------| +| `is_last_x_quarters(date, quarters)` | Last N completed quarters | +| `is_x_quarters_ago(date, quarters)` | A single completed quarter N quarters ago | +| `is_current_quarter(date)` | Current calendar quarter (from `date_range_helpers_surveys.sql`) | + +### Cumulative / "Up To" Macros + +| Macro | Window | +|-------|--------| +| `is_up_to_year_to_date(date)` | Everything before today | +| `is_up_to_last_completed_year(date)` | Everything through end of last year | +| `is_up_to_prev_completed_year(date)` | Everything through end of 2 years ago | + +--- + +### Previous-Period Macros (for Period-over-Period Comparisons) + +These macros define the period immediately before the corresponding +`is_last_*` window, enabling percent-change and delta calculations. + +| Macro | Window | +|-------|--------| +| `is_prev_7_days(date)` | Days -14 to -8 (the week before `is_last_7_days`) | +| `is_prev_14_days(date)` | Days -28 to -15 | +| `is_prev_30_days(date)` | Days -60 to -31 | +| `is_prev_90_days(date)` | Days -180 to -91 | +| `is_prev_6_months(date)` | Months -12 to -7 | +| `is_prev_12_months(date)` | Months -24 to -13 | +| `is_prev_24_months(date)` | Months -48 to -25 | +| `is_prev_quarter(date)` | The quarter before `is_last_quarter` | +| `is_prev_year_to_date(date)` | Same YTD window, shifted back one year (handles leap years) | + +**Usage:** + +```sql +-- Current period +SUM(CASE WHEN {{ is_last_30_days('activity_date') }} THEN 1 ELSE 0 END) AS last_30_days_count, + +-- Previous period for comparison +SUM(CASE WHEN {{ is_prev_30_days('activity_date') }} THEN 1 ELSE 0 END) AS prev_30_days_count +``` + +--- + +### "Through Today" Variants + +These macros shift the window to include today. Used primarily by social +listening models. The day count stays the same but the window slides forward +by one day. + +| Macro | Window | +|-------|--------| +| `is_last_7_days_through_today(date)` | Days -6 to 0 (includes today) | +| `is_last_30_days_through_today(date)` | Days -29 to 0 | +| `is_last_90_days_through_today(date)` | Days -89 to 0 | +| `is_last_12_months_through_today(date)` | 12 months back through today | +| `is_year_to_date_through_today(date)` | Jan 1 through today | + +Matching previous-period macros exist: +`is_prev_7_days_through_today(date)`, `is_prev_30_days_through_today(date)`, etc. + +--- + +### Month-Overlap Macros + +For monthly-grain data where you need to check if a month falls within a window: + +| Macro | Purpose | +|-------|---------| +| `month_overlaps_last_x_days(date, days)` | Does the month containing `date` overlap the last N days? | +| `month_overlaps_last_x_months(date, months)` | Does the month containing `date` overlap the last N months? | + +--- + +### Unified Time Range Filter + +```sql +-- Filters based on a time_range_name column +WHERE {{ time_range_filter('date_column', 'time_range_column') }} +``` + +Supports `'past_365_days'`, `'past_2_years'`, and `'alltime'` values. Used by +ecosystem influence models. + +--- + +## Date/Time Formatting Macros + +**File:** `macros/format_helpers.sql` + +### `get_short_month(date)` + +Returns 3-letter month abbreviation: `'Jan'`, `'Feb'`, ..., `'Dec'` + +### `get_month(date)` + +Returns full month name: `'January'`, `'February'`, ..., `'December'` + +### `get_quarter(date)` + +Returns quarter label: `'Q1'`, `'Q2'`, `'Q3'`, `'Q4'` + +**Usage:** + +```sql +SELECT + {{ get_month('event_start_date') }} AS event_month, + {{ get_quarter('event_start_date') }} AS event_quarter, + {{ get_short_month('event_start_date') }} AS event_month_short +FROM {{ ref('silver_dim_events') }} +``` + +--- + +## Delta / Period-over-Period Comparison Macros + +**File:** `macros/delta_helpers.sql` + +### `add_delta_columns(metrics)` + +Generates `_prev`, `_diff`, and `_delta` (percent change) columns for a list +of metric names. Expects the query to have `curr.*` and `prev.*` aliases. + +**Usage:** + +```sql +SELECT + curr.project_id + {{ add_delta_columns(['total_commits', 'total_contributors', 'total_prs']) }} +FROM current_period curr +LEFT JOIN previous_period prev + ON curr.project_id = prev.project_id +``` + +**Produces** (for each metric): +- `total_commits` — current value +- `total_commits_prev` — previous period value +- `total_commits_diff` — absolute difference +- `total_commits_delta` — percent change (100% if previous was 0) + +### `add_share_of_total(metrics)` + +Generates `_share` (percent of total) and `_total_delta` columns. + +--- + +## Data Quality and Filtering Macros + +### `gdpr_filter_email(email_field)` + +**File:** `macros/gdpr_filter.sql` + +Excludes rows where the email matches a GDPR suppression or deletion request. + +```sql +WHERE {{ gdpr_filter_email('u.email') }} +``` + +### `gdpr_filter_email_list(email_list_field, delimiter)` + +Filters rows where any email in a delimited list matches a GDPR request. +Supports `;`, `,`, `:`, `|` delimiters. + +```sql +WHERE {{ gdpr_filter_email_list('cc_emails', ';') }} +``` + +--- + +### Email Validation Macros + +**File:** `macros/email_validation.sql` + +| Macro | Purpose | +|-------|---------| +| `is_valid_email(email_field)` | Regex validation of email format | +| `email_filter_clause(email_field)` | Not null + not empty + valid format | +| `exclude_test_emails(email_field)` | Excludes test, example, noreply, retired addresses | +| `comprehensive_email_filter(email_field)` | Combines `email_filter_clause` + `exclude_test_emails` | + +```sql +-- Full email validation +WHERE {{ comprehensive_email_filter('email') }} + +-- Just format check +WHERE {{ is_valid_email('email') }} +``` + +--- + +### Common Filters + +**File:** `macros/common_filters.sql` + +| Macro | Purpose | +|-------|---------| +| `filter_code_contributions_non_bot(table_alias)` | Excludes bot contributions from code activity data | +| `exclude_individual_account(account)` | Filters out individual/no-account Salesforce records | +| `is_organization_domain(domain)` | Checks that an email domain is not a consumer provider (gmail, yahoo, etc.) | + +```sql +-- Filter to human code contributions only +WHERE {{ filter_code_contributions_non_bot('c') }} + +-- Exclude individual Salesforce accounts +WHERE {{ exclude_individual_account('account_id') }} +``` + +--- + +### Formatting and Cleanup Macros + +**File:** `macros/format_helpers.sql` + +| Macro | Purpose | +|-------|---------| +| `format_country(country)` | Normalizes messy country names to canonical values (handles US/USA/U.S.A., UK variants, etc.) | +| `clean_name_field(field)` | Cleans garbage/placeholder values from name fields (null, unknown, test, N/A, etc.) | +| `format_repository_url(repository_url)` | Lowercases and strips `.git` suffix | +| `email_to_domain(email)` | Extracts domain from an email address | +| `extract_repo_name(url_column)` | Extracts repository name from a git URL | +| `format_commit_url(repository_url, commit_id)` | Generates a clickable commit URL for GitHub, GitLab, Bitbucket, or kernel.org | +| `parse_github_username(field)` | Extracts a GitHub username from a URL or raw value | +| `parse_linkedin_username(field)` | Extracts a LinkedIn username from a URL or raw value | +| `is_apac_country(billing_country_column)` | Checks if a country is in the APAC region (China, HK, Taiwan, Macao) | + +```sql +SELECT + {{ format_country('raw_country') }} AS country, + {{ clean_name_field('first_name') }} AS first_name, + {{ email_to_domain('email') }} AS email_domain +FROM {{ ref('bronze_source') }} +``` diff --git a/skills/lfx-data-engineer/references/medallion-architecture.md b/skills/lfx-data-engineer/references/medallion-architecture.md new file mode 100644 index 0000000..f4d0642 --- /dev/null +++ b/skills/lfx-data-engineer/references/medallion-architecture.md @@ -0,0 +1,433 @@ + + + +# Medallion Architecture Guide + +The lf-dbt project follows a four-layer medallion architecture. Each layer has +a specific purpose, materialization strategy, and set of conventions. + +## Layer Overview + +```text +┌──────────────────────────────────────────────────────────────────┐ +│ Platinum │ Pre-computed reports with time windows │ +│ │ Dashboard-ready data (PCC, ID, OD, Insights) │ +├─────────────┼────────────────────────────────────────────────────┤ +│ Gold │ Aggregated metrics for specific business cases │ +│ │ Code contributions by org, enrollment counts │ +├─────────────┼────────────────────────────────────────────────────┤ +│ Silver │ Business logic, joins, transformations │ +│ │ Reusable objects: Users, Projects, Activities │ +├─────────────┼────────────────────────────────────────────────────┤ +│ Bronze │ 1:1 with source data │ +│ │ Column renames, type casting, delete filtering │ +└─────────────┴────────────────────────────────────────────────────┘ + ▲ ▲ ▲ ▲ + Raw Sources source() ref() ref() +``` + +--- + +## Bronze Layer + +### Purpose + +Bronze models provide a clean, renamed view of raw source data. They are the +only layer that reads from `source()` — all other layers use `ref()`. + +### Conventions + +- **Materialization:** `view` (default) +- **Schema:** `bronze_*` per source system (e.g., `bronze_fivetran_platform`) +- **One model per source table** — no joins +- **No business logic** — only column renames, type casting, and filtering + +### What Belongs Here + +- Column renames from source naming to snake_case business names +- Type casting (e.g., string to date) +- Filtering deleted records (`_fivetran_deleted`) +- Filtering test data (`is_test`) +- Timestamp normalization to UTC using `format_timestamp()` + +### Example: Bronze Event Model + +```sql +-- Copyright The Linux Foundation and each contributor to LFX. +-- SPDX-License-Identifier: MIT + +{% set warehouse = get_warehouse('hourly') %} + +{{ config(snowflake_warehouse=warehouse) }} + +SELECT + id AS event_id, + event_start_date, + event_end_date, + event_title AS event_name, + currency, + project_id, + salesforce_id AS salesforce_event_id, + event_location, + IFF(event_status_name = 'Complete', 'Completed', event_status_name) AS event_status, + city AS event_city, + country AS event_country, + created_date AS event_created_ts, + event_category, + event_code, + account_stub AS event_account_stub, + source, + lastmodified_date AS updated_at + +FROM {{ source('fivetran_platform', 'event') }} +WHERE + NOT _fivetran_deleted + AND NOT is_test +``` + +### Key Patterns + +- `source('schema_name', 'table_name')` or `smart_source()` for dev lookback +- `get_warehouse('hourly')` for large source tables +- Column naming: `_ts` for timestamps, `_date` for dates, `is_`/`has_` for booleans +- Filter `_fivetran_deleted` when the source has Fivetran soft deletes + +### File Naming + +`bronze_{source_system}_{table_name}.sql` + +Examples: +- `bronze_fivetran_platform_events.sql` +- `bronze_fivetran_salesforce_projects.sql` +- `bronze_kafka_crowd_dev_activities.sql` + +--- + +## Silver Layer + +### Purpose + +Silver models apply business logic, join multiple bronze models, and create +reusable business objects. They are divided into two subfolders: + +- **`dim/`** — Dimensions: slowly-changing attributes (users, projects, organizations) +- **`fact/`** — Facts: events and transactions (activities, registrations, contributions) + +### Conventions + +- **Materialization:** `table` +- **Schema:** `silver_dim` or `silver_fact` +- **Table naming:** The `generate_alias_name` macro strips the schema prefix. + A model named `silver_dim_users.sql` becomes table `USERS` in the + `SILVER_DIM` schema (not `SILVER_DIM_USERS`). +- **Block comment** at the top explaining purpose, questions answered, and data sources + +### What Belongs Here + +- Joins across multiple bronze models +- Business rules and transformations +- Deduplication logic +- Enrichment from reference data +- Reusable objects that serve multiple downstream use cases + +### Example: Silver Dimension Model + +```sql +-- Copyright The Linux Foundation and each contributor to LFX. +-- SPDX-License-Identifier: MIT + +/* +This model creates a standardized dimension table for projects. + +## Purpose: +- Provides a comprehensive view of projects with all relevant attributes + +## Questions this model can help answer: +1. What is the hierarchical structure of projects? +2. Which projects belong to specific foundations? +3. What is the current health score of a project? + +## Data sources: +- bronze_fivetran_salesforce_projects +- silver_fact_crowd_dev_project_health_metrics +*/ + +{% set warehouse = get_warehouse('hourly') %} + +{{ config(snowflake_warehouse=warehouse) }} + +WITH latest_health_metrics AS ( + SELECT + project_slug, + metric_date AS health_metric_date, + health_score, + health_score_category + FROM {{ ref('silver_fact_crowd_dev_project_health_metrics') }} + QUALIFY ROW_NUMBER() OVER ( + PARTITION BY project_slug + ORDER BY metric_date DESC + ) = 1 +), + +projects AS ( + SELECT + project_id, + project_name, + project_slug, + project_status + FROM {{ ref('bronze_fivetran_salesforce_projects') }} +) + +SELECT + p.project_id, + p.project_name, + p.project_slug, + p.project_status, + h.health_score, + h.health_score_category, + h.health_metric_date +FROM projects p +LEFT JOIN latest_health_metrics h + ON p.project_slug = h.project_slug +``` + +### Helper Models + +Silver includes `helper_models/` subfolders for reusable SQL fragments. These +files start with a `_` prefix (e.g., `_silver_dim_project_spine.sql`) and are +not full models — they serve as building blocks for other models. + +The `_silver_dim_project_spine.sql` helper is particularly important: it fans +out projects to their parent hierarchy for downstream aggregation. + +### File Naming + +- Dimensions: `silver_dim_{entity}.sql` (e.g., `silver_dim_users.sql`) +- Facts: `silver_fact_{domain}_{entity}.sql` (e.g., `silver_fact_event_registrations.sql`) +- Helpers: `_silver_{dim|fact}_{name}.sql` (e.g., `_silver_dim_project_spine.sql`) + +--- + +## Gold Layer + +### Purpose + +Gold models combine silver models into purpose-built datasets for specific +business use cases. They answer specific analytical questions without requiring +additional joins. + +### Conventions + +- **Materialization:** `table` +- **Schema:** `gold_*` per domain (e.g., `gold_fact`, `gold_reporting`) +- **Surrogate keys** via `dbt_utils.generate_surrogate_key()` for composite primary keys +- **`unique_key`** in config for incremental models + +### What Belongs Here + +- Aggregated metrics (code contributions by org, enrollment counts) +- Purpose-built datasets that downstream consumers query directly +- Fan-out logic using the project spine helper + +### Example: Gold Fact Model + +```sql +-- Copyright The Linux Foundation and each contributor to LFX. +-- SPDX-License-Identifier: MIT + +{{ config(unique_key=["_key", "project_id"]) }} + +SELECT + ({{ dbt_utils.generate_surrogate_key(["c._key", "p.mapped_project_id"]) }}) AS activity_project_id, + c._key, + c.activity_id, + c.activity_ts, + c.activity_type, + c.activity_category, + c.member_id, + c.github_username, + c.repository_url, + p.mapped_project_id AS project_id, + p.mapped_project_slug AS project_slug, + p.mapped_project_name AS project_name, + c.additions, + c.deletions, + COALESCE(c.is_pr_approved, FALSE) AS is_pr_approved, + c.is_org_contribution, + c.member_is_bot, + ( + ROW_NUMBER() OVER ( + PARTITION BY p.mapped_project_id, c.member_id + ORDER BY c.activity_ts + ) = 1 + ) AS is_members_first_project_contribution + +FROM {{ ref("silver_fact_crowd_dev_activities") }} c +LEFT JOIN {{ ref("_silver_dim_project_spine") }} p + ON c.project_id = p.base_project_id +WHERE + p.mapped_project_id IS NOT NULL + AND {{ filter_code_contributions_non_bot('c') }} +``` + +### File Naming + +`gold_fact_{domain}.sql` or `gold_{purpose}_{entity}.sql` + +Examples: +- `gold_fact_code_contributions.sql` +- `gold_fact_enrollments.sql` +- `gold_fact_course_purchases.sql` + +--- + +## Platinum Layer + +### Purpose + +Platinum models produce dashboard-ready data with pre-computed time windows. +Consumers query platinum tables directly without needing date range filters. + +### Conventions + +- **Materialization:** `table` +- **Schema:** `platinum*` per product (e.g., `platinum_organization_dashboard`) +- **Date range macros** for time-windowed aggregations +- **`get_warehouse()`** for resource-intensive computations +- **`GROUP BY ALL`** is acceptable for complex aggregations +- **`QUALIFY`** with `ROW_NUMBER()` for deduplication + +### What Belongs Here + +- Pre-computed metrics by time period (last 7 days, last 30 days, YTD) +- Period-over-period comparisons (current vs previous period) +- Dashboard-specific data shapes +- Delta calculations using `add_delta_columns()` + +### Example: Platinum Dashboard Model + +```sql +-- Copyright The Linux Foundation and each contributor to LFX. +-- SPDX-License-Identifier: MIT + +{% set warehouse = get_warehouse('hourly') %} + +{{ config(snowflake_warehouse=warehouse) }} + +WITH sponsors AS ( + SELECT + event_id, + contact_id + FROM {{ ref('silver_fact_event_sponsorships') }} + GROUP BY ALL +), + +event_registrations AS ( + SELECT + er.registration_id, + mu.user_id, + mu.user_name, + er.event_id, + er.event_name, + er.event_start_date, + er.event_end_date, + er.project_id, + er.user_attended, + er.registration_status, + CASE + WHEN sp.contact_id IS NOT NULL THEN 'Sponsor' + WHEN er.is_event_speaker THEN 'Speaker' + WHEN er.user_attended = TRUE THEN 'Attendee' + ELSE 'Registered' + END AS user_role + FROM {{ ref('silver_fact_event_registrations') }} er + INNER JOIN {{ ref('bronze_fivetran_salesforce_merged_user') }} mu + ON er.user_id = mu.user_id + LEFT JOIN sponsors sp + ON mu.user_id = sp.contact_id + AND er.event_id = sp.event_id + WHERE + er.event_name IS NOT NULL + AND er.event_start_date IS NOT NULL + GROUP BY ALL +) + +SELECT + ({{ dbt_utils.generate_surrogate_key(['user_id', 'event_id']) }}) AS _key, + registration_id, + user_id, + user_name, + event_id, + event_name, + event_start_date, + event_end_date, + project_id, + user_attended, + user_role, + registration_status +FROM event_registrations +QUALIFY ROW_NUMBER() OVER ( + PARTITION BY user_id, event_id + ORDER BY event_start_date +) = 1 +``` + +### Product Folders + +Platinum models are organized by dashboard/product: + +| Folder | Dashboard | +|--------|-----------| +| `individual_dashboard/` | Individual Dashboard (ID) | +| `organization_dashboard/` | Organization Dashboard (OD) | +| `lfx_one/` | LFX One platform | +| `events/` | Events metrics | +| `code_contributions/` | Code contribution analytics | +| `enrollments/` | Training enrollment reports | +| `membership/` | Membership metrics | +| `marketing/` | Marketing analytics | +| `sales_metrics/` | Sales pipeline reports | + +### File Naming + +`platinum_{product}_{entity}.sql` + +Examples: +- `platinum_individual_dashboard_event_registrations.sql` +- `platinum_organization_dashboard_overview.sql` +- `platinum_lfx_one_project_code_commits.sql` + +--- + +## Decision Tree: Which Layer? + +```text +Is this reading directly from a raw source table? + └─ YES → Bronze + └─ NO → Does it create a reusable business object (users, projects, activities)? + └─ YES → Silver (dim/ for attributes, fact/ for events) + └─ NO → Does it aggregate metrics for a specific use case? + └─ YES → Is it pre-computed with time windows for a dashboard? + └─ YES → Platinum + └─ NO → Gold + └─ NO → Silver (it's probably a helper or intermediate model) +``` + +## Schema Mapping Reference + +| Layer + Folder | Snowflake Schema (Production) | +|---------------|-------------------------------| +| `bronze/fivetran_platform/` | `BRONZE_FIVETRAN_PLATFORM` | +| `bronze/fivetran_salesforce/` | `BRONZE_SALESFORCE` | +| `bronze/kafka_crowd_dev/` | `BRONZE_KAFKA_CROWD_DEV` | +| `bronze/stripe/` | `BRONZE_STRIPE` | +| `silver/dim/` | `SILVER_DIM` | +| `silver/fact/` | `SILVER_FACT` | +| `gold/fact/` | `GOLD_FACT` | +| `gold/reporting/` | `GOLD_REPORTING` | +| `platinum/individual_dashboard/` | `PLATINUM_INDIVIDUAL_DASHBOARD` | +| `platinum/organization_dashboard/` | `PLATINUM_ORGANIZATION_DASHBOARD` | +| `platinum/lfx_one/` | `PLATINUM_LFX_ONE` | + +In dev, schemas are prefixed with your personal schema: +`{your_schema}_BRONZE_FIVETRAN_PLATFORM`, etc. diff --git a/skills/lfx-data-engineer/references/sql-style-guide.md b/skills/lfx-data-engineer/references/sql-style-guide.md new file mode 100644 index 0000000..f5bb3ef --- /dev/null +++ b/skills/lfx-data-engineer/references/sql-style-guide.md @@ -0,0 +1,289 @@ + + + +# SQL Style Guide + +This guide consolidates the formatting rules enforced by `.sqlfluff` and the +project's coding standards. All SQL files must pass `sqlfluff lint` before +being committed. + +## Keyword and Identifier Casing + +| Element | Casing | Example | +|---------|--------|---------| +| SQL keywords | UPPERCASE | `SELECT`, `FROM`, `WHERE`, `LEFT JOIN`, `GROUP BY` | +| Column names | lowercase | `event_id`, `project_name`, `created_ts` | +| Table aliases | lowercase | `FROM users u`, `JOIN projects p` | +| Functions | UPPERCASE | `SUM()`, `COUNT()`, `COALESCE()`, `ROW_NUMBER()` | +| Literals | UPPERCASE | `TRUE`, `FALSE`, `NULL` | +| Type casts | lowercase shorthand | `::int`, `::string`, `::date` (not `CAST()`) | + +## Indentation + +- Use **4 spaces** (not tabs) +- Do not right-align aliases +- Use **trailing commas** in SELECT statements + +```sql +-- CORRECT +SELECT + user_id, + user_name, + email, + created_ts + +-- WRONG (leading commas) +SELECT + user_id + , user_name + , email + , created_ts + +-- WRONG (right-aligned aliases) +SELECT + userId as user_id, + convert_timezone('UTC', createdDate) as created_date +``` + +## SELECT Statements + +- Fields should be stated before aggregates and window functions +- Group-by columns are always listed first in the SELECT +- Final SELECT must explicitly list all columns — no `SELECT *` +- `SELECT DISTINCT` is not allowed (requires architect approval) +- Use `GROUP BY` or `QUALIFY ROW_NUMBER()` instead of `DISTINCT` + +```sql +-- CORRECT: explicit columns, group-by fields first +SELECT + project_id, + project_name, + COUNT(*) AS total_events, + SUM(revenue) AS total_revenue +FROM events +GROUP BY 1, 2 + +-- WRONG: SELECT * +SELECT * FROM events +``` + +## GROUP BY and ORDER BY + +- Prefer ordering and grouping **by number**: `GROUP BY 1, 2` +- If grouping by more than a few columns, reconsider the model design +- `GROUP BY ALL` is acceptable in platinum models for complex aggregations + +```sql +-- CORRECT +GROUP BY 1, 2, 3 + +-- ACCEPTABLE in platinum models +GROUP BY ALL +``` + +## JOINs + +- **Default to INNER JOIN** — use LEFT JOIN only when the right side may have + no matches and you still want rows from the left +- **RIGHT JOIN is not allowed** — rewrite as LEFT JOIN +- Specify join keys explicitly — **do not use `USING`** (Snowflake has + inconsistencies with `USING` results) +- When joining two or more tables, always **prefix columns with the table alias** +- Pre-filter complex conditions in a CTE before the join +- Do **not** filter on the right side of a LEFT JOIN in the `WHERE` clause + (this negates the LEFT JOIN). Filter in the `ON` clause or in a CTE. + +```sql +-- CORRECT: filter in ON clause +SELECT + l.user_id, + r.event_name +FROM users l +LEFT JOIN events r + ON l.user_id = r.user_id + AND r.event_status = 'Active' + +-- WRONG: filtering right side in WHERE (turns LEFT JOIN into INNER JOIN) +SELECT + l.user_id, + r.event_name +FROM users l +LEFT JOIN events r + ON l.user_id = r.user_id +WHERE + r.event_status = 'Active' + +-- WRONG: using USING +FROM users u +JOIN events e USING (user_id) +``` + +## CTEs (Common Table Expressions) + +- Use CTEs instead of subqueries in `FROM` or `JOIN` clauses (enforced by + sqlfluff rule `ST05`) +- Each CTE should perform a **single, logical unit of work** +- CTE names should be **verbose** enough to convey what they do +- CTEs with confusing or notable logic should have a comment +- CTEs duplicated across models should be pulled into their own models or macros + +```sql +-- CORRECT: CTEs for logical units +WITH active_events AS ( + SELECT + event_id, + event_name, + event_start_date + FROM {{ ref('bronze_fivetran_platform_events') }} + WHERE event_status = 'Active' +), + +event_registrations AS ( + SELECT + event_id, + COUNT(*) AS registration_count + FROM {{ ref('silver_fact_event_registrations') }} + GROUP BY 1 +) + +SELECT + e.event_id, + e.event_name, + e.event_start_date, + COALESCE(r.registration_count, 0) AS registration_count +FROM active_events e +LEFT JOIN event_registrations r + ON e.event_id = r.event_id + +-- WRONG: subquery in FROM +SELECT * +FROM ( + SELECT event_id, event_name + FROM events + WHERE event_status = 'Active' +) e +``` + +## Table Aliasing + +- Use the `AS` keyword when aliasing columns +- Table aliases do not require `AS` (implicit aliasing is allowed) +- When selecting from a single table, do **not** prefix columns with the alias + +```sql +-- CORRECT: single table, no prefix +SELECT + user_id, + user_name, + email +FROM users + +-- CORRECT: multiple tables, always prefix +SELECT + u.user_id, + u.user_name, + e.event_name +FROM users u +INNER JOIN events e + ON u.user_id = e.organizer_id +``` + +## CASE Statements + +- `CASE` and `END` on their own lines +- Conditions indented inside the block +- Multiple boolean conditions on separate lines + +```sql +-- CORRECT +CASE + WHEN status = 'Active' + AND is_verified = TRUE + THEN 'Active Verified' + WHEN status = 'Inactive' + THEN 'Inactive' + ELSE 'Unknown' +END AS status_label, +``` + +## WHERE Clauses + +- Single conditions can be inline: `WHERE event_status = 'Active'` +- Multiple conditions on separate lines, indented +- `OR` conditions enclosed in parentheses + +```sql +-- CORRECT: multiple conditions +WHERE + event_status = 'Active' + AND event_start_date >= CURRENT_DATE() + AND ( + event_type = 'Conference' + OR event_type = 'Meetup' + ) +``` + +## Data Types + +The project normalizes data types. These names are **blocked** by sqlfluff: + +| Blocked Type | Use Instead | +|-------------|-------------| +| `NUMBER`, `NUMERIC` | `DECIMAL` | +| `INTEGER`, `BIGINT`, `SMALLINT`, `TINYINT`, `BYTEINT` | `INT` | +| `DOUBLE`, `REAL` | `FLOAT` | +| `CHARACTER` | `CHAR` | +| `DATETIME` | `TIMESTAMP_NTZ` | + +If an exception is required, add `-- noqa: L062` with a comment explaining why. + +## Type Casting + +Use shorthand casting (enforced by sqlfluff): + +```sql +-- CORRECT +column_name::int +column_name::date +column_name::string + +-- WRONG +CAST(column_name AS INT) +CONVERT(INT, column_name) +``` + +## Newlines and Readability + +**DO NOT OPTIMIZE FOR A SMALLER NUMBER OF LINES OF CODE.** +Newlines are cheap; brain time is expensive. + +- Long lines should be broken up if it improves readability +- Any clause with more than one item should be listed on new lines, indented +- Conform to the existing style in a file, even if it contradicts this guide + +## Running sqlfluff + +```bash +# Lint a specific file +sqlfluff lint models/bronze/fivetran_platform/bronze_fivetran_platform_events.sql + +# Auto-fix formatting issues +sqlfluff fix models/bronze/fivetran_platform/bronze_fivetran_platform_events.sql + +# Lint via Makefile +make lint-fix file=models/bronze/fivetran_platform/bronze_fivetran_platform_events.sql + +# Lint all staged files before commit +make lint-staged-files + +# Auto-fix all staged files +make fix-lint-staged-files +``` + +sqlfluff uses the `.sqlfluff` configuration at the repo root. Key settings: + +- Dialect: Snowflake +- Templater: dbt (understands `ref()`, `source()`, Jinja) +- No max line length +- Macros loaded from `macros/` directory +- Subqueries forbidden in `FROM` and `JOIN` (use CTEs) diff --git a/skills/lfx-data-engineer/references/testing-patterns.md b/skills/lfx-data-engineer/references/testing-patterns.md new file mode 100644 index 0000000..cfd733a --- /dev/null +++ b/skills/lfx-data-engineer/references/testing-patterns.md @@ -0,0 +1,355 @@ + + + +# dbt Testing Patterns + +This guide covers the test conventions for the lf-dbt project, aligned with +dbt v1.10.5+. All models must have corresponding tests in a `*_tests.yml` file +co-located in the same directory as the model. + +## Test File Structure + +Test files use `version: 2` and the `models:` key. Each model entry includes +a description and column definitions with data types and tests. + +```yaml +# Copyright The Linux Foundation and each contributor to LFX. +# SPDX-License-Identifier: MIT + +version: 2 +models: + - name: bronze_fivetran_platform_events + description: "Event data from the Fivetran Platform source." + config: + tags: + - "events" + columns: + - name: event_id + description: "Unique identifier for the event." + data_type: string + data_tests: + - unique + - not_null + + - name: event_name + description: "The name of the event." + data_type: string + data_tests: + - not_null + - dbt_utils.not_empty_string + + - name: event_start_date + description: "The start date of the event." + data_type: timestamp_tz + data_tests: + - not_null +``` + +--- + +## Key Rules + +### Use `data_tests:` (not `tests:`) + +The `tests:` key is deprecated in dbt v1.10.5+. Always use `data_tests:`. + +```yaml +# CORRECT +columns: + - name: event_id + data_tests: + - unique + - not_null + +# WRONG (deprecated) +columns: + - name: event_id + tests: + - unique + - not_null +``` + +### Use `arguments:` for Parameterized Tests + +Tests that accept parameters (like `accepted_values`, `relationships`) must +wrap their arguments under the `arguments:` property. + +```yaml +# CORRECT +columns: + - name: status + data_tests: + - accepted_values: + arguments: + values: ["active", "inactive", "pending"] + + - name: project_id + data_tests: + - relationships: + arguments: + to: ref('silver_dim_projects') + field: project_id + +# WRONG (missing arguments: wrapper) +columns: + - name: status + data_tests: + - accepted_values: + values: ["active", "inactive", "pending"] +``` + +Simple tests without arguments (`unique`, `not_null`, `dbt_utils.not_empty_string`) +do NOT need the `arguments:` wrapper. + +--- + +## Primary Key Tests + +Every column named `_key` or `_pk` must have these three tests: + +```yaml +columns: + - name: _key + description: "The unique primary key for the table." + data_tests: + - unique + - not_null + - dbt_utils.not_empty_string +``` + +This pattern is enforced across all layers. + +--- + +## PII Tagging + +Columns containing personally identifiable information must be tagged using +`config.meta`. Do NOT put `meta` at the top level — it must be nested inside +`config`. + +```yaml +# CORRECT +columns: + - name: email + description: "User email address" + data_type: string + config: + meta: + contains_pii: true + data_retention: "undefined" + +# WRONG (meta at top level — triggers deprecation warnings) +columns: + - name: email + description: "User email address" + meta: + contains_pii: true + data_retention: "undefined" +``` + +Always include `data_retention: "undefined"` when adding a `contains_pii` tag. + +Do NOT duplicate PII information across `tags` and `meta`: + +```yaml +# WRONG (redundant — tags and meta both indicate PII) +columns: + - name: email + config: + tags: + - "contains_pii" + meta: + contains_pii: true + data_retention: "undefined" + +# CORRECT (meta is the single source of truth) +columns: + - name: email + config: + meta: + contains_pii: true + data_retention: "undefined" +``` + +### What Counts as PII + +- Full, first, middle, or last name +- Email addresses +- Phone numbers +- Physical addresses +- Government IDs (SSN, passport numbers) +- Financial information + +--- + +## Model-Level Configuration + +Tags and meta at the model level also go under `config:`: + +```yaml +models: + - name: my_model + description: "Model description" + config: + tags: + - "events" + meta: + contains_pii: false + data_retention: "undefined" + columns: + - name: _key + data_tests: + - unique + - not_null +``` + +Never define `config:` twice in the same block: + +```yaml +# WRONG (duplicate config key) +models: + - name: my_model + config: + tags: + - "events" + config: + contract: { enforced: true } + +# CORRECT (single config block) +models: + - name: my_model + config: + tags: + - "events" + contract: { enforced: true } +``` + +--- + +## Test Configuration + +Use `config:` for test-level settings like `where`, `severity`, and error +thresholds. Custom keys must go in `config.meta`: + +```yaml +columns: + - name: order_id + data_tests: + - unique: + config: + error_if: ">10" + warn_if: ">10" + - not_null + - accepted_values: + arguments: + values: ["placed", "shipped", "completed", "returned"] + config: + where: "order_date >= CURRENT_DATE - INTERVAL '30 days'" + meta: + severity: warn +``` + +--- + +## Common Test Types + +### Simple Tests (no arguments needed) + +```yaml +data_tests: + - unique + - not_null + - dbt_utils.not_empty_string +``` + +### Accepted Values + +```yaml +data_tests: + - accepted_values: + arguments: + values: ["Active", "Completed", "Cancelled", "Pending"] +``` + +### Relationships (Foreign Keys) + +```yaml +data_tests: + - relationships: + arguments: + to: ref('silver_dim_projects') + field: project_id +``` + +### Custom Error Thresholds + +For known edge cases where a few duplicates are expected: + +```yaml +data_tests: + - unique: + config: + error_if: ">10" + warn_if: ">10" +``` + +--- + +## Unit Tests + +For unit tests, use the `unit_tests:` key. Custom keys like `severity` must +go in `config.meta`: + +```yaml +unit_tests: + - name: test_my_model_logic + model: my_model + config: + meta: + severity: warn + given: + - input: ref('source_model') + rows: + - { id: "123", status: "active" } + - { id: "456", status: "inactive" } + expect: + rows: + - { id: "123", status: "active" } +``` + +For detailed unit test patterns, see the `adding-dbt-unit-test` skill in the +lf-dbt repository's `.agents/skills/` directory. + +--- + +## Test File Organization + +Test files are co-located with models and follow this naming convention: + +| Layer | Test File | +|-------|-----------| +| Bronze | `models/bronze/{source}/bronze_{source}_tests.yml` | +| Silver | `models/silver/dim/silver_dim_tests.yml` or `models/silver/fact/silver_fact_tests.yml` | +| Gold | `models/gold/fact/gold_fact_tests.yml` | +| Platinum | `models/platinum/platinum_tests.yml` or per-folder | + +Some layers use a single consolidated test file (like `silver_dim_tests.yml`), +while others have per-source test files. Check the existing pattern in the +target directory and follow it. + +--- + +## Checklist for New Tests + +- [ ] License header at top of YML file +- [ ] `version: 2` declared +- [ ] `data_tests:` used (not deprecated `tests:`) +- [ ] `arguments:` wrapper on parameterized tests +- [ ] Primary key columns have `unique`, `not_null`, `dbt_utils.not_empty_string` +- [ ] PII columns tagged with `config.meta.contains_pii: true` +- [ ] `data_retention: "undefined"` included with PII tags +- [ ] `meta` and `tags` nested under `config:` (not at top level) +- [ ] No duplicate `config:` keys in the same block +- [ ] Custom keys nested in `config.meta` (not directly in `config`) +- [ ] Column `data_type` specified for key columns +- [ ] Descriptions provided for all columns