diff --git a/presets/data-warehouse/commands/speckit.constitution.md b/presets/data-warehouse/commands/speckit.constitution.md new file mode 100644 index 000000000..e8ddacc0c --- /dev/null +++ b/presets/data-warehouse/commands/speckit.constitution.md @@ -0,0 +1,107 @@ +--- +description: Create or update the data warehouse project constitution in .specify/memory/constitution.md. +--- + +## User Input + +```text +$ARGUMENTS +``` + +## Outline + +### 1. Collect Project Context + +Ask the user (or infer from the current project if files exist) for: + +**Warehouse Platform** +- What database/platform? (Snowflake, BigQuery, Redshift, Databricks, Postgres+dbt, etc.) +- Are there multiple environments? (dev, staging, prod) + +**Modeling Philosophy** +- Default modeling pattern? (Star schema preferred / Snowflake schema allowed / Data Vault required) +- Are conformed dimensions required? Or is each domain fully autonomous? + +**Naming Conventions** +- Any project-specific deviations from the standard `fct_/dim_/stg_` convention? +- Schema/layer names? (raw/staging/mart vs. bronze/silver/gold vs. custom) +- Any column naming conventions beyond the defaults? + +**SCD Standards** +- Default SCD strategy when unspecified in a feature spec? (Type 1 is the safest default) +- Is historical preservation required for any dimension class by policy (e.g., customer data must always be Type 2)? + +**Data Quality** +- Which DQ tool is used? (dbt tests, Great Expectations, Soda, custom framework) +- What is the default row count variance threshold for the mandatory check? (suggest ±10%) +- Is there a centralized DQ results table that all pipelines must write to? + +**Performance Standards** +- SLA window for nightly batch? (e.g., "All marts available by 06:00 local business timezone") +- Maximum acceptable incremental load duration? +- Are there budget/cost constraints on compute? + +**Security & Compliance** +- Which columns are always PII and must be treated as Type 1 (always current)? +- Are there data retention requirements? +- Which teams have access to which layers? + +**Orchestration** +- What orchestration tool is used? (Airflow, dbt Cloud, Prefect, Azure Data Factory, none) +- Are there required pipeline metadata/audit patterns? + +### 2. Draft the Constitution + +Write `.specify/memory/constitution.md` using the DW constitution template. Fill every section with the project's specific answers. Where the user did not specify, use the proven defaults from the template and note them as defaults. + +The constitution MUST include all eight standard sections: +1. Dimensional Modeling Standards +2. Naming Conventions (with the full table for table prefixes, column conventions, and schema/layer names) +3. SCD Standards +4. Data Quality Gates +5. Pipeline Idempotency +6. Data Lineage & Documentation +7. Performance Standards +8. Security & Compliance + +Plus: Governance section (version, ratification date, amendment process) + +If a section doesn't apply to this project (e.g., no PII data), keep the section heading and note "Not applicable — [reason]" rather than deleting it. + +### 3. Validate the Constitution + +Verify the drafted constitution is self-consistent: + +- Naming conventions don't conflict with each other +- SCD defaults are compatible with stated PII policy (PII must be Type 1) +- DQ gate thresholds are achievable given the stated SLA window +- Performance standards are specific (times, not vague "fast") + +Surface any contradictions to the user before writing. + +### 4. Write the Constitution File + +Write the complete constitution to `.specify/memory/constitution.md`. + +If a constitution file already exists at that path: +- Show the user a diff of what will change +- Ask for confirmation before overwriting +- Or create `.specify/memory/constitution-draft.md` for the user to review first + +### 5. Report Completion + +Report to the user: +- Location: `.specify/memory/constitution.md` +- Summary of the key decisions captured (modeling pattern, naming scheme, DQ tool, SLA) +- Any defaults applied (list them explicitly) +- How to amend: "Edit `.specify/memory/constitution.md` directly; update the Version and Last Amended date; add a note in the Governance section about what changed and why" + +--- + +## DW Constitution Quick Guidelines + +- **Be specific, not aspirational** — "queries should be fast" is not a standard; "P1 consumer queries must return in < 5 seconds" is +- **Every naming convention needs an example** — abstract rules get misinterpreted; concrete examples prevent it +- **SCD policy belongs in the constitution** — individual feature specs should reference the constitution's default, not re-derive it each time +- **DQ gate defaults should be conservative** — it's easier to relax a threshold later than to explain why bad data reached consumers +- **The constitution is a living document** — date it, version it, and establish the amendment process so it gets updated rather than ignored diff --git a/presets/data-warehouse/commands/speckit.plan.md b/presets/data-warehouse/commands/speckit.plan.md new file mode 100644 index 000000000..1b90d2328 --- /dev/null +++ b/presets/data-warehouse/commands/speckit.plan.md @@ -0,0 +1,178 @@ +--- +description: Create a data warehouse implementation plan and store it in plan.md. +handoffs: + - label: Generate DW Task List + agent: speckit.tasks + prompt: Generate the implementation task list for this data warehouse plan +--- + +## User Input + +```text +$ARGUMENTS +``` + +## Outline + +### 1. Load Context + +- Read `.specify/feature.json` to get the feature directory path +- Read `/spec.md` — the authoritative source of truth +- Read `.specify/memory/constitution.md` — project standards that govern every decision +- Read `/data-contracts/` if present — source system schemas and SLAs + +### 2. Constitution Check + +Before any design work, verify the spec is compatible with the project constitution. Populate the Constitution Check table in `plan.md`: + +| Principle | Status | Notes | +|-----------|--------|-------| +| Dimensional Modeling Standards | Pass / Violation | Grain defined? Conformed dims identified? | +| Naming Conventions | Pass / Violation | fct_/dim_/stg_ prefixes will be used? | +| SCD Standards | Pass / Violation | SCD type documented per dim attribute? | +| DQ Gates | Pass / Violation | Mandatory checks defined before serving? | +| Pipeline Idempotency | Pass / Violation | Re-run strategy clear? | + +Record any violations in the Complexity Tracking section with justification. If a violation cannot be justified, surface it to the user before continuing. + +### 3. Schema Design + +Design the physical schema using the spec's Dimensional Model section: + +**Fact Tables**: For each fact in the spec: +- Define the physical grain columns (which columns together form the grain) +- Choose Fact Type (Transaction / Periodic Snapshot / Accumulating Snapshot) +- List all foreign keys (to which dimension) and degenerate dimensions +- List all measures with additive / semi-additive / non-additive classification +- Choose Load Type: Incremental (with watermark column) / Full Refresh / Append-Only +- Choose Partition column (almost always `date_key` or equivalent date column) +- Choose Clustering/Sorting columns (FK columns used in common filters) +- Add standard audit columns: `load_id`, `loaded_at`, `source_system` + +**Dimension Tables**: For each dimension in the spec: +- Define business key (from source; immutable) +- Define surrogate key (system-generated BIGINT) +- Document SCD strategy per attribute (use the spec's SCD column): + - Type 1 (overwrite): UPDATE in place + - Type 2 (history): new row on change with `valid_from / valid_to / is_current` + - Type 3 (prior+current): add `prior_[attribute]` column, UPDATE in place + - Static: no SCD logic needed +- List all descriptive attributes +- Add standard audit columns + +**Bridge Tables** (if many-to-many relationships exist): Define weighting scheme and validity window. + +### 4. ETL/ELT Architecture + +Design the pipeline layer by layer: + +**Layer Definitions**: Document schema names, contents, load type, retention policy, and access controls for each layer (Raw, Staging, Mart) per the constitution's layer naming convention. + +**Load Strategy per Table**: For every table, specify: +- Load type (append, truncate-reload, incremental MERGE, incremental DELETE+INSERT) +- Watermark column (for incremental loads) +- Deduplication strategy (dedup key, hash-based, MERGE key) +- Estimated duration at expected volume + +**Idempotency Design**: For each layer, document the re-run strategy that prevents duplicate data. No pipeline may silently create duplicates on re-execution. + +### 5. Data Quality Implementation + +Map each DQ check from the spec to a concrete implementation: + +For each DQ-### item in the spec: +- Which layer does this check run in? (Staging or Mart) +- How is it implemented? (dbt test, custom macro, Great Expectations suite, Python assertion) +- What exactly triggers failure? (NULL count > 0, duplicate count > 0, row count delta > X%) +- Failure action: abort pipeline and quarantine, or alert and continue? + +**Quarantine Schema**: Define the `quarantine.[table]_rejected` DDL (mirrors source columns + `rejection_reason`, `rejected_at`, `source_run_id`, `is_reprocessed`). + +**Observability**: Define the `pipeline_runs` audit table schema and any alerting integrations (Slack, PagerDuty, email). + +### 6. Source-to-Target Field Mapping + +Create `/data-lineage.md` with a column-level mapping table: + +| Target Table | Target Column | Source System | Source Table | Source Column | Transformation Rule | +|-------------|--------------|--------------|-------------|--------------|---------------------| +| fct_[name] | net_revenue | [CRM] | orders | amount | `amount - discount - tax` | +| dim_customer | customer_segment | [CRM] | customers | spend_tier_code | `'A' → 'Premium'; 'B' → 'Standard'` | + +### 7. Source System Contracts + +For each source system in the spec, create `/data-contracts/[source_name]-contract.md`: + +```markdown +# Data Contract: [Source System Name] + +**Version**: 1.0 **Owner**: [Team] **Effective**: [DATE] + +## Delivery +- Format: [REST API / flat file / DB snapshot / CDC stream] +- Schedule: [Nightly at HH:MM UTC / real-time / on-demand] +- Location: [S3 path / API endpoint / JDBC connection name] + +## Schema Guarantee +[List columns the source guarantees will always be present and non-null] + +## Known Limitations +[List known data quality issues, schema instability risks, or latency variances] + +## Change Notification +[How source team will communicate schema changes — SLA for advance notice] +``` + +### 8. Quickstart Validation Document + +Create `/quickstart.md` with sign-off queries for each consumer use case: + +```markdown +# Quickstart Validation: [FEATURE NAME] + +## Use Case 1 Sign-Off Query (P1) + +[SQL query that a business stakeholder can run to verify the P1 use case] + +Expected result: [Describe what correct output looks like; include row count or totals if known] + +## Use Case 2 Sign-Off Query (P2) + +[SQL query for P2 validation] + +Expected result: [Description] + +## Reconciliation Query + +[Query to compare total measures in the warehouse against a known source system total] +``` + +### 9. Project Structure Decision + +Based on the technical context (platform, framework, team conventions), select and document the source code structure. Show only the chosen option — remove unused options from plan.md. + +### 10. Write plan.md + +Write the complete plan to `/plan.md` using the DW plan template structure. Every section must be concrete — no placeholders left in the final output. + +### 11. Report Completion + +Report to the user: +- `plan.md` location +- Schema design summary (list fact and dimension tables) +- SCD strategy summary (dimension → SCD type mapping) +- Load strategy summary (table → load type) +- Constitution check results +- Artifacts created: `data-lineage.md`, `data-contracts/`, `quickstart.md` +- Next step: `/speckit.tasks` + +--- + +## DW Planning Quick Guidelines + +- **Schema decisions are implementation; SCD decisions are business** — the spec owns SCD type; the plan owns the SQL implementation +- **Never leave load type as "TBD"** — every table must have a documented strategy before tasks can be written +- **Grain integrity is paramount** — verify the chosen fact load strategy cannot produce duplicate rows at the grain +- **Dimension surrogate key stability matters** — if a dimension's surrogate key changes on reload, all fact foreign keys are broken +- **Always prefer star schema** over snowflake unless the constitution explicitly allows snowflake and there's a documented reason +- **Document the idempotency strategy** explicitly for every table — "use MERGE" is not enough; specify the MERGE key diff --git a/presets/data-warehouse/commands/speckit.specify.md b/presets/data-warehouse/commands/speckit.specify.md new file mode 100644 index 000000000..5a0b4f6f3 --- /dev/null +++ b/presets/data-warehouse/commands/speckit.specify.md @@ -0,0 +1,182 @@ +--- +description: Create a data warehouse feature specification and store it in spec.md. +handoffs: + - label: Build DW Technical Plan + agent: speckit.plan + prompt: Create a data warehouse implementation plan for this spec. Platform is... + - label: Clarify DW Requirements + agent: speckit.clarify + prompt: Clarify data warehouse specification requirements + send: true +--- + +## User Input + +```text +$ARGUMENTS +``` + +You **MUST** consider the user input before proceeding (if not empty). + +## Pre-Execution Checks + +Check for extension hooks (`before_specify`) by reading `.specify/extensions.yml` if it exists, following the standard hook execution rules. If not present, skip silently. + +## Outline + +Given the user's feature description, do the following: + +### 1. Generate a Feature Short Name + +Create a 2–4 word slug for the feature (action-noun, lowercase, hyphenated). + +Examples: `customer-dim-scd2`, `sales-fact-daily`, `product-etl-pipeline`, `revenue-mart-rebuild` + +### 2. Create the Spec Directory and File + +- Determine the feature directory following the standard resolution order: + 1. Explicit `SPECIFY_FEATURE_DIRECTORY` if provided + 2. Auto-generate under `specs/` using sequential (`NNN`) or timestamp prefix per `.specify/init-options.json` +- `mkdir -p SPECIFY_FEATURE_DIRECTORY` +- Copy `templates/spec-template.md` to `SPECIFY_FEATURE_DIRECTORY/spec.md` +- Write `.specify/feature.json`: + ```json + { "feature_directory": "" } + ``` + +### 3. Collect and Fill DW-Specific Information + +Parse the feature description to extract — and make informed defaults for anything unspecified: + +**Data Sources** +- What source systems feed this feature? (CRM, ERP, event stream, flat files, APIs?) +- What is the refresh frequency and estimated row volume? +- Are there known data quality issues in the source? + +**Dimensional Model** +- What is the **grain**? (one row per what, per what time period?) — this is the most critical question; if unclear, propose the most natural grain and flag it +- What fact table(s) are needed? What type (transaction, periodic snapshot, accumulating snapshot)? +- What dimension tables are needed? Which are conformed (shared) vs. local? +- What SCD strategy applies to each dimension? (Type 1 overwrite / Type 2 history / Type 3 prior+current / static) + +**Consumer Use Cases** +- Who will query this data? (BI team in Tableau, data scientists, application APIs?) +- What business questions does it answer? +- What is the priority order of consumers/use cases? + +**Business Rules** +- How are key measures calculated? (formulas, exclusions, rounding rules) +- Are there filter or exclusion rules? (test data, cancelled records, etc.) +- Are there derived dimension attributes? (segments, tiers, regions mapped from codes?) + +**Data Quality** +- What mandatory checks should abort the pipeline on failure? +- What warning checks should alert without blocking? +- How should rejected records be quarantined? + +**Freshness & SLA** +- What time must data be available by each day? +- What is the acceptable load duration? + +Mark critical unknowns with `[NEEDS CLARIFICATION: specific question]`. Limit to **3 markers maximum** — make informed defaults for everything else and document them in the Assumptions section. + +**Prioritize clarifications by impact**: grain definition > source availability > consumer SLA > business rule ambiguity. + +**Common DW defaults** (do not ask, just apply): +- Date dimension: assume a shared `dim_date` already exists unless stated otherwise +- Surrogate keys: system-generated BIGINT +- Audit columns: `created_at`, `updated_at`, `source_system`, `load_id` on every table +- Default SCD for customer/product dimensions with no stated strategy: Type 1 (overwrite) +- Row count variance threshold: ±10% of prior load + +### 4. Write the Specification + +Fill `spec.md` using the DW spec template structure. Replace all placeholders with concrete content derived from the feature description. Preserve all section headings. + +Ensure every section is actionable: +- **Grain Statement**: must be a single unambiguous sentence +- **Consumer Use Cases**: each must have an independent test described +- **Business Rules**: every measure formula must be written out explicitly +- **DQ Requirements**: numbered DQ-001, DQ-002, … with clear pass/fail criteria +- **SLA table**: concrete times, not vague "daily" + +### 5. Specification Quality Validation + +After writing the spec, validate against these DW-specific criteria: + +**Create** `SPECIFY_FEATURE_DIRECTORY/checklists/requirements.md`: + +```markdown +# DW Specification Quality Checklist: [FEATURE NAME] + +**Purpose**: Validate completeness before planning begins +**Created**: [DATE] +**Feature**: [Link to spec.md] + +## Dimensional Modeling Completeness +- [ ] Grain statement is present and unambiguous (one sentence) +- [ ] All fact tables have a Fact Type defined (transaction / snapshot / accumulating) +- [ ] All dimension tables have an SCD strategy documented +- [ ] Conformed dimensions are identified (shared vs. local) +- [ ] Semi-additive or non-additive measures are called out explicitly + +## Data Source Completeness +- [ ] All source systems are listed with format, frequency, and estimated volume +- [ ] Source data quality baseline is documented (known issues, latency, history availability) + +## Consumer Use Cases +- [ ] Each use case has a priority (P1, P2, ...) +- [ ] Each use case has an independent test described +- [ ] At least one illustrative query pattern is included per use case + +## Business Rules +- [ ] Every measure formula is written out explicitly (no "standard calculation" placeholders) +- [ ] All filter and exclusion rules are documented +- [ ] Derived dimension attributes have their derivation logic stated + +## Data Quality & SLA +- [ ] Mandatory DQ checks are numbered (DQ-001, ...) with clear pass/fail criteria +- [ ] Warning checks are distinguished from mandatory (abort) checks +- [ ] Quarantine behavior is described +- [ ] SLA table has concrete times (not vague "daily") + +## General Quality +- [ ] No implementation details (no SQL, no framework names, no column types) +- [ ] No [NEEDS CLARIFICATION] markers remain +- [ ] Assumptions section covers all defaults applied +- [ ] Success criteria are measurable and technology-agnostic + +## Notes + +[Document any remaining issues or items requiring stakeholder input] +``` + +**Run validation**: Check the spec against every checklist item. + +- If all pass: mark checklist complete +- If items fail: fix the spec and re-validate (up to 3 iterations) +- If `[NEEDS CLARIFICATION]` markers remain: present up to 3 questions to the user using the standard Q&A format (options A/B/C/Custom table), wait for answers, update spec, re-validate + +### 6. Report Completion + +Report to the user: +- `SPECIFY_FEATURE_DIRECTORY` — the feature directory path +- `spec.md` location +- Grain statement (repeat it explicitly — this is the most important output) +- Checklist results summary +- Any assumptions made (list them) +- Next step: `/speckit.plan` + +### 7. Post-Execution Hooks + +Check `.specify/extensions.yml` for `after_specify` hooks and execute per standard hook rules. + +--- + +## DW Specification Quick Guidelines + +- Focus on **WHAT** data is needed, **WHY** consumers need it, and **HOW FRESH** it must be +- **Never** include implementation details: no SQL dialects, no framework names (dbt, Spark, Airflow), no column data types, no index strategies +- The grain statement is the most critical artifact — it determines everything else. If it isn't clear from the input, derive the most natural grain and mark it for confirmation. +- Business rules belong here. If a measure formula can't be written unambiguously, it's a clarification item. +- SCD strategy decisions belong in the spec (business decision), not the plan (implementation detail) diff --git a/presets/data-warehouse/commands/speckit.tasks.md b/presets/data-warehouse/commands/speckit.tasks.md new file mode 100644 index 000000000..bb65d8241 --- /dev/null +++ b/presets/data-warehouse/commands/speckit.tasks.md @@ -0,0 +1,107 @@ +--- +description: Generate the data warehouse implementation task list and store it in tasks.md. +--- + +## User Input + +```text +$ARGUMENTS +``` + +## Outline + +### 1. Load Context + +- Read `.specify/feature.json` to get the feature directory path +- Read `/spec.md` — consumer use cases (with priorities), DQ requirements, SLAs +- Read `/plan.md` — schema design, load strategies, SCD decisions, project structure +- Read `/data-lineage.md` — field-level source-to-target mappings +- Read `/data-contracts/` — source system contracts (if present) +- Read `.specify/memory/constitution.md` — to ensure tasks enforce constitutional requirements + +### 2. Understand the Scope + +Before writing any tasks, extract and confirm: + +**From spec.md**: +- Consumer use cases and their priorities (P1, P2, P3 …) +- DQ check identifiers (DQ-001, DQ-002 …) and their failure actions +- SLA and refresh schedule + +**From plan.md**: +- Every table: fact and dimension tables with their load type and idempotency strategy +- SCD type per dimension attribute +- Data quality tool (dbt tests / Great Expectations / custom macros / etc.) +- Project structure (dbt models path / Python pipeline path / etc.) +- Orchestration tool (Airflow DAG / dbt Cloud job / etc.) + +**From data-lineage.md**: +- Source tables that need raw landing tables +- Complex transformations that need intermediate staging models + +### 3. Generate Tasks + +Create `/tasks.md` using the DW tasks template structure. Replace all sample tasks with actual tasks derived from the spec and plan. Follow these rules: + +**Task ID format**: Sequential integers, zero-padded to 3 digits (T001, T002, …) + +**Layer prefix in every task**: +- `[RAW]` — raw/bronze extraction tasks +- `[STG]` — staging/silver cleaning tasks +- `[DIM]` — dimension table tasks +- `[MART]` — fact table and mart tasks +- `[DQ]` — data quality test writing tasks +- `[OPS]` — orchestration, monitoring, documentation tasks + +**Every task must include the exact file path** of the model, script, test file, or config being created or modified. + +**DQ test tasks come BEFORE implementation tasks** within each phase. Tests must be written first and verified as failing before implementation begins. + +**Phase structure** (replicate from tasks template): + +1. **Phase 1 — Environment Setup**: Schemas, project scaffold, audit table, quarantine DDL, CI/CD +2. **Phase 2 — Raw/Bronze**: One extraction task per source table from `data-lineage.md`; unit test for each extractor; row count validation +3. **Phase 3 — Staging/Silver**: One staging model per source entity from `plan.md`; deduplication; DQ tests for structural checks; quarantine routing; row count reconciliation +4. **Phase 4 — Dimensions**: DQ tests first (write and confirm failure); then one implementation block per dimension from `plan.md`; SCD logic per constitution; audit columns; SCD correctness test scenario +5. **Phase 5 — Facts / Consumer Use Case 1 (P1)**: DQ tests first; grain join; measure derivations per spec business rules; partitioning/clustering; idempotent load; quarantine routing; audit table emit — **checkpoint: P1 sign-off query passes** +6. **Phase 6+ — Facts / Consumer Use Cases 2, 3, ... (P2, P3, …)**: One phase per consumer use case, following the same DQ-first pattern +7. **Phase N — Operations**: DAG/scheduler, failure alerting, warning alerting, audit table, data catalog, data lineage finalization, quickstart sign-off, runbook + +**Parallelism marking**: Mark `[P]` on any task that can run concurrently with other tasks in the same phase — specifically tasks that operate on different tables/files with no shared dependencies. + +**Typical parallel opportunities**: +- Multiple raw source tables: all parallel +- Multiple staging models for different entities: all parallel +- Multiple dimensions with no FK dependency between them: all parallel +- DQ test authoring for different tables: all parallel +- Multiple use case fact tasks (after dimensions complete): parallelizable across developers + +### 4. Write tasks.md + +Write all tasks to `/tasks.md`. The generated file MUST: +- Contain only real tasks (no sample/template placeholders) +- Include concrete file paths for every task +- Have a clear checkpoint after each phase +- List the Dependencies & Execution Order section showing phase dependencies +- Include the DQ Validation Checklist section +- Include the Rollback Checklist section + +### 5. Report Completion + +Report to the user: +- `tasks.md` location +- Task count per phase +- Which tasks are marked `[P]` (parallelizable) +- P1 MVP scope: exactly which tasks deliver the first independently testable use case +- Suggested starting point: "Start with Phase 1 (Setup), then Phase 2 (Raw) in parallel" + +--- + +## DW Task Generation Quick Guidelines + +- **DQ tests before implementation** — this is non-negotiable per the constitution. If a phase has no DQ tests, add them. +- **Dimensions before facts** — always. Fact tasks must be in a later phase than the dimensions they reference. +- **Exact file paths** — every task must name the exact model file, test file, or config that will be created or modified. "Create a staging model" is insufficient; "Create `models/staging/salesforce/stg_salesforce__accounts.sql`" is correct. +- **Idempotency task is mandatory** — every fact table implementation block must include a task that verifies re-running the pipeline for the same date produces no duplicates. +- **Checkpoint queries** — every use-case phase must end with a task to run the corresponding sign-off query from `quickstart.md`. +- **Rollback and runbook tasks belong in the Operations phase** — do not skip them. diff --git a/presets/data-warehouse/preset.yml b/presets/data-warehouse/preset.yml new file mode 100644 index 000000000..512606657 --- /dev/null +++ b/presets/data-warehouse/preset.yml @@ -0,0 +1,70 @@ +schema_version: "1.0" + +preset: + id: "data-warehouse" + name: "Data Warehouse Development Kit" + version: "1.0.0" + description: "Spec-driven development kit for data warehousing projects. Covers dimensional modeling, ETL/ELT pipelines, SCD handling, data quality gates, and warehouse-native workflows." + author: "branky" + repository: "https://github.com/branky/spec-kit" + license: "MIT" + +requires: + speckit_version: ">=0.6.0" + +provides: + templates: + - type: "template" + name: "spec-template" + file: "templates/spec-template.md" + description: "DW feature spec template — data sources, dimensional model, business rules, data quality, and SLAs" + replaces: "spec-template" + + - type: "template" + name: "plan-template" + file: "templates/plan-template.md" + description: "DW implementation plan — schema design, ETL architecture, SCD strategy, DQ framework, and performance" + replaces: "plan-template" + + - type: "template" + name: "constitution-template" + file: "templates/constitution-template.md" + description: "DW project constitution — modeling standards, naming conventions, data quality mandates, and governance" + replaces: "constitution-template" + + - type: "template" + name: "tasks-template" + file: "templates/tasks-template.md" + description: "DW task list template organized by pipeline layer (raw → staging → dimensions → facts) and consumer use case" + replaces: "tasks-template" + + - type: "command" + name: "speckit.specify" + file: "commands/speckit.specify.md" + description: "DW-aware specify — captures data sources, consumers, dimensional model requirements, DQ rules, and SLAs" + replaces: "speckit.specify" + + - type: "command" + name: "speckit.plan" + file: "commands/speckit.plan.md" + description: "DW-aware plan — designs schema, ETL/ELT strategy, SCD types, DQ implementation, and partitioning" + replaces: "speckit.plan" + + - type: "command" + name: "speckit.tasks" + file: "commands/speckit.tasks.md" + description: "DW-aware tasks — generates pipeline-phase task list (raw, staging, dimensions, facts, ops) with DQ gates" + replaces: "speckit.tasks" + + - type: "command" + name: "speckit.constitution" + file: "commands/speckit.constitution.md" + description: "DW constitution — defines modeling standards, naming conventions, SCD rules, and quality gates for the project" + replaces: "speckit.constitution" + +tags: + - "data-warehouse" + - "etl" + - "dimensional-modeling" + - "analytics" + - "data-engineering" diff --git a/presets/data-warehouse/templates/constitution-template.md b/presets/data-warehouse/templates/constitution-template.md new file mode 100644 index 000000000..b2bb5133c --- /dev/null +++ b/presets/data-warehouse/templates/constitution-template.md @@ -0,0 +1,136 @@ +# [PROJECT_NAME] Data Warehouse Constitution + + + +--- + +## I. Dimensional Modeling Standards (NON-NEGOTIABLE) + +All warehouse models MUST follow dimensional modeling principles: + +- Every fact table MUST have a **grain statement** documented in its feature spec before implementation begins +- Fact tables contain **measures** (additive, semi-additive, or non-additive) plus **foreign keys** to dimension tables — no business logic columns +- Dimension tables contain **descriptive attributes** plus a system-generated surrogate key and the source business key +- Raw and staging data MUST NOT be exposed to BI consumers — only the mart/serving layer is consumer-facing +- **Conformed dimensions** MUST be shared across fact tables — never duplicated with different definitions +- The modeling pattern (star, snowflake, data vault) MUST be justified in the feature plan + +--- + +## II. Naming Conventions (NON-NEGOTIABLE) + +**Table Prefixes**: + +| Object Type | Prefix | Example | +|-------------|--------|---------| +| Fact tables | `fct_` | `fct_orders`, `fct_web_sessions` | +| Dimension tables | `dim_` | `dim_customer`, `dim_product` | +| Staging models | `stg_[source]__[entity]` | `stg_salesforce__accounts` (double underscore) | +| Intermediate models | `int_[domain]__[transform]` | `int_finance__revenue_allocation` | +| Aggregate tables | `agg_[fact]_[grain]` | `agg_orders_monthly` | +| Bridge tables | `bridge_[a]_[b]` | `bridge_customer_account` | +| Quarantine tables | `quarantine.[table]_rejected` | `quarantine.fct_orders_rejected` | + +**Column Conventions**: + +| Column Type | Convention | Example | +|------------|-----------|---------| +| Surrogate keys | `[entity]_key` (BIGINT) | `customer_key`, `product_key` | +| Business/natural keys | `[entity]_id` | `customer_id`, `order_id` | +| SCD Type 2 control | `valid_from`, `valid_to`, `is_current` | — | +| Audit columns (all tables) | `created_at`, `updated_at`, `source_system`, `load_id` | — | +| All column names | snake_case, lowercase | `net_revenue`, `order_line_id` | + +**Schema/Layer Names**: + +| Layer | Name | +|-------|------| +| Raw ingestion | `raw` or `bronze` | +| Cleaned/typed | `staging` or `silver` | +| Consumer-facing | `mart` or `gold` | + +--- + +## III. SCD (Slowly Changing Dimension) Standards (NON-NEGOTIABLE) + +- The SCD type for **every dimension attribute** MUST be documented in the feature spec before any dimension is built +- **SCD Type 2** MUST use these control columns: `valid_from TIMESTAMP NOT NULL`, `valid_to TIMESTAMP` (NULL = currently active row), `is_current BOOLEAN NOT NULL` +- **Business keys are immutable** — they MUST never be updated; only descriptive attributes may change +- **Surrogate keys are system-generated** — they carry no business meaning and MUST never be exposed to source systems +- Historical rows in SCD Type 2 dimensions MUST be **preserved permanently** — never deleted during normal operation + +--- + +## IV. Data Quality Gates (NON-NEGOTIABLE) + +All pipelines MUST enforce these gates before data reaches the serving layer: + +1. **No NULLs** in surrogate keys or business key columns +2. **No duplicates** at the defined grain of every fact table +3. **Referential integrity** — every fact foreign key must resolve to a dimension surrogate key +4. **Row count plausibility** — new load row count must be within an expected variance of the prior load + +**Failure policy**: Pipelines that fail a mandatory DQ check MUST abort and route failing records to the quarantine table. Silent continuation is prohibited. + +**Warning checks** (NULL rate thresholds, measure anomaly detection) MUST alert but MUST NOT block the pipeline. + +--- + +## V. Pipeline Idempotency (NON-NEGOTIABLE) + +All ETL/ELT pipelines MUST be idempotent: + +- Re-running the pipeline for the **same time window** MUST produce **identical results** — no data duplication +- Each pipeline run MUST be identifiable via a unique `load_id` (run identifier) +- **Full reloads** MUST be executable without manual table manipulation +- The idempotency strategy (MERGE, DELETE+INSERT by partition, or truncate-reload) MUST be documented in the feature plan for every table + +--- + +## VI. Data Lineage & Documentation + +- Every model MUST have a **source-to-target field mapping** (`data-lineage.md`) completed before implementation +- **Business rule derivations** MUST be documented in the feature spec — not only in code comments +- All mart tables MUST include **column-level descriptions** in the schema metadata (dbt `schema.yml` or equivalent) +- Each feature MUST include a `quickstart.md` with sign-off queries and expected results for each consumer use case + +--- + +## VII. Performance Standards + +- Consumer-facing fact tables MUST be **partitioned by date** unless explicitly justified otherwise +- **Query patterns known at design time** MUST be supported by clustering, sorting, or indexing documented in the plan +- Incremental loads MUST complete within the SLA window defined in the feature spec +- **Unbounded full-table scans** on mart tables (for regular consumer queries) MUST be justified in the plan's Complexity Tracking section + +--- + +## VIII. Security & Compliance + +[Define project-specific rules — examples below; replace or remove as appropriate] + +- Columns containing PII MUST be identified in the feature spec and treated as SCD Type 1 (always current, no history) +- PII columns MUST NOT appear in quarantine tables in plaintext — use tokenization or hashing +- Data retention periods MUST be defined per table in the feature spec and enforced by automated purge jobs +- Access to raw and staging layers is restricted to the data engineering team; mart access is role-controlled per consumer group + +--- + +## [Project-Specific Standards] + +[Add additional standards relevant to your organization: compliance frameworks, specific platform rules, approved technology list, etc.] + +--- + +## Governance + +- This constitution **supersedes** all local conventions, individual preferences, and prior practices +- Amendments require: written rationale, review by at least two senior team members, and a migration plan for any existing models affected +- All feature plans MUST include a **Constitution Check table** verifying compliance before implementation begins +- Constitution violations MUST be documented and justified in the plan's Complexity Tracking section — unexplained violations block code review approval + +**Version**: 1.0.0 | **Ratified**: [DATE] | **Last Amended**: [DATE] diff --git a/presets/data-warehouse/templates/plan-template.md b/presets/data-warehouse/templates/plan-template.md new file mode 100644 index 000000000..766936c23 --- /dev/null +++ b/presets/data-warehouse/templates/plan-template.md @@ -0,0 +1,291 @@ +# Data Warehouse Implementation Plan: [FEATURE] + +**Branch**: `[###-feature-name]` | **Date**: [DATE] | **Spec**: [link to spec.md] +**Input**: Feature specification from `/specs/[###-feature-name]/spec.md` + +> **Note**: This template is filled by the `/speckit.plan` command. See `.specify/templates/plan-template.md` for the execution workflow. + +--- + +## Summary + +[Extract from spec: what data is being warehoused, core modeling decision, and ETL approach — 2–4 sentences] + +--- + +## Technical Context + +**Warehouse Platform**: [e.g., Snowflake, BigQuery, Redshift, Databricks Delta Lake, dbt + Postgres] +**Orchestration**: [e.g., Apache Airflow, dbt Cloud, Azure Data Factory, Prefect, no orchestrator] +**ETL/ELT Framework**: [e.g., dbt Core, PySpark, AWS Glue, custom Python, Dataform] +**Source Connectivity**: [e.g., Fivetran, Airbyte, custom extractor, JDBC, S3 drop zone] +**Data Quality Tool**: [e.g., dbt tests, Great Expectations, Soda Core, custom framework] +**Schema Layer Convention**: [e.g., Raw → Staging → Mart; Medallion Bronze/Silver/Gold] +**Testing**: [e.g., dbt tests + pytest, Great Expectations suites, cargo test] +**Performance Goals**: [e.g., "Incremental load < 30 min on 5M rows/day; P1 queries < 5 sec"] +**Constraints**: [e.g., "Budget cap $X/month on warehouse compute; no PII in unencrypted columns"] +**Scale/Scope**: [e.g., "500M-row fact table; 10M new rows/day; 5-year retention"] + +--- + +## Constitution Check + +*GATE: Must pass before schema design. Re-check after ETL design.* + +| Constitution Principle | Status | Notes | +|------------------------|--------|-------| +| Dimensional Modeling Standards | [ ] Pass / [ ] Violation | [e.g., grain defined, conformed dims reused] | +| Naming Conventions | [ ] Pass / [ ] Violation | [e.g., fct_/dim_/stg_ prefixes applied] | +| SCD Standards | [ ] Pass / [ ] Violation | [e.g., SCD type documented per dim attribute] | +| DQ Gates | [ ] Pass / [ ] Violation | [e.g., mandatory checks defined before serving] | +| Pipeline Idempotency | [ ] Pass / [ ] Violation | [e.g., re-run strategy defined] | + +> Violations must be justified in the Complexity Tracking section below. + +--- + +## Schema Design + +### Modeling Approach + +**Pattern Selected**: [Star Schema / Snowflake Schema / Data Vault / Wide/Flat Table / Hybrid] +**Rationale**: [Why this pattern fits the grain statement and consumer query patterns from the spec] + +### Fact Table Design + +```text +[fct_table_name] + Grain: One row per [unit] per [time period] + Fact Type: Transaction / Periodic Snapshot / Accumulating Snapshot + Load Type: Incremental (watermark: [column]) / Full Refresh / Append-Only + Partition: [date_key / load_date / month_key] + Cluster/Sort:[high-cardinality FK columns used in common filters] + + Surrogate Key: + [fact_id] BIGINT IDENTITY — system-generated, no business meaning + + Foreign Keys (→ dimension surrogate keys): + [customer_key] → dim_customer.[customer_key] + [product_key] → dim_product.[product_key] + [date_key] → dim_date.[date_key] + + Degenerate Dimensions (in fact, no supporting dim table): + [order_number] — source order identifier + + Additive Measures: + [gross_amount] DECIMAL(18,2) + [net_revenue] DECIMAL(18,2) — derived: gross − discount − tax + [quantity] INTEGER + + Semi/Non-additive Measures: + [account_balance] DECIMAL(18,2) — semi-additive (sum across products; not time) + + Audit Columns: + [load_id] VARCHAR — pipeline run identifier + [loaded_at] TIMESTAMP — warehouse load timestamp + [source_system] VARCHAR — originating system name +``` + +### Dimension Table Designs + +```text +[dim_customer] + Business Key: customer_id (from source; never updated) + Surrogate Key: customer_key BIGINT IDENTITY + SCD Strategy: Mixed (see SCD table below) + + Attributes (Type 2 — track history): + customer_segment, billing_address, tier + + Attributes (Type 1 — overwrite, always current): + email, phone_number [PII — always show current value] + + SCD Control Columns (Type 2 only): + valid_from TIMESTAMP NOT NULL + valid_to TIMESTAMP — NULL = currently active row + is_current BOOLEAN NOT NULL DEFAULT TRUE + + Audit Columns: + created_at, updated_at, source_system, load_id +``` + +```text +[dim_product] + Business Key: sku (from source) + Surrogate Key: product_key BIGINT IDENTITY + SCD Strategy: Type 1 — overwrite all attributes on change + + Key Attributes: product_name, category, subcategory, brand, list_price + Audit Columns: created_at, updated_at, source_system, load_id +``` + +### SCD Implementation Strategy + +| Dimension | Attribute(s) | SCD Type | Implementation | +|-----------|-------------|----------|----------------| +| dim_customer | customer_segment, address, tier | Type 2 | INSERT new row; set `valid_to` + `is_current=FALSE` on prior | +| dim_customer | email, phone | Type 1 | UPDATE in place on change | +| dim_product | all attributes | Type 1 | MERGE/UPSERT on `sku` | +| dim_geography | region_name | Type 3 | `current_region` + `prior_region` columns; no extra row | + +### Bridge Tables *(if applicable)* + +```text +[bridge_customer_account] + Purpose: Resolve many-to-many between customers and accounts + Weighting: equal / proportional by [allocation_factor column] + Columns: customer_key, account_key, allocation_factor, valid_from, valid_to +``` + +--- + +## ETL/ELT Architecture + +### Layer Definitions + +```text +Layer 1 — RAW / BRONZE + Schema: raw (or bronze) + Contents: Exact replica of source; zero transformations + Load Type: Append-only with load_timestamp + source_file metadata + Retention: [90 days / indefinite] + Access: Data engineering only (not exposed to BI consumers) + +Layer 2 — STAGING / SILVER + Schema: staging (or silver) + Contents: Typed, cleaned, deduplicated, schema-validated data + Load Type: Truncate-reload per batch OR incremental by watermark + Transformations: type casting, NULL handling, deduplication, rejection routing + Access: Data engineering only + +Layer 3 — SERVING / MART / GOLD + Schema: mart (or gold) + Contents: Dimensional model — fact and dimension tables + Load Type: Incremental MERGE OR DELETE+INSERT by partition + Access: BI tools, analysts, data science, application APIs +``` + +### Load Strategy per Table + +| Table | Load Type | Watermark Column | Dedup Strategy | Est. Duration | +|-------|-----------|-----------------|----------------|--------------| +| `raw.[source_table]` | Append | `load_timestamp` | None (raw = unmodified) | ~[X] min | +| `staging.[entity]` | Truncate-Reload | N/A | Dedupe on `[business_key]` | ~[X] min | +| `dim_customer` | SCD Merge | `customer_id` | Merge on business key | ~[X] min | +| `fct_sales` | Incremental Merge | `updated_at` | Merge on `[grain_cols]` | ~[X] min | +| `dim_date` | Static | N/A | Pre-populated; skip | N/A | + +### Idempotency Design + +| Layer | Re-run Strategy | +|-------|----------------| +| Raw | Append with `(source_file, record_hash)` dedup; idempotent via dedup key | +| Staging | Truncate before reload; or `DELETE WHERE load_date = :run_date` then INSERT | +| Dimensions | MERGE on business key; SCD2 hash comparison prevents duplicate row inserts | +| Facts | `DELETE WHERE [partition_key] = :run_date` then INSERT; or MERGE on grain | + +--- + +## Data Quality Implementation + +### DQ Check Design + +| Check ID | Layer | Check Type | Implementation | Failure Action | +|----------|-------|-----------|----------------|---------------| +| DQ-001 | Staging | NOT NULL on `[pk_col]` | `dbt not_null` test | Abort pipeline; quarantine rows | +| DQ-002 | Mart | Referential integrity | `dbt relationships` test | Abort pipeline | +| DQ-003 | Mart | Duplicate grain | `dbt unique` on `[grain_cols]` | Abort pipeline | +| DQ-004 | Mart | Row count variance | Custom macro: ±[X]% vs prior run | Abort pipeline | +| DQ-005 | Staging | NULL rate threshold | Custom macro: NULL% on `[col]` | Alert only | +| DQ-006 | Mart | Measure anomaly | Custom macro: 7-day rolling avg | Alert only | + +### Quarantine Schema + +```text +quarantine.[table_name]_rejected + Mirrors all columns of the source table, plus: + rejection_reason VARCHAR — which DQ check failed and why + rejected_at TIMESTAMP — when the record was quarantined + source_run_id VARCHAR — pipeline run that rejected this row + is_reprocessed BOOLEAN DEFAULT FALSE +``` + +### Alerting & Observability + +- **Run audit table**: `[warehouse].[pipeline_runs]` + Tracks: `run_id`, `pipeline_name`, `started_at`, `finished_at`, `rows_ingested`, `rows_rejected`, `dq_checks_passed`, `dq_checks_failed`, `status` +- **Failure alerting**: [e.g., Slack webhook on DQ-001/002/003 abort; PagerDuty on SLA breach] +- **Health dashboard**: [e.g., pipeline health view in Tableau / Grafana / Metabase] + +--- + +## Project Structure + +### Documentation (this feature) + +```text +specs/[###-feature]/ +├── spec.md # Feature specification (source of truth) +├── plan.md # This file +├── data-lineage.md # Source-to-target field mapping (column level) +├── data-contracts/ # Source system contracts +│ └── [source_name]-contract.md +├── quickstart.md # Sign-off queries + expected results for each use case +└── tasks.md # Implementation tasks (created by /speckit.tasks) +``` + +### Source Code Structure + +```text +# Option A: dbt project +models/ +├── staging/ +│ └── [source_name]/ +│ ├── stg_[source]__[entity].sql # One staging model per source table +│ └── schema.yml # dbt tests + column docs +├── intermediate/ # Complex multi-source joins (optional) +│ └── int_[domain]__[transform].sql +└── marts/ + └── [domain]/ + ├── fct_[business_process].sql # Fact table + ├── dim_[entity].sql # Dimension tables + └── schema.yml # dbt tests + consumer-facing docs + +# Option B: Spark / Python ETL +pipelines/ +├── extract/ +│ └── [source_name]_extractor.py +├── transform/ +│ ├── staging/ +│ │ └── [source]_staging.py +│ └── marts/ +│ ├── [fact_table]_transform.py +│ └── [dimension]_transform.py +├── load/ +│ └── warehouse_loader.py +├── quality/ +│ └── dq_checks.py +└── orchestration/ + └── [pipeline_name]_dag.py + +tests/ +├── unit/ +│ └── test_[transform]_business_rules.py +├── integration/ +│ └── test_[pipeline]_end_to_end.py +└── data_quality/ + └── [source_name]_dq_suite.py +``` + +**Structure Decision**: [Document which option was selected and why] + +--- + +## Complexity Tracking + +> **Fill ONLY when Constitution Check has violations that require justification** + +| Violation | Why Needed | Simpler Alternative Rejected Because | +|-----------|------------|--------------------------------------| +| [e.g., Snowflake schema vs. star] | [specific query pattern prevents star] | [star schema caused fan-out join issues] | +| [e.g., Data Vault instead of direct mart] | [full audit trail required by compliance] | [direct mart cannot preserve source provenance] | diff --git a/presets/data-warehouse/templates/spec-template.md b/presets/data-warehouse/templates/spec-template.md new file mode 100644 index 000000000..e61c1a9c6 --- /dev/null +++ b/presets/data-warehouse/templates/spec-template.md @@ -0,0 +1,249 @@ +# Data Warehouse Feature Specification: [FEATURE NAME] + +**Feature Branch**: `[###-feature-name]` +**Created**: [DATE] +**Status**: Draft +**Grain**: [One row per ... per ...] *(define before filling any other section)* +**Input**: User description: "$ARGUMENTS" + +## Overview + +[2–3 sentences: what data is being warehoused, for whom, and what business questions it answers] + +--- + +## Data Sources + + + +| Source System | Format | Refresh Frequency | Estimated Volume | Owner / Contact | +|---------------|--------|-------------------|------------------|-----------------| +| [e.g., Salesforce CRM] | [REST API / DB snapshot / CSV] | [Daily / Hourly / Real-time] | [~5M rows/day] | [team or person] | +| [e.g., ERP system] | [JDBC / flat files] | [Nightly batch] | [~500K rows/day] | [team or person] | + +### Source Data Quality Baseline + +- **Known Issues**: [e.g., "NULL customer_id on guest-checkout orders, ~2% of rows"] +- **Latency**: [e.g., "Source data available by 03:00 UTC after nightly close"] +- **Historical Availability**: [e.g., "Clean history from 2020-01-01; earlier data is unreliable"] + +--- + +## Consumer Use Cases + + + +### Use Case 1 — [Brief Title] (Priority: P1) + +[Plain-language description of the analytics or reporting need] + +**Consumer**: [e.g., "Finance team in Tableau", "Data science team via Python"] + +**Why this priority**: [Business value and rationale] + +**Independent Test**: [e.g., "Run the reconciliation query in quickstart.md; totals must match source system report"] + +**Typical Query Pattern**: + +```sql +-- Illustrative query this use case drives (not a spec requirement) +SELECT [dimension], SUM([measure]) +FROM [fact_table] +WHERE [filter] +GROUP BY [dimension] +``` + +**Acceptance Scenarios**: + +1. **Given** [source data state], **When** [pipeline runs / query executes], **Then** [expected result] +2. **Given** [a data quality issue], **When** [pipeline encounters it], **Then** [quarantine / alert / reject behavior] + +--- + +### Use Case 2 — [Brief Title] (Priority: P2) + +[Description of the analytics need] + +**Consumer**: [Who uses this] + +**Why this priority**: [Rationale] + +**Independent Test**: [How to validate independently] + +**Acceptance Scenarios**: + +1. **Given** [state], **When** [action], **Then** [outcome] + +--- + +[Add more use cases as needed, each with an assigned priority] + +### Edge Cases + +- What happens when source data arrives outside the SLA window? +- How does the pipeline handle duplicate records from the source system? +- What is the behavior when a fact row references a dimension key that doesn't exist? +- How are soft-deletes or hard-deletes in the source handled? +- What happens on full reload vs. incremental refresh? +- How are late-arriving facts processed? + +--- + +## Dimensional Model + + + +### Grain Statement + +> **"This fact table contains one row per [unit of measure] per [time granularity]."** + +*Example: "One row per sales order line item per calendar day."* + +### Fact Table(s) + +| Fact Table | Grain | Fact Type | Additive Measures | +|------------|-------|-----------|-------------------| +| `[fct_sales]` | [order line / day] | [Transaction / Periodic Snapshot / Accumulating Snapshot] | [revenue, quantity, discount_amount] | + +**Semi-additive or Non-additive Measures** *(if any)*: + +- [e.g., "`account_balance` is semi-additive — sum across products OK, NOT across time periods"] + +### Dimension Tables + +| Dimension | Business Key | Conformed? | SCD Strategy | +|-----------|-------------|------------|-------------| +| `[dim_customer]` | `[customer_id]` | [Yes / No] | [Type 2 — preserve full history] | +| `[dim_product]` | `[sku]` | [Yes / No] | [Type 1 — overwrite on change] | +| `[dim_date]` | `[date_key]` | [Yes — shared] | [Static — no SCD needed] | +| `[dim_geography]` | `[geo_code]` | [Yes / No] | [Type 3 — current + prior region] | + +### Relationships + +```text +[fct_sales] + → dim_customer (many-to-one) + → dim_product (many-to-one) + → dim_date (many-to-one) + → dim_geography (many-to-one) +``` + +--- + +## Business Rules & Transformations + + + +### Measure Definitions + +- **`[measure_name]`**: [Full business definition and formula, e.g., "`net_revenue = gross_amount − discount_amount − tax_amount`. Excludes shipping fees."] +- **`[kpi_name]`**: [Definition, e.g., "`customer_lifetime_value = SUM(net_revenue)` for customers with `tenure_days ≥ 365`"] + +### Dimension Attribute Rules + +- **`[attribute]`**: [Derivation rule, e.g., "`customer_segment`: annual_spend > $10K → 'Premium'; else → 'Standard'"] +- **`[status_field]`**: [Source-to-warehouse mapping, e.g., "Source codes 'A','B' → 'Active'; 'C' → 'Inactive'"] + +### Filter & Exclusion Rules + +- [e.g., "Exclude orders where `customer_id` starts with `'TEST-'`"] +- [e.g., "Exclude rows where `cancellation_date < order_date` — indicates a source data integrity error"] + +--- + +## Data Quality Requirements + + + +### Mandatory Checks *(Pipeline aborts on failure)* + +- **DQ-001**: No NULL values in `[primary_key_column]` +- **DQ-002**: Referential integrity — every `fact.[dimension_key]` must exist in the corresponding dimension table +- **DQ-003**: No duplicate rows at the defined grain +- **DQ-004**: Row count within ±[X]% of the prior successful load + +### Warning Checks *(Alert, do not fail)* + +- **DQ-005**: NULL rate in `[optional_field]` exceeds [Y]% — alert data engineering team +- **DQ-006**: Measure `[revenue]` deviates more than [Z]% from 7-day rolling average — alert analytics team + +### Quarantine Rules + +- Records failing mandatory DQ checks MUST be routed to `quarantine.[table_name]_rejected` with a `rejection_reason` and `rejected_at` timestamp +- Quarantined records MUST be reprocessable without triggering a full reload + +--- + +## Freshness & SLA Requirements + +| Requirement | Target | Hard Limit | Escalation | +|-------------|--------|------------|------------| +| Data available by | [06:00 local business timezone] | [08:00] | [Page on-call data engineer] | +| Incremental load runtime | [< 30 minutes] | [60 minutes] | [Auto-retry + alert] | +| Full reload runtime | [< 4 hours] | [8 hours] | [Notify stakeholders] | + +**Refresh Schedule**: [e.g., "Nightly at 02:00 UTC, triggered after source system close"] + +--- + +## Requirements + + + +### Functional Requirements + +- **FR-001**: Pipeline MUST ingest all source records within the defined SLA window +- **FR-002**: Pipeline MUST apply all business rules in this spec before writing to the serving layer +- **FR-003**: Pipeline MUST enforce all mandatory DQ checks before promoting data to consumers +- **FR-004**: Dimension tables MUST implement the specified SCD strategy per the Dimensional Model section +- **FR-005**: Fact table MUST maintain grain integrity — zero duplicates at the defined grain +- **FR-006**: Pipeline MUST be idempotent — re-running for the same period produces identical results +- **FR-007**: All rejected records MUST be quarantined with `rejection_reason` and `rejected_at` +- **FR-008**: Each pipeline run MUST emit row counts, runtime, and DQ check results to a run log + +### Key Entities + +- **[FactEntity]**: [Business event it records; grain; key relationships to dimensions] +- **[DimensionEntity]**: [Business concept it describes; business key; history strategy] + +--- + +## Success Criteria + +### Measurable Outcomes + +- **SC-001**: All P1 consumer use cases return correct results as verified by business sign-off queries in `quickstart.md` +- **SC-002**: Pipeline completes within the defined SLA on at least 95% of scheduled runs over the first 30 days +- **SC-003**: Mandatory DQ check pass rate ≥ 99.5% over a 30-day production period +- **SC-004**: Zero consumer-reported data accuracy incidents in the first 30 days post-launch +- **SC-005**: Full reload completes within the hard limit at production-scale data volume + +--- + +## Assumptions + +- [e.g., "Source schema will not change without advance notice to the data engineering team"] +- [e.g., "Historical backfill is required from [START_DATE]; data quality before that date is not guaranteed"] +- [e.g., "A shared `dim_date` table already exists in the warehouse and will be reused"] +- [e.g., "Source system communicates deletes as soft-delete flags — not physical row removal"] +- [e.g., "Downstream BI tools support the proposed schema without additional semantic layer changes"] diff --git a/presets/data-warehouse/templates/tasks-template.md b/presets/data-warehouse/templates/tasks-template.md new file mode 100644 index 000000000..6fcd939e9 --- /dev/null +++ b/presets/data-warehouse/templates/tasks-template.md @@ -0,0 +1,230 @@ +--- +description: "Data warehouse task list — organized by pipeline layer and consumer use case" +--- + +# DW Tasks: [FEATURE NAME] + +**Input**: Design documents from `/specs/[###-feature-name]/` +**Prerequisites**: `plan.md` (required), `spec.md` (required), `data-lineage.md` (required), `data-contracts/` (if available) + +**Organization**: Tasks flow through pipeline layers (Raw → Staging → Dimensions → Facts), then consumer use cases within the serving layer, then operations. Each phase has a checkpoint. + +## Format: `[ID] [P?] [Layer/Story] Description — file path` + +- **[P]**: Safe to run in parallel (no file or table dependencies) +- **[Layer]**: `RAW`, `STG`, `DIM`, `MART`, `DQ`, `OPS` +- **[US#]**: Which consumer use case this task enables +- Include **exact model or file paths** in every task description + + + +--- + +## Phase 1: Environment Setup + +**Purpose**: Infrastructure, schemas, project scaffolding, audit plumbing + +- [ ] T001 Create warehouse schemas: `raw`, `staging`, `mart` (adjust names per plan.md layer convention) +- [ ] T002 [P] Configure source system connection credentials and test connectivity +- [ ] T003 [P] Initialize ETL project structure per plan.md source code layout +- [ ] T004 [P] Create pipeline audit table `[warehouse].pipeline_runs` per plan.md observability design +- [ ] T005 [P] Create quarantine schema and base DDL template per plan.md quarantine schema +- [ ] T006 Configure CI/CD pipeline: lint → test → deploy steps for ETL models + +**Checkpoint**: Infrastructure ready; source connectivity confirmed; audit and quarantine tables exist + +--- + +## Phase 2: Raw / Bronze Layer (Source Extraction) + +**Purpose**: Land source data with zero transformation. Append-only. Exact replica of source. + +**Rule**: No business logic. No type coercion beyond minimal landing schema. Every row gets `load_timestamp` and `source_file` or `source_run_id`. + +- [ ] T007 [RAW] Create raw landing DDL for `raw.[source_table_1]` — file: `models/staging/[source]/schema.yml` or DDL script +- [ ] T008 [P] [RAW] Create raw landing DDL for `raw.[source_table_2]` +- [ ] T009 [RAW] Implement extractor: `[source_system]` → `raw.[source_table_1]` — file: `pipelines/extract/[source]_extractor.py` +- [ ] T010 [P] [RAW] Write unit test: extractor preserves all source columns without modification — `tests/unit/test_[source]_extractor.py` +- [ ] T011 [RAW] Validate raw row counts match source system record counts for a known batch + +**Checkpoint**: Raw tables populated; row counts reconcile to source; no transformations applied + +--- + +## Phase 3: Staging / Silver Layer (Clean, Type, Validate) + +**Purpose**: Apply structural cleaning only — type casting, deduplication, NULL handling, schema enforcement. No business metric derivation. + +**Rule**: Staging must quarantine records failing structural validation before any business-layer model runs. + +- [ ] T012 [STG] Create staging model `stg_[source]__[entity1].sql` — `models/staging/[source]/stg_[source]__[entity1].sql` +- [ ] T013 [P] [STG] Create staging model `stg_[source]__[entity2].sql` — `models/staging/[source]/stg_[source]__[entity2].sql` +- [ ] T014 [STG] Implement deduplication logic on `[business_key]` in `stg_[source]__[entity1]` +- [ ] T015 [P] [STG] Add dbt schema tests: `not_null`, `unique`, `accepted_values` — `models/staging/[source]/schema.yml` +- [ ] T016 [STG] Implement quarantine routing: records failing structural checks → `quarantine.[entity1]_rejected` +- [ ] T017 [STG] Validate staging row count vs. raw; document expected attrition from deduplication + +**Checkpoint**: Staging layer clean and tested; quarantine routing verified; ready for dimensional modeling + +--- + +## Phase 4: Dimension Tables *(Foundational — blocks all fact loading)* + +**Purpose**: Build dimension tables with stable surrogate keys. Facts cannot load until all referenced dimensions exist and are validated. + +**Rule**: Write DQ tests first. Tests MUST fail before dimension implementation begins. + +### Data Quality Tests — Write First, Verify They Fail + +- [ ] T018 [DQ] Write `dbt unique` test on `[dim_entity1].[entity_key]` — `models/marts/schema.yml` +- [ ] T019 [P] [DQ] Write `dbt not_null` test on `[dim_entity1].[business_key]` — `models/marts/schema.yml` +- [ ] T020 [P] [DQ] Write `dbt unique` test on `[dim_entity1].[business_key]` among `is_current=TRUE` rows (SCD2 guard) + +### Dimension: [dim_entity1] + +- [ ] T021 [DIM] Create dimension model `dim_[entity1].sql` — `models/marts/[domain]/dim_[entity1].sql` +- [ ] T022 [DIM] Implement SCD Type [1/2/3] logic for `dim_[entity1]` per plan.md SCD table +- [ ] T023 [DIM] Add audit columns: `valid_from`, `valid_to`, `is_current`, `load_id` (SCD2 only) +- [ ] T024 [DIM] Verify historical rows are preserved correctly with SCD Type 2 test scenario + +### Dimension: [dim_entity2] + +- [ ] T025 [P] [DIM] Create dimension model `dim_[entity2].sql` — `models/marts/[domain]/dim_[entity2].sql` +- [ ] T026 [P] [DIM] Implement SCD logic for `dim_[entity2]` +- [ ] T027 [P] [DIM] Add dbt tests for `dim_[entity2]` + +**Checkpoint**: All dimensions loaded; surrogate keys stable; SCD logic validated; dbt tests pass — fact loading can begin + +--- + +## Phase 5: Fact Table — Consumer Use Case 1 (Priority: P1) 🎯 MVP + +**Goal**: [What business questions P1 consumers can now answer] + +**Independent Test**: Run the P1 sign-off query from `quickstart.md`; totals must match source system reconciliation report. + +### Data Quality Tests — Write First, Verify They Fail + +- [ ] T028 [DQ] [US1] Write `dbt unique` test on grain columns `([col1], [col2], [col3])` — `models/marts/schema.yml` +- [ ] T029 [P] [DQ] [US1] Write `dbt not_null` test on all foreign key columns +- [ ] T030 [P] [DQ] [US1] Write `dbt relationships` test: `fct_[name].[customer_key]` → `dim_customer.[customer_key]` (repeat per FK) +- [ ] T031 [P] [DQ] [US1] Write row count variance check macro (DQ-004: ±[X]% vs prior run) + +### Implementation + +- [ ] T032 [MART] [US1] Create fact model `fct_[business_process].sql` — `models/marts/[domain]/fct_[business_process].sql` +- [ ] T033 [MART] [US1] Implement grain join: fact source rows → dimension surrogate key lookups +- [ ] T034 [MART] [US1] Implement measure derivations per business rules in `spec.md` (e.g., `net_revenue`, `discount_pct`) +- [ ] T035 [MART] [US1] Add partitioning and clustering per plan.md performance section +- [ ] T036 [MART] [US1] Implement idempotent load logic: MERGE or DELETE+INSERT per plan.md idempotency design +- [ ] T037 [MART] [US1] Implement quarantine routing for mandatory DQ check failures (DQ-001/002/003) +- [ ] T038 [MART] [US1] Emit row counts and DQ results to `pipeline_runs` audit table + +**Checkpoint**: Fact table loads; all mandatory DQ checks pass; P1 sign-off query returns correct results independently + +--- + +## Phase 6: Fact Table — Consumer Use Case 2 (Priority: P2) + +**Goal**: [What P2 consumers gain; may extend the existing fact or add a separate model] + +**Independent Test**: Run P2 sign-off query from `quickstart.md`; P1 and P2 results remain correct independently. + +### Data Quality Tests — Write First + +- [ ] T039 [DQ] [US2] Add DQ tests for any new P2 tables or extensions to `schema.yml` + +### Implementation + +- [ ] T040 [MART] [US2] [Implement P2-specific fact extension, new aggregate, or separate fact table] +- [ ] T041 [P] [MART] [US2] Validate P2 query patterns against `quickstart.md` + +**Checkpoint**: P1 AND P2 use cases independently validated; neither breaks the other + +--- + +[Add more use case phases following the same pattern] + +--- + +## Phase N: Orchestration, Monitoring & Operations + +**Purpose**: Production readiness — scheduling, alerting, documentation, sign-off + +- [ ] TXXX [OPS] Create or update DAG / workflow trigger for scheduled execution at defined refresh window +- [ ] TXXX [OPS] Configure failure alerting: pipeline abort notification → `[Slack channel / PagerDuty]` +- [ ] TXXX [P] [OPS] Configure warning check alerting: DQ-005/006 → `[Slack channel / email]` +- [ ] TXXX [OPS] Verify pipeline run metadata writes correctly to `pipeline_runs` audit table +- [ ] TXXX [P] [OPS] Create pipeline health dashboard in `[BI tool]` +- [ ] TXXX [P] [OPS] Update data catalog / data dictionary with new table and column definitions +- [ ] TXXX [OPS] Finalize `data-lineage.md` with confirmed source-to-target field mappings +- [ ] TXXX [OPS] Complete `quickstart.md` with final sign-off queries and expected row counts +- [ ] TXXX [OPS] Conduct business stakeholder sign-off using `quickstart.md` queries +- [ ] TXXX [OPS] Write runbook: re-run procedure, quarantine investigation, full reload steps + +**Checkpoint**: Pipeline running on schedule; alerts wired; docs complete; stakeholder sign-off obtained + +--- + +## Dependencies & Execution Order + +### Phase Dependencies + +- **Setup (Phase 1)**: No dependencies — start immediately +- **Raw/Bronze (Phase 2)**: Requires Phase 1 complete (schemas and infrastructure exist) +- **Staging/Silver (Phase 3)**: Requires Phase 2 (raw data available to transform) +- **Dimensions (Phase 4)**: Requires Phase 3 (staging clean and validated) — **BLOCKS all fact loading** +- **Facts (Phase 5+)**: Requires Phase 4 (dimension surrogate keys stable) +- **Operations (Phase N)**: Requires all fact and dimension phases complete + +### Within Each Phase + +- DQ tests MUST be written and confirmed failing before any implementation begins +- Dimension tables MUST be fully loaded and validated before facts reference them +- Staging must be clean and tested before dimensions can build from it + +### Parallel Opportunities + +- Multiple raw source tables: all `[P]` extraction tasks can run concurrently +- Multiple staging models for different source entities: all `[P]` +- Multiple dimensions with no shared surrogate key dependency: all `[P]` +- Multiple consumer use cases (separate fact tables): can parallelize across team members once dimensions are complete +- DQ test authoring for multiple tables: all `[P]` + +--- + +## Data Quality Validation Checklist + +Before declaring any phase complete: + +- [ ] All mandatory DQ checks PASS (not just run without error) +- [ ] Quarantine table inspected; rejection rate is within acceptable bounds; `rejection_reason` values are understood +- [ ] Row counts reconciled against source system or prior pipeline layer +- [ ] Grain integrity confirmed: zero duplicate rows at the defined grain +- [ ] Sample queries from `quickstart.md` return expected results + +--- + +## Rollback Checklist + +Before first production run: + +- [ ] Rollback procedure documented in runbook: which tables to truncate/restore for a failed load +- [ ] Quarantine table verified: captured records include enough metadata to identify and reprocess the source rows +- [ ] Idempotency verified: pipeline executed twice for the same date → no duplicate rows in fact or dimension tables +- [ ] Full reload tested end-to-end in a non-production environment