Skip to content

[BUG] Satellite inserting duplicates when granularity of source table and target satellite differ #427

@felix-pagel

Description

@felix-pagel

Describe the bug

We are currently facing the issue that in satellites where we deviate from a direct 1:1 mapping of source table columns to satellite columns, the sat_v0 macro will keep inserting on incremental loads the same records with every load even if there is no change in the hashdiff.

Environment

  • dbt version: core 1.11.2
  • datavault4dbt version: 1.12.2
  • Database/Platform: Snowflake

To Reproduce

Example source table:

product_hk payload_col_sat1 payload_col_sat2 ldts
A Value B Value C 22-02-26 09:00:14:00
A Value B Value F 22-02-26 09:00:14:00
A Value B Value G 22-02-26 09:00:14:00

This is a simplified example input where the granularity of the source table doesn't match the granularity of the target satellite (in this case the problem is in sat1). The HK and HD and ldts for all three records in this example will be the same on every load.

The behavior happens because the deduplication step in the macro has the logic for incremental loads to assign a row number that will later be used in the WHERE NOT EXISTS clause to only compare the latest record per HK of the batch with the target satellite.

The problem is that this row number is not aligned with the ordering of the deduplication function since on similar ldts the ordering of the two window functions will be non-deterministic.

SELECT
{{ parent_hashkey }},
{{ ns.hdiff_alias }},
{{- "\n\n    " ~ datavault4dbt.print_list(datavault4dbt.escape_column_names(source_cols)) if source_cols else " *" }}
{% if is_incremental() -%}
, ROW_NUMBER() OVER(PARTITION BY {{ parent_hashkey }} ORDER BY {{ src_ldts }}) as rn
{%- endif %}
FROM source_data
QUALIFY
    CASE
        WHEN {{ ns.hdiff_alias }} = LAG({{ ns.hdiff_alias }}) OVER(PARTITION BY {{ parent_hashkey|lower }} ORDER BY {{ src_ldts }}) THEN FALSE
        ELSE TRUE
    END

The result is that whenever there are complete duplicates (HK, HD, LDTS all same, as is the case when reading records from a table with a higher granularity than the attributes for the satellite) row number 1 will not match with the QUALIFY=TRUE, and therefore records will be inserted again.

This could be easily fixed by breaking the CTE into two parts, the first one assigning the row number and then the second one deduplicating with the QUALIFY statement but ordering on the row number from the first CTE. This way both steps would be aligned and would not end up inserting the same records again.

Expected behavior

The macro should be able to handle cases when the source table has a higher granularity than the attributes tracked in the satellite and therefore will be presented with duplicates in each load. It shouldn't insert the same values with each when there is no change in the HD.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions