Describe the bug
We are currently facing the issue that in satellites where we deviate from a direct 1:1 mapping of source table columns to satellite columns, the sat_v0 macro will keep inserting on incremental loads the same records with every load even if there is no change in the hashdiff.
Environment
- dbt version: core 1.11.2
- datavault4dbt version: 1.12.2
- Database/Platform: Snowflake
To Reproduce
Example source table:
| product_hk |
payload_col_sat1 |
payload_col_sat2 |
ldts |
| A |
Value B |
Value C |
22-02-26 09:00:14:00 |
| A |
Value B |
Value F |
22-02-26 09:00:14:00 |
| A |
Value B |
Value G |
22-02-26 09:00:14:00 |
This is a simplified example input where the granularity of the source table doesn't match the granularity of the target satellite (in this case the problem is in sat1). The HK and HD and ldts for all three records in this example will be the same on every load.
The behavior happens because the deduplication step in the macro has the logic for incremental loads to assign a row number that will later be used in the WHERE NOT EXISTS clause to only compare the latest record per HK of the batch with the target satellite.
The problem is that this row number is not aligned with the ordering of the deduplication function since on similar ldts the ordering of the two window functions will be non-deterministic.
SELECT
{{ parent_hashkey }},
{{ ns.hdiff_alias }},
{{- "\n\n " ~ datavault4dbt.print_list(datavault4dbt.escape_column_names(source_cols)) if source_cols else " *" }}
{% if is_incremental() -%}
, ROW_NUMBER() OVER(PARTITION BY {{ parent_hashkey }} ORDER BY {{ src_ldts }}) as rn
{%- endif %}
FROM source_data
QUALIFY
CASE
WHEN {{ ns.hdiff_alias }} = LAG({{ ns.hdiff_alias }}) OVER(PARTITION BY {{ parent_hashkey|lower }} ORDER BY {{ src_ldts }}) THEN FALSE
ELSE TRUE
END
The result is that whenever there are complete duplicates (HK, HD, LDTS all same, as is the case when reading records from a table with a higher granularity than the attributes for the satellite) row number 1 will not match with the QUALIFY=TRUE, and therefore records will be inserted again.
This could be easily fixed by breaking the CTE into two parts, the first one assigning the row number and then the second one deduplicating with the QUALIFY statement but ordering on the row number from the first CTE. This way both steps would be aligned and would not end up inserting the same records again.
Expected behavior
The macro should be able to handle cases when the source table has a higher granularity than the attributes tracked in the satellite and therefore will be presented with duplicates in each load. It shouldn't insert the same values with each when there is no change in the HD.
Describe the bug
We are currently facing the issue that in satellites where we deviate from a direct 1:1 mapping of source table columns to satellite columns, the sat_v0 macro will keep inserting on incremental loads the same records with every load even if there is no change in the hashdiff.
Environment
To Reproduce
Example source table:
This is a simplified example input where the granularity of the source table doesn't match the granularity of the target satellite (in this case the problem is in sat1). The HK and HD and ldts for all three records in this example will be the same on every load.
The behavior happens because the deduplication step in the macro has the logic for incremental loads to assign a row number that will later be used in the WHERE NOT EXISTS clause to only compare the latest record per HK of the batch with the target satellite.
The problem is that this row number is not aligned with the ordering of the deduplication function since on similar ldts the ordering of the two window functions will be non-deterministic.
The result is that whenever there are complete duplicates (HK, HD, LDTS all same, as is the case when reading records from a table with a higher granularity than the attributes for the satellite) row number 1 will not match with the QUALIFY=TRUE, and therefore records will be inserted again.
This could be easily fixed by breaking the CTE into two parts, the first one assigning the row number and then the second one deduplicating with the QUALIFY statement but ordering on the row number from the first CTE. This way both steps would be aligned and would not end up inserting the same records again.
Expected behavior
The macro should be able to handle cases when the source table has a higher granularity than the attributes tracked in the satellite and therefore will be presented with duplicates in each load. It shouldn't insert the same values with each when there is no change in the HD.