feat(lineage): resolve struct-field lineage through scope recursion#4
Open
treff7es wants to merge 4 commits into
Open
feat(lineage): resolve struct-field lineage through scope recursion#4treff7es wants to merge 4 commits into
treff7es wants to merge 4 commits into
Conversation
sqlglot.lineage collapsed struct/JSON field access to the bare column: `SELECT MIN(widget.metric.a, widget.metric.b) FROM tbl` produced a single leaf `tbl.widget`, losing both `.metric.a` and `.metric.b`. Two causes: 1. `set(find_all_in_scope(select, exp.Column))` deduped the two value-equal `widget` columns into one before the leaf loop ran. 2. The leaf was named from `c.sql()`, which renders only the column and drops the `exp.Dot` ancestor chain holding the subfield path. Introduce `_struct_access_root`, which climbs the left spine of the Dot chain to the maximal access expression, and use it both as the dedup key (so distinct subfield accesses stay distinct) and as the leaf name (so the full path `tbl.widget.metric.a` is preserved). Plain columns are unchanged. Resolves the leaf-level cases in tobymao#7604. Field access via CTE/derived-table recursion (tobymao#6258) is unaffected and still requires threading the path through to_node recursion.
Add focused tests for the struct-field lineage fix: two subfields of the same column staying distinct with full paths, single and deeply-nested field access, and a regression guard that a plain column is unchanged.
Use builtin `dict` instead of `t.Dict` (ruff UP006, valid under `from __future__ import annotations`) and rename the source-column loop variable so it no longer shadows the `column` parameter of `to_node`, which mypy resolved to the parameter's `str | int` type.
Builds on the leaf-level struct-path fix to attribute lineage to the exact struct field even when access happens across CTEs, derived tables, nested struct constructors, and UNNEST of array-of-structs. Previously a struct field accessed through a CTE recursed with only the bare column name, so `sc.a` over `STRUCT(t.a AS a, t.b AS b)` attributed to both `t.a` and `t.b`. UNNEST attributed to the whole array column, dropping the accessed field. Thread a `subfield` path through `to_node` (and the cache key). At each scope: - Struct constructor: `_narrow_struct_field` matches the next path segment against the field names and descends, consuming segments through nested structs, so only the selected field's columns are attributed. - Passthrough column / UNNEST: the residual path rides along and is appended to the resulting leaf (`t.s.a`, `t.arr.a`). - Unmatched field: degrades to whole-struct attribution, never raises. Narrowing is purely structural, so it works with or without a schema. Plain columns carry an empty subfield and are unaffected. Covers the CTE case in tobymao#6258; complements the leaf-level fix for tobymao#7604.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Builds on #3 (leaf-level struct-path preservation) to attribute lineage to the exact struct field even when access happens across CTEs, derived tables, nested struct constructors, and
UNNESTof array-of-structs.Problem
sc.aoverSTRUCT(t.a AS a, t.b AS b)attributed to botht.aandt.b(STRUCT lineage - More granular support tobymao/sqlglot#6258, Example 2).UNNESTattributed to the whole array column, dropping the accessed field.Change
Thread a
subfieldpath throughto_node(and the memo cache key). At each scope:_narrow_struct_fieldmatches the next path segment against the field names and descends, consuming segments through nested structs — only the selected field's columns are attributed.t.s.a,t.arr.a).Narrowing is purely structural (matches constructor field aliases), so it works with or without a schema. Plain columns carry an empty subfield and are unaffected.
Results
WITH r AS (SELECT STRUCT(t.a AS a, t.b AS b) AS sc FROM t) SELECT sc.a FROM r['t.a','t.b']['t.a']SELECT t.s AS sc … sc.a['t.s']['t.s.a']SELECT u.a FROM t, UNNEST(t.arr) AS u['t.arr']['t.arr.a']sc.outer.innerthrough CTE['t.x']sc.missing['t.a','t.b'](graceful)Scope / known limitation
Resolves the CTE case in tobymao#6258. Inline-struct construction error (tobymao#6258, Example 1 —
SELECT struct(a AS x) AS sthen lineage ofs.xraisingCannot find column) is a separate column-resolution path and is not addressed here. Struct representations other thanexp.Structconstructors (e.g. dialect-specific JSON/GET_PATH access) fall back to whole-struct attribution rather than erroring.tests/test_lineage.py: 52 pass (incl. 6 new for this PR).ruff+mypyclean.