Skip to content

feat(lineage): preserve struct-field path on leaf nodes#2

Closed
treff7es wants to merge 2 commits into
aviraj-gour:mainfrom
treff7es:feat/lineage-struct-subfield-path
Closed

feat(lineage): preserve struct-field path on leaf nodes#2
treff7es wants to merge 2 commits into
aviraj-gour:mainfrom
treff7es:feat/lineage-struct-subfield-path

Conversation

@treff7es

Copy link
Copy Markdown

Summary

sqlglot.lineage collapses struct/JSON field access down to the bare column. For

SELECT MIN(widget.metric.a, widget.metric.b) FROM tbl

lineage returns a single leaf tbl.widget — both .metric.a and .metric.b are lost. This affects Snowflake VARIANT, BigQuery STRUCT, Iceberg nested types, and Spark complex types — anywhere users access struct fields.

There are two root causes in to_node:

  1. set(find_all_in_scope(select, exp.Column)) dedupes the two value-equal widget columns into one before the leaf loop runs.
  2. The leaf is named from c.sql(), which renders only the column reference and drops the exp.Dot ancestor chain that holds the subfield path.

Change

  • Add _struct_access_root(column) — climbs the left spine of the exp.Dot chain to the maximal access expression (returns the column unchanged when there is no Dot parent).
  • Use it as the dedup key (so t.s.a and t.s.b stay distinct) and as the leaf name (so tbl.widget.metric.a is preserved).

This is the "first-class" representation — the path lives in Node.name, with no new dataclass field and no extra callback.

Results

Query Before After
MIN(widget.metric.a, widget.metric.b) ['tbl.widget'] ['tbl.widget.metric.a', 'tbl.widget.metric.b']
t.s.a ['t.s'] ['t.s.a']
COALESCE(t.s.a, t.s.b) ['t.s'] ['t.s.a', 't.s.b']
t.s.a.b.c ['t.s'] ['t.s.a.b.c']
plain column a ['tbl.a'] ['tbl.a'] (unchanged)
  • tests/test_lineage.py: 43/43 pass, no regressions.

Scope / known limitation

Resolves the leaf-level cases (sqlglot#7604). Struct-field access resolved through a CTE / derived table (sqlglot#6258, Example 2) is unaffected — that path recurses with the bare column name and still needs the subfield threaded through to_node recursion. The inline struct(...) construction error (sqlglot#6258, Example 1) is a separate code path and not addressed here.

Refs: tobymao#7604, tobymao#6258

treff7es added 2 commits June 16, 2026 10:21
sqlglot.lineage collapsed struct/JSON field access to the bare column:
`SELECT MIN(widget.metric.a, widget.metric.b) FROM tbl` produced a single
leaf `tbl.widget`, losing both `.metric.a` and `.metric.b`. Two causes:

1. `set(find_all_in_scope(select, exp.Column))` deduped the two value-equal
   `widget` columns into one before the leaf loop ran.
2. The leaf was named from `c.sql()`, which renders only the column and
   drops the `exp.Dot` ancestor chain holding the subfield path.

Introduce `_struct_access_root`, which climbs the left spine of the Dot
chain to the maximal access expression, and use it both as the dedup key
(so distinct subfield accesses stay distinct) and as the leaf name (so the
full path `tbl.widget.metric.a` is preserved). Plain columns are unchanged.

Resolves the leaf-level cases in tobymao#7604. Field access via CTE/derived-table
recursion (tobymao#6258) is unaffected and still requires threading the path
through to_node recursion.
Add focused tests for the struct-field lineage fix: two subfields of the
same column staying distinct with full paths, single and deeply-nested
field access, and a regression guard that a plain column is unchanged.
@treff7es

Copy link
Copy Markdown
Author

Superseded by the two-PR split: #3 (leaf-level struct-path preservation) and #4 (struct-field resolution through CTEs / nested constructors / UNNEST, stacked on #3). Review #3 first.

@treff7es treff7es closed this Jun 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant