Skip to content

feat(lineage): preserve struct-field path on leaf nodes#3

Open
treff7es wants to merge 3 commits into
aviraj-gour:mainfrom
treff7es:feat/lineage-struct-subfield-leaf
Open

feat(lineage): preserve struct-field path on leaf nodes#3
treff7es wants to merge 3 commits into
aviraj-gour:mainfrom
treff7es:feat/lineage-struct-subfield-leaf

Conversation

@treff7es

Copy link
Copy Markdown

Summary

sqlglot.lineage collapses struct/JSON field access down to the bare column. For

SELECT MIN(widget.metric.a, widget.metric.b) FROM tbl

lineage returns a single leaf tbl.widget — both .metric.a and .metric.b are lost. Affects Snowflake VARIANT, BigQuery STRUCT, Iceberg nested types, and Spark complex types.

Two root causes in to_node:

  1. set(find_all_in_scope(select, exp.Column)) dedupes the two value-equal widget columns into one before the leaf loop runs.
  2. The leaf is named from c.sql(), which renders only the column and drops the exp.Dot ancestor chain holding the subfield path.

Change

  • Add _struct_access_root(column) — climbs the left spine of the exp.Dot chain to the maximal access expression (returns the column unchanged when there is no Dot parent).
  • Use it as the dedup key (so t.s.a and t.s.b stay distinct) and as the leaf name (so tbl.widget.metric.a is preserved).

This is the "first-class" representation — the path lives in Node.name, no new dataclass field, no extra callback.

Behavior change (scoped)

Node.name changes only for leaves that access a struct field: tbl.widgettbl.widget.metric.a. Plain columns are unchanged (verified by a regression test). Consumers that read Node.name for struct queries will see the fuller path.

Results

Query Before After
MIN(widget.metric.a, widget.metric.b) ['tbl.widget'] ['tbl.widget.metric.a', 'tbl.widget.metric.b']
t.s.a ['t.s'] ['t.s.a']
COALESCE(t.s.a, t.s.b) ['t.s'] ['t.s.a', 't.s.b']
t.s.a.b.c ['t.s'] ['t.s.a.b.c']
plain column a ['tbl.a'] ['tbl.a'] (unchanged)

tests/test_lineage.py: all pass, no regressions. ruff + mypy clean.

Refs: tobymao#7604. A follow-up PR builds on this to resolve struct fields through CTEs, nested constructors, and UNNEST (tobymao#6258).

treff7es added 3 commits June 16, 2026 10:21
sqlglot.lineage collapsed struct/JSON field access to the bare column:
`SELECT MIN(widget.metric.a, widget.metric.b) FROM tbl` produced a single
leaf `tbl.widget`, losing both `.metric.a` and `.metric.b`. Two causes:

1. `set(find_all_in_scope(select, exp.Column))` deduped the two value-equal
   `widget` columns into one before the leaf loop ran.
2. The leaf was named from `c.sql()`, which renders only the column and
   drops the `exp.Dot` ancestor chain holding the subfield path.

Introduce `_struct_access_root`, which climbs the left spine of the Dot
chain to the maximal access expression, and use it both as the dedup key
(so distinct subfield accesses stay distinct) and as the leaf name (so the
full path `tbl.widget.metric.a` is preserved). Plain columns are unchanged.

Resolves the leaf-level cases in tobymao#7604. Field access via CTE/derived-table
recursion (tobymao#6258) is unaffected and still requires threading the path
through to_node recursion.
Add focused tests for the struct-field lineage fix: two subfields of the
same column staying distinct with full paths, single and deeply-nested
field access, and a regression guard that a plain column is unchanged.
Use builtin `dict` instead of `t.Dict` (ruff UP006, valid under
`from __future__ import annotations`) and rename the source-column loop
variable so it no longer shadows the `column` parameter of `to_node`,
which mypy resolved to the parameter's `str | int` type.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant