Skip to content

feat(lineage): resolve struct-field lineage through scope recursion#4

Open
treff7es wants to merge 4 commits into
aviraj-gour:mainfrom
treff7es:feat/lineage-struct-subfield-recursion
Open

feat(lineage): resolve struct-field lineage through scope recursion#4
treff7es wants to merge 4 commits into
aviraj-gour:mainfrom
treff7es:feat/lineage-struct-subfield-recursion

Conversation

@treff7es

Copy link
Copy Markdown

Summary

Builds on #3 (leaf-level struct-path preservation) to attribute lineage to the exact struct field even when access happens across CTEs, derived tables, nested struct constructors, and UNNEST of array-of-structs.

Stacked on #3. Until #3 merges, this PR's diff includes its commits. Review/merge #3 first, then rebase this onto main.

Problem

  • A struct field accessed through a CTE recursed with only the bare column name, so sc.a over STRUCT(t.a AS a, t.b AS b) attributed to both t.a and t.b (STRUCT lineage - More granular support tobymao/sqlglot#6258, Example 2).
  • UNNEST attributed to the whole array column, dropping the accessed field.

Change

Thread a subfield path through to_node (and the memo cache key). At each scope:

  • Struct constructor: _narrow_struct_field matches the next path segment against the field names and descends, consuming segments through nested structs — only the selected field's columns are attributed.
  • Passthrough column / UNNEST: the residual path rides along and is appended to the resulting leaf (t.s.a, t.arr.a).
  • Unmatched field: degrades to whole-struct attribution, never raises.

Narrowing is purely structural (matches constructor field aliases), so it works with or without a schema. Plain columns carry an empty subfield and are unaffected.

Results

Query Before After
WITH r AS (SELECT STRUCT(t.a AS a, t.b AS b) AS sc FROM t) SELECT sc.a FROM r ['t.a','t.b'] ['t.a']
struct passthrough SELECT t.s AS sc … sc.a ['t.s'] ['t.s.a']
SELECT u.a FROM t, UNNEST(t.arr) AS u ['t.arr'] ['t.arr.a']
nested sc.outer.inner through CTE both leaves ['t.x']
unmatched sc.missing ['t.a','t.b'] (graceful)

Scope / known limitation

Resolves the CTE case in tobymao#6258. Inline-struct construction error (tobymao#6258, Example 1 — SELECT struct(a AS x) AS s then lineage of s.x raising Cannot find column) is a separate column-resolution path and is not addressed here. Struct representations other than exp.Struct constructors (e.g. dialect-specific JSON/GET_PATH access) fall back to whole-struct attribution rather than erroring.

tests/test_lineage.py: 52 pass (incl. 6 new for this PR). ruff + mypy clean.

treff7es added 4 commits June 16, 2026 10:21
sqlglot.lineage collapsed struct/JSON field access to the bare column:
`SELECT MIN(widget.metric.a, widget.metric.b) FROM tbl` produced a single
leaf `tbl.widget`, losing both `.metric.a` and `.metric.b`. Two causes:

1. `set(find_all_in_scope(select, exp.Column))` deduped the two value-equal
   `widget` columns into one before the leaf loop ran.
2. The leaf was named from `c.sql()`, which renders only the column and
   drops the `exp.Dot` ancestor chain holding the subfield path.

Introduce `_struct_access_root`, which climbs the left spine of the Dot
chain to the maximal access expression, and use it both as the dedup key
(so distinct subfield accesses stay distinct) and as the leaf name (so the
full path `tbl.widget.metric.a` is preserved). Plain columns are unchanged.

Resolves the leaf-level cases in tobymao#7604. Field access via CTE/derived-table
recursion (tobymao#6258) is unaffected and still requires threading the path
through to_node recursion.
Add focused tests for the struct-field lineage fix: two subfields of the
same column staying distinct with full paths, single and deeply-nested
field access, and a regression guard that a plain column is unchanged.
Use builtin `dict` instead of `t.Dict` (ruff UP006, valid under
`from __future__ import annotations`) and rename the source-column loop
variable so it no longer shadows the `column` parameter of `to_node`,
which mypy resolved to the parameter's `str | int` type.
Builds on the leaf-level struct-path fix to attribute lineage to the
exact struct field even when access happens across CTEs, derived tables,
nested struct constructors, and UNNEST of array-of-structs.

Previously a struct field accessed through a CTE recursed with only the
bare column name, so `sc.a` over `STRUCT(t.a AS a, t.b AS b)` attributed
to both `t.a` and `t.b`. UNNEST attributed to the whole array column,
dropping the accessed field.

Thread a `subfield` path through `to_node` (and the cache key). At each
scope:
- Struct constructor: `_narrow_struct_field` matches the next path
  segment against the field names and descends, consuming segments
  through nested structs, so only the selected field's columns are
  attributed.
- Passthrough column / UNNEST: the residual path rides along and is
  appended to the resulting leaf (`t.s.a`, `t.arr.a`).
- Unmatched field: degrades to whole-struct attribution, never raises.

Narrowing is purely structural, so it works with or without a schema.
Plain columns carry an empty subfield and are unaffected.

Covers the CTE case in tobymao#6258; complements the leaf-level fix for tobymao#7604.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant