Skip to content

Why does ColumnLineageVisitor use mutable global state instead of a typed return value? #2

@nickBes

Description

@nickBes

Background

ColumnLineageVisitor extends SqlBaseVisitor<void> and accumulates results by mutating several
private instance fields throughout the traversal:

// All visit methods return void — results are side-effects on these
private readonly queryScopeStack: QueryScope[] = [];
private readonly cteScopeStack:   CteScope[]   = [];
private readonly resultEntries    = new Map<NormalizedIdBrand, ResultEntry>();
private readonly unresolvedColumns = new HashSet<...>(...);

The final answer is extracted imperatively after visiting the tree:

const visitor = new ColumnLineageVisitor(metadata);
visitor.visit(tree);
return visitor.getResult(); // pulls from accumulated mutable state

ANTLR visitors are generic: SqlBaseVisitor<T> allows every visit* method to return a value of
type T, letting the tree calculate its result bottom-up through return values instead of
side-effects — the same pattern described in
Listeners and Visitors (tomassetti.me):

"The visitor pattern allows you to return values from each visitor function … A typical example
would be to use it to calculate and pass around the results of sub-expressions for creating a
calculator."

A result-typed visitor would look roughly like:

class ColumnLineageVisitor extends SqlBaseVisitor<PartialResult> {
  visitQuerySpecification(ctx): PartialResult {
    const fromResult = this.visit(ctx.from());
    const selectResult = ctx.selectItem().map(i => this.visit(i));
    return merge(fromResult, ...selectResult);
  }
  // ...
}

The question

Was the mutable-state approach a deliberate design decision? If so, what drove it? If not, is there
appetite to refactor toward a result-type approach?


Pros of the current approach (mutable global state / void visitor)

# Pro Notes
1 Less boilerplate for aggregateResult SqlBaseVisitor<T> requires overriding defaultResult() and aggregateResult() to combine child results. For a complex type like PartialResult neither has a sensible default, so you'd be fighting the base class.
2 Explicit control over traversal order The implementation already deliberately skips visitChildren in many places and orchestrates traversal manually (e.g. customVisitSpecInternals, processTermAndOrderBy). The void return means there's no pressure to conform to ANTLR's built-in aggregation pipeline.
3 Easy correlated-subquery support The scope stacks (queryScopeStack, cteScopeStack) need to survive across many unrelated visit* calls so that inner scopes can see outer aliases. A result-typed approach would need to smuggle this context through every return value or keep it as state anyway.
4 Familiar imperative style Most ANTLR tutorials and generated base classes default to this style, so contributors familiar with ANTLR will recognise it immediately without needing to understand a custom result-composition algebra.
5 Simpler error bookkeeping unresolvedColumns is a flat append-only set. Merging partial unresolved sets across the return-value chain requires every caller to thread that accumulator through, adding noise to the type.

Cons of the current approach (mutable global state / void visitor)

# Con Notes
1 Non-reentrant / not thread-safe A single ColumnLineageVisitor instance cannot safely analyse two queries concurrently—or even sequentially without a reset—because all state is shared.
2 Hidden data-flow Reading a visit* method in isolation gives no indication of what it produces or consumes. The side-effects on resultEntries/unresolvedColumns must be traced manually, making the code harder to reason about.
3 Difficult to unit-test sub-visitors To test how, say, a JOIN … USING clause is resolved, a test must construct a full visitor with metadata and examine global state after the call—not just inspect a returned value.
4 Scope-stack push/pop bugs are silent If a code path forgets to pop (or pops too many times), nothing throws immediately; the bug only surfaces as wrong lineage output, often far from the faulty call. A return-value design inherently ties scope lifetime to the call stack.
5 Violates the stated intent of the visitor interface SqlBaseVisitor<T> is generic precisely to support result-propagation. Using void means the type system cannot enforce that every relevant node contributes to the output.
6 Extra getResult() ceremony Callers must remember to call getResult() after visit(). A result-typed visitor would return the answer directly from the top-level visitQuery() call.
7 Scope stacks duplicate what the call stack already models In a result-typed design, pushing/popping scope corresponds naturally to entering/returning from a recursive visit* call — no parallel stack data structure needed.

Possible middle-ground

Rather than a fully pure functional visitor, the scope management could stay as instance state
(since it genuinely needs to be visible across call boundaries for correlated subqueries) while the
result accumulation (resultEntries, unresolvedColumns) is converted to return values:

// visit* methods return a PartialResult; scopes remain as instance state
interface PartialResult {
  resolved:   Map<NormalizedIdBrand, ResultEntry>;
  unresolved: Set<UnresolvedColumnReference>;
}

This would eliminate the silent push/pop risk and restore testability of individual node handlers
without requiring a fundamental redesign of the correlated-subquery logic.


Related

  • Listeners and Visitors — tomassetti.me
    (especially the "You Control the Visitor" and "When to Use Neither" sections)
  • packages/core/src/column-lineage.ts — current implementation
  • SqlBaseVisitor<T> in packages/core/src/generated/official/SqlBaseVisitor.ts

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions