[Refactor] Query Rewriter V2#106
Conversation
There was a problem hiding this comment.
Pull request overview
Migrates query rewriting from the legacy JSON-rule pipeline to an AST-based “Query Rewriter V2” that matches and rewrites SQL via RuleParserV2, including support for compound queries (e.g., UNION) and partial logical matches.
Changes:
- Added
core/query_rewriter_v2.pyimplementing BFS AST matching, variable binding, partial-match carry-through, and rewrite application. - Added
data.rules.get_rule_v2()to load v2 rules (pattern/rewrite AST + mapping) while reusing v1 action parsing. - Updated parsing/formatting to better round-trip compound queries and logical operators; added comprehensive v2 rewrite tests.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
tests/test_query_rewriter_v2.py |
New end-to-end tests for v2 match/replace/rewrite behavior across many rules. |
data/rules.py |
Adds v2 rule loader (get_rule_v2) returning ASTs/mapping/actions for QueryRewriterV2. |
core/rule_parser_v2.py |
Extends v2 rule parsing to support CompoundQueryNode and placeholder substitution in compound queries. |
core/query_rewriter_v2.py |
Introduces the new AST-based query rewriter, including partial matching and action support. |
core/query_parser.py |
Parses qualified * (e.g. t.*) into ColumnNode('*', parent_alias='t'). |
core/query_formatter.py |
Flattens AND/OR operator trees to mo_sql_parsing list form; minor formatting/documentation tweaks. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 11 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
|
| Function | What it does | When it's used |
|---|---|---|
_flatten_logical(node, op_name) |
Recursively unwraps a binary AND/OR tree into a flat list[Node]. e.g. AND(AND(a,b),c) → [a, b, c] |
Match: before comparing AND/OR clauses unordered. Replace: before rebuilding partial-match results. |
_unflatten_logical(nodes, op_name) |
Inverse of the above: folds a flat list back into a left-associative binary tree. [a,b,c] → AND(AND(a,b),c) |
Replace: after substitution, to reconstruct the AND/OR operator in the output AST. |
_get_clause(query, clause_type) |
Returns the first child of a QueryNode whose .type matches the requested clause (SELECT, FROM, WHERE, …). Returns None if absent. |
Match: in _match_query_node. Replace: in replace() and _merge_extra_clauses. |
_CLAUSE_TYPES |
A constant ordered list of all NodeType clause types. |
Match + Replace: iterating all clauses in a structured way. |
Step 1 — Match
Goal: decide whether a rule's pattern_ast appears somewhere inside query_ast, and if so, record the variable bindings in memo.
Memo helpers (used only during matching)
| Function | What it does | When it's used |
|---|---|---|
_memo_snapshot(memo) |
Context manager. Snapshots the memo dict on entry; if commit() is never called before the with block exits, it rolls back all changes. |
Every place matching does backtracking — _match_logical_list_unordered, _partial_match_in_logical, _match_children_list. Replaces the old manual "snapshot keys, then delete on failure" pattern. |
_bind(var_name, value, memo) |
Stores var_name → value in memo. If the name is already bound, checks consistency (returns False on conflict). Handles the special case where a TableNode and a plain string refer to the same table via alias. |
Called at every point a variable is successfully matched against a query subtree. |
_is_var_name(s, mapping) |
Returns True if string s appears as a key in the rule's mapping dict (i.e., it is an external variable name like "x" or "y"). |
Used inside _match_node for TableNode, ColumnNode, and LiteralNode — these nodes can have variable names as their .name / .alias / .parent_alias fields. |
Core matching functions
| Function | What it does | When it's used |
|---|---|---|
_match_node(q, p, memo, mode, mapping) |
Main recursive matching dispatcher. Checks the runtime type of p and dispatches to the right sub-case: variable nodes → _bind, LiteralNode → value compare, ColumnNode/TableNode → field-by-field compare with variable awareness, OperatorNode → _match_and_or or _partial_match_in_logical, etc. |
Called everywhere a single node-to-node comparison is needed. The "spine" of the match step. |
_match_and_or(q, p, op_name, memo, mode, mapping) |
Both q and p are the same AND/OR operator. Flattens both into lists, calls _match_logical_list_unordered. If the query has leftover clauses not in the pattern and mode is ALLOW_PARTIAL, stores them in memo["_partial_remaining"]. |
Called from _match_node when both sides are AND/AND or OR/OR. |
_partial_match_in_logical(q, p, op_name, memo, mode, mapping) |
Query q is AND/OR but pattern p is a different kind of node. Flattens q's clauses and tries _match_node(clause, p) for each one. On success, stores the remaining unmatched clauses plus positional info (_partial_matched_idx, _partial_flat_list) so replace() can reconstruct the clause list in the original order. |
Called from _match_node when mode == ALLOW_PARTIAL and the query is AND/OR but the pattern is not. |
_match_logical_list_unordered(q_flat, p_flat, memo, mode, mapping) |
Matches two flat clause lists without caring about order. Handles SetVariableNode (it absorbs all unmatched elements). For fixed elements, does backtracking combinatorial search using _memo_snapshot. Returns the list of unmatched query elements on success, or None on failure. |
Called by _match_and_or. |
_match_query_node(q, p, memo, mode, mapping) |
Matches a QueryNode against a pattern QueryNode clause by clause (SELECT vs SELECT, FROM vs FROM, etc.). Query clauses that have no counterpart in the pattern are collected into memo["_extra_clauses"] for carry-through. |
Called from _match_node when both q and p are QueryNode. |
_match_children_list(q_list, p_list, memo, mode, mapping) |
Matches an ordered list of child nodes (e.g., SELECT items, function arguments). SetVariableNode in the pattern acts as a wildcard that absorbs the contiguous slice of query children not consumed by the surrounding fixed elements. |
Called for SelectNode, FromNode, WhereNode, GroupByNode, FunctionNode, ListNode, etc. |
Public match entry points (class methods)
| Method | What it does |
|---|---|
QueryRewriterV2.match(query_ast, rule, memo, matching_mode) |
BFS over the entire query_ast. At each node, calls _match_node with a fresh attempt_memo. On the first success, copies the attempt memo into the caller's memo (including _rule_node) and returns True. |
QueryRewriterV2.match_node(query_node, pattern_node, rule, memo, mode) |
Thin public wrapper around _match_node. Lets callers outside the module invoke node-level matching without importing the private function. |
Step 2 — Replace
Goal: take the matched bindings in memo and produce the rewritten AST. This is split into two sub-steps:
2a. Substitute — fill in variable placeholders in rewrite_ast with their bound values.
2b. Graft — splice the substituted subtree back into the original query_ast at exactly the location that was matched.
2a — Substitute into rewrite AST
| Function | What it does | When it's used |
|---|---|---|
_subst(node, memo) |
Main substitution dispatcher. Walks rewrite_ast depth-first. ElementVariableNode → looks up the bound Node in memo. SetVariableNode → delegated to caller (see _subst_list). All other node types → rebuild with recursively substituted children. For LiteralNode strings: substitutes any variable name embedded inside the string (e.g. '%x%' → '%iphone%'). For AND/OR OperatorNode: flattens, expands SetVariableNodes, rebuilds. |
Core of the replace step; called by QueryRewriterV2.replace(). |
_subst_str(s, memo) |
Substitutes a single string field (like a node's .name or .alias) if that string is a variable name bound in memo. Extracts the canonical string from the bound value (TableNode → alias or name, ColumnNode → name). |
Called inside _subst for TableNode, ColumnNode, FunctionNode string fields. |
_subst_list(items, memo) |
Substitutes a list of nodes. Handles SetVariableNode specially: expands it into its bound list in-place. ElementVariableNode also gets expanded if bound to a list. |
Called for any child-list: SELECT items, FROM children, AND/OR clause lists, function args. |
2b — Graft replacement into the original tree
| Function | What it does | When it's used |
|---|---|---|
_replace_in_tree(tree, target_id, replacement) |
Immutable tree walk using Python's id() as the identity check. At every node, if id(node) == target_id, returns replacement; otherwise, rebuilds the node with recursively replaced children. Never mutates the original tree. |
Called by QueryRewriterV2.replace() when the matched node is not the root of query_ast. |
Public replace entry point
| Method | What it does |
|---|---|
QueryRewriterV2.replace(query_ast, rule, memo) |
Orchestrates the full replace step: (1) calls _subst(rewrite_ast, memo) to build the substituted rewrite; (2) if _extra_clauses in memo, calls _merge_extra_clauses to carry extra query clauses into the rewrite; (3) if _partial_remaining in memo, wraps the rewrite back into the AND/OR tree (with positional reconstruction for non-QueryNode rewrites); (4) grafts via _replace_in_tree if the matched node is not the root, or returns the rewrite directly if it is. |
_merge_extra_clauses(rewrite, extra) |
Adds query clauses that were present in the query but absent in the rule pattern (e.g., a GROUP BY when the rule only mentions SELECT/WHERE) into the rewritten QueryNode. Prevents those clauses from being silently dropped. |
Step 3 — Action (take_actions)
Goal: execute any side-effects declared in rule["actions_json"] — currently only the SUBSTITUTE action, which replaces one subtree inside a variable's bound value.
| Function | What it does | When it's used |
|---|---|---|
QueryRewriterV2.take_actions(query_ast, rule, memo) |
Iterates actions_json. For SUBSTITUTE(scope, source, target): resolves the three names through _resolve_action_var, looks up the bound values, then calls _node_subst(memo[scope], memo[source], memo[target]) and writes the result back into memo[scope]. Returns query_ast unchanged (it only mutates memo). |
Called before replace() in the rewrite loop — modifies memo bindings so that replace uses the post-substitution values. |
_node_subst(tree, src, tgt) |
Structural node substitution. Walks a subtree and replaces every occurrence of src value with tgt value. Handles aliased tables and qualified columns. Used for alias renaming (e.g. "replace all references to table alias e2 with e1"). |
Called by take_actions. Also exposed publicly as QueryRewriterV2.substitute(). |
_subst_val(field, src, tgt) |
Substitutes a single field (string or Node) against src → tgt. Handles three cases: string-to-string, TableNode-to-string (extracts alias), and node equality. |
Helper called only inside _node_subst for .name, .alias, .parent_alias fields. |
QueryRewriterV2.substitute(tree, source, target) |
Public wrapper around _node_subst. Exposed for direct use in tests. |
Step 4 — Rewrite loop (rewrite)
Goal: repeatedly apply rules until no rule fires or a cycle is detected.
| Function | What it does | When it's used |
|---|---|---|
QueryRewriterV2.rewrite(query, rules, iterate) |
Main entry point. Parses the input SQL, then loops: (1) call _pick_applicable_rule to find a firing rule; (2) call take_actions then replace; (3) format back to SQL and re-parse; (4) repeat if iterate=True and no cycle. If a rule application throws, that rule is added to the exclusion set and the loop tries the next rule on the same query. |
Public API — called by callers with a query string and a list of rule dicts. |
_pick_applicable_rule(query_ast, rules, excluded) |
Runs two passes: first looks for a full root match (FULL_ONLY + memo["_rule_node"] is query_ast); then if none found, looks for any partial match (ALLOW_PARTIAL). Skips rules in the excluded set. Calls _should_skip_partial_and_application as a guardrail. Returns (rule, memo) or (None, {}). |
Called from the inner while True loop inside rewrite(). |
_should_skip_partial_and_application(rule, memo) |
Heuristic guardrail. Returns True (skip this match) if: the match is partial-AND, the rule's rewrite introduces a JOIN, and the rule's WHERE pattern is a single non-AND predicate. This prevents order-sensitive mis-rewrites where a JOIN-introducing rule fires on one clause of a multi-clause AND prematurely. |
Called inside _pick_applicable_rule after a successful match is found. |
_rule_key(rule) |
Returns the rule's "key" field if present, else its "id". Used as the stable identifier for the exclusion set. |
Helper used only inside _pick_applicable_rule. |
QueryRewriterV2.reformat(query) |
Round-trips query through QueryParser + QueryFormatter to produce a canonical SQL string. |
Utility; not part of the rewrite loop itself but used by callers. |
QueryRewriterV2.beautify(query) |
Runs sqlparse.format(query, reindent=True) for human-readable output. |
Utility. |
Summary diagram
rewrite(query, rules)
│
├─ QueryParser.parse(query) → query_ast
│
└─ loop:
├─ _pick_applicable_rule(query_ast, rules, excluded)
│ ├─ QueryRewriterV2.match() ← Step 1: BFS over query_ast
│ │ └─ _match_node() ← dispatches by node type
│ │ ├─ _bind() ← records variable bindings in memo
│ │ ├─ _memo_snapshot() ← rollback on backtrack
│ │ ├─ _match_and_or()
│ │ │ └─ _match_logical_list_unordered()
│ │ ├─ _partial_match_in_logical()
│ │ ├─ _match_query_node() ← clause-by-clause for QueryNode
│ │ └─ _match_children_list() ← ordered list for SELECT/FROM/etc.
│ └─ _should_skip_partial_and_application() ← guardrail
│
├─ take_actions(query_ast, rule, memo) ← Step 3: mutates memo bindings
│ └─ _node_subst() /_subst_val()
│
├─ replace(query_ast, rule, memo) ← Step 2: build rewritten AST
│ ├─ _subst(rewrite_ast, memo) ← 2a: fill in variables
│ │ ├─ _subst_str()
│ │ └─ _subst_list()
│ ├─ _merge_extra_clauses() ← carry through extra clauses
│ ├─ _unflatten_logical() ← rebuild AND/OR if partial match
│ └─ _replace_in_tree() ← 2b: graft into original tree
│
└─ QueryFormatter.format(query_ast) → SQL string → re-parse → repeat
Overview
Migrate from QueryRewriter to QueryRewriterV2 using our AST rule parsing/matching/replacement pipeline.
Code Changes
query_rewriter_v2.pyand migrate logic over._partial_remaining.rules.py, addsget_rule_v2()to return pattern_ast / rewrite_ast and mapping.rule_parser_v2.pyupdates: supportCompoundQueryNode.test_query_rewriter_v2.py.Test
All test passed in
test_query_rewriter_v2.py.