fix: render Jinja blocks before splitting on semicolons (#2650)#2782
fix: render Jinja blocks before splitting on semicolons (#2650)#2782sfc-gh-moczko merged 8 commits intomainfrom
Conversation
de27892 to
f6d865b
Compare
225511d to
1f5d863
Compare
1f5d863 to
4f42dd4
Compare
sfc-gh-olorek
left a comment
There was a problem hiding this comment.
Left some comments, I am a bit scarred on changing the order of the rendering pipeline, as it may produce some unwanted side effects.
| content = f.read() | ||
| if pre_render: | ||
| content = pre_render(content) | ||
| stmts = split_statements(io.StringIO(content), remove_comments) |
There was a problem hiding this comment.
This runs before split_statements removes comments, so Jinja-like syntax inside SQL comments (e.g. -- {% if debug %}) would be resolved and could cause template errors.
I think sth like this will cause errors:
@mock.patch("snowflake.cli._plugins.sql.manager.SqlManager._execute_string")
def test_jinja_syntax_in_sql_comment_causes_error(mock_execute_query):
"""Jinja-like syntax in SQL comments is resolved by pre-render
because pre-render runs before comment removal."""
manager = SqlManager()
query = "-- {% if debug %}enable tracing{% endif %}\nselect 1;"
with pytest.raises(Exception, match="undefined"):
_, results = manager.execute(
query=query,
files=None,
std_in=False,
data={},
template_syntax_config=SQLTemplateSyntaxConfig(
enable_legacy_syntax=False,
enable_standard_syntax=False,
enable_jinja_syntax=True,
),
)
list(results)
There was a problem hiding this comment.
Fixed. Added _strip_sql_comments() — a state machine that strips -- and /* */ comments while preserving single-quoted strings and dollar-quoted strings. It runs as the first step inside _jinja_pre_render, before Jinja sees the content. test_jinja_template_syntax_in_sql_comment_is_ignored verifies that -- {{ undefined_var }} does not cause an UndefinedError.
There was a problem hiding this comment.
Thanks for the fix. I have a concern about the approach: _strip_sql_comments permanently removes all SQL comments when Jinja is enabled, even when --retain-comments is used. This silently changes behavior for users who rely on comments reaching Snowflake.
I, for now, don't know a clean way around it, as for example escaping approach by preprocessing jinja semicolons by replacing ; inside {% %} with some placeholder also have some corner cases, and has a lot of unnecessary complexity.
This may result with a different approach for the whole pr, that I don't know yet. Maybe you have some ideas?
There was a problem hiding this comment.
Good catch, and agreed this was a real regression. I've pushed a fix in the latest commit.
The approach: instead of stripping comments before Jinja renders, we now conditionally wrap them in {% raw %}...{% endraw %}. Jinja passes {% raw %} blocks through unchanged, so all comment content (including any {{ }} or {% %} syntax inside) is neutralised without actually removing the text.
Concretely, in _jinja_pre_render:
--retain-comments→_wrap_comments_in_jinja_raw()— comments become{% raw %}-- hint{% endraw %}before Jinja runs, then the raw tags are consumed and the original comment text survives into the output SQL.- default (strip) →
_strip_sql_comments()— existing behaviour unchanged.
Both functions share a new _tokenize_sql() generator that handles string literal detection once, so the logic can't drift between the two paths.
The one theoretical edge case: a comment containing the literal string {% endraw %} would break the wrapping. In practice this can't appear in any real SQL (it's a Jinja-specific string with no SQL meaning), and the token is short enough that it's implausible as part of an optimizer hint. If it ever becomes a concern, the placeholder/nonce approach is the robust fallback.
There was a problem hiding this comment.
Thanks for the improvement. I still think that it has some corner cases:
-
you acknowledged that a comment containing
{% endraw %}is a limitation itself, but you said that it is practically impossible to hit. I disagree, as someone can document jinja escaping itself in a SQL command. I think we cannot proceed with this limitation forward -
Snowflake supports nested comments but tokenizer ends preemptively. for example
/* outer /* inner */ {{ boom }} */ SELECT 1;should be interpreted but with the current approach it fails. -
_tokenize_sql- is very hacky and impossible to review by a human eye. Making the code not maintainable -
can be resolved by escaping
{% endraw %}, but this increases the complexity of the change even more -
can be cleanly resolved by counting the comment depth, and only ending the when the depth reaches 0
to resolve 3. I think we need completely another approach to the problem. relying on a regex.
I have played with the code for a little while and found the following outline to be the simplest and solving all corner cases that I could identify:
# === statement_reader.py — additions (after ASYNC_SUFFIX) ===
_SQL_TOKEN_RE = re.compile(
r"""
'(?:[^'\\]|\\.| '')*' # single-quoted string (skip)
| \$\$.*?\$\$ # $$-dollar-quoted string (skip)
| (?P<block_start> /\* ) # block comment opening (nested scan)
| (?P<line_comment> --[^\n]* ) # line comment (capture)
""",
re.VERBOSE | re.DOTALL,
)
@dataclass
class SavedComments:
_comments: list = field(default_factory=list)
_nonce: str = field(default_factory=lambda: uuid.uuid4().hex[:12])
def add(self, text: str) -> str:
idx = len(self._comments)
self._comments.append(text)
return f"__SQLC_{self._nonce}_{idx}__"
def restore(self, sql: str) -> str:
for i, comment in enumerate(self._comments):
sql = sql.replace(f"__SQLC_{self._nonce}_{i}__", comment)
return sql
def _protect_sql_comments(sql: str) -> tuple[str, SavedComments]:
saved = SavedComments()
result: list[str] = []
pos = 0
while pos < len(sql):
m = _SQL_TOKEN_RE.search(sql, pos)
if m is None:
break
if m.group("line_comment"):
result.append(sql[pos : m.start()])
result.append(saved.add(m.group()))
pos = m.end()
elif m.group("block_start"):
depth = 1
j = m.end()
while j < len(sql) and depth > 0:
if sql[j : j + 2] == "/*":
depth += 1; j += 2
elif sql[j : j + 2] == "*/":
depth -= 1; j += 2
else:
j += 1
result.append(sql[pos : m.start()])
result.append(saved.add(sql[m.start() : j]))
pos = j
else:
# String literal — emit as-is, advance past it
result.append(sql[pos : m.end()])
pos = m.end()
result.append(sql[pos:])
return "".join(result), saved
# === statement_reader.py — signature changes only ===
def query_reader(source, operators, remove_comments=False, pre_render=None):
if pre_render:
source = pre_render(source)
stmts = split_statements(io.StringIO(source), remove_comments)
yield from recursive_statement_reader(stmts, [], operators, remove_comments, pre_render)
def files_reader(paths, operators, remove_comments=False, pre_render=None):
for path in paths:
with path.open(read_file_limit_mb=UNLIMITED) as f:
content = f.read()
if pre_render:
content = pre_render(content)
stmts = split_statements(io.StringIO(content), remove_comments)
yield from recursive_statement_reader(
stmts, [path.as_posix()], operators, remove_comments, pre_render,
)
def recursive_statement_reader(source, seen_files, operators, remove_comments, pre_render=None):
# ... same logic, but in the !source/!load branch:
content = parsed_source.statement.read()
if pre_render:
content = pre_render(content)
yield from recursive_statement_reader(
split_statements(io.StringIO(content), remove_comments),
seen_files, operators, remove_comments, pre_render,
)
# === manager.py — execute() body ===
stmt_operators = []
jinja_pre_render = None
if template_syntax_config.enable_jinja_syntax:
def _jinja_pre_render(content: str) -> str:
content, saved = _protect_sql_comments(content)
if template_syntax_config.enable_legacy_syntax:
content = transpile_snowsql_templates(content)
content = snowflake_sql_jinja_render(
content,
template_syntax_config=SQLTemplateSyntaxConfig(
enable_legacy_syntax=template_syntax_config.enable_legacy_syntax,
enable_standard_syntax=template_syntax_config.enable_standard_syntax,
enable_jinja_syntax=True,
),
data=data,
)
return saved.restore(content)
jinja_pre_render = _jinja_pre_render
else:
if template_syntax_config.enable_legacy_syntax:
stmt_operators.append(transpile_snowsql_templates)
stmt_operators.append(partial(snowflake_sql_jinja_render, ...))
# Pass jinja_pre_render to readers:
stmt_reader = query_reader(query, stmt_operators, remove_comments, jinja_pre_render)@sfc-gh-moczko what do you think about changing the PR with the following regex based aproach?
There was a problem hiding this comment.
All three points are valid — switching to the regex placeholder approach as you outlined.
The implementation follows your sketch closely:
_SQL_TOKEN_REregex skips single-quoted strings and$$-dollar-quoted strings, matches line comments (captured as a group), and matches block comment openings (handled separately)._SavedCommentsdataclass stores original comment text keyed by nonce-prefixed index;restore()does a plain string replace.- Block comments use a depth-counting loop (not the regex) so nested
/* ... /* ... */ ... */is captured as a single placeholder. _jinja_pre_renderalways protects then restores — noretain_commentsconditional needed there.split_statements(remove_comments=…)remains the sole decision point for stripping vs. keeping.
{% raw %}, _tokenize_sql, _strip_sql_comments, and _wrap_comments_in_jinja_raw are all gone. Tests updated accordingly, including an explicit case for {% endraw %} inside a comment and for nested block comments.
There was a problem hiding this comment.
Two minor follow-up cleanups pushed alongside the implementation:
test_jinja_template_syntax_in_sql_comment_is_ignoredhad a stale inline comment saying "SQL comments should be stripped first" — updated to say they are replaced with inert placeholders before Jinja runs.test_retain_comments_with_jinja_enableddocstring still referenced_strip_sql_commentswhich no longer exists — updated to describe the actual mechanism.
No functional changes.
8ff37ef to
2d28299
Compare
93bbb4c to
0669a17
Compare
|
@sfc-gh-moczko lgtm please rebase on top of 0fd2247 to fix the integration tests |
`split_statements()` splits SQL on `;` before template rendering,
breaking Jinja block statements (`{% if %}`, `{% for %}`, etc.) that
contain semicolons. This moves Jinja rendering to a pre-render step
on the whole content before splitting, while legacy/standard syntax
continues to render per-statement.
Closes #2650
.... Generated with [Cortex Code](https://docs.snowflake.com/en/user-guide/cortex-code/cortex-code)
Co-Authored-By: Cortex Code <noreply@snowflake.com>
The for-loop renders to 3 separate SQL statements, which correctly produces nested result sets consistent with all other multi-statement output in the CLI. .... Generated with [Cortex Code](https://docs.snowflake.com/en/user-guide/cortex-code/cortex-code) Co-Authored-By: Cortex Code <noreply@snowflake.com>
… to !source files - Add _strip_sql_comments() state-machine to statement_reader.py that preserves string literals and dollar-quoted strings, so template-like syntax inside SQL comments is not evaluated by Jinja. - When enable_jinja_syntax=True, consolidate all rendering (legacy transpile, standard/<% %> variables, Jinja blocks) into a single pre-render pass applied before split_statements. This preserves the correct standard->Jinja ordering and fixes Jinja blocks that span multiple semicolon-separated statements. - Propagate pre_render through recursive_statement_reader so files loaded via !source receive the same pre-render treatment as the top-level query. - Improve the misleading empty-output error message from "Use either query, filename or input option." to "No SQL statements found to execute." - Add unit tests for _strip_sql_comments, Jinja comment safety, updated error message, and !source pre_render propagation. Fixes: #2650 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…enabled
When Jinja syntax is enabled, _jinja_pre_render was calling _strip_sql_comments
unconditionally, silently discarding all comments regardless of the
--retain-comments flag.
Fix by introducing _wrap_comments_in_jinja_raw(): instead of stripping comments,
it wraps them in {% raw %}...{% endraw %} so Jinja passes them through unchanged.
_jinja_pre_render now chooses the right strategy based on retain_comments.
Both functions share a new _tokenize_sql() generator that handles string literal
detection in one place, eliminating the duplication risk.
…protection
Three issues with the previous approach were identified in review:
1. A comment containing {% endraw %} (e.g. documenting Jinja escaping) would
break the wrapping — not a theoretical edge case.
2. Snowflake supports nested block comments; the character-by-character tokenizer
exited at the first */ leaving outer comment content exposed to Jinja.
3. _tokenize_sql was considered too complex and unmaintainable.
New approach (_protect_sql_comments):
- Regex finds string literals (skip), line comments, and block comment openings.
- Line comments are replaced with nonce-keyed placeholders.
- Block comments use a depth-counting loop to handle nesting correctly.
- Placeholders are restored after Jinja renders, then split_statements handles
the retain/strip decision via remove_comments as before.
- No dependency on Jinja internals ({% raw %} removed entirely).
0669a17 to
fb1a519
Compare
|
CI update: all checks pass except the 4 integration-trusted jobs (ubuntu/windows × tests-trusted/tests-trusted-ng), which are timing out. This is the known SNOWCLI_IT warehouse credit limit issue — not related to this PR's changes. |
Summary
{% if %},{% for %}, etc.) containing semicolons were broken becausesplit_statements()splits on;before Jinja rendering, producing invalid template fragments.split_statements(), while legacy (&{ }) and standard (<% %>) syntax continues to render per-statement (they are variable-only, no blocks).Root Cause
The SQL execution pipeline called
split_statements()then per-statement template rendering. Sincesplit_statementsis unaware of Jinja block syntax, a query like:was split into two broken fragments:
{% if var == 'Jinja' %}select 1;-- unclosedifblock --> TemplateSyntaxError{% endif %}-- orphanedendif--> TemplateSyntaxErrorFix
Jinja block rendering now happens on the whole content before splitting. The per-statement pipeline still handles legacy/standard variable substitution (which never had block syntax). This is safe because Jinja
{{ }}/{% %}syntax is orthogonal to<% %>/&{ }-- rendering one leaves the other untouched.Changes
statement_reader.pypre_renderparameter toquery_reader()andfiles_reader()manager.pytests/sql/test_sql.pyTest plan