Skip to content

fix(sqlite): non-ASCII in SQL comments corrupts generated queries#4373

Open
exploded wants to merge 1 commit intosqlc-dev:mainfrom
exploded:fix/sqlite-unicode-comment-corruption
Open

fix(sqlite): non-ASCII in SQL comments corrupts generated queries#4373
exploded wants to merge 1 commit intosqlc-dev:mainfrom
exploded:fix/sqlite-unicode-comment-corruption

Conversation

@exploded
Copy link
Copy Markdown

Summary

  • Root cause: ANTLR4's InputStream operates on characters (runes), so token positions are character indices. source.Pluck() slices Go strings with byte offsets. When multi-byte UTF-8 characters (e.g. em-dash U+2014, 3 bytes) appear in SQL comments, the offset mismatch causes query text to be extracted at wrong positions -- truncating ? parameter placeholders and leaking comment text into generated Go code.
  • Fix: Build a rune-to-byte offset lookup table in the SQLite parser and convert ANTLR character indices to byte offsets before storing StmtLocation/StmtLen.
  • Regression test: New end-to-end test sqlite_unicode_comment reproducing the exact scenario from the issue (em-dash in a comment between two queries).

Fixes #4372

Test plan

  • New end-to-end test sqlite_unicode_comment passes
  • All existing SQLite end-to-end tests pass (sqlite_skip_todo, sqlite_table_options)
  • SQLite engine unit tests pass
  • Full TestReplay/base suite passes (only pre-existing wasm_plugin_sqlc_gen_unsafe_paths failure, unrelated)

🤖 Generated with Claude Code

…ce extraction

ANTLR's InputStream operates on characters (runes), so token positions
returned by GetStop() are character indices. However, source.Pluck()
slices Go strings using byte offsets. When multi-byte UTF-8 characters
(e.g. em-dash U+2014) appear in SQL comments, this mismatch causes
queries to be extracted at wrong positions -- truncating parameter
placeholders and leaking comment text into generated Go code.

Build a rune-to-byte offset lookup table and use it to translate ANTLR
positions before storing StmtLocation and StmtLen.

Fixes sqlc-dev#4372

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Apr 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SQLite: non-ASCII character (em-dash) in SQL comment corrupts generated queries

1 participant