Skip to content

feat(generators): token-level string interpolation metadata + string fixes#9

Merged
johnsoncodehk merged 1 commit into
johnsoncodehk:masterfrom
dmno-dev:dmno/backtick-string-delims
Jun 6, 2026
Merged

feat(generators): token-level string interpolation metadata + string fixes#9
johnsoncodehk merged 1 commit into
johnsoncodehk:masterfrom
dmno-dev:dmno/backtick-string-delims

Conversation

@theoephraim
Copy link
Copy Markdown
Contributor

@theoephraim theoephraim commented Jun 5, 2026

Apologies in advance for the AI authored PR. This was encountered while wiring up a parser for varlock - the language is called "@env-spec" and is a small DSL on top of familiar dotenv syntax which includes decorator style comments and function calls.

From what I understand, it was having some issues with handling backtick quotes correctly in the generated textmate grammar, as well as string template style regions.

Maintainer note — rebuilt on current master. The original commits were written against the pre-IR master and conflicted across the generators, so this PR is reimplemented on the current token-pattern-IR codebase, with the two regression tests kept as the contract. Two differences from the original sketch: (1) the token API is now IR-based (not RegExp); (2) after a design review, the interpolation begin/end are literal delimiters rather than regex source, and interpolation is highlight-only by design — per varlock's docs, env-spec interpolation (${VAR}, op(...)) is runtime-evaluated, not parsed into the AST, so the parser leaves values as strings and this is purely cosmetic highlighting (a parser-level approach that re-enters the grammar would be the wrong tool here). The code samples below are updated to match.

Why this exists

This PR documents and locks down specific behavior needed by env-spec-style DSL grammars. The implementation may be replaced; the important part is preserving these scenarios (the tests).

Behavior locked down (must-pass)

  1. TextMate backtick delimiter inference is correct for escaped backtick strings:

    token(seq(lit('`'), star(alt(seq(lit('\\'), anyChar()), noneOf(oneOf('`', '\\')))), lit('`')),
          { string: true, escape: seq(lit('\\'), anyChar()) })

    The TM region uses the backtick as its begin/end delimiter (begin = the backtick, end = the backtick followed by |$), with no fallback to a double-quote delimiter.

  2. interpolation metadata is first-class on string tokens and propagates to all three highlighters. begin/end are literal delimiters:

    token(seq(lit('"'), star(alt(seq(lit('\\'), anyChar()), noneOf(oneOf('"', '\\')))), lit('"')), {
      string: true,
      escape: seq(lit('\\'), anyChar()),
      interpolation: [
        { begin: '${', end: '}', beginScope: '...', endScope: '...', contentScope: '...' },
        { begin: '$(', end: ')' },
      ],
    })
    • TextMate emits nested interpolation regions
    • Monarch emits interpolation begin rules + interpolation states
    • tree-sitter re-emits the string as a rule + an external <tok>_chars scanner + @punctuation.special highlight captures
  3. YAML quoted-scalar continuation gating: for indentation grammars without indent.blockScalar, inline multiline quoted values such as KEY="line1\nline2" must parse (no YAML continuation indentation error).

Tests (the contract)

  • test/env-spec-regressions.ts — backtick delimiter regression + block-scalar overreach regression
  • test/interpolation-metadata.ts — interpolation metadata propagation across TextMate / Monarch / tree-sitter (incl. a real tree-sitter generate + parse)

If this PR is replaced with a cleaner implementation, keeping these tests (or equivalent) preserves the same user-facing behavior.

Summary of changes

  • fix TextMate delimiter inference for escaped backtick string tokens (generic delimiter scope; no " fallback)
  • keep existing single/double-quote behavior unchanged
  • gate YAML-style multiline quoted-scalar indentation enforcement behind indent.blockScalar
  • add a token-level string interpolation option (highlight-only; literal begin/end delimiters)
  • consume interpolation in TextMate, Monarch, and tree-sitter generation

interpolation token option

A string token may declare highlight-only interpolation regions. begin/end are literal delimiters (each generator escapes/uses them — TextMate/Monarch escape them into their regex dialect, tree-sitter uses them as literals):

interpolation: [
  {
    begin: '${',
    end: '}',
    beginScope: 'punctuation.definition.interpolation.begin',
    endScope: 'punctuation.definition.interpolation.end',
    contentScope: 'meta.embedded.expression',
    include: '$self',
  },
]

Validation

  • node test/env-spec-regressions.ts — 4/4
  • node test/interpolation-metadata.ts — 19/19 (includes a real tree-sitter generate + parse)
  • all 7 existing grammars regenerate byte-identically (npm run gen)
  • TS conformance unchanged (5386/5659), npm test 15/15, agnostic 9/9, gate:treesitter 96.0% (beats official)

@theoephraim theoephraim changed the title fix(tm): handle escaped backtick string tokens in delimiter inference fix(tm+lexer): backtick escaped strings and block-scalar guard Jun 5, 2026
@theoephraim theoephraim changed the title fix(tm+lexer): backtick escaped strings and block-scalar guard feat(tm): token-level string interpolation metadata + string fixes Jun 5, 2026
@theoephraim theoephraim changed the title feat(tm): token-level string interpolation metadata + string fixes feat(generators): token-level string interpolation metadata + string fixes Jun 5, 2026
@johnsoncodehk
Copy link
Copy Markdown
Owner

johnsoncodehk commented Jun 5, 2026

Hi @theoephraim, could you provide a code example and Monogram configuration that I can use to test this change? Never mind, I saw that tests were already done in the latest commit. :)

@theoephraim
Copy link
Copy Markdown
Contributor Author

yes - apologies when I saw what was first created, I had it cranking away to make it more clear :)

@johnsoncodehk
Copy link
Copy Markdown
Owner

The main branch has undergone some major restructurings. I don’t have the permission to push changes to this PR. Could you handle the merge conflicts or grant me the push permission?

@theoephraim
Copy link
Copy Markdown
Contributor Author

@johnsoncodehk - hm strange. everything seems correct over here 🤷 I added you as a maintainer as well to my fork just in case...

@johnsoncodehk johnsoncodehk force-pushed the dmno/backtick-string-delims branch from 8b53b48 to 717167f Compare June 6, 2026 21:16
@johnsoncodehk
Copy link
Copy Markdown
Owner

Rebuilt on top of the current master, which has since moved tokens to a pattern algebra/IR — the original commits conflicted across the generators and couldn't be merged as-is. The three behaviors are reimplemented and your two test files are kept as the contract (ported to the IR token() API):

  • backtick delimitersgen-tm now infers string-region delimiters generically, so an escaped backtick string keeps backtick delimiters (no " fallback)
  • blockScalar gating — the YAML multiline quoted-scalar continuation check is gated behind indent.blockScalar, so a plain indentation grammar accepts KEY="a\nb"
  • interpolation metadatastring tokens can declare interpolation regions, consumed by TextMate (nested regions), Monarch (interpolation states), and tree-sitter (a rule + an external <tok>_chars scanner + highlight captures)

Verification: env-spec-regressions 4/4, interpolation-metadata 19/19 (incl. a real tree-sitter generate + parse), all seven existing grammars regenerate byte-identically, and the TS conformance / agnostic / tree-sitter gates are unchanged.

Thanks for the clear reimplementation contract — the test files made this straightforward. Credited you as co-author on the commit.

The original PR was written against the pre-IR master (escape: RegExp,
pattern.source); the codebase has since moved tokens to an algebra/IR, so this
reimplements the same three behaviors on current master and keeps the regression
tests as the contract.

- gen-tm: infer string-region delimiters generically, so an escaped backtick
  string keeps backtick delimiters instead of falling back to `"`
- gen-lexer: gate the YAML multiline quoted-scalar continuation check behind
  `indent.blockScalar`, so a plain indentation grammar accepts `KEY="a\nb"`
- types/api: a `string` token may declare `interpolation` regions
- gen-tm / gen-monarch / gen-treesitter: consume `interpolation` — nested
  TextMate regions, Monarch interpolation states, and a tree-sitter rule + an
  external `<tok>_chars` scanner + highlight captures
- tests: env-spec-regressions + interpolation-metadata, ported to the IR API

All seven existing grammars regenerate byte-identically; TS conformance, the
agnostic gate, and the tree-sitter accuracy gate are unchanged.

Co-authored-by: Theo Ephraim <theoephraim@users.noreply.github.com>
@johnsoncodehk johnsoncodehk force-pushed the dmno/backtick-string-delims branch from 717167f to 25ee63a Compare June 6, 2026 21:51
@johnsoncodehk
Copy link
Copy Markdown
Owner

Follow-up after a design review of the interpolation API:

  • interpolation begin/end are now literal delimiters ('${', '}', …) instead of regex source. Each generator escapes/uses the literal itself (TextMate/Monarch via escapeRegex, tree-sitter as a literal — the earlier regex→literal decode heuristic is gone), so the three backends are consistent and nothing hand-writes regex.
  • Kept it highlight-only by design: per varlock's docs, env-spec interpolation (${VAR}, op(...)) is runtime-evaluated, not parsed into the AST — so the parser correctly leaves values as strings and this is purely cosmetic highlighting. A parser-level approach (generalizing the template-literal machinery so holes re-enter the grammar) would be the wrong tool here; that's for languages whose interpolation holes are parse-time syntax.

Verification unchanged: env-spec-regressions 4/4, interpolation-metadata 19/19 (incl. a real tree-sitter generate + parse), all 7 existing grammars byte-identical, TS conformance == baseline, tree-sitter gate 96.0%.

@johnsoncodehk johnsoncodehk merged commit ad0415c into johnsoncodehk:master Jun 6, 2026
2 checks passed
@johnsoncodehk johnsoncodehk deleted the dmno/backtick-string-delims branch June 6, 2026 22:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants