canonicalize.ts: hardening roadmap (H1–H8) — prose/script robustness

Tracking issue for the remaining `canonicalize.ts` hardenings identified in the
2026-04-25 cross-project audit. Safe fixes already shipped in 9be870b (BOM
strip, backtick/ZWSP as whitespace, `clear`/`unset`/`reset-counters`/`reset-counters-all`
added to `GENERAL_COMMANDS`). This issue tracks the larger items that need design
discussion before landing.

## Background

Audit was driven from `tikoci/lsp-routeros-ts` while evaluating whether to vendor
`canonicalize.ts` for prose-text command extraction (chat input, MCP tool input,
markdown snippets). Full audit doc with methodology, findings, and proposed
hardenings:

https://github.com/tikoci/lsp-routeros-ts/blob/main/docs/canonicalize-audit.md

50-input probe across 6 categories (prose, comment/quote, bracket/brace,
scripting constructs, identifier-shaped values, markdown). **Zero crashes** —
the parser is robust as a parser. **12 wrong-but-quiet correctness gaps** when
fed anything other than clean CLI input.

Test surface lives at `src/canonicalize.fuzz.test.ts` — anchor tests document
current behaviour, `test.todo` markers list the unshipped hardenings below.

## Hardening roadmap

### Already shipped (commit 9be870b)

- ✅ **H7** — BOM strip + zero-width space as whitespace
- ✅ **Backticks as whitespace** (in both outer + word loops)
- ✅ **Universal verb expansion** — `clear`, `unset`, `reset-counters`,
  `reset-counters-all` added to `GENERAL_COMMANDS` after DB cross-check

### Still on the books

- ⬜ **H1: `mode: 'strict' | 'lenient'` parameter.** Today a leading word
  like \`Run /ip/address/print\` becomes \`{path: \"/Run/ip/address\", verb: \"print\"}\`
  because mid-line slash never restarts path context. Lenient mode should
  drop leading prose and start a new command on stray slashes. Strict mode
  preserves today's behaviour. Anchor test:
  \`describe('finding #1 — mid-line slash does not restart path (anchor)')\`.

- ⬜ **H2: `Tok.Var` for `\$identifier`.** Variables are tolerated in args
  position (most common case works) but treated as path segments when they
  appear after a path-shaped run with no recognised verb. Adding a dedicated
  token type makes them never-a-segment.

- ⬜ **H3: paren `(…)` expression scope.** \`:if (\$x = 1) do={ /log/info \"yes\" }\`
  produces ZERO commands today — parens aren't tokenized; the entire
  expression is consumed as garbage tokens that swallow the do-block.
  Inconsistently, \`:while (\$i < 10) do={ /log/info \$i }\` extracts the
  inner command. Recommend tokenizing \`(\` and \`)\` and skipping their
  contents (or recursing for any nested \`[…]\` subshells).

- ⬜ **H4: pluggable `isVerb` resolver.** **The most impactful one — and
  it specifically benefits rosetta.** The expanded universal verb set
  closed the easy half of finding #4, but \`info\`/\`warning\`/\`error\`/\`debug\`
  can't go in the universal set because they're path-context-dependent
  (\`info\` is a verb at \`/log\`, a dir at \`/interface/wireless\`).

  Proposed API:

  \`\`\`ts
  export interface CanonicalizeOptions {
    cwd?: string;
    mode?: 'strict' | 'lenient';
    /** Optional path-context-aware verb classifier. Called when the parser
     *  encounters a token that *could* be either a path segment or a verb.
     *  Return true to treat it as a verb. Falls back to GENERAL_COMMANDS /
     *  EXTRA_VERBS when not provided. MUST be synchronous. */
    isVerb?: (token: string, parentPath: string) => boolean;
  }
  \`\`\`

  Wiring:

  - **rosetta** (TUI / MCP / DB-backed callers): pass a callback that does
    \`SELECT 1 FROM commands WHERE name=? AND parent_path=? AND type='cmd'\`
    — sub-millisecond per call, version-aware, path-aware.
  - **lsp-routeros-ts**: pass a callback backed by a static \`verbs.json\`
    (see #4 for the parallel docs-links artifact pattern), augmented at
    runtime by classifications observed in \`/console/inspect highlight\`
    responses.
  - **standalone / no-deps callers**: omit the option, get today's universal
    set behaviour.

  This keeps the module pure and dependency-free while letting each consumer
  pick its precision/cost tradeoff. **Most of \`EXTRA_VERBS\` could be
  retired** once rosetta's TUI/MCP wires the DB resolver — the data already
  has every cmd verb at every menu.

  Pairs with: a CI-published \`verbs.json\` artifact (analogous to the
  \`routeros-docs-links.json\` ask in #4) for callers that don't want to
  bundle the full DB.

- ⬜ **H5: `{` after `key=` is a literal block value.**
  \`/system/script/add name=foo source={ /ip/address/print }\` extracts the
  inner \`/ip/address/print\` and **drops the outer add command** because
  \`{...}\` is treated as a scope to recurse into rather than a quoted
  block-value. May need the H4 resolver to know that \`add\` itself doesn't
  introduce a block.

- ⬜ **H6: `extractMentions(input)` for navigation-only references.**
  \`extractPaths('/ip/firewall/filter')\` returns \`[]\` because no verb is
  attached. For "what does this text reference?" use cases (rosetta's
  classifier, LSP hover, MCP context-feeders) we want a separate function
  that surfaces every path the text mentions, with or without a verb.

- ⬜ **H8: confidence flag on each `CanonicalCommand`.**
  \`'high' | 'medium' | 'low'\` — well-formed CLI / relative-with-cwd /
  prose-extracted respectively. Lets consumers filter
  (LSP could ignore \`'low'\` for hover, accept all for "what's this
  doing?" queries).

## Why this matters across tikoci

Robust extraction of RouterOS commands from arbitrary strings is
high-value across the org:

- **lsp-routeros-ts** — hover, document-link provider, prose-aware features
- **tikbook** — RouterOS in markdown fences and notebook cells
- **rosetta** — classifier (\`src/classify.ts\`) overlaps with this; the TUI
  and MCP both feed user input through canonicalize before DB lookup
- **Future Copilot / agent tooling** — any feature that takes free-form
  text and needs to find the RouterOS commands inside

Today most of these treat the parser as black-box. With the H4 resolver
landed, each surface gets the same parser with a backend appropriate to
its constraints.

## How to pick this up

1. Read this issue + the audit doc linked at the top.
2. Open \`src/canonicalize.fuzz.test.ts\` — the \`test.todo\` markers map 1:1
   to H1–H8 here. Promoting one to \`test(...)\` forces the implementation.
3. Suggested order: **H4 first** (largest payoff, unblocks rosetta itself),
   then H1 (lenient mode, unblocks LSP), then H5, then the smaller ones.
4. Each landing should:
   - flip the corresponding \`test.todo\` to \`test(...)\` with concrete
     assertions
   - update the corresponding "anchor" test in the same file (the one
     documenting today's wrong behaviour) to assert the *new* correct
     behaviour
   - add a CHANGELOG bullet under \`[Unreleased]\` → \`Fixed\` or \`Added\`
   - cross-check verb additions / API changes against the \`commands\`
     table to avoid path collisions

## Cross-references

- Audit doc: https://github.com/tikoci/lsp-routeros-ts/blob/main/docs/canonicalize-audit.md
- Shipped fixes: commit 9be870b
- Test surface: \`src/canonicalize.fuzz.test.ts\`
- Related: rosetta issue #4 (publish artifacts for downstream consumers — \`verbs.json\` would join \`routeros-docs-links.json\`)
- Related: rosetta \`BACKLOG.md\` § "LSP integration"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

canonicalize.ts: hardening roadmap (H1–H8) — prose/script robustness #5

Background

Hardening roadmap

Already shipped (commit 9be870b)

Still on the books

Why this matters across tikoci

How to pick this up

Cross-references

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

canonicalize.ts: hardening roadmap (H1–H8) — prose/script robustness #5

Description

Background

Hardening roadmap

Already shipped (commit 9be870b)

Still on the books

Why this matters across tikoci

How to pick this up

Cross-references

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions