Skip to content

canonicalize.ts: hardening roadmap (H1–H8) — prose/script robustness #5

@mobileskyfi

Description

@mobileskyfi

Tracking issue for the remaining canonicalize.ts hardenings identified in the
2026-04-25 cross-project audit. Safe fixes already shipped in 9be870b (BOM
strip, backtick/ZWSP as whitespace, clear/unset/reset-counters/reset-counters-all
added to GENERAL_COMMANDS). This issue tracks the larger items that need design
discussion before landing.

Background

Audit was driven from tikoci/lsp-routeros-ts while evaluating whether to vendor
canonicalize.ts for prose-text command extraction (chat input, MCP tool input,
markdown snippets). Full audit doc with methodology, findings, and proposed
hardenings:

https://github.com/tikoci/lsp-routeros-ts/blob/main/docs/canonicalize-audit.md

50-input probe across 6 categories (prose, comment/quote, bracket/brace,
scripting constructs, identifier-shaped values, markdown). Zero crashes
the parser is robust as a parser. 12 wrong-but-quiet correctness gaps when
fed anything other than clean CLI input.

Test surface lives at src/canonicalize.fuzz.test.ts — anchor tests document
current behaviour, test.todo markers list the unshipped hardenings below.

Hardening roadmap

Already shipped (commit 9be870b)

  • H7 — BOM strip + zero-width space as whitespace
  • Backticks as whitespace (in both outer + word loops)
  • Universal verb expansionclear, unset, reset-counters,
    reset-counters-all added to GENERAL_COMMANDS after DB cross-check

Still on the books

  • H1: mode: 'strict' | 'lenient' parameter. Today a leading word
    like `Run /ip/address/print` becomes `{path: "/Run/ip/address", verb: "print"}`
    because mid-line slash never restarts path context. Lenient mode should
    drop leading prose and start a new command on stray slashes. Strict mode
    preserves today's behaviour. Anchor test:
    `describe('finding Add SafeSkill security badge (73/100 — Passes with Notes) #1 — mid-line slash does not restart path (anchor)')`.

  • H2: Tok.Var for \$identifier. Variables are tolerated in args
    position (most common case works) but treated as path segments when they
    appear after a path-shaped run with no recognised verb. Adding a dedicated
    token type makes them never-a-segment.

  • H3: paren (…) expression scope. `:if ($x = 1) do={ /log/info "yes" }`
    produces ZERO commands today — parens aren't tokenized; the entire
    expression is consumed as garbage tokens that swallow the do-block.
    Inconsistently, `:while ($i < 10) do={ /log/info $i }` extracts the
    inner command. Recommend tokenizing `(` and `)` and skipping their
    contents (or recursing for any nested `[…]` subshells).

  • H4: pluggable isVerb resolver. The most impactful one — and
    it specifically benefits rosetta.
    The expanded universal verb set
    closed the easy half of finding Export routeros-docs-links.json as a CI artifact for LSP consumers #4, but `info`/`warning`/`error`/`debug`
    can't go in the universal set because they're path-context-dependent
    (`info` is a verb at `/log`, a dir at `/interface/wireless`).

    Proposed API:

    ```ts
    export interface CanonicalizeOptions {
    cwd?: string;
    mode?: 'strict' | 'lenient';
    /** Optional path-context-aware verb classifier. Called when the parser

    • encounters a token that could be either a path segment or a verb.
    • Return true to treat it as a verb. Falls back to GENERAL_COMMANDS /
    • EXTRA_VERBS when not provided. MUST be synchronous. */
      isVerb?: (token: string, parentPath: string) => boolean;
      }
      ```

    Wiring:

    • rosetta (TUI / MCP / DB-backed callers): pass a callback that does
      `SELECT 1 FROM commands WHERE name=? AND parent_path=? AND type='cmd'`
      — sub-millisecond per call, version-aware, path-aware.
    • lsp-routeros-ts: pass a callback backed by a static `verbs.json`
      (see Export routeros-docs-links.json as a CI artifact for LSP consumers #4 for the parallel docs-links artifact pattern), augmented at
      runtime by classifications observed in `/console/inspect highlight`
      responses.
    • standalone / no-deps callers: omit the option, get today's universal
      set behaviour.

    This keeps the module pure and dependency-free while letting each consumer
    pick its precision/cost tradeoff. Most of `EXTRA_VERBS` could be
    retired
    once rosetta's TUI/MCP wires the DB resolver — the data already
    has every cmd verb at every menu.

    Pairs with: a CI-published `verbs.json` artifact (analogous to the
    `routeros-docs-links.json` ask in Export routeros-docs-links.json as a CI artifact for LSP consumers #4) for callers that don't want to
    bundle the full DB.

  • H5: { after key= is a literal block value.
    `/system/script/add name=foo source={ /ip/address/print }` extracts the
    inner `/ip/address/print` and drops the outer add command because
    `{...}` is treated as a scope to recurse into rather than a quoted
    block-value. May need the H4 resolver to know that `add` itself doesn't
    introduce a block.

  • H6: extractMentions(input) for navigation-only references.
    `extractPaths('/ip/firewall/filter')` returns `[]` because no verb is
    attached. For "what does this text reference?" use cases (rosetta's
    classifier, LSP hover, MCP context-feeders) we want a separate function
    that surfaces every path the text mentions, with or without a verb.

  • H8: confidence flag on each CanonicalCommand.
    `'high' | 'medium' | 'low'` — well-formed CLI / relative-with-cwd /
    prose-extracted respectively. Lets consumers filter
    (LSP could ignore `'low'` for hover, accept all for "what's this
    doing?" queries).

Why this matters across tikoci

Robust extraction of RouterOS commands from arbitrary strings is
high-value across the org:

  • lsp-routeros-ts — hover, document-link provider, prose-aware features
  • tikbook — RouterOS in markdown fences and notebook cells
  • rosetta — classifier (`src/classify.ts`) overlaps with this; the TUI
    and MCP both feed user input through canonicalize before DB lookup
  • Future Copilot / agent tooling — any feature that takes free-form
    text and needs to find the RouterOS commands inside

Today most of these treat the parser as black-box. With the H4 resolver
landed, each surface gets the same parser with a backend appropriate to
its constraints.

How to pick this up

  1. Read this issue + the audit doc linked at the top.
  2. Open `src/canonicalize.fuzz.test.ts` — the `test.todo` markers map 1:1
    to H1–H8 here. Promoting one to `test(...)` forces the implementation.
  3. Suggested order: H4 first (largest payoff, unblocks rosetta itself),
    then H1 (lenient mode, unblocks LSP), then H5, then the smaller ones.
  4. Each landing should:
    • flip the corresponding `test.todo` to `test(...)` with concrete
      assertions
    • update the corresponding "anchor" test in the same file (the one
      documenting today's wrong behaviour) to assert the new correct
      behaviour
    • add a CHANGELOG bullet under `[Unreleased]` → `Fixed` or `Added`
    • cross-check verb additions / API changes against the `commands`
      table to avoid path collisions

Cross-references

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions