Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,17 @@ jobs:
exit 1
fi

- name: Phase-0 manifest contract gate
# Asserts dist/repo.meta.json parses, carries all required fields,
# and every exposes.* path resolves on disk. Cross-repo tier-2
# contract per .github/docs/AI-discoverability-plan.md §3.4.
run: make check-manifest

- name: docs/ prose-only gate
# Cross-repo guardrail: docs/ holds only human-readable prose.
# Non-prose artifacts belong elsewhere.
run: make check-docs-prose

node:
name: node (${{ matrix.os }})
runs-on: ${{ matrix.os }}
Expand Down
5 changes: 4 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,10 @@ _obj/

# Python artifacts
.venv/
dist/
dist/*
# Phase-0 contract: org catalog fetches this by raw URL; track it
# despite the broader dist/ ignore.
!dist/repo.meta.json
*.egg-info
*.whl

Expand Down
236 changes: 236 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,236 @@
---
# Machine-readable project descriptor — schema v1 (2026-05-05).
name: tree-sitter-m
kind: [parser, grammar, library]
status: active # B5 done; B6 bindings + CI wired; remaining: publish + per-tier coverage gate
languages: [javascript, c, rust, python, go, typescript]

runtime:
needs:
- "node>=18 (for tree-sitter CLI + bindings build)"
optional:
- "rust toolchain (rust binding)"
- "python>=3.10 (python binding)"
- "go (go binding)"
excludes: []

distribution:
pypi: null # python binding consumed via wheels attached to GitHub releases
npm: null # node binding present but not yet on npm
cargo: null
github: m-dev-tools/tree-sitter-m

location: ~/m-dev-tools/tree-sitter-m

exposes:
grammar: "grammar.js (committed; generated from m-standard's grammar-surface.json)"
parser: "src/parser.c (generated, committed)"
bindings:
- "bindings/node/"
- "bindings/rust/"
- "bindings/python/"
- "bindings/go/"
wasm: "tree-sitter-m.wasm build artefact (consumed by tree-sitter-m-vscode)"
formats_produced:
- "AST + grammar-metadata.json (for downstream tools)"

consumes:
formats: [".m", ".mac", ".int"]
services: []
upstream_data:
- "m-standard/integrated/grammar-surface.json (commands, ISVs, functions, etc.)"

companions:
- project: m-standard
relation: "input — grammar-surface.json drives the build-grammar.js code generator; rebuild grammar when m-standard updates"
- project: m-cli
relation: "consumer — m-cli's lint/fmt rules walk the tree-sitter AST"
- project: tree-sitter-m-vscode
relation: "consumer — loads the WASM build for VS Code syntax highlighting"
- project: vista-meta
relation: "primary validation corpus — 39,330 routines at ~/vista-meta/vista/vista-m-host/Packages (99.06% clean as of B5)"
- project: m-modern-corpus
relation: "secondary validation corpus — confirms grammar handles modern non-VistA idioms"

incompatibilities:
- "ObjectScript out of scope — M only."
- "Generated `src/parser.c` is committed so consumers don't need tree-sitter-cli or m-standard at install time."

docs:
primary: README.md
spec: docs/spec.md
---

# Claude project context — tree-sitter-m

## What this is
A tree-sitter grammar for M (MUMPS). Specification phase only — no
code yet. Full design in `docs/spec.md`.

## Where things will live
- `tools/build-grammar.js` — code-generator that consumes
`m-standard/integrated/grammar-surface.json` and emits the
data-driven half of `grammar.js`.
- `grammar.js` — generated tree-sitter grammar definition.
- `src/parser.c` — generated by `tree-sitter generate` from
grammar.js. Both are committed so consumers don't need
tree-sitter-cli or m-standard to install.
- `bindings/` — Node, Rust, Python, Go bindings.
- `test/corpus/` — tree-sitter corpus tests + real M routines from
m-standard's sources.

## Pipeline
```
m-standard/integrated/grammar-surface.json
tools/build-grammar.js (run on m-standard updates)
grammar.js (committed)
tree-sitter generate (run on grammar.js changes)
src/parser.c + src/grammar-metadata.json (committed)
tree-sitter test (corpus tests, per-tier coverage gate)
bindings/{node,rust,python,go} (compiled per platform)
```

## Hard rules
- **Read `docs/tree-sitter-notes.md` before adding grammar rules
that involve overlapping regex tokens, keyword vs identifier
disambiguation, or context-sensitive recognition.** Two non-
obvious tree-sitter constraints have already cost us debugging
time: token precedence is dominant (not a tiebreaker for length),
and the regex engine has no look-around. The notes document
records the workaround patterns we've adopted (GLR parser
alternatives, external-scanner byte-level lookahead, disjoint-by-
required-structure regexes) and when to pick which.
- **The parser recognises the union of all sources.** Subsetting
belongs in the linter layer (`tree-sitter-m-lint`), not the
grammar. See `docs/spec.md` AD-01.
- **Hand-code language structure; data-drive the keyword tables.**
Line shape, comments, strings, postconditionals, indirection,
dot-blocks are invariant across sources and live in `grammar.js`.
Commands / functions / ISVs / operators / pattern codes come
from `m-standard`. See AD-02.
- **Stamp `standard_status` on every keyword node.** Per-token
tier metadata is what lets downstream linters work without
re-parsing. See AD-03.
- **Pin the m-standard schema_version.** When m-standard ships a
breaking schema change, tree-sitter-m adopts deliberately and bumps
major version. CI fails if the consumed file's
`schema_version` doesn't match the pin. See AD-04.

## Source for grammar data
**Always** `~/projects/m-standard/integrated/grammar-surface.json`,
read at build time only. Never the per-source TSVs, never the
pragmatic / SAC / operational standards. Those serve different
consumers (the linter, not the parser).

## Conventions
- AGPL-3.0 (matches m-standard).
- Generated artifacts (`grammar.js`, `src/parser.c`) ARE committed
so installs don't need build tools.
- Tree-sitter version pinned in `package.json`; no surprise upgrades.
- Bindings use tree-sitter's standard scaffold; no bespoke binding
code.
- Tests use tree-sitter's standard corpus format
(`test/corpus/*.txt` with `===` separators).

## Toolchain
- Node.js ≥ 20 (tree-sitter-cli)
- C compiler (gcc/clang/MSVC) for the generated parser
- No m-standard runtime dependency — tree-sitter-m ships pre-generated
artifacts. m-standard is a build-time data input only.

## What this is NOT
- A linter. That's `tree-sitter-m-lint` (sibling project).
- A formatter. That's a separate downstream consumer.
- A compiler / interpreter. M execution belongs to YottaDB / IRIS.
- A semantic analyzer. The parser produces a parse tree; type
inference, control flow, cross-routine resolution all live above.
- A parser for **InterSystems ObjectScript** (`##class`, `&sql`,
`obj.method()`, `obj.property=val`, `##super`, etc.). ObjectScript
is a separate scripting language layered on top of M's runtime;
if you want to parse it, build a sibling grammar
(`tree-sitter-objectscript`). tree-sitter-m covers M and M dialects
(AnnoStd, YottaDB, IRIS's M layer) only.

## Setup

```bash
npm ci # install tree-sitter-cli + bindings deps
```

Node ≥ 18, plus a C compiler for the generated parser. For the Python
binding specifically, install from a tagged release's prebuilt wheel
(no C toolchain needed) — see `RELEASE.md` for URL templates.

## Test

```bash
make test # tree-sitter corpus tests (test/corpus/*.txt)
make parse-rate-check # VistA-corpus parse-rate gate (≤ 1.0% errors by default)
npm test # full Node side: corpus + lib + coverage gate
```

`make parse-rate-check` requires the VistA corpus at
`$VISTA_DIR/Packages/` (defaults to `~/vista-meta/vista/vista-m-host/Packages/`).
Skip locally if you don't have that checkout — CI runs it.

## Build / generate

The parser artefacts under `src/` are committed so consumers can install
without tree-sitter-cli or m-standard. Regenerate when grammar or
m-standard changes:

```bash
tree-sitter generate # → src/parser.c, src/grammar.json, src/node-types.json
node tools/build-grammar.js # → grammar.js, keywords.generated.js, src/grammar-metadata.json
# (requires sibling m-standard checkout)
make all # → libtree-sitter-m.{a,so} + pkg-config (C consumers)
```

The `make manifest` target in this Makefile is a no-op pointer — the
exposed `dist/repo.meta.json` payloads (`src/node-types.json`,
`src/grammar.json`, `src/grammar-metadata.json`) are tree-sitter outputs
already gated by `make test`.

## Verify

The `verification_commands` declared in `dist/repo.meta.json`:

```bash
make test # corpus tests pass
make check-manifest # repo.meta.json validates + every exposes.* path exists
```

Cross-repo guardrail:

```bash
make check-docs-prose # docs/ holds only prose
```

## Guardrails

- **Do not hand-edit `src/parser.c`, `src/grammar.json`,
`src/node-types.json`, or `keywords.generated.js`.** They are
`tree-sitter generate` and `tools/build-grammar.js` outputs.
- **Do not hand-edit `src/grammar-metadata.json`.** It is the
build-grammar tool's emission, pinned by `schema_version` against
m-standard's `grammar-surface.json`.
- **Generated artefacts are committed.** This is deliberate — consumers
install without needing tree-sitter-cli + Node + an m-standard
sibling checkout. Don't `.gitignore` them.
- **ObjectScript is out of scope.** Don't add `##class` / `&sql` /
`obj.method()` to grammar.js. That belongs in a sibling
`tree-sitter-objectscript` grammar.
- **m-standard is a build-time data input only.** No runtime dep,
no install-time dep, no PyPI/npm dep on it.
Loading
Loading