Skip to content

Commit 6a3613b

Browse files
AST canonicalization: omc_code_canonical + omc_code_equivalent
The maximally-useful LLM primitive: semantic-equivalence at the parse level. Two programs that differ only in whitespace / comments / local variable names / parameter names / loop variables / catch error vars / lambda parameters / match-arm binds produce byte-identical canonical output. Pipeline: parse OMC source → walk AST renaming locals to __v0/__v1/... (alpha-equivalence) → re-emit via the existing canonical formatter (strips whitespace + comments + normalizes operator parens). Preserved (observable API): top-level function names, class names, dict keys, string literals, global variables, builtin call sites. Renamed (cosmetic): every binding inside a function or lambda scope. Two new LLM-facing builtins: omc_code_canonical(code) -> string omc_code_equivalent(a, b) -> int (1 iff canonicals match) Combined with omc_code_hash, this gives an LLM a semantic-stable id for any program region — a memory key that survives reformatting, renaming, comment edits. The answer to "is this still the function I was editing 10 messages ago?" becomes a single int compare. Self-tested via /tmp/use_canonical.omc against four realistic LLM-edit scenarios: 1. Local var rename: equivalent ✓ 2. Comment + reflow: equivalent ✓ 3. Body change (+1 vs +2): not equiv ✓ 4. Full alpha-rename: equivalent ✓ Tests: 15 OMC cases + 8 Rust unit tests covering whitespace, blank lines, comments, param/local/loop/lambda/catch alpha-equivalence, top-level fn names preserved, structural differences detected, the omc_code_equivalent shortcut, canonical-hash stability under rename, builtin names preserved through canonical form. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
1 parent e36e1ac commit 6a3613b

7 files changed

Lines changed: 678 additions & 3 deletions

File tree

OMC_REFERENCE.md

Lines changed: 23 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,9 @@
22

33
Auto-generated from `omnimcode-core/src/docs.rs`. Run `omc --gen-docs > OMC_REFERENCE.md` to regenerate.
44

5-
**Total documented builtins**: 110
5+
**Total documented builtins**: 112
66

7-
**OMC-unique**: 22 (no direct Python/NumPy equivalent — these are why you reach for OMC over numpy)
7+
**OMC-unique**: 24 (no direct Python/NumPy equivalent — these are why you reach for OMC over numpy)
88

99
---
1010

@@ -24,7 +24,7 @@ Auto-generated from `omnimcode-core/src/docs.rs`. Run `omc --gen-docs > OMC_REFE
2424
- [stdlib](#stdlib) (8 builtins)
2525
- [exceptions](#exceptions) (1 builtins)
2626
- [introspection](#introspection) (8 builtins)
27-
- [tokenizer](#tokenizer) (10 builtins)
27+
- [tokenizer](#tokenizer) (12 builtins)
2828

2929
---
3030

@@ -1186,5 +1186,25 @@ Substrate distance between two programs (|hash_a - hash_b|). Same code → 0; sm
11861186
omc_code_distance("return 1;", "return 2;") // small
11871187
```
11881188

1189+
### `omc_code_canonical` 🔱 *OMC-unique*
1190+
1191+
**Signature**: `(code: string) -> string`
1192+
1193+
Parse + AST-canonicalize + re-emit. Output is invariant under whitespace/comments/local-var-names/param-names/loop-vars/catch-vars/lambda-params. Top-level fn/class names + globals preserved.
1194+
1195+
```omc
1196+
omc_code_canonical("fn f(x) { return x; }") == omc_code_canonical("fn f(a) { return a; }")
1197+
```
1198+
1199+
### `omc_code_equivalent` 🔱 *OMC-unique*
1200+
1201+
**Signature**: `(code_a: string, code_b: string) -> int`
1202+
1203+
1 iff the two programs canonicalize identically (semantic alpha-equivalence). LLMs use this as a memory-key check: 'is this still the same function I was editing?'
1204+
1205+
```omc
1206+
omc_code_equivalent("fn f(x) { return x; }", "fn f(a) { return a; }") // 1
1207+
```
1208+
11891209
---
11901210

examples/tests/test_canonical.omc

Lines changed: 151 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,151 @@
1+
# AST canonicalization — the LLM-reach-for semantic-equivalence layer.
2+
#
3+
# omc_code_canonical(code) → AST-canonicalized + reformatted string
4+
# omc_code_equivalent(a, b) → 1 if canonicals match, else 0
5+
#
6+
# Two programs that differ only in whitespace / comments / local-var
7+
# names / param names / loop vars / catch vars / lambda params produce
8+
# byte-identical canonical output. Top-level fn/class names and global
9+
# variables are preserved (observable API).
10+
11+
fn assert_eq(actual, expected, msg) {
12+
if actual != expected {
13+
test_record_failure(msg + ": expected " + to_string(expected) + " got " + to_string(actual));
14+
}
15+
}
16+
17+
fn assert_true(cond, msg) {
18+
if !cond { test_record_failure(msg); }
19+
}
20+
21+
# ---- Whitespace / formatting invariance ----
22+
23+
fn test_whitespace_invariant() {
24+
h c1 = omc_code_canonical("fn add(x, y) { return x + y; }");
25+
h c2 = omc_code_canonical("fn add(x,y){return x+y;}");
26+
assert_eq(c1, c2, "whitespace doesn't change canonical");
27+
}
28+
29+
fn test_blank_lines_invariant() {
30+
h c1 = omc_code_canonical("fn f(x) { return x; }");
31+
h c2 = omc_code_canonical("fn f(x) {\n\n return x;\n\n}");
32+
assert_eq(c1, c2, "blank lines invariant");
33+
}
34+
35+
# ---- Comments stripped ----
36+
37+
fn test_comments_stripped() {
38+
h c1 = omc_code_canonical("fn f(x) { return x; }");
39+
h c2 = omc_code_canonical("# header\nfn f(x) {\n # body\n return x;\n}");
40+
assert_eq(c1, c2, "comments don't affect canonical");
41+
}
42+
43+
# ---- Alpha-equivalence: param renames ----
44+
45+
fn test_param_rename_invariant() {
46+
h c1 = omc_code_canonical("fn add(x, y) { return x + y; }");
47+
h c2 = omc_code_canonical("fn add(a, b) { return a + b; }");
48+
assert_eq(c1, c2, "param names normalize");
49+
}
50+
51+
# ---- Alpha-equivalence: local var rename ----
52+
53+
fn test_local_var_rename_invariant() {
54+
h c1 = omc_code_canonical("fn f(x) { h tmp = x * 2; return tmp; }");
55+
h c2 = omc_code_canonical("fn f(x) { h other = x * 2; return other; }");
56+
assert_eq(c1, c2, "local var names normalize");
57+
}
58+
59+
# ---- Top-level fn names PRESERVED ----
60+
61+
fn test_top_level_fn_name_preserved() {
62+
h c1 = omc_code_canonical("fn add(x, y) { return x + y; }");
63+
h c2 = omc_code_canonical("fn sub(x, y) { return x + y; }");
64+
assert_true(c1 != c2, "top-level fn names are observable");
65+
}
66+
67+
# ---- Structurally different programs differ ----
68+
69+
fn test_different_body_differs() {
70+
h c1 = omc_code_canonical("fn f(x) { return x; }");
71+
h c2 = omc_code_canonical("fn f(x) { return x + 1; }");
72+
assert_true(c1 != c2, "different bodies → different canonical");
73+
}
74+
75+
# ---- omc_code_equivalent shortcut ----
76+
77+
fn test_equivalent_returns_1_for_equivalents() {
78+
assert_eq(
79+
omc_code_equivalent(
80+
"fn f(x) { h tmp = x * 2; return tmp; }",
81+
"fn f(a) { h q = a * 2; return q; }"
82+
),
83+
1,
84+
"alpha-equivalent → 1"
85+
);
86+
}
87+
88+
fn test_equivalent_returns_0_for_different() {
89+
assert_eq(
90+
omc_code_equivalent(
91+
"fn f(x) { return x; }",
92+
"fn f(x) { return x + 1; }"
93+
),
94+
0,
95+
"different bodies → 0"
96+
);
97+
}
98+
99+
fn test_equivalent_returns_0_for_parse_error() {
100+
# Malformed code shouldn't crash; should return 0 (can't verify).
101+
assert_eq(
102+
omc_code_equivalent("fn f(x) { return", "fn f(x) { return x; }"),
103+
0,
104+
"parse error → 0"
105+
);
106+
}
107+
108+
# ---- Combined with omc_code_hash: semantic memory key ----
109+
110+
fn test_canonical_hash_stable_across_renames() {
111+
h c1 = omc_code_canonical("fn relu(x) { if x > 0 { return x; } return 0; }");
112+
h c2 = omc_code_canonical("fn relu(input) { if input > 0 { return input; } return 0; }");
113+
h h1 = omc_code_hash(c1);
114+
h h2 = omc_code_hash(c2);
115+
assert_eq(dict_get(h1, "raw"), dict_get(h2, "raw"),
116+
"canonical-hash stable under alpha-rename");
117+
}
118+
119+
# ---- For-loop variable invariance ----
120+
121+
fn test_for_loop_var_invariant() {
122+
h c1 = omc_code_canonical("fn f(xs) { for i in xs { print(i); } }");
123+
h c2 = omc_code_canonical("fn f(xs) { for k in xs { print(k); } }");
124+
assert_eq(c1, c2, "for-loop variable normalized");
125+
}
126+
127+
# ---- Lambda param invariance ----
128+
129+
fn test_lambda_param_invariant() {
130+
h c1 = omc_code_canonical("fn f(xs) { return arr_map(xs, fn(x) { return x * 2; }); }");
131+
h c2 = omc_code_canonical("fn f(xs) { return arr_map(xs, fn(z) { return z * 2; }); }");
132+
assert_eq(c1, c2, "lambda param normalized");
133+
}
134+
135+
# ---- Catch err-var invariance ----
136+
137+
fn test_catch_err_var_invariant() {
138+
h c1 = omc_code_canonical("fn f() { try { throw 1; } catch e { return e; } }");
139+
h c2 = omc_code_canonical("fn f() { try { throw 1; } catch err { return err; } }");
140+
assert_eq(c1, c2, "catch err-var normalized");
141+
}
142+
143+
# ---- Builtin names are NOT renamed ----
144+
145+
fn test_builtin_names_preserved() {
146+
h c1 = omc_code_canonical("fn f(xs) { return arr_softmax(xs); }");
147+
h c2 = omc_code_canonical("fn f(ys) { return arr_softmax(ys); }");
148+
# Params normalize, but arr_softmax stays.
149+
assert_eq(c1, c2, "alpha-equivalent");
150+
assert_true(re_match("arr_softmax", c1) == 1, "arr_softmax preserved in output");
151+
}

0 commit comments

Comments
 (0)