Phase V: self-hosting lexer (milestone 1) + 2 bugs it surfaced

RandomCoder-lab · claude · RandomCoder-lab · commit e85bb0192268 · 2026-05-13T21:28:00.000-05:00
examples/self_hosting_lexer.omc — a lexer for a subset of OMNIcode,
written entirely in OMNIcode itself. Runs on the Rust OMC interpreter
and emits tokens for programs the same interpreter could parse. First
milestone toward true self-hosting (lexer → parser → codegen → fixpoint).

Handles: identifiers, integer literals, keywords (h fn if else while for
in return break continue print import and or not res fold true false),
double-quoted string literals, single-char punctuation (= , ; ( ) { } [ ]
+ - * / % &lt; &gt; . : @ &amp; | ^ ~), `#` line comments, and whitespace.

Not yet: multi-char operators (== != &lt;= &gt;= -&gt; &lt;&lt; &gt;&gt; &amp;&amp;), float literals,
escape sequences, triple-quoted strings. Milestone 2 fills these in.

Verified output for `h x = 89;`:
  [0] H h    [1] IDENT x    [2] EQ =    [3] NUMBER 89    [4] SEMI ;    [5] EOF

Tree-walk and VM produce IDENTICAL output. That parity is the contract.

## Bugs surfaced by writing real OMC code in OMC

### 1. String equality went through to_int()

`"a" == "b"` was evaluating to TRUE because both strings parse to int 0
via `s.parse::&lt;i64&gt;().unwrap_or(0)`. The Eq/Ne branches in both the
tree-walk interpreter and the VM's cmp_op fell into the int-coercion
path for non-numeric strings.

Fix: detect (Value::String, Value::String) BEFORE the float/int
fallback and compare as strings. The VM's cmp_op now also supports
&lt; &lt;= &gt; &gt;= on strings (lexicographic).

The lexer's `is_alpha`/`is_digit`/`punct_kind` predicates all rely on
char-equality (`c == "h"`). Without this fix, every character compared
equal to every other character.

### 2. Mutating built-ins lost their writes on the VM path

The VM's `vm_call_builtin` shim copies each argument into a synthetic
`__vm_arg_0`, `__vm_arg_1`, ... variable before delegating to the tree-
walk dispatch. For mutating built-ins like `arr_push(arr, val)`, the
mutation landed on the synthetic — not the user's array variable —
so the caller never saw the change.

Fix: two new specialized opcodes — `Op::ArrPushNamed(name)` and
`Op::ArrSetNamed(name)`. The compiler detects calls of the form
`arr_push(varname, expr)` / `arr_set(varname, idx, val)` at compile
time and emits the named opcodes. The opcode itself carries the
variable name, so the VM can mutate the user's binding directly.

Disassembler updated to render them as `ARR_PUSH_NAMED tokens`.

## Why this matters

Self-hosting work is the best stress test for a language. You can't lie
to yourself about features — if `c == "h"` is broken, your lexer's
first branch fails silently. Writing real code in OMC, using OMC as it
exists, surfaces every assumption that was wrong.

This commit fixes those two bugs and adds the lexer as a permanent
demo/test of the language self-applying.

## Tests

141 passing across the workspace. Canonical sweep still 22/30 in both
tree-walk and VM. Lexer demo produces identical output on both paths.

Co-Authored-By: Claude Opus 4.7 &lt;noreply@anthropic.com&gt;
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -4,6 +4,31 @@ All notable changes to OMNIcode will be documented in this file.
 
 ## [Unreleased]
 
+### Added (Phase V: self-hosting lexer (milestone 1), 2026-05-13)
+
+`examples/self_hosting_lexer.omc` — a lexer for a subset of OMNIcode, written **entirely in OMNIcode itself**. Runs on the Rust OMC interpreter and emits tokens for programs the same interpreter could parse. **First milestone toward self-hosting.**
+
+The lexer handles: identifiers, integer literals, keywords (`h`, `fn`, `if`, `else`, `while`, `for`, `in`, `return`, `break`, `continue`, `print`, `import`, `and`, `or`, `not`, `res`, `fold`, `true`, `false`), double-quoted string literals, all single-character punctuation, `#` line comments, and whitespace. **Not yet:** multi-char operators (`==`, `<=`, `<<`, etc.), float literals, escape sequences, triple-quoted strings — saved for milestone 2.
+
+**Verified output** on `h x = 89;`:
+```
+[0] H h        [1] IDENT x    [2] EQ =       [3] NUMBER 89    [4] SEMI ;    [5] EOF
+```
+
+On `fn add(a, b) { return a + b; }` — 14 tokens, all correctly classified. Tree-walk and VM produce identical output.
+
+### Fixed (surfaced by Phase V)
+
+The self-hosting work exposed two real bugs that had been silent until now:
+
+**1. String equality went through `to_int()` coercion.** `"a" == "b"` was evaluating to `true` because both strings parsed to integer `0` via `s.parse().unwrap_or(0)`. Fix: in `Expression::Eq` / `Expression::Ne` and the VM's `cmp_op`, check for `(Value::String, Value::String)` and compare as strings directly. The same string ordering now works for `<`, `<=`, `>`, `>=` on the VM path. Tree-walk path was already broken in the same way and is also fixed.
+
+**2. `arr_push` / `arr_set` on the VM path lost mutations.** The VM's `vm_call_builtin` shim copies args into synthetic `__vm_arg_0`, `__vm_arg_1` variables before delegating to the tree-walk dispatch. Mutating built-ins like `arr_push` modified the synthetic — not the user's actual array variable — so the mutation never reached the caller's scope. Fix: two new specialized opcodes `Op::ArrPushNamed(name)` and `Op::ArrSetNamed(name)`. The compiler detects `arr_push(varname, expr)` / `arr_set(varname, idx, val)` at compile time and emits the named opcodes, which take the variable name in the opcode itself and mutate the user's binding directly. The disassembler renders them as `ARR_PUSH_NAMED tokens` for clarity.
+
+Both bugs are tested implicitly through the lexer demo (which exercises hundreds of string comparisons and array mutations across both execution paths).
+
+**Tests:** still 141 passing across the workspace. Canonical sweep still 22/30 in both modes.
+
 ### Added (Phase T: source positions in error messages, 2026-05-13)
 
 Every parser error now reports the precise `line:col` where it occurred. The lexer tracks `line` and `col` as it consumes characters (incrementing line on `\n`, col otherwise). `tokenize_with_pos` returns `Vec<(Token, Pos)>` paired; `Parser` stores them and exposes `current_pos()` to error-reporting sites.
diff --git a/examples/self_hosting_lexer.omc b/examples/self_hosting_lexer.omc
@@ -0,0 +1,304 @@
+# =============================================================================
+# Self-Hosting Lexer (Phase V, milestone 1)
+# =============================================================================
+# A lexer for a subset of OMNIcode, written entirely in OMNIcode itself.
+# Run on the Rust OMC interpreter, this demonstrates that the language is
+# expressive enough to introspect its own source.
+#
+# Pipeline:
+#   string source  -->  this lexer  -->  array of ["KIND", "value"] tokens
+#
+# Scope: handles identifiers, integers, the major punctuation (= , ; ( ) { }
+# [ ] + - * / . :), the `h` / `fn` / `if` / `while` keywords, and string
+# literals in double quotes. Comments (# ...) are skipped.
+#
+# Not yet handled: float literals, triple-quoted strings, the bitwise ops,
+# escape sequences, multi-character operators (==, !=, <=, >=, ->, <<, >>).
+# Those are the next milestone.
+# =============================================================================
+
+# ---------------------------------------------------------------------------
+# Character class predicates. Implemented via str_contains over the literal
+# alphabets — no ord() needed (we don't have one). Returns 1 / 0 like
+# canonical Python OMC convention.
+# ---------------------------------------------------------------------------
+fn is_digit(c) {
+    return str_contains("0123456789", c);
+}
+
+fn is_alpha(c) {
+    if str_contains("abcdefghijklmnopqrstuvwxyz", c) == 1 { return 1; }
+    if str_contains("ABCDEFGHIJKLMNOPQRSTUVWXYZ", c) == 1 { return 1; }
+    if c == "_" { return 1; }
+    return 0;
+}
+
+fn is_alnum(c) {
+    if is_alpha(c) == 1 { return 1; }
+    if is_digit(c) == 1 { return 1; }
+    return 0;
+}
+
+fn is_space(c) {
+    if c == " " { return 1; }
+    if c == "\n" { return 1; }
+    if c == "\t" { return 1; }
+    if c == "\r" { return 1; }
+    return 0;
+}
+
+# ---------------------------------------------------------------------------
+# Skip whitespace and `#` comments starting at position `pos`. Returns the
+# new position. This is the only "stateful" helper — we thread the position
+# explicitly because OMC doesn't have mutable references.
+# ---------------------------------------------------------------------------
+fn skip_ws(source, pos) {
+    h n = str_len(source);
+    h p = pos;
+    while p < n {
+        h c = str_slice(source, p, p + 1);
+        if is_space(c) == 1 {
+            p = p + 1;
+        } else {
+            if c == "#" {
+                # Skip to end of line.
+                while p < n {
+                    h cc = str_slice(source, p, p + 1);
+                    if cc == "\n" {
+                        p = p + 1;
+                        break;
+                    }
+                    p = p + 1;
+                }
+            } else {
+                break;
+            }
+        }
+    }
+    return p;
+}
+
+# ---------------------------------------------------------------------------
+# Recognize keywords. Returns the keyword kind string if `word` is a known
+# keyword, or "IDENT" otherwise. Mirrors the OMC lexer's keyword table.
+# ---------------------------------------------------------------------------
+fn classify_word(word) -> string {
+    if word == "h"      { return "H"; }
+    if word == "fn"     { return "FN"; }
+    if word == "if"     { return "IF"; }
+    if word == "else"   { return "ELSE"; }
+    if word == "while"  { return "WHILE"; }
+    if word == "for"    { return "FOR"; }
+    if word == "in"     { return "IN"; }
+    if word == "return" { return "RETURN"; }
+    if word == "break"  { return "BREAK"; }
+    if word == "continue" { return "CONTINUE"; }
+    if word == "print"  { return "PRINT"; }
+    if word == "import" { return "IMPORT"; }
+    if word == "and"    { return "AND"; }
+    if word == "or"     { return "OR"; }
+    if word == "not"    { return "NOT"; }
+    if word == "res"    { return "RES"; }
+    if word == "fold"   { return "FOLD"; }
+    if word == "true"   { return "BOOL"; }
+    if word == "false"  { return "BOOL"; }
+    return "IDENT";
+}
+
+# ---------------------------------------------------------------------------
+# Read an identifier or keyword starting at `pos`. Returns a 3-element
+# array: [kind, value, end_pos].
+# ---------------------------------------------------------------------------
+fn read_ident(source, pos) {
+    h n = str_len(source);
+    h end = pos;
+    while end < n {
+        h c = str_slice(source, end, end + 1);
+        if is_alnum(c) == 1 {
+            end = end + 1;
+        } else {
+            break;
+        }
+    }
+    h word = str_slice(source, pos, end);
+    h kind = classify_word(word);
+    return [kind, word, end];
+}
+
+# ---------------------------------------------------------------------------
+# Read an integer literal. Returns [NUMBER, digits, end_pos].
+# ---------------------------------------------------------------------------
+fn read_number(source, pos) {
+    h n = str_len(source);
+    h end = pos;
+    while end < n {
+        h c = str_slice(source, end, end + 1);
+        if is_digit(c) == 1 {
+            end = end + 1;
+        } else {
+            break;
+        }
+    }
+    h digits = str_slice(source, pos, end);
+    return ["NUMBER", digits, end];
+}
+
+# ---------------------------------------------------------------------------
+# Read a double-quoted string literal. Returns [STRING, content, end_pos].
+# Does NOT handle backslash escapes (next milestone).
+# ---------------------------------------------------------------------------
+fn read_string_literal(source, pos) {
+    h n = str_len(source);
+    h end = pos + 1;     # skip opening quote
+    while end < n {
+        h c = str_slice(source, end, end + 1);
+        if c == "\"" {
+            h content = str_slice(source, pos + 1, end);
+            return ["STRING", content, end + 1];
+        }
+        end = end + 1;
+    }
+    return ["STRING_UNCLOSED", "", end];
+}
+
+# ---------------------------------------------------------------------------
+# Single-character punctuation map. Returns the kind name or empty string.
+# ---------------------------------------------------------------------------
+fn punct_kind(c) -> string {
+    if c == "(" { return "LPAREN"; }
+    if c == ")" { return "RPAREN"; }
+    if c == "{" { return "LBRACE"; }
+    if c == "}" { return "RBRACE"; }
+    if c == "[" { return "LBRACKET"; }
+    if c == "]" { return "RBRACKET"; }
+    if c == ";" { return "SEMI"; }
+    if c == "," { return "COMMA"; }
+    if c == "=" { return "EQ"; }
+    if c == "+" { return "PLUS"; }
+    if c == "-" { return "MINUS"; }
+    if c == "*" { return "STAR"; }
+    if c == "/" { return "SLASH"; }
+    if c == "%" { return "PERCENT"; }
+    if c == "<" { return "LT"; }
+    if c == ">" { return "GT"; }
+    if c == "." { return "DOT"; }
+    if c == ":" { return "COLON"; }
+    if c == "@" { return "AT"; }
+    if c == "&" { return "AMP"; }
+    if c == "|" { return "PIPE"; }
+    if c == "^" { return "CARET"; }
+    if c == "~" { return "TILDE"; }
+    return "";
+}
+
+# ---------------------------------------------------------------------------
+# Tokenize the whole source string. Returns an array of [kind, value]
+# tokens. Position-threading is internal; the caller just gets the
+# completed token stream.
+# ---------------------------------------------------------------------------
+fn tokenize(source) {
+    h n = str_len(source);
+    h tokens = arr_new(0, 0);
+    h pos = 0;
+
+    while pos < n {
+        pos = skip_ws(source, pos);
+        if pos >= n { break; }
+
+        h c = str_slice(source, pos, pos + 1);
+
+        # Identifier or keyword?
+        if is_alpha(c) == 1 {
+            h tok = read_ident(source, pos);
+            arr_push(tokens, [arr_get(tok, 0), arr_get(tok, 1)]);
+            pos = arr_get(tok, 2);
+        } else {
+            # Number?
+            if is_digit(c) == 1 {
+                h tok = read_number(source, pos);
+                arr_push(tokens, [arr_get(tok, 0), arr_get(tok, 1)]);
+                pos = arr_get(tok, 2);
+            } else {
+                # String literal?
+                if c == "\"" {
+                    h tok = read_string_literal(source, pos);
+                    arr_push(tokens, [arr_get(tok, 0), arr_get(tok, 1)]);
+                    pos = arr_get(tok, 2);
+                } else {
+                    # Punctuation.
+                    h kind = punct_kind(c);
+                    if str_len(kind) > 0 {
+                        arr_push(tokens, [kind, c]);
+                        pos = pos + 1;
+                    } else {
+                        # Unknown — emit and skip.
+                        arr_push(tokens, ["UNKNOWN", c]);
+                        pos = pos + 1;
+                    }
+                }
+            }
+        }
+    }
+
+    arr_push(tokens, ["EOF", ""]);
+    return tokens;
+}
+
+# ---------------------------------------------------------------------------
+# Pretty-print a token stream.
+# ---------------------------------------------------------------------------
+fn print_tokens(tokens) {
+    h i = 0;
+    h n = arr_len(tokens);
+    while i < n {
+        h t = arr_get(tokens, i);
+        h kind = arr_get(t, 0);
+        h value = arr_get(t, 1);
+        print(concat_many("  [", i, "] ", kind, " ", value));
+        i = i + 1;
+    }
+}
+
+# ---------------------------------------------------------------------------
+# Drive the lexer on representative inputs.
+# ---------------------------------------------------------------------------
+print("== Self-Hosting Lexer Demo (Phase V, milestone 1) ==");
+print("");
+
+# Test 1: simplest possible OMC program.
+print("--- Input 1: h x = 89; ---");
+h src1 = "h x = 89;";
+h toks1 = tokenize(src1);
+print_tokens(toks1);
+print("");
+
+# Test 2: a function definition with arithmetic.
+print("--- Input 2: fn add(a, b) { return a + b; } ---");
+h src2 = "fn add(a, b) { return a + b; }";
+h toks2 = tokenize(src2);
+print_tokens(toks2);
+print("");
+
+# Test 3: with a comment and a string literal.
+print("--- Input 3: # greet\\nprint(\"hi\"); ---");
+h src3 = "# greet
+print(\"hi\");";
+h toks3 = tokenize(src3);
+print_tokens(toks3);
+print("");
+
+# Test 4: harmonic-flavored — uses res() and a Fibonacci constant.
+print("--- Input 4: h r = res(89); ---");
+h src4 = "h r = res(89);";
+h toks4 = tokenize(src4);
+print_tokens(toks4);
+print("");
+
+print("== Observations ==");
+print("- This lexer runs on the Rust OMC interpreter and emits tokens for");
+print("  programs the SAME interpreter could parse. Self-introspection.");
+print("- Position-threading by return value is verbose but works without");
+print("  mutable references — a real constraint of the language as it stands.");
+print("- Next milestones: multi-char operators (== <= => !=), float literals,");
+print("  string-escape handling. Then a parser. Then a codegen. Then the");
+print("  fixpoint: OMC-compiled-by-OMC produces the same output as itself.");
diff --git a/omnimcode-core/src/bytecode.rs b/omnimcode-core/src/bytecode.rs
@@ -100,6 +100,15 @@ pub enum Op {
     NewArray(usize),       // pop N items into a new array, push
     ArrayIndex,            // pop index, pop array, push array[index]
     ArrayIndexAssign(String), // pop value, pop index, assign array_var[idx] = value
+    /// Mutating array push: pop one value off the stack and append it
+    /// to the named array variable in the current scope. Emitted by the
+    /// compiler when it sees `arr_push(tokens, expr)` with a literal
+    /// variable as the first argument. Bypasses vm_call_builtin's
+    /// synthetic-arg shim, which would otherwise lose the mutation.
+    ArrPushNamed(String),
+    /// Mutating array store: pop value, pop index, store at named array's
+    /// index. Same rationale as ArrPushNamed.
+    ArrSetNamed(String),
 
     // Special harmonic operations (short-circuit to built-in semantics
     // without the call overhead — these are the hot ones).
diff --git a/omnimcode-core/src/compiler.rs b/omnimcode-core/src/compiler.rs
@@ -324,6 +324,30 @@ impl Compiler {
                 self.emit(Op::Fold1);
             }
             Expression::Call { name, args } => {
+                // Mutating built-ins must be specialized so the VM doesn't
+                // route them through vm_call_builtin's synthetic-arg shim
+                // (which would otherwise lose the mutation — the shim
+                // copies args into __vm_arg_N variables and the built-in
+                // mutates the COPY).
+                if !self.user_fns.contains(name) {
+                    if name == "arr_push" && args.len() == 2 {
+                        if let Expression::Variable(arr_name) = &args[0] {
+                            // value first → on stack; then the named push.
+                            self.compile_expr(&args[1])?;
+                            self.emit(Op::ArrPushNamed(arr_name.clone()));
+                            return Ok(());
+                        }
+                    }
+                    if name == "arr_set" && args.len() == 3 {
+                        if let Expression::Variable(arr_name) = &args[0] {
+                            // value, then index → stack top is index, then value
+                            self.compile_expr(&args[1])?; // index
+                            self.compile_expr(&args[2])?; // value
+                            self.emit(Op::ArrSetNamed(arr_name.clone()));
+                            return Ok(());
+                        }
+                    }
+                }
                 // Fast-path inline for hot harmonic ops — avoids the Call -> bridge
                 // -> stdlib lookup overhead. Only inline when the user HASN'T
                 // redefined the name (preserves recursion-by-shadowing).
diff --git a/omnimcode-core/src/disasm.rs b/omnimcode-core/src/disasm.rs
@@ -73,6 +73,8 @@ fn op_mnemonic(op: &Op, ip: usize, constants: &[Const]) -> String {
         Op::NewArray(n) => format!("NEW_ARRAY    {}", n),
         Op::ArrayIndex => "ARRAY_INDEX".to_string(),
         Op::ArrayIndexAssign(name) => format!("ARRAY_INDEX_ASSIGN {}", name),
+        Op::ArrPushNamed(name) => format!("ARR_PUSH_NAMED  {}", name),
+        Op::ArrSetNamed(name) => format!("ARR_SET_NAMED   {}", name),
 
         Op::Resonance => "RESONANCE".to_string(),
         Op::Fold1 => "FOLD".to_string(),
diff --git a/omnimcode-core/src/interpreter.rs b/omnimcode-core/src/interpreter.rs
diff --git a/omnimcode-core/src/vm.rs b/omnimcode-core/src/vm.rs