🥂 v0.3.1 symbolic compression: omc_predict format=hash default + omc_fetch_by_hash

RandomCoder-lab · claude · RandomCoder-lab · commit c2f5e0dd67d4 · 2026-05-17T12:43:32.000-05:00
v0.3 shipped predict but the MCP response sent full source — burning LLM context window on bodies the model often doesn't need to read. v0.3.1 closes that gap. ## What changed - omc_predict gains a `format` parameter: - `hash` (NEW DEFAULT, ~50 bytes/suggestion): fn_name + file + canonical_hash + prefix_match_len + substrate_distance - `signature` (~100 bytes): adds the fn signature line - `full`: complete source (previous default behavior) - omc_fetch_by_hash(paths, canonical_hash) — companion tool. Recovers a function body by its alpha-rename-invariant canonical hash. Returns {found, fn_name, file, source} or {found: false}. ## Measured compression Same query `fn prom_attention_` × top_k=5 against prometheus.omc: format=hash 1253 bytes (26.2%, 3.8x smaller) format=signature 1622 bytes (33.9%) format=full 4783 bytes (100%, v0.3 behavior) The ratio widens on longer fns — top_k=5 over fns averaging 60 lines compresses ~10x. ## Why it matters Canonical hash is alpha-rename invariant — recovery via fetch_by_hash works even if the fn was renamed in source after the predict call. The LLM workflow becomes: predict cheaply (hash), reason over candidates, fetch only the body it commits to using. Branching is now ~free at the context-budget level — 50 candidates fit in mind for the cost of 6-7 full bodies. ## Tests 13/13 MCP integration tests pass (was 8 + 5 new): - format=hash omits source field - format=signature includes signature, omits body - format=full unchanged from v0.3 - omc_fetch_by_hash round-trips through omc_predict - unknown hash returns {found: false} gracefully (not an error) Final: 231 Rust pass, 1087/1087 OMC. ## Next chapter v0.4-substrate-context: take the symbolic-context compression thesis end-to-end. The substrate codec from v0.0.5 already does 10-50× library-lookup compression; wire it into the LLM flow as the first-class context-compression mechanism. Per the updated ROADMAP. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -13,6 +13,7 @@ Read top-to-bottom for the arc; jump to any chapter for the detail.
 
 | Tag | Date | One-line |
 |---|---|---|
+| [v0.3.1-symbolic-compression](#v031-symbolic-compression--2026-05-17) | 2026-05-17 | `omc_predict` gains `format=hash`/`signature`/`full` (default = compressed hash form, 3.8× smaller context cost) + `omc_fetch_by_hash` companion for on-demand recovery |
 | [v0.3-symbolic-prediction](#v03-symbolic-prediction--2026-05-17) | 2026-05-17 | Substrate-indexed code completion: `omc_predict_files(paths, prefix, top_k)` returns ranked provenance-tracked continuations from a content-addressed corpus |
 | [v0.2-ergonomics](#v02-ergonomics--2026-05-17) | 2026-05-17 | OMC becomes forgiving: Python-idiom builtins, `+=`, friendly errors with traces, 11 heal classes total |
 | [v0.1-substrate-attention](#v01-substrate-attention--2026-05-17) | 2026-05-17 | Three substrate-component swaps inside transformer attention (K, S-MOD softmax, V) stack to −8.94% val on TinyShakespeare |
@@ -25,6 +26,53 @@ Read top-to-bottom for the arc; jump to any chapter for the detail.
 
 ---
 
+## [v0.3.1-symbolic-compression] - 2026-05-17
+
+**`omc_predict` learns to compress: default response is hash-only (~50 bytes/suggestion), with on-demand body recovery via `omc_fetch_by_hash`.**
+
+v0.3 shipped the predict engine but its MCP response sent the full source of every suggestion — typically 4-8KB for a top-k=5 query. This burns LLM context window with body text the model often doesn't need to read. v0.3.1 closes that gap.
+
+### What changed
+
+- **`omc_predict` gains a `format` parameter** with three projections:
+  - `hash` (default): `{fn_name, file, canonical_hash, prefix_match_len, substrate_distance}`. ~50 bytes/suggestion. Use for browsing.
+  - `signature`: adds the fn signature line (`fn name(args) -> ret`). ~100 bytes/suggestion. Use when call shape is enough.
+  - `full`: complete source (previous default behavior). Use when you'll actually adapt the body.
+- **New `omc_fetch_by_hash(paths, canonical_hash)` MCP tool**: recovers a function body by canonical hash. The companion to format=hash — browse cheaply, fetch only when needed. Returns `{found, fn_name, file, source}` or `{found: false}` if no fn in the corpus has that hash.
+
+### Measured compression
+
+Same query `fn prom_attention_` × top_k=5 against `examples/lib/prometheus.omc`:
+
+| Format | Bytes | Ratio vs full |
+|---|---:|---:|
+| **hash** (new default) | 1,253 | **26.2%** (3.8× smaller) |
+| signature | 1,622 | 33.9% |
+| full (v0.3 behavior) | 4,783 | 100% |
+
+The ratio widens on corpora with longer fns — a top_k=5 over fns averaging 60 lines compresses ~10×.
+
+### Why it matters
+
+The canonical hash is alpha-rename invariant — recovery via `omc_fetch_by_hash` works even if the function was renamed in the source after the predict call. The LLM workflow becomes: predict cheaply (hash), reason over candidates, fetch only the body it commits to using. Branching is now ~free at the context budget level — the LLM can hold 50 candidates in mind for the cost of 6-7 full bodies.
+
+### Tests
+
+13/13 MCP integration tests pass (was 8 + 5 new):
+- format=hash omits source field
+- format=signature includes signature, omits body
+- format=full unchanged from v0.3
+- omc_fetch_by_hash round-trips through omc_predict (predict returns a hash → fetch returns the same fn)
+- unknown hash returns `{found: false}` gracefully (not an error)
+
+Final: 231 Rust pass, 1087/1087 OMC.
+
+### What's next (v0.4 candidate)
+
+The compression story has more to give: the substrate codec from v0.0.5 can ship a 5-line "library reference + sampled tokens" payload that recovers losslessly via library lookup. Wiring codec output into omc_predict completes the symbolic-context compression thesis — the LLM exchanges canonical hashes as if they were words, and the substrate carries the meaning.
+
+---
+
 ## [v0.3-symbolic-prediction] - 2026-05-17
 
 **Substrate-indexed code completion: `omc_predict_files(paths, prefix, top_k)` returns ranked provenance-tracked continuations from a content-addressed corpus.**
diff --git a/README.md b/README.md
@@ -265,6 +265,7 @@ If you're trying to understand how OMC got here, **read the [GitHub Releases](ht
 | [v0.1-substrate-attention](https://github.com/RandomCoder-lab/OMC/releases/tag/v0.1-substrate-attention) | Three substrate components (K, S-MOD, V) stack inside attention for −8.94% val |
 | [v0.2-ergonomics](https://github.com/RandomCoder-lab/OMC/releases/tag/v0.2-ergonomics) | OMC becomes forgiving: Python-idiom builtins, `+=`, traced errors, 11 heal classes |
 | [v0.3-symbolic-prediction](https://github.com/RandomCoder-lab/OMC/releases/tag/v0.3-symbolic-prediction) | Substrate-indexed code completion: `omc_predict_files` returns ranked provenance-tracked continuations |
+| [v0.3.1-symbolic-compression](https://github.com/RandomCoder-lab/OMC/releases/tag/v0.3.1-symbolic-compression) | `omc_predict` learns to compress: `format=hash` default is 3.8× smaller, with `omc_fetch_by_hash` for on-demand body recovery |
 
 ---
 
diff --git a/ROADMAP.md b/ROADMAP.md
@@ -1,25 +1,42 @@
 # OMC Roadmap
 
-Current chapter: **v0.3-symbolic-prediction** (shipped 2026-05-17).
-Next chapter: open — candidates listed below.
+Current chapter: **v0.3.1-symbolic-compression** (shipped 2026-05-17).
+Next chapter: **v0.4-substrate-context** (planned — the symbolic-context compression thesis taken seriously).
 
 See [CHANGELOG.md](CHANGELOG.md) and [GitHub Releases](https://github.com/RandomCoder-lab/OMC/releases) for the chapter-by-chapter history of how OMC got here. This file describes what's on the path going forward.
 
 ---
 
-## Post-v0.3 candidates (none committed yet)
+## v0.4-substrate-context (planned)
 
-### v0.4 candidate A — predict engine grows up
+**Take the symbolic-context compression thesis end-to-end.** v0.3.1 added format options to omc_predict (3.8× compression on the predict response path). v0.4 generalizes: every LLM-facing OMC surface becomes substrate-aware about its context cost.
 
-The v0.3 engine ships a stateless predictor with substrate ranking. Natural extensions:
+The substrate codec from v0.0.5 already does library-lookup compression (`omc_codec_encode` → 10-50× ratios when the receiver has the library). The v0.4 chapter wires it into the LLM flow as a first-class context-compression mechanism:
 
-- **Prometheus rerank pass** — train a small Prometheus model on the corpus and rerank top-k by token-stream probability. Substrate ranking is the structural prior; Prometheus would be the learned overlay.
-- **Stateful corpus API** — `omc_corpus_build` returns a handle, `omc_predict_from(handle, prefix, top_k)` reuses it. The current API rebuilds per call (fine for interactive use; slow in tight loops).
-- **MCP tool surface** — wrap `omc_predict_files` as an MCP tool so LLM clients can query during code generation without launching a subprocess.
+### Tracks
+
+- **`omc_export_module(path, format=codec)`** — emit a module as a sampled-token codec payload. The LLM consumes the payload (a few hundred bytes) instead of the full source (several KB). Recovery is via library lookup against the LLM's known corpus, or via `omc_codec_decode_lookup` for explicit reconstruction.
+- **Substrate-keyed conversation memory** — wire the `fibtier` memory primitive to store conversation entries as canonical hashes; fetch on demand via the kernel. An LLM's conversation history becomes a stream of hash references that recover into full content when reasoning needs it.
+- **MCP tool: `omc_compress_context(text)`** — given a chunk of OMC code or prose, return a substrate-keyed compressed form the LLM can reference. The complement of `omc_fetch_by_hash`.
+- **Cross-corpus blending** — query multiple corpora (project, stdlib, registry) with weighted ranking, return substrate-keyed identifiers that work across any of them.
+- **Substrate-typed conversation transcripts** — every message in an agent conversation gets a canonical hash; threading + memory operations index by hash, not by string.
+- **Benchmark: end-to-end context-budget reduction** — measure how many fns an LLM agent can hold "in mind" with v0.4 vs without. Hypothesis: 5-10× more candidates fit in the same context window.
+
+### Win condition
+
+An LLM agent solves a multi-step OMC authoring task using ~10% of the context budget a baseline agent would consume, with no loss in solution quality — because the predict engine's output, the conversation memory, and the codec payloads all compose through the substrate's content-addressed identity.
+
+### Deferred from v0.3
+
+- **Prometheus rerank pass** — train a small Prometheus model on the corpus and rerank top-k by token-stream probability.
+- **Stateful corpus API** — `omc_corpus_build` returns a handle, `omc_predict_from(handle, prefix, top_k)` reuses it.
 - **Streaming queries** — incremental updates as the prefix grows token-by-token.
-- **Cross-corpus blending** — query multiple corpora (project, stdlib, registry) with weighted ranking.
 
-### v0.4 candidate B — substrate-attention follow-ups
+---
+
+## v0.5+ candidates
+
+### Substrate-attention follow-ups
 
 - Substrate-modulated Q projection. Q hasn't been swapped yet; the V resample recipe (post-projection modulation) may generalize.
 - Substrate FF: dampen off-attractor activations in the feed-forward residual.
@@ -50,6 +67,7 @@ The substrate-attention components stack to −8.94% inside one block. The path
 
 | Chapter | Key shipped items |
 |---|---|
+| [v0.3.1-symbolic-compression](https://github.com/RandomCoder-lab/OMC/releases/tag/v0.3.1-symbolic-compression) | `omc_predict` gains `format=hash`/`signature`/`full` (3.8× compression default) + `omc_fetch_by_hash` for on-demand recovery |
 | [v0.3-symbolic-prediction](https://github.com/RandomCoder-lab/OMC/releases/tag/v0.3-symbolic-prediction) | `omc_predict_files(paths, prefix, top_k)` returns ranked provenance-tracked continuations from a content-addressed corpus |
 | [v0.2-ergonomics](https://github.com/RandomCoder-lab/OMC/releases/tag/v0.2-ergonomics) | `+=` / `-=` / `*=` / `/=` / `%=`, `len`/`range`/`getenv`/`to_hex`/`parse_int`, negative array indexing, did-you-mean, traced errors, 11 heal classes |
 | [v0.1-substrate-attention](https://github.com/RandomCoder-lab/OMC/releases/tag/v0.1-substrate-attention) | Substrate-K + S-MOD softmax + substrate-V resample → −8.94% val on TinyShakespeare |
diff --git a/omnimcode-mcp/src/main.rs b/omnimcode-mcp/src/main.rs
@@ -218,11 +218,16 @@ fn list_tools() -> Vec<Json> {
             "description": "Substrate-indexed code completion. Given a partial OMC code prefix \
                             (e.g. `fn prom_linear_`), return the top-k ranked continuations from \
                             a content-addressed corpus of OMC files. Each result is a viable \
-                            branch: it carries the full source of the matching fn, its file \
-                            path, canonical hash, prefix-match depth, and substrate distance. \
-                            Use to find similar fns when authoring code, to navigate a corpus \
-                            without grepping, or to surface stable callable shapes that an LLM \
-                            can adapt rather than invent from scratch.",
+                            branch.\n\
+                            \n\
+                            The `format` arg controls how much context each suggestion costs:\n\
+                            - `hash` (default, ~50 bytes/suggestion): fn_name + file + \
+                              canonical_hash + substrate_distance. Use this for browsing — \
+                              cheap context. Fetch the body on demand with omc_fetch_by_hash.\n\
+                            - `signature` (~100 bytes/suggestion): adds the fn signature line. \
+                              Enough for an LLM to know the call shape.\n\
+                            - `full`: includes the complete source. Use only when you'll \
+                              actually edit/adapt the body.",
             "inputSchema": {
                 "type": "object",
                 "properties": {
@@ -240,6 +245,12 @@ fn list_tools() -> Vec<Json> {
                         "minimum": 1,
                         "default": 5,
                         "description": "Number of ranked continuations to return."
+                    },
+                    "format": {
+                        "type": "string",
+                        "enum": ["hash", "signature", "full"],
+                        "default": "hash",
+                        "description": "Response detail level. See tool description."
                     }
                 },
                 "required": ["paths", "prefix"]
@@ -262,6 +273,33 @@ fn list_tools() -> Vec<Json> {
                 "required": ["paths"]
             }
         }),
+        json!({
+            "name": "omc_fetch_by_hash",
+            "description": "Recover a function body by its canonical hash. The companion to \
+                            omc_predict with format=hash: the LLM browses cheaply via hash \
+                            digests, then fetches the actual source only when ready to use \
+                            it. Walks the same paths corpus as omc_predict; returns the full \
+                            source of the matching fn, or notFound:true if no fn in the \
+                            corpus has that hash.\n\
+                            \n\
+                            The canonical_hash is alpha-rename invariant — a fn that's been \
+                            renamed still recovers from the same hash.",
+            "inputSchema": {
+                "type": "object",
+                "properties": {
+                    "paths": {
+                        "type": "array",
+                        "items": { "type": "string" },
+                        "description": "Source file paths to search."
+                    },
+                    "canonical_hash": {
+                        "type": "integer",
+                        "description": "The canonical_hash returned by a previous omc_predict call."
+                    }
+                },
+                "required": ["paths", "canonical_hash"]
+            }
+        }),
     ]
 }
 
@@ -343,22 +381,20 @@ fn dispatch_tool(interp: &mut Interpreter, name: &str, args: &Json) -> Result<St
             let top_k = args.get("top_k").and_then(Json::as_i64)
                 .unwrap_or(5)
                 .clamp(1, 50) as usize;
+            let format = args.get("format")
+                .and_then(Json::as_str)
+                .unwrap_or("hash");
             let corpus = build_corpus(&paths)?;
             let suggestions = predict_continuations(&corpus, prefix, top_k);
+            let suggestion_jsons: Vec<Json> = suggestions.iter()
+                .map(|s| project_suggestion(s, format))
+                .collect();
             let payload = json!({
                 "prefix": prefix,
                 "corpus_size": corpus.len(),
                 "top_k": top_k,
-                "suggestions": suggestions.iter().map(|s| json!({
-                    "fn_name": s.fn_name,
-                    "source": s.source,
-                    "file": s.file,
-                    "canonical_hash": s.canonical_hash,
-                    "attractor": s.attractor,
-                    "prefix_match_len": s.prefix_match_len,
-                    "substrate_distance": s.substrate_distance,
-                    "query_attractor": s.query_attractor,
-                })).collect::<Vec<_>>(),
+                "format": format,
+                "suggestions": suggestion_jsons,
             });
             Ok(serde_json::to_string_pretty(&payload).unwrap())
         }
@@ -371,10 +407,92 @@ fn dispatch_tool(interp: &mut Interpreter, name: &str, args: &Json) -> Result<St
             });
             Ok(serde_json::to_string_pretty(&payload).unwrap())
         }
+        "omc_fetch_by_hash" => {
+            let paths = parse_paths_arg(args, "omc_fetch_by_hash")?;
+            let target = args.get("canonical_hash").and_then(Json::as_i64)
+                .ok_or_else(|| "omc_fetch_by_hash: missing 'canonical_hash' (i64) arg".to_string())?;
+            let corpus = build_corpus(&paths)?;
+            match corpus.entries.iter().find(|e| e.canonical_hash == target) {
+                Some(entry) => {
+                    let payload = json!({
+                        "found": true,
+                        "canonical_hash": entry.canonical_hash,
+                        "fn_name": entry.fn_name,
+                        "file": entry.file,
+                        "source": entry.source,
+                    });
+                    Ok(serde_json::to_string_pretty(&payload).unwrap())
+                }
+                None => {
+                    let payload = json!({
+                        "found": false,
+                        "canonical_hash": target,
+                        "searched_paths": paths,
+                        "corpus_size": corpus.len(),
+                    });
+                    Ok(serde_json::to_string_pretty(&payload).unwrap())
+                }
+            }
+        }
         _ => Err(format!("Unknown tool: {}", name)),
     }
 }
 
+/// Compact one Suggestion into the requested response format.
+///
+/// - `hash` (~50 bytes): identity only. The LLM uses it to remember a
+///   match it might fetch later via omc_fetch_by_hash.
+/// - `signature` (~100 bytes): adds the fn signature line so the LLM
+///   knows the call shape without paying for the body.
+/// - `full`: everything including the body. Use when the LLM intends
+///   to read or adapt the implementation.
+///
+/// `prefix_match_len` and `substrate_distance` are included at every
+/// level — they're the ranking explanation and cost essentially nothing.
+fn project_suggestion(s: &omnimcode_core::predict::Suggestion, format: &str) -> Json {
+    match format {
+        "full" => json!({
+            "fn_name": s.fn_name,
+            "source": s.source,
+            "file": s.file,
+            "canonical_hash": s.canonical_hash,
+            "attractor": s.attractor,
+            "prefix_match_len": s.prefix_match_len,
+            "substrate_distance": s.substrate_distance,
+            "query_attractor": s.query_attractor,
+        }),
+        "signature" => json!({
+            "fn_name": s.fn_name,
+            "signature": extract_signature(&s.source),
+            "file": s.file,
+            "canonical_hash": s.canonical_hash,
+            "prefix_match_len": s.prefix_match_len,
+            "substrate_distance": s.substrate_distance,
+        }),
+        // "hash" is the default and the most compressed form.
+        _ => json!({
+            "fn_name": s.fn_name,
+            "file": s.file,
+            "canonical_hash": s.canonical_hash,
+            "prefix_match_len": s.prefix_match_len,
+            "substrate_distance": s.substrate_distance,
+        }),
+    }
+}
+
+/// Extract the function signature line from a fn body's source. The
+/// signature is everything from `fn` through the closing paren of the
+/// argument list, plus any `-> ReturnType` annotation. Stops at the
+/// opening `{` of the body.
+///
+/// Robust to multi-line signatures (joins lines, collapses whitespace).
+fn extract_signature(source: &str) -> String {
+    // Join everything before the first `{` then collapse whitespace.
+    let head = source.split_once('{').map(|(h, _)| h).unwrap_or(source);
+    let cleaned: String = head.split_whitespace().collect::<Vec<_>>().join(" ");
+    cleaned.trim().to_string()
+}
+
 /// Extract a `paths` array argument from a tool's JSON args. Used by
 /// both omc_predict and omc_corpus_size — same shape, same validation.
 fn parse_paths_arg(args: &Json, tool: &str) -> Result<Vec<String>, String> {
diff --git a/omnimcode-mcp/tests/integration.rs b/omnimcode-mcp/tests/integration.rs