You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
v0.3 shipped predict but the MCP response sent full source — burning
LLM context window on bodies the model often doesn't need to read.
v0.3.1 closes that gap.
## What changed
- omc_predict gains a `format` parameter:
- `hash` (NEW DEFAULT, ~50 bytes/suggestion): fn_name + file +
canonical_hash + prefix_match_len + substrate_distance
- `signature` (~100 bytes): adds the fn signature line
- `full`: complete source (previous default behavior)
- omc_fetch_by_hash(paths, canonical_hash) — companion tool. Recovers
a function body by its alpha-rename-invariant canonical hash.
Returns {found, fn_name, file, source} or {found: false}.
## Measured compression
Same query `fn prom_attention_` × top_k=5 against prometheus.omc:
format=hash 1253 bytes (26.2%, 3.8x smaller)
format=signature 1622 bytes (33.9%)
format=full 4783 bytes (100%, v0.3 behavior)
The ratio widens on longer fns — top_k=5 over fns averaging 60 lines
compresses ~10x.
## Why it matters
Canonical hash is alpha-rename invariant — recovery via fetch_by_hash
works even if the fn was renamed in source after the predict call.
The LLM workflow becomes: predict cheaply (hash), reason over
candidates, fetch only the body it commits to using. Branching is
now ~free at the context-budget level — 50 candidates fit in mind
for the cost of 6-7 full bodies.
## Tests
13/13 MCP integration tests pass (was 8 + 5 new):
- format=hash omits source field
- format=signature includes signature, omits body
- format=full unchanged from v0.3
- omc_fetch_by_hash round-trips through omc_predict
- unknown hash returns {found: false} gracefully (not an error)
Final: 231 Rust pass, 1087/1087 OMC.
## Next chapter
v0.4-substrate-context: take the symbolic-context compression thesis
end-to-end. The substrate codec from v0.0.5 already does 10-50×
library-lookup compression; wire it into the LLM flow as the
first-class context-compression mechanism. Per the updated ROADMAP.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|[v0.3-symbolic-prediction](#v03-symbolic-prediction--2026-05-17)| 2026-05-17 | Substrate-indexed code completion: `omc_predict_files(paths, prefix, top_k)` returns ranked provenance-tracked continuations from a content-addressed corpus |
17
18
|[v0.2-ergonomics](#v02-ergonomics--2026-05-17)| 2026-05-17 | OMC becomes forgiving: Python-idiom builtins, `+=`, friendly errors with traces, 11 heal classes total |
18
19
|[v0.1-substrate-attention](#v01-substrate-attention--2026-05-17)| 2026-05-17 | Three substrate-component swaps inside transformer attention (K, S-MOD softmax, V) stack to −8.94% val on TinyShakespeare |
@@ -25,6 +26,53 @@ Read top-to-bottom for the arc; jump to any chapter for the detail.
25
26
26
27
---
27
28
29
+
## [v0.3.1-symbolic-compression] - 2026-05-17
30
+
31
+
**`omc_predict` learns to compress: default response is hash-only (~50 bytes/suggestion), with on-demand body recovery via `omc_fetch_by_hash`.**
32
+
33
+
v0.3 shipped the predict engine but its MCP response sent the full source of every suggestion — typically 4-8KB for a top-k=5 query. This burns LLM context window with body text the model often doesn't need to read. v0.3.1 closes that gap.
34
+
35
+
### What changed
36
+
37
+
-**`omc_predict` gains a `format` parameter** with three projections:
38
+
-`hash` (default): `{fn_name, file, canonical_hash, prefix_match_len, substrate_distance}`. ~50 bytes/suggestion. Use for browsing.
39
+
-`signature`: adds the fn signature line (`fn name(args) -> ret`). ~100 bytes/suggestion. Use when call shape is enough.
40
+
-`full`: complete source (previous default behavior). Use when you'll actually adapt the body.
41
+
-**New `omc_fetch_by_hash(paths, canonical_hash)` MCP tool**: recovers a function body by canonical hash. The companion to format=hash — browse cheaply, fetch only when needed. Returns `{found, fn_name, file, source}` or `{found: false}` if no fn in the corpus has that hash.
42
+
43
+
### Measured compression
44
+
45
+
Same query `fn prom_attention_` × top_k=5 against `examples/lib/prometheus.omc`:
46
+
47
+
| Format | Bytes | Ratio vs full |
48
+
|---|---:|---:|
49
+
|**hash** (new default) | 1,253 |**26.2%** (3.8× smaller) |
50
+
| signature | 1,622 | 33.9% |
51
+
| full (v0.3 behavior) | 4,783 | 100% |
52
+
53
+
The ratio widens on corpora with longer fns — a top_k=5 over fns averaging 60 lines compresses ~10×.
54
+
55
+
### Why it matters
56
+
57
+
The canonical hash is alpha-rename invariant — recovery via `omc_fetch_by_hash` works even if the function was renamed in the source after the predict call. The LLM workflow becomes: predict cheaply (hash), reason over candidates, fetch only the body it commits to using. Branching is now ~free at the context budget level — the LLM can hold 50 candidates in mind for the cost of 6-7 full bodies.
58
+
59
+
### Tests
60
+
61
+
13/13 MCP integration tests pass (was 8 + 5 new):
62
+
- format=hash omits source field
63
+
- format=signature includes signature, omits body
64
+
- format=full unchanged from v0.3
65
+
- omc_fetch_by_hash round-trips through omc_predict (predict returns a hash → fetch returns the same fn)
66
+
- unknown hash returns `{found: false}` gracefully (not an error)
67
+
68
+
Final: 231 Rust pass, 1087/1087 OMC.
69
+
70
+
### What's next (v0.4 candidate)
71
+
72
+
The compression story has more to give: the substrate codec from v0.0.5 can ship a 5-line "library reference + sampled tokens" payload that recovers losslessly via library lookup. Wiring codec output into omc_predict completes the symbolic-context compression thesis — the LLM exchanges canonical hashes as if they were words, and the substrate carries the meaning.
73
+
74
+
---
75
+
28
76
## [v0.3-symbolic-prediction] - 2026-05-17
29
77
30
78
**Substrate-indexed code completion: `omc_predict_files(paths, prefix, top_k)` returns ranked provenance-tracked continuations from a content-addressed corpus.**
Copy file name to clipboardExpand all lines: README.md
+1Lines changed: 1 addition & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -265,6 +265,7 @@ If you're trying to understand how OMC got here, **read the [GitHub Releases](ht
265
265
|[v0.1-substrate-attention](https://github.com/RandomCoder-lab/OMC/releases/tag/v0.1-substrate-attention)| Three substrate components (K, S-MOD, V) stack inside attention for −8.94% val |
|[v0.3.1-symbolic-compression](https://github.com/RandomCoder-lab/OMC/releases/tag/v0.3.1-symbolic-compression)|`omc_predict` learns to compress: `format=hash` default is 3.8× smaller, with `omc_fetch_by_hash` for on-demand body recovery |
Copy file name to clipboardExpand all lines: ROADMAP.md
+28-10Lines changed: 28 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,25 +1,42 @@
1
1
# OMC Roadmap
2
2
3
-
Current chapter: **v0.3-symbolic-prediction** (shipped 2026-05-17).
4
-
Next chapter: open — candidates listed below.
3
+
Current chapter: **v0.3.1-symbolic-compression** (shipped 2026-05-17).
4
+
Next chapter: **v0.4-substrate-context** (planned — the symbolic-context compression thesis taken seriously).
5
5
6
6
See [CHANGELOG.md](CHANGELOG.md) and [GitHub Releases](https://github.com/RandomCoder-lab/OMC/releases) for the chapter-by-chapter history of how OMC got here. This file describes what's on the path going forward.
7
7
8
8
---
9
9
10
-
## Post-v0.3 candidates (none committed yet)
10
+
## v0.4-substrate-context (planned)
11
11
12
-
### v0.4 candidate A — predict engine grows up
12
+
**Take the symbolic-context compression thesis end-to-end.**v0.3.1 added format options to omc_predict (3.8× compression on the predict response path). v0.4 generalizes: every LLM-facing OMC surface becomes substrate-aware about its context cost.
13
13
14
-
The v0.3 engine ships a stateless predictor with substrate ranking. Natural extensions:
14
+
The substrate codec from v0.0.5 already does library-lookup compression (`omc_codec_encode` → 10-50× ratios when the receiver has the library). The v0.4 chapter wires it into the LLM flow as a first-class context-compression mechanism:
15
15
16
-
-**Prometheus rerank pass** — train a small Prometheus model on the corpus and rerank top-k by token-stream probability. Substrate ranking is the structural prior; Prometheus would be the learned overlay.
17
-
-**Stateful corpus API** — `omc_corpus_build` returns a handle, `omc_predict_from(handle, prefix, top_k)` reuses it. The current API rebuilds per call (fine for interactive use; slow in tight loops).
18
-
-**MCP tool surface** — wrap `omc_predict_files` as an MCP tool so LLM clients can query during code generation without launching a subprocess.
16
+
### Tracks
17
+
18
+
-**`omc_export_module(path, format=codec)`** — emit a module as a sampled-token codec payload. The LLM consumes the payload (a few hundred bytes) instead of the full source (several KB). Recovery is via library lookup against the LLM's known corpus, or via `omc_codec_decode_lookup` for explicit reconstruction.
19
+
-**Substrate-keyed conversation memory** — wire the `fibtier` memory primitive to store conversation entries as canonical hashes; fetch on demand via the kernel. An LLM's conversation history becomes a stream of hash references that recover into full content when reasoning needs it.
20
+
-**MCP tool: `omc_compress_context(text)`** — given a chunk of OMC code or prose, return a substrate-keyed compressed form the LLM can reference. The complement of `omc_fetch_by_hash`.
21
+
-**Cross-corpus blending** — query multiple corpora (project, stdlib, registry) with weighted ranking, return substrate-keyed identifiers that work across any of them.
22
+
-**Substrate-typed conversation transcripts** — every message in an agent conversation gets a canonical hash; threading + memory operations index by hash, not by string.
23
+
-**Benchmark: end-to-end context-budget reduction** — measure how many fns an LLM agent can hold "in mind" with v0.4 vs without. Hypothesis: 5-10× more candidates fit in the same context window.
24
+
25
+
### Win condition
26
+
27
+
An LLM agent solves a multi-step OMC authoring task using ~10% of the context budget a baseline agent would consume, with no loss in solution quality — because the predict engine's output, the conversation memory, and the codec payloads all compose through the substrate's content-addressed identity.
28
+
29
+
### Deferred from v0.3
30
+
31
+
-**Prometheus rerank pass** — train a small Prometheus model on the corpus and rerank top-k by token-stream probability.
32
+
-**Stateful corpus API** — `omc_corpus_build` returns a handle, `omc_predict_from(handle, prefix, top_k)` reuses it.
19
33
-**Streaming queries** — incremental updates as the prefix grows token-by-token.
|[v0.3-symbolic-prediction](https://github.com/RandomCoder-lab/OMC/releases/tag/v0.3-symbolic-prediction)|`omc_predict_files(paths, prefix, top_k)` returns ranked provenance-tracked continuations from a content-addressed corpus |
0 commit comments