Skip to content

Latest commit

 

History

History
195 lines (157 loc) · 10.4 KB

File metadata and controls

195 lines (157 loc) · 10.4 KB

Memscribe architecture

Memscribe is a deterministic, zero-LLM pipeline that turns the transcript logs AI coding agents already write into typed nodes the downstream inference-and-governance layer (MemCortex) can consume. No model is ever called: capture is reading and parsing, never summarizing. The output is an exact function of the input, which is what makes the whole module golden-file, property, and fuzz testable.

It is the bottom of a three-layer stack — Memtrace uses MemCortex, and MemCortex uses Memscribe. The dependency direction is strictly one-way: each layer depends only on the one below it, and memscribe-core depends on nothing else in the workspace. Memscribe never calls upward.


The pipeline

A single, linear, deterministic pipeline. Each stage is a trait, so it can be tested in isolation and swapped. Everything between Source and Sink is a pure, synchronous function of the event stream.

 Source                Adapter           Gate        Segmenter      Binder        NodePrep        Sink
 (memscribe-io)        (memscribe-       (core)      (core)         (core)        (core)          (memscribe-sink)
                        adapters)
 tail JSONL        →   parse one     →   admit?  →   arc / turn  →  decision  →   assemble    →   NDJSON / SQLite
 hook stdin            RawRecord →       commitment  spans;         ↔ edit,       PreparedNode    / MemDB
 OTLP receiver         CaptureEvent[]    markers     elevate gated  PROV          stream
                       (version-                     turns; seed    (t_use
                        tolerant)                     decisions;    ≤ t_gen)
                                                      collect edits
   RawRecord               CaptureEvent      markers    Segmentation   BindingEdge   PreparedNode    (consumer)
   (bytes + provenance)    (normalized)                                              stream
  • Source → Adapter produces the normalized CaptureEvent stream — the system of record. This is the only stage that touches tool-specific formats.
  • Gate → Segmenter → Binder → NodePrep transform that stream into PreparedNodes. Pure and synchronous given the events.
  • An optional redaction pass runs over the prepared nodes before the sink.
  • Sink writes the nodes out. It is the single seam that decouples Memscribe from MemDB.

The orchestration lives in memscribe-core::pipeline::DefaultPipeline:

let nodes = DefaultPipeline::new()                 // redaction ON by default
    .run_records(adapter.as_ref(), &records);      // parse → prepare → redact
// or stream straight to a sink:
let n = DefaultPipeline::new()
    .run_to_sink(adapter.as_ref(), &records, &mut sink)?;

DefaultPipeline::prepare_events(&events) is the pure core: its output is an exact function of events. without_redaction() turns the redactor off (golden tests assert on verbatim content), and with_gate(..) / with_redactor(..) swap in config-driven stages.


Crate responsibilities

Crate Responsibility
memscribe-core The frozen contract: the event model, the prepared-node output types, the TranscriptAdapter and Sink traits, and the deterministic pipeline (gatesegmenterbindernodeprep) plus the redact pass. Depends on nothing in the workspace.
memscribe-adapters Per-tool parsers behind feature flags. Each implements TranscriptAdapter. The registry assembles the enabled set (all_adapters) and resolves one by SourceKind (adapter_for).
memscribe-io Generic sources: a notify-based file tailer (offset resume), a hook server, and an OTLP receiver. Turns raw bytes into RawRecords.
memscribe-sink Concrete Sinks: NdjsonSink (canonical default), SqliteSink (feature sqlite), and MemDbSink (feature memdb, off by default).
memscribe-cli The memscribe binary: watch / hook / parse / replay / verify / redact.
memscribe-testkit The harness: parse_events / prepare_nodes, the invariant checks, golden-fixture loaders, and the cross-tool conformance scenario catalog.

The contract types

All of these live in memscribe-core and are re-exported from its crate root. Do not change their behavior or public shape — the test suite and every consumer depend on exact output.

Input: the normalized event model (model.rs)

CaptureEvent is the system of record produced by adapters. Every field is copied verbatim from the source; none is generated by Memscribe.

pub struct CaptureEvent {
    pub schema_version: u16,        // SCHEMA_VERSION; consumers gate on this
    pub source: SourceKind,         // which tool produced it
    pub session_id: String,         // tool-native session/thread id
    pub seq: u64,                   // monotonic per-session, from file order
    pub event_id: String,           // tool-native id, or blake3(content) fallback
    pub parent_id: Option<String>,  // DAG link where the tool provides one
    pub timestamp: OffsetDateTime,  // RFC3339, verbatim
    pub project: ProjectRef,        // cwd / repo_root / git, from session start
    pub kind: EventKind,            // the payload
    pub provenance: SourceLocation, // pointer back into the source bytes
}

EventKind is the payload enum. EventKind::Unknown is load-bearing: an unrecognized record type or a new field is preserved verbatim and flagged, never discarded — that is how the stream stays lossless across tool-version churn.

EventKind variant Meaning
SessionStart cwd, git ref, model, tool version
UserTurn a user message (flattened text + structured Parts)
AssistantTurn an assistant message (text, thinking, model, usage, parts)
ToolCall a tool invocation (call_id, name, raw args)
ToolResult a tool result (call_id, ok, raw output)
FileEdit a normalized Diff (from Edit/Write/apply_patch/replace)
Compaction model-side history compaction — flagged, never stored as truth
Rewind a user rewind back to an earlier event
SessionEnd the session ended
Unknown an unrecognized record, preserved verbatim and flagged

SourceKind enumerates the nine tools plus Unknown; SourceKind::parse maps CLI/--as slugs (tolerant of aliases such as claude / claude-code).

Output: the prepared-node stream (node.rs)

PreparedNode is the typed data a consumer ingests. It is a tagged enum:

PreparedNode variant Payload Meaning
Conversation ConversationSpan a gated, verbatim dialogue span with the markers that fired
Decision DecisionRecord a deterministically-parsed decision (IBIS / QOC / MADR / Kruchten shape)
Episode CodeEpisode a code edit episode: path, Diff, git ref, deterministic episode_id
Binding BindingEdge a decision/conversation → episode edge carrying a ProvRecord

Epistemic honesty: FactStatus

Every node and edge carries a FactStatus. Memscribe only ever emits the first two; the latter two are flags for a downstream inference layer — values Memscribe never computes by guessing. This is the property that keeps the module zero-LLM and its output golden-testable.

FactStatus Who sets it
Observed Memscribe — verbatim from the source
DeterministicallyDerived Memscribe — a pure function of observed data
StatisticallyRanked downstream — a statistical measure
LlmHypothesis downstream — an LLM hypothesis; Memscribe only flags it

ProvRecord records used(session, decision) + wasGeneratedBy(diff, session) with the temporal invariant t_use ≤ t_gen (ProvRecord::is_temporally_valid).


How to add a new adapter

Adapters are the volatile part — every tool's format churns — so adding one is a well-trodden, five-step path. The contract: a parser is version-tolerant (it pattern-matches on the fields it needs and routes anything unrecognized to EventKind::Unknown) and must never panic.

  1. Add a SourceKind variant (memscribe-core/src/model.rs). Wire its stable snake_case slug into SourceKind::as_str and into SourceKind::parse (include any aliases). This is the one allowed touch of memscribe-core for a new tool — coordinate it, since the frozen contract is shared.

  2. Add the adapter module (memscribe-adapters/src/<tool>.rs) behind a #[cfg(feature = "<tool>")] and a matching entry in the crate's [features] table. Implement TranscriptAdapter:

    • source_kind() — return your SourceKind.
    • discover(&DiscoverCfg) — locate live & historical transcripts. Honor the per-tool override key in DiscoverCfg.overrides (e.g. CLAUDE_CONFIG_DIR, CODEX_HOME) and fall back to cfg.home_dir(). Return handles in a deterministic (sorted) order.
    • parse(&RawRecord, &mut ParseCtx) — turn ONE record into zero or more CaptureEvents. Use ParseCtx::alloc_seq for the monotonic seq, ParseCtx::first_seen for dedup, and ParseCtx::project_or_default for the project binding. Never panic; route unknowns to EventKind::Unknown.
    • schema_fingerprint(&RawRecord) — return a SchemaVariant so the corpus and runtime can version-gate the parser.
  3. Register it (memscribe-adapters/src/registry.rs). Add the cfg-gated push in all_adapters() and the cfg-gated arm in adapter_for().

  4. Add fixtures under fixtures/<tool>/<version>/<scenario>.jsonl for the canonical scenarios in memscribe-testkit::scenarios::SCENARIOS, and bless the expected outputs under fixtures-expected/<tool>/<version>/ (see CONTRIBUTING.md for the capture → golden → bless flow).

  5. Add tests. Unit-test the parser; run the shared invariant checks from memscribe-testkit::invariants (check_monotonic_seq, check_lossless, check_unique_event_ids, check_determinism); and add a cargo-fuzz target so the never-panic contract is enforced. Verify in isolation: cargo test -p memscribe-adapters --test <your_file_stem>.

The conformance suite then asserts your tool normalizes the canonical scenarios to the same shape as every other tool — that cross-tool equivalence is the point of the thin-waist event model.