Skip to content

Noise filter misses structural memory contamination at write time (System traces / raw blobs / fragments) #127

@i0ivi0i

Description

@i0ivi0i

Summary

Noise filtering in memory-lancedb-pro appears to be too narrow at write time: it can block greetings / denials / meta-questions, but still allows structurally noisy memories into LanceDB, such as:

  • System: / compaction / model-switch traces
  • long raw quoted conversation blobs
  • concatenated fragment strings / malformed snippets
  • low-value duplicate-ish user utterances that were not distilled into atomic memory entries

This makes the retrieval engine look “healthy” while the memory corpus itself slowly accumulates contaminants.

Observed Symptoms

In real-world usage, recent memories included examples like:

  • quoted raw user text stored nearly verbatim
  • system traces such as compaction/model-switch remnants ending up in memory rows
  • concatenated fragment-like entries such as mixed filename/text shards

These were not simple greeting/boilerplate cases, so the current noise-filter.ts patterns did not catch them.

Why this matters

The plugin’s retrieval stack is strong (hybrid retrieval, rerank, decay, normalization, diversity), but corpus quality still depends on ingress quality.
If ingress filtering is too weak, the plugin can remain operational while recall quality degrades over time because the stored memories are not clean / atomic / semantically stable.

Current likely gap

From reading the code/docs, the likely ingress points are:

  1. src/tools.tsmemory_store
  2. index.ts auto-capture path (agent_end hook) before final persistence

src/noise-filter.ts currently focuses on:

  • greetings / boilerplate
  • denials
  • meta-questions

But it does not appear to explicitly reject:

  • System:-prefixed traces or internal runtime artifacts
  • compaction / session-management / model-switch messages
  • overly long raw conversation blobs
  • malformed concatenated fragments / accidental shards
  • “not yet distilled” entries that should have been compressed into a short fact/decision/preference memory instead of being stored verbatim

Suggestion

Consider adding a stricter ingress hygiene layer before persistence, applied both to manual tool-store and auto-capture:

1) Source/artifact rejection

Reject entries matching patterns like:

  • ^System:
  • compaction / model switched / session reset / tool transcript artifacts
  • known internal control markers / tags

2) Atomicity / length gate

Reject or require transformation when:

  • text is over a configurable character threshold
  • contains multi-sentence raw dialogue / quote blocks
  • contains suspicious concatenation signatures / repeated quote wrapping / filename-shard blobs

3) Distillation gate

For auto-capture especially, require candidate memory items to resemble atomic memory forms, e.g.:

  • preference
  • fact (pitfall/cause/fix/prevention)
  • decision principle
  • entity

Instead of allowing near-verbatim conversation carryover.

4) Optional config flags

Example ideas:

  • store.rejectSystemArtifacts: true
  • store.maxRawLength: 500
  • store.requireAtomicMemoryShape: true
  • store.rejectConversationBlob: true
  • store.rejectMalformedFragments: true

5) Post-write verification hook (optional)

An optional callback / validation stage that checks whether the stored item is likely retrievable and non-noisy before accepting it permanently.

Key point

This seems less like a retrieval problem and more like an ingress validation / corpus hygiene problem.

The current noise filter is useful, but from field behavior it feels optimized for surface-level low-quality chatter, not structural memory contamination.

If helpful, I can also prepare a concrete patch proposal for:

  • src/tools.ts write-time validation
  • index.ts auto-capture final gate
  • new noise-filter.ts heuristics for system artifacts / raw blob rejection

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions