-
Notifications
You must be signed in to change notification settings - Fork 393
Description
Summary
Noise filtering in memory-lancedb-pro appears to be too narrow at write time: it can block greetings / denials / meta-questions, but still allows structurally noisy memories into LanceDB, such as:
System:/ compaction / model-switch traces- long raw quoted conversation blobs
- concatenated fragment strings / malformed snippets
- low-value duplicate-ish user utterances that were not distilled into atomic memory entries
This makes the retrieval engine look “healthy” while the memory corpus itself slowly accumulates contaminants.
Observed Symptoms
In real-world usage, recent memories included examples like:
- quoted raw user text stored nearly verbatim
- system traces such as compaction/model-switch remnants ending up in memory rows
- concatenated fragment-like entries such as mixed filename/text shards
These were not simple greeting/boilerplate cases, so the current noise-filter.ts patterns did not catch them.
Why this matters
The plugin’s retrieval stack is strong (hybrid retrieval, rerank, decay, normalization, diversity), but corpus quality still depends on ingress quality.
If ingress filtering is too weak, the plugin can remain operational while recall quality degrades over time because the stored memories are not clean / atomic / semantically stable.
Current likely gap
From reading the code/docs, the likely ingress points are:
src/tools.ts→memory_storeindex.tsauto-capture path (agent_endhook) before final persistence
src/noise-filter.ts currently focuses on:
- greetings / boilerplate
- denials
- meta-questions
But it does not appear to explicitly reject:
System:-prefixed traces or internal runtime artifacts- compaction / session-management / model-switch messages
- overly long raw conversation blobs
- malformed concatenated fragments / accidental shards
- “not yet distilled” entries that should have been compressed into a short fact/decision/preference memory instead of being stored verbatim
Suggestion
Consider adding a stricter ingress hygiene layer before persistence, applied both to manual tool-store and auto-capture:
1) Source/artifact rejection
Reject entries matching patterns like:
^System:- compaction / model switched / session reset / tool transcript artifacts
- known internal control markers / tags
2) Atomicity / length gate
Reject or require transformation when:
- text is over a configurable character threshold
- contains multi-sentence raw dialogue / quote blocks
- contains suspicious concatenation signatures / repeated quote wrapping / filename-shard blobs
3) Distillation gate
For auto-capture especially, require candidate memory items to resemble atomic memory forms, e.g.:
- preference
- fact (pitfall/cause/fix/prevention)
- decision principle
- entity
Instead of allowing near-verbatim conversation carryover.
4) Optional config flags
Example ideas:
store.rejectSystemArtifacts: truestore.maxRawLength: 500store.requireAtomicMemoryShape: truestore.rejectConversationBlob: truestore.rejectMalformedFragments: true
5) Post-write verification hook (optional)
An optional callback / validation stage that checks whether the stored item is likely retrievable and non-noisy before accepting it permanently.
Key point
This seems less like a retrieval problem and more like an ingress validation / corpus hygiene problem.
The current noise filter is useful, but from field behavior it feels optimized for surface-level low-quality chatter, not structural memory contamination.
If helpful, I can also prepare a concrete patch proposal for:
src/tools.tswrite-time validationindex.tsauto-capture final gate- new
noise-filter.tsheuristics for system artifacts / raw blob rejection