Skip to content

L1 memory extraction pipeline: 4 improvements to reduce LLM dependency #82

@yuanrengu

Description

@yuanrengu

Summary

The L1 memory extraction pipeline (l1-extractor.ts) currently relies entirely on a single LLM call for scene segmentation, memory extraction, priority scoring, and type classification. This makes the quality of extracted memories brittle — LLM hallucination, JSON formatting errors, or low-quality output all go undetected.

This issue proposes 4 targeted improvements, implemented and tested on a fork.


Problem 1: L1 quality gate is neutered

File: src/utils/sanitize.tsshouldExtractL1()

Length filters and prompt-injection detection are commented out:

// const isCJK = /[\u4e00-\u9fff...]/.test(text);
// if (isCJK && text.length < 2) return false;
// if (!isCJK && text.length < 2) return false;
// if (text.length > 5000) return false;
// ...
// if (looksLikePromptInjection(text)) return false;

This means short messages ("好的", "OK", "hi"), prompt-injection payloads, and excessively long pasted logs all enter the LLM pipeline, wasting tokens and risking injection into persistent memory.

Fix: Re-enable filters with appropriate thresholds (CJK ≥ 4 chars, alpha ≥ 10 chars), add conversational filler filter, and enable prompt-injection guard.


Problem 2: No rule-based pre-extraction — everything goes to LLM

Even obviously structured patterns like "我是Python工程师" (identity) or "以后请用中文回复" (instruction) require a full LLM round-trip. This wastes tokens and introduces latency for clear-cut cases.

Fix: Add src/core/record/pre-extractor.ts — a new rule-based extraction layer that runs BEFORE the LLM call:

  • 10 persona patterns: 我喜欢/我是/我的职业是/我擅长/我认为…
  • 8 instruction patterns: 以后/记住/禁止/从现在开始/使用X语言回复…
  • Date + action-verb episodic detection ("2025-03-15 部署了新版本")

HIGH-confidence matches bypass the LLM entirely and are directly merged into results. MEDIUM-confidence matches are passed as hints.

Impact: For common patterns, no LLM call is needed at all. For everything else, the LLM gets higher-quality input.


Problem 3: JSON parse failures are silently discarded

When the LLM returns malformed JSON (missing brackets, wrong structure, control characters), parseExtractionResult() returns [] and all extracted memories are lost. There is no retry or correction mechanism.

Fix: Add self-correction retry in callLlmExtraction() — when the first parse fails with a specific error, the LLM is called once more with a correction prompt containing the exact parse error. If the retry succeeds, the corrected result is used. Only if both fail is the extraction truly discarded.


Problem 4: No post-LLM quality validation

The LLM output goes directly to dedup and storage without any validation:

  • Hallucinated memories (content not traceable to source messages)
  • Type confusion (persona content labeled as instruction)
  • Trivial/vague memories ("用户询问了关于天气的情况")

Fix: Add passesConfidenceCheck() and extractSignificantWords() in l1-extractor.ts:

  1. Minimal content: CJK ≥ 4 chars, alpha ≥ 15 chars
  2. Source traceability: ≥ 30% of CJK bigrams/English words must appear in source messages (prevents pure hallucination)
  3. Type consistency: persona must reference 用户/我; instruction must contain AI/directive keywords; episodic must not be trivial boilerplate

Additional fixes

  • Regex in pre-extractor patterns was greedy (记住.{0,5}(.{1,50})), causing capture groups to be empty. Fixed to non-greedy (记住.{0,5}?(.{1,50})).
  • CJK injection detection patterns were too narrow. Changed from rigid pattern matching to flexible .{0,10} wildcard matching.

Testing

54/55 unit tests pass covering all 4 modules.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions