L1 memory extraction pipeline: 4 improvements to reduce LLM dependency

## Summary

The L1 memory extraction pipeline (`l1-extractor.ts`) currently relies entirely on a single LLM call for scene segmentation, memory extraction, priority scoring, and type classification. This makes the quality of extracted memories brittle — LLM hallucination, JSON formatting errors, or low-quality output all go undetected.

This issue proposes 4 targeted improvements, implemented and tested on a fork.

---

## Problem 1: L1 quality gate is neutered

**File:** `src/utils/sanitize.ts` → `shouldExtractL1()`

Length filters and prompt-injection detection are commented out:

```typescript
// const isCJK = /[\u4e00-\u9fff...]/.test(text);
// if (isCJK && text.length < 2) return false;
// if (!isCJK && text.length < 2) return false;
// if (text.length > 5000) return false;
// ...
// if (looksLikePromptInjection(text)) return false;
```

This means short messages ("好的", "OK", "hi"), prompt-injection payloads, and excessively long pasted logs all enter the LLM pipeline, wasting tokens and risking injection into persistent memory.

**Fix:** Re-enable filters with appropriate thresholds (CJK ≥ 4 chars, alpha ≥ 10 chars), add conversational filler filter, and enable prompt-injection guard.

---

## Problem 2: No rule-based pre-extraction — everything goes to LLM

Even obviously structured patterns like "我是Python工程师" (identity) or "以后请用中文回复" (instruction) require a full LLM round-trip. This wastes tokens and introduces latency for clear-cut cases.

**Fix:** Add `src/core/record/pre-extractor.ts` — a new rule-based extraction layer that runs BEFORE the LLM call:
- 10 persona patterns: 我喜欢/我是/我的职业是/我擅长/我认为…
- 8 instruction patterns: 以后/记住/禁止/从现在开始/使用X语言回复…
- Date + action-verb episodic detection ("2025-03-15 部署了新版本")

HIGH-confidence matches bypass the LLM entirely and are directly merged into results. MEDIUM-confidence matches are passed as hints.

**Impact:** For common patterns, no LLM call is needed at all. For everything else, the LLM gets higher-quality input.

---

## Problem 3: JSON parse failures are silently discarded

When the LLM returns malformed JSON (missing brackets, wrong structure, control characters), `parseExtractionResult()` returns `[]` and all extracted memories are lost. There is no retry or correction mechanism.

**Fix:** Add self-correction retry in `callLlmExtraction()` — when the first parse fails with a specific error, the LLM is called once more with a correction prompt containing the exact parse error. If the retry succeeds, the corrected result is used. Only if both fail is the extraction truly discarded.

---

## Problem 4: No post-LLM quality validation

The LLM output goes directly to dedup and storage without any validation:
- Hallucinated memories (content not traceable to source messages)
- Type confusion (persona content labeled as instruction)
- Trivial/vague memories ("用户询问了关于天气的情况")

**Fix:** Add `passesConfidenceCheck()` and `extractSignificantWords()` in `l1-extractor.ts`:
1. **Minimal content:** CJK ≥ 4 chars, alpha ≥ 15 chars
2. **Source traceability:** ≥ 30% of CJK bigrams/English words must appear in source messages (prevents pure hallucination)
3. **Type consistency:** persona must reference 用户/我; instruction must contain AI/directive keywords; episodic must not be trivial boilerplate

---

## Additional fixes

- Regex in pre-extractor patterns was greedy (`记住.{0,5}(.{1,50})`), causing capture groups to be empty. Fixed to non-greedy (`记住.{0,5}?(.{1,50})`).
- CJK injection detection patterns were too narrow. Changed from rigid pattern matching to flexible `.{0,10}` wildcard matching.

---

## Testing

54/55 unit tests pass covering all 4 modules.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

L1 memory extraction pipeline: 4 improvements to reduce LLM dependency #82

Summary

Problem 1: L1 quality gate is neutered

Problem 2: No rule-based pre-extraction — everything goes to LLM

Problem 3: JSON parse failures are silently discarded

Problem 4: No post-LLM quality validation

Additional fixes

Testing

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

L1 memory extraction pipeline: 4 improvements to reduce LLM dependency #82

Description

Summary

Problem 1: L1 quality gate is neutered

Problem 2: No rule-based pre-extraction — everything goes to LLM

Problem 3: JSON parse failures are silently discarded

Problem 4: No post-LLM quality validation

Additional fixes

Testing

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions