fix(lorebook): prevent substring matching for keywords, check for Asian characters also#347
fix(lorebook): prevent substring matching for keywords, check for Asian characters also#347vire70 wants to merge 3 commits into
Conversation
Removed the short-circuiting .includes() check that caused keywords (like 'hulk') to falsely trigger on substrings (like 'madhulkman'). Replaced it with a dynamic regex using \b word boundaries for space-separated languages. Added CJK character detection so that Asian language keywords (which don't use spaces) bypass \b and correctly use substring matching, since \b does not work between CJK characters.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughtextMatches in EntryRetrievalService now special-cases non‑space‑separated scripts (CJK/Thai/Lao/Khmer/Burmese) and uses plain substring matching; for other scripts it constructs an escaped Unicode-aware regex with conditional lookarounds at the query edges and tests that against searchContent. ChangesCJK Text Matching Logic
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related issues
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request updates the keyword matching logic in EntryRetrievalService to use substring matching for CJK characters and dynamic word boundaries for space-separated languages. However, the implementation relies on ASCII-only boundaries (\b and \w), which fail for non-ASCII space-separated languages like Cyrillic, and it misses other non-space-separated languages like Thai or Lao. It is recommended to adopt Unicode-aware lookarounds and expand the language detection check as suggested.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| // Check if the keyword contains CJK (Chinese, Japanese, Korean) characters | ||
| const hasCJK = /[\u3040-\u30ff\u3400-\u4dbf\u4e00-\u9fff\uac00-\ud7af]/.test(normalized) | ||
|
|
||
| if (hasCJK) { | ||
| // Non-space-separated languages must use substring matching | ||
| return searchContent.includes(normalized) | ||
| } | ||
|
|
||
| // Word boundary match | ||
| const wordPattern = new RegExp(`\\b${escapeRegex(normalized)}\\b`, 'i') | ||
| if (wordPattern.test(searchContent)) { | ||
| return true | ||
| // Dynamic word boundary match for space-separated languages | ||
| let patternStr = escapeRegex(normalized) | ||
| if (/^\w/.test(normalized)) { | ||
| patternStr = '\\b' + patternStr | ||
| } | ||
| if (/\w$/.test(normalized)) { | ||
| patternStr = patternStr + '\\b' | ||
| } | ||
|
|
||
| return false | ||
| const wordPattern = new RegExp(patternStr, 'i') | ||
| return wordPattern.test(searchContent) |
There was a problem hiding this comment.
The current implementation of word boundary matching has two significant limitations for non-English languages:
- ASCII-only
\band\w: The regex/^\w/and/\w$/only match ASCII alphanumeric characters. For other space-separated languages (e.g., Cyrillic, Greek, Arabic, Hebrew), no word boundaries will be added, falling back to substring matching and causing false positives (e.g., "кот" matching "рокот"). Furthermore, JavaScript's native\bis ASCII-only, meaning it fails to match boundaries between non-ASCII characters and spaces/punctuation. - Other Non-Space-Separated Languages: Languages like Thai, Lao, Khmer, and Burmese also do not use spaces to separate words. Since they are not CJK, they will go to the space-separated logic, but because they are non-ASCII, they won't get
\bboundaries and will fall back to substring matching. However, to be robust, they should be explicitly grouped with CJK to ensure they always use substring matching.
We can solve both issues elegantly by:
- Expanding the non-space-separated check to include Thai, Lao, Khmer, and Burmese.
- Using Unicode-aware lookarounds (
(?<![\p{L}\p{N}])and(?![\p{L}\p{N}])) with the/uflag for space-separated languages.
| // Check if the keyword contains CJK (Chinese, Japanese, Korean) characters | |
| const hasCJK = /[\u3040-\u30ff\u3400-\u4dbf\u4e00-\u9fff\uac00-\ud7af]/.test(normalized) | |
| if (hasCJK) { | |
| // Non-space-separated languages must use substring matching | |
| return searchContent.includes(normalized) | |
| } | |
| // Word boundary match | |
| const wordPattern = new RegExp(`\\b${escapeRegex(normalized)}\\b`, 'i') | |
| if (wordPattern.test(searchContent)) { | |
| return true | |
| // Dynamic word boundary match for space-separated languages | |
| let patternStr = escapeRegex(normalized) | |
| if (/^\w/.test(normalized)) { | |
| patternStr = '\\b' + patternStr | |
| } | |
| if (/\w$/.test(normalized)) { | |
| patternStr = patternStr + '\\b' | |
| } | |
| return false | |
| const wordPattern = new RegExp(patternStr, 'i') | |
| return wordPattern.test(searchContent) | |
| // Check if the keyword contains characters from non-space-separated languages (CJK, Thai, Lao, Khmer, Burmese) | |
| const isNonSpaceSeparated = /[\u3040-\u30ff\u3400-\u4dbf\u4e00-\u9fff\uac00-\ud7af\u0e00-\u0e7f\u0e80-\u0eff\u1780-\u17ff\u1000-\u109f]/.test(normalized) | |
| if (isNonSpaceSeparated) { | |
| // Non-space-separated languages must use substring matching | |
| return searchContent.includes(normalized) | |
| } | |
| // Dynamic Unicode-aware word boundary match for space-separated languages | |
| let patternStr = escapeRegex(normalized) | |
| if (/^[\p{L}\p{N}]/u.test(normalized)) { | |
| patternStr = '(?<![\\p{L}\\p{N}])' + patternStr | |
| } | |
| if (/[\p{L}\p{N}]$/u.test(normalized)) { | |
| patternStr = patternStr + '(?![\\p{L}\\p{N}])' | |
| } | |
| const wordPattern = new RegExp(patternStr, 'iu') | |
| return wordPattern.test(searchContent) |
There was a problem hiding this comment.
Well I admit I wasn't considering every alphabet in the world, but you're right that it might as well be fixed all at once.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/services/ai/retrieval/EntryRetrievalService.ts`:
- Around line 672-680: The current textMatches logic uses /^\w/ and /\w$/ and
inserts \b boundaries which are ASCII-only; update it to use Unicode-aware
boundaries by replacing the \b logic with Unicode property lookarounds and the
'u' flag: check the first/last character of normalized using Unicode property
tests (letters/numbers) and, when needed, prefix patternStr with a negative
lookbehind like (?<![\p{L}\p{N}]) and/or suffix with a negative lookahead like
(?![\p{L}\p{N}]); construct the RegExp with the 'iu' flags (case-insensitive +
Unicode) so the final RegExp (created from patternStr and flags) correctly
enforces word boundaries for non-ASCII characters. Ensure you still escape the
user input with escapeRegex and keep using searchContent for testing in
textMatches.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 6e8180fa-f063-4c0d-ac13-618031558a3c
📒 Files selected for processing (1)
src/lib/services/ai/retrieval/EntryRetrievalService.ts
Renamed hasCJK to isNonSpaceSeparated and added Thai, Lao, Khmer, and
Burmese to the non-space-separated language check. Replaced ASCII-only
\w and \b with Unicode property escapes and lookarounds (\p{L}\p{N})
so that Cyrillic, Arabic, Greek, accented Latin, and other non-ASCII
keywords receive proper word boundary matching.
Removed the short-circuiting .includes() check that caused keywords (like 'hulk') to falsely trigger on substrings (like 'madhulkman').
Replaced it with a dynamic regex using \b word boundaries for space-separated languages. Added CJK character detection so that Asian language keywords (which don't use spaces) bypass \b and correctly use substring matching, since \b does not work between CJK characters.
Summary by CodeRabbit