Skip to content

fix(lorebook): prevent substring matching for keywords, check for Asian characters also#347

Open
vire70 wants to merge 3 commits into
AventurasTeam:masterfrom
vire70:fix/lorebook-keyword-boundaries
Open

fix(lorebook): prevent substring matching for keywords, check for Asian characters also#347
vire70 wants to merge 3 commits into
AventurasTeam:masterfrom
vire70:fix/lorebook-keyword-boundaries

Conversation

@vire70

@vire70 vire70 commented Jun 11, 2026

Copy link
Copy Markdown

Removed the short-circuiting .includes() check that caused keywords (like 'hulk') to falsely trigger on substrings (like 'madhulkman').

Replaced it with a dynamic regex using \b word boundaries for space-separated languages. Added CJK character detection so that Asian language keywords (which don't use spaces) bypass \b and correctly use substring matching, since \b does not work between CJK characters.

Summary by CodeRabbit

  • Bug Fixes
    • Improved keyword matching for CJK and other non-space-separated scripts (Chinese, Japanese, Korean, Thai, Lao, Khmer, Burmese) using reliable substring matching.
    • Enhanced boundary-aware matching for space-separated languages to reduce false positives.
    • Refined overall search result relevance across mixed-language queries.

Removed the short-circuiting .includes() check that caused keywords (like 'hulk') to falsely trigger on substrings (like 'madhulkman').

Replaced it with a dynamic regex using \b word boundaries for space-separated languages. Added CJK character detection so that Asian language keywords (which don't use spaces) bypass \b and correctly use substring matching, since \b does not work between CJK characters.
@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 31b69a66-029a-418b-a117-97a37cd8304f

📥 Commits

Reviewing files that changed from the base of the PR and between faefa81 and f78df9d.

📒 Files selected for processing (1)
  • src/lib/services/ai/retrieval/EntryRetrievalService.ts

📝 Walkthrough

Walkthrough

textMatches in EntryRetrievalService now special-cases non‑space‑separated scripts (CJK/Thai/Lao/Khmer/Burmese) and uses plain substring matching; for other scripts it constructs an escaped Unicode-aware regex with conditional lookarounds at the query edges and tests that against searchContent.

Changes

CJK Text Matching Logic

Layer / File(s) Summary
textMatches CJK and word-boundary logic
src/lib/services/ai/retrieval/EntryRetrievalService.ts
Detects non-space-separated scripts and uses searchContent.includes(normalized) for those; otherwise escapes the normalized query, adds Unicode letter/number-aware lookarounds at start/end when appropriate, and runs regex.test(searchContent).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related issues

Poem

🐰 I hop through text both near and far,
CJK blossoms match by substring star,
While letters wear lookarounds polite,
Regex and includes both do the right,
A tiny rabbit cheers the search tonight.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely describes the main changes: preventing substring matching for keywords and adding support for Asian character detection, which aligns with the PR's core objectives.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the keyword matching logic in EntryRetrievalService to use substring matching for CJK characters and dynamic word boundaries for space-separated languages. However, the implementation relies on ASCII-only boundaries (\b and \w), which fail for non-ASCII space-separated languages like Cyrillic, and it misses other non-space-separated languages like Thai or Lao. It is recommended to adopt Unicode-aware lookarounds and expand the language detection check as suggested.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +662 to +680
// Check if the keyword contains CJK (Chinese, Japanese, Korean) characters
const hasCJK = /[\u3040-\u30ff\u3400-\u4dbf\u4e00-\u9fff\uac00-\ud7af]/.test(normalized)

if (hasCJK) {
// Non-space-separated languages must use substring matching
return searchContent.includes(normalized)
}

// Word boundary match
const wordPattern = new RegExp(`\\b${escapeRegex(normalized)}\\b`, 'i')
if (wordPattern.test(searchContent)) {
return true
// Dynamic word boundary match for space-separated languages
let patternStr = escapeRegex(normalized)
if (/^\w/.test(normalized)) {
patternStr = '\\b' + patternStr
}
if (/\w$/.test(normalized)) {
patternStr = patternStr + '\\b'
}

return false
const wordPattern = new RegExp(patternStr, 'i')
return wordPattern.test(searchContent)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation of word boundary matching has two significant limitations for non-English languages:

  1. ASCII-only \b and \w: The regex /^\w/ and /\w$/ only match ASCII alphanumeric characters. For other space-separated languages (e.g., Cyrillic, Greek, Arabic, Hebrew), no word boundaries will be added, falling back to substring matching and causing false positives (e.g., "кот" matching "рокот"). Furthermore, JavaScript's native \b is ASCII-only, meaning it fails to match boundaries between non-ASCII characters and spaces/punctuation.
  2. Other Non-Space-Separated Languages: Languages like Thai, Lao, Khmer, and Burmese also do not use spaces to separate words. Since they are not CJK, they will go to the space-separated logic, but because they are non-ASCII, they won't get \b boundaries and will fall back to substring matching. However, to be robust, they should be explicitly grouped with CJK to ensure they always use substring matching.

We can solve both issues elegantly by:

  • Expanding the non-space-separated check to include Thai, Lao, Khmer, and Burmese.
  • Using Unicode-aware lookarounds ((?<![\p{L}\p{N}]) and (?![\p{L}\p{N}])) with the /u flag for space-separated languages.
Suggested change
// Check if the keyword contains CJK (Chinese, Japanese, Korean) characters
const hasCJK = /[\u3040-\u30ff\u3400-\u4dbf\u4e00-\u9fff\uac00-\ud7af]/.test(normalized)
if (hasCJK) {
// Non-space-separated languages must use substring matching
return searchContent.includes(normalized)
}
// Word boundary match
const wordPattern = new RegExp(`\\b${escapeRegex(normalized)}\\b`, 'i')
if (wordPattern.test(searchContent)) {
return true
// Dynamic word boundary match for space-separated languages
let patternStr = escapeRegex(normalized)
if (/^\w/.test(normalized)) {
patternStr = '\\b' + patternStr
}
if (/\w$/.test(normalized)) {
patternStr = patternStr + '\\b'
}
return false
const wordPattern = new RegExp(patternStr, 'i')
return wordPattern.test(searchContent)
// Check if the keyword contains characters from non-space-separated languages (CJK, Thai, Lao, Khmer, Burmese)
const isNonSpaceSeparated = /[\u3040-\u30ff\u3400-\u4dbf\u4e00-\u9fff\uac00-\ud7af\u0e00-\u0e7f\u0e80-\u0eff\u1780-\u17ff\u1000-\u109f]/.test(normalized)
if (isNonSpaceSeparated) {
// Non-space-separated languages must use substring matching
return searchContent.includes(normalized)
}
// Dynamic Unicode-aware word boundary match for space-separated languages
let patternStr = escapeRegex(normalized)
if (/^[\p{L}\p{N}]/u.test(normalized)) {
patternStr = '(?<![\\p{L}\\p{N}])' + patternStr
}
if (/[\p{L}\p{N}]$/u.test(normalized)) {
patternStr = patternStr + '(?![\\p{L}\\p{N}])'
}
const wordPattern = new RegExp(patternStr, 'iu')
return wordPattern.test(searchContent)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well I admit I wasn't considering every alphabet in the world, but you're right that it might as well be fixed all at once.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/services/ai/retrieval/EntryRetrievalService.ts`:
- Around line 672-680: The current textMatches logic uses /^\w/ and /\w$/ and
inserts \b boundaries which are ASCII-only; update it to use Unicode-aware
boundaries by replacing the \b logic with Unicode property lookarounds and the
'u' flag: check the first/last character of normalized using Unicode property
tests (letters/numbers) and, when needed, prefix patternStr with a negative
lookbehind like (?<![\p{L}\p{N}]) and/or suffix with a negative lookahead like
(?![\p{L}\p{N}]); construct the RegExp with the 'iu' flags (case-insensitive +
Unicode) so the final RegExp (created from patternStr and flags) correctly
enforces word boundaries for non-ASCII characters. Ensure you still escape the
user input with escapeRegex and keep using searchContent for testing in
textMatches.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 6e8180fa-f063-4c0d-ac13-618031558a3c

📥 Commits

Reviewing files that changed from the base of the PR and between 2a41f21 and faefa81.

📒 Files selected for processing (1)
  • src/lib/services/ai/retrieval/EntryRetrievalService.ts

Comment thread src/lib/services/ai/retrieval/EntryRetrievalService.ts Outdated
vire70 added 2 commits June 11, 2026 13:54
Renamed hasCJK to isNonSpaceSeparated and added Thai, Lao, Khmer, and
Burmese to the non-space-separated language check. Replaced ASCII-only
\w and \b with Unicode property escapes and lookarounds (\p{L}\p{N})
so that Cyrillic, Arabic, Greek, accented Latin, and other non-ASCII
keywords receive proper word boundary matching.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants