fix(lorebook): prevent substring matching for keywords, check for Asian characters also by vire70 · Pull Request #347 · AventurasTeam/Aventuras

vire70 · 2026-06-11T03:15:36Z

Removed the short-circuiting .includes() check that caused keywords (like 'hulk') to falsely trigger on substrings (like 'madhulkman').

Replaced it with a dynamic regex using \b word boundaries for space-separated languages. Added CJK character detection so that Asian language keywords (which don't use spaces) bypass \b and correctly use substring matching, since \b does not work between CJK characters.

Summary by CodeRabbit

Bug Fixes
- Improved keyword matching for CJK and other non-space-separated scripts (Chinese, Japanese, Korean, Thai, Lao, Khmer, Burmese) using reliable substring matching.
- Enhanced boundary-aware matching for space-separated languages to reduce false positives.
- Refined overall search result relevance across mixed-language queries.

Removed the short-circuiting .includes() check that caused keywords (like 'hulk') to falsely trigger on substrings (like 'madhulkman'). Replaced it with a dynamic regex using \b word boundaries for space-separated languages. Added CJK character detection so that Asian language keywords (which don't use spaces) bypass \b and correctly use substring matching, since \b does not work between CJK characters.

coderabbitai · 2026-06-11T03:15:49Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 31b69a66-029a-418b-a117-97a37cd8304f

📥 Commits

Reviewing files that changed from the base of the PR and between faefa81 and f78df9d.

📒 Files selected for processing (1)

src/lib/services/ai/retrieval/EntryRetrievalService.ts

📝 Walkthrough

Walkthrough

textMatches in EntryRetrievalService now special-cases non‑space‑separated scripts (CJK/Thai/Lao/Khmer/Burmese) and uses plain substring matching; for other scripts it constructs an escaped Unicode-aware regex with conditional lookarounds at the query edges and tests that against searchContent.

Changes

CJK Text Matching Logic

Layer / File(s)	Summary
textMatches CJK and word-boundary logic `src/lib/services/ai/retrieval/EntryRetrievalService.ts`	Detects non-space-separated scripts and uses `searchContent.includes(normalized)` for those; otherwise escapes the normalized query, adds Unicode letter/number-aware lookarounds at start/end when appropriate, and runs `regex.test(searchContent)`.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related issues

Lorebook keywords trigger from any substring #346: Both changes address keyword-matching behavior and special-casing non‑spaced scripts vs. Unicode-aware whole-word matching.

Poem

🐰 I hop through text both near and far,
CJK blossoms match by substring star,
While letters wear lookarounds polite,
Regex and includes both do the right,
A tiny rabbit cheers the search tonight.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely describes the main changes: preventing substring matching for keywords and adding support for Asian character detection, which aligns with the PR's core objectives.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request updates the keyword matching logic in EntryRetrievalService to use substring matching for CJK characters and dynamic word boundaries for space-separated languages. However, the implementation relies on ASCII-only boundaries (\b and \w), which fail for non-ASCII space-separated languages like Cyrillic, and it misses other non-space-separated languages like Thai or Lao. It is recommended to adopt Unicode-aware lookarounds and expand the language detection check as suggested.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-11T03:17:42Z

+    // Check if the keyword contains CJK (Chinese, Japanese, Korean) characters
+    const hasCJK = /[\u3040-\u30ff\u3400-\u4dbf\u4e00-\u9fff\uac00-\ud7af]/.test(normalized)
+
+    if (hasCJK) {
+      // Non-space-separated languages must use substring matching
+      return searchContent.includes(normalized)
    }

-    // Word boundary match
-    const wordPattern = new RegExp(`\\b${escapeRegex(normalized)}\\b`, 'i')
-    if (wordPattern.test(searchContent)) {
-      return true
+    // Dynamic word boundary match for space-separated languages
+    let patternStr = escapeRegex(normalized)
+    if (/^\w/.test(normalized)) {
+      patternStr = '\\b' + patternStr
+    }
+    if (/\w$/.test(normalized)) {
+      patternStr = patternStr + '\\b'
    }

-    return false
+    const wordPattern = new RegExp(patternStr, 'i')
+    return wordPattern.test(searchContent)


The current implementation of word boundary matching has two significant limitations for non-English languages:

ASCII-only \b and \w: The regex /^\w/ and /\w$/ only match ASCII alphanumeric characters. For other space-separated languages (e.g., Cyrillic, Greek, Arabic, Hebrew), no word boundaries will be added, falling back to substring matching and causing false positives (e.g., "кот" matching "рокот"). Furthermore, JavaScript's native \b is ASCII-only, meaning it fails to match boundaries between non-ASCII characters and spaces/punctuation.

Other Non-Space-Separated Languages: Languages like Thai, Lao, Khmer, and Burmese also do not use spaces to separate words. Since they are not CJK, they will go to the space-separated logic, but because they are non-ASCII, they won't get \b boundaries and will fall back to substring matching. However, to be robust, they should be explicitly grouped with CJK to ensure they always use substring matching.

We can solve both issues elegantly by:

Expanding the non-space-separated check to include Thai, Lao, Khmer, and Burmese.

Using Unicode-aware lookarounds ((?<![\p{L}\p{N}]) and (?![\p{L}\p{N}])) with the /u flag for space-separated languages.

Suggested change

// Check if the keyword contains CJK (Chinese, Japanese, Korean) characters

const hasCJK = /[\u3040-\u30ff\u3400-\u4dbf\u4e00-\u9fff\uac00-\ud7af]/.test(normalized)

if (hasCJK) {

// Non-space-separated languages must use substring matching

return searchContent.includes(normalized)

}

// Word boundary match

const wordPattern = new RegExp(`\\b${escapeRegex(normalized)}\\b`, 'i')

if (wordPattern.test(searchContent)) {

return true

// Dynamic word boundary match for space-separated languages

let patternStr = escapeRegex(normalized)

if (/^\w/.test(normalized)) {

patternStr = '\\b' + patternStr

}

if (/\w$/.test(normalized)) {

patternStr = patternStr + '\\b'

}

return false

const wordPattern = new RegExp(patternStr, 'i')

return wordPattern.test(searchContent)

// Check if the keyword contains characters from non-space-separated languages (CJK, Thai, Lao, Khmer, Burmese)

const isNonSpaceSeparated = /[\u3040-\u30ff\u3400-\u4dbf\u4e00-\u9fff\uac00-\ud7af\u0e00-\u0e7f\u0e80-\u0eff\u1780-\u17ff\u1000-\u109f]/.test(normalized)

if (isNonSpaceSeparated) {

// Non-space-separated languages must use substring matching

return searchContent.includes(normalized)

}

// Dynamic Unicode-aware word boundary match for space-separated languages

let patternStr = escapeRegex(normalized)

if (/^[\p{L}\p{N}]/u.test(normalized)) {

patternStr = '(?<![\\p{L}\\p{N}])' + patternStr

}

if (/[\p{L}\p{N}]$/u.test(normalized)) {

patternStr = patternStr + '(?![\\p{L}\\p{N}])'

}

const wordPattern = new RegExp(patternStr, 'iu')

return wordPattern.test(searchContent)

Well I admit I wasn't considering every alphabet in the world, but you're right that it might as well be fixed all at once.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/services/ai/retrieval/EntryRetrievalService.ts`:
- Around line 672-680: The current textMatches logic uses /^\w/ and /\w$/ and
inserts \b boundaries which are ASCII-only; update it to use Unicode-aware
boundaries by replacing the \b logic with Unicode property lookarounds and the
'u' flag: check the first/last character of normalized using Unicode property
tests (letters/numbers) and, when needed, prefix patternStr with a negative
lookbehind like (?<![\p{L}\p{N}]) and/or suffix with a negative lookahead like
(?![\p{L}\p{N}]); construct the RegExp with the 'iu' flags (case-insensitive +
Unicode) so the final RegExp (created from patternStr and flags) correctly
enforces word boundaries for non-ASCII characters. Ensure you still escape the
user input with escapeRegex and keep using searchContent for testing in
textMatches.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 6e8180fa-f063-4c0d-ac13-618031558a3c

📥 Commits

Reviewing files that changed from the base of the PR and between 2a41f21 and faefa81.

📒 Files selected for processing (1)

src/lib/services/ai/retrieval/EntryRetrievalService.ts

Renamed hasCJK to isNonSpaceSeparated and added Thai, Lao, Khmer, and Burmese to the non-space-separated language check. Replaced ASCII-only \w and \b with Unicode property escapes and lookarounds (\p{L}\p{N}) so that Cyrillic, Arabic, Greek, accented Latin, and other non-ASCII keywords receive proper word boundary matching.

gemini-code-assist Bot reviewed Jun 11, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 11, 2026

View reviewed changes

Comment thread src/lib/services/ai/retrieval/EntryRetrievalService.ts Outdated

vire70 added 2 commits June 11, 2026 13:54

style: fix prettier formatting for long regex line

f78df9d

failerko approved these changes Jun 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(lorebook): prevent substring matching for keywords, check for Asian characters also#347

fix(lorebook): prevent substring matching for keywords, check for Asian characters also#347
vire70 wants to merge 3 commits into
AventurasTeam:masterfrom
vire70:fix/lorebook-keyword-boundaries

vire70 commented Jun 11, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 11, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Poem

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 11, 2026

Uh oh!

vire70 Jun 11, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

vire70 commented Jun 11, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Poem

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

vire70 Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vire70 commented Jun 11, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 11, 2026 •

edited

Loading