Skip to content

Comments

fix: use word-boundary regex for geo-tagging keyword matching#330

Open
princelevant wants to merge 2 commits intokoala73:mainfrom
princelevant:fix/geo-tagging-substring-matching
Open

fix: use word-boundary regex for geo-tagging keyword matching#330
princelevant wants to merge 2 commits intokoala73:mainfrom
princelevant:fix/geo-tagging-substring-matching

Conversation

@princelevant
Copy link

Summary

  • Replaced String.includes() with word-boundary regex (\b...\b) across the entire geo-tagging pipeline to prevent substring false positives
  • Replaced the ambiguous "hts" keyword (matched "rights", "fights", etc.) with "tahrir al-sham" / "hayat tahrir"
  • Added 20 regression tests covering false positive prevention, true positive preservation, and edge cases

Problem

When zooming into Syria on the map, unrelated articles (e.g. French politics mentioning "ambassador") appeared at Syria's coordinates. The keyword "assad" matched as a substring inside "ambassador", and "hts" matched inside "rights", "fights", "flights", etc.

Root cause: keywords >= 5 characters used titleLower.includes(keyword) instead of word-boundary regex.

Files changed

File Change
src/services/geo-hub-index.ts Word-boundary regex for all keyword lengths
src/components/DeckGLMap.ts Hotspot keyword matching uses \b regex
src/components/Map.ts Same fix for mobile map
src/App.ts Flash location matching uses \b regex
src/services/entity-index.ts Entity keyword matching uses \b regex
src/services/country-instability.ts Country keyword matching uses \b regex
src/services/story-data.ts Country keyword matching uses \b regex
src/services/related-assets.ts Asset keyword matching uses \b regex
src/utils/analysis-constants.ts includesKeyword() utility uses \b regex
src/config/geo.ts Replaced "hts" with "tahrir al-sham" / "hayat tahrir"
tests/geo-keyword-matching.test.mjs 20 new test cases

Test plan

  • vite build passes clean
  • All 111 existing tests pass (0 regressions)
  • 20 new tests verify: "ambassador" no longer matches Syria, "rights" no longer matches Damascus, genuine Syria/HTS articles still match correctly

Fixes #324

-KT

🤖 Generated with Claude Code

…3#324)

Keyword matching across the geo-tagging pipeline used String.includes()
(substring matching), causing false positives like "assad" matching
inside "ambassador" and tagging unrelated articles to Syria. Replaced
all instances with word-boundary regex (\b...\b) for accurate matching.

Also replaced the ambiguous 3-char "hts" keyword (matched "rights",
"fights", etc.) with unambiguous "tahrir al-sham" / "hayat tahrir".

Fixes koala73#324

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@vercel
Copy link

vercel bot commented Feb 24, 2026

@princelevant is attempting to deploy a commit to the Elie Team on Vercel.

A member of the Team first needs to authorize it.

@koala73
Copy link
Owner

koala73 commented Feb 24, 2026

Lovely
Was on my todo
Thank you
Will review

@koala73
Copy link
Owner

koala73 commented Feb 24, 2026

Plan vs Implementation Review

Thanks for tackling #324! The core goal (fixing substring false positives) is right, but the implementation diverges from the approved plan in ways that introduce new issues. Here's a detailed comparison.

Approach Mismatch

The approved plan uses tokenization-based exact word matching (Set.has()), not \b word-boundary regex. Tokenization was chosen because \b still has edge cases with common English words used as keywords (e.g., 'oil', 'fed', 'house'), while tokenization eliminates ALL substring false positives by design:

"ambassador" → tokens: {"ambassador"} → has("assad")? NO ✓
"Assad regime" → tokens: {"assad","regime"} → has("assad")? YES ✓

Issues

# Severity Issue Detail
1 Critical Wrong approach Uses \b regex, not tokenization. \b still has edge cases with common words like 'oil', 'fed', 'house' (still in the DC keyword list at geo.ts:84)
2 Critical Removes 'hts' keyword Replaced with 'tahrir al-sham'/'hayat tahrir' only. Headlines saying just "HTS" (very common: "HTS forces advance") no longer match Damascus. With tokenization, keeping 'hts' is safe since tokens.has('hts')"rights"
3 Critical No regex cache — performance regression new RegExp() created on EVERY call in the hot loop. DeckGLMap: 100 news × 33 hotspots × 8 keywords = 26,400 RegExp allocations per render cycle (runs every few minutes)
4 High Changed shared includesKeyword() Modified analysis-constants.ts:188 which affects analysis-core.ts:313,347 (correlation/signal generation, not geo-tagging). Plan explicitly creates a separate src/utils/keyword-match.ts to avoid regression in non-geo paths
5 High No centralized utility The escape+regex pattern is copy-pasted 10+ times across 8 files. If matching logic changes, every site needs updating again
6 Medium Missing files tech-hub-index.ts:221 and server-side get-risk-scores.ts not updated — false positives persist there
7 Medium 'us ' and 'house' not fixed DC hotspot (geo.ts:84) still has 'us ' (trailing space hack for .includes()) and standalone 'house'. With \b regex, \bus \b may behave unexpectedly
8 Low Tests lack integration coverage All 20 tests are unit tests on the regex function. No tests against actual inferGeoHubsFromTitle(), normalizeCountryName(), or hotspot matching

What the PR Gets Right

  • Correct file coverage for core geo-tagging paths (8 of 10 files)
  • App.ts:findFlashLocation() included
  • Solid test cases for the "ambassador"/"assad" false positive
  • escapeRegex() used consistently for safety
  • Conflict-topic .includes() in DeckGLMap correctly converted

Recommended Changes

Per the approved plan (/plans/dapper-tinkering-engelbart.md):

  1. Create src/utils/keyword-match.ts with tokenizeForMatch() + matchKeyword() — single source of truth, tokenize once per title then O(1) Set lookups
  2. Keep 'hts' in Damascus keywords — tokenization makes it safe (no "rights" false positive)
  3. Tokenize once per title in hot loops, reuse across all hotspot keyword checks (faster than 26K regex allocations)
  4. Don't touch analysis-constants.ts — isolate geo-matching to avoid blast radius in analysis-core
  5. Add tech-hub-index.ts and get-risk-scores.ts to scope
  6. Remove 'us ' and 'house' from DC hotspot keywords
  7. Add integration tests for inferGeoHubsFromTitle() and normalizeCountryName()

The plan file has the full tokenizeForMatch() and matchKeyword() implementation with contiguous phrase matching for multi-word keywords.

…73#324)

Replace word-boundary regex with tokenization + Set lookups per approved plan:
- Create src/utils/keyword-match.ts as single source of truth
- Tokenize titles once, O(1) Set.has() per keyword (no RegExp allocations)
- Restore 'hts' keyword for Damascus (safe with tokenization)
- Revert shared includesKeyword() in analysis-constants.ts
- Remove 'us ' trailing-space hack and bare 'house' from DC keywords
- Add tech-hub-index.ts to scope (was missing)
- Add integration tests for inferGeoHubsFromTitle flow

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@princelevant
Copy link
Author

Hey @koala73 — this is a great initiative and I'm happy to contribute early on. The impact is huge. Thank you for the quick and prompt responses!

Here's the fix based on your feedback:

Changes in this revision:

  • Tokenization over regex — replaced all \b regex matching with tokenizeForMatch() + Set.has() lookups per the approved plan. Titles are tokenized once, then O(1) keyword checks — zero RegExp allocations in hot loops
  • Centralized utility — new src/utils/keyword-match.ts as single source of truth, all 10 files import from it
  • Restored 'hts' in Damascus keywords — tokenization makes it safe (no more "rights"/"flights" false positives)
  • Reverted analysis-constants.tsincludesKeyword() back to original, geo-matching is now fully isolated
  • Added tech-hub-index.ts to scope (was missing)
  • Removed 'us ' and 'house' from DC hotspot keywords
  • Integration tests added for the full inferGeoHubsFromTitle flow (41 tests, all passing)

Let me know if anything else needs adjusting. Yalla! 🚀

— KT

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Geo-tagging uses substring matching, causing articles to be placed in wrong map regions

2 participants