fix: use word-boundary regex for geo-tagging keyword matching#330
fix: use word-boundary regex for geo-tagging keyword matching#330princelevant wants to merge 2 commits intokoala73:mainfrom
Conversation
…3#324) Keyword matching across the geo-tagging pipeline used String.includes() (substring matching), causing false positives like "assad" matching inside "ambassador" and tagging unrelated articles to Syria. Replaced all instances with word-boundary regex (\b...\b) for accurate matching. Also replaced the ambiguous 3-char "hts" keyword (matched "rights", "fights", etc.) with unambiguous "tahrir al-sham" / "hayat tahrir". Fixes koala73#324 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@princelevant is attempting to deploy a commit to the Elie Team on Vercel. A member of the Team first needs to authorize it. |
|
Lovely |
Plan vs Implementation ReviewThanks for tackling #324! The core goal (fixing substring false positives) is right, but the implementation diverges from the approved plan in ways that introduce new issues. Here's a detailed comparison. Approach MismatchThe approved plan uses tokenization-based exact word matching ( Issues
What the PR Gets Right
Recommended ChangesPer the approved plan (
The plan file has the full |
…73#324) Replace word-boundary regex with tokenization + Set lookups per approved plan: - Create src/utils/keyword-match.ts as single source of truth - Tokenize titles once, O(1) Set.has() per keyword (no RegExp allocations) - Restore 'hts' keyword for Damascus (safe with tokenization) - Revert shared includesKeyword() in analysis-constants.ts - Remove 'us ' trailing-space hack and bare 'house' from DC keywords - Add tech-hub-index.ts to scope (was missing) - Add integration tests for inferGeoHubsFromTitle flow Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Hey @koala73 — this is a great initiative and I'm happy to contribute early on. The impact is huge. Thank you for the quick and prompt responses! Here's the fix based on your feedback: Changes in this revision:
Let me know if anything else needs adjusting. Yalla! 🚀 — KT |
Summary
String.includes()with word-boundary regex (\b...\b) across the entire geo-tagging pipeline to prevent substring false positives"hts"keyword (matched "rights", "fights", etc.) with"tahrir al-sham"/"hayat tahrir"Problem
When zooming into Syria on the map, unrelated articles (e.g. French politics mentioning "ambassador") appeared at Syria's coordinates. The keyword
"assad"matched as a substring inside"ambassador", and"hts"matched inside"rights","fights","flights", etc.Root cause: keywords >= 5 characters used
titleLower.includes(keyword)instead of word-boundary regex.Files changed
src/services/geo-hub-index.tssrc/components/DeckGLMap.ts\bregexsrc/components/Map.tssrc/App.ts\bregexsrc/services/entity-index.ts\bregexsrc/services/country-instability.ts\bregexsrc/services/story-data.ts\bregexsrc/services/related-assets.ts\bregexsrc/utils/analysis-constants.tsincludesKeyword()utility uses\bregexsrc/config/geo.ts"hts"with"tahrir al-sham"/"hayat tahrir"tests/geo-keyword-matching.test.mjsTest plan
vite buildpasses cleanFixes #324
-KT
🤖 Generated with Claude Code