Skip to content

span_info on combined unicode characters#82

Open
kkaiser wants to merge 49 commits intovi3k6i5:masterfrom
kkaiser:master
Open

span_info on combined unicode characters#82
kkaiser wants to merge 49 commits intovi3k6i5:masterfrom
kkaiser:master

Conversation

@kkaiser
Copy link

@kkaiser kkaiser commented May 28, 2019

This fixes issue: #81

Lowering a sentence with combined unicode chars changes the length of a sentence.

s = 'İ love Big Apple and Bay Area.'
len(s)  # 30
len(s.lower())  # 31

Lowering keywords and search sentence now works on a per char basis to return the correct span_info

from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor()
keyword_processor.add_keyword('Big Apple', 'New York')
keyword_processor.add_keyword('Bay Area')
keyword_processor.add_keyword('İ love')
s = 'İ love Big Apple and Bay Area.'
keywords_found = keyword_processor.extract_keywords(s, span_info=True)
keywords_found
# old: [('İ love', 0, 7), ('New York', 8, 17), ('Bay Area', 22, 30)]
# new: [('İ love', 0, 6), ('New York', 7, 16), ('Bay Area', 21, 29)]
for k in keywords_found:
    print(s[k[1]:k[2]])
# new: İ love
# old: İ love
# new: Big Apple
# old: ig Apple
# new: Bay Area
# old: ay Area.

vi3k6i5 and others added 29 commits November 10, 2017 20:47
added reference to flashtext paper
  `charactes` | `characters`
  `explaination` | `explanation`
  `matche` | `match`
Fix issue with incomplete keyword at the end of the sentence
Performances improvement for strings manipulations
…a per char basis to return correct span_info
@coveralls
Copy link

Coverage Status

Coverage increased (+0.02%) to 99.327% when pulling 40633c9 on kkaiser:master into 50c45f1 on vi3k6i5:master.

@kkaiser kkaiser mentioned this pull request Jan 31, 2020
@vi3k6i5
Copy link
Owner

vi3k6i5 commented May 3, 2020

Moving the check inside the loop ads a little of execution overhead, please share a test case for the change. Thanks

@vi3k6i5
Copy link
Owner

vi3k6i5 commented May 3, 2020

Can you please resolve the conflict.

@kkaiser
Copy link
Author

kkaiser commented May 15, 2020

Ready for review

@laurenegerton
Copy link

laurenegerton commented Nov 16, 2020

This is still happening with flashtext-2.7
Looks like the fix was never merged with master...

@spencertollefson
Copy link

@vi3k6i5 - Hi there! Is this ready for and can you complete merging this PR?

HCYT added a commit to termdock/flashtext-i18n that referenced this pull request Jan 13, 2026
Problem:
- Turkish İ lowercases to 'i̇' (2 chars), changing string length
- Lowercasing entire sentence upfront caused span position offsets

Solution:
- Lowercase each character individually during traversal
- This preserves original string indices for span_info

Fixes: #2
Ref: upstream vi3k6i5/flashtext#119, vi3k6i5/flashtext#81, vi3k6i5/flashtext#82
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.