Releases: termdock/flashtext-i18n
Releases · termdock/flashtext-i18n
4.0.0a9
Changelog
All notable changes to this project will be documented in this file.
[4.0.0a9] - 2026-01-14
🚀 Major Rewrite (The Rust Era)
- Rust Core: The entire core logic has been rewritten in Rust (
flashtext-rs), providing massive performance gains and memory safety. - Performance: Throughput increased by 3x-4x compared to v3.0 (Python). Match latency is now near-constant regardless of keyword count.
- Drop-in Compatible: 100% API compatibility with the original FlashText and v3.x series.
Added
- True Unicode Boundaries: Fixed the long-standing issue where non-ASCII characters (e.g.,
é,ß,Adjancent CJK) were incorrectly treated as delimiters. Rust'sunicode-segmentationnow handles word boundaries correctly for ALL languages. - Universal Wheels: Pre-compiled binary wheels for macOS (Intel/Silicon), Windows (x64), Linux (x86_64/aarch64), and Musl Linux (Alpine). No Rust compiler needed for users.
- JSON File Loading: Native support for loading keywords from JSON files for faster startup.
Changed
- Packaging: Migrated build system to
maturin+pyo3. - Minimum Python: Now requires Python >= 3.8.
3.1.1
Changelog
All notable changes to this project will be documented in this file.
[3.1.1] - 2026-01-13
Refactoring (Architecture 3.0)
- Modularization: Split monolithic
keyword.pyinto distinct responsibilities:flashtext/keyword.py: High-level API and facade.flashtext/trie_dict.py: Data structure operations (pure functions).flashtext/utils.py: Algorithms (Levenshtein) and helper utilities.
- Utils: Extracted
extract_sentencesandlevenstheintoutils.pyto reduce class weight.
Performance
- Loop Optimization: Optimized
extract_keywordshot loop by caching member variables and reducing object creation overhead. - Benchmark: Performance restored to ~0.27s (Case-Sensitive) / 0.29s (Case-Insensitive) on standard corpus.
- Reverted: "Internationalized Word Boundaries" (Issue #4) reverted due to 3.5x performance regression. This feature is reopened for future optimized implementation.
Added
- Mixed Case Support: Added ability to mix case-sensitive and case-insensitive keywords in the same processor.
- Implemented via Multi-Edge Trie (Space-for-Time), removing runtime
lower()calls.
- Implemented via Multi-Edge Trie (Space-for-Time), removing runtime
- Fuzzy Matching Support: Added
max_costparameter to support Levenshtein distance matching (including CJK support). - Keyword Count API: New
len(keyword_processor)support to get total unique terms. - Replacement Metadata:
replace_keywordsnow supportsspan_info=Trueto return detailed replacement records. - Sentence Extraction: New
extract_sentences()API to find sentences containing keywords. - Clean Name Mapping:
add_keywordnow accepts a list of clean names (Issue #11).
Fixed
- CJK Support: Fixed adjacent keyword extraction for Chinese/Japanese/Korean text (Issue #1).
- Unicode Spans: Fixed inaccurate span positions when handling Unicode characters that change length during case folding (Issue #2).
- Edge Cases:
- Platform: Verified Linux aarch64 support (Pure Python).
Documentation
- Added
CONTRIBUTING.mdwith strict performance guidelines. - Added
benchmark.pyfor standardized performance testing. - Updated
README.mdwith new features and benchmark results.
v3.0.0 - Internationalization Fixes
🎉 flashtext-i18n v3.0.0
This is the first release of the i18n-focused fork of flashtext, with fixes for internationalization (CJK, Unicode) issues.
✨ Bug Fixes
-
#1 CJK + Numbers: 中文關鍵詞後接數字現在可以正確提取
"地中海贫血2"→["地中海贫血"]✅
-
#2 Unicode Span: Unicode 大小寫轉換不再導致 span 位置錯誤
- 土耳其語
İ等特殊字符現在正確處理
- 土耳其語
-
#3 Adjacent Keywords: 相鄰關鍵詞的 replace_keywords 現在正常運作
- 重寫為基於 extract_keywords,更簡潔可靠
-
#10 Custom Boundaries: 從 non_word_boundaries 移除的字符現在可以匹配
📦 Installation
pip install flashtext-i18n🔄 Migration from flashtext
# Before
from flashtext import KeywordProcessor
# After (drop-in replacement)
from flashtext import KeywordProcessor # Same import, different package
None - API 100% compatible with flashtext 2.x
🙏 Credits
Original flashtext by Vikash Singh
Fork maintained by termdock & Huang Chung Yi