Thai text analysis plugin for INFINI Pizza — wraps the core Thai tokenizer with Thai-specific normalization.
The core ThaiTokenizer provides basic script-boundary segmentation (splitting at Thai/non-Thai boundaries). This dedicated crate adds Thai-specific normalizations essential for search quality:
| Normalization | Before | After | Why |
|---|---|---|---|
| Sara Am decomposition | ทำ (ท + ำ) | ท + ◌ํ + า | Match both composed and decomposed forms |
| Thai digit → ASCII | ราคา๑๐๐ | ราคา100 | Match regardless of digit style |
| Zero-width removal | สวัสดี\u200Bครับ | สวัสดีครับ | Strip word-boundary hints |
| Name | Type | Description |
|---|---|---|
thai_normalization |
TokenFilter | Sara Am decomposition + Thai digits → ASCII + zero-width removal |
thai_stop |
TokenFilter | 112 Thai stop words |
thai |
Analyzer | ThaiTokenizer → thai_normalization → lowercase → stop |
Input text
│
▼
ThaiTokenizer (from analysis-core)
│ Script-boundary segmentation
▼
ThaiNormalizationFilter
│ Sara Am decomposition
│ Thai digits → ASCII
│ Zero-width character removal
▼
LowercaseTokenFilter
│ Lowercases any embedded Latin text
▼
ThaiStopFilter
│ Removes 112 common Thai stop words
▼
Output tokens
Input: "ได้ราคา๑๐๐บาท"
Tokens: ["ราคา100", "บาท"]
↑ Thai digits normalized, "ได้" removed as stop word
Thai script doesn't use spaces between words. The ThaiTokenizer performs script-boundary segmentation only — it splits at Thai/non-Thai transitions but does not do dictionary-based word segmentation within Thai text. For full dictionary-based Thai word segmentation, use the ICU tokenizer.
MIT