Skip to content

pizza-rs/analysis-thai

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🇹🇭 pizza-analysis-thai

Thai text analysis plugin for INFINI Pizza — wraps the core Thai tokenizer with Thai-specific normalization.

Why a dedicated Thai analyzer?

The core ThaiTokenizer provides basic script-boundary segmentation (splitting at Thai/non-Thai boundaries). This dedicated crate adds Thai-specific normalizations essential for search quality:

Normalization Before After Why
Sara Am decomposition ทำ (ท + ำ) ท + ◌ํ + า Match both composed and decomposed forms
Thai digit → ASCII ราคา๑๐๐ ราคา100 Match regardless of digit style
Zero-width removal สวัสดี\u200Bครับ สวัสดีครับ Strip word-boundary hints

Components

Name Type Description
thai_normalization TokenFilter Sara Am decomposition + Thai digits → ASCII + zero-width removal
thai_stop TokenFilter 112 Thai stop words
thai Analyzer ThaiTokenizer → thai_normalization → lowercase → stop

Pipeline

Input text
    │
    ▼
ThaiTokenizer (from analysis-core)
    │  Script-boundary segmentation
    ▼
ThaiNormalizationFilter
    │  Sara Am decomposition
    │  Thai digits → ASCII
    │  Zero-width character removal
    ▼
LowercaseTokenFilter
    │  Lowercases any embedded Latin text
    ▼
ThaiStopFilter
    │  Removes 112 common Thai stop words
    ▼
Output tokens

Example

Input:  "ได้ราคา๑๐๐บาท"
Tokens: ["ราคา100", "บาท"]
         ↑ Thai digits normalized, "ได้" removed as stop word

Note on Thai word segmentation

Thai script doesn't use spaces between words. The ThaiTokenizer performs script-boundary segmentation only — it splits at Thai/non-Thai transitions but does not do dictionary-based word segmentation within Thai text. For full dictionary-based Thai word segmentation, use the ICU tokenizer.

License

MIT

About

🇹🇭 Thai text analysis plugin for INFINI Pizza

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages