🇹🇭 pizza-analysis-thai

Thai text analysis plugin for INFINI Pizza — wraps the core Thai tokenizer with Thai-specific normalization.

Why a dedicated Thai analyzer?

The core ThaiTokenizer provides basic script-boundary segmentation (splitting at Thai/non-Thai boundaries). This dedicated crate adds Thai-specific normalizations essential for search quality:

Normalization	Before	After	Why
Sara Am decomposition	ทำ (ท + ำ)	ท + ◌ํ + า	Match both composed and decomposed forms
Thai digit → ASCII	ราคา๑๐๐	ราคา100	Match regardless of digit style
Zero-width removal	สวัสดี\u200Bครับ	สวัสดีครับ	Strip word-boundary hints

Components

Name	Type	Description
`thai_normalization`	TokenFilter	Sara Am decomposition + Thai digits → ASCII + zero-width removal
`thai_stop`	TokenFilter	112 Thai stop words
`thai`	Analyzer	ThaiTokenizer → thai_normalization → lowercase → stop

Pipeline

Input text
    │
    ▼
ThaiTokenizer (from analysis-core)
    │  Script-boundary segmentation
    ▼
ThaiNormalizationFilter
    │  Sara Am decomposition
    │  Thai digits → ASCII
    │  Zero-width character removal
    ▼
LowercaseTokenFilter
    │  Lowercases any embedded Latin text
    ▼
ThaiStopFilter
    │  Removes 112 common Thai stop words
    ▼
Output tokens

Example

Input:  "ได้ราคา๑๐๐บาท"
Tokens: ["ราคา100", "บาท"]
         ↑ Thai digits normalized, "ได้" removed as stop word

Note on Thai word segmentation

Thai script doesn't use spaces between words. The ThaiTokenizer performs script-boundary segmentation only — it splits at Thai/non-Thai transitions but does not do dictionary-based word segmentation within Thai text. For full dictionary-based Thai word segmentation, use the ICU tokenizer.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
src		src
.gitignore		.gitignore
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🇹🇭 pizza-analysis-thai

Why a dedicated Thai analyzer?

Components

Pipeline

Example

Note on Thai word segmentation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🇹🇭 pizza-analysis-thai

Why a dedicated Thai analyzer?

Components

Pipeline

Example

Note on Thai word segmentation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages