Detects the language of incoming text at analysis time using the
whatlang crate, then delegates to the
matching language-specific analyzer already registered in the AnalysisFactory.
When detection confidence is too low, falls back to the standard analyzer.
| Type | Name | Description |
|---|---|---|
| Analyzer | auto |
Detects language โ delegates to the matching analyzer |
| TokenFilter | language_detect |
Passthrough filter (placeholder for custom pipelines) |
The auto analyzer maps 50+ detected languages to Pizza analyzers:
| Language | Code | Analyzer | Language | Code | Analyzer |
|---|---|---|---|---|---|
| English | eng |
english |
Norwegian | nor |
norwegian |
| French | fra |
french |
Swedish | swe |
swedish |
| German | deu |
german |
Finnish | fin |
finnish |
| Spanish | spa |
spanish |
Hungarian | hun |
hungarian |
| Italian | ita |
italian |
Romanian | ron |
romanian |
| Portuguese | por |
portuguese |
Catalan | cat |
catalan |
| Dutch | nld |
dutch |
Polish | pol |
polish |
| Danish | dan |
danish |
Czech | ces |
czech |
| Slovak | slk |
slovak |
Slovenian | slv |
slovenian |
| Croatian | hrv |
croatian |
Serbian | srp |
serbian |
| Bulgarian | bul |
bulgarian |
Lithuanian | lit |
lithuanian |
| Latvian | lav |
latvian |
Estonian | est |
estonian |
| Language | Code | Analyzer |
|---|---|---|
| Russian | rus |
russian |
| Ukrainian | ukr |
ukrainian |
| Greek | ell |
greek |
| Language | Code | Analyzer |
|---|---|---|
| Turkish | tur |
turkish |
| Azerbaijani | aze |
azerbaijani |
| Language | Code | Analyzer | Notes |
|---|---|---|---|
| Hindi | hin |
hindi |
|
| Bengali | ben |
bengali |
|
| Tamil | tam |
tamil |
Indic norm + Tamil stem |
| Telugu | tel |
telugu |
Indic norm + Telugu stem |
| Kannada | kan |
kannada |
Indic norm + Kannada stem |
| Malayalam | mal |
malayalam |
Indic norm + chillu normalization |
| Marathi | mar |
hindi |
Devanagari script, close to Hindi |
| Nepali | nep |
hindi |
Devanagari script, close to Hindi |
| Language | Code | Analyzer |
|---|---|---|
| Indonesian | ind |
indonesian |
| Malay | msa |
indonesian |
| Vietnamese | vie |
vietnamese |
| Thai | tha |
thai |
| Language | Code | Analyzer |
|---|---|---|
| Chinese | cmn |
ik |
| Japanese | jpn |
kuromoji |
| Korean | kor |
nori |
| Language | Code | Analyzer |
|---|---|---|
| Arabic | ara |
arabic |
| Persian | fas |
persian |
Languages not listed above (Gujarati, Punjabi, Hebrew, Khmer, etc.) fall back to
standard. You can override any mapping โ see below.
Override the default analyzer for any detected language using its 3-letter code:
// Use jieba for Chinese instead of the default ik
auto_tokenizer.set_override("cmn", "jieba");
// Use cjk bigram for Japanese instead of kuromoji
auto_tokenizer.set_override("jpn", "cjk");The default threshold is 0.3. Adjust it to be stricter or more permissive:
auto_tokenizer.set_confidence_threshold(0.5); // stricter โ require higher confidence
auto_tokenizer.set_confidence_threshold(0.1); // looser โ accept weaker detections- Detection โ whatlang analyzes the input text and returns a language + confidence score
- Threshold โ if confidence โฅ threshold (default 0.3, configurable), the detected language is used; otherwise falls back to
standard - Override check โ if the user configured an override for the detected language, use that analyzer instead
- Delegation โ the
AutoTokenizerruns the full analysis pipeline (normalizers โ tokenizer โ token filters) of the matched language analyzer
use pizza_engine::analysis::AnalysisFactory;
let mut factory = AnalysisFactory::new();
// Register language analyzers first
pizza_analysis_english::register_all(&mut factory);
pizza_analysis_french::register_all(&mut factory);
pizza_analysis_ik::register_all(&mut factory);
pizza_analysis_kuromoji::register_all(&mut factory);
// Register auto last โ captures all analyzers above
pizza_analysis_auto::register_all(&mut factory);
let analyzer = factory.get_analyzer("auto").unwrap();
// "Bonjour le monde" โ detected: French โ uses "french" analyzer
// "ใใใซใกใฏไธ็" โ detected: Japanese โ uses "kuromoji" analyzer
// "Hello world" โ detected: English โ uses "english" analyzer
// "ๅไบฌๆฌข่ฟไฝ " โ detected: Chinese โ uses "ik" analyzer[dependencies]
pizza-analysis-auto = "0.1"Or via pizza-analysis-all:
[dependencies]
pizza-analysis-all = { version = "0.1", features = ["auto"] }MIT