Go package for transliterating text from any script to Latin (ASCII) characters. Supports all 184 officially assigned ISO 639-1 language codes.
go get github.com/censync/go-translit
package main
import (
"fmt"
"github.com/censync/go-translit"
)
func main() {
// Chinese (Pinyin)
result, _ := translit.Translit("中国人民", "zh")
fmt.Println(result) // zhong guo ren min
// Korean (Hangul decomposition)
result, _ = translit.Translit("한국어", "ko")
fmt.Println(result) // hangukeo
// Hindi (Devanagari)
result, _ = translit.Translit("नमस्ते", "hi")
fmt.Println(result) // namaste
// Arabic
result, _ = translit.Translit("مرحبا", "ar")
fmt.Println(result) // mrhba
// German (special rules: ae, oe, ue, ss)
result, _ = translit.Translit("Straße", "de")
fmt.Println(result) // Strasse
// French (diacritics removal)
result, _ = translit.Translit("cafe resume naive", "fr")
fmt.Println(result) // cafe resume naive
// Custom word separator
result, _ = translit.TranslitWithSeparator("Привет мир", "ru", "-")
fmt.Println(result) // Privet-mir
result, _ = translit.TranslitWithSeparator("中国人民", "zh", "_")
fmt.Println(result) // zhong_guo_ren_min
// Russian
result, _ = translit.Translit("Привет мир", "ru")
fmt.Println(result) // Privet mir
}Transliterates src using rules for the given ISO 639-1 two-letter language code.
Words are separated by spaces. Returns ErrUnsupportedLanguage if the code is
not recognized.
Same as Translit, but joins words with the given separator instead of spaces.
Allowed separators: " ", "-", "_", ".", "+".
Returns ErrInvalidSeparator if the separator is not in the allowed list.
Returns a sorted list of all supported ISO 639-1 codes.
Returns true if the language code is supported.
| Script | Languages | Codes |
|---|---|---|
| Cyrillic | Russian, Ukrainian, Belarusian, Bulgarian, Serbian, Macedonian, Kazakh, Kyrgyz, Tajik, Tatar, Bashkir, Chechen, Chuvash, Ossetian, Abkhaz, Avar, Komi, Church Slavonic, Mongolian | ru, uk, be, bg, sr, mk, kk, ky, tg, tt, ba, ce, cv, os, ab, av, kv, cu, mn |
| Arabic | Arabic, Persian, Urdu, Pashto, Sindhi, Uyghur, Kashmiri | ar, fa, ur, ps, sd, ug, ks |
| Greek | Greek | el |
| Georgian | Georgian | ka |
| Armenian | Armenian | hy |
| Hebrew | Hebrew, Yiddish | he, yi |
| CJK | Chinese (Pinyin), Japanese (Kana/Kanji), Korean (Hangul) | zh, ja, ko |
| Devanagari | Hindi, Marathi, Nepali, Sanskrit, Pali, Bihari | hi, mr, ne, sa, pi, bh |
| Bengali | Bengali, Assamese | bn, as |
| Gurmukhi | Punjabi | pa |
| Gujarati | Gujarati | gu |
| Odia | Odia | or |
| Tamil | Tamil | ta |
| Telugu | Telugu | te |
| Kannada | Kannada | kn |
| Malayalam | Malayalam | ml |
| Sinhala | Sinhala | si |
| Thai | Thai | th |
| Lao | Lao | lo |
| Myanmar | Myanmar | my |
| Khmer | Khmer | km |
| Tibetan | Tibetan, Dzongkha | bo, dz |
| Ethiopic | Amharic, Tigrinya | am, ti |
| Thaana | Dhivehi | dv |
| Canadian Syllabics | Inuktitut | iu |
| Latin | 97 languages with diacritics normalization | en, fr, es, de, pt, ... |
Total: 184 languages across 25+ scripts.
Each Cyrillic language has its own transliteration table reflecting phonetic
differences. For example, Ukrainian и maps to y (not i as in Russian),
and г maps to h (not g).
- Chinese: Character-to-Pinyin lookup table (~2000 common characters). Characters not in the table pass through unchanged.
- Japanese: Full hiragana and katakana tables. Kanji falls back to the Chinese Pinyin table.
- Korean: Algorithmic Hangul syllable decomposition (U+AC00-U+D7A3) into initial, medial, and final jamo components with Revised Romanization.
Context-aware transliteration with virama (halant) handling. Consonants
include an inherent vowel a which is suppressed when followed by a virama
or replaced when followed by a vowel sign.
Uses the expanded form for umlauts: ae, oe, ue, ss (not a, o, u, s). Luxembourgish uses the same rules.
Full support for compound diacritics (e.g. Vietnamese ệ -> e),
covering all precomposed vowel forms.
Languages that use Latin script natively get diacritics stripped to ASCII. Pure ASCII input passes through unchanged.
go test -v ./...
The test suite covers all 184 language codes with native-script input samples for non-Latin scripts and diacritics normalization for Latin-script languages.
MIT