Skip to content

censync/go-translit

Repository files navigation

go-translit

Go package for transliterating text from any script to Latin (ASCII) characters. Supports all 184 officially assigned ISO 639-1 language codes.

Installation

go get github.com/censync/go-translit

Usage

package main

import (
    "fmt"
    "github.com/censync/go-translit"
)

func main() {
    // Chinese (Pinyin)
    result, _ := translit.Translit("中国人民", "zh")
    fmt.Println(result) // zhong guo ren min

    // Korean (Hangul decomposition)
    result, _ = translit.Translit("한국어", "ko")
    fmt.Println(result) // hangukeo

    // Hindi (Devanagari)
    result, _ = translit.Translit("नमस्ते", "hi")
    fmt.Println(result) // namaste

    // Arabic
    result, _ = translit.Translit("مرحبا", "ar")
    fmt.Println(result) // mrhba

    // German (special rules: ae, oe, ue, ss)
    result, _ = translit.Translit("Straße", "de")
    fmt.Println(result) // Strasse

    // French (diacritics removal)
    result, _ = translit.Translit("cafe resume naive", "fr")
    fmt.Println(result) // cafe resume naive

    // Custom word separator
    result, _ = translit.TranslitWithSeparator("Привет мир", "ru", "-")
    fmt.Println(result) // Privet-mir

    result, _ = translit.TranslitWithSeparator("中国人民", "zh", "_")
    fmt.Println(result) // zhong_guo_ren_min

    // Russian
    result, _ = translit.Translit("Привет мир", "ru")
    fmt.Println(result) // Privet mir

}

API

Translit(src, code string) (string, error)

Transliterates src using rules for the given ISO 639-1 two-letter language code. Words are separated by spaces. Returns ErrUnsupportedLanguage if the code is not recognized.

TranslitWithSeparator(src, code, sep string) (string, error)

Same as Translit, but joins words with the given separator instead of spaces. Allowed separators: " ", "-", "_", ".", "+". Returns ErrInvalidSeparator if the separator is not in the allowed list.

SupportedLanguages() []string

Returns a sorted list of all supported ISO 639-1 codes.

IsSupported(code string) bool

Returns true if the language code is supported.

Supported Scripts

Script Languages Codes
Cyrillic Russian, Ukrainian, Belarusian, Bulgarian, Serbian, Macedonian, Kazakh, Kyrgyz, Tajik, Tatar, Bashkir, Chechen, Chuvash, Ossetian, Abkhaz, Avar, Komi, Church Slavonic, Mongolian ru, uk, be, bg, sr, mk, kk, ky, tg, tt, ba, ce, cv, os, ab, av, kv, cu, mn
Arabic Arabic, Persian, Urdu, Pashto, Sindhi, Uyghur, Kashmiri ar, fa, ur, ps, sd, ug, ks
Greek Greek el
Georgian Georgian ka
Armenian Armenian hy
Hebrew Hebrew, Yiddish he, yi
CJK Chinese (Pinyin), Japanese (Kana/Kanji), Korean (Hangul) zh, ja, ko
Devanagari Hindi, Marathi, Nepali, Sanskrit, Pali, Bihari hi, mr, ne, sa, pi, bh
Bengali Bengali, Assamese bn, as
Gurmukhi Punjabi pa
Gujarati Gujarati gu
Odia Odia or
Tamil Tamil ta
Telugu Telugu te
Kannada Kannada kn
Malayalam Malayalam ml
Sinhala Sinhala si
Thai Thai th
Lao Lao lo
Myanmar Myanmar my
Khmer Khmer km
Tibetan Tibetan, Dzongkha bo, dz
Ethiopic Amharic, Tigrinya am, ti
Thaana Dhivehi dv
Canadian Syllabics Inuktitut iu
Latin 97 languages with diacritics normalization en, fr, es, de, pt, ...

Total: 184 languages across 25+ scripts.

Script-Specific Notes

Cyrillic

Each Cyrillic language has its own transliteration table reflecting phonetic differences. For example, Ukrainian и maps to y (not i as in Russian), and г maps to h (not g).

CJK

  • Chinese: Character-to-Pinyin lookup table (~2000 common characters). Characters not in the table pass through unchanged.
  • Japanese: Full hiragana and katakana tables. Kanji falls back to the Chinese Pinyin table.
  • Korean: Algorithmic Hangul syllable decomposition (U+AC00-U+D7A3) into initial, medial, and final jamo components with Revised Romanization.

Indic Scripts

Context-aware transliteration with virama (halant) handling. Consonants include an inherent vowel a which is suppressed when followed by a virama or replaced when followed by a vowel sign.

German

Uses the expanded form for umlauts: ae, oe, ue, ss (not a, o, u, s). Luxembourgish uses the same rules.

Vietnamese

Full support for compound diacritics (e.g. Vietnamese -> e), covering all precomposed vowel forms.

Latin Script

Languages that use Latin script natively get diacritics stripped to ASCII. Pure ASCII input passes through unchanged.

Testing

go test -v ./...

The test suite covers all 184 language codes with native-script input samples for non-Latin scripts and diacritics normalization for Latin-script languages.

License

MIT

About

Go package for transliterating text from any official language to Latin (ASCII) characters.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages