go-translit

Go package for transliterating text from any script to Latin (ASCII) characters. Supports all 184 officially assigned ISO 639-1 language codes.

Installation

go get github.com/censync/go-translit

Usage

package main

import (
    "fmt"
    "github.com/censync/go-translit"
)

func main() {
    // Chinese (Pinyin)
    result, _ := translit.Translit("中国人民", "zh")
    fmt.Println(result) // zhong guo ren min

    // Korean (Hangul decomposition)
    result, _ = translit.Translit("한국어", "ko")
    fmt.Println(result) // hangukeo

    // Hindi (Devanagari)
    result, _ = translit.Translit("नमस्ते", "hi")
    fmt.Println(result) // namaste

    // Arabic
    result, _ = translit.Translit("مرحبا", "ar")
    fmt.Println(result) // mrhba

    // German (special rules: ae, oe, ue, ss)
    result, _ = translit.Translit("Straße", "de")
    fmt.Println(result) // Strasse

    // French (diacritics removal)
    result, _ = translit.Translit("cafe resume naive", "fr")
    fmt.Println(result) // cafe resume naive

    // Custom word separator
    result, _ = translit.TranslitWithSeparator("Привет мир", "ru", "-")
    fmt.Println(result) // Privet-mir

    result, _ = translit.TranslitWithSeparator("中国人民", "zh", "_")
    fmt.Println(result) // zhong_guo_ren_min

    // Russian
    result, _ = translit.Translit("Привет мир", "ru")
    fmt.Println(result) // Privet mir

}

API

Translit(src, code string) (string, error)

Transliterates src using rules for the given ISO 639-1 two-letter language code. Words are separated by spaces. Returns ErrUnsupportedLanguage if the code is not recognized.

TranslitWithSeparator(src, code, sep string) (string, error)

Same as Translit, but joins words with the given separator instead of spaces. Allowed separators: " ", "-", "_", ".", "+". Returns ErrInvalidSeparator if the separator is not in the allowed list.

SupportedLanguages() []string

Returns a sorted list of all supported ISO 639-1 codes.

IsSupported(code string) bool

Returns true if the language code is supported.

Supported Scripts

Script	Languages	Codes
Cyrillic	Russian, Ukrainian, Belarusian, Bulgarian, Serbian, Macedonian, Kazakh, Kyrgyz, Tajik, Tatar, Bashkir, Chechen, Chuvash, Ossetian, Abkhaz, Avar, Komi, Church Slavonic, Mongolian	ru, uk, be, bg, sr, mk, kk, ky, tg, tt, ba, ce, cv, os, ab, av, kv, cu, mn
Arabic	Arabic, Persian, Urdu, Pashto, Sindhi, Uyghur, Kashmiri	ar, fa, ur, ps, sd, ug, ks
Greek	Greek	el
Georgian	Georgian	ka
Armenian	Armenian	hy
Hebrew	Hebrew, Yiddish	he, yi
CJK	Chinese (Pinyin), Japanese (Kana/Kanji), Korean (Hangul)	zh, ja, ko
Devanagari	Hindi, Marathi, Nepali, Sanskrit, Pali, Bihari	hi, mr, ne, sa, pi, bh
Bengali	Bengali, Assamese	bn, as
Gurmukhi	Punjabi	pa
Gujarati	Gujarati	gu
Odia	Odia	or
Tamil	Tamil	ta
Telugu	Telugu	te
Kannada	Kannada	kn
Malayalam	Malayalam	ml
Sinhala	Sinhala	si
Thai	Thai	th
Lao	Lao	lo
Myanmar	Myanmar	my
Khmer	Khmer	km
Tibetan	Tibetan, Dzongkha	bo, dz
Ethiopic	Amharic, Tigrinya	am, ti
Thaana	Dhivehi	dv
Canadian Syllabics	Inuktitut	iu
Latin	97 languages with diacritics normalization	en, fr, es, de, pt, ...

Total: 184 languages across 25+ scripts.

Script-Specific Notes

Cyrillic

Each Cyrillic language has its own transliteration table reflecting phonetic differences. For example, Ukrainian и maps to y (not i as in Russian), and г maps to h (not g).

CJK

Chinese: Character-to-Pinyin lookup table (~2000 common characters). Characters not in the table pass through unchanged.
Japanese: Full hiragana and katakana tables. Kanji falls back to the Chinese Pinyin table.
Korean: Algorithmic Hangul syllable decomposition (U+AC00-U+D7A3) into initial, medial, and final jamo components with Revised Romanization.

Indic Scripts

Context-aware transliteration with virama (halant) handling. Consonants include an inherent vowel a which is suppressed when followed by a virama or replaced when followed by a vowel sign.

German

Uses the expanded form for umlauts: ae, oe, ue, ss (not a, o, u, s). Luxembourgish uses the same rules.

Vietnamese

Full support for compound diacritics (e.g. Vietnamese ệ -> e), covering all precomposed vowel forms.

Latin Script

Languages that use Latin script natively get diacritics stripped to ASCII. Pure ASCII input passes through unchanged.

Testing

go test -v ./...

The test suite covers all 184 language codes with native-script input samples for non-Latin scripts and diacritics normalization for Latin-script languages.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.idea		.idea
LICENSE		LICENSE
README.md		README.md
arabic.go		arabic.go
armenian.go		armenian.go
cjk.go		cjk.go
cyrillic.go		cyrillic.go
georgian.go		georgian.go
go.mod		go.mod
greek.go		greek.go
hebrew.go		hebrew.go
indic.go		indic.go
latin.go		latin.go
misc_scripts.go		misc_scripts.go
pinyin_table.go		pinyin_table.go
translit.go		translit.go
translit_test.go		translit_test.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

go-translit

Installation

Usage

API

Translit(src, code string) (string, error)

TranslitWithSeparator(src, code, sep string) (string, error)

SupportedLanguages() []string

IsSupported(code string) bool

Supported Scripts

Script-Specific Notes

Cyrillic

CJK

Indic Scripts

German

Vietnamese

Latin Script

Testing

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

go-translit

Installation

Usage

API

Translit(src, code string) (string, error)

TranslitWithSeparator(src, code, sep string) (string, error)

SupportedLanguages() []string

IsSupported(code string) bool

Supported Scripts

Script-Specific Notes

Cyrillic

CJK

Indic Scripts

German

Vietnamese

Latin Script

Testing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages