Translation Toolkit

Translate and maintain software localization files with one CLI and one GUI.

This repository is for software-localization work, not generic document translation. It gives you a shared backend for five jobs:

translate unfinished localization files
revise existing translations with a precise instruction
check translated PO or TS files for QA issues
extract glossary terms with a model
discover glossary candidates locally without any API call

Supported formats:

.po
.ts
.resx
.strings
.txt
Android <resources> XML

Supported providers:

Gemini
OpenAI
Anthropic

What This Project Is

The toolkit is built around three ideas:

One task-oriented CLI: translate, revise, check, extract-terms, extract-terms-local
Shared language resources: vocabulary and rules are loaded by target language
One format backend: supported file types are normalized into a shared entry model and then written back in their native format

That matters because the same vocabulary, rules, runtime controls, batching, and format handling are reused across CLI and GUI instead of being reimplemented per script.

The preferred entry point is:

python translate_cli.py

translate_cli.py is the main CLI surface, and process_gui.py is the GUI entry point.

Task Docs

Detailed task guides live in docs/:

docs/translate.md
docs/check.md
docs/extract.md
docs/extract-local.md
docs/revise.md
docs/extraction-refactor.md

Quick Start

Install dependencies:

pip install -r requirements.txt

Set the API key for the provider you want to use:

$env:GOOGLE_API_KEY = "your_google_api_key"
$env:OPENAI_API_KEY = "your_openai_api_key"
$env:ANTHROPIC_API_KEY = "your_anthropic_api_key"

Notes:

You only need the key for the provider you are actually using.
Gemini can run against AI Studio or Vertex API-key mode.
Vertex API-key mode currently supports the global endpoint only.

The Main Workflows

1. Translate a localization file

Translate one file:

python translate_cli.py translate source.po
python translate_cli.py translate source.ts
python translate_cli.py translate source.resx
python translate_cli.py translate source.strings
python translate_cli.py translate source.txt

Translate several files in one run when they are the same format:

python translate_cli.py translate first.po second.po third.po

Choose provider and model explicitly:

python translate_cli.py translate source.po --provider openai --model your-model
python translate_cli.py translate source.po --provider anthropic --model your-model
python translate_cli.py translate source.po --provider gemini --model your-model

Useful controls:

python translate_cli.py translate source.po --target-lang fr
python translate_cli.py translate source.po --thinking-level medium
python translate_cli.py translate source.po --batch-size 100 --parallel-requests 4
python translate_cli.py translate source.po --retranslate-all
python translate_cli.py translate source.po --flex
python translate_cli.py translate source.po --warnings-report

Behavior:

default target language is kk
by default, only unfinished messages are translated
--retranslate-all forces already translated messages through translation again
recursive directory translation skips generated toolkit artifacts such as *.ai-translated.*, *.glossary.po, *.missing-terms.po, and *.prototype-*.po
when the scan root is this toolkit repository itself, recursive translation also skips toolkit-owned directories such as data/, logs/, docs/, tests/, tasks/, and core/
translated output is written as *.ai-translated.<ext>
--warnings-report also writes *.translation-warnings.json with only the messages where the model reported ambiguity, unclear meaning, risky glossary choice, or another review-worthy concern

Warnings sidecar behavior:

warnings are emitted per translated message, not as one batch-level summary
each warning item includes structured issues with code, message, and severity, plus the source text, translated text, and any available context, note, or matched relevant_vocabulary
translation warning codes use the translate.* namespace, for example translate.ambiguous_term
severity is warning for real risk or ambiguity, and info for notable but non-risk notes such as preserved structure
this is a lightweight translator self-report; the dedicated check task remains the real QA pass

2. Translate Android XML with a paired source file

Android translated exports often contain only resource IDs on the target side, so translation uses a paired-source workflow:

python translate_cli.py translate translated.xml --source-file source.xml

The Android XML backend:

supports <string> and <plurals>
pairs <string> items by resource name
pairs <plurals> by resource name plus quantity
preserves inline XML such as <xliff:g>
preserves literal escapes such as \n in source style

Current translation constraint:

Android .xml translation currently supports one target file at a time and requires --source-file

3. Revise an existing translation

Use revise when a file is already translated and you want targeted changes rather than full retranslation.

For formats that still contain source and translation together:

python translate_cli.py revise translated.po --instruction "Use a shorter term for Preferences"
python translate_cli.py revise translated.ts --instruction "Replace archive with package where the source says package"

For formats where the translated file no longer carries the original source text, pass the matching source file:

python translate_cli.py revise translated.ai-translated.xml --source-file source.xml --instruction "Use natural confirmation questions and preserve literal \\n escapes"
python translate_cli.py revise translated.ai-translated.strings --source-file source.strings --instruction "Shorten viewer labels where possible"
python translate_cli.py revise translated.ai-translated.resx --source-file source.resx --instruction "Use command bar instead of toolbar"
python translate_cli.py revise translated.txt --source-file source.txt --instruction "Use formal tone for Exit"

Revision behavior:

default output path is <input>.revised.<ext>
--in-place overwrites the translated input file
--dry-run reviews and reports changes without writing output
changed AI-reviewed entries are marked as review-required where the format supports it

4. Check a translated PO or TS file

Use check for QA on an already translated PO or TS file:

python translate_cli.py check translated.po
python translate_cli.py check translated.ts
python translate_cli.py check translated.po --probe 50
python translate_cli.py check translated.po --out report.json --include-ok

The checker combines model findings with deterministic local checks for:

placeholders
tags
accelerators
plural slots
approved vocabulary usage

Default output path:

translated.translation-check.json

Check-report issue shape:

each issue uses a structured code, message, and severity
check issue codes use the check.* namespace, for example check.meaning, check.placeholder, or check.terminology

5. Extract glossary terms with a model

Use extract-terms when you want the model to propose glossary entries:

python translate_cli.py extract-terms source.po
python translate_cli.py extract-terms source.xml

Useful variants:

python translate_cli.py extract-terms source.po --mode missing --vocab data/locales/kk/vocab --out-format po
python translate_cli.py extract-terms source.po --mode missing --out-format json --vocab data/locales/kk/vocab
python translate_cli.py extract-terms source.po --out glossary.po --batch-size 200 --parallel-requests 4

Modes:

all: build a broader glossary from the source content
missing: focus on terms that are not already in your existing vocabulary

Output defaults:

all + po -> <input>.glossary.po
missing + po -> <input>.missing-terms.po
missing + json -> <input>.missing-terms.json

When you run missing-term extraction with --out-format po, the generated PO is designed to go straight back into review and then translation:

known terms from the supplied vocabulary are imported
new missing terms are added as reviewable entries

6. Discover terms locally, without a model

Use extract-terms-local when you want a fast local analysis pass with no API call.

Single-file usage:

python translate_cli.py extract-terms-local source.po
python translate_cli.py extract-terms-local source.xml --mode missing --max-length 1

Directory-tree usage:

python translate_cli.py extract-terms-local C:\path\to\source-tree --also-po

Convert a local JSON report into a PO handoff:

python translate_cli.py extract-terms-local source.prototype-missing-terms.json --to-po

Local extraction behavior:

works on one supported source file or a whole directory tree
deduplicates repeated source messages across files
writes a JSON report with accepted, borderline, and translation-candidate terms
can also write a PO handoff file with --also-po

The local extractor deliberately filters common localization noise before scoring terms, including:

placeholders and variable-like tokens
CLI flags and digit-led labels
mnemonic fragments such as underscore accelerators
URL, tag, and attribute noise such as href, src, domain fragments, and embedded markup payloads

Recommended Workflow

For larger localization work, the recommended flow is:

Run local extraction first.
Translate the resulting glossary PO handoff.
Review and approve that glossary.
Use the approved glossary as the vocabulary base for the main translation.
Review and approve the main translated source file.

In practice, that looks like this:

# 1. Local extraction from one file or a source tree
python translate_cli.py extract-terms-local source.po --mode missing --also-po
python translate_cli.py extract-terms-local C:\path\to\source-tree --mode missing --also-po

# 2. Translate the generated glossary PO handoff
python translate_cli.py translate source.prototype-missing-terms.po --vocab data/locales/kk/vocab

# 3. Review and approve the glossary PO
#    Keep only good terms, fix bad translations, and save the approved glossary.

# 4. Use the approved glossary as the base vocabulary for the main translation
python translate_cli.py translate source.po --vocab approved-glossary.po

# 5. Review and approve the main translated source file

Why this workflow is recommended:

extract-terms-local can deterministically avoid terms already present in your approved vocabulary and skip local noise such as stop words, excluded abbreviations, placeholders, tags, and weak phrase candidates
the glossary is reviewed before bulk translation, so terminology is stabilized early
the main translate task can load the approved glossary PO directly through --vocab
the final source translation still needs review, because approved terminology does not replace full QA

Keep a distinction between:

candidate glossary output from local extraction
approved glossary used as translation input

That approved glossary can stay as a reviewed .po passed with --vocab, or it can be merged into your canonical locale vocabulary under data/locales/<target-lang>/.

Vocabulary And Rules

By default, the toolkit looks up language resources from data/locales/<target-lang>/.

Auto-detected resources:

data/locales/<target-lang>/vocab.txt
data/locales/<target-lang>/vocab/
data/locales/<target-lang>/rules.md

Locale fallback is supported. For example, fr_CA falls back to fr if the region-specific resource is not present.

You can override both resources per run:

python translate_cli.py translate source.po --vocab custom-vocab.txt --rules custom-rules.md
python translate_cli.py translate source.po --vocab custom-vocab --rules-str "Use concise imperative labels."

--vocab accepts:

a glossary .txt
a glossary .po
a glossary .tbx
a directory containing glossary .txt, .po, and .tbx files

When a vocabulary directory is used:

files are loaded in filename order
later duplicates override earlier ones

Recommended layout:

data/
  locales/
    kk/
      vocab/
        common.txt
        colors.txt
        media.txt
      rules.md
    fr/
      vocab.txt
      rules.md
  extract/
    common/
      abbreviations.txt
      excluded_terms.txt
    en/
      stopwords.txt
      low_value_words.txt
      fixed_multiword_allowlist.txt

Rich vocabulary entries use this schema:

source_term|target_term|part_of_speech|context_note

Example:

archive|package|noun|software package manager context
save|store|verb|short imperative UI action

During translation, the toolkit still sends the full vocabulary for compatibility, but it also computes relevant_vocabulary per message so each message sees the subset of glossary entries that actually match it.

When warnings reporting is enabled, the translation response can also include a per-message warnings field. Those warnings are written to a separate JSON sidecar so you can inspect ambiguous or risky messages without rereading the whole translated file.

Format Behavior At A Glance

Format	Translate	Revise	Extract Terms	Notes
`.po`	Yes	Yes	Yes	Source and translation live together
`.ts`	Yes	Yes	Yes	Source and translation live together
`.resx`	Yes	Yes	Yes	Revision requires `--source-file`
`.strings`	Yes	Yes	Yes	Revision requires `--source-file`
`.txt`	Yes	Yes	Yes	One line equals one message; revision requires `--source-file`
Android `.xml`	Yes	Yes	Yes	Translation and revision use paired-source matching

Additional notes:

check currently supports translated .po and .ts files
.strings translation treats commented key/value entries as untranslated source entries and uncommented entries as translated entries
.strings output preserves file encoding and common literal escape sequences
.txt output preserves original line order and blank lines

GUI

The Tk desktop UI is available here:

python process_gui.py

The GUI is a frontend over the same backend concepts as the CLI. It includes:

shared provider, model, API-key, thinking, and runtime controls
instruction preview for the resolved system prompt and language rules
a Translate tab with Android Source file support
a Local Extract tab for file, folder, and JSON-to-PO local extraction workflows

Translate-tab note:

the GUI enables the translation warnings JSON sidecar by default
a normal translate run writes the translated output file and a matching *.translation-warnings.json report

Project Layout

The repository is intentionally split between shared mechanics and task-specific logic:

translate_cli.py          unified CLI entry point
process_gui.py            Tk frontend
tasks/                    task-specific contracts and runners
core/                     formats, providers, runtime, resources, shared helpers
data/locales/             per-language vocabulary and rules
data/extract/             local-extraction stop words and filters
tests/                    smoke and regression coverage

If you are changing behavior, the important design line is:

core/ owns shared mechanics
tasks/ owns task-specific prompts, schemas, and result handling
process_gui.py should stay frontend-oriented

Gettext Placeholder Note

If Poedit complains after placeholder order changes, check the format flag on the entry:

#, c-format: reordering is allowed with positional placeholders such as %2$s, %1$s
#, python-format: %2$s is not valid, so plain %s placeholders cannot be safely reordered

For python-format, safe reordering requires named placeholders in the source, for example:

msgid "From %(src)s to %(dst)s"
msgstr "%(dst)s to %(src)s"

Always preserve the same placeholder set and types between source and translation.

Smoke Tests

python -m unittest discover -s tests -p "test_*.py" -v

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Translation Toolkit

What This Project Is

Task Docs

Quick Start

The Main Workflows

1. Translate a localization file

2. Translate Android XML with a paired source file

3. Revise an existing translation

4. Check a translated PO or TS file

5. Extract glossary terms with a model

6. Discover terms locally, without a model

Recommended Workflow

Vocabulary And Rules

Format Behavior At A Glance

GUI

Project Layout

Gettext Placeholder Note

Smoke Tests

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 108 Commits
core		core
data		data
docs		docs
tasks		tasks
tests		tests
.gitignore		.gitignore
README.md		README.md
process_gui.py		process_gui.py
requirements.txt		requirements.txt
translate_cli.py		translate_cli.py

Folders and files

Latest commit

History

Repository files navigation

Translation Toolkit

What This Project Is

Task Docs

Quick Start

The Main Workflows

1. Translate a localization file

2. Translate Android XML with a paired source file

3. Revise an existing translation

4. Check a translated PO or TS file

5. Extract glossary terms with a model

6. Discover terms locally, without a model

Recommended Workflow

Vocabulary And Rules

Format Behavior At A Glance

GUI

Project Layout

Gettext Placeholder Note

Smoke Tests

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages