-
Notifications
You must be signed in to change notification settings - Fork 1
Locale-aware spelling normalisation in transcription pipeline #49
Description
Problem
Whisper and Parakeet only support bare en as a language parameter — no locale variants (en-AU, en-GB). Both models are trained predominantly on US English text, so they output US spellings: "specialize", "organize", "analyze", etc.
This is the industry-standard behaviour. Every major STT app (macOS Dictation, Google Speech-to-Text, Otter.ai) solves this with post-processing spelling normalisation for the target locale.
Parakeet already handles some categories correctly (e.g. -our words like "colour"), but -ize/-ise words come through as US spelling.
Solution
Add a locale config option and a built-in spelling normalisation step in the filter pipeline.
Design
- Config:
transcription.localefield, e.g."en-AU"(default),"en-US","en-GB","none" - Pipeline position: After transcription, before AI enhancement — alongside the existing filter steps in
filter.rs - Lookup method:
HashMap<&str, &str>whole-word lookup with word-boundary matching. NOT regex substitution (too dangerous — "capsize", "seize", "prize" must not be converted) - Inflected forms: For each base word (e.g. organize→organise), generate -izes→-ises, -ized→-ised, -izing→-ising forms automatically
- Performance: HashMap lookup is O(1) per word. A 50-word transcription = ~50 lookups = microseconds. Invisible next to the 400ms+ transcription time.
Word categories to normalise (en-AU)
| Category | Example | Count |
|---|---|---|
| -ize → -ise | organize→organise, specialize→specialise | ~120 base words |
| -ization → -isation | organization→organisation | ~60 base words |
| -yze → -yse | analyze→analyse, paralyze→paralyse | ~6 base words |
Categories NOT needed initially (Parakeet handles these, or they're rare):
- -or → -our (color→colour) — already correct
- -er → -re (center→centre) — monitor, add if needed
- -ense → -ence (defense→defence) — rare in dictation
- -og → -ogue (catalog→catalogue) — rare in dictation
Important: words to NEVER convert
Some words ending in -ize/-ise are the same in all English variants. The implementation must use an explicit allowlist of convertible words, not a suffix pattern:
- capsize, seize, prize, size (always -ize)
- advertise, advise, supervise, surprise, exercise, compromise, improvise (always -ise)
Files to modify
src-tauri/src/config.rs— addlocalefield toTranscriptionConfigsrc-tauri/src/transcription/filter.rs— add normalisation step- New file:
src-tauri/src/transcription/locale.rs— word lists and lookup logic src/lib/components/TranscriptionPane.svelte(or equivalent settings UI) — locale selector