Skip to content

Locale-aware spelling normalisation in transcription pipeline #49

@poodle64

Description

@poodle64

Problem

Whisper and Parakeet only support bare en as a language parameter — no locale variants (en-AU, en-GB). Both models are trained predominantly on US English text, so they output US spellings: "specialize", "organize", "analyze", etc.

This is the industry-standard behaviour. Every major STT app (macOS Dictation, Google Speech-to-Text, Otter.ai) solves this with post-processing spelling normalisation for the target locale.

Parakeet already handles some categories correctly (e.g. -our words like "colour"), but -ize/-ise words come through as US spelling.

Solution

Add a locale config option and a built-in spelling normalisation step in the filter pipeline.

Design

  • Config: transcription.locale field, e.g. "en-AU" (default), "en-US", "en-GB", "none"
  • Pipeline position: After transcription, before AI enhancement — alongside the existing filter steps in filter.rs
  • Lookup method: HashMap<&str, &str> whole-word lookup with word-boundary matching. NOT regex substitution (too dangerous — "capsize", "seize", "prize" must not be converted)
  • Inflected forms: For each base word (e.g. organize→organise), generate -izes→-ises, -ized→-ised, -izing→-ising forms automatically
  • Performance: HashMap lookup is O(1) per word. A 50-word transcription = ~50 lookups = microseconds. Invisible next to the 400ms+ transcription time.

Word categories to normalise (en-AU)

Category Example Count
-ize → -ise organize→organise, specialize→specialise ~120 base words
-ization → -isation organization→organisation ~60 base words
-yze → -yse analyze→analyse, paralyze→paralyse ~6 base words

Categories NOT needed initially (Parakeet handles these, or they're rare):

  • -or → -our (color→colour) — already correct
  • -er → -re (center→centre) — monitor, add if needed
  • -ense → -ence (defense→defence) — rare in dictation
  • -og → -ogue (catalog→catalogue) — rare in dictation

Important: words to NEVER convert

Some words ending in -ize/-ise are the same in all English variants. The implementation must use an explicit allowlist of convertible words, not a suffix pattern:

  • capsize, seize, prize, size (always -ize)
  • advertise, advise, supervise, surprise, exercise, compromise, improvise (always -ise)

Files to modify

  • src-tauri/src/config.rs — add locale field to TranscriptionConfig
  • src-tauri/src/transcription/filter.rs — add normalisation step
  • New file: src-tauri/src/transcription/locale.rs — word lists and lookup logic
  • src/lib/components/TranscriptionPane.svelte (or equivalent settings UI) — locale selector

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions