fix(translate): sweep common-word wrong-forms from 7 locales' _protected#197
Merged
Conversation
The #172 sweep removed dangerous common-word wrong-forms from the CJK dictionaries only. The same bug class was still live in es/fr/it/de/pt-BR/ ru/vi: everyday words were registered as brand "wrong-forms", so the unanchored replaceAll in restoreProtectedTerms rewrote correct prose into English on virtually every lesson — e.g. de "die Zusammenarbeit im Unternehmen" -> "die Cowork im Enterprise" it "le competenze che svilupperai" -> "le skills che svilupperai" fr "l'extension de navigateur" -> "l'Plugin de navigateur" ru "эти навыки общения" -> "эти Skills общения" vi "phần đầu của bài học" -> "frontmatter của bài học" Education content uses these words (skills/compétences/competenze/навыки/ kỹ năng …) constantly, so 7 of the 11 premium languages were degraded on every page. Curation rule (same as #172): keep only proper-noun mistranslations, transliterations, and coined anchored phrases (Claudio, Anthropique, Клод Код, Mã Claude, Claude-Code, Schrägstrich-Befehl, sous-agent, …); drop the everyday words/phrases (Enterprise/Unternehmen/Empresa/Impresa, skills-words, Plugin/extension/Erweiterung/complemento, hook/crochet/ Haken/gancho/móc, Cowork/Zusammenarbeit/travail collaboratif, Dispatch/ envío/envio/Отправка, Computer-Use phrases, frontmatter/préambule/ preámbulo/phần đầu, Personal-words). Russian keeps its loanword forms (Плагин, хук, Коворк, Диспетчеризация) matching the Korean precedent (플러그인/후크/코워크). it.json also loses the Spanish leftovers its _protected inherited from the es-derived v1 (Código Claude, habilidades, gancho, …) and the actively wrong "Plugin"->"Plugins" mapping. Proof: extended the real-dictionary regression test with ordinary-prose sentences for all 7 locales — all 7 failed against the old dictionaries and pass after the sweep (42/42). Plugin terms data regenerated in sync. Gates: 527 unit tests, full E2E 19/19, validate, glossary, check:plugin/ i18n/dict-coverage/locales/dicts, lint, prettier — green. Store zips NOT rebuilt (owner builds on instruction).
heznpc
added a commit
that referenced
this pull request
Jun 10, 2026
… + machine-readable) (#203) Makes the moat visible and the QA state machine-readable instead of claimed: - _meta gains two QA fields in all 11 dictionaries: lastAudited (stamped by the pre-release LLM audit; 2026-06-10 for the audit that shipped in #197/#199) and nativeReview ("recruiting" -> "reviewed" after a native pass). - generate-docs.js gains a LOCALE_QA marker: the README per-locale QA table (entries / last curated / last audit / native-review status) is generated from _meta by `npm run docs`, so the public table cannot drift from reality. - README "Terminology QA" section: the standing pipeline (drift watcher -> same-day dictionary wiring -> CI gates -> real-dictionary regression suite) with the verifiable same-day proof (2026-06-10: #196 detected morning, #201 wired all 11 locales same day) + the generated table + native-reviewer call. - docs/TRANSLATION_QA.md: the three-layer assurance model, honest about what each layer does NOT catch and why no paid API can sit in CI (free-forever). - RELEASE_CHECKLIST step 0: the pre-release LLM dictionary audit is now a release convention, with the _meta.lastAudited stamping step. - CONTRIBUTING "Native language reviewers" section; recruitment umbrella issue #202 (help wanted / good first issue / i18n) with per-locale checklist. Gates: validate, glossary, check:i18n/dict-coverage/locales/dicts/plugin, 527 unit tests, lint, prettier — green.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Dictionary-accuracy audit follow-up. #172 swept the CJK dictionaries only — the same corruption class was still live in es/fr/it/de/pt-BR/ru/vi: everyday words registered as brand "wrong-forms", so
restoreProtectedTerms's unanchoredreplaceAllrewrote correct prose into English on virtually every lesson:Education content uses these words constantly → 7 of 11 premium languages degraded on every page.
Curation (same rule as #172)
Claudio,Anthropique,Клод Код,Mã Claude,Claude-Code,Schrägstrich-Befehl,sous-agent, …). Russian keeps its loanwords (Плагин,хук,Коворк) matching the Korean precedent (플러그인/후크/코워크).Código Claude,habilidades,gancho, …) and the actively wrong"Plugin"→"Plugins"mapping.Proof (TDD)
Extended the real-dictionary regression test with ordinary-prose sentences for all 7 locales — all 7 failed against the old dictionaries and pass after the sweep (42/42). CJK guard tests unchanged.
Gates: 527 unit · full E2E 19/19 · validate · glossary · check:* — green. Plugin terms data regenerated in sync. Store zips deliberately not rebuilt (owner builds on instruction).
🤖 Generated with Claude Code