fix(i18n): stop protected-term restoration from corrupting correct CJK prose#172
Merged
Conversation
…K prose A service-quality audit confirmed (reproduced from shipped data) that restoreProtectedTerms — an unanchored replaceAll with no CJK word boundary — corrupts CORRECT translations whenever a _protected "wrong form" is itself a common standalone word: 클라우드(Cloud) -> Claude "클라우드 컴퓨팅" -> "Claude 컴퓨팅" 인류(humanity) -> Anthropic "인류의 미래" -> "Anthropic의 미래" 기업(enterprise) -> Enterprise "대기업 환경" -> "대Enterprise 환경" 협업(collaborate)-> Cowork, etc. This hit all four CJK locales (ko/ja/zh-CN/zh-TW) live — worse than untranslated text, since it ships confident, brand-correct-looking but semantically wrong output. The code comment already documented the substring class; the data was triggering it broadly. Fix: removed the dangerous wrong-forms — common words whose meaning differs from the brand — from the _protected sections of all 4 CJK dictionaries, keeping the brand transliterations and same-concept forms (클로드, 플러그인, 技能, サブエージェント). Intended restorations still work (클로드->Claude, 서브에이전트->subagent verified); brand-English fidelity that's lost (e.g. Enterprise occasionally shown as 기업) is the Gemini-verify pass's job and is far preferable to corrupting real prose. Regenerated the companion plugin's bundled data accordingly. Tests: added a regression guard in protected-terms.test.js that runs the REAL src/data/*.json (not a fixture) and asserts ordinary CJK prose passes through untouched — it fails if a dangerous wrong-form is ever re-introduced. 492 tests, eslint, check:plugin, check:locales green. Note: the src/data diff is large because ja/zh _protected arrays were normalized from multi-line to single-line; the only semantic change is the removed forms.
The CJK _protected fix removed dangerous common-word wrong-forms (인류학적 "anthropological" → Anthropic, etc.). The protected-terms E2E stub deliberately mistranslated "Anthropic" → 인류학적 to prove the restoration mechanism runs; that form is now (correctly) no longer restored, so the test broke. Switched the stub + assertion to 앤스로픽 — a pure transliteration of "Anthropic" that is NOT a real word and so is still (and safely) restored. The mechanism assertion (wrong-form → English brand name in the DOM) is unchanged; only the sample wrong-form moved to one that isn't a corruption hazard. Verified: full local E2E suite (17 specs) green, including this one.
This was referenced Jun 4, 2026
heznpc
added a commit
that referenced
this pull request
Jun 9, 2026
…tency) (#181) A verified readiness audit found the code is done (505 tests, 0 open issues) but front-door docs had drifted. Fixes (all factual/compliance, not the deferred strategy docs): Factual errors (were misleading users/owner): - README Installation said the CWS listing "was removed ... not currently available" (full delisting). It is actually live as v1.0.1 in all locales except the US (removed 2026-05-12 over the old icon). Corrected to match POSITIONING (the source of truth). - RELEASE_CHECKLIST pointed at store-assets/promotion/ drafts that were purged and no longer exist. Removed the dead pointer (drafts are kept off-repo). Stale (now closed): - CHANGELOG [Unreleased] was missing #167/#170/#172/#174/#175/#176/#179/#180; added them. - it.json _meta.translation_provenance + lastUpdated (and the matching constants.js comment, README locale-table cell) still said "v1, Spanish- derived regex" — it was re-translated from English in #166/#167 (overlap now 0.1%). Updated; regenerated plugin data accordingly. - TESTING.md listed "E2E flows" under "What is NOT tested" — the Playwright E2E suite exists and runs in CI. Reframed to describe what E2E covers. - PRIVACY_POLICY "Last updated" dateline was April 11 despite June changes. Gates green: 505 tests, lint, prettier, validate, check:plugin/dicts/locales/ i18n/dict-coverage, full E2E (17). Deferred strategy docs (POSITIONING, quarter-focus) untouched — owned by the separate doc-cleanup session.
heznpc
added a commit
that referenced
this pull request
Jun 10, 2026
…ted (#197) The #172 sweep removed dangerous common-word wrong-forms from the CJK dictionaries only. The same bug class was still live in es/fr/it/de/pt-BR/ ru/vi: everyday words were registered as brand "wrong-forms", so the unanchored replaceAll in restoreProtectedTerms rewrote correct prose into English on virtually every lesson — e.g. de "die Zusammenarbeit im Unternehmen" -> "die Cowork im Enterprise" it "le competenze che svilupperai" -> "le skills che svilupperai" fr "l'extension de navigateur" -> "l'Plugin de navigateur" ru "эти навыки общения" -> "эти Skills общения" vi "phần đầu của bài học" -> "frontmatter của bài học" Education content uses these words (skills/compétences/competenze/навыки/ kỹ năng …) constantly, so 7 of the 11 premium languages were degraded on every page. Curation rule (same as #172): keep only proper-noun mistranslations, transliterations, and coined anchored phrases (Claudio, Anthropique, Клод Код, Mã Claude, Claude-Code, Schrägstrich-Befehl, sous-agent, …); drop the everyday words/phrases (Enterprise/Unternehmen/Empresa/Impresa, skills-words, Plugin/extension/Erweiterung/complemento, hook/crochet/ Haken/gancho/móc, Cowork/Zusammenarbeit/travail collaboratif, Dispatch/ envío/envio/Отправка, Computer-Use phrases, frontmatter/préambule/ preámbulo/phần đầu, Personal-words). Russian keeps its loanword forms (Плагин, хук, Коворк, Диспетчеризация) matching the Korean precedent (플러그인/후크/코워크). it.json also loses the Spanish leftovers its _protected inherited from the es-derived v1 (Código Claude, habilidades, gancho, …) and the actively wrong "Plugin"->"Plugins" mapping. Proof: extended the real-dictionary regression test with ordinary-prose sentences for all 7 locales — all 7 failed against the old dictionaries and pass after the sweep (42/42). Plugin terms data regenerated in sync. Gates: 527 unit tests, full E2E 19/19, validate, glossary, check:plugin/ i18n/dict-coverage/locales/dicts, lint, prettier — green. Store zips NOT rebuilt (owner builds on instruction).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Confirmed bug (reproduced from shipped data)
A service-quality audit found — and I reproduced directly from
src/data/*.json— thatrestoreProtectedTerms(src/lib/protected-terms.js:73, an unanchoredreplaceAllwith no CJK word boundary) corrupts correct translations whenever a_protected"wrong form" is itself a common standalone word:This affected all four CJK locales (ko/ja/zh-CN/zh-TW) in the live build. It's worse than untranslated text — it ships confident, brand-correct-looking but semantically wrong output, in a target market. The in-code comment already documented the substring-corruption class; the shipped dictionaries were triggering it widely.
Fix
Removed the dangerous wrong-forms (common words whose meaning ≠ the brand) from the
_protectedsection of all 4 CJK dictionaries, keeping brand transliterations and same-concept forms (클로드,플러그인,技能,サブエージェント).클로드→Claude,서브에이전트→subagent,플러그인→Plugin.Enterpriseshown as기업) is the Gemini-verify pass's job and is far preferable to corrupting real prose._protecteddata to match.Test (directly answers "are the tests real?")
The existing tests used a fixture containing the dangerous forms and asserted the mechanism — they never tested the corruption the code comment itself documents. Added a regression guard that runs the real shipped
src/data/*.jsonand asserts ordinary CJK prose passes through untouched; it fails if a dangerous wrong-form is re-introduced.Verification
492 tests (+4 regression) · eslint · check:plugin (regen in sync) · check:locales · glossary · validate — all green.
🤖 Generated with Claude Code