fix(i18n): stop protected-term restoration from corrupting correct CJK prose by heznpc · Pull Request #172 · heznpc/skillBridge

heznpc · 2026-06-04T10:48:41Z

Confirmed bug (reproduced from shipped data)

A service-quality audit found — and I reproduced directly from src/data/*.json — that restoreProtectedTerms (src/lib/protected-terms.js:73, an unanchored replaceAll with no CJK word boundary) corrupts correct translations whenever a _protected "wrong form" is itself a common standalone word:

correct prose	renders as	because _protected maps
클라우드 컴퓨팅 (Cloud computing)	Claude 컴퓨팅	클라우드 → Claude
인류의 미래 (future of humanity)	Anthropic의 미래	인류 → Anthropic
대기업 환경 (large enterprise)	대Enterprise 환경	기업 → Enterprise
우리는 협업합니다 (we collaborate)	우리는 Cowork합니다	협업 → Cowork

This affected all four CJK locales (ko/ja/zh-CN/zh-TW) in the live build. It's worse than untranslated text — it ships confident, brand-correct-looking but semantically wrong output, in a target market. The in-code comment already documented the substring-corruption class; the shipped dictionaries were triggering it widely.

Fix

Removed the dangerous wrong-forms (common words whose meaning ≠ the brand) from the _protected section of all 4 CJK dictionaries, keeping brand transliterations and same-concept forms (클로드, 플러그인, 技能, サブエージェント).

Intended restorations still work (verified): 클로드→Claude, 서브에이전트→subagent, 플러그인→Plugin.
The brand-English fidelity occasionally lost (e.g. Enterprise shown as 기업) is the Gemini-verify pass's job and is far preferable to corrupting real prose.
Regenerated the companion plugin's bundled _protected data to match.

Test (directly answers "are the tests real?")

The existing tests used a fixture containing the dangerous forms and asserted the mechanism — they never tested the corruption the code comment itself documents. Added a regression guard that runs the real shipped src/data/*.json and asserts ordinary CJK prose passes through untouched; it fails if a dangerous wrong-form is re-introduced.

Verification

492 tests (+4 regression) · eslint · check:plugin (regen in sync) · check:locales · glossary · validate — all green.

Diff note: the src/data line count is large because ja/zh _protected arrays were normalized multi-line → single-line; the only semantic change is the removed forms.

🤖 Generated with Claude Code

…K prose A service-quality audit confirmed (reproduced from shipped data) that restoreProtectedTerms — an unanchored replaceAll with no CJK word boundary — corrupts CORRECT translations whenever a _protected "wrong form" is itself a common standalone word: 클라우드(Cloud) -> Claude "클라우드 컴퓨팅" -> "Claude 컴퓨팅" 인류(humanity) -> Anthropic "인류의 미래" -> "Anthropic의 미래" 기업(enterprise) -> Enterprise "대기업 환경" -> "대Enterprise 환경" 협업(collaborate)-> Cowork, etc. This hit all four CJK locales (ko/ja/zh-CN/zh-TW) live — worse than untranslated text, since it ships confident, brand-correct-looking but semantically wrong output. The code comment already documented the substring class; the data was triggering it broadly. Fix: removed the dangerous wrong-forms — common words whose meaning differs from the brand — from the _protected sections of all 4 CJK dictionaries, keeping the brand transliterations and same-concept forms (클로드, 플러그인, 技能, サブエージェント). Intended restorations still work (클로드->Claude, 서브에이전트->subagent verified); brand-English fidelity that's lost (e.g. Enterprise occasionally shown as 기업) is the Gemini-verify pass's job and is far preferable to corrupting real prose. Regenerated the companion plugin's bundled data accordingly. Tests: added a regression guard in protected-terms.test.js that runs the REAL src/data/*.json (not a fixture) and asserts ordinary CJK prose passes through untouched — it fails if a dangerous wrong-form is ever re-introduced. 492 tests, eslint, check:plugin, check:locales green. Note: the src/data diff is large because ja/zh _protected arrays were normalized from multi-line to single-line; the only semantic change is the removed forms.

The CJK _protected fix removed dangerous common-word wrong-forms (인류학적 "anthropological" → Anthropic, etc.). The protected-terms E2E stub deliberately mistranslated "Anthropic" → 인류학적 to prove the restoration mechanism runs; that form is now (correctly) no longer restored, so the test broke. Switched the stub + assertion to 앤스로픽 — a pure transliteration of "Anthropic" that is NOT a real word and so is still (and safely) restored. The mechanism assertion (wrong-form → English brand name in the DOM) is unchanged; only the sample wrong-form moved to one that isn't a corruption hazard. Verified: full local E2E suite (17 specs) green, including this one.

…tency) (#181) A verified readiness audit found the code is done (505 tests, 0 open issues) but front-door docs had drifted. Fixes (all factual/compliance, not the deferred strategy docs): Factual errors (were misleading users/owner): - README Installation said the CWS listing "was removed ... not currently available" (full delisting). It is actually live as v1.0.1 in all locales except the US (removed 2026-05-12 over the old icon). Corrected to match POSITIONING (the source of truth). - RELEASE_CHECKLIST pointed at store-assets/promotion/ drafts that were purged and no longer exist. Removed the dead pointer (drafts are kept off-repo). Stale (now closed): - CHANGELOG [Unreleased] was missing #167/#170/#172/#174/#175/#176/#179/#180; added them. - it.json _meta.translation_provenance + lastUpdated (and the matching constants.js comment, README locale-table cell) still said "v1, Spanish- derived regex" — it was re-translated from English in #166/#167 (overlap now 0.1%). Updated; regenerated plugin data accordingly. - TESTING.md listed "E2E flows" under "What is NOT tested" — the Playwright E2E suite exists and runs in CI. Reframed to describe what E2E covers. - PRIVACY_POLICY "Last updated" dateline was April 11 despite June changes. Gates green: 505 tests, lint, prettier, validate, check:plugin/dicts/locales/ i18n/dict-coverage, full E2E (17). Deferred strategy docs (POSITIONING, quarter-focus) untouched — owned by the separate doc-cleanup session.

…ted (#197) The #172 sweep removed dangerous common-word wrong-forms from the CJK dictionaries only. The same bug class was still live in es/fr/it/de/pt-BR/ ru/vi: everyday words were registered as brand "wrong-forms", so the unanchored replaceAll in restoreProtectedTerms rewrote correct prose into English on virtually every lesson — e.g. de "die Zusammenarbeit im Unternehmen" -> "die Cowork im Enterprise" it "le competenze che svilupperai" -> "le skills che svilupperai" fr "l'extension de navigateur" -> "l'Plugin de navigateur" ru "эти навыки общения" -> "эти Skills общения" vi "phần đầu của bài học" -> "frontmatter của bài học" Education content uses these words (skills/compétences/competenze/навыки/ kỹ năng …) constantly, so 7 of the 11 premium languages were degraded on every page. Curation rule (same as #172): keep only proper-noun mistranslations, transliterations, and coined anchored phrases (Claudio, Anthropique, Клод Код, Mã Claude, Claude-Code, Schrägstrich-Befehl, sous-agent, …); drop the everyday words/phrases (Enterprise/Unternehmen/Empresa/Impresa, skills-words, Plugin/extension/Erweiterung/complemento, hook/crochet/ Haken/gancho/móc, Cowork/Zusammenarbeit/travail collaboratif, Dispatch/ envío/envio/Отправка, Computer-Use phrases, frontmatter/préambule/ preámbulo/phần đầu, Personal-words). Russian keeps its loanword forms (Плагин, хук, Коворк, Диспетчеризация) matching the Korean precedent (플러그인/후크/코워크). it.json also loses the Spanish leftovers its _protected inherited from the es-derived v1 (Código Claude, habilidades, gancho, …) and the actively wrong "Plugin"->"Plugins" mapping. Proof: extended the real-dictionary regression test with ordinary-prose sentences for all 7 locales — all 7 failed against the old dictionaries and pass after the sweep (42/42). Plugin terms data regenerated in sync. Gates: 527 unit tests, full E2E 19/19, validate, glossary, check:plugin/ i18n/dict-coverage/locales/dicts, lint, prettier — green. Store zips NOT rebuilt (owner builds on instruction).

heznpc enabled auto-merge (squash) June 4, 2026 10:48

heznpc mentioned this pull request Jun 4, 2026

fix(tutor): propagate chatStream bridge-not-ready as a rejection (stranded spinner) #173

Closed

heznpc merged commit 4a1ce26 into main Jun 4, 2026
9 checks passed

heznpc deleted the fix/protected-terms-cjk-corruption branch June 4, 2026 11:27

This was referenced Jun 4, 2026

fix(tutor): propagate chatStream bridge-not-ready as a rejection (stranded spinner) #174

Merged

docs: fix front-door factual errors + stale claims (pre-deploy) #181

Merged

heznpc mentioned this pull request Jun 10, 2026

fix(translate): sweep common-word wrong-forms from 7 locales' _protected #197

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(i18n): stop protected-term restoration from corrupting correct CJK prose#172

fix(i18n): stop protected-term restoration from corrupting correct CJK prose#172
heznpc merged 2 commits into
mainfrom
fix/protected-terms-cjk-corruption

heznpc commented Jun 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

heznpc commented Jun 4, 2026

Confirmed bug (reproduced from shipped data)

Fix

Test (directly answers "are the tests real?")

Verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant