Skip to content

fix(i18n): stop protected-term restoration from corrupting correct CJK prose#172

Merged
heznpc merged 2 commits into
mainfrom
fix/protected-terms-cjk-corruption
Jun 4, 2026
Merged

fix(i18n): stop protected-term restoration from corrupting correct CJK prose#172
heznpc merged 2 commits into
mainfrom
fix/protected-terms-cjk-corruption

Conversation

@heznpc

@heznpc heznpc commented Jun 4, 2026

Copy link
Copy Markdown
Owner

Confirmed bug (reproduced from shipped data)

A service-quality audit found — and I reproduced directly from src/data/*.json — that restoreProtectedTerms (src/lib/protected-terms.js:73, an unanchored replaceAll with no CJK word boundary) corrupts correct translations whenever a _protected "wrong form" is itself a common standalone word:

correct prose renders as because _protected maps
클라우드 컴퓨팅 (Cloud computing) Claude 컴퓨팅 클라우드 → Claude
인류의 미래 (future of humanity) Anthropic의 미래 인류 → Anthropic
대기업 환경 (large enterprise) Enterprise 환경 기업 → Enterprise
우리는 협업합니다 (we collaborate) 우리는 Cowork합니다 협업 → Cowork

This affected all four CJK locales (ko/ja/zh-CN/zh-TW) in the live build. It's worse than untranslated text — it ships confident, brand-correct-looking but semantically wrong output, in a target market. The in-code comment already documented the substring-corruption class; the shipped dictionaries were triggering it widely.

Fix

Removed the dangerous wrong-forms (common words whose meaning ≠ the brand) from the _protected section of all 4 CJK dictionaries, keeping brand transliterations and same-concept forms (클로드, 플러그인, 技能, サブエージェント).

  • Intended restorations still work (verified): 클로드→Claude, 서브에이전트→subagent, 플러그인→Plugin.
  • The brand-English fidelity occasionally lost (e.g. Enterprise shown as 기업) is the Gemini-verify pass's job and is far preferable to corrupting real prose.
  • Regenerated the companion plugin's bundled _protected data to match.

Test (directly answers "are the tests real?")

The existing tests used a fixture containing the dangerous forms and asserted the mechanism — they never tested the corruption the code comment itself documents. Added a regression guard that runs the real shipped src/data/*.json and asserts ordinary CJK prose passes through untouched; it fails if a dangerous wrong-form is re-introduced.

Verification

492 tests (+4 regression) · eslint · check:plugin (regen in sync) · check:locales · glossary · validate — all green.

Diff note: the src/data line count is large because ja/zh _protected arrays were normalized multi-line → single-line; the only semantic change is the removed forms.

🤖 Generated with Claude Code

…K prose

A service-quality audit confirmed (reproduced from shipped data) that
restoreProtectedTerms — an unanchored replaceAll with no CJK word boundary —
corrupts CORRECT translations whenever a _protected "wrong form" is itself a
common standalone word:
  클라우드(Cloud)  -> Claude          "클라우드 컴퓨팅" -> "Claude 컴퓨팅"
  인류(humanity)   -> Anthropic       "인류의 미래"     -> "Anthropic의 미래"
  기업(enterprise) -> Enterprise      "대기업 환경"     -> "대Enterprise 환경"
  협업(collaborate)-> Cowork, etc.
This hit all four CJK locales (ko/ja/zh-CN/zh-TW) live — worse than untranslated
text, since it ships confident, brand-correct-looking but semantically wrong
output. The code comment already documented the substring class; the data was
triggering it broadly.

Fix: removed the dangerous wrong-forms — common words whose meaning differs from
the brand — from the _protected sections of all 4 CJK dictionaries, keeping the
brand transliterations and same-concept forms (클로드, 플러그인, 技能, サブエージェント).
Intended restorations still work (클로드->Claude, 서브에이전트->subagent verified);
brand-English fidelity that's lost (e.g. Enterprise occasionally shown as 기업) is
the Gemini-verify pass's job and is far preferable to corrupting real prose.
Regenerated the companion plugin's bundled data accordingly.

Tests: added a regression guard in protected-terms.test.js that runs the REAL
src/data/*.json (not a fixture) and asserts ordinary CJK prose passes through
untouched — it fails if a dangerous wrong-form is ever re-introduced. 492 tests,
eslint, check:plugin, check:locales green.

Note: the src/data diff is large because ja/zh _protected arrays were normalized
from multi-line to single-line; the only semantic change is the removed forms.
@heznpc heznpc enabled auto-merge (squash) June 4, 2026 10:48
The CJK _protected fix removed dangerous common-word wrong-forms (인류학적
"anthropological" → Anthropic, etc.). The protected-terms E2E stub deliberately
mistranslated "Anthropic" → 인류학적 to prove the restoration mechanism runs;
that form is now (correctly) no longer restored, so the test broke.

Switched the stub + assertion to 앤스로픽 — a pure transliteration of "Anthropic"
that is NOT a real word and so is still (and safely) restored. The mechanism
assertion (wrong-form → English brand name in the DOM) is unchanged; only the
sample wrong-form moved to one that isn't a corruption hazard. Verified: full
local E2E suite (17 specs) green, including this one.
@heznpc heznpc merged commit 4a1ce26 into main Jun 4, 2026
9 checks passed
@heznpc heznpc deleted the fix/protected-terms-cjk-corruption branch June 4, 2026 11:27
heznpc added a commit that referenced this pull request Jun 9, 2026
…tency) (#181)

A verified readiness audit found the code is done (505 tests, 0 open issues)
but front-door docs had drifted. Fixes (all factual/compliance, not the
deferred strategy docs):

Factual errors (were misleading users/owner):
- README Installation said the CWS listing "was removed ... not currently
  available" (full delisting). It is actually live as v1.0.1 in all locales
  except the US (removed 2026-05-12 over the old icon). Corrected to match
  POSITIONING (the source of truth).
- RELEASE_CHECKLIST pointed at store-assets/promotion/ drafts that were purged
  and no longer exist. Removed the dead pointer (drafts are kept off-repo).

Stale (now closed):
- CHANGELOG [Unreleased] was missing #167/#170/#172/#174/#175/#176/#179/#180;
  added them.
- it.json _meta.translation_provenance + lastUpdated (and the matching
  constants.js comment, README locale-table cell) still said "v1, Spanish-
  derived regex" — it was re-translated from English in #166/#167 (overlap
  now 0.1%). Updated; regenerated plugin data accordingly.
- TESTING.md listed "E2E flows" under "What is NOT tested" — the Playwright
  E2E suite exists and runs in CI. Reframed to describe what E2E covers.
- PRIVACY_POLICY "Last updated" dateline was April 11 despite June changes.

Gates green: 505 tests, lint, prettier, validate, check:plugin/dicts/locales/
i18n/dict-coverage, full E2E (17). Deferred strategy docs (POSITIONING,
quarter-focus) untouched — owned by the separate doc-cleanup session.
heznpc added a commit that referenced this pull request Jun 10, 2026
…ted (#197)

The #172 sweep removed dangerous common-word wrong-forms from the CJK
dictionaries only. The same bug class was still live in es/fr/it/de/pt-BR/
ru/vi: everyday words were registered as brand "wrong-forms", so the
unanchored replaceAll in restoreProtectedTerms rewrote correct prose into
English on virtually every lesson — e.g.

  de  "die Zusammenarbeit im Unternehmen" -> "die Cowork im Enterprise"
  it  "le competenze che svilupperai"     -> "le skills che svilupperai"
  fr  "l'extension de navigateur"         -> "l'Plugin de navigateur"
  ru  "эти навыки общения"                -> "эти Skills общения"
  vi  "phần đầu của bài học"              -> "frontmatter của bài học"

Education content uses these words (skills/compétences/competenze/навыки/
kỹ năng …) constantly, so 7 of the 11 premium languages were degraded on
every page.

Curation rule (same as #172): keep only proper-noun mistranslations,
transliterations, and coined anchored phrases (Claudio, Anthropique,
Клод Код, Mã Claude, Claude-Code, Schrägstrich-Befehl, sous-agent, …);
drop the everyday words/phrases (Enterprise/Unternehmen/Empresa/Impresa,
skills-words, Plugin/extension/Erweiterung/complemento, hook/crochet/
Haken/gancho/móc, Cowork/Zusammenarbeit/travail collaboratif, Dispatch/
envío/envio/Отправка, Computer-Use phrases, frontmatter/préambule/
preámbulo/phần đầu, Personal-words). Russian keeps its loanword forms
(Плагин, хук, Коворк, Диспетчеризация) matching the Korean precedent
(플러그인/후크/코워크). it.json also loses the Spanish leftovers its
_protected inherited from the es-derived v1 (Código Claude, habilidades,
gancho, …) and the actively wrong "Plugin"->"Plugins" mapping.

Proof: extended the real-dictionary regression test with ordinary-prose
sentences for all 7 locales — all 7 failed against the old dictionaries
and pass after the sweep (42/42). Plugin terms data regenerated in sync.

Gates: 527 unit tests, full E2E 19/19, validate, glossary, check:plugin/
i18n/dict-coverage/locales/dicts, lint, prettier — green. Store zips NOT
rebuilt (owner builds on instruction).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant