localization: pin UTF-8 explicitly on every read_text/write_text call by CryptoJones · Pull Request #8 · CryptoJones/120xSocrates

CryptoJones · 2026-05-20T23:26:59Z

Path.read_text/Path.write_text default to locale.getpreferredencoding()
when no encoding= is passed. On most modern Linux/macOS that's UTF-8 —
but on Windows it's still cp1252 in many setups (Python 3.15 will
finally default to UTF-8 everywhere via PEP 686; we're not there yet).

Practical impact: an operator running socrates on Windows whose
planning files contain typical non-ASCII content — em-dashes from
copy-paste, smart quotes, a client name like "Café Berlin", a Spanish
or Korean decision text — would get UnicodeDecodeError on read OR
silent mojibake on write. Either way the file gets corrupted on a
round-trip through one of socrates' commands.

Pinned encoding="utf-8" on all 128 read_text/write_text call sites
across src/ (49 calls) and tests/ (79 calls). All 147 tests still
pass; ruff + mypy clean; ruff format applied to normalize the
multi-line cases.

No API change. No behavior change on Linux/macOS where UTF-8 was
already the locale default. Windows operators (and any non-UTF-8
locale) now get deterministic behavior.

Self-review caveat: this PR's diff (414 ins / 222 del) is bloated. After my sed-style rewriter produced malformed multi-line calls, I ran ruff format to normalize. That reformatted some adjacent unrelated lines in test files too. The encoding-pin changes are the load-bearing diff; the rest is whitespace/import-order normalization. Also did NOT set json.dumps(ensure_ascii=False) — non-ASCII in answers.json still becomes \uXXXX escapes (round-trips correctly, just harder to read).

Path.read_text/Path.write_text default to locale.getpreferredencoding() when no encoding= is passed. On most modern Linux/macOS that's UTF-8 — but on Windows it's still cp1252 in many setups (Python 3.15 will finally default to UTF-8 everywhere via PEP 686; we're not there yet). Practical impact: an operator running socrates on Windows whose planning files contain typical non-ASCII content — em-dashes from copy-paste, smart quotes, a client name like "Café Berlin", a Spanish or Korean decision text — would get UnicodeDecodeError on read OR silent mojibake on write. Either way the file gets corrupted on a round-trip through one of socrates' commands. Pinned encoding="utf-8" on all 128 read_text/write_text call sites across src/ (49 calls) and tests/ (79 calls). All 147 tests still pass; ruff + mypy clean; ruff format applied to normalize the multi-line cases. No API change. No behavior change on Linux/macOS where UTF-8 was already the locale default. Windows operators (and any non-UTF-8 locale) now get deterministic behavior.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

localization: pin UTF-8 explicitly on every read_text/write_text call#8

localization: pin UTF-8 explicitly on every read_text/write_text call#8
CryptoJones wants to merge 1 commit into
mainfrom
localization/explicit-utf8-encoding

CryptoJones commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

CryptoJones commented May 20, 2026

No API change. No behavior change on Linux/macOS where UTF-8 was already the locale default. Windows operators (and any non-UTF-8 locale) now get deterministic behavior.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

No API change. No behavior change on Linux/macOS where UTF-8 was
already the locale default. Windows operators (and any non-UTF-8
locale) now get deterministic behavior.