Skip to content

localization: pin UTF-8 explicitly on every read_text/write_text call#8

Open
CryptoJones wants to merge 1 commit into
mainfrom
localization/explicit-utf8-encoding
Open

localization: pin UTF-8 explicitly on every read_text/write_text call#8
CryptoJones wants to merge 1 commit into
mainfrom
localization/explicit-utf8-encoding

Conversation

@CryptoJones
Copy link
Copy Markdown
Owner

Path.read_text/Path.write_text default to locale.getpreferredencoding()
when no encoding= is passed. On most modern Linux/macOS that's UTF-8 —
but on Windows it's still cp1252 in many setups (Python 3.15 will
finally default to UTF-8 everywhere via PEP 686; we're not there yet).

Practical impact: an operator running socrates on Windows whose
planning files contain typical non-ASCII content — em-dashes from
copy-paste, smart quotes, a client name like "Café Berlin", a Spanish
or Korean decision text — would get UnicodeDecodeError on read OR
silent mojibake on write. Either way the file gets corrupted on a
round-trip through one of socrates' commands.

Pinned encoding="utf-8" on all 128 read_text/write_text call sites
across src/ (49 calls) and tests/ (79 calls). All 147 tests still
pass; ruff + mypy clean; ruff format applied to normalize the
multi-line cases.

No API change. No behavior change on Linux/macOS where UTF-8 was
already the locale default. Windows operators (and any non-UTF-8
locale) now get deterministic behavior.

Self-review caveat: this PR's diff (414 ins / 222 del) is bloated. After my sed-style rewriter produced malformed multi-line calls, I ran ruff format to normalize. That reformatted some adjacent unrelated lines in test files too. The encoding-pin changes are the load-bearing diff; the rest is whitespace/import-order normalization. Also did NOT set json.dumps(ensure_ascii=False) — non-ASCII in answers.json still becomes \uXXXX escapes (round-trips correctly, just harder to read).

Path.read_text/Path.write_text default to locale.getpreferredencoding()
when no encoding= is passed. On most modern Linux/macOS that's UTF-8 —
but on Windows it's still cp1252 in many setups (Python 3.15 will
finally default to UTF-8 everywhere via PEP 686; we're not there yet).

Practical impact: an operator running socrates on Windows whose
planning files contain typical non-ASCII content — em-dashes from
copy-paste, smart quotes, a client name like "Café Berlin", a Spanish
or Korean decision text — would get UnicodeDecodeError on read OR
silent mojibake on write. Either way the file gets corrupted on a
round-trip through one of socrates' commands.

Pinned encoding="utf-8" on all 128 read_text/write_text call sites
across src/ (49 calls) and tests/ (79 calls). All 147 tests still
pass; ruff + mypy clean; ruff format applied to normalize the
multi-line cases.

No API change. No behavior change on Linux/macOS where UTF-8 was
already the locale default. Windows operators (and any non-UTF-8
locale) now get deterministic behavior.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant