localization: pin UTF-8 explicitly on every read_text/write_text call#8
Open
CryptoJones wants to merge 1 commit into
Open
localization: pin UTF-8 explicitly on every read_text/write_text call#8CryptoJones wants to merge 1 commit into
CryptoJones wants to merge 1 commit into
Conversation
Path.read_text/Path.write_text default to locale.getpreferredencoding() when no encoding= is passed. On most modern Linux/macOS that's UTF-8 — but on Windows it's still cp1252 in many setups (Python 3.15 will finally default to UTF-8 everywhere via PEP 686; we're not there yet). Practical impact: an operator running socrates on Windows whose planning files contain typical non-ASCII content — em-dashes from copy-paste, smart quotes, a client name like "Café Berlin", a Spanish or Korean decision text — would get UnicodeDecodeError on read OR silent mojibake on write. Either way the file gets corrupted on a round-trip through one of socrates' commands. Pinned encoding="utf-8" on all 128 read_text/write_text call sites across src/ (49 calls) and tests/ (79 calls). All 147 tests still pass; ruff + mypy clean; ruff format applied to normalize the multi-line cases. No API change. No behavior change on Linux/macOS where UTF-8 was already the locale default. Windows operators (and any non-UTF-8 locale) now get deterministic behavior.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Path.read_text/Path.write_text default to locale.getpreferredencoding()
when no encoding= is passed. On most modern Linux/macOS that's UTF-8 —
but on Windows it's still cp1252 in many setups (Python 3.15 will
finally default to UTF-8 everywhere via PEP 686; we're not there yet).
Practical impact: an operator running socrates on Windows whose
planning files contain typical non-ASCII content — em-dashes from
copy-paste, smart quotes, a client name like "Café Berlin", a Spanish
or Korean decision text — would get UnicodeDecodeError on read OR
silent mojibake on write. Either way the file gets corrupted on a
round-trip through one of socrates' commands.
Pinned encoding="utf-8" on all 128 read_text/write_text call sites
across src/ (49 calls) and tests/ (79 calls). All 147 tests still
pass; ruff + mypy clean; ruff format applied to normalize the
multi-line cases.
No API change. No behavior change on Linux/macOS where UTF-8 was
already the locale default. Windows operators (and any non-UTF-8
locale) now get deterministic behavior.
Self-review caveat: this PR's diff (414 ins / 222 del) is bloated. After my sed-style rewriter produced malformed multi-line calls, I ran
ruff formatto normalize. That reformatted some adjacent unrelated lines in test files too. The encoding-pin changes are the load-bearing diff; the rest is whitespace/import-order normalization. Also did NOT setjson.dumps(ensure_ascii=False)— non-ASCII in answers.json still becomes \uXXXX escapes (round-trips correctly, just harder to read).