reliability(interview): atomic save + graceful recovery from corrupt resume#5
Open
CryptoJones wants to merge 1 commit into
Open
reliability(interview): atomic save + graceful recovery from corrupt resume#5CryptoJones wants to merge 1 commit into
CryptoJones wants to merge 1 commit into
Conversation
…resume `Interview.save()` previously did `answers_path.write_text(...)` which opens, truncates, and writes. If the process is killed mid-write — SIGINT from Ctrl-C, OOM, power loss — the file on disk is left in a half-written state. The next `socrates init --resume` then calls `json.loads()` on the partial JSON and crashes with a stacktrace: json.decoder.JSONDecodeError: Expecting ',' delimiter ... (8 lines of traceback) Operator's only escape is to delete .socrates-answers.json and start over — losing every answer that *did* successfully land before the interrupt. Fix is two-part: 1. Atomic save: write to `<file>.tmp` then `os.replace` onto the final path. On POSIX (and Windows ≥ 3.3 via os.replace) the rename is atomic — the file is either fully old or fully new, never partial. Even if the tempfile write is interrupted, the real answers file is untouched. Cleanup unlinks the tempfile on success or failure. 2. Resume recovery: if --resume is passed and the file IS corrupt (despite #1, e.g. a pre-fix file or out-of-process tampering), warn the operator on stderr and start with empty answers instead of crashing. The corrupt file gets overwritten cleanly on the first answer. Tests added (4): - save() leaves no .tmp stranded - save() with simulated rename failure: pre-existing file untouched, no tempfile leaked - load() with corrupt JSON: warns, returns empty answers, no raise - load() with permission-denied OSError: same graceful path 151/151 tests pass; ruff + mypy clean.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Interview.save()previously didanswers_path.write_text(...)whichopens, truncates, and writes. If the process is killed mid-write —
SIGINT from Ctrl-C, OOM, power loss — the file on disk is left in a
half-written state. The next
socrates init --resumethen callsjson.loads()on the partial JSON and crashes with a stacktrace:json.decoder.JSONDecodeError: Expecting ',' delimiter ...
(8 lines of traceback)
Operator's only escape is to delete .socrates-answers.json and start
over — losing every answer that did successfully land before the
interrupt.
Fix is two-part:
Atomic save: write to
<file>.tmpthenos.replaceonto the finalpath. On POSIX (and Windows ≥ 3.3 via os.replace) the rename is
atomic — the file is either fully old or fully new, never partial.
Even if the tempfile write is interrupted, the real answers file is
untouched. Cleanup unlinks the tempfile on success or failure.
Resume recovery: if --resume is passed and the file IS corrupt
(despite Add
--format md|html|xmlto socrates pack #1, e.g. a pre-fix file or out-of-process tampering),warn the operator on stderr and start with empty answers instead
of crashing. The corrupt file gets overwritten cleanly on the
first answer.
Tests added (4):
no tempfile leaked
151/151 tests pass; ruff + mypy clean.
Self-review caveat: the atomic-write helper here is duplicated in #patterns-cache-atomic-save and superseded by #refactor/shared-atomic-write-and-decide-lock (which moves it to socrates120x/_atomic.py). Cleanest merge order: refactor first, then rebase this + patterns-cache to use the shared helper. Functionally correct either way.