Skip to content

reliability(interview): atomic save + graceful recovery from corrupt resume#5

Open
CryptoJones wants to merge 1 commit into
mainfrom
reliability/interview-atomic-save-and-resume-recovery
Open

reliability(interview): atomic save + graceful recovery from corrupt resume#5
CryptoJones wants to merge 1 commit into
mainfrom
reliability/interview-atomic-save-and-resume-recovery

Conversation

@CryptoJones
Copy link
Copy Markdown
Owner

Interview.save() previously did answers_path.write_text(...) which
opens, truncates, and writes. If the process is killed mid-write —
SIGINT from Ctrl-C, OOM, power loss — the file on disk is left in a
half-written state. The next socrates init --resume then calls
json.loads() on the partial JSON and crashes with a stacktrace:

json.decoder.JSONDecodeError: Expecting ',' delimiter ...
(8 lines of traceback)

Operator's only escape is to delete .socrates-answers.json and start
over — losing every answer that did successfully land before the
interrupt.

Fix is two-part:

  1. Atomic save: write to <file>.tmp then os.replace onto the final
    path. On POSIX (and Windows ≥ 3.3 via os.replace) the rename is
    atomic — the file is either fully old or fully new, never partial.
    Even if the tempfile write is interrupted, the real answers file is
    untouched. Cleanup unlinks the tempfile on success or failure.

  2. Resume recovery: if --resume is passed and the file IS corrupt
    (despite Add --format md|html|xml to socrates pack #1, e.g. a pre-fix file or out-of-process tampering),
    warn the operator on stderr and start with empty answers instead
    of crashing. The corrupt file gets overwritten cleanly on the
    first answer.

Tests added (4):

  • save() leaves no .tmp stranded
  • save() with simulated rename failure: pre-existing file untouched,
    no tempfile leaked
  • load() with corrupt JSON: warns, returns empty answers, no raise
  • load() with permission-denied OSError: same graceful path

151/151 tests pass; ruff + mypy clean.

Self-review caveat: the atomic-write helper here is duplicated in #patterns-cache-atomic-save and superseded by #refactor/shared-atomic-write-and-decide-lock (which moves it to socrates120x/_atomic.py). Cleanest merge order: refactor first, then rebase this + patterns-cache to use the shared helper. Functionally correct either way.

…resume

`Interview.save()` previously did `answers_path.write_text(...)` which
opens, truncates, and writes. If the process is killed mid-write —
SIGINT from Ctrl-C, OOM, power loss — the file on disk is left in a
half-written state. The next `socrates init --resume` then calls
`json.loads()` on the partial JSON and crashes with a stacktrace:

  json.decoder.JSONDecodeError: Expecting ',' delimiter ...
  (8 lines of traceback)

Operator's only escape is to delete .socrates-answers.json and start
over — losing every answer that *did* successfully land before the
interrupt.

Fix is two-part:

1. Atomic save: write to `<file>.tmp` then `os.replace` onto the final
   path. On POSIX (and Windows ≥ 3.3 via os.replace) the rename is
   atomic — the file is either fully old or fully new, never partial.
   Even if the tempfile write is interrupted, the real answers file is
   untouched. Cleanup unlinks the tempfile on success or failure.

2. Resume recovery: if --resume is passed and the file IS corrupt
   (despite #1, e.g. a pre-fix file or out-of-process tampering),
   warn the operator on stderr and start with empty answers instead
   of crashing. The corrupt file gets overwritten cleanly on the
   first answer.

Tests added (4):
- save() leaves no .tmp stranded
- save() with simulated rename failure: pre-existing file untouched,
  no tempfile leaked
- load() with corrupt JSON: warns, returns empty answers, no raise
- load() with permission-denied OSError: same graceful path

151/151 tests pass; ruff + mypy clean.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant